-
Notifications
You must be signed in to change notification settings - Fork 66
feat(trainer): add dataset and model initializer support to container backend #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(trainer): add dataset and model initializer support to container backend #188
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
Fiona-Waters
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @HKanoje for this PR! I've left some comments, ptal! Thanks
| """ | ||
| # Use the training-operator image which contains initializer scripts | ||
| # This can be made configurable via backend config in the future | ||
| return "kubeflow/training-operator:latest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this configurable rather than hardcoding it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Definitely. I've made this configurable via ContainerBackendConfig.initializer_image (default: kubeflow/training-operator:latest). Users can now customize it when creating the backend.
| try: | ||
| import time | ||
|
|
||
| timeout = 600 # 10 minutes timeout for initialization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be configurable, or is 10 minutes always going to be enough time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added ContainerBackendConfig.initializer_timeout (default: 600 seconds / 10 minutes). This gives users flexibility for large datasets/models that may take longer to download.
| # Clean up the failed container | ||
| from contextlib import suppress | ||
|
|
||
| with suppress(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As well as cleaning up when a failure occurs, should we clean up the initializer containers when they have been successful also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented wait_for_container() in the adapter interface and both Docker/Podman adapters. This replaces the polling loop with a single blocking wait call - much more efficient.
| logger.debug(f"Created network: {network_id}") | ||
|
|
||
| # Run initializers if configured | ||
| if initializer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the initializer fails should we clean up the network we have created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely! Added cleanup for successful initializer containers after completion to prevent accumulation. Also added cleanup for timed-out containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Would it make sense to add a helper function for the cleanup logic to reduce duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Added _cleanup_container_resources() helper method in commit 0b7a952 to consolidate the duplicated cleanup logic across exception handlers and delete_job().
| if isinstance( | ||
| initializer, (types.HuggingFaceDatasetInitializer, types.HuggingFaceModelInitializer) | ||
| ) | ||
| else "python -m kubeflow.storage_initializer.datacache " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are setting datacache as the default/fallback, do we want to do this? In thekubernetes backend we offer 2 options and raise a value error if the type is invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right - that was inconsistent with the kubernetes backend. Changed to raise ValueError with a clear message listing all supported types instead of defaulting to datacache.
| import time | ||
|
|
||
| timeout = 600 # 10 minutes timeout for initialization | ||
| polling_interval = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could using wait API be supported?
|
/ok-to-test |
- Make initializer image configurable via ContainerBackendConfig - Make initializer timeout configurable (default 600 seconds) - Implement wait API in adapters instead of polling - Clean up successful initializer containers after completion - Clean up network on initializer failure - Raise ValueError for unsupported initializer types (no datacache fallback) All tests passing (173/173). Addresses all feedback from PR kubeflow#188.
|
Hey @HKanoje, could you please sign your commits? |
Pull Request Test Coverage Report for Build 20747503876Details
💛 - Coveralls |
… backend Add support for dataset and model initializers in the container backend to bring it to feature parity with the Kubernetes backend. Changes: - Add utility functions for building initializer commands and environment variables - Implement _run_initializers() and _run_single_initializer() methods in ContainerBackend - Run initializers sequentially before training containers start - Download datasets to /workspace/dataset and models to /workspace/model - Track initializer containers as separate steps in TrainJob - Support all initializer types: HuggingFace, S3, and DataCache - Add comprehensive unit tests for all initializer configurations - Handle initializer failures with proper cleanup and error messages Fixes kubeflow#171 Signed-off-by: HKanoje <[email protected]>
- Make initializer image configurable via ContainerBackendConfig - Make initializer timeout configurable (default 600 seconds) - Implement wait API in adapters instead of polling - Clean up successful initializer containers after completion - Clean up network on initializer failure - Raise ValueError for unsupported initializer types (no datacache fallback) All tests passing (173/173). Addresses all feedback from PR kubeflow#188. Signed-off-by: HKanoje <[email protected]>
768a6a9 to
0dbb6b6
Compare
|
@kramaranya Done! All commits are now signed. |
Add _cleanup_container_resources() helper method to consolidate duplicated cleanup logic for stopping/removing containers and deleting networks. Refactor 5 locations across train(), initializer handlers, and delete_job() to use this helper. Signed-off-by: HKanoje <[email protected]>
What this PR does / why we need it
This PR implements dataset and model initializer support in the container backend, bringing it to feature parity with the Kubernetes backend. This addresses issue #171 by enabling users to automatically download and prepare datasets and models before training starts.
Solution Overview
This implementation adds full initializer support to the container backend by:
Detailed Changes
1. New Utility Functions (
kubeflow/trainer/backends/container/utils.py)build_initializer_command(initializer, init_type)Builds the appropriate container command based on initializer type:
kubeflow.storage_initializer.hugging_facemodulekubeflow.storage_initializer.s3modulekubeflow.storage_initializer.datacachemodulebuild_initializer_env(initializer, init_type)Constructs environment variables from initializer configuration:
STORAGE_URIfrom the initializer configOUTPUT_PATHto/workspace/datasetor/workspace/modelbased on typeACCESS_TOKEN,ENDPOINT,REGION, etc.CLUSTER_SIZE,METADATA_LOCget_initializer_image()Returns the initializer container image (
kubeflow/training-operator:latest).This can be made configurable via backend config in future iterations.
2. Enhanced ContainerBackend (
kubeflow/trainer/backends/container/backend.py)_run_initializers(job_name, initializer, workdir, network_id)Orchestrates the initialization phase:
pull_policy)_run_single_initializer(job_name, initializer_config, init_type, image, workdir, network_id)Executes a single initializer container:
/workspaceUpdated
train()methodUpdated
__get_trainjob_from_containers()num_nodes(excludes initializers)Updated
get_job_logs()node-0logs (default), only shows training container logsstep="dataset-initializer"orstep="model-initializer"3. Comprehensive Test Coverage (
kubeflow/trainer/backends/container/backend_test.py)Added 11 new test cases covering:
Initialization Success Scenarios
Log Retrieval
Error Handling
Implementation Details
Initialization Flow
User calls
train()withinitializerparameterContainerBackend creates:
initializer.datasetis set:initializer.modelis set:Volume Layout
Host:
~/.kubeflow/trainer/containers/<job-name>/Container Mount:
/workspace/Testing Results
All tests pass with no regressions:
Usage Example