-
Notifications
You must be signed in to change notification settings - Fork 66
feat(trainer): Support namespaced TrainingRuntime in the SDK #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(trainer): Support namespaced TrainingRuntime in the SDK #130
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
24f00a7 to
1740535
Compare
|
/ok-to-test |
Signed-off-by: Moeed Shaik <[email protected]>
Signed-off-by: Moeed Shaik <[email protected]>
8f0b6d5 to
de2ad1b
Compare
Signed-off-by: Moeed Shaik <[email protected]>
|
Thank you @shaikmoeed for this! |
|
|
||
| def get_runtime(self, name: str) -> types.Runtime: | ||
| """Get the the Runtime object""" | ||
| """Get the the Runtime object prefer namespaced, fall-back to cluster-scoped""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """Get the the Runtime object prefer namespaced, fall-back to cluster-scoped""" | |
| """Get the Runtime object prefer namespaced, fall-back to cluster-scoped""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same change goes for each occurence here
| ) | ||
|
|
||
| cluster_runtime_list = models.TrainerV1alpha1ClusterTrainingRuntimeList.from_dict( | ||
| cluster_thread.get(constants.DEFAULT_TIMEOUT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| cluster_thread.get(constants.DEFAULT_TIMEOUT) | |
| cluster_thread.get(common_constants.DEFAULT_TIMEOUT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Reason for test case failures!
| def create_training_runtime( | ||
| name: str, | ||
| namespace: str = "default", | ||
| ) -> models.TrainerV1alpha1TrainingRuntime: | ||
| """Create a mock namespaced TrainingRuntime object (not cluster-scoped).""" | ||
| return models.TrainerV1alpha1TrainingRuntime( | ||
| apiVersion=constants.API_VERSION, | ||
| kind="TrainingRuntime", | ||
| metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta( | ||
| name=name, | ||
| namespace=namespace, | ||
| labels={constants.RUNTIME_FRAMEWORK_LABEL: name}, | ||
| ), | ||
| spec=models.TrainerV1alpha1TrainingRuntimeSpec( | ||
| mlPolicy=models.TrainerV1alpha1MLPolicy( | ||
| torch=models.TrainerV1alpha1TorchMLPolicySource( | ||
| numProcPerNode=models.IoK8sApimachineryPkgUtilIntstrIntOrString(2) | ||
| ), | ||
| numNodes=2, | ||
| ), | ||
| template=models.TrainerV1alpha1JobSetTemplateSpec( | ||
| metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta( | ||
| name=name, | ||
| namespace=namespace, | ||
| ), | ||
| spec=models.JobsetV1alpha2JobSetSpec(replicatedJobs=[get_replicated_job()]), | ||
| ), | ||
| ), | ||
| ) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean to create this in kubernetes/backend_test.py?
this is not a test function and I believe it should be added to the TrainerClient and propagated to the different backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to create_train_job, I thought to use create_training_runtime to pass here instead of empty list. Please correct me, if this was not intended?
Signed-off-by: Moeed Shaik <[email protected]>
Signed-off-by: Moeed Shaik <[email protected]>
d1ca707 to
32b18fd
Compare
Signed-off-by: Moeed <[email protected]>
Pull Request Test Coverage Report for Build 19512318478Details
💛 - Coveralls |
Signed-off-by: Moeed Shaik <[email protected]>
|
@abhijeet-dhumal, can you review it again during your free time? |
kramaranya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @shaikmoeed!
I'd like all of us to spend more time on designing this properly, since there are a lot of items to consider
/assign @kubeflow/kubeflow-sdk-team
| except multiprocessing.TimeoutError as e: | ||
| raise TimeoutError(f"Timeout to list {constants.CLUSTER_TRAINING_RUNTIME_KIND}s") from e | ||
| raise TimeoutError( | ||
| "Timeout to list " | ||
| f"{constants.CLUSTER_TRAINING_RUNTIME_KIND}s/{constants.TRAINING_RUNTIME_KIND}s " | ||
| f"in namespace: {self.namespace}" | ||
| ) from e | ||
| except Exception as e: | ||
| raise RuntimeError(f"Failed to list {constants.CLUSTER_TRAINING_RUNTIME_KIND}s") from e | ||
| raise RuntimeError( | ||
| "Failed to list " | ||
| f"{constants.CLUSTER_TRAINING_RUNTIME_KIND}s/{constants.TRAINING_RUNTIME_KIND}s " | ||
| f"in namespace: {self.namespace}" | ||
| ) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What If only cluster runtimes exist (no TrainingRuntime CRD)? The entire list_runtimes will fail with RuntimeError, which we don't want, right?
| except multiprocessing.TimeoutError as e: | ||
| raise TimeoutError(f"Timeout to list {constants.CLUSTER_TRAINING_RUNTIME_KIND}s") from e | ||
| raise TimeoutError( | ||
| "Timeout to list " | ||
| f"{constants.CLUSTER_TRAINING_RUNTIME_KIND}s/{constants.TRAINING_RUNTIME_KIND}s " | ||
| f"in namespace: {self.namespace}" | ||
| ) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if one type times out and the other succeeds? I think we still should return partial results
| "Timeout to list " | ||
| f"{constants.CLUSTER_TRAINING_RUNTIME_KIND}s/{constants.TRAINING_RUNTIME_KIND}s " | ||
| f"in namespace: {self.namespace}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: cluster scoped runtimes are not really in a namespace
| except Exception as e: | ||
| logger.warning( | ||
| f"Namespaced TrainingRuntime '{self.namespace}/{name}' not found " | ||
| f"({type(e).__name__}: {e}); falling back to cluster-scoped runtime." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if it was for example TimeoutError? We will still silently fallback to cluster scoped runtimes, right? I would suggest only treating not found / missing CRD as fall back.
|
|
||
| return result | ||
|
|
||
| def get_runtime(self, name: str) -> types.Runtime: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to list_runtimes(), either case will cause a full failure
| # The Kind name for the TrainingRuntime. | ||
| TRAINING_RUNTIME_KIND = "TrainingRuntime" | ||
|
|
||
| # The plural for the ClusterTrainingRuntime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # The plural for the ClusterTrainingRuntime. | |
| # The plural for the TrainingRuntime. |
| return result | ||
|
|
||
| for runtime in runtime_list.items: | ||
| for runtime in namespace_runtime_list.items + cluster_runtime_list.items: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shaikmoeed
Quick question : What if runtimes with the same name exists in both cluster and namespace scoped ?
IIUC you have implemented namespace scoped priority in get_runtime() where trainingRuntimes get's first priority..
thinking should it be same case for list_runtimes method too ?
And one more thing, In case of list_runtimes it's just appending both even if duplicates comes in..
So for end user how user will be able to know the kind of runtime via list_runtimes's list items ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we introduce kind and namespace(optional) params in Runtime dataclass here :
sdk/kubeflow/trainer/types/types.py
Line 251 in 49a5087
| class Runtime: |
So that
for runtime in runtimes:
print(f"{runtime.name} ({runtime.kind}, ns={runtime.namespace})")
# Output:
# torch-runtime (TrainingRuntime, ns=team-a)
# torch-runtime (ClusterTrainingRuntime, ns=None)
# custom-runtime (TrainingRuntime, ns=team-a)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT? @kramaranya @szaher ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be in favor of adding a scope field instead, with values like "project" and "cluster", so we don't expose any k8s terminology to users.
Regarding namespace I don't think AI practitioners need to see it, since they already know they're working in their project context (set at client init time).
cc @andreyvelich @astefanutti
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for a scope field, though I wonder whether it could be more symmetrical to the resolution logic in the get_runtime and not returning the ClusterTrainingRuntimes that are "overwritten" in the client namespace. This way I don't think any additional field would be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we configure this setting during TrainerClient() initialization?
I am curious if we ever have a use-case when users want to create a single TrainerClient() which can work with TrainingRuntime and ClusterTrainingRuntime ?
|
Hey, @shaikmoeed! Are you still working on this PR? Did you have a chance to address the feedback? |
@kramaranya Haven't got a chance to address the feedback. Will update the PR by next week. |
What this PR does / why we need it:
Add support to list/get namespaced TrainingRuntime.
Which issue(s) this PR fixes:
Fixes #88
Checklist: