⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

Conversation

@HKanoje
Copy link

@HKanoje HKanoje commented Nov 17, 2025

Description

Problem:

The get_job() API currently returns multiple Pods for the same TrainJob component
(e.g., dataset-initializer, trainer-node-0) when Kubernetes recreates Pods based on
Batch/Job restart policies.

This causes users to see duplicate components with conflicting statuses
—for example, one Pod may show "Failed" while another shows "Running"—leading to
confusion about the actual state of the training job.

Solution:

This PR improves the get_job() API to filter duplicate Pods and display only the most recently created Pod for each TrainJob component.

Key Improvements

1. Groups Pods by role

  • Uses JOBSET_RJOB_NAME_LABEL for initializer Pods
  • Uses a combination of JOBSET_RJOB_NAME_LABEL + JOB_INDEX_LABEL for training-node Pods
    (ensures correct grouping across multi-node trainer replicas)

2. Selects the most recent Pod

  • For each group, the API now selects the Pod with the latest creation_timestamp
  • Eliminates stale or restarted Pods that would otherwise appear as duplicates
    (e.g., old Pods in Failed state)

3. Maintains backward compatibility

  • No changes to the API schema or response format
  • Behavior only differs when duplicate Pods exist, improving clarity for end users

This ensures users see clean, de-duplicated component statuses that accurately represent the current state of their training job.

Example Impact:

Before this fix:

job = client.get_job("my-job")
# Shows duplicate components with conflicting statuses
job.steps = [
    Step(name='dataset-initializer', status='Failed'),    # Old pod
    Step(name='dataset-initializer', status='Running'),   # New pod
    Step(name='node-0', status='Failed'),                 # Old pod
    Step(name='node-0', status='Running'),                # New pod
]

After this fix:

job = client.get_job("my-job")
# Shows only current components
job.steps = [
    Step(name='dataset-initializer', status='Running'),   # Latest only ✓
    Step(name='node-0', status='Running'),                # Latest only ✓
]

Changes Made

Modified Files


backend.py

  • Updated the __get_trainjob_from_cr() method to implement Pod de-duplication and filtering logic
  • Added comprehensive inline comments explaining the grouping and selection approach
  • Groups Pods by component role:
    • Initializers grouped by JOBSET_RJOB_NAME_LABEL
    • Training nodes grouped by JOBSET_RJOB_NAME_LABEL + JOB_INDEX_LABEL
  • For each group, selects the most recent Pod based on creation_timestamp

backend_test.py

  • Added a new test: test_get_job_with_pod_restarts()
  • Simulates Pod restart scenarios where Kubernetes creates duplicate Pods
  • Verifies that only the most recent Pod per component is returned
  • Covers mixed scenarios:
    • Some components with restarts
    • Some components without restarts
  • Ensures correct behavior and backward compatibility

Testing

All tests passing:

  • make verifyPASSED (lint + format checks)
  • test_get_job_with_pod_restartsPASSED (new test for Pod restart filtering)
  • test_get_jobPASSED (existing behavior remains compatible)
  • All 36 Kubernetes backend testsPASSED
  • All 163 Python unit testsPASSED

Test Coverage

  • Pod restart scenarios with duplicate Pods having different creation_timestamp values
  • Mixed scenarios where:
    • Some components have restarts
    • Others have no duplicates
  • Verified that the API selects only the newest Pod per component
  • Confirmed that statuses come from the latest Pods, not older failed ones

Checklist

  • Follows Conventional Commits specification
  • Code follows project style guidelines (make verify passes)
  • All tests pass locally (make test-python)
  • Added comprehensive unit tests for new functionality
  • Updated documentation and inline comments where needed
  • No breaking changes to public APIs
  • Fully backward compatible with existing behavior

Related Issues

Fixes #25

When Kubernetes recreates Pods due to restart policies, multiple Pods
with the same role can exist simultaneously. This causes get_job() to
return duplicate TrainJob components with different statuses, creating
confusion for users.

This change groups Pods by their component role and selects only the
most recently created Pod for each component based on creation_timestamp.
This ensures users see the current state of their TrainJob after any
Pod restarts.

Changes:
- Group Pods by role identifier (initializer name or node+index)
- Select most recent Pod from each group using creation_timestamp
- Add comprehensive test for Pod restart scenarios

Fixes kubeflow#25

Signed-off-by: HKanoje <[email protected]>
@HKanoje HKanoje force-pushed the fix/filter-duplicate-pods-in-get-job branch from 978b209 to faf96a5 Compare November 17, 2025 03:36
pod_groups[key] = []
pod_groups[key].append(pod)

# Select the most recently created Pod from each group.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to make it more robust we could select the pod based on the status as well as the timestamp something like this Fiona-Waters@b48277f
wdyt?
It will return a pod that actually reflects the true state of each TrainJob component, rather than the newest pod.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely agree! I've actually implemented exactly that approach from your commit b48277f. The current implementation now:

Prioritizes by status first: Running (4) > Succeeded (3) > Failed (2) > Pending (1) > Unknown (0)
Uses timestamp as tiebreaker: Among pods with the same status, selects the most recent one
This ensures we return a pod that reflects the true state of the TrainJob component (preferring Running/Succeeded pods over Failed ones), rather than blindly picking the newest pod regardless of its state.

For example, if we have:

Pod A: Failed (created at 11:00)
Pod B: Running (created at 10:00)
The old logic would return Pod A (newest), but the new logic correctly returns Pod B (Running status is higher priority).

…e safety

Apply code quality improvements based on review feedback:

- Use status-based priority for pod selection (Running > Succeeded > Failed > Pending > Unknown)
- Add datetime.min fallback for safer timestamp sorting (prevents TypeError)
- Add precise type hints to internal dicts for better type checking
- Use consistent .get() access for JOB_INDEX_LABEL with default fallback
- Add pod phase constants (POD_RUNNING, POD_FAILED, POD_PENDING, POD_UNKNOWN)

These changes improve robustness, type safety, and maintainability while
maintaining the same behavior of selecting the best pod for each role.

Signed-off-by: HKanoje <[email protected]>

# Sort by creation timestamp (most recent first)
candidate_pods.sort(
key=lambda p: p.metadata.creation_timestamp or datetime.datetime.min, reverse=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could cause an issue in newer python versions - do we want it to be timezone naive or set to utc?

Suggested change
key=lambda p: p.metadata.creation_timestamp or datetime.datetime.min, reverse=True
key=lambda p: (p.metadata.creation_timestamp or datetime.datetime.min.replace(tzinfo=timezone.utc))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I've applied your suggestion to use datetime.datetime.min.replace(tzinfo=timezone.utc) instead of the timezone-naive datetime.datetime.min.

Thanks for the review!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Don't forget to import timezone too
from datetime import timezone

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Added the timezone import. Thanks for catching that! 👍

@Fiona-Waters
Copy link
Contributor

Fiona-Waters commented Nov 17, 2025

@HKanoje left one more comment but otherwise it looks good to me.
@andreyvelich @astefanutti @kramaranya please review when you can. Thanks.

Use datetime.datetime.min.replace(tzinfo=timezone.utc) instead of
datetime.datetime.min to prevent TypeError when comparing timezone-aware
and timezone-naive datetimes in Python 3.9+.

The Kubernetes API returns creation_timestamp as timezone-aware datetime
objects in UTC, so the fallback should also be timezone-aware for safe
comparison.

Signed-off-by: HKanoje <[email protected]>
Import timezone from datetime module to use timezone.utc directly
instead of datetime.timezone.utc for better readability.

Signed-off-by: HKanoje <[email protected]>
@astefanutti
Copy link
Contributor

/lgtm

Thanks @HKanoje @Fiona-Waters!

/assign @kubeflow/kubeflow-sdk-team

@astefanutti
Copy link
Contributor

/ok-to-test

@coveralls
Copy link

coveralls commented Nov 27, 2025

Pull Request Test Coverage Report for Build 19878220233

Details

  • 58 of 61 (95.08%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.4%) to 67.024%

Changes Missing Coverage Covered Lines Changed/Added Lines %
kubeflow/trainer/backends/kubernetes/backend.py 32 35 91.43%
Totals Coverage Status
Change from base Build 19828346095: 0.4%
Covered Lines: 2561
Relevant Lines: 3821

💛 - Coveralls

Copy link
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @HKanoje!
I've left a few comments

Comment on lines 67 to 82
Priority order:
1. Running or Succeeded Pods (prefer most recent)
2. Failed Pods (prefer most recent)
3. Pending Pods (prefer most recent)
4. Unknown Pods (prefer most recent)
"""
if not pods:
return None

# Pod status priority (higher number = higher priority)
status_priority = {
constants.POD_RUNNING: 4, # Highest priority
constants.POD_SUCCEEDED: 3, # Second highest
constants.POD_FAILED: 2, # Third priority
constants.POD_PENDING: 1, # Low priority
constants.POD_UNKNOWN: 0, # Lowest priority
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do running and succeeded statuses have the same priority? The docstring doesn't match the actual priorities

Comment on lines 78 to 79
constants.POD_RUNNING: 4, # Highest priority
constants.POD_SUCCEEDED: 3, # Second highest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we consider those two to be equal priority? Since both running and succeeded are healthy pods, I think we should care about the most recent one. wdyt @HKanoje @andreyvelich @astefanutti

trainjob.runtime,
pod.metadata.labels[constants.JOBSET_RJOB_NAME_LABEL],
int(pod.metadata.labels[constants.JOB_INDEX_LABEL]),
int(pod.metadata.labels.get(constants.JOB_INDEX_LABEL, "0")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't those pods always have this label?


self.namespace = cfg.namespace

def _select_best_pod_for_role(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this after public methods?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kramaranya Thank you for the thorough review! I've addressed all your comments:

Changes Made:

1. Docstring & Priority Design

  • Updated docstring to explicitly state: "Running or Succeeded Pods (equal priority, prefer most recent)"
  • Changed POD_SUCCEEDED priority from 3 to 4 (now equal to POD_RUNNING)
  • Added clarification: "Both Running and Succeeded are considered healthy states with equal priority"

2. JOB_INDEX_LABEL

  • Removed .get(constants.JOB_INDEX_LABEL, "0") in both locations
  • Now using direct access: pod.metadata.labels[constants.JOB_INDEX_LABEL]

3. Method Placement

  • Moved _select_best_pod_for_role after all public methods (after delete_job)
  • Now positioned before _read_pod_logs, following project convention

Testing:

  • make verify passes
  • All 36 Kubernetes backend tests pass
  • All 163 Python tests pass

HKanoje added a commit to HKanoje/sdk that referenced this pull request Dec 2, 2025
- Give Running and Succeeded pods equal priority (both are healthy states)
- Update docstring to clearly explain equal priority and timestamp tiebreaker
- Remove JOB_INDEX_LABEL .get() default, use direct access
- Move _select_best_pod_for_role method after public methods per convention

Addresses review comments from @kramaranya on PR kubeflow#160
@google-oss-prow
Copy link

New changes are detected. LGTM label has been removed.

@google-oss-prow google-oss-prow bot removed the lgtm label Dec 2, 2025
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from astefanutti. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- Give Running and Succeeded pods equal priority (both are healthy states)
- Update docstring to clearly explain equal priority and timestamp tiebreaker
- Remove JOB_INDEX_LABEL .get() default, use direct access
- Move _select_best_pod_for_role method after public methods per convention

Addresses review comments from @kramaranya on PR kubeflow#160

Signed-off-by: HKanoje <[email protected]>
@HKanoje HKanoje force-pushed the fix/filter-duplicate-pods-in-get-job branch from a22ed47 to 3ec8093 Compare December 3, 2025 00:35
@HKanoje
Copy link
Author

HKanoje commented Jan 1, 2026

@astefanutti I have made new changes please review and then it can be tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants