fix(trainer): filter duplicate Pods in get_job() API #160

HKanoje · 2025-11-17T03:32:58Z

Description

Problem:

The get_job() API currently returns multiple Pods for the same TrainJob component
(e.g., dataset-initializer, trainer-node-0) when Kubernetes recreates Pods based on
Batch/Job restart policies.

This causes users to see duplicate components with conflicting statuses
—for example, one Pod may show "Failed" while another shows "Running"—leading to
confusion about the actual state of the training job.

Solution:

This PR improves the get_job() API to filter duplicate Pods and display only the most recently created Pod for each TrainJob component.

Key Improvements

1. Groups Pods by role

Uses JOBSET_RJOB_NAME_LABEL for initializer Pods
Uses a combination of JOBSET_RJOB_NAME_LABEL + JOB_INDEX_LABEL for training-node Pods
(ensures correct grouping across multi-node trainer replicas)

2. Selects the most recent Pod

For each group, the API now selects the Pod with the latest creation_timestamp
Eliminates stale or restarted Pods that would otherwise appear as duplicates
(e.g., old Pods in Failed state)

3. Maintains backward compatibility

No changes to the API schema or response format
Behavior only differs when duplicate Pods exist, improving clarity for end users

This ensures users see clean, de-duplicated component statuses that accurately represent the current state of their training job.

Example Impact:

Before this fix:

job = client.get_job("my-job")
# Shows duplicate components with conflicting statuses
job.steps = [
    Step(name='dataset-initializer', status='Failed'),    # Old pod
    Step(name='dataset-initializer', status='Running'),   # New pod
    Step(name='node-0', status='Failed'),                 # Old pod
    Step(name='node-0', status='Running'),                # New pod
]

After this fix:

job = client.get_job("my-job")
# Shows only current components
job.steps = [
    Step(name='dataset-initializer', status='Running'),   # Latest only ✓
    Step(name='node-0', status='Running'),                # Latest only ✓
]

Changes Made

Modified Files

`backend.py`

Updated the __get_trainjob_from_cr() method to implement Pod de-duplication and filtering logic
Added comprehensive inline comments explaining the grouping and selection approach
Groups Pods by component role:
- Initializers grouped by JOBSET_RJOB_NAME_LABEL
- Training nodes grouped by JOBSET_RJOB_NAME_LABEL + JOB_INDEX_LABEL
For each group, selects the most recent Pod based on creation_timestamp

`backend_test.py`

Added a new test: test_get_job_with_pod_restarts()
Simulates Pod restart scenarios where Kubernetes creates duplicate Pods
Verifies that only the most recent Pod per component is returned
Covers mixed scenarios:
- Some components with restarts
- Some components without restarts
Ensures correct behavior and backward compatibility

Testing

All tests passing:

make verify — PASSED (lint + format checks)
test_get_job_with_pod_restarts — PASSED (new test for Pod restart filtering)
test_get_job — PASSED (existing behavior remains compatible)
All 36 Kubernetes backend tests — PASSED
All 163 Python unit tests — PASSED

Test Coverage

Pod restart scenarios with duplicate Pods having different creation_timestamp values
Mixed scenarios where:
- Some components have restarts
- Others have no duplicates
Verified that the API selects only the newest Pod per component
Confirmed that statuses come from the latest Pods, not older failed ones

Checklist

Follows Conventional Commits specification
Code follows project style guidelines (make verify passes)
All tests pass locally (make test-python)
Added comprehensive unit tests for new functionality
Updated documentation and inline comments where needed
No breaking changes to public APIs
Fully backward compatible with existing behavior

Related Issues

Fixes #25

When Kubernetes recreates Pods due to restart policies, multiple Pods with the same role can exist simultaneously. This causes get_job() to return duplicate TrainJob components with different statuses, creating confusion for users. This change groups Pods by their component role and selects only the most recently created Pod for each component based on creation_timestamp. This ensures users see the current state of their TrainJob after any Pod restarts. Changes: - Group Pods by role identifier (initializer name or node+index) - Select most recent Pod from each group using creation_timestamp - Add comprehensive test for Pod restart scenarios Fixes kubeflow#25 Signed-off-by: HKanoje <[email protected]>

Fiona-Waters · 2025-11-17T09:44:57Z

kubeflow/trainer/backends/kubernetes/backend.py

+                    pod_groups[key] = []
+                pod_groups[key].append(pod)
+
+            # Select the most recently created Pod from each group.


I think to make it more robust we could select the pod based on the status as well as the timestamp something like this Fiona-Waters@b48277f
wdyt?
It will return a pod that actually reflects the true state of each TrainJob component, rather than the newest pod.

Absolutely agree! I've actually implemented exactly that approach from your commit b48277f. The current implementation now:

Prioritizes by status first: Running (4) > Succeeded (3) > Failed (2) > Pending (1) > Unknown (0)
Uses timestamp as tiebreaker: Among pods with the same status, selects the most recent one
This ensures we return a pod that reflects the true state of the TrainJob component (preferring Running/Succeeded pods over Failed ones), rather than blindly picking the newest pod regardless of its state.

For example, if we have:

Pod A: Failed (created at 11:00)
Pod B: Running (created at 10:00)
The old logic would return Pod A (newest), but the new logic correctly returns Pod B (Running status is higher priority).

…e safety Apply code quality improvements based on review feedback: - Use status-based priority for pod selection (Running > Succeeded > Failed > Pending > Unknown) - Add datetime.min fallback for safer timestamp sorting (prevents TypeError) - Add precise type hints to internal dicts for better type checking - Use consistent .get() access for JOB_INDEX_LABEL with default fallback - Add pod phase constants (POD_RUNNING, POD_FAILED, POD_PENDING, POD_UNKNOWN) These changes improve robustness, type safety, and maintainability while maintaining the same behavior of selecting the best pod for each role. Signed-off-by: HKanoje <[email protected]>

Fiona-Waters · 2025-11-17T12:20:47Z

kubeflow/trainer/backends/kubernetes/backend.py

+
+        # Sort by creation timestamp (most recent first)
+        candidate_pods.sort(
+            key=lambda p: p.metadata.creation_timestamp or datetime.datetime.min, reverse=True


This could cause an issue in newer python versions - do we want it to be timezone naive or set to utc?

Suggested change

key=lambda p: p.metadata.creation_timestamp or datetime.datetime.min, reverse=True

key=lambda p: (p.metadata.creation_timestamp or datetime.datetime.min.replace(tzinfo=timezone.utc))

Good catch! I've applied your suggestion to use datetime.datetime.min.replace(tzinfo=timezone.utc) instead of the timezone-naive datetime.datetime.min.

Thanks for the review!

Great. Don't forget to import timezone too
from datetime import timezone

Done! Added the timezone import. Thanks for catching that! 👍

Fiona-Waters · 2025-11-17T12:23:24Z

@HKanoje left one more comment but otherwise it looks good to me.
@andreyvelich @astefanutti @kramaranya please review when you can. Thanks.

Use datetime.datetime.min.replace(tzinfo=timezone.utc) instead of datetime.datetime.min to prevent TypeError when comparing timezone-aware and timezone-naive datetimes in Python 3.9+. The Kubernetes API returns creation_timestamp as timezone-aware datetime objects in UTC, so the fallback should also be timezone-aware for safe comparison. Signed-off-by: HKanoje <[email protected]>

Import timezone from datetime module to use timezone.utc directly instead of datetime.timezone.utc for better readability. Signed-off-by: HKanoje <[email protected]>

astefanutti · 2025-11-27T16:19:43Z

/lgtm

Thanks @HKanoje @Fiona-Waters!

/assign @kubeflow/kubeflow-sdk-team

astefanutti · 2025-11-27T16:19:54Z

/ok-to-test

coveralls · 2025-11-27T16:22:24Z

Pull Request Test Coverage Report for Build 19878220233

Details

58 of 61 (95.08%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.4%) to 67.024%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
kubeflow/trainer/backends/kubernetes/backend.py	32	35	91.43%

Totals
Change from base Build 19828346095:	0.4%
Covered Lines:	2561
Relevant Lines:	3821

💛 - Coveralls

kramaranya

Thank you @HKanoje!
I've left a few comments

kramaranya · 2025-12-02T22:27:53Z

kubeflow/trainer/backends/kubernetes/backend.py

+        Priority order:
+        1. Running or Succeeded Pods (prefer most recent)
+        2. Failed Pods (prefer most recent)
+        3. Pending Pods (prefer most recent)
+        4. Unknown Pods (prefer most recent)
+        """
+        if not pods:
+            return None
+
+        # Pod status priority (higher number = higher priority)
+        status_priority = {
+            constants.POD_RUNNING: 4,  # Highest priority
+            constants.POD_SUCCEEDED: 3,  # Second highest
+            constants.POD_FAILED: 2,  # Third priority
+            constants.POD_PENDING: 1,  # Low priority
+            constants.POD_UNKNOWN: 0,  # Lowest priority


Do running and succeeded statuses have the same priority? The docstring doesn't match the actual priorities

kramaranya · 2025-12-02T22:34:39Z

kubeflow/trainer/backends/kubernetes/backend.py

+            constants.POD_RUNNING: 4,  # Highest priority
+            constants.POD_SUCCEEDED: 3,  # Second highest


Shall we consider those two to be equal priority? Since both running and succeeded are healthy pods, I think we should care about the most recent one. wdyt @HKanoje @andreyvelich @astefanutti

kramaranya · 2025-12-02T22:43:52Z

kubeflow/trainer/backends/kubernetes/backend.py

                            trainjob.runtime,
                            pod.metadata.labels[constants.JOBSET_RJOB_NAME_LABEL],
-                            int(pod.metadata.labels[constants.JOB_INDEX_LABEL]),
+                            int(pod.metadata.labels.get(constants.JOB_INDEX_LABEL, "0")),


Don't those pods always have this label?

kramaranya · 2025-12-02T22:45:43Z

kubeflow/trainer/backends/kubernetes/backend.py


        self.namespace = cfg.namespace

+    def _select_best_pod_for_role(


Can you move this after public methods?

@kramaranya Thank you for the thorough review! I've addressed all your comments:

Changes Made:

1. Docstring & Priority Design

Updated docstring to explicitly state: "Running or Succeeded Pods (equal priority, prefer most recent)"

Changed POD_SUCCEEDED priority from 3 to 4 (now equal to POD_RUNNING)

Added clarification: "Both Running and Succeeded are considered healthy states with equal priority"

2. JOB_INDEX_LABEL

Removed .get(constants.JOB_INDEX_LABEL, "0") in both locations

Now using direct access: pod.metadata.labels[constants.JOB_INDEX_LABEL]

3. Method Placement

Moved _select_best_pod_for_role after all public methods (after delete_job)

Now positioned before _read_pod_logs, following project convention

Testing:

make verify passes

All 36 Kubernetes backend tests pass

All 163 Python tests pass

@kramaranya

- Give Running and Succeeded pods equal priority (both are healthy states) - Update docstring to clearly explain equal priority and timestamp tiebreaker - Remove JOB_INDEX_LABEL .get() default, use direct access - Move _select_best_pod_for_role method after public methods per convention Addresses review comments from @kramaranya on PR kubeflow#160

google-oss-prow · 2025-12-02T23:48:34Z

New changes are detected. LGTM label has been removed.

google-oss-prow · 2025-12-02T23:48:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from astefanutti. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kramaranya

- Give Running and Succeeded pods equal priority (both are healthy states) - Update docstring to clearly explain equal priority and timestamp tiebreaker - Remove JOB_INDEX_LABEL .get() default, use direct access - Move _select_best_pod_for_role method after public methods per convention Addresses review comments from @kramaranya on PR kubeflow#160 Signed-off-by: HKanoje <[email protected]>

HKanoje · 2026-01-01T05:03:33Z

@astefanutti I have made new changes please review and then it can be tested.

google-oss-prow bot requested review from kramaranya and szaher November 17, 2025 03:33

google-oss-prow bot added the size/L label Nov 17, 2025

HKanoje force-pushed the fix/filter-duplicate-pods-in-get-job branch from 978b209 to faf96a5 Compare November 17, 2025 03:36

Fiona-Waters reviewed Nov 17, 2025

View reviewed changes

HKanoje added 2 commits November 17, 2025 07:34

fix(trainer): import timezone for cleaner code

f9df10e

Import timezone from datetime module to use timezone.utc directly instead of datetime.timezone.utc for better readability. Signed-off-by: HKanoje <[email protected]>

google-oss-prow bot assigned astefanutti Nov 27, 2025

google-oss-prow bot added the lgtm label Nov 27, 2025

google-oss-prow bot added the ok-to-test label Nov 27, 2025

kramaranya reviewed Dec 2, 2025

View reviewed changes

google-oss-prow bot removed the lgtm label Dec 2, 2025

HKanoje force-pushed the fix/filter-duplicate-pods-in-get-job branch from a22ed47 to 3ec8093 Compare December 3, 2025 00:35

	key=lambda p: p.metadata.creation_timestamp or datetime.datetime.min, reverse=True
	key=lambda p: (p.metadata.creation_timestamp or datetime.datetime.min.replace(tzinfo=timezone.utc))

		constants.POD_RUNNING: 4, # Highest priority
		constants.POD_SUCCEEDED: 3, # Second highest


		self.namespace = cfg.namespace

		def _select_best_pod_for_role(

fix(trainer): filter duplicate Pods in get_job() API #160

Are you sure you want to change the base?

fix(trainer): filter duplicate Pods in get_job() API #160

Uh oh!

Conversation

HKanoje commented Nov 17, 2025

Description

Problem:

Solution:

Key Improvements

1. Groups Pods by role

2. Selects the most recent Pod

3. Maintains backward compatibility

Example Impact:

Before this fix:

After this fix:

Changes Made

Modified Files

backend.py

backend_test.py

Testing

Test Coverage

Checklist

Related Issues

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fiona-Waters commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

astefanutti commented Nov 27, 2025

Uh oh!

astefanutti commented Nov 27, 2025

Uh oh!

coveralls commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 19878220233

Details

💛 - Coveralls

Uh oh!

kramaranya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Changes Made:

Testing:

Uh oh!

google-oss-prow bot commented Dec 2, 2025

Uh oh!

google-oss-prow bot commented Dec 2, 2025

Uh oh!

HKanoje commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

`backend.py`

`backend_test.py`

Fiona-Waters commented Nov 17, 2025 •

edited

Loading

coveralls commented Nov 27, 2025 •

edited

Loading