⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

Conversation

@NarayanaSabari
Copy link

Fixes #153

Summary

Implements Container Backend for OptimizerClient to enable local hyperparameter optimization without Kubernetes. Uses Optuna TPE for adaptive sampling and Docker/Podman for trial execution.

Implementation

Core Files

kubeflow/optimizer/backends/container/backend.py (~500 lines)

  • Implements OptimizerBackend interface
  • Integrates Optuna TPE sampler with SQLite persistence
  • Uses TrainerClient Container Backend for trial execution
  • Parallel execution via ThreadPoolExecutor
  • Metric extraction from container logs via regex

kubeflow/optimizer/backends/container/storage.py (~200 lines)

  • Local JSON file persistence for job/trial metadata
  • File structure: {storage_path}/{job_name}/experiment.json + trials/*.json

kubeflow/optimizer/backends/container/types.py (~100 lines)

  • ContainerBackendConfig with validation
  • Parameters: storage_path, max_parallel_trials, pull_policy, container_runtime

API Changes

kubeflow/optimizer/__init__.py

  • Added export: ContainerBackendConfig
  • No breaking changes to existing APIs

Examples & Documentation

  • examples/optimizer/simple-local-example.py - Working example (3 trials, tested)
  • examples/optimizer/LOCAL_OPTIMIZATION_GUIDE.md - Technical guide
  • kubeflow/optimizer/backends/container/README.md - Implementation details

Usage

from kubeflow.optimizer import OptimizerClient, ContainerBackendConfig, Search, TrialConfig, Objective

client = OptimizerClient(backend_config=ContainerBackendConfig(
    max_parallel_trials=3
))

job_name = client.optimize(
    trial_template=template,
    search_space={"learning_rate": Search.loguniform(0.001, 0.1)},
    objectives=[Objective(metric="accuracy", direction="maximize")],
    trial_config=TrialConfig(num_trials=10, parallel_trials=3),
)

results = client.get_best_results(job_name)

Key Design Decisions

  1. Optuna Integration - Mature TPE algorithm, SQLite persistence, resume support
  2. TrainerBackend Reuse - Leverages existing container orchestration for trials
  3. JSON Storage - Simple, debuggable state persistence (vs K8s CRDs)
  4. ThreadPoolExecutor - Simple parallelism without distributed coordination

Testing

  • ✅ End-to-end tested with 3 and 10 trial runs
  • ✅ Sequential and parallel execution validated
  • ✅ Metric extraction working (metric_name: value pattern)
  • ✅ Job lifecycle complete: create → optimize → results → cleanup

Dependencies

  • optuna>=3.0.0 (new, for hyperparameter optimization)
  • docker>=6.0.0 (existing, via [docker] extras)

Breaking Changes

None. New backend addition only.

Removal Files

  • Added some README files for reviewing, after review we can delete all of those.

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@NarayanaSabari NarayanaSabari force-pushed the optimizer-container branch 2 times, most recently from dce137b to def51a3 Compare November 9, 2025 07:03
Signed-off-by: narayanasabari <[email protected]>
Signed-off-by: narayanasabari <[email protected]>
@astefanutti
Copy link
Contributor

/ok-to-test

@astefanutti
Copy link
Contributor

/assign @kubeflow/kubeflow-sdk-team

@NarayanaSabari
Copy link
Author

@astefanutti thanks for running the unit test, i will fix all the unit test error by the end of next week.

Signed-off-by: narayanasabari <[email protected]>
Signed-off-by: narayanasabari <[email protected]>
Signed-off-by: narayanasabari <[email protected]>
@coveralls
Copy link

Pull Request Test Coverage Report for Build 19736393425

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 2 of 611 (0.33%) changed or added relevant lines in 7 files are covered.
  • 60 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-9.3%) to 57.332%

Changes Missing Coverage Covered Lines Changed/Added Lines %
kubeflow/optimizer/init.py 0 1 0.0%
kubeflow/optimizer/backends/container/init.py 0 3 0.0%
kubeflow/optimizer/api/optimizer_client.py 0 5 0.0%
kubeflow/optimizer/backends/container/types.py 0 31 0.0%
kubeflow/optimizer/backends/container/storage.py 0 143 0.0%
kubeflow/optimizer/backends/container/backend.py 0 426 0.0%
Files with Coverage Reduction New Missed Lines %
kubeflow/trainer/backends/kubernetes/backend_test.py 6 96.62%
kubeflow/trainer/backends/kubernetes/utils.py 21 75.23%
kubeflow/trainer/backends/kubernetes/backend.py 33 79.92%
Totals Coverage Status
Change from base Build 19171217698: -9.3%
Covered Lines: 2506
Relevant Lines: 4371

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Local Execution for Optimizer

3 participants