Refactor codebase #56

parashardhapola · 2026-01-10T18:11:28Z

PR Type

Enhancement, Tests

Description

Reorganized codebase into modular subpackages (api, core, preprocessing)
Refactored main.py to use new modular imports and simplified logic
Added comprehensive integration tests with fixtures and mocking
Improved error handling with specific exception types and better messages
Enhanced API client with transport layer, progress display, and result fetching
Updated dependencies and Python version requirement to 3.12

Diagram Walkthrough

flowchart LR
  main["main.py<br/>CyteType class"]
  api["api/<br/>Client & Transport"]
  core["core/<br/>Payload & Results"]
  preproc["preprocessing/<br/>Validation & Extraction"]
  
  main -->|imports| api
  main -->|imports| core
  main -->|imports| preproc
  
  api -->|HTTP requests| server["API Server"]
  core -->|stores/loads| adata["AnnData Object"]
  preproc -->|validates| adata

File Walkthrough

Relevant files

Enhancement

12 files

main.py `Refactored to use modular subpackage imports`	+117/-296
__init__.py `New API client module with public exports`	+43/-0
client.py `Core API client functions for job submission and polling`	+232/-0
transport.py `HTTP transport layer with error handling`	+97/-0
progress.py `Terminal progress display during job polling`	+118/-0
schemas.py `Added error response and job status schemas`	+32/-1
payload.py `Payload building and validation logic`	+88/-0
results.py `Result storage and retrieval functions`	+301/-0
__init__.py `Preprocessing module public API exports`	+11/-0
validation.py `AnnData validation with gene symbol checking`	+183/-0
extraction.py `Marker gene and coordinate extraction functions`	+160/-0
aggregation.py `Expression percentage and metadata aggregation`	+108/-0

Error handling

1 files

exceptions.py `Comprehensive exception hierarchy for API errors`	+101/-0

Documentation

1 files

__init__.py `Core module initialization with documentation`	+6/-0

Tests

2 files

conftest.py `Pytest fixtures for mock AnnData and API responses`	+103/-0
test_cytetype_integration.py `Comprehensive integration tests for CyteType class`	+419/-0

Configuration changes

1 files

__init__.py `Updated version to 0.13.0`	+1/-1

Dependencies

1 files

pyproject.toml `Updated dependencies and Python version to 3.12`	+15/-12

Additional files

13 files

anndata_helpers.py	+0/-413
api.py	+0/-281
client.py	+0/-373
config.py	+0/-6
display.py	+0/-104
exceptions.py	+0/-25
test_anndata_helpers.py	+0/-492
test_anndata_helpers_missing.py	+0/-357
test_client.py	+0/-307
test_client_missing.py	+0/-247
test_config.py	+0/-92
test_exceptions.py	+0/-77
test_main.py	+0/-545

Reorganized codebase into modular subpackages: moved API logic to cytetype/api/, core logic to cytetype/core/, and preprocessing to cytetype/preprocessing/. Improved error handling with specific exceptions, enhanced server communication, and updated imports. Removed legacy modules and tests, and updated version to 0.13.0.

qodo-code-review · 2026-01-10T18:12:07Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
⚪	SSRF via base_url Description: The new HTTP client constructs request URLs by directly concatenating user-controlled `base_url` and path components (including `job_id` via callers), which can enable SSRF and path/URL manipulation (e.g., a crafted `base_url` targeting internal services or a `job_id` containing unexpected path segments) if these values can be influenced by untrusted input. transport.py [11-93] Referred Code def __init__(self, base_url: str, auth_token: str \| None = None): self.base_url = base_url.rstrip("/") self.auth_token = auth_token self.session = requests.Session() def _build_headers(self, content_type: bool = False) -> dict[str, str]: """Build request headers with auth token.""" headers = {} if self.auth_token: headers["Authorization"] = f"Bearer {self.auth_token}" if content_type: headers["Content-Type"] = "application/json" return headers def _parse_error(self, response: requests.Response) -> None: """Parse server error response and raise appropriate exception.""" try: error_data = response.json() # Check if error is nested under "detail" key if "detail" in error_data and isinstance(error_data["detail"], dict): ... (clipped 62 lines)
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
🔴	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Unbound variable: In `wait_for_completion`, the generic exception handler references `cluster_status` even when it may be unset (e.g., if `get_job_status` raises before assignment), causing an `UnboundLocalError` instead of graceful retry behavior. Referred Code while (time.time() - start_time) < timeout: try: status_data = get_job_status(base_url, auth_token, job_id) job_status = status_data.get("jobStatus") cluster_status = status_data.get("clusterStatus", {}) # Reset 404 counter on valid response if job_status != "not_found": consecutive_not_found = 0 if job_status == "completed": if progress: progress.finalize(cluster_status) logger.info(f"Job {job_id} completed successfully.") return fetch_job_results(base_url, auth_token, job_id) elif job_status == "failed": if progress: progress.finalize(cluster_status) raise JobFailedError(f"Job {job_id} failed") ... (clipped 38 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Exposes response text: Error handling falls back to raising `NetworkError` with `response.text` included, which can expose internal server details or sensitive data in user-visible exception messages. Referred Code def _parse_error(self, response: requests.Response) -> None: """Parse server error response and raise appropriate exception.""" try: error_data = response.json() # Check if error is nested under "detail" key if "detail" in error_data and isinstance(error_data["detail"], dict): error_data = error_data["detail"] error = ErrorResponse(**error_data) raise create_api_exception(error.error_code, error.message) except ValueError: # Not JSON or missing fields - fallback to HTTP status raise NetworkError( f"HTTP {response.status_code}: {response.text}", error_code="HTTP_ERROR", ) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Unfiltered error content: The fallback error path logs/raises messages containing raw `response.text` (potentially including sensitive payloads or identifiers), which can end up in logs or surfaced errors without redaction. Referred Code def _parse_error(self, response: requests.Response) -> None: """Parse server error response and raise appropriate exception.""" try: error_data = response.json() # Check if error is nested under "detail" key if "detail" in error_data and isinstance(error_data["detail"], dict): error_data = error_data["detail"] error = ErrorResponse(**error_data) raise create_api_exception(error.error_code, error.message) except ValueError: # Not JSON or missing fields - fallback to HTTP status raise NetworkError( f"HTTP {response.status_code}: {response.text}", error_code="HTTP_ERROR", ) Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Missing user context: The new API job submission and result fetching flow logs operational events but does not include any user identity context (e.g., user ID) needed to form an audit trail for critical actions. Referred Code logger.info("Calculating expression percentages...") self.expression_percentages = aggregate_expression_percentages( adata=adata, clusters=self.clusters, batch_size=pcent_batch_size, gene_names=adata.var[self.gene_symbols_column].tolist(), ) logger.info("Extracting marker genes...") self.marker_genes = extract_marker_genes( adata=self.adata, cell_group_key=self.group_key, rank_genes_key=self.rank_key, cluster_map=self.cluster_map, n_top_genes=n_top_genes, gene_symbols_col=self.gene_symbols_column, ) if aggregate_metadata: logger.info("Aggregating cluster metadata...") self.group_metadata = aggregate_cluster_metadata( ... (clipped 157 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Sensitive data written: The new default `save_query` behavior writes the full `input_data` payload to disk via `save_query_to_file`, which may include sensitive study metadata and should be reviewed for secure storage expectations (permissions, encryption, and user consent). Referred Code def save_query_to_file(input_data: dict[str, Any], filename: str) -> None: """Save query to JSON file. Args: input_data: The input data dictionary to save filename: Path to save the query file """ with open(filename, "w") as f: json.dump(input_data, f) Learn more about managing compliance generic rules or creating your own custom rules
Update

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-code-review · 2026-01-10T18:14:04Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Initialize variable to prevent NameError In `wait_for_completion`, initialize `cluster_status` before the `try` block to prevent a `NameError` in the exception handler if the initial API call fails. cytetype/api/client.py [143-232] def wait_for_completion( base_url: str, auth_token: str \| None, job_id: str, poll_interval: int, timeout: int, show_progress: bool = True, ) -> dict[str, Any]: """Poll job until completion and return results.""" progress = ProgressDisplay() if show_progress else None start_time = time.time() logger.info(f"CyteType job (id: {job_id}) submitted. Polling for results...") # Initial delay time.sleep(5) # Show report URL report_url = f"{base_url}/report/{job_id}" logger.info(f"Report (updates automatically) available at: {report_url}") logger.info( "If network disconnects, the results can still be fetched:\n" "`results = annotator.get_results()`" ) consecutive_not_found = 0 + cluster_status: dict[str, str] = {} while (time.time() - start_time) < timeout: try: status_data = get_job_status(base_url, auth_token, job_id) job_status = status_data.get("jobStatus") cluster_status = status_data.get("clusterStatus", {}) ... except Exception as e: # Network errors - log and retry logger.debug(f"Error during polling: {e}. Retrying...") retry_interval = min(poll_interval, 5) _sleep_with_spinner(retry_interval, progress, cluster_status) # Timeout reached if progress: progress.finalize({}) raise TimeoutError(f"Job {job_id} did not complete within {timeout}s") `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 8 __ Why: This suggestion correctly identifies a potential `UnboundLocalError` that would crash the polling loop during a network error, and the proposed fix of initializing the variable before the loop is the correct solution.	Medium
Possible issue	Use local variables for method-specific overrides In the `run` method, use local variables for `api_url` and `auth_token` overrides instead of permanently modifying the instance attributes `self.api_url` and `self.auth_token`. cytetype/main.py [255-258] -if api_url: - self.api_url = api_url.strip("/") -if auth_token: - self.auth_token = auth_token +run_api_url = (api_url or self.api_url).strip("/") +run_auth_token = auth_token or self.auth_token Apply / Chat Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies that the `run` method undesirably modifies instance state, which is a design flaw. Using local variables for the overrides improves the class's predictability and prevents side effects.	Low
High-level	Consider a more explicit client-server architecture Create a dedicated `ApiClient` class to handle all API communication, separating it from the data processing logic in the `CyteType` class. The `CyteType` class would then use an instance of this new client. Examples: cytetype/main.py [178-302] def run( self, study_context: str, llm_configs: list[dict[str, Any]] \| None = None, metadata: dict[str, Any] \| None = None, n_parallel_clusters: int = 2, results_prefix: str = "cytetype", poll_interval_seconds: int = 10, timeout_seconds: int = 7200, api_url: str \| None = None, ... (clipped 115 lines) Solution Walkthrough: Before: # In cytetype/main.py from .api import submit_annotation_job, wait_for_completion class CyteType: def __init__(self, ..., api_url, auth_token): self.api_url = api_url self.auth_token = auth_token # ... data preprocessing ... def run(self, ...): # ... if api_url: self.api_url = api_url if auth_token: self.auth_token = auth_token payload = build_annotation_payload(...) job_id = submit_annotation_job(self.api_url, self.auth_token, payload) result = wait_for_completion(self.api_url, self.auth_token, job_id, ...) # ... After: # In a new cytetype/api/client.py class ApiClient: def __init__(self, base_url, auth_token): self.transport = HTTPTransport(base_url, auth_token) def submit_job(self, payload): ... def wait_for_completion(self, job_id, ...): ... # In cytetype/main.py from .api import ApiClient class CyteType: def __init__(self, ..., api_url, auth_token): self.api_client = ApiClient(api_url, auth_token) # ... data preprocessing ... def run(self, ...): # ... if api_url or auth_token: # Handle overrides self.api_client = ApiClient(api_url or self.api_client.base_url, ...) payload = build_annotation_payload(...) job_id = self.api_client.submit_job(payload) result = self.api_client.wait_for_completion(job_id, ...) # ... Suggestion importance[1-10]: 7 __ Why: This is a valid design suggestion that would further improve the separation of concerns, which is the main goal of this PR, by encapsulating all API interactions within a dedicated client class.	Medium
General	Use stored API URL for fetching In `get_results`, when fetching remote results, use the `api_url` stored in the job's details instead of the current instance's `self.api_url`. cytetype/main.py [341-349] +api_url = self.adata.uns[job_details_key].get("api_url", self.api_url) return fetch_remote_results( self.adata, job_id, - self.api_url, + api_url, self.auth_token, results_prefix, self.group_key, self.clusters, ) Apply / Chat Suggestion importance[1-10]: 7 __ Why: This is a significant correctness improvement. It ensures that when fetching results for a past job, the client contacts the same API endpoint that the job was originally submitted to, which is crucial if multiple API endpoints are used.	Medium
	Save full payload to file In the `run` method, save the entire `payload` dictionary to the query file, not just the `input_data` part, to ensure `llm_configs` are also included. cytetype/main.py [274-275] if save_query: - save_query_to_file(payload["input_data"], query_filename) + save_query_to_file(payload, query_filename) Apply / Chat Suggestion importance[1-10]: 5 __ Why: The suggestion correctly points out that only a part of the API payload is being saved. Saving the full payload, including `llm_configs`, makes the saved query file more complete and useful for debugging or re-running jobs.	Low
	Avoid shadowing parameter with local function In `_add_annotation_column`, avoid shadowing the `field_getter` parameter by assigning the locally defined function to a new variable, such as `getter`. cytetype/core/results.py [247-280] def _add_annotation_column( adata: anndata.AnnData, result_data: dict[str, Any], field_key: str, column_name: str, clusters: list[str], default_value: str, field_getter: Callable[[dict[str, Any]], Any] \| None = None, ) -> None: """Add a single annotation column to adata.obs (DRY helper). ... """ - if field_getter is None: + getter = field_getter + if getter is None: - def field_getter(item: dict[str, Any]) -> Any: + def default_getter(item: dict[str, Any]) -> Any: return item.get(field_key, default_value) + getter = default_getter + field_map = { - item["clusterId"]: field_getter(item) + item["clusterId"]: getter(item) for item in result_data.get("annotations", []) } adata.obs[column_name] = pd.Series( [field_map.get(cluster_id, default_value) for cluster_id in clusters], index=adata.obs.index, ).astype("category") `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 4 __ Why: The suggestion correctly points out that shadowing the `field_getter` parameter with a local function definition is confusing and poor practice. The proposed change improves code clarity and maintainability by avoiding this shadow.	Low
	Add file encoding to JSON write In `save_query_to_file`, specify `encoding="utf-8"` when opening the file for writing to ensure cross-platform consistency. cytetype/core/payload.py [87-88] -with open(filename, "w") as f: +with open(filename, "w", encoding="utf-8") as f: json.dump(input_data, f) Apply / Chat Suggestion importance[1-10]: 4 __ Why: This is a good practice suggestion that improves the robustness of file I/O by explicitly setting the encoding to `utf-8`, preventing potential issues on different operating systems with varying default encodings.	Low
	Remove null `llm_configs` In `build_annotation_payload`, conditionally add the `llm_configs` key to the payload only if it has a value, instead of sending it as `null` when empty. cytetype/core/payload.py [74-77] -return { - "input_data": input_data, - "llm_configs": validated_llm_configs if validated_llm_configs else None, -} +payload = {"input_data": input_data} +if validated_llm_configs: + payload["llm_configs"] = validated_llm_configs +return payload Apply / Chat Suggestion importance[1-10]: 3 __ Why: This is a minor improvement for creating a cleaner API payload by omitting an optional key when it's empty, which can be considered better practice than sending `null`.	Low
Update

Updated pyproject.toml to exclude build, dist, and egg-info directories from mypy checks. Added explicit return type annotations to test fixtures in tests/conftest.py for improved type safety.

parashardhapola added 2 commits January 10, 2026 19:09

Update .gitignore

db8cb26

qodo-code-review bot added the Review effort 4/5 label Jan 10, 2026

Add mypy exclude patterns and type annotations in tests

ccdd497

Updated pyproject.toml to exclude build, dist, and egg-info directories from mypy checks. Added explicit return type annotations to test fixtures in tests/conftest.py for improved type safety.

parashardhapola merged commit 5992425 into master Jan 10, 2026
1 check passed

parashardhapola deleted the refactor_codebase branch January 10, 2026 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor codebase #56

Refactor codebase #56

Uh oh!

parashardhapola commented Jan 10, 2026 •

edited by qodo-code-review bot

Loading

Uh oh!

qodo-code-review bot commented Jan 10, 2026 •

edited

Loading

Uh oh!

qodo-code-review bot commented Jan 10, 2026 •

edited

Loading

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor codebase #56

Refactor codebase #56

Uh oh!

Conversation

parashardhapola commented Jan 10, 2026 • edited by qodo-code-review bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-code-review bot commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Compliance Guide 🔍

Uh oh!

qodo-code-review bot commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

parashardhapola commented Jan 10, 2026 •

edited by qodo-code-review bot

Loading

qodo-code-review bot commented Jan 10, 2026 •

edited

Loading

qodo-code-review bot commented Jan 10, 2026 •

edited

Loading