⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

Conversation

@parashardhapola
Copy link
Member

@parashardhapola parashardhapola commented Jan 10, 2026

PR Type

Enhancement, Tests


Description

  • Reorganized codebase into modular subpackages (api, core, preprocessing)

  • Refactored main.py to use new modular imports and simplified logic

  • Added comprehensive integration tests with fixtures and mocking

  • Improved error handling with specific exception types and better messages

  • Enhanced API client with transport layer, progress display, and result fetching

  • Updated dependencies and Python version requirement to 3.12


Diagram Walkthrough

flowchart LR
  main["main.py<br/>CyteType class"]
  api["api/<br/>Client & Transport"]
  core["core/<br/>Payload & Results"]
  preproc["preprocessing/<br/>Validation & Extraction"]
  
  main -->|imports| api
  main -->|imports| core
  main -->|imports| preproc
  
  api -->|HTTP requests| server["API Server"]
  core -->|stores/loads| adata["AnnData Object"]
  preproc -->|validates| adata
Loading

File Walkthrough

Relevant files
Enhancement
12 files
main.py
Refactored to use modular subpackage imports                         
+117/-296
__init__.py
New API client module with public exports                               
+43/-0   
client.py
Core API client functions for job submission and polling 
+232/-0 
transport.py
HTTP transport layer with error handling                                 
+97/-0   
progress.py
Terminal progress display during job polling                         
+118/-0 
schemas.py
Added error response and job status schemas                           
+32/-1   
payload.py
Payload building and validation logic                                       
+88/-0   
results.py
Result storage and retrieval functions                                     
+301/-0 
__init__.py
Preprocessing module public API exports                                   
+11/-0   
validation.py
AnnData validation with gene symbol checking                         
+183/-0 
extraction.py
Marker gene and coordinate extraction functions                   
+160/-0 
aggregation.py
Expression percentage and metadata aggregation                     
+108/-0 
Error handling
1 files
exceptions.py
Comprehensive exception hierarchy for API errors                 
+101/-0 
Documentation
1 files
__init__.py
Core module initialization with documentation                       
+6/-0     
Tests
2 files
conftest.py
Pytest fixtures for mock AnnData and API responses             
+103/-0 
test_cytetype_integration.py
Comprehensive integration tests for CyteType class             
+419/-0 
Configuration changes
1 files
__init__.py
Updated version to 0.13.0                                                               
+1/-1     
Dependencies
1 files
pyproject.toml
Updated dependencies and Python version to 3.12                   
+15/-12 
Additional files
13 files
anndata_helpers.py +0/-413 
api.py +0/-281 
client.py +0/-373 
config.py +0/-6     
display.py +0/-104 
exceptions.py +0/-25   
test_anndata_helpers.py +0/-492 
test_anndata_helpers_missing.py +0/-357 
test_client.py +0/-307 
test_client_missing.py +0/-247 
test_config.py +0/-92   
test_exceptions.py +0/-77   
test_main.py +0/-545 

Reorganized codebase into modular subpackages: moved API logic to cytetype/api/, core logic to cytetype/core/, and preprocessing to cytetype/preprocessing/. Improved error handling with specific exceptions, enhanced server communication, and updated imports. Removed legacy modules and tests, and updated version to 0.13.0.
@qodo-code-review
Copy link

qodo-code-review bot commented Jan 10, 2026

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
SSRF via base_url

Description: The new HTTP client constructs request URLs by directly concatenating user-controlled
base_url and path components (including job_id via callers), which can enable SSRF and
path/URL manipulation (e.g., a crafted base_url targeting internal services or a job_id
containing unexpected path segments) if these values can be influenced by untrusted input.

transport.py [11-93]

Referred Code
def __init__(self, base_url: str, auth_token: str | None = None):
    self.base_url = base_url.rstrip("/")
    self.auth_token = auth_token
    self.session = requests.Session()

def _build_headers(self, content_type: bool = False) -> dict[str, str]:
    """Build request headers with auth token."""
    headers = {}
    if self.auth_token:
        headers["Authorization"] = f"Bearer {self.auth_token}"
    if content_type:
        headers["Content-Type"] = "application/json"
    return headers

def _parse_error(self, response: requests.Response) -> None:
    """Parse server error response and raise appropriate exception."""
    try:
        error_data = response.json()

        # Check if error is nested under "detail" key
        if "detail" in error_data and isinstance(error_data["detail"], dict):


 ... (clipped 62 lines)
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Unbound variable: In wait_for_completion, the generic exception handler references cluster_status even when
it may be unset (e.g., if get_job_status raises before assignment), causing an
UnboundLocalError instead of graceful retry behavior.

Referred Code
while (time.time() - start_time) < timeout:
    try:
        status_data = get_job_status(base_url, auth_token, job_id)
        job_status = status_data.get("jobStatus")
        cluster_status = status_data.get("clusterStatus", {})

        # Reset 404 counter on valid response
        if job_status != "not_found":
            consecutive_not_found = 0

        if job_status == "completed":
            if progress:
                progress.finalize(cluster_status)
            logger.info(f"Job {job_id} completed successfully.")
            return fetch_job_results(base_url, auth_token, job_id)

        elif job_status == "failed":
            if progress:
                progress.finalize(cluster_status)
            raise JobFailedError(f"Job {job_id} failed")



 ... (clipped 38 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Exposes response text: Error handling falls back to raising NetworkError with response.text included, which can
expose internal server details or sensitive data in user-visible exception messages.

Referred Code
def _parse_error(self, response: requests.Response) -> None:
    """Parse server error response and raise appropriate exception."""
    try:
        error_data = response.json()

        # Check if error is nested under "detail" key
        if "detail" in error_data and isinstance(error_data["detail"], dict):
            error_data = error_data["detail"]

        error = ErrorResponse(**error_data)
        raise create_api_exception(error.error_code, error.message)
    except ValueError:
        # Not JSON or missing fields - fallback to HTTP status
        raise NetworkError(
            f"HTTP {response.status_code}: {response.text}",
            error_code="HTTP_ERROR",
        )

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Unfiltered error content: The fallback error path logs/raises messages containing raw response.text (potentially
including sensitive payloads or identifiers), which can end up in logs or surfaced errors
without redaction.

Referred Code
def _parse_error(self, response: requests.Response) -> None:
    """Parse server error response and raise appropriate exception."""
    try:
        error_data = response.json()

        # Check if error is nested under "detail" key
        if "detail" in error_data and isinstance(error_data["detail"], dict):
            error_data = error_data["detail"]

        error = ErrorResponse(**error_data)
        raise create_api_exception(error.error_code, error.message)
    except ValueError:
        # Not JSON or missing fields - fallback to HTTP status
        raise NetworkError(
            f"HTTP {response.status_code}: {response.text}",
            error_code="HTTP_ERROR",
        )

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing user context: The new API job submission and result fetching flow logs operational events but does not
include any user identity context (e.g., user ID) needed to form an audit trail for
critical actions.

Referred Code
logger.info("Calculating expression percentages...")
self.expression_percentages = aggregate_expression_percentages(
    adata=adata,
    clusters=self.clusters,
    batch_size=pcent_batch_size,
    gene_names=adata.var[self.gene_symbols_column].tolist(),
)

logger.info("Extracting marker genes...")
self.marker_genes = extract_marker_genes(
    adata=self.adata,
    cell_group_key=self.group_key,
    rank_genes_key=self.rank_key,
    cluster_map=self.cluster_map,
    n_top_genes=n_top_genes,
    gene_symbols_col=self.gene_symbols_column,
)

if aggregate_metadata:
    logger.info("Aggregating cluster metadata...")
    self.group_metadata = aggregate_cluster_metadata(


 ... (clipped 157 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Sensitive data written: The new default save_query behavior writes the full input_data payload to disk via
save_query_to_file, which may include sensitive study metadata and should be reviewed for
secure storage expectations (permissions, encryption, and user consent).

Referred Code
def save_query_to_file(input_data: dict[str, Any], filename: str) -> None:
    """Save query to JSON file.

    Args:
        input_data: The input data dictionary to save
        filename: Path to save the query file
    """
    with open(filename, "w") as f:
        json.dump(input_data, f)

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link

qodo-code-review bot commented Jan 10, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Initialize variable to prevent NameError

In wait_for_completion, initialize cluster_status before the try block to
prevent a NameError in the exception handler if the initial API call fails.

cytetype/api/client.py [143-232]

 def wait_for_completion(
     base_url: str,
     auth_token: str | None,
     job_id: str,
     poll_interval: int,
     timeout: int,
     show_progress: bool = True,
 ) -> dict[str, Any]:
     """Poll job until completion and return results."""
     progress = ProgressDisplay() if show_progress else None
     start_time = time.time()
 
     logger.info(f"CyteType job (id: {job_id}) submitted. Polling for results...")
 
     # Initial delay
     time.sleep(5)
 
     # Show report URL
     report_url = f"{base_url}/report/{job_id}"
     logger.info(f"Report (updates automatically) available at: {report_url}")
     logger.info(
         "If network disconnects, the results can still be fetched:\n"
         "`results = annotator.get_results()`"
     )
 
     consecutive_not_found = 0
+    cluster_status: dict[str, str] = {}
 
     while (time.time() - start_time) < timeout:
         try:
             status_data = get_job_status(base_url, auth_token, job_id)
             job_status = status_data.get("jobStatus")
             cluster_status = status_data.get("clusterStatus", {})
 ...
         except Exception as e:
             # Network errors - log and retry
             logger.debug(f"Error during polling: {e}. Retrying...")
             retry_interval = min(poll_interval, 5)
             _sleep_with_spinner(retry_interval, progress, cluster_status)
 
     # Timeout reached
     if progress:
         progress.finalize({})
     raise TimeoutError(f"Job {job_id} did not complete within {timeout}s")

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 8

__

Why: This suggestion correctly identifies a potential UnboundLocalError that would crash the polling loop during a network error, and the proposed fix of initializing the variable before the loop is the correct solution.

Medium
Use local variables for method-specific overrides

In the run method, use local variables for api_url and auth_token overrides
instead of permanently modifying the instance attributes self.api_url and
self.auth_token.

cytetype/main.py [255-258]

-if api_url:
-    self.api_url = api_url.strip("/")
-if auth_token:
-    self.auth_token = auth_token
+run_api_url = (api_url or self.api_url).strip("/")
+run_auth_token = auth_token or self.auth_token
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that the run method undesirably modifies instance state, which is a design flaw. Using local variables for the overrides improves the class's predictability and prevents side effects.

Low
High-level
Consider a more explicit client-server architecture

Create a dedicated ApiClient class to handle all API communication, separating
it from the data processing logic in the CyteType class. The CyteType class
would then use an instance of this new client.

Examples:

cytetype/main.py [178-302]
    def run(
        self,
        study_context: str,
        llm_configs: list[dict[str, Any]] | None = None,
        metadata: dict[str, Any] | None = None,
        n_parallel_clusters: int = 2,
        results_prefix: str = "cytetype",
        poll_interval_seconds: int = 10,
        timeout_seconds: int = 7200,
        api_url: str | None = None,

 ... (clipped 115 lines)

Solution Walkthrough:

Before:

# In cytetype/main.py
from .api import submit_annotation_job, wait_for_completion

class CyteType:
    def __init__(self, ..., api_url, auth_token):
        self.api_url = api_url
        self.auth_token = auth_token
        # ... data preprocessing ...

    def run(self, ...):
        # ...
        if api_url: self.api_url = api_url
        if auth_token: self.auth_token = auth_token
        
        payload = build_annotation_payload(...)
        job_id = submit_annotation_job(self.api_url, self.auth_token, payload)
        result = wait_for_completion(self.api_url, self.auth_token, job_id, ...)
        # ...

After:

# In a new cytetype/api/client.py
class ApiClient:
    def __init__(self, base_url, auth_token):
        self.transport = HTTPTransport(base_url, auth_token)

    def submit_job(self, payload): ...
    def wait_for_completion(self, job_id, ...): ...

# In cytetype/main.py
from .api import ApiClient

class CyteType:
    def __init__(self, ..., api_url, auth_token):
        self.api_client = ApiClient(api_url, auth_token)
        # ... data preprocessing ...

    def run(self, ...):
        # ...
        if api_url or auth_token: # Handle overrides
            self.api_client = ApiClient(api_url or self.api_client.base_url, ...)

        payload = build_annotation_payload(...)
        job_id = self.api_client.submit_job(payload)
        result = self.api_client.wait_for_completion(job_id, ...)
        # ...
Suggestion importance[1-10]: 7

__

Why: This is a valid design suggestion that would further improve the separation of concerns, which is the main goal of this PR, by encapsulating all API interactions within a dedicated client class.

Medium
General
Use stored API URL for fetching

In get_results, when fetching remote results, use the api_url stored in the
job's details instead of the current instance's self.api_url.

cytetype/main.py [341-349]

+api_url = self.adata.uns[job_details_key].get("api_url", self.api_url)
 return fetch_remote_results(
     self.adata,
     job_id,
-    self.api_url,
+    api_url,
     self.auth_token,
     results_prefix,
     self.group_key,
     self.clusters,
 )
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: This is a significant correctness improvement. It ensures that when fetching results for a past job, the client contacts the same API endpoint that the job was originally submitted to, which is crucial if multiple API endpoints are used.

Medium
Save full payload to file

In the run method, save the entire payload dictionary to the query file, not
just the input_data part, to ensure llm_configs are also included.

cytetype/main.py [274-275]

 if save_query:
-    save_query_to_file(payload["input_data"], query_filename)
+    save_query_to_file(payload, query_filename)
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly points out that only a part of the API payload is being saved. Saving the full payload, including llm_configs, makes the saved query file more complete and useful for debugging or re-running jobs.

Low
Avoid shadowing parameter with local function

In _add_annotation_column, avoid shadowing the field_getter parameter by
assigning the locally defined function to a new variable, such as getter.

cytetype/core/results.py [247-280]

 def _add_annotation_column(
     adata: anndata.AnnData,
     result_data: dict[str, Any],
     field_key: str,
     column_name: str,
     clusters: list[str],
     default_value: str,
     field_getter: Callable[[dict[str, Any]], Any] | None = None,
 ) -> None:
     """Add a single annotation column to adata.obs (DRY helper).
 ...
     """
-    if field_getter is None:
+    getter = field_getter
+    if getter is None:
 
-        def field_getter(item: dict[str, Any]) -> Any:
+        def default_getter(item: dict[str, Any]) -> Any:
             return item.get(field_key, default_value)
 
+        getter = default_getter
+
     field_map = {
-        item["clusterId"]: field_getter(item)
+        item["clusterId"]: getter(item)
         for item in result_data.get("annotations", [])
     }
 
     adata.obs[column_name] = pd.Series(
         [field_map.get(cluster_id, default_value) for cluster_id in clusters],
         index=adata.obs.index,
     ).astype("category")

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 4

__

Why: The suggestion correctly points out that shadowing the field_getter parameter with a local function definition is confusing and poor practice. The proposed change improves code clarity and maintainability by avoiding this shadow.

Low
Add file encoding to JSON write

In save_query_to_file, specify encoding="utf-8" when opening the file for
writing to ensure cross-platform consistency.

cytetype/core/payload.py [87-88]

-with open(filename, "w") as f:
+with open(filename, "w", encoding="utf-8") as f:
     json.dump(input_data, f)
  • Apply / Chat
Suggestion importance[1-10]: 4

__

Why: This is a good practice suggestion that improves the robustness of file I/O by explicitly setting the encoding to utf-8, preventing potential issues on different operating systems with varying default encodings.

Low
Remove null llm_configs

In build_annotation_payload, conditionally add the llm_configs key to the
payload only if it has a value, instead of sending it as null when empty.

cytetype/core/payload.py [74-77]

-return {
-    "input_data": input_data,
-    "llm_configs": validated_llm_configs if validated_llm_configs else None,
-}
+payload = {"input_data": input_data}
+if validated_llm_configs:
+    payload["llm_configs"] = validated_llm_configs
+return payload
  • Apply / Chat
Suggestion importance[1-10]: 3

__

Why: This is a minor improvement for creating a cleaner API payload by omitting an optional key when it's empty, which can be considered better practice than sending null.

Low
  • Update

Updated pyproject.toml to exclude build, dist, and egg-info directories from mypy checks. Added explicit return type annotations to test fixtures in tests/conftest.py for improved type safety.
@parashardhapola parashardhapola merged commit 5992425 into master Jan 10, 2026
1 check passed
@parashardhapola parashardhapola deleted the refactor_codebase branch January 10, 2026 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants