⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

Conversation

@kinjalpatel27
Copy link
Contributor

@kinjalpatel27 kinjalpatel27 commented Jan 21, 2026

What does this PR do?

Type of change: new feature

Overview:

  • Added support to reload HF exported checkpoint with modelopt state in vLLM fakequant
  • Added support to reload all quantizer parameters using QUANT_FILE_PATH instead of only amax

Usage

cd $PWD/examples/llm_ptq
python hf_ptq.py --pyt_ckpt_path meta-llama/Llama-3.2-3B-Instruct --qformat nvfp4 --export_fmt hf --dataset cnn_dailymail --export_path llama3.2-3b --trust_remote_code --inference_pipeline_parallel 1 --batch_size 1 --calib_size 512 --kv_cache_qformat nvfp4_affine --export_vllm_fq

cd $PWD/examples/vllm_serve
MODELOPT_STATE_PATH=../llm_ptq/llama3.2-3b/vllm_fq_modelopt_state.pth python vllm_serve_fakequant.py ../llm_ptq/llama3.2-3b/ -tp 1 --served-model-name llama3.2-3b--host 0.0.0.0 --port 8001 --trust-remote-code --enforce-eager  --disable-custom-all-reduce --gpu-memory-utilization 0.8

Testing

  • Exported the checkpoint using hf_ptq, reloaded to vllm fakequant example, manually checked the quantizer values
  • Repeated above with mixed quantization by disabling quantization for few layers
  • exported checkpoint with megatron-lm with nvfp4, reloaded the checkpoint with vllm

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: No
  • Did you add or update any necessary documentation?: Yes
  • Did you update Changelog?: Yes

Additional Information

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for vLLM fakequant reload using ModelOpt state for HuggingFace models
    • Introduced --export_vllm_fq flag to enable exporting vLLM-compatible fakequant checkpoints
  • Documentation

    • Updated serving instructions with new environment variables and step-by-step export guidance for HuggingFace and Megatron model formats
  • Refactor

    • Modernized quantization state handling workflow for improved checkpoint compatibility

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 21, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 21, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

This PR adds support for vLLM fakequant reload using ModelOpt state for HuggingFace models. It introduces a new export pathway, replaces amax-based quantization handling with modelopt state loading, adds utilities for format conversion and tensor parallelism sharding, and updates the post-restore API to pass model context.

Changes

Cohort / File(s) Summary
Documentation Updates
CHANGELOG.rst, examples/vllm_serve/README.md
Added new feature entry for vLLM fakequant reload. Updated README with new environment variables (QUANT_FILE_PATH, MODELOPT_STATE_PATH, CALIB_BATCH_SIZE), revised calibration/serving instructions, and clarified MCore/HF export workflows.
Example Scripts & Export
examples/llm_ptq/hf_ptq.py, examples/vllm_serve/vllm_serve_fakequant.py
Added --export_vllm_fq CLI flag to enable vLLM fakequant checkpoint export. Replaced AMAX_FILE_PATH with QUANT_FILE_PATH and added MODELOPT_STATE_PATH and CALIB_BATCH_SIZE to environment variables.
Core Serving Logic
examples/vllm_serve/fakequant_worker.py
Refactored quantization workflow to conditionally load modelopt state when provided, otherwise fall back to standard quantization/calibration. Removed legacy amax merging; integrated new modelopt state loading and conversion utilities. Updated barrier and warm-up triggers.
State Conversion Utilities
examples/vllm_serve/vllm_reload_utils.py
New module providing conversion of HF-style quantizer state to vLLM format, state dict key mapping, tensor merging strategies, and tensor parallelism-aware sharding via process_state_dict_for_tp.
Export Plugins
modelopt/torch/export/plugins/vllm_fakequant_hf.py, modelopt/torch/export/plugins/vllm_fakequant_megatron.py
Replaced amax-based state saving with composite modelopt state export. HF plugin now saves vllm_fq_modelopt_state.pth with quantizer_state injection; Megatron plugin gathers quantizer state dicts and adds _get_quantized_state helper.
Quantization API
modelopt/torch/quantization/conversion.py, modelopt/torch/quantization/nn/modules/quant_module.py
Updated QuantModule.modelopt_post_restore signature to accept optional model parameter, enabling device detection via parent model context during state restoration.

Sequence Diagram(s)

sequenceDiagram
    participant W as vLLM Worker
    participant MO as ModelOpt State<br/>(Disk)
    participant Conv as State Converter<br/>(vllm_reload_utils)
    participant Model as vLLM Model
    participant Q as Quantizer State

    W->>MO: Load modelopt_state.pth
    MO-->>W: modelopt_state_dict
    W->>Conv: convert_modelopt_state_to_vllm()
    Conv->>Conv: _group_keys_for_vllm()
    Conv->>Conv: _merge_values_by_max_or_concat()
    Conv-->>W: vllm_compatible_state
    W->>Conv: process_state_dict_for_tp()
    Conv->>Conv: Shard tensors for TP
    Conv-->>W: sharded_state
    W->>Model: restore_from_modelopt_state()
    Model->>Q: Apply quantizer state
    Q-->>Model: Quantizers configured
    Model-->>W: Ready for inference
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.87% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title 'Kinjal/vllm modelopt reload' is vague and uses a branch-naming pattern rather than a clear description of the main change. Revise title to clearly describe the feature, e.g., 'Add support for reloading HF modelopt state checkpoints in vLLM fakequant' or 'Enable vLLM fakequant reload from HF-exported modelopt state'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Jan 21, 2026

Codecov Report

❌ Patch coverage is 25.00000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.09%. Comparing base (5cc2a54) to head (d170ed0).

Files with missing lines Patch % Lines
...lopt/torch/quantization/nn/modules/quant_module.py 18.18% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #805      +/-   ##
==========================================
- Coverage   74.13%   74.09%   -0.04%     
==========================================
  Files         192      192              
  Lines       19263    19273      +10     
==========================================
+ Hits        14280    14281       +1     
- Misses       4983     4992       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
@kinjalpatel27 kinjalpatel27 marked this pull request as ready for review January 22, 2026 22:27
@kinjalpatel27 kinjalpatel27 requested review from a team as code owners January 22, 2026 22:27
@kinjalpatel27 kinjalpatel27 changed the title Kinjal/vllm modelopt reload vllm fakequant reload with modelopt state for HF Jan 22, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@modelopt/torch/quantization/nn/modules/quant_module.py`:
- Around line 62-72: The fallback block currently unconditionally overwrites
non_tq_param_or_buffer; change the guard to only run when non_tq_param_or_buffer
is None (e.g., if non_tq_param_or_buffer is None and model is not None) so the
first-found parameter isn't clobbered, and when computing the parent module use
model.get_submodule(parent_prefix) if parent_prefix else model to avoid calling
get_submodule with an empty string; also either implement the intended filtering
to skip TensorQuantizer-owned params (add a check inside the loop to continue if
the param belongs to a TensorQuantizer) or update the comment to reflect that
the code simply takes the first parameter, referencing non_tq_param_or_buffer,
prefix, and model.get_submodule to locate where to change.
🧹 Nitpick comments (5)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

33-43: Docstring is outdated and references amax instead of quantizer state.

The docstring still describes extracting "amax values" but the implementation now saves the complete quantizer state dict and modelopt state. Update to reflect the actual behavior.

📝 Suggested docstring update
-    """Exports the torch model weights and amax values separately.
+    """Exports the torch model weights and quantizer state separately for vLLM fakequant.

     This function:
-    1. Extracts amax values for calibration
+    1. Extracts quantizer state dict and modelopt state
     2. Deletes all quantizer parameters from state dict to store only weights in original dtype
     3. Saves the model weights

     Args:
         model: The quantized model to export
-        export_dir: Directory to save the amax values
+        export_dir: Directory to save the model and quantizer state

     """
modelopt/torch/export/plugins/vllm_fakequant_megatron.py (1)

46-64: Comments reference "amax" but code now handles full quantizer state.

Several comments still reference "amax" (lines 46, 51, 63) but the code now handles the complete quantizer state dictionary. Consider updating for clarity.

📝 Suggested comment updates
-    # Gather all amax dicts to rank 0
+    # Gather all quantizer state dicts to rank 0
     world_size = torch.distributed.get_world_size()
     rank = torch.distributed.get_rank()

     if rank == 0:
-        # Rank 0 will collect all amax values
+        # Rank 0 will collect all quantizer state values
         all_quantizer_state_dicts = [None] * world_size
         torch.distributed.gather_object(quantizer_state_dict, all_quantizer_state_dicts, dst=0)
         ...
     else:
-        # Other ranks just send their amax values
+        # Other ranks send their quantizer state values
         torch.distributed.gather_object(quantizer_state_dict, None, dst=0)
examples/vllm_serve/vllm_reload_utils.py (1)

175-185: Docstring references non-existent parameter fuse_experts.

The docstring mentions fuse_experts parameter but the function only has state_dict and merge_mode parameters.

📝 Suggested docstring fix
     """
     Common implementation for converting quantizer state from HF to vLLM format.

     Args:
         state_dict: Input state dict
-        fuse_experts: Whether to fuse expert projections
         merge_mode: Mode to merge grouped values, "max_or_concat" or "require_identical"
+
+    Returns:
+        Converted state dict in vLLM format.
     """
examples/vllm_serve/fakequant_worker.py (2)

115-124: Consider adding a comment about weights_only=False security implications.

Using weights_only=False in torch.load is necessary for loading complex modelopt state, but it allows arbitrary code execution from untrusted files. The current code is fine for trusted checkpoints, but a brief comment noting this would be helpful for future maintainers.

💡 Optional: Add security note
         # Load on CPU to avoid failures when the checkpoint was saved from a different
         # GPU mapping
+        # Note: weights_only=False is required for modelopt state but should only be used
+        # with trusted checkpoint files.
         modelopt_state = torch.load(
             quant_config["modelopt_state_path"], weights_only=False, map_location="cpu"
         )

235-242: Asymmetric key validation is intentional but could use a comment.

The code raises an error when model keys are missing from the checkpoint but only warns when checkpoint has extra keys. This asymmetry makes sense (model requires all its quantizers to be loaded), but a brief comment explaining the rationale would improve clarity.

💡 Optional: Add clarifying comment
+            # Checkpoint may have extra keys (e.g., from PP sharding), but model must have
+            # all its quantizer keys present in the checkpoint for correct loading
             for key in checkpoint_quant_keys:
                 if key not in model_quant_keys:
                     print(f"Key {key} not found in model state dict, but exists in checkpoint")
             for key in model_quant_keys:
                 if key not in checkpoint_quant_keys:
                     raise ValueError(
                         f"Key {key} not found in checkpoint state dict, but exists in model"
                     )

kinjalpatel27 and others added 3 commits January 22, 2026 23:33
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant