vllm fakequant reload with modelopt state for HF #805

kinjalpatel27 · 2026-01-21T23:42:11Z

What does this PR do?

Type of change: new feature

Overview:

Added support to reload HF exported checkpoint with modelopt state in vLLM fakequant
Added support to reload all quantizer parameters using QUANT_FILE_PATH instead of only amax

Usage

cd $PWD/examples/llm_ptq
python hf_ptq.py --pyt_ckpt_path meta-llama/Llama-3.2-3B-Instruct --qformat nvfp4 --export_fmt hf --dataset cnn_dailymail --export_path llama3.2-3b --trust_remote_code --inference_pipeline_parallel 1 --batch_size 1 --calib_size 512 --kv_cache_qformat nvfp4_affine --export_vllm_fq

cd $PWD/examples/vllm_serve
MODELOPT_STATE_PATH=../llm_ptq/llama3.2-3b/vllm_fq_modelopt_state.pth python vllm_serve_fakequant.py ../llm_ptq/llama3.2-3b/ -tp 1 --served-model-name llama3.2-3b--host 0.0.0.0 --port 8001 --trust-remote-code --enforce-eager  --disable-custom-all-reduce --gpu-memory-utilization 0.8

Testing

Exported the checkpoint using hf_ptq, reloaded to vllm fakequant example, manually checked the quantizer values
Repeated above with mixed quantization by disabling quantization for few layers
exported checkpoint with megatron-lm with nvfp4, reloaded the checkpoint with vllm

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: No
Did you add or update any necessary documentation?: Yes
Did you update Changelog?: Yes

Additional Information

Summary by CodeRabbit

Release Notes

New Features
- Added support for vLLM fakequant reload using ModelOpt state for HuggingFace models
- Introduced --export_vllm_fq flag to enable exporting vLLM-compatible fakequant checkpoints
Documentation
- Updated serving instructions with new environment variables and step-by-step export guidance for HuggingFace and Megatron model formats
Refactor
- Modernized quantization state handling workflow for improved checkpoint compatibility

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

copy-pr-bot · 2026-01-21T23:42:15Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-01-21T23:42:21Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

This PR adds support for vLLM fakequant reload using ModelOpt state for HuggingFace models. It introduces a new export pathway, replaces amax-based quantization handling with modelopt state loading, adds utilities for format conversion and tensor parallelism sharding, and updates the post-restore API to pass model context.

Changes

Cohort / File(s)	Summary
Documentation Updates `CHANGELOG.rst`, `examples/vllm_serve/README.md`	Added new feature entry for vLLM fakequant reload. Updated README with new environment variables (QUANT_FILE_PATH, MODELOPT_STATE_PATH, CALIB_BATCH_SIZE), revised calibration/serving instructions, and clarified MCore/HF export workflows.
Example Scripts & Export `examples/llm_ptq/hf_ptq.py`, `examples/vllm_serve/vllm_serve_fakequant.py`	Added `--export_vllm_fq` CLI flag to enable vLLM fakequant checkpoint export. Replaced AMAX_FILE_PATH with QUANT_FILE_PATH and added MODELOPT_STATE_PATH and CALIB_BATCH_SIZE to environment variables.
Core Serving Logic `examples/vllm_serve/fakequant_worker.py`	Refactored quantization workflow to conditionally load modelopt state when provided, otherwise fall back to standard quantization/calibration. Removed legacy amax merging; integrated new modelopt state loading and conversion utilities. Updated barrier and warm-up triggers.
State Conversion Utilities `examples/vllm_serve/vllm_reload_utils.py`	New module providing conversion of HF-style quantizer state to vLLM format, state dict key mapping, tensor merging strategies, and tensor parallelism-aware sharding via process_state_dict_for_tp.
Export Plugins `modelopt/torch/export/plugins/vllm_fakequant_hf.py`, `modelopt/torch/export/plugins/vllm_fakequant_megatron.py`	Replaced amax-based state saving with composite modelopt state export. HF plugin now saves vllm_fq_modelopt_state.pth with quantizer_state injection; Megatron plugin gathers quantizer state dicts and adds _get_quantized_state helper.
Quantization API `modelopt/torch/quantization/conversion.py`, `modelopt/torch/quantization/nn/modules/quant_module.py`	Updated QuantModule.modelopt_post_restore signature to accept optional model parameter, enabling device detection via parent model context during state restoration.

Sequence Diagram(s)

sequenceDiagram
    participant W as vLLM Worker
    participant MO as ModelOpt State<br/>(Disk)
    participant Conv as State Converter<br/>(vllm_reload_utils)
    participant Model as vLLM Model
    participant Q as Quantizer State

    W->>MO: Load modelopt_state.pth
    MO-->>W: modelopt_state_dict
    W->>Conv: convert_modelopt_state_to_vllm()
    Conv->>Conv: _group_keys_for_vllm()
    Conv->>Conv: _merge_values_by_max_or_concat()
    Conv-->>W: vllm_compatible_state
    W->>Conv: process_state_dict_for_tp()
    Conv->>Conv: Shard tensors for TP
    Conv-->>W: sharded_state
    W->>Model: restore_from_modelopt_state()
    Model->>Q: Apply quantizer state
    Q-->>Model: Quantizers configured
    Model-->>W: Ready for inference

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.87% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The PR title 'Kinjal/vllm modelopt reload' is vague and uses a branch-naming pattern rather than a clear description of the main change.	Revise title to clearly describe the feature, e.g., 'Add support for reloading HF modelopt state checkpoints in vLLM fakequant' or 'Enable vLLM fakequant reload from HF-exported modelopt state'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-01-21T23:53:25Z

Codecov Report

❌ Patch coverage is 25.00000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.09%. Comparing base (5cc2a54) to head (d170ed0).

Files with missing lines	Patch %	Lines
...lopt/torch/quantization/nn/modules/quant_module.py	18.18%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #805      +/-   ##
==========================================
- Coverage   74.13%   74.09%   -0.04%     
==========================================
  Files         192      192              
  Lines       19263    19273      +10     
==========================================
+ Hits        14280    14281       +1     
- Misses       4983     4992       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@modelopt/torch/quantization/nn/modules/quant_module.py`:
- Around line 62-72: The fallback block currently unconditionally overwrites
non_tq_param_or_buffer; change the guard to only run when non_tq_param_or_buffer
is None (e.g., if non_tq_param_or_buffer is None and model is not None) so the
first-found parameter isn't clobbered, and when computing the parent module use
model.get_submodule(parent_prefix) if parent_prefix else model to avoid calling
get_submodule with an empty string; also either implement the intended filtering
to skip TensorQuantizer-owned params (add a check inside the loop to continue if
the param belongs to a TensorQuantizer) or update the comment to reflect that
the code simply takes the first parameter, referencing non_tq_param_or_buffer,
prefix, and model.get_submodule to locate where to change.

🧹 Nitpick comments (5)

modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
33-43: Docstring is outdated and references amax instead of quantizer state.

The docstring still describes extracting "amax values" but the implementation now saves the complete quantizer state dict and modelopt state. Update to reflect the actual behavior.
📝 Suggested docstring update
-    """Exports the torch model weights and amax values separately.
+    """Exports the torch model weights and quantizer state separately for vLLM fakequant.

     This function:
-    1. Extracts amax values for calibration
+    1. Extracts quantizer state dict and modelopt state
     2. Deletes all quantizer parameters from state dict to store only weights in original dtype
     3. Saves the model weights

     Args:
         model: The quantized model to export
-        export_dir: Directory to save the amax values
+        export_dir: Directory to save the model and quantizer state

     """
modelopt/torch/export/plugins/vllm_fakequant_megatron.py (1)
46-64: Comments reference "amax" but code now handles full quantizer state.

Several comments still reference "amax" (lines 46, 51, 63) but the code now handles the complete quantizer state dictionary. Consider updating for clarity.
📝 Suggested comment updates
-    # Gather all amax dicts to rank 0
+    # Gather all quantizer state dicts to rank 0
     world_size = torch.distributed.get_world_size()
     rank = torch.distributed.get_rank()

     if rank == 0:
-        # Rank 0 will collect all amax values
+        # Rank 0 will collect all quantizer state values
         all_quantizer_state_dicts = [None] * world_size
         torch.distributed.gather_object(quantizer_state_dict, all_quantizer_state_dicts, dst=0)
         ...
     else:
-        # Other ranks just send their amax values
+        # Other ranks send their quantizer state values
         torch.distributed.gather_object(quantizer_state_dict, None, dst=0)
examples/vllm_serve/vllm_reload_utils.py (1)
175-185: Docstring references non-existent parameter fuse_experts.

The docstring mentions fuse_experts parameter but the function only has state_dict and merge_mode parameters.
📝 Suggested docstring fix
     """
     Common implementation for converting quantizer state from HF to vLLM format.

     Args:
         state_dict: Input state dict
-        fuse_experts: Whether to fuse expert projections
         merge_mode: Mode to merge grouped values, "max_or_concat" or "require_identical"
+
+    Returns:
+        Converted state dict in vLLM format.
     """
examples/vllm_serve/fakequant_worker.py (2)
115-124: Consider adding a comment about weights_only=False security implications.

Using weights_only=False in torch.load is necessary for loading complex modelopt state, but it allows arbitrary code execution from untrusted files. The current code is fine for trusted checkpoints, but a brief comment noting this would be helpful for future maintainers.
💡 Optional: Add security note
         # Load on CPU to avoid failures when the checkpoint was saved from a different
         # GPU mapping
+        # Note: weights_only=False is required for modelopt state but should only be used
+        # with trusted checkpoint files.
         modelopt_state = torch.load(
             quant_config["modelopt_state_path"], weights_only=False, map_location="cpu"
         )
235-242: Asymmetric key validation is intentional but could use a comment.

The code raises an error when model keys are missing from the checkpoint but only warns when checkpoint has extra keys. This asymmetry makes sense (model requires all its quantizers to be loaded), but a brief comment explaining the rationale would improve clarity.
💡 Optional: Add clarifying comment
+            # Checkpoint may have extra keys (e.g., from PP sharding), but model must have
+            # all its quantizer keys present in the checkpoint for correct loading
             for key in checkpoint_quant_keys:
                 if key not in model_quant_keys:
                     print(f"Key {key} not found in model state dict, but exists in checkpoint")
             for key in model_quant_keys:
                 if key not in checkpoint_quant_keys:
                     raise ValueError(
                         f"Key {key} not found in checkpoint state dict, but exists in model"
                     )

modelopt/torch/quantization/nn/modules/quant_module.py

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

kinjalpatel27 added 2 commits January 21, 2026 23:37

Added support for HF modelopt state reload for vllm fakequant

272fd1a

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

changelog update

beb66c9

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

kinjalpatel27 added 2 commits January 22, 2026 01:16

minor

a44dae2

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

updated for TP>1

c1be1cd

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

kinjalpatel27 marked this pull request as ready for review January 22, 2026 22:27

kinjalpatel27 requested review from a team as code owners January 22, 2026 22:27

kinjalpatel27 requested review from kaix-nv and meenchen January 22, 2026 22:27

kinjalpatel27 changed the title ~~Kinjal/vllm modelopt reload~~ vllm fakequant reload with modelopt state for HF Jan 22, 2026

coderabbitai bot reviewed Jan 22, 2026

View reviewed changes

modelopt/torch/quantization/nn/modules/quant_module.py Outdated Show resolved Hide resolved

kinjalpatel27 and others added 3 commits January 22, 2026 23:33

minor

a57d527

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

updated test

46a88bb

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

Merge branch 'main' into kinjal/vllm_modelopt_reload

d170ed0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm fakequant reload with modelopt state for HF #805

vllm fakequant reload with modelopt state for HF #805

Uh oh!

kinjalpatel27 commented Jan 21, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Jan 21, 2026

Uh oh!

coderabbitai bot commented Jan 21, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

codecov bot commented Jan 21, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vllm fakequant reload with modelopt state for HF #805

Are you sure you want to change the base?

vllm fakequant reload with modelopt state for HF #805

Uh oh!

Conversation

kinjalpatel27 commented Jan 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Jan 21, 2026

Uh oh!

coderabbitai bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

codecov bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kinjalpatel27 commented Jan 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 21, 2026 •

edited

Loading

codecov bot commented Jan 21, 2026 •

edited

Loading