-
Notifications
You must be signed in to change notification settings - Fork 242
vllm fakequant reload with modelopt state for HF #805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughThis PR adds support for vLLM fakequant reload using ModelOpt state for HuggingFace models. It introduces a new export pathway, replaces amax-based quantization handling with modelopt state loading, adds utilities for format conversion and tensor parallelism sharding, and updates the post-restore API to pass model context. Changes
Sequence Diagram(s)sequenceDiagram
participant W as vLLM Worker
participant MO as ModelOpt State<br/>(Disk)
participant Conv as State Converter<br/>(vllm_reload_utils)
participant Model as vLLM Model
participant Q as Quantizer State
W->>MO: Load modelopt_state.pth
MO-->>W: modelopt_state_dict
W->>Conv: convert_modelopt_state_to_vllm()
Conv->>Conv: _group_keys_for_vllm()
Conv->>Conv: _merge_values_by_max_or_concat()
Conv-->>W: vllm_compatible_state
W->>Conv: process_state_dict_for_tp()
Conv->>Conv: Shard tensors for TP
Conv-->>W: sharded_state
W->>Model: restore_from_modelopt_state()
Model->>Q: Apply quantizer state
Q-->>Model: Quantizers configured
Model-->>W: Ready for inference
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #805 +/- ##
==========================================
- Coverage 74.13% 74.09% -0.04%
==========================================
Files 192 192
Lines 19263 19273 +10
==========================================
+ Hits 14280 14281 +1
- Misses 4983 4992 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@modelopt/torch/quantization/nn/modules/quant_module.py`:
- Around line 62-72: The fallback block currently unconditionally overwrites
non_tq_param_or_buffer; change the guard to only run when non_tq_param_or_buffer
is None (e.g., if non_tq_param_or_buffer is None and model is not None) so the
first-found parameter isn't clobbered, and when computing the parent module use
model.get_submodule(parent_prefix) if parent_prefix else model to avoid calling
get_submodule with an empty string; also either implement the intended filtering
to skip TensorQuantizer-owned params (add a check inside the loop to continue if
the param belongs to a TensorQuantizer) or update the comment to reflect that
the code simply takes the first parameter, referencing non_tq_param_or_buffer,
prefix, and model.get_submodule to locate where to change.
🧹 Nitpick comments (5)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
33-43: Docstring is outdated and references amax instead of quantizer state.The docstring still describes extracting "amax values" but the implementation now saves the complete quantizer state dict and modelopt state. Update to reflect the actual behavior.
📝 Suggested docstring update
- """Exports the torch model weights and amax values separately. + """Exports the torch model weights and quantizer state separately for vLLM fakequant. This function: - 1. Extracts amax values for calibration + 1. Extracts quantizer state dict and modelopt state 2. Deletes all quantizer parameters from state dict to store only weights in original dtype 3. Saves the model weights Args: model: The quantized model to export - export_dir: Directory to save the amax values + export_dir: Directory to save the model and quantizer state """modelopt/torch/export/plugins/vllm_fakequant_megatron.py (1)
46-64: Comments reference "amax" but code now handles full quantizer state.Several comments still reference "amax" (lines 46, 51, 63) but the code now handles the complete quantizer state dictionary. Consider updating for clarity.
📝 Suggested comment updates
- # Gather all amax dicts to rank 0 + # Gather all quantizer state dicts to rank 0 world_size = torch.distributed.get_world_size() rank = torch.distributed.get_rank() if rank == 0: - # Rank 0 will collect all amax values + # Rank 0 will collect all quantizer state values all_quantizer_state_dicts = [None] * world_size torch.distributed.gather_object(quantizer_state_dict, all_quantizer_state_dicts, dst=0) ... else: - # Other ranks just send their amax values + # Other ranks send their quantizer state values torch.distributed.gather_object(quantizer_state_dict, None, dst=0)examples/vllm_serve/vllm_reload_utils.py (1)
175-185: Docstring references non-existent parameterfuse_experts.The docstring mentions
fuse_expertsparameter but the function only hasstate_dictandmerge_modeparameters.📝 Suggested docstring fix
""" Common implementation for converting quantizer state from HF to vLLM format. Args: state_dict: Input state dict - fuse_experts: Whether to fuse expert projections merge_mode: Mode to merge grouped values, "max_or_concat" or "require_identical" + + Returns: + Converted state dict in vLLM format. """examples/vllm_serve/fakequant_worker.py (2)
115-124: Consider adding a comment aboutweights_only=Falsesecurity implications.Using
weights_only=Falseintorch.loadis necessary for loading complex modelopt state, but it allows arbitrary code execution from untrusted files. The current code is fine for trusted checkpoints, but a brief comment noting this would be helpful for future maintainers.💡 Optional: Add security note
# Load on CPU to avoid failures when the checkpoint was saved from a different # GPU mapping + # Note: weights_only=False is required for modelopt state but should only be used + # with trusted checkpoint files. modelopt_state = torch.load( quant_config["modelopt_state_path"], weights_only=False, map_location="cpu" )
235-242: Asymmetric key validation is intentional but could use a comment.The code raises an error when model keys are missing from the checkpoint but only warns when checkpoint has extra keys. This asymmetry makes sense (model requires all its quantizers to be loaded), but a brief comment explaining the rationale would improve clarity.
💡 Optional: Add clarifying comment
+ # Checkpoint may have extra keys (e.g., from PP sharding), but model must have + # all its quantizer keys present in the checkpoint for correct loading for key in checkpoint_quant_keys: if key not in model_quant_keys: print(f"Key {key} not found in model state dict, but exists in checkpoint") for key in model_quant_keys: if key not in checkpoint_quant_keys: raise ValueError( f"Key {key} not found in checkpoint state dict, but exists in model" )
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
What does this PR do?
Type of change: new feature
Overview:
QUANT_FILE_PATHinstead of only amaxUsage
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
Release Notes
New Features
--export_vllm_fqflag to enable exporting vLLM-compatible fakequant checkpointsDocumentation
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.