-
Notifications
You must be signed in to change notification settings - Fork 797
Qualcomm AI Engine Direct - Support multimodal(VLM) runner #16536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Qualcomm AI Engine Direct - Support multimodal(VLM) runner #16536
Conversation
Summary:
- Runtime support for models
- SmolVLM 500M
- InternVL3 1B
- add hybrid mode runtime requantization in multimodal runner
- CI
- refactor VLM test script
- add VLM acc/perf runtime tests
- Refactor(VLM)
- rename embedding forward input for CPU quantization
- Update VLM vision encoder architecture to align with upcoming
transformers 5.0 changes
- Documentation
- add readme for multimodal VLM
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16536
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 19da734 with merge base 9ba1b5d ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "release notes: qualcomm" |
|
Hi @cccclai, SmolVLM 500M (Hybrid):Image: https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg I 00:00:09.501139 executorch:multimodal_runner.cpp:646] RSS after finishing text generation: 419.035156 MiB (0 if unsupported)
I 00:00:09.501241 executorch:stats.h:143] Prompt Tokens: 81 Generated Tokens: 454
I 00:00:09.501285 executorch:stats.h:149] Model Load Time: 0.370000 (seconds)
I 00:00:09.501327 executorch:stats.h:159] Total inference time: 8.935000 (seconds) Rate: 50.811416 (tokens/second)
I 00:00:09.501369 executorch:stats.h:167] Prompt evaluation: 0.117000 (seconds) Rate: 692.307692 (tokens/second)
I 00:00:09.501412 executorch:stats.h:178] Generated 454 tokens: 8.818000 (seconds) Rate: 51.485598 (tokens/second)
I 00:00:09.501472 executorch:stats.h:186] Time to first generated token: 0.117000 (seconds)
I 00:00:09.501512 executorch:stats.h:193] Sampling time over 535 tokens: 0.782000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
PyTorchObserver {"prompt_tokens":81,"generated_tokens":454,"model_load_start_ms":1751021368585,"model_load_end_ms":1751021368955,"inference_start_ms":1751021368955,"inference_end_ms":1751021377890,"prompt_eval_end_ms":1751021369072,"first_token_ms":1751021369072,"aggregate_sampling_time_ms":782,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
/data/local/tmp/yuyazhua/executorch/single_llama/outputs/: 2 files pulled. 1.1 MB/s (2809 bytes in 0.002s)
INFO:root:Results[0]:
<|im_start|>User:<fake_token_around_image><global-img><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image>Can you describe this image?<end_of_utterance>
Assistant: The image depicts a serene and picturesque scene of a cityscape. The focal point of the image is a prominent, rectangular structure that appears to be a monument or a significant landmark. This structure is situated on a small, elevated platform, which is likely part of a monument or a memorial. The monument is rectangular in shape and has a smooth, reflective surface, suggesting it might be made of stone or a similar material.
The monument is surrounded by a small, circular area, which is likely a small plaza or a small park. This area is enclosed by a low, low-rise building, which is partially visible in the background. The building has a modern design, with a flat roof and a few windows visible.
In the foreground, there is a large, rectangular stone slab that seems to be part of the monument's base. The stone slab is smooth and reflective, indicating it might be made of granite or another similar material.
The sky above the monument is clear and blue, with no visible clouds, suggesting it is a sunny day. The overall scene is calm and peaceful, with no signs of human activity or movement.
The image does not contain any people, vehicles, or other objects, which helps to focus the viewer's attention on the monument and its surroundings. The absence of any urban elements, such as buildings or roads, also helps to keep the focus on the monument itself.
Given the description, a pure text model can answer questions related to the image by providing a detailed and logical analysis of the elements present in the image. For example, if asked about the type of monument, the model can explain that it is a rectangular stone structure with a smooth, reflective surface. Additionally, if asked about the surrounding area, the model can describe the small, circular plaza or park enclosed by the building.
In summary, the image depicts a rectangular stone monument with a smooth, reflective surface, surrounded by a small, circular plaza or park enclosed by a low, low-rise building. The sky is clear and blue, with no visible clouds, and the overall scene is calm and peaceful. The absence of any human activity or other elements helps to keep the focus on the monument itself.<end_of_utterance>InternVL3 1B (Hybrid):Image: http://images.cocodataset.org/val2017/000000039769.jpg I 00:00:03.613379 executorch:multimodal_runner.cpp:627] RSS after finishing text generation: 612.617188 MiB (0 if unsupported)
I 00:00:03.613412 executorch:stats.h:143] Prompt Tokens: 272 Generated Tokens: 118
I 00:00:03.613422 executorch:stats.h:149] Model Load Time: 0.761000 (seconds)
I 00:00:03.613432 executorch:stats.h:159] Total inference time: 1.770000 (seconds) Rate: 66.666667 (tokens/second)
I 00:00:03.613441 executorch:stats.h:167] Prompt evaluation: 0.197000 (seconds) Rate: 1380.710660 (tokens/second)
I 00:00:03.613451 executorch:stats.h:178] Generated 118 tokens: 1.573000 (seconds) Rate: 75.015893 (tokens/second)
I 00:00:03.613462 executorch:stats.h:186] Time to first generated token: 0.197000 (seconds)
I 00:00:03.613476 executorch:stats.h:193] Sampling time over 390 tokens: 0.097000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
PyTorchObserver {"prompt_tokens":272,"generated_tokens":118,"model_load_start_ms":1749836595295,"model_load_end_ms":1749836596056,"inference_start_ms":1749836596056,"inference_end_ms":1749836597826,"prompt_eval_end_ms":1749836596253,"first_token_ms":1749836596253,"aggregate_sampling_time_ms":97,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
/data/local/tmp/yuyazhua/executorch/single_llama/outputs/: 2 files pulled. 1.6 MB/s (3960 bytes in 0.002s)
INFO:root:Results[0]:
<|im_start|>user:
<img><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT></img>
Can you describe this image?<|im_end|>assistant
The image shows two cats lying on a pink surface, which appears to be a bed or a couch. The cat on the left is a tabby with dark stripes and is lying on its side with its front paws stretched out. The cat on the right is also tabby but with a mix of brown and black stripes. Both cats are relaxed and appear to be sleeping or lying down comfortably. There are two remote controls placed on the pink surface near the cats. The overall scene suggests a cozy and peaceful setting, ideal for a cat to rest.<|im_end|>cc: @haowhsu-quic Please have a look! |
|
@DannyYuyang-quic Thank you for this PR! I just tested it on SOC SM8550 . It failed at a very late stage. Here is the 1st command output. Full log is also attached. The 2nd command also failed due to the same error. I noticed that the adb seems disconnected and reconnected as seen in the Ubuntu Taskbar. Here is an error from the 3rd command, |
It looks like the issue is related to the device connection failing. Could you share the exact command you used? If your device is connected to a remote host, make sure to include the Based on your log, the compilation completed successfully, you can use the python backends/qualcomm/tests/test_qnn_delegate.py TestExampleMultimodalityScript.test_static_vlm -v -b build-android -
H ${host_name} -m SM8550 -s ${SERIAL_NUM} -a . --model_name smolvlm_500m_instruct --executorch_root . --pre_gen_pte . |
|
@DannyYuyang-quic Thank you for your reply! I was using the following command. The Android phone is connected to the Linux PC via Type-C. |
Hi @luffy-yu, git submodule sync
git submodule update --init
./install_executorch.sh
./backends/qualcomm/scripts/build.shI noticed you’re using a very old PyTorch version: 2.8.0+cu128. After rebuilding, could you also run a simple model test to confirm everything works? python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_linear -b build-android -m SM8550 -s ${SERIAL_NUM} -a . --executorch_root . |
|
@DannyYuyang-quic Thank you for your suggestion. I re-cloned this repo and rebuilt it. Then, I found the root issue was the Adb IssueI installed adb on Linux via The working adb is from android-tools-adb installed via Submodule CommandThe submodule command should be Test Results on SM8550 - All OKIt needs to install some packages before running tests: Thank you for your hard work! It helps my project a lot. |
cccclai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution! Great job on this. Just would like to understand a bit more details
| @dataclass(frozen=True) | ||
| class MLLMSpecs: | ||
| max_seq_len: int | ||
| SM8650: float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does SM8650 means here and why are they float?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means the token/sec threshold for SM8650 to pass the test.
Token rate is a floating-point number, so that’s why I use float.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, maybe let's rename it to make it clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I’ll rename it to sm8650_token_rate.
| def prequant_algorithm(model, prefill_config, args): | ||
| # TODO: use dtype of model checkpoint | ||
| model = model.to(device=args.device, dtype=torch.float) | ||
| inputs = model.get_example_inputs(use_kv_cache=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it mean we remove kv cache mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. use_kv_cache is set during model initialization, so get_example_inputs doesn’t need to set it again.
we reuse self.use_kv_cache in get_example_inputs
https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/model/static_llama.py#L541
| pte_path=( | ||
| pte_path | ||
| if not is_modality | ||
| else [pte_path, encoder_pte_path, text_embedding_pte_path] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we generating 3 .pte files for multimodal models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so users can mix and match different precisions for the encoder or embedding when composing with text_decoder without having to recompile, since most of the runtime cost is in the decoder.
| super().__init__(hidden_size, eps=eps) | ||
|
|
||
|
|
||
| @register_norm("gemma3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We set model_config.norm_type to rmsnorm in wrappers.py https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/wrappers.py#L439-L441
before registration, so there’s no need to add extra model names to the decorator here.
| DEFINE_string( | ||
| embedding_path, | ||
| "embedding.pte", | ||
| "Path to embedding model serialized in flatbuffer format."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "Path to embedding model serialized in flatbuffer format."); | |
| "Path to embedding model serialized in .pte format."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for applying the suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that mainline still uses flatbuffer in many places instead of .pte. Should we standardize on one format~?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, maybe let's do it in a separate PR...
| namespace example { | ||
|
|
||
| template <typename T> | ||
| void MultimodalLhdTokenGenerator<T>::prepare_io( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean we also support lookahead with multimodal ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, lookahead works with multimodal and can be run. However, it hasn’t been calibrated on a full dataset yet, only with a single prompt, so accuracy isn’t great now.
| // Extend DecoderModelVersion enum with multimodal models | ||
| enum MultimodalDecoderModelVersion { | ||
| kSmolvlm = 0, | ||
| kInternvl3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why kSmolvlm is 0 and kInternvl3 is not defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kSmolVLM is 0 because we set it explicitly for clarity. kInternVL3 is defined and has the implicit value 1, but I can make both values explicit to avoid ambiguity~
| self.meta = self.decoder.get_metadata() | ||
|
|
||
| # check if sharding required | ||
| if self.decoder and self.config.num_sharding > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we shard any of the MultiModal model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but only shard the text decoder part of Multimodal.
|
@luffy-yu, thank you for sharing the ADB issue you encountered. Glad to hear you got it running successfully and really appreciate the detailed explanation of how you solved it! |
|
@DannyYuyang-quic, I found a potential issue with the command: inference with The fix is as follows. |
|
@luffy-yu, good catch, and thanks for the patch, I’ll land it and have a test. |
…rence with --pre_gen_pte Fix lint error
Summary:
annotate_prefill_kv_outputeffectively narrows the output gapbetween
hybridmode andKVmode. However, applying the same method to multimodalmodels do not work(bad results). To achieve decent result in hybrid mode, we dequantize
the KV cache right after prefilling and re‑quantize it based on the decoder input cache at
runtime.
Test plan
SmolVLM
Perf: ~63 TPS in SM8750
InternVL3
Perf: ~17 TPS in SM8750
Script
SmolVLM
InternVL3