Qualcomm AI Engine Direct - Support multimodal(VLM) runner #16536

DannyYuyang-quic · 2026-01-12T06:33:19Z

Summary:

Runtime support for models
- SmolVLM 500M
- InternVL3 1B
- add hybrid mode runtime requantization in multimodal runner
  - Background: In LLMs, annotate_prefill_kv_output effectively narrows the output gap
    between hybrid mode and KV mode. However, applying the same method to multimodal
    models do not work(bad results). To achieve decent result in hybrid mode, we dequantize
    the KV cache right after prefilling and re‑quantize it based on the decoder input cache at
    runtime.
CI
- refactor VLM test script
- add VLM acc/perf runtime tests
Refactor (VLM)
- rename embedding forward input for CPU quantization
- Update VLM vision encoder architecture to align with transformers 5.0 changes
Documentation
- add readme for multimodal VLM

Test plan

SmolVLM

Perf: ~63 TPS in SM8750

python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_vlm --model_name smolvlm_500m_instruct -b build-android --executorch_root . -a . -m SM8750 -s ${SERIAL_NUM}

InternVL3

Perf: ~17 TPS in SM8750

python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_vlm --model_name internvl3_1b -b build-android --executorch_root . -a . -m SM8750 -s ${SERIAL_NUM}

Script

SmolVLM

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"

InternVL3

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model internvl3_1b --model_mode kv --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "http://images.cocodataset.org/val2017/000000039769.jpg"

Summary: - Runtime support for models - SmolVLM 500M - InternVL3 1B - add hybrid mode runtime requantization in multimodal runner - CI - refactor VLM test script - add VLM acc/perf runtime tests - Refactor(VLM) - rename embedding forward input for CPU quantization - Update VLM vision encoder architecture to align with upcoming transformers 5.0 changes - Documentation - add readme for multimodal VLM

pytorch-bot · 2026-01-12T06:33:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16536

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 19da734 with merge base 9ba1b5d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

DannyYuyang-quic · 2026-01-12T06:34:28Z

@pytorchbot label "release notes: qualcomm"

DannyYuyang-quic · 2026-01-12T06:46:45Z

Hi @cccclai,
This PR adds support for VLM runtime, follow‑up to #16292
below are some HTP runtime results for SmolVLM‑500M (Hybrid mode) and InternVL3‑1B (Hybrid mode):

SmolVLM 500M (Hybrid):

Image: https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg
Query:"Can you describe this image?"
Answer:

I 00:00:09.501139 executorch:multimodal_runner.cpp:646] RSS after finishing text generation: 419.035156 MiB (0 if unsupported)
I 00:00:09.501241 executorch:stats.h:143]       Prompt Tokens: 81    Generated Tokens: 454
I 00:00:09.501285 executorch:stats.h:149]       Model Load Time:                0.370000 (seconds)
I 00:00:09.501327 executorch:stats.h:159]       Total inference time:           8.935000 (seconds)               Rate:  50.811416 (tokens/second)
I 00:00:09.501369 executorch:stats.h:167]               Prompt evaluation:      0.117000 (seconds)               Rate:  692.307692 (tokens/second)
I 00:00:09.501412 executorch:stats.h:178]               Generated 454 tokens:   8.818000 (seconds)               Rate:  51.485598 (tokens/second)
I 00:00:09.501472 executorch:stats.h:186]       Time to first generated token:  0.117000 (seconds)
I 00:00:09.501512 executorch:stats.h:193]       Sampling time over 535 tokens:  0.782000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend

PyTorchObserver {"prompt_tokens":81,"generated_tokens":454,"model_load_start_ms":1751021368585,"model_load_end_ms":1751021368955,"inference_start_ms":1751021368955,"inference_end_ms":1751021377890,"prompt_eval_end_ms":1751021369072,"first_token_ms":1751021369072,"aggregate_sampling_time_ms":782,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
/data/local/tmp/yuyazhua/executorch/single_llama/outputs/: 2 files pulled. 1.1 MB/s (2809 bytes in 0.002s)
INFO:root:Results[0]:
<|im_start|>User:<fake_token_around_image><global-img><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image>Can you describe this image?<end_of_utterance>
Assistant: The image depicts a serene and picturesque scene of a cityscape. The focal point of the image is a prominent, rectangular structure that appears to be a monument or a significant landmark. This structure is situated on a small, elevated platform, which is likely part of a monument or a memorial. The monument is rectangular in shape and has a smooth, reflective surface, suggesting it might be made of stone or a similar material.

The monument is surrounded by a small, circular area, which is likely a small plaza or a small park. This area is enclosed by a low, low-rise building, which is partially visible in the background. The building has a modern design, with a flat roof and a few windows visible.

In the foreground, there is a large, rectangular stone slab that seems to be part of the monument's base. The stone slab is smooth and reflective, indicating it might be made of granite or another similar material.

The sky above the monument is clear and blue, with no visible clouds, suggesting it is a sunny day. The overall scene is calm and peaceful, with no signs of human activity or movement.

The image does not contain any people, vehicles, or other objects, which helps to focus the viewer's attention on the monument and its surroundings. The absence of any urban elements, such as buildings or roads, also helps to keep the focus on the monument itself.

Given the description, a pure text model can answer questions related to the image by providing a detailed and logical analysis of the elements present in the image. For example, if asked about the type of monument, the model can explain that it is a rectangular stone structure with a smooth, reflective surface. Additionally, if asked about the surrounding area, the model can describe the small, circular plaza or park enclosed by the building.

In summary, the image depicts a rectangular stone monument with a smooth, reflective surface, surrounded by a small, circular plaza or park enclosed by a low, low-rise building. The sky is clear and blue, with no visible clouds, and the overall scene is calm and peaceful. The absence of any human activity or other elements helps to keep the focus on the monument itself.<end_of_utterance>

InternVL3 1B (Hybrid):

Image: http://images.cocodataset.org/val2017/000000039769.jpg
Query:"Can you describe this image?"
Answer:

I 00:00:03.613379 executorch:multimodal_runner.cpp:627] RSS after finishing text generation: 612.617188 MiB (0 if unsupported)
I 00:00:03.613412 executorch:stats.h:143]       Prompt Tokens: 272    Generated Tokens: 118
I 00:00:03.613422 executorch:stats.h:149]       Model Load Time:                0.761000 (seconds)
I 00:00:03.613432 executorch:stats.h:159]       Total inference time:           1.770000 (seconds)               Rate:  66.666667 (tokens/second)
I 00:00:03.613441 executorch:stats.h:167]               Prompt evaluation:      0.197000 (seconds)               Rate:  1380.710660 (tokens/second)
I 00:00:03.613451 executorch:stats.h:178]               Generated 118 tokens:   1.573000 (seconds)               Rate:  75.015893 (tokens/second)
I 00:00:03.613462 executorch:stats.h:186]       Time to first generated token:  0.197000 (seconds)
I 00:00:03.613476 executorch:stats.h:193]       Sampling time over 390 tokens:  0.097000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend

PyTorchObserver {"prompt_tokens":272,"generated_tokens":118,"model_load_start_ms":1749836595295,"model_load_end_ms":1749836596056,"inference_start_ms":1749836596056,"inference_end_ms":1749836597826,"prompt_eval_end_ms":1749836596253,"first_token_ms":1749836596253,"aggregate_sampling_time_ms":97,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
/data/local/tmp/yuyazhua/executorch/single_llama/outputs/: 2 files pulled. 1.6 MB/s (3960 bytes in 0.002s)
INFO:root:Results[0]:
<|im_start|>user:
<img><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT></img>
Can you describe this image?<|im_end|>assistant
The image shows two cats lying on a pink surface, which appears to be a bed or a couch. The cat on the left is a tabby with dark stripes and is lying on its side with its front paws stretched out. The cat on the right is also tabby but with a mix of brown and black stripes. Both cats are relaxed and appear to be sleeping or lying down comfortably. There are two remote controls placed on the pink surface near the cats. The overall scene suggests a cozy and peaceful setting, ideal for a cat to rest.<|im_end|>

cc: @haowhsu-quic

Please have a look!
Thanks!

luffy-yu · 2026-01-12T21:59:43Z

@DannyYuyang-quic Thank you for this PR!

I just tested it on SOC SM8550 . It failed at a very late stage.

Here is the 1st command output.

======================================================================
FAIL: test_static_vlm (__main__.TestExampleMultimodalityScript.test_static_vlm)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/n10288/Documents/Code/executorch/backends/qualcomm/tests/test_qnn_delegate.py", line 6561, in test_static_vlm
    self.fail(msg["Error"])
AssertionError: [Errno 2] No such file or directory: './outputs/outputs.txt'

----------------------------------------------------------------------
Ran 1 test in 225.912s

FAILED (failures=1)

Full log is also attached.

smolvlm_500m_instruct.log

The 2nd command also failed due to the same error.

I noticed that the adb seems disconnected and reconnected as seen in the Ubuntu Taskbar. Here is an error from the 3rd command, adb: error: failed to get feature set: device 'RFCW40R0PHW' not found.

DannyYuyang-quic · 2026-01-13T03:24:43Z

@DannyYuyang-quic Thank you for this PR!

I just tested it on SOC SM8550 . It failed at a very late stage.

It looks like the issue is related to the device connection failing. Could you share the exact command you used? If your device is connected to a remote host, make sure to include the -H ${host_name} flag in your command.

Based on your log, the compilation completed successfully, you can use the --pre_gen_pte ${artifacts_folder_path} to reuse the generated PTE file and verify the runtime results without recompiling.

python backends/qualcomm/tests/test_qnn_delegate.py TestExampleMultimodalityScript.test_static_vlm -v -b build-android -
H ${host_name} -m SM8550 -s ${SERIAL_NUM} -a . --model_name smolvlm_500m_instruct --executorch_root . --pre_gen_pte .

luffy-yu · 2026-01-13T03:37:35Z

@DannyYuyang-quic Thank you for your reply! I was using the following command.

python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_vlm --model_name smolvlm_500m_instruct -b build-android --executorch_root . -a . -m SM8550 -s ${SERIAL_NUM}

The Android phone is connected to the Linux PC via Type-C. adb devices outputs the desired device.

meta-codesync · 2026-01-13T04:14:40Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D90555945.

DannyYuyang-quic · 2026-01-13T04:26:27Z

The Android phone is connected to the Linux PC via Type-C. adb devices outputs the desired device.

Hi @luffy-yu,
Did you rebuild the entire environment after checking out this commit?

git submodule sync
git submodule update --init
./install_executorch.sh
./backends/qualcomm/scripts/build.sh

I noticed you’re using a very old PyTorch version: 2.8.0+cu128.

After rebuilding, could you also run a simple model test to confirm everything works?
Sample test here:

python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_linear -b build-android -m SM8550 -s ${SERIAL_NUM} -a . --executorch_root .

luffy-yu · 2026-01-13T21:09:11Z

@DannyYuyang-quic Thank you for your suggestion.

I re-cloned this repo and rebuilt it. Then, I found the root issue was the adb itself.

Adb Issue

I installed adb on Linux via sudo apt install adb, but it seems to be a known issue.

The working adb is from android-tools-adb installed via sudo apt remove adb && sudo apt install android-tools-adb.

Submodule Command

The submodule command should be git submodule update --init --recursive. It can not build without --recursive.

Test Results on SM8550 - All OK

It needs to install some packages before running tests: pip install graphviz lm_eval.

# Simple Test
python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_linear -b build-android -m ${SOC_MODEL} -s ${SERIAL_NUM} -a . --executorch_root .

----------------------------------------------------------------------
Ran 1 test in 6.327s

OK

# Test Plan - SmolVLM
python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_vlm --model_name smolvlm_500m_instruct -b build-android --executorch_root . -a . -m ${SOC_MODEL} -s ${SERIAL_NUM}

Encoder PTE Size: 102143252 bytes
Text Embedding PTE Size: 94655764 bytes
Decoder PTE Size: 370046356 bytes
.
----------------------------------------------------------------------
Ran 1 test in 244.256s

OK

# Test Plan - InternVL3
python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_vlm --model_name internvl3_1b -b build-android --executorch_root . -a . -m ${SOC_MODEL} -s ${SERIAL_NUM}

Encoder PTE Size: 390255892 bytes
Text Embedding PTE Size: 271840532 bytes
Decoder PTE Size: 504313364 bytes
.
----------------------------------------------------------------------
Ran 1 test in 556.488s

OK

# Script - SmolVLM
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"

[INFO 2026-01-13 15:40:57,230 llama.py:423] Results[0]:
<|im_start|>User:<fake_token_around_image><global-img><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image>Can you describe this image?<end_of_utterance>
Assistant: The image depicts a scenic view of a cityscape. The foreground is dominated by a large, green, leafy tree situated on the left side of the image. The tree is positioned such that its trunk is partially obscured by the foliage, creating a natural barrier that separates the viewer from the city. The tree's branches are densely packed, suggesting a dense, possibly urban environment.

In the middle ground, there is a body of water, likely a river or a lake, which is partially visible. The water is calm and reflects the sky, indicating a clear day. The water's surface is smooth, suggesting a calm, possibly tranquil atmosphere.

In the background, the cityscape is visible. The city is densely populated, with numerous buildings and structures. The buildings are varied in design and height, indicating a mix of residential, commercial, and possibly governmental buildings. The skyline is filled with various structures, including skyscrapers, which are tall and modern, suggesting a developed urban area.

The sky in the background is clear, with no visible clouds, indicating good weather conditions. The sky is a gradient of blue, transitioning from light to dark, which is typical for a clear day.

The overall scene is serene and picturesque, with the tree providing a natural element that contrasts with the urban environment. The calm water and clear sky suggest a peaceful and pleasant environment, likely ideal for relaxation or leisure activities.

To summarize, the image captures a tranquil urban scene with a large tree in the foreground, a calm body of water in the middle ground, and a developed cityscape in the background. The clear sky and calm water suggest a pleasant day, while the dense urban environment and tree provide a natural contrast.<end_of_utterance>

# Script - InternVL3
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model internvl3_1b --model_mode kv --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "http://images.cocodataset.org/val2017/000000039769.jpg"

[INFO 2026-01-13 15:52:49,640 llama.py:423] Results[0]:
<|im_start|>user:
<img><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT></img>
Can you describe this image?<|im_end|>assistant
The image shows two cats lying on a pink surface, which appears to be a bed or a couch. The cat on the left is a tabby with a mix of black, brown, and gray stripes, and it is stretched out with its front paws and tail visible. The cat on the right is also a tabby, but it has a more solid brown and black striped pattern. Both cats are lying on their sides, with their bodies stretched out and their heads resting on the pink surface. There are two remote controls placed on the pink surface near the cats. One remote is on the left side of the cat on the left, and the other is on the right side of the cat on the right. The background is a solid pink color, providing a soft contrast to the cats and their surroundings. The scene suggests a cozy and relaxed atmosphere, typical of a cat's resting spot.<|im_end|>

Thank you for your hard work! It helps my project a lot.

cccclai

Thank you for the contribution! Great job on this. Just would like to understand a bit more details

cccclai · 2026-01-13T22:23:12Z

backends/qualcomm/tests/test_qnn_delegate.py

+    @dataclass(frozen=True)
+    class MLLMSpecs:
+        max_seq_len: int
+        SM8650: float


What does SM8650 means here and why are they float?

It means the token/sec threshold for SM8650 to pass the test.
Token rate is a floating-point number, so that’s why I use float.

I see, maybe let's rename it to make it clearer

Sounds good, I’ll rename it to sm8650_token_rate.

cccclai · 2026-01-13T22:25:22Z

examples/qualcomm/oss_scripts/llama/eval_llama_qnn.py

 def prequant_algorithm(model, prefill_config, args):
    # TODO: use dtype of model checkpoint
    model = model.to(device=args.device, dtype=torch.float)
-    inputs = model.get_example_inputs(use_kv_cache=False)


does it mean we remove kv cache mode?

No. use_kv_cache is set during model initialization, so get_example_inputs doesn’t need to set it again.
we reuse self.use_kv_cache in get_example_inputs
https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/model/static_llama.py#L541

cccclai · 2026-01-13T22:28:35Z

examples/qualcomm/oss_scripts/llama/llama.py

+            pte_path=(
+                pte_path
+                if not is_modality
+                else [pte_path, encoder_pte_path, text_embedding_pte_path]


Are we generating 3 .pte files for multimodal models?

Yes, so users can mix and match different precisions for the encoder or embedding when composing with text_decoder without having to recompile, since most of the runtime cost is in the decoder.

cccclai · 2026-01-13T22:29:13Z

examples/qualcomm/oss_scripts/llama/model/layernorm.py

        super().__init__(hidden_size, eps=eps)


-@register_norm("gemma3")


Is it expected?

Yes. We set model_config.norm_type to rmsnorm in wrappers.py https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/wrappers.py#L439-L441
before registration, so there’s no need to add extra model names to the decorator here.

cccclai · 2026-01-13T22:30:01Z

examples/qualcomm/oss_scripts/llama/qnn_multimodal_runner.cpp

+DEFINE_string(
+    embedding_path,
+    "embedding.pte",
+    "Path to embedding model serialized in flatbuffer format.");


Suggested change

"Path to embedding model serialized in flatbuffer format.");

"Path to embedding model serialized in .pte format.");

Thanks for applying the suggestion!

I noticed that mainline still uses flatbuffer in many places instead of .pte. Should we standardize on one format~?

I see, maybe let's do it in a separate PR...

cccclai · 2026-01-13T22:31:46Z

examples/qualcomm/oss_scripts/llama/runner/multimodal_runner/multimodal_lhd_token_generator.cpp

+namespace example {
+
+template <typename T>
+void MultimodalLhdTokenGenerator<T>::prepare_io(


Does it mean we also support lookahead with multimodal ?

Yes, lookahead works with multimodal and can be run. However, it hasn’t been calibrated on a full dataset yet, only with a single prompt, so accuracy isn’t great now.

cccclai · 2026-01-13T22:32:54Z

examples/qualcomm/oss_scripts/llama/runner/multimodal_runner/multimodal_runner.h

+// Extend DecoderModelVersion enum with multimodal models
+enum MultimodalDecoderModelVersion {
+  kSmolvlm = 0,
+  kInternvl3,


why kSmolvlm is 0 and kInternvl3 is not defined?

kSmolVLM is 0 because we set it explicitly for clarity. kInternVL3 is defined and has the implicit value 1, but I can make both values explicit to avoid ambiguity~

cccclai · 2026-01-13T22:34:18Z

examples/qualcomm/oss_scripts/llama/wrappers.py

+            self.meta = self.decoder.get_metadata()
+
+        # check if sharding required
+        if self.decoder and self.config.num_sharding > 1:


Do we shard any of the MultiModal model?

Yes, but only shard the text decoder part of Multimodal.

DannyYuyang-quic · 2026-01-14T03:42:07Z

@luffy-yu, thank you for sharing the ADB issue you encountered. Glad to hear you got it running successfully and really appreciate the detailed explanation of how you solved it!

luffy-yu · 2026-01-14T21:11:56Z

@DannyYuyang-quic, I found a potential issue with the command: inference with --pre_gen_pte still uses the compile-only input even --image_path is set.

# Compile only with the cat image
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode hybrid --prefill_ar_len 16 --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "http://images.cocodataset.org/val2017/000000039769.jpg" --compile_only -a smolvlm_500m_hybrid

[INFO 2026-01-14 15:33:13,013 decoder_utils.py:912] kv inference result:
User:Can you describe this image?
Assistant: The image depicts a cozy, well-cared-for cat lying......

# Infer with the NYC image
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode hybrid --prefill_ar_len 16 --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" --pre_gen_pte smolvlm_500m_hybrid

It is still talking about the image "http://images.cocodataset.org/val2017/000000039769.jpg".

The fix is as follows.

diff --git a/examples/qualcomm/oss_scripts/llama/llama.py b/examples/qualcomm/oss_scripts/llama/llama.py
index a0940126e8..bddebdc638 100755
--- a/examples/qualcomm/oss_scripts/llama/llama.py
+++ b/examples/qualcomm/oss_scripts/llama/llama.py
@@ -381,11 +381,9 @@ def inference(
 
         # No pregen inputs, input_list is not required
         if not args.skip_push:
-            image_path = (
-                f"{args.pre_gen_pte}/{VISION_ENCODER_INPUT_FILENAME}.raw"
-                if args.pre_gen_pte
-                else f"{args.artifact}/{VISION_ENCODER_INPUT_FILENAME}.raw"
-            )
+            # Always use image from artifact folder since that's where it's saved during preprocessing
+            # regardless of whether pre_gen_pte is used (pre_gen_pte only applies to .pte model files)
+            image_path = f"{args.artifact}/{VISION_ENCODER_INPUT_FILENAME}.raw"
             adb.push(
                 inputs=[],
                 files=[runtime_tokenizer_path] + ([image_path] if is_modality else []),

DannyYuyang-quic · 2026-01-15T01:03:52Z

@luffy-yu, good catch, and thanks for the patch, I’ll land it and have a test.

…rence with --pre_gen_pte Fix lint error

DannyYuyang-quic requested review from cccclai, kirklandsign and larryliu0820 as code owners January 12, 2026 06:33

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 12, 2026

pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Jan 12, 2026

cccclai reviewed Jan 13, 2026

View reviewed changes

Rename CI perf variable & fix(inference): use artifact image for infe…

19da734

…rence with --pre_gen_pte Fix lint error

		super().__init__(hidden_size, eps=eps)


		@register_norm("gemma3")

	"Path to embedding model serialized in flatbuffer format.");
	"Path to embedding model serialized in .pte format.");

Qualcomm AI Engine Direct - Support multimodal(VLM) runner #16536

Are you sure you want to change the base?

Qualcomm AI Engine Direct - Support multimodal(VLM) runner #16536

Uh oh!

Conversation

DannyYuyang-quic commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

Test plan

SmolVLM

InternVL3

Script

SmolVLM

InternVL3

Uh oh!

pytorch-bot bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16536

✅ No Failures

Uh oh!

DannyYuyang-quic commented Jan 12, 2026

Uh oh!

DannyYuyang-quic commented Jan 12, 2026

SmolVLM 500M (Hybrid):

InternVL3 1B (Hybrid):

Uh oh!

luffy-yu commented Jan 12, 2026

Uh oh!

DannyYuyang-quic commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luffy-yu commented Jan 13, 2026

Uh oh!

meta-codesync bot commented Jan 13, 2026

Uh oh!

DannyYuyang-quic commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luffy-yu commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adb Issue

Submodule Command

Test Results on SM8550 - All OK

Uh oh!

cccclai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DannyYuyang-quic Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DannyYuyang-quic Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DannyYuyang-quic Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DannyYuyang-quic commented Jan 12, 2026 •

edited

Loading

pytorch-bot bot commented Jan 12, 2026 •

edited

Loading

DannyYuyang-quic commented Jan 13, 2026 •

edited

Loading

DannyYuyang-quic commented Jan 13, 2026 •

edited

Loading

luffy-yu commented Jan 13, 2026 •

edited

Loading

DannyYuyang-quic Jan 14, 2026 •

edited

Loading

DannyYuyang-quic Jan 14, 2026 •

edited

Loading

DannyYuyang-quic Jan 14, 2026 •

edited

Loading