[compile] Add enable_prompt_embeds to compile hash. #27285

zhxchen17 · 2025-10-21T19:21:07Z

Summary:

Fixing issue #27283

enable_prompt_embeds will make input_ids argument to be None instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence.

Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here)

Therefore 2 ways to do this:

Use dynamo guards, so this is guarded at torch.compile level.
Add enable_prompt_embeds to compute_hash, so this is guarded at vllm level.

In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code.

Test Plan:
(with torch 2.10.dev)
pytest tests/basic_correctness/test_basic_correctness.py

Reviewers:

Subscribers:

Tasks:

Tags:

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request correctly addresses a potential cache invalidation issue by including enable_prompt_embeds in the compilation hash. This ensures that changes to this flag, which can alter the input types to the model, will properly trigger a re-compilation. I've added one suggestion to also include runner_type and convert_type in the hash, as they also seem to have a significant impact on the computation graph and could lead to similar caching problems if not included. Overall, this is a good fix.

gemini-code-assist · 2025-10-21T19:23:02Z

vllm/config/model.py

        factors.append(self.rope_scaling)
        factors.append(self.rope_theta)
        factors.append(self.video_pruning_rate)
+        factors.append(self.enable_prompt_embeds)


Good catch adding enable_prompt_embeds to the compilation hash.

While reviewing this, I noticed that runner_type and convert_type also seem to affect the computation graph but are not currently included in the hash. These fields can determine which model implementation is used (e.g., for generation vs. pooling) or whether a model adapter is applied, both of which are significant changes to the graph.

To prevent potential cache collisions when switching between runners or converters for the same base model, it would be safer to include them in the hash factors. What do you think about adding them here?

Suggested change

factors.append(self.enable_prompt_embeds)

factors.append(self.enable_prompt_embeds)

factors.append(self.runner_type)

factors.append(self.convert_type)

Summary: Fixing issue vllm-project#27283 `enable_prompt_embeds` will make input_ids argument to be `None` instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence. Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here) Therefore 2 ways to do this: 1. Use dynamo guards, so this is guarded at torch.compile level. 2. Add enable_prompt_embeds to `compute_hash`, so this is guarded at vllm level. In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code. Test Plan: (with torch 2.10.dev) `pytest tests/basic_correctness/test_basic_correctness.py` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: zhxchen17 <[email protected]>

Signed-off-by: zhxchen17 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: Bhagyashri <[email protected]>

Signed-off-by: zhxchen17 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Summary: This is a reland of vllm-project#27285 since it regressed in the vllm trunk recently. enable_prompt_embeds will make input_ids argument to be None instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence. Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here) Therefore 2 ways to do this: Use dynamo guards, so this is guarded at torch.compile level. Add enable_prompt_embeds to compute_hash, so this is guarded at vllm level. In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code. cpu_offload_gb will affect model inputs since it will produce a different graph for different offloading configs. Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: This is a reland of vllm-project#27285 since it regressed in the vllm trunk recently. enable_prompt_embeds will make input_ids argument to be None instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence. Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here) Therefore 2 ways to do this: Use dynamo guards, so this is guarded at torch.compile level. Add enable_prompt_embeds to compute_hash, so this is guarded at vllm level. In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code. cpu_offload_gb will affect model inputs since it will produce a different graph for different offloading configs. Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: zhxchen17 <[email protected]>

Signed-off-by: zhxchen17 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

zhxchen17 requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners October 21, 2025 19:21

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

zhxchen17 mentioned this pull request Oct 22, 2025

AOT Compilation for torch.compile (Bundled) #24274

Merged

5 tasks

zou3519 approved these changes Oct 27, 2025

View reviewed changes

zou3519 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 27, 2025

zhxchen17 force-pushed the zhxchen17/precompile/enable_prompt_embeds branch from f8dc86e to c2c379a Compare October 27, 2025 15:26

Merge branch 'main' into zhxchen17/precompile/enable_prompt_embeds

a86c1f5

DarkLight1337 merged commit 259504e into vllm-project:main Oct 28, 2025
45 checks passed

ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025

[compile] Add enable_prompt_embeds to compile hash. (vllm-project#27285)

71efd3b

Signed-off-by: zhxchen17 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

[compile] Add enable_prompt_embeds to compile hash. (vllm-project#27285)

3fd09dc

Signed-off-by: zhxchen17 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[compile] Add enable_prompt_embeds to compile hash. (vllm-project#27285)

3a61979

Signed-off-by: zhxchen17 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

zhxchen17 mentioned this pull request Nov 25, 2025

[caching] Add enable_prompt_embeds and cpu_offload_gb to compile hashes. #29435

Merged

5 tasks

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[compile] Add enable_prompt_embeds to compile hash. (vllm-project#27285)

d7924b4

Signed-off-by: zhxchen17 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[compile] Add enable_prompt_embeds to compile hash. #27285

[compile] Add enable_prompt_embeds to compile hash. #27285

Uh oh!

zhxchen17 commented Oct 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[compile] Add enable_prompt_embeds to compile hash. #27285

[compile] Add enable_prompt_embeds to compile hash. #27285

Uh oh!

Conversation

zhxchen17 commented Oct 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhxchen17 commented Oct 21, 2025 •

edited by github-actions bot

Loading