Skip to content

Conversation

@zhxchen17
Copy link
Contributor

@zhxchen17 zhxchen17 commented Oct 21, 2025

Summary:

Fixing issue #27283

enable_prompt_embeds will make input_ids argument to be None instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence.

Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here)

Therefore 2 ways to do this:

  1. Use dynamo guards, so this is guarded at torch.compile level.
  2. Add enable_prompt_embeds to compute_hash, so this is guarded at vllm level.

In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code.

Test Plan:
(with torch 2.10.dev)
pytest tests/basic_correctness/test_basic_correctness.py

Reviewers:

Subscribers:

Tasks:

Tags:

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a potential cache invalidation issue by including enable_prompt_embeds in the compilation hash. This ensures that changes to this flag, which can alter the input types to the model, will properly trigger a re-compilation. I've added one suggestion to also include runner_type and convert_type in the hash, as they also seem to have a significant impact on the computation graph and could lead to similar caching problems if not included. Overall, this is a good fix.

factors.append(self.rope_scaling)
factors.append(self.rope_theta)
factors.append(self.video_pruning_rate)
factors.append(self.enable_prompt_embeds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Good catch adding enable_prompt_embeds to the compilation hash.

While reviewing this, I noticed that runner_type and convert_type also seem to affect the computation graph but are not currently included in the hash. These fields can determine which model implementation is used (e.g., for generation vs. pooling) or whether a model adapter is applied, both of which are significant changes to the graph.

To prevent potential cache collisions when switching between runners or converters for the same base model, it would be safer to include them in the hash factors. What do you think about adding them here?

Suggested change
factors.append(self.enable_prompt_embeds)
factors.append(self.enable_prompt_embeds)
factors.append(self.runner_type)
factors.append(self.convert_type)

@zou3519 zou3519 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 27, 2025
Summary:

Fixing issue vllm-project#27283

`enable_prompt_embeds` will make input_ids argument to be `None` instead of tensor type,
which will invalidate the compile cache at vllm level. Previously this wasn't an issue
because inductor has its own caching validation that serves as the last line of defence.

Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we
need to guard it against input type changes (e.g. Tensor -> None here)

Therefore 2 ways to do this:
1. Use dynamo guards, so this is guarded at torch.compile level.
2. Add enable_prompt_embeds to `compute_hash`, so this is guarded at vllm level.

In the short term, 2. seems to be the better approach because vllm already
throws away all the guards from dynamo and enabling the guards is a non
trivial change to the existing code.

Test Plan:
(with torch 2.10.dev)
`pytest tests/basic_correctness/test_basic_correctness.py`

Reviewers:

Subscribers:

Tasks:

Tags:

Signed-off-by: zhxchen17 <[email protected]>
@zhxchen17 zhxchen17 force-pushed the zhxchen17/precompile/enable_prompt_embeds branch from f8dc86e to c2c379a Compare October 27, 2025 15:26
@DarkLight1337 DarkLight1337 merged commit 259504e into vllm-project:main Oct 28, 2025
45 checks passed
bhagyashrigai pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Oct 29, 2025
Signed-off-by: zhxchen17 <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Signed-off-by: Bhagyashri <[email protected]>
ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025
ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
zhxchen17 added a commit to zhxchen17/vllm that referenced this pull request Nov 25, 2025
Summary:

This is a reland of vllm-project#27285 since it
regressed in the vllm trunk recently.

enable_prompt_embeds will make input_ids argument to be None instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence.

Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here)

Therefore 2 ways to do this:

Use dynamo guards, so this is guarded at torch.compile level.
Add enable_prompt_embeds to compute_hash, so this is guarded at vllm level.
In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code.

cpu_offload_gb will affect model inputs since it will produce a different graph for
different offloading configs.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
zhxchen17 added a commit to zhxchen17/vllm that referenced this pull request Nov 25, 2025
Summary:

This is a reland of vllm-project#27285 since it
regressed in the vllm trunk recently.

enable_prompt_embeds will make input_ids argument to be None instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence.

Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here)

Therefore 2 ways to do this:

Use dynamo guards, so this is guarded at torch.compile level.
Add enable_prompt_embeds to compute_hash, so this is guarded at vllm level.
In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code.

cpu_offload_gb will affect model inputs since it will produce a different graph for
different offloading configs.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: zhxchen17 <[email protected]>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants