-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[compile] Add enable_prompt_embeds to compile hash. #27285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[compile] Add enable_prompt_embeds to compile hash. #27285
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses a potential cache invalidation issue by including enable_prompt_embeds in the compilation hash. This ensures that changes to this flag, which can alter the input types to the model, will properly trigger a re-compilation. I've added one suggestion to also include runner_type and convert_type in the hash, as they also seem to have a significant impact on the computation graph and could lead to similar caching problems if not included. Overall, this is a good fix.
| factors.append(self.rope_scaling) | ||
| factors.append(self.rope_theta) | ||
| factors.append(self.video_pruning_rate) | ||
| factors.append(self.enable_prompt_embeds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch adding enable_prompt_embeds to the compilation hash.
While reviewing this, I noticed that runner_type and convert_type also seem to affect the computation graph but are not currently included in the hash. These fields can determine which model implementation is used (e.g., for generation vs. pooling) or whether a model adapter is applied, both of which are significant changes to the graph.
To prevent potential cache collisions when switching between runners or converters for the same base model, it would be safer to include them in the hash factors. What do you think about adding them here?
| factors.append(self.enable_prompt_embeds) | |
| factors.append(self.enable_prompt_embeds) | |
| factors.append(self.runner_type) | |
| factors.append(self.convert_type) |
Summary: Fixing issue vllm-project#27283 `enable_prompt_embeds` will make input_ids argument to be `None` instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence. Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here) Therefore 2 ways to do this: 1. Use dynamo guards, so this is guarded at torch.compile level. 2. Add enable_prompt_embeds to `compute_hash`, so this is guarded at vllm level. In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code. Test Plan: (with torch 2.10.dev) `pytest tests/basic_correctness/test_basic_correctness.py` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: zhxchen17 <[email protected]>
f8dc86e to
c2c379a
Compare
Signed-off-by: zhxchen17 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: Bhagyashri <[email protected]>
Signed-off-by: zhxchen17 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
Signed-off-by: zhxchen17 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
Signed-off-by: zhxchen17 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
Summary: This is a reland of vllm-project#27285 since it regressed in the vllm trunk recently. enable_prompt_embeds will make input_ids argument to be None instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence. Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here) Therefore 2 ways to do this: Use dynamo guards, so this is guarded at torch.compile level. Add enable_prompt_embeds to compute_hash, so this is guarded at vllm level. In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code. cpu_offload_gb will affect model inputs since it will produce a different graph for different offloading configs. Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: This is a reland of vllm-project#27285 since it regressed in the vllm trunk recently. enable_prompt_embeds will make input_ids argument to be None instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence. Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here) Therefore 2 ways to do this: Use dynamo guards, so this is guarded at torch.compile level. Add enable_prompt_embeds to compute_hash, so this is guarded at vllm level. In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code. cpu_offload_gb will affect model inputs since it will produce a different graph for different offloading configs. Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: zhxchen17 <[email protected]>
Signed-off-by: zhxchen17 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
Summary:
Fixing issue #27283
enable_prompt_embedswill make input_ids argument to beNoneinstead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence.Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here)
Therefore 2 ways to do this:
compute_hash, so this is guarded at vllm level.In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code.
Test Plan:
(with torch 2.10.dev)
pytest tests/basic_correctness/test_basic_correctness.pyReviewers:
Subscribers:
Tasks:
Tags:
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.