Skip to content

Conversation

hmellor
Copy link
Member

@hmellor hmellor commented Feb 19, 2025

Follow up for #11967 which removes kv_cache and attn_metadata from all model definitions.

Summary of changes:

  • Remove kv_cache and attn_metadata from Attention.forward() args
  • Use forward context instead of attn_metadata arg in MambaMixer and MambaMixer2
  • Remove kv_caches, kv_cache and attn_metadata from forward() args of all model modules
  • Remove kv_caches and attn_metadata from new model docs
  • Leave kv_caches arg (but try not to use it) in all child classes of ModelRunnerBase.execute_model() to avoid further complication

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 19, 2025
@hmellor hmellor added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 19, 2025
@hmellor hmellor marked this pull request as ready for review February 19, 2025 13:35
@DarkLight1337
Copy link
Member

Can you fix the failure in helm chart CI?

Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
@hmellor
Copy link
Member Author

hmellor commented Feb 19, 2025

Can you fix the failure in helm chart CI?

This failure is likely due to the CPU runner not being updated correctly. If I can fix the CPU model runner in buildkite then this workflow should also pass.

@hmellor
Copy link
Member Author

hmellor commented Feb 21, 2025

The above commit should stop errors coming from Attention.forward.

I'm not sure what to do about CustomModelForCausalLM.forward. I could add some code to inspect the function signature in all the model runners and pass None to kv_caches and attn_metadata if needed?

@youkaichao
Copy link
Member

@youkaichao will this break out of tree models?

I think it's fine to ask out-of-tree models to update to new vLLM versions. We cannot let out-of-tree models to slow down the development. In fact we can remove these code several months ago, and I think we have waited for long enough.

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree on the direction, thanks for cleaning up the code!

This reverts commit 5d84b99.

Signed-off-by: Harry Mellor <[email protected]>
@hmellor
Copy link
Member Author

hmellor commented Feb 22, 2025

Reverting the deprecation of kv_cache and attn_metadata from Attention.forward because it was only a half measure that does not fully solve the problem of breaking out of tree models.

This way, it is unambiguous that we do not expect these arguments in Model.forward or Attention.forward.

Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it won't be difficult for out-of-tree models to remove these kwargs. One thing that is already broken is models with kv cache sharing (multiple attention layers using the same kv cache). It exists in open source models but not in vllm repo. We need to design the interface for it after we get the first pr for kv cache sharing model.

I think we need to tell the users how to debug kv cache related problems. These two variables are accessed frequently during debugging the models.

) -> torch.Tensor:
# NOTE: please avoid accessing `kv_cache` and `attn_metadata` arguments
# directly, use `self.kv_cache` and
# `get_forward_context().attn_metadata` instead.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary in this pr, but can we update these notes to help people print kv cache & attn_metadata to debug kv-cache related problems?

@simon-mo simon-mo merged commit cdc1fa1 into vllm-project:main Feb 25, 2025
52 of 54 checks passed
@hmellor hmellor deleted the remove-unused-attn-args branch February 25, 2025 09:20
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Feb 25, 2025
### What this PR does / why we need it?
The arg list of `Attention.forward()` is changed by
vllm-project/vllm#13555.
The unused args `kv_caches` and `attn_metadata` are removed.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.

Signed-off-by: MengqingCao <[email protected]>
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
ttanzhiqiang pushed a commit to ttanzhiqiang/vllm-ascend that referenced this pull request Apr 27, 2025
)

### What this PR does / why we need it?
The arg list of `Attention.forward()` is changed by
vllm-project/vllm#13555.
The unused args `kv_caches` and `attn_metadata` are removed.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with existing test.

Signed-off-by: MengqingCao <[email protected]>
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants