Skip to content

Update PatchedVLLMKVCache for deepseek performance #2165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 5, 2025
Merged

Conversation

mengniwang95
Copy link
Contributor

Type of Change

workaround

Description

Update PatchedVLLMKVCache for deepseek performance

xuechendi pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 4, 2025
Previously, when we use INC to convert deepseek FP8 model, we need this
[commit
](intel/neural-compressor@7c0a3e2)
to remove extra converts in KVCache but actually GC can remove them
during graph optimization theoretically.
Furthermore, the change in commit is not aligned with the design of INC
patched module, which wants to keep the returned tensor BF16 because we
can't make sure users' next operation.
So, I update the modeling file to make GC can work for patched KVCache
pattern of deepseek model.
Since next release is very close and GC currently can not work as
expection during decode stage, it is still a workround. We will root
cause and fix it from source in next relase.

This PR should work together with this PR:
intel/neural-compressor#2165

Signed-off-by: Mengni Wang <[email protected]>
@yiliu30 yiliu30 merged commit fcf3031 into r1-woq Apr 5, 2025
7 of 9 checks passed
@yiliu30 yiliu30 deleted the dev/mengni/kv branch April 5, 2025 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants