-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. #21364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. #21364
Conversation
…en 1m models. Signed-off-by: Tao He <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug in the dual-chunk-flash-attention backend for Qwen 1M models. The changes involve removing an erroneous block_table
parameter from several function calls within vllm/attention/backends/dual_chunk_flash_attn.py
. My review confirms that this parameter was being passed incorrectly to functions that expect contiguous tensors, and its removal is the correct fix. The changes are consistent and well-contained, resolving the bug introduced during a previous rebase. The code quality is good.
Please help verify whether using VLLM_ALLOW_LONG_MAX_MODEL_LEN in examples/offline_inference/qwen_1m.py may potentially cause issues leading to nan. Refer to #20904 …en 1m models. I didn't even know what kind of machine could run it Nan can cause the detokener to output !!!!!!!!!!!!!! |
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This issue should be orthogonal with this PR. I will take a look, but could we merge this PR first to make qwen-1m work? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, approving since @sighingnow maintains the dual-chunk attention thing and this change looks reasonable.
Thanks! The failed test looks not caused by this PR. |
…en 1m models. (vllm-project#21364) Signed-off-by: Tao He <[email protected]> Signed-off-by: qizixi <[email protected]>
…en 1m models. (vllm-project#21364) Signed-off-by: Tao He <[email protected]> Signed-off-by: x22x22 <[email protected]>
…en 1m models. (vllm-project#21364) Signed-off-by: Tao He <[email protected]>
…en 1m models. (vllm-project#21364) Signed-off-by: Tao He <[email protected]>
…en 1m models. (vllm-project#21364) Signed-off-by: Tao He <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
…en 1m models. (vllm-project#21364) Signed-off-by: Tao He <[email protected]> Signed-off-by: Paul Pak <[email protected]>
…en 1m models. (vllm-project#21364) Signed-off-by: Tao He <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
…en 1m models. (vllm-project#21364) Signed-off-by: Tao He <[email protected]>
…en 1m models.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Fixes a bug in the DCA backend. The error was introduced during rebasing previous PR: #11844