-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Refactor sliding window configuration to Transformers best practice #21927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor sliding window configuration to Transformers best practice #21927
Conversation
Signed-off-by: Harry Mellor <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the sliding window configuration to align with Transformers' best practices by using layer_types instead of custom attributes like sliding_window_pattern and interleaved_sliding_window. This is a positive change that simplifies the codebase. However, I've identified two critical issues. Firstly, there's an inconsistency in the string used to identify sliding attention layers ('sliding_attention' vs. 'sliding_window') between the main configuration file and the model implementations, which could lead to incorrect behavior. Secondly, the refactoring seems to have removed support for per-layer sliding window sizes when provided as a list, which could be a regression and cause runtime errors.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
|
(sorry everyone...) |
Signed-off-by: Harry Mellor <[email protected]>
|
I have updated the description to detail the consequences this PR has on Gemma 3 models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing!
|
Need to merge from main again |
…llm-project#21927) Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Paul Pak <[email protected]>
…llm-project#21927) Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
…llm-project#21927) Signed-off-by: Harry Mellor <[email protected]>
…llm-project#21927) Signed-off-by: Harry Mellor <[email protected]>
…llm-project#21927) Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…llm-project#21927) Signed-off-by: Harry Mellor <[email protected]>
layer_typesinstead ofsliding_window_patterninterleaved_sliding_windowas it should not be necessaryUnlike the last time I tried to remove
sliding_window: list[int](#18494 (comment)), this PR correctly maps Mistral format sliding window configuration to the new HF style sliding window configuration.This PR should also fix the issues that everyone has been seeing with Gemma 3.
Prior to this PR, vLLM was unable to detect that some Gemma 3 models were interleaved sliding attention models. This is because vLLM relied on the presence of
sliding_window_patternin theGemma3Config. In earlier versions of Transformers this field was present and it was either read directly fromconfig.jsonor derived based on the known pattern in Gemma 3.However, in newer versions of Transformers
sliding_window_patternhas been replaced withlayer_types. This means that there is no longer a derivedsliding_window_patterninGemma3Configand it will only appear if it is explicitly included inconfig.json. As you can see from this sample of Gemma 3config.jsons, it was not always included:Since these configs contained
sliding_windowand vLLM was unable to recognise that they were interleaved, it treated Gemma 3 as an all sliding attention model. The lack of full attention layers can explain the decrease in task performance people saw.PR updates vLLM to rely on the newer
layer_typesfield, so it should fix the Gemma 3 models that were not working properly in vLLM. #20541 is a related PR that did addlayer_typessupport locally in vLLM's Gemma 3 modelling code, but this PR applies it globally to all of vLLM so new models (and the Transformers backend) shouldn't have this issue.Fixes huggingface/transformers#40017
Fixes #15752
Fixes #17689
Fixes #20341
Fixes #22270
Fixes #22475