Separate MLAAttention class from Attention #25103

therealnaveenkamal · 2025-09-17T22:06:09Z

Purpose

This PR implements the first step of #24620 by separating Multi-Head Latent Attention into its own dedicated AttentionLayerBase subclass.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2025-09-17T22:06:20Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request refactors the Multi-Head Latent Attention (MLA) logic out of the generic Attention class and into a new, dedicated MLAAttention class. This is a good step towards better code organization and separation of concerns. The changes in vllm/attention/layer.py and vllm/model_executor/layers/mla.py correctly remove the old MLA logic and adopt the new class. However, the new MLAAttention class in vllm/model_executor/layers/mla_attention.py has critical implementation issues. It fails to properly instantiate and call the attention backend, and it lacks the necessary integration with the KV cache and attention metadata management. These issues will prevent the MLA feature from functioning. I've left detailed comments on how to address these critical problems.

vllm/model_executor/layers/mla_attention.py

ProExpertProg

A few minor notes

vllm/model_executor/layers/mla.py

vllm/model_executor/layers/mla_attention.py

vllm/model_executor/layers/mla.py

therealnaveenkamal · 2025-09-20T00:04:08Z

@ProExpertProg i'm working on unified_mla_attention ops - how do you want it to be? any inputs would be helpful.

ProExpertProg · 2025-09-20T00:08:02Z

Yeah to start they can just mimic the unified_attention and unified_attention_with_output ops. Also please keep the existing MLAAttentionWrapper as is and make the new MLAAttention layer the same in scope as Attention (no rope, no o_proj, etc.)

therealnaveenkamal · 2025-09-20T02:14:13Z

Hi @ProExpertProg, thanks for the feedback.

I've added the unified_mla_attention and unified_mla_attention_with_output ops, which mimic the existing unified attention ops.

MLAAttention layer has been created in mla.py...scoped similarly to the base Attention layer and does not handle projections or rotary embeddings.

The MultiHeadLatentAttentionWrapper uses the new MLAAttention layer to handle the core attention logic.

Let me know what you think. Thanks

vllm/attention/layer.py

vllm/model_executor/layers/mla.py

mergify · 2025-09-23T22:52:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @therealnaveenkamal.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

therealnaveenkamal · 2025-09-24T04:18:01Z

@ProExpertProg i've resolved all the comments. please let me know if i have to make any changes

ProExpertProg · 2025-09-24T04:26:09Z

Can you fix pre commit please

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

ProExpertProg

Just one remaining nit

ProExpertProg · 2025-10-08T19:21:03Z

vllm/model_executor/model_loader/utils.py

+    # Initialize post-load attention weights for both Attention and MLA.
+    # NOTE: Happens after other modules so we can easily decompress weights.


mergify · 2025-10-08T19:22:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @therealnaveenkamal.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Luka Govedič <[email protected]>

Signed-off-by: Naveenraj Kamalakannan <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]>

…to loader * 'loader' of https://github.com/dsxsteven/vllm_splitPR: (778 commits) [torchao] Add support for ModuleFqnToConfig using regex (vllm-project#26001) Add: Support for multiple hidden layers in Eagle3 (vllm-project#26164) Enable `RMSNorm` substitution for Transformers backend (vllm-project#26353) [Model] Gemma3: Fix GGUF loading and quantization (vllm-project#26189) Bump Flashinfer to v0.4.0 (vllm-project#26326) Update Dockerfile and install runai-model-streamer[gcs] package (vllm-project#26464) [Core] Relax the LoRA max rank (vllm-project#26461) [CI/Build] Fix model nightly tests (vllm-project#26466) [Hybrid]: Decouple Kernel Block Size from KV Page Size (vllm-project#24486) [Core][KVConnector] Propagate all tokens on resumed preemptions (vllm-project#24926) [MM][Doc] Add documentation for configurable mm profiling (vllm-project#26200) [Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439) [Bugfix] Incorrect another MM data format in vllm bench throughput (vllm-project#26462) [Bugfix] Catch and log invalid token ids in detokenizer #2 (vllm-project#26445) [Minor] Change warning->warning_once in preprocess (vllm-project#26455) [Bugfix] Set the minimum python version for gpt-oss (vllm-project#26392) [Misc] Redact ray runtime env before logging (vllm-project#26302) Separate MLAAttention class from Attention (vllm-project#25103) [Attention] Register FLASHMLA_SPARSE (vllm-project#26441) [Kernels] Modular kernel refactor (vllm-project#24812) ...

Signed-off-by: Naveenraj Kamalakannan <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

Signed-off-by: Naveenraj Kamalakannan <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: Dhruvil Bhatt <[email protected]>

therealnaveenkamal requested a review from LucasWilkinson as a code owner September 17, 2025 22:06

therealnaveenkamal changed the title ~~Separate MLAAttention class from, Attention (needs Review)~~ Separate MLAAttention class from Attention (needs Review) Sep 17, 2025

gemini-code-assist bot reviewed Sep 17, 2025

View reviewed changes

vllm/model_executor/layers/mla_attention.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/mla_attention.py Outdated Show resolved Hide resolved

ProExpertProg reviewed Sep 18, 2025

View reviewed changes

vllm/model_executor/layers/mla.py Show resolved Hide resolved

vllm/model_executor/layers/mla_attention.py Outdated Show resolved Hide resolved

MatthewBonanni reviewed Sep 18, 2025

View reviewed changes

vllm/model_executor/layers/mla_attention.py Outdated Show resolved Hide resolved

LucasWilkinson reviewed Sep 18, 2025

View reviewed changes

vllm/model_executor/layers/mla.py Outdated Show resolved Hide resolved

mergify bot added the deepseek Related to DeepSeek models label Sep 19, 2025

therealnaveenkamal requested review from WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners September 20, 2025 02:03

ProExpertProg reviewed Sep 22, 2025

View reviewed changes

vllm/attention/layer.py Outdated Show resolved Hide resolved

ProExpertProg reviewed Sep 22, 2025

View reviewed changes

vllm/model_executor/layers/mla.py Outdated Show resolved Hide resolved

ProExpertProg reviewed Sep 22, 2025

View reviewed changes

vllm/model_executor/layers/mla.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Sep 23, 2025

therealnaveenkamal force-pushed the mla_attn branch from a19c360 to cfe6f46 Compare September 24, 2025 04:15

mergify bot removed the needs-rebase label Sep 24, 2025

therealnaveenkamal changed the title ~~Separate MLAAttention class from Attention (needs Review)~~ Separate MLAAttention class from Attention Sep 24, 2025

ProExpertProg enabled auto-merge (squash) October 7, 2025 21:54

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 7, 2025

ProExpertProg approved these changes Oct 7, 2025

View reviewed changes

pre-commit fixes

8202371

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

auto-merge was automatically disabled October 7, 2025 22:03
Head branch was pushed to by a user without write access

ProExpertProg enabled auto-merge (squash) October 7, 2025 22:24

ProExpertProg removed the ready ONLY add when PR is ready to merge/full CI is needed label Oct 7, 2025

ProExpertProg disabled auto-merge October 7, 2025 23:01

therealnaveenkamal added 3 commits October 8, 2025 03:21

fixed attentionlayerbase issue

e955784

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

final fix

2422830

Signed-off-by: Naveenraj Kamalakannan <[email protected]>

Merge branch 'main' into mla_attn

dca6734

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 8, 2025

ProExpertProg approved these changes Oct 8, 2025

View reviewed changes

mergify bot added the needs-rebase label Oct 8, 2025

Merge branch 'main' into mla_attn

2ddf547

Signed-off-by: Luka Govedič <[email protected]>

mergify bot removed the needs-rebase label Oct 8, 2025

Remove unnecessary blank line in layer.py

b52ac89

Signed-off-by: Luka Govedič <[email protected]>

ProExpertProg enabled auto-merge (squash) October 8, 2025 19:23

simon-mo disabled auto-merge October 9, 2025 00:11

simon-mo merged commit e614ab7 into vllm-project:main Oct 9, 2025
57 of 59 checks passed

This was referenced Oct 10, 2025

[Refactor][MLA]: Independently pass q_nope & q_rope #26567

Open

[Refactor][MLA]: Independently passing q_nope & q_rope #26568

Open

This was referenced Oct 14, 2025

[Bugfix] DeepSeek V3.2 MTP metadata & CUDA graph issues #26779

Open

[Bug]: DeepSeek V3.2 running with MTP will cause DeepseekV32IndexerMetadata parsing error #26711

Open

		# Initialize post-load attention weights for both Attention and MLA.
		# NOTE: Happens after other modules so we can easily decompress weights.

Uh oh!

Separate MLAAttention class from Attention #25103

Separate MLAAttention class from Attention #25103

Conversation

therealnaveenkamal commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

github-actions bot commented Sep 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

therealnaveenkamal commented Sep 20, 2025

Uh oh!

ProExpertProg commented Sep 20, 2025

Uh oh!

therealnaveenkamal commented Sep 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Sep 23, 2025

Uh oh!

therealnaveenkamal commented Sep 24, 2025

Uh oh!

ProExpertProg commented Sep 24, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

therealnaveenkamal commented Sep 17, 2025 •

edited

Loading