Skip to content

Conversation

@alexm-redhat
Copy link
Collaborator

@alexm-redhat alexm-redhat commented Sep 15, 2025

This PR removes 2 redundant clone() calls in pre-attn cutlass MLA python code (that we found in the profiling work). This PR is step 1 ("Remove unnecessary copies from Cutlass MLA") from this meta fusion issue (#24629): For DeekSeekR1 on 8xB200 GPUs batch size 32, this improves decode perf by 2.4% from 19.15ms TPO to 18.7ms.

Verified that correctness is preserved via manual check and also lm_eval on GSM8K.
Command used: lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-R1-0528,tensor_parallel_size=8 --tasks gsm8k --num_fewshot 5 --batch_size auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9568 ± 0.0056
strict-match 5 exact_match 0.9538 ± 0.0058

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes two redundant .clone() calls on the q_nope and q_pe tensors within the _sm100_forward_decode function. This is a good performance optimization as it avoids unnecessary data copies. The change is correct because the underlying sm100_cutlass_mla_decode custom operation can handle non-contiguous tensors by using their strides, as long as the innermost dimension is contiguous, which is the case for the tensors here. The optimization is safely scoped to the newer sm100 execution path, leaving the legacy implementation untouched. The change improves performance and is safe to merge.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 15, 2025
@mgoin mgoin enabled auto-merge (squash) September 15, 2025 17:07
@alexm-redhat
Copy link
Collaborator Author

alexm-redhat commented Sep 15, 2025

Removed the contiguous() call in _sm100_cutlass_mla_decode(), gets additional 0.8%, for a total of 2.4% improvement. TPOT 18.7ms vs 19.15ms.

@alexm-redhat alexm-redhat self-assigned this Sep 15, 2025
# Extract the subsets of the outputs
returned_lse = lse[:, :H].contiguous(
) if self.need_to_return_lse_for_decode else lse
out = out[:, :H]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand putting this in a conditional, but why can we remove the contiguous for out if we can't for lse?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely lse as well, I was just on the safe side, since I don't know how to test it.

@mgoin mgoin merged commit aae725a into main Sep 15, 2025
46 checks passed
@mgoin mgoin deleted the cutlass_mla_no_clones branch September 15, 2025 20:21
QierLi pushed a commit to QierLi/vllm that referenced this pull request Oct 5, 2025
Signed-off-by: bbartels <[email protected]>

[gpt-oss] Add IncompleteDetails to ResponsesRepsonse (vllm-project#24561)

Signed-off-by: Andrew Xia <[email protected]>

[gpt-oss][1a] create_responses stream outputs BaseModel type, api server is SSE still (vllm-project#24759)

Signed-off-by: Andrew Xia <[email protected]>

[Performance] Remove redundant clone() calls in cutlass_mla (vllm-project#24891)

[Bug] Fix Cutlass Scaled MM Compilation Error (vllm-project#24887)

Signed-off-by: yewentao256 <[email protected]>

[ci] fix wheel names for arm wheels (vllm-project#24898)

Signed-off-by: simon-mo <[email protected]>

[Tests] fix initialization of kv hash in tests (vllm-project#24273)

Signed-off-by: Mickael Seznec <[email protected]>

[Compile] Fix noop_elimination pass and add tests for noop_elimination (vllm-project#24880)

Signed-off-by: zjy0516 <[email protected]>

Propagate entire tokens to connector for resumed preemptions

Signed-off-by: Qier Li <[email protected]>

Fix pre-commit

Signed-off-by: Qier Li <[email protected]>

Rename field and nullify empty lists

Signed-off-by: Qier Li <[email protected]>

Update vllm/v1/core/sched/scheduler.py

Co-authored-by: Nick Hill <[email protected]>
Signed-off-by: Qier Li <[email protected]>

Add unit test for preemption resumption

Signed-off-by: Qier Li <[email protected]>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants