[V1] V1 FlashInfer Attention #16684

mgoin · 2025-04-15T21:23:52Z

Carrying on @aurickq work from here #14061. Thanks to @LucasWilkinson for helping debug qo_indptr issues.

There are some performance issues in the original PR due to using BatchPrefillWithPagedKVCacheWrapper for all prefill and decode tokens. This PR separates prefill and decode tokens in V1 using the reorder_batch() functionality added for MLA, where the requests in the input_batch is reshuffled such that all decode tokens are at the front and all prefill tokens are at the back. This makes it easy to split the input/output to the attention implementation to contiguous chunks for decode and prefill.

With this new implementation FlashInfer 0.2.1.post2 is close to within the performance of FA3.

Evaluations

Evaluations on GSM8k:

export VLLM_ATTENTION_BACKEND=FLASHINFER
lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
vllm (pretrained=meta-llama/Llama-3.2-1B-Instruct,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3351|±  | 0.013|
|     |       |strict-match    |     5|exact_match|↑  |0.3351|±  | 0.013|

lm_eval --model vllm --model_args pretrained=Qwen/Qwen2.5-7B-Instruct --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
vllm (pretrained=Qwen/Qwen2.5-7B-Instruct,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8264|±  |0.0104|
|     |       |strict-match    |     5|exact_match|↑  |0.7885|±  |0.0112|

lm_eval --model vllm --model_args pretrained=RedHatAI/QwQ-32B-FP8-dynamic,tensor_parallel_size=2 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
vllm (pretrained=RedHatAI/QwQ-32B-FP8-dynamic,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4321|±  |0.0136|
|     |       |strict-match    |     5|exact_match|↑  |0.7369|±  |0.0121|

Benchmarks

Benchmarks run on H100:

Llama 8B at 1024/128 input/output tokens showing the improvement over mixed implementation and comparing FA3:

python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1000 --input-len 1024 --output-len 128

# V1 FA3
Throughput: 25.63 requests/s, 30776.11 total tokens/s, 3280.51 output tokens/s

# V1 Original Flashinfer (Old PR, combined prefill+decode)
Throughput: 15.51 requests/s, 18616.93 total tokens/s, 1985.48 output tokens/s

# V1 Flashinfer (This PR)
Throughput: 25.09 requests/s, 30112.70 total tokens/s, 3212.02 output tokens/s

Llama 8B at 1000/1000 input/output tokens against FA3:

python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1000 --input-len 1000 --output-len 1000

export VLLM_ATTENTION_BACKEND=FLASHINFER
Throughput: 5.93 requests/s, 12136.45 total tokens/s, 5931.55 output tokens/s

export VLLM_ATTENTION_BACKEND=FLASH_ATTN 
Throughput: 6.17 requests/s, 12632.79 total tokens/s, 6165.96 output tokens/s

QwQ 32B FP8-dynamic TP=2 at 1000/1000 input/output tokens against FA3:

python benchmarks/benchmark_throughput.py --model RedHatAI/QwQ-32B-FP8-dynamic --tensor-parallel-size=2 --num-prompts 1000 --input-len 1000 --output-len 1000

export VLLM_ATTENTION_BACKEND=FLASHINFER
Throughput: 4.25 requests/s, 8748.90 total tokens/s, 4247.52 output tokens/s

export VLLM_ATTENTION_BACKEND=FLASH_ATTN
Throughput: 4.25 requests/s, 8741.93 total tokens/s, 4248.12 output tokens/s

Signed-off-by: mgoin <[email protected]>

vllm/engine/arg_utils.py

Signed-off-by: mgoin <[email protected]>

mgoin · 2025-04-16T17:47:46Z

Will need to rebase on #16673

Signed-off-by: mgoin <[email protected]>

mergify · 2025-04-18T14:25:15Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: mgoin <[email protected]>

LucasWilkinson

LGTM, thanks for doing this!

JaheimLee · 2025-04-22T03:36:17Z

Can we use flashinfer's fp8 kv cache with this pr? Vllm now only do

will_use_fa = (
        current_platform.is_cuda()
        and not envs.is_set("VLLM_ATTENTION_BACKEND")
    ) or envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1"
supported = False
if fp8_attention and will_use_fa:
    from vllm.vllm_flash_attn.fa_utils import (
        flash_attn_supports_fp8)
    supported = flash_attn_supports_fp8()

mgoin · 2025-04-22T20:49:27Z

@JaheimLee I tried enabling it here but flashinfer fails to compile #17005

Signed-off-by: mgoin <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Signed-off-by: Frieda (Jingying) Huang <[email protected]>

Signed-off-by: mgoin <[email protected]> Co-authored-by: Aurick Qiao <[email protected]>

Signed-off-by: mgoin <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>

Signed-off-by: mgoin <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Signed-off-by: Mu Huai <[email protected]>

aurickq and others added 24 commits February 27, 2025 16:41

flashinfer minimally working

91f51aa

unbreak flash_attn

0df828e

make backend instantiable

0f1eb93

cascade attention minimally working

ff0e363

cleanup

1e1e0ae

small

33c980e

Merge branch 'main' into flashinfer-v1

5c30d2e

cleanup

be8e193

make backend stateless

ba095b7

clean

a59fdba

clean

5e4c87c

remove state class

53f8b81

clean

8be9a47

lint

3779aaf

small

ec4cfbd

clean

24d09d4

bugfix

7bc66c3

bugfix

b8d775f

cascade attn test for flashinfer

ce4d533

move use_cascade_attention to attention metadata builder

eeaeb2f

fix lint

4153ac4

Merge branch 'main' into flashinfer-v1

cab022d

Merge branch 'main' into flashinfer-v1

4be02de

Separate prefill and decode

95c402c

Signed-off-by: mgoin <[email protected]>

mgoin requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners April 15, 2025 21:23

NickLucche reviewed Apr 16, 2025

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

Update test cascade

77f5b1f

Signed-off-by: mgoin <[email protected]>

Updates

b8ced05

Signed-off-by: mgoin <[email protected]>

mgoin changed the title ~~V1 FlashInfer Attention~~ [V1] V1 FlashInfer Attention Apr 18, 2025

mergify bot added the needs-rebase label Apr 18, 2025

Merge branch 'main' into flashinfer-v1

9c0fc75

Signed-off-by: mgoin <[email protected]>

mergify bot removed the needs-rebase label Apr 18, 2025

Update reorder logic

44c871a

Signed-off-by: mgoin <[email protected]>

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 18, 2025

mgoin requested a review from LucasWilkinson April 21, 2025 21:51

LucasWilkinson approved these changes Apr 21, 2025

View reviewed changes

LucasWilkinson enabled auto-merge (squash) April 21, 2025 22:08

tlrmchlsmth approved these changes Apr 21, 2025

View reviewed changes

LucasWilkinson merged commit 986537f into vllm-project:main Apr 22, 2025
63 checks passed

mgoin mentioned this pull request Apr 22, 2025

[V1][Core] FlashInfer attention backend for V1 #14061

Closed

mgoin deleted the flashinfer-v1 branch April 22, 2025 20:26

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

[V1] V1 FlashInfer Attention (vllm-project#16684)

a4edd3a

Signed-off-by: mgoin <[email protected]> Co-authored-by: Aurick Qiao <[email protected]>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[V1] V1 FlashInfer Attention (vllm-project#16684)

1d12531

Signed-off-by: mgoin <[email protected]> Co-authored-by: Aurick Qiao <[email protected]>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[V1] V1 FlashInfer Attention (vllm-project#16684)

19e43fe

Signed-off-by: mgoin <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Signed-off-by: Mu Huai <[email protected]>

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1] V1 FlashInfer Attention #16684

[V1] V1 FlashInfer Attention #16684

mgoin commented Apr 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

mgoin commented Apr 16, 2025

Uh oh!

mergify bot commented Apr 18, 2025

Uh oh!

LucasWilkinson left a comment

Uh oh!

Uh oh!

JaheimLee commented Apr 22, 2025 •

edited

Loading

Uh oh!

mgoin commented Apr 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

[V1] V1 FlashInfer Attention #16684

[V1] V1 FlashInfer Attention #16684

Conversation

mgoin commented Apr 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Evaluations

Benchmarks

Uh oh!

Uh oh!

mgoin commented Apr 16, 2025

Uh oh!

mergify bot commented Apr 18, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JaheimLee commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin commented Apr 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mgoin commented Apr 15, 2025 •

edited by github-actions bot

Loading

JaheimLee commented Apr 22, 2025 •

edited

Loading