[Don't review][webgpu] Make graph capture work on LLM #25868

qjia7 · 2025-08-27T04:17:53Z

Description

This PR includes all necessary changes to enable graph capture in LLM.
It mainly introduces support for indirect dispatch in the WebGPU FlashAttention implementation, enabling kernel launches based on runtime total sequence lengths and some necessary changes so that the whole model can run on gpu.

This pull request is intended to facilitate discussion and provide a comprehensive overview of the overall changes. Subsequently, it will be divided into smaller pull requests to make the review process more manageable.

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/core/providers/webgpu/program.h

Add seqlen_k to dynamically compute total_seq_length Add Indirect buffer usage fuse PrepareIndirectDispatch shader into CopyKVCache code reuse Update the conditions

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/core/providers/webgpu/math/binary_elementwise_ops.cc

This reverts commit bc4b41e.

### Description This PR unifies the present_sequence_length in flash attention and removes the dependency on total_sequence_length. This is preparation to support graph capture. #25868

### Description This PR adds the dispatchWorkgroupsIndirect capability for the program. It's part of the work to enable graph capture in phi4 #25868 --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

test

This pull request introduces support for indirect dispatch in the WebGPU FlashAttention implementation, enabling more dynamic and efficient kernel launches based on runtime sequence lengths. The changes add new logic and parameters to propagate sequence length information and indirect dispatch buffers through the attention pipeline, with conditional code paths to maintain compatibility with the existing direct dispatch approach. It's part of the work to enable graph capture in phi4 #25868

This pull request extends the WebGPU execution provider to support int64 data type casting in the `Cast` operator, with conditional support based on whether graph capture is enabled. It refactors kernel registration to allow toggling int64 support and updates the shader code and kernel logic to handle int64 tensors efficiently. It's part of the work to enable graph capture in phi4 #25868

This pull request introduces support for indirect dispatch in the WebGPU FlashAttention implementation, enabling more dynamic and efficient kernel launches based on runtime sequence lengths. The changes add new logic and parameters to propagate sequence length information and indirect dispatch buffers through the attention pipeline, with conditional code paths to maintain compatibility with the existing direct dispatch approach. It's part of the work to enable graph capture in phi4 #25868

This pull request extends the WebGPU execution provider to support int64 data type casting in the `Cast` operator, with conditional support based on whether graph capture is enabled. It refactors kernel registration to allow toggling int64 support and updates the shader code and kernel logic to handle int64 tensors efficiently. It's part of the work to enable graph capture in phi4 #25868

This pull request enables conditionally register GQA with total_sequence_length on gpu or not. It resolves the issue that a MemcpyToHost is generated when graph capture is enabled (refer to #25868). This is the last functionality part to support graph capture in webgpu ep in ORT. The main changes ensure that when graph capture is enabled, sequence length information is read from GPU buffers instead of CPU memory, and shader code generation adapts accordingly. This enables more efficient execution and compatibility with graph-captured models. In this PR, we still get total sequence length from `seqlen_k` tensor not `total_seqlen_tensor` tensor to keep consistent with other parts. In the next PR, we can refactor all places to directly use `total_seqlen_tensor` instead of `seqlen_k` when graph capture enabled.

qjia7 · 2025-10-30T06:18:15Z

Close this one since the all functionalities to support graph capture have been merged separately.

github-actions bot reviewed Sep 2, 2025

View reviewed changes

qjia7 added 6 commits September 3, 2025 14:10

Add dispatchWorkgroupsIndirect support

f21efe6

Add seqlen_k to dynamically compute total_seq_length Add Indirect buffer usage fuse PrepareIndirectDispatch shader into CopyKVCache code reuse Update the conditions

unify the present_sequence_length

83ef24c

update CopyKVCache

12a4559

Add int64 support for ReduceSum/Sub/Cast

bc4b41e

fix

c8231b6

keep the dispatch path unchanged

1197a17

qjia7 force-pushed the indirect_dispatch branch from d3e1ae0 to 1197a17 Compare September 3, 2025 06:45

github-actions bot reviewed Sep 3, 2025

View reviewed changes

onnxruntime/core/providers/webgpu/math/binary_elementwise_ops.cc Outdated Show resolved Hide resolved

onnxruntime/core/providers/webgpu/math/binary_elementwise_ops.cc Outdated Show resolved Hide resolved

qjia7 changed the title ~~[WIP][webgpu] Indirect dispatch support~~ [Don't review][webgpu] Make graph capture work on LLM Sep 3, 2025

qjia7 added 2 commits September 3, 2025 16:12

Revert "Add int64 support for ReduceSum/Sub/Cast"

5d66e65

This reverts commit bc4b41e.

Add int64 to cast

1576d1e

This was referenced Sep 3, 2025

[webgpu] Add dispatchWorkgroupsIndirect support #25934

Merged

[webgpu] Unify the present_sequence_length in flash attention #25945

Merged

Merge branch 'main' into indirect_dispatch

e77d617

qjia7 added 2 commits September 22, 2025 14:35

Merge branch 'main' into indirect_dispatch

3a502f2

Register conditional Cast

61c2541

qjia7 mentioned this pull request Sep 29, 2025

[webgpu] And int64 to cast #25610

Merged

qjia7 added 3 commits September 29, 2025 16:04

fix CI errors

76b059e

fix the CI errors

91ea82e

move the implementation to cc

39bcbe4

qjia7 mentioned this pull request Sep 30, 2025

[webgpu] Enable indirect dispatch for flash attention #26207

Merged

[webgpu] Expose webgpu ep's internal dawn

dfbee48

test

qjia7 mentioned this pull request Oct 11, 2025

Add enable_webgpu_graph in extra_options microsoft/onnxruntime-genai#1788

Merged

qjia7 added 2 commits October 13, 2025 08:53

cast 64 chagnes

f15bdbb

fix the memory leak

35a46cf

qjia7 mentioned this pull request Oct 14, 2025

[Need discussion] Add graph capture for webgpu microsoft/onnxruntime-genai#1720

Draft

Merge branch 'main' into indirect_dispatch

6afae3e

qjia7 mentioned this pull request Oct 22, 2025

[webgpu] Register GQA based on graph capture #26384

Merged

qjia7 added 2 commits October 23, 2025 13:49

Merge branch 'main' into indirect_dispatch

99b302d

fix the CI errors

0171a27

qjia7 closed this Oct 30, 2025

qjia7 deleted the indirect_dispatch branch October 30, 2025 06:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Don't review][webgpu] Make graph capture work on LLM #25868

[Don't review][webgpu] Make graph capture work on LLM #25868

Uh oh!

qjia7 commented Aug 27, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Uh oh!

qjia7 commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Don't review][webgpu] Make graph capture work on LLM #25868

[Don't review][webgpu] Make graph capture work on LLM #25868

Uh oh!

Conversation

qjia7 commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

qjia7 commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qjia7 commented Aug 27, 2025 •

edited

Loading