[webgpu] Register GQA based on graph capture #26384

qjia7 · 2025-10-22T09:58:06Z

This pull request enables conditionally register GQA with total_sequence_length on gpu or not. It resolves the issue that a MemcpyToHost is generated when graph capture is enabled (refer to #25868). This is the last functionality part to support graph capture in webgpu ep in ORT.

The main changes ensure that when graph capture is enabled, sequence length information is read from GPU buffers instead of CPU memory, and shader code generation adapts accordingly. This enables more efficient execution and compatibility with graph-captured models.

In this PR, we still get total sequence length from seqlen_k tensor not total_seqlen_tensor tensor to keep consistent with other parts. In the next PR, we can refactor all places to directly use total_seqlen_tensor instead of seqlen_k when graph capture enabled.

Copilot

Pull Request Overview

This PR enables conditional registration of the GroupQueryAttention (GQA) operator based on whether graph capture is enabled in the WebGPU execution provider. When graph capture is enabled, the operator reads total sequence length from GPU buffers instead of CPU memory, eliminating the need for a MemcpyToHost operation that was blocking graph capture support.

Key changes:

Modified GQA kernel registration to conditionally set InputMemoryType based on graph capture status
Updated flash attention shader templates and programs to support reading sequence length from GPU buffers
Added validation logic to handle total_seqlen tensor when it resides on GPU during graph capture

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
onnxruntime/core/providers/webgpu/webgpu_execution_provider.cc	Passes enable_graph_capture flag to RegisterWebGpuContribKernels
onnxruntime/contrib_ops/webgpu/webgpu_contrib_kernels.h	Adds enable_graph_capture parameter to RegisterWebGpuContribKernels signature
onnxruntime/contrib_ops/webgpu/webgpu_contrib_kernels.cc	Replaces static GQA registration with conditional registration via CreateGroupQueryAttentionKernelInfo
onnxruntime/contrib_ops/webgpu/bert/group_query_attention.h	Declares CreateGroupQueryAttentionKernelInfo function for conditional kernel creation
onnxruntime/contrib_ops/webgpu/bert/group_query_attention.cc	Implements conditional kernel registration and updates ApplyFlashAttention signature to accept seqlen_k
onnxruntime/contrib_ops/webgpu/bert/flash_attention.wgsl.template	Adds get_total_sequence_length() function that reads from either GPU buffer or uniforms based on use_seqlen_k flag
onnxruntime/contrib_ops/webgpu/bert/flash_attention.h	Adds use_seqlen_k member to CopyKVCacheProgram and FlashAttentionProgram classes
onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc	Implements use_seqlen_k logic in shader code generation and removes past_sequence_length uniform
onnxruntime/contrib_ops/cpu/bert/group_query_attention_helper.h	Updates validation logic to skip CPU-specific checks when total_seqlen is on GPU

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc

This pull request enables conditionally register GQA with total_sequence_length on gpu or not. It resolves the issue that a MemcpyToHost is generated when graph capture is enabled (refer to microsoft#25868). This is the last functionality part to support graph capture in webgpu ep in ORT. The main changes ensure that when graph capture is enabled, sequence length information is read from GPU buffers instead of CPU memory, and shader code generation adapts accordingly. This enables more efficient execution and compatibility with graph-captured models. In this PR, we still get total sequence length from `seqlen_k` tensor not `total_seqlen_tensor` tensor to keep consistent with other parts. In the next PR, we can refactor all places to directly use `total_seqlen_tensor` instead of `seqlen_k` when graph capture enabled.

[webgpu] Register GQA based on graph capture

d08c52c

qjia7 requested review from fs-eire, guschmue and sushraja-msft October 22, 2025 10:13

guschmue added the ep:WebGPU ort-web webgpu provider label Oct 22, 2025

guschmue approved these changes Oct 28, 2025

View reviewed changes

guschmue requested a review from Copilot October 28, 2025 15:44

Copilot AI reviewed Oct 28, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Show resolved Hide resolved

qjia7 merged commit f7fd3b5 into main Oct 29, 2025
94 of 96 checks passed

qjia7 deleted the dynamic_register_gqa branch October 29, 2025 02:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[webgpu] Register GQA based on graph capture #26384

[webgpu] Register GQA based on graph capture #26384

Uh oh!

qjia7 commented Oct 22, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[webgpu] Register GQA based on graph capture #26384

[webgpu] Register GQA based on graph capture #26384

Uh oh!

Conversation

qjia7 commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qjia7 commented Oct 22, 2025 •

edited

Loading