Skip to content

[RFC]: Address piecewise graph splitting and attention fusion incompatibility #23261

@ProExpertProg

Description

@ProExpertProg

UPDATE: we chose approach 2, implemented in #24281 and pytorch/pytorch#162207.

Motivation.

I wanted to get a quick opinion from people on possible solutions on the issue where attention+quant fusion is incompatible with splitting_ops and hence piecewise cudagraphs. I've thought about it before but @nvpohanh brought it up in a DM this morning.

I am looking for feedback on both possible solutions and whether this is worth solving.

Problem:

There is an incompatibility between @fhl2000 ’s FULL_AND_PIECEWISE and the splitting_ops setting. In vLLM, we are currently splitting the FX graph before we start to apply graph fusion passes. However, we would like to enable Attn+Q fusions, so we clear the splitting_ops to keep the Attn op in the FX graph, but that breaks the FULL_AND_PIECEWISE mode because now we cannot capture piece-wise cuda graphs with full FX graph.

Proposed Change.

There are a few possible solutions:

  1. Set splitting_ops=[] and use cudagraph_mode=FULL (current approach)
  2. Split the graph in Inductor after custom passes
  3. Split the graph after AotDispatcher (and custom passes) but before Inductor
  4. While splitting the graph, make modifications to enable fusion

More details and the specific drawbacks of each approach are posted below.

1. Disable splitting and use cudagraph_mode=FULL

There are a few issues with this approach:

  • Mixed decode-prefill batches can be faster with piecewise cudagraphs.
  • Cascade attention requires piecewise cudagraphs.
  • Some attention backends don't always support cudagraph_mode=FULL. But those backends might not support fp8 fusion either (MLA) so this might not be a big issue.

2. Split the graph in Inductor after custom passes

This approach was endorsed by @nvpohanh. I'm not sure if Inductor could handle a split graph during compilation (unsplit graph going into post_grad_custom_post_pass and split graph coming out)

3. Split the graph after AotDispatcher (and custom passes) but before Inductor

This is my current favorite approach. It would mean custom passes receive normalized and functionalized IR and can operate on the whole graph. However, this approach could increase compile time as we're not reusing piecewise graphs for the aot piece (minor) and if the Inductor piece reruns aot (major but avoidable).

4. While splitting the graph, make modifications to enable fusion

This means either performing attn+quant fusion manually on the Dynamo graph or moving ops around to make sure quant ops end in the same graph as attention. This is the dirtiest, least scalable, but also the quickest approach.


While approaches 2 and 3 would be best long-term, they might require more work to make sure caching of both the whole graph and the subgraphs works as intended. I am not 100% confident this wouldn't break something.

Feedback Period.

8/20-8/27

CC List.

@youkaichao @zou3519 @BoyuanFeng @nvpohanh @mgoin @ilmarkov

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Done

Relationships

None yet

Development

No branches or pull requests

Issue actions