[RFC]: Address piecewise graph splitting and attention fusion incompatibility

**UPDATE**: we chose approach 2, implemented in #24281 and pytorch/pytorch#162207.

## Motivation.

I wanted to get a quick opinion from people on possible solutions on the issue where attention+quant fusion is incompatible with splitting_ops and hence piecewise cudagraphs. I've thought about it before but @nvpohanh brought it up in a DM this morning.

I am looking for feedback on both possible solutions and whether this is worth solving.

### Problem:
> There is an incompatibility between @fhl2000 ’s `FULL_AND_PIECEWISE` and the `splitting_ops` setting. In vLLM, we are currently splitting the FX graph before we start to apply graph fusion passes. However, we would like to enable Attn+Q fusions, so we clear the `splitting_ops` to keep the Attn op in the FX graph, but that breaks the `FULL_AND_PIECEWISE` mode because now we cannot capture piece-wise cuda graphs with full FX graph.


## Proposed Change.

There are a few possible solutions:
1. Set `splitting_ops=[]` and use `cudagraph_mode=FULL` (current approach)
2. Split the graph in Inductor after custom passes
3. Split the graph after AotDispatcher (and custom passes) but before Inductor
4. While splitting the graph, make modifications to enable fusion

More details and the specific drawbacks of each approach are posted below.

### 1. Disable splitting and use `cudagraph_mode=FULL`
There are a few issues with this approach:
- Mixed decode-prefill batches can be faster with piecewise cudagraphs.
- Cascade attention requires piecewise cudagraphs.
- Some attention backends don't always support cudagraph_mode=FULL. But those backends might not support fp8 fusion either (MLA) so this might not be a big issue.

### 2. Split the graph in Inductor after custom passes
This approach was endorsed by @nvpohanh. I'm not sure if Inductor could handle a split graph during compilation (unsplit graph going into `post_grad_custom_post_pass` and split graph coming out)

### 3. Split the graph after AotDispatcher (and custom passes) but before Inductor
This is my current favorite approach. It would mean custom passes receive normalized and functionalized IR and can operate on the whole graph. However, this approach could increase compile time as we're not reusing piecewise graphs for the aot piece (minor) and if the Inductor piece reruns aot (major but avoidable).

### 4. While splitting the graph, make modifications to enable fusion
This means either performing attn+quant fusion manually on the Dynamo graph or moving ops around to make sure quant ops end in the same graph as attention. This is the dirtiest, least scalable, but also the quickest approach.

---

While approaches 2 and 3 would be best long-term, they might require more work to make sure caching of both the whole graph and the subgraphs works as intended. I am not 100% confident this wouldn't break something.


### Feedback Period.

8/20-8/27

### CC List.

@youkaichao @zou3519 @BoyuanFeng @nvpohanh @mgoin @ilmarkov 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Address piecewise graph splitting and attention fusion incompatibility #23261

Motivation.

Problem:

Proposed Change.

1. Disable splitting and use `cudagraph_mode=FULL`

2. Split the graph in Inductor after custom passes

3. Split the graph after AotDispatcher (and custom passes) but before Inductor

4. While splitting the graph, make modifications to enable fusion

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Address piecewise graph splitting and attention fusion incompatibility #23261

Description

Motivation.

Problem:

Proposed Change.

1. Disable splitting and use cudagraph_mode=FULL

2. Split the graph in Inductor after custom passes

3. Split the graph after AotDispatcher (and custom passes) but before Inductor

4. While splitting the graph, make modifications to enable fusion

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Disable splitting and use `cudagraph_mode=FULL`