-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
UPDATE: we chose approach 2, implemented in #24281 and pytorch/pytorch#162207.
Motivation.
I wanted to get a quick opinion from people on possible solutions on the issue where attention+quant fusion is incompatible with splitting_ops and hence piecewise cudagraphs. I've thought about it before but @nvpohanh brought it up in a DM this morning.
I am looking for feedback on both possible solutions and whether this is worth solving.
Problem:
There is an incompatibility between @fhl2000 ’s
FULL_AND_PIECEWISE
and thesplitting_ops
setting. In vLLM, we are currently splitting the FX graph before we start to apply graph fusion passes. However, we would like to enable Attn+Q fusions, so we clear thesplitting_ops
to keep the Attn op in the FX graph, but that breaks theFULL_AND_PIECEWISE
mode because now we cannot capture piece-wise cuda graphs with full FX graph.
Proposed Change.
There are a few possible solutions:
- Set
splitting_ops=[]
and usecudagraph_mode=FULL
(current approach) - Split the graph in Inductor after custom passes
- Split the graph after AotDispatcher (and custom passes) but before Inductor
- While splitting the graph, make modifications to enable fusion
More details and the specific drawbacks of each approach are posted below.
1. Disable splitting and use cudagraph_mode=FULL
There are a few issues with this approach:
- Mixed decode-prefill batches can be faster with piecewise cudagraphs.
- Cascade attention requires piecewise cudagraphs.
- Some attention backends don't always support cudagraph_mode=FULL. But those backends might not support fp8 fusion either (MLA) so this might not be a big issue.
2. Split the graph in Inductor after custom passes
This approach was endorsed by @nvpohanh. I'm not sure if Inductor could handle a split graph during compilation (unsplit graph going into post_grad_custom_post_pass
and split graph coming out)
3. Split the graph after AotDispatcher (and custom passes) but before Inductor
This is my current favorite approach. It would mean custom passes receive normalized and functionalized IR and can operate on the whole graph. However, this approach could increase compile time as we're not reusing piecewise graphs for the aot piece (minor) and if the Inductor piece reruns aot (major but avoidable).
4. While splitting the graph, make modifications to enable fusion
This means either performing attn+quant fusion manually on the Dynamo graph or moving ops around to make sure quant ops end in the same graph as attention. This is the dirtiest, least scalable, but also the quickest approach.
While approaches 2 and 3 would be best long-term, they might require more work to make sure caching of both the whole graph and the subgraphs works as intended. I am not 100% confident this wouldn't break something.
Feedback Period.
8/20-8/27
CC List.
@youkaichao @zou3519 @BoyuanFeng @nvpohanh @mgoin @ilmarkov
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status