[RFC]: To Inductor partition or to not Inductor partition (by default in v0.11.1)

**tl;dr:** Inductor partition is a semi-experimental feature in torch 2.9 that increases performance but also compile time and has risks, we should decide whether to enable it by default or not in 0.11.1.

## Motivation.

Currently, on vLLM main, piecewise cudagraphs are achieved using Dynamo partitioning (splitting the fx.Graph before it enters Inductor). That makes piecewise cudagraphs incompatible with compilation passes that need to see the whole graph to work - like attention+quant fusion and sequence parallelism (and hence async tp). Apart from allreduce+rmsnorm(+quant) fusion, those are the passes that bring the most benefit.   

The vLLM x torch.compile collaboration yielded a custom Inductor partitioning solution (RFC: #23261, PR: #24281). It gives better performance by itself by reducing cudagraph replay overhead and it allows piecewise cudagraphs to be compatible with fullgraph custom passes. It requires torch==2.9, with a couple of monkeypatches and workarounds (listed below) in vLLM but they are not too bad. It also significantly increases cold-start (first-time) compilation time. Finally, it is somewhat of an experimental feature, and we've resolved most of the known issues (so far), but there could be more.

Monkeypatches & workarounds:
- #26735: monkeypatch for partition rules
- #26878: monkeypatch for memory plan output naming 
- #26956: AOT caching issue workaround
- #26952: nested `torch.compile`s need to disable graph partitioning (not an issue, just might need to be applied to more places) 

Current known issues:
- #26988: attention-nvfp4 fusion + inductor partition causes Inductor codegen error
- a few more CI issues: #26738, hoping to resolve these by EOD Friday

## Proposed Change.

The big question to resolve is whether to enable **Inductor partition by default in 0.11.1 or not**. The upside is increased performance, the downside is risk of breakage and increased compilation time. Inductor partitioning has gone through a decent amount of manual testing as well as vLLM CI (some issues are still being resolved). The performance benefit is between 2-10% for various models and QPS regimes. The startup cost is around 2-5x depending on the model - although warm start is actually slightly faster as there are fewer artifacts to load. More information on performance is below.

If we choose to enable Inductor partition by default, there's also a second question of whether to enable **attention+quant fusion** by default (up to additional 2-6%) and **SP+AsyncTP** by default (numbers TBD). Both of them require Inductor partition (or they have to disable piecewise compilation and cudagraphs). A downside would be that we haven't benchmarked these extensively across different kinds of hardware. For example, attention+quant fusion causes a slowdown on llama-70B tp=4 on Blackwell with torch==2.9 (cause unknown). So we're likely not going to enable these by default.

Note that even if enabled by default, that users can always disable Inductor partitioning. With the planned addition of optimization levels (RFC: #20283, PR: #26847, more below), it will be as easy as `-O1`. This way there's an easy way to control the tradeoff between performance and startup (which will only grow over time), especially useful for development.

What we want to avoid is souring users by increasing startup time while not delivering as much speedup as initially planned (the original plan for this release was all fusion passes improved and on by default in `-O2`). However, vLLM is known for being at the forefront of performance and innovation and we can always hotfix issues if necessary. And the earlier we turn it on, the earlier we can find any remaining issues, improve the feature, and deliver better performance to users.

We could also make `-O1` (with FULL_AND_PIECEWISE cudagraphs) the default for now and let users opt into inductor partition and compile passes with `-O2`. In the next release, we can change the default from `-O1` to `-O2`. In fact this would be my preferred approach if we don't want inductor partition by default.

## Feedback Period.

By the time we release 0.11.1 - late Friday 10/17 or early weekend.

## CC List.

@simon-mo @WoosukKwon @youkaichao @zou3519 @BoyuanFeng @angelayi @pavanimajety @tlrmchlsmth @robertgshaw2-redhat @mgoin @alexm-redhat @morrison-turnansky 

## Any Other Things.

### Appendix: Optimization levels
- All optimization levels are achievable manually via compilation of flags
- All flags can be overridden

End goal:
- `-O0`: no optimization/compilation, fast startup (basically --enforce-eager)
- `-O1`: fast compilation: Dynamo partition, PIECEWISE cudagraphs, simple compile passes (SiluMul+quant fusion)
- `-O2`: full optimization (default): Inductor partition, FULL_AND_PIECEWISE cudagraphs, all compile passes
- `-O3`: max autotune (future): O2 with additional compile sizes and autotuning

### Appendix: Benchmarking results

These numbers all use 2.9 + #24604. The H100 SP-AsyncTP numbers also use #26975.

On B200, the following settings were used:
- common: `--kv-cache-dtype=fp8 --no-enable-prefix-caching -O.pass_config.enable_noop=true`
- dynamo partition: `-O.use_inductor_graph_partition=False -O.cudagraph_mode=FULL_AND_PIECEWISE`
- inductor partition: `-O.use_inductor_graph_partition=True -O.cudagraph_mode=FULL_AND_PIECEWISE`
- no partition: `-O.use_inductor_graph_partition=False -O.cudagraph_mode=FULL_DECODE_ONLY -O.splitting_ops=[]`
- fusion: `-O.pass_config.enable_attn_fusion=true`
- 2.8: means torch==2.8 was used

#### 1xB200, redhatai/meta-llama-3.1-8B-Instruct-FP8

<img width="800" height="1000" alt="Image" src="https://github.com/user-attachments/assets/23c4141d-a1b2-46fd-bac4-3d65851c9934" />

#### 4xB200, redhatai/meta-llama-3.1-70B-Instruct-FP8

<img width="800" height="1000" alt="Image" src="https://github.com/user-attachments/assets/cd11fb50-c86a-45de-be49-dcc46b31c641" />

#### 4xB200, nvidia/Llama-4-Scout-17B-16E-Instruct-FP8

<img width="800" height="1000" alt="Image" src="https://github.com/user-attachments/assets/60ff8afd-72f7-470a-9a7a-ea9bbc7802e2" />

---

On H100, the following additional settings were used:
- common: `--gpu_memory_utilization=0.8`
- async_tp: `-O.pass_config.enable_async_tp=true` (implies `-O.pass_config.enable_sequence_parallelism=true`)

#### 4xH100, redhatai/meta-llama-3.1-70B-Instruct-FP8

<img width="800" height="1000" alt="Image" src="https://github.com/user-attachments/assets/38b7f5b4-a9fd-4b0d-a9f0-dadc29457e75" />

#### Startup Time

Taken from #24281, anecdotally I've seen larger models compile for up to 100s on cold start as well but that was the most I've seen.

<img width="903" height="170" alt="Image" src="https://github.com/user-attachments/assets/7ad75a3b-75d0-45d5-98b3-ff3d6e81405a" />

<img width="901" height="171" alt="Image" src="https://github.com/user-attachments/assets/09ece3a8-0329-45ca-8102-042d96de896c" />

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: To Inductor partition or to not Inductor partition (by default in v0.11.1) #27080

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Appendix: Optimization levels

Appendix: Benchmarking results

1xB200, redhatai/meta-llama-3.1-8B-Instruct-FP8

4xB200, redhatai/meta-llama-3.1-70B-Instruct-FP8

4xB200, nvidia/Llama-4-Scout-17B-16E-Instruct-FP8

4xH100, redhatai/meta-llama-3.1-70B-Instruct-FP8

Startup Time

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: To Inductor partition or to not Inductor partition (by default in v0.11.1) #27080

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Appendix: Optimization levels

Appendix: Benchmarking results

1xB200, redhatai/meta-llama-3.1-8B-Instruct-FP8

4xB200, redhatai/meta-llama-3.1-70B-Instruct-FP8

4xB200, nvidia/Llama-4-Scout-17B-16E-Instruct-FP8

4xH100, redhatai/meta-llama-3.1-70B-Instruct-FP8

Startup Time

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions