Skip to content

[RFC]: To Inductor partition or to not Inductor partition (by default in v0.11.1) #27080

@ProExpertProg

Description

@ProExpertProg

tl;dr: Inductor partition is a semi-experimental feature in torch 2.9 that increases performance but also compile time and has risks, we should decide whether to enable it by default or not in 0.11.1.

Motivation.

Currently, on vLLM main, piecewise cudagraphs are achieved using Dynamo partitioning (splitting the fx.Graph before it enters Inductor). That makes piecewise cudagraphs incompatible with compilation passes that need to see the whole graph to work - like attention+quant fusion and sequence parallelism (and hence async tp). Apart from allreduce+rmsnorm(+quant) fusion, those are the passes that bring the most benefit.

The vLLM x torch.compile collaboration yielded a custom Inductor partitioning solution (RFC: #23261, PR: #24281). It gives better performance by itself by reducing cudagraph replay overhead and it allows piecewise cudagraphs to be compatible with fullgraph custom passes. It requires torch==2.9, with a couple of monkeypatches and workarounds (listed below) in vLLM but they are not too bad. It also significantly increases cold-start (first-time) compilation time. Finally, it is somewhat of an experimental feature, and we've resolved most of the known issues (so far), but there could be more.

Monkeypatches & workarounds:

Current known issues:

Proposed Change.

The big question to resolve is whether to enable Inductor partition by default in 0.11.1 or not. The upside is increased performance, the downside is risk of breakage and increased compilation time. Inductor partitioning has gone through a decent amount of manual testing as well as vLLM CI (some issues are still being resolved). The performance benefit is between 2-10% for various models and QPS regimes. The startup cost is around 2-5x depending on the model - although warm start is actually slightly faster as there are fewer artifacts to load. More information on performance is below.

If we choose to enable Inductor partition by default, there's also a second question of whether to enable attention+quant fusion by default (up to additional 2-6%) and SP+AsyncTP by default (numbers TBD). Both of them require Inductor partition (or they have to disable piecewise compilation and cudagraphs). A downside would be that we haven't benchmarked these extensively across different kinds of hardware. For example, attention+quant fusion causes a slowdown on llama-70B tp=4 on Blackwell with torch==2.9 (cause unknown). So we're likely not going to enable these by default.

Note that even if enabled by default, that users can always disable Inductor partitioning. With the planned addition of optimization levels (RFC: #20283, PR: #26847, more below), it will be as easy as -O1. This way there's an easy way to control the tradeoff between performance and startup (which will only grow over time), especially useful for development.

What we want to avoid is souring users by increasing startup time while not delivering as much speedup as initially planned (the original plan for this release was all fusion passes improved and on by default in -O2). However, vLLM is known for being at the forefront of performance and innovation and we can always hotfix issues if necessary. And the earlier we turn it on, the earlier we can find any remaining issues, improve the feature, and deliver better performance to users.

We could also make -O1 (with FULL_AND_PIECEWISE cudagraphs) the default for now and let users opt into inductor partition and compile passes with -O2. In the next release, we can change the default from -O1 to -O2. In fact this would be my preferred approach if we don't want inductor partition by default.

Feedback Period.

By the time we release 0.11.1 - late Friday 10/17 or early weekend.

CC List.

@simon-mo @WoosukKwon @youkaichao @zou3519 @BoyuanFeng @angelayi @pavanimajety @tlrmchlsmth @robertgshaw2-redhat @mgoin @alexm-redhat @morrison-turnansky

Any Other Things.

Appendix: Optimization levels

  • All optimization levels are achievable manually via compilation of flags
  • All flags can be overridden

End goal:

  • -O0: no optimization/compilation, fast startup (basically --enforce-eager)
  • -O1: fast compilation: Dynamo partition, PIECEWISE cudagraphs, simple compile passes (SiluMul+quant fusion)
  • -O2: full optimization (default): Inductor partition, FULL_AND_PIECEWISE cudagraphs, all compile passes
  • -O3: max autotune (future): O2 with additional compile sizes and autotuning

Appendix: Benchmarking results

These numbers all use 2.9 + #24604. The H100 SP-AsyncTP numbers also use #26975.

On B200, the following settings were used:

  • common: --kv-cache-dtype=fp8 --no-enable-prefix-caching -O.pass_config.enable_noop=true
  • dynamo partition: -O.use_inductor_graph_partition=False -O.cudagraph_mode=FULL_AND_PIECEWISE
  • inductor partition: -O.use_inductor_graph_partition=True -O.cudagraph_mode=FULL_AND_PIECEWISE
  • no partition: -O.use_inductor_graph_partition=False -O.cudagraph_mode=FULL_DECODE_ONLY -O.splitting_ops=[]
  • fusion: -O.pass_config.enable_attn_fusion=true
  • 2.8: means torch==2.8 was used

1xB200, redhatai/meta-llama-3.1-8B-Instruct-FP8

Image

4xB200, redhatai/meta-llama-3.1-70B-Instruct-FP8

Image

4xB200, nvidia/Llama-4-Scout-17B-16E-Instruct-FP8

Image

On H100, the following additional settings were used:

  • common: --gpu_memory_utilization=0.8
  • async_tp: -O.pass_config.enable_async_tp=true (implies -O.pass_config.enable_sequence_parallelism=true)

4xH100, redhatai/meta-llama-3.1-70B-Instruct-FP8

Image

Startup Time

Taken from #24281, anecdotally I've seen larger models compile for up to 100s on cold start as well but that was the most I've seen.

Image Image

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions