[Feature] Support sequence parallelism #14908

cascade812 · 2025-03-16T23:27:30Z

support sequence parallel with TP on models like llama

In this PR, I modified RowParallelLinear, ColumnParallelLinear, LogitsProcessor, VocabParallelEmbedding to support SP.

Belows are TODOs

support combination with pp
support other layers

github-actions · 2025-03-16T23:27:39Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

yaochengji · 2025-03-18T18:40:22Z

Thanks for your contribution, @cascade812 !

Hi @robertgshaw2-redhat , @tlrmchlsmth , could you take a look at this?

Based on my micro-benchmark, collective-matmul optimization can improve the performance of multi-chip inference on TPU greatly. To enable collective-matmul, we depend on vLLM to support megatron-style sequence parallelism. Then the TPU compiler can automatically convert the pattern of ag-matmul and matmul-rs to collective-matmul.

cc @bvrockwell @yarongmu-google

mergify · 2025-03-18T22:07:29Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @cascade812.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: cascade812 <[email protected]>

tlrmchlsmth · 2025-03-20T00:08:37Z

I’ll take a look! I’m excited about and in favor of sequence parallel support in general.

@yaochengji could you explain why this helps the gemm-rs and ag-gemm rewrite? I don’t really see that as sequence parallel outside of the rms_norm

yaochengji · 2025-03-20T00:24:27Z

Thanks, @tlrmchlsmth .

It is because sequence parallelism, the model with tp looks like as below:
matmul -> allreduce -> rms_norm(or other ops) -> matmul

with sp enabled, it becomes:
matmul -> rs -> rms_norm(or other ops) -> ag -> matmul

matmul-rs and ag-matmul are clean patterns for optimizations to detect, including TPU compiler.

youkaichao

our previous plan for sequence parallel, is to make it a compilation pass, without changing the linear/embedding layer. that thread somehow got lost.

cc @bnellnm

yaochengji · 2025-03-20T03:42:39Z

Thanks for your comment, @youkaichao !

If we don't change the linear/embedding layer, we have to match various patterns for matmul -> allreduce -> rms_norm(or other ops) -> matmul even when only rms_norm is considered. We might find different kinds of hardware have different rms_norm implementation, sometimes one hardware has various implementations of rms_norm, e.g. nvgpu.

Changing the linear or embedding layer offers a potentially more generalized approach. Furthermore, the number of layers requiring modification is relatively small.

yaochengji · 2025-03-20T03:44:22Z

@cascade812 I know currently not all layers support sp. Could you print readable message to the vLLM users if a specific layer doesn't support sequence parallelism?

NickLucche

Mega job here thanks!
Can we get some benchmark numbers to go along with this PR?

bnellnm · 2025-03-20T17:23:30Z

our previous plan for sequence parallel, is to make it a compilation pass, without changing the linear/embedding layer. that thread somehow got lost.

cc @bnellnm

There were too many problems with pytorch, the limitations of the kernels and issues with piecewise graphs so the project was put on hold.

yaochengji · 2025-03-20T21:07:37Z

Mega job here thanks! Can we get some benchmark numbers to go along with this PR?

I'm working on enabling collective matmul optimization on TPU. My change on vLLM will be based on this PR, will share the benchmark numbers later.

tlrmchlsmth

I left some inline comments. Generally I also think this should be done as a pass in torch inductor or similar compiler layer. I'm pretty sure these changes are making assumptions about the model definition that may not be valid.

vllm/distributed/parallel_state.py

tlrmchlsmth · 2025-03-20T22:38:47Z

vllm/model_executor/layers/linear.py

+            forward_context = try_get_forward_context()
+            if (forward_context is not None
+                    and forward_context.enable_sequence_parallel):


The forward context isn't available outside of model initialization, so you'll have to do self.enable_sequence_parallel = forward_context.enable_sequence_parallel in __init__, otherwise you won't actually be using sequence parallel while actually inferencing the model (unless you're using CUDA graphs)

I think this is a pretty tricky footgun, so we should address this - (cc @youkaichao)

The forward context is available, it's set before calling self.model().

vllm/model_executor/models/llama.py

tlrmchlsmth · 2025-03-20T22:42:00Z

vllm/v1/worker/gpu_model_runner.py

+            enable_sequence_parallel = (
+                self.vllm_config.parallel_config.enable_sequence_parallel
+                and num_tokens %
+                self.vllm_config.parallel_config.tensor_parallel_size == 0)


Decisions like this should go in vllm/config.py

This logic is placed outside of vllm.config because torch.distributed.reduce_scatter_tensor only works when the reduced scattered dimension (in this case, the token dimension) is divisible by the parallel size. Since num_tokens changes every iteration, it seems not reasonable to put it in the config.

tlrmchlsmth · 2025-03-20T22:42:09Z

vllm/v1/worker/gpu_model_runner.py

+        with set_forward_context(
+                attn_metadata,
+                self.vllm_config,
+                enable_sequence_parallel=enable_sequence_parallel):


OK, I see we're setting the forward_context here now. @youkaichao thoughts on this?

looks quite intrusive to me 👀

@youkaichao Since we need num_tokens to make the decision before reduce-scatter (see the explanation above), do you have any suggestions for a better approach?

cascade812 · 2025-03-21T04:10:28Z

Thanks for all the comments and reviews! I'll address accordingly.

I know currently not all layers support sp. Could you print readable message to the vLLM users if a specific layer doesn't support sequence parallelism?

@yaochengji Do you mean adding a message to all unsupported layers? That would involve many layers. Do you have any suggestions on how to implement this efficiently?

Signed-off-by: cascade812 <[email protected]>

yaochengji · 2025-03-22T05:12:45Z

Do you mean adding a message to all unsupported layers? That would involve many layers. Do you have any suggestions on how to implement this efficiently?

Usually we have a base class and the sequence parallelism not implemented warning can be put there. But I just took a look the vLLM code realize it's not easy to implement that way.

NVM, I'm fine that users need to be aware of the sequence parallelism support availability before it is fully supported.

youkaichao · 2025-03-22T06:31:30Z

we have to match various patterns for matmul -> allreduce -> rms_norm(or other ops) -> matmul even when only rms_norm is considered.

I think you only need to match matmul -> allreduce and allreduce -> matmul ? rms_norm just takes an input and produce output with the same shape, you don't need to change the op.

yaochengji · 2025-03-22T15:49:22Z

I think you only need to match matmul -> allreduce and allreduce -> matmul ?

There's no allreduce->matmul in 1D TP.

rms_norm just takes an input and produce output with the same shape, you don't need to change the op.

We don't need to change the op but we still need to detect it. Because we will decompose allreduce to reduce-scatter and allgather and move allgather after rms_norm but before the next matmul.

robertgshaw2-redhat · 2025-03-26T14:26:50Z

hey @cascade812 - is it okay if I push a few changes to your branch?

cascade812 · 2025-03-26T17:11:29Z

hey @cascade812 - is it okay if I push a few changes to your branch?

Absolutely! Your contributions are more than welcome. Thanks!

cascade812 · 2025-03-27T04:01:12Z

@robertgshaw2-redhat @tlrmchlsmth @youkaichao what's your thought on this? If you guys think compilation pass is the better choice, I'm willing to help too!

tlrmchlsmth · 2025-03-29T00:38:37Z

@robertgshaw2-redhat @tlrmchlsmth @youkaichao what's your thought on this? If you guys think compilation pass is the better choice, I'm willing to help too!

I didn’t see this message! It would be awesome for you to work on that!

cascade812 · 2025-03-31T00:13:41Z

@robertgshaw2-redhat @tlrmchlsmth @youkaichao what's your thought on this? If you guys think compilation pass is the better choice, I'm willing to help too!

I didn’t see this message! It would be awesome for you to work on that!

Great! I'll work on it and keep you posted on the progress.

mergify · 2025-04-01T08:20:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @cascade812.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tlrmchlsmth · 2025-04-01T19:28:23Z

@robertgshaw2-redhat @tlrmchlsmth @youkaichao what's your thought on this? If you guys think compilation pass is the better choice, I'm willing to help too!

I didn’t see this message! It would be awesome for you to work on that!

Great! I'll work on it and keep you posted on the progress.

Hey @cascade812, I have some thoughts/pointers on this (sorry, meant to send these along earlier):

First, here is an earlier PR that attempted to do this via an inductor pass with pattern matching: #9886.
There are a couple of issues with this implementation. It has some deadlocking issues that aren't that well understood, and it also has some complexities when working with num_tokens % 4 != 0 (similar to things that you need to deal with in this PR).

Another problem is that the pattern matching is very brittle and would need to be extended to support different models. @yaochengji raised this issue, and from my past experience I think this is a very valid concern.

In an effort to make it more flexible, a some of us have discussed adding sentinel no-op operations (e.g. a begin_sp_region operation that does a clone at the end of RowParallelLinear's apply method and an end_sp_region at the beginning of MergedColumnParallelLinear). Then we can find regions fenced by these operations to do a rewrite. This would let us be more robust and selective, and would neatly handle PP as well

BTW are you on the vllm slack? If not, please join as it's easier to discuss there!

yaochengji · 2025-04-01T21:09:25Z

@tlrmchlsmth

I noticed that the layer_norm is a CustomOp class, do you think we can make it a composite rms_norm op in fx graph for all cases?

vllm/vllm/model_executor/layers/layernorm.py

Line 81 in db9dfcf

@CustomOp.register("rms_norm")

tlrmchlsmth · 2025-04-01T23:11:29Z

@tlrmchlsmth

I noticed that the layer_norm is a CustomOp class, do you think we can make it a composite rms_norm op in fx graph for all cases?

vllm/vllm/model_executor/layers/layernorm.py

Line 81 in db9dfcf

@CustomOp.register("rms_norm")

Not sure about this. Often we make things custom ops when torch.compile has trouble with them. @youkaichao do you know?

mklasby · 2025-04-08T15:45:05Z

vllm/worker/model_runner.py

+            self.vllm_config.parallel_config.enable_sequence_parallel
+            and num_tokens is not None and num_tokens %
+            self.vllm_config.parallel_config.tensor_parallel_size == 0)
+


It may be beneficial to log a warning the first time this occurs so that it's clear in logs if sequence parallel is not being enabled due to sequence length even if config argument for sequence parallel is True.

tlrmchlsmth · 2025-04-28T18:14:25Z

Closing in favor of #16155, which has been merged!

cascade812 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, youkaichao and ywang96 as code owners March 16, 2025 23:27

mergify bot added frontend v1 labels Mar 16, 2025

mergify bot mentioned this pull request Mar 16, 2025

[Feature]Add support for sequence parallel #14871

Closed

cascade812 changed the title ~~support sequence parallel~~ [Feature] Support sequence parallelism Mar 16, 2025

mergify bot added the needs-rebase label Mar 18, 2025

cascade812 added 2 commits March 19, 2025 04:29

support sequence parallel

defd5d1

Signed-off-by: cascade812 <[email protected]>

fix

b0a9c01

Signed-off-by: cascade812 <[email protected]>

cascade812 force-pushed the spp branch from 34ffd63 to b0a9c01 Compare March 19, 2025 04:31

mergify bot removed the needs-rebase label Mar 19, 2025

cascade812 added 2 commits March 19, 2025 05:33

update

3045445

Signed-off-by: cascade812 <[email protected]>

fix

bd7f3d4

Signed-off-by: cascade812 <[email protected]>

youkaichao reviewed Mar 20, 2025

View reviewed changes

NickLucche requested changes Mar 20, 2025

View reviewed changes

tlrmchlsmth reviewed Mar 20, 2025

View reviewed changes

cascade812 added 2 commits March 22, 2025 04:10

update test and fix v1

62c43f9

Signed-off-by: cascade812 <[email protected]>

update reduce_scatter and tests

9cebd6f

Signed-off-by: cascade812 <[email protected]>

mergify bot added the ci/build label Mar 22, 2025

mgoin self-requested a review March 27, 2025 16:29

mergify bot added the needs-rebase label Apr 1, 2025

mklasby reviewed Apr 8, 2025

View reviewed changes

tlrmchlsmth closed this Apr 28, 2025

Uh oh!

[Feature] Support sequence parallelism #14908

[Feature] Support sequence parallelism #14908

Uh oh!

Conversation

cascade812 commented Mar 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 16, 2025

Uh oh!

yaochengji commented Mar 18, 2025

Uh oh!

mergify bot commented Mar 18, 2025

Uh oh!

tlrmchlsmth commented Mar 20, 2025

Uh oh!

yaochengji commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

yaochengji commented Mar 20, 2025

Uh oh!

yaochengji commented Mar 20, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

bnellnm commented Mar 20, 2025

Uh oh!

yaochengji commented Mar 20, 2025

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cascade812 commented Mar 21, 2025

Uh oh!

yaochengji commented Mar 22, 2025

Uh oh!

youkaichao commented Mar 22, 2025

Uh oh!

yaochengji commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertgshaw2-redhat commented Mar 26, 2025

Uh oh!

cascade812 commented Mar 26, 2025

Uh oh!

cascade812 commented Mar 27, 2025

Uh oh!

tlrmchlsmth commented Mar 29, 2025

Uh oh!

cascade812 commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Apr 1, 2025

Uh oh!

tlrmchlsmth commented Apr 1, 2025

Uh oh!

yaochengji commented Apr 1, 2025

Uh oh!

tlrmchlsmth commented Apr 1, 2025

Uh oh!

cascade812 commented Mar 16, 2025 •

edited by github-actions bot

Loading

yaochengji commented Mar 20, 2025 •

edited

Loading

yaochengji commented Mar 22, 2025 •

edited

Loading

cascade812 commented Mar 31, 2025 •

edited

Loading