[torch.compile][TPU] Make @support_torch_compile work for XLA backend #15782

lsy323 · 2025-03-31T02:03:43Z

Make @support_torch_compile work for XLA backend. With the custom dispatcher, overhead of dynamo guard evaluation is eliminated.

For TPU backend, each models have 2 FX graphs/dynamo bytecodes:

During profiling run - No KV cache
After profiling run - with KV cache

This breaks the assumption in the current @support_torch_compile implementation - Each model has one FX graph/cached bytecode. Since the profiling graph won't be invoked after the profiling run, we clear the cached bytecode after profiling run so that the assumption remains true.

Other changes:

Remove ModelWrapperV1, which is used to wrap the model code and torch.compile the wrapped model. It's not needed anymore since we are reusing the compile decorator.
Since ModelWrapperV1 is removed, sampler logic is moved to a separate function.

Credit to @WoosukKwon on the idea of clearing bytecode cache and @youkaichao for the many help on torch dynamo related questions!

cc @youkaichao @alexm-redhat @miladm @NickLucche @WoosukKwon @yaochengji @robertgshaw2-redhat

Signed-off-by: Siyuan Liu <[email protected]>

github-actions · 2025-03-31T02:03:53Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

lsy323 · 2025-03-31T02:05:27Z

Slightly improved the throughput 6.09 -> 6.14 req/s, benchmarking cmd is:

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
 --disable-log-requests \
 --port 8004 \
 --gpu-memory-utilization 0.95 \
 --max-num-seqs 512 \
 --max-num-batched-tokens 512 \
 --tensor-parallel-size 1 \
 --max-model-len 2048

python benchmarks/benchmark_serving.py \
    --backend vllm \
    --model meta-llama/Llama-3.1-8B-Instruct  \
    --dataset-name random \
    --random-input-len 1800 \
    --random-output-len 128 \
    --random-prefix-len 0 \
    --port 8004

mgoin · 2025-03-31T09:26:48Z

vllm/v1/worker/tpu_model_runner.py

-        self._hidden_states_dtype = out.dtype
+            self.model(input_ids=input_ids,
+                       positions=position_ids,
+                       inputs_embeds=inputs_embeds)


Please keep the _hidden_states_dtype assignment

yep that's still needed

IMO updating _hidden_states_dtype is not needed, _hidden_states_dtype is initialized with the model dtype already in _hidden_states_dtype. Also seems we don't need _hidden_states_dtype at all, since it should be the same as the model dtype

Here is more context https://github.com/vllm-project/vllm/pull/15714/files#r2019453889

I think this is out of scope, but yes in principle you'd only need self.dtype. The dtype issue you linked has proven that nothing crashes at runtime should the output of an op not match self.dtype though.
Hence if the same bug were to re-appear again, we would only notice the server recompiling and we'd have to debug again the way I did, painfully.

Thanks! Updated

mgoin · 2025-03-31T09:29:18Z

vllm/v1/worker/tpu_model_runner.py

-        xm.wait_device_ops()
+            xm.wait_device_ops()


Is there a reason to pull the device sync inside of the loop? IIRC we pulled it out since it made parallel compilation slightly quicker

Thanks! Updated.

mgoin · 2025-03-31T09:30:15Z

vllm/v1/worker/tpu_model_runner.py

-    def get_multimodal_embeddings(self, *args, **kwargs):
-        return self.model.get_multimodal_embeddings(*args, **kwargs)
-
-    def get_input_embeddings(self, *args, **kwargs):
-        return self.model.get_input_embeddings(*args, **kwargs)


Have you tested that multimodal inference still works and these are called correctly?

I only ran TPU CI to test the change, multimodal is not tested. Can you provide a script to test the multimodal? I can test it on this PR.

VLLM_USE_V1=1 vllm serve llava-hf/llava-1.5-7b-hf --max-model-len 4096 --max-num-seqs 8 --max-num-batched-tokens 512 --chat-template examples/template_llava.jinja

then python examples/online_serving/openai_chat_completion_client_for_multimodal.py

I am adding tests in another pr btw

sg I'll rebase to your PR after :)

Hi @NickLucche Thank you for providing the testing cmd for multimodal, I made this PR working for llava-hf/llava-1.5-7b-hf. The server cmd runs fine.

However, the client script python examples/online_serving/openai_chat_completion_client_for_multimodal.py would fail at HEAD.

Main is working for me, let me try your PR

Still really slow but it works the same on this PR on my side, thanks!

@NickLucche Thank you so much for trying! I set up a new conda env on my end and tried as well. I also found it's really slow, was about to ask you lol.

NickLucche

Thanks for the work!
Left a comment about MM, I think that's about the only thing to clarify on my side.

NickLucche · 2025-03-31T14:13:51Z

vllm/v1/worker/tpu_model_runner.py

-
-        return hidden_states
+    def reset_dynamo_cache(self):
+        # TODO(lsy323): Support multimodal models, the backbone language model


do you mean

Suggested change

# TODO(lsy323): Support multimodal models, the backbone language model

compiled_model = self.model.language_model if self.is_multimodal_model else self.model.model

I haven't tested multimodal yet, plan to do it in another PR. Right now I only tested the tests in TPU CI

I haven't tested multimodal yet, plan to do it in another PR

I think it's best if we double check here o/w we may inadvertently break MM

I'll rebase to the MM PR after it's merged to ensure this PR doesn't break MM

NickLucche · 2025-03-31T14:20:26Z

vllm/v1/worker/tpu_model_runner.py

-
-class ModelWrapperV1(nn.Module):
-
-    def __init__(self, model: nn.Module):


I didn't mind the wrapper too much I think it was grouping a few related functions nicely. Still, if it benefits performance I am ok with that.

The wrapper was added in V0 for torch.compile. Now we have the @support_torch_compile decorator that wraps the model with torch.compile already, therefore we don't need the wrapper anymore.

NickLucche · 2025-03-31T14:22:13Z

vllm/v1/worker/tpu_model_runner.py

        sample_hidden_states = \
            hidden_states[sampling_metadata.indices_do_sample]
-        logits = self.compute_logits(sample_hidden_states)
+        logits = self.model.compute_logits(sample_hidden_states, None)


can we re-add the pruning comment here? Just in case it slips though in the future.

Sorry, may I ask what is 'pruning comment'?

# SamplingMetadata here for pruning output in LogitsProcessor, disabled
or smt along this line to indicate why the 2nd argument is None. That's because it could enable logits pruning which is bad for xla

Thanks! Updated

NickLucche · 2025-03-31T14:24:40Z

vllm/v1/worker/tpu_model_runner.py

-        self._hidden_states_dtype = out.dtype
+            self.model(input_ids=input_ids,
+                       positions=position_ids,
+                       inputs_embeds=inputs_embeds)


yep that's still needed

Signed-off-by: Siyuan Liu <[email protected]>

vanbasten23 · 2025-04-01T02:28:13Z

vllm/v1/worker/tpu_model_runner.py

+    def reset_dynamo_cache(self):
+        # TODO(lsy323): Support multimodal models, the backbone language model
+        # is stored in a different member.
+        compiled_model = self.model.model


i wonder why we need to do model.model. Could you add a comment?

Is it possible self.model doesn't have a attr model? E.g. it's not annotated by support_torch_compile?

Yes, model.model is torch compiled

Signed-off-by: Siyuan Liu <[email protected]>

mgoin · 2025-04-02T18:27:40Z

vllm/v1/worker/tpu_model_runner.py

+        if self.is_multimodal_model:
+            compiled_model = self.model.language_model.model
+        else:
+            compiled_model = self.model.model


AFAIK "language_model" is not a stable attribute to reference, it is based on the HF model definition. Maybe @ywang96 @DarkLight1337 would know a stable interface to access the language model backbone?

I think a potential solution is to add a query function get_backbone_lm for multimodal models, similar to the existing get_input_embeddings

Yeah we can add get_language_model to SupportsMultiModal interface

Something like #16007?

Please leave an assert here for now as we will address later @lsy323

yaochengji

LGTM, thanks!

mgoin

For now I think we should aim to land this just working with language_model since there are other blockers/developments there. Afterwards we can migrate to use the get_language_model interface proposed by Nicolo.

mgoin · 2025-04-08T01:54:08Z

vllm/v1/worker/tpu_model_runner.py

+        if self.is_multimodal_model:
+            compiled_model = self.model.language_model.model
+        else:
+            compiled_model = self.model.model


Please leave an assert here for now as we will address later @lsy323

Signed-off-by: mgoin <[email protected]>

…vllm-project#15782) Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: Yang Wang <[email protected]>

…vllm-project#15782) Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]>

…vllm-project#15782) Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: Mu Huai <[email protected]>

lsy323 added 2 commits March 31, 2025 01:11

make @support_torch_compile support xla backend

176bc3e

Signed-off-by: Siyuan Liu <[email protected]>

small fix

14d33b2

Signed-off-by: Siyuan Liu <[email protected]>

lsy323 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners March 31, 2025 02:03

mergify bot added v1 tpu Related to Google TPUs labels Mar 31, 2025

mgoin reviewed Mar 31, 2025

View reviewed changes

NickLucche requested changes Mar 31, 2025

View reviewed changes

nits, address commnets

c27943d

Signed-off-by: Siyuan Liu <[email protected]>

vanbasten23 reviewed Apr 1, 2025

View reviewed changes

support multimodal model

c14518f

Signed-off-by: Siyuan Liu <[email protected]>

mgoin reviewed Apr 2, 2025

View reviewed changes

yaochengji approved these changes Apr 3, 2025

View reviewed changes

NickLucche mentioned this pull request Apr 3, 2025

[Model] Add SupportsMultiModal.get_language_model interface #16007

Merged

mgoin approved these changes Apr 8, 2025

View reviewed changes

mgoin added 2 commits April 8, 2025 01:55

Add assert for language model

cdec999

Signed-off-by: mgoin <[email protected]>

Merge branch 'main' into lsiyuan/try-disable-dynamo-guard-3

a8dbbf1

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 8, 2025

DarkLight1337 merged commit 87918e4 into vllm-project:main Apr 8, 2025
56 checks passed

lsy323 deleted the lsiyuan/try-disable-dynamo-guard-3 branch April 8, 2025 18:05

lsy323 restored the lsiyuan/try-disable-dynamo-guard-3 branch April 8, 2025 18:06

NickLucche mentioned this pull request Apr 10, 2025

[TPU][V1] Use language_model interface for getting text backbone in MM #16410

Merged

	# TODO(lsy323): Support multimodal models, the backbone language model
	compiled_model = self.model.language_model if self.is_multimodal_model else self.model.model


		class ModelWrapperV1(nn.Module):

		def __init__(self, model: nn.Module):

Uh oh!

Uh oh!

[torch.compile][TPU] Make @support_torch_compile work for XLA backend #15782

[torch.compile][TPU] Make @support_torch_compile work for XLA backend #15782

Uh oh!

Conversation

lsy323 commented Mar 31, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 31, 2025

Uh oh!

lsy323 commented Mar 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lsy323 Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lsy323 commented Mar 31, 2025 •

edited by github-actions bot

Loading

lsy323 Mar 31, 2025 •

edited

Loading