[Hardware][TPU][V1] Better tpu multilora compilation #16989

jdefreitas02 · 2025-04-22T15:06:01Z

This PR improves LoRA compilation time and memory usage by splitting the large graph created by setting LoRAs into smaller sub-graphs. It also stops recompilations caused by indexing multiple LoRAs.

It further optimises the work done in #15655

Signed-off-by: Akshat Tripathi <[email protected]>

…` to be called with infinities Signed-off-by: Akshat Tripathi <[email protected]>

Signed-off-by: Akshat Tripathi <[email protected]>

Signed-off-by: xihajun <[email protected]>

github-actions · 2025-04-22T15:06:13Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-04-22T15:06:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jdefreitas02.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Jorge de Freitas <[email protected]>

…/Krai/vllm into better_tpu_multilora_compilation

Signed-off-by: Jorge de Freitas <[email protected]>

yaochengji · 2025-04-27T22:54:29Z

vllm/lora/ops/xla_ops/lora_ops.py

@@ -0,0 +1,98 @@
+# SPDX-License-Identifier: Apache-2.0


@bythew3i , could you help review the multi-lora kernels?

For introduction, @bythew3i is a Pallas and TPU expert and also the main author of ragged paged attention kernel in vLLM

Thanks Chengji for the introduction. Also thanks @jdefreitas02 for the detailed comments in pallas kernel.

I wonder what is the motivation of wrting the pallas kernel to implement multi-lora? It seems to me using normal pytorch implementation can achieve better performance. Not every case needs a kernel. Unless we want to manual fuse lora kernel to attention kernel, other than that, I do not see the pallas kernel will outperform given it can not naturally fused with other ops by XLA in TPU.

(cc: @yarongmu-google )

Hi @bythew3i there were a few reasons for writing the kernel. In the original PR I started out with a pytorch implementation like what's there in the CPU implementation, but it was extremely slow, which led me down this route.

I didn't look at the IR or HLO for it but my guess for the reason is that the index_select operation causes a lot of data copies, which we're able to avoid in a kernel.

This kernel also has the LoRA laning feature, which allows us to pack multiple adapters into 1 TPU register, which reduces the number of matrix multiplications we need to do by a large factor.

So @jdefreitas02 ran a benchmark comparing a pytorch implementation against the kernels for Llama3.1 8B, with 1 LoRA, 1024 input tokens and 1024 output tokens. The results were:

Pytorch: 406 tok/s
Pallas: 1407 tok/s

Can you please share baseline implemetation?

Sure, the kernels are replaced with this function:

def ref_bgmv(inputs: torch.Tensor, loras: torch.Tensor, idxs: torch.Tensor): selected_loras = loras[idxs] if len(selected_loras.shape) == 4: selected_loras = selected_loras.squeeze(axis=1) T, L, D= selected_loras.shape return (selected_loras @ inputs.reshape( (T, D, 1))).reshape((T, L))

@jdefreitas02 do you still have the code where these are integrated?

Thanks! Can you also please share all the inputs' shape and dtype that you used for benchmarking?

Both the input and output sizes were 1024 and the dtype was fp16. Also we used Nvidia's GenAI perf image.

Thanks for the info!

Now, I see the original TPU gather is too slow and nice optimization in your mask solution! I put the other comments in #15655 (comment) PTAL. Thanks!

yaochengji

Thanks for you contribution and continuous improvements from the old PR!

I tested the test_lora.py lora locally but got the error after pre-compilation finished, do you know why?

RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/vfio/0): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/0

yaochengji · 2025-04-27T23:40:38Z

vllm/lora/punica_wrapper/punica_tpu.py

+                bias = bias.view(-1, bias.shape[-1])
+                bias = bias[indices]
+                bias = torch.where(indices[:, None] == -1, 0, bias)
+


nit: remove the empty line.

yaochengji · 2025-04-28T00:40:40Z

vllm/lora/punica_wrapper/punica_tpu.py

+            "cpu",
+            long_lora_context,
+        )
+        self._token_lora_indices[:base_indices.shape[0]] = base_indices.to(


Can we use self._token_lora_indices = base_indices.to(self.device) here? The underlying implementation is different than GPUs. It creates intermediate buffers.

Yes, I have now implemented this and it removed a couple subgraphs in the process.

yaochengji · 2025-04-28T01:05:07Z

vllm/v1/worker/tpu_model_runner.py

+            self.lora_config.max_lora_rank = _get_padded_lora_rank(
+                self.lora_config.max_lora_rank, self.lora_config.max_loras)
+
+        if self.lora_config is not None:


Seems redundant.

yaochengji · 2025-04-28T01:12:09Z

vllm/v1/worker/tpu_model_runner.py

+                                         self.lora_config, self.device)
+            replace_set_lora(model)
+            punica_wrapper = self.lora_manager._adapter_manager.punica_wrapper
+            if not self.enforce_eager:


We can mark_compiled even when enforce_eager.

Yep, I've changed this in the original PR #14238, planning on merging it in once it's accepted.

Akshat-Tripathi · 2025-04-28T09:20:26Z

vllm/v1/worker/tpu_model_runner.py

    def get_input_embeddings(self, *args, **kwargs):
        return self.model.get_input_embeddings(*args, **kwargs)

+    def add_lora(self, lora_request: LoRARequest) -> bool:


Is this function still needed with the new set_lora function?

Akshat-Tripathi · 2025-04-28T09:22:04Z

I tested the test_lora.py lora locally but got the error after pre-compilation finished, do you know why?
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/vfio/0): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/0

That looks like there's another process using your TPU, killing it should help

Akshat-Tripathi · 2025-04-28T09:23:37Z

@jdefreitas02 Do you have a list of the graphs we're compiling now?

Signed-off-by: Akshat Tripathi <[email protected]>

jdefreitas02 · 2025-04-28T14:40:08Z

@jdefreitas02 Do you have a list of the graphs we're compiling now?

Stage	#xla graphs
backbone	14
tpu_set_lora	6
select_hidden_states	4
sample_from_hidden	10

yaochengji · 2025-04-28T18:31:41Z

That looks like there's another process using your TPU, killing it should help

There's no other process. I guess something is wrong in the program.

Signed-off-by: Jorge de Freitas <[email protected]>

Akshat-Tripathi · 2025-05-01T08:57:43Z

There's no other process. I guess something is wrong in the program.

Hi @yaochengji we're not able to reproduce your error on our end, maybe it's something with your environment?

Signed-off-by: Jorge de Freitas <[email protected]>

yaochengji · 2025-05-01T16:49:08Z

Hi @yaochengji we're not able to reproduce your error on our end, maybe it's something with your environment?

But my other program looks good. NVM, we can focus on the first multi-lora PR first.

Signed-off-by: Jorge de Freitas <[email protected]>

Akshat-Tripathi added 30 commits March 4, 2025 21:02

Added non-triton SGMV and BGMV ops (not kernels yet)

d993de9

Signed-off-by: Akshat Tripathi <[email protected]>

Made a copy of the layer tests for the TPU. TODO: DRY it out

4f816ed

Signed-off-by: Akshat Tripathi <[email protected]>

Removed extra print

5f0355b

Signed-off-by: Akshat Tripathi <[email protected]>

Made some minor shape-based fixes to the kernels

edd02c5

Signed-off-by: Akshat Tripathi <[email protected]>

Added basic lora execution code

aff94f9

Signed-off-by: Akshat Tripathi <[email protected]>

Replaced einsums with matmuls+reshaping for better xla compilation

adfd194

Signed-off-by: Akshat Tripathi <[email protected]>

Replaced inf/-inf with max/min since XLA doesn't allow `nan_to_num_()…

816a56c

…` to be called with infinities Signed-off-by: Akshat Tripathi <[email protected]>

Added lora config to _dummy_run()

c8a51c8

Signed-off-by: Akshat Tripathi <[email protected]>

Changed torch._dynamo config

51f929d

Signed-off-by: Akshat Tripathi <[email protected]>

Quick patch to allow non lora code to run

23d4a24

Signed-off-by: Akshat Tripathi <[email protected]>

Minor fixes

47397a7

Signed-off-by: Akshat Tripathi <[email protected]>

Replaced einsums with matmuls to allow xla compilation

456eb37

Signed-off-by: Akshat Tripathi <[email protected]>

Removed xla ops for torch ops

eabc748

Signed-off-by: Akshat Tripathi <[email protected]>

Removed old debug log points

ac9753e

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed bgmv/sgmv shape error

aa8b0fd

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed lora batching crash in warmup

124215f

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed shape issue in add_lora_linear()

e148254

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed dynamic lora tensor shapes

494b35e

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed lora_input preparation for actual execution

1dbfcd9

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed wrong model bug

1bb2578

Signed-off-by: Akshat Tripathi <[email protected]>

Moved if statements outside of for loops in PunicaWrapperTPU

ddc4cbc

Signed-off-by: Akshat Tripathi <[email protected]>

Added early exits to PunicaWrapperTPU lora functions

48a6944

Signed-off-by: Akshat Tripathi <[email protected]>

Added torch ops for tpu (Static prefill sizes)

7802e84

Signed-off-by: Akshat Tripathi <[email protected]>

XLA bgmv operations are now imported from the default torch_ops

ab5396b

Signed-off-by: Akshat Tripathi <[email protected]>

Removed TODOs

fdf29d3

Signed-off-by: Akshat Tripathi <[email protected]>

Removed old code

c2b4139

Signed-off-by: Akshat Tripathi <[email protected]>

Linting

f31b7d1

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed import error

87ff73e

Signed-off-by: Akshat Tripathi <[email protected]>

lint

96c3dde

Signed-off-by: Akshat Tripathi <[email protected]>

Abstracted out infinity values

4e72ede

Signed-off-by: Akshat Tripathi <[email protected]>

Akshat-Tripathi and others added 5 commits April 22, 2025 10:20

Added mark_steps to set_lora to break up large graphs

177fced

Signed-off-by: Akshat Tripathi <[email protected]>

Stopped index based recompilation for multi-lora

12ff364

Signed-off-by: Akshat Tripathi <[email protected]>

Restored original maybe_dummy_run_with_lora

491578d

Signed-off-by: Akshat Tripathi <[email protected]>

Split up into lora setup and lora selection functions

5cb4724

Signed-off-by: Akshat Tripathi <[email protected]>

refactor mark step from layers into tpu model runner

cba8267

Signed-off-by: xihajun <[email protected]>

mergify bot added ci/build v1 tpu Related to Google TPUs labels Apr 22, 2025

mergify bot added the needs-rebase label Apr 22, 2025

jdefreitas02 changed the title ~~Better tpu multilora compilation~~ [Hardware][TPU][V1] Better tpu multilora compilation Apr 22, 2025

jdefreitas02 and others added 3 commits April 23, 2025 17:36

refactor mark step from layers into tpu model runner

d4b3707

Signed-off-by: Jorge de Freitas <[email protected]>

Merge branch 'better_tpu_multilora_compilation' of https://github.com…

afd7690

…/Krai/vllm into better_tpu_multilora_compilation

mask creation moved outside of matmul loop

ed1738a

Signed-off-by: Jorge de Freitas <[email protected]>

yaochengji reviewed Apr 27, 2025

View reviewed changes

yaochengji reviewed Apr 28, 2025

View reviewed changes

Akshat-Tripathi reviewed Apr 28, 2025

View reviewed changes

Moved mask setup outside of lora running loop

f81111e

Signed-off-by: Akshat Tripathi <[email protected]>

Jorge de Freitas added 2 commits April 29, 2025 13:47

compile reset_lora function as separate graph

94bc0e2

Signed-off-by: Jorge de Freitas <[email protected]>

update base metadata padding moved to cpu

b0bfc7a

Signed-off-by: Jorge de Freitas <[email protected]>

fix size of sampler indices

966e800

Signed-off-by: Jorge de Freitas <[email protected]>

Akshat-Tripathi mentioned this pull request May 6, 2025

[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend #15655

Merged

remove add lora

22bafee

Signed-off-by: Jorge de Freitas <[email protected]>

Uh oh!

[Hardware][TPU][V1] Better tpu multilora compilation #16989

Are you sure you want to change the base?

[Hardware][TPU][V1] Better tpu multilora compilation #16989

Uh oh!

Conversation

jdefreitas02 commented Apr 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 22, 2025

Uh oh!

mergify bot commented Apr 22, 2025

Uh oh!

yaochengji Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bythew3i May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Akshat-Tripathi commented Apr 28, 2025

Uh oh!

Akshat-Tripathi commented Apr 28, 2025

Uh oh!

jdefreitas02 commented Apr 28, 2025

Uh oh!

yaochengji commented Apr 28, 2025

Uh oh!

Akshat-Tripathi commented May 1, 2025

Uh oh!

yaochengji commented May 1, 2025

Uh oh!

Uh oh!

jdefreitas02 commented Apr 22, 2025 •

edited by github-actions bot

Loading

yaochengji Apr 27, 2025 •

edited

Loading

bythew3i May 8, 2025 •

edited

Loading