[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend #15655

Akshat-Tripathi · 2025-03-27T23:43:17Z

Summary

This PR optimises the Multi-LoRA implementation from #14238. This one should be merged in after it.

This includes several kernel optimisations:

Block size tuning 2bb8868 d7338f8
Faster mask creation 2aacb34
Allowing for some blocks to be skipped 6ee0b57
Adding LoRA Laning eb804a0
Splitting the Pallas kernel into shrink/expand variants de6746a
Removing masking when only 1 LoRA adapter is used aad109b

And a few general ones:

Pre-transposing the LoRA adapters used in the expand op a82f3fe
Reducing recompilations 5638e7d

Things left/RFC

There are still a few recompilations at the start of a run that I need to track down
LogitsProcessorWithLoRA introduces a long (~1.5 second) stall when it's enabled, but not much activity seems to happen on the CPU or TPU during this time. I've disabled this for now.
It seems LogitsProcessorWithLoRA is always created even if there's no LoRA adapter that needs it, is there a reason for this?
I have microbenchmarks for the kernels, but I'm not sure what the right place to put them is.

Signed-off-by: Akshat Tripathi <[email protected]>

…because xla doesn't allow partial updates Signed-off-by: Akshat Tripathi <[email protected]>

Signed-off-by: Akshat Tripathi <[email protected]>

This reverts commit b78b088. Signed-off-by: Akshat Tripathi <[email protected]>

Signed-off-by: Akshat Tripathi <[email protected]>

yaochengji

LGTM, thanks for the contribution!

Signed-off-by: Akshat Tripathi <[email protected]>

vllm/v1/worker/lora_model_runner_mixin.py

.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh

vllm/v1/worker/tpu_model_runner.py

Signed-off-by: Akshat Tripathi <[email protected]>

yaochengji

LGTM, thanks!

NickLucche

Nice work optimizing lora here! Just had some minor notes, please take a look when you find the time. Otherwise we can address them in a separate PR if needs be.

vllm/v1/worker/tpu_model_runner.py

Signed-off-by: Akshat Tripathi <[email protected]>

mergify · 2025-05-27T21:56:14Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Akshat-Tripathi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Akshat Tripathi <[email protected]>

NickLucche

LGTM, sorry for delaying the merge a bit! Let's get this landed today.

Akshat-Tripathi · 2025-05-28T09:17:37Z

LGTM, sorry for delaying the merge a bit! Let's get this landed today.

No worries! Yep I'm hoping once these tests pass we can merge it in. Would you mind re-enabling auto-merge?

Signed-off-by: Akshat Tripathi <[email protected]>

…llm-project#15655) Signed-off-by: Akshat Tripathi <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: xihajun <[email protected]> Signed-off-by: Jorge de Freitas <[email protected]> Signed-off-by: Jorge de Freitas <[email protected]> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: xihajun <[email protected]> Co-authored-by: Jorge de Freitas <[email protected]> Co-authored-by: Jorge de Freitas <[email protected]> Signed-off-by: amit <[email protected]>

amanocha · 2025-07-18T01:19:11Z

Benchmarking LoRA against baseline (no LoRA) throughput

We use NVIDIA's GenAI-Perf tool to force fixed-length inputs and outputs to produce "heatmap" plots as below. On TPU-v6e and H100 instances, we vary the inputs from 128 to 8k. On L4 instances, we vary the inputs from 128 to 2k.

We calculate the LoRA slowdown as ((LoRA throughput / baseline throughput) - 1) * 100%.

Llama3.1-8B

1x TPU-v6e

The LoRA slowdown varies from -8.4% to -23.9%.
### 1x GPU-L4 The LoRA slowdown varies from -17.3% to -32.8%. ### 1x GPU-H100 The LoRA slowdown varies from -10.0% to -51.8%. ## Llama3.1-70B ### 8x TPU-v6e The LoRA slowdown varies from -20.7% to -46.3%. ### 8x GPU-L4 The LoRA slowdown varies from -13.8% (second best: -25.1%) to -49.7%. ### 4x GPU-H100 Unable to launch VMs due to persistent unavailability across multiple zones and regions.

I have a few questions about the data published in this PR:

What configurations are used in these experiments, e.g. number of adapters, rank, batch size?
Are adapters loaded dynamically?
Did you measure latency? How did you measure adapter loading time?

I also have a few questions about the "Hot Swapping" and "Compare Multi-LoRAs" tabs in this link: https://insights.krai.ai/benchmarking-multi-lora

What is the difference between these two tabs? Are all of the LoRAs pre-allocated ahead of time in the "Compare Multi-LoRAs" results? In a "static" multi-LoRA fashion?
Is the number of LoRAs the batch size? Do you assign 1 LoRA per batch?
What does the "0x LoRAs" represent? Is this the performance of the base model? And if so, is the batch size 1 and you are comparing the base model with batch size=1 to multi-LoRA with possibly batch size > 1?
I also see in the "Compare Multi-LoRAs" tab you measure performance regression for 1-9 LoRAs and 1-4 LoRAs for "Hot Swapping" - do you have data for 8 LoRAs for hot swapping? And did you have data for an input size of 4096 for "Compare multi-LoRAs"? (Trying to do a side-by-side comparison of the two configurations.)
Is e data for the TPU? Did you colleagues happen to run the same experiments on the L4 and H100 GPUs?

Akshat-Tripathi · 2025-08-12T10:22:52Z

Benchmarking LoRA against baseline (no LoRA) throughput

We use NVIDIA's GenAI-Perf tool to force fixed-length inputs and outputs to produce "heatmap" plots as below. On TPU-v6e and H100 instances, we vary the inputs from 128 to 8k. On L4 instances, we vary the inputs from 128 to 2k.
We calculate the LoRA slowdown as ((LoRA throughput / baseline throughput) - 1) * 100%.

Llama3.1-8B

1x TPU-v6e

The LoRA slowdown varies from -8.4% to -23.9%.

1x GPU-L4

The LoRA slowdown varies from -17.3% to -32.8%.

1x GPU-H100

The LoRA slowdown varies from -10.0% to -51.8%.

Llama3.1-70B

8x TPU-v6e

The LoRA slowdown varies from -20.7% to -46.3%.

8x GPU-L4

The LoRA slowdown varies from -13.8% (second best: -25.1%) to -49.7%.

4x GPU-H100

Unable to launch VMs due to persistent unavailability across multiple zones and regions.

Benchmarking LoRA against baseline (no LoRA) throughput

We use NVIDIA's GenAI-Perf tool to force fixed-length inputs and outputs to produce "heatmap" plots as below. On TPU-v6e and H100 instances, we vary the inputs from 128 to 8k. On L4 instances, we vary the inputs from 128 to 2k.
We calculate the LoRA slowdown as ((LoRA throughput / baseline throughput) - 1) * 100%.

Llama3.1-8B

1x TPU-v6e

The LoRA slowdown varies from -8.4% to -23.9%.

1x GPU-L4

The LoRA slowdown varies from -17.3% to -32.8%.

1x GPU-H100

The LoRA slowdown varies from -10.0% to -51.8%.

Llama3.1-70B

8x TPU-v6e

The LoRA slowdown varies from -20.7% to -46.3%.

8x GPU-L4

The LoRA slowdown varies from -13.8% (second best: -25.1%) to -49.7%.

4x GPU-H100

Unable to launch VMs due to persistent unavailability across multiple zones and regions.

I have a few questions about the data published in this PR:

What configurations are used in these experiments, e.g. number of adapters, rank, batch size?

Are adapters loaded dynamically?

Did you measure latency? How did you measure adapter loading time?

I also have a few questions about the "Hot Swapping" and "Compare Multi-LoRAs" tabs in this link: https://insights.krai.ai/benchmarking-multi-lora

What is the difference between these two tabs? Are all of the LoRAs pre-allocated ahead of time in the "Compare Multi-LoRAs" results? In a "static" multi-LoRA fashion?

Is the number of LoRAs the batch size? Do you assign 1 LoRA per batch?

What does the "0x LoRAs" represent? Is this the performance of the base model? And if so, is the batch size 1 and you are comparing the base model with batch size=1 to multi-LoRA with possibly batch size > 1?

I also see in the "Compare Multi-LoRAs" tab you measure performance regression for 1-9 LoRAs and 1-4 LoRAs for "Hot Swapping" - do you have data for 8 LoRAs for hot swapping? And did you have data for an input size of 4096 for "Compare multi-LoRAs"? (Trying to do a side-by-side comparison of the two configurations.)

Is e data for the TPU? Did you colleagues happen to run the same experiments on the L4 and H100 GPUs?

Hi @amanocha thanks for your interest.

These experiments were all run with a single adapter of rank 16. We didn't batch by sequence, but vLLM used 128 batched tokens.
The adapters were loaded statically for this experiment.
The GenAI perf tool measures latency and throughput, so we do have the data, albeit unprocessed. We didn't explicitly measure adapter loading times no.

As for the questions about the website.

For the "Compare Multi-LoRAs" page we allocated LoRAs statically, "Hot Swapping" here means that we'd be swapping adapters between the CPU and TPU.
Yes and no, we have a "batch" of LoRAs that can be applied all at once for "Compare Multi-LoRAs", so the number of LoRAs there all run together. When hotswapping we run each LoRA individually. In a real world use-case these can be mixed and matched, so it's possible to have a batch of 2 active LoRAs whilst serving 8.
Yep 0 LoRAs means that no LoRAs are present, but since we're not batching by LoRA it doesn't affect the "Comparing Multi-LoRAs" results. I think we normalised the token batches when hotswapping.
Unfortunately no we didn't have time to collect that data, similarly for the L4 and H100s

Akshat-Tripathi added 30 commits March 4, 2025 21:02

Fixed lora_input preparation for actual execution

1dbfcd9

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed wrong model bug

1bb2578

Signed-off-by: Akshat Tripathi <[email protected]>

Moved if statements outside of for loops in PunicaWrapperTPU

ddc4cbc

Signed-off-by: Akshat Tripathi <[email protected]>

Added early exits to PunicaWrapperTPU lora functions

48a6944

Signed-off-by: Akshat Tripathi <[email protected]>

Added torch ops for tpu (Static prefill sizes)

7802e84

Signed-off-by: Akshat Tripathi <[email protected]>

XLA bgmv operations are now imported from the default torch_ops

ab5396b

Signed-off-by: Akshat Tripathi <[email protected]>

Removed TODOs

fdf29d3

Signed-off-by: Akshat Tripathi <[email protected]>

Removed old code

c2b4139

Signed-off-by: Akshat Tripathi <[email protected]>

Linting

f31b7d1

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed import error

87ff73e

Signed-off-by: Akshat Tripathi <[email protected]>

lint

96c3dde

Signed-off-by: Akshat Tripathi <[email protected]>

Abstracted out infinity values

4e72ede

Signed-off-by: Akshat Tripathi <[email protected]>

Moved and modified bgmv ops from the cpu backend to the tpu backend, …

e4d35ce

…because xla doesn't allow partial updates Signed-off-by: Akshat Tripathi <[email protected]>

Removed total_size for linting

3cf0680

Signed-off-by: Akshat Tripathi <[email protected]>

Reverted changes to torch_ops

a8ab0c9

Signed-off-by: Akshat Tripathi <[email protected]>

Lint

d73f1ce

Signed-off-by: Akshat Tripathi <[email protected]>

Replaced in-place buffer updates with direct returning

e01d9a4

Signed-off-by: Akshat Tripathi <[email protected]>

PunicaWrapperTPU now returns unchanged buffer if no loras are needed

0c1bfb9

Signed-off-by: Akshat Tripathi <[email protected]>

Simplified TPU prefill

46ce7fa

Signed-off-by: Akshat Tripathi <[email protected]>

Removed sgmv kernels from TPU implementation

5d0cc37

Signed-off-by: Akshat Tripathi <[email protected]>

Fix bug

7590b0e

Signed-off-by: Akshat Tripathi <[email protected]>

Added torch.compiles to PunicaWrapperTPU functions

e7f75b5

Signed-off-by: Akshat Tripathi <[email protected]>

Replaced "x[x==-1] = y" with "x = torch.where(x == - 1, y)"

fe193f7

Signed-off-by: Akshat Tripathi <[email protected]>

Revert "Added torch.compiles to PunicaWrapperTPU functions"

52e3911

This reverts commit b78b088. Signed-off-by: Akshat Tripathi <[email protected]>

Fix linting

33a70b0

Signed-off-by: Akshat Tripathi <[email protected]>

Added lora hotswapping test

67446b2

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed hotswapping test prompt

0db19b1

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed bug in tpu lora test

a4c3b0a

Signed-off-by: Akshat Tripathi <[email protected]>

Merged set_no_lora() functionality with _udpate_prefill_metada

9d6c388

Signed-off-by: Akshat Tripathi <[email protected]>

Added Multi-LoRA functionality to TPU V1

2a9978e

Signed-off-by: Akshat Tripathi <[email protected]>

Akshat-Tripathi added 2 commits May 19, 2025 17:45

Removed extra add_lora function

e6dfd00

Signed-off-by: Akshat Tripathi <[email protected]>

Lint

a242f31

Signed-off-by: Akshat Tripathi <[email protected]>

yaochengji approved these changes May 19, 2025

View reviewed changes

Akshat-Tripathi added 2 commits May 20, 2025 09:57

Merge branch 'main' into tpu_bgmv_optimisation

6770adf

Signed-off-by: Akshat Tripathi <[email protected]>

Merge branch 'main' into tpu_bgmv_optimisation

820d9f6

Signed-off-by: Akshat Tripathi <[email protected]>

Akshat-Tripathi requested a review from jeejeelee as a code owner May 21, 2025 07:15

jeejeelee reviewed May 21, 2025

View reviewed changes

vllm/v1/worker/lora_model_runner_mixin.py Show resolved Hide resolved

yaochengji reviewed May 23, 2025

View reviewed changes

Akshat-Tripathi added 2 commits May 27, 2025 10:47

Merge branch 'main' into tpu_bgmv_optimisation

376b1a1

Signed-off-by: Akshat Tripathi <[email protected]>

Addressed PR comments

c575c0e

Signed-off-by: Akshat Tripathi <[email protected]>

yaochengji approved these changes May 27, 2025

View reviewed changes

NickLucche requested changes May 27, 2025

View reviewed changes

vllm/v1/worker/tpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/tpu_model_runner.py Show resolved Hide resolved

Akshat-Tripathi added 2 commits May 27, 2025 20:31

Moved LoRA recompilation check to the test script

53ab8c4

Signed-off-by: Akshat Tripathi <[email protected]>

Lint

6edf6fb

Signed-off-by: Akshat Tripathi <[email protected]>

yaochengji enabled auto-merge (squash) May 27, 2025 20:02

mergify bot added the needs-rebase label May 27, 2025

Merge branch 'main' into tpu_bgmv_optimisation

8a87529

Signed-off-by: Akshat Tripathi <[email protected]>

auto-merge was automatically disabled May 28, 2025 07:17
Head branch was pushed to by a user without write access

mergify bot removed the needs-rebase label May 28, 2025

NickLucche approved these changes May 28, 2025

View reviewed changes

Merge branch 'main' into tpu_bgmv_optimisation

a3dcb2c

Signed-off-by: Akshat Tripathi <[email protected]>

yaochengji enabled auto-merge (squash) May 28, 2025 17:28

yaochengji merged commit 643622b into vllm-project:main May 28, 2025
68 checks passed

Uh oh!

[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend #15655

[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend #15655

Uh oh!

Conversation

Akshat-Tripathi commented Mar 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Things left/RFC

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mergify bot commented May 27, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Akshat-Tripathi commented May 28, 2025

Uh oh!

Uh oh!

amanocha commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarking LoRA against baseline (no LoRA) throughput

Llama3.1-8B

1x TPU-v6e

Benchmarking LoRA against baseline (no LoRA) throughput

Llama3.1-8B

1x TPU-v6e

Uh oh!

Akshat-Tripathi commented Aug 12, 2025

Benchmarking LoRA against baseline (no LoRA) throughput

Llama3.1-8B

1x TPU-v6e

1x GPU-L4

1x GPU-H100

Llama3.1-70B

8x TPU-v6e

8x GPU-L4

4x GPU-H100

Benchmarking LoRA against baseline (no LoRA) throughput

Llama3.1-8B

1x TPU-v6e

1x GPU-L4

1x GPU-H100

Llama3.1-70B

8x TPU-v6e

8x GPU-L4

4x GPU-H100

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Akshat-Tripathi commented Mar 27, 2025 •

edited by github-actions bot

Loading

amanocha commented Jul 18, 2025 •

edited

Loading