Skip to content

Conversation

Akshat-Tripathi
Copy link
Contributor

@Akshat-Tripathi Akshat-Tripathi commented Mar 27, 2025

Summary

This PR optimises the Multi-LoRA implementation from #14238. This one should be merged in after it.

This includes several kernel optimisations:

  • Block size tuning 2bb8868 d7338f8
  • Faster mask creation 2aacb34
  • Allowing for some blocks to be skipped 6ee0b57
  • Adding LoRA Laning eb804a0
  • Splitting the Pallas kernel into shrink/expand variants de6746a
  • Removing masking when only 1 LoRA adapter is used aad109b

And a few general ones:

  • Pre-transposing the LoRA adapters used in the expand op a82f3fe
  • Reducing recompilations 5638e7d

Things left/RFC

  • There are still a few recompilations at the start of a run that I need to track down
  • LogitsProcessorWithLoRA introduces a long (~1.5 second) stall when it's enabled, but not much activity seems to happen on the CPU or TPU during this time. I've disabled this for now.
  • It seems LogitsProcessorWithLoRA is always created even if there's no LoRA adapter that needs it, is there a reason for this?
  • I have microbenchmarks for the kernels, but I'm not sure what the right place to put them is.

Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
…because xla doesn't allow partial updates

Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
This reverts commit b78b088.

Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Copy link
Collaborator

@yaochengji yaochengji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution!

Copy link
Collaborator

@yaochengji yaochengji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work optimizing lora here! Just had some minor notes, please take a look when you find the time. Otherwise we can address them in a separate PR if needs be.

Signed-off-by: Akshat Tripathi <[email protected]>
@yaochengji yaochengji enabled auto-merge (squash) May 27, 2025 20:02
Copy link

mergify bot commented May 27, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Akshat-Tripathi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 27, 2025
auto-merge was automatically disabled May 28, 2025 07:17

Head branch was pushed to by a user without write access

@mergify mergify bot removed the needs-rebase label May 28, 2025
Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, sorry for delaying the merge a bit! Let's get this landed today.

@Akshat-Tripathi
Copy link
Contributor Author

LGTM, sorry for delaying the merge a bit! Let's get this landed today.

No worries! Yep I'm hoping once these tests pass we can merge it in. Would you mind re-enabling auto-merge?

@yaochengji yaochengji enabled auto-merge (squash) May 28, 2025 17:28
@yaochengji yaochengji merged commit 643622b into vllm-project:main May 28, 2025
68 checks passed
amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025
…llm-project#15655)

Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Chengji Yao <[email protected]>
Signed-off-by: xihajun <[email protected]>
Signed-off-by: Jorge de Freitas <[email protected]>
Signed-off-by: Jorge de Freitas <[email protected]>
Co-authored-by: Chengji Yao <[email protected]>
Co-authored-by: xihajun <[email protected]>
Co-authored-by: Jorge de Freitas <[email protected]>
Co-authored-by: Jorge de Freitas <[email protected]>
Signed-off-by: amit <[email protected]>
@amanocha
Copy link

amanocha commented Jul 18, 2025

Benchmarking LoRA against baseline (no LoRA) throughput

We use NVIDIA's GenAI-Perf tool to force fixed-length inputs and outputs to produce "heatmap" plots as below. On TPU-v6e and H100 instances, we vary the inputs from 128 to 8k. On L4 instances, we vary the inputs from 128 to 2k.

We calculate the LoRA slowdown as ((LoRA throughput / baseline throughput) - 1) * 100%.

Llama3.1-8B

1x TPU-v6e

The LoRA slowdown varies from -8.4% to -23.9%.

Llama3 1-8B_1xTPU-v6e ### 1x GPU-L4 The LoRA slowdown varies from -17.3% to -32.8%. Llama3 1-8B_1xGPU-L4_v2 ### 1x GPU-H100 The LoRA slowdown varies from -10.0% to -51.8%. Llama3 1-8B_1xGPU-H100 ## Llama3.1-70B ### 8x TPU-v6e The LoRA slowdown varies from -20.7% to -46.3%. Llama3 1-70B_8xTPU-v6e ### 8x GPU-L4 The LoRA slowdown varies from -13.8% (second best: -25.1%) to -49.7%. Llama3 1-70B_8xGPU-L4 ### 4x GPU-H100 Unable to launch VMs due to persistent unavailability across multiple zones and regions.

Benchmarking LoRA against baseline (no LoRA) throughput

We use NVIDIA's GenAI-Perf tool to force fixed-length inputs and outputs to produce "heatmap" plots as below. On TPU-v6e and H100 instances, we vary the inputs from 128 to 8k. On L4 instances, we vary the inputs from 128 to 2k.

We calculate the LoRA slowdown as ((LoRA throughput / baseline throughput) - 1) * 100%.

Llama3.1-8B

1x TPU-v6e

The LoRA slowdown varies from -8.4% to -23.9%.

Llama3 1-8B_1xTPU-v6e ### 1x GPU-L4 The LoRA slowdown varies from -17.3% to -32.8%. Llama3 1-8B_1xGPU-L4_v2 ### 1x GPU-H100 The LoRA slowdown varies from -10.0% to -51.8%. Llama3 1-8B_1xGPU-H100 ## Llama3.1-70B ### 8x TPU-v6e The LoRA slowdown varies from -20.7% to -46.3%. Llama3 1-70B_8xTPU-v6e ### 8x GPU-L4 The LoRA slowdown varies from -13.8% (second best: -25.1%) to -49.7%. Llama3 1-70B_8xGPU-L4 ### 4x GPU-H100 Unable to launch VMs due to persistent unavailability across multiple zones and regions.

I have a few questions about the data published in this PR:

  1. What configurations are used in these experiments, e.g. number of adapters, rank, batch size?
  2. Are adapters loaded dynamically?
  3. Did you measure latency? How did you measure adapter loading time?

I also have a few questions about the "Hot Swapping" and "Compare Multi-LoRAs" tabs in this link: https://insights.krai.ai/benchmarking-multi-lora

  1. What is the difference between these two tabs? Are all of the LoRAs pre-allocated ahead of time in the "Compare Multi-LoRAs" results? In a "static" multi-LoRA fashion?
  2. Is the number of LoRAs the batch size? Do you assign 1 LoRA per batch?
  3. What does the "0x LoRAs" represent? Is this the performance of the base model? And if so, is the batch size 1 and you are comparing the base model with batch size=1 to multi-LoRA with possibly batch size > 1?
  4. I also see in the "Compare Multi-LoRAs" tab you measure performance regression for 1-9 LoRAs and 1-4 LoRAs for "Hot Swapping" - do you have data for 8 LoRAs for hot swapping? And did you have data for an input size of 4096 for "Compare multi-LoRAs"? (Trying to do a side-by-side comparison of the two configurations.)
  5. Is e data for the TPU? Did you colleagues happen to run the same experiments on the L4 and H100 GPUs?

@Akshat-Tripathi
Copy link
Contributor Author

Benchmarking LoRA against baseline (no LoRA) throughput

We use NVIDIA's GenAI-Perf tool to force fixed-length inputs and outputs to produce "heatmap" plots as below. On TPU-v6e and H100 instances, we vary the inputs from 128 to 8k. On L4 instances, we vary the inputs from 128 to 2k.
We calculate the LoRA slowdown as ((LoRA throughput / baseline throughput) - 1) * 100%.

Llama3.1-8B

1x TPU-v6e

The LoRA slowdown varies from -8.4% to -23.9%.
Llama3 1-8B_1xTPU-v6e

1x GPU-L4

The LoRA slowdown varies from -17.3% to -32.8%.
Llama3 1-8B_1xGPU-L4_v2

1x GPU-H100

The LoRA slowdown varies from -10.0% to -51.8%.
Llama3 1-8B_1xGPU-H100

Llama3.1-70B

8x TPU-v6e

The LoRA slowdown varies from -20.7% to -46.3%.
Llama3 1-70B_8xTPU-v6e

8x GPU-L4

The LoRA slowdown varies from -13.8% (second best: -25.1%) to -49.7%.
Llama3 1-70B_8xGPU-L4

4x GPU-H100

Unable to launch VMs due to persistent unavailability across multiple zones and regions.

Benchmarking LoRA against baseline (no LoRA) throughput

We use NVIDIA's GenAI-Perf tool to force fixed-length inputs and outputs to produce "heatmap" plots as below. On TPU-v6e and H100 instances, we vary the inputs from 128 to 8k. On L4 instances, we vary the inputs from 128 to 2k.
We calculate the LoRA slowdown as ((LoRA throughput / baseline throughput) - 1) * 100%.

Llama3.1-8B

1x TPU-v6e

The LoRA slowdown varies from -8.4% to -23.9%.
Llama3 1-8B_1xTPU-v6e

1x GPU-L4

The LoRA slowdown varies from -17.3% to -32.8%.
Llama3 1-8B_1xGPU-L4_v2

1x GPU-H100

The LoRA slowdown varies from -10.0% to -51.8%.
Llama3 1-8B_1xGPU-H100

Llama3.1-70B

8x TPU-v6e

The LoRA slowdown varies from -20.7% to -46.3%.
Llama3 1-70B_8xTPU-v6e

8x GPU-L4

The LoRA slowdown varies from -13.8% (second best: -25.1%) to -49.7%.
Llama3 1-70B_8xGPU-L4

4x GPU-H100

Unable to launch VMs due to persistent unavailability across multiple zones and regions.

I have a few questions about the data published in this PR:

  1. What configurations are used in these experiments, e.g. number of adapters, rank, batch size?
  2. Are adapters loaded dynamically?
  3. Did you measure latency? How did you measure adapter loading time?

I also have a few questions about the "Hot Swapping" and "Compare Multi-LoRAs" tabs in this link: https://insights.krai.ai/benchmarking-multi-lora

  1. What is the difference between these two tabs? Are all of the LoRAs pre-allocated ahead of time in the "Compare Multi-LoRAs" results? In a "static" multi-LoRA fashion?
  2. Is the number of LoRAs the batch size? Do you assign 1 LoRA per batch?
  3. What does the "0x LoRAs" represent? Is this the performance of the base model? And if so, is the batch size 1 and you are comparing the base model with batch size=1 to multi-LoRA with possibly batch size > 1?
  4. I also see in the "Compare Multi-LoRAs" tab you measure performance regression for 1-9 LoRAs and 1-4 LoRAs for "Hot Swapping" - do you have data for 8 LoRAs for hot swapping? And did you have data for an input size of 4096 for "Compare multi-LoRAs"? (Trying to do a side-by-side comparison of the two configurations.)
  5. Is e data for the TPU? Did you colleagues happen to run the same experiments on the L4 and H100 GPUs?

Hi @amanocha thanks for your interest.

  1. These experiments were all run with a single adapter of rank 16. We didn't batch by sequence, but vLLM used 128 batched tokens.
  2. The adapters were loaded statically for this experiment.
  3. The GenAI perf tool measures latency and throughput, so we do have the data, albeit unprocessed. We didn't explicitly measure adapter loading times no.

As for the questions about the website.

  1. For the "Compare Multi-LoRAs" page we allocated LoRAs statically, "Hot Swapping" here means that we'd be swapping adapters between the CPU and TPU.
  2. Yes and no, we have a "batch" of LoRAs that can be applied all at once for "Compare Multi-LoRAs", so the number of LoRAs there all run together. When hotswapping we run each LoRA individually. In a real world use-case these can be mixed and matched, so it's possible to have a batch of 2 active LoRAs whilst serving 8.
  3. Yep 0 LoRAs means that no LoRAs are present, but since we're not batching by LoRA it doesn't affect the "Comparing Multi-LoRAs" results. I think we normalised the token batches when hotswapping.
  4. Unfortunately no we didn't have time to collect that data, similarly for the L4 and H100s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants