Allow dynamic loading of LoRA adapters in a cache dir #14634

jberkhahn · 2025-03-11T18:37:40Z

Per the discussion in #12174, this PR implements a simple in-place solution that doesn't depend on external infrastructure other than disk. Allows VLLM to be configured with the location of a directory that LORA adapters might be present in. When receiving a request for a LORA adapter that isn't recognized, check if it is in the directory under a location that matches the name of the model, i.e. VLLM_ADAPATER_CACHE/model_name. If found, load the model and continue as normal.

Note this implementation only does this for LORA adapters, but it would be easy to extend to other model types.

github-actions · 2025-03-11T18:37:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

joerunde · 2025-03-12T18:00:11Z

tests/entrypoints/openai/test_lora_adapters.py

Nice test 👍 !

I think a negative test case would be good too. Can we throw some junk in the adapter cache, or maybe an adapter for a different base model, and ensure that we get a graceful 400 when trying to run a chat completions with it?

i can add a negative test case for something that isn't a lora adapter, but how does a request know if it has a matching base model? the request only has a model (which is actually a lora adapter) set, i dont see anything that related the request to a base model, except that the lora adapter itself has its own base model.

If you try to load a lora adapter for a different model architecture than the base model, or a differently sized model than the base model, then the engine should fail to load it and return an error response which it looks like your code would already catch

joerunde · 2025-03-12T18:18:18Z

vllm/entrypoints/openai/serving_engine.py

On success, self.models.load_lora_adapter should cache a new LoRARequest for this adapter in self.models.lora_requests. Shouldn't we return that here, instead of None, None?

I guess if this logic block was moved above line 164 here and the return was removed, then the existing block that checks for lora adapters would find and return it:

for lora in self.models.lora_requests: if request.model == lora.lora_name: return lora, None

joerunde · 2025-03-12T18:19:44Z

vllm/envs.py

I have a slight preference for incorporating LORA in the name here to remove any ambiguity about the word ADAPTER

hmm, i kept it generic in case we wanted to use this in the future for other cached stuff, backwards compatability and all that. i can change it

joerunde · 2025-03-12T18:46:10Z

@smarterclayton My colleague @jberkhahn put this together to support the "shared file systems with write-once semantics and unique naming" use case for lora adapters that you brought up on the RFC, if you wanted to take a look

comaniac · 2025-03-12T23:23:24Z

One high level suggestion: Is it possible to achieve the desired functionality without introducing another environment variable? We should avoid that as possible and we actually plan to clean them up soon.

jberkhahn · 2025-03-12T23:34:24Z

One high level suggestion: Is it possible to achieve the desired functionality without introducing another environment variable? We should avoid that as possible and we actually plan to clean them up soon.

Are you suggesting adding it as a command line argument or somehow making this implicitly configurable? The issue is this is dependent on dynamic lora being enabled, which is configured via env var, so that seemed the easiest way to make it dependent on whether that is set or not.

joerunde · 2025-03-13T13:50:55Z

@jberkhahn This could be added as a CLI arg, and still conditioned on whether or not dynamic lora is enabled. One way to do that might be:

Add the cli flag to the parser around here: https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L679
Add that value to the LoRAConfig, see how that's created here: https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L679
Add some validation in __post_init__: https://github.com/vllm-project/vllm/blob/main/vllm/config.py#L2296-L2316

e.g. something like

if self.lora_adapter_cache is not None and not envs.VLLM_ALLOW_RUNTIME_LORA_UPDATING:
    raise ValueError("setting --lora-adapter-cache requires setting VLLM_ALLOW_RUNTIME_LORA_UPDATING=1")

tjohnson31415

Looks pretty good to me. Just some NITS and a couple refactor suggestions

tjohnson31415 · 2025-03-20T19:39:26Z

tests/entrypoints/openai/test_completion.py

These asserts are all on the base model's completion and some are redundant with asserting on the generated text.
IMO, they could just be removed, or at least they should be assertions on the lora_completion.

Yeah, I think we just need to make sure that the request with the adapter returns different results than the base model, so the two assertions on the text are good but if the others are required here then a quick little comment about why would be 💯

tjohnson31415 · 2025-03-20T22:09:58Z

vllm/entrypoints/openai/serving_chat.py

NIT: move kwarg with default value down with the others that have a default. Same thing in other serving_*.py files.

tjohnson31415 · 2025-03-20T22:12:18Z

vllm/entrypoints/openai/serving_embedding.py

NIT for consistency

Suggested change

lora_cache_dir: Optional[str],

lora_cache_dir: Optional[str] = None,

(and move down to end of list of required kwargs)

tjohnson31415 · 2025-03-20T22:35:27Z

vllm/entrypoints/openai/serving_engine.py

I think this dynamic lora check should come after the check for prompt adapters so that we don't need to query disk for every request that uses a static prompt adapter.

tjohnson31415 · 2025-03-20T22:38:39Z

vllm/entrypoints/openai/serving_engine.py

NIT: This could be moved to its own function and used in _check_model as well to reduce code duplication.

tjohnson31415 · 2025-03-20T23:20:16Z

vllm/entrypoints/openai/serving_engine.py

The error returned if an adapter fails to load looks like:

{ "object": "error", "message": "The size of tensor a (2560) must match the size of tensor b (4096) at non-singleton dimension 1", "type": "BadRequestError", "param": null, "code": 400 }

We could add a prefix to the message to provide a little more context:

ValueError(f"Failed to load LoRA adapter: {response.message}")

joerunde · 2025-03-24T15:54:17Z

Looking pretty good!

It would be great to also get the docs updated while we're at it, @jberkhahn can you take a look at adding some information about this to docs/source/features/lora.md?

jberkhahn · 2025-03-24T16:38:37Z

Looking pretty good!

It would be great to also get the docs updated while we're at it, @jberkhahn can you take a look at adding some information about this to docs/source/features/lora.md?

Sure. I'll go through this stuff and fix it up.

mergify · 2025-04-01T08:21:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jberkhahn.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

varun-sundar-rabindranath · 2025-04-01T19:18:05Z

vllm/entrypoints/openai/api_server.py

Could local_cache_dir be part of OpenAIServingModels ?

yeah this is probably a good idea, but it looks like i broke some tests when i just rebased on master. I'll get to this once i'm done squashing that.

simon-mo · 2025-04-01T20:10:04Z

docs/source/features/lora.md

s/VLLM/vLLM

varun-sundar-rabindranath · 2025-04-01T20:17:09Z

Hi @jberkhahn and @joerunde . Thanks for the PR!

I have a few questions on the problem and the fix the PR is implementing. I am a bit unfamiliar with the problem itself - Please correct me if I am wrong.

As mentioned in the RFC, In a scenario where there are multiple vllm instance, the problem is to tackle the deficiencies of the load_lora_adapter and unload_lora_adapter, such as,
- Ensures the adapter is loaded across all replicas of the deployment
- Guarantees that the adapter will be available on a new replica, or after a replica restart

IIUC, the fix is looking for LoRA modules in some file system directory that all the replicas share.

Questions:

Who writes to the cache? Is the cache pre-populated before engine initialization ?
What happens when we get a new LoRA request that is not in the cache ? (It looks like we simply don't handle it, like before ?)
Have you considered making vLLM download adapters to the cache directly ? - maybe that is the solution for cache population ?

This is mostly for my edification in trying to understand the problem better. Thanks 🙌

jberkhahn · 2025-04-01T20:49:28Z

Questions:

1. Who writes to the cache? Is the cache pre-populated before engine initialization ?

2. What happens when we get a new LoRA request that is not in the cache ? (It looks like we simply don't handle it, like before ?)

3. Have you considered making vLLM download adapters to the cache directly ? - maybe that is the solution for cache population ?

Currently vLLM doesn't ever write to the cache, only read. The sort of archetypal environment I was envisioning was a shared volume that is mounted into vLLM's running environment, such as a persistent volume mounted into a pod in Kube, but I wanted an implementation that didn't make too many specific assumptions on how this would be used.
Correct, this doesn't change the current behavior when we get a request for an adapter we don't have anywhere.
We could implement that if it was desired by the community, but for this initial PR I wanted to keep the added feature as lightweight as possible. I beleive my implementation is fairly lightweight while still being usable in a non-trivial context. Also I wanted to get feedback from the community before spending more time impelenting features, as there were some pretty wide ranging solutions discussed in the RFC.

Signed-off-by: jberkhahn <[email protected]>

jberkhahn · 2025-04-03T16:54:34Z

Putting a hold on this while I coordinate my work with #10546

varun-sundar-rabindranath · 2025-04-03T16:57:55Z

tests/entrypoints/openai/test_lora_adapters.py

+    models = await client.models.list()
+    models = models.data
+    lora_models = models[1:]
+    assert len(lora_models) == 2


Why is this ==2 ? are we also counting the adapter added in a previous tests ?

yes the server is shared between all the tests in the module

jberkhahn · 2025-04-14T20:38:02Z

closing this in favor of #10546, will resubmit this as a plugin

jberkhahn requested review from DarkLight1337, robertgshaw2-redhat and simon-mo as code owners March 11, 2025 18:37

mergify bot added the frontend label Mar 11, 2025

jberkhahn force-pushed the dynamic_lora2 branch 5 times, most recently from 2261dea to d1017c7 Compare March 11, 2025 21:36

DarkLight1337 requested a review from jeejeelee March 12, 2025 02:29

joerunde reviewed Mar 12, 2025

View reviewed changes

jberkhahn force-pushed the dynamic_lora2 branch 3 times, most recently from fee6e44 to f1ce50f Compare March 12, 2025 23:19

jberkhahn force-pushed the dynamic_lora2 branch from f1ce50f to d0a75d4 Compare March 12, 2025 23:28

jberkhahn force-pushed the dynamic_lora2 branch 8 times, most recently from 545aec4 to f76abc2 Compare March 19, 2025 18:10

tjohnson31415 reviewed Mar 20, 2025

View reviewed changes

jberkhahn force-pushed the dynamic_lora2 branch from f76abc2 to 105c720 Compare March 25, 2025 18:17

mergify bot added the documentation Improvements or additions to documentation label Mar 25, 2025

jberkhahn force-pushed the dynamic_lora2 branch 3 times, most recently from 0b2179d to d1e90dc Compare March 25, 2025 18:38

mergify bot added the needs-rebase label Apr 1, 2025

jberkhahn force-pushed the dynamic_lora2 branch from d1e90dc to 8300da8 Compare April 1, 2025 18:30

mergify bot removed the needs-rebase label Apr 1, 2025

varun-sundar-rabindranath reviewed Apr 1, 2025

View reviewed changes

simon-mo reviewed Apr 1, 2025

View reviewed changes

jberkhahn force-pushed the dynamic_lora2 branch 3 times, most recently from c8cbab8 to 06f59ab Compare April 2, 2025 20:31

Allow dynamic loading of LoRA adapters in a cache dir

639712b

Signed-off-by: jberkhahn <[email protected]>

jberkhahn force-pushed the dynamic_lora2 branch from 06f59ab to 639712b Compare April 2, 2025 20:38

varun-sundar-rabindranath reviewed Apr 3, 2025

View reviewed changes

jberkhahn closed this Apr 14, 2025

jberkhahn mentioned this pull request Apr 18, 2025

[Lora][Frontend]Add default local directory LoRA resolver plugin. #16855

Merged

	lora_cache_dir: Optional[str],
	lora_cache_dir: Optional[str] = None,

Uh oh!

Allow dynamic loading of LoRA adapters in a cache dir #14634

Allow dynamic loading of LoRA adapters in a cache dir #14634

Uh oh!

Conversation

jberkhahn commented Mar 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joerunde Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joerunde commented Mar 12, 2025

Uh oh!

comaniac commented Mar 12, 2025

Uh oh!

jberkhahn commented Mar 12, 2025

Uh oh!

joerunde commented Mar 13, 2025

Uh oh!

tjohnson31415 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joerunde commented Mar 24, 2025

Uh oh!

jberkhahn commented Mar 24, 2025

Uh oh!

mergify bot commented Apr 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath commented Apr 1, 2025

Uh oh!

jberkhahn commented Apr 1, 2025

Uh oh!

jberkhahn commented Apr 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jberkhahn commented Apr 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

jberkhahn commented Mar 11, 2025 •

edited by github-actions bot

Loading

joerunde Mar 12, 2025 •

edited

Loading