Skip to content

Conversation

@jberkhahn
Copy link
Contributor

@jberkhahn jberkhahn commented Mar 11, 2025

FIX #12174

Per the discussion in #12174, this PR implements a simple in-place solution that doesn't depend on external infrastructure other than disk. Allows VLLM to be configured with the location of a directory that LORA adapters might be present in. When receiving a request for a LORA adapter that isn't recognized, check if it is in the directory under a location that matches the name of the model, i.e. VLLM_ADAPATER_CACHE/model_name. If found, load the model and continue as normal.

Note this implementation only does this for LORA adapters, but it would be easy to extend to other model types.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the frontend label Mar 11, 2025
@jberkhahn jberkhahn force-pushed the dynamic_lora2 branch 5 times, most recently from 2261dea to d1017c7 Compare March 11, 2025 21:36
@DarkLight1337 DarkLight1337 requested a review from jeejeelee March 12, 2025 02:29
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test 👍 !

I think a negative test case would be good too. Can we throw some junk in the adapter cache, or maybe an adapter for a different base model, and ensure that we get a graceful 400 when trying to run a chat completions with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can add a negative test case for something that isn't a lora adapter, but how does a request know if it has a matching base model? the request only has a model (which is actually a lora adapter) set, i dont see anything that related the request to a base model, except that the lora adapter itself has its own base model.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you try to load a lora adapter for a different model architecture than the base model, or a differently sized model than the base model, then the engine should fail to load it and return an error response which it looks like your code would already catch

Copy link
Collaborator

@joerunde joerunde Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On success, self.models.load_lora_adapter should cache a new LoRARequest for this adapter in self.models.lora_requests. Shouldn't we return that here, instead of None, None?

I guess if this logic block was moved above line 164 here and the return was removed, then the existing block that checks for lora adapters would find and return it:

        for lora in self.models.lora_requests:
            if request.model == lora.lora_name:
                return lora, None

vllm/envs.py Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for incorporating LORA in the name here to remove any ambiguity about the word ADAPTER

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i kept it generic in case we wanted to use this in the future for other cached stuff, backwards compatability and all that. i can change it

@joerunde
Copy link
Collaborator

@smarterclayton My colleague @jberkhahn put this together to support the "shared file systems with write-once semantics and unique naming" use case for lora adapters that you brought up on the RFC, if you wanted to take a look

@jberkhahn jberkhahn force-pushed the dynamic_lora2 branch 3 times, most recently from fee6e44 to f1ce50f Compare March 12, 2025 23:19
@comaniac
Copy link
Collaborator

One high level suggestion: Is it possible to achieve the desired functionality without introducing another environment variable? We should avoid that as possible and we actually plan to clean them up soon.

@jberkhahn
Copy link
Contributor Author

One high level suggestion: Is it possible to achieve the desired functionality without introducing another environment variable? We should avoid that as possible and we actually plan to clean them up soon.

Are you suggesting adding it as a command line argument or somehow making this implicitly configurable? The issue is this is dependent on dynamic lora being enabled, which is configured via env var, so that seemed the easiest way to make it dependent on whether that is set or not.

@joerunde
Copy link
Collaborator

@jberkhahn This could be added as a CLI arg, and still conditioned on whether or not dynamic lora is enabled. One way to do that might be:

  1. Add the cli flag to the parser around here: https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L679
  2. Add that value to the LoRAConfig, see how that's created here: https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L679
  3. Add some validation in __post_init__: https://github.com/vllm-project/vllm/blob/main/vllm/config.py#L2296-L2316

e.g. something like

if self.lora_adapter_cache is not None and not envs.VLLM_ALLOW_RUNTIME_LORA_UPDATING:
    raise ValueError("setting --lora-adapter-cache requires setting VLLM_ALLOW_RUNTIME_LORA_UPDATING=1")

@jberkhahn jberkhahn force-pushed the dynamic_lora2 branch 8 times, most recently from 545aec4 to f76abc2 Compare March 19, 2025 18:10
Copy link
Contributor

@tjohnson31415 tjohnson31415 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me. Just some NITS and a couple refactor suggestions

Comment on lines 174 to 181
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These asserts are all on the base model's completion and some are redundant with asserting on the generated text.
IMO, they could just be removed, or at least they should be assertions on the lora_completion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we just need to make sure that the request with the adapter returns different results than the base model, so the two assertions on the text are good but if the others are required here then a quick little comment about why would be 💯

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: move kwarg with default value down with the others that have a default. Same thing in other serving_*.py files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT for consistency

Suggested change
lora_cache_dir: Optional[str],
lora_cache_dir: Optional[str] = None,

(and move down to end of list of required kwargs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this dynamic lora check should come after the check for prompt adapters so that we don't need to query disk for every request that uses a static prompt adapter.

Comment on lines 171 to 182
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: This could be moved to its own function and used in _check_model as well to reduce code duplication.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error returned if an adapter fails to load looks like:

{
  "object": "error",
  "message": "The size of tensor a (2560) must match the size of tensor b (4096) at non-singleton dimension 1",
  "type": "BadRequestError",
  "param": null,
  "code": 400
}

We could add a prefix to the message to provide a little more context:

 ValueError(f"Failed to load LoRA adapter: {response.message}")

@joerunde
Copy link
Collaborator

Looking pretty good!

It would be great to also get the docs updated while we're at it, @jberkhahn can you take a look at adding some information about this to docs/source/features/lora.md?

@jberkhahn
Copy link
Contributor Author

Looking pretty good!

It would be great to also get the docs updated while we're at it, @jberkhahn can you take a look at adding some information about this to docs/source/features/lora.md?

Sure. I'll go through this stuff and fix it up.

@mergify mergify bot added the documentation Improvements or additions to documentation label Mar 25, 2025
@jberkhahn jberkhahn force-pushed the dynamic_lora2 branch 3 times, most recently from 0b2179d to d1e90dc Compare March 25, 2025 18:38
@mergify
Copy link

mergify bot commented Apr 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jberkhahn.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could local_cache_dir be part of OpenAIServingModels ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is probably a good idea, but it looks like i broke some tests when i just rebased on master. I'll get to this once i'm done squashing that.

Comment on lines 158 to 162
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/VLLM/vLLM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this

@varun-sundar-rabindranath
Copy link
Contributor

Hi @jberkhahn and @joerunde . Thanks for the PR!

I have a few questions on the problem and the fix the PR is implementing. I am a bit unfamiliar with the problem itself - Please correct me if I am wrong.

  • As mentioned in the RFC, In a scenario where there are multiple vllm instance, the problem is to tackle the deficiencies of the load_lora_adapter and unload_lora_adapter, such as,
    - Ensures the adapter is loaded across all replicas of the deployment
    - Guarantees that the adapter will be available on a new replica, or after a replica restart

IIUC, the fix is looking for LoRA modules in some file system directory that all the replicas share.

Questions:

  1. Who writes to the cache? Is the cache pre-populated before engine initialization ?
  2. What happens when we get a new LoRA request that is not in the cache ? (It looks like we simply don't handle it, like before ?)
  3. Have you considered making vLLM download adapters to the cache directly ? - maybe that is the solution for cache population ?

This is mostly for my edification in trying to understand the problem better. Thanks 🙌

@jberkhahn
Copy link
Contributor Author

Questions:

1. Who writes to the cache? Is the cache pre-populated before engine initialization ?

2. What happens when we get a new LoRA request that is not in the cache ? (It looks like we simply don't handle it, like before ?)

3. Have you considered making vLLM download adapters to the cache directly ? - maybe that is the solution for cache population ?
  1. Currently vLLM doesn't ever write to the cache, only read. The sort of archetypal environment I was envisioning was a shared volume that is mounted into vLLM's running environment, such as a persistent volume mounted into a pod in Kube, but I wanted an implementation that didn't make too many specific assumptions on how this would be used.

  2. Correct, this doesn't change the current behavior when we get a request for an adapter we don't have anywhere.

  3. We could implement that if it was desired by the community, but for this initial PR I wanted to keep the added feature as lightweight as possible. I beleive my implementation is fairly lightweight while still being usable in a non-trivial context. Also I wanted to get feedback from the community before spending more time impelenting features, as there were some pretty wide ranging solutions discussed in the RFC.

@jberkhahn jberkhahn force-pushed the dynamic_lora2 branch 3 times, most recently from c8cbab8 to 06f59ab Compare April 2, 2025 20:31
@jberkhahn
Copy link
Contributor Author

Putting a hold on this while I coordinate my work with #10546

models = await client.models.list()
models = models.data
lora_models = models[1:]
assert len(lora_models) == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this ==2 ? are we also counting the adapter added in a previous tests ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the server is shared between all the tests in the module

@jberkhahn
Copy link
Contributor Author

closing this in favor of #10546, will resubmit this as a plugin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Distribute LoRA adapters across deployment

6 participants