-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Allow dynamic loading of LoRA adapters in a cache dir #14634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
2261dea to
d1017c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test 👍 !
I think a negative test case would be good too. Can we throw some junk in the adapter cache, or maybe an adapter for a different base model, and ensure that we get a graceful 400 when trying to run a chat completions with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can add a negative test case for something that isn't a lora adapter, but how does a request know if it has a matching base model? the request only has a model (which is actually a lora adapter) set, i dont see anything that related the request to a base model, except that the lora adapter itself has its own base model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you try to load a lora adapter for a different model architecture than the base model, or a differently sized model than the base model, then the engine should fail to load it and return an error response which it looks like your code would already catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On success, self.models.load_lora_adapter should cache a new LoRARequest for this adapter in self.models.lora_requests. Shouldn't we return that here, instead of None, None?
I guess if this logic block was moved above line 164 here and the return was removed, then the existing block that checks for lora adapters would find and return it:
for lora in self.models.lora_requests:
if request.model == lora.lora_name:
return lora, None
vllm/envs.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a slight preference for incorporating LORA in the name here to remove any ambiguity about the word ADAPTER
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, i kept it generic in case we wanted to use this in the future for other cached stuff, backwards compatability and all that. i can change it
|
@smarterclayton My colleague @jberkhahn put this together to support the "shared file systems with write-once semantics and unique naming" use case for lora adapters that you brought up on the RFC, if you wanted to take a look |
fee6e44 to
f1ce50f
Compare
|
One high level suggestion: Is it possible to achieve the desired functionality without introducing another environment variable? We should avoid that as possible and we actually plan to clean them up soon. |
f1ce50f to
d0a75d4
Compare
Are you suggesting adding it as a command line argument or somehow making this implicitly configurable? The issue is this is dependent on dynamic lora being enabled, which is configured via env var, so that seemed the easiest way to make it dependent on whether that is set or not. |
|
@jberkhahn This could be added as a CLI arg, and still conditioned on whether or not dynamic lora is enabled. One way to do that might be:
e.g. something like |
545aec4 to
f76abc2
Compare
tjohnson31415
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me. Just some NITS and a couple refactor suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These asserts are all on the base model's completion and some are redundant with asserting on the generated text.
IMO, they could just be removed, or at least they should be assertions on the lora_completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we just need to make sure that the request with the adapter returns different results than the base model, so the two assertions on the text are good but if the others are required here then a quick little comment about why would be 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: move kwarg with default value down with the others that have a default. Same thing in other serving_*.py files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT for consistency
| lora_cache_dir: Optional[str], | |
| lora_cache_dir: Optional[str] = None, |
(and move down to end of list of required kwargs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this dynamic lora check should come after the check for prompt adapters so that we don't need to query disk for every request that uses a static prompt adapter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: This could be moved to its own function and used in _check_model as well to reduce code duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error returned if an adapter fails to load looks like:
{
"object": "error",
"message": "The size of tensor a (2560) must match the size of tensor b (4096) at non-singleton dimension 1",
"type": "BadRequestError",
"param": null,
"code": 400
}
We could add a prefix to the message to provide a little more context:
ValueError(f"Failed to load LoRA adapter: {response.message}")
|
Looking pretty good! It would be great to also get the docs updated while we're at it, @jberkhahn can you take a look at adding some information about this to docs/source/features/lora.md? |
Sure. I'll go through this stuff and fix it up. |
f76abc2 to
105c720
Compare
0b2179d to
d1e90dc
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could local_cache_dir be part of OpenAIServingModels ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this is probably a good idea, but it looks like i broke some tests when i just rebased on master. I'll get to this once i'm done squashing that.
docs/source/features/lora.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/VLLM/vLLM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed this
|
Hi @jberkhahn and @joerunde . Thanks for the PR! I have a few questions on the problem and the fix the PR is implementing. I am a bit unfamiliar with the problem itself - Please correct me if I am wrong.
IIUC, the fix is looking for LoRA modules in some file system directory that all the replicas share. Questions:
This is mostly for my edification in trying to understand the problem better. Thanks 🙌 |
|
c8cbab8 to
06f59ab
Compare
Signed-off-by: jberkhahn <[email protected]>
|
Putting a hold on this while I coordinate my work with #10546 |
| models = await client.models.list() | ||
| models = models.data | ||
| lora_models = models[1:] | ||
| assert len(lora_models) == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this ==2 ? are we also counting the adapter added in a previous tests ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the server is shared between all the tests in the module
|
closing this in favor of #10546, will resubmit this as a plugin |
FIX #12174
Per the discussion in #12174, this PR implements a simple in-place solution that doesn't depend on external infrastructure other than disk. Allows VLLM to be configured with the location of a directory that LORA adapters might be present in. When receiving a request for a LORA adapter that isn't recognized, check if it is in the directory under a location that matches the name of the model, i.e.
VLLM_ADAPATER_CACHE/model_name. If found, load the model and continue as normal.Note this implementation only does this for LORA adapters, but it would be easy to extend to other model types.