-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Frontend][Core] Add Guidance backend for guided decoding #10217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JC1DA for the great contribution!
A few other questions:
- Presumably the parallelization speedup is due to the fact that the pytorch ops involved release the gil?
- Were your outlines measurements also using the threadpool?
- It would be good to also try with the latest outlines 0.1.x if possible which is apparently much faster than < 0.1. We would want to upgrade to that too in any case.
vllm/model_executor/guided_decoding/outlines_logits_processors.py
Outdated
Show resolved
Hide resolved
tokenizer) -> Optional[LogitsProcessor]: | ||
# CFG grammar not supported by LMFE, so we use outlines instead | ||
if guided_params.backend == 'outlines' or guided_params.grammar: | ||
if guided_params.backend == 'outlines': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LMFE doesn't support grammar, we should retain the existing behaviour to fall back to a different backend in this case (perhaps it could now be guidance rather than outlines).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a check at the beginning to use outlines by default?
if guided_params.grammar and guided_params.backend not in [
'outlines', 'guidance'
]:
guided_params.backend = 'outlines'
vllm/model_executor/guided_decoding/guidance_logits_processors.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/guided_decoding/guidance_logits_processors.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/guided_decoding/guidance_logits_processors.py
Outdated
Show resolved
Hide resolved
mask = torch.tensor(mask, | ||
dtype=logits.dtype, | ||
device=logits.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the allocated mask tensor be reused between calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njhill I have updated the code to reuse the logits variable. as we don't add thread pool into this PR anymore, it should work great with inplace ops.
requirements-common.txt
Outdated
tiktoken >= 0.6.0 # Required for DBRX tokenizer | ||
lm-format-enforcer == 0.10.6 | ||
outlines >= 0.0.43, < 0.1 | ||
guidance>=0.2rc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the requirements of guidance? Does it have compiled binaries for specific python versions or CPU architectures?
Maybe this could be an optional dependency to start with, like we do for many quantization backends
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @mgoin, guidance does have a fair number of dependencies, but we're mostly depending on the lower-level guidance layer here llguidance. llguidance is compiled for Python 3.9+ on manylinux/Mac OS/Windows. My understanding is that vLLM only supports Linux on Python 3.9+ too so I think we should be good there.
We can change this PR in the near future to just use llguidance (which has no other dependencies: https://github.com/microsoft/llguidance/blob/b5ca97b2562b720c1ff3f567bfa45956338a1864/pyproject.toml#L8). We just need to port one last function down from the Python guidance library into the Rust layer first :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgoin we replaced guidance with llguidance which has no extra dependencies. Hope it is good enough to merge :)
Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
Thanks @njhill for your quick review. Really appreciate it.
That's one reason, another one is the parser (llguidance) used in guidance was implemented in Rust, and it automatically releases GIL when called. So it would be more efficient to run guidance in parallel.
Yes, experiments were done using threadpool
I haven't tested outlines 0.1.x yet, just used the current version in VLLM. However, I am not focusing too much on the benchmark for this PR. The focus is to make guidance available as another guided decoding backend to VLLM's community so people can choose what's best for them. :) |
Signed-off-by: Loc Huynh <[email protected]>
I also figured out lm-format-enforcer is not thread-safe. It failed some tests when number of threads is larger than 1. |
Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Loc Huynh <[email protected]>
Decided to rollback to single threaded version to not break lm-format-enforcer. The PR is coming with minimal changes to add llguidance as new logits processor. |
This pull request has merge conflicts that must be resolved before it can be |
We have just released a large JSON Schema benchmark and a paper. Of particular interest might be isolated mask-generation benchmarks - comparing LLGuidance, Outlines, XGrammar and llama.cpp grammars. |
This pull request has merge conflicts that must be resolved before it can be |
This commit is based on the PR vllm-project#10217. I started to rebase it, but found it easier to just re-apply the changes on top of latest main. Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Loc Huynh <[email protected]> Co-authored-by: Loc Huynh <[email protected]>
Hello! I wanted to try this out, so I re-applied the changes on top of It's failing on
|
Hey @russellb! We've been tracking the discussion on #12388 . Our plan is to re-do this PR once that gets merged. llguidance exposes a similar API to xgrammar so it'll be quite a bit easier to just drop our code in at that point. Happy to get started on it whenever you recommend. Thanks for the pointer on the slack, we'll join and chat there too :) |
That sounds great. I want to get multiple backends going for the V1 engine after that PR merges. I also might have a use case for this in the V0 engine for an existing user, as well, which brought me over to this PR. I figured I might be able to help get this updated and working so I can test and see if it works for them. |
This commit is based on the PR vllm-project#10217. It is updated to be compatible with `main`. Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Loc Huynh <[email protected]> Co-authored-by: Michal Moskal <[email protected]>
superceded by #14589 |
This commit is based on the PR vllm-project#10217. It is updated to be compatible with `main`. Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Loc Huynh <[email protected]> Co-authored-by: Michal Moskal <[email protected]>
This commit is based on the PR vllm-project#10217. It is updated to be compatible with `main`. It has also been significantly updated and simplified to take advantage of changes to llguidance since the original PR was written. Many thanks to the llguidance team for the help on this. Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Loc Huynh <[email protected]> Co-authored-by: Michal Moskal <[email protected]> Co-authored-by: Aaron Pham <[email protected]>
This commit is based on the PR vllm-project#10217. It is updated to be compatible with `main`. It has also been significantly updated and simplified to take advantage of changes to llguidance since the original PR was written. Many thanks to the llguidance team for the help on this. Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Loc Huynh <[email protected]> Co-authored-by: Michal Moskal <[email protected]> Co-authored-by: Aaron Pham <[email protected]>
This pull request extends guided decoding capabilities
guidance
backend supportsregex
,choice
,json
andgrammar
.relevant: #5245
Usage