-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[V1] Prevent xgrammar from breaking TPU support #14575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This does a couple of things: 1. Defer initializing the grammar bitmask until the first time it is needed instead of at engine creation time. For environments where structured output is not used, this will prevent xgrammar from ever being imported. 2. Cleanly reject structured output requests for TPU since that is not expected to work right now. Signed-off-by: Russell Bryant <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
You could just simply allocate a torch tensor for bitmask. llguidance uses the same format (as did AICI before it). |
LGTM. We can provide a TPU support soon. The |
True! Either way, it seems reasonable to not allocate it until we know it's actually needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the discussion on slack, seems like we are all agree on installing triton for TPU for now?
ok then we should at least update the PR title, as it doesn't really solve the TPU issue. |
I assume you were responding to "seems like we all agree on installing triton for TPU" comment? That's not what merged, though. Let me know if you still have trouble after this change. I know structured output won't work (yet), but it at least shouldn't get in the way of anything else now. |
Yeah I assumed that was the conclusion of some other discussion I had no context about, because after testing current main I am still experiencing the same issue :/ Let me share that. Were you able to have it work locally on your side?
|
sorry, no, I didn't have a TPU environment to test on. I just made sure the feature was still working. I see that my change wasn't sufficient. A little more needs to be moved around. I'll submit a follow-up PR in a few minutes. |
@NickLucche can you take a look at this and see if it helps? #14616 |
PR vllm-project#14575 delayed initialization of the grammar bitmask until it was needed to try to fix a problem encountered on TPU systems. Unfortunately, that change was not sufficient. We need to delay usage of ALL xgrammar APIs, not just the grammar initialization. This change implements that. More initialization is now deferred until the first time a structured output request is received. Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
This does a couple of things:
Defer initializing the grammar bitmask until the first time it is needed
instead of at engine creation time. For environments where structured
output is not used, this will prevent xgrammar from ever being imported.
Cleanly reject structured output requests for TPU since
that is not expected to work right now.
Signed-off-by: Russell Bryant [email protected]