-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat: Support logprobs for vLLM models in OpenAI Frontend
#8538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Support logprobs for vLLM models in OpenAI Frontend
#8538
Conversation
…probs-in-triton-openai' of https://github.com/triton-inference-server/server into spolisetty/tri-216-add-support-for-logprobs-and-top_logprobs-in-triton-openai
…and-top_logprobs-in-triton-openai
…probs-in-triton-openai' of https://github.com/triton-inference-server/server into spolisetty/tri-216-add-support-for-logprobs-and-top_logprobs-in-triton-openai
logprobs for vLLM models in OpenAI APIlogprobs for vLLM models in OpenAI Frontend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for logprobs (log probabilities) functionality in the OpenAI-compatible frontend for vLLM models. The feature allows users to request detailed probability information for generated tokens, which is useful for understanding model confidence and exploring alternative completions.
Key changes:
- Added logprobs support for both chat completions and standard completions endpoints
- Implemented conversion from vLLM's logprobs format to OpenAI's format
- Added comprehensive test coverage for logprobs functionality including validation and streaming
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| python/openai/openai_frontend/engine/utils/triton.py | Added helper functions to parse and convert logprobs from vLLM responses to OpenAI format for both chat and completion endpoints |
| python/openai/openai_frontend/engine/triton_engine.py | Integrated logprobs support into request handling, validation, and response generation for both streaming and non-streaming modes |
| python/openai/tests/test_openai_client.py | Added async tests for logprobs functionality using the OpenAI client library, including validation tests |
| python/openai/tests/test_chat_completions.py | Added HTTP-level tests for chat completions with logprobs, including edge cases and validation |
| python/openai/tests/test_completions.py | Added HTTP-level tests for completions with logprobs, including edge cases and validation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
| return | ||
|
|
||
| chat_completion = await client.chat.completions.create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to send non-stream and steam requests and compare the output values to be the same, similar to test_chat_streaming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added changes to compare the counts of streaming and non-streaming logprobs and tokens. Unable to compare logprobs values due to floating point discrepancies exceeding 1e-2 for some values.
Ex: -0.11291083693504333 vs -0.12702862918376923, -0.002805228577926755 vs -0.0024760086089372635 and -0.1270265281200409 vs -0.1270267367362976
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use np.allclose. See examples in test_embeddings.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified to use np.allclose. Thank you.
What does the PR do?
This PR adds support for
logprobsfunctionality in the OpenAI-compatible frontend for vLLM models. The feature allows users to request detailed probability information for generated tokens, which is useful for understanding model confidence and exploring alternative completions.Key changes:
Background:
https://platform.openai.com/docs/api-reference/completions/create
https://platform.openai.com/docs/api-reference/chat/create
Checklist
<commit_type>: <Title>Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
Where should the reviewer start?
Test plan:
Caveats:
Background
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)