-
Notifications
You must be signed in to change notification settings - Fork 176
Refactor LLMRequest: Structured RequestData for Completions & Chat-Completions #1446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @vMaroon. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
a4f61c0
to
60ffd6f
Compare
/hold While we are releasing |
@liu-cong thank you for reviewing - addressed your comments. I force pushed a squash with 3 of your commits due to the problem above - hope that is ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
just a nit
/lgtm |
/ok-to-test |
Bumping since the release was cut @kfswain @nirrozenbaum @liu-cong |
pkg/epp/requestcontrol/director.go
Outdated
RequestId: reqCtx.Request.Headers[requtil.RequestIdHeaderKey], | ||
TargetModel: reqCtx.TargetModelName, | ||
Prompt: prompt, | ||
Data: requestData, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: data is a generic name, is there a more specific name we can use here? should we use body instead since we have headers below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Input
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this the body but structured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a portion of the body - practically we can have the IGW provide only the needed fields from the body, which would mean that whenever a plugin requires something that is not unmarshalled then they'd need to expand the body struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In LLMRequest, it is the only part of the body being captured, so I think Body is appropriate here; I think the type LLMRequestData
could be named Body
as well; if we want to pass the raw body in the future, we can add a RawBody
field later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/epp/util/request/body.go
Outdated
// ExtractRequestData extracts the LLMRequestData from the given request body map. | ||
func ExtractRequestData(body map[string]any) (*types.LLMRequestData, error) { | ||
// Convert map back to JSON bytes | ||
jsonBytes, err := json.Marshal(body) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a perf concern here that we are marshalling and unmarshalling again for every request? What are your thoughts on having the Body param in the Request type being structured to begin with so that we unmarshal once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR initially began with extracting each field separately by checking the body map, but this change was suggested in this comment #1446 (comment) for better readability.
I think your proposal makes sense but it could get complex due to the different APIs and additional fields that the different inference-engines might add. I think this should be a follow-up discussion/work, outside of this PR.
FYI, in a benchmark that compares chat-completions with the entire e2e pipeline of this change, plus the llm-d precise-prefix-cache with preprocessing (using an embedded python interpreter), added about 10ms E2E latency compared to llm-d-inference-scheduler v0.2.1 on completions API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my main concern is the overhead of the extra marshling and unmarshling for long contexts; is the 10ms added overhead related to that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 10ms overhead involves several other significant actions such as templating fields with Python's jinja2
library through CGO. I think the overhead is negligible. Although as you suggested in another comment and privately, this can be avoided by structuring the handlers.Request.Body
field passed to the director. Though I think this should be done in a follow-up PR.
Would that be ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your thoughts on having the Body param in the Request type being structured to begin with so that we unmarshal once?
I remember discussing this point, I thought this was the plan of record?
1. cleaner API declaration 2. data fields are preserved, after-read transformations are done in plugins 3. prefix-cache scorer does not need naive templating - minor bugfixes and improvements Signed-off-by: Maroon Ayoub <[email protected]>
Signed-off-by: Maroon Ayoub <[email protected]>
- rename LLMRequest.Data to LLMRequest.Body - test refactoring after rebase Signed-off-by: Maroon Ayoub <[email protected]>
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, vMaroon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold Thanks for your patience @vMaroon |
Motivation
The types.LLMRequest struct is the core API for scheduling plugins.
Until now, it only exposed a flat Prompt string.
/v1/completions
, this was sufficient./v1/chat/completions
, messages were flattened into a pseudo-prompt via naive templating.This flattening discards useful fields (e.g.,
tools
,chat_template
, etc.) that plugins may need. For example, the llm-dprecise-prefix-cache-scorer
plugin recreates vLLM's tokenization (includingjinja2
templating), but today's API does not provide all the necessary inputs.To support richer scheduling logic and prepare for newer APIs (e.g., OpenAI responses), we need to preserve raw, structured request data instead of flattening it upfront.
Summary of Changes
types.LLMRequest
Prompt
string with structuredLLMRequestData
, a disjoint union ofCompletionsRequest
andChatCompletionsRequest
- the latter includes fields from the HuggingFace transformers chat-templating API.Prefix-cache scorer
requestcontrol.Director
Testing
Unit tests
Prefix plugin tests
Benchmarks
Related Issues
cc @nirrozenbaum @kfswain @liu-cong