Refactor LLMRequest: Structured RequestData for Completions & Chat-Completions #1446

vMaroon · 2025-08-25T13:52:39Z

Motivation

The types.LLMRequest struct is the core API for scheduling plugins.
Until now, it only exposed a flat Prompt string.

For /v1/completions, this was sufficient.
For /v1/chat/completions, messages were flattened into a pseudo-prompt via naive templating.

This flattening discards useful fields (e.g., tools, chat_template, etc.) that plugins may need. For example, the llm-d precise-prefix-cache-scorer plugin recreates vLLM's tokenization (including jinja2 templating), but today's API does not provide all the necessary inputs.

To support richer scheduling logic and prepare for newer APIs (e.g., OpenAI responses), we need to preserve raw, structured request data instead of flattening it upfront.

Summary of Changes

types.LLMRequest
- Replaced flat Prompt string with structured LLMRequestData, a disjoint union of CompletionsRequest and ChatCompletionsRequest - the latter includes fields from the HuggingFace transformers chat-templating API.
- Request data is preserved as-is, local transformations are left to plugins.
Prefix-cache scorer
- No longer depends on naive prompt reconstruction for chat-completions:
  - This was originally done in order to attempt to hold the 1:4 chars-to-tokens estimation ratio. This attempt fails in that in actual chat-templating, the added special-keywords map to individual preserved tokens that are a part of the model's vocabulary. Therefore the existence of the naive templating as a whole is questionable, and can be avoided.
requestcontrol.Director
- Now responsible for parsing and populating the request fields as-is.

Testing

Unit tests
- Added coverage for both completions and chat-completions request parsing.
- Validated optional fields (tools, documents, chat_template, etc.).
Prefix plugin tests
- Verified prefix scoring works with both completions and chat-completions.
- Added growth tests to confirm correct prefix matching across extended conversations.
Benchmarks
- Stress tests for long prompts and large chat histories.
- Confirmed cache scoring scales correctly under load.

Related Issues

Fixes Enhanced OpenAI Chat-Completions API Support #827

cc @nirrozenbaum @kfswain @liu-cong

k8s-ci-robot · 2025-08-25T13:52:49Z

Hi @vMaroon. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2025-08-25T13:53:13Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`2f8d627`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68c32e104c6d19000895880b
😎 Deploy Preview	https://deploy-preview-1446--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go

pkg/epp/util/request/body.go

pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go

pkg/epp/scheduling/types/types.go

pkg/epp/util/request/body.go

pkg/epp/util/request/body_test.go

kfswain · 2025-08-26T12:28:09Z

/hold

While we are releasing

vMaroon · 2025-08-26T13:00:39Z

@liu-cong thank you for reviewing - addressed your comments. I force pushed a squash with 3 of your commits due to the problem above - hope that is ok.

liu-cong

/lgtm

just a nit

pkg/epp/scheduling/types/types.go

liu-cong · 2025-08-26T18:01:37Z

/lgtm

ahg-g · 2025-08-26T18:29:24Z

/ok-to-test

vMaroon · 2025-09-10T11:17:59Z

Bumping since the release was cut @kfswain @nirrozenbaum @liu-cong

ahg-g · 2025-09-10T15:22:58Z

pkg/epp/requestcontrol/director.go

 		RequestId:   reqCtx.Request.Headers[requtil.RequestIdHeaderKey],
 		TargetModel: reqCtx.TargetModelName,
-		Prompt:      prompt,
+		Data:        requestData,


nit: data is a generic name, is there a more specific name we can use here? should we use body instead since we have headers below?

isn't this the body but structured?

it's a portion of the body - practically we can have the IGW provide only the needed fields from the body, which would mean that whenever a plugin requires something that is not unmarshalled then they'd need to expand the body struct.

In LLMRequest, it is the only part of the body being captured, so I think Body is appropriate here; I think the type LLMRequestData could be named Body as well; if we want to pass the raw body in the future, we can add a RawBody field later

ahg-g · 2025-09-10T15:27:00Z

pkg/epp/util/request/body.go

+// ExtractRequestData extracts the LLMRequestData from the given request body map.
+func ExtractRequestData(body map[string]any) (*types.LLMRequestData, error) {
+	// Convert map back to JSON bytes
+	jsonBytes, err := json.Marshal(body)


Is there a perf concern here that we are marshalling and unmarshalling again for every request? What are your thoughts on having the Body param in the Request type being structured to begin with so that we unmarshal once?

This PR initially began with extracting each field separately by checking the body map, but this change was suggested in this comment #1446 (comment) for better readability.

I think your proposal makes sense but it could get complex due to the different APIs and additional fields that the different inference-engines might add. I think this should be a follow-up discussion/work, outside of this PR.

FYI, in a benchmark that compares chat-completions with the entire e2e pipeline of this change, plus the llm-d precise-prefix-cache with preprocessing (using an embedded python interpreter), added about 10ms E2E latency compared to llm-d-inference-scheduler v0.2.1 on completions API.

my main concern is the overhead of the extra marshling and unmarshling for long contexts; is the 10ms added overhead related to that?

The 10ms overhead involves several other significant actions such as templating fields with Python's jinja2 library through CGO. I think the overhead is negligible. Although as you suggested in another comment and privately, this can be avoided by structuring the handlers.Request.Body field passed to the director. Though I think this should be done in a follow-up PR.

Would that be ok?

sounds good to me!

What are your thoughts on having the Body param in the Request type being structured to begin with so that we unmarshal once?

I remember discussing this point, I thought this was the plan of record?

1. cleaner API declaration 2. data fields are preserved, after-read transformations are done in plugins 3. prefix-cache scorer does not need naive templating - minor bugfixes and improvements Signed-off-by: Maroon Ayoub <[email protected]>

Signed-off-by: Maroon Ayoub <[email protected]>

- rename LLMRequest.Data to LLMRequest.Body - test refactoring after rebase Signed-off-by: Maroon Ayoub <[email protected]>

ahg-g · 2025-09-11T20:57:02Z

/lgtm
/approve

k8s-ci-robot · 2025-09-11T20:57:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, vMaroon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kfswain · 2025-09-11T21:00:40Z

/unhold

Thanks for your patience @vMaroon

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 25, 2025

k8s-ci-robot requested review from nirrozenbaum and robscott August 25, 2025 13:52

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 25, 2025

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 25, 2025

vMaroon force-pushed the main branch from fb5326b to fb5ce41 Compare August 25, 2025 13:57

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 25, 2025

liu-cong reviewed Aug 25, 2025

View reviewed changes

pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go Show resolved Hide resolved

pkg/epp/util/request/body.go Outdated Show resolved Hide resolved

vMaroon force-pushed the main branch 2 times, most recently from a4f61c0 to 60ffd6f Compare August 25, 2025 22:11

liu-cong reviewed Aug 25, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Aug 26, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 26, 2025

vMaroon force-pushed the main branch from f8bfc3c to a966ad4 Compare August 26, 2025 12:59

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Aug 26, 2025

liu-cong reviewed Aug 26, 2025

View reviewed changes

pkg/epp/scheduling/types/types.go Outdated Show resolved Hide resolved

k8s-ci-robot assigned liu-cong Aug 26, 2025

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Aug 26, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 26, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 26, 2025

ahg-g mentioned this pull request Sep 10, 2025

Changes in filters/scorers to support chat completions llm-d/llm-d-inference-scheduler#341

Open

ahg-g reviewed Sep 10, 2025

View reviewed changes

vMaroon added 3 commits September 11, 2025 23:15

removed LLMRequestData::String

aae9c0f

Signed-off-by: Maroon Ayoub <[email protected]>

- rename LLMRequestData to LLMRequestBody

2f8d627

- rename LLMRequest.Data to LLMRequest.Body - test refactoring after rebase Signed-off-by: Maroon Ayoub <[email protected]>

vMaroon force-pushed the main branch from 837b151 to 2f8d627 Compare September 11, 2025 20:16

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2025

k8s-ci-robot assigned ahg-g Sep 11, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 11, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 11, 2025

k8s-ci-robot merged commit 4361b59 into kubernetes-sigs:main Sep 11, 2025
11 checks passed

vMaroon mentioned this pull request Sep 11, 2025

Structure the Body field of the handlers.Request type #1571

Open

ahg-g mentioned this pull request Sep 12, 2025

Update the endpoint picker diagram #1572

Merged

nirrozenbaum mentioned this pull request Sep 14, 2025

failure in post-inference-extension-push-images #1583

Closed

Refactor LLMRequest: Structured RequestData for Completions & Chat-Completions #1446

Refactor LLMRequest: Structured RequestData for Completions & Chat-Completions #1446

Uh oh!

Conversation

vMaroon commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Summary of Changes

Testing

Related Issues

Uh oh!

k8s-ci-robot commented Aug 25, 2025

Uh oh!

netlify bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfswain commented Aug 26, 2025

Uh oh!

vMaroon commented Aug 26, 2025

Uh oh!

liu-cong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liu-cong commented Aug 26, 2025

Uh oh!

ahg-g commented Aug 26, 2025

Uh oh!

vMaroon commented Sep 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vMaroon Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahg-g Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vMaroon Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahg-g commented Sep 11, 2025

Uh oh!

k8s-ci-robot commented Sep 11, 2025

Uh oh!

kfswain commented Sep 11, 2025

Uh oh!

Uh oh!

vMaroon commented Aug 25, 2025 •

edited

Loading

netlify bot commented Aug 25, 2025 •

edited

Loading

vMaroon Sep 10, 2025 •

edited

Loading

ahg-g Sep 10, 2025 •

edited

Loading

vMaroon Sep 11, 2025 •

edited

Loading