Skip to content

Conversation

vMaroon
Copy link
Contributor

@vMaroon vMaroon commented Aug 25, 2025

Motivation

The types.LLMRequest struct is the core API for scheduling plugins.
Until now, it only exposed a flat Prompt string.

  • For /v1/completions, this was sufficient.
  • For /v1/chat/completions, messages were flattened into a pseudo-prompt via naive templating.

This flattening discards useful fields (e.g., tools, chat_template, etc.) that plugins may need. For example, the llm-d precise-prefix-cache-scorer plugin recreates vLLM's tokenization (including jinja2 templating), but today's API does not provide all the necessary inputs.

To support richer scheduling logic and prepare for newer APIs (e.g., OpenAI responses), we need to preserve raw, structured request data instead of flattening it upfront.

Summary of Changes

  • types.LLMRequest

    • Replaced flat Prompt string with structured LLMRequestData, a disjoint union of CompletionsRequest and ChatCompletionsRequest - the latter includes fields from the HuggingFace transformers chat-templating API.
    • Request data is preserved as-is, local transformations are left to plugins.
  • Prefix-cache scorer

    • No longer depends on naive prompt reconstruction for chat-completions:
      • This was originally done in order to attempt to hold the 1:4 chars-to-tokens estimation ratio. This attempt fails in that in actual chat-templating, the added special-keywords map to individual preserved tokens that are a part of the model's vocabulary. Therefore the existence of the naive templating as a whole is questionable, and can be avoided.
  • requestcontrol.Director

    • Now responsible for parsing and populating the request fields as-is.

Testing

  • Unit tests

    • Added coverage for both completions and chat-completions request parsing.
    • Validated optional fields (tools, documents, chat_template, etc.).
  • Prefix plugin tests

    • Verified prefix scoring works with both completions and chat-completions.
    • Added growth tests to confirm correct prefix matching across extended conversations.
  • Benchmarks

    • Stress tests for long prompts and large chat histories.
    • Confirmed cache scoring scales correctly under load.

Related Issues

cc @nirrozenbaum @kfswain @liu-cong

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 25, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 25, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @vMaroon. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 25, 2025
Copy link

netlify bot commented Aug 25, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 2f8d627
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68c32e104c6d19000895880b
😎 Deploy Preview https://deploy-preview-1446--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 25, 2025
@vMaroon vMaroon force-pushed the main branch 2 times, most recently from a4f61c0 to 60ffd6f Compare August 25, 2025 22:11
@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Aug 26, 2025
@kfswain
Copy link
Collaborator

kfswain commented Aug 26, 2025

/hold

While we are releasing

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 26, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Aug 26, 2025
@vMaroon
Copy link
Contributor Author

vMaroon commented Aug 26, 2025

@liu-cong thank you for reviewing - addressed your comments. I force pushed a squash with 3 of your commits due to the problem above - hope that is ok.

Copy link
Contributor

@liu-cong liu-cong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

just a nit

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Aug 26, 2025
@liu-cong
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 26, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Aug 26, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 26, 2025
@vMaroon
Copy link
Contributor Author

vMaroon commented Sep 10, 2025

Bumping since the release was cut @kfswain @nirrozenbaum @liu-cong

RequestId: reqCtx.Request.Headers[requtil.RequestIdHeaderKey],
TargetModel: reqCtx.TargetModelName,
Prompt: prompt,
Data: requestData,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: data is a generic name, is there a more specific name we can use here? should we use body instead since we have headers below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this the body but structured?

Copy link
Contributor Author

@vMaroon vMaroon Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a portion of the body - practically we can have the IGW provide only the needed fields from the body, which would mean that whenever a plugin requires something that is not unmarshalled then they'd need to expand the body struct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In LLMRequest, it is the only part of the body being captured, so I think Body is appropriate here; I think the type LLMRequestData could be named Body as well; if we want to pass the raw body in the future, we can add a RawBody field later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// ExtractRequestData extracts the LLMRequestData from the given request body map.
func ExtractRequestData(body map[string]any) (*types.LLMRequestData, error) {
// Convert map back to JSON bytes
jsonBytes, err := json.Marshal(body)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a perf concern here that we are marshalling and unmarshalling again for every request? What are your thoughts on having the Body param in the Request type being structured to begin with so that we unmarshal once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR initially began with extracting each field separately by checking the body map, but this change was suggested in this comment #1446 (comment) for better readability.

I think your proposal makes sense but it could get complex due to the different APIs and additional fields that the different inference-engines might add. I think this should be a follow-up discussion/work, outside of this PR.

FYI, in a benchmark that compares chat-completions with the entire e2e pipeline of this change, plus the llm-d precise-prefix-cache with preprocessing (using an embedded python interpreter), added about 10ms E2E latency compared to llm-d-inference-scheduler v0.2.1 on completions API.

Copy link
Contributor

@ahg-g ahg-g Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my main concern is the overhead of the extra marshling and unmarshling for long contexts; is the 10ms added overhead related to that?

Copy link
Contributor Author

@vMaroon vMaroon Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 10ms overhead involves several other significant actions such as templating fields with Python's jinja2 library through CGO. I think the overhead is negligible. Although as you suggested in another comment and privately, this can be avoided by structuring the handlers.Request.Body field passed to the director. Though I think this should be done in a follow-up PR.

Would that be ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on having the Body param in the Request type being structured to begin with so that we unmarshal once?

I remember discussing this point, I thought this was the plan of record?

1. cleaner API declaration
2. data fields are preserved, after-read transformations are done in plugins
3. prefix-cache scorer does not need naive templating
- minor bugfixes and improvements

Signed-off-by: Maroon Ayoub <[email protected]>
- rename LLMRequest.Data to LLMRequest.Body
- test refactoring after rebase

Signed-off-by: Maroon Ayoub <[email protected]>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Sep 11, 2025

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, vMaroon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 11, 2025
@kfswain
Copy link
Collaborator

kfswain commented Sep 11, 2025

/unhold

Thanks for your patience @vMaroon

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 11, 2025
@k8s-ci-robot k8s-ci-robot merged commit 4361b59 into kubernetes-sigs:main Sep 11, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhanced OpenAI Chat-Completions API Support
5 participants