support extracting prompt from chat completions API #798

delavet · 2025-05-08T12:03:19Z

Resolves #790.

This PR introduces a utility function called ExtractPromptFromRequestBody. As its name suggests, it can extract the request prompt from the request body. For the /chat/completions API, it reads the messages field of the request and converts it into a prompt string. The conversion process simply uses <|im_start|> and <|im_end|> from the OpenAI ChatML format to denote the start and end of each message, thereby delimiting the list of conversation messages within the prompt.

ExtractPromptFromRequestBody is used by EPP when processing the request body to attempt to extract the prompt.

netlify · 2025-05-08T12:03:24Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`8815645`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/6822a9744e0870000871a887
😎 Deploy Preview	https://deploy-preview-798--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

k8s-ci-robot · 2025-05-08T12:03:28Z

Hi @delavet. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

vMaroon · 2025-05-08T12:45:38Z

pkg/epp/util/request/body.go

+}
+
+func extractPromptForChatCompletions(body map[string]interface{}) (string, error) {
+	messages, ok := body["messages"]


Hi, since prefix-aware routing is an attempt at estimating the locations of KVCache, this may be sufficient to some degree, but a chat-completions request is more complex. Two chat-completion requests can have the same messages but lead to entirely different KV blocks.

See this struct for example: https://github.com/sashabaranov/go-openai/blob/6181facea7e6e5525b6b8da42205d7cce822c249/chat.go#L95

And an example to how a chat-completions request is templated before tokenization in vLLM: https://github.com/vllm-project/vllm/blob/main/examples/tool_chat_template_llama3.2_json.jinja

Thank you for your valuable suggestions. I indeed designed this function with relatively simple logic because, in my understanding, EPP should be as model-agnostic as possible: for chat completions requests sent to the same model, we only need to ensure that requests with the same message prefix receive the same prompt, which should suffice for Prefix Cache Aware Routing.

Based on this, I referenced the simple template from https://github.com/vllm-project/vllm/blob/main/examples/template_chatml.jinja to perform basic processing on the message list.

I do notice that the template you provided includes some more complex details, and I plan to implement some improvements to better handle these cases:

In the OpenAI schema, content may not always be a string but can also be an array: additional handling can be added in the code to deal with this case.

Requests may include multimodal content such as images or videos: since the current purpose of extracting the prompt is mainly for prefix cache aware routing, I assume we can temporarily ignore the multimodal parts, especially since GIE currently is not claimed to support multimodal models. I believe these can be addressed later when multimodal support is introduced.

The request body may contain different tools, and this might result in different system prompts. I plan to add logic to simulate this behavior, referencing the example provided here.

This should cover most of the common scenarios. If any others have any further comments or suggestions, please help to point it out.

I think it's best to keep this PR small and focused since it resolves a known bug. We can iterate, if needed, to resolve any potential kv cache inefficiencies.

vMaroon · 2025-05-08T12:49:22Z

pkg/epp/requestcontrol/director.go

-	prompt, ok := requestBodyMap["prompt"].(string)
-	if !ok {
-		return reqCtx, errutil.Error{Code: errutil.BadRequest, Msg: "prompt not found in request"}
+	prompt, err := requestutil.ExtractPromptFromRequestBody(requestBodyMap)


While using the LLMRequest.Prompt fields makes sense for both completion/chat-completion API requests for scorers that consume the flattened information, other scorers may want the distinction and the accurate use of a chat-completion's request's fields.

This may not be relevant to the codebase right now, but I'm just saying that such a distinction can be required, e.g., by a kvcache-aware scorer that needs to accurately rebuild the tokenization that matches vLLM's.

Since messages is currently only used for prefix cache aware routing, the current implementation should be sufficient. I believe that handling more complex cases can be addressed in separate pull requests.

As the kind of scoring mechanism which may rely on accurate tokenization has not yet been explicitly introduced, we don't yet have a clear picture of how to meet such requirements.

ahg-g · 2025-05-08T16:42:25Z

/ok-to-test

pkg/epp/util/request/body.go

nirrozenbaum

overall the PR looks good.
I do agree with @vMaroon's comments and there is room for improvement.
but this fix is enabling the chat completions and is a good start.
we can do another iteration to make sure tools and tool_choices are covered in a follow up PR.

left just a few minor comments.

@delavet Thanks for your contribution!

delavet · 2025-05-12T04:30:35Z

overall the PR looks good. I do agree with @vMaroon's comments and there is room for improvement. but this fix is enabling the chat completions and is a good start. we can do another iteration to make sure tools and tool_choices are covered in a follow up PR.

left just a few minor comments.

@delavet Thanks for your contribution!

Thanks for the suggestions:-) I have just push a commit to fix these tiny things. To make this PR clean, I think I could just open another issue for the follow up of these discussed improvements.

nirrozenbaum · 2025-05-12T05:16:23Z

Thanks for the suggestions:-) I have just push a commit to fix these tiny things. To make this PR clean, I think I could just open another issue for the follow up of these discussed improvements.

sounds good. let's document in the new issue the things Maroon pointed out (or just reference to his review) to make sure we don't miss these things.

this PR is good enough to unblock the chat completions serving for now.

/lgtm

Thanks!

nirrozenbaum · 2025-05-12T13:57:54Z

@delavet can you rebase?

pkg/epp/requestcontrol/director_test.go

pkg/epp/util/request/body.go

danehans · 2025-05-12T15:18:00Z

pkg/epp/util/request/body.go

+}
+
+func extractPromptForChatCompletions(body map[string]interface{}) (string, error) {
+	messages, ok := body["messages"]


I think it's best to keep this PR small and focused since it resolves a known bug. We can iterate, if needed, to resolve any potential kv cache inefficiencies.

kfswain · 2025-05-12T15:34:25Z

Agreed with @danehans s comments, here.

TY for catching this bug. Would love to help get this resolved ASAP, let us know what you need from us, thanks!

Signed-off-by: Hang Yin <[email protected]>

kfswain · 2025-05-13T03:11:54Z

/lgtm
/approve

k8s-ci-robot · 2025-05-13T03:12:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: delavet, kfswain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kfswain]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

) * support extracting prompt from chat completions API Signed-off-by: Hang Yin <[email protected]> * typo fixes Signed-off-by: Hang Yin <[email protected]> * fix tests * supply more tests and heading boilerplate Signed-off-by: Hang Yin <[email protected]> --------- Signed-off-by: Hang Yin <[email protected]>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 8, 2025

k8s-ci-robot requested review from danehans and nirrozenbaum May 8, 2025 12:03

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 8, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 8, 2025

vMaroon reviewed May 8, 2025

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 8, 2025

danehans mentioned this pull request May 9, 2025

e2e: Add /chat/completions Test Case #814

Closed