Skip to content

Conversation

delavet
Copy link
Contributor

@delavet delavet commented May 8, 2025

Resolves #790.

This PR introduces a utility function called ExtractPromptFromRequestBody. As its name suggests, it can extract the request prompt from the request body. For the /chat/completions API, it reads the messages field of the request and converts it into a prompt string. The conversion process simply uses <|im_start|> and <|im_end|> from the OpenAI ChatML format to denote the start and end of each message, thereby delimiting the list of conversation messages within the prompt.

ExtractPromptFromRequestBody is used by EPP when processing the request body to attempt to extract the prompt.

Copy link

netlify bot commented May 8, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 8815645
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/6822a9744e0870000871a887
😎 Deploy Preview https://deploy-preview-798--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 8, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 8, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @delavet. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 8, 2025
}

func extractPromptForChatCompletions(body map[string]interface{}) (string, error) {
messages, ok := body["messages"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, since prefix-aware routing is an attempt at estimating the locations of KVCache, this may be sufficient to some degree, but a chat-completions request is more complex. Two chat-completion requests can have the same messages but lead to entirely different KV blocks.

See this struct for example: https://github.com/sashabaranov/go-openai/blob/6181facea7e6e5525b6b8da42205d7cce822c249/chat.go#L95

And an example to how a chat-completions request is templated before tokenization in vLLM: https://github.com/vllm-project/vllm/blob/main/examples/tool_chat_template_llama3.2_json.jinja

Copy link
Contributor Author

@delavet delavet May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your valuable suggestions. I indeed designed this function with relatively simple logic because, in my understanding, EPP should be as model-agnostic as possible: for chat completions requests sent to the same model, we only need to ensure that requests with the same message prefix receive the same prompt, which should suffice for Prefix Cache Aware Routing.

Based on this, I referenced the simple template from https://github.com/vllm-project/vllm/blob/main/examples/template_chatml.jinja to perform basic processing on the message list.

I do notice that the template you provided includes some more complex details, and I plan to implement some improvements to better handle these cases:

  • In the OpenAI schema, content may not always be a string but can also be an array: additional handling can be added in the code to deal with this case.
  • Requests may include multimodal content such as images or videos: since the current purpose of extracting the prompt is mainly for prefix cache aware routing, I assume we can temporarily ignore the multimodal parts, especially since GIE currently is not claimed to support multimodal models. I believe these can be addressed later when multimodal support is introduced.
  • The request body may contain different tools, and this might result in different system prompts. I plan to add logic to simulate this behavior, referencing the example provided here.

This should cover most of the common scenarios. If any others have any further comments or suggestions, please help to point it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best to keep this PR small and focused since it resolves a known bug. We can iterate, if needed, to resolve any potential kv cache inefficiencies.

prompt, ok := requestBodyMap["prompt"].(string)
if !ok {
return reqCtx, errutil.Error{Code: errutil.BadRequest, Msg: "prompt not found in request"}
prompt, err := requestutil.ExtractPromptFromRequestBody(requestBodyMap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While using the LLMRequest.Prompt fields makes sense for both completion/chat-completion API requests for scorers that consume the flattened information, other scorers may want the distinction and the accurate use of a chat-completion's request's fields.

This may not be relevant to the codebase right now, but I'm just saying that such a distinction can be required, e.g., by a kvcache-aware scorer that needs to accurately rebuild the tokenization that matches vLLM's.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since messages is currently only used for prefix cache aware routing, the current implementation should be sufficient. I believe that handling more complex cases can be addressed in separate pull requests.

As the kind of scoring mechanism which may rely on accurate tokenization has not yet been explicitly introduced, we don't yet have a clear picture of how to meet such requirements.

@ahg-g
Copy link
Contributor

ahg-g commented May 8, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 8, 2025
Copy link
Contributor

@nirrozenbaum nirrozenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall the PR looks good.
I do agree with @vMaroon's comments and there is room for improvement.
but this fix is enabling the chat completions and is a good start.
we can do another iteration to make sure tools and tool_choices are covered in a follow up PR.

left just a few minor comments.

@delavet Thanks for your contribution!

@delavet
Copy link
Contributor Author

delavet commented May 12, 2025

overall the PR looks good. I do agree with @vMaroon's comments and there is room for improvement. but this fix is enabling the chat completions and is a good start. we can do another iteration to make sure tools and tool_choices are covered in a follow up PR.

left just a few minor comments.

@delavet Thanks for your contribution!

Thanks for the suggestions:-) I have just push a commit to fix these tiny things. To make this PR clean, I think I could just open another issue for the follow up of these discussed improvements.

@delavet delavet requested a review from nirrozenbaum May 12, 2025 04:30
@nirrozenbaum
Copy link
Contributor

Thanks for the suggestions:-) I have just push a commit to fix these tiny things. To make this PR clean, I think I could just open another issue for the follow up of these discussed improvements.

sounds good. let's document in the new issue the things Maroon pointed out (or just reference to his review) to make sure we don't miss these things.

this PR is good enough to unblock the chat completions serving for now.

/lgtm

Thanks!

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 12, 2025
@nirrozenbaum
Copy link
Contributor

@delavet can you rebase?

}

func extractPromptForChatCompletions(body map[string]interface{}) (string, error) {
messages, ok := body["messages"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best to keep this PR small and focused since it resolves a known bug. We can iterate, if needed, to resolve any potential kv cache inefficiencies.

@kfswain
Copy link
Collaborator

kfswain commented May 12, 2025

Agreed with @danehans s comments, here.

TY for catching this bug. Would love to help get this resolved ASAP, let us know what you need from us, thanks!

@delavet delavet force-pushed the extract-chat-completions-prompt branch from 912d649 to 8815645 Compare May 13, 2025 02:07
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 13, 2025
@delavet delavet requested a review from danehans May 13, 2025 02:14
@kfswain
Copy link
Collaborator

kfswain commented May 13, 2025

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 13, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: delavet, kfswain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 13, 2025
@k8s-ci-robot k8s-ci-robot merged commit 2b2b4a6 into kubernetes-sigs:main May 13, 2025
8 checks passed
nayihz pushed a commit to nayihz/gateway-api-inference-extension that referenced this pull request May 14, 2025
)

* support extracting prompt from chat completions API

Signed-off-by: Hang Yin <[email protected]>

* typo fixes

Signed-off-by: Hang Yin <[email protected]>

* fix tests

* supply more tests and heading boilerplate

Signed-off-by: Hang Yin <[email protected]>

---------

Signed-off-by: Hang Yin <[email protected]>
kaushikmitr pushed a commit to kaushikmitr/llm-instance-gateway that referenced this pull request May 15, 2025
)

* support extracting prompt from chat completions API

Signed-off-by: Hang Yin <[email protected]>

* typo fixes

Signed-off-by: Hang Yin <[email protected]>

* fix tests

* supply more tests and heading boilerplate

Signed-off-by: Hang Yin <[email protected]>

---------

Signed-off-by: Hang Yin <[email protected]>
irar2 pushed a commit to irar2/gateway-api-inference-extension that referenced this pull request Jun 3, 2025
)

* support extracting prompt from chat completions API

Signed-off-by: Hang Yin <[email protected]>

* typo fixes

Signed-off-by: Hang Yin <[email protected]>

* fix tests

* supply more tests and heading boilerplate

Signed-off-by: Hang Yin <[email protected]>

---------

Signed-off-by: Hang Yin <[email protected]>
rlakhtakia pushed a commit to rlakhtakia/gateway-api-inference-extension that referenced this pull request Jun 11, 2025
)

* support extracting prompt from chat completions API

Signed-off-by: Hang Yin <[email protected]>

* typo fixes

Signed-off-by: Hang Yin <[email protected]>

* fix tests

* supply more tests and heading boilerplate

Signed-off-by: Hang Yin <[email protected]>

---------

Signed-off-by: Hang Yin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EPP cannot serve /chat/completions API
7 participants