-
Notifications
You must be signed in to change notification settings - Fork 176
support extracting prompt from chat completions API #798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support extracting prompt from chat completions API #798
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Hi @delavet. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
} | ||
|
||
func extractPromptForChatCompletions(body map[string]interface{}) (string, error) { | ||
messages, ok := body["messages"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, since prefix-aware routing is an attempt at estimating the locations of KVCache, this may be sufficient to some degree, but a chat-completions request is more complex. Two chat-completion requests can have the same messages but lead to entirely different KV blocks.
See this struct for example: https://github.com/sashabaranov/go-openai/blob/6181facea7e6e5525b6b8da42205d7cce822c249/chat.go#L95
And an example to how a chat-completions request is templated before tokenization in vLLM: https://github.com/vllm-project/vllm/blob/main/examples/tool_chat_template_llama3.2_json.jinja
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your valuable suggestions. I indeed designed this function with relatively simple logic because, in my understanding, EPP should be as model-agnostic as possible: for chat completions requests sent to the same model, we only need to ensure that requests with the same message prefix receive the same prompt, which should suffice for Prefix Cache Aware Routing.
Based on this, I referenced the simple template from https://github.com/vllm-project/vllm/blob/main/examples/template_chatml.jinja to perform basic processing on the message list.
I do notice that the template you provided includes some more complex details, and I plan to implement some improvements to better handle these cases:
- In the OpenAI schema, content may not always be a string but can also be an array: additional handling can be added in the code to deal with this case.
- Requests may include multimodal content such as images or videos: since the current purpose of extracting the prompt is mainly for prefix cache aware routing, I assume we can temporarily ignore the multimodal parts, especially since GIE currently is not claimed to support multimodal models. I believe these can be addressed later when multimodal support is introduced.
- The request body may contain different tools, and this might result in different system prompts. I plan to add logic to simulate this behavior, referencing the example provided here.
This should cover most of the common scenarios. If any others have any further comments or suggestions, please help to point it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's best to keep this PR small and focused since it resolves a known bug. We can iterate, if needed, to resolve any potential kv cache inefficiencies.
pkg/epp/requestcontrol/director.go
Outdated
prompt, ok := requestBodyMap["prompt"].(string) | ||
if !ok { | ||
return reqCtx, errutil.Error{Code: errutil.BadRequest, Msg: "prompt not found in request"} | ||
prompt, err := requestutil.ExtractPromptFromRequestBody(requestBodyMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While using the LLMRequest.Prompt
fields makes sense for both completion/chat-completion API requests for scorers that consume the flattened information, other scorers may want the distinction and the accurate use of a chat-completion's request's fields.
This may not be relevant to the codebase right now, but I'm just saying that such a distinction can be required, e.g., by a kvcache-aware scorer that needs to accurately rebuild the tokenization that matches vLLM's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since messages is currently only used for prefix cache aware routing, the current implementation should be sufficient. I believe that handling more complex cases can be addressed in separate pull requests.
As the kind of scoring mechanism which may rely on accurate tokenization has not yet been explicitly introduced, we don't yet have a clear picture of how to meet such requirements.
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall the PR looks good.
I do agree with @vMaroon's comments and there is room for improvement.
but this fix is enabling the chat completions and is a good start.
we can do another iteration to make sure tools and tool_choices are covered in a follow up PR.
left just a few minor comments.
@delavet Thanks for your contribution!
Thanks for the suggestions:-) I have just push a commit to fix these tiny things. To make this PR clean, I think I could just open another issue for the follow up of these discussed improvements. |
sounds good. let's document in the new issue the things Maroon pointed out (or just reference to his review) to make sure we don't miss these things. this PR is good enough to unblock the chat completions serving for now. /lgtm Thanks! |
@delavet can you rebase? |
} | ||
|
||
func extractPromptForChatCompletions(body map[string]interface{}) (string, error) { | ||
messages, ok := body["messages"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's best to keep this PR small and focused since it resolves a known bug. We can iterate, if needed, to resolve any potential kv cache inefficiencies.
Agreed with @danehans s comments, here. TY for catching this bug. Would love to help get this resolved ASAP, let us know what you need from us, thanks! |
Signed-off-by: Hang Yin <[email protected]>
Signed-off-by: Hang Yin <[email protected]>
Signed-off-by: Hang Yin <[email protected]>
912d649
to
8815645
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: delavet, kfswain The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
) * support extracting prompt from chat completions API Signed-off-by: Hang Yin <[email protected]> * typo fixes Signed-off-by: Hang Yin <[email protected]> * fix tests * supply more tests and heading boilerplate Signed-off-by: Hang Yin <[email protected]> --------- Signed-off-by: Hang Yin <[email protected]>
) * support extracting prompt from chat completions API Signed-off-by: Hang Yin <[email protected]> * typo fixes Signed-off-by: Hang Yin <[email protected]> * fix tests * supply more tests and heading boilerplate Signed-off-by: Hang Yin <[email protected]> --------- Signed-off-by: Hang Yin <[email protected]>
) * support extracting prompt from chat completions API Signed-off-by: Hang Yin <[email protected]> * typo fixes Signed-off-by: Hang Yin <[email protected]> * fix tests * supply more tests and heading boilerplate Signed-off-by: Hang Yin <[email protected]> --------- Signed-off-by: Hang Yin <[email protected]>
) * support extracting prompt from chat completions API Signed-off-by: Hang Yin <[email protected]> * typo fixes Signed-off-by: Hang Yin <[email protected]> * fix tests * supply more tests and heading boilerplate Signed-off-by: Hang Yin <[email protected]> --------- Signed-off-by: Hang Yin <[email protected]>
Resolves #790.
This PR introduces a utility function called
ExtractPromptFromRequestBody
. As its name suggests, it can extract the request prompt from the request body. For the/chat/completions
API, it reads the messages field of the request and converts it into a prompt string. The conversion process simply uses<|im_start|>
and<|im_end|>
from the OpenAI ChatML format to denote the start and end of each message, thereby delimiting the list of conversation messages within the prompt.ExtractPromptFromRequestBody
is used by EPP when processing the request body to attempt to extract the prompt.