-
Notifications
You must be signed in to change notification settings - Fork 176
Description
Following up #798 which added initial support for the OpenAI Chat-Completions API, I believe the following enhancements are sensible:
-
In the mentioned PR, chat-completions requests are partially collapsed into the
schedulingtypes.LLMRequest::Prompt
field. While this is sensible for current use, the loss of the original fields such asmessages
,tools
andtool_choices
would affect scorers that require precise templating of the request - such as ones that utilizes a global KV-cache index.- I think instead there should be a clear distinction between
prompt
from the completions API and the fields of a chat-completions request while balancing efficiency as well - This should be postponed until such a scorer is implemented
- I think instead there should be a clear distinction between
-
Since prefix-aware routing is an attempt at estimating the locations of KVCache, this may be sufficient to some degree, but a chat-completions request is more complex. Two chat-completion requests can have the same messages but lead to entirely different KV blocks.
- See this struct for example: https://github.com/sashabaranov/go-openai/blob/6181facea7e6e5525b6b8da42205d7cce822c249/chat.go#L95
- And an example to how a chat-completions request is templated before tokenization in vLLM: https://github.com/vllm-project/vllm/blob/main/examples/tool_chat_template_llama3.2_json.jinja
Both issues can be resolved by an approach that utilizes this go-openai package.