-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Support chat template and echo
for chat API
#1756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@simon-mo whats your opinion on just supporting chat templates through envvar? By doing this, there would be no vllm should be able to document how people can pass in the jinja templates through a envvar, so that vllm won't handle any parsing, and users are responsible for this? I think this is powerful in a way that users have full control of the chat templates. By default, if there is none provided then fall back to huggingface behaviour? I think we can provide a default chat templates if needed edit: This is probably good for serverless as well. |
Can we also make sure there is a way to disable chat templates such that users provide messages that are preformatted? |
As said before, you use the regular Explain how we would provide the preformatted text using the
|
I see where you are coming from - perhaps it should be mentioned in the PR that this only applies to certain endpoints. |
I think chat templates should only apply for |
This pull request is a great contribution to the vllm project. I hope it gets merged soon! |
Hi @Tostino , Thank you so much for your PR! Your contribution has successfully addressed key issues in model inference, enabling the implementation of function calls using models like OpenHermes. Here is an example I created that demonstrates the effective application of OpenHermes: OpenHermes Functions with VLLM. Your work has truly unlocked the potential of OpenHermes. Adding another practical example to your contribution would greatly help others understand the significant impact of this work. |
Huggingface now supports To that end, we have setup a huggingface space that allows developers and the community to encourage the use of huggingface tokenizer's The ui interface also allows user to download chat template as jinja2 which I think would be beneficial for this PR feature as well. |
@tjtanaa That is a neat tool, glad there is more work going into this in the same direction. Being able to download the template to a file easily would be beneficial for users. @dongxiaolong Glad you found it useful! Very cool example to see working. |
Thank you for taking the time to try it out. airoboros_v2.jinja2
|
@tjtanaa where can we discuss that issue other than on this PR? My template is valid jinja, and it does have new lines embedded in the single-line version of it. There was an extra step to correctly load the single line jinja in this PR to deal with new lines properly, I'm guessing you just need to do the same. Edit: I'll open a discussion on the HF repo. |
@aarnphm / @simon-mo / @WoosukKwon (or anyone else appropriate) Is there anything that needs to be addressed prior to merging this PR at this point? There is nothing else I am aware of, as it's been pretty thoroughly tested. |
help="The file path to the chat template, " | ||
"or the template in single-line form " | ||
"for the specified model") | ||
parser.add_argument("--response-role", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this largely depends on the templates itself right? By default I don't think this is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The response-role? Some models may not use user/assistant as the role names. The templates are free for the creator to choose with that. I defaulted to the same behavior as the OpenAI API, but made it so there is compatibility with the flexibility provided by the chat_template feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ye but we already have that in the message['role'] right (usually alternate between user and assistant, so for example in the phi-1.5 cases it would be BOB and SANDRA?) not sure why we need this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it usually alternates between user/assistant...but given the following request, how would we know the role to respond with?
curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer " \
-d '{
"model": "Tostino/Inkbot-13B-8k-0.2",
"stream": false,
"n": 1,
"messages": [
{"role": "meta-current_date", "content": "2023-10-20"},
{"role": "meta-task_name", "content": "general"},
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of the USA?"}
]
}'
Previously, it was hard-coded as assistant
. Now it is configurable with that cli argument. Ideally, this would be additional metadata about how to use the template that would be stored somewhere in the tokenizer...but HF didn't think that far ahead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users should be responsible for doing few-shot prompt right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that is totally unnecessary for a lot of models and just eats up context space. Not to mention that locks us into two-role conversations. What if there is a user_context
role that is used for any input from a file the user wants to interact with, and that is appended after their text input. That is a real use case I have been using this implementation for.
if request.add_generation_prompt: | ||
return response_role | ||
else: | ||
return request.messages[-1]["role"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like here should it just be request.messages[-1]['role']
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that doesn't work.
@Tostino Would it be possible to add some simple unit tests for this? |
@Yard1 Sure, are there any existing tests for the server that I can add to? |
I don't think there are any for the OpenAI server specifically. |
@Yard1 So, to properly test this, it looks like I need to refactor a whole lot more code... (I could be mistaken...my dayjob is not python so i've never used any of the testing libraries before, and am learning on-the-fly) Are you sure you want me to do that? I am not doing any more work that will be thrown away...I've spent far too much time on this already. And it now looks like i'll have to spend more time rebasing, because there are conflicts again. |
1. Addition of the `--chat-template` command-line argument to specify a chat template file or single-line template for the model. 2. Implementation of the `--response-role` command-line argument for defining the role name in chat responses when `add_generation_prompt` is set to true. 3. Introduction of the `--disable-endpoints` argument to allow disabling of specific server endpoints. 4. Update to the chat API request handling to support handling finishing a partial response correctly, and echoing input portions of messages (request.add_generation_prompt, and request.echo). 5. Addition of new chat templates in JSON and Jinja formats (`template_chatml.json`, `template_alpaca.jinja`, and `template_inkbot.jinja`) showing the multiple ways they can be specified. 6. More robust error handling, and fix the responses to actually match the OpenAI API format. 7. Update quickstart.rst to show the new features.
…nd simplify the template loading code to remove support for json based templates.
Co-authored-by: Aaron Pham <[email protected]>
Thank you for adding the tests. I made a pass for small code quality nits. Attached diff in this comment. @Tostino please let me know when you have finished adding tests so I can directly push to this branch without causing you more merge conflict and resolving them one by one. I think for the future work, the design considerations still need to flush out with actual use cases. I'm aware of previous discussion on this, this PR should not be blocked by any of these. Just putting these here as notes:
Diff Attached
diff --git a/docs/source/getting_started/quickstart.rst b/docs/source/getting_started/quickstart.rst
index 1516a7b..bd3940f 100644
--- a/docs/source/getting_started/quickstart.rst
+++ b/docs/source/getting_started/quickstart.rst
@@ -124,11 +124,12 @@ Use model from www.modelscope.cn
$ --model="qwen/Qwen-7B-Chat" --revision="v1.1.8" --trust-remote-code
By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using the ``--chat-template`` argument:
+
.. code-block:: console
$ python -m vllm.entrypoints.openai.api_server \
- --model facebook/opt-125m \
- --chat-template ./examples/template_chatml.json
+ $ --model facebook/opt-125m \
+ $ --chat-template ./examples/template_chatml.json
This server can be queried in the same format as OpenAI API. For example, list the models:
@@ -137,7 +138,7 @@ This server can be queried in the same format as OpenAI API. For example, list t
$ curl http://localhost:8000/v1/models
Using OpenAI Completions API with vLLM
---------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Query the model with input prompts:
@@ -167,7 +168,7 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
For a more detailed client example, refer to `examples/openai_completion_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`_.
Using OpenAI Chat API with vLLM
--------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The vLLM server is designed to support the OpenAI Chat API, allowing you to engage in dynamic conversations with the model. The chat interface is a more interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py
index b87fcd9..54ff3c8 100644
--- a/vllm/entrypoints/openai/api_server.py
+++ b/vllm/entrypoints/openai/api_server.py
@@ -47,15 +47,22 @@ def create_error_response(status_code: HTTPStatus,
status_code=status_code.value)
-def load_chat_template():
- try:
- with open(args.chat_template, "r") as f:
- content = f.read()
- return content
- except OSError:
- # If opening a file fails, set chat template to be args to
- # ensure we decode so our escape are interpreted correctly
- return codecs.decode(args.chat_template, "unicode_escape")
+def load_chat_template(args, tokenizer):
+ if args.chat_template is not None:
+ try:
+ with open(args.chat_template, "r") as f:
+ chat_template = f.read()
+ except OSError:
+ # If opening a file fails, set chat template to be args to
+ # ensure we decode so our escape are interpreted correctly
+ chat_template = codecs.decode(args.chat_template, "unicode_escape")
+
+ tokenizer.chat_template = chat_template
+ logger.info(f"Using supplied chat template:\n{tokenizer.chat_template}")
+ elif tokenizer.chat_template is not None:
+ logger.info(f"Using default chat template:\n{tokenizer.chat_template}")
+ else:
+ logger.warning("No chat template provided. Chat API will not work.")
@app.exception_handler(RequestValidationError)
@@ -73,16 +80,6 @@ async def check_model(request) -> Optional[JSONResponse]:
return ret
-async def get_gen_prompt(request) -> str:
- try:
- return tokenizer.apply_chat_template(
- conversation=request.messages,
- tokenize=False,
- add_generation_prompt=request.add_generation_prompt)
- except Exception as e:
- raise RuntimeError(f"Error generating prompt: {str(e)}") from e
-
-
async def check_length(
request: Union[ChatCompletionRequest, CompletionRequest],
prompt: Optional[str] = None,
@@ -174,8 +171,6 @@ async def create_chat_completion(request: ChatCompletionRequest,
- function_call (Users should implement this by themselves)
- logit_bias (to be supported by vLLM engine)
"""
- logger.info(f"Received chat completion request: {request}")
-
error_check_ret = await check_model(request)
if error_check_ret is not None:
return error_check_ret
@@ -186,9 +181,12 @@ async def create_chat_completion(request: ChatCompletionRequest,
"logit_bias is not currently supported")
try:
- prompt = await get_gen_prompt(request)
- except RuntimeError as e:
- logger.error(f"Error in generating prompt from request: {str(e)}")
+ prompt = tokenizer.apply_chat_template(
+ conversation=request.messages,
+ tokenize=False,
+ add_generation_prompt=request.add_generation_prompt)
+ except Exception as e:
+ logger.error(f"Error in applying chat template from request: {str(e)}")
return create_error_response(HTTPStatus.BAD_REQUEST, str(e))
token_ids, error_check_ret = await check_length(request, prompt=prompt)
@@ -198,7 +196,7 @@ async def create_chat_completion(request: ChatCompletionRequest,
model_name = request.model
request_id = f"cmpl-{random_uuid()}"
created_time = int(time.monotonic())
- obj_str = "chat.completion.chunk"
+ chunk_object_type = "chat.completion.chunk"
try:
spaces_between_special_tokens = request.spaces_between_special_tokens
sampling_params = SamplingParams(
@@ -236,7 +234,7 @@ async def create_chat_completion(request: ChatCompletionRequest,
choice_data = ChatCompletionResponseStreamChoice(
index=i, delta=DeltaMessage(role=role), finish_reason=None)
chunk = ChatCompletionStreamResponse(id=request_id,
- object=obj_str,
+ object=chunk_object_type,
created=created_time,
choices=[choice_data],
model=model_name)
@@ -258,7 +256,7 @@ async def create_chat_completion(request: ChatCompletionRequest,
delta=DeltaMessage(content=last_msg_content),
finish_reason=None)
chunk = ChatCompletionStreamResponse(id=request_id,
- object=obj_str,
+ object=chunk_object_type,
created=created_time,
choices=[choice_data],
model=model_name)
@@ -273,8 +271,12 @@ async def create_chat_completion(request: ChatCompletionRequest,
res: RequestOutput
for output in res.outputs:
i = output.index
- # Send token-by-token response for each request.n
- if output.finish_reason is None and not finish_reason_sent[i]:
+
+ if finish_reason_sent[i]:
+ continue
+
+ if output.finish_reason is None:
+ # Send token-by-token response for each request.n
delta_text = output.text[len(previous_texts[i]):]
previous_texts[i] = output.text
completion_tokens = len(output.token_ids)
@@ -284,15 +286,14 @@ async def create_chat_completion(request: ChatCompletionRequest,
delta=DeltaMessage(content=delta_text),
finish_reason=None)
chunk = ChatCompletionStreamResponse(id=request_id,
- object=obj_str,
+ object=chunk_object_type,
created=created_time,
choices=[choice_data],
model=model_name)
data = chunk.json(exclude_unset=True, ensure_ascii=False)
yield f"data: {data}\n\n"
- # Send the finish response for each request.n only once
- if output.finish_reason is not None and not finish_reason_sent[
- i]:
+ else:
+ # Send the finish response for each request.n only once
prompt_tokens = len(res.prompt_token_ids)
final_usage = UsageInfo(
prompt_tokens=prompt_tokens,
@@ -302,7 +303,7 @@ async def create_chat_completion(request: ChatCompletionRequest,
choice_data = ChatCompletionResponseStreamChoice(
index=i, delta=[], finish_reason=output.finish_reason)
chunk = ChatCompletionStreamResponse(id=request_id,
- object=obj_str,
+ object=chunk_object_type,
created=created_time,
choices=[choice_data],
model=model_name)
@@ -326,6 +327,7 @@ async def create_chat_completion(request: ChatCompletionRequest,
"Client disconnected")
final_res = res
assert final_res is not None
+
choices = []
role = get_role()
for output in final_res.outputs:
@@ -710,14 +712,8 @@ if __name__ == "__main__":
engine_model_config.tokenizer,
tokenizer_mode=engine_model_config.tokenizer_mode,
trust_remote_code=engine_model_config.trust_remote_code)
+ load_chat_template(args, tokenizer)
- chat_template = None
- if args.chat_template is not None:
- chat_template = load_chat_template()
- if chat_template is not None:
- tokenizer.chat_template = chat_template
- if tokenizer.chat_template is not None:
- logger.info(f"Using chat template:\n{tokenizer.chat_template}")
uvicorn.run(app,
host=args.host, |
echo
for chat API
Fixed issues with chatml template not actually supporting the add_generation_prompt feature. This was just a copy/paste from a random model.
Thank you very much @simon-mo. I just added a handful of tests, and fixed the chatml template after tests identified some issues. Should be ready for you now.
Agree here that most use cases will have a single value...but think of a chat UI that has two buttons (or hotkeys), one that sends the current message, and the other that has the model auto-complete the message the user is typing as they are typing. You would use different values for add_generation_prompt for each of those actions. |
Oh one more future work could be loading chat template from http url. Let's see if that will be a common request and decide whether to be added. |
I was waiting for this PR so long but I found out that making request to |
@flexchar Not off the top of my head, i've not had any noticeable slowdown between I'm getting ~300-500 tokens/sec generated throughput using 1x 3090 and a llama2 13b AWQ model using the Do you have an example you can share to trigger it? Edit I tried it myself... could not reproduce. They both generate in roughly the same amount of time on my machine.
|
Thank you for trying. I don't have an easy reproducable example but the next time I work on that part I will do one certainly. I appreciate you testing thou :) |
vllm/entrypoints/api_server.py 这里为什么还没有这个参数了? |
How do I use the chat template with the "offline inference" with This PR only enables this for the REST API as far as I can see. @Tostino |
Sorry, on mobile right now so going from memory. I believe that there wasn't a "local" equivalent of the |
well,i see,thanks发自我的 iPhone在 2024年4月25日,23:11,Adam Brusselback ***@***.***> 写道:
Sorry, on mobile right now so going from memory. I believe that there wasn't a "local" equivalent of the chat/completions API when I implemented this. So it was implemented for the REST endpoint, because that's all it could work with unless I did a bunch of extra work to also add an equivalent local version of the chat completions API.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
This pull request introduces the chat template feature to vLLM, utilizing the template stored in the tokenizer, enhancing its compatibility with the OpenAI Chat API.
https://huggingface.co/blog/chat-templates
This only affects the OpenAI API
chat/completion
endpoint, the regularcompletion
endpoint does not utilize this feature.There has already been a ton of discussion under this PR: #1493 but I accidentally messed things up by replacing the branch, so we are trying this again...
--chat-template
command-line argument to specify a chat template file or single-line template for the model.--response-role
command-line argument for defining the role name in chat responses whenadd_generation_prompt
is set to true.template_chatml.jinja
,template_alpaca.jinja
, andtemplate_inkbot.jinja
) showing the multiple ways they can be specified.