Add support for conversation-template argument for openai endpoint #1294

Tostino · 2023-10-09T00:08:16Z

Here is a PR for support for conversation-templates setup as json files which can be specified for a model upon starting the api. Just create your template, and pass in the path with the --conversation-template my_template.json argument.

There was no where for me to add support for this in the vLLM api, so I only added it to the OpenAI api.

Added an example, and updated the quickstart part of the readme (the only place that talked about the other arguments)

…o a json template. Added to quickstart.rst and example json file in the examples folder.

Tostino · 2023-10-16T02:10:41Z

Going to cancel this PR, and work on another which properly implements the HF chat templates within the tokenizer.

### What this PR does / why we need it? 1. Add MTP dummy_run, and adapt main model dummy_run when mtp is enabled 2. Adapt main model torchair graph mode, when mtp is enabled 3. mtp model torchair graph mode will be supported in the future ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? This patch is tested by `vllm-ascend/tests/long_term/spec_decode_v1/test_v1_mtp_correctness.py` ### Usage online example ```shell export VLLM_USE_V1=1 export VLLM_ENABLE_MC2=1 export VLLM_VERSION=0.9.1 export ASCEND_LAUNCH_BLOCKING=0 python -m vllm.entrypoints.openai.api_server \ --model="/model_weight_path" \ --trust-remote-code \ --max-model-len 40000 \ --tensor-parallel-size 4 \ --data_parallel_size 4 \ --enable_expert_parallel \ --served-model-name deepseekr1 \ --quantization ascend \ --host 0.0.0.0 \ --port 1234 \ --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \ --additional-config '{"ascend_scheduler_config":{"enabled":true,"enable_chunked_prefill":false},"torchair_graph_config":{"enabled":true}}' \ --gpu_memory_utilization 0.95 ``` offline example ``` lm = LLM( model="/model_weight_path", tensor_parallel_size=16, max_num_seqs=128, gpu_memory_utilization=0.95, distributed_executor_backend="mp", enable_expert_parallel=True, speculative_config={ "method": "deepseek_mtp", "num_speculative_tokens": 1, }, trust_remote_code=True, enforce_eager=False, additional_config = { 'torchair_graph_config': { 'enabled': True, 'enable_multistream_shared_expert': False }, "ascend_scheduler_config": { "enabled": True }, } ) ``` Signed-off-by: xuyexiong <[email protected]>

Tostino added 6 commits October 8, 2023 20:04

Add support for conversation-template argument which takes the path t…

9d5480e

…o a json template. Added to quickstart.rst and example json file in the examples folder.

Merge branch 'vllm-project:main' into main

93236ad

Fix formatting

990e5b1

Merge remote-tracking branch 'origin/main'

19d20c5

Fix formatting as yapf wants it to be...

d7db472

Additional formatting fixes.

7116b29

This was referenced Oct 9, 2023

Support specify template in OpenAI server #1263

Closed

conversation template should come from huggingface tokenizer instead of fastchat #1361

Closed

Tostino closed this Oct 16, 2023

lamm-mit mentioned this pull request Oct 20, 2023

Going to cancel this PR, and work on another which properly implements the HF chat templates within the tokenizer. #1425

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add support for conversation-template argument for openai endpoint #1294

Add support for conversation-template argument for openai endpoint #1294

Uh oh!

Tostino commented Oct 9, 2023

Uh oh!

Tostino commented Oct 16, 2023

Uh oh!

Uh oh!

Uh oh!

Add support for conversation-template argument for openai endpoint #1294

Add support for conversation-template argument for openai endpoint #1294

Uh oh!

Conversation

Tostino commented Oct 9, 2023

Uh oh!

Tostino commented Oct 16, 2023

Uh oh!

Uh oh!