-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Description
Context
The "/v1/chat/completions" endpoint uses the apply_chat_template
method of the HF tokenizers. It seems to us that these templates take care of adding special tokens (cf. this line from Llama's default template). However, tokenization in vLLM also seems to add special token(s) if this is the tokenizer's default behavior - in particular, the Llama tokenizer adds a BOS token at the start of its tokenization.
There are therefore configurations in which the final tokenization will contain more special tokens than necessary.
Repro
In a terminal, launch a vLLM server. For example:
python -m vllm.entrypoints.openai.api_server --model TheBloke/Llama-2-7B-Chat-AWQ
In another terminal, request this server:
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "None"
openai_api_base = f"http://{FILL_ME}/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="TheBloke/Llama-2-7B-Chat-AWQ",
messages=[
{"role": "user", "content": "Tell me a joke."},
{
"role": "assistant",
"content": " Ah, a moment of levity you seek! Very well. Pray, allow me to regale you with this humorous anecdote:\n\nWhy don't historians play cricket?\n\nBecause they prefer to leave their past in the archives!\n\nAnd now, if you'll excuse me, I must return to my scholarly pursuits. Although, I must admit, it is rather refreshing to engage in such frivolous banter from time to time.",
},
{"role": "user", "content": "Another one."},
],
)
print("Chat response:", chat_response)
Output:
async_llm_engine.py:379]
Received request cmpl-cca85113d5af4178b3c93fb2c2b72578:
prompt: "<s>[INST] Tell me a joke. [/INST] Ah, a moment of levity you seek! Very well. Pray, allow me to regale you with this humorous anecdote:\n\nWhy don't historians play cricket?\n\nBecause they prefer to leave their past in the archives!\n\nAnd now, if you'll excuse me, I must return to my scholarly pursuits. Although, I must admit, it is rather refreshing to engage in such frivolous banter from time to time. </s><s>[INST] Another one. [/INST]",
sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], ignore_eos=False, max_tokens=3959, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True),
prompt token ids: [1, 1, 518, 25580, 29962, 24948, 592, 263, 2958, 446, 29889, 518, 29914, 25580, 29962, 9070, 29892, 263, 3256, 310, 14453, 537, 366, 16508, 29991, 18064, 1532, 29889, 349, 764, 29892, 2758, 592, 304, 1072, 744, 366, 411, 445, 3165, 20657, 385, 687, 29881, 866, 29901, 13, 13, 11008, 1016, 29915, 29873, 3603, 5834, 1708, 2181, 8522, 29973, 13, 13, 29933, 5658, 896, 5821, 304, 5967, 1009, 4940, 297, 278, 3190, 3145, 29991, 13, 13, 2855, 1286, 29892, 565, 366, 29915, 645, 5566, 1509, 592, 29892, 306, 1818, 736, 304, 590, 21344, 368, 12359, 19544, 29889, 8512, 29892, 306, 1818, 20000, 29892, 372, 338, 3265, 2143, 690, 2790, 304, 3033, 482, 297, 1316, 285, 1150, 324, 681, 9892, 357, 515, 931, 304, 931, 29889, 29871, 2, 1, 518, 25580, 29962, 7280, 697, 29889, 518, 29914, 25580, 29962].
We can see that the prompt token ids start with two 1s instead of one.
This issue also impacts the new mistralai/Mixtral-8x7B-Instruct-v0.1
model added in the PR #2011