"/v1/chat/completions" tokenization issue

# Context 

The "/v1/chat/completions" endpoint uses the `apply_chat_template` method of the HF tokenizers. It seems to us that these templates take care of adding special tokens (cf. [this line from Llama's default template]( https://github.com/huggingface/transformers/blob/5e620a92cf7e6c312435db86ec55e13b75dece75/src/transformers/models/llama/tokenization_llama.py#L460)). However, tokenization in vLLM also seems to add special token(s) if this is the tokenizer's default behavior - in particular, the Llama tokenizer adds a BOS token at the start of its tokenization. 

There are therefore configurations in which the final tokenization will contain more special tokens than necessary. 

# Repro

In a terminal, launch a vLLM server. For example: 
```python 
python -m vllm.entrypoints.openai.api_server --model TheBloke/Llama-2-7B-Chat-AWQ
```

In another terminal, request this server:
```
from openai import OpenAI

# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "None"
openai_api_base = f"http://{FILL_ME}/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="TheBloke/Llama-2-7B-Chat-AWQ",
    messages=[
        {"role": "user", "content": "Tell me a joke."},
        {
            "role": "assistant",
            "content": " Ah, a moment of levity you seek! Very well. Pray, allow me to regale you with this humorous anecdote:\n\nWhy don't historians play cricket?\n\nBecause they prefer to leave their past in the archives!\n\nAnd now, if you'll excuse me, I must return to my scholarly pursuits. Although, I must admit, it is rather refreshing to engage in such frivolous banter from time to time.",
        },
        {"role": "user", "content": "Another one."},
    ],
)
print("Chat response:", chat_response)
```

Output:
```
async_llm_engine.py:379] 
Received request cmpl-cca85113d5af4178b3c93fb2c2b72578: 
prompt: "<s>[INST] Tell me a joke. [/INST] Ah, a moment of levity you seek! Very well. Pray, allow me to regale you with this humorous anecdote:\n\nWhy don't historians play cricket?\n\nBecause they prefer to leave their past in the archives!\n\nAnd now, if you'll excuse me, I must return to my scholarly pursuits. Although, I must admit, it is rather refreshing to engage in such frivolous banter from time to time. </s><s>[INST] Another one. [/INST]", 
sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], ignore_eos=False, max_tokens=3959, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), 
prompt token ids: [1, 1, 518, 25580, 29962, 24948, 592, 263, 2958, 446, 29889, 518, 29914, 25580, 29962, 9070, 29892, 263, 3256, 310, 14453, 537, 366, 16508, 29991, 18064, 1532, 29889, 349, 764, 29892, 2758, 592, 304, 1072, 744, 366, 411, 445, 3165, 20657, 385, 687, 29881, 866, 29901, 13, 13, 11008, 1016, 29915, 29873, 3603, 5834, 1708, 2181, 8522, 29973, 13, 13, 29933, 5658, 896, 5821, 304, 5967, 1009, 4940, 297, 278, 3190, 3145, 29991, 13, 13, 2855, 1286, 29892, 565, 366, 29915, 645, 5566, 1509, 592, 29892, 306, 1818, 736, 304, 590, 21344, 368, 12359, 19544, 29889, 8512, 29892, 306, 1818, 20000, 29892, 372, 338, 3265, 2143, 690, 2790, 304, 3033, 482, 297, 1316, 285, 1150, 324, 681, 9892, 357, 515, 931, 304, 931, 29889, 29871, 2, 1, 518, 25580, 29962, 7280, 697, 29889, 518, 29914, 25580, 29962].
```

We can see that the prompt token ids start with two 1s instead of one.

This issue also impacts the new `mistralai/Mixtral-8x7B-Instruct-v0.1` model added in the PR https://github.com/vllm-project/vllm/pull/2011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

"/v1/chat/completions" tokenization issue #2012

Context

Repro

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

"/v1/chat/completions" tokenization issue #2012

Description

Context

Repro

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions