Skip to content

"/v1/chat/completions" tokenization issue #2012

@SaulLu

Description

@SaulLu

Context

The "/v1/chat/completions" endpoint uses the apply_chat_template method of the HF tokenizers. It seems to us that these templates take care of adding special tokens (cf. this line from Llama's default template). However, tokenization in vLLM also seems to add special token(s) if this is the tokenizer's default behavior - in particular, the Llama tokenizer adds a BOS token at the start of its tokenization.

There are therefore configurations in which the final tokenization will contain more special tokens than necessary.

Repro

In a terminal, launch a vLLM server. For example:

python -m vllm.entrypoints.openai.api_server --model TheBloke/Llama-2-7B-Chat-AWQ

In another terminal, request this server:

from openai import OpenAI

# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "None"
openai_api_base = f"http://{FILL_ME}/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="TheBloke/Llama-2-7B-Chat-AWQ",
    messages=[
        {"role": "user", "content": "Tell me a joke."},
        {
            "role": "assistant",
            "content": " Ah, a moment of levity you seek! Very well. Pray, allow me to regale you with this humorous anecdote:\n\nWhy don't historians play cricket?\n\nBecause they prefer to leave their past in the archives!\n\nAnd now, if you'll excuse me, I must return to my scholarly pursuits. Although, I must admit, it is rather refreshing to engage in such frivolous banter from time to time.",
        },
        {"role": "user", "content": "Another one."},
    ],
)
print("Chat response:", chat_response)

Output:

async_llm_engine.py:379] 
Received request cmpl-cca85113d5af4178b3c93fb2c2b72578: 
prompt: "<s>[INST] Tell me a joke. [/INST] Ah, a moment of levity you seek! Very well. Pray, allow me to regale you with this humorous anecdote:\n\nWhy don't historians play cricket?\n\nBecause they prefer to leave their past in the archives!\n\nAnd now, if you'll excuse me, I must return to my scholarly pursuits. Although, I must admit, it is rather refreshing to engage in such frivolous banter from time to time. </s><s>[INST] Another one. [/INST]", 
sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], ignore_eos=False, max_tokens=3959, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), 
prompt token ids: [1, 1, 518, 25580, 29962, 24948, 592, 263, 2958, 446, 29889, 518, 29914, 25580, 29962, 9070, 29892, 263, 3256, 310, 14453, 537, 366, 16508, 29991, 18064, 1532, 29889, 349, 764, 29892, 2758, 592, 304, 1072, 744, 366, 411, 445, 3165, 20657, 385, 687, 29881, 866, 29901, 13, 13, 11008, 1016, 29915, 29873, 3603, 5834, 1708, 2181, 8522, 29973, 13, 13, 29933, 5658, 896, 5821, 304, 5967, 1009, 4940, 297, 278, 3190, 3145, 29991, 13, 13, 2855, 1286, 29892, 565, 366, 29915, 645, 5566, 1509, 592, 29892, 306, 1818, 736, 304, 590, 21344, 368, 12359, 19544, 29889, 8512, 29892, 306, 1818, 20000, 29892, 372, 338, 3265, 2143, 690, 2790, 304, 3033, 482, 297, 1316, 285, 1150, 324, 681, 9892, 357, 515, 931, 304, 931, 29889, 29871, 2, 1, 518, 25580, 29962, 7280, 697, 29889, 518, 29914, 25580, 29962].

We can see that the prompt token ids start with two 1s instead of one.

This issue also impacts the new mistralai/Mixtral-8x7B-Instruct-v0.1 model added in the PR #2011

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions