Skip to content

[Bug]: Qwen/Qwen2.5-1.5B-Instruct generates out of vocabulary tokens #13175

@AlexPiche

Description

@AlexPiche

Your current environment

>>> import vllm
INFO 02-12 20:27:04 __init__.py:190] Automatically detected platform cuda.
>>> vllm.__version__
'0.7.2'

🐛 Describe the bug

Hi,

It looks like Qwen models can generate tokens out of vocabulary. We can see this by feeding the generate tokens to the model which sometimes result in the following exception: Token id 151779 is out of vocabulary. Here is a minimal code to reproduce this error.

import vllm
from transformers import AutoTokenizer
import numpy as np

PROMPT = """
<|im_start|>system
Please reason step by step, and put your final answer within \\boxed{}.<|im_end|>
<|im_start|>user
The equation $a^7xy-a^6y-a^5x=a^4(b^4-1)$ is equivalent to the equation $(a^mx-a^n)(a^py-a^2)=a^4b^4$ for some integers $m$, $n$, and $p$.  Find $mnp$.<|im_end|>
<|im_start|>assistant
"""

if __name__ == '__main__':
    model_path = "Qwen/Qwen2.5-1.5B-Instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    PROMPT_TOKEN_IDS = tokenizer.encode(PROMPT)

    sampling_params = vllm.SamplingParams(temperature=1.2, max_tokens=100)
    llm = vllm.LLM(model_path)

    # can we now generate tokens out of vocabulary?
    out_of_vocab = []
    out_of_vocab_tokens = []
    for i in range(100):
        out = llm.generate(prompt_token_ids=PROMPT_TOKEN_IDS, sampling_params=sampling_params)
        PROMPT_COMPLETION_TOKEN_IDS = PROMPT_TOKEN_IDS + list(out[0].outputs[0].token_ids)
        try:
            out2 = llm.generate(prompt_token_ids=PROMPT_COMPLETION_TOKEN_IDS, sampling_params=sampling_params)
            out_of_vocab.append(0)
        except Exception as e:
            print(e)
            # Extract token id from error message
            token_id = int(str(e).split("Token id ")[1].split(" ")[0])
            out_of_vocab_tokens.append(token_id)
            out_of_vocab.append(1)
    
    print(f"Proportion of out of vocabulary generations: {np.mean(out_of_vocab)}")
    print(out_of_vocab_tokens)
        

selected output

Token id 151779 is out of vocabulary
Token id 151734 is out of vocabulary
...
Proportion of out of vocabulary generations: 0.03
[151925, 151779, 151734]

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingunstaleRecieved activity after being labelled stale

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions