Skip to content

added_tokens with bytemap charaters in ByteLevel could not be decoded correctly #1392

Closed
@DOGEwbx

Description

@DOGEwbx

I just found that if added tokens contain some characters that exist in the byte map for ByteLevel preprocessor could not be decoded correctly.
This is a script to reproduce the problem with version 0.14.1

from tokenizers import Tokenizer
from tokenizers import normalizers
from tokenizers.pre_tokenizers import (
    ByteLevel,
)
from tokenizers.models import BPE
from tokenizers import decoders
tokenizer = Tokenizer(BPE())
tokenizer.normalizer = normalizers.Sequence([])

tokenizer.pre_tokenizer = Sequence(
    [
        ByteLevel(add_prefix_space=False, use_regex=False),
    ])
tokenizer.add_tokens(["ilÖveyou"])
# Ö is the character representing for 0xf6
tokenizer.decoder = decoders.ByteLevel()
encode_result = tokenizer.encode("ilÖveyou")
print(encode_result.ids)
print(tokenizer.decode(encode_result.ids))

the output wil be

[0]
il�veyou

I believe the problem comes from
https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/tokenizer/mod.rs#L832-L836
I don't think added token should be sent to bytelevel decoder for it is extacted before pretokenize.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions