Closed
Description
I just found that if added tokens contain some characters that exist in the byte map for ByteLevel preprocessor could not be decoded correctly.
This is a script to reproduce the problem with version 0.14.1
from tokenizers import Tokenizer
from tokenizers import normalizers
from tokenizers.pre_tokenizers import (
ByteLevel,
)
from tokenizers.models import BPE
from tokenizers import decoders
tokenizer = Tokenizer(BPE())
tokenizer.normalizer = normalizers.Sequence([])
tokenizer.pre_tokenizer = Sequence(
[
ByteLevel(add_prefix_space=False, use_regex=False),
])
tokenizer.add_tokens(["ilÖveyou"])
# Ö is the character representing for 0xf6
tokenizer.decoder = decoders.ByteLevel()
encode_result = tokenizer.encode("ilÖveyou")
print(encode_result.ids)
print(tokenizer.decode(encode_result.ids))
the output wil be
[0]
il�veyou
I believe the problem comes from
https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/tokenizer/mod.rs#L832-L836
I don't think added token should be sent to bytelevel decoder for it is extacted before pretokenize.