Description
System Info
transformers
version: 4.26.1- Platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.31
- Python version: 3.9.16
- Huggingface_hub version: 0.12.1
- PyTorch version (GPU?): 1.12.1+cu113 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Hi, recently, I find some subtle difference between slow tokenizer and fast tokenizer, Here is a example
from transformers import AutoTokenizer, T5Tokenizer
path = "t5-small"
text = "this is a ஐ apple"
fast_tokenizer = AutoTokenizer.from_pretrained(path)
num = fast_tokenizer.add_tokens(["ஐ"], special_tokens=True)
assert num == 1
ids = fast_tokenizer(text)["input_ids"]
fast_tokenizer.decode(ids, skip_special_tokens=True) # 'this is a apple'
slow_tokenizer = T5Tokenizer.from_pretrained(path)
num = slow_tokenizer.add_tokens(["ஐ"], special_tokens=True)
assert num == 1
ids = slow_tokenizer(text)["input_ids"]
slow_tokenizer.decode(ids, skip_special_tokens=True) # 'this is a ஐ apple'
Here are more informations about the issue, I'm not a native English speaker, hope to be understood.
- I know in the first situation, fast tokenizer utilizes 🤗 Tokenizer, which will invoke
tokenizers.Tokenizer.add_special_tokens(tokens)
, thus the tokenஐ
will be added to vocabulary, and be viewed as "special token", and never be processed by tokenizer.model. - In the second situation, when decoding, slow tokenizer treats the added token
ஐ
as "normal token", so it will not be skipped. By the way, I read the related source code, whenskip_special_tokens=True
, slow tokenizer only skipself.all_special_ids
, butஐ
is not stored in this, butself.added_tokens_encoder
.
I read some 🤗 official documents, and struggled to figure out the meaning of so called "special token", and realize it's a subtle concept, here is my thought: Tokens can be divided to these categories:
- normal tokens: these tokens can be split
- control tokens (the name inspired by SentencePiece):
bos_token
,eos_token
, ...,additional_special_tokens
, the major propose of these tokens is used in encode post-processing pipeline. When these tokens appeared in input text, in slow tokenizer situation, in most cases, these tokens also be included inself.unique_no_split_tokens
, so these tokens will not be split, but I don't know the treatment in fast tokenizer case. - user add tokens:
- If the token already in vocab, but it can be marked as "special token", and this token will never be split now (but cannot be treated as the same as control tokens in some subtle situation).
- If the token not in vocab, it will be added (allocate a new token_id to it), this token also will never be split.
so, in both cases, these user added tokens will never be split.
Please let me know if there are any misunderstandings.
Several weeks ago, I summit a issue 23001 related to return_overflowing_tokens
behavior, which is considered as a specific feature of fast tokenizer, so it's a feature not a bug. Generally, I want to know the differences between slow and fast tokenizer, should be viewed as features, or bugs.
Expected behavior
The slow tokenizer should behave same as fast tokenizer.