skip_special_tokens has different behavior between slow and fast tokenizer

### System Info

- `transformers` version: 4.26.1
- Platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.31
- Python version: 3.9.16
- Huggingface_hub version: 0.12.1
- PyTorch version (GPU?): 1.12.1+cu113 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

### Who can help?

@ArthurZucker 

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

Hi, recently, I find some subtle difference between slow tokenizer and fast tokenizer, Here is a example

```python
from transformers import AutoTokenizer, T5Tokenizer
path = "t5-small"
text = "this is a ஐ apple"

fast_tokenizer = AutoTokenizer.from_pretrained(path)
num = fast_tokenizer.add_tokens(["ஐ"], special_tokens=True)
assert num == 1
ids = fast_tokenizer(text)["input_ids"]
fast_tokenizer.decode(ids, skip_special_tokens=True)  # 'this is a apple'

slow_tokenizer = T5Tokenizer.from_pretrained(path)
num = slow_tokenizer.add_tokens(["ஐ"], special_tokens=True)
assert num == 1
ids = slow_tokenizer(text)["input_ids"]
slow_tokenizer.decode(ids, skip_special_tokens=True)  # 'this is a ஐ apple'
```

Here are more informations about the issue, I'm not a native English speaker, hope to be understood.

- I know in the first situation, fast tokenizer utilizes 🤗 Tokenizer, which will invoke `tokenizers.Tokenizer.add_special_tokens(tokens)`, thus the token `ஐ` will be added to vocabulary, and be viewed as "special token", and [never be processed by tokenizer.model](https://huggingface.co/docs/tokenizers/api/tokenizer#tokenizers.Tokenizer.add_special_tokens).
- In the second situation, when decoding, slow tokenizer treats the added token `ஐ` as "normal token", so it will not be skipped. By the way, I read the related source code, when `skip_special_tokens=True`, slow tokenizer only skip `self.all_special_ids`, but `ஐ` is not stored in this, but `self.added_tokens_encoder`.

I read some 🤗 official documents, and struggled to figure out the meaning of so called "special token", and realize it's a subtle concept, here is my thought: Tokens can be divided to these categories:

- normal tokens: these tokens can be split
- control tokens (the name inspired by [SentencePiece](https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb)): `bos_token`, `eos_token`, ..., `additional_special_tokens`, the major propose of these tokens is used in encode **[post-processing](https://huggingface.co/docs/tokenizers/pipeline)** pipeline. When these tokens appeared in input text, in slow tokenizer situation, **in most cases**, these tokens also be included in `self.unique_no_split_tokens`, so these tokens **will not be split**, but I don't know the treatment in fast tokenizer case.
- user add tokens: 
  - If the token already in vocab, but it can be marked as "special token", and this token will never be split now (but cannot be treated as the same as control tokens in some subtle situation).
  - If the token not in vocab, it will be added (allocate a new token_id to it), this token also will never be split.
  so, in both cases, these user added tokens will never be split.

Please let me know if there are any misunderstandings.

Several weeks ago, I summit a [issue 23001](https://github.com/huggingface/transformers/issues/23001) related to `return_overflowing_tokens` behavior, which is considered as a specific feature of fast tokenizer, so it's a feature not a bug. Generally, I want to know the differences between slow and fast tokenizer, should be viewed as features, or bugs.

### Expected behavior

The slow tokenizer should behave same as fast tokenizer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

skip_special_tokens has different behavior between slow and fast tokenizer #23250

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

skip_special_tokens has different behavior between slow and fast tokenizer #23250

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions