Error while loading GPT2 tokenizer with specifying "unk_token"

### System Info

- `transformers` version: 4.28.0.dev0
- Platform: Linux-4.18.0-305.65.1.el8_4.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.16
- Huggingface_hub version: 0.13.3
- PyTorch version (GPU?): 1.11.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

 @ArthurZucker

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

For a certain reason, I need to modify the default unk_token of GPT2Tokenizer. Currently, it is "<|endoftext|>". When I tried to change it, I encounter problems.

```python
from transformers import GPT2Tokenizer

control_tokens ={"sep_token": "<|sep|>", "pad_token": "<|pad|>", "cls_token": "<|cls|>", "mask_token": "<|mask|>", "unk_token": "<|unk|>"}

tokenizer = GPT2Tokenizer.from_pretrained("./tokenizer/", **control_tokens)
tokenizer.encode(["<|unk|>"])
```
,
where directory ./tokenizer has all tokenizer files provided by gpt2-small: tokenizer.json, merges.txt, vocab.json

error information:


Traceback (most recent call last):
  File "./model/unit_test_customed_gpt2.py", line 451, in test_BuildMappingFileTestCase_bpe_mhp_gpt
    self.tokenizer.build_mapping_file(self.mapped_tokenizer, "./tokenizer/customed-mhp-gpt-bpe/mapping_%s.json"%text, max_length=32, is_chinese_vocab=False)
  File "/home/X/scratch/variable-text-segmentation/data_utils/sp_tokenizer.py", line 500, in build_mapping_file
    mapping_ids= mapping_tokenizer.encode(mapped_text,add_special_tokens=False)
  File "/home/lsiyang/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2302, in encode
    encoded_inputs = self.encode_plus(
  File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2710, in encode_plus
    return self._encode_plus(
  File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 650, in _encode_plus
    return self.prepare_for_model(
  File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3189, in prepare_for_model
    encoded_inputs = self.pad(
  File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2979, in pad
    raise ValueError(
ValueError: type of None unknown: <class 'NoneType'>. Should be one of a python, numpy, pytorch or tensorflow object.


**I may know the reason. 
When we specify a new token as  unk_token via GPT2Tokenizer.from_pretrained(*, unk_token=XX), it would not first add this new token to the vocabulary but only update self.tokenizer.unk_token=XX. 
It makes the tokenizer able to correctly return its unk_token but actually cannot find the token id of that new unk_token in the vocab. The problem lies in tokenization_utils.py**

```python
    def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
        new_tokens = [str(tok) for tok in new_tokens]
        tokens_to_add = []
        for token in new_tokens:
            if not isinstance(token, str):
                raise TypeError(f"Token {token} is not a string but a {type(token)}.")
            if not special_tokens and hasattr(self, "do_lower_case") and self.do_lower_case:
                token = token.lower()
            if (
                token != self.unk_token  #PROBLEM!  self.unk_token has been changed to the newest. So newest unk_token cannot be added.
                and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)
                and token not in tokens_to_add
            ):
                tokens_to_add.append(token)
                if self.verbose:
                    logger.info(f"Adding {token} to the vocabulary")
```
**For other tokens, like sep_token, it is allowed to specify it via GPT2Tokenizer.from_pretrained(*, sep_token=XX). Even if it doesn't exist in vocab, it would add a new token to vocab.**

This is also impossible.
```python
from transformers import GPT2Tokenizer
control_tokens ={"sep_token": "<|sep|>", "pad_token": "<|pad|>", "cls_token": "<|cls|>", "mask_token": "<|mask|>"}
tokenizer = GPT2Tokenizer.from_pretrained("./tokenizer/", **control_tokens)
tokenizer.add_special_tokens({"unk_token": "<|unk|>"})
tokenizer.encode(["<|unk|>"])
```
**I think we should also allow unk_token specification before its existence, like other special tokens.**





### Expected behavior

I think we should also allow unk_token specification before its existence, like other special tokens

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error while loading GPT2 tokenizer with specifying "unk_token" #22414

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error while loading GPT2 tokenizer with specifying "unk_token" #22414

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions