Skip to content

Error while loading GPT2 tokenizer with specifying "unk_token" #22414

@lsy641

Description

@lsy641

System Info

  • transformers version: 4.28.0.dev0
  • Platform: Linux-4.18.0-305.65.1.el8_4.x86_64-x86_64-with-glibc2.17
  • Python version: 3.8.16
  • Huggingface_hub version: 0.13.3
  • PyTorch version (GPU?): 1.11.0 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

For a certain reason, I need to modify the default unk_token of GPT2Tokenizer. Currently, it is "<|endoftext|>". When I tried to change it, I encounter problems.

from transformers import GPT2Tokenizer

control_tokens ={"sep_token": "<|sep|>", "pad_token": "<|pad|>", "cls_token": "<|cls|>", "mask_token": "<|mask|>", "unk_token": "<|unk|>"}

tokenizer = GPT2Tokenizer.from_pretrained("./tokenizer/", **control_tokens)
tokenizer.encode(["<|unk|>"])

,
where directory ./tokenizer has all tokenizer files provided by gpt2-small: tokenizer.json, merges.txt, vocab.json

error information:

Traceback (most recent call last):
File "./model/unit_test_customed_gpt2.py", line 451, in test_BuildMappingFileTestCase_bpe_mhp_gpt
self.tokenizer.build_mapping_file(self.mapped_tokenizer, "./tokenizer/customed-mhp-gpt-bpe/mapping_%s.json"%text, max_length=32, is_chinese_vocab=False)
File "/home/X/scratch/variable-text-segmentation/data_utils/sp_tokenizer.py", line 500, in build_mapping_file
mapping_ids= mapping_tokenizer.encode(mapped_text,add_special_tokens=False)
File "/home/lsiyang/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2302, in encode
encoded_inputs = self.encode_plus(
File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2710, in encode_plus
return self._encode_plus(
File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 650, in _encode_plus
return self.prepare_for_model(
File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3189, in prepare_for_model
encoded_inputs = self.pad(
File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2979, in pad
raise ValueError(
ValueError: type of None unknown: <class 'NoneType'>. Should be one of a python, numpy, pytorch or tensorflow object.

I may know the reason.
When we specify a new token as unk_token via GPT2Tokenizer.from_pretrained(*, unk_token=XX), it would not first add this new token to the vocabulary but only update self.tokenizer.unk_token=XX.
It makes the tokenizer able to correctly return its unk_token but actually cannot find the token id of that new unk_token in the vocab. The problem lies in tokenization_utils.py

    def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
        new_tokens = [str(tok) for tok in new_tokens]
        tokens_to_add = []
        for token in new_tokens:
            if not isinstance(token, str):
                raise TypeError(f"Token {token} is not a string but a {type(token)}.")
            if not special_tokens and hasattr(self, "do_lower_case") and self.do_lower_case:
                token = token.lower()
            if (
                token != self.unk_token  #PROBLEM!  self.unk_token has been changed to the newest. So newest unk_token cannot be added.
                and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)
                and token not in tokens_to_add
            ):
                tokens_to_add.append(token)
                if self.verbose:
                    logger.info(f"Adding {token} to the vocabulary")

For other tokens, like sep_token, it is allowed to specify it via GPT2Tokenizer.from_pretrained(*, sep_token=XX). Even if it doesn't exist in vocab, it would add a new token to vocab.

This is also impossible.

from transformers import GPT2Tokenizer
control_tokens ={"sep_token": "<|sep|>", "pad_token": "<|pad|>", "cls_token": "<|cls|>", "mask_token": "<|mask|>"}
tokenizer = GPT2Tokenizer.from_pretrained("./tokenizer/", **control_tokens)
tokenizer.add_special_tokens({"unk_token": "<|unk|>"})
tokenizer.encode(["<|unk|>"])

I think we should also allow unk_token specification before its existence, like other special tokens.

Expected behavior

I think we should also allow unk_token specification before its existence, like other special tokens

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions