-
Notifications
You must be signed in to change notification settings - Fork 30.6k
Description
System Info
transformers
version: 4.28.0.dev0- Platform: Linux-4.18.0-305.65.1.el8_4.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.16
- Huggingface_hub version: 0.13.3
- PyTorch version (GPU?): 1.11.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
For a certain reason, I need to modify the default unk_token of GPT2Tokenizer. Currently, it is "<|endoftext|>". When I tried to change it, I encounter problems.
from transformers import GPT2Tokenizer
control_tokens ={"sep_token": "<|sep|>", "pad_token": "<|pad|>", "cls_token": "<|cls|>", "mask_token": "<|mask|>", "unk_token": "<|unk|>"}
tokenizer = GPT2Tokenizer.from_pretrained("./tokenizer/", **control_tokens)
tokenizer.encode(["<|unk|>"])
,
where directory ./tokenizer has all tokenizer files provided by gpt2-small: tokenizer.json, merges.txt, vocab.json
error information:
Traceback (most recent call last):
File "./model/unit_test_customed_gpt2.py", line 451, in test_BuildMappingFileTestCase_bpe_mhp_gpt
self.tokenizer.build_mapping_file(self.mapped_tokenizer, "./tokenizer/customed-mhp-gpt-bpe/mapping_%s.json"%text, max_length=32, is_chinese_vocab=False)
File "/home/X/scratch/variable-text-segmentation/data_utils/sp_tokenizer.py", line 500, in build_mapping_file
mapping_ids= mapping_tokenizer.encode(mapped_text,add_special_tokens=False)
File "/home/lsiyang/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2302, in encode
encoded_inputs = self.encode_plus(
File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2710, in encode_plus
return self._encode_plus(
File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 650, in _encode_plus
return self.prepare_for_model(
File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3189, in prepare_for_model
encoded_inputs = self.pad(
File "/home/X/scratch/miniconda3/envs/mix/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2979, in pad
raise ValueError(
ValueError: type of None unknown: <class 'NoneType'>. Should be one of a python, numpy, pytorch or tensorflow object.
I may know the reason.
When we specify a new token as unk_token via GPT2Tokenizer.from_pretrained(*, unk_token=XX), it would not first add this new token to the vocabulary but only update self.tokenizer.unk_token=XX.
It makes the tokenizer able to correctly return its unk_token but actually cannot find the token id of that new unk_token in the vocab. The problem lies in tokenization_utils.py
def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
new_tokens = [str(tok) for tok in new_tokens]
tokens_to_add = []
for token in new_tokens:
if not isinstance(token, str):
raise TypeError(f"Token {token} is not a string but a {type(token)}.")
if not special_tokens and hasattr(self, "do_lower_case") and self.do_lower_case:
token = token.lower()
if (
token != self.unk_token #PROBLEM! self.unk_token has been changed to the newest. So newest unk_token cannot be added.
and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)
and token not in tokens_to_add
):
tokens_to_add.append(token)
if self.verbose:
logger.info(f"Adding {token} to the vocabulary")
For other tokens, like sep_token, it is allowed to specify it via GPT2Tokenizer.from_pretrained(*, sep_token=XX). Even if it doesn't exist in vocab, it would add a new token to vocab.
This is also impossible.
from transformers import GPT2Tokenizer
control_tokens ={"sep_token": "<|sep|>", "pad_token": "<|pad|>", "cls_token": "<|cls|>", "mask_token": "<|mask|>"}
tokenizer = GPT2Tokenizer.from_pretrained("./tokenizer/", **control_tokens)
tokenizer.add_special_tokens({"unk_token": "<|unk|>"})
tokenizer.encode(["<|unk|>"])
I think we should also allow unk_token specification before its existence, like other special tokens.
Expected behavior
I think we should also allow unk_token specification before its existence, like other special tokens