Skip to content

skip_special_tokens has different behavior between slow and fast tokenizer #23250

Closed
@BuxianChen

Description

@BuxianChen

System Info

  • transformers version: 4.26.1
  • Platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.31
  • Python version: 3.9.16
  • Huggingface_hub version: 0.12.1
  • PyTorch version (GPU?): 1.12.1+cu113 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Hi, recently, I find some subtle difference between slow tokenizer and fast tokenizer, Here is a example

from transformers import AutoTokenizer, T5Tokenizer
path = "t5-small"
text = "this is a ஐ apple"

fast_tokenizer = AutoTokenizer.from_pretrained(path)
num = fast_tokenizer.add_tokens(["ஐ"], special_tokens=True)
assert num == 1
ids = fast_tokenizer(text)["input_ids"]
fast_tokenizer.decode(ids, skip_special_tokens=True)  # 'this is a apple'

slow_tokenizer = T5Tokenizer.from_pretrained(path)
num = slow_tokenizer.add_tokens(["ஐ"], special_tokens=True)
assert num == 1
ids = slow_tokenizer(text)["input_ids"]
slow_tokenizer.decode(ids, skip_special_tokens=True)  # 'this is a ஐ apple'

Here are more informations about the issue, I'm not a native English speaker, hope to be understood.

  • I know in the first situation, fast tokenizer utilizes 🤗 Tokenizer, which will invoke tokenizers.Tokenizer.add_special_tokens(tokens), thus the token will be added to vocabulary, and be viewed as "special token", and never be processed by tokenizer.model.
  • In the second situation, when decoding, slow tokenizer treats the added token as "normal token", so it will not be skipped. By the way, I read the related source code, when skip_special_tokens=True, slow tokenizer only skip self.all_special_ids, but is not stored in this, but self.added_tokens_encoder.

I read some 🤗 official documents, and struggled to figure out the meaning of so called "special token", and realize it's a subtle concept, here is my thought: Tokens can be divided to these categories:

  • normal tokens: these tokens can be split
  • control tokens (the name inspired by SentencePiece): bos_token, eos_token, ..., additional_special_tokens, the major propose of these tokens is used in encode post-processing pipeline. When these tokens appeared in input text, in slow tokenizer situation, in most cases, these tokens also be included in self.unique_no_split_tokens, so these tokens will not be split, but I don't know the treatment in fast tokenizer case.
  • user add tokens:
    • If the token already in vocab, but it can be marked as "special token", and this token will never be split now (but cannot be treated as the same as control tokens in some subtle situation).
    • If the token not in vocab, it will be added (allocate a new token_id to it), this token also will never be split.
      so, in both cases, these user added tokens will never be split.

Please let me know if there are any misunderstandings.

Several weeks ago, I summit a issue 23001 related to return_overflowing_tokens behavior, which is considered as a specific feature of fast tokenizer, so it's a feature not a bug. Generally, I want to know the differences between slow and fast tokenizer, should be viewed as features, or bugs.

Expected behavior

The slow tokenizer should behave same as fast tokenizer.

Metadata

Metadata

Assignees

Labels

Core: TokenizationInternals of the library; Tokenization.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions