Skip to content

[TokenizerSlow] replace_additional_special_tokens is not doing much #24276

Closed
@ArthurZucker

Description

@ArthurZucker

Just flagging this as the add_special_tokens method got pretty complicated, adding a kwargs, replace_additional_special_tokens, that supposedly can prevent replacing the self._additional_special_tokens attribute.
For any tokenizer, this will remove it from the list, but will not update the internal trie and thus has no effect at all:

>>> from transformers import XLMRobertaTokenizer
>>> tokenizer_a = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')
>>> tokenizer_a.add_special_tokens({"additional_special_tokens":["<//s>"]})
>>> tokenizer_a.additional_special_tokens
['<//s>']
>>> print(tokenizer_a.tokenize("This is a <//s>"))
['▁This', '▁is', '▁a', '<//s>']
>>> tokenizer_a.add_special_tokens({"additional_special_tokens":["<///s>"]}, replace_additional_special_tokens= True)
>>> print(tokenizer_a.tokenize("This is a <//s>"))
['▁This', '▁is', '▁a', '<//s>']

This will be addressed in #23909

Metadata

Metadata

Assignees

Labels

Core: TokenizationInternals of the library; Tokenization.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions