Closed
Description
Just flagging this as the add_special_tokens
method got pretty complicated, adding a kwargs, replace_additional_special_tokens
, that supposedly can prevent replacing the self._additional_special_tokens
attribute.
For any tokenizer, this will remove it from the list, but will not update the internal trie
and thus has no effect at all:
>>> from transformers import XLMRobertaTokenizer
>>> tokenizer_a = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')
>>> tokenizer_a.add_special_tokens({"additional_special_tokens":["<//s>"]})
>>> tokenizer_a.additional_special_tokens
['<//s>']
>>> print(tokenizer_a.tokenize("This is a <//s>"))
['▁This', '▁is', '▁a', '<//s>']
>>> tokenizer_a.add_special_tokens({"additional_special_tokens":["<///s>"]}, replace_additional_special_tokens= True)
>>> print(tokenizer_a.tokenize("This is a <//s>"))
['▁This', '▁is', '▁a', '<//s>']
This will be addressed in #23909