Closed
Description
System Info
transformers==4.17.0
torch==1.10.0
python==3.7.3
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
# xlm-roberta-base directory: git clone https://huggingface.co/xlm-roberta-base
from transformers import XLMRobertaTokenizer
tokenizer_a = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base/')
tokenizer_b = XLMRobertaTokenizer('xlm-roberta-base/sentencepiece.bpe.model')
t = 'texta<s>textb'
print(tokenizer_a.tokenize(t))
print(tokenizer_b.tokenize(t))
Expected behavior
# what I expect is that both outputs:
['▁text', 'a', '<s>', '▁text', 'b']
['▁text', 'a', '<s>', '▁text', 'b']
# However, in reality, their outputs are as follows:
['▁text', 'a', '<s>', '▁text', 'b']
['▁text', 'a', '<', 's', '>', 'text', 'b']
Why these two tokenizers have different segmentation results for special words?