Skip to content

Two tokenizer initialization methods result in inconsistent segmentation results for special words #23930

Closed
@KinvenW

Description

@KinvenW

System Info

transformers==4.17.0
torch==1.10.0
python==3.7.3

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

# xlm-roberta-base directory: git clone https://huggingface.co/xlm-roberta-base
from transformers import XLMRobertaTokenizer
tokenizer_a = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base/')
tokenizer_b = XLMRobertaTokenizer('xlm-roberta-base/sentencepiece.bpe.model')

t = 'texta<s>textb'
print(tokenizer_a.tokenize(t))
print(tokenizer_b.tokenize(t))

Expected behavior

# what I expect is that both outputs: 
['▁text', 'a', '<s>', '▁text', 'b']
['▁text', 'a', '<s>', '▁text', 'b']

# However, in reality, their outputs are as follows:
['▁text', 'a', '<s>', '▁text', 'b']
['▁text', 'a', '<', 's', '>', 'text', 'b']

Why these two tokenizers have different segmentation results for special words?

Metadata

Metadata

Assignees

Labels

Core: TokenizationInternals of the library; Tokenization.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions