-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Description
Hello,
I want to add new words to my BPE tokenizer. I know the symbol Ġ means the end of a new token and the majority of tokens in vocabs of pre-trained tokenizers start with Ġ. Assume I want to add the word Salah to my tokenizer. I tried to add both Salah token and ĠSalah:
tokenizer.add_tokens(['Salah', 'ĠSalah']) # they get 50265 and 50266 values respectively.
However, when I tokenize a sentence where Salah appears, the tokenizer will never return me the second number (neither when using .tokenize nor .encode), for instance:
tokenizer.tokenize('I love Salah and salad') returns ['I', 'Ġlove', 'Salah', 'Ġand', 'Ġsalad'].
The question is: should I use the symbol Ġ when adding new tokens or the tokenizer does it itself? Or, probably, it must be specified manually?
Thanks in advance!