Skip to content

Usage of Ġ in BPE tokenizer #4786

@maschasap

Description

@maschasap

Hello,
I want to add new words to my BPE tokenizer. I know the symbol Ġ means the end of a new token and the majority of tokens in vocabs of pre-trained tokenizers start with Ġ. Assume I want to add the word Salah to my tokenizer. I tried to add both Salah token and ĠSalah:
tokenizer.add_tokens(['Salah', 'ĠSalah']) # they get 50265 and 50266 values respectively.
However, when I tokenize a sentence where Salah appears, the tokenizer will never return me the second number (neither when using .tokenize nor .encode), for instance:
tokenizer.tokenize('I love Salah and salad') returns ['I', 'Ġlove', 'Salah', 'Ġand', 'Ġsalad'].
The question is: should I use the symbol Ġ when adding new tokens or the tokenizer does it itself? Or, probably, it must be specified manually?
Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions