Usage of Ġ in BPE tokenizer

Hello, 
I want to add new words to my BPE tokenizer. I know the symbol Ġ means the end of a new token and the majority of tokens in vocabs of pre-trained tokenizers start with Ġ. Assume I want to add the word **Salah** to my tokenizer. I tried to add both **Salah** token and **ĠSalah**:
`tokenizer.add_tokens(['Salah', 'ĠSalah'])` # they get 50265 and 50266 values respectively.
However, when I tokenize a sentence where **Salah** appears, the tokenizer will never return me the second number (neither when using `.tokenize` nor `.encode`), for instance:
`tokenizer.tokenize('I love Salah and salad')` returns `['I', 'Ġlove', 'Salah', 'Ġand', 'Ġsalad']`.
The question is: should I use the symbol `Ġ` when adding new tokens or the tokenizer does it itself? Or, probably, it must be specified manually?
Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Usage of Ġ in BPE tokenizer #4786

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Usage of Ġ in BPE tokenizer #4786

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions