-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Fix NFD computation #7122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix NFD computation #7122
Conversation
Using your example, I added the following test to @@ -252,6 +255,7 @@ tests = [
"3333333",
"33333333",
"333333333",
+ "乡—서울에서",
chktxt,
] The unit test using the bert vocab still fails with this PR: # regenerate test data
python3 convert-hf-to-gguf-update.py <hf_token>
# run tests
make -j tests && ./tests/test-tokenizer-0 ./models/ggml-vocab-bert-bge.gguf Result:
Shouldn't it pass with the NFD fixes? |
What exactly I need to pass to the |
You need a HuggingFace account: https://huggingface.co/settings/tokens Pass the read token from the link above If you don't have HF account, I can push the test to your branch manually? |
u can feel free to push yes |
I get the error:
|
Pull and run: make -j tests && ./tests/test-tokenizer-0 ./models/ggml-vocab-bert-bge.gguf This compares the resulting tokens when using
|
How can I know the |
The reference HF tokenizer model is listed in the For the BERT model "bert-bge", the link is: https://huggingface.co/BAAI/bge-small-en-v1.5/tree/main You can clone that repository and write a Python script that imports The |
okey, thanks I will investigate further thx |
I believe this test also fails in master without my changes, so maybe the issue does not come from the normalization? The result without the change and with the change give the same result |
Yes, it's possible. I was trying to add a test for the NFD changes. Can you provide a text that does not work on |
I am trying but it seems quite hard, I find cases where the NFD output differs, but the end tokenization results in the same tokens.
This is expanded ny NFD but the extra codepoints do not reflect in the tokenization. I have tried with mutliple examples and no success. If you want I can close this PR. Now I have a little bit more clarity and most of the cases I would just ignore NFC and NFD for my problem and hope for it to work most of the time. |
If you think the |
Seems to make sense in the strict sense of the algorithm, but I tried several cases quite complex and could not find the case, because it seems that the second part is not part of the vocabulary. |
Maybe @iamlemec can provide more context and finds a case where it could change? |
Ok, so I've searched programmatically over the Now, this business with the Korean (Hangul) letters is interesting as a separate issue. Looks like the HF tokenizer handles those programmatically, as recommended. Since there are |
Let's close this, thanks for the help. I got some nice insights into this new world of Unicode and Normalization. Thanks @ggerganov , @iamlemec and @teleprint-me ! |
I am trying to change the NFD computation according to https://unicode.org/reports/tr15/#Description_Norm
Thee changes are:
range
from thenfd_map
and apply thedecomposition
recursively. (Edit) According to @iamlemec, thenfd_map
is constructed in a way that the recursion is appliedCanonical_Combining_Class
.TODO
unicode_canonical_class
map