Fix NFD computation #7122

JoanFM · 2024-05-07T12:14:50Z

I am trying to change the NFD computation according to https://unicode.org/reports/tr15/#Description_Norm

Thee changes are:

Get the range from the nfd_map and apply the decomposition recursively. (Edit) According to @iamlemec, the nfd_map is constructed in a way that the recursion is applied
Sort results according to the Canonical_Combining_Class.

TODO

Properly fill the unicode_canonical_class map
Do some tests and validate the implementation
Fix the implementation avoiding potential problem with inserting into a vector while iterating over it

github-actions · 2024-05-07T12:43:52Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 550 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8546.39ms p(95)=20204.48ms fails=, finish reason: stop=489 truncated=61
Prompt processing (pp): avg=98.39tk/s p(95)=372.49tk/s
Token generation (tg): avg=35.22tk/s p(95)=50.21tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=fix-nfd commit=c0aedfec8338e8136a69ec2ccee2528e479f2834

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715157035 --> 1715157661
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 880.43, 880.43, 880.43, 880.43, 880.43, 779.47, 779.47, 779.47, 779.47, 779.47, 777.52, 777.52, 777.52, 777.52, 777.52, 812.19, 812.19, 812.19, 812.19, 812.19, 858.94, 858.94, 858.94, 858.94, 858.94, 858.48, 858.48, 858.48, 858.48, 858.48, 871.28, 871.28, 871.28, 871.28, 871.28, 881.74, 881.74, 881.74, 881.74, 881.74, 896.48, 896.48, 896.48, 896.48, 896.48, 893.42, 893.42, 893.42, 893.42, 893.42, 912.85, 912.85, 912.85, 912.85, 912.85, 903.58, 903.58, 903.58, 903.58, 903.58, 915.18, 915.18, 915.18, 915.18, 915.18, 890.19, 890.19, 890.19, 890.19, 890.19, 899.94, 899.94, 899.94, 899.94, 899.94, 897.18, 897.18, 897.18, 897.18, 897.18, 887.69, 887.69, 887.69, 887.69, 887.69, 890.95, 890.95, 890.95, 890.95, 890.95, 890.23, 890.23, 890.23, 890.23, 890.23, 892.89, 892.89, 892.89, 892.89, 892.89, 888.61, 888.61, 888.61, 888.61, 888.61, 864.43, 864.43, 864.43, 864.43, 864.43, 866.77, 866.77, 866.77, 866.77, 866.77, 868.21, 868.21, 868.21, 868.21, 868.21, 846.39, 846.39, 846.39, 846.39, 846.39, 843.03, 843.03, 843.03, 843.03, 843.03, 841.4, 841.4, 841.4, 841.4, 841.4, 842.27, 842.27, 842.27, 842.27, 842.27, 846.75, 846.75, 846.75, 846.75, 846.75, 844.64, 844.64, 844.64, 844.64, 844.64, 847.42, 847.42, 847.42, 847.42, 847.42, 852.17, 852.17, 852.17, 852.17, 852.17, 825.82, 825.82, 825.82, 825.82, 825.82, 826.56, 826.56, 826.56, 826.56, 826.56, 822.91, 822.91, 822.91, 822.91, 822.91, 820.89, 820.89, 820.89, 820.89, 820.89, 821.42, 821.42, 821.42, 821.42, 821.42, 824.32, 824.32, 824.32, 824.32, 824.32, 827.18, 827.18, 827.18, 827.18, 827.18, 835.54, 835.54, 835.54, 835.54, 835.54, 829.11, 829.11, 829.11, 829.11, 829.11, 831.34, 831.34, 831.34, 831.34, 831.34, 831.34, 831.34, 831.34, 831.34, 831.34, 829.32, 829.32, 829.32, 829.32, 829.32, 826.48, 826.48, 826.48, 826.48, 826.48, 831.67, 831.67, 831.67, 831.67, 831.67, 833.69, 833.69, 833.69, 833.69, 833.69, 833.91, 833.91, 833.91, 833.91, 833.91, 838.44, 838.44, 838.44, 838.44, 838.44, 838.77, 838.77, 838.77, 838.77, 838.77, 834.89, 834.89, 834.89, 834.89, 834.89, 837.98, 837.98, 837.98, 837.98, 837.98, 838.47, 838.47, 838.47, 838.47, 838.47, 842.73, 842.73, 842.73, 842.73, 842.73, 844.12, 844.12, 844.12, 844.12, 844.12, 843.8, 843.8, 843.8, 843.8, 843.8, 843.03, 843.03, 843.03, 843.03, 843.03, 844.51, 844.51, 844.51, 844.51, 844.51, 844.55, 844.55, 844.55, 844.55, 844.55, 847.44, 847.44, 847.44, 847.44, 847.44, 847.95, 847.95, 847.95, 847.95]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715157035 --> 1715157661
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 40.97, 40.97, 40.97, 40.97, 40.97, 26.64, 26.64, 26.64, 26.64, 26.64, 26.1, 26.1, 26.1, 26.1, 26.1, 27.78, 27.78, 27.78, 27.78, 27.78, 29.54, 29.54, 29.54, 29.54, 29.54, 31.01, 31.01, 31.01, 31.01, 31.01, 32.73, 32.73, 32.73, 32.73, 32.73, 33.13, 33.13, 33.13, 33.13, 33.13, 33.24, 33.24, 33.24, 33.24, 33.24, 33.95, 33.95, 33.95, 33.95, 33.95, 34.09, 34.09, 34.09, 34.09, 34.09, 33.88, 33.88, 33.88, 33.88, 33.88, 32.85, 32.85, 32.85, 32.85, 32.85, 31.8, 31.8, 31.8, 31.8, 31.8, 31.96, 31.96, 31.96, 31.96, 31.96, 32.13, 32.13, 32.13, 32.13, 32.13, 32.44, 32.44, 32.44, 32.44, 32.44, 31.93, 31.93, 31.93, 31.93, 31.93, 31.99, 31.99, 31.99, 31.99, 31.99, 32.13, 32.13, 32.13, 32.13, 32.13, 32.36, 32.36, 32.36, 32.36, 32.36, 32.4, 32.4, 32.4, 32.4, 32.4, 32.53, 32.53, 32.53, 32.53, 32.53, 32.74, 32.74, 32.74, 32.74, 32.74, 32.79, 32.79, 32.79, 32.79, 32.79, 32.14, 32.14, 32.14, 32.14, 32.14, 31.88, 31.88, 31.88, 31.88, 31.88, 32.12, 32.12, 32.12, 32.12, 32.12, 32.31, 32.31, 32.31, 32.31, 32.31, 32.36, 32.36, 32.36, 32.36, 32.36, 32.55, 32.55, 32.55, 32.55, 32.55, 32.51, 32.51, 32.51, 32.51, 32.51, 32.48, 32.48, 32.48, 32.48, 32.48, 32.35, 32.35, 32.35, 32.35, 32.35, 32.3, 32.3, 32.3, 32.3, 32.3, 32.02, 32.02, 32.02, 32.02, 32.02, 32.11, 32.11, 32.11, 32.11, 32.11, 32.17, 32.17, 32.17, 32.17, 32.17, 32.38, 32.38, 32.38, 32.38, 32.38, 32.47, 32.47, 32.47, 32.47, 32.47, 32.4, 32.4, 32.4, 32.4, 32.4, 32.15, 32.15, 32.15, 32.15, 32.15, 31.92, 31.92, 31.92, 31.92, 31.92, 30.22, 30.22, 30.22, 30.22, 30.22, 30.04, 30.04, 30.04, 30.04, 30.04, 30.09, 30.09, 30.09, 30.09, 30.09, 30.09, 30.09, 30.09, 30.09, 30.09, 30.15, 30.15, 30.15, 30.15, 30.15, 30.27, 30.27, 30.27, 30.27, 30.27, 30.3, 30.3, 30.3, 30.3, 30.3, 30.46, 30.46, 30.46, 30.46, 30.46, 30.41, 30.41, 30.41, 30.41, 30.41, 30.2, 30.2, 30.2, 30.2, 30.2, 30.11, 30.11, 30.11, 30.11, 30.11, 30.17, 30.17, 30.17, 30.17, 30.17, 30.3, 30.3, 30.3, 30.3, 30.3, 30.43, 30.43, 30.43, 30.43, 30.43, 30.47, 30.47, 30.47, 30.47, 30.47, 30.54, 30.54, 30.54, 30.54, 30.54, 30.56, 30.56, 30.56, 30.56, 30.56, 30.55, 30.55, 30.55, 30.55]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715157035 --> 1715157661
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.23, 0.23, 0.23, 0.23, 0.23, 0.43, 0.43, 0.43, 0.43, 0.43, 0.28, 0.28, 0.28, 0.28, 0.28, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.31, 0.31, 0.31, 0.31, 0.31, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.29, 0.29, 0.29, 0.29, 0.29, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19, 0.19, 0.19, 0.19, 0.29, 0.29, 0.29, 0.29, 0.29, 0.27, 0.27, 0.27, 0.27, 0.27, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.32, 0.32, 0.32, 0.32, 0.32, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.29, 0.29, 0.29, 0.29, 0.29, 0.5, 0.5, 0.5, 0.5, 0.5, 0.57, 0.57, 0.57, 0.57, 0.57, 0.67, 0.67, 0.67, 0.67, 0.67, 0.59, 0.59, 0.59, 0.59, 0.59, 0.09, 0.09, 0.09, 0.09, 0.09, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.27, 0.27, 0.27, 0.27, 0.27, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.21, 0.21, 0.21, 0.21]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715157035 --> 1715157661
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0]

examples/server/tests/features/embeddings.feature

ggerganov · 2024-05-08T10:03:43Z

Using your example, I added the following test to convert-hf-to-gguf-update.py:

@@ -252,6 +255,7 @@ tests = [
     "3333333",
     "33333333",
     "333333333",
+    "乡—서울에서",
     chktxt,
 ]

The unit test using the bert vocab still fails with this PR:

# regenerate test data
python3 convert-hf-to-gguf-update.py <hf_token>

# run tests
make -j tests && ./tests/test-tokenizer-0 ./models/ggml-vocab-bert-bge.gguf

Result:

src: '乡—서울에서'
res: '[UNK] —[UNK]'
tok: 100 1517 100 
main : failed test:    '乡—서울에서'
main : detokenized to: '[UNK] —[UNK]' instead of '[UNK] — 서울에서'
main : expected tokens:    100 '[UNK]',   1517 ' —',   1461 ' ᄉ',  30008 'ᅥ',  29999 'ᄋ',  30014 'ᅮ',  30022 'ᆯ',  29999 'ᄋ',  30009 'ᅦ',  29997 'ᄉ',  30008 'ᅥ', 
main : got tokens:         100 '[UNK]',   1517 ' —',    100 '[UNK]', 

Tests failed

Shouldn't it pass with the NFD fixes?

JoanFM · 2024-05-08T10:16:17Z

ggml-vocab-bert-bge.gguf

What exactly I need to pass to the convert-hf-to-gguf-update.py call?

ggerganov · 2024-05-08T10:17:51Z

You need a HuggingFace account: https://huggingface.co/settings/tokens

Pass the read token from the link above

If you don't have HF account, I can push the test to your branch manually?

JoanFM · 2024-05-08T10:18:15Z

You need a HuggingFace account: https://huggingface.co/settings/tokens

Pass the read token from the link above

If you don't have HF account, I can push the test to your branch manually?

u can feel free to push yes

JoanFM · 2024-05-08T10:20:40Z

You need a HuggingFace account: https://huggingface.co/settings/tokens

Pass the read token from the link above

If you don't have HF account, I can push the test to your branch manually?

I get the error:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/joan/workspace/ollama/llm/llama.cpp/convert-hf-to-gguf-update.py", line 139, in <module>
    tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
  File "/home/joan/workspace/ollama/llm/llama.cpp/gguf-research/venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 819, in from_pretrained
    config = AutoConfig.from_pretrained(
  File "/home/joan/workspace/ollama/llm/llama.cpp/gguf-research/venv/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 928, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/joan/workspace/ollama/llm/llama.cpp/gguf-research/venv/lib/python3.10/site-packages/transformers/configuration_utils.py", line 631, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/joan/workspace/ollama/llm/llama.cpp/gguf-research/venv/lib/python3.10/site-packages/transformers/configuration_utils.py", line 686, in _get_config_dict
    resolved_config_file = cached_file(
  File "/home/joan/workspace/ollama/llm/llama.cpp/gguf-research/venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 369, in cached_file
    raise EnvironmentError(
OSError: models/tokenizers/llama-bpe does not appear to have a file named config.json. Checkout 'https://huggingface.co/models/tokenizers/llama-bpe/tree/None' for available files.

ggerganov · 2024-05-08T10:24:18Z

Pull and run:

make -j tests && ./tests/test-tokenizer-0 ./models/ggml-vocab-bert-bge.gguf

This compares the resulting tokens when using hf/tokenizers and llama.cpp.
The test that fails is with the string 乡—서울에서:

HF tokenizers produces the tokens: [100, 1517, 1461, 30008, 29999, 30014, 30022, 29999, 30009, 29997, 30008]
llama.cpp produces the tokens: [100, 1517, 100]

JoanFM · 2024-05-08T10:26:05Z

Pull and run:
make -j tests && ./tests/test-tokenizer-0 ./models/ggml-vocab-bert-bge.gguf
This compares the resulting tokens when using hf/tokenizers and llama.cpp. The test that fails is with the string 乡—서울에서:

HF tokenizers produces the tokens: [100, 1517, 1461, 30008, 29999, 30014, 30022, 29999, 30009, 29997, 30008]

llama.cpp produces the tokens: [100, 1517, 100]

How can I know the hf model or tokenizer that is using? Like this I can debug better

ggerganov · 2024-05-08T10:32:22Z

The reference HF tokenizer model is listed in the convert-hf-to-gguf-update.py script:

https://github.com/ggerganov/llama.cpp/blob/acdce3cdef6fc2f0b7b5623231fd7762c0884d1c/convert-hf-to-gguf-update.py#L56-L71

For the BERT model "bert-bge", the link is: https://huggingface.co/BAAI/bge-small-en-v1.5/tree/main

You can clone that repository and write a Python script that imports AutoTokenizer from the transformers package and loads the tokenizer model from that repository. This is the reference tokenization that we try to match in llama.cpp

The convert-hf-to-gguf-update.py script already does that and generates a set of unit tests

JoanFM · 2024-05-08T10:39:55Z

bge

okey, thanks I will investigate further thx

JoanFM · 2024-05-08T12:10:48Z

I believe this test also fails in master without my changes, so maybe the issue does not come from the normalization?

The result without the change and with the change give the same result

ggerganov · 2024-05-08T12:24:02Z

Yes, it's possible. I was trying to add a test for the NFD changes. Can you provide a text that does not work on master but works with this PR?

JoanFM · 2024-05-08T13:06:30Z

Yes, it's possible. I was trying to add a test for the NFD changes. Can you provide a text that does not work on master but works with this PR?

I am trying but it seems quite hard,

I find cases where the NFD output differs, but the end tokenization results in the same tokens.

The façade of the café was uniquely designed with naïve motifs and coöperate signage.

This is expanded ny NFD but the extra codepoints do not reflect in the tokenization. I have tried with mutliple examples and no success.

If you want I can close this PR. Now I have a little bit more clarity and most of the cases I would just ignore NFC and NFD for my problem and hope for it to work most of the time.

ggerganov · 2024-05-08T14:09:25Z

If you think the find -> equal_range makes sense, we can merge. But ideally it would be nice to have a test that validates the behaviour

JoanFM · 2024-05-08T14:12:32Z

If you think the find -> equal_range makes sense, we can merge. But ideally it would be nice to have a test that validates the behaviour

Seems to make sense in the strict sense of the algorithm, but I tried several cases quite complex and could not find the case, because it seems that the second part is not part of the vocabulary.

JoanFM · 2024-05-08T14:17:27Z

If you think the find -> equal_range makes sense, we can merge. But ideally it would be nice to have a test that validates the behaviour

Seems to make sense in the strict sense of the algorithm, but I tried several cases quite complex and could not find the case, because it seems that the second part is not part of the vocabulary.

Maybe @iamlemec can provide more context and finds a case where it could change?

iamlemec · 2024-05-08T16:39:04Z

Ok, so I've searched programmatically over the UnicodeData.txt data and it appears there are no codepoints that decompose into more than one letter. Ligatures like "æ" do not get decomposed here. And since the letter is always listed first, I'm not sure there exists such an example. To me, this points in the direction of keeping find as is and possibly turning unicode_map_nfd into a regular map.

Now, this business with the Korean (Hangul) letters is interesting as a separate issue. Looks like the HF tokenizer handles those programmatically, as recommended. Since there are 19 x 21 x 28 character combinations, you can just invert the codepoint like a flat index. I would be happy to address this in another PR if that works for folks.

JoanFM · 2024-05-08T19:00:31Z

Let's close this, thanks for the help.

I got some nice insights into this new world of Unicode and Normalization.

Thanks @ggerganov , @iamlemec and @teleprint-me !

fix: fix NFD computation

88e943f

JoanFM mentioned this pull request May 7, 2024

Tokenizers questions and ... proposals? #6980

Closed

JoanFM force-pushed the fix-nfd branch from fd05e40 to 9a7df74 Compare May 7, 2024 12:56

fix: add real values

d6edc62

JoanFM force-pushed the fix-nfd branch from 9a7df74 to d6edc62 Compare May 7, 2024 14:27

JoanFM added 2 commits May 7, 2024 16:56

fix: do not insert in the middle of iteration

eb5a0e1

test: add a new step test

d9e2903

JoanFM force-pushed the fix-nfd branch from 94f702e to d9e2903 Compare May 7, 2024 15:38

fix: fix infinite recursion

043f298

JoanFM force-pushed the fix-nfd branch from 9514e03 to 043f298 Compare May 7, 2024 16:15

JoanFM commented May 8, 2024

View reviewed changes

examples/server/tests/features/embeddings.feature Show resolved Hide resolved

feat: remove extra complexity in NFD

668e0d9

JoanFM force-pushed the fix-nfd branch from c0aedfe to 668e0d9 Compare May 8, 2024 08:17

JoanFM marked this pull request as ready for review May 8, 2024 08:23

tests : add NFD-related test

492f75c

JoanFM closed this May 8, 2024

Fix NFD computation #7122

Fix NFD computation #7122

Uh oh!

Conversation

JoanFM commented May 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ggerganov commented May 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoanFM commented May 8, 2024

Uh oh!

ggerganov commented May 8, 2024

Uh oh!

JoanFM commented May 8, 2024

Uh oh!

JoanFM commented May 8, 2024

Uh oh!

ggerganov commented May 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoanFM commented May 8, 2024

Uh oh!

ggerganov commented May 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoanFM commented May 8, 2024

Uh oh!

JoanFM commented May 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented May 8, 2024

Uh oh!

JoanFM commented May 8, 2024

Uh oh!

ggerganov commented May 8, 2024

Uh oh!

JoanFM commented May 8, 2024

Uh oh!

JoanFM commented May 8, 2024

Uh oh!

iamlemec commented May 8, 2024

Uh oh!

JoanFM commented May 8, 2024

Uh oh!

Uh oh!

JoanFM commented May 7, 2024 •

edited

Loading

github-actions bot commented May 7, 2024 •

edited

Loading

ggerganov commented May 8, 2024 •

edited

Loading

ggerganov commented May 8, 2024 •

edited

Loading

ggerganov commented May 8, 2024 •

edited

Loading

JoanFM commented May 8, 2024 •

edited

Loading