Skip to content

Fix NFD computation #7122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

Fix NFD computation #7122

wants to merge 7 commits into from

Conversation

JoanFM
Copy link
Contributor

@JoanFM JoanFM commented May 7, 2024

I am trying to change the NFD computation according to https://unicode.org/reports/tr15/#Description_Norm

Thee changes are:

  • Get the range from the nfd_map and apply the decomposition recursively. (Edit) According to @iamlemec, the nfd_map is constructed in a way that the recursion is applied
  • Sort results according to the Canonical_Combining_Class.

TODO

  • Properly fill the unicode_canonical_class map
  • Do some tests and validate the implementation
  • Fix the implementation avoiding potential problem with inserting into a vector while iterating over it

Copy link
Contributor

github-actions bot commented May 7, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 550 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8546.39ms p(95)=20204.48ms fails=, finish reason: stop=489 truncated=61
  • Prompt processing (pp): avg=98.39tk/s p(95)=372.49tk/s
  • Token generation (tg): avg=35.22tk/s p(95)=50.21tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=fix-nfd commit=c0aedfec8338e8136a69ec2ccee2528e479f2834

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715157035 --> 1715157661
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 880.43, 880.43, 880.43, 880.43, 880.43, 779.47, 779.47, 779.47, 779.47, 779.47, 777.52, 777.52, 777.52, 777.52, 777.52, 812.19, 812.19, 812.19, 812.19, 812.19, 858.94, 858.94, 858.94, 858.94, 858.94, 858.48, 858.48, 858.48, 858.48, 858.48, 871.28, 871.28, 871.28, 871.28, 871.28, 881.74, 881.74, 881.74, 881.74, 881.74, 896.48, 896.48, 896.48, 896.48, 896.48, 893.42, 893.42, 893.42, 893.42, 893.42, 912.85, 912.85, 912.85, 912.85, 912.85, 903.58, 903.58, 903.58, 903.58, 903.58, 915.18, 915.18, 915.18, 915.18, 915.18, 890.19, 890.19, 890.19, 890.19, 890.19, 899.94, 899.94, 899.94, 899.94, 899.94, 897.18, 897.18, 897.18, 897.18, 897.18, 887.69, 887.69, 887.69, 887.69, 887.69, 890.95, 890.95, 890.95, 890.95, 890.95, 890.23, 890.23, 890.23, 890.23, 890.23, 892.89, 892.89, 892.89, 892.89, 892.89, 888.61, 888.61, 888.61, 888.61, 888.61, 864.43, 864.43, 864.43, 864.43, 864.43, 866.77, 866.77, 866.77, 866.77, 866.77, 868.21, 868.21, 868.21, 868.21, 868.21, 846.39, 846.39, 846.39, 846.39, 846.39, 843.03, 843.03, 843.03, 843.03, 843.03, 841.4, 841.4, 841.4, 841.4, 841.4, 842.27, 842.27, 842.27, 842.27, 842.27, 846.75, 846.75, 846.75, 846.75, 846.75, 844.64, 844.64, 844.64, 844.64, 844.64, 847.42, 847.42, 847.42, 847.42, 847.42, 852.17, 852.17, 852.17, 852.17, 852.17, 825.82, 825.82, 825.82, 825.82, 825.82, 826.56, 826.56, 826.56, 826.56, 826.56, 822.91, 822.91, 822.91, 822.91, 822.91, 820.89, 820.89, 820.89, 820.89, 820.89, 821.42, 821.42, 821.42, 821.42, 821.42, 824.32, 824.32, 824.32, 824.32, 824.32, 827.18, 827.18, 827.18, 827.18, 827.18, 835.54, 835.54, 835.54, 835.54, 835.54, 829.11, 829.11, 829.11, 829.11, 829.11, 831.34, 831.34, 831.34, 831.34, 831.34, 831.34, 831.34, 831.34, 831.34, 831.34, 829.32, 829.32, 829.32, 829.32, 829.32, 826.48, 826.48, 826.48, 826.48, 826.48, 831.67, 831.67, 831.67, 831.67, 831.67, 833.69, 833.69, 833.69, 833.69, 833.69, 833.91, 833.91, 833.91, 833.91, 833.91, 838.44, 838.44, 838.44, 838.44, 838.44, 838.77, 838.77, 838.77, 838.77, 838.77, 834.89, 834.89, 834.89, 834.89, 834.89, 837.98, 837.98, 837.98, 837.98, 837.98, 838.47, 838.47, 838.47, 838.47, 838.47, 842.73, 842.73, 842.73, 842.73, 842.73, 844.12, 844.12, 844.12, 844.12, 844.12, 843.8, 843.8, 843.8, 843.8, 843.8, 843.03, 843.03, 843.03, 843.03, 843.03, 844.51, 844.51, 844.51, 844.51, 844.51, 844.55, 844.55, 844.55, 844.55, 844.55, 847.44, 847.44, 847.44, 847.44, 847.44, 847.95, 847.95, 847.95, 847.95]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715157035 --> 1715157661
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 40.97, 40.97, 40.97, 40.97, 40.97, 26.64, 26.64, 26.64, 26.64, 26.64, 26.1, 26.1, 26.1, 26.1, 26.1, 27.78, 27.78, 27.78, 27.78, 27.78, 29.54, 29.54, 29.54, 29.54, 29.54, 31.01, 31.01, 31.01, 31.01, 31.01, 32.73, 32.73, 32.73, 32.73, 32.73, 33.13, 33.13, 33.13, 33.13, 33.13, 33.24, 33.24, 33.24, 33.24, 33.24, 33.95, 33.95, 33.95, 33.95, 33.95, 34.09, 34.09, 34.09, 34.09, 34.09, 33.88, 33.88, 33.88, 33.88, 33.88, 32.85, 32.85, 32.85, 32.85, 32.85, 31.8, 31.8, 31.8, 31.8, 31.8, 31.96, 31.96, 31.96, 31.96, 31.96, 32.13, 32.13, 32.13, 32.13, 32.13, 32.44, 32.44, 32.44, 32.44, 32.44, 31.93, 31.93, 31.93, 31.93, 31.93, 31.99, 31.99, 31.99, 31.99, 31.99, 32.13, 32.13, 32.13, 32.13, 32.13, 32.36, 32.36, 32.36, 32.36, 32.36, 32.4, 32.4, 32.4, 32.4, 32.4, 32.53, 32.53, 32.53, 32.53, 32.53, 32.74, 32.74, 32.74, 32.74, 32.74, 32.79, 32.79, 32.79, 32.79, 32.79, 32.14, 32.14, 32.14, 32.14, 32.14, 31.88, 31.88, 31.88, 31.88, 31.88, 32.12, 32.12, 32.12, 32.12, 32.12, 32.31, 32.31, 32.31, 32.31, 32.31, 32.36, 32.36, 32.36, 32.36, 32.36, 32.55, 32.55, 32.55, 32.55, 32.55, 32.51, 32.51, 32.51, 32.51, 32.51, 32.48, 32.48, 32.48, 32.48, 32.48, 32.35, 32.35, 32.35, 32.35, 32.35, 32.3, 32.3, 32.3, 32.3, 32.3, 32.02, 32.02, 32.02, 32.02, 32.02, 32.11, 32.11, 32.11, 32.11, 32.11, 32.17, 32.17, 32.17, 32.17, 32.17, 32.38, 32.38, 32.38, 32.38, 32.38, 32.47, 32.47, 32.47, 32.47, 32.47, 32.4, 32.4, 32.4, 32.4, 32.4, 32.15, 32.15, 32.15, 32.15, 32.15, 31.92, 31.92, 31.92, 31.92, 31.92, 30.22, 30.22, 30.22, 30.22, 30.22, 30.04, 30.04, 30.04, 30.04, 30.04, 30.09, 30.09, 30.09, 30.09, 30.09, 30.09, 30.09, 30.09, 30.09, 30.09, 30.15, 30.15, 30.15, 30.15, 30.15, 30.27, 30.27, 30.27, 30.27, 30.27, 30.3, 30.3, 30.3, 30.3, 30.3, 30.46, 30.46, 30.46, 30.46, 30.46, 30.41, 30.41, 30.41, 30.41, 30.41, 30.2, 30.2, 30.2, 30.2, 30.2, 30.11, 30.11, 30.11, 30.11, 30.11, 30.17, 30.17, 30.17, 30.17, 30.17, 30.3, 30.3, 30.3, 30.3, 30.3, 30.43, 30.43, 30.43, 30.43, 30.43, 30.47, 30.47, 30.47, 30.47, 30.47, 30.54, 30.54, 30.54, 30.54, 30.54, 30.56, 30.56, 30.56, 30.56, 30.56, 30.55, 30.55, 30.55, 30.55]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715157035 --> 1715157661
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.23, 0.23, 0.23, 0.23, 0.23, 0.43, 0.43, 0.43, 0.43, 0.43, 0.28, 0.28, 0.28, 0.28, 0.28, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.31, 0.31, 0.31, 0.31, 0.31, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.29, 0.29, 0.29, 0.29, 0.29, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19, 0.19, 0.19, 0.19, 0.29, 0.29, 0.29, 0.29, 0.29, 0.27, 0.27, 0.27, 0.27, 0.27, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.32, 0.32, 0.32, 0.32, 0.32, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.29, 0.29, 0.29, 0.29, 0.29, 0.5, 0.5, 0.5, 0.5, 0.5, 0.57, 0.57, 0.57, 0.57, 0.57, 0.67, 0.67, 0.67, 0.67, 0.67, 0.59, 0.59, 0.59, 0.59, 0.59, 0.09, 0.09, 0.09, 0.09, 0.09, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.27, 0.27, 0.27, 0.27, 0.27, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.21, 0.21, 0.21, 0.21]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715157035 --> 1715157661
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0]
                    
Loading

@JoanFM JoanFM marked this pull request as ready for review May 8, 2024 08:23
@ggerganov
Copy link
Member

ggerganov commented May 8, 2024

Using your example, I added the following test to convert-hf-to-gguf-update.py:

@@ -252,6 +255,7 @@ tests = [
     "3333333",
     "33333333",
     "333333333",
+    "乡—서울에서",
     chktxt,
 ]

The unit test using the bert vocab still fails with this PR:

# regenerate test data
python3 convert-hf-to-gguf-update.py <hf_token>

# run tests
make -j tests && ./tests/test-tokenizer-0 ./models/ggml-vocab-bert-bge.gguf

Result:

src: '乡—서울에서'
res: '[UNK] —[UNK]'
tok: 100 1517 100 
main : failed test:    '乡—서울에서'
main : detokenized to: '[UNK] —[UNK]' instead of '[UNK] — 서울에서'
main : expected tokens:    100 '[UNK]',   1517 ' —',   1461 ' ᄉ',  30008 'ᅥ',  29999 'ᄋ',  30014 'ᅮ',  30022 'ᆯ',  29999 'ᄋ',  30009 'ᅦ',  29997 'ᄉ',  30008 'ᅥ', 
main : got tokens:         100 '[UNK]',   1517 ' —',    100 '[UNK]', 

Tests failed

Shouldn't it pass with the NFD fixes?

@JoanFM
Copy link
Contributor Author

JoanFM commented May 8, 2024

ggml-vocab-bert-bge.gguf

What exactly I need to pass to the convert-hf-to-gguf-update.py call?

@ggerganov
Copy link
Member

You need a HuggingFace account: https://huggingface.co/settings/tokens

Pass the read token from the link above

If you don't have HF account, I can push the test to your branch manually?

@JoanFM
Copy link
Contributor Author

JoanFM commented May 8, 2024

You need a HuggingFace account: https://huggingface.co/settings/tokens

Pass the read token from the link above

If you don't have HF account, I can push the test to your branch manually?

u can feel free to push yes

@JoanFM
Copy link
Contributor Author

JoanFM commented May 8, 2024

You need a HuggingFace account: https://huggingface.co/settings/tokens

Pass the read token from the link above

If you don't have HF account, I can push the test to your branch manually?

I get the error:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/joan/workspace/ollama/llm/llama.cpp/convert-hf-to-gguf-update.py", line 139, in <module>
    tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
  File "/home/joan/workspace/ollama/llm/llama.cpp/gguf-research/venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 819, in from_pretrained
    config = AutoConfig.from_pretrained(
  File "/home/joan/workspace/ollama/llm/llama.cpp/gguf-research/venv/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 928, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/joan/workspace/ollama/llm/llama.cpp/gguf-research/venv/lib/python3.10/site-packages/transformers/configuration_utils.py", line 631, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/joan/workspace/ollama/llm/llama.cpp/gguf-research/venv/lib/python3.10/site-packages/transformers/configuration_utils.py", line 686, in _get_config_dict
    resolved_config_file = cached_file(
  File "/home/joan/workspace/ollama/llm/llama.cpp/gguf-research/venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 369, in cached_file
    raise EnvironmentError(
OSError: models/tokenizers/llama-bpe does not appear to have a file named config.json. Checkout 'https://huggingface.co/models/tokenizers/llama-bpe/tree/None' for available files.

@ggerganov
Copy link
Member

ggerganov commented May 8, 2024

Pull and run:

make -j tests && ./tests/test-tokenizer-0 ./models/ggml-vocab-bert-bge.gguf

This compares the resulting tokens when using hf/tokenizers and llama.cpp.
The test that fails is with the string 乡—서울에서:

  • HF tokenizers produces the tokens: [100, 1517, 1461, 30008, 29999, 30014, 30022, 29999, 30009, 29997, 30008]
  • llama.cpp produces the tokens: [100, 1517, 100]

@JoanFM
Copy link
Contributor Author

JoanFM commented May 8, 2024

Pull and run:

make -j tests && ./tests/test-tokenizer-0 ./models/ggml-vocab-bert-bge.gguf

This compares the resulting tokens when using hf/tokenizers and llama.cpp. The test that fails is with the string 乡—서울에서:

  • HF tokenizers produces the tokens: [100, 1517, 1461, 30008, 29999, 30014, 30022, 29999, 30009, 29997, 30008]
  • llama.cpp produces the tokens: [100, 1517, 100]

How can I know the hf model or tokenizer that is using? Like this I can debug better

@ggerganov
Copy link
Member

ggerganov commented May 8, 2024

The reference HF tokenizer model is listed in the convert-hf-to-gguf-update.py script:

https://github.com/ggerganov/llama.cpp/blob/acdce3cdef6fc2f0b7b5623231fd7762c0884d1c/convert-hf-to-gguf-update.py#L56-L71

For the BERT model "bert-bge", the link is: https://huggingface.co/BAAI/bge-small-en-v1.5/tree/main

You can clone that repository and write a Python script that imports AutoTokenizer from the transformers package and loads the tokenizer model from that repository. This is the reference tokenization that we try to match in llama.cpp

The convert-hf-to-gguf-update.py script already does that and generates a set of unit tests

@JoanFM
Copy link
Contributor Author

JoanFM commented May 8, 2024

bge

okey, thanks I will investigate further thx

@JoanFM
Copy link
Contributor Author

JoanFM commented May 8, 2024

I believe this test also fails in master without my changes, so maybe the issue does not come from the normalization?

The result without the change and with the change give the same result

@ggerganov
Copy link
Member

Yes, it's possible. I was trying to add a test for the NFD changes. Can you provide a text that does not work on master but works with this PR?

@JoanFM
Copy link
Contributor Author

JoanFM commented May 8, 2024

Yes, it's possible. I was trying to add a test for the NFD changes. Can you provide a text that does not work on master but works with this PR?

I am trying but it seems quite hard,

I find cases where the NFD output differs, but the end tokenization results in the same tokens.

The façade of the café was uniquely designed with naïve motifs and coöperate signage.

This is expanded ny NFD but the extra codepoints do not reflect in the tokenization. I have tried with mutliple examples and no success.

If you want I can close this PR. Now I have a little bit more clarity and most of the cases I would just ignore NFC and NFD for my problem and hope for it to work most of the time.

@ggerganov
Copy link
Member

If you think the find -> equal_range makes sense, we can merge. But ideally it would be nice to have a test that validates the behaviour

@JoanFM
Copy link
Contributor Author

JoanFM commented May 8, 2024

If you think the find -> equal_range makes sense, we can merge. But ideally it would be nice to have a test that validates the behaviour

Seems to make sense in the strict sense of the algorithm, but I tried several cases quite complex and could not find the case, because it seems that the second part is not part of the vocabulary.

@JoanFM
Copy link
Contributor Author

JoanFM commented May 8, 2024

If you think the find -> equal_range makes sense, we can merge. But ideally it would be nice to have a test that validates the behaviour

Seems to make sense in the strict sense of the algorithm, but I tried several cases quite complex and could not find the case, because it seems that the second part is not part of the vocabulary.

Maybe @iamlemec can provide more context and finds a case where it could change?

@iamlemec
Copy link
Collaborator

iamlemec commented May 8, 2024

Ok, so I've searched programmatically over the UnicodeData.txt data and it appears there are no codepoints that decompose into more than one letter. Ligatures like "æ" do not get decomposed here. And since the letter is always listed first, I'm not sure there exists such an example. To me, this points in the direction of keeping find as is and possibly turning unicode_map_nfd into a regular map.

Now, this business with the Korean (Hangul) letters is interesting as a separate issue. Looks like the HF tokenizer handles those programmatically, as recommended. Since there are 19 x 21 x 28 character combinations, you can just invert the codepoint like a flat index. I would be happy to address this in another PR if that works for folks.

@JoanFM
Copy link
Contributor Author

JoanFM commented May 8, 2024

Let's close this, thanks for the help.

I got some nice insights into this new world of Unicode and Normalization.

Thanks @ggerganov , @iamlemec and @teleprint-me !

@JoanFM JoanFM closed this May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants