Skip to content

Conversation

tristandruyen
Copy link
Contributor

@tristandruyen tristandruyen commented May 22, 2024

Reported in & Resolves #7432

Problem

Zephyr & Phi3 use similar tokens like <|assistant|> & <|user|>, but they use different eos tokens, as the zephyr template appears first in the if else chain, phi3 is misdetected as zephyr, which causes wrong <|endoftext|> instead of <|end|> tokens to appear in the prompt as reported in the referenced issue.

Root-Cause

Phi-3's template, seems to have changed sometime after my PR #6857 was merged.
Initially it didn't explicitly include <|user|> but spliced the role into there...

OLD:

{{ bos_token }}{% for message in messages %}{{'<|' + message['role'] + '|>' + ' ' + message['content'] + '<|end|> ' }}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|> ' }}{% else %}{{ eos_token }}{% endif %}

NEW:

{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + ' ' + message['content'] + '<|end|>' + ' ' + '<|assistant|>' + ' '}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + ' '}}{% endif %}{% endfor %}

This causes this mis-matching as zephyr.

Solution

It's pretty simple, stop zephyr from applying to every model with <|user|> in the template by adding an additional requirement. reordering the if else's

@github-actions github-actions bot added the testing Everything test related label May 22, 2024
@tristandruyen tristandruyen changed the title Fix phi3 template matching vs zephyr Fix phi3 chat template confusion with zephyr May 22, 2024
@tristandruyen tristandruyen marked this pull request as ready for review May 22, 2024 01:54
Copy link
Contributor

github-actions bot commented May 22, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 529 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8849.99ms p(95)=21246.52ms fails=, finish reason: stop=473 truncated=56
  • Prompt processing (pp): avg=102.35tk/s p(95)=462.85tk/s
  • Token generation (tg): avg=32.41tk/s p(95)=46.3tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=3574d6369018002662b6650c5dc0c29657e71d7a

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 529 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716486846 --> 1716487472
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 335.8, 335.8, 335.8, 335.8, 335.8, 591.22, 591.22, 591.22, 591.22, 591.22, 568.5, 568.5, 568.5, 568.5, 568.5, 597.32, 597.32, 597.32, 597.32, 597.32, 630.3, 630.3, 630.3, 630.3, 630.3, 683.09, 683.09, 683.09, 683.09, 683.09, 685.99, 685.99, 685.99, 685.99, 685.99, 694.02, 694.02, 694.02, 694.02, 694.02, 706.0, 706.0, 706.0, 706.0, 706.0, 721.41, 721.41, 721.41, 721.41, 721.41, 723.21, 723.21, 723.21, 723.21, 723.21, 742.83, 742.83, 742.83, 742.83, 742.83, 789.57, 789.57, 789.57, 789.57, 789.57, 806.68, 806.68, 806.68, 806.68, 806.68, 803.4, 803.4, 803.4, 803.4, 803.4, 809.68, 809.68, 809.68, 809.68, 809.68, 813.43, 813.43, 813.43, 813.43, 813.43, 815.65, 815.65, 815.65, 815.65, 815.65, 814.94, 814.94, 814.94, 814.94, 814.94, 810.14, 810.14, 810.14, 810.14, 810.14, 811.01, 811.01, 811.01, 811.01, 811.01, 811.6, 811.6, 811.6, 811.6, 811.6, 817.49, 817.49, 817.49, 817.49, 817.49, 820.01, 820.01, 820.01, 820.01, 820.01, 841.62, 841.62, 841.62, 841.62, 841.62, 836.24, 836.24, 836.24, 836.24, 836.24, 838.88, 838.88, 838.88, 838.88, 838.88, 847.86, 847.86, 847.86, 847.86, 847.86, 852.04, 852.04, 852.04, 852.04, 852.04, 852.41, 852.41, 852.41, 852.41, 852.41, 851.69, 851.69, 851.69, 851.69, 851.69, 857.43, 857.43, 857.43, 857.43, 857.43, 856.59, 856.59, 856.59, 856.59, 856.59, 855.02, 855.02, 855.02, 855.02, 855.02, 858.11, 858.11, 858.11, 858.11, 858.11, 859.74, 859.74, 859.74, 859.74, 859.74, 864.34, 864.34, 864.34, 864.34, 864.34, 863.81, 863.81, 863.81, 863.81, 863.81, 844.47, 844.47, 844.47, 844.47, 844.47, 841.06, 841.06, 841.06, 841.06, 841.06, 840.37, 840.37, 840.37, 840.37, 840.37, 843.06, 843.06, 843.06, 843.06, 843.06, 845.14, 845.14, 845.14, 845.14, 845.14, 845.78, 845.78, 845.78, 845.78, 845.78, 861.76, 861.76, 861.76, 861.76, 861.76, 860.59, 860.59, 860.59, 860.59, 860.59, 858.96, 858.96, 858.96, 858.96, 858.96, 857.73, 857.73, 857.73, 857.73, 857.73, 856.24, 856.24, 856.24, 856.24, 856.24, 862.25, 862.25, 862.25, 862.25, 862.25, 861.61, 861.61, 861.61, 861.61, 861.61, 864.28, 864.28, 864.28, 864.28, 864.28, 865.52, 865.52, 865.52, 865.52, 865.52, 868.9, 868.9, 868.9, 868.9, 868.9, 865.3, 865.3, 865.3, 865.3, 865.3, 865.81, 865.81, 865.81, 865.81, 865.81, 869.07, 869.07, 869.07, 869.07, 869.07, 870.63, 870.63, 870.63, 870.63, 870.63, 869.99, 869.99, 869.99, 869.99, 869.99, 870.98, 870.98, 870.98, 870.98, 870.98, 869.52, 869.52, 869.52, 869.52]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 529 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716486846 --> 1716487472
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 43.26, 43.26, 43.26, 43.26, 43.26, 45.49, 45.49, 45.49, 45.49, 45.49, 26.53, 26.53, 26.53, 26.53, 26.53, 29.95, 29.95, 29.95, 29.95, 29.95, 31.24, 31.24, 31.24, 31.24, 31.24, 31.57, 31.57, 31.57, 31.57, 31.57, 32.87, 32.87, 32.87, 32.87, 32.87, 34.26, 34.26, 34.26, 34.26, 34.26, 34.54, 34.54, 34.54, 34.54, 34.54, 34.54, 34.54, 34.54, 34.54, 34.54, 33.97, 33.97, 33.97, 33.97, 33.97, 33.95, 33.95, 33.95, 33.95, 33.95, 33.17, 33.17, 33.17, 33.17, 33.17, 32.59, 32.59, 32.59, 32.59, 32.59, 32.29, 32.29, 32.29, 32.29, 32.29, 31.99, 31.99, 31.99, 31.99, 31.99, 30.59, 30.59, 30.59, 30.59, 30.59, 29.96, 29.96, 29.96, 29.96, 29.96, 29.9, 29.9, 29.9, 29.9, 29.9, 29.98, 29.98, 29.98, 29.98, 29.98, 29.73, 29.73, 29.73, 29.73, 29.73, 29.74, 29.74, 29.74, 29.74, 29.74, 29.78, 29.78, 29.78, 29.78, 29.78, 29.97, 29.97, 29.97, 29.97, 29.97, 30.12, 30.12, 30.12, 30.12, 30.12, 29.99, 29.99, 29.99, 29.99, 29.99, 30.07, 30.07, 30.07, 30.07, 30.07, 30.42, 30.42, 30.42, 30.42, 30.42, 30.24, 30.24, 30.24, 30.24, 30.24, 30.34, 30.34, 30.34, 30.34, 30.34, 30.59, 30.59, 30.59, 30.59, 30.59, 30.73, 30.73, 30.73, 30.73, 30.73, 30.86, 30.86, 30.86, 30.86, 30.86, 30.96, 30.96, 30.96, 30.96, 30.96, 31.08, 31.08, 31.08, 31.08, 31.08, 31.13, 31.13, 31.13, 31.13, 31.13, 31.08, 31.08, 31.08, 31.08, 31.08, 30.77, 30.77, 30.77, 30.77, 30.77, 30.67, 30.67, 30.67, 30.67, 30.67, 30.52, 30.52, 30.52, 30.52, 30.52, 30.48, 30.48, 30.48, 30.48, 30.48, 30.53, 30.53, 30.53, 30.53, 30.53, 30.62, 30.62, 30.62, 30.62, 30.62, 30.79, 30.79, 30.79, 30.79, 30.79, 30.64, 30.64, 30.64, 30.64, 30.64, 30.48, 30.48, 30.48, 30.48, 30.48, 30.19, 30.19, 30.19, 30.19, 30.19, 29.57, 29.57, 29.57, 29.57, 29.57, 29.11, 29.11, 29.11, 29.11, 29.11, 29.05, 29.05, 29.05, 29.05, 29.05, 29.08, 29.08, 29.08, 29.08, 29.08, 29.16, 29.16, 29.16, 29.16, 29.16, 29.23, 29.23, 29.23, 29.23, 29.23, 29.27, 29.27, 29.27, 29.27, 29.27, 29.3, 29.3, 29.3, 29.3, 29.3, 29.3, 29.3, 29.3, 29.3, 29.3, 29.23, 29.23, 29.23, 29.23, 29.23, 29.3, 29.3, 29.3, 29.3, 29.3, 29.48, 29.48, 29.48, 29.48, 29.48, 29.5, 29.5, 29.5, 29.5, 29.5, 29.61, 29.61, 29.61, 29.61]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 529 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716486846 --> 1716487472
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.37, 0.37, 0.37, 0.37, 0.37, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.08, 0.08, 0.08, 0.08, 0.08, 0.28, 0.28, 0.28, 0.28, 0.28, 0.25, 0.25, 0.25, 0.25, 0.25, 0.35, 0.35, 0.35, 0.35, 0.35, 0.33, 0.33, 0.33, 0.33, 0.33, 0.26, 0.26, 0.26, 0.26, 0.26, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.24, 0.24, 0.24, 0.24, 0.24, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.34, 0.34, 0.34, 0.34, 0.34, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.31, 0.31, 0.31, 0.31, 0.31, 0.27, 0.27, 0.27, 0.27, 0.27, 0.28, 0.28, 0.28, 0.28, 0.28, 0.31, 0.31, 0.31, 0.31, 0.31, 0.23, 0.23, 0.23, 0.23, 0.23, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.51, 0.51, 0.51, 0.51, 0.51, 0.67, 0.67, 0.67, 0.67, 0.67, 0.59, 0.59, 0.59, 0.59, 0.59, 0.33, 0.33, 0.33, 0.33, 0.33, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.08, 0.08, 0.08, 0.08, 0.08, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 529 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716486846 --> 1716487472
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0]
                    
Loading

@mofosyne mofosyne added model Model specific Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels May 22, 2024
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. It would be nice to have all 4 templates in the test.

Co-authored-by: Xuan Son Nguyen <[email protected]>
@ngxson
Copy link
Collaborator

ngxson commented May 23, 2024

OK I'll merge when CI passes

@tristandruyen
Copy link
Contributor Author

The test failures seem unrelated to the changes, should be good to merge.

@ngxson ngxson merged commit 007489e into ggml-org:master May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model Model specific Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[server] phi-3 uses <|endoftext|> instead of <|end|> when applying chat template in /chat/completions
3 participants