Skip to content

examples/server: "New UI" chat becomes slower with each subsequent message #7944

Closed
@khimaros

Description

@khimaros

What happened?

when using examples/server's "New UI", parts of the chat history seem to be re-evaluated (skipping the KV cache?) on each new message from the user. this is not the case with llama-cli or examples/server in the old UI mode with default settings/prompt.

this seems to be a common failure mode for third-party frontends to llama.cpp, maybe there is an issue with the API layer that is making this problem difficult for frontends to solve? #7185

Name and Version

version: 3151 (f8ec887)
built with cc (Debian 13.2.0-25) 13.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

INFO [           print_timings] prompt eval time     =     189.41 ms /     1 tokens (  189.41 ms per token,     5.28 tokens per second) | tid="140556433274816" timestamp=1718408696 id_slot=0 id_task=3534 t_prompt_processing=189.405 n_prompt_tokens_processed=1 t_token=189.405 n_tokens_second=5.2796916660067055

INFO [           print_timings] prompt eval time     =    2473.22 ms /    40 tokens (   61.83 ms per token,    16.17 tokens per second) | tid="140556433274816" timestamp=1718408717 id_slot=0 id_task=3564 t_prompt_processing=2473.219 n_prompt_tokens_processed=40 t_token=61.830475 n_tokens_second=16.173254370114414

INFO [           print_timings] prompt eval time     =    5231.45 ms /    83 tokens (   63.03 ms per token,    15.87 tokens per second) | tid="140556433274816" timestamp=1718408745 id_slot=0 id_task=3632 t_prompt_processing=5231.451 n_prompt_tokens_processed=83 t_token=63.02953012048193 n_tokens_second=15.865579167232953

INFO [           print_timings] prompt eval time     =    6692.69 ms /   105 tokens (   63.74 ms per token,    15.69 tokens per second) | tid="140556433274816" timestamp=1718408774 id_slot=0 id_task=3721 t_prompt_processing=6692.691 n_prompt_tokens_processed=105 t_token=63.739914285714285 n_tokens_second=15.688756585355577

INFO [           print_timings] prompt eval time     =    5536.72 ms /    90 tokens (   61.52 ms per token,    16.26 tokens per second) | tid="140556433274816" timestamp=1718408815 id_slot=0 id_task=3797 t_prompt_processing=5536.721 n_prompt_tokens_processed=90 t_token=61.519122222222215 n_tokens_second=16.255108393578077

INFO [           print_timings] prompt eval time     =    6353.86 ms /   106 tokens (   59.94 ms per token,    16.68 tokens per second) | tid="140556433274816" timestamp=1718408885 id_slot=0 id_task=3885 t_prompt_processing=6353.859 n_prompt_tokens_processed=106 t_token=59.942066037735856 n_tokens_second=16.68277498760989

INFO [           print_timings] prompt eval time     =    8704.61 ms /   134 tokens (   64.96 ms per token,    15.39 tokens per second) | tid="140556433274816" timestamp=1718408926 id_slot=0 id_task=4002 t_prompt_processing=8704.613 n_prompt_tokens_processed=134 t_token=64.95979850746268 n_tokens_second=15.3941364193905

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)stale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions