Skip to content

Server slowing down with each request (requests are identical) #4201

Closed
@ruped

Description

@ruped

Pre-Prerequisite

Thanks to all the contributors for all the great work on llama.cpp!

Prerequisites

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behaviour

  • Identical requests to the server should take similar amounts of time
    • I made 4 identical requests to the server, sequential with 1 second pause after each.

Current Behaviour

  • Each request to the server takes longer to process.
  • Note particularly the prompt_eval time gets much slower.
  • Also: After the last request it says "slot 3 released (1155 tokens in cache)" - is that expected?
    • Why should there be data in the cache after the slot has been released and there are no active requests?
print_timings: prompt eval time =    1110.28 ms /  1134 tokens (    0.98 ms per token,  1021.37 tokens per second)
print_timings:        eval time =     561.83 ms /    20 runs   (   28.09 ms per token,    35.60 tokens per second)
print_timings:       total time =    1672.10 ms
slot 0 released (1155 tokens in cache)

slot 1 is processing [task id: 1]
slot 1 : kv cache rm - [0, end)
print_timings: prompt eval time =    1483.84 ms /  1134 tokens (    1.31 ms per token,   764.23 tokens per second)
print_timings:        eval time =     591.59 ms /    20 runs   (   29.58 ms per token,    33.81 tokens per second)
print_timings:       total time =    2075.43 ms
slot 1 released (1155 tokens in cache)

slot 2 is processing [task id: 2]
slot 2 : kv cache rm - [0, end)
print_timings: prompt eval time =    1764.20 ms /  1134 tokens (    1.56 ms per token,   642.78 tokens per second)
print_timings:        eval time =     618.07 ms /    20 runs   (   30.90 ms per token,    32.36 tokens per second)
print_timings:       total time =    2382.28 ms
slot 2 released (1155 tokens in cache)

slot 3 is processing [task id: 3]
slot 3 : kv cache rm - [0, end)
print_timings: prompt eval time =    2229.91 ms /  1134 tokens (    1.97 ms per token,   508.54 tokens per second)
print_timings:        eval time =     642.50 ms /    20 runs   (   32.12 ms per token,    31.13 tokens per second)
print_timings:       total time =    2872.41 ms
slot 3 released (1155 tokens in cache)

Environment and Context

  • Physical (or virtual) hardware you are using: Physical hardware, Nvidia GPU

  • Operating System: Linux

Failure Information (for bugs)

Please help provide information about the failure / bug.

Steps to Reproduce

  • Start server
 (./build/bin/server -m ./ggml-model-q4.bin -ngl 9999  --ctx-size 8000 --host 0.0.0.0 --port 7777 --cont-batching --parallel 4)
  • Make requests to server with curl sequentially (one at a time).

Thanks!

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingstale

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions