Closed
Description
Pre-Prerequisite
Thanks to all the contributors for all the great work on llama.cpp!
Prerequisites
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behaviour
- Identical requests to the server should take similar amounts of time
- I made 4 identical requests to the server, sequential with 1 second pause after each.
Current Behaviour
- Each request to the server takes longer to process.
- Note particularly the
prompt_eval
time gets much slower. - Also: After the last request it says "slot 3 released (1155 tokens in cache)" - is that expected?
- Why should there be data in the cache after the slot has been released and there are no active requests?
print_timings: prompt eval time = 1110.28 ms / 1134 tokens ( 0.98 ms per token, 1021.37 tokens per second)
print_timings: eval time = 561.83 ms / 20 runs ( 28.09 ms per token, 35.60 tokens per second)
print_timings: total time = 1672.10 ms
slot 0 released (1155 tokens in cache)
slot 1 is processing [task id: 1]
slot 1 : kv cache rm - [0, end)
print_timings: prompt eval time = 1483.84 ms / 1134 tokens ( 1.31 ms per token, 764.23 tokens per second)
print_timings: eval time = 591.59 ms / 20 runs ( 29.58 ms per token, 33.81 tokens per second)
print_timings: total time = 2075.43 ms
slot 1 released (1155 tokens in cache)
slot 2 is processing [task id: 2]
slot 2 : kv cache rm - [0, end)
print_timings: prompt eval time = 1764.20 ms / 1134 tokens ( 1.56 ms per token, 642.78 tokens per second)
print_timings: eval time = 618.07 ms / 20 runs ( 30.90 ms per token, 32.36 tokens per second)
print_timings: total time = 2382.28 ms
slot 2 released (1155 tokens in cache)
slot 3 is processing [task id: 3]
slot 3 : kv cache rm - [0, end)
print_timings: prompt eval time = 2229.91 ms / 1134 tokens ( 1.97 ms per token, 508.54 tokens per second)
print_timings: eval time = 642.50 ms / 20 runs ( 32.12 ms per token, 31.13 tokens per second)
print_timings: total time = 2872.41 ms
slot 3 released (1155 tokens in cache)
Environment and Context
-
Physical (or virtual) hardware you are using: Physical hardware, Nvidia GPU
-
Operating System: Linux
Failure Information (for bugs)
Please help provide information about the failure / bug.
Steps to Reproduce
- Start server
(./build/bin/server -m ./ggml-model-q4.bin -ngl 9999 --ctx-size 8000 --host 0.0.0.0 --port 7777 --cont-batching --parallel 4)
- Make requests to server with curl sequentially (one at a time).
Thanks!