Server slowing down with each request (requests are identical)

# Pre-Prerequisite

**Thanks to all the contributors for all the great work on llama.cpp!**

# Prerequisites

- [X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [X] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Expected Behaviour

- Identical requests to the server should take similar amounts of time
  - I made 4 identical requests to the server, sequential with 1 second pause after each.

# Current Behaviour

- Each request to the server takes longer to process. 
- Note particularly the `prompt_eval` time gets much slower.
- Also: After the last request it says "slot 3 released (1155 tokens in cache)" - is that expected? 
   - Why should there be data in the cache after the slot has been released and there are no active requests?

```
print_timings: prompt eval time =    1110.28 ms /  1134 tokens (    0.98 ms per token,  1021.37 tokens per second)
print_timings:        eval time =     561.83 ms /    20 runs   (   28.09 ms per token,    35.60 tokens per second)
print_timings:       total time =    1672.10 ms
slot 0 released (1155 tokens in cache)

slot 1 is processing [task id: 1]
slot 1 : kv cache rm - [0, end)
print_timings: prompt eval time =    1483.84 ms /  1134 tokens (    1.31 ms per token,   764.23 tokens per second)
print_timings:        eval time =     591.59 ms /    20 runs   (   29.58 ms per token,    33.81 tokens per second)
print_timings:       total time =    2075.43 ms
slot 1 released (1155 tokens in cache)

slot 2 is processing [task id: 2]
slot 2 : kv cache rm - [0, end)
print_timings: prompt eval time =    1764.20 ms /  1134 tokens (    1.56 ms per token,   642.78 tokens per second)
print_timings:        eval time =     618.07 ms /    20 runs   (   30.90 ms per token,    32.36 tokens per second)
print_timings:       total time =    2382.28 ms
slot 2 released (1155 tokens in cache)

slot 3 is processing [task id: 3]
slot 3 : kv cache rm - [0, end)
print_timings: prompt eval time =    2229.91 ms /  1134 tokens (    1.97 ms per token,   508.54 tokens per second)
print_timings:        eval time =     642.50 ms /    20 runs   (   32.12 ms per token,    31.13 tokens per second)
print_timings:       total time =    2872.41 ms
slot 3 released (1155 tokens in cache)
```

# Environment and Context

* Physical (or virtual) hardware you are using: Physical hardware, Nvidia GPU

* Operating System: Linux 

# Failure Information (for bugs)

Please help provide information about the failure / bug.

# Steps to Reproduce

* Start server
```
 (./build/bin/server -m ./ggml-model-q4.bin -ngl 9999  --ctx-size 8000 --host 0.0.0.0 --port 7777 --cont-batching --parallel 4)
```
* Make requests to server with curl sequentially (one at a time).

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Server slowing down with each request (requests are identical) #4201

Pre-Prerequisite

Prerequisites

Expected Behaviour

Current Behaviour

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Server slowing down with each request (requests are identical) #4201

Description

Pre-Prerequisite

Prerequisites

Expected Behaviour

Current Behaviour

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions