Skip to content

Commit 525213d

Browse files
phymbertggerganov
andauthored
server: init functional tests (#5566)
* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" #3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault #5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <[email protected]>
1 parent fd43d66 commit 525213d

File tree

14 files changed

+1243
-18
lines changed

14 files changed

+1243
-18
lines changed

.github/ISSUE_TEMPLATE/bug.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,5 @@ assignees: ''
77
---
88

99
Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.
10+
11+
If the bug concerns the server, please try to reproduce it first using the [server test scenario framework](https://github.com/ggerganov/llama.cpp/tree/master/examples/server/tests).

.github/workflows/server.yml

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# Server build and tests
2+
name: Server
3+
4+
on:
5+
workflow_dispatch: # allows manual triggering
6+
push:
7+
branches:
8+
- master
9+
- test/server-add-ci-test # FIXME remove
10+
paths: ['.github/workflows/**', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/**.*']
11+
pull_request:
12+
types: [opened, synchronize, reopened]
13+
paths: ['**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/**.*']
14+
15+
jobs:
16+
server:
17+
runs-on: ubuntu-latest
18+
19+
strategy:
20+
matrix:
21+
build: [noavx, avx2, avx, avx512, cublas, clblast, openblas, kompute, vulkan]
22+
sanitizer: [ADDRESS, THREAD, UNDEFINED]
23+
build_type: [Debug, Release]
24+
include:
25+
- build: 'noavx'
26+
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_AVX=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF'
27+
image: ubuntu:latest
28+
- build: 'avx2'
29+
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON'
30+
image: ubuntu:latest
31+
- build: 'avx'
32+
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_AVX2=OFF'
33+
image: ubuntu:latest
34+
- build: 'avx512'
35+
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_AVX512=ON'
36+
image: ubuntu:latest
37+
experimental: true
38+
- build: 'cublas'
39+
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_CUBLAS=ON'
40+
image: nvidia/cuda:12.3.1-devel-ubuntu22.04
41+
arch_not_available: true # require nvidia docker engine
42+
- build: 'clblast'
43+
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_CLBLAST=ON'
44+
image: ubuntu:latest
45+
arch_not_available: true
46+
- build: 'openblas'
47+
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS'
48+
image: ubuntu:latest
49+
- build: 'kompute'
50+
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_KOMPUTE=ON -DKOMPUTE_OPT_DISABLE_VULKAN_VERSION_CHECK=ON'
51+
image: ubuntu:latest
52+
arch_not_available: true
53+
- build: 'vulkan'
54+
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_VULKAN=ON'
55+
image: ubuntu:latest
56+
arch_not_available: true
57+
58+
container:
59+
image: ${{ matrix.image }}
60+
ports:
61+
- 8888
62+
options: --cpus 4
63+
64+
steps:
65+
- name: Clone
66+
id: checkout
67+
uses: actions/checkout@v3
68+
69+
- name: Dependencies
70+
id: depends
71+
run: |
72+
apt-get update
73+
apt-get -y install \
74+
build-essential \
75+
pkg-config \
76+
git \
77+
cmake \
78+
python3-pip \
79+
wget \
80+
psmisc
81+
82+
- name: Download CLBlast
83+
id: get_clblast
84+
if: ${{ matrix.build == 'clblast' }}
85+
run: |
86+
apt install -y libclblast-dev
87+
88+
- name: Download OpenBLAS
89+
id: get_openblas
90+
if: ${{ matrix.build == 'openblas' }}
91+
run: |
92+
apt-get -y install libopenblas-dev
93+
94+
- name: Install Vulkan SDK
95+
id: get_vulkan
96+
if: ${{ matrix.build == 'kompute' || matrix.build == 'vulkan' }}
97+
run: |
98+
wget -qO- https://packages.lunarg.com/lunarg-signing-key-pub.asc | tee /etc/apt/trusted.gpg.d/lunarg.asc
99+
wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list http://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
100+
apt-get update
101+
apt-get -y install vulkan-sdk
102+
103+
- name: Build
104+
id: cmake_build
105+
run: |
106+
mkdir build
107+
cd build
108+
cmake .. -DLLAMA_SANITIZE_${{ matrix.sanitizer }}=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} ${{ matrix.defines }}
109+
cmake --build . --config ${{ matrix.build_type }} -j $(nproc) --target server
110+
111+
- name: Tests dependencies
112+
id: test_dependencies
113+
run: |
114+
pip install -r examples/server/tests/requirements.txt
115+
116+
- name: Download models
117+
id: download_models
118+
run: |
119+
cd examples/server/tests
120+
../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf
121+
122+
- name: Tests
123+
id: server_integration_test
124+
continue-on-error: ${{ matrix.experimental || matrix.arch_not_available }}
125+
run: |
126+
cd examples/server/tests
127+
PORT=8888 ./tests.sh

examples/server/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,12 @@ curl --request POST \
9898
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
9999
```
100100

101+
## Advanced testing
102+
103+
We implemented a [server test framework](./tests/README.md) using human-readable scenario.
104+
105+
*Before submitting an issue, please try to reproduce it with this format.*
106+
101107
## Node JS Test
102108

103109
You need to have [Node.js](https://nodejs.org/en) installed.

examples/server/server.cpp

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1410,11 +1410,6 @@ struct llama_server_context
14101410
int n_processing_slots = 0;
14111411

14121412
for (llama_client_slot &slot: slots) {
1413-
if (slot.available()) {
1414-
n_idle_slots++;
1415-
} else {
1416-
n_processing_slots++;
1417-
}
14181413
json slot_data = get_formated_generation(slot);
14191414
slot_data["id"] = slot.id;
14201415
slot_data["task_id"] = slot.task_id;
@@ -1429,6 +1424,11 @@ struct llama_server_context
14291424
{"stopped_limit", slot.stopped_limit},
14301425
{"stopping_word", slot.stopping_word},
14311426
};
1427+
if (slot_data["state"] == IDLE) {
1428+
n_idle_slots++;
1429+
} else {
1430+
n_processing_slots++;
1431+
}
14321432
slots_data.push_back(slot_data);
14331433
}
14341434
LOG_TEE("task %i - slots data: idle=%i processing=%i\n", task.id, n_idle_slots, n_processing_slots);
@@ -2748,19 +2748,6 @@ int main(int argc, char **argv)
27482748
log_data["api_key"] = "api_key: " + std::to_string(sparams.api_keys.size()) + " keys loaded";
27492749
}
27502750

2751-
LOG_INFO("HTTP server listening", log_data);
2752-
// run the HTTP server in a thread - see comment below
2753-
std::thread t([&]()
2754-
{
2755-
if (!svr.listen_after_bind())
2756-
{
2757-
state.store(SERVER_STATE_ERROR);
2758-
return 1;
2759-
}
2760-
2761-
return 0;
2762-
});
2763-
27642751
// load the model
27652752
if (!llama.load_model(params))
27662753
{
@@ -3228,6 +3215,19 @@ int main(int argc, char **argv)
32283215
}*/
32293216
//);
32303217

3218+
LOG_INFO("HTTP server listening", log_data);
3219+
// run the HTTP server in a thread - see comment below
3220+
std::thread t([&]()
3221+
{
3222+
if (!svr.listen_after_bind())
3223+
{
3224+
state.store(SERVER_STATE_ERROR);
3225+
return 1;
3226+
}
3227+
3228+
return 0;
3229+
});
3230+
32313231
llama.queue_tasks.on_new_task(std::bind(
32323232
&llama_server_context::process_single_task, &llama, std::placeholders::_1));
32333233
llama.queue_tasks.on_finish_multitask(std::bind(

examples/server/tests/README.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Server tests
2+
3+
Python based server tests scenario using [BDD](https://en.wikipedia.org/wiki/Behavior-driven_development) and [behave](https://behave.readthedocs.io/en/latest/):
4+
* [issues.feature](./features/issues.feature) Pending issues scenario
5+
* [parallel.feature](./features/parallel.feature) Scenario involving multi slots and concurrent requests
6+
* [security.feature](./features/security.feature) Security, CORS and API Key
7+
* [server.feature](./features/server.feature) Server base scenario: completion, embedding, tokenization, etc...
8+
9+
Tests target GitHub workflows job runners with 4 vCPU.
10+
11+
Requests are using [aiohttp](https://docs.aiohttp.org/en/stable/client_reference.html), [asyncio](https://docs.python.org/fr/3/library/asyncio.html) based http client.
12+
13+
Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail. To mitigate it, you can increase values in `n_predict`, `kv_size`.
14+
15+
### Install dependencies
16+
`pip install -r requirements.txt`
17+
18+
### Run tests
19+
1. Build the server
20+
```shell
21+
cd ../../..
22+
mkdir build
23+
cd build
24+
cmake ../
25+
cmake --build . --target server
26+
```
27+
2. download required models:
28+
1. `../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf`
29+
3. Start the test: `./tests.sh`
30+
31+
It's possible to override some scenario steps values with environment variables:
32+
- `PORT` -> `context.server_port` to set the listening port of the server during scenario, default: `8080`
33+
- `LLAMA_SERVER_BIN_PATH` -> to change the server binary path, default: `../../../build/bin/server`
34+
- `DEBUG` -> "ON" to enable steps and server verbose mode `--verbose`
35+
36+
### Run @bug, @wip or @wrong_usage annotated scenario
37+
38+
Feature or Scenario must be annotated with `@llama.cpp` to be included in the default scope.
39+
- `@bug` annotation aims to link a scenario with a GitHub issue.
40+
- `@wrong_usage` are meant to show user issue that are actually an expected behavior
41+
- `@wip` to focus on a scenario working in progress
42+
43+
To run a scenario annotated with `@bug`, start:
44+
`DEBUG=ON ./tests.sh --no-skipped --tags bug`
45+
46+
After changing logic in `steps.py`, ensure that `@bug` and `@wrong_usage` scenario are updated.
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
import os
2+
import socket
3+
import subprocess
4+
import time
5+
from contextlib import closing
6+
from signal import SIGKILL
7+
8+
9+
def before_scenario(context, scenario):
10+
print(f"\x1b[33;42mStarting new scenario: {scenario.name}!\x1b[0m")
11+
port = 8080
12+
if 'PORT' in os.environ:
13+
port = int(os.environ['PORT'])
14+
if is_server_listening("localhost", port):
15+
assert False, "Server already started"
16+
17+
18+
def after_scenario(context, scenario):
19+
if scenario.status == "failed":
20+
if 'GITHUB_ACTIONS' in os.environ:
21+
print(f"\x1b[33;101mSCENARIO FAILED: {scenario.name} server logs:\x1b[0m\n\n")
22+
if os.path.isfile('llama.log'):
23+
with closing(open('llama.log', 'r')) as f:
24+
for line in f:
25+
print(line)
26+
if not is_server_listening(context.server_fqdn, context.server_port):
27+
print("\x1b[33;101mERROR: Server stopped listening\x1b[0m")
28+
29+
if not pid_exists(context.server_process.pid):
30+
assert False, f"Server not running pid={context.server_process.pid} ..."
31+
32+
print(f"stopping server pid={context.server_process.pid} ...")
33+
context.server_process.kill()
34+
# Wait few for socket to free up
35+
time.sleep(0.05)
36+
37+
attempts = 0
38+
while is_server_listening(context.server_fqdn, context.server_port):
39+
print(f"stopping server pid={context.server_process.pid} ...")
40+
os.kill(context.server_process.pid, SIGKILL)
41+
time.sleep(0.1)
42+
attempts += 1
43+
if attempts > 5:
44+
print(f"Server dangling exits, killing all {context.server_path} ...")
45+
process = subprocess.run(['killall', '-9', context.server_path],
46+
stderr=subprocess.PIPE,
47+
universal_newlines=True)
48+
print(process)
49+
50+
51+
def is_server_listening(server_fqdn, server_port):
52+
with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as sock:
53+
result = sock.connect_ex((server_fqdn, server_port))
54+
return result == 0
55+
56+
57+
def pid_exists(pid):
58+
"""Check whether pid exists in the current process table."""
59+
import errno
60+
if pid < 0:
61+
return False
62+
try:
63+
os.kill(pid, 0)
64+
except OSError as e:
65+
return e.errno == errno.EPERM
66+
else:
67+
return True
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# List of ongoing issues
2+
@bug
3+
Feature: Issues
4+
# Issue #5655
5+
Scenario: Multi users embeddings
6+
Given a server listening on localhost:8080
7+
And a model file stories260K.gguf
8+
And a model alias tinyllama-2
9+
And 42 as server seed
10+
And 64 KV cache size
11+
And 2 slots
12+
And continuous batching
13+
And embeddings extraction
14+
Then the server is starting
15+
Then the server is healthy
16+
17+
Given a prompt:
18+
"""
19+
Write a very long story about AI.
20+
"""
21+
And a prompt:
22+
"""
23+
Write another very long music lyrics.
24+
"""
25+
And a prompt:
26+
"""
27+
Write a very long poem.
28+
"""
29+
And a prompt:
30+
"""
31+
Write a very long joke.
32+
"""
33+
Given concurrent embedding requests
34+
Then the server is busy
35+
Then the server is idle
36+
Then all embeddings are generated

0 commit comments

Comments
 (0)