Skip to content

Commit 75cd4c7

Browse files
authored
ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495)
* ci: bench: support sse and fix prompt processing time server: add tokens usage in stream mode * ci: bench: README.md EOL * ci: bench: remove total pp and tg as it is not accurate * ci: bench: fix case when there is no token generated * ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics * ci: bench: fix finish reason rate
1 parent a8bd14d commit 75cd4c7

File tree

5 files changed

+111
-37
lines changed

5 files changed

+111
-37
lines changed

.github/workflows/bench.yml

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -79,12 +79,18 @@ jobs:
7979
sleep 0.1
8080
done
8181
82-
- name: Install k6
82+
- name: Set up Go
83+
uses: actions/setup-go@v5
84+
with:
85+
go-version: '1.21'
86+
87+
- name: Install k6 and xk6-sse
8388
id: k6_installation
8489
run: |
8590
cd examples/server/bench
86-
wget --quiet https://github.com/grafana/k6/releases/download/v0.49.0/k6-v0.49.0-linux-amd64.tar.gz
87-
tar xzf k6*.tar.gz --strip-components=1
91+
go install go.k6.io/xk6/cmd/xk6@latest
92+
xk6 build master \
93+
--with github.com/phymbert/xk6-sse
8894
8995
- name: Build
9096
id: cmake_build
@@ -118,7 +124,7 @@ jobs:
118124
119125
cd examples/server/bench
120126
source venv/bin/activate
121-
BENCH_K6_BIN_PATH=./k6 python bench.py \
127+
python bench.py \
122128
--runner-label ${{ env.RUNNER_LABEL }} \
123129
--name ${{ github.job }} \
124130
--branch ${{ github.head_ref || github.ref_name }} \
@@ -228,9 +234,9 @@ jobs:
228234
<summary>Expand details for performance related PR only</summary>
229235
230236
- Concurrent users: ${{ env.N_USERS }}, duration: ${{ github.event.inputs.duration || env.DURATION }}
231-
- HTTP request : avg=${{ env.HTTP_REQ_DURATION_AVG }}ms p(90)=${{ env.HTTP_REQ_DURATION_P_90_ }}ms fails=${{ env.HTTP_REQ_FAILED_PASSES }}, finish reason: stop=${{ env.LLAMACPP_COMPLETIONS_STOP_RATE_PASSES }} truncated=${{ env.LLAMACPP_COMPLETIONS_TRUNCATED_RATE_PASSES }}
232-
- Prompt processing (pp): avg=${{ env.LLAMACPP_PROMPT_TOKENS_AVG }}tk/s p(90)=${{ env.LLAMACPP_PROMPT_TOKENS_P_90_ }}tk/s **total=${{ env.LLAMACPP_PROMPT_TOKENS_TOTAL_COUNTER_RATE }}tk/s**
233-
- Token generation (tg): avg=${{ env.LLAMACPP_TOKENS_SECOND_AVG }}tk/s p(90)=${{ env.LLAMACPP_TOKENS_SECOND_P_90_ }}tk/s **total=${{ env.LLAMACPP_COMPLETION_TOKENS_TOTAL_COUNTER_RATE }}tk/s**
237+
- HTTP request : avg=${{ env.HTTP_REQ_DURATION_AVG }}ms p(95)=${{ env.HTTP_REQ_DURATION_P_95_ }}ms fails=${{ env.HTTP_REQ_FAILED_PASSES }}, finish reason: stop=${{ env.LLAMACPP_COMPLETIONS_STOP_RATE_PASSES }} truncated=${{ env.LLAMACPP_COMPLETIONS_TRUNCATED_RATE_PASSES }}
238+
- Prompt processing (pp): avg=${{ env.LLAMACPP_PROMPT_PROCESSING_SECOND_AVG }}tk/s p(95)=${{ env.LLAMACPP_PROMPT_PROCESSING_SECOND_P_95_ }}tk/s
239+
- Token generation (tg): avg=${{ env.LLAMACPP_TOKENS_SECOND_AVG }}tk/s p(95)=${{ env.LLAMACPP_TOKENS_SECOND_P_95_ }}tk/s
234240
- ${{ env.BENCH_GRAPH_XLABEL }}
235241
236242

examples/server/bench/README.md

Lines changed: 37 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,15 @@
22

33
Benchmark is using [k6](https://k6.io/).
44

5-
##### Install k6
5+
##### Install k6 and sse extension
66

7-
Follow instruction from: https://k6.io/docs/get-started/installation/
7+
SSE is not supported by default in k6, you have to build k6 with the [xk6-sse](https://github.com/phymbert/xk6-sse) extension.
88

9-
Example for ubuntu:
9+
Example:
1010
```shell
11-
snap install k6
11+
go install go.k6.io/xk6/cmd/xk6@latest
12+
xk6 build master \
13+
--with github.com/phymbert/xk6-sse
1214
```
1315

1416
#### Download a dataset
@@ -46,7 +48,7 @@ server --host localhost --port 8080 \
4648

4749
For 500 chat completions request with 8 concurrent users during maximum 10 minutes, run:
4850
```shell
49-
k6 run script.js --duration 10m --iterations 500 --vus 8
51+
./k6 run script.js --duration 10m --iterations 500 --vus 8
5052
```
5153

5254
The benchmark values can be overridden with:
@@ -86,3 +88,33 @@ K6 metrics might be compared against [server metrics](../README.md), with:
8688
```shell
8789
curl http://localhost:8080/metrics
8890
```
91+
92+
### Using the CI python script
93+
The `bench.py` script does several steps:
94+
- start the server
95+
- define good variable for k6
96+
- run k6 script
97+
- extract metrics from prometheus
98+
99+
It aims to be used in the CI, but you can run it manually:
100+
101+
```shell
102+
LLAMA_SERVER_BIN_PATH=../../../cmake-build-release/bin/server python bench.py \
103+
--runner-label local \
104+
--name local \
105+
--branch `git rev-parse --abbrev-ref HEAD` \
106+
--commit `git rev-parse HEAD` \
107+
--scenario script.js \
108+
--duration 5m \
109+
--hf-repo ggml-org/models \
110+
--hf-file phi-2/ggml-model-q4_0.gguf \
111+
--model-path-prefix models \
112+
--parallel 4 \
113+
-ngl 33 \
114+
--batch-size 2048 \
115+
--ubatch-size 256 \
116+
--ctx-size 4096 \
117+
--n-prompts 200 \
118+
--max-prompt-tokens 256 \
119+
--max-tokens 256
120+
```

examples/server/bench/bench.py

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,6 @@ def main(args_in: list[str] | None = None) -> None:
7676
data['metrics'][metric_name][metric_metric]=value
7777
github_env.write(
7878
f"{escape_metric_name(metric_name)}_{escape_metric_name(metric_metric)}={value}\n")
79-
token_seconds = data['metrics']['llamacpp_tokens_second']['avg']
8079
iterations = data['root_group']['checks']['success completion']['passes']
8180

8281
except Exception:
@@ -181,16 +180,16 @@ def main(args_in: list[str] | None = None) -> None:
181180
bench_results = {
182181
"i": iterations,
183182
"req": {
184-
"p90": round(data['metrics']["http_req_duration"]["p(90)"], 2),
183+
"p95": round(data['metrics']["http_req_duration"]["p(95)"], 2),
185184
"avg": round(data['metrics']["http_req_duration"]["avg"], 2),
186185
},
187186
"pp": {
188-
"p90": round(data['metrics']["llamacpp_prompt_tokens"]["p(90)"], 2),
189-
"avg": round(data['metrics']["llamacpp_prompt_tokens"]["avg"], 2),
187+
"p95": round(data['metrics']["llamacpp_prompt_processing_second"]["p(95)"], 2),
188+
"avg": round(data['metrics']["llamacpp_prompt_processing_second"]["avg"], 2),
190189
"0": round(mean(prometheus_metrics['prompt_tokens_seconds']), 2),
191190
},
192191
"tg": {
193-
"p90": round(data['metrics']["llamacpp_tokens_second"]["p(90)"], 2),
192+
"p95": round(data['metrics']["llamacpp_tokens_second"]["p(95)"], 2),
194193
"avg": round(data['metrics']["llamacpp_tokens_second"]["avg"], 2),
195194
"0": round(mean(prometheus_metrics['predicted_tokens_seconds']), 2),
196195
},
@@ -206,7 +205,7 @@ def main(args_in: list[str] | None = None) -> None:
206205

207206

208207
def start_benchmark(args):
209-
k6_path = 'k6'
208+
k6_path = './k6'
210209
if 'BENCH_K6_BIN_PATH' in os.environ:
211210
k6_path = os.environ['BENCH_K6_BIN_PATH']
212211
k6_args = [

examples/server/bench/script.js

Lines changed: 47 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
import http from 'k6/http'
1+
import sse from 'k6/x/sse'
22
import {check, sleep} from 'k6'
33
import {SharedArray} from 'k6/data'
44
import {Counter, Rate, Trend} from 'k6/metrics'
@@ -53,7 +53,9 @@ const data = new SharedArray('conversations', function () {
5353

5454
const llamacpp_prompt_tokens = new Trend('llamacpp_prompt_tokens')
5555
const llamacpp_completion_tokens = new Trend('llamacpp_completion_tokens')
56+
5657
const llamacpp_tokens_second = new Trend('llamacpp_tokens_second')
58+
const llamacpp_prompt_processing_second = new Trend('llamacpp_prompt_processing_second')
5759

5860
const llamacpp_prompt_tokens_total_counter = new Counter('llamacpp_prompt_tokens_total_counter')
5961
const llamacpp_completion_tokens_total_counter = new Counter('llamacpp_completion_tokens_total_counter')
@@ -86,36 +88,62 @@ export default function () {
8688
}
8789
],
8890
"model": model,
89-
"stream": false,
91+
"stream": true,
9092
"seed": 42,
9193
"max_tokens": max_tokens
9294
}
9395

94-
const body = JSON.stringify(payload)
96+
const params = {method: 'POST', body: JSON.stringify(payload)};
97+
98+
const startTime = new Date()
99+
let promptEvalEndTime = null
100+
let prompt_tokens = 0
101+
let completions_tokens = 0
102+
let finish_reason = null
103+
const res = sse.open(`${server_url}/chat/completions`, params, function (client) {
104+
client.on('event', function (event) {
105+
if (promptEvalEndTime == null) {
106+
promptEvalEndTime = new Date()
107+
}
95108

96-
let res = http.post(`${server_url}/chat/completions`, body, {
97-
headers: {'Content-Type': 'application/json'},
98-
timeout: '300s'
99-
})
109+
let chunk = JSON.parse(event.data)
110+
let choice = chunk.choices[0]
111+
if (choice.finish_reason) {
112+
finish_reason = choice.finish_reason
113+
}
100114

101-
check(res, {'success completion': (r) => r.status === 200})
115+
if (chunk.usage) {
116+
prompt_tokens = chunk.usage.prompt_tokens
117+
llamacpp_prompt_tokens.add(prompt_tokens)
118+
llamacpp_prompt_tokens_total_counter.add(prompt_tokens)
119+
120+
completions_tokens = chunk.usage.completion_tokens
121+
llamacpp_completion_tokens.add(completions_tokens)
122+
llamacpp_completion_tokens_total_counter.add(completions_tokens)
123+
}
124+
})
102125

103-
if (res.status === 200) {
104-
const completions = res.json()
126+
client.on('error', function (e) {
127+
console.log('An unexpected error occurred: ', e.error());
128+
throw e;
129+
})
130+
})
105131

106-
llamacpp_prompt_tokens.add(completions.usage.prompt_tokens)
107-
llamacpp_prompt_tokens_total_counter.add(completions.usage.prompt_tokens)
132+
check(res, {'success completion': (r) => r.status === 200})
108133

109-
llamacpp_completion_tokens.add(completions.usage.completion_tokens)
110-
llamacpp_completion_tokens_total_counter.add(completions.usage.completion_tokens)
134+
const endTime = new Date()
111135

112-
llamacpp_completions_truncated_rate.add(completions.choices[0].finish_reason === 'length')
113-
llamacpp_completions_stop_rate.add(completions.choices[0].finish_reason === 'stop')
136+
const promptEvalTime = promptEvalEndTime - startTime
137+
if (promptEvalTime > 0) {
138+
llamacpp_prompt_processing_second.add(prompt_tokens / (promptEvalEndTime - startTime) * 1.e3)
139+
}
114140

115-
llamacpp_tokens_second.add(completions.usage.total_tokens / res.timings.duration * 1.e3)
116-
} else {
117-
console.error(`response: ${res.body} request=${payload}`)
141+
const completion_time = endTime - promptEvalEndTime
142+
if (completions_tokens > 0 && completion_time > 0) {
143+
llamacpp_tokens_second.add(completions_tokens / completion_time * 1.e3)
118144
}
145+
llamacpp_completions_truncated_rate.add(finish_reason === 'length')
146+
llamacpp_completions_stop_rate.add(finish_reason === 'stop')
119147

120148
sleep(0.3)
121149
}

examples/server/utils.hpp

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -567,6 +567,15 @@ static std::vector<json> format_partial_response_oaicompat(json result, const st
567567
{"model", modelname},
568568
{"object", "chat.completion.chunk"}
569569
};
570+
if (!finish_reason.empty()) {
571+
int num_tokens_predicted = json_value(result, "tokens_predicted", 0);
572+
int num_prompt_tokens = json_value(result, "tokens_evaluated", 0);
573+
ret.push_back({"usage", json {
574+
{"completion_tokens", num_tokens_predicted},
575+
{"prompt_tokens", num_prompt_tokens},
576+
{"total_tokens", num_tokens_predicted + num_prompt_tokens}
577+
}});
578+
}
570579

571580
return std::vector<json>({ret});
572581
}

0 commit comments

Comments
 (0)