You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* server: monitoring - add /metrics prometheus compatible endpoint
* server: concurrency issue, when 2 task are waiting for results, only one call thread is notified
* server: metrics - move to a dedicated struct
Copy file name to clipboardExpand all lines: examples/server/README.md
+13Lines changed: 13 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -41,6 +41,7 @@ see https://github.com/ggerganov/llama.cpp/issues/1437
41
41
-`--grp-attn-w`: Set the group attention width to extend context size through self-extend(default: 512), used together with group attention factor `--grp-attn-n`
42
42
-`-n, --n-predict`: Set the maximum tokens to predict (default: -1)
43
43
-`--slots-endpoint-disable`: To disable slots state monitoring endpoint. Slots state may contain user data, prompts included.
-`--chat-template JINJA_TEMPLATE`: Set custom jinja chat template. This parameter accepts a string, not a file name (default: template taken from model's metadata). We only support [some pre-defined templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
45
46
46
47
## Build
@@ -457,6 +458,18 @@ Notice that each `probs` is an array of length `n_probs`.
457
458
]
458
459
```
459
460
461
+
-**GET**`/metrics`: [Prometheus](https://prometheus.io/) compatible metrics exporter endpoint if `--metrics` is enabled:
462
+
463
+
Available metrics:
464
+
-`llamacpp:prompt_tokens_total`: Number of prompt tokens processed.
465
+
-`llamacpp:tokens_predicted_total`: Number of generation tokens processed.
466
+
-`llamacpp:prompt_tokens_seconds`: Average prompt throughput in tokens/s.
467
+
-`llamacpp:predicted_tokens_seconds`: Average generation throughput in tokens/s.
468
+
-`llamacpp:kv_cache_usage_ratio`: KV-cache usage. 1 means 100 percent usage.
469
+
-`llamacpp:kv_cache_tokens`: KV-cache tokens.
470
+
-`llamacpp:requests_processing`: Number of request processing.
471
+
-`llamacpp:requests_deferred`: Number of request deferred.
0 commit comments