You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
server : refactor middleware and /health endpoint (ggml-org#9056)
* server : refactor middleware and /health endpoint
* move "fail_on_no_slot" to /slots
* Update examples/server/server.cpp
Co-authored-by: Georgi Gerganov <[email protected]>
* fix server tests
* fix CI
* update server docs
---------
Co-authored-by: Georgi Gerganov <[email protected]>
Copy file name to clipboardExpand all lines: examples/server/README.md
+26-9Lines changed: 26 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -368,15 +368,16 @@ node index.js
368
368
369
369
## API Endpoints
370
370
371
-
### GET `/health`: Returns the current state of the server
371
+
### GET `/health`: Returns heath check result
372
372
373
-
- 503 -> `{"status": "loading model"}` if the model is still being loaded.
374
-
- 500 -> `{"status": "error"}` if the model failed to load.
375
-
- 200 -> `{"status": "ok", "slots_idle": 1, "slots_processing": 2 }` if the model is successfully loaded and the server is ready for further requests mentioned below.
376
-
- 200 -> `{"status": "no slot available", "slots_idle": 0, "slots_processing": 32}` if no slots are currently available.
377
-
- 503 -> `{"status": "no slot available", "slots_idle": 0, "slots_processing": 32}` if the query parameter `fail_on_no_slot` is provided and no slots are currently available.
373
+
**Response format**
378
374
379
-
If the query parameter `include_slots` is passed, `slots` field will contain internal slots data except if `--slots-endpoint-disable` is set.
- Explanation: the model is successfully loaded and the server is ready.
380
381
381
382
### POST `/completion`: Given a `prompt`, it returns the predicted completion.
382
383
@@ -639,10 +640,16 @@ Given a ChatML-formatted json description in `messages`, it returns the predicte
639
640
}'
640
641
```
641
642
642
-
### GET `/slots`: Returns the current slots processing state. Can be disabled with `--slots-endpoint-disable`.
643
+
### GET `/slots`: Returns the current slots processing state
644
+
645
+
This endpoint can be disabled with `--no-slots`
646
+
647
+
If query param `?fail_on_no_slot=1` is set, this endpoint will respond with status code 503 if there is no available slots.
643
648
644
649
**Response format**
645
650
651
+
Example:
652
+
646
653
```json
647
654
[
648
655
{
@@ -702,7 +709,13 @@ Given a ChatML-formatted json description in `messages`, it returns the predicte
702
709
]
703
710
```
704
711
705
-
### GET `/metrics`: Prometheus compatible metrics exporter endpoint if `--metrics` is enabled:
712
+
Possible values for`slot[i].state` are:
713
+
- `0`: SLOT_STATE_IDLE
714
+
- `1`: SLOT_STATE_PROCESSING
715
+
716
+
### GET `/metrics`: Prometheus compatible metrics exporter
717
+
718
+
This endpoint is only accessible if`--metrics` is set.
706
719
707
720
Available metrics:
708
721
- `llamacpp:prompt_tokens_total`: Number of prompt tokens processed.
@@ -767,6 +780,10 @@ Available metrics:
767
780
768
781
### GET `/lora-adapters`: Get list of all LoRA adapters
769
782
783
+
This endpoint returns the loaded LoRA adapters. You can add adapters using `--lora` when starting the server, for example: `--lora my_adapter_1.gguf --lora my_adapter_2.gguf ...`
784
+
785
+
By default, all adapters will be loaded with scale set to 1. To initialize all adapters scale to 0, add `--lora-init-without-apply`
786
+
770
787
If an adapter is disabled, the scale will be set to 0.
0 commit comments