Skip to content

Commit ffa1667

Browse files
ngxsonggerganov
authored andcommitted
server : refactor middleware and /health endpoint (ggml-org#9056)
* server : refactor middleware and /health endpoint * move "fail_on_no_slot" to /slots * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <[email protected]> * fix server tests * fix CI * update server docs --------- Co-authored-by: Georgi Gerganov <[email protected]>
1 parent de614ee commit ffa1667

File tree

3 files changed

+178
-218
lines changed

3 files changed

+178
-218
lines changed

examples/server/README.md

Lines changed: 26 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -368,15 +368,16 @@ node index.js
368368

369369
## API Endpoints
370370

371-
### GET `/health`: Returns the current state of the server
371+
### GET `/health`: Returns heath check result
372372

373-
- 503 -> `{"status": "loading model"}` if the model is still being loaded.
374-
- 500 -> `{"status": "error"}` if the model failed to load.
375-
- 200 -> `{"status": "ok", "slots_idle": 1, "slots_processing": 2 }` if the model is successfully loaded and the server is ready for further requests mentioned below.
376-
- 200 -> `{"status": "no slot available", "slots_idle": 0, "slots_processing": 32}` if no slots are currently available.
377-
- 503 -> `{"status": "no slot available", "slots_idle": 0, "slots_processing": 32}` if the query parameter `fail_on_no_slot` is provided and no slots are currently available.
373+
**Response format**
378374

379-
If the query parameter `include_slots` is passed, `slots` field will contain internal slots data except if `--slots-endpoint-disable` is set.
375+
- HTTP status code 503
376+
- Body: `{"error": {"code": 503, "message": "Loading model", "type": "unavailable_error"}}`
377+
- Explanation: the model is still being loaded.
378+
- HTTP status code 200
379+
- Body: `{"status": "ok" }`
380+
- Explanation: the model is successfully loaded and the server is ready.
380381

381382
### POST `/completion`: Given a `prompt`, it returns the predicted completion.
382383

@@ -639,10 +640,16 @@ Given a ChatML-formatted json description in `messages`, it returns the predicte
639640
}'
640641
```
641642

642-
### GET `/slots`: Returns the current slots processing state. Can be disabled with `--slots-endpoint-disable`.
643+
### GET `/slots`: Returns the current slots processing state
644+
645+
This endpoint can be disabled with `--no-slots`
646+
647+
If query param `?fail_on_no_slot=1` is set, this endpoint will respond with status code 503 if there is no available slots.
643648

644649
**Response format**
645650

651+
Example:
652+
646653
```json
647654
[
648655
{
@@ -702,7 +709,13 @@ Given a ChatML-formatted json description in `messages`, it returns the predicte
702709
]
703710
```
704711

705-
### GET `/metrics`: Prometheus compatible metrics exporter endpoint if `--metrics` is enabled:
712+
Possible values for `slot[i].state` are:
713+
- `0`: SLOT_STATE_IDLE
714+
- `1`: SLOT_STATE_PROCESSING
715+
716+
### GET `/metrics`: Prometheus compatible metrics exporter
717+
718+
This endpoint is only accessible if `--metrics` is set.
706719

707720
Available metrics:
708721
- `llamacpp:prompt_tokens_total`: Number of prompt tokens processed.
@@ -767,6 +780,10 @@ Available metrics:
767780

768781
### GET `/lora-adapters`: Get list of all LoRA adapters
769782

783+
This endpoint returns the loaded LoRA adapters. You can add adapters using `--lora` when starting the server, for example: `--lora my_adapter_1.gguf --lora my_adapter_2.gguf ...`
784+
785+
By default, all adapters will be loaded with scale set to 1. To initialize all adapters scale to 0, add `--lora-init-without-apply`
786+
770787
If an adapter is disabled, the scale will be set to 0.
771788

772789
**Response format**

0 commit comments

Comments
 (0)