You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/server/README.md
+3-4Lines changed: 3 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,8 @@ The project is under active development, and we are [looking for feedback and co
17
17
18
18
**Command line options:**
19
19
20
-
-`--threads N`, `-t N`: Set the number of threads to use during generation. Not used if model layers are offloaded to GPU. The server is using batching. This parameter is used only if one token is to be processed on CPU backend.
20
+
-`-v`, `--verbose`: Enable verbose server output. When using the `/completion` endpoint, this includes the tokenized prompt, the full request and the full response.
21
+
-`-t N`, `--threads N`: Set the number of threads to use during generation. Not used if model layers are offloaded to GPU. The server is using batching. This parameter is used only if one token is to be processed on CPU backend.
21
22
-`-tb N, --threads-batch N`: Set the number of threads to use during batch and prompt processing. If not specified, the number of threads will be set to the number of threads used for generation. Not used if model layers are offloaded to GPU.
22
23
-`--threads-http N`: Number of threads in the http server pool to process requests. Default: `max(std::thread::hardware_concurrency() - 1, --parallel N + 2)`
23
24
-`-m FNAME`, `--model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.gguf`).
@@ -36,9 +37,7 @@ The project is under active development, and we are [looking for feedback and co
36
37
-`--numa STRATEGY`: Attempt one of the below optimization strategies that may help on some NUMA systems
37
38
-`--numa distribute`: Spread execution evenly over all nodes
38
39
-`--numa isolate`: Only spawn threads on CPUs on the node that execution started on
39
-
-`--numa numactl`: Use the CPU map provided by numactl. If run without this previously, it is recommended to drop the system
40
-
page cache before using this. See https://github.com/ggerganov/llama.cpp/issues/1437
41
-
40
+
-`--numa numactl`: Use the CPU map provided by numactl. If run without this previously, it is recommended to drop the system page cache before using this. See https://github.com/ggerganov/llama.cpp/issues/1437
42
41
-`--numa`: Attempt optimizations that may help on some NUMA systems.
43
42
-`--lora FNAME`: Apply a LoRA (Low-Rank Adaptation) adapter to the model (implies --no-mmap). This allows you to adapt the pretrained model to specific tasks or domains.
44
43
-`--lora-base FNAME`: Optional model to use as a base for the layers modified by the LoRA adapter. This flag is used in conjunction with the `--lora` flag, and specifies the base model for the adaptation.
0 commit comments