Skip to content

Commit 1b78ed2

Browse files
authored
Only show -ngl option when relevant + other doc/arg handling updates (#1625)
1. Add a `LLAMA_SUPPORTS_GPU_OFFLOAD` define to `llama.h` (defined when compiled with CLBlast or cuBLAS) 2. Update the argument handling in the common example code to only show the `-ngl`, `--n-gpu-layers` option when GPU offload is possible. 3. Add an entry for the `-ngl`, `--n-gpu-layers` option to the `main` and `server` examples documentation 4. Update `main` and `server` examples documentation to use the new style dash separator argument format 5. Update the `server` example to use dash separators for its arguments and adds `-ngl` to `--help` (only shown when compiled with appropriate support). It will still support `--memory_f32` and `--ctx_size` for compatibility. 6. Add a warning discouraging use of `--memory-f32` for the `main` and `server` examples `--help` text as well as documentation. Rationale: #1593 (reply in thread)
1 parent 337aea1 commit 1b78ed2

File tree

5 files changed

+56
-32
lines changed

5 files changed

+56
-32
lines changed

examples/common.cpp

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -289,7 +289,12 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
289289
invalid_param = true;
290290
break;
291291
}
292+
#ifdef LLAMA_SUPPORTS_GPU_OFFLOAD
292293
params.n_gpu_layers = std::stoi(argv[i]);
294+
#else
295+
fprintf(stderr, "warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored\n");
296+
fprintf(stderr, "warning: see main README.md for information on enabling GPU BLAS support\n");
297+
#endif
293298
} else if (arg == "--no-mmap") {
294299
params.use_mmap = false;
295300
} else if (arg == "--mtest") {
@@ -416,7 +421,8 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
416421
fprintf(stderr, " -c N, --ctx-size N size of the prompt context (default: %d)\n", params.n_ctx);
417422
fprintf(stderr, " --ignore-eos ignore end of stream token and continue generating (implies --logit-bias 2-inf)\n");
418423
fprintf(stderr, " --no-penalize-nl do not penalize newline token\n");
419-
fprintf(stderr, " --memory-f32 use f32 instead of f16 for memory key+value\n");
424+
fprintf(stderr, " --memory-f32 use f32 instead of f16 for memory key+value (default: disabled)\n");
425+
fprintf(stderr, " not recommended: doubles context memory required and no measurable increase in quality\n");
420426
fprintf(stderr, " --temp N temperature (default: %.1f)\n", (double)params.temp);
421427
fprintf(stderr, " -b N, --batch-size N batch size for prompt processing (default: %d)\n", params.n_batch);
422428
fprintf(stderr, " --perplexity compute perplexity over the prompt\n");
@@ -427,8 +433,10 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
427433
if (llama_mmap_supported()) {
428434
fprintf(stderr, " --no-mmap do not memory-map model (slower load but may reduce pageouts if not using mlock)\n");
429435
}
436+
#ifdef LLAMA_SUPPORTS_GPU_OFFLOAD
430437
fprintf(stderr, " -ngl N, --n-gpu-layers N\n");
431438
fprintf(stderr, " number of layers to store in VRAM\n");
439+
#endif
432440
fprintf(stderr, " --mtest compute maximum memory usage\n");
433441
fprintf(stderr, " --verbose-prompt print prompt before generation\n");
434442
fprintf(stderr, " --lora FNAME apply LoRA adapter (implies --no-mmap)\n");

examples/main/README.md

Lines changed: 27 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -69,8 +69,8 @@ In this section, we cover the most commonly used options for running the `main`
6969
- `-m FNAME, --model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.bin`).
7070
- `-i, --interactive`: Run the program in interactive mode, allowing you to provide input directly and receive real-time responses.
7171
- `-ins, --instruct`: Run the program in instruction mode, which is particularly useful when working with Alpaca models.
72-
- `-n N, --n_predict N`: Set the number of tokens to predict when generating text. Adjusting this value can influence the length of the generated text.
73-
- `-c N, --ctx_size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
72+
- `-n N, --n-predict N`: Set the number of tokens to predict when generating text. Adjusting this value can influence the length of the generated text.
73+
- `-c N, --ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
7474

7575
## Input Prompts
7676

@@ -136,29 +136,29 @@ During text generation, LLaMA models have a limited context size, which means th
136136

137137
### Context Size
138138

139-
The `--ctx_size` option allows you to set the size of the prompt context used by the LLaMA models during text generation. A larger context size helps the model to better comprehend and generate responses for longer input or conversations.
139+
The `--ctx-size` option allows you to set the size of the prompt context used by the LLaMA models during text generation. A larger context size helps the model to better comprehend and generate responses for longer input or conversations.
140140

141-
- `-c N, --ctx_size N`: Set the size of the prompt context (default: 512). The LLaMA models were built with a context of 2048, which will yield the best results on longer input/inference. However, increasing the context size beyond 2048 may lead to unpredictable results.
141+
- `-c N, --ctx-size N`: Set the size of the prompt context (default: 512). The LLaMA models were built with a context of 2048, which will yield the best results on longer input/inference. However, increasing the context size beyond 2048 may lead to unpredictable results.
142142

143143
### Keep Prompt
144144

145145
The `--keep` option allows users to retain the original prompt when the model runs out of context, ensuring a connection to the initial instruction or conversation topic is maintained.
146146

147147
- `--keep N`: Specify the number of tokens from the initial prompt to retain when the model resets its internal context. By default, this value is set to 0 (meaning no tokens are kept). Use `-1` to retain all tokens from the initial prompt.
148148

149-
By utilizing context management options like `--ctx_size` and `--keep`, you can maintain a more coherent and consistent interaction with the LLaMA models, ensuring that the generated text remains relevant to the original prompt or conversation.
149+
By utilizing context management options like `--ctx-size` and `--keep`, you can maintain a more coherent and consistent interaction with the LLaMA models, ensuring that the generated text remains relevant to the original prompt or conversation.
150150

151151
## Generation Flags
152152

153153
The following options allow you to control the text generation process and fine-tune the diversity, creativity, and quality of the generated text according to your needs. By adjusting these options and experimenting with different combinations of values, you can find the best settings for your specific use case.
154154

155155
### Number of Tokens to Predict
156156

157-
- `-n N, --n_predict N`: Set the number of tokens to predict when generating text (default: 128, -1 = infinity).
157+
- `-n N, --n-predict N`: Set the number of tokens to predict when generating text (default: 128, -1 = infinity).
158158

159-
The `--n_predict` option controls the number of tokens the model generates in response to the input prompt. By adjusting this value, you can influence the length of the generated text. A higher value will result in longer text, while a lower value will produce shorter text. A value of -1 will cause text to be generated without limit.
159+
The `--n-predict` option controls the number of tokens the model generates in response to the input prompt. By adjusting this value, you can influence the length of the generated text. A higher value will result in longer text, while a lower value will produce shorter text. A value of -1 will cause text to be generated without limit.
160160

161-
It is important to note that the generated text may be shorter than the specified number of tokens if an End-of-Sequence (EOS) token or a reverse prompt is encountered. In interactive mode text generation will pause and control will be returned to the user. In non-interactive mode, the program will end. In both cases, the text generation may stop before reaching the specified `n_predict` value. If you want the model to keep going without ever producing End-of-Sequence on its own, you can use the `--ignore-eos` parameter.
161+
It is important to note that the generated text may be shorter than the specified number of tokens if an End-of-Sequence (EOS) token or a reverse prompt is encountered. In interactive mode text generation will pause and control will be returned to the user. In non-interactive mode, the program will end. In both cases, the text generation may stop before reaching the specified `n-predict` value. If you want the model to keep going without ever producing End-of-Sequence on its own, you can use the `--ignore-eos` parameter.
162162

163163
### Temperature
164164

@@ -170,33 +170,33 @@ Example usage: `--temp 0.5`
170170

171171
### Repeat Penalty
172172

173-
- `--repeat_penalty N`: Control the repetition of token sequences in the generated text (default: 1.1).
174-
- `--repeat_last_n N`: Last n tokens to consider for penalizing repetition (default: 64, 0 = disabled, -1 = ctx_size).
173+
- `--repeat-penalty N`: Control the repetition of token sequences in the generated text (default: 1.1).
174+
- `--repeat-last-n N`: Last n tokens to consider for penalizing repetition (default: 64, 0 = disabled, -1 = ctx-size).
175175
- `--no-penalize-nl`: Disable penalization for newline tokens when applying the repeat penalty.
176176

177-
The `repeat_penalty` option helps prevent the model from generating repetitive or monotonous text. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. The default value is 1.1.
177+
The `repeat-penalty` option helps prevent the model from generating repetitive or monotonous text. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. The default value is 1.1.
178178

179-
The `repeat_last_n` option controls the number of tokens in the history to consider for penalizing repetition. A larger value will look further back in the generated text to prevent repetitions, while a smaller value will only consider recent tokens. A value of 0 disables the penalty, and a value of -1 sets the number of tokens considered equal to the context size (`ctx_size`).
179+
The `repeat-last-n` option controls the number of tokens in the history to consider for penalizing repetition. A larger value will look further back in the generated text to prevent repetitions, while a smaller value will only consider recent tokens. A value of 0 disables the penalty, and a value of -1 sets the number of tokens considered equal to the context size (`ctx-size`).
180180

181181
Use the `--no-penalize-nl` option to disable newline penalization when applying the repeat penalty. This option is particularly useful for generating chat conversations, dialogues, code, poetry, or any text where newline tokens play a significant role in structure and formatting. Disabling newline penalization helps maintain the natural flow and intended formatting in these specific use cases.
182182

183-
Example usage: `--repeat_penalty 1.15 --repeat_last_n 128 --no-penalize-nl`
183+
Example usage: `--repeat-penalty 1.15 --repeat-last-n 128 --no-penalize-nl`
184184

185185
### Top-K Sampling
186186

187-
- `--top_k N`: Limit the next token selection to the K most probable tokens (default: 40).
187+
- `--top-k N`: Limit the next token selection to the K most probable tokens (default: 40).
188188

189-
Top-k sampling is a text generation method that selects the next token only from the top k most likely tokens predicted by the model. It helps reduce the risk of generating low-probability or nonsensical tokens, but it may also limit the diversity of the output. A higher value for top_k (e.g., 100) will consider more tokens and lead to more diverse text, while a lower value (e.g., 10) will focus on the most probable tokens and generate more conservative text. The default value is 40.
189+
Top-k sampling is a text generation method that selects the next token only from the top k most likely tokens predicted by the model. It helps reduce the risk of generating low-probability or nonsensical tokens, but it may also limit the diversity of the output. A higher value for top-k (e.g., 100) will consider more tokens and lead to more diverse text, while a lower value (e.g., 10) will focus on the most probable tokens and generate more conservative text. The default value is 40.
190190

191-
Example usage: `--top_k 30`
191+
Example usage: `--top-k 30`
192192

193193
### Top-P Sampling
194194

195-
- `--top_p N`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.9).
195+
- `--top-p N`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.9).
196196

197-
Top-p sampling, also known as nucleus sampling, is another text generation method that selects the next token from a subset of tokens that together have a cumulative probability of at least p. This method provides a balance between diversity and quality by considering both the probabilities of tokens and the number of tokens to sample from. A higher value for top_p (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. The default value is 0.9.
197+
Top-p sampling, also known as nucleus sampling, is another text generation method that selects the next token from a subset of tokens that together have a cumulative probability of at least p. This method provides a balance between diversity and quality by considering both the probabilities of tokens and the number of tokens to sample from. A higher value for top-p (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. The default value is 0.9.
198198

199-
Example usage: `--top_p 0.95`
199+
Example usage: `--top-p 0.95`
200200

201201
### Tail Free Sampling (TFS)
202202

@@ -217,16 +217,16 @@ Example usage: `--typical 0.9`
217217
### Mirostat Sampling
218218

219219
- `--mirostat N`: Enable Mirostat sampling, controlling perplexity during text generation (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0).
220-
- `--mirostat_lr N`: Set the Mirostat learning rate, parameter eta (default: 0.1).
221-
- `--mirostat_ent N`: Set the Mirostat target entropy, parameter tau (default: 5.0).
220+
- `--mirostat-lr N`: Set the Mirostat learning rate, parameter eta (default: 0.1).
221+
- `--mirostat-ent N`: Set the Mirostat target entropy, parameter tau (default: 5.0).
222222

223223
Mirostat is an algorithm that actively maintains the quality of generated text within a desired range during text generation. It aims to strike a balance between coherence and diversity, avoiding low-quality output caused by excessive repetition (boredom traps) or incoherence (confusion traps).
224224

225-
The `--mirostat_lr` option sets the Mirostat learning rate (eta). The learning rate influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive. The default value is `0.1`.
225+
The `--mirostat-lr` option sets the Mirostat learning rate (eta). The learning rate influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive. The default value is `0.1`.
226226

227-
The `--mirostat_ent` option sets the Mirostat target entropy (tau), which represents the desired perplexity value for the generated text. Adjusting the target entropy allows you to control the balance between coherence and diversity in the generated text. A lower value will result in more focused and coherent text, while a higher value will lead to more diverse and potentially less coherent text. The default value is `5.0`.
227+
The `--mirostat-ent` option sets the Mirostat target entropy (tau), which represents the desired perplexity value for the generated text. Adjusting the target entropy allows you to control the balance between coherence and diversity in the generated text. A lower value will result in more focused and coherent text, while a higher value will lead to more diverse and potentially less coherent text. The default value is `5.0`.
228228

229-
Example usage: `--mirostat 2 --mirostat_lr 0.05 --mirostat_ent 3.0`
229+
Example usage: `--mirostat 2 --mirostat-lr 0.05 --mirostat-ent 3.0`
230230

231231
### Logit Bias
232232

@@ -264,11 +264,11 @@ These options help improve the performance and memory usage of the LLaMA models.
264264

265265
### Memory Float 32
266266

267-
- `--memory_f32`: Use 32-bit floats instead of 16-bit floats for memory key+value, allowing higher quality inference at the cost of higher memory usage.
267+
- `--memory-f32`: Use 32-bit floats instead of 16-bit floats for memory key+value. This doubles the context memory requirement and cached prompt file size but does not appear to increase generation quality in a measurable way. Not recommended.
268268

269269
### Batch Size
270270

271-
- `-b N, --batch_size N`: Set the batch size for prompt processing (default: 512). This large batch size benefits users who have BLAS installed and enabled it during the build. If you don't have BLAS enabled ("BLAS=0"), you can use a smaller number, such as 8, to see the prompt progress as it's evaluated in some situations.
271+
- `-b N, --batch-size N`: Set the batch size for prompt processing (default: 512). This large batch size benefits users who have BLAS installed and enabled it during the build. If you don't have BLAS enabled ("BLAS=0"), you can use a smaller number, such as 8, to see the prompt progress as it's evaluated in some situations.
272272

273273
### Prompt Caching
274274

@@ -285,5 +285,6 @@ These options provide extra functionality and customization when running the LLa
285285
- `-h, --help`: Display a help message showing all available options and their default values. This is particularly useful for checking the latest options and default values, as they can change frequently, and the information in this document may become outdated.
286286
- `--verbose-prompt`: Print the prompt before generating text.
287287
- `--mtest`: Test the model's functionality by running a series of tests to ensure it's working properly.
288+
- `-ngl N, --n-gpu-layers N`: When compiled with appropriate support (currently CLBlast or cuBLAS), this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
288289
- `--lora FNAME`: Apply a LoRA (Low-Rank Adaptation) adapter to the model (implies --no-mmap). This allows you to adapt the pretrained model to specific tasks or domains.
289290
- `--lora-base FNAME`: Optional model to use as a base for the layers modified by the LoRA adapter. This flag is used in conjunction with the `--lora` flag, and specifies the base model for the adaptation.

examples/server/README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -285,7 +285,8 @@ Test();
285285
## Common Options
286286

287287
- `-m FNAME, --model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.bin`).
288-
- `-c N, --ctx_size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
288+
- `-c N, --ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
289+
- `-ngl N, --n-gpu-layers N`: When compiled with appropriate support (currently CLBlast or cuBLAS), this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
289290
- `--embedding`: Enable the embedding mode. **Completion function doesn't work in this mode**.
290291
- `--host`: Set the hostname or ip address to listen. Default `127.0.0.1`;
291292
- `--port`: Set the port to listen. Default: `8080`.
@@ -304,7 +305,7 @@ The RNG seed is used to initialize the random number generator that influences t
304305

305306
### Memory Float 32
306307

307-
- `--memory_f32`: Use 32-bit floats instead of 16-bit floats for memory key+value, allowing higher quality inference at the cost of higher memory usage.
308+
- `--memory-f32`: Use 32-bit floats instead of 16-bit floats for memory key+value. This doubles the context memory requirement but does not appear to increase generation quality in a measurable way. Not recommended.
308309

309310
## Limitations:
310311

0 commit comments

Comments
 (0)