guide: llama-cli help reformatted, organized, fleshed out and examples added #15709
rosmur
started this conversation in
Show and tell
Replies: 3 comments 4 replies
-
|
is kind of ... disconnected? How do I make use of the information in the red box? |
Beta Was this translation helpful? Give feedback.
1 reply
-
|
Thank you, llama.cpp changes faster than my brain can afford. |
Beta Was this translation helpful? Give feedback.
1 reply
-
|
Maybe highlight
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
As a non-SWE, I think the
--helpcould be made cleaner and clearer to drive adoption. Here's the help information reformatted, organized into clear sections, parameter ranges and usage examples addedLlama CLI User Guide
A comprehensive guide to using the llama-cli command-line tool for text generation and chat conversations with Large Language Models.
llama-cli Version
This guide is current for version: 6310 (c8d0d14)
Quick Start
Basic Commands
Usage
Essential Parameters
-m, --model-m models/llama-2-7b.gguf-p, --prompt-p "Hello, how are you?"-n, --predict-n 100-sys, --system-prompt-sys "You are a helpful AI"-c, --ctx-size4096Basic Info and Logging
-h, --help--version-v, --verbosefalseModel Download Options
--hf-repo--hf-repo unsloth/phi-4-GGUF:q4_k_m--hf-file--hf-file model-q4_k_m.gguf--hf-token--hf-token your_token_here--offline--offlineModel Adapters
--lora--lora-scaled--control-vectorChat Configuration
-cnv, --conversation-no-cnv, --no-conversationfalse-i, --interactivefalse-if, --interactive-firstfalse-st, --single-turnfalse--jinja--chat-template--chat-template-fileAvailable Built-in Chat Templates
List here: https://github.com/ggml-org/llama.cpp/tree/master/models/templates
Input/Output Control
--in-prefix--in-suffix--in-prefix-bos-r, --reverse-promptText Generation Parameters
Basic Generation Control
-n, --predict-1(infinite)--keep00to context size--ignore-eosfalseContext Management
-c, --ctx-size4096--no-context-shiftfalse-b, --batch-size2048-ub, --ubatch-size512Sampling and Creativity Control
Temperature and Randomness
--temp0.80.1-2.0-s, --seed-1(random)--dynatemp-range0.00.0-1.0Token Selection Methods
--top-k401-100--top-p0.90.0-1.0--min-p0.10.0-1.0--typical1.00.0-1.0Repetition Control
--repeat-penalty1.0--repeat-last-n64--presence-penalty0.0--frequency-penalty0.0Advanced Sampling
DRY (Don't Repeat Yourself) Sampling
--dry-multiplier0.0--dry-base1.75--dry-allowed-length2Mirostat Sampling
--mirostat0--mirostat-lr0.1--mirostat-ent5.0Performance and Hardware
CPU Configuration
-t, --threads-1(auto)-tb, --threads-batch--threads--cpu-mask""--cpu-rangeGPU Configuration
-ngl, --gpu-layers0-sm, --split-modelayer-mg, --main-gpu0-ts, --tensor-splitGPU Split Modes
none: Single GPU onlylayer: Split by layers (recommended)row: Split by tensor rowsMemory Management
--mlock--no-mmap--numaNUMA Options
distribute: Spread across all nodesisolate: Use current node onlynumactl: Use numactl CPU mapAdvanced Features
Structured Generation
--grammar--grammar "root ::= [a-z]+"--grammar-file--grammar-file grammar.bnf-j, --json-schema-j '{"type": "object"}'--json-schema-file--json-schema-file schema.jsonReasoning and Thinking
--reasoning-formatnone,deepseek,auto--reasoning-budget-1(unlimited),0(disabled)Caching
--prompt-cache--prompt-cache-all--prompt-cache-roLogging and Debugging
--log-file--log-colors--log-timestamps--log-verbosity--no-perfEnvironment Variables
Many parameters can be set via environment variables:
LLAMA_ARG_MODEL-m, --modelLLAMA_ARG_CTX_SIZE-c, --ctx-sizeLLAMA_ARG_THREADS-t, --threadsLLAMA_ARG_N_PREDICT-n, --predictLLAMA_ARG_N_GPU_LAYERS-ngl, --gpu-layersHF_TOKEN--hf-tokenLLAMA_OFFLINE--offlineMore Examples
1. Chat
With temperature setting and conversation mode:
Hugging Face Integration
Chat Templates
2. Technical Q&A Assistant
3. Creative Writing with High Randomness
4. Structured JSON Output
5. Multi-GPU Setup
# Use multiple GPUs with layer splitting llama-cli -m large-model.gguf -ngl 40 -sm layer -ts 3,1 --main-gpu 06. High-Performance CPU Setup
# Optimize for CPU performance llama-cli -m model.gguf -t 8 --cpu-range 0-7 --mlock --numa distribute7. Conversation with Custom Template
8. Constrained Generation with Grammar
9. Batch Processing with Caching
# Process multiple prompts with caching llama-cli -m model.gguf --prompt-cache prompts.cache --prompt-cache-all -f input-prompts.txt10. Debug and Development
Tips for Beginners
-m,-p, and-n--ctx-sizefor longer conversations or documents--gpu-layersto speed up inference significantlyCommon Issues and Solutions
Performance Issues
--gpu-layersor--threads--ctx-sizeor--batch-size--threadsor use--cpu-rangeGeneration Quality
--repeat-penaltyor enable DRY sampling--tempor adjust--top-p--tempand--top-pModel Loading
--mlockto keep model in RAMThis guide covers the essential features of llama-cli. For the most up-to-date information, always refer to
llama-cli --help.Beta Was this translation helpful? Give feedback.
All reactions