-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Closed
Labels
bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Description
What happened?
Release https://github.com/ggerganov/llama.cpp/releases/tag/b3683 (#9308) saw the refactor of the arg parser.
This broke imatrix arguments
$ ./llama-imatrix -m $MDODEL_PATH -f $RAW_TEXT_PATH -o imatrix.data
error: invalid argument: -o
Name and Version
$ ./llama-imatrix --version
version: 3683 (1b9ae518)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
$ ./llama-imatrix -h
----- common options -----
----- example-specific options -----
-h , --help, --usage print usage and exit
--version show version and build info
-v , --verbose print verbose information
--verbosity N set specific verbosity level (default: 1)
-s , --seed SEED RNG seed (default: -1, use random seed for < 0)
-t , --threads N number of threads to use during generation (default: -1)
(env: LLAMA_ARG_THREADS)
-tb , --threads-batch N number of threads to use during batch and prompt processing (default:
same as --threads)
-C , --cpu-mask M CPU affinity mask: arbitrarily long hex. Complements cpu-range
(default: "")
-Cr , --cpu-range lo-hi range of CPUs for affinity. Complements --cpu-mask
--cpu-strict <0|1> use strict CPU placement (default: 0)
--poll <0...100> use polling level to wait for work (0 - no polling, default: 50)
-Cb , --cpu-mask-batch M CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch
(default: same as --cpu-mask)
-Crb , --cpu-range-batch lo-hi ranges of CPUs for affinity. Complements --cpu-mask-batch
--cpu-strict-batch <0|1> use strict CPU placement (default: same as --cpu-strict)
--poll-batch <0|1> use polling to wait for work (default: same as --poll)
-lcs , --lookup-cache-static FNAME path to static lookup cache to use for lookup decoding (not updated by
generation)
-lcd , --lookup-cache-dynamic FNAME path to dynamic lookup cache to use for lookup decoding (updated by
generation)
-c , --ctx-size N size of the prompt context (default: 512, 0 = loaded from model)
(env: LLAMA_ARG_CTX_SIZE)
-n , --predict, --n-predict N number of tokens to predict (default: -1, -1 = infinity, -2 = until
context filled)
(env: LLAMA_ARG_N_PREDICT)
-b , --batch-size N logical maximum batch size (default: 2048)
(env: LLAMA_ARG_BATCH)
-ub , --ubatch-size N physical maximum batch size (default: 512)
(env: LLAMA_ARG_UBATCH)
--keep N number of tokens to keep from the initial prompt (default: 0, -1 =
all)
--chunks N max number of chunks to process (default: -1, -1 = all)
-fa , --flash-attn enable Flash Attention (default: disabled)
(env: LLAMA_ARG_FLASH_ATTN)
-p , --prompt PROMPT prompt to start generation with
-f , --file FNAME a file containing the prompt (default: none)
--in-file FNAME an input file (repeat to specify multiple files)
-bf , --binary-file FNAME binary file containing the prompt (default: none)
-e , --escape process escapes sequences (\n, \r, \t, \', \", \\) (default: true)
--no-escape do not process escape sequences
--samplers SAMPLERS samplers that will be used for generation in the order, separated by
';'
(default: top_k;tfs_z;typ_p;top_p;min_p;temperature)
--sampling-seq SEQUENCE simplified sequence for samplers that will be used (default: kfypmt)
--ignore-eos ignore end of stream token and continue generating (implies
--logit-bias EOS-inf)
--penalize-nl penalize newline tokens (default: false)
--temp N temperature (default: 0.8)
--top-k N top-k sampling (default: 40, 0 = disabled)
--top-p N top-p sampling (default: 0.9, 1.0 = disabled)
--min-p N min-p sampling (default: 0.1, 0.0 = disabled)
--tfs N tail free sampling, parameter z (default: 1.0, 1.0 = disabled)
--typical N locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
--repeat-last-n N last n tokens to consider for penalize (default: 64, 0 = disabled, -1
= ctx_size)
--repeat-penalty N penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
--presence-penalty N repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
--frequency-penalty N repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
--dynatemp-range N dynamic temperature range (default: 0.0, 0.0 = disabled)
--dynatemp-exp N dynamic temperature exponent (default: 1.0)
--mirostat N use Mirostat sampling.
Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if
used.
(default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N Mirostat learning rate, parameter eta (default: 0.1)
--mirostat-ent N Mirostat target entropy, parameter tau (default: 5.0)
-l , --logit-bias TOKEN_ID(+/-)BIAS
modifies the likelihood of token appearing in the completion,
i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
--grammar GRAMMAR BNF-like grammar to constrain generations (see samples in grammars/
dir) (default: '')
--grammar-file FNAME file to read grammar from
-j , --json-schema SCHEMA JSON schema to constrain generations (https://json-schema.org/), e.g.
`{}` for any JSON object
For schemas w/ external $refs, use --grammar +
example/json_schema_to_grammar.py instead
--rope-scaling {none,linear,yarn} RoPE frequency scaling method, defaults to linear unless specified by
the model
--rope-scale N RoPE context scaling factor, expands context by a factor of N
--rope-freq-base N RoPE base frequency, used by NTK-aware scaling (default: loaded from
model)
--rope-freq-scale N RoPE frequency scaling factor, expands context by a factor of 1/N
--yarn-orig-ctx N YaRN: original context size of model (default: 0 = model training
context size)
--yarn-ext-factor N YaRN: extrapolation mix factor (default: -1.0, 0.0 = full
interpolation)
--yarn-attn-factor N YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
--yarn-beta-slow N YaRN: high correction dim or alpha (default: 1.0)
--yarn-beta-fast N YaRN: low correction dim or beta (default: 32.0)
-gan , --grp-attn-n N group-attention factor (default: 1)
-gaw , --grp-attn-w N group-attention width (default: 512.0)
-dkvc , --dump-kv-cache verbose print of the KV cache
-nkvo , --no-kv-offload disable KV offload
-ctk , --cache-type-k TYPE KV cache data type for K (default: f16)
-ctv , --cache-type-v TYPE KV cache data type for V (default: f16)
-dt , --defrag-thold N KV cache defragmentation threshold (default: -1.0, < 0 - disabled)
(env: LLAMA_ARG_DEFRAG_THOLD)
-np , --parallel N number of parallel sequences to decode (default: 1)
-ns , --sequences N number of sequences to decode (default: 1)
-cb , --cont-batching enable continuous batching (a.k.a dynamic batching) (default: enabled)
(env: LLAMA_ARG_CONT_BATCHING)
-nocb , --no-cont-batching disable continuous batching
(env: LLAMA_ARG_NO_CONT_BATCHING)
--mlock force system to keep model in RAM rather than swapping or compressing
--no-mmap do not memory-map model (slower load but may reduce pageouts if not
using mlock)
--numa TYPE attempt optimizations that help on some NUMA systems
- distribute: spread execution evenly over all nodes
- isolate: only spawn threads on CPUs on the node that execution
started on
- numactl: use the CPU map provided by numactl
if run without this previously, it is recommended to drop the system
page cache before using this
see https://github.com/ggerganov/llama.cpp/issues/1437
-ngl , --gpu-layers N number of layers to store in VRAM
(env: LLAMA_ARG_N_GPU_LAYERS)
-sm , --split-mode {none,layer,row}
how to split the model across multiple GPUs, one of:
- none: use one GPU only
- layer (default): split layers and KV across GPUs
- row: split rows across GPUs
-ts , --tensor-split N0,N1,N2,... fraction of the model to offload to each GPU, comma-separated list of
proportions, e.g. 3,1
-mg , --main-gpu INDEX the GPU to use for the model (with split-mode = none), or for
intermediate results and KV (with split-mode = row) (default: 0)
--check-tensors check model tensor data for invalid values (default: false)
--override-kv KEY=TYPE:VALUE advanced option to override model metadata by key. may be specified
multiple times.
types: int, float, bool, str. example: --override-kv
tokenizer.ggml.add_bos_token=bool:false
--lora FNAME path to LoRA adapter (can be repeated to use multiple adapters)
--lora-scaled FNAME SCALE path to LoRA adapter with user defined scaling (can be repeated to use
multiple adapters)
--control-vector FNAME add a control vector
note: this argument can be repeated to add multiple control vectors
--control-vector-scaled FNAME SCALE add a control vector with user defined scaling SCALE
note: this argument can be repeated to add multiple scaled control
vectors
--control-vector-layer-range START END
layer range to apply the control vector(s) to, start and end inclusive
-m , --model FNAME model path (default: `models/$filename` with filename from `--hf-file`
or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)
(env: LLAMA_ARG_MODEL)
-mu , --model-url MODEL_URL model download url (default: unused)
(env: LLAMA_ARG_MODEL_URL)
-hfr , --hf-repo REPO Hugging Face model repository (default: unused)
(env: LLAMA_ARG_HF_REPO)
-hff , --hf-file FILE Hugging Face model file (default: unused)
(env: LLAMA_ARG_HF_FILE)
-hft , --hf-token TOKEN Hugging Face access token (default: value from HF_TOKEN environment
variable)
(env: HF_TOKEN)
-ld , --logdir LOGDIR path under which to save YAML logs (no logging if unset)
--log-test Log test
--log-disable Log disable
--log-enable Log enable
--log-new Log new
--log-append Log append
--log-file FNAME Log file
example usage:
./llama-imatrix \
-m model.gguf -f some-text.txt [-o imatrix.dat] [--process-output] [--verbosity 1] \
[--no-ppl] [--chunk 123] [--output-frequency 10] [--save-frequency 0] \
[--in-file imatrix-prev-0.dat --in-file imatrix-prev-1.dat ...]
Metadata
Metadata
Assignees
Labels
bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)