You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+8-7Lines changed: 8 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -391,13 +391,14 @@ Building the program with BLAS support may lead to some performance improvements
391
391
<!---
392
392
|LLAMA_CUDA_CUBLAS|Boolean|false|Use cuBLAS instead of custom CUDA kernels for prompt processing. Fasterfor all quantization formats except for q4_0 and q8_0, especially for k-quants. IncreasesVRAM usage (700MiBfor 7b, 970MiBfor 13b, 1430MiBfor 33b).|
|LLAMA_CUDA_FORCE_DMMV|Boolean|false|Force the use of dequantization + matrix vector multiplication kernels instead of using kernels that do matrix vector multiplication on quantized data. Bydefault the decision is made based on compute capability (MMVQfor6.1/Pascal/GTX1000 or higher).Does not affect k-quants. |
397
-
|LLAMA_CUDA_DMMV_X|Positive integer >=32|32|Number of values in x direction processed by the CUDA dequantization + matrix vector multiplication kernel per iteration. Increasingthis value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
398
-
|LLAMA_CUDA_MMV_Y|Positive integer |1|Block size in y direction for the CUDA mul mat vec kernels. Increasingthis value can improve performance on fast GPUs. Power of 2 recommended. |
399
-
|LLAMA_CUDA_F16|Boolean|false|If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels and for the q4_1 and q5_1 matrix matrix multiplication kernels. Can improve performance on relatively recent GPUs. |
400
-
|LLAMA_CUDA_KQUANTS_ITER|1 or 2|2|Number of values processed per iteration and per CUDA thread forQ2_K and Q6_K quantization formats. Settingthis value to 1 can improve performance for slow GPUs. |
|LLAMA_CUDA_FORCE_DMMV|Boolean|false|Force the use of dequantization + matrix vector multiplication kernels instead of using kernels that do matrix vector multiplication on quantized data. Bydefault the decision is made based on compute capability (MMVQfor6.1/Pascal/GTX1000 or higher).Does not affect k-quants. |
397
+
|LLAMA_CUDA_DMMV_X|Positive integer >=32|32|Number of values in x direction processed by the CUDA dequantization + matrix vector multiplication kernel per iteration. Increasingthis value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
398
+
|LLAMA_CUDA_MMV_Y|Positive integer |1|Block size in y direction for the CUDA mul mat vec kernels. Increasingthis value can improve performance on fast GPUs. Power of 2 recommended. |
399
+
|LLAMA_CUDA_F16|Boolean|false|If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels and for the q4_1 and q5_1 matrix matrix multiplication kernels. Can improve performance on relatively recent GPUs. |
400
+
|LLAMA_CUDA_KQUANTS_ITER|1 or 2|2|Number of values processed per iteration and per CUDA thread forQ2_K and Q6_K quantization formats. Settingthis value to 1 can improve performance for slow GPUs. |
401
+
|LLAMA_CUDA_PEER_MAX_BATCH_SIZE|Positive integer |128|Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
0 commit comments