Merge branch 'master' of https://github.com/cmp-nct/ggllm.cpp

John · John · commit d94c88d9f8ec · 2023-06-27T18:25:22.000+02:00
diff --git a/README.md b/README.md
@@ -4,8 +4,10 @@ ggllm.cpp is a llama.cpp modification to run Falcon (work in progress)
 - Support for Falcon 7B and 40B models (inference, quantization and perplexity tool)
 - Fully automated GPU offloading based on available and total VRAM
 - Higher efficiency in VRAM usage when using batched processing (more layers being offloaded)
+- 16 bit cuBLAs support (takes half the VRAM for those operations)
 - Improved loading screen and visualization
-- Current Falcon inference speed on consumer GPU: up to 51 tokens/sec for 7B-4bit and 17 tokens/sec for 40B-6bit
+- More command line parameter options (like disabling GPUs)
+- Current Falcon inference speed on consumer GPU: up to 51 tokens/sec for 7B-4bit and 17 tokens/sec for 40B-6bit, roughly 38/sec and 16/sec at at 1000 tokens generated
   
 **What is missing/being worked on:**
 - Full GPU offloading of Falcon
@@ -29,7 +31,7 @@ https://huggingface.co/tiiuae/falcon-7b-instruct
 _The Falcon 7B model features tensor sizes which are not yet supported by K-type quantizers - use the traditional quantization for those_  
   
 **Status/Bugs:**  
-- Cummulative token slowdown over increasing context (party solved)
+- nothing major
   
 **How to compile ggllm.cpp:**
 1) Recommended with cmake: (change the CUBLAS flag to 0 to disable CUDA requirements and support)
@@ -56,6 +58,7 @@ make falcon_main falcon_quantize falcon_perplexity
 ```
 
 **Windows and Demos**
+_Note: those tutorials are before the latest performance patches_
 Video tutorial for Windows compilation without WSL:  
 https://www.youtube.com/watch?v=BALw669Qeyw     
 Another demo of Falcon 40B at 5 bit quantization:  
@@ -82,18 +85,17 @@ export PATH="/usr/local/cuda-12.1/bin:$PATH"
 
 **CUDA Optimizing inference speed**
 - Thread count will be optimal between 1 and 8. Start with `-t 2` 
-- For huge prompts n_batch can speed up processing 10-20 times but additional VRAM of 1500-4700 MB is required. That's `-b 512` 
+- For huge prompts n_batch can speed up processing 10-20 times but additional VRAM of 500-1700 MB is required. That's `-b 512` 
 - Multi GPU systems can benefit from single GPU processing when the model is small enough. That's `--override-max-gpu 1`  
 - Multi GPU systems with different GPUs benefit from custom tensor splitting to load one GPU heavier. To load the 2nd GPU stronger: `--tensor-split 1,3` `-mg 1`
-- Need to squeeze a model into VRAM but 1-2 layers don't fit ? Try `--gpu-reserve-mb-main 1` to reduce reserved VRAM to 1 MB
+- Need to squeeze a model into VRAM but 1-2 layers don't fit ? Try `--gpu-reserve-mb-main 1` to reduce reserved VRAM to 1 MB, you can use negative numbers to force VRAM swapping
 - Wish to reduce VRAM usage and offload less layers? Use `-ngl 10` to only load 10 layers
-- Want to dive into details ? Use `--debug-timings <1,2,3>` to get detailed statistics on performance of each operation
+- Want to dive into details ? Use `--debug-timings <1,2,3>` to get detailed statistics on performance of each operation, how and where it was performed and it's total impact
 
    
 **Inference speed**  
 Only some tensors are GPU supported currently and only mul_mat operation supported
 Some of the below examples require two GPUs to run at the given speed, the settings were tailored for one environment and a different GPU/CPU/DDR setup might require adaptions  
-Using -b 1 (default) can save from 1500 up to 4800 MB of VRAM (depending on quantization type and model)
 
 **Falcon 40B 6 bit K-type quantization:**
 ```
@@ -107,24 +109,14 @@ falcon_print_timings:       total time =  1980.28 ms
 
 **Falcon 40B 4 bit K-type quantization:**
 ```
-falcon_main.exe -t 2 -m Q:\models\falcon-40b\q4_k -n 512 -n 128 --debug-timings 0 -b 1 --ignore-eos -p "I am" # -ts 2,1
+falcon_main.exe -t 2 -m Q:\models\falcon-40b\q4_k -n 512 -n 128 --debug-timings 0 -b 1 --ignore-eos -p "I am" # -ts 2,1 # --override-max-gpu 1 --gpu-reserve-mb-main -500
 ...
-falcon_print_timings:        load time =  8076.62 ms
-falcon_print_timings:      sample time =    29.38 ms /   128 runs   (    0.23 ms per token,  4357.15 tokens per second)
-falcon_print_timings:        eval time =  9580.33 ms /   129 runs   (   74.27 ms per token,    13.47 tokens per second)
-falcon_print_timings:       total time =  9631.29 ms
+falcon_print_timings:        load time =  8749.56 ms
+falcon_print_timings:      sample time =    29.47 ms /   128 runs   (    0.23 ms per token,  4342.81 tokens per second)
+falcon_print_timings:        eval time =  7046.11 ms /   129 runs   (   54.62 ms per token,    18.31 tokens per second)
+falcon_print_timings:       total time =  7095.81 ms
 ```
 
-**Falcon 40B 4 bit K-type quantization (single 4090 24GB GPU full offload, 0MB free VRAM):**
-```
-falcon_main.exe -t 2 -m Q:\models\falcon-40b\q4_k -n 512 -n 32 -b 1 --ignore-eos -p "I am"
-...
-falcon_print_timings:        load time = 10927.17 ms
-falcon_print_timings:      sample time =     7.40 ms /    32 runs   (    0.23 ms per token,  4323.16 tokens per second)
-falcon_print_timings:        eval time =  1875.69 ms /    33 runs   (   56.84 ms per token,    17.59 tokens per second)
-falcon_print_timings:       total time =  1901.62 ms
-# this benefits from latest CUDA drivers, the 24GB are very tight. Minor VRAM swapping must be going on.
-```
 
 **Falcon 7B 8 bit quantization:**
 ```
@@ -136,18 +128,16 @@ falcon_print_timings:        eval time =   758.21 ms /    33 runs   (   22.98 ms
 falcon_print_timings:       total time =   770.52 ms
 ```
 
-**Falcon 7B 4 bit quantization:**
+**Falcon 7B 4 bit quantization (large generation):**
 ```
-falcon_main.exe -t 2 -m Q:\models\falcon-7b\q4_1 -n 512 -n 32  --debug-timings 0 -b 1 --ignore-eos --override-max-gpu 1 -p "I am"
+falcon_main.exe -t 2 -m Q:\models\falcon-7b\q4_1 -n 512 --debug-timings 0 -b 1 --ignore-eos --override-max-gpu 1 -p "I am"
 ...
-falcon_print_timings:        load time =  1665.93 ms
-falcon_print_timings:      sample time =     7.65 ms /    32 runs   (    0.24 ms per token,  4184.65 tokens per second)
-falcon_print_timings:        eval time =   645.45 ms /    33 runs   (   19.56 ms per token,    51.13 tokens per second)
-falcon_print_timings:       total time =   661.19 ms
+falcon_print_timings:        load time =  2442.76 ms
+falcon_print_timings:      sample time =   118.56 ms /   512 runs   (    0.23 ms per token,  4318.34 tokens per second)
+falcon_print_timings:        eval time = 16719.48 ms /   769 runs   (   21.74 ms per token,    45.99 tokens per second)
+falcon_print_timings:       total time = 16930.51 ms
 ```
 
 CUDA sidenote:  
-1) try to use 1 less threads than you have physical processor cores 
-2) If it's too slow and GPU memory is at 100% then the automated tensor skip is not working properly, reduce --ngl until gpu memory does not saturate fully at first inference
-3) use "-b 1" if low on VRAM or when using short prompts 
+1) try to use less threads than you have physical processor cores