Skip to content

Commit d94c88d

Browse files
author
John
committed
2 parents c0176f2 + 748bb29 commit d94c88d

File tree

1 file changed

+20
-30
lines changed

1 file changed

+20
-30
lines changed

README.md

Lines changed: 20 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,10 @@ ggllm.cpp is a llama.cpp modification to run Falcon (work in progress)
44
- Support for Falcon 7B and 40B models (inference, quantization and perplexity tool)
55
- Fully automated GPU offloading based on available and total VRAM
66
- Higher efficiency in VRAM usage when using batched processing (more layers being offloaded)
7+
- 16 bit cuBLAs support (takes half the VRAM for those operations)
78
- Improved loading screen and visualization
8-
- Current Falcon inference speed on consumer GPU: up to 51 tokens/sec for 7B-4bit and 17 tokens/sec for 40B-6bit
9+
- More command line parameter options (like disabling GPUs)
10+
- Current Falcon inference speed on consumer GPU: up to 51 tokens/sec for 7B-4bit and 17 tokens/sec for 40B-6bit, roughly 38/sec and 16/sec at at 1000 tokens generated
911

1012
**What is missing/being worked on:**
1113
- Full GPU offloading of Falcon
@@ -29,7 +31,7 @@ https://huggingface.co/tiiuae/falcon-7b-instruct
2931
_The Falcon 7B model features tensor sizes which are not yet supported by K-type quantizers - use the traditional quantization for those_
3032

3133
**Status/Bugs:**
32-
- Cummulative token slowdown over increasing context (party solved)
34+
- nothing major
3335

3436
**How to compile ggllm.cpp:**
3537
1) Recommended with cmake: (change the CUBLAS flag to 0 to disable CUDA requirements and support)
@@ -56,6 +58,7 @@ make falcon_main falcon_quantize falcon_perplexity
5658
```
5759

5860
**Windows and Demos**
61+
_Note: those tutorials are before the latest performance patches_
5962
Video tutorial for Windows compilation without WSL:
6063
https://www.youtube.com/watch?v=BALw669Qeyw
6164
Another demo of Falcon 40B at 5 bit quantization:
@@ -82,18 +85,17 @@ export PATH="/usr/local/cuda-12.1/bin:$PATH"
8285

8386
**CUDA Optimizing inference speed**
8487
- Thread count will be optimal between 1 and 8. Start with `-t 2`
85-
- For huge prompts n_batch can speed up processing 10-20 times but additional VRAM of 1500-4700 MB is required. That's `-b 512`
88+
- For huge prompts n_batch can speed up processing 10-20 times but additional VRAM of 500-1700 MB is required. That's `-b 512`
8689
- Multi GPU systems can benefit from single GPU processing when the model is small enough. That's `--override-max-gpu 1`
8790
- Multi GPU systems with different GPUs benefit from custom tensor splitting to load one GPU heavier. To load the 2nd GPU stronger: `--tensor-split 1,3` `-mg 1`
88-
- Need to squeeze a model into VRAM but 1-2 layers don't fit ? Try `--gpu-reserve-mb-main 1` to reduce reserved VRAM to 1 MB
91+
- Need to squeeze a model into VRAM but 1-2 layers don't fit ? Try `--gpu-reserve-mb-main 1` to reduce reserved VRAM to 1 MB, you can use negative numbers to force VRAM swapping
8992
- Wish to reduce VRAM usage and offload less layers? Use `-ngl 10` to only load 10 layers
90-
- Want to dive into details ? Use `--debug-timings <1,2,3>` to get detailed statistics on performance of each operation
93+
- Want to dive into details ? Use `--debug-timings <1,2,3>` to get detailed statistics on performance of each operation, how and where it was performed and it's total impact
9194

9295

9396
**Inference speed**
9497
Only some tensors are GPU supported currently and only mul_mat operation supported
9598
Some of the below examples require two GPUs to run at the given speed, the settings were tailored for one environment and a different GPU/CPU/DDR setup might require adaptions
96-
Using -b 1 (default) can save from 1500 up to 4800 MB of VRAM (depending on quantization type and model)
9799

98100
**Falcon 40B 6 bit K-type quantization:**
99101
```
@@ -107,24 +109,14 @@ falcon_print_timings: total time = 1980.28 ms
107109

108110
**Falcon 40B 4 bit K-type quantization:**
109111
```
110-
falcon_main.exe -t 2 -m Q:\models\falcon-40b\q4_k -n 512 -n 128 --debug-timings 0 -b 1 --ignore-eos -p "I am" # -ts 2,1
112+
falcon_main.exe -t 2 -m Q:\models\falcon-40b\q4_k -n 512 -n 128 --debug-timings 0 -b 1 --ignore-eos -p "I am" # -ts 2,1 # --override-max-gpu 1 --gpu-reserve-mb-main -500
111113
...
112-
falcon_print_timings: load time = 8076.62 ms
113-
falcon_print_timings: sample time = 29.38 ms / 128 runs ( 0.23 ms per token, 4357.15 tokens per second)
114-
falcon_print_timings: eval time = 9580.33 ms / 129 runs ( 74.27 ms per token, 13.47 tokens per second)
115-
falcon_print_timings: total time = 9631.29 ms
114+
falcon_print_timings: load time = 8749.56 ms
115+
falcon_print_timings: sample time = 29.47 ms / 128 runs ( 0.23 ms per token, 4342.81 tokens per second)
116+
falcon_print_timings: eval time = 7046.11 ms / 129 runs ( 54.62 ms per token, 18.31 tokens per second)
117+
falcon_print_timings: total time = 7095.81 ms
116118
```
117119

118-
**Falcon 40B 4 bit K-type quantization (single 4090 24GB GPU full offload, 0MB free VRAM):**
119-
```
120-
falcon_main.exe -t 2 -m Q:\models\falcon-40b\q4_k -n 512 -n 32 -b 1 --ignore-eos -p "I am"
121-
...
122-
falcon_print_timings: load time = 10927.17 ms
123-
falcon_print_timings: sample time = 7.40 ms / 32 runs ( 0.23 ms per token, 4323.16 tokens per second)
124-
falcon_print_timings: eval time = 1875.69 ms / 33 runs ( 56.84 ms per token, 17.59 tokens per second)
125-
falcon_print_timings: total time = 1901.62 ms
126-
# this benefits from latest CUDA drivers, the 24GB are very tight. Minor VRAM swapping must be going on.
127-
```
128120

129121
**Falcon 7B 8 bit quantization:**
130122
```
@@ -136,18 +128,16 @@ falcon_print_timings: eval time = 758.21 ms / 33 runs ( 22.98 ms
136128
falcon_print_timings: total time = 770.52 ms
137129
```
138130

139-
**Falcon 7B 4 bit quantization:**
131+
**Falcon 7B 4 bit quantization (large generation):**
140132
```
141-
falcon_main.exe -t 2 -m Q:\models\falcon-7b\q4_1 -n 512 -n 32 --debug-timings 0 -b 1 --ignore-eos --override-max-gpu 1 -p "I am"
133+
falcon_main.exe -t 2 -m Q:\models\falcon-7b\q4_1 -n 512 --debug-timings 0 -b 1 --ignore-eos --override-max-gpu 1 -p "I am"
142134
...
143-
falcon_print_timings: load time = 1665.93 ms
144-
falcon_print_timings: sample time = 7.65 ms / 32 runs ( 0.24 ms per token, 4184.65 tokens per second)
145-
falcon_print_timings: eval time = 645.45 ms / 33 runs ( 19.56 ms per token, 51.13 tokens per second)
146-
falcon_print_timings: total time = 661.19 ms
135+
falcon_print_timings: load time = 2442.76 ms
136+
falcon_print_timings: sample time = 118.56 ms / 512 runs ( 0.23 ms per token, 4318.34 tokens per second)
137+
falcon_print_timings: eval time = 16719.48 ms / 769 runs ( 21.74 ms per token, 45.99 tokens per second)
138+
falcon_print_timings: total time = 16930.51 ms
147139
```
148140

149141
CUDA sidenote:
150-
1) try to use 1 less threads than you have physical processor cores
151-
2) If it's too slow and GPU memory is at 100% then the automated tensor skip is not working properly, reduce --ngl until gpu memory does not saturate fully at first inference
152-
3) use "-b 1" if low on VRAM or when using short prompts
142+
1) try to use less threads than you have physical processor cores
153143

0 commit comments

Comments
 (0)