You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+20-30Lines changed: 20 additions & 30 deletions
Original file line number
Diff line number
Diff line change
@@ -4,8 +4,10 @@ ggllm.cpp is a llama.cpp modification to run Falcon (work in progress)
4
4
- Support for Falcon 7B and 40B models (inference, quantization and perplexity tool)
5
5
- Fully automated GPU offloading based on available and total VRAM
6
6
- Higher efficiency in VRAM usage when using batched processing (more layers being offloaded)
7
+
- 16 bit cuBLAs support (takes half the VRAM for those operations)
7
8
- Improved loading screen and visualization
8
-
- Current Falcon inference speed on consumer GPU: up to 51 tokens/sec for 7B-4bit and 17 tokens/sec for 40B-6bit
9
+
- More command line parameter options (like disabling GPUs)
10
+
- Current Falcon inference speed on consumer GPU: up to 51 tokens/sec for 7B-4bit and 17 tokens/sec for 40B-6bit, roughly 38/sec and 16/sec at at 1000 tokens generated
- Thread count will be optimal between 1 and 8. Start with `-t 2`
85
-
- For huge prompts n_batch can speed up processing 10-20 times but additional VRAM of 1500-4700 MB is required. That's `-b 512`
88
+
- For huge prompts n_batch can speed up processing 10-20 times but additional VRAM of 500-1700 MB is required. That's `-b 512`
86
89
- Multi GPU systems can benefit from single GPU processing when the model is small enough. That's `--override-max-gpu 1`
87
90
- Multi GPU systems with different GPUs benefit from custom tensor splitting to load one GPU heavier. To load the 2nd GPU stronger: `--tensor-split 1,3``-mg 1`
88
-
- Need to squeeze a model into VRAM but 1-2 layers don't fit ? Try `--gpu-reserve-mb-main 1` to reduce reserved VRAM to 1 MB
91
+
- Need to squeeze a model into VRAM but 1-2 layers don't fit ? Try `--gpu-reserve-mb-main 1` to reduce reserved VRAM to 1 MB, you can use negative numbers to force VRAM swapping
89
92
- Wish to reduce VRAM usage and offload less layers? Use `-ngl 10` to only load 10 layers
90
-
- Want to dive into details ? Use `--debug-timings <1,2,3>` to get detailed statistics on performance of each operation
93
+
- Want to dive into details ? Use `--debug-timings <1,2,3>` to get detailed statistics on performance of each operation, how and where it was performed and it's total impact
91
94
92
95
93
96
**Inference speed**
94
97
Only some tensors are GPU supported currently and only mul_mat operation supported
95
98
Some of the below examples require two GPUs to run at the given speed, the settings were tailored for one environment and a different GPU/CPU/DDR setup might require adaptions
96
-
Using -b 1 (default) can save from 1500 up to 4800 MB of VRAM (depending on quantization type and model)
97
99
98
100
**Falcon 40B 6 bit K-type quantization:**
99
101
```
@@ -107,24 +109,14 @@ falcon_print_timings: total time = 1980.28 ms
falcon_print_timings: sample time = 7.65 ms / 32 runs ( 0.24 ms per token, 4184.65 tokens per second)
145
-
falcon_print_timings: eval time = 645.45 ms / 33 runs ( 19.56 ms per token, 51.13 tokens per second)
146
-
falcon_print_timings: total time = 661.19 ms
135
+
falcon_print_timings: load time = 2442.76 ms
136
+
falcon_print_timings: sample time = 118.56 ms / 512 runs ( 0.23 ms per token, 4318.34 tokens per second)
137
+
falcon_print_timings: eval time = 16719.48 ms / 769 runs ( 21.74 ms per token, 45.99 tokens per second)
138
+
falcon_print_timings: total time = 16930.51 ms
147
139
```
148
140
149
141
CUDA sidenote:
150
-
1) try to use 1 less threads than you have physical processor cores
151
-
2) If it's too slow and GPU memory is at 100% then the automated tensor skip is not working properly, reduce --ngl until gpu memory does not saturate fully at first inference
152
-
3) use "-b 1" if low on VRAM or when using short prompts
142
+
1) try to use less threads than you have physical processor cores
0 commit comments