forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 20
Open
Labels
enhancementNew feature or requestNew feature or request
Description
A large patch was just integrated into llama.cpp (ggml-org#2001) another stunning job by @ikawrakow
In the long run we need it, K quants are better for 7B and have more flexibility but two obstacles need to be solved:
- We need to modify that PR so it's not a compiler switch anymore, it needs to support 256 and 64 bit.
Either by splitting and duplicating it or by using a global variable instead of the define.
Otherwise we'd need distinctly compiled binaries for 7B and 40B - These are 32 bit dequantizers, we use 16 bit for cuBLAS to save 50% VRAM.
It's not a huge thing to change but it doubles the kernels (again) and I'm a bit afraid of maintaining so many of them.
Maybe instead of duplicating all kernels from 32 to 16 it would be possible to write a wrapper, let the kernels work in 32 bit and convert that into half precision. Given the parallelization that wouldn't require much VRAM.
I'm a bit afraid of investing hours integrating such custom variants in case another big push comes from upstream.
lin72h and ikawrakow
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request