K Quant 64 support - quite a feat to integrate

A large patch was just integrated into llama.cpp (https://github.com/ggerganov/llama.cpp/pull/2001) another stunning job by @ikawrakow 

In the long run we need it, K quants are better for 7B and have more flexibility but two obstacles need to be solved:
1) We need to modify that PR so it's not a compiler switch anymore, it needs to support 256 and 64 bit.
Either by splitting and duplicating it or by using a global variable instead of the define.
Otherwise we'd need distinctly compiled binaries for 7B and 40B
2) These are 32 bit dequantizers, we use 16 bit for cuBLAS to save 50% VRAM.
It's not a huge thing to change but it doubles the kernels (again) and I'm a bit afraid of maintaining so many of them.
Maybe instead of duplicating all kernels from 32 to 16 it would be possible to write a wrapper, let the kernels work in 32 bit and convert that into half precision. Given the parallelization that wouldn't require much VRAM.

I'm a bit afraid of investing hours integrating such custom variants in case another big push comes from upstream.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

K Quant 64 support - quite a feat to integrate #34

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

K Quant 64 support - quite a feat to integrate #34

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions