Skip to content

K Quant 64 support - quite a feat to integrate #34

@cmp-nct

Description

@cmp-nct

A large patch was just integrated into llama.cpp (ggml-org#2001) another stunning job by @ikawrakow

In the long run we need it, K quants are better for 7B and have more flexibility but two obstacles need to be solved:

  1. We need to modify that PR so it's not a compiler switch anymore, it needs to support 256 and 64 bit.
    Either by splitting and duplicating it or by using a global variable instead of the define.
    Otherwise we'd need distinctly compiled binaries for 7B and 40B
  2. These are 32 bit dequantizers, we use 16 bit for cuBLAS to save 50% VRAM.
    It's not a huge thing to change but it doubles the kernels (again) and I'm a bit afraid of maintaining so many of them.
    Maybe instead of duplicating all kernels from 32 to 16 it would be possible to write a wrapper, let the kernels work in 32 bit and convert that into half precision. Given the parallelization that wouldn't require much VRAM.

I'm a bit afraid of investing hours integrating such custom variants in case another big push comes from upstream.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions