Thanks for the work on this repo, it's amazing. The newly added quantized matmul kernels are great, but they're slower than using the vec kernels on small batch sizes. On Llama.cpp I see they use the vec version if batch is <= 4 https://github.com/ggerganov/llama.cpp/pull/5351 https://github.com/ggerganov/llama.cpp/pull/5370 I'm reading https://github.com/huggingface/candle/blob/main/candle-kernels/src/quantized.cu and it seems the code was extracted from Llama.cpp before this PR was merged, and candle's version misses this optimization. For context, this issue was noticed on mistral.rs https://github.com/EricLBuehler/mistral.rs/issues/139