Quantized Matmul: Small batches are slower than no-batch

Thanks for the work on this repo, it's amazing.

The newly added quantized matmul kernels are great, but they're slower than using the vec kernels on small batch sizes.

On Llama.cpp I see they use the vec version if batch is <= 4 https://github.com/ggerganov/llama.cpp/pull/5351 https://github.com/ggerganov/llama.cpp/pull/5370

I'm reading https://github.com/huggingface/candle/blob/main/candle-kernels/src/quantized.cu and it seems the code was extracted from Llama.cpp before this PR was merged, and candle's version misses this optimization.


For context, this issue was noticed on mistral.rs https://github.com/EricLBuehler/mistral.rs/issues/139

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantized Matmul: Small batches are slower than no-batch #2074

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Quantized Matmul: Small batches are slower than no-batch #2074

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions