llama : fix K-shift with quantized K and BLAS backend #13113

slaren · 2025-04-25T16:45:49Z

It is not necessary to set the backend since the ggml_cpy operation is a view of the K tensor, so it will be forced to run on the CPU backend, and the other ops will be expanded from this.

Fixes #13112

ggml-ci

ref #13113 ggml-ci

* kv-cache : serparate recurrent vs non-recurrent impl (wip) ggml-ci * kv-cache : init -> contructor + add llama_memory_params ggml-ci * kv-cache : fix callback reference ggml-ci * context : llama_kv_cache -> llama_memory_i ggml-ci * context : move memory creation logic to model ggml-ci * llama : remove reference of memory during encode ggml-ci * kv-cache : hide padding details in the implementation ggml-ci * kv-cache : add ubatch_next() ggml-ci * context : simplify sbatch logic ggml-ci * kv-cache : hide defrag logic in the implementation ggml-ci * context : hide kv cache details in implementation ggml-ci * build : fix ggml-ci * cont : another fix ggml-ci * kv-cache : simplify interface (wip) ggml-ci * kv-cache : use separate KV cell structs for unified/recurrent ggml-ci * kv-cache : clean-up ggml-ci * model : better llama_model::create_model() signature ggml-ci * kv-cache : fix recurrent seq_rm() ggml-ci * kv-cache : replace `struct callbacks` with `llama_model &` ggml-ci * kv-cache : replace `struct graph_params` with `llama_context &` ggml-ci * kv-cache : fix offload check ggml-ci * context : avoid passing unique_ptr ggml-ci * kv-cache : avoid using the backends from the llama_context ref #13113 ggml-ci * kv-cache : more consistent debug logs [no ci] * kv-cache : do not pass the full llama_context for kv graphs ggml-ci * kv-cache : remove comment * kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext ggml-ci * kv-cache : fix recurrent multi-user case ggml-ci * memory : remove comments [no ci]

llama : fix K-shift with quantized K and BLAS backend

530146c

ggerganov approved these changes Apr 25, 2025

View reviewed changes

slaren merged commit 295354e into master Apr 25, 2025
48 checks passed

slaren deleted the sl/fix-q-k-shift-backend branch April 25, 2025 17:40

ggerganov mentioned this pull request Apr 25, 2025

kv-cache : separate recurrent vs non-recurrent impl #12799

Merged

8 tasks

pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Apr 28, 2025

llama : fix K-shift with quantized K and BLAS backend (ggml-org#13113)

1f7a36f

slaren referenced this pull request Apr 30, 2025

context : avoid passing unique_ptr

ed7bb58

ggml-ci

ggerganov added a commit that referenced this pull request Apr 30, 2025

kv-cache : avoid using the backends from the llama_context

f5adaab

ref #13113 ggml-ci

ggerganov added a commit that referenced this pull request May 2, 2025

kv-cache : avoid using the backends from the llama_context

c9bddfc

ref #13113 ggml-ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : fix K-shift with quantized K and BLAS backend #13113

llama : fix K-shift with quantized K and BLAS backend #13113

Uh oh!

slaren commented Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!

llama : fix K-shift with quantized K and BLAS backend #13113

llama : fix K-shift with quantized K and BLAS backend #13113

Uh oh!

Conversation

slaren commented Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!