Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next) #16817

yael-works · 2025-10-28T13:16:40Z

New Attention Mechanism: SparseK Attention (CPU Backend)

This PR introduces a new attention mechanism called SparseK Attention, implemented from scratch as a new operator within the GGML framework, currently with CPU backend support.

Overview

SparseK Attention is a selective and efficient attention mechanism inspired by Flash Attention, but introduces additional sparsity through:

Top-K filtering – keeps only the strongest attention weights.
Local windowing – limits attention to a configurable local context.
Global stride – adds periodic global connections between tokens.

Implementation Details

Added new operator: GGML_OP_SPARSEK_ATTN defined in ggml.h and ggml.c.
Implemented construction function ggml_sparsek_attn() that creates a computation node with parameters (k_top, win_local, stride_global).
Added full CPU backend implementation in:
- ggml-cpu/ops.h
- ggml-cpu/ops.cpp
- ggml-cpu.c

The CPU version includes:

Scaled dot-product computation QKᵀ / √d
Dynamic Top-K filtering
Softmax normalization
Multiplication with V

Next Steps

Our next goal is to extend SparseK Attention to the SYCL (GPU) backend in order to:

Measure and compare performance between CPU and GPU implementations.
Optimize kernel execution for sparse attention patterns.
Validate correctness and scaling on Intel GPUs.

We are submitting this initial CPU implementation first to ensure review, integration, and baseline correctness before introducing GPU acceleration.

Co-Authors

Co-authored-by: Yael Shuker ([email protected])
Co-authored-by: Gitty Burstein ([email protected])

GittyBurstein · 2025-10-28T13:23:40Z

Hi @CISC and @NeoZhangJianyu,

We’d appreciate it if you could review our PR implementing the new SPARSEK Attention operator.
We ran internal validation tests we created ourselves, and all passed successfully.

This contribution was developed jointly by both of us (@yael-works and @GittyBurstein ).
Please make sure the PR reflects both contributors — if needed, we can adjust the commit authors accordingly.

Thanks in advance for your time and feedback!

CISC · 2025-10-28T13:35:43Z

We are talking about this SparseK, right?

yael-works · 2025-10-28T13:38:26Z

yes! @CISC

CISC · 2025-10-30T10:52:36Z

You need to rebase to fix Server CI failures, also please fix whitespaces:
https://github.com/ggml-org/llama.cpp/actions/runs/18935125175/job/54060021809

tests/test-backend-ops.cpp

GittyBurstein · 2025-10-31T11:07:29Z

Hi @CISC,
Just to clarify — the failing tests are unrelated to my changes.
This PR only introduces the new SPARSEK Attention operator within GGML and doesn’t modify any existing server or inference logic.

I’d really appreciate it if you could review the code itself so we can move forward with the merge —
all SPARSEK-related tests are passing successfully.

Thanks!

CISC · 2025-10-31T11:17:09Z

Hi @CISC, Just to clarify — the failing tests are unrelated to my changes. This PR only introduces the new SPARSEK Attention operator within GGML and doesn’t modify any existing server or inference logic.

Yes, as mentioned, will be resolved if you rebase, it's ok. :)

I’d really appreciate it if you could review the code itself so we can move forward with the merge — all SPARSEK-related tests are passing successfully.

So, my main challenge is where/what/when will SparseK be used? I can't recall seeing any actual implementation being used in the wild. This also means we don't really have any reference to test it against...

GittyBurstein · 2025-10-31T11:30:23Z

@CISC
The current PR focuses solely on adding the SparseK Attention operator at the GGML level (CPU backend).
At this stage, it isn’t directly integrated into the model’s runtime pipeline — it’s designed as a standalone operator for experimentation and future extensions.

Once this PR is merged, the operator can be connected to higher-level use cases such as:

selective attention mechanisms for long-context models,
experimental low-latency or memory-efficient inference,
or research benchmarking against variants like Flash Attention or block-sparse implementations....
Do you have any other idea that could demonstrate or validate this even better?

Thank you!!

CISC · 2025-10-31T11:34:14Z

I think @ggerganov will have to weigh in on this.

ggerganov · 2025-11-02T09:25:14Z

Sparse attention implementations such as DSA and SparseK should leverage the existing FA implementations and mask filtering logic. No need to introduce new operators and duplicate all the existing work that already went into optimizing FA.

yael-works · 2025-11-02T09:55:57Z

Hi @ggerganov and @CISC,
The branch has been successfully rebased on the latest master.
All SparseK Attention tests are passing, and the PR is ready for final review and merge.
Thanks for the feedback and support!
— Yael & Gitty

yael-works · 2025-11-04T12:38:38Z

Hi @ggerganov and @CISC,
Following @ggerganov’s feedback, we refactored SparseK to reuse the existing FlashAttention logic rather than maintaining a separate operator.
The new design integrates SparseK’s sparsity mechanism (Top-K + local + stride) within the FlashAttention extension path.
This keeps the optimization benefits of FlashAttention while allowing selective sparse attention behavior — all tested and validated on CPU backend.

ggerganov

My idea was more along the following lines:

Sparse attention implementations should somehow compute a sparse KQ mask. Depending on the specifics (e.g. local windows, top-k product, deepseek lightning stuff, etc.) this can be done in different way, but generally it should require some extra logic when constructing the compute graph
Then we pass the sparse KQ mask (i.e. a normal mask but with extra -INF values where we don't have to compute the attention) to ggml_flash_attn_ext and we delegate the filtering logic to the backend implementation. For example, the Metal backend will already skip large amount of the filtered values depending on the KQ mask contents (#16372). Similar or better logic can be added to the other backend implementations.

I think at most, the only change to the existing ggml_flash_attn_ext API would be to provide a "mask hint" that would inform the backend what kind of mask to expect (causal, sparse, etc.). End the rest of the changes should be at the compute graph level and at the backend implementation for filtering the -INF values. Let me know if this makes sense.

GittyBurstein · 2025-11-04T15:13:29Z

@ggerganov
Before we start implementing, we want to make sure we understand correctly —
We’re not creating a separate operator for SparseK at all, but instead just adding a mask that integrates with ggml_flash_attn_ext, right?

And if that’s the case, where exactly should the mask implementation be added — inside the compute graph logic, or only for testing (e.g., in test-backend-ops)?
thanks!
Yael & Gitty

ggerganov · 2025-11-04T15:47:00Z

We’re not creating a separate operator for SparseK at all, but instead just adding a mask that integrates with ggml_flash_attn_ext, right?

In llama.cpp, the mask is already being created and passed to ggml_flash_attn_ext. Currently, we populate the mask outside of the compute graph because it is static - i.e. depends only on the token positions in the sequences:

llama.cpp/src/llama-kv-cache.cpp

Lines 1223 to 1306 in afd3532

    
           void llama_kv_cache::set_input_kq_mask(ggml_tensor * dst, const llama_ubatch * ubatch, bool causal_attn) const { 
        
               const uint32_t n_tokens = ubatch->n_tokens; 
        
               GGML_ASSERT(ggml_backend_buffer_is_host(dst->buffer)); 
        
               float * data = (float *) dst->data; 
        
               const int64_t n_kv     = dst->ne[0]; 
        
               const int64_t n_stream = dst->ne[3]; // num streams in the current ubatch 
        
               GGML_ASSERT(n_tokens%n_stream == 0); 
        
               // n_tps == n_tokens_per_stream 
        
               const int64_t n_tps     = n_tokens/n_stream; 
        
               const int64_t n_tps_pad = GGML_PAD(n_tps, GGML_KQ_MASK_PAD); 
        
               std::fill(data, data + ggml_nelements(dst), -INFINITY); 
        
               // Use only the previous KV cells of the correct sequence for each token of the ubatch. 
        
               // It's assumed that if a token in the batch has multiple sequences, they are equivalent. 
        
               // Example with a cache of 10 tokens, 2 tokens populated in cache and 3 tokens in batch: 
        
               //   Causal mask: 
        
               //      xxx------- 
        
               //      xxxx------ 
        
               //      xxxxx----- 
        
               //   Non-causal mask: 
        
               //      xxxxx----- 
        
               //      xxxxx----- 
        
               //      xxxxx----- 
        
               // To visualize the mask, see https://github.com/ggml-org/llama.cpp/pull/12615 
        
               // TODO: optimize this section 
        
               for (uint32_t h = 0; h < 1; ++h) { 
        
                   for (uint32_t s = 0; s < n_stream; ++s) { 
        
                       for (uint32_t ii = 0; ii < n_tps; ++ii) { 
        
                           const uint32_t i = s*n_tps + ii; 
        
                           const llama_seq_id seq_id = ubatch->seq_id[i][0]; 
        
                           const auto & cells = v_cells[seq_to_stream[seq_id]]; 
        
                           const llama_pos p1 = ubatch->pos[i]; 
        
                           // for M-RoPE 
        
                           const bool is_2d = ubatch->is_pos_2d(); 
        
                           const llama_pos p1_x = is_2d ? ubatch->pos[i + ubatch->n_tokens*2] : 0; 
        
                           const llama_pos p1_y = is_2d ? ubatch->pos[i + ubatch->n_tokens]   : 0; 
        
                           const uint64_t idst = n_kv*(h*n_stream*n_tps_pad + s*n_tps_pad + ii); 
        
                           for (uint32_t j = 0; j < n_kv; ++j) { 
        
                               if (cells.is_empty(j)) { 
        
                                   continue; 
        
                               } 
        
                               // mask the token if not the same sequence 
        
                               if (!cells.seq_has(j, seq_id)) { 
        
                                   continue; 
        
                               } 
        
                               const llama_pos p0 = cells.pos_get(j); 
        
                               // mask future tokens 
        
                               if (causal_attn && p0 > p1) { 
        
                                   continue; 
        
                               } 
        
                               // M-RoPE causal mask 
        
                               if (causal_attn && is_2d && p0 == p1) { 
        
                                   const auto & p0_ext = cells.ext_get(j); 
        
                                   if (p0_ext.is_2d_gt(p1_x, p1_y)) { 
        
                                       continue; 
        
                                   } 
        
                               } 
        
                               // apply SWA if any 
        
                               if (is_masked_swa(p0, p1)) { 
        
                                   continue; 
        
                               } 
        
                               data[idst + j] = hparams.use_alibi ? -std::abs(p0 - p1) : 0.0f; 
        
                           } 
        
                       } 
        
                   } 
        
               } 
        
           }

I think that the sparse attention implementations should augment this static mask through some extra logic. This extra logic should be implemented for example in the llm_graph_context::build_attn methods. This specific logic could potentially require some new ggml operators, but in general it boils down to setting certain elements of the kq_mask tensor to -INF in some way.

From there, the FA implementations will deal with the provided mask in their own way (i.e. by skipping computations when possible).

And if that’s the case, where exactly should the mask implementation be added — inside the compute graph logic, or only for testing (e.g., in test-backend-ops)?

For testing, you can already take a look how we create KQ masks with blocks of -INF values here:

llama.cpp/tests/test-backend-ops.cpp

Lines 134 to 176 in afd3532

    
           // generate an F16 mask where certain blocks are randomly masked with -INF value 
        
           static void init_tensor_kq_mask(ggml_tensor * tensor, float min = -1.0f, float max = 1.0f) { 
        
               GGML_ASSERT(tensor->type == GGML_TYPE_F16); 
        
               GGML_TENSOR_LOCALS( int32_t, ne, tensor, ne); 
        
               std::vector<float>       data_f32(ne0*ne1*ne2*ne3); 
        
               std::vector<ggml_fp16_t> data_f16(ne0*ne1*ne2*ne3); 
        
               std::random_device rd; 
        
               std::mt19937 gen(rd()); 
        
               std::uniform_real_distribution<float> dis(min, max); 
        
               for (size_t i = 0; i < data_f32.size(); i++) { 
        
                   data_f32[i] = dis(gen); 
        
               } 
        
               // block size 
        
               const int blck0 = 128; 
        
               const int blck1 = 64; 
        
               // number of INF blocks 
        
               const int n_inf_blocks = 0.1*(ne0*ne1*ne2*ne3)/(blck0*blck1); 
        
               for (int b = 0; b < n_inf_blocks; b++) { 
        
                   const int p3 = (rd() % ne3); 
        
                   const int p2 = (rd() % ne2); 
        
                   const int p1 = (rd() % ne1); 
        
                   const int p0 = (rd() % ne0); 
        
                   for (int i1 = 0; i1 < blck1 && p1 + i1 < ne1; i1++) { 
        
                       const int idx = p3*ne2*ne1*ne0 + p2*ne1*ne0 + (p1 + i1)*ne0 + p0; 
        
                       for (int i0 = 0; i0 < blck0 && p0 + i0 < ne0; i0++) { 
        
                           data_f32[idx + i0] = -INFINITY; 
        
                       } 
        
                   } 
        
               } 
        
               ggml_fp32_to_fp16_row(data_f32.data(), data_f16.data(), ne0*ne1*ne2*ne3); 
        
               ggml_backend_tensor_set(tensor, data_f16.data(), 0, data_f16.size()*sizeof(ggml_fp16_t)); 
        
           }

I imagine that we would need tests that create various sorts of sparse masks and simply run ggml_flash_attn_ext as we do now. And also additional tests as needed, depending on what new operators for constructing these sparse masks are introduced.

GittyBurstein · 2025-11-11T20:10:41Z

Hi @ggerganov!
We’ve now implemented dynamic mask construction directly within the graph, replacing the previous static approach.
This implementation builds the mask nodes at graph time, allowing flexible control through the SparseK parameters (e.g., LLAMA_SPARSEK_ENABLE, LLAMA_SPARSEK_TOPK, etc.).

We’d really appreciate it if you could take a look at the updated code —
we’re very eager to move forward to the next step.
Gitty & Yael

.gitignore

src/llama-graph.h

src/llama-graph.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

…n call, header cleanup) Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael <[email protected]>

src/llama-graph.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

yael-works · 2025-11-12T07:53:22Z

Hi @ggerganov!
Yesterday @CISC did a code review for us, and we made all the updates according to your guidelines.
We’d be happy if you could also take a look so we can move forward with the merge 🙏

Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael Shuker <[email protected]>

CISC · 2025-11-12T21:13:50Z

@yael-works See #16817 (comment)

yael-works · 2025-11-12T21:33:30Z

Hi @CISC 👋
I’ve just pushed the latest update — the fix was actually quite small, mainly aligning the mask reshape and tightening the top-k guard.
Everything should now be fully consistent with your feedback
I’d really appreciate your guidance on how we can move the PR forward as soon as possible — I’m eager to start working on the GPU implementation, so it’s important to confirm that this version looks good to you.
Is there anything else you’d like me to adjust or clarify to help finalize this review?
Thanks so much for your time and support 🙏

GittyBurstein · 2025-11-13T09:02:17Z

Hi @ggerganov @NeoZhangJianyu
We’d really appreciate your feedback on our addition — we worked on it with the goal of matching the guidance we received at the beginning.
This algorithm implementation is our final project, and we’re really eager to move forward and complete it, especially with our submission deadline coming up in the next few days.

Thank you so much for your time and support!
Yael & Gitty

CISC · 2025-11-13T09:10:24Z

@CISC has already done a very thorough code review, and we carefully addressed all the comments to ensure the implementation meets all the requirements.

TBC I have merely made sure you have "working" code and pass EditorConfig CI, please do not consider my efforts here as a code review.

GittyBurstein · 2025-11-13T09:17:35Z

@CISC
You're right,
I'm editing the comment again....

ggerganov · 2025-11-13T09:22:10Z

This algorithm implementation is our final project, and we’re really eager to move forward and complete it, especially with our submission deadline coming up in the next few days.

Adjust your expectations - this PR is far from a state where it can be merged. Certainly it's not going to be merged just to meet a submission deadline.

As it is, it has no practical value because no existing open model uses this type of sparse attention. As a PoC it is OK and you can play with these changes if this is interesting to you and your project.

A final version would at the very least have to:

have a real model to test with
avoid reading env vars and instead get the information from the model metadata
reduce the number of graph nodes in some way
devise a strategy for efficient -INF filtering in the FA kernels
evaluate performance
add tests

In short, there is a long way before getting this in master. Please reduce the amount of comments asking to merge if you want to get any further assistance on this.

…che double masking Co-authored-by: Yael Shuker <[email protected]> Co-authored-by: Gitty Burstein <[email protected]>

…fix scatter

yael-works requested review from ggerganov and slaren as code owners October 28, 2025 13:16

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Oct 28, 2025

DajanaV mentioned this pull request Oct 28, 2025

UPSTREAM PR #16817: Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next) auroralabs-loci/llama.cpp#4

Closed

CISC reviewed Oct 30, 2025

View reviewed changes

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved

DajanaV mentioned this pull request Nov 2, 2025

UPSTREAM PR #16817: Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next) auroralabs-loci/llama.cpp#40

Closed

yael-works force-pushed the feature/sparsek-attn-sycl branch from 77f4088 to 22c063e Compare November 2, 2025 09:53

yael-works force-pushed the feature/sparsek-attn-sycl branch from 16d7eee to 556ab36 Compare November 3, 2025 09:21

ggerganov reviewed Nov 4, 2025

View reviewed changes

GittyBurstein requested review from JohannesGaessler, allozaur, danbev, lhez, max-krasnyansky and ngxson as code owners November 5, 2025 11:24

GittyBurstein force-pushed the feature/sparsek-attn-sycl branch from 7c5f85a to 627bd45 Compare November 5, 2025 11:32

GittyBurstein force-pushed the feature/sparsek-attn-sycl branch from df59fa2 to 8db1307 Compare November 11, 2025 19:55

remove accidental .gitignore

68ab48c

CISC removed request for JohannesGaessler, allozaur, danbev, lhez, max-krasnyansky and ngxson November 11, 2025 20:22

CISC reviewed Nov 11, 2025

View reviewed changes

.gitignore Show resolved Hide resolved

src/llama-graph.h Show resolved Hide resolved

src/llama-graph.cpp Outdated Show resolved Hide resolved

GittyBurstein and others added 4 commits November 11, 2025 23:18

Without unnecessary spaces

ce761f8

Co-authored-by: Sigbjørn Skjæret <[email protected]>

restore .gitignore from upstream/master

9d07172

SparseK: apply review feedback (use ggml_scale_bias, single flash_att…

af711f8

…n call, header cleanup) Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael <[email protected]>

SparseK: apply review feedback (use ggml_scale_bias, single flash_att…

3933069

…n call, header cleanup) Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael <[email protected]>

CISC reviewed Nov 11, 2025

View reviewed changes

src/llama-graph.cpp Outdated Show resolved Hide resolved

fix(SparseK): use ggml_scale_bias directly on scores

0c2dd04

Co-authored-by: Sigbjørn Skjæret <[email protected]>

yael-works and others added 4 commits November 12, 2025 10:04

restore SparseK kv-cache implementation (recovered from local file)

c6a5db4

SparseK: update graph build — replace src/llama-graph.{h,cpp}

a6784f0

Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael Shuker <[email protected]>

sparsek: finalize mask reshape and validation fixes

f9bd873

sparsek: replace ggml_scale_bias with standard ops for portability

de64151

sparsek: align base mask 4D shape and add topk==0 guard for robustness

08e359d

yael-works and others added 2 commits November 13, 2025 12:34

SparseK: clean dynamic mask path, remove legacy reshapes, avoid kv-ca…

49a8a81

…che double masking Co-authored-by: Yael Shuker <[email protected]> Co-authored-by: Gitty Burstein <[email protected]>

SparseK: finalize graph pipeline cleanup, remove deprecated path and …

ea21d8f

…fix scatter

Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next) #16817

Are you sure you want to change the base?

Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next) #16817

Conversation

yael-works commented Oct 28, 2025

New Attention Mechanism: SparseK Attention (CPU Backend)

Overview

Implementation Details

Next Steps

Co-Authors

Uh oh!

GittyBurstein commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Oct 28, 2025

Uh oh!

yael-works commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Oct 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GittyBurstein commented Oct 31, 2025

Uh oh!

CISC commented Oct 31, 2025

Uh oh!

GittyBurstein commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Oct 31, 2025

Uh oh!

ggerganov commented Nov 2, 2025

Uh oh!

yael-works commented Nov 2, 2025

Uh oh!

yael-works commented Nov 4, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

GittyBurstein commented Nov 4, 2025

Uh oh!

ggerganov commented Nov 4, 2025

Uh oh!

GittyBurstein commented Nov 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yael-works commented Nov 12, 2025

Uh oh!

CISC commented Nov 12, 2025

Uh oh!

yael-works commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GittyBurstein commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Nov 13, 2025

Uh oh!

GittyBurstein commented Nov 13, 2025

Uh oh!

ggerganov commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GittyBurstein commented Oct 28, 2025 •

edited

Loading

yael-works commented Oct 28, 2025 •

edited

Loading

GittyBurstein commented Oct 31, 2025 •

edited

Loading

yael-works commented Nov 12, 2025 •

edited

Loading

GittyBurstein commented Nov 13, 2025 •

edited

Loading