-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next) #16817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hi @CISC and @NeoZhangJianyu, We’d appreciate it if you could review our PR implementing the new SPARSEK Attention operator. This contribution was developed jointly by both of us (@yael-works and @GittyBurstein ). Thanks in advance for your time and feedback! |
|
We are talking about this SparseK, right? |
|
yes! @CISC |
|
You need to rebase to fix Server CI failures, also please fix whitespaces: |
|
Hi @CISC, I’d really appreciate it if you could review the code itself so we can move forward with the merge — Thanks! |
Yes, as mentioned, will be resolved if you rebase, it's ok. :)
So, my main challenge is where/what/when will SparseK be used? I can't recall seeing any actual implementation being used in the wild. This also means we don't really have any reference to test it against... |
|
@CISC Once this PR is merged, the operator can be connected to higher-level use cases such as:
Thank you!! |
|
I think @ggerganov will have to weigh in on this. |
|
Sparse attention implementations such as DSA and SparseK should leverage the existing FA implementations and mask filtering logic. No need to introduce new operators and duplicate all the existing work that already went into optimizing FA. |
77f4088 to
22c063e
Compare
|
Hi @ggerganov and @CISC, |
16d7eee to
556ab36
Compare
|
Hi @ggerganov and @CISC, |
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was more along the following lines:
- Sparse attention implementations should somehow compute a sparse KQ mask. Depending on the specifics (e.g. local windows, top-k product, deepseek lightning stuff, etc.) this can be done in different way, but generally it should require some extra logic when constructing the compute graph
- Then we pass the sparse KQ mask (i.e. a normal mask but with extra -INF values where we don't have to compute the attention) to
ggml_flash_attn_extand we delegate the filtering logic to the backend implementation. For example, the Metal backend will already skip large amount of the filtered values depending on the KQ mask contents (#16372). Similar or better logic can be added to the other backend implementations.
I think at most, the only change to the existing ggml_flash_attn_ext API would be to provide a "mask hint" that would inform the backend what kind of mask to expect (causal, sparse, etc.). End the rest of the changes should be at the compute graph level and at the backend implementation for filtering the -INF values. Let me know if this makes sense.
|
@ggerganov And if that’s the case, where exactly should the mask implementation be added — inside the compute graph logic, or only for testing (e.g., in test-backend-ops)? |
In llama.cpp, the mask is already being created and passed to llama.cpp/src/llama-kv-cache.cpp Lines 1223 to 1306 in afd3532
I think that the sparse attention implementations should augment this static mask through some extra logic. This extra logic should be implemented for example in the From there, the FA implementations will deal with the provided mask in their own way (i.e. by skipping computations when possible).
For testing, you can already take a look how we create KQ masks with blocks of -INF values here: llama.cpp/tests/test-backend-ops.cpp Lines 134 to 176 in afd3532
I imagine that we would need tests that create various sorts of sparse masks and simply run |
7c5f85a to
627bd45
Compare
df59fa2 to
8db1307
Compare
|
Hi @ggerganov! We’d really appreciate it if you could take a look at the updated code — |
Co-authored-by: Sigbjørn Skjæret <[email protected]>
…n call, header cleanup) Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael <[email protected]>
…n call, header cleanup) Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
|
Hi @ggerganov! |
Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael Shuker <[email protected]>
|
Hi @CISC 👋 |
|
Hi @ggerganov @NeoZhangJianyu Thank you so much for your time and support! |
TBC I have merely made sure you have "working" code and pass |
|
@CISC |
Adjust your expectations - this PR is far from a state where it can be merged. Certainly it's not going to be merged just to meet a submission deadline. As it is, it has no practical value because no existing open model uses this type of sparse attention. As a PoC it is OK and you can play with these changes if this is interesting to you and your project. A final version would at the very least have to:
In short, there is a long way before getting this in |
…che double masking Co-authored-by: Yael Shuker <[email protected]> Co-authored-by: Gitty Burstein <[email protected]>
New Attention Mechanism: SparseK Attention (CPU Backend)
This PR introduces a new attention mechanism called SparseK Attention, implemented from scratch as a new operator within the GGML framework, currently with CPU backend support.
Overview
SparseK Attention is a selective and efficient attention mechanism inspired by Flash Attention, but introduces additional sparsity through:
Implementation Details
GGML_OP_SPARSEK_ATTNdefined inggml.handggml.c.ggml_sparsek_attn()that creates a computation node with parameters (k_top,win_local,stride_global).ggml-cpu/ops.hggml-cpu/ops.cppggml-cpu.cThe CPU version includes:
QKᵀ / √dNext Steps
Our next goal is to extend SparseK Attention to the SYCL (GPU) backend in order to:
We are submitting this initial CPU implementation first to ensure review, integration, and baseline correctness before introducing GPU acceleration.
Co-Authors
Co-authored-by: Yael Shuker ([email protected])
Co-authored-by: Gitty Burstein ([email protected])