Skip to content

CUDA: fix bad asserts for partial offload #13337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

Fixup to #13320 .

The requirements for clearing the padding is that it should not be done for views since that would risk clearing valid tensor data and that the tensor memory should be in one contiguous block. The latter is to avoid having to handle the edge case of potentially having to clear memory inside the tensor.

The asserts in mmq.cu and mmvq.cu were too strict because ggml_is_contiguous disallows both views and permutations. I added a new function ggml_is_contiguously_allocated that only disallows views but allows permutations (to my knowledge we do not yet have a utility function for this). I changed the asserts to check that the memory is allocated contiguously and that the tensor in question is not a view of another tensor (to avoid potentially overwriting valid tensor data).

The assert in ggml_mul_mat_id was wrong because I forgot that for MUL_MAT_ID the number of tokens is stored in dimension 2 instead of dimension 1. But since the generic MoE code effectively creates views of src0 I think it's better to mark the slices as such and to remove that assert entirely.

I remembered that in the FlashAttention code the conversion of quantized KV data to FP16 implicitly assumes a contiguous block of memory and added asserts with the new function.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 6, 2025
@CISC CISC linked an issue May 6, 2025 that may be closed by this pull request
@JohannesGaessler JohannesGaessler merged commit 2356fb1 into ggml-org:master May 6, 2025
46 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 6, 2025
* origin/master: (27 commits)
llama : fix build_ffn without gate (ggml-org#13336)
CUDA: fix bad asserts for partial offload (ggml-org#13337)
convert : qwen2/3moe : set yarn metadata if present (ggml-org#13331)
CUDA: fix --split-mode row for MMQ (ggml-org#13323)
gguf-py : avoid requiring pyside6 for other scripts (ggml-org#13036)
CUDA: fix logic for clearing padding with -ngl 0 (ggml-org#13320)
sampling : Integrate Top-nσ into main sampling chain (and add it to the server) (ggml-org#13264)
server : Webui - change setText command from parent window to also send the message. (ggml-org#13309)
mtmd : rename llava directory to mtmd (ggml-org#13311)
clip : fix confused naming ffn_up and ffn_down (ggml-org#13290)
convert : bailingmoe : set yarn metadata if present (ggml-org#13312)
SYCL: Disable mul_mat kernels for noncontiguous tensor b (ggml-org#13308)
mtmd : add C public API (ggml-org#13184)
rpc : use backend registry, support dl backends (ggml-org#13304)
ggml : activate s390x simd for Q3_K (ggml-org#13301)
llava/mtmd : fixes to fully support dl backends (ggml-org#13303)
llama : build windows releases with dl backends (ggml-org#13220)
CUDA: fix race condition in MMQ stream-k fixup (ggml-org#13299)
CUDA: fix race condition in MMQ ids_dst (ggml-org#13294)
vulkan: Additional type support for unary, binary, and copy (ggml-org#13266)
...
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request May 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: llama-server crash after 2 messages following commit 9070365
2 participants