CUDA: fix bad asserts for partial offload #13337

JohannesGaessler · 2025-05-06T09:26:23Z

Fixup to #13320 .

The requirements for clearing the padding is that it should not be done for views since that would risk clearing valid tensor data and that the tensor memory should be in one contiguous block. The latter is to avoid having to handle the edge case of potentially having to clear memory inside the tensor.

The asserts in mmq.cu and mmvq.cu were too strict because ggml_is_contiguous disallows both views and permutations. I added a new function ggml_is_contiguously_allocated that only disallows views but allows permutations (to my knowledge we do not yet have a utility function for this). I changed the asserts to check that the memory is allocated contiguously and that the tensor in question is not a view of another tensor (to avoid potentially overwriting valid tensor data).

The assert in ggml_mul_mat_id was wrong because I forgot that for MUL_MAT_ID the number of tokens is stored in dimension 2 instead of dimension 1. But since the generic MoE code effectively creates views of src0 I think it's better to mark the slices as such and to remove that assert entirely.

I remembered that in the FlashAttention code the conversion of quantized KV data to FP16 implicitly assumes a contiguous block of memory and added asserts with the new function.

* origin/master: (27 commits) llama : fix build_ffn without gate (ggml-org#13336) CUDA: fix bad asserts for partial offload (ggml-org#13337) convert : qwen2/3moe : set yarn metadata if present (ggml-org#13331) CUDA: fix --split-mode row for MMQ (ggml-org#13323) gguf-py : avoid requiring pyside6 for other scripts (ggml-org#13036) CUDA: fix logic for clearing padding with -ngl 0 (ggml-org#13320) sampling : Integrate Top-nσ into main sampling chain (and add it to the server) (ggml-org#13264) server : Webui - change setText command from parent window to also send the message. (ggml-org#13309) mtmd : rename llava directory to mtmd (ggml-org#13311) clip : fix confused naming ffn_up and ffn_down (ggml-org#13290) convert : bailingmoe : set yarn metadata if present (ggml-org#13312) SYCL: Disable mul_mat kernels for noncontiguous tensor b (ggml-org#13308) mtmd : add C public API (ggml-org#13184) rpc : use backend registry, support dl backends (ggml-org#13304) ggml : activate s390x simd for Q3_K (ggml-org#13301) llava/mtmd : fixes to fully support dl backends (ggml-org#13303) llama : build windows releases with dl backends (ggml-org#13220) CUDA: fix race condition in MMQ stream-k fixup (ggml-org#13299) CUDA: fix race condition in MMQ ids_dst (ggml-org#13294) vulkan: Additional type support for unary, binary, and copy (ggml-org#13266) ...

This reverts commit 2356fb1.

CUDA: fix bad asserts for partial offload

0fae54f

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 6, 2025

JohannesGaessler mentioned this pull request May 6, 2025

Eval bug: llama-server crash after 2 messages following commit 9070365 #13329

Closed

CISC linked an issue May 6, 2025 that may be closed by this pull request

Eval bug: llama-server crash after 2 messages following commit 9070365 #13329

Closed

slaren approved these changes May 6, 2025

View reviewed changes

JohannesGaessler merged commit 2356fb1 into ggml-org:master May 6, 2025
46 checks passed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request May 9, 2025

Revert "CUDA: fix bad asserts for partial offload (ggml-org#13337)"

af38d1d

This reverts commit 2356fb1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: fix bad asserts for partial offload #13337

CUDA: fix bad asserts for partial offload #13337

Uh oh!

JohannesGaessler commented May 6, 2025

Uh oh!

Uh oh!

Uh oh!

CUDA: fix bad asserts for partial offload #13337

CUDA: fix bad asserts for partial offload #13337

Uh oh!

Conversation

JohannesGaessler commented May 6, 2025

Uh oh!

Uh oh!

Uh oh!