ggml-cuda : fix INT_MAX overflow in cpy kernels (#18140) #18340
+101
−101
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR addresses the crash reported in #18140, where loading a context of ~126k tokens with a large batch size causes an
INT_MAXoverflow in the CUDAcpykernels.Motivation and Context
The crash is triggered by
ggml_nbytes(src0)exceeding the signed 32-bit integer limit (~2.14GB). The existing kernel logic inggml-cuda/cpy.cuusesintfor element counts (ne) and byte offsets. When processing large contexts (e.g., Qwen3-Next-80B with 128k context), this results in integer overflow and assertion failures.Changes
ne,ne00,nb00, etc.) frominttoint64_tincpy_scalar,cpy_f32_q, and related templates.(int64_t)casting to thread index calculations (e.g.,blockDim.x * blockIdx.x) to ensure 64-bit arithmetic is used for memory addressing.GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX)checks to allow processing tensors larger than 2GB.Testing Status
Important Note: I have implemented this fix based on the stack trace analysis and the clear integer overflow root cause.
No Local Verification: I was unable to verify this fix locally due to hardware limitations (Github Codespaces resource constraints prevented full compilation, and I do not have access to the high-VRAM hardware required to reproduce the 126k token crash).
Fixes #18140