ggml-cuda : fix INT_MAX overflow in cpy kernels (#18140) #18340

Muhammad-Kamran-Khan · 2025-12-24T07:57:00Z

Description

This PR addresses the crash reported in #18140, where loading a context of ~126k tokens with a large batch size causes an INT_MAX overflow in the CUDA cpy kernels.

Motivation and Context

The crash is triggered by ggml_nbytes(src0) exceeding the signed 32-bit integer limit (~2.14GB). The existing kernel logic in ggml-cuda/cpy.cu uses int for element counts (ne) and byte offsets. When processing large contexts (e.g., Qwen3-Next-80B with 128k context), this results in integer overflow and assertion failures.

Changes

Type Promotion: Promoted kernel parameters (ne, ne00, nb00, etc.) from int to int64_t in cpy_scalar, cpy_f32_q, and related templates.
Safe Arithmetic: Added explicit (int64_t) casting to thread index calculations (e.g., blockDim.x * blockIdx.x) to ensure 64-bit arithmetic is used for memory addressing.
Assertion Removal: Removed the GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) checks to allow processing tensors larger than 2GB.

Testing Status

Important Note: I have implemented this fix based on the stack trace analysis and the clear integer overflow root cause.
No Local Verification: I was unable to verify this fix locally due to hardware limitations (Github Codespaces resource constraints prevented full compilation, and I do not have access to the high-VRAM hardware required to reproduce the 126k token crash).

I am relying on the CI and maintainers with appropriate hardware to verify that the 64-bit promotion resolves the crash without regressions.

Fixes #18140

Muhammad-Kamran-Khan · 2025-12-24T08:04:58Z

Hi @JohannesGaessler, @lilblam
I have submitted the fix for #18140 regarding the INT_MAX overflow in CUDA cpy kernels.
Since this is my first contribution, the CI workflows are awaiting approval. Could you please approve them so we can verify the compilation?

Thanks!

JohannesGaessler · 2025-12-24T11:01:25Z

ggml/src/ggml-cuda/cpy.cu

-    // determine indices i03/i13, i02/i12, i01/i11, i00/i10 as a function of index i of flattened tensor
-    // then combine those indices with the corresponding byte offsets to get the total offsets


You are expected to read the contribution guidelines where it clearly states:

Using AI to generate PRs is permitted. However, you must (1) explicitly disclose how AI was used and (2) conduct a thorough manual review before publishing the PR. Note that trivial tab autocompletions do not require disclosure.

This PR is not acceptable in terms of quality control.

Apologies for the oversight. Per the guidelines, I disclose that AI was used to assist with these type changes. I have now manually reviewed the code, restored the original comments, and fixed the formatting. Ready for re-review.

Added comments to clarify index calculations and assertions.

ggml-cuda : fix INT_MAX overflow in cpy kernels (fixes ggml-org#18140)

b500b65

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 24, 2025

JohannesGaessler reviewed Dec 24, 2025

View reviewed changes

Muhammad-Kamran-Khan added 2 commits December 24, 2025 16:34

Restore comments and fix integer overflow

9a655ff

Added comments to clarify index calculations and assertions.

Restore comments and fix integer overflow. Fixed newline error.

6b43a2a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cuda : fix INT_MAX overflow in cpy kernels (#18140) #18340

ggml-cuda : fix INT_MAX overflow in cpy kernels (#18140) #18340

Muhammad-Kamran-Khan commented Dec 24, 2025

Uh oh!

Muhammad-Kamran-Khan commented Dec 24, 2025 •

edited

Loading

Uh oh!

JohannesGaessler Dec 24, 2025

Uh oh!

Muhammad-Kamran-Khan Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// determine indices i03/i13, i02/i12, i01/i11, i00/i10 as a function of index i of flattened tensor
		// then combine those indices with the corresponding byte offsets to get the total offsets

ggml-cuda : fix INT_MAX overflow in cpy kernels (#18140) #18340

Are you sure you want to change the base?

ggml-cuda : fix INT_MAX overflow in cpy kernels (#18140) #18340

Conversation

Muhammad-Kamran-Khan commented Dec 24, 2025

Description

Motivation and Context

Changes

Testing Status

Uh oh!

Muhammad-Kamran-Khan commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Muhammad-Kamran-Khan Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Muhammad-Kamran-Khan commented Dec 24, 2025 •

edited

Loading