Skip to content

Conversation

@Muhammad-Kamran-Khan
Copy link

Description

This PR addresses the crash reported in #18140, where loading a context of ~126k tokens with a large batch size causes an INT_MAX overflow in the CUDA cpy kernels.

Motivation and Context

The crash is triggered by ggml_nbytes(src0) exceeding the signed 32-bit integer limit (~2.14GB). The existing kernel logic in ggml-cuda/cpy.cu uses int for element counts (ne) and byte offsets. When processing large contexts (e.g., Qwen3-Next-80B with 128k context), this results in integer overflow and assertion failures.

Changes

  • Type Promotion: Promoted kernel parameters (ne, ne00, nb00, etc.) from int to int64_t in cpy_scalar, cpy_f32_q, and related templates.
  • Safe Arithmetic: Added explicit (int64_t) casting to thread index calculations (e.g., blockDim.x * blockIdx.x) to ensure 64-bit arithmetic is used for memory addressing.
  • Assertion Removal: Removed the GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) checks to allow processing tensors larger than 2GB.

Testing Status

Important Note: I have implemented this fix based on the stack trace analysis and the clear integer overflow root cause.
No Local Verification: I was unable to verify this fix locally due to hardware limitations (Github Codespaces resource constraints prevented full compilation, and I do not have access to the high-VRAM hardware required to reproduce the 126k token crash).

  • I am relying on the CI and maintainers with appropriate hardware to verify that the 64-bit promotion resolves the crash without regressions.

Fixes #18140

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 24, 2025
@Muhammad-Kamran-Khan
Copy link
Author

Muhammad-Kamran-Khan commented Dec 24, 2025

Hi @JohannesGaessler, @lilblam
I have submitted the fix for #18140 regarding the INT_MAX overflow in CUDA cpy kernels.
Since this is my first contribution, the CI workflows are awaiting approval. Could you please approve them so we can verify the compilation?

Thanks!

Comment on lines 25 to 26
// determine indices i03/i13, i02/i12, i01/i11, i00/i10 as a function of index i of flattened tensor
// then combine those indices with the corresponding byte offsets to get the total offsets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are expected to read the contribution guidelines where it clearly states:

Using AI to generate PRs is permitted. However, you must (1) explicitly disclose how AI was used and (2) conduct a thorough manual review before publishing the PR. Note that trivial tab autocompletions do not require disclosure.

This PR is not acceptable in terms of quality control.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the oversight. Per the guidelines, I disclose that AI was used to assist with these type changes. I have now manually reviewed the code, restored the original comments, and fixed the formatting. Ready for re-review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: Qwen3-Next-80b crashes loading 126k tokens of context in CUDA (Vulkan is fine)

2 participants