Skip to content

CUDA: fix non-cont. inputs for batched mat mul #13155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

See #13137 .

I misdiagnosed the problem in the previous PR. The issue is in fact not the numerical precision but rather that the memory offsets during the FP32->FP16 conversion of src1 were wrong. The code implicitly assumed that the memory layout of src1 is contiguous in the sense that there are no gaps when iterating over all elements. Usually something like this causes completely garbled outputs but in this case the effect was relatively small.

I fixed the issue by extending the conversion of floats to support non-contiguous inputs. So far we do not need it for non-float data so I did not touch that code. The conversion code is a mess and I think long-term we should refactor it. I don't think this would be very difficult, maybe we can mark this as a good first issue for CUDA in particular?

This PR reverts the previous changes to the precision logic; I thought the batched matrix multiplication only supports FP16 precision but I guess I misremembered.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 28, 2025
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested this - approving just in case you would like to merge early.

Can we add a test to test-backend-ops that reproduces the original problem?

@github-actions github-actions bot added the testing Everything test related label Apr 29, 2025
@JohannesGaessler
Copy link
Collaborator Author

Thanks for reminding me. test-backend-ops already had test cases for non-contiguous inputs but only src0 was made contiguous. So all that needed to be done was apply the same operation to src1 as well. I would prefer to merge this quickly; I will be available with low latency in the next few days if issues arise.

@JohannesGaessler
Copy link
Collaborator Author

@0cc4m it seems that the Vulkan backend currently has the same issue that the CUDA backend had.

@JohannesGaessler
Copy link
Collaborator Author

I'll merge this PR without the added test, then make a new PR for the test. That way the Vulkan CI won't fail for unrelated PRs until there is a fix.

@JohannesGaessler JohannesGaessler force-pushed the cuda-cublas-noncont-src1-2 branch from 92684f2 to b380e66 Compare April 29, 2025 12:51
@JohannesGaessler JohannesGaessler merged commit cdf7658 into ggml-org:master Apr 29, 2025
86 of 95 checks passed
@jeffbolznv
Copy link
Collaborator

I'll take a look at the Vulkan failures.

@JohannesGaessler
Copy link
Collaborator Author

The changed tests are in #13187 , may make sense for you to base your fix on that branch.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Apr 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants