CUDA: fix non-cont. inputs for batched mat mul #13155
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See #13137 .
I misdiagnosed the problem in the previous PR. The issue is in fact not the numerical precision but rather that the memory offsets during the FP32->FP16 conversion of src1 were wrong. The code implicitly assumed that the memory layout of
src1
is contiguous in the sense that there are no gaps when iterating over all elements. Usually something like this causes completely garbled outputs but in this case the effect was relatively small.I fixed the issue by extending the conversion of floats to support non-contiguous inputs. So far we do not need it for non-float data so I did not touch that code. The conversion code is a mess and I think long-term we should refactor it. I don't think this would be very difficult, maybe we can mark this as a good first issue for CUDA in particular?
This PR reverts the previous changes to the precision logic; I thought the batched matrix multiplication only supports FP16 precision but I guess I misremembered.