-
Notifications
You must be signed in to change notification settings - Fork 12k
vulkan: Optimize contiguous copies #10254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add a flops calculation for flash attention. Add one GGML_OP_CPY perf test.
Add a variant of the copy shader for when the tensors are contiguous. Avoid the complex addressing calculations, and do four elements per invocation to hide some other overhead. Apply similar changes to the scale shader, since scale is always contiguous. Add a "progress bar" for shader compiles.
I did some more benchmarks. RTX 3090:
Tesla P40:
Radeon Pro VII:
Looks like a good improvement all around. Edit: Similar improvements to SCALE. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Vulkan changes look good to me, and I tested them successfully on Nvidia and AMD. From my side this can be merged.
@ggerganov Are the test-backend-ops changes fine?
* tests: Fix memory bandwidth calculation for perf tests Add a flops calculation for flash attention. Add one GGML_OP_CPY perf test. * vulkan: Optimize contiguous copies Add a variant of the copy shader for when the tensors are contiguous. Avoid the complex addressing calculations, and do four elements per invocation to hide some other overhead. Apply similar changes to the scale shader, since scale is always contiguous. Add a "progress bar" for shader compiles.
* tests: Fix memory bandwidth calculation for perf tests Add a flops calculation for flash attention. Add one GGML_OP_CPY perf test. * vulkan: Optimize contiguous copies Add a variant of the copy shader for when the tensors are contiguous. Avoid the complex addressing calculations, and do four elements per invocation to hide some other overhead. Apply similar changes to the scale shader, since scale is always contiguous. Add a "progress bar" for shader compiles.
Split out from #10206, but the solution I went with is a bit different.
Add a variant of the copy shader for when the tensors are contiguous. Avoid the complex addressing calculations, and do four elements per invocation to hide some other overhead. In #10206, the matrix multiply is much faster if the B matrix is fp16, so there are a lot of these contiguous copies to do that conversion.
Apply similar changes to the scale shader, since scale is always contiguous.
The first commit fixes a bug in
test-backend-ops perf
where it computed the memory footprint of one iteration but then divided by the total time for all iterations.Before/after on RTX 4070. In the after numbers, the larger copies are more or less framebuffer bandwidth-limited, and the smaller copies are hitting in L2.