-
Notifications
You must be signed in to change notification settings - Fork 12k
CUDA: more warps for mmvq on NVIDIA #5394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: more warps for mmvq on NVIDIA #5394
Conversation
@JohannesGaessler How could I test this in my RDNA3 GPU such as 7900XTX to see if there is any improvement? |
Setting |
RDNA3 behave the same as RDNA2, there is performance regression about 10%~ |
With Radeon VII and Vega there is a notable increase in TG when compiling with master no mmv_y
master mmv_y=4
build: 8504d2d (2097) Now in this PR PR warp=1
PR warp=4
build: 7a0f63a (2094) |
I pushed a patch that enables the higher warp count for RDNA 1 and older. I don't have actually have access to an RDNA 1 card to test this on though. But if this causes a performance regression I expect someone to quickly open an issue at which point the threshold can just be adjusted. |
Thank you, performance is good again
build: f195490 (2095) |
a762d4f
to
00ebc31
Compare
Do you think it could be worth extending this for batch sizes 5-8 now? I noticed that it becomes slower after 4.
build: 00ebc31 (2094) |
I didn't test it but before I do that I would like to try a bit more in regards to a better way to do the calculation. |
After this (I think) I am up to 17.4 t/s on 70b. Feels like time to first token got longer and I went from 299t/s to 229 t/s PP. Still not back to peak perf but it's getting better. I can confirm setting MMV_Y didn't seem to impact anything anymore. This is dual 3090s with nvlink. I have taken the P40s down for now. |
This PR had no effect whatsoever on batch sizes > 4 so I don't think that is related. |
I only use batch of 1. |
The time to first token is 90% a factor of matrix multiplications with batch sizes >> 1, either cuBLAS GEMM or mul_mat_q. Both of those operations were not touched by this PR. It doesn't matter what batch size you use for generation. |
You are right. I did more a/b testing and it has to be related to something else. My first generation seems to take a while. Layer splitting is now faster. Tensor cores still lags. Not necessarily from this PR but with the current codebase. I need to test with 3090/3090/2080(22g) because the regression there previously was much larger than just with 2 GPU. This is definitely improvement though vs last week. |
The biggest issue is that there is no support for the optimizations I added in #3110 . |
Writing a better matrix vector multiplication kernel has turned out to be more difficult than I anticipated. So I'm doing another optimization for the current implementation instead. On master currently only a single warp is used for each matrix row. But on NVIDIA GPUs the use of more warps seems to be beneficial in scenarios where the compute per memory bandwidth is relatively high (batch sizes > 1, k-quants, , GPUs with low compute). So this PR sets the number of warps per row to 4 for NVIDIA GPUs. On AMD (RX 6800) this seemed to be detrimental so there is no change. The option
LLAMA_CUDA_MMV_Y
no longer affects performance but in my testing changing this value was only beneficial for my RTX 3090 and the implementation in this PR performs better. On my systems the performance changes as follows:If you account for tail effects and the initial latency when launching a kernel the current implementation achieves ~90% of the maximum theoretical matrix vector multiplication performance for q8_0 on my RTX 3090.