-
Notifications
You must be signed in to change notification settings - Fork 21
Random spikes of up to 30ms in ggml_cuda_op device synchronization when using a low -ngl count with dual GPU #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
|
This repo is diverting a bit from the upstream. I'd be happy if any minor commits are taken but from my experience with previous pr's that's very unlikely. I see a similar problem with the current llama.cpp branch, but one magnitude less severe than here on dual-gpu. However similar behavior with single-gpu. With ggllm.cpp I have up to 30 milliseconds delay during device synchronization (with 1 and with 2 GPUs) llama.cpp with dual GPU:
llama.cpp with -ngl 5:
llama.cpp with 2nd GPU disabled (forcing g_device_count=1 in init_cublas())
llama.cpp with -ngl 1:
I used this to capture that specific time:
I'm not sure how important this is, in most situations people will offload a lot of layers and that performance hit appears to vanish at larger -ngl (or it's distributed among them) |
First of all, the way you're measuring CUDA performance is incorrect. CUDA is by design asynchronous: the ideal way to use it is to queue up as many kernels as possible and to then call In any case, the current synchronization logic in llama.cpp is still very suboptimal, particularly for multi GPU settings. The environmental variable |
The measurement at the position is after the parallelized kernels were called, it gives a couple timings how long each (blocking) DeviceSync call took.
I am not sure if it is actually a bug, it might just be CUDA behavior. Maybe the GPU ramps up and down in performance. Something like that could be playing in. I will look into your commit and see how it affects performance. |
Uh oh!
There was an error while loading. Please reload this page.
In ggml_cuda_op() I have spikes of up to 30ms, easily reproduceable when using a very low -ngl count like 1,2 or 3 on a large model like 40B, q6_k
This causes a quite significant slowdown of the calculations, it's 2 orders of magnitude higher than what the operation usually takes.
The CPU operations are significantly faster than the GPU operations in those cases.
The device the tensor is on is a 4090, a second 3090 is installed
I used -ngl 1 to reproduce it with almost every token.
I tried -ts 1,0 without any change (all tensors are on device 0)
When all works fine the sync on result_wo takes 0.144 ms
I debugged it down to the call of cudaDeviceSynchronize() at the end of the function.
Will continue debugging this one tomorrow
Maybe @JohannesGaessler already has an idea what is going on ?
Also anyone to confirm this would be helpful.
The text was updated successfully, but these errors were encountered: