Skip to content

Random spikes of up to 30ms in ggml_cuda_op device synchronization when using a low -ngl count with dual GPU #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cmp-nct opened this issue Jun 22, 2023 · 4 comments

Comments

@cmp-nct
Copy link
Owner

cmp-nct commented Jun 22, 2023

In ggml_cuda_op() I have spikes of up to 30ms, easily reproduceable when using a very low -ngl count like 1,2 or 3 on a large model like 40B, q6_k
This causes a quite significant slowdown of the calculations, it's 2 orders of magnitude higher than what the operation usually takes.
The CPU operations are significantly faster than the GPU operations in those cases.

The device the tensor is on is a 4090, a second 3090 is installed
I used -ngl 1 to reproduce it with almost every token.
I tried -ts 1,0 without any change (all tensors are on device 0)

When all works fine the sync on result_wo takes 0.144 ms

I debugged it down to the call of cudaDeviceSynchronize() at the end of the function.
Will continue debugging this one tomorrow

Maybe @JohannesGaessler already has an idea what is going on ?
Also anyone to confirm this would be helpful.

Just run a model like 40b q6_k (or similar) with **-ngl 1** and **--debug-timings 3**
In my case it shows some mat_mul spikes of 7-30ms in almost every token generation.
-ts 1,0 had no influence (note, the tensor split is currently not working because it stops at device #1 memory_free (was just fixing that)
@JohannesGaessler
Copy link

--debug-timings seems to be an option that you added yourself and I'm not going to help unless you contribute that option upstream or spoonfeed me what exactly it does.

@cmp-nct
Copy link
Owner Author

cmp-nct commented Jun 23, 2023

This repo is diverting a bit from the upstream. I'd be happy if any minor commits are taken but from my experience with previous pr's that's very unlikely.
"--debug-timings 3" is equivalent to setting -DGGML_PERF=1. GGML_PERF is default on in this repo and you control it with that new flag. 1 means only first token, 2 means first and last tokens reported and 3 means all tokens.
It's not necessary to use, just taking the single time of the synchronization is enough (see below)

I see a similar problem with the current llama.cpp branch, but one magnitude less severe than here on dual-gpu. However similar behavior with single-gpu.

With ggllm.cpp I have up to 30 milliseconds delay during device synchronization (with 1 and with 2 GPUs)
With llama.cpp I have up to 4 milliseconds delay with 2 GPUs and up to 27ms sync delay with 1 GPU.
In both cases the entire combined operation typically takes 200-300 microseconds.

llama.cpp with dual GPU:
llama.cpp with -ngl 25:

cudaDeviceSynchronize 0 took 110 us
cudaDeviceSynchronize 1 took 1 us
cudaDeviceSynchronize 1 took 231 us
cudaDeviceSynchronize 1 took 229 us
cudaDeviceSynchronize 1 took 212 us
cudaDeviceSynchronize 0 took 54 us
cudaDeviceSynchronize 1 took 129 us

llama.cpp with -ngl 5:

cudaDeviceSynchronize 1 took 1014 us
cudaDeviceSynchronize 0 took 1449 us
cudaDeviceSynchronize 1 took 1 us
cudaDeviceSynchronize 0 took 63 us
cudaDeviceSynchronize 1 took 3148 us
cudaDeviceSynchronize 1 took 3136 us
cudaDeviceSynchronize 1 took 3283 us

llama.cpp with 2nd GPU disabled (forcing g_device_count=1 in init_cublas())
llama.cpp with -ngl 25:

cudaDeviceSynchronize 0 took 167 us
cudaDeviceSynchronize 0 took 166 us
cudaDeviceSynchronize 0 took 166 us
cudaDeviceSynchronize 0 took 1081 us
cudaDeviceSynchronize 0 took 166 us
cudaDeviceSynchronize 0 took 166 us
cudaDeviceSynchronize 0 took 167 us
cudaDeviceSynchronize 0 took 1070 us

llama.cpp with -ngl 1:

cudaDeviceSynchronize 0 took 2660 us
cudaDeviceSynchronize 0 took 3005 us
cudaDeviceSynchronize 0 took 2471 us
cudaDeviceSynchronize 0 took 2503 us
cudaDeviceSynchronize 0 took 29 us
cudaDeviceSynchronize 0 took 24571 us

I used this to capture that specific time:

// wait until each device is finished, then free their buffers
.....
        size_t start = ggml_time_us();
        CUDA_CHECK(cudaDeviceSynchronize());
        printf("cudaDeviceSynchronize %d took %zu us\n", id, ggml_time_us() - start);

I'm not sure how important this is, in most situations people will offload a lot of layers and that performance hit appears to vanish at larger -ngl (or it's distributed among them)

@JohannesGaessler
Copy link

First of all, the way you're measuring CUDA performance is incorrect. CUDA is by design asynchronous: the ideal way to use it is to queue up as many kernels as possible and to then call cudaDeviceSynchronize once and wait until the computations are done. In ggml-org#1898 I removed a lot of unnecessary calls to cudeDeviceSynchronize which improved performance but also means that some ggml tensors will require next to no CPU time because they are just queuing things while others will require a lot of CPU time because they have to wait for the computations to finish. So CPU time "spiking" is not indicative of anything going wrong. The correct way to measure CUDA performance is to use e.g. NVIDIA Nsight systems and profile the application. Or just look at the t/s print and check whether performance goes up or down.

In any case, the current synchronization logic in llama.cpp is still very suboptimal, particularly for multi GPU settings. The environmental variable CUDA_VISIBLE_DEVICES can be set to restrict the use to only one GPU which should reduce synchronization for llama.cpp at least. Falcon currently lacks the consecutive tensors with GPU or GPU_SPLIT backends that llama.cpp has. So in a system with multiple GPUs the changes that I did in ggml-org#1898 are probably increasing the number of calls to cudaDeviceSynchronize which should result in worse performance. I plan to at some point implement better logic which should be universally faster.

@cmp-nct
Copy link
Owner Author

cmp-nct commented Jun 24, 2023

The measurement at the position is after the parallelized kernels were called, it gives a couple timings how long each (blocking) DeviceSync call took.
meassured timings fit together, no matter which approach they all show the same spikes (just summed up)

  1. the total token timing increases compared to CPU, I saw up to 100ms additional token delay with "-ngl 1" compared to pure CPU performance (500->600ms per token). The GPU performance starts to turn show at 10+ layers offload. Also those crazy 30ms spikes start to vanish or distribute.
  2. the ggml_perf summed up timing shows the same spikes
  3. the small microsecond timings (as seen above) summed up show the same delay that is experienced in GGML_PERF and in additional token delay.

I am not sure if it is actually a bug, it might just be CUDA behavior. Maybe the GPU ramps up and down in performance. Something like that could be playing in.

I will look into your commit and see how it affects performance.

@cmp-nct cmp-nct closed this as completed Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants