Random spikes of up to 30ms in ggml_cuda_op device synchronization when using a low -ngl count with dual GPU

In ggml_cuda_op() I have spikes of up to 30ms, easily reproduceable when using a very low -ngl count like 1,2 or 3 on a large model like 40B, q6_k
This causes a quite significant slowdown of the calculations, it's 2 orders of magnitude higher than what the operation usually takes.
The CPU operations are significantly faster than the GPU operations in those cases.

The device the tensor is on is a 4090, a second 3090 is installed
I used -ngl 1 to reproduce it with almost every token.
I tried -ts 1,0 without any change (all tensors are on device 0)

When all works fine the sync on result_wo takes 0.144 ms

I debugged it down to the call of cudaDeviceSynchronize() at the end of the function.
Will continue debugging this one tomorrow 

Maybe @JohannesGaessler already has an idea what is going on ? 
Also anyone to confirm this would be helpful.
```
Just run a model like 40b q6_k (or similar) with **-ngl 1** and **--debug-timings 3**
In my case it shows some mat_mul spikes of 7-30ms in almost every token generation.
-ts 1,0 had no influence (note, the tensor split is currently not working because it stops at device #1 memory_free (was just fixing that)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Random spikes of up to 30ms in ggml_cuda_op device synchronization when using a low -ngl count with dual GPU #19

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Random spikes of up to 30ms in ggml_cuda_op device synchronization when using a low -ngl count with dual GPU #19

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions