Skip to content

Conversation

leejet
Copy link
Contributor

@leejet leejet commented Aug 2, 2025

device: RTX 4090

before:

  |==================================================| 20/20 - 10.13it/s
[INFO ] stable-diffusion.cpp:1806 - sampling completed, taking 2.36s
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 0.23s

after:

  |==================================================| 20/20 - 12.19it/s
[INFO ] stable-diffusion.cpp:1806 - sampling completed, taking 2.05s
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 0.14s

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 2, 2025
@leejet leejet changed the title cuda: make im2col a little faster cuda: make im2col faster Aug 2, 2025
@Green-Sky
Copy link
Collaborator

perf:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes

Backend 1/2: CUDA0
  Device description: NVIDIA GeForce RTX 2070
  Device memory: 7778 MB (6966 MB free)

master:

  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               6552 runs -   194.12 us/run -    10244 kB/run -   50.33 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1640 runs -   726.40 us/run -    40964 kB/run -   53.79 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      104 runs - 14680.27 us/run -   655364 kB/run -   42.66 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      656 runs -  2026.15 us/run -   102445 kB/run -   48.23 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      164 runs -  7501.76 us/run -   409645 kB/run -   52.14 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               2852 runs -   446.17 us/run -    23536 kB/run -   50.31 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                670 runs -  2025.14 us/run -   100208 kB/run -   47.20 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       40 runs - 34712.45 us/run -  1678448 kB/run -   46.20 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      286 runs -  4234.20 us/run -   235365 kB/run -   53.03 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       68 runs - 19390.66 us/run -  1002085 kB/run -   49.34 GB/s

pr:

  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              13104 runs -    99.74 us/run -    10244 kB/run -   97.95 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               3280 runs -   397.37 us/run -    40964 kB/run -   98.33 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      156 runs -  6447.76 us/run -   655364 kB/run -   97.12 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     1312 runs -   996.14 us/run -   102445 kB/run -   98.11 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      246 runs -  4183.23 us/run -   409645 kB/run -   93.50 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   237.06 us/run -    23536 kB/run -   94.69 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1005 runs -  1012.47 us/run -   100208 kB/run -   94.40 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       60 runs - 17695.45 us/run -  1678448 kB/run -   90.63 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      429 runs -  2375.16 us/run -   235365 kB/run -   94.53 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      102 runs - 11160.16 us/run -  1002085 kB/run -   85.73 GB/s

@Green-Sky
Copy link
Collaborator

sd runs:

sd2 turbo q8_0 512x512

master

without flash attention:

  |==================================================| 8/8 - 6.48it/s
[INFO ] stable-diffusion.cpp:1822 - sampling completed, taking 1.34s

with flash attention:

  |==================================================| 8/8 - 9.38it/s
[INFO ] stable-diffusion.cpp:1822 - sampling completed, taking 0.96s

vae:

[INFO ] stable-diffusion.cpp:1843 - latent 1 decoded, taking 0.76s

pr

without flash attention:

  |==================================================| 8/8 - 7.18it/s
[INFO ] stable-diffusion.cpp:1822 - sampling completed, taking 1.30s

with flash attention:

  |==================================================| 8/8 - 10.85it/s
[INFO ] stable-diffusion.cpp:1822 - sampling completed, taking 0.85s

vae:

[INFO ] stable-diffusion.cpp:1843 - latent 1 decoded, taking 0.47s

sd1 f16 512x768

master

without flash attention:

  |==================================================| 30/30 - 1.40it/s
[INFO ] stable-diffusion.cpp:1822 - sampling completed, taking 20.89s

with flash attention:

  |==================================================| 30/30 - 1.52it/s
[INFO ] stable-diffusion.cpp:1822 - sampling completed, taking 19.65s

vae:

[INFO ] stable-diffusion.cpp:1843 - latent 1 decoded, taking 1.20s

pr

without flash attention:

  |==================================================| 30/30 - 1.57it/s
[INFO ] stable-diffusion.cpp:1822 - sampling completed, taking 19.16s

with flash attention:

  |==================================================| 30/30 - 1.65it/s
[INFO ] stable-diffusion.cpp:1822 - sampling completed, taking 18.15s

vae:

[INFO ] stable-diffusion.cpp:1843 - latent 1 decoded, taking 0.74s

@leejet working hard to widen the performance gap between vulkan and cuda again (:

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 2, 2025
@ggerganov ggerganov merged commit 3303c19 into ggml-org:master Aug 2, 2025
46 of 47 checks passed
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Aug 5, 2025
Thireus added a commit to Thireus/ik_llama.cpp that referenced this pull request Aug 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants