cuda: make im2col faster #15025

leejet · 2025-08-02T04:10:04Z

device: RTX 4090

before:

  |==================================================| 20/20 - 10.13it/s
[INFO ] stable-diffusion.cpp:1806 - sampling completed, taking 2.36s
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 0.23s

after:

  |==================================================| 20/20 - 12.19it/s
[INFO ] stable-diffusion.cpp:1806 - sampling completed, taking 2.05s
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 0.14s

Green-Sky · 2025-08-02T09:56:59Z

perf:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes

Backend 1/2: CUDA0
  Device description: NVIDIA GeForce RTX 2070
  Device memory: 7778 MB (6966 MB free)

master:

  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               6552 runs -   194.12 us/run -    10244 kB/run -   50.33 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1640 runs -   726.40 us/run -    40964 kB/run -   53.79 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      104 runs - 14680.27 us/run -   655364 kB/run -   42.66 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      656 runs -  2026.15 us/run -   102445 kB/run -   48.23 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      164 runs -  7501.76 us/run -   409645 kB/run -   52.14 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               2852 runs -   446.17 us/run -    23536 kB/run -   50.31 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                670 runs -  2025.14 us/run -   100208 kB/run -   47.20 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       40 runs - 34712.45 us/run -  1678448 kB/run -   46.20 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      286 runs -  4234.20 us/run -   235365 kB/run -   53.03 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       68 runs - 19390.66 us/run -  1002085 kB/run -   49.34 GB/s

pr:

  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              13104 runs -    99.74 us/run -    10244 kB/run -   97.95 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               3280 runs -   397.37 us/run -    40964 kB/run -   98.33 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      156 runs -  6447.76 us/run -   655364 kB/run -   97.12 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     1312 runs -   996.14 us/run -   102445 kB/run -   98.11 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      246 runs -  4183.23 us/run -   409645 kB/run -   93.50 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   237.06 us/run -    23536 kB/run -   94.69 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1005 runs -  1012.47 us/run -   100208 kB/run -   94.40 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       60 runs - 17695.45 us/run -  1678448 kB/run -   90.63 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      429 runs -  2375.16 us/run -   235365 kB/run -   94.53 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      102 runs - 11160.16 us/run -  1002085 kB/run -   85.73 GB/s

Green-Sky · 2025-08-02T10:24:09Z

sd runs:

`sd2 turbo q8_0 512x512`

master

without flash attention:

  |==================================================| 8/8 - 6.48it/s
[INFO ] stable-diffusion.cpp:1822 - sampling completed, taking 1.34s

with flash attention:

  |==================================================| 8/8 - 9.38it/s
[INFO ] stable-diffusion.cpp:1822 - sampling completed, taking 0.96s

vae:

[INFO ] stable-diffusion.cpp:1843 - latent 1 decoded, taking 0.76s

pr

without flash attention:

  |==================================================| 8/8 - 7.18it/s
[INFO ] stable-diffusion.cpp:1822 - sampling completed, taking 1.30s

with flash attention:

  |==================================================| 8/8 - 10.85it/s
[INFO ] stable-diffusion.cpp:1822 - sampling completed, taking 0.85s

vae:

[INFO ] stable-diffusion.cpp:1843 - latent 1 decoded, taking 0.47s

`sd1 f16 512x768`

master

without flash attention:

  |==================================================| 30/30 - 1.40it/s
[INFO ] stable-diffusion.cpp:1822 - sampling completed, taking 20.89s

with flash attention:

  |==================================================| 30/30 - 1.52it/s
[INFO ] stable-diffusion.cpp:1822 - sampling completed, taking 19.65s

vae:

[INFO ] stable-diffusion.cpp:1843 - latent 1 decoded, taking 1.20s

pr

without flash attention:

  |==================================================| 30/30 - 1.57it/s
[INFO ] stable-diffusion.cpp:1822 - sampling completed, taking 19.16s

with flash attention:

  |==================================================| 30/30 - 1.65it/s
[INFO ] stable-diffusion.cpp:1822 - sampling completed, taking 18.15s

vae:

[INFO ] stable-diffusion.cpp:1843 - latent 1 decoded, taking 0.74s

@leejet working hard to widen the performance gap between vulkan and cuda again (:

@leejet

author : @leejet

Port from ggml-org/llama.cpp#15025

cuda: make im2col a little faster

158423c

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 2, 2025

leejet changed the title ~~cuda: make im2col a little faster~~ cuda: make im2col faster Aug 2, 2025

ggerganov approved these changes Aug 2, 2025

View reviewed changes

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 2, 2025

cuda: make im2col faster ggml-org#15025

bfbddcd

author : @leejet

ggerganov merged commit 3303c19 into ggml-org:master Aug 2, 2025
46 of 47 checks passed

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Aug 5, 2025

cuda: make im2col a little faster (ggml-org#15025)

4ad8524

Thireus added a commit to Thireus/ik_llama.cpp that referenced this pull request Aug 11, 2025

cuda: make im2col faster

9cf458f

Port from ggml-org/llama.cpp#15025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda: make im2col faster #15025

cuda: make im2col faster #15025

Uh oh!

leejet commented Aug 2, 2025

Uh oh!

Green-Sky commented Aug 2, 2025

Uh oh!

Green-Sky commented Aug 2, 2025

Uh oh!

Uh oh!

Uh oh!

cuda: make im2col faster #15025

cuda: make im2col faster #15025

Uh oh!

Conversation

leejet commented Aug 2, 2025

Uh oh!

Green-Sky commented Aug 2, 2025

Uh oh!

Green-Sky commented Aug 2, 2025

sd2 turbo q8_0 512x512

master

pr

sd1 f16 512x768

master

pr

Uh oh!

Uh oh!

Uh oh!

`sd2 turbo q8_0 512x512`

`sd1 f16 512x768`