vulkan: Optimize contiguous copies #10254

jeffbolznv · 2024-11-11T19:10:19Z

Split out from #10206, but the solution I went with is a bit different.

Add a variant of the copy shader for when the tensors are contiguous. Avoid the complex addressing calculations, and do four elements per invocation to hide some other overhead. In #10206, the matrix multiply is much faster if the B matrix is fp16, so there are a lot of these contiguous copies to do that conversion.

Apply similar changes to the scale shader, since scale is always contiguous.

The first commit fixes a bug in test-backend-ops perf where it computed the memory footprint of one iteration but then divided by the total time for all iterations.

Before/after on RTX 4070. In the after numbers, the larger copies are more or less framebuffer bandwidth-limited, and the smaller copies are hitting in L2.

Before:
  CPY(type_src=f32,type_dst=f16,ne=[256,3072,1,1],permute=[0,0,0,0]):                  80102 runs -    12.86 us/run -     4608 kB/run -  341.72 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                  43692 runs -    22.91 us/run -     9216 kB/run -  383.71 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[4096,3072,1,1],permute=[0,0,0,0]):                  5016 runs -   206.73 us/run -    73728 kB/run -  340.61 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[16384,16384,1,1],permute=[0,0,0,0]):                 198 runs -  5364.71 us/run -  1572864 kB/run -  288.08 GB/s
  SCALE(type=f32,ne=[256,3072,1,1],scale=2.000000):                    81930 runs -    12.85 us/run -     6144 kB/run -  456.08 GB/s
  SCALE(type=f32,ne=[512,3072,1,1],scale=2.000000):                    43696 runs -    23.07 us/run -    12288 kB/run -  508.00 GB/s
  SCALE(type=f32,ne=[4096,3072,1,1],scale=2.000000):                    4446 runs -   236.93 us/run -    98304 kB/run -  396.26 GB/s
  SCALE(type=f32,ne=[16384,16384,1,1],scale=2.000000):                   204 runs -  4983.04 us/run -  2097152 kB/run -  413.17 GB/s

After:
  CPY(type_src=f32,type_dst=f16,ne=[256,3072,1,1],permute=[0,0,0,0]):                 233024 runs -     4.41 us/run -     4608 kB/run -  997.23 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                 160204 runs -     6.34 us/run -     9216 kB/run - 1387.29 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[4096,3072,1,1],permute=[0,0,0,0]):                  6384 runs -   163.27 us/run -    73728 kB/run -  431.28 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[16384,16384,1,1],permute=[0,0,0,0]):                 264 runs -  3908.14 us/run -  1572864 kB/run -  395.45 GB/s
  SCALE(type=f32,ne=[256,3072,1,1],scale=2.000000):                   185708 runs -     5.49 us/run -     6144 kB/run - 1067.64 GB/s
  SCALE(type=f32,ne=[512,3072,1,1],scale=2.000000):                   114702 runs -     8.73 us/run -    12288 kB/run - 1342.44 GB/s
  SCALE(type=f32,ne=[4096,3072,1,1],scale=2.000000):                    4788 runs -   220.35 us/run -    98304 kB/run -  426.08 GB/s
  SCALE(type=f32,ne=[16384,16384,1,1],scale=2.000000):                   204 runs -  5030.21 us/run -  2097152 kB/run -  409.29 GB/s

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Add a flops calculation for flash attention. Add one GGML_OP_CPY perf test.

Add a variant of the copy shader for when the tensors are contiguous. Avoid the complex addressing calculations, and do four elements per invocation to hide some other overhead. Apply similar changes to the scale shader, since scale is always contiguous. Add a "progress bar" for shader compiles.

0cc4m · 2024-11-12T07:46:23Z

I did some more benchmarks.

RTX 3090:

Before:
  CPY(type_src=f32,type_dst=f16,ne=[256,3072,1,1],permute=[0,0,0,0]):                  94666 runs -    10.66 us/run -     4608 kB/run -  412.15 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                  47333 runs -    21.58 us/run -     9216 kB/run -  407.36 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[4096,3072,1,1],permute=[0,0,0,0]):                  6840 runs -   154.11 us/run -    73728 kB/run -  456.91 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[16384,16384,1,1],permute=[0,0,0,0]):                 330 runs -  3222.34 us/run -  1572864 kB/run -  479.61 GB/s

After:
  CPY(type_src=f32,type_dst=f16,ne=[256,3072,1,1],permute=[0,0,0,0]):                 276716 runs -     3.70 us/run -     4608 kB/run - 1188.59 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                  80102 runs -    12.60 us/run -     9216 kB/run -  697.57 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[4096,3072,1,1],permute=[0,0,0,0]):                 11400 runs -    91.06 us/run -    73728 kB/run -  773.32 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[16384,16384,1,1],permute=[0,0,0,0]):                 550 runs -  1851.64 us/run -  1572864 kB/run -  834.64 GB/s

Tesla P40:

Before:
  CPY(type_src=f32,type_dst=f16,ne=[256,3072,1,1],permute=[0,0,0,0]):                  29128 runs -    40.55 us/run -     4608 kB/run -  108.38 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                  14564 runs -    75.19 us/run -     9216 kB/run -  116.91 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[4096,3072,1,1],permute=[0,0,0,0]):                  1824 runs -   577.92 us/run -    73728 kB/run -  121.84 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[16384,16384,1,1],permute=[0,0,0,0]):                  88 runs - 12307.07 us/run -  1572864 kB/run -  125.57 GB/s

After:
  CPY(type_src=f32,type_dst=f16,ne=[256,3072,1,1],permute=[0,0,0,0]):                  58256 runs -    18.64 us/run -     4608 kB/run -  235.76 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                  29128 runs -    35.05 us/run -     9216 kB/run -  250.77 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[4096,3072,1,1],permute=[0,0,0,0]):                  4104 runs -   255.77 us/run -    73728 kB/run -  275.31 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[16384,16384,1,1],permute=[0,0,0,0]):                 198 runs -  5384.78 us/run -  1572864 kB/run -  287.00 GB/s

Radeon Pro VII:

Before:
  CPY(type_src=f32,type_dst=f16,ne=[256,3072,1,1],permute=[0,0,0,0]):                  36410 runs -    29.78 us/run -     4608 kB/run -  147.60 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                  18205 runs -    67.49 us/run -     9216 kB/run -  130.26 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[4096,3072,1,1],permute=[0,0,0,0]):                  2736 runs -   385.51 us/run -    73728 kB/run -  182.65 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[16384,16384,1,1],permute=[0,0,0,0]):                 132 runs -  7695.48 us/run -  1572864 kB/run -  200.83 GB/s

After:
  CPY(type_src=f32,type_dst=f16,ne=[256,3072,1,1],permute=[0,0,0,0]):                 131076 runs -     7.68 us/run -     4608 kB/run -  571.95 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                  76461 runs -    13.51 us/run -     9216 kB/run -  650.84 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[4096,3072,1,1],permute=[0,0,0,0]):                 10944 runs -    91.95 us/run -    73728 kB/run -  765.79 GB/s
  CPY(type_src=f32,type_dst=f16,ne=[16384,16384,1,1],permute=[0,0,0,0]):                 550 runs -  1885.12 us/run -  1572864 kB/run -  819.82 GB/s

Looks like a good improvement all around.

Edit: Similar improvements to SCALE.

0cc4m

The Vulkan changes look good to me, and I tested them successfully on Nvidia and AMD. From my side this can be merged.

@ggerganov Are the test-backend-ops changes fine?

* tests: Fix memory bandwidth calculation for perf tests Add a flops calculation for flash attention. Add one GGML_OP_CPY perf test. * vulkan: Optimize contiguous copies Add a variant of the copy shader for when the tensors are contiguous. Avoid the complex addressing calculations, and do four elements per invocation to hide some other overhead. Apply similar changes to the scale shader, since scale is always contiguous. Add a "progress bar" for shader compiles.

tests: Fix memory bandwidth calculation for perf tests

103f10c

Add a flops calculation for flash attention. Add one GGML_OP_CPY perf test.

jeffbolznv requested a review from 0cc4m November 11, 2024 19:10

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Nov 11, 2024

jeffbolznv added ggml changes relating to the ggml tensor library for machine learning and removed ggml changes relating to the ggml tensor library for machine learning labels Nov 11, 2024

jeffbolznv force-pushed the copy_opt branch from 16a257a to b04b638 Compare November 11, 2024 19:31

0cc4m approved these changes Nov 12, 2024

View reviewed changes

slaren approved these changes Nov 12, 2024

View reviewed changes

0cc4m merged commit 80dd7ff into ggml-org:master Nov 13, 2024
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: Optimize contiguous copies #10254

vulkan: Optimize contiguous copies #10254

Uh oh!

jeffbolznv commented Nov 11, 2024 •

edited

Loading

Uh oh!

0cc4m commented Nov 12, 2024 •

edited

Loading

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

Uh oh!

vulkan: Optimize contiguous copies #10254

vulkan: Optimize contiguous copies #10254

Uh oh!

Conversation

jeffbolznv commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Nov 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jeffbolznv commented Nov 11, 2024 •

edited

Loading

0cc4m commented Nov 12, 2024 •

edited

Loading