Vulkan for Android #5739

riverzhou · 2024-02-26T16:56:45Z

System:
Android 14 termux

Version:
latest

Log start
main: build = 2274 (47bb7b48)
main: built with clang version 17.0.6 for aarch64-unknown-linux-android24
main: seed  = 1708966403
ggml_vk_instance_init()
ggml_vulkan: Found 1 Vulkan devices:
ggml_vk_print_gpu_info(0)
Vulkan0: Adreno (TM) 730 | uma: 1 | fp16: 1 | warp size: 64
ggml_backend_vk_init(0)
ggml_vk_init(, 0)
ggml_vk_find_queue_family_index()
ggml_vk_find_queue_family_index()
ggml_vk_load_shaders()
ggml_vk_create_pipeline(matmul_f32_l, main, 3, 56, (128,128,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f32_m, main, 3, 56, (64,64,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f32_s, main, 3, 56, (32,32,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128)
ggml_vk_create_pipeline(matmul_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64)
ggml_vk_create_pipeline(matmul_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32)
ggml_vk_create_pipeline(matmul_f16_l, main, 3, 56, (128,128,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f16_m, main, 3, 56, (64,64,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f16_s, main, 3, 56, (32,32,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f16_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128)
ggml_vk_create_pipeline(matmul_f16_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64)
ggml_vk_create_pipeline(matmul_f16_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32)
ggml_vk_create_pipeline(matmul_f16_f32_l, main, 3, 56, (128,128,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f16_f32_m, main, 3, 56, (64,64,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f16_f32_s, main, 3, 56, (32,32,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f16_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128)
ggml_vk_create_pipeline(matmul_f16_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64)
ggml_vk_create_pipeline(matmul_f16_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32)
ggml_vk_create_pipeline(mul_mat_vec_f16_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q4_0_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q4_1_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q5_0_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q5_1_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q8_0_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q2_K_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q3_K_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q4_K_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q5_K_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q6_K_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(f32_to_f16, main, 2, 16, (64,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_f16, main, 2, 16, (8192,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_q4_0, main, 2, 16, (8192,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_q4_1, main, 2, 16, (8192,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_q5_0, main, 2, 16, (8192,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_q5_1, main, 2, 16, (8192,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_q8_0, main, 2, 16, (8192,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_q2_K, main, 2, 16, (16384,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_q3_K, main, 2, 16, (16384,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_q4_K, main, 2, 16, (8192,1,1), specialization_constants, 1)
libc++abi: terminating due to uncaught exception of type vk::UnknownError: vk::Device::createComputePipeline: ErrorUnknown
Aborted

0cc4m · 2024-02-26T18:37:59Z

Probably #5186 ?

akingoverlook · 2024-02-26T18:41:59Z

You are on a path to waste a lot of time. I would know, because I did.
Read this first - #5186 (comment)

This one is easy to work around, but the next one will be tough. And then, even if it is all resolved, it will be slower than a good CPU anyway.

akingoverlook · 2024-02-26T18:54:35Z

Probably #5186 ?

Do you think there is room left to minimize the memory transfers in this backend? I did see some TODO in the code suggesting something of the sort. The cost is probably too high on the mobile chipsets, with their (best case) 4x16 bit memory bus.

0cc4m · 2024-02-26T19:26:36Z

Probably #5186 ?

Do you think there is room left to minimize the memory transfers in this backend? I did see some TODO in the code suggesting something of the sort. The cost is probably too high on the mobile chipsets, with their (best case) 4x16 bit memory bus.

No, that's done already. I think the issue is that the shaders are optimized for Nvidia/AMD, but these mobile GPUs work differently. The most obvious difference is the warp size of 16. Optimizing for that might help.

akingoverlook · 2024-02-26T21:33:26Z

Probably #5186 ?

Do you think there is room left to minimize the memory transfers in this backend? I did see some TODO in the code suggesting something of the sort. The cost is probably too high on the mobile chipsets, with their (best case) 4x16 bit memory bus.

No, that's done already. I think the issue is that the shaders are optimized for Nvidia/AMD, but these mobile GPUs work differently. The most obvious difference is the warp size of 16. Optimizing for that might help.

Yes, I noticed that CLBlast has kernel tuners (per op, sweeping some parameters), and the results they produce differ quite a bit per GPU, so it is built by generating custom headers based on the tuners output. Not that it helps very much because that backend is really just an external BLAS, it does not offload the whole graph.

But perhaps something like that can be implemented for the Vulkan shaders, if they have equivalent parameters to sweep?
Or, as a shortcut, perhaps the Vulkan backend can eat the generated headers (or JSON files) produced by the CLBlast tuners?

Btw, the warp size depends on the platform:
Vulkan0: Adreno (TM) 735 | uma: 1 | fp16: 1 | warp size: 64
ggml_vulkan: Using Intel(R) UHD Graphics | uma: 1 | fp16: 1 | warp size: 32
Vulkan0: Mali-G720-Immortalis MC12 | uma: 1 | fp16: 1 | warp size: 16

0cc4m · 2024-02-27T19:01:26Z

Probably #5186 ?

Do you think there is room left to minimize the memory transfers in this backend? I did see some TODO in the code suggesting something of the sort. The cost is probably too high on the mobile chipsets, with their (best case) 4x16 bit memory bus.

No, that's done already. I think the issue is that the shaders are optimized for Nvidia/AMD, but these mobile GPUs work differently. The most obvious difference is the warp size of 16. Optimizing for that might help.

Yes, I noticed that CLBlast has kernel tuners (per op, sweeping some parameters), and the results they produce differ quite a bit per GPU, so it is built by generating custom headers based on the tuners output. Not that it helps very much because that backend is really just an external BLAS, it does not offload the whole graph.

But perhaps something like that can be implemented for the Vulkan shaders, if they have equivalent parameters to sweep? Or, as a shortcut, perhaps the Vulkan backend can eat the generated headers (or JSON files) produced by the CLBlast tuners?

Btw, the warp size depends on the platform: Vulkan0: Adreno (TM) 735 | uma: 1 | fp16: 1 | warp size: 64 ggml_vulkan: Using Intel(R) UHD Graphics | uma: 1 | fp16: 1 | warp size: 32 Vulkan0: Mali-G720-Immortalis MC12 | uma: 1 | fp16: 1 | warp size: 16

Yeah, in the long term I'd like to write an auto tuner, especially for the matrix matrix and matrix vector multiplication shaders. But at the moment there are more important topics. If someone else wants to give it a try, I'll help as much as I can. We can't just take CLBlast tunings, since they are for different kernels.

akingoverlook · 2024-02-27T20:58:02Z

Probably #5186 ?

Do you think there is room left to minimize the memory transfers in this backend? I did see some TODO in the code suggesting something of the sort. The cost is probably too high on the mobile chipsets, with their (best case) 4x16 bit memory bus.

No, that's done already. I think the issue is that the shaders are optimized for Nvidia/AMD, but these mobile GPUs work differently. The most obvious difference is the warp size of 16. Optimizing for that might help.

Yes, I noticed that CLBlast has kernel tuners (per op, sweeping some parameters), and the results they produce differ quite a bit per GPU, so it is built by generating custom headers based on the tuners output. Not that it helps very much because that backend is really just an external BLAS, it does not offload the whole graph.
But perhaps something like that can be implemented for the Vulkan shaders, if they have equivalent parameters to sweep? Or, as a shortcut, perhaps the Vulkan backend can eat the generated headers (or JSON files) produced by the CLBlast tuners?
Btw, the warp size depends on the platform: Vulkan0: Adreno (TM) 735 | uma: 1 | fp16: 1 | warp size: 64 ggml_vulkan: Using Intel(R) UHD Graphics | uma: 1 | fp16: 1 | warp size: 32 Vulkan0: Mali-G720-Immortalis MC12 | uma: 1 | fp16: 1 | warp size: 16

Yeah, in the long term I'd like to write an auto tuner, especially for the matrix matrix and matrix vector multiplication shaders. But at the moment there are more important topics. If someone else wants to give it a try, I'll help as much as I can. We can't just take CLBlast tunings, since they are for different kernels.

If you give me some list of parameters and possible ranges to sweep, I will at least try some brute force experimentation to see if it helps anything.

Also, by now it is pretty clear to me that the only way this backend works in any coherent manner on Adreno is when the Vulkan buffer is smaller than the max allocation size (1GB). I suspect that short of fixing the driver (which I could ask QC to do, but would not hold my breath) the only real solution would be to "shard" the model for GPU offload. Have you given any thought to such ideas? I know that has been done before in a different context with very similar constraints.

0cc4m · 2024-02-27T21:04:57Z

Probably #5186 ?

Do you think there is room left to minimize the memory transfers in this backend? I did see some TODO in the code suggesting something of the sort. The cost is probably too high on the mobile chipsets, with their (best case) 4x16 bit memory bus.

No, that's done already. I think the issue is that the shaders are optimized for Nvidia/AMD, but these mobile GPUs work differently. The most obvious difference is the warp size of 16. Optimizing for that might help.

Yes, I noticed that CLBlast has kernel tuners (per op, sweeping some parameters), and the results they produce differ quite a bit per GPU, so it is built by generating custom headers based on the tuners output. Not that it helps very much because that backend is really just an external BLAS, it does not offload the whole graph.
But perhaps something like that can be implemented for the Vulkan shaders, if they have equivalent parameters to sweep? Or, as a shortcut, perhaps the Vulkan backend can eat the generated headers (or JSON files) produced by the CLBlast tuners?
Btw, the warp size depends on the platform: Vulkan0: Adreno (TM) 735 | uma: 1 | fp16: 1 | warp size: 64 ggml_vulkan: Using Intel(R) UHD Graphics | uma: 1 | fp16: 1 | warp size: 32 Vulkan0: Mali-G720-Immortalis MC12 | uma: 1 | fp16: 1 | warp size: 16

Yeah, in the long term I'd like to write an auto tuner, especially for the matrix matrix and matrix vector multiplication shaders. But at the moment there are more important topics. If someone else wants to give it a try, I'll help as much as I can. We can't just take CLBlast tunings, since they are for different kernels.

If you give me some list of parameters and possible ranges to sweep, I will at least try some brute force experimentation to see if it helps anything.

Also, by now it is pretty clear to me that the only way this backend works in any coherent manner on Adreno is when the Vulkan buffer is smaller than the max allocation size (1GB). I suspect that short of fixing the driver (which I could ask QC to do, but would not hold my breath) the only real solution would be to "shard" the model for GPU offload. Have you given any thought to such ideas? I know that has been done before in a different context with very similar constraints.

The first parameters to tune would be the specialization constants of the matrix matrix multiplication shader, but it's not straightforward which combinations of them are valid. They have a bunch of constraints that I haven't documented yet.

The model does get split into multiple buffers if more than maxAllocationSize or maxBufferSize of the Vulkan device is required. Even on Nvidia/AMD/Intel this is necessary, as they only allow a max of 2 or 4GB buffers.

akingoverlook · 2024-02-27T21:17:36Z

Probably #5186 ?

Do you think there is room left to minimize the memory transfers in this backend? I did see some TODO in the code suggesting something of the sort. The cost is probably too high on the mobile chipsets, with their (best case) 4x16 bit memory bus.

No, that's done already. I think the issue is that the shaders are optimized for Nvidia/AMD, but these mobile GPUs work differently. The most obvious difference is the warp size of 16. Optimizing for that might help.

Yes, I noticed that CLBlast has kernel tuners (per op, sweeping some parameters), and the results they produce differ quite a bit per GPU, so it is built by generating custom headers based on the tuners output. Not that it helps very much because that backend is really just an external BLAS, it does not offload the whole graph.
But perhaps something like that can be implemented for the Vulkan shaders, if they have equivalent parameters to sweep? Or, as a shortcut, perhaps the Vulkan backend can eat the generated headers (or JSON files) produced by the CLBlast tuners?
Btw, the warp size depends on the platform: Vulkan0: Adreno (TM) 735 | uma: 1 | fp16: 1 | warp size: 64 ggml_vulkan: Using Intel(R) UHD Graphics | uma: 1 | fp16: 1 | warp size: 32 Vulkan0: Mali-G720-Immortalis MC12 | uma: 1 | fp16: 1 | warp size: 16

Yeah, in the long term I'd like to write an auto tuner, especially for the matrix matrix and matrix vector multiplication shaders. But at the moment there are more important topics. If someone else wants to give it a try, I'll help as much as I can. We can't just take CLBlast tunings, since they are for different kernels.

If you give me some list of parameters and possible ranges to sweep, I will at least try some brute force experimentation to see if it helps anything.
Also, by now it is pretty clear to me that the only way this backend works in any coherent manner on Adreno is when the Vulkan buffer is smaller than the max allocation size (1GB). I suspect that short of fixing the driver (which I could ask QC to do, but would not hold my breath) the only real solution would be to "shard" the model for GPU offload. Have you given any thought to such ideas? I know that has been done before in a different context with very similar constraints.

The first parameters to tune would be the specialization constants of the matrix matrix multiplication shader, but it's not straightforward which combinations of them are valid. They have a bunch of constraints that I haven't documented yet.

The model does get split into multiple buffers if more than maxAllocationSize or maxBufferSize of the Vulkan device is required. Even on Nvidia/AMD/Intel this is necessary, as they only allow a max of 2 or 4GB buffers.

Ok, whenever you can document anything to sweep/try, let me know.

As far as the model split - I have a suspicion that is not covering all the scenarios. I did see the code that handles it in ggml_backend_alloc_ctx_tensors_from_buft(), but nowhere else besides that. And it looks like the buffer for model tensors may get allocated by ggml_backend_cpu_buffer_from_ptr() in llama.cpp:4456 because it takes that "important for Apple path".

Admittedly, I don't know the code well enough to be sure I am not misinterpreting things, but it does take that path on Adreno, so it is not clear how the max allocation would be respected.

Again, consider the fact that this is UMA with a small allocation limit, unlike Apple. This isn't like any other platform, so it might take a path you didn't expect.

To check that hunch i tried to disable mmap, which would force it to take the ggml_backend_alloc_ctx_tensors_from_buft() path, but that does not help. It still reports a Vulkan buffer larger than 1GB, and still dies with DEVICE_LOST.

llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 19/19 layers to GPU
llm_load_tensors:    Vulkan0 buffer size =  1344.80 MiB

github-actions · 2024-04-12T01:06:39Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

riverzhou added the bug-unconfirmed label Feb 26, 2024

github-actions bot added the stale label Mar 29, 2024

woachk mentioned this issue Mar 31, 2024

Vulkan backend fails to compile a number of shaders on Adreno #6395

Closed

github-actions bot closed this as completed Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vulkan for Android #5739

Vulkan for Android #5739

riverzhou commented Feb 26, 2024 •

edited

Loading

0cc4m commented Feb 26, 2024

Uh oh!

akingoverlook commented Feb 26, 2024 •

edited

Loading

Uh oh!

akingoverlook commented Feb 26, 2024

Uh oh!

0cc4m commented Feb 26, 2024

Uh oh!

akingoverlook commented Feb 26, 2024 •

edited

Loading

Uh oh!

0cc4m commented Feb 27, 2024

Uh oh!

akingoverlook commented Feb 27, 2024

Uh oh!

0cc4m commented Feb 27, 2024

Uh oh!

akingoverlook commented Feb 27, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Apr 12, 2024

Uh oh!

Vulkan for Android #5739

Vulkan for Android #5739

Comments

riverzhou commented Feb 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

0cc4m commented Feb 26, 2024

Uh oh!

akingoverlook commented Feb 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akingoverlook commented Feb 26, 2024

Uh oh!

0cc4m commented Feb 26, 2024

Uh oh!

akingoverlook commented Feb 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Feb 27, 2024

Uh oh!

akingoverlook commented Feb 27, 2024

Uh oh!

0cc4m commented Feb 27, 2024

Uh oh!

akingoverlook commented Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 12, 2024

Uh oh!

riverzhou commented Feb 26, 2024 •

edited

Loading

akingoverlook commented Feb 26, 2024 •

edited

Loading

akingoverlook commented Feb 26, 2024 •

edited

Loading

akingoverlook commented Feb 27, 2024 •

edited

Loading