Skip to content

Vulkan for Android #5739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
riverzhou opened this issue Feb 26, 2024 · 10 comments
Closed

Vulkan for Android #5739

riverzhou opened this issue Feb 26, 2024 · 10 comments

Comments

@riverzhou
Copy link

riverzhou commented Feb 26, 2024

System:
Android 14 termux

Version:
latest

Log start
main: build = 2274 (47bb7b48)
main: built with clang version 17.0.6 for aarch64-unknown-linux-android24
main: seed  = 1708966403
ggml_vk_instance_init()
ggml_vulkan: Found 1 Vulkan devices:
ggml_vk_print_gpu_info(0)
Vulkan0: Adreno (TM) 730 | uma: 1 | fp16: 1 | warp size: 64
ggml_backend_vk_init(0)
ggml_vk_init(, 0)
ggml_vk_find_queue_family_index()
ggml_vk_find_queue_family_index()
ggml_vk_load_shaders()
ggml_vk_create_pipeline(matmul_f32_l, main, 3, 56, (128,128,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f32_m, main, 3, 56, (64,64,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f32_s, main, 3, 56, (32,32,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128)
ggml_vk_create_pipeline(matmul_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64)
ggml_vk_create_pipeline(matmul_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32)
ggml_vk_create_pipeline(matmul_f16_l, main, 3, 56, (128,128,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f16_m, main, 3, 56, (64,64,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f16_s, main, 3, 56, (32,32,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f16_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128)
ggml_vk_create_pipeline(matmul_f16_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64)
ggml_vk_create_pipeline(matmul_f16_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32)
ggml_vk_create_pipeline(matmul_f16_f32_l, main, 3, 56, (128,128,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f16_f32_m, main, 3, 56, (64,64,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f16_f32_s, main, 3, 56, (32,32,1), specialization_constants, 1)
ggml_vk_create_pipeline(matmul_f16_f32_aligned_l, main, 3, 56, (128,128,1), specialization_constants, 128)
ggml_vk_create_pipeline(matmul_f16_f32_aligned_m, main, 3, 56, (64,64,1), specialization_constants, 64)
ggml_vk_create_pipeline(matmul_f16_f32_aligned_s, main, 3, 56, (32,32,1), specialization_constants, 32)
ggml_vk_create_pipeline(mul_mat_vec_f16_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q4_0_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q4_1_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q5_0_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q5_1_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q8_0_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q2_K_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q3_K_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q4_K_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q5_K_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(mul_mat_vec_q6_K_f32, main, 3, 12, (1,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(f32_to_f16, main, 2, 16, (64,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_f16, main, 2, 16, (8192,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_q4_0, main, 2, 16, (8192,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_q4_1, main, 2, 16, (8192,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_q5_0, main, 2, 16, (8192,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_q5_1, main, 2, 16, (8192,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_q8_0, main, 2, 16, (8192,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_q2_K, main, 2, 16, (16384,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_q3_K, main, 2, 16, (16384,1,1), specialization_constants, 1)
ggml_vk_create_pipeline(dequant_q4_K, main, 2, 16, (8192,1,1), specialization_constants, 1)
libc++abi: terminating due to uncaught exception of type vk::UnknownError: vk::Device::createComputePipeline: ErrorUnknown
Aborted
@0cc4m
Copy link
Collaborator

0cc4m commented Feb 26, 2024

Probably #5186 ?

@akingoverlook
Copy link

akingoverlook commented Feb 26, 2024

You are on a path to waste a lot of time. I would know, because I did.
Read this first - #5186 (comment)

This one is easy to work around, but the next one will be tough. And then, even if it is all resolved, it will be slower than a good CPU anyway.

@akingoverlook
Copy link

Probably #5186 ?

Do you think there is room left to minimize the memory transfers in this backend? I did see some TODO in the code suggesting something of the sort. The cost is probably too high on the mobile chipsets, with their (best case) 4x16 bit memory bus.

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 26, 2024

Probably #5186 ?

Do you think there is room left to minimize the memory transfers in this backend? I did see some TODO in the code suggesting something of the sort. The cost is probably too high on the mobile chipsets, with their (best case) 4x16 bit memory bus.

No, that's done already. I think the issue is that the shaders are optimized for Nvidia/AMD, but these mobile GPUs work differently. The most obvious difference is the warp size of 16. Optimizing for that might help.

@akingoverlook
Copy link

akingoverlook commented Feb 26, 2024

Probably #5186 ?

Do you think there is room left to minimize the memory transfers in this backend? I did see some TODO in the code suggesting something of the sort. The cost is probably too high on the mobile chipsets, with their (best case) 4x16 bit memory bus.

No, that's done already. I think the issue is that the shaders are optimized for Nvidia/AMD, but these mobile GPUs work differently. The most obvious difference is the warp size of 16. Optimizing for that might help.

Yes, I noticed that CLBlast has kernel tuners (per op, sweeping some parameters), and the results they produce differ quite a bit per GPU, so it is built by generating custom headers based on the tuners output. Not that it helps very much because that backend is really just an external BLAS, it does not offload the whole graph.

But perhaps something like that can be implemented for the Vulkan shaders, if they have equivalent parameters to sweep?
Or, as a shortcut, perhaps the Vulkan backend can eat the generated headers (or JSON files) produced by the CLBlast tuners?

Btw, the warp size depends on the platform:
Vulkan0: Adreno (TM) 735 | uma: 1 | fp16: 1 | warp size: 64
ggml_vulkan: Using Intel(R) UHD Graphics | uma: 1 | fp16: 1 | warp size: 32
Vulkan0: Mali-G720-Immortalis MC12 | uma: 1 | fp16: 1 | warp size: 16

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 27, 2024

Probably #5186 ?

Do you think there is room left to minimize the memory transfers in this backend? I did see some TODO in the code suggesting something of the sort. The cost is probably too high on the mobile chipsets, with their (best case) 4x16 bit memory bus.

No, that's done already. I think the issue is that the shaders are optimized for Nvidia/AMD, but these mobile GPUs work differently. The most obvious difference is the warp size of 16. Optimizing for that might help.

Yes, I noticed that CLBlast has kernel tuners (per op, sweeping some parameters), and the results they produce differ quite a bit per GPU, so it is built by generating custom headers based on the tuners output. Not that it helps very much because that backend is really just an external BLAS, it does not offload the whole graph.

But perhaps something like that can be implemented for the Vulkan shaders, if they have equivalent parameters to sweep? Or, as a shortcut, perhaps the Vulkan backend can eat the generated headers (or JSON files) produced by the CLBlast tuners?

Btw, the warp size depends on the platform: Vulkan0: Adreno (TM) 735 | uma: 1 | fp16: 1 | warp size: 64 ggml_vulkan: Using Intel(R) UHD Graphics | uma: 1 | fp16: 1 | warp size: 32 Vulkan0: Mali-G720-Immortalis MC12 | uma: 1 | fp16: 1 | warp size: 16

Yeah, in the long term I'd like to write an auto tuner, especially for the matrix matrix and matrix vector multiplication shaders. But at the moment there are more important topics. If someone else wants to give it a try, I'll help as much as I can. We can't just take CLBlast tunings, since they are for different kernels.

@akingoverlook
Copy link

Probably #5186 ?

Do you think there is room left to minimize the memory transfers in this backend? I did see some TODO in the code suggesting something of the sort. The cost is probably too high on the mobile chipsets, with their (best case) 4x16 bit memory bus.

No, that's done already. I think the issue is that the shaders are optimized for Nvidia/AMD, but these mobile GPUs work differently. The most obvious difference is the warp size of 16. Optimizing for that might help.

Yes, I noticed that CLBlast has kernel tuners (per op, sweeping some parameters), and the results they produce differ quite a bit per GPU, so it is built by generating custom headers based on the tuners output. Not that it helps very much because that backend is really just an external BLAS, it does not offload the whole graph.
But perhaps something like that can be implemented for the Vulkan shaders, if they have equivalent parameters to sweep? Or, as a shortcut, perhaps the Vulkan backend can eat the generated headers (or JSON files) produced by the CLBlast tuners?
Btw, the warp size depends on the platform: Vulkan0: Adreno (TM) 735 | uma: 1 | fp16: 1 | warp size: 64 ggml_vulkan: Using Intel(R) UHD Graphics | uma: 1 | fp16: 1 | warp size: 32 Vulkan0: Mali-G720-Immortalis MC12 | uma: 1 | fp16: 1 | warp size: 16

Yeah, in the long term I'd like to write an auto tuner, especially for the matrix matrix and matrix vector multiplication shaders. But at the moment there are more important topics. If someone else wants to give it a try, I'll help as much as I can. We can't just take CLBlast tunings, since they are for different kernels.

If you give me some list of parameters and possible ranges to sweep, I will at least try some brute force experimentation to see if it helps anything.

Also, by now it is pretty clear to me that the only way this backend works in any coherent manner on Adreno is when the Vulkan buffer is smaller than the max allocation size (1GB). I suspect that short of fixing the driver (which I could ask QC to do, but would not hold my breath) the only real solution would be to "shard" the model for GPU offload. Have you given any thought to such ideas? I know that has been done before in a different context with very similar constraints.

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 27, 2024

Probably #5186 ?

Do you think there is room left to minimize the memory transfers in this backend? I did see some TODO in the code suggesting something of the sort. The cost is probably too high on the mobile chipsets, with their (best case) 4x16 bit memory bus.

No, that's done already. I think the issue is that the shaders are optimized for Nvidia/AMD, but these mobile GPUs work differently. The most obvious difference is the warp size of 16. Optimizing for that might help.

Yes, I noticed that CLBlast has kernel tuners (per op, sweeping some parameters), and the results they produce differ quite a bit per GPU, so it is built by generating custom headers based on the tuners output. Not that it helps very much because that backend is really just an external BLAS, it does not offload the whole graph.
But perhaps something like that can be implemented for the Vulkan shaders, if they have equivalent parameters to sweep? Or, as a shortcut, perhaps the Vulkan backend can eat the generated headers (or JSON files) produced by the CLBlast tuners?
Btw, the warp size depends on the platform: Vulkan0: Adreno (TM) 735 | uma: 1 | fp16: 1 | warp size: 64 ggml_vulkan: Using Intel(R) UHD Graphics | uma: 1 | fp16: 1 | warp size: 32 Vulkan0: Mali-G720-Immortalis MC12 | uma: 1 | fp16: 1 | warp size: 16

Yeah, in the long term I'd like to write an auto tuner, especially for the matrix matrix and matrix vector multiplication shaders. But at the moment there are more important topics. If someone else wants to give it a try, I'll help as much as I can. We can't just take CLBlast tunings, since they are for different kernels.

If you give me some list of parameters and possible ranges to sweep, I will at least try some brute force experimentation to see if it helps anything.

Also, by now it is pretty clear to me that the only way this backend works in any coherent manner on Adreno is when the Vulkan buffer is smaller than the max allocation size (1GB). I suspect that short of fixing the driver (which I could ask QC to do, but would not hold my breath) the only real solution would be to "shard" the model for GPU offload. Have you given any thought to such ideas? I know that has been done before in a different context with very similar constraints.

The first parameters to tune would be the specialization constants of the matrix matrix multiplication shader, but it's not straightforward which combinations of them are valid. They have a bunch of constraints that I haven't documented yet.

The model does get split into multiple buffers if more than maxAllocationSize or maxBufferSize of the Vulkan device is required. Even on Nvidia/AMD/Intel this is necessary, as they only allow a max of 2 or 4GB buffers.

@akingoverlook
Copy link

akingoverlook commented Feb 27, 2024

Probably #5186 ?

Do you think there is room left to minimize the memory transfers in this backend? I did see some TODO in the code suggesting something of the sort. The cost is probably too high on the mobile chipsets, with their (best case) 4x16 bit memory bus.

No, that's done already. I think the issue is that the shaders are optimized for Nvidia/AMD, but these mobile GPUs work differently. The most obvious difference is the warp size of 16. Optimizing for that might help.

Yes, I noticed that CLBlast has kernel tuners (per op, sweeping some parameters), and the results they produce differ quite a bit per GPU, so it is built by generating custom headers based on the tuners output. Not that it helps very much because that backend is really just an external BLAS, it does not offload the whole graph.
But perhaps something like that can be implemented for the Vulkan shaders, if they have equivalent parameters to sweep? Or, as a shortcut, perhaps the Vulkan backend can eat the generated headers (or JSON files) produced by the CLBlast tuners?
Btw, the warp size depends on the platform: Vulkan0: Adreno (TM) 735 | uma: 1 | fp16: 1 | warp size: 64 ggml_vulkan: Using Intel(R) UHD Graphics | uma: 1 | fp16: 1 | warp size: 32 Vulkan0: Mali-G720-Immortalis MC12 | uma: 1 | fp16: 1 | warp size: 16

Yeah, in the long term I'd like to write an auto tuner, especially for the matrix matrix and matrix vector multiplication shaders. But at the moment there are more important topics. If someone else wants to give it a try, I'll help as much as I can. We can't just take CLBlast tunings, since they are for different kernels.

If you give me some list of parameters and possible ranges to sweep, I will at least try some brute force experimentation to see if it helps anything.
Also, by now it is pretty clear to me that the only way this backend works in any coherent manner on Adreno is when the Vulkan buffer is smaller than the max allocation size (1GB). I suspect that short of fixing the driver (which I could ask QC to do, but would not hold my breath) the only real solution would be to "shard" the model for GPU offload. Have you given any thought to such ideas? I know that has been done before in a different context with very similar constraints.

The first parameters to tune would be the specialization constants of the matrix matrix multiplication shader, but it's not straightforward which combinations of them are valid. They have a bunch of constraints that I haven't documented yet.

The model does get split into multiple buffers if more than maxAllocationSize or maxBufferSize of the Vulkan device is required. Even on Nvidia/AMD/Intel this is necessary, as they only allow a max of 2 or 4GB buffers.

Ok, whenever you can document anything to sweep/try, let me know.

As far as the model split - I have a suspicion that is not covering all the scenarios. I did see the code that handles it in ggml_backend_alloc_ctx_tensors_from_buft(), but nowhere else besides that. And it looks like the buffer for model tensors may get allocated by ggml_backend_cpu_buffer_from_ptr() in llama.cpp:4456 because it takes that "important for Apple path".

Admittedly, I don't know the code well enough to be sure I am not misinterpreting things, but it does take that path on Adreno, so it is not clear how the max allocation would be respected.

Again, consider the fact that this is UMA with a small allocation limit, unlike Apple. This isn't like any other platform, so it might take a path you didn't expect.

To check that hunch i tried to disable mmap, which would force it to take the ggml_backend_alloc_ctx_tensors_from_buft() path, but that does not help. It still reports a Vulkan buffer larger than 1GB, and still dies with DEVICE_LOST.

llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 19/19 layers to GPU
llm_load_tensors:    Vulkan0 buffer size =  1344.80 MiB

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants