Llama.cpp portable fails to initialise with context sizes above 22528 (24 x 1024). #13130

HumerousGorgon · 2025-05-03T02:33:28Z

Describe the bug
The portable nightly build of Llama.cpp fails to initialise when setting context size above 22528. This is using -sm layer across 3 Arc GPUs. There is more than enough VRAM available.

How to reproduce
Steps to reproduce the error:

Download a copy of Qwen3-30B-A3B-Q4_K_L.gguf.
Download the latest nightly build of Llama.cpp from the releases section.
Start the server with:
ONEAPI_DEVICE_SELECTOR=level_zero:0,1,2 ZES_ENABLE_SYSMAN=1 SYCL_CACHE_PERSISTENT=1 SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ./llama-server -c 22528 -ngl 999 -m /home/llm/models/Qwen_Qwen3-30B-A3B-Q4_K_L.gguf --host 0.0.0.0 --port 8001 -sm layer --jinja
If the context is set above 22628, the engine crashes with the following error:
llama_kv_cache_init: kv_size = 23552, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1 llama_kv_cache_init: SYCL0 KV buffer size = 782.00 MiB llama_kv_cache_init: SYCL1 KV buffer size = 736.00 MiB llama_kv_cache_init: SYCL2 KV buffer size = 690.00 MiB llama_init_from_model: KV self size = 2208.00 MiB, K (f16): 1104.00 MiB, V (f16): 1104.00 MiB llama_init_from_model: SYCL_Host output buffer size = 0.58 MiB llama_init_from_model: pipeline parallelism enabled (n_copies=4) ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 4334944256 Bytes of memory on device ggml_gallocr_reserve_n: failed to allocate SYCL2 buffer of size 4334944256 ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 6065967104 Bytes of memory on device ggml_gallocr_reserve_n: failed to allocate SYCL0 buffer of size 6065967104 ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 5626382848 Bytes of memory on device ggml_gallocr_reserve_n: failed to allocate SYCL1 buffer of size 5626382848 llama_init_from_model: failed to allocate compute buffers common_init_from_params: failed to create context with model '/home/llm/models/Qwen_Qwen3-30B-A3B-Q4_K_L.gguf' terminate called without an active exception ./llama-server: line 2: 142366 Aborted (core dumped) LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(cd "$(dirname "$0")";pwd) $(cd "$(dirname "$0")";pwd)/llama-server-bin "$@"

By comparison, setting the context to or below 22528, the following log for KV cache information is generated:
llama_kv_cache_init: kv_size = 22528, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1 llama_kv_cache_init: SYCL0 KV buffer size = 748.00 MiB llama_kv_cache_init: SYCL1 KV buffer size = 704.00 MiB llama_kv_cache_init: SYCL2 KV buffer size = 660.00 MiB llama_init_from_model: KV self size = 2112.00 MiB, K (f16): 1056.00 MiB, V (f16): 1056.00 MiB llama_init_from_model: SYCL_Host output buffer size = 0.58 MiB llama_init_from_model: pipeline parallelism enabled (n_copies=4) llama_init_from_model: SYCL0 compute buffer size = 2016.06 MiB llama_init_from_model: SYCL1 compute buffer size = 2016.06 MiB llama_init_from_model: SYCL2 compute buffer size = 4070.12 MiB llama_init_from_model: SYCL_Host compute buffer size = 1440.19 MiB llama_init_from_model: graph nodes = 3270 (with bs=4096), 2646 (with bs=1) llama_init_from_model: graph splits = 4

As you can see, the difference in compute buffer size is hardly an issue when comparing 22528 context to 24576, however it still fails to initialise.

The text was updated successfully, but these errors were encountered:

kirel · 2025-05-05T15:17:06Z

If I'm reading ggml-org/llama.cpp#10026 correctly you could try -sm row - KV maybe isn't split by default and with -sm layer?

HumerousGorgon · 2025-05-05T15:23:34Z

SYCL backend doesn't support -sm row unfortuantely :(
Maybe in the future!

qiuxin2012 · 2025-05-06T02:27:04Z

sycl buffer size is limited to 4GB. kv cache is too large in your case.

HumerousGorgon · 2025-05-06T02:29:50Z

Had a feeling you might say that. I remember this being a limitation.
Here’s a question though: on mainline llama.cpp I can assign buffer sizes larger than 4GB, it just knows when to split it up correctly. Either that, or we use KV Cache quantisation to shrink the size. Are either of these a possibility?

HumerousGorgon · 2025-05-07T05:02:32Z

sycl buffer size is limited to 4GB. kv cache is too large in your case.

I've read that it's possible to compile oneAPI applications with the patch that enables above 4GB allocations.
Is there a way we can download a docker image of the environment the devs are using for building the llama.cpp instances with the patches? That way I can try the oneAPI build args.

HumerousGorgon added the user issue label May 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama.cpp portable fails to initialise with context sizes above 22528 (24 x 1024). #13130

Llama.cpp portable fails to initialise with context sizes above 22528 (24 x 1024). #13130

HumerousGorgon commented May 3, 2025

kirel commented May 5, 2025

Uh oh!

HumerousGorgon commented May 5, 2025

Uh oh!

qiuxin2012 commented May 6, 2025

Uh oh!

HumerousGorgon commented May 6, 2025

Uh oh!

HumerousGorgon commented May 7, 2025

Uh oh!

Llama.cpp portable fails to initialise with context sizes above 22528 (24 x 1024). #13130

Llama.cpp portable fails to initialise with context sizes above 22528 (24 x 1024). #13130

Comments

HumerousGorgon commented May 3, 2025

kirel commented May 5, 2025

Uh oh!

HumerousGorgon commented May 5, 2025

Uh oh!

qiuxin2012 commented May 6, 2025

Uh oh!

HumerousGorgon commented May 6, 2025

Uh oh!

HumerousGorgon commented May 7, 2025

Uh oh!