Skip to content

Llama.cpp portable fails to initialise with context sizes above 22528 (24 x 1024). #13130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
HumerousGorgon opened this issue May 3, 2025 · 5 comments

Comments

@HumerousGorgon
Copy link

Describe the bug
The portable nightly build of Llama.cpp fails to initialise when setting context size above 22528. This is using -sm layer across 3 Arc GPUs. There is more than enough VRAM available.

How to reproduce
Steps to reproduce the error:

  1. Download a copy of Qwen3-30B-A3B-Q4_K_L.gguf.
  2. Download the latest nightly build of Llama.cpp from the releases section.
  3. Start the server with:
    ONEAPI_DEVICE_SELECTOR=level_zero:0,1,2 ZES_ENABLE_SYSMAN=1 SYCL_CACHE_PERSISTENT=1 SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ./llama-server -c 22528 -ngl 999 -m /home/llm/models/Qwen_Qwen3-30B-A3B-Q4_K_L.gguf --host 0.0.0.0 --port 8001 -sm layer --jinja
  4. If the context is set above 22628, the engine crashes with the following error:
    llama_kv_cache_init: kv_size = 23552, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1 llama_kv_cache_init: SYCL0 KV buffer size = 782.00 MiB llama_kv_cache_init: SYCL1 KV buffer size = 736.00 MiB llama_kv_cache_init: SYCL2 KV buffer size = 690.00 MiB llama_init_from_model: KV self size = 2208.00 MiB, K (f16): 1104.00 MiB, V (f16): 1104.00 MiB llama_init_from_model: SYCL_Host output buffer size = 0.58 MiB llama_init_from_model: pipeline parallelism enabled (n_copies=4) ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 4334944256 Bytes of memory on device ggml_gallocr_reserve_n: failed to allocate SYCL2 buffer of size 4334944256 ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 6065967104 Bytes of memory on device ggml_gallocr_reserve_n: failed to allocate SYCL0 buffer of size 6065967104 ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 5626382848 Bytes of memory on device ggml_gallocr_reserve_n: failed to allocate SYCL1 buffer of size 5626382848 llama_init_from_model: failed to allocate compute buffers common_init_from_params: failed to create context with model '/home/llm/models/Qwen_Qwen3-30B-A3B-Q4_K_L.gguf' terminate called without an active exception ./llama-server: line 2: 142366 Aborted (core dumped) LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(cd "$(dirname "$0")";pwd) $(cd "$(dirname "$0")";pwd)/llama-server-bin "$@"

By comparison, setting the context to or below 22528, the following log for KV cache information is generated:
llama_kv_cache_init: kv_size = 22528, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1 llama_kv_cache_init: SYCL0 KV buffer size = 748.00 MiB llama_kv_cache_init: SYCL1 KV buffer size = 704.00 MiB llama_kv_cache_init: SYCL2 KV buffer size = 660.00 MiB llama_init_from_model: KV self size = 2112.00 MiB, K (f16): 1056.00 MiB, V (f16): 1056.00 MiB llama_init_from_model: SYCL_Host output buffer size = 0.58 MiB llama_init_from_model: pipeline parallelism enabled (n_copies=4) llama_init_from_model: SYCL0 compute buffer size = 2016.06 MiB llama_init_from_model: SYCL1 compute buffer size = 2016.06 MiB llama_init_from_model: SYCL2 compute buffer size = 4070.12 MiB llama_init_from_model: SYCL_Host compute buffer size = 1440.19 MiB llama_init_from_model: graph nodes = 3270 (with bs=4096), 2646 (with bs=1) llama_init_from_model: graph splits = 4

As you can see, the difference in compute buffer size is hardly an issue when comparing 22528 context to 24576, however it still fails to initialise.

@kirel
Copy link

kirel commented May 5, 2025

If I'm reading ggml-org/llama.cpp#10026 correctly you could try -sm row - KV maybe isn't split by default and with -sm layer?

@HumerousGorgon
Copy link
Author

SYCL backend doesn't support -sm row unfortuantely :(
Maybe in the future!

@qiuxin2012
Copy link
Contributor

sycl buffer size is limited to 4GB. kv cache is too large in your case.

@HumerousGorgon
Copy link
Author

Had a feeling you might say that. I remember this being a limitation.
Here’s a question though: on mainline llama.cpp I can assign buffer sizes larger than 4GB, it just knows when to split it up correctly. Either that, or we use KV Cache quantisation to shrink the size. Are either of these a possibility?

@HumerousGorgon
Copy link
Author

sycl buffer size is limited to 4GB. kv cache is too large in your case.

I've read that it's possible to compile oneAPI applications with the patch that enables above 4GB allocations.
Is there a way we can download a docker image of the environment the devs are using for building the llama.cpp instances with the patches? That way I can try the oneAPI build args.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants