Skip to content

Eval bug: Qwen3-Next-80b crashes loading 126k tokens of context in CUDA (Vulkan is fine) #18140

@lilblam

Description

@lilblam

Name and Version

C:\Chu\LLM\Llamacpp>llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from C:\Chu\LLM\Llamacpp\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Chu\LLM\Llamacpp\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Chu\LLM\Llamacpp\ggml-cpu-icelake.dll
version: 7438 (ef83fb8)
built with Clang 19.1.5 for Windows x86_64

Operating systems

Windows

GGML backends

CUDA

Hardware

Nvidia RTX 5060 Ti

Models

https://huggingface.co/unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF/tree/main/UD-Q6_K_XL

Problem description & steps to reproduce

When I run llama-server b7438 CUDA-13 build with Qwen3-Next-80b-Instruct, everything works great until I try to load more than about 131k of context. It crashes trying to load the context with an error. The Vulkan build doesn't give me any trouble (but is much slower than CUDA build for PP). My PC specs:

CPU - Ryzen 9900x
GPU - RTX 5060ti
RAM - 96GB DDR5-6000 (running at 4800 speeds).

My launch params:

title llama-server
set "LLAMA_SERVER_SLOTS_DEBUG=1"
llama-server ^
--model models/Qwen3-Next-80B-A3B-Instruct-UD-Q6_K_XL-00001-of-00002.gguf ^
--ctx-size 196608 ^
--n-predict 196608 ^
--gpu-layers 99 ^
--cpu-moe ^
--ubatch-size 4096 ^
--batch-size 4096 ^
--threads 10 ^
--temp 0.7 ^
--top-k 20 ^
--top-p 0.8 ^
--min-p 0.0 ^
--repeat-penalty 1.0 ^
--port 8013
pause

I don't think it matters what you load into the context, but to make it easy, try to get it to summarize a combination of these 2 articles which combined give over 131k tokens:

https://en.wikipedia.org/wiki/Nvidia
https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units

Let me know if you need anything else from me!

First Bad Commit

No response

Relevant log output

C:\Chu\LLM\Llamacpp>title llama-server

C:\Chu\LLM\Llamacpp>set "LLAMA_SERVER_SLOTS_DEBUG=1"

C:\Chu\LLM\Llamacpp>llama-server --model models/Qwen3-Next-80B-A3B-Instruct-UD-Q6_K_XL-00001-of-00002.gguf --ctx-size 196608 --n-predict 196608 --gpu-layers 99 --cpu-moe --ubatch-size 4096 --batch-size 4096 --threads 10 --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.0 --repeat-penalty 1.0 --port 8013
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from C:\Chu\LLM\Llamacpp\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Chu\LLM\Llamacpp\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Chu\LLM\Llamacpp\ggml-cpu-icelake.dll
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 7438 (ef83fb860) with Clang 19.1.5 for Windows x86_64
system info: n_threads = 10, n_threads_batch = 10, total_threads = 24

system_info: n_threads = 10 (n_threads_batch = 10) / 24 | CUDA : ARCHS = 750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

init: using 23 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model 'models/Qwen3-Next-80B-A3B-Instruct-UD-Q6_K_XL-00001-of-00002.gguf'
common_init_result: fitting params to device memory, to report bugs during this step use -fit off (or --verbose if you can't)
llama_params_fit_impl: projected to use 11984 MiB of device memory vs. 16310 MiB of free device memory
llama_params_fit_impl: will leave 2871 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.25 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5060 Ti) (0000:01:00.0) - 15158 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 52 key-value pairs and 807 tensors from models/Qwen3-Next-80B-A3B-Instruct-UD-Q6_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3next
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.800000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 0.700000
llama_model_loader: - kv   5:                               general.name str              = Qwen3-Next-80B-A3B-Instruct
llama_model_loader: - kv   6:                           general.finetune str              = Instruct
llama_model_loader: - kv   7:                           general.basename str              = Qwen3-Next-80B-A3B-Instruct
llama_model_loader: - kv   8:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   9:                         general.size_label str              = 80B-A3B
llama_model_loader: - kv  10:                            general.license str              = apache-2.0
llama_model_loader: - kv  11:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-Nex...
llama_model_loader: - kv  12:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  13:                   general.base_model.count u32              = 1
llama_model_loader: - kv  14:                  general.base_model.0.name str              = Qwen3 Next 80B A3B Instruct
llama_model_loader: - kv  15:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  16:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-Nex...
llama_model_loader: - kv  17:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  18:                      qwen3next.block_count u32              = 48
llama_model_loader: - kv  19:                   qwen3next.context_length u32              = 262144
llama_model_loader: - kv  20:                 qwen3next.embedding_length u32              = 2048
llama_model_loader: - kv  21:              qwen3next.feed_forward_length u32              = 5120
llama_model_loader: - kv  22:             qwen3next.attention.head_count u32              = 16
llama_model_loader: - kv  23:          qwen3next.attention.head_count_kv u32              = 2
llama_model_loader: - kv  24:                   qwen3next.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  25: qwen3next.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  26:                qwen3next.expert_used_count u32              = 10
llama_model_loader: - kv  27:             qwen3next.attention.key_length u32              = 256
llama_model_loader: - kv  28:           qwen3next.attention.value_length u32              = 256
llama_model_loader: - kv  29:                     qwen3next.expert_count u32              = 512
llama_model_loader: - kv  30:       qwen3next.expert_feed_forward_length u32              = 512
llama_model_loader: - kv  31: qwen3next.expert_shared_feed_forward_length u32              = 512
llama_model_loader: - kv  32:                  qwen3next.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  33:                   qwen3next.ssm.state_size u32              = 128
llama_model_loader: - kv  34:                  qwen3next.ssm.group_count u32              = 16
llama_model_loader: - kv  35:               qwen3next.ssm.time_step_rank u32              = 32
llama_model_loader: - kv  36:                   qwen3next.ssm.inner_size u32              = 4096
llama_model_loader: - kv  37:             qwen3next.rope.dimension_count u32              = 64
llama_model_loader: - kv  38:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  39:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  40:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  41:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  42:                      tokenizer.ggml.merges arr[str,151387]  = ["─á ─á", "─á─á ─á─á", "i n", "─á t",...
llama_model_loader: - kv  43:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  44:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  45:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  46:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  47:               general.quantization_version u32              = 2
llama_model_loader: - kv  48:                          general.file_type u32              = 18
llama_model_loader: - kv  49:                                   split.no u16              = 0
llama_model_loader: - kv  50:                        split.tensors.count i32              = 807
llama_model_loader: - kv  51:                                split.count u16              = 2
llama_model_loader: - type  f32:  313 tensors
llama_model_loader: - type q8_0:  146 tensors
llama_model_loader: - type q6_K:  300 tensors
llama_model_loader: - type bf16:   48 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q6_K
print_info: file size   = 61.20 GiB (6.60 BPW)
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3next
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 262144
print_info: n_embd           = 2048
print_info: n_embd_inp       = 2048
print_info: n_layer          = 48
print_info: n_head           = 16
print_info: n_head_kv        = 2
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 5120
print_info: n_expert         = 512
print_info: n_expert_used    = 10
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 4
print_info: ssm_d_inner      = 4096
print_info: ssm_d_state      = 128
print_info: ssm_dt_rank      = 32
print_info: ssm_n_group      = 16
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 80B.A3B
print_info: model params     = 79.67 B
print_info: general.name     = Qwen3-Next-80B-A3B-Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 '─è'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:   CPU_Mapped model buffer size = 47261.64 MiB
load_tensors:   CPU_Mapped model buffer size = 15086.64 MiB
load_tensors:        CUDA0 model buffer size =  1870.42 MiB
....................................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 196608
llama_context: n_ctx_seq     = 196608
llama_context: n_batch       = 4096
llama_context: n_ubatch      = 4096
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = true
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (196608) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.32 MiB
llama_kv_cache:      CUDA0 KV buffer size =  4608.00 MiB
llama_kv_cache: size = 4608.00 MiB (196608 cells,  12 layers,  4/1 seqs), K (f16): 2304.00 MiB, V (f16): 2304.00 MiB
llama_memory_recurrent:      CUDA0 RS buffer size =   301.50 MiB
llama_memory_recurrent: size =  301.50 MiB (     4 cells,  48 layers,  4 seqs), R (f32):   13.50 MiB, S (f32):  288.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      CUDA0 compute buffer size =  5204.42 MiB
llama_context:  CUDA_Host compute buffer size =  3104.11 MiB
llama_context: graph nodes  = 35378 (with bs=4096), 6614 (with bs=1)
llama_context: graph splits = 147 (with bs=4096), 99 (with bs=1)
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 4
slot         init: id  0 | task -1 | new slot, n_ctx = 196608
slot         init: id  1 | task -1 | new slot, n_ctx = 196608
slot         init: id  2 | task -1 | new slot, n_ctx = 196608
slot         init: id  3 | task -1 | new slot, n_ctx = 196608
srv          init: slots debug = 1
srv          init: prompt cache is enabled, size limit: 8192 MiB
srv          init: use `--cache-ram 0` to disable the prompt cache
srv          init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv          init: thinking = 0
init: chat template, chat_template: {%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0].role == 'system' %}
        {{- messages[0].content + '\n\n' }}
    {%- endif %}
    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- for message in messages %}
    {%- if message.content is string %}
        {%- set content = message.content %}
    {%- else %}
        {%- set content = '' %}
    {%- endif %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- if message.tool_calls %}
            {%- for tool_call in message.tool_calls %}
                {%- if (loop.first and content) or (not loop.first) %}
                    {{- '\n' }}
                {%- endif %}
                {%- if tool_call.function %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {{- '<tool_call>\n{"name": "' }}
                {{- tool_call.name }}
                {{- '", "arguments": ' }}
                {%- if tool_call.arguments is string %}
                    {{- tool_call.arguments }}
                {%- else %}
                    {{- tool_call.arguments | tojson }}
                {%- endif %}
                {{- '}\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: model loaded
main: server is listening on http://127.0.0.1:8013
main: starting the main loop...
srv  update_slots: all slots are idle
srv  log_server_r: request: GET / 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /v1/models 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 0 | processing task
slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 9
slot update_slots: id  3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 9, batch.n_tokens = 9, progress = 1.000000
slot update_slots: id  3 | task 0 | prompt done, n_tokens = 9, batch.n_tokens = 9
slot print_timing: id  3 | task 0 |
prompt eval time =     330.90 ms /     9 tokens (   36.77 ms per token,    27.20 tokens per second)
       eval time =     574.69 ms /    12 tokens (   47.89 ms per token,    20.88 tokens per second)
      total time =     905.59 ms /    21 tokens
slot      release: id  3 | task 0 | stop processing: n_tokens = 20, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  2 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  2 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  2 | task 13 | processing task
slot update_slots: id  2 | task 13 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 130368
slot update_slots: id  2 | task 13 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  2 | task 13 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 4096, progress = 0.031419
srv          stop: cancel task, id_task = 13
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot      release: id  2 | task 13 | stop processing: n_tokens = 4096, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  1 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 16 | processing task
slot update_slots: id  1 | task 16 | new prompt, n_ctx_slot = 196608, n_keep = 0, task.n_tokens = 143970
slot update_slots: id  1 | task 16 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 4096, progress = 0.028450
slot update_slots: id  1 | task 16 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 4096, progress = 0.056901
slot update_slots: id  1 | task 16 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 12288, batch.n_tokens = 4096, progress = 0.085351
slot update_slots: id  1 | task 16 | n_tokens = 12288, memory_seq_rm [12288, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 16384, batch.n_tokens = 4096, progress = 0.113801
slot update_slots: id  1 | task 16 | n_tokens = 16384, memory_seq_rm [16384, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 20480, batch.n_tokens = 4096, progress = 0.142252
slot update_slots: id  1 | task 16 | n_tokens = 20480, memory_seq_rm [20480, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 24576, batch.n_tokens = 4096, progress = 0.170702
slot update_slots: id  1 | task 16 | n_tokens = 24576, memory_seq_rm [24576, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 28672, batch.n_tokens = 4096, progress = 0.199153
slot update_slots: id  1 | task 16 | n_tokens = 28672, memory_seq_rm [28672, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 32768, batch.n_tokens = 4096, progress = 0.227603
slot update_slots: id  1 | task 16 | n_tokens = 32768, memory_seq_rm [32768, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 36864, batch.n_tokens = 4096, progress = 0.256053
slot update_slots: id  1 | task 16 | n_tokens = 36864, memory_seq_rm [36864, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 40960, batch.n_tokens = 4096, progress = 0.284504
slot update_slots: id  1 | task 16 | n_tokens = 40960, memory_seq_rm [40960, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 45056, batch.n_tokens = 4096, progress = 0.312954
slot update_slots: id  1 | task 16 | n_tokens = 45056, memory_seq_rm [45056, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 49152, batch.n_tokens = 4096, progress = 0.341404
slot update_slots: id  1 | task 16 | n_tokens = 49152, memory_seq_rm [49152, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 53248, batch.n_tokens = 4096, progress = 0.369855
slot update_slots: id  1 | task 16 | n_tokens = 53248, memory_seq_rm [53248, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 57344, batch.n_tokens = 4096, progress = 0.398305
slot update_slots: id  1 | task 16 | n_tokens = 57344, memory_seq_rm [57344, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 61440, batch.n_tokens = 4096, progress = 0.426756
slot update_slots: id  1 | task 16 | n_tokens = 61440, memory_seq_rm [61440, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 65536, batch.n_tokens = 4096, progress = 0.455206
slot update_slots: id  1 | task 16 | n_tokens = 65536, memory_seq_rm [65536, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 69632, batch.n_tokens = 4096, progress = 0.483656
slot update_slots: id  1 | task 16 | n_tokens = 69632, memory_seq_rm [69632, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 73728, batch.n_tokens = 4096, progress = 0.512107
slot update_slots: id  1 | task 16 | n_tokens = 73728, memory_seq_rm [73728, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 77824, batch.n_tokens = 4096, progress = 0.540557
slot update_slots: id  1 | task 16 | n_tokens = 77824, memory_seq_rm [77824, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 81920, batch.n_tokens = 4096, progress = 0.569007
slot update_slots: id  1 | task 16 | n_tokens = 81920, memory_seq_rm [81920, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 86016, batch.n_tokens = 4096, progress = 0.597458
slot update_slots: id  1 | task 16 | n_tokens = 86016, memory_seq_rm [86016, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 90112, batch.n_tokens = 4096, progress = 0.625908
slot update_slots: id  1 | task 16 | n_tokens = 90112, memory_seq_rm [90112, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 94208, batch.n_tokens = 4096, progress = 0.654359
slot update_slots: id  1 | task 16 | n_tokens = 94208, memory_seq_rm [94208, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 98304, batch.n_tokens = 4096, progress = 0.682809
slot update_slots: id  1 | task 16 | n_tokens = 98304, memory_seq_rm [98304, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 102400, batch.n_tokens = 4096, progress = 0.711259
slot update_slots: id  1 | task 16 | n_tokens = 102400, memory_seq_rm [102400, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 106496, batch.n_tokens = 4096, progress = 0.739710
slot update_slots: id  1 | task 16 | n_tokens = 106496, memory_seq_rm [106496, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 110592, batch.n_tokens = 4096, progress = 0.768160
slot update_slots: id  1 | task 16 | n_tokens = 110592, memory_seq_rm [110592, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 114688, batch.n_tokens = 4096, progress = 0.796610
slot update_slots: id  1 | task 16 | n_tokens = 114688, memory_seq_rm [114688, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 118784, batch.n_tokens = 4096, progress = 0.825061
slot update_slots: id  1 | task 16 | n_tokens = 118784, memory_seq_rm [118784, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 122880, batch.n_tokens = 4096, progress = 0.853511
slot update_slots: id  1 | task 16 | n_tokens = 122880, memory_seq_rm [122880, end)
slot update_slots: id  1 | task 16 | prompt processing progress, n_tokens = 126976, batch.n_tokens = 4096, progress = 0.881962
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\cpy.cu:359: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions