Skip to content

Eval bug: DeepSeek-R1-UD-Q2_K_XL output broken #13305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
joesixpaq opened this issue May 4, 2025 · 13 comments · Fixed by #13320
Closed

Eval bug: DeepSeek-R1-UD-Q2_K_XL output broken #13305

joesixpaq opened this issue May 4, 2025 · 13 comments · Fixed by #13320

Comments

@joesixpaq
Copy link

joesixpaq commented May 4, 2025

Name and Version

I experience gibberish with DeepSeek-R1-UD-Q2_K_XL by unsloth (checked with SHA256)

In my case, this gibberish output started with e1e8e09.

I eventually managed to isolate the latest still working commit: 6f67cf1

The most recent tested commit which is still not working is 9f2da58

Image

Operating systems

Linux

GGML backends

CUDA

Hardware

1x RTX 3090, Intel Xeon E5-2640 v3, 1TB RAM

Models

DeepSeek-R1-UD-Q2_K_XL by unsloth

Problem description & steps to reproduce

Senseless output with partially Chinese characters

First Bad Commit

e1e8e09

Relevant log output

#!/bin/bash

if [ -t 0 ]; then
    CPU0="--physcpubind=16,17,18,19,20,21,22,23 --membind=0"
    CPU1="--physcpubind=8,10,12,14,24,26,28,30 --membind=1"

    declare -a MODEL_ALIASES=(
        "DeepSeek R1 Q2_K_XL"
        "Qwen3-32B-UD-Q4_K_XL"
    )
    
    declare -a MODEL_PATHS=(
        "/mnt/AI/LLM/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf"
        "/mnt/AI/LLM/Qwen3-32B-UD-Q4_K_XL/Qwen3-32B-UD-Q4_K_XL.gguf"
    )

    # CPU Selection
    echo "Select CPU configuration:"
    select cpu_opt in "CPU0" "CPU1"; do
        case $cpu_opt in
            "CPU0") SELECTED_CPU="$CPU0"; break;;
            "CPU1") SELECTED_CPU="$CPU1"; break;;
            *) echo "Invalid option, please choose 1 or 2";;
        esac
    done
    
    # Model Selection
    echo "Select model:"
    select model_alias in "${MODEL_ALIASES[@]}"; do
        if [[ -n "$model_alias" ]] && (( REPLY >= 1 && REPLY <= ${#MODEL_ALIASES[@]} )); then
            MODEL_PATH="${MODEL_PATHS[$((REPLY-1))]}"
            break
        else
            echo "Invalid selection. Please enter a number between 1 and ${#MODEL_ALIASES[@]}."
        fi
    done
    
    read -p "Server port [8000]: " PORT
    PORT=${PORT:-8000}
    read -p "Context size [16384]: " CTX
    CTX=${CTX:-16384}
    read -p "GPU layers [0]: " NGL
    NGL=${NGL:-0}

    SERVER_PARAMS="
    --port $PORT
    --model $MODEL_PATH
    --n-gpu-layers $NGL
    --ctx-size $CTX
    --no-mmap
    --temp 0.6
    --cache-type-k q4_0
    --threads 8
    --predict 16384
    --host 0.0.0.0
    --batch-size 4096
    --device CUDA0
    "
    # Start server with line buffering
    
    # COMMIT="e1e8e0991ffd9e99a445c6812bb519d5bac9f4b5" # FIRST BROKEN
    # COMMIT="6f67cf1f480926391ad75ff746e0a021214bf70c" # LAST GOOD
    
    # COMMIT="8afbd968182909cf93fb15959fc867b6dd3adb53" # GOOD newest 04.05.2025 for Qwen3 but not DS-R1

    COMMIT="9f2da5871f4bbd205b8a3b952cdc76283218d595" 

    PATH2APP="$HOME/workspace/LLAMA_CPP/$COMMIT/llama.cpp/build/bin/llama-server"
    echo $PATH2APP
    numactl $SELECTED_CPU stdbuf -oL "$PATH2APP" $SERVER_PARAMS > server.log 2>&1 &
    
    SERVER_PID=$!

    # Monitor log for startup completion
    echo "Waiting for server to start..."
    (
        tail -f server.log | while IFS= read -r line; do
            # Print line to terminal
            echo "$line"
            # Check for magic string
            if [[ "$line" == *"starting the main loop"* ]]; then
                pkill -P $$ tail  # Kill tail process
                exit 0
            fi
        done
    ) & TAIL_PID=$!

    # Wait for monitoring process
    wait $TAIL_PID 2>/dev/null

    # Open browser if successful
    if [ $? -eq 0 ]; then
        echo "Server ready! Launching browser..."
        xdg-open "http://localhost:$PORT" 2>/dev/null
    else
        echo "ERROR: Server failed to start"
        kill $SERVER_PID 2>/dev/null
        exit 1
    fi

    # Cleanup
    wait $SERVER_PID
    rm server.log

else
    gnome-terminal -- bash -c "$0; exec bash"
fi

--

/home/xyz/workspace/LLAMA_CPP/9f2da5871f4bbd205b8a3b952cdc76283218d595/llama.cpp/build/bin/llama-server
Waiting for server to start...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: Quadro M2000, compute capability 5.2, VMM: yes
build: 5276 (9f2da587) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 32

system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CUDA : ARCHS = 500,610,700,750,800 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8001, http threads: 31
main: loading model
srv    load_model: loading model '/mnt/AI/LLM/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 17426 MiB free
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 48 key-value pairs and 1025 tensors from /mnt/AI/LLM/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 BF16
llama_model_loader: - kv   3:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   4:                         general.size_label str              = 256x20B
llama_model_loader: - kv   5:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   6:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   7:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   8:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv   9:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  10:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  11:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  12:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  13: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  15:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  16:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  17:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  18:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  19:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  20:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  21:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  22:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  23:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  24:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  25:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  26:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  27:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  28:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  29:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  30: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  31: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  33:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  34:                      tokenizer.ggml.tokens arr[str,129280]  = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 128815
llama_model_loader: - kv  40:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  41:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  43:               general.quantization_version u32              = 2
llama_model_loader: - kv  44:                          general.file_type u32              = 10
llama_model_loader: - kv  45:                                   split.no u16              = 0
llama_model_loader: - kv  46:                        split.tensors.count i32              = 1025
llama_model_loader: - kv  47:                                split.count u16              = 5
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q2_K:  171 tensors
llama_model_loader: - type q3_K:    3 tensors
llama_model_loader: - type q4_K:  306 tensors
llama_model_loader: - type q6_K:  184 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q2_K - Medium
print_info: file size   = 211.03 GiB (2.70 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 819
load: token to piece cache size = 0.8223 MB
print_info: arch             = deepseek2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 163840
print_info: n_embd           = 7168
print_info: n_layer          = 61
print_info: n_head           = 128
print_info: n_head_kv        = 128
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 192
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 24576
print_info: n_embd_v_gqa     = 16384
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18432
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = yarn
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 671B
print_info: model params     = 671.03 B
print_info: general.name     = DeepSeek R1 BF16
print_info: n_layer_dense_lead   = 3
print_info: n_lora_q             = 1536
print_info: n_lora_kv            = 512
print_info: n_embd_head_k_mla    = 0
print_info: n_embd_head_v_mla    = 0
print_info: n_ff_exp             = 2048
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm  = 1
print_info: expert_gating_func   = sigmoid
print_info: rope_yarn_log_mul    = 0.1000
print_info: vocab type       = BPE
print_info: n_vocab          = 129280
print_info: n_merges         = 127741
print_info: BOS token        = 0 '<|begin▁of▁sentence|>'
print_info: EOS token        = 1 '<|end▁of▁sentence|>'
print_info: EOT token        = 1 '<|end▁of▁sentence|>'
print_info: PAD token        = 128815 '<|PAD▁TOKEN|>'
print_info: LF token         = 201 'Ċ'
print_info: FIM PRE token    = 128801 '<|fim▁begin|>'
print_info: FIM SUF token    = 128800 '<|fim▁hole|>'
print_info: FIM MID token    = 128802 '<|fim▁end|>'
print_info: EOG token        = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/62 layers to GPU
load_tensors:    CUDA_Host model buffer size = 215601.95 MiB
load_tensors:          CPU model buffer size =   497.11 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 4096
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 0.025
llama_context: n_ctx_per_seq (16384) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.49 MiB
llama_kv_cache_unified: kv_size = 16384, type_k = 'q4_0', type_v = 'f16', n_layer = 61, can_shift = 1, padding = 32
llama_kv_cache_unified:        CPU KV buffer size = 44408.00 MiB
llama_kv_cache_unified: KV self size  = 44408.00 MiB, K (q4_0): 13176.00 MiB, V (f16): 31232.00 MiB
llama_context:      CUDA0 compute buffer size =  5017.50 MiB
llama_context:  CUDA_Host compute buffer size =   112.01 MiB
llama_context: graph nodes  = 4842
llama_context: graph splits = 1148 (with bs=512), 1 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 16384
main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='', is_first_sp=true) %}{%- for message in messages %}{%- if message['role'] == 'system' %}{%- if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{%- else %}{% set ns.system_prompt = ns.system_prompt + '\n\n' + message['content'] %}{%- endif %}{%- endif %}{%- endfor %}{{ bos_token }}{{ ns.system_prompt }}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and 'tool_calls' in message %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls'] %}{%- if not ns.is_first %}{%- if message['content'] is none %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{%- else %}{{'<|Assistant|>' + message['content'] + '<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{%- endif %}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{%- endif %}{%- endfor %}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- if message['role'] == 'assistant' and 'tool_calls' not in message %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}, example_format: 'You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
main: server is listening on http://0.0.0.0:8001 - starting the main loop
@MB7979
Copy link

MB7979 commented May 5, 2025

Might be related to the issues many of us who are doing partial offloading are having since the MLA commit.

For me the gibberish began with the MLA commit, but I was able to somewhat workaround it by enabling Q8_0 k cache. I see you have Q4_0 cache enabled so perhaps you are only just hitting it now for whatever reason with newer changes.

@JohannesGaessler
Copy link
Collaborator

Calculating perplexity on Wikitext using Deepseek V2 Lite q4_0 with max. GPU layers:

[1]3.9466,[2]5.4965,[3]5.1995,[4]5.6518,[5]6.1076,[6]6.8265,[7]7.0578,[8]7.4052,[9]7.8312,[10]8.1759,

With -ngl 0:

[1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,

With -ngl 23:

[1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,

With -ngl 24:

[1]3.9466,[2]5.4965,[3]5.1995,[4]5.6518,[5]6.1076,[6]6.8265,[7]nan,[8]nan,[9]nan,[10]nan,

With -ngl 25:

[1]3.9466,[2]5.4965,[3]5.1995,[4]5.6518,[5]6.1076,[6]6.8265,[7]7.0578,[8]7.4052,[9]7.8312,[10]8.1759,

Without a GPU at all:

[1]3.9699,[2]5.5112,[3]5.2249,[4]5.6893,[5]6.1456,[6]6.8579,[7]7.0851,[8]7.4328,[9]7.8667,[10]8.2129,

So it does look like there are issues with partial offloading.

@danielhanchen
Copy link
Contributor

Oh yes I was supposed to bring the discussion over from https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD/discussions/2 - thanks @joesixpaq for the extensive tests.

I think the MLA commit might still be having some issues - the new MMQ commit @JohannesGaessler made I think should be fixed with some small fixes, but unsure if there are some interactions going on.

The only thing I can confirm for now is full GPU offloading (ie all layers) seem to work OK - most people are having gibberish outputs when CPU offloading is done

@joesixpaq
Copy link
Author

The problem disappeared when I specified --no-kv-offload in this first broken commit e1e8e09.

In another try, I additionally offloaded 5 layers to GPU (max for RTX 3090), and it worked as well.

@joesixpaq
Copy link
Author

It works with the latest commit 66645a5 and --no-kv-offload.

/home/ai/workspace/LLAMA_CPP/66645a5285d8c4c5f9a3b3f360d042baac2d820a/llama.cpp/build/bin/llama-server
Waiting for server to start...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: Quadro M2000, compute capability 5.2, VMM: yes
build: 5281 (66645a5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 32

system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CUDA : ARCHS = 500,610,700,750,800 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8000, http threads: 31
main: loading model
srv load_model: loading model '/mnt/AI/LLM/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23873 MiB free
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 48 key-value pairs and 1025 tensors from /mnt/AI/LLM/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek R1 BF16
llama_model_loader: - kv 3: general.quantized_by str = Unsloth
llama_model_loader: - kv 4: general.size_label str = 256x20B
llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 6: deepseek2.block_count u32 = 61
llama_model_loader: - kv 7: deepseek2.context_length u32 = 163840
llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168
llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432
llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 128
llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 128
llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8
llama_model_loader: - kv 15: deepseek2.leading_dense_block_count u32 = 3
llama_model_loader: - kv 16: deepseek2.vocab_size u32 = 129280
llama_model_loader: - kv 17: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 18: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 19: deepseek2.attention.key_length u32 = 192
llama_model_loader: - kv 20: deepseek2.attention.value_length u32 = 128
llama_model_loader: - kv 21: deepseek2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 22: deepseek2.expert_count u32 = 256
llama_model_loader: - kv 23: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 24: deepseek2.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 25: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 26: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 27: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 28: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 29: deepseek2.rope.scaling.factor f32 = 40.000000
llama_model_loader: - kv 30: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 31: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = deepseek-v3
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 128815
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 42: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 43: general.quantization_version u32 = 2
llama_model_loader: - kv 44: general.file_type u32 = 10
llama_model_loader: - kv 45: split.no u16 = 0
llama_model_loader: - kv 46: split.tensors.count i32 = 1025
llama_model_loader: - kv 47: split.count u16 = 5
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q2_K: 171 tensors
llama_model_loader: - type q3_K: 3 tensors
llama_model_loader: - type q4_K: 306 tensors
llama_model_loader: - type q6_K: 184 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q2_K - Medium
print_info: file size = 211.03 GiB (2.70 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 819
load: token to piece cache size = 0.8223 MB
print_info: arch = deepseek2
print_info: vocab_only = 0
print_info: n_ctx_train = 163840
print_info: n_embd = 7168
print_info: n_layer = 61
print_info: n_head = 128
print_info: n_head_kv = 128
print_info: n_rot = 64
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 192
print_info: n_embd_head_v = 128
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 24576
print_info: n_embd_v_gqa = 16384
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 18432
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = yarn
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 671B
print_info: model params = 671.03 B
print_info: general.name = DeepSeek R1 BF16
print_info: n_layer_dense_lead = 3
print_info: n_lora_q = 1536
print_info: n_lora_kv = 512
print_info: n_embd_head_k_mla = 0
print_info: n_embd_head_v_mla = 0
print_info: n_ff_exp = 2048
print_info: n_expert_shared = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm = 1
print_info: expert_gating_func = sigmoid
print_info: rope_yarn_log_mul = 0.1000
print_info: vocab type = BPE
print_info: n_vocab = 129280
print_info: n_merges = 127741
print_info: BOS token = 0 '<|begin▁of▁sentence|>'
print_info: EOS token = 1 '<|end▁of▁sentence|>'
print_info: EOT token = 1 '<|end▁of▁sentence|>'
print_info: PAD token = 128815 '<|PAD▁TOKEN|>'
print_info: LF token = 201 'Ċ'
print_info: FIM PRE token = 128801 '<|fim▁begin|>'
print_info: FIM SUF token = 128800 '<|fim▁hole|>'
print_info: FIM MID token = 128802 '<|fim▁end|>'
print_info: EOG token = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/62 layers to GPU
load_tensors: CUDA_Host model buffer size = 215601.95 MiB
load_tensors: CPU model buffer size = 497.11 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch = 4096
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 10000.0
llama_context: freq_scale = 0.025
llama_context: n_ctx_per_seq (16384) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.49 MiB
llama_kv_cache_unified: kv_size = 16384, type_k = 'q4_0', type_v = 'f16', n_layer = 61, can_shift = 1, padding = 32
llama_kv_cache_unified: CPU KV buffer size = 44408.00 MiB
llama_kv_cache_unified: KV self size = 44408.00 MiB, K (q4_0): 13176.00 MiB, V (f16): 31232.00 MiB
llama_context: CUDA0 compute buffer size = 2783.50 MiB
llama_context: CUDA_Host compute buffer size = 4256.01 MiB
llama_context: graph nodes = 4842
llama_context: graph splits = 1087 (with bs=512), 1 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 16384
main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='', is_first_sp=true) %}{%- for message in messages %}{%- if message['role'] == 'system' %}{%- if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{%- else %}{% set ns.system_prompt = ns.system_prompt + '\n\n' + message['content'] %}{%- endif %}{%- endif %}{%- endfor %}{{ bos_token }}{{ ns.system_prompt }}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and 'tool_calls' in message %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls'] %}{%- if not ns.is_first %}{%- if message['content'] is none %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + 'json' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{%- else %}{{'<|Assistant|>' + message['content'] + '<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + 'json' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{%- endif %}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + 'json' + '\n' + tool['function']['arguments'] + '\n' + '' + '<|tool▁call▁end|>'}}{%- endif %}{%- endfor %}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- if message['role'] == 'assistant' and 'tool_calls' not in message %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '' in content %}{% set content = content.split('')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}, example_format: 'You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
main: server is listening on http://0.0.0.0:8000 - starting the main loop
Server ready! Launching browser...
Opening in existing browser session.

@JohannesGaessler
Copy link
Collaborator

@danielhanchen since you're already here, can I recruit you to help with investigating #13287 (comment) ?

@danielhanchen
Copy link
Contributor

I could try, but I doubt I could be helpful! I've only recently started coding on the outer layers of llama.cpp, but not any internals

@slaren
Copy link
Member

slaren commented May 5, 2025

So it does look like there are issues with partial offloading.

It seems to work with a BF16 model. Maybe some padding that is not cleared correctly when copying the quantized tensors to VRAM?

@JohannesGaessler
Copy link
Collaborator

Yes, I already figured out that the issue specifically occurs for the combination of MMQ and ne00 % 256 != 0.

@JohannesGaessler
Copy link
Collaborator

I could try, but I doubt I could be helpful! I've only recently started coding on the outer layers of llama.cpp, but not any internals

What I meant is, since you are already running extensive tests with perplexity and KL divergence, could you check whether my changes have made the model predictions worse in a statistically significant way?

@JohannesGaessler
Copy link
Collaborator

Should be fixed by #13320 .

@MB7979
Copy link

MB7979 commented May 5, 2025

This fixed the issues I’ve been having. Thank you very much.

@danielhanchen
Copy link
Contributor

danielhanchen commented May 8, 2025

@JohannesGaessler I was just checking all your new changes - great work as usual!

imatrix.cpp sadly still gets errors with:

CUDA error: invalid configuration argument
  current device: 0, in function ggml_cuda_mul_mat_id at llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2062
  cudaGetLastError()

ie at:

get_rows_cuda(src1->data, src1->type, ids_to_sorted, src1_sorted.ptr, type_src1_sorted,
        ne10, nb11, nb12, nb13,
        ne_get_rows, 1, 1, sizeof(int32_t), ne_get_rows*sizeof(int32_t), ne_get_rows*sizeof(int32_t),
        ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, stream);
    CUDA_CHECK(cudaGetLastError());

Using --ubatch-size 8192 causes the error to occur on Qwen 3 30B MoE. --ubatch-size 8191 works fine.

My suspicion is because CUDA I think requires arguments to be < INT_MAX It's because Qwen has 128 experts, 2048 in dim, so 8192 * 2048 * 128 = 2147483648 > 2147483647 (INT_MAX).

8191 * 2048 * 128 = 2147221504, so less than INT_MAX.

Ie one of the arguments:

ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted

is exceeding INT_MAX, thus causing CUDA to error out.

I'll make a new issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants