Skip to content

Bug: Unexpected output from Granite 3.0 MoE 1b when all layers on NVIDIA GPU #9991

Closed
@gabe-l-hart

Description

@gabe-l-hart

What happened?

Overview

As we launched the Granite 3.0 models today, we have found that one of them, the 1b-a400m MoE model, behaves very strangely if all layers are placed on an NVIDIA GPU. If we keep the last two layers off the GPU, the model performs as expected.

I'm opening this ticket to track my own work investigating the issue as well as see if it happens to trigger any thoughts for others in the community.

Details

We originally noticed this via the ollama integration, but I've been able to repro it with llama-cli directly, so I'm trying to isolate the issue here. I've done the following investigation so far:

  • I've tried F32, F16, and Q4_K_M variants with no change, so this doesn't appear to be related to dtype or quantization.
  • I've run all of these same experiments with the larger MoE version (3b-a800m) and see none of the same behavior with this model. It has the same architecture, but subtly different parameters:
    • hidden_size: 1024 (1b) vs 1536 (3b)
    • num_attention_heads: 16 (1b) vs 24 (3b)
    • num_hidden_layers: 24 (1b) vs 32 (3b)
    • num_local_experts: 32 (1b) vs 40 (3b)
    • router_aux_loss_coef: 0.0 (1b) vs 0.001 (3b)
    • vocab_size: 49155 (1b) vs 49152 (3b)
  • Isolating to a single GPU (--split-mode none) does not change the results

Experiments

  • When run with llama-cli -m <model> -ngl 23 -p "hi" -n 10 the output is sane:

    logs ``` llama-cli -m granite-3.0-1b-a400m-instruct/granite-3.0-1B-a400M-instruct-F32.gguf -ngl 23 -p "hi" -n 10 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 8 CUDA devices: Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes build: 3953 (994cfb1) with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu main: llama backend init main: load the model and apply lora adapter, if any llama_load_model_from_file: using device CUDA0 (NVIDIA A100-SXM4-80GB) - 80732 MiB free llama_load_model_from_file: using device CUDA1 (NVIDIA A100-SXM4-80GB) - 80732 MiB free llama_load_model_from_file: using device CUDA2 (NVIDIA A100-SXM4-80GB) - 80732 MiB free llama_load_model_from_file: using device CUDA3 (NVIDIA A100-SXM4-80GB) - 80732 MiB free llama_load_model_from_file: using device CUDA4 (NVIDIA A100-SXM4-80GB) - 67189 MiB free llama_load_model_from_file: using device CUDA5 (NVIDIA A100-SXM4-80GB) - 68997 MiB free llama_load_model_from_file: using device CUDA6 (NVIDIA A100-SXM4-80GB) - 29537 MiB free llama_load_model_from_file: using device CUDA7 (NVIDIA A100-SXM4-80GB) - 70903 MiB free llama_model_loader: loaded meta data with 38 key-value pairs and 242 tensors from granite-3.0-1b-a400m-instruct/granite-3.0-1B-a400M-instruct-F32.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = granitemoe llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Granite 3.0 1b A400M Instruct llama_model_loader: - kv 3: general.finetune str = instruct llama_model_loader: - kv 4: general.basename str = granite-3.0 llama_model_loader: - kv 5: general.size_label str = 1B-a400M llama_model_loader: - kv 6: general.license str = apache-2.0 llama_model_loader: - kv 7: general.tags arr[str,3] = ["language", "granite-3.0", "text-gen... llama_model_loader: - kv 8: granitemoe.block_count u32 = 24 llama_model_loader: - kv 9: granitemoe.context_length u32 = 4096 llama_model_loader: - kv 10: granitemoe.embedding_length u32 = 1024 llama_model_loader: - kv 11: granitemoe.feed_forward_length u32 = 512 llama_model_loader: - kv 12: granitemoe.attention.head_count u32 = 16 llama_model_loader: - kv 13: granitemoe.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: granitemoe.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 15: granitemoe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 16: granitemoe.expert_count u32 = 32 llama_model_loader: - kv 17: granitemoe.expert_used_count u32 = 8 llama_model_loader: - kv 18: general.file_type u32 = 0 llama_model_loader: - kv 19: granitemoe.vocab_size u32 = 49155 llama_model_loader: - kv 20: granitemoe.rope.dimension_count u32 = 64 llama_model_loader: - kv 21: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 22: granitemoe.attention.scale f32 = 0.015625 llama_model_loader: - kv 23: granitemoe.embedding_scale f32 = 12.000000 llama_model_loader: - kv 24: granitemoe.residual_scale f32 = 0.220000 llama_model_loader: - kv 25: granitemoe.logit_scale f32 = 6.000000 llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 27: tokenizer.ggml.pre str = refact llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,49155] = ["<|end_of_text|>", "", "... llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,49155] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,48891] = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ... llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 0 llama_model_loader: - kv 33: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 34: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 36: tokenizer.chat_template str = {%- if tools %}\n {{- '<|start_of_r... llama_model_loader: - kv 37: general.quantization_version u32 = 2 llama_model_loader: - type f32: 242 tensors llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 22 llm_load_vocab: token to piece cache size = 0.2826 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = granitemoe llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 49155 llm_load_print_meta: n_merges = 48891 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 1024 llm_load_print_meta: n_layer = 24 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 512 llm_load_print_meta: n_embd_v_gqa = 512 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 6.0e+00 llm_load_print_meta: n_ff = 512 llm_load_print_meta: n_expert = 32 llm_load_print_meta: n_expert_used = 8 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 1.33 B llm_load_print_meta: model size = 4.97 GiB (32.00 BPW) llm_load_print_meta: general.name = Granite 3.0 1b A400M Instruct llm_load_print_meta: BOS token = 0 '<|end_of_text|>' llm_load_print_meta: EOS token = 0 '<|end_of_text|>' llm_load_print_meta: UNK token = 0 '<|end_of_text|>' llm_load_print_meta: PAD token = 0 '<|end_of_text|>' llm_load_print_meta: LF token = 145 'Ä' llm_load_print_meta: EOG token = 0 '<|end_of_text|>' llm_load_print_meta: max token length = 512 llm_load_print_meta: f_embedding_scale = 12.000000 llm_load_print_meta: f_residual_scale = 0.220000 llm_load_print_meta: f_attention_scale = 0.015625 llm_load_tensors: ggml ctx size = 0.99 MiB llm_load_tensors: offloading 23 repeating layers to GPU llm_load_tensors: offloaded 23/25 layers to GPU llm_load_tensors: CPU buffer size = 5091.20 MiB llm_load_tensors: CUDA0 buffer size = 816.53 MiB llm_load_tensors: CUDA1 buffer size = 612.40 MiB llm_load_tensors: CUDA2 buffer size = 612.40 MiB llm_load_tensors: CUDA3 buffer size = 816.53 MiB llm_load_tensors: CUDA4 buffer size = 612.40 MiB llm_load_tensors: CUDA5 buffer size = 408.27 MiB llm_load_tensors: CUDA6 buffer size = 408.27 MiB llm_load_tensors: CUDA7 buffer size = 408.27 MiB ................................................................................ llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 8.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 32.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 24.00 MiB llama_kv_cache_init: CUDA2 KV buffer size = 24.00 MiB llama_kv_cache_init: CUDA3 KV buffer size = 32.00 MiB llama_kv_cache_init: CUDA4 KV buffer size = 24.00 MiB llama_kv_cache_init: CUDA5 KV buffer size = 16.00 MiB llama_kv_cache_init: CUDA6 KV buffer size = 16.00 MiB llama_kv_cache_init: CUDA7 KV buffer size = 16.00 MiB llama_new_context_with_model: KV self size = 192.00 MiB, K (f16): 96.00 MiB, V (f16): 96.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.19 MiB llama_new_context_with_model: CUDA0 compute buffer size = 290.02 MiB llama_new_context_with_model: CUDA1 compute buffer size = 144.00 MiB llama_new_context_with_model: CUDA2 compute buffer size = 144.00 MiB llama_new_context_with_model: CUDA3 compute buffer size = 144.00 MiB llama_new_context_with_model: CUDA4 compute buffer size = 144.00 MiB llama_new_context_with_model: CUDA5 compute buffer size = 144.00 MiB llama_new_context_with_model: CUDA6 compute buffer size = 144.00 MiB llama_new_context_with_model: CUDA7 compute buffer size = 144.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 10.01 MiB llama_new_context_with_model: graph nodes = 1472 llama_new_context_with_model: graph splits = 23 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 40

    system_info: n_threads = 40 (n_threads_batch = 40) / 80 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

    sampler seed: 3838250562
    sampler params:
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
    sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
    generate: n_ctx = 4096, n_batch = 2048, n_predict = 10, n_keep = 0

    hi NAME: Hello! How can I assist you today

    </details>
    
    
  • When run with llama-cli -m <model> -ngl 24 -p "hi" -n 10, the output is a small number of sane tokens, followed by a sequence of 4s

    logs
    llama-cli -m granite-3.0-1b-a400m-instruct/granite-3.0-1B-a400M-instruct-F32.gguf -ngl 24 -p "hi" -n 10
    ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 8 CUDA devices:
      Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
    build: 3953 (994cfb1a) with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
    main: llama backend init
    main: load the model and apply lora adapter, if any
    llama_load_model_from_file: using device CUDA0 (NVIDIA A100-SXM4-80GB) - 80732 MiB free
    llama_load_model_from_file: using device CUDA1 (NVIDIA A100-SXM4-80GB) - 80732 MiB free
    llama_load_model_from_file: using device CUDA2 (NVIDIA A100-SXM4-80GB) - 80732 MiB free
    llama_load_model_from_file: using device CUDA3 (NVIDIA A100-SXM4-80GB) - 80732 MiB free
    llama_load_model_from_file: using device CUDA4 (NVIDIA A100-SXM4-80GB) - 67189 MiB free
    llama_load_model_from_file: using device CUDA5 (NVIDIA A100-SXM4-80GB) - 68997 MiB free
    llama_load_model_from_file: using device CUDA6 (NVIDIA A100-SXM4-80GB) - 29537 MiB free
    llama_load_model_from_file: using device CUDA7 (NVIDIA A100-SXM4-80GB) - 70903 MiB free
    llama_model_loader: loaded meta data with 38 key-value pairs and 242 tensors from granite-3.0-1b-a400m-instruct/granite-3.0-1B-a400M-instruct-F32.gguf (version GGUF V3 (latest))
    llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
    llama_model_loader: - kv   0:                       general.architecture str              = granitemoe
    llama_model_loader: - kv   1:                               general.type str              = model
    llama_model_loader: - kv   2:                               general.name str              = Granite 3.0 1b A400M Instruct
    llama_model_loader: - kv   3:                           general.finetune str              = instruct
    llama_model_loader: - kv   4:                           general.basename str              = granite-3.0
    llama_model_loader: - kv   5:                         general.size_label str              = 1B-a400M
    llama_model_loader: - kv   6:                            general.license str              = apache-2.0
    llama_model_loader: - kv   7:                               general.tags arr[str,3]       = ["language", "granite-3.0", "text-gen...
    llama_model_loader: - kv   8:                     granitemoe.block_count u32              = 24
    llama_model_loader: - kv   9:                  granitemoe.context_length u32              = 4096
    llama_model_loader: - kv  10:                granitemoe.embedding_length u32              = 1024
    llama_model_loader: - kv  11:             granitemoe.feed_forward_length u32              = 512
    llama_model_loader: - kv  12:            granitemoe.attention.head_count u32              = 16
    llama_model_loader: - kv  13:         granitemoe.attention.head_count_kv u32              = 8
    llama_model_loader: - kv  14:                  granitemoe.rope.freq_base f32              = 10000.000000
    llama_model_loader: - kv  15: granitemoe.attention.layer_norm_rms_epsilon f32              = 0.000001
    llama_model_loader: - kv  16:                    granitemoe.expert_count u32              = 32
    llama_model_loader: - kv  17:               granitemoe.expert_used_count u32              = 8
    llama_model_loader: - kv  18:                          general.file_type u32              = 0
    llama_model_loader: - kv  19:                      granitemoe.vocab_size u32              = 49155
    llama_model_loader: - kv  20:            granitemoe.rope.dimension_count u32              = 64
    llama_model_loader: - kv  21:            tokenizer.ggml.add_space_prefix bool             = false
    llama_model_loader: - kv  22:                 granitemoe.attention.scale f32              = 0.015625
    llama_model_loader: - kv  23:                 granitemoe.embedding_scale f32              = 12.000000
    llama_model_loader: - kv  24:                  granitemoe.residual_scale f32              = 0.220000
    llama_model_loader: - kv  25:                     granitemoe.logit_scale f32              = 6.000000
    llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
    llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = refact
    llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,49155]   = ["<|end_of_text|>", "<fim_prefix>", "...
    llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,49155]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
    llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,48891]   = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
    llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 0
    llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 0
    llama_model_loader: - kv  33:            tokenizer.ggml.unknown_token_id u32              = 0
    llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 0
    llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
    llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|start_of_r...
    llama_model_loader: - kv  37:               general.quantization_version u32              = 2
    llama_model_loader: - type  f32:  242 tensors
    llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
    llm_load_vocab: special tokens cache size = 22
    llm_load_vocab: token to piece cache size = 0.2826 MB
    llm_load_print_meta: format           = GGUF V3 (latest)
    llm_load_print_meta: arch             = granitemoe
    llm_load_print_meta: vocab type       = BPE
    llm_load_print_meta: n_vocab          = 49155
    llm_load_print_meta: n_merges         = 48891
    llm_load_print_meta: vocab_only       = 0
    llm_load_print_meta: n_ctx_train      = 4096
    llm_load_print_meta: n_embd           = 1024
    llm_load_print_meta: n_layer          = 24
    llm_load_print_meta: n_head           = 16
    llm_load_print_meta: n_head_kv        = 8
    llm_load_print_meta: n_rot            = 64
    llm_load_print_meta: n_swa            = 0
    llm_load_print_meta: n_embd_head_k    = 64
    llm_load_print_meta: n_embd_head_v    = 64
    llm_load_print_meta: n_gqa            = 2
    llm_load_print_meta: n_embd_k_gqa     = 512
    llm_load_print_meta: n_embd_v_gqa     = 512
    llm_load_print_meta: f_norm_eps       = 0.0e+00
    llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
    llm_load_print_meta: f_clamp_kqv      = 0.0e+00
    llm_load_print_meta: f_max_alibi_bias = 0.0e+00
    llm_load_print_meta: f_logit_scale    = 6.0e+00
    llm_load_print_meta: n_ff             = 512
    llm_load_print_meta: n_expert         = 32
    llm_load_print_meta: n_expert_used    = 8
    llm_load_print_meta: causal attn      = 1
    llm_load_print_meta: pooling type     = 0
    llm_load_print_meta: rope type        = 0
    llm_load_print_meta: rope scaling     = linear
    llm_load_print_meta: freq_base_train  = 10000.0
    llm_load_print_meta: freq_scale_train = 1
    llm_load_print_meta: n_ctx_orig_yarn  = 4096
    llm_load_print_meta: rope_finetuned   = unknown
    llm_load_print_meta: ssm_d_conv       = 0
    llm_load_print_meta: ssm_d_inner      = 0
    llm_load_print_meta: ssm_d_state      = 0
    llm_load_print_meta: ssm_dt_rank      = 0
    llm_load_print_meta: ssm_dt_b_c_rms   = 0
    llm_load_print_meta: model type       = ?B
    llm_load_print_meta: model ftype      = all F32
    llm_load_print_meta: model params     = 1.33 B
    llm_load_print_meta: model size       = 4.97 GiB (32.00 BPW) 
    llm_load_print_meta: general.name     = Granite 3.0 1b A400M Instruct
    llm_load_print_meta: BOS token        = 0 '<|end_of_text|>'
    llm_load_print_meta: EOS token        = 0 '<|end_of_text|>'
    llm_load_print_meta: UNK token        = 0 '<|end_of_text|>'
    llm_load_print_meta: PAD token        = 0 '<|end_of_text|>'
    llm_load_print_meta: LF token         = 145 'Ä'
    llm_load_print_meta: EOG token        = 0 '<|end_of_text|>'
    llm_load_print_meta: max token length = 512
    llm_load_print_meta: f_embedding_scale = 12.000000
    llm_load_print_meta: f_residual_scale  = 0.220000
    llm_load_print_meta: f_attention_scale = 0.015625
    llm_load_tensors: ggml ctx size =    0.99 MiB
    llm_load_tensors: offloading 24 repeating layers to GPU
    llm_load_tensors: offloaded 24/25 layers to GPU
    llm_load_tensors:        CPU buffer size =  5091.20 MiB
    llm_load_tensors:      CUDA0 buffer size =   816.53 MiB
    llm_load_tensors:      CUDA1 buffer size =   612.40 MiB
    llm_load_tensors:      CUDA2 buffer size =   816.53 MiB
    llm_load_tensors:      CUDA3 buffer size =   612.40 MiB
    llm_load_tensors:      CUDA4 buffer size =   612.40 MiB
    llm_load_tensors:      CUDA5 buffer size =   612.40 MiB
    llm_load_tensors:      CUDA6 buffer size =   204.13 MiB
    llm_load_tensors:      CUDA7 buffer size =   612.40 MiB
    ................................................................................
    llama_new_context_with_model: n_ctx      = 4096
    llama_new_context_with_model: n_batch    = 2048
    llama_new_context_with_model: n_ubatch   = 512
    llama_new_context_with_model: flash_attn = 0
    llama_new_context_with_model: freq_base  = 10000.0
    llama_new_context_with_model: freq_scale = 1
    llama_kv_cache_init:      CUDA0 KV buffer size =    32.00 MiB
    llama_kv_cache_init:      CUDA1 KV buffer size =    24.00 MiB
    llama_kv_cache_init:      CUDA2 KV buffer size =    32.00 MiB
    llama_kv_cache_init:      CUDA3 KV buffer size =    24.00 MiB
    llama_kv_cache_init:      CUDA4 KV buffer size =    24.00 MiB
    llama_kv_cache_init:      CUDA5 KV buffer size =    24.00 MiB
    llama_kv_cache_init:      CUDA6 KV buffer size =     8.00 MiB
    llama_kv_cache_init:      CUDA7 KV buffer size =    24.00 MiB
    llama_new_context_with_model: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
    llama_new_context_with_model:  CUDA_Host  output buffer size =     0.19 MiB
    llama_new_context_with_model:      CUDA0 compute buffer size =   290.02 MiB
    llama_new_context_with_model:      CUDA1 compute buffer size =   144.00 MiB
    llama_new_context_with_model:      CUDA2 compute buffer size =   144.00 MiB
    llama_new_context_with_model:      CUDA3 compute buffer size =   144.00 MiB
    llama_new_context_with_model:      CUDA4 compute buffer size =   144.00 MiB
    llama_new_context_with_model:      CUDA5 compute buffer size =   144.00 MiB
    llama_new_context_with_model:      CUDA6 compute buffer size =   144.00 MiB
    llama_new_context_with_model:      CUDA7 compute buffer size =   144.00 MiB
    llama_new_context_with_model:  CUDA_Host compute buffer size =    10.01 MiB
    llama_new_context_with_model: graph nodes  = 1472
    llama_new_context_with_model: graph splits = 11
    common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
    main: llama threadpool init, n_threads = 40
    
    system_info: n_threads = 40 (n_threads_batch = 40) / 80 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
    
    sampler seed: 2086755151
    sampler params: 
      repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
      top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
      mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
    sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
    generate: n_ctx = 4096, n_batch = 2048, n_predict = 10, n_keep = 0
    
    hi later,44444444
    
  • With 25/25 layers, the results are similar to 24, but not identical

    logs
    llama-cli -m granite-3.0-1b-a400m-instruct/granite-3.0-1B-a400M-instruct-F32.gguf -ngl 25 -p "hi" -n 10
    ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 8 CUDA devices:
      Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
    build: 3953 (994cfb1a) with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
    main: llama backend init
    main: load the model and apply lora adapter, if any
    llama_load_model_from_file: using device CUDA0 (NVIDIA A100-SXM4-80GB) - 80732 MiB free
    llama_load_model_from_file: using device CUDA1 (NVIDIA A100-SXM4-80GB) - 80732 MiB free
    llama_load_model_from_file: using device CUDA2 (NVIDIA A100-SXM4-80GB) - 80732 MiB free
    llama_load_model_from_file: using device CUDA3 (NVIDIA A100-SXM4-80GB) - 80732 MiB free
    llama_load_model_from_file: using device CUDA4 (NVIDIA A100-SXM4-80GB) - 67189 MiB free
    llama_load_model_from_file: using device CUDA5 (NVIDIA A100-SXM4-80GB) - 68997 MiB free
    llama_load_model_from_file: using device CUDA6 (NVIDIA A100-SXM4-80GB) - 29537 MiB free
    llama_load_model_from_file: using device CUDA7 (NVIDIA A100-SXM4-80GB) - 70903 MiB free
    llama_model_loader: loaded meta data with 38 key-value pairs and 242 tensors from granite-3.0-1b-a400m-instruct/granite-3.0-1B-a400M-instruct-F32.gguf (version GGUF V3 (latest))
    llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
    llama_model_loader: - kv   0:                       general.architecture str              = granitemoe
    llama_model_loader: - kv   1:                               general.type str              = model
    llama_model_loader: - kv   2:                               general.name str              = Granite 3.0 1b A400M Instruct
    llama_model_loader: - kv   3:                           general.finetune str              = instruct
    llama_model_loader: - kv   4:                           general.basename str              = granite-3.0
    llama_model_loader: - kv   5:                         general.size_label str              = 1B-a400M
    llama_model_loader: - kv   6:                            general.license str              = apache-2.0
    llama_model_loader: - kv   7:                               general.tags arr[str,3]       = ["language", "granite-3.0", "text-gen...
    llama_model_loader: - kv   8:                     granitemoe.block_count u32              = 24
    llama_model_loader: - kv   9:                  granitemoe.context_length u32              = 4096
    llama_model_loader: - kv  10:                granitemoe.embedding_length u32              = 1024
    llama_model_loader: - kv  11:             granitemoe.feed_forward_length u32              = 512
    llama_model_loader: - kv  12:            granitemoe.attention.head_count u32              = 16
    llama_model_loader: - kv  13:         granitemoe.attention.head_count_kv u32              = 8
    llama_model_loader: - kv  14:                  granitemoe.rope.freq_base f32              = 10000.000000
    llama_model_loader: - kv  15: granitemoe.attention.layer_norm_rms_epsilon f32              = 0.000001
    llama_model_loader: - kv  16:                    granitemoe.expert_count u32              = 32
    llama_model_loader: - kv  17:               granitemoe.expert_used_count u32              = 8
    llama_model_loader: - kv  18:                          general.file_type u32              = 0
    llama_model_loader: - kv  19:                      granitemoe.vocab_size u32              = 49155
    llama_model_loader: - kv  20:            granitemoe.rope.dimension_count u32              = 64
    llama_model_loader: - kv  21:            tokenizer.ggml.add_space_prefix bool             = false
    llama_model_loader: - kv  22:                 granitemoe.attention.scale f32              = 0.015625
    llama_model_loader: - kv  23:                 granitemoe.embedding_scale f32              = 12.000000
    llama_model_loader: - kv  24:                  granitemoe.residual_scale f32              = 0.220000
    llama_model_loader: - kv  25:                     granitemoe.logit_scale f32              = 6.000000
    llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
    llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = refact
    llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,49155]   = ["<|end_of_text|>", "<fim_prefix>", "...
    llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,49155]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
    llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,48891]   = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
    llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 0
    llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 0
    llama_model_loader: - kv  33:            tokenizer.ggml.unknown_token_id u32              = 0
    llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 0
    llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
    llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|start_of_r...
    llama_model_loader: - kv  37:               general.quantization_version u32              = 2
    llama_model_loader: - type  f32:  242 tensors
    llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
    llm_load_vocab: special tokens cache size = 22
    llm_load_vocab: token to piece cache size = 0.2826 MB
    llm_load_print_meta: format           = GGUF V3 (latest)
    llm_load_print_meta: arch             = granitemoe
    llm_load_print_meta: vocab type       = BPE
    llm_load_print_meta: n_vocab          = 49155
    llm_load_print_meta: n_merges         = 48891
    llm_load_print_meta: vocab_only       = 0
    llm_load_print_meta: n_ctx_train      = 4096
    llm_load_print_meta: n_embd           = 1024
    llm_load_print_meta: n_layer          = 24
    llm_load_print_meta: n_head           = 16
    llm_load_print_meta: n_head_kv        = 8
    llm_load_print_meta: n_rot            = 64
    llm_load_print_meta: n_swa            = 0
    llm_load_print_meta: n_embd_head_k    = 64
    llm_load_print_meta: n_embd_head_v    = 64
    llm_load_print_meta: n_gqa            = 2
    llm_load_print_meta: n_embd_k_gqa     = 512
    llm_load_print_meta: n_embd_v_gqa     = 512
    llm_load_print_meta: f_norm_eps       = 0.0e+00
    llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
    llm_load_print_meta: f_clamp_kqv      = 0.0e+00
    llm_load_print_meta: f_max_alibi_bias = 0.0e+00
    llm_load_print_meta: f_logit_scale    = 6.0e+00
    llm_load_print_meta: n_ff             = 512
    llm_load_print_meta: n_expert         = 32
    llm_load_print_meta: n_expert_used    = 8
    llm_load_print_meta: causal attn      = 1
    llm_load_print_meta: pooling type     = 0
    llm_load_print_meta: rope type        = 0
    llm_load_print_meta: rope scaling     = linear
    llm_load_print_meta: freq_base_train  = 10000.0
    llm_load_print_meta: freq_scale_train = 1
    llm_load_print_meta: n_ctx_orig_yarn  = 4096
    llm_load_print_meta: rope_finetuned   = unknown
    llm_load_print_meta: ssm_d_conv       = 0
    llm_load_print_meta: ssm_d_inner      = 0
    llm_load_print_meta: ssm_d_state      = 0
    llm_load_print_meta: ssm_dt_rank      = 0
    llm_load_print_meta: ssm_dt_b_c_rms   = 0
    llm_load_print_meta: model type       = ?B
    llm_load_print_meta: model ftype      = all F32
    llm_load_print_meta: model params     = 1.33 B
    llm_load_print_meta: model size       = 4.97 GiB (32.00 BPW) 
    llm_load_print_meta: general.name     = Granite 3.0 1b A400M Instruct
    llm_load_print_meta: BOS token        = 0 '<|end_of_text|>'
    llm_load_print_meta: EOS token        = 0 '<|end_of_text|>'
    llm_load_print_meta: UNK token        = 0 '<|end_of_text|>'
    llm_load_print_meta: PAD token        = 0 '<|end_of_text|>'
    llm_load_print_meta: LF token         = 145 'Ä'
    llm_load_print_meta: EOG token        = 0 '<|end_of_text|>'
    llm_load_print_meta: max token length = 512
    llm_load_print_meta: f_embedding_scale = 12.000000
    llm_load_print_meta: f_residual_scale  = 0.220000
    llm_load_print_meta: f_attention_scale = 0.015625
    llm_load_tensors: ggml ctx size =    0.99 MiB
    llm_load_tensors: offloading 24 repeating layers to GPU
    llm_load_tensors: offloading non-repeating layers to GPU
    llm_load_tensors: offloaded 25/25 layers to GPU
    llm_load_tensors:        CPU buffer size =   192.01 MiB
    llm_load_tensors:      CUDA0 buffer size =   816.53 MiB
    llm_load_tensors:      CUDA1 buffer size =   816.53 MiB
    llm_load_tensors:      CUDA2 buffer size =   612.40 MiB
    llm_load_tensors:      CUDA3 buffer size =   816.53 MiB
    llm_load_tensors:      CUDA4 buffer size =   612.40 MiB
    llm_load_tensors:      CUDA5 buffer size =   612.40 MiB
    llm_load_tensors:      CUDA6 buffer size =   204.13 MiB
    llm_load_tensors:      CUDA7 buffer size =   600.28 MiB
    ................................................................................
    llama_new_context_with_model: n_ctx      = 4096
    llama_new_context_with_model: n_batch    = 2048
    llama_new_context_with_model: n_ubatch   = 512
    llama_new_context_with_model: flash_attn = 0
    llama_new_context_with_model: freq_base  = 10000.0
    llama_new_context_with_model: freq_scale = 1
    llama_kv_cache_init:      CUDA0 KV buffer size =    32.00 MiB
    llama_kv_cache_init:      CUDA1 KV buffer size =    32.00 MiB
    llama_kv_cache_init:      CUDA2 KV buffer size =    24.00 MiB
    llama_kv_cache_init:      CUDA3 KV buffer size =    32.00 MiB
    llama_kv_cache_init:      CUDA4 KV buffer size =    24.00 MiB
    llama_kv_cache_init:      CUDA5 KV buffer size =    24.00 MiB
    llama_kv_cache_init:      CUDA6 KV buffer size =     8.00 MiB
    llama_kv_cache_init:      CUDA7 KV buffer size =    16.00 MiB
    llama_new_context_with_model: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
    llama_new_context_with_model:  CUDA_Host  output buffer size =     0.19 MiB
    llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
    llama_new_context_with_model:      CUDA0 compute buffer size =   176.01 MiB
    llama_new_context_with_model:      CUDA1 compute buffer size =   176.01 MiB
    llama_new_context_with_model:      CUDA2 compute buffer size =   176.01 MiB
    llama_new_context_with_model:      CUDA3 compute buffer size =   176.01 MiB
    llama_new_context_with_model:      CUDA4 compute buffer size =   176.01 MiB
    llama_new_context_with_model:      CUDA5 compute buffer size =   176.01 MiB
    llama_new_context_with_model:      CUDA6 compute buffer size =   174.01 MiB
    llama_new_context_with_model:      CUDA7 compute buffer size =   176.02 MiB
    llama_new_context_with_model:  CUDA_Host compute buffer size =    34.02 MiB
    llama_new_context_with_model: graph nodes  = 1472
    llama_new_context_with_model: graph splits = 9
    common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
    main: llama threadpool init, n_threads = 40
    
    system_info: n_threads = 40 (n_threads_batch = 40) / 80 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
    
    sampler seed: 1719435385
    sampler params: 
      repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
      top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
      mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
    sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
    generate: n_ctx = 4096, n_batch = 2048, n_predict = 10, n_keep = 0
    
    hi al-44444444
    
  • When I disable kv offload (--no-kv-offload), the results are sane, but also produce GGML backend error messages

    logs
    llama-cli -m granite-3.0-1b-a400m-instruct/granite-3.0-1B-a400M-instruct-F32.gguf -ngl 25 -p "hi" --no-kv-offload -n 10
    ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 8 CUDA devices:
      Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
      Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
    build: 3953 (994cfb1a) with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
    main: llama backend init
    main: load the model and apply lora adapter, if any
    llama_load_model_from_file: using device CUDA0 (NVIDIA A100-SXM4-80GB) - 80732 MiB free
    llama_load_model_from_file: using device CUDA1 (NVIDIA A100-SXM4-80GB) - 80732 MiB free
    llama_load_model_from_file: using device CUDA2 (NVIDIA A100-SXM4-80GB) - 80732 MiB free
    llama_load_model_from_file: using device CUDA3 (NVIDIA A100-SXM4-80GB) - 80732 MiB free
    llama_load_model_from_file: using device CUDA4 (NVIDIA A100-SXM4-80GB) - 67189 MiB free
    llama_load_model_from_file: using device CUDA5 (NVIDIA A100-SXM4-80GB) - 68997 MiB free
    llama_load_model_from_file: using device CUDA6 (NVIDIA A100-SXM4-80GB) - 29537 MiB free
    llama_load_model_from_file: using device CUDA7 (NVIDIA A100-SXM4-80GB) - 70903 MiB free
    llama_model_loader: loaded meta data with 38 key-value pairs and 242 tensors from granite-3.0-1b-a400m-instruct/granite-3.0-1B-a400M-instruct-F32.gguf (version GGUF V3 (latest))
    llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
    llama_model_loader: - kv   0:                       general.architecture str              = granitemoe
    llama_model_loader: - kv   1:                               general.type str              = model
    llama_model_loader: - kv   2:                               general.name str              = Granite 3.0 1b A400M Instruct
    llama_model_loader: - kv   3:                           general.finetune str              = instruct
    llama_model_loader: - kv   4:                           general.basename str              = granite-3.0
    llama_model_loader: - kv   5:                         general.size_label str              = 1B-a400M
    llama_model_loader: - kv   6:                            general.license str              = apache-2.0
    llama_model_loader: - kv   7:                               general.tags arr[str,3]       = ["language", "granite-3.0", "text-gen...
    llama_model_loader: - kv   8:                     granitemoe.block_count u32              = 24
    llama_model_loader: - kv   9:                  granitemoe.context_length u32              = 4096
    llama_model_loader: - kv  10:                granitemoe.embedding_length u32              = 1024
    llama_model_loader: - kv  11:             granitemoe.feed_forward_length u32              = 512
    llama_model_loader: - kv  12:            granitemoe.attention.head_count u32              = 16
    llama_model_loader: - kv  13:         granitemoe.attention.head_count_kv u32              = 8
    llama_model_loader: - kv  14:                  granitemoe.rope.freq_base f32              = 10000.000000
    llama_model_loader: - kv  15: granitemoe.attention.layer_norm_rms_epsilon f32              = 0.000001
    llama_model_loader: - kv  16:                    granitemoe.expert_count u32              = 32
    llama_model_loader: - kv  17:               granitemoe.expert_used_count u32              = 8
    llama_model_loader: - kv  18:                          general.file_type u32              = 0
    llama_model_loader: - kv  19:                      granitemoe.vocab_size u32              = 49155
    llama_model_loader: - kv  20:            granitemoe.rope.dimension_count u32              = 64
    llama_model_loader: - kv  21:            tokenizer.ggml.add_space_prefix bool             = false
    llama_model_loader: - kv  22:                 granitemoe.attention.scale f32              = 0.015625
    llama_model_loader: - kv  23:                 granitemoe.embedding_scale f32              = 12.000000
    llama_model_loader: - kv  24:                  granitemoe.residual_scale f32              = 0.220000
    llama_model_loader: - kv  25:                     granitemoe.logit_scale f32              = 6.000000
    llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
    llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = refact
    llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,49155]   = ["<|end_of_text|>", "<fim_prefix>", "...
    llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,49155]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
    llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,48891]   = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
    llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 0
    llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 0
    llama_model_loader: - kv  33:            tokenizer.ggml.unknown_token_id u32              = 0
    llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 0
    llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
    llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|start_of_r...
    llama_model_loader: - kv  37:               general.quantization_version u32              = 2
    llama_model_loader: - type  f32:  242 tensors
    llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
    llm_load_vocab: special tokens cache size = 22
    llm_load_vocab: token to piece cache size = 0.2826 MB
    llm_load_print_meta: format           = GGUF V3 (latest)
    llm_load_print_meta: arch             = granitemoe
    llm_load_print_meta: vocab type       = BPE
    llm_load_print_meta: n_vocab          = 49155
    llm_load_print_meta: n_merges         = 48891
    llm_load_print_meta: vocab_only       = 0
    llm_load_print_meta: n_ctx_train      = 4096
    llm_load_print_meta: n_embd           = 1024
    llm_load_print_meta: n_layer          = 24
    llm_load_print_meta: n_head           = 16
    llm_load_print_meta: n_head_kv        = 8
    llm_load_print_meta: n_rot            = 64
    llm_load_print_meta: n_swa            = 0
    llm_load_print_meta: n_embd_head_k    = 64
    llm_load_print_meta: n_embd_head_v    = 64
    llm_load_print_meta: n_gqa            = 2
    llm_load_print_meta: n_embd_k_gqa     = 512
    llm_load_print_meta: n_embd_v_gqa     = 512
    llm_load_print_meta: f_norm_eps       = 0.0e+00
    llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
    llm_load_print_meta: f_clamp_kqv      = 0.0e+00
    llm_load_print_meta: f_max_alibi_bias = 0.0e+00
    llm_load_print_meta: f_logit_scale    = 6.0e+00
    llm_load_print_meta: n_ff             = 512
    llm_load_print_meta: n_expert         = 32
    llm_load_print_meta: n_expert_used    = 8
    llm_load_print_meta: causal attn      = 1
    llm_load_print_meta: pooling type     = 0
    llm_load_print_meta: rope type        = 0
    llm_load_print_meta: rope scaling     = linear
    llm_load_print_meta: freq_base_train  = 10000.0
    llm_load_print_meta: freq_scale_train = 1
    llm_load_print_meta: n_ctx_orig_yarn  = 4096
    llm_load_print_meta: rope_finetuned   = unknown
    llm_load_print_meta: ssm_d_conv       = 0
    llm_load_print_meta: ssm_d_inner      = 0
    llm_load_print_meta: ssm_d_state      = 0
    llm_load_print_meta: ssm_dt_rank      = 0
    llm_load_print_meta: ssm_dt_b_c_rms   = 0
    llm_load_print_meta: model type       = ?B
    llm_load_print_meta: model ftype      = all F32
    llm_load_print_meta: model params     = 1.33 B
    llm_load_print_meta: model size       = 4.97 GiB (32.00 BPW) 
    llm_load_print_meta: general.name     = Granite 3.0 1b A400M Instruct
    llm_load_print_meta: BOS token        = 0 '<|end_of_text|>'
    llm_load_print_meta: EOS token        = 0 '<|end_of_text|>'
    llm_load_print_meta: UNK token        = 0 '<|end_of_text|>'
    llm_load_print_meta: PAD token        = 0 '<|end_of_text|>'
    llm_load_print_meta: LF token         = 145 'Ä'
    llm_load_print_meta: EOG token        = 0 '<|end_of_text|>'
    llm_load_print_meta: max token length = 512
    llm_load_print_meta: f_embedding_scale = 12.000000
    llm_load_print_meta: f_residual_scale  = 0.220000
    llm_load_print_meta: f_attention_scale = 0.015625
    llm_load_tensors: ggml ctx size =    0.99 MiB
    llm_load_tensors: offloading 24 repeating layers to GPU
    llm_load_tensors: offloading non-repeating layers to GPU
    llm_load_tensors: offloaded 25/25 layers to GPU
    llm_load_tensors:        CPU buffer size =   192.01 MiB
    llm_load_tensors:      CUDA0 buffer size =   816.53 MiB
    llm_load_tensors:      CUDA1 buffer size =   816.53 MiB
    llm_load_tensors:      CUDA2 buffer size =   612.40 MiB
    llm_load_tensors:      CUDA3 buffer size =   816.53 MiB
    llm_load_tensors:      CUDA4 buffer size =   612.40 MiB
    llm_load_tensors:      CUDA5 buffer size =   612.40 MiB
    llm_load_tensors:      CUDA6 buffer size =   204.13 MiB
    llm_load_tensors:      CUDA7 buffer size =   600.28 MiB
    ................................................................................
    llama_new_context_with_model: n_ctx      = 4096
    llama_new_context_with_model: n_batch    = 2048
    llama_new_context_with_model: n_ubatch   = 512
    llama_new_context_with_model: flash_attn = 0
    llama_new_context_with_model: freq_base  = 10000.0
    llama_new_context_with_model: freq_scale = 1
    llama_kv_cache_init:  CUDA_Host KV buffer size =   192.00 MiB
    llama_new_context_with_model: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
    llama_new_context_with_model:  CUDA_Host  output buffer size =     0.19 MiB
    llama_new_context_with_model:      CUDA0 compute buffer size =    30.00 MiB
    llama_new_context_with_model:      CUDA1 compute buffer size =    30.00 MiB
    llama_new_context_with_model:      CUDA2 compute buffer size =    30.00 MiB
    llama_new_context_with_model:      CUDA3 compute buffer size =    30.00 MiB
    llama_new_context_with_model:      CUDA4 compute buffer size =    30.00 MiB
    llama_new_context_with_model:      CUDA5 compute buffer size =    30.00 MiB
    llama_new_context_with_model:      CUDA6 compute buffer size =    30.00 MiB
    llama_new_context_with_model:      CUDA7 compute buffer size =   100.01 MiB
    llama_new_context_with_model:  CUDA_Host compute buffer size =   140.01 MiB
    llama_new_context_with_model: graph nodes  = 1472
    llama_new_context_with_model: graph splits = 57
    common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
    main: llama threadpool init, n_threads = 40
    
    system_info: n_threads = 40 (n_threads_batch = 40) / 80 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
    
    sampler seed: 1182385544
    sampler params: 
      repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
      top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
      mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
    sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
    generate: n_ctx = 4096, n_batch = 2048, n_predict = 10, n_keep = 0
    
    higgml_backend_cuda_graph_compute: CUDA graph update failed
    ggml_backend_cuda_graph_compute: CUDA graph update failed
    ggml_backend_cuda_graph_compute: CUDA graph update failed
    ggml_backend_cuda_graph_compute: CUDA graph update failed
    ggml_backend_cuda_graph_compute: CUDA graph update failed
    ggml_backend_cuda_graph_compute: CUDA graph update failed
    ggml_backend_cuda_graph_compute: CUDA graph update failed
    ggml_backend_cuda_graph_compute: CUDA graph update failed
    party, I am here to help you with your
    

Setup Repro

huggingface-cli download ibm-granite/granite-3.0-1b-a400m-instruct --local-dir granite-3.0-1b-a400m-instruct

# Convert to F32, F16, and Q4_K_M
convert_hf_to_gguf.py granite-3.0-1b-a400m-instruct/ --outtype f32
convert_hf_to_gguf.py granite-3.0-1b-a400m-instruct/
llama-quantize granite-3.0-1b-a400m-instruct/granite-3.0-1B-a400M-instruct-F16.gguf Q4_K_M

Name and Version

$ llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
version: 3953 (994cfb1a)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

  • ubuntu 24.04
  • x86_64
  • cuda 12.6

Relevant log output

(see above in details)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions