Skip to content

Eval bug: Garbage output from Llama-3.2-1B-Instruct-Q4_K_M using GGML_VULKAN on M1 Mac #11256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
swolchok opened this issue Jan 15, 2025 · 2 comments

Comments

@swolchok
Copy link

swolchok commented Jan 15, 2025

Name and Version

$  ./build/bin/llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Apple M1 Pro (MoltenVK) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
version: 4489 (f11cfdfd)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0

Operating systems

Mac

GGML backends

Vulkan

Hardware

Apple M1 Pro, 32 GB RAM

Models

Meta Llama 3.2 Instruct 1B Q4_K_M

Problem description & steps to reproduce

In a fresh git clone:

$ cmake -B build -DGGML_VULKAN=ON -DGGML_METAL=OFF -DCMAKE_BUILD_TYPE=Release -G Ninja
$ cmake --build build --config Release -j 8
$  ./build/bin/llama-cli -m ~/llamas/Llama-3.2-1B-Instruct-Q4_K_M.gguf -p "The capital of France is " --device Vulkan0 -ngl 17  -no-cnv --version

Result: prompt is echoed, but then generation is obvious nonsense tokens.

If I omit --device Vulkan0 -ngl 17, I get reasonable output, but I see

load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/17 layers to GPU

in the logs, suggesting that the GPU is not used. Omitting -ngl 17 and keeping --device Vulkan0 has the same behavior as omitting both -ngl 17 and --device Vulkan0.

First Bad Commit

EDIT: bisect surprisingly finished; seems to bisect to d79d8f3 (#10846).

45095a6 is bad
e9e661b is good

there are a lot of revs with broken builds in that range. I wrote a simple shell loop to auto-skip them, but it's skipping a lot of revs that mention changing Vulkan, so I'm giving up on bisection being helpful.

Relevant log output

llama_model_loader: - type  f32:   34 tensors
llama_model_loader: - type q4_K:   96 tensors
llama_model_loader: - type q6_K:   17 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 762.81 MiB (5.18 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2048
print_info: n_layer          = 16
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1B
print_info: model params     = 1.24 B
print_info: general.name     = Llama 3.2 1B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 128 'Ä'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
ggml_vulkan: Compiling shaders.....................................Done!
load_tensors: offloading 16 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 17/17 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   205.49 MiB
load_tensors:      Vulkan0 model buffer size =   762.81 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 16, can_shift = 1
llama_kv_cache_init:    Vulkan0 KV buffer size =   128.00 MiB
llama_init_from_model: KV self size  =  128.00 MiB, K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_init_from_model: Vulkan_Host  output buffer size =     0.49 MiB
llama_init_from_model:    Vulkan0 compute buffer size =   280.00 MiB
llama_init_from_model: Vulkan_Host compute buffer size =    12.01 MiB
llama_init_from_model: graph nodes  = 518
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 10 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |

sampler seed: 1881698075
sampler params:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

The capital of France is   ansomightsightsightsightsightsightsightsightsightsightsunningunningunning draft fork fork Fork Fork Fenlockspspsightsunningunningunning fairly Fairy Fairy Fairy Fairy draftunning fork fork cer cer madness fairly fairly Fork Fairy Fairyfork Up Sent Sentunning fairly terms Sent Faith Fairy Fork fork Fork Bra Fairy fairlyunningunningunningunningunningunningights fairly Mad fork Forkunning draft fork Indian Indianightsightsightsunningunningunningunningunningunning sent Up Sentightsights Fork fork fairly Bra mise Upightsunningunning Faithunningunningunning Fairy sent fork sentunningunningightsightsightsunning Ambunningunningunningunning fairly fairly fairly fairly Indian madness204 up factunningunningunningunningunningunningunningunningunning Amb Forkambunning Fairy Fairy Fairy reached fairly Indian terms termsunningunningunning Fairy Fairy fork Bra Bra Bal forkunning Fork Amb204 draft Bor Fairy fairlyightsunningunning
@jeffbolznv
Copy link
Collaborator

Seems like a duplicate of the remaining issue in #10984. I still suspect it's a driver or moltenvk bug.

Copy link
Contributor

github-actions bot commented Mar 1, 2025

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Mar 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants