Closed
Description
Name and Version
version: 4879 (f08f4b3)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0
Operating systems
Mac
GGML backends
Metal
Hardware
Apple M3 Pro 36GB
Models
https://huggingface.co/ggml-org/gemma-3-4b-it-GGUF
Problem description & steps to reproduce
When I run llama-server
, Gemma 3 models output gibberish when they hit context length and attempt to shift.
Reproduce by:
- Start server with build from https://github.com/ggml-org/llama.cpp/releases/tag/b4879
llama-server -m /Users/matt/.cache/lm-studio/models/ggml-org/gemma-3-4b-it-GGUF/gemma-3-4b-it-Q4_K_M.gguf -ngl 99 -c 512 -b 512 --temp 0 --seed 0 -n 1000
- Send request:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gemma-3",
"messages": [
{
"role": "user",
"content": "Tell me a 1000 word math proof"
}
],
"n_keep": 19
}'
To see gibberish in response like:
{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":"Okay, let's dive into a classic and relatively accessible mathematical proof: **The Proof that √2 is Irrational.** This proof, attributed to Carl Friedrich Gauss, is a beautiful example of a rigorous argument and a cornerstone of number theory.\n\n**Theorem:** The square root of 2 (√2) is an irrational number.\n\n**Proof:**\n\n**1. Assumption for Contradiction:**\n\nWe will proceed by contradiction. This means we will assume the opposite of what we want to prove and then show that this assumption leads to a logical inconsistency.\n\nAssume, for the sake of argument, that √2 is a rational number. If √2 is rational, then it can be expressed as a fraction p/q, where p and q are integers, and q ≠ 0. Furthermore, we can assume that the fraction p/q is in its *simplest form*, meaning that p and q have no common factors other than 1 (i.e., they are coprime). This is crucial. If they had a common factor, we could simply divide both numerator and denominator by that factor to get a simpler fraction.\n\nTherefore, we have:\n\n√2 = p/q, where p and q are integers, q ≠ 0, and gcd(p, q) = 1 (gcd stands for greatest common divisor).\n\n**2. Rearranging the Equation:**\n\nSquare both sides of the equation:\n\n2 = p²/q²\n\nMultiply both sides by q²:\n\n2q² = p²\n\n**3. Deduction about p²:**\n\nThis equation (2q² = p²) tells us that p² is an even number. Why? Because it’s equal to 2 times an integer (q²).\n\n**4. Deduction about p:**\n\nNow, we know that if p² is even, then p itself must also be even. Here's the reasoning:\n\n* If p were odd, it could be written as 2k+1, where k is an integer.\n* Then p² = (2k+1)² = 4k² + 4k + 1 = 2(2k² + 2k) + 1, which is an odd number.\n* But we know that p² = 2, (1 + 1 + 1, (k + 1, 1 + 1 is even 111 +1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"}}],"created":1741811471,"model":"gemma-3","system_fingerprint":"b4879-f08f4b31","object":"chat.completion","usage":{"completion_tokens":1000,"prompt_tokens":20,"total_tokens":1020},"id":"chatcmpl-YVQt2u4nXDvWZuvfK3uqsIcrNtPYn9pz","timings":{"prompt_n":20,"prompt_ms":148.071,"prompt_per_token_ms":7.40355,"prompt_per_second":135.0703378784502,"predicted_n":1000,"predicted_ms":23825.723,"predicted_per_token_ms":23.825723000000004,"predicted_per_second":41.971444056493056}}
Note that I don't just see this when the context length is set to 512
, but also when it is set longer. I just set it to 512
for the brevity of this reproduction.
This does not happen with other models, like Qwen2.5 3B instruct from https://huggingface.co/lmstudio-community/Qwen2.5-3B-Instruct-GGUF
- Server command
llama-server -m /Users/matt/.cache/lm-studio/models/lmstudio-community/Qwen2.5-3B-Instruct-GGUF/Qwen2.5-3B-Instruct-Q4_K_M.gguf -ngl 99 -c 512 -b 512 --temp 0 --seed 0 -n 1000
- Response
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "qwen",
"messages": [
{
"role": "user",
"content": "Tell me a 1000 word math proof"
}
],
"n_keep": 19
}'
{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":"Certainly! Let's delve into a proof that is both intricate and fascinating: the proof of the Four Color Theorem. This theorem states that any map can be colored using at most four colors in such a way that no two adjacent regions (countries, states, etc.) share the same color. This proof is particularly notable because it was one of the first major theorems to be proven using a computer, and it took over a decade to complete.\n\n### The Four Color Theorem\n\n**Statement:** Any planar map can be colored using at most four colors such that no two adjacent regions share the same color.\n\n### Proof Overview\n\nThe proof of the Four Color Theorem was completed in 1976 by Kenneth Appel and Wolfgang Haken. Their proof was a monumental achievement, but it was also controversial because it relied heavily on computer assistance. The proof involved a massive computer search to verify that no counterexample to the theorem could exist. This approach was not widely accepted by the mathematical community at the time, as it was seen as a \"computer-assisted proof\" rather than a traditional mathematical proof.\n\n### The Proof Process\n\nThe proof of the Four Color Theorem can be broken down into several key steps:\n\n1. **Reduction to a Finite Set of Maps:**\n The first step was to show that it was sufficient to consider only a finite set of maps. This was done by considering all possible maps with a finite number of regions and showing that these maps could be reduced to a smaller set of maps.\n\n2. **Reduction to a Set of 1,936 Maps:**\n After reducing the problem to a finite set of maps, the next step was to show that it was sufficient to consider only 1,936 specific maps. This was a significant reduction from the original set of maps.\n\n3. **Computer-Assisted Verification:**\n The final step was to use a computer to verify that these 1,936 maps could all be colored with four colors. This verification was done by checking all possible colorings of these maps, which was a computationally intensive task.\n\n### The Computer-Assisted Verification\n\nThe computer-assisted verification involved checking all possible colorings of the 1,936 maps. This task was completed by Appel and Haken, who wrote a computer program to perform the necessary checks. The program was designed to verify that no two adjacent regions in any of the 1,936 maps could be colored with fewer than four colors.\n\n### The Proof of the Four-Color Theorem\n\nThe Four-Color Theorem states that any planar map can be colored with at most four colors in such a way that no two adjacent regions have the same color. The proof of this theorem, as provided by Appel and Haken, is a significant achievement in mathematics, but it has also been the subject of much debate and controversy.\n\n#### Key Points of the Proof:\n\n1. **Reduction to a Finite Set of Maps:**\n The proof starts by showing that it is sufficient to consider only a finite set of maps. This set is constructed by considering all possible maps with a finite number of regions and showing that these maps can be reduced to a smaller set of maps.\n\n2. **Reduction to a Set of 1,936 Maps:**\n After reducing the problem to a finite set of maps, the next step is to show that it is sufficient to consider only 1,936 specific maps. This was done by constructing a set of 1,936 maps that are representative of all possible maps with a finite number of regions.\n\n3. **Graph Theory Representation:**\n The proof uses graph theory to represent the problem. Each region in the map is represented by a vertex in a graph, and edges are drawn between vertices if the corresponding regions share a boundary. The problem of coloring the map is then translated into a problem of finding a proper coloring of the graph.\n\n4. **Graph Coloring:**\n The proof then uses a technique called \"reductions\" to show that any graph that can be reduced to one of the 1,936 specific graphs can be colored with at most four colors. This is done by showing that any graph that cannot be colored with four colors must have a subgraph that is also reducible to one of the 1,936 specific graphs.\n\n5. **Computer-Assisted Proof:**\n The proof relies on a computer to check a large number of cases. Specifically, the proof checks that no graph that can be reduced to one of the 1,936 specific graphs has a subgraph that cannot be colored with four colors. This computer-assisted part of the proof is what has made the Four-Color Theorem controversial, as some mathematicians have questioned the reliability of computer-assisted proofs.\n\n6. **Conclusion:**\n"}}],"created":1741810970,"model":"gemma-3-1b-it-Q4_K_M","system_fingerprint":"b4879-f08f4b31","object":"chat.completion","usage":{"completion_tokens":1000,"prompt_tokens":19,"total_tokens":1019},"id":"chatcmpl-eL46rPRpUa940s3gfvO8WvTscLwjCBN4","timings":{"prompt_n":19,"prompt_ms":139.281,"prompt_per_token_ms":7.330578947368421,"prompt_per_second":136.41487352905276,"predicted_n":1000,"predicted_ms":19492.072,"predicted_per_token_ms":19.492072,"predicted_per_second":51.30290920329045}}
First Bad Commit
No response
Relevant log output
## Broken Gemma 3
### Server command
llama-server -m /Users/matt/.cache/lm-studio/models/ggml-org/gemma-3-4b-it-GGUF/gemma-3-4b-it-Q4_K_M.gguf -ngl 99 -c 512 -b 512 --temp 0 --seed 0 -n 1000
build: 4879 (f08f4b31) with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | Metal : EMBED_LIBRARY = 1 | BF16 = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 11
main: loading model
srv load_model: loading model '/Users/matt/.cache/lm-studio/models/ggml-org/gemma-3-4b-it-GGUF/gemma-3-4b-it-Q4_K_M.gguf'
llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 444 tensors from /Users/matt/.cache/lm-studio/models/ggml-org/gemma-3-4b-it-GGUF/gemma-3-4b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma 3 4b It
llama_model_loader: - kv 3: general.finetune str = it
llama_model_loader: - kv 4: general.basename str = gemma-3
llama_model_loader: - kv 5: general.size_label str = 4B
llama_model_loader: - kv 6: general.license str = gemma
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Gemma 3 4b Pt
llama_model_loader: - kv 9: general.base_model.0.organization str = Google
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv 11: general.tags arr[str,1] = ["image-text-to-text"]
llama_model_loader: - kv 12: gemma3.context_length u32 = 131072
llama_model_loader: - kv 13: gemma3.embedding_length u32 = 2560
llama_model_loader: - kv 14: gemma3.block_count u32 = 34
llama_model_loader: - kv 15: gemma3.feed_forward_length u32 = 10240
llama_model_loader: - kv 16: gemma3.attention.head_count u32 = 8
llama_model_loader: - kv 17: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 18: gemma3.attention.key_length u32 = 256
llama_model_loader: - kv 19: gemma3.attention.value_length u32 = 256
llama_model_loader: - kv 20: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 21: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 22: gemma3.attention.head_count_kv u32 = 4
llama_model_loader: - kv 23: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 24: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 25: tokenizer.ggml.model str = llama
llama_model_loader: - kv 26: tokenizer.ggml.pre str = default
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,262144] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 28: tokenizer.ggml.scores arr[f32,262144] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,262144] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 32: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 35: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 36: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 37: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 38: general.quantization_version u32 = 2
llama_model_loader: - kv 39: general.file_type u32 = 15
llama_model_loader: - type f32: 205 tensors
llama_model_loader: - type q4_K: 204 tensors
llama_model_loader: - type q6_K: 35 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 2.31 GiB (5.12 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 6414
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 2560
print_info: n_layer = 34
print_info: n_head = 8
print_info: n_head_kv = 4
print_info: n_rot = 256
print_info: n_swa = 1024
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 6.2e-02
print_info: n_ff = 10240
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 4B
print_info: model params = 3.88 B
print_info: general.name = Gemma 3 4b It
print_info: vocab type = SPM
print_info: n_vocab = 262144
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 1 '<eos>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 1 '<eos>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 34 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 35/35 layers to GPU
load_tensors: Metal_Mapped model buffer size = 2368.18 MiB
load_tensors: CPU_Mapped model buffer size = 525.00 MiB
.................................................................
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 512
llama_init_from_model: n_ctx_per_seq = 512
llama_init_from_model: n_batch = 512
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 0.125
llama_init_from_model: n_ctx_per_seq (512) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Pro
ggml_metal_init: picking default device: Apple M3 Pro
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name: Apple M3 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets = false
ggml_metal_init: has bfloat = true
ggml_metal_init: use bfloat = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 28991.03 MB
llama_kv_cache_init: kv_size = 512, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 34, can_shift = 1
llama_kv_cache_init: Metal KV buffer size = 68.00 MiB
llama_init_from_model: KV self size = 68.00 MiB, K (f16): 34.00 MiB, V (f16): 34.00 MiB
llama_init_from_model: CPU output buffer size = 1.00 MiB
llama_init_from_model: Metal compute buffer size = 517.00 MiB
llama_init_from_model: CPU compute buffer size = 7.01 MiB
llama_init_from_model: graph nodes = 1367
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 512
main: model loaded
main: chat template, chat_template: {{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
{%- if messages[0]['content'] is string -%}
{%- set first_user_prefix = messages[0]['content'] + '
' -%}
{%- else -%}
{%- set first_user_prefix = messages[0]['content'][0]['text'] + '
' -%}
{%- endif -%}
{%- set loop_messages = messages[1:] -%}
{%- else -%}
{%- set first_user_prefix = "" -%}
{%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
{%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
{{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
{%- endif -%}
{%- if (message['role'] == 'assistant') -%}
{%- set role = "model" -%}
{%- else -%}
{%- set role = message['role'] -%}
{%- endif -%}
{{ '<start_of_turn>' + role + '
' + (first_user_prefix if loop.first else "") }}
{%- if message['content'] is string -%}
{{ message['content'] | trim }}
{%- elif message['content'] is iterable -%}
{%- for item in message['content'] -%}
{%- if item['type'] == 'image' -%}
{{ '<start_of_image>' }}
{%- elif item['type'] == 'text' -%}
{{ item['text'] | trim }}
{%- endif -%}
{%- endfor -%}
{%- else -%}
{{ raise_exception("Invalid content type") }}
{%- endif -%}
{{ '<end_of_turn>
' }}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{'<start_of_turn>model
'}}
{%- endif -%}
, example_format: '<start_of_turn>user
You are a helpful assistant
Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 512, n_keep = 19, n_prompt_tokens = 20
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 20, n_tokens = 20, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 20, n_tokens = 20
slot update_slots: id 0 | task 0 | slot context shift, n_keep = 20, n_left = 491, n_discard = 245
slot update_slots: id 0 | task 0 | slot context shift, n_keep = 20, n_left = 491, n_discard = 245
slot update_slots: id 0 | task 0 | slot context shift, n_keep = 20, n_left = 491, n_discard = 245
slot release: id 0 | task 0 | stop processing: n_past = 284, truncated = 1
slot print_timing: id 0 | task 0 |
prompt eval time = 148.07 ms / 20 tokens ( 7.40 ms per token, 135.07 tokens per second)
eval time = 23825.72 ms / 1000 tokens ( 23.83 ms per token, 41.97 tokens per second)
total time = 23973.79 ms / 1020 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
### Request and response
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gemma-3",
"messages": [
{
"role": "user",
"content": "Tell me a 1000 word math proof"
}
],
"n_keep": 19
}'
{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":"Okay, let's dive into a classic and relatively accessible mathematical proof: **The Proof that √2 is Irrational.** This proof, attributed to Carl Friedrich Gauss, is a beautiful example of a rigorous argument and a cornerstone of number theory.\n\n**Theorem:** The square root of 2 (√2) is an irrational number.\n\n**Proof:**\n\n**1. Assumption for Contradiction:**\n\nWe will proceed by contradiction. This means we will assume the opposite of what we want to prove and then show that this assumption leads to a logical inconsistency.\n\nAssume, for the sake of argument, that √2 is a rational number. If √2 is rational, then it can be expressed as a fraction p/q, where p and q are integers, and q ≠ 0. Furthermore, we can assume that the fraction p/q is in its *simplest form*, meaning that p and q have no common factors other than 1 (i.e., they are coprime). This is crucial. If they had a common factor, we could simply divide both numerator and denominator by that factor to get a simpler fraction.\n\nTherefore, we have:\n\n√2 = p/q, where p and q are integers, q ≠ 0, and gcd(p, q) = 1 (gcd stands for greatest common divisor).\n\n**2. Rearranging the Equation:**\n\nSquare both sides of the equation:\n\n2 = p²/q²\n\nMultiply both sides by q²:\n\n2q² = p²\n\n**3. Deduction about p²:**\n\nThis equation (2q² = p²) tells us that p² is an even number. Why? Because it’s equal to 2 times an integer (q²).\n\n**4. Deduction about p:**\n\nNow, we know that if p² is even, then p itself must also be even. Here's the reasoning:\n\n* If p were odd, it could be written as 2k+1, where k is an integer.\n* Then p² = (2k+1)² = 4k² + 4k + 1 = 2(2k² + 2k) + 1, which is an odd number.\n* But we know that p² = 2, (1 + 1 + 1, (k + 1, 1 + 1 is even 111 +1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"}}],"created":1741811471,"model":"gemma-3","system_fingerprint":"b4879-f08f4b31","object":"chat.completion","usage":{"completion_tokens":1000,"prompt_tokens":20,"total_tokens":1020},"id":"chatcmpl-YVQt2u4nXDvWZuvfK3uqsIcrNtPYn9pz","timings":{"prompt_n":20,"prompt_ms":148.071,"prompt_per_token_ms":7.40355,"prompt_per_second":135.0703378784502,"predicted_n":1000,"predicted_ms":23825.723,"predicted_per_token_ms":23.825723000000004,"predicted_per_second":41.971444056493056}}
## Working Qwen2.5 3B
### Server
llama-server -m /Users/matt/.cache/lm-studio/models/lmstudio-community/Qwen2.5-3B-Instruct-GGUF/Qwen2.5-3B-Instruct-Q4_K_M.gguf -ngl 99 -c 512 -b 512 --temp 0 --seed 0 -n 1000
build: 4879 (f08f4b31) with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | Metal : EMBED_LIBRARY = 1 | BF16 = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 11
main: loading model
srv load_model: loading model '/Users/matt/.cache/lm-studio/models/lmstudio-community/Qwen2.5-3B-Instruct-GGUF/Qwen2.5-3B-Instruct-Q4_K_M.gguf'
llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 434 tensors from /Users/matt/.cache/lm-studio/models/lmstudio-community/Qwen2.5-3B-Instruct-GGUF/Qwen2.5-3B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2.5
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.license str = other
llama_model_loader: - kv 7: general.license.name str = qwen-research
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-3...
llama_model_loader: - kv 9: general.base_model.count u32 = 1
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen2.5 3B
llama_model_loader: - kv 11: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-3B
llama_model_loader: - kv 13: general.tags arr[str,2] = ["chat", "text-generation"]
llama_model_loader: - kv 14: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 15: qwen2.block_count u32 = 36
llama_model_loader: - kv 16: qwen2.context_length u32 = 32768
llama_model_loader: - kv 17: qwen2.embedding_length u32 = 2048
llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 11008
llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 16
llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 23: general.file_type u32 = 15
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 34: general.quantization_version u32 = 2
llama_model_loader: - type f32: 181 tensors
llama_model_loader: - type q4_K: 216 tensors
llama_model_loader: - type q6_K: 37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 1.79 GiB (4.99 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 2048
print_info: n_layer = 36
print_info: n_head = 16
print_info: n_head_kv = 2
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 256
print_info: n_embd_v_gqa = 256
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 11008
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 3B
print_info: model params = 3.09 B
print_info: general.name = Qwen2.5 3B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors: Metal_Mapped model buffer size = 1834.83 MiB
load_tensors: CPU_Mapped model buffer size = 243.43 MiB
...............................................................................
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 512
llama_init_from_model: n_ctx_per_seq = 512
llama_init_from_model: n_batch = 512
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_per_seq (512) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Pro
ggml_metal_init: picking default device: Apple M3 Pro
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name: Apple M3 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets = false
ggml_metal_init: has bfloat = true
ggml_metal_init: use bfloat = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 28991.03 MB
llama_kv_cache_init: kv_size = 512, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1
llama_kv_cache_init: Metal KV buffer size = 18.00 MiB
llama_init_from_model: KV self size = 18.00 MiB, K (f16): 9.00 MiB, V (f16): 9.00 MiB
llama_init_from_model: CPU output buffer size = 0.58 MiB
llama_init_from_model: Metal compute buffer size = 300.75 MiB
llama_init_from_model: CPU compute buffer size = 5.01 MiB
llama_init_from_model: graph nodes = 1266
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 512
main: model loaded
main: chat template, chat_template: {%- if tools %}
{{- '<|im_start|>system\n' }}
{%- if messages[0]['role'] == 'system' %}
{{- messages[0]['content'] }}
{%- else %}
{{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
{%- endif %}
{{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson }}
{%- endfor %}
{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
{%- if messages[0]['role'] == 'system' %}
{{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
{%- else %}
{{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- for message in messages %}
{%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- elif message.role == "assistant" %}
{{- '<|im_start|>' + message.role }}
{%- if message.content %}
{{- '\n' + message.content }}
{%- endif %}
{%- for tool_call in message.tool_calls %}
{%- if tool_call.function is defined %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '\n<tool_call>\n{"name": "' }}
{{- tool_call.name }}
{{- '", "arguments": ' }}
{{- tool_call.arguments | tojson }}
{{- '}\n</tool_call>' }}
{%- endfor %}
{{- '<|im_end|>\n' }}
{%- elif message.role == "tool" %}
{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
{{- '<|im_start|>user' }}
{%- endif %}
{{- '\n<tool_response>\n' }}
{{- message.content }}
{{- '\n</tool_response>' }}
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- endif %}
, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
### Curl request and response:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "qwen",
"messages": [
{
"role": "user",
"content": "Tell me a 1000 word math proof"
}
],
"n_keep": 19
}'
{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":"Certainly! Let's delve into a proof that is both intricate and fascinating: the proof of the Four Color Theorem. This theorem states that any map can be colored using at most four colors in such a way that no two adjacent regions (countries, states, etc.) share the same color. This proof is particularly notable because it was one of the first major theorems to be proven using a computer, and it took over a decade to complete.\n\n### The Four Color Theorem\n\n**Statement:** Any planar map can be colored using at most four colors such that no two adjacent regions share the same color.\n\n### Proof Overview\n\nThe proof of the Four Color Theorem was completed in 1976 by Kenneth Appel and Wolfgang Haken. Their proof was a monumental achievement, but it was also controversial because it relied heavily on computer assistance. The proof involved a massive computer search to verify that no counterexample to the theorem could exist. This approach was not widely accepted by the mathematical community at the time, as it was seen as a \"computer-assisted proof\" rather than a traditional mathematical proof.\n\n### The Proof Process\n\nThe proof of the Four Color Theorem can be broken down into several key steps:\n\n1. **Reduction to a Finite Set of Maps:**\n The first step was to show that it was sufficient to consider only a finite set of maps. This was done by considering all possible maps with a finite number of regions and showing that these maps could be reduced to a smaller set of maps.\n\n2. **Reduction to a Set of 1,936 Maps:**\n After reducing the problem to a finite set of maps, the next step was to show that it was sufficient to consider only 1,936 specific maps. This was a significant reduction from the original set of maps.\n\n3. **Computer-Assisted Verification:**\n The final step was to use a computer to verify that these 1,936 maps could all be colored with four colors. This verification was done by checking all possible colorings of these maps, which was a computationally intensive task.\n\n### The Computer-Assisted Verification\n\nThe computer-assisted verification involved checking all possible colorings of the 1,936 maps. This task was completed by Appel and Haken, who wrote a computer program to perform the necessary checks. The program was designed to verify that no two adjacent regions in any of the 1,936 maps could be colored with fewer than four colors.\n\n### The Proof of the Four-Color Theorem\n\nThe Four-Color Theorem states that any planar map can be colored with at most four colors in such a way that no two adjacent regions have the same color. The proof of this theorem, as provided by Appel and Haken, is a significant achievement in mathematics, but it has also been the subject of much debate and controversy.\n\n#### Key Points of the Proof:\n\n1. **Reduction to a Finite Set of Maps:**\n The proof starts by showing that it is sufficient to consider only a finite set of maps. This set is constructed by considering all possible maps with a finite number of regions and showing that these maps can be reduced to a smaller set of maps.\n\n2. **Reduction to a Set of 1,936 Maps:**\n After reducing the problem to a finite set of maps, the next step is to show that it is sufficient to consider only 1,936 specific maps. This was done by constructing a set of 1,936 maps that are representative of all possible maps with a finite number of regions.\n\n3. **Graph Theory Representation:**\n The proof uses graph theory to represent the problem. Each region in the map is represented by a vertex in a graph, and edges are drawn between vertices if the corresponding regions share a boundary. The problem of coloring the map is then translated into a problem of finding a proper coloring of the graph.\n\n4. **Graph Coloring:**\n The proof then uses a technique called \"reductions\" to show that any graph that can be reduced to one of the 1,936 specific graphs can be colored with at most four colors. This is done by showing that any graph that cannot be colored with four colors must have a subgraph that is also reducible to one of the 1,936 specific graphs.\n\n5. **Computer-Assisted Proof:**\n The proof relies on a computer to check a large number of cases. Specifically, the proof checks that no graph that can be reduced to one of the 1,936 specific graphs has a subgraph that cannot be colored with four colors. This computer-assisted part of the proof is what has made the Four-Color Theorem controversial, as some mathematicians have questioned the reliability of computer-assisted proofs.\n\n6. **Conclusion:**\n"}}],"created":1741811715,"model":"qwen","system_fingerprint":"b4879-f08f4b31","object":"chat.completion","usage":{"completion_tokens":1000,"prompt_tokens":19,"total_tokens":1019},"id":"chatcmpl-sAl3IZMjrZjaVIejRpQGGlkYMi2n8Y9R","timings":{"prompt_n":19,"prompt_ms":117.785,"prompt_per_token_ms":6.199210526315789,"prompt_per_second":161.31086301311714,"predicted_n":1000,"predicted_ms":19508.532,"predicted_per_token_ms":19.508532,"predicted_per_second":51.25962322536622}}