Skip to content

Bug: Gemma 2 incoherent output when using quantized k cache without Flash Attention #8853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Dampfinchen opened this issue Aug 4, 2024 · 3 comments
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) stale

Comments

@Dampfinchen
Copy link

Dampfinchen commented Aug 4, 2024

What happened?

Output like "Mh giàu され rodas reliablyacheteurδε Są" happens when using quantized K cache, CUDA, with Gemma 2. Here's how to reproduce:

./llama-server -m "Gemma-2-9B-It-SPPO-Iter3-Q4_K_S.gguf" -t 6 -c 8192 -ngl 31 -ctk q4_0 --host 127.0.0.1 --port 8080

Then connect a frontend like SillyTavern to it. Strangely this only happens with server, not with main-cli.

This leads to incoherent output. Note: I can't say if this issue happens when using full offloading, as I just have 6 GB VRAM.

Name and Version

./llama-cli --version
version: 3506 (76614f3)
built with MSVC 19.29.30154.0 for x64

What operating system are you seeing the problem on?

Windows

Relevant log output

No response

@Dampfinchen Dampfinchen added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Aug 4, 2024
@ggerganov
Copy link
Member

I can't reproduce with RTX 2060:

GGML_CUDA=1 make -j && ./llama-server -m models/gemma-2-9b-it/ggml-model-q4_k_s.gguf -t 6 -c 8192 -ngl 31 -ctk q4_0 --host 127.0.0.1 --port 8080

I ccache found, compilation results will be cached. Disable with GGML_NO_CCACHE.
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -DGGML_CUDA_USE_GRAPHS -std=c11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion
I CXXFLAGS: -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -DGGML_CUDA_USE_GRAPHS
I NVCCFLAGS: -std=c++11 -O3 -g -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
I LDFLAGS: -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/local/cuda/lib64/stubs -L/usr/lib/wsl/lib
I CC: cc (Ubuntu 11.4.0-1ubuntu122.04) 11.4.0
I CXX: c++ (Ubuntu 11.4.0-1ubuntu1
22.04) 11.4.0
I NVCC: Build cuda_12.5.r12.5/compiler.34177558_0

/usr/bin/ccache c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -DGGML_CUDA_USE_GRAPHS examples/deprecation-warning/deprecation-warning.o -o main -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/local/cuda/lib64/stubs -L/usr/lib/wsl/lib
/usr/bin/ccache c++ -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -DGGML_CUDA_USE_GRAPHS examples/deprecation-warning/deprecation-warning.o -o server -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/local/cuda/lib64/stubs -L/usr/lib/wsl/lib
NOTICE: The 'server' binary is deprecated. Please use 'llama-server' instead.
NOTICE: The 'main' binary is deprecated. Please use 'llama-cli' instead.
INFO [ main] build info | tid="124731602448384" timestamp=1722770577 build=3509 commit="ecf6b7f2"
INFO [ main] system info | tid="124731602448384" timestamp=1722770577 n_threads=6 n_threads_batch=-1 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from models/gemma-2-9b-it/ggml-model-q4_k_s.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma2
llama_model_loader: - kv 1: general.name str = gemma-2-9b-it
llama_model_loader: - kv 2: gemma2.context_length u32 = 8192
llama_model_loader: - kv 3: gemma2.embedding_length u32 = 3584
llama_model_loader: - kv 4: gemma2.block_count u32 = 42
llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 16
llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 256
llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 256
llama_model_loader: - kv 11: general.file_type u32 = 14
llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000
llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000
llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096
llama_model_loader: - kv 15: tokenizer.ggml.model str = llama
llama_model_loader: - kv 16: tokenizer.ggml.pre str = default
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ...
llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 169 tensors
llama_model_loader: - type q4_K: 285 tensors
llama_model_loader: - type q5_K: 9 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens cache size = 108
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = gemma2
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 256000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 3584
llm_load_print_meta: n_layer = 42
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 256
llm_load_print_meta: n_swa = 4096
llm_load_print_meta: n_embd_head_k = 256
llm_load_print_meta: n_embd_head_v = 256
llm_load_print_meta: n_gqa = 2
llm_load_print_meta: n_embd_k_gqa = 2048
llm_load_print_meta: n_embd_v_gqa = 2048
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 9B
llm_load_print_meta: model ftype = Q4_K - Small
llm_load_print_meta: model params = 9.24 B
llm_load_print_meta: model size = 5.10 GiB (4.74 BPW)
llm_load_print_meta: general.name = gemma-2-9b-it
llm_load_print_meta: BOS token = 2 ''
llm_load_print_meta: EOS token = 1 ''
llm_load_print_meta: UNK token = 3 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 227 '<0x0A>'
llm_load_print_meta: EOT token = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 93
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size = 0.41 MiB
llm_load_tensors: offloading 31 repeating layers to GPU
llm_load_tensors: offloaded 31/43 layers to GPU
llm_load_tensors: CPU buffer size = 5219.33 MiB
llm_load_tensors: CUDA0 buffer size = 3297.38 MiB
..............................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 451.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 1271.00 MiB
llama_new_context_with_model: KV self size = 1722.00 MiB, K (q4_0): 378.00 MiB, V (f16): 1344.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.95 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1224.77 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 40.01 MiB
llama_new_context_with_model: graph nodes = 1690
llama_new_context_with_model: graph splits = 147
INFO [ init] initializing slots | tid="124731602448384" timestamp=1722770579 n_slots=1
INFO [ init] new slot | tid="124731602448384" timestamp=1722770579 id_slot=0 n_ctx_slot=8192
INFO [ main] model loaded | tid="124731602448384" timestamp=1722770579
INFO [ main] chat template | tid="124731602448384" timestamp=1722770579 chat_example="<start_of_turn>user\nYou are a helpful assistant\n\nHello<end_of_turn>\n<start_of_turn>model\nHi there<end_of_turn>\n<start_of_turn>user\nHow are you?<end_of_turn>\n<start_of_turn>model\n" built_in=true
INFO [ main] HTTP server listening | tid="124731602448384" timestamp=1722770579 n_threads_http="31" port="8080" hostname="127.0.0.1"
INFO [ update_slots] all slots are idle | tid="124731602448384" timestamp=1722770579
INFO [ launch_slot_with_task] slot is processing task | tid="124731602448384" timestamp=1722770581 id_slot=0 id_task=0
INFO [ update_slots] kv cache rm [p0, end) | tid="124731602448384" timestamp=1722770581 id_slot=0 id_task=0 p0=0
INFO [ print_timings] prompt eval time = 126.05 ms / 5 tokens ( 25.21 ms per token, 39.67 tokens per second) | tid="124731602448384" timestamp=1722770582 id_slot=0 id_task=0 t_prompt_processing=126.053 n_prompt_tokens_processed=5 t_token=25.2106 n_tokens_second=39.66585483883763
INFO [ print_timings] generation eval time = 1317.15 ms / 16 runs ( 82.32 ms per token, 12.15 tokens per second) | tid="124731602448384" timestamp=1722770582 id_slot=0 id_task=0 t_token_generation=1317.154 n_decoded=16 t_token=82.322125 n_tokens_second=12.147402657548016
INFO [ print_timings] total time = 1443.21 ms | tid="124731602448384" timestamp=1722770582 id_slot=0 id_task=0 t_prompt_processing=126.053 t_token_generation=1317.154 t_total=1443.2069999999999
INFO [ update_slots] slot released | tid="124731602448384" timestamp=1722770582 id_slot=0 id_task=0 n_ctx=8192 n_past=20 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [ update_slots] all slots are idle | tid="124731602448384" timestamp=1722770582
INFO [ log_server_request] request | tid="124717945876480" timestamp=1722770582 remote_addr="127.0.0.1" remote_port=52382 status=200 method="POST" path="/completion" params={}
^CINFO [ update_slots] all slots are idle | tid="124731602448384" timestamp=1722770602

curl -s --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": "hello, how can", "n_predict": 16}' | jq
{
  "content": " I find a good mechanic in my area?\n\nIt's tough to find",
  "id_slot": 0,
  "stop": true,
  "model": "models/gemma-2-9b-it/ggml-model-q4_k_s.gguf",
  "tokens_predicted": 16,
  "tokens_evaluated": 5,
  "generation_settings": {
    "n_ctx": 8192,
    "n_predict": -1,
    "model": "models/gemma-2-9b-it/ggml-model-q4_k_s.gguf",
    "seed": 4294967295,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0,
    "dynatemp_exponent": 1,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.05000000074505806,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "repeat_penalty": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "penalty_prompt_tokens": [],
    "use_penalty_prompt_tokens": false,
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.10000000149011612,
    "penalize_nl": false,
    "stop": [],
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": false,
    "logit_bias": [],
    "n_probs": 0,
    "min_keep": 0,
    "grammar": "",
    "samplers": [
      "top_k",
      "tfs_z",
      "typical_p",
      "top_p",
      "min_p",
      "temperature"
    ]
  },
  "prompt": "hello, how can",
  "truncated": false,
  "stopped_eos": false,
  "stopped_word": false,
  "stopped_limit": true,
  "stopping_word": "",
  "tokens_cached": 20,
  "timings": {
    "prompt_n": 5,
    "prompt_ms": 126.053,
    "prompt_per_token_ms": 25.2106,
    "prompt_per_second": 39.66585483883763,
    "predicted_n": 16,
    "predicted_ms": 1317.154,
    "predicted_per_token_ms": 82.322125,
    "predicted_per_second": 12.147402657548016
  }
}

@Dampfinchen
Copy link
Author

I can't reproduce with RTX 2060:

GGML_CUDA=1 make -j && ./llama-server -m models/gemma-2-9b-it/ggml-model-q4_k_s.gguf -t 6 -c 8192 -ngl 31 -ctk q4_0 --host 127.0.0.1 --port 8080
curl -s --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": "hello, how can", "n_predict": 16}' | jq
{
  "content": " I find a good mechanic in my area?\n\nIt's tough to find",
  "id_slot": 0,
  "stop": true,
  "model": "models/gemma-2-9b-it/ggml-model-q4_k_s.gguf",
  "tokens_predicted": 16,
  "tokens_evaluated": 5,
  "generation_settings": {
    "n_ctx": 8192,
    "n_predict": -1,
    "model": "models/gemma-2-9b-it/ggml-model-q4_k_s.gguf",
    "seed": 4294967295,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0,
    "dynatemp_exponent": 1,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.05000000074505806,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "repeat_penalty": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "penalty_prompt_tokens": [],
    "use_penalty_prompt_tokens": false,
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.10000000149011612,
    "penalize_nl": false,
    "stop": [],
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": false,
    "logit_bias": [],
    "n_probs": 0,
    "min_keep": 0,
    "grammar": "",
    "samplers": [
      "top_k",
      "tfs_z",
      "typical_p",
      "top_p",
      "min_p",
      "temperature"
    ]
  },
  "prompt": "hello, how can",
  "truncated": false,
  "stopped_eos": false,
  "stopped_word": false,
  "stopped_limit": true,
  "stopping_word": "",
  "tokens_cached": 20,
  "timings": {
    "prompt_n": 5,
    "prompt_ms": 126.053,
    "prompt_per_token_ms": 25.2106,
    "prompt_per_second": 39.66585483883763,
    "predicted_n": 16,
    "predicted_ms": 1317.154,
    "predicted_per_token_ms": 82.322125,
    "predicted_per_second": 12.147402657548016
  }
}

I've just tested it and yes you are right, right off the bat with llama.cpp server, it works.

However, as soon as you use the chat template, it stops working properly.

Here's how to reproduce:

When you open the llama.cpp server gui, click on new UI. Then you choose ChatML without system prompt. Edit the chatml tokens accordingly with the correct Gemma 2 prompt template. Then chat with the model below.

Does it still work?

@github-actions github-actions bot added the stale label Sep 4, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) stale
Projects
None yet
Development

No branches or pull requests

2 participants