Skip to content

Incoherent output after merging https://github.com/ggerganov/llama.cpp/pull/2183 #2187

Closed
@LostRuins

Description

@LostRuins

The commit in question seems to be 20d7740

The AI responses no longer seem to consider the prompt after this commit.

Running pre-built cuda executables from github actions:

llama-master-20d7740-bin-win-cublas-cu11.7.1-x64

PS E:\LLaMA\llamacpp> .\main.exe --model e:\LLaMA\models\airoboros-7b-gpt4.ggmlv3.q4_0.bin -ngl 32 -n 30 -p "Hi, my name is"
main: build = 820 (20d7740)
main: seed  = 1689137712
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
llama.cpp: loading model from e:\LLaMA\models\airoboros-7b-gpt4.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 1932.72 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloaded 32/35 layers to GPU
llama_model_load_internal: total VRAM used: 3763 MB
llama_new_context_with_model: kv self size  =  256.00 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0
| WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 30, n_keep = 0


 Hi, my name isCounterclockwise: 2016 in Review – Part One
Welcome to Counterclockwise where we take a look back at some
llama_print_timings:        load time =  2374.73 ms
llama_print_timings:      sample time =     7.17 ms /    30 runs   (    0.24 ms per token,  4181.77 tokens per second)
llama_print_timings: prompt eval time =   402.77 ms /     6 tokens (   67.13 ms per token,    14.90 tokens per second)
llama_print_timings:        eval time =  1391.52 ms /    29 runs   (   47.98 ms per token,    20.84 tokens per second)
llama_print_timings:       total time =  1807.27 ms

llama-master-5bf2a27-bin-win-cublas-cu11.7.1-x64

PS E:\LLaMA\llamacpp> .\main.exe --model e:\LLaMA\models\airoboros-7b-gpt4.ggmlv3.q4_0.bin -ngl 32 -n 30 -p "Hi, my name is"
main: build = 819 (5bf2a27)
main: seed  = 1689137643
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
llama.cpp: loading model from e:\LLaMA\models\airoboros-7b-gpt4.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 1932.72 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloaded 32/35 layers to GPU
llama_model_load_internal: total VRAM used: 3763 MB
llama_new_context_with_model: kv self size  =  256.00 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0
| WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 30, n_keep = 0


 Hi, my name is John and I'm 31 years old.
I was diagnosed with chronic fatigue syndrome in 2015 but
llama_print_timings:        load time =  2316.55 ms
llama_print_timings:      sample time =     5.91 ms /    30 runs   (    0.20 ms per token,  5079.58 tokens per second)
llama_print_timings: prompt eval time =   376.72 ms /     6 tokens (   62.79 ms per token,    15.93 tokens per second)
llama_print_timings:        eval time =  1419.35 ms /    29 runs   (   48.94 ms per token,    20.43 tokens per second)
llama_print_timings:       total time =  1807.44 ms

anyone else experiencing the same issues?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions