Closed
Description
The commit in question seems to be 20d7740
The AI responses no longer seem to consider the prompt after this commit.
Running pre-built cuda executables from github actions:
llama-master-20d7740-bin-win-cublas-cu11.7.1-x64
PS E:\LLaMA\llamacpp> .\main.exe --model e:\LLaMA\models\airoboros-7b-gpt4.ggmlv3.q4_0.bin -ngl 32 -n 30 -p "Hi, my name is"
main: build = 820 (20d7740)
main: seed = 1689137712
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
llama.cpp: loading model from e:\LLaMA\models\airoboros-7b-gpt4.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 1932.72 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloaded 32/35 layers to GPU
llama_model_load_internal: total VRAM used: 3763 MB
llama_new_context_with_model: kv self size = 256.00 MB
system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0
| WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 30, n_keep = 0
Hi, my name isCounterclockwise: 2016 in Review – Part One
Welcome to Counterclockwise where we take a look back at some
llama_print_timings: load time = 2374.73 ms
llama_print_timings: sample time = 7.17 ms / 30 runs ( 0.24 ms per token, 4181.77 tokens per second)
llama_print_timings: prompt eval time = 402.77 ms / 6 tokens ( 67.13 ms per token, 14.90 tokens per second)
llama_print_timings: eval time = 1391.52 ms / 29 runs ( 47.98 ms per token, 20.84 tokens per second)
llama_print_timings: total time = 1807.27 ms
llama-master-5bf2a27-bin-win-cublas-cu11.7.1-x64
PS E:\LLaMA\llamacpp> .\main.exe --model e:\LLaMA\models\airoboros-7b-gpt4.ggmlv3.q4_0.bin -ngl 32 -n 30 -p "Hi, my name is"
main: build = 819 (5bf2a27)
main: seed = 1689137643
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
llama.cpp: loading model from e:\LLaMA\models\airoboros-7b-gpt4.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 1932.72 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloaded 32/35 layers to GPU
llama_model_load_internal: total VRAM used: 3763 MB
llama_new_context_with_model: kv self size = 256.00 MB
system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0
| WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 30, n_keep = 0
Hi, my name is John and I'm 31 years old.
I was diagnosed with chronic fatigue syndrome in 2015 but
llama_print_timings: load time = 2316.55 ms
llama_print_timings: sample time = 5.91 ms / 30 runs ( 0.20 ms per token, 5079.58 tokens per second)
llama_print_timings: prompt eval time = 376.72 ms / 6 tokens ( 62.79 ms per token, 15.93 tokens per second)
llama_print_timings: eval time = 1419.35 ms / 29 runs ( 48.94 ms per token, 20.43 tokens per second)
llama_print_timings: total time = 1807.44 ms
anyone else experiencing the same issues?