Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
llama.cpp (server) processes inputs
Current Behavior
When chatting with the LLM through server
(and api_like_OAI.py
) it works for a bit, but then seemingly when --ctx-size
is exceeded, it gets into an infinite loop of context_shift
s:
I have mostly seen:
slot 0: context shift - n_keep = 4092, n_left = 2, n_discard = 1
but am currently looking at:
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
It just keeps repeating this at near-full GPU usage without ever continuing. I have to restart server
.
Environment and Context
I've seen this happen both on the Windows (llama-b1492-bin-win-cublas-cu12.2.0-x64.zip
) host as well as on WSL2 (tag b1492
, make LLAMA_CUBLAS=1
), with:
server -t 16 -m deepseek-coder-33b-instruct.Q4_K_S.gguf -c 6120 --timeout 30 -ngl 65
Note that these are the several-times-corrected gguf's from TheBloke, and the latest at time of writing (there was a tokenizer issue before). md5sum 19a1079a27fd5a6925a34076de8fbf74 deepseek-coder-33b-instruct.Q4_K_S.gguf
- Physical (or virtual) hardware you are using, e.g. for Linux:
From WSL2:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 8
Model name: AMD Ryzen Threadripper 2950X 16-Core Processor
Stepping: 2
CPU MHz: 3493.482
BogoMIPS: 6986.96
Virtualization: AMD-V
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 512 KiB
L1i cache: 1 MiB
L2 cache: 8 MiB
L3 cache: 32 MiB
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, STIBP disabled, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 ss
e4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetb
v1 xsaves clzero xsaveerptr virt_ssbd arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload
- Operating System, e.g. for Linux:
Linux Jorrit 5.10.43.3-microsoft-standard-WSL2 #1 SMP Wed Jun 16 23:47:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
- SDK version, e.g. for Linux:
Python 3.10.13
GNU Make 4.2.1
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Failure Information (for bugs)
Please help provide information about the failure / bug.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
server -t 16 -m deepseek-coder-33b-instruct.Q4_K_S.gguf -c 6120 --timeout 30 -ngl 65
python api_like_OAI.py --chat-prompt "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n" --user-name "\n### Instruction:\n" --ai-name "\n### Response:\n" --system-name "\n"
- Talk to API and exceed context size (I use Aider's test benchmark, which is tricky to get working, but if interested - instructions )
- Infinite
context shift
loop
Failure Logs
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
{"timestamp":1699281996,"level":"INFO","function":"main","line":2267,"message":"build info","build":1492,"commit":"2833a6f"}
{"timestamp":1699281996,"level":"INFO","function":"main","line":2274,"message":"system info","n_threads":16,"n_threads_batch":-1,"total_threads":32,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | "}
llama_model_loader: loaded meta data with 22 key-value pairs and 561 tensors from s:\WizardCoder34B\deepseek-coder-33b-instruct.Q4_K_S.gguf (version GGUF V3 (latest))
( ... llama_model_loader ... )
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: llama.rope.freq_base f32
llama_model_loader: - kv 11: llama.rope.scale_linear f32
llama_model_loader: - kv 12: general.file_type u32
llama_model_loader: - kv 13: tokenizer.ggml.model str
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr
llama_model_loader: - kv 15: tokenizer.ggml.scores arr
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr
llama_model_loader: - kv 17: tokenizer.ggml.merges arr
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv 21: general.quantization_version u32
llama_model_loader: - type f32: 125 tensors
llama_model_loader: - type q4_K: 427 tensors
llama_model_loader: - type q5_K: 8 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: mismatch in special tokens definition ( 243/32256 vs 237/32256 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 32256
llm_load_print_meta: n_merges = 31757
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_embd = 7168
llm_load_print_meta: n_head = 56
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 62
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 7
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 19200
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 100000.0
llm_load_print_meta: freq_scale_train = 0.25
llm_load_print_meta: n_yarn_orig_ctx = 16384
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = mostly Q4_K - Small
llm_load_print_meta: model params = 33.34 B
llm_load_print_meta: model size = 17.59 GiB (4.53 BPW)
llm_load_print_meta: general.name = deepseek-ai_deepseek-coder-33b-instruct
llm_load_print_meta: BOS token = 32013 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token = 32021 '<|EOT|>'
llm_load_print_meta: PAD token = 32014 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token = 30 '?'
llm_load_tensors: ggml ctx size = 0.21 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 124.24 MB
llm_load_tensors: offloading 62 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 65/65 layers to GPU
llm_load_tensors: VRAM used: 17891.45 MB
...................................................................................................
llama_new_context_with_model: n_ctx = 6120
llama_new_context_with_model: freq_base = 100000.0
llama_new_context_with_model: freq_scale = 0.25
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 1482.19 MB
llama_new_context_with_model: kv self size = 1482.19 MB
llama_build_graph: non-view tensors processed: 1430/1430
llama_new_context_with_model: compute buffer total size = 729.96 MB
llama_new_context_with_model: VRAM scratch buffer: 723.33 MB
llama_new_context_with_model: total VRAM used: 20096.97 MB (model: 17891.45 MB, context: 2205.52 MB)
Available slots:
-> Slot 0 - max context: 6120
llama server listening at http://0.0.0.0:8080
( ... lots of API calls ... )
print_timings: prompt eval time = 514.27 ms / 521 tokens ( 0.99 ms per token, 1013.09 tokens per second)
print_timings: eval time = 9365.17 ms / 250 runs ( 37.46 ms per token, 26.69 tokens per second)
print_timings: total time = 9879.43 ms
slot 0 released (1119 tokens in cache)
{"timestamp":1699284174,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57682,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 204]
slot 0 : in cache: 347 tokens | to process: 934 tokens
slot 0 : kv cache rm - [347, end)
print_timings: prompt eval time = 845.49 ms / 934 tokens ( 0.91 ms per token, 1104.68 tokens per second)
print_timings: eval time = 13463.77 ms / 352 runs ( 38.25 ms per token, 26.14 tokens per second)
print_timings: total time = 14309.26 ms
slot 0 released (1634 tokens in cache)
{"timestamp":1699284188,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57686,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 205]
slot 0 : in cache: 336 tokens | to process: 1888 tokens
slot 0 : kv cache rm - [336, end)
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot unavailable
{"timestamp":1699284790,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57694,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284790,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57698,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284791,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57702,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284797,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57706,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284803,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57710,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284833,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57714,"status":404,"method":"POST","path":"/completion","params":{}}
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot unavailable
{"timestamp":1699284864,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57718,"status":404,"method":"POST","path":"/completion","params":{}}
slot unavailable
{"timestamp":1699284900,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57722,"status":404,"method":"POST","path":"/completion","params":{}}
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot unavailable
{"timestamp":1699284975,"level":"INFO","function":"log_server_request","line":2217,"message":"request","remote_addr":"172.22.146.2","remote_port":57726,"status":404,"method":"POST","path":"/completion","params":{}}
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
slot 0: context shift - n_keep = 2224, n_left = 3894, n_discard = 1947
(repeats forever)