server: fix core dump when input prompt larger than prompt context #4022

iohub · 2023-11-10T15:12:59Z

Fix the core dump bug caused by

when input prompt larger than prompt context (variable n_ctx )
and not hit slot.params.cache_prompt code branch.

root cause location:

                    if (!slot.params.cache_prompt) // root cause: the prompt too long and not truncated
                    {
                        llama_sampling_reset(slot.ctx_sampling);

                        slot.n_past = 0;
                        slot.num_prompt_tokens_processed = slot.num_prompt_tokens;
                    }
                    else ...

my gdb debug log:

Core was generated by `bin/server -ngl 32 -m /home/do/ssd/modelhub/Wizard-GGUF/wizardcoder-python-34b-'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055aae060f588 in llama_batch_add (batch=..., id=1053, pos=101, seq_ids=..., logits=false) at /home/do/ssd/local/llama.cpp/common/common.cpp:937
937	        batch.seq_id[batch.n_tokens][i] = seq_ids[i];
[Current thread is 1 (Thread 0x7f25417db000 (LWP 30525))]

(gdb) bt

#0  0x000055aae060f588 in llama_batch_add (batch=..., id=1053, pos=101, seq_ids=std::vector of length 1, capacity 1 = {...}, logits=false)
    at /home/do/ssd/local/llama.cpp/common/common.cpp:937
#1  0x000055aae05202f2 in llama_server_context::update_slots (this=0x7ffc89d270f0) at /home/do/ssd/local/llama.cpp/examples/server/server.cpp:1635
#2  0x000055aae04f55c2 in main (argc=9, argv=0x7ffc89d27708) at /home/do/ssd/local/llama.cpp/examples/server/server.cpp:2571
(gdb) f 0
#0  0x000055aae060f588 in llama_batch_add (batch=..., id=1053, pos=101, seq_ids=std::vector of length 1, capacity 1 = {...}, logits=false)
    at /home/do/ssd/local/llama.cpp/common/common.cpp:937
937	        batch.seq_id[batch.n_tokens][i] = seq_ids[i];

(gdb) l 937
932	                               bool   logits) {
933	    batch.token   [batch.n_tokens] = id;
934	    batch.pos     [batch.n_tokens] = pos,
935	    batch.n_seq_id[batch.n_tokens] = seq_ids.size();
936	    for (size_t i = 0; i < seq_ids.size(); ++i) {
937	        batch.seq_id[batch.n_tokens][i] = seq_ids[i];
938	    }
939	    batch.logits  [batch.n_tokens] = logits;
940	
941	    batch.n_tokens++;

(gdb) p batch.n_tokens
$1 = 101

(gdb) f 1
#1  0x000055aae05202f2 in llama_server_context::update_slots (this=0x7ffc89d270f0) at /home/do/ssd/local/llama.cpp/examples/server/server.cpp:1635
1635	                       llama_batch_add(batch, prefix_tokens[slot.n_past], system_tokens.size() + slot.n_past, { slot.id }, false);

(gdb) p n_ctx
$2 = 100
(gdb)

…_ctx)

jhen0409 · 2023-11-10T22:48:26Z

It looks like already has PR #3996 fix this, can you try with that?

iohub · 2023-11-10T23:48:33Z

It looks like already has PR #3996 fix this, can you try with that?

Thanks for the heads up, I'll try this fix!

iohub · 2023-11-11T00:20:59Z

I have tried the PR #3996 . but I have a question on this fix proposal。

only half of the prompt context is used When the prompt is first submitted.

and this PR will use all prompt context space.

As the log show, only 51 tokens used to inference.

server launch command: 
bin/server -ngl 32 -m ~/ssd/modelhub/Wizard-GGUF/wizardcoder-python-34b-v1.0.Q5_K_M.gguf -t 20 -c 100

simple log:

all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 0]
prompt_tokens size:51
slot 0 : kv cache rm - [0, end)
slot 0: context shift - n_keep = 0, n_left = 98, n_discard = 49

print_timings: prompt eval time =    5533.65 ms /    51 tokens (  108.50 ms per token,     9.22 tokens per second)

The changes according to PR #3996

 
+
+                    if (slot.params.n_keep < 0)
+                    {
+                        slot.params.n_keep = slot.num_prompt_tokens;
+                    }
+                    slot.params.n_keep = std::min(slot.n_ctx - 4, slot.params.n_keep);
+
+                    // if input prompt is too big, truncate it
+                    if (slot.num_prompt_tokens >= slot.n_ctx)
+                    {
+                        const int n_left = slot.n_ctx - slot.params.n_keep;
+                        const int n_block_size = n_left / 2;
+                        const int erased_blocks = (slot.num_prompt_tokens - slot.params.n_keep - n_block_size) / n_block_size;
+
+                        std::vector<llama_token> new_tokens(prompt_tokens.begin(), prompt_tokens.begin() + slot.params.n_keep);
+                        new_tokens.insert(new_tokens.end(), prompt_tokens.begin() + slot.params.n_keep + erased_blocks * n_block_size, prompt_tokens.end());
+
+                        LOG_VERBOSE("input truncated", {
+                            {"n_ctx",  slot.n_ctx},
+                            {"n_keep", slot.params.n_keep},
+                            {"n_left", n_left},
+                            {"new_tokens", tokens_to_str(ctx, new_tokens.cbegin(), new_tokens.cend())},
+                        });
+                        slot.truncated = true;
+                        prompt_tokens = new_tokens;
+
+                        slot.num_prompt_tokens = prompt_tokens.size();
+                        GGML_ASSERT(slot.num_prompt_tokens < slot.n_ctx);
+                    }
+
+                    printf("prompt_tokens size:%d\n", prompt_tokens.size());
+
                     if (!slot.params.cache_prompt)
                     {
                         llama_sampling_reset(slot.ctx_sampling);
@@ -1565,36 +1597,7 @@ struct llama_server_context
                         slot.num_prompt_tokens_processed = slot.num_prompt_tokens;
                     }
                     else
-                    {
-                        if (slot.params.n_keep < 0)
-                        {
-                            slot.params.n_keep = slot.num_prompt_tokens;

jhen0409 · 2023-11-11T03:10:57Z

only half of the prompt context is used When the prompt is first submitted.

It should be expected to truncate half by default, and the size can be increased by set n_keep.

iohub · 2023-11-11T05:31:25Z

Thanks，I see , i'll try it

server: fix core dump when input prompt larger than prompt context (n…

f034eff

…_ctx)

iohub marked this pull request as draft November 10, 2023 15:15

iohub marked this pull request as ready for review November 10, 2023 15:17

iohub closed this Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: fix core dump when input prompt larger than prompt context #4022

server: fix core dump when input prompt larger than prompt context #4022

Uh oh!

iohub commented Nov 10, 2023 •

edited

Loading

Uh oh!

jhen0409 commented Nov 10, 2023

Uh oh!

iohub commented Nov 10, 2023

Uh oh!

iohub commented Nov 11, 2023 •

edited

Loading

Uh oh!

jhen0409 commented Nov 11, 2023

Uh oh!

iohub commented Nov 11, 2023

Uh oh!

Uh oh!

server: fix core dump when input prompt larger than prompt context #4022

server: fix core dump when input prompt larger than prompt context #4022

Uh oh!

Conversation

iohub commented Nov 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhen0409 commented Nov 10, 2023

Uh oh!

iohub commented Nov 10, 2023

Uh oh!

iohub commented Nov 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhen0409 commented Nov 11, 2023

Uh oh!

iohub commented Nov 11, 2023

Uh oh!

Uh oh!

iohub commented Nov 10, 2023 •

edited

Loading

iohub commented Nov 11, 2023 •

edited

Loading