Skip to content

server: fix core dump when input prompt larger than prompt context #4022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

iohub
Copy link
Contributor

@iohub iohub commented Nov 10, 2023

Fix the core dump bug caused by

  • when input prompt larger than prompt context (variable n_ctx )
  • and not hit slot.params.cache_prompt code branch.

root cause location:

                    if (!slot.params.cache_prompt) // root cause: the prompt too long and not truncated
                    {
                        llama_sampling_reset(slot.ctx_sampling);

                        slot.n_past = 0;
                        slot.num_prompt_tokens_processed = slot.num_prompt_tokens;
                    }
                    else ...

my gdb debug log:

Core was generated by `bin/server -ngl 32 -m /home/do/ssd/modelhub/Wizard-GGUF/wizardcoder-python-34b-'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055aae060f588 in llama_batch_add (batch=..., id=1053, pos=101, seq_ids=..., logits=false) at /home/do/ssd/local/llama.cpp/common/common.cpp:937
937	        batch.seq_id[batch.n_tokens][i] = seq_ids[i];
[Current thread is 1 (Thread 0x7f25417db000 (LWP 30525))]

(gdb) bt

#0  0x000055aae060f588 in llama_batch_add (batch=..., id=1053, pos=101, seq_ids=std::vector of length 1, capacity 1 = {...}, logits=false)
    at /home/do/ssd/local/llama.cpp/common/common.cpp:937
#1  0x000055aae05202f2 in llama_server_context::update_slots (this=0x7ffc89d270f0) at /home/do/ssd/local/llama.cpp/examples/server/server.cpp:1635
#2  0x000055aae04f55c2 in main (argc=9, argv=0x7ffc89d27708) at /home/do/ssd/local/llama.cpp/examples/server/server.cpp:2571
(gdb) f 0
#0  0x000055aae060f588 in llama_batch_add (batch=..., id=1053, pos=101, seq_ids=std::vector of length 1, capacity 1 = {...}, logits=false)
    at /home/do/ssd/local/llama.cpp/common/common.cpp:937
937	        batch.seq_id[batch.n_tokens][i] = seq_ids[i];

(gdb) l 937
932	                               bool   logits) {
933	    batch.token   [batch.n_tokens] = id;
934	    batch.pos     [batch.n_tokens] = pos,
935	    batch.n_seq_id[batch.n_tokens] = seq_ids.size();
936	    for (size_t i = 0; i < seq_ids.size(); ++i) {
937	        batch.seq_id[batch.n_tokens][i] = seq_ids[i];
938	    }
939	    batch.logits  [batch.n_tokens] = logits;
940	
941	    batch.n_tokens++;

(gdb) p batch.n_tokens
$1 = 101

(gdb) f 1
#1  0x000055aae05202f2 in llama_server_context::update_slots (this=0x7ffc89d270f0) at /home/do/ssd/local/llama.cpp/examples/server/server.cpp:1635
1635	                       llama_batch_add(batch, prefix_tokens[slot.n_past], system_tokens.size() + slot.n_past, { slot.id }, false);

(gdb) p n_ctx
$2 = 100
(gdb) 

@iohub iohub marked this pull request as draft November 10, 2023 15:15
@iohub iohub marked this pull request as ready for review November 10, 2023 15:17
@jhen0409
Copy link
Collaborator

It looks like already has PR #3996 fix this, can you try with that?

@iohub
Copy link
Contributor Author

iohub commented Nov 10, 2023

It looks like already has PR #3996 fix this, can you try with that?

Thanks for the heads up, I'll try this fix!

@iohub
Copy link
Contributor Author

iohub commented Nov 11, 2023

I have tried the PR #3996 . but I have a question on this fix proposal。

  • only half of the prompt context is used When the prompt is first submitted.

and this PR will use all prompt context space.

As the log show, only 51 tokens used to inference.

server launch command: 
bin/server -ngl 32 -m ~/ssd/modelhub/Wizard-GGUF/wizardcoder-python-34b-v1.0.Q5_K_M.gguf -t 20 -c 100

simple log:

all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 0]
prompt_tokens size:51
slot 0 : kv cache rm - [0, end)
slot 0: context shift - n_keep = 0, n_left = 98, n_discard = 49

print_timings: prompt eval time =    5533.65 ms /    51 tokens (  108.50 ms per token,     9.22 tokens per second)

The changes according to PR #3996

 
+
+                    if (slot.params.n_keep < 0)
+                    {
+                        slot.params.n_keep = slot.num_prompt_tokens;
+                    }
+                    slot.params.n_keep = std::min(slot.n_ctx - 4, slot.params.n_keep);
+
+                    // if input prompt is too big, truncate it
+                    if (slot.num_prompt_tokens >= slot.n_ctx)
+                    {
+                        const int n_left = slot.n_ctx - slot.params.n_keep;
+                        const int n_block_size = n_left / 2;
+                        const int erased_blocks = (slot.num_prompt_tokens - slot.params.n_keep - n_block_size) / n_block_size;
+
+                        std::vector<llama_token> new_tokens(prompt_tokens.begin(), prompt_tokens.begin() + slot.params.n_keep);
+                        new_tokens.insert(new_tokens.end(), prompt_tokens.begin() + slot.params.n_keep + erased_blocks * n_block_size, prompt_tokens.end());
+
+                        LOG_VERBOSE("input truncated", {
+                            {"n_ctx",  slot.n_ctx},
+                            {"n_keep", slot.params.n_keep},
+                            {"n_left", n_left},
+                            {"new_tokens", tokens_to_str(ctx, new_tokens.cbegin(), new_tokens.cend())},
+                        });
+                        slot.truncated = true;
+                        prompt_tokens = new_tokens;
+
+                        slot.num_prompt_tokens = prompt_tokens.size();
+                        GGML_ASSERT(slot.num_prompt_tokens < slot.n_ctx);
+                    }
+
+                    printf("prompt_tokens size:%d\n", prompt_tokens.size());
+
                     if (!slot.params.cache_prompt)
                     {
                         llama_sampling_reset(slot.ctx_sampling);
@@ -1565,36 +1597,7 @@ struct llama_server_context
                         slot.num_prompt_tokens_processed = slot.num_prompt_tokens;
                     }
                     else
-                    {
-                        if (slot.params.n_keep < 0)
-                        {
-                            slot.params.n_keep = slot.num_prompt_tokens;

@jhen0409
Copy link
Collaborator

  • only half of the prompt context is used When the prompt is first submitted.

It should be expected to truncate half by default, and the size can be increased by set n_keep.

@iohub
Copy link
Contributor Author

iohub commented Nov 11, 2023

Thanks,I see , i'll try it

@iohub iohub closed this Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants