Skip to content

[User] -n -2 generates nothing #2754

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Aug 23, 2023 · 21 comments · Fixed by #2767
Closed

[User] -n -2 generates nothing #2754

ghost opened this issue Aug 23, 2023 · 21 comments · Fixed by #2767

Comments

@ghost
Copy link

ghost commented Aug 23, 2023

See this post: #2754 (comment)

@ghost ghost mentioned this issue Aug 23, 2023
@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Aug 23, 2023

Your prompt is longer than the requested context size. I seem to recall running into weirdness involving that previously as well.

If it did respect the context size, it would exit partway through evaluating the prompt and before you got a chance to interact.

@ghost
Copy link
Author

ghost commented Aug 23, 2023

Your prompt is longer than the requested context size.

Nope, the prompt is 13 tokens. Here's another example with -c 50:

USER: Hi. ASSISTANT: Hello! How can I help you today? 
USER: list 3 movie titles.                                                                                               
1) The Lion King                                        
2) The Incredibles                                   
3) Finding Nemo                                       
4) The Little Mermaid                             
5) Cars                                              
6) Monsters, Inc.                                   
7) Buzz Lightyear of Star Command                        
8) The Lion King                                          
9) Hakuna Matata from The Lion                                                                                        

Assistant used 77 tokens and I had to CTRL + C to end it.

@KerfuffleV2
Copy link
Collaborator

Apologies, I read it wrong. Sorry for the confusion.

@KerfuffleV2
Copy link
Collaborator

Ahh, I think I know what's going on. You have -n -1 (generate forever):

-n N, --n-predict N   number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)

and also keep is set to -1:

--keep N              number of tokens to keep from the initial prompt (default: 0, -1 = all)

It'll just try to roll over the context whenever it fills up and keep those prompt tokens. Since the context is so small, it's not surprising this causes fairly nonsensical results. (The context rollover thing is kind of hit or miss in general since it happens in an arbitrary place that may be in the middle of a word.)

Anyway, try setting -n to either -2 or a specific value <= the context size.

@ghost
Copy link
Author

ghost commented Aug 24, 2023

Ahh, I think I know what's going on. You have -n -1 (generate forever):

--n-predicts defaults to -1. I exclude it from main and llama.cpp sets it. It's the # of tokens llama.cpp generates before stopping. Default value unexpectedly overrides the context.

It'll just try to roll over the context whenever it fills up and keep those prompt tokens.

Yes, but it shouldn't roll over the context.

Anyway, try setting -n to either -2 or a specific value <= the context size.

Screenshot_20230824_090025

The image shows llama.cpp produced 20 tokens total(equal to -n 20), then got hung up and didn't respond.

I feel this's easily reproducable, here's another example where llama.cpp just hangs with -n -2

generate: n_ctx = 50, n_batch = 7, n_predict = -2, n_keep = 13                                                                                                                                                                              == Running in interactive mode. ==                                  
...                                                           
USER: Hi. ASSISTANT: Hello!

llama.cpp didn't even try to fill context.

@KerfuffleV2
Copy link
Collaborator

Yes, but it shouldn't roll over the context.

Well, it should if n_predict == -1. If you're setting n_predict to a different value and something is overriding it then that's a different issue.

The image shows llama.cpp produced 20 tokens total(equal to -n 20), then got hung up and didn't respond.

Certainly that behavior is wrong too.

It's different from your original problem where you thought the size just wasn't respected though.

@ghost
Copy link
Author

ghost commented Aug 24, 2023

Thanks for your response, though I don't understand the purpose of context if llama.cpp ignores it by default. To clarify, llama.cpp defaults to -n -1.

All these other weird behaviors extend from llama.cpp neglecting to stop when it reaches max context.

Edit: I'm so ready for a trace 😅

@KerfuffleV2
Copy link
Collaborator

Thanks for your response, though I don't understand the purpose of context if llama.cpp ignores it by default.

You can set it to a non-default value and have a context limit if you want. The context size also affects how much memory (CPU or GPU) is used so it still has significant effects even aside from limiting how many tokens are generated.

All these other weird behaviors extend from llama.cpp neglecting to stop when it reaches max context.

I'm afraid I don't agree here. If n_predict is -1 then this is specifically saying generate an infinite amount of tokens. You can't be surprised when it generates an infinite amount of tokens.

If you set n_predict to 50 or to -2 (which specifically means "generate until the context is full") and it doesn't stop at 50 tokens (or full context as the case may be) then that is absolutely a problem. If it gets stuck and just spins without producing output, that's absolutely a problem.

But the generating tokens past the context limit when n_predict == -1 is expected behavior and whether the default should be changed is an administrative thing rather than a bug in the program.

@ghost
Copy link
Author

ghost commented Aug 24, 2023

I'm afraid I don't agree here. If n_predict is -1 then this is specifically saying generate an infinite amount of tokens. You can't be surprised when it generates an infinite amount of tokens.

Alright, let's try to clarify here. Maybe this comes down to interpretation, but README shows --n-predict Set the number of tokens to predict when generating text.

This value affects the # of tokens llama.cpp predicts during inference. It cannot predict and generate more tokens than initally defined by context.

It refers to the length of the generation, and not context size. Essentially, I'm saying that llama.cpp should not to generate or predict more tokens than set in -c

But the generating tokens past the context limit when n_predict == -1 is expected behavior

I disagree. That's not my understanding of the README.

If you set n_predict to 50 or to -2 (which specifically means "generate until the context is full") and it doesn't stop at 50 tokens (or full context as the case may be) then that is absolutely a problem. If it gets stuck and just spins without producing output, that's absolutely a problem.

I've also noticed this but I didn't intend to discover it. I don't have the time to explore more than 1 bug at a time, I'm sure you understand.

@KerfuffleV2
Copy link
Collaborator

It cannot predict and generate more tokens than initally defined by context.

Oh, but it actually can!

There's special logic to handle running into the context limit. Assuming you didn't set n_predict to a value <= that limit or to -2 once it hits the context limit it will shuffle around the list of last tokens and keep generating, overwriting some of the existing state. If --keep was set, it'll reapply that number of prompt tokens first.

This is a poor man's infinite context. This was more useful before the discovery of the rope stuff and LLaMA2 supporting 4096 context off the bat but using this approach it was possible to get pretty coherent output past the 2048 context limit of LLaMA1 models.

See: https://github.com/ggerganov/llama.cpp/tree/master/examples/main#number-of-tokens-to-predict

A value of -1 will enable infinite text generation, even though we have a finite context window. When the context window is full, some of the earlier tokens (half of the tokens after --n-keep) will be discarded. The context must then be re-evaluated before generation can resume.

@ghost
Copy link
Author

ghost commented Aug 24, 2023

Oh, but it actually can!

There's special logic to handle running into the context limit.

Oh dear,.. I stand corrected! I'll test various -n values today. Thank you.

@ghost ghost closed this as completed Aug 24, 2023
@ghost
Copy link
Author

ghost commented Aug 24, 2023

Perhaps there's reason for -n -2 generating nothing, but I dunno:

Screenshot_20230824_102934

./main -m ~/llama-2-chat.gguf --color -c 40 --keep -1 -n -2 -t 3 -b 7 -i -r "USER:" --in-prefix " " -p "USER: Hi."

Had to CTRL+C.

Before saying I need higher context, or my prompt template is wrong, here's perfect template with more context, same result:
Screenshot_20230824_112458

@ghost ghost changed the title [User] Context size isn't respected [User] -n -2 generates nothing Aug 24, 2023
@ghost ghost reopened this Aug 24, 2023
@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Aug 24, 2023

Perhaps there's reason for -n -2 generating nothing,

Not as far as I know. So what you thought was originally a bug was okay (though there's probably an argument for changing the default). It seems like you did find a genuine issue though. Glad you already reopened this yourself, I was going to suggest that.


I tried it and wasn't able to replicate:

main: static prompt based on n_keep: ' Once upon a time, in a dark forest, there lived a little fox'

sampling: repeat_last_n = 0, repeat_penalty = 1.040000, presence_penalty = 0.000000, frequency_penalty = 0.060000, top_k = 70, tfs_z = 0.950000, top_p = 2.000000, typical_p = 0.250000, temp = 1.200000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000, seqrep(last_n = -1, min_length = 3, start_offset = 0, presence_penalty = 1.0000, length_penalty = 4.0000, tolerance = 0.7500, mid_word_scale = 0.0000, tolerance_match_credit = 0.2500, tolerance_half_step_cost = 0.2500, flags = 76)
generate: n_ctx = 100, n_batch = 256, n_predict = -2, n_keep = 16


 Once upon a time, in a dark forest, there lived a little fox named Fie. Fie was a mischievous little fox who loved to play pranks on his forest friends. One day, Fie saw a beautiful butterfly and decided to catch it and play a trick on her.

"Give me your wings, butterfly," said Fie. "I'll make you fly around the forest."

But the butterfly was too good for Fie's trick

main: context full, stopping generation

Are you running on Metal? I've heard there were issues that could lead to issues where llama.cpp would get stuck. edit: #2678 is the one I was thinking of.

@ghost
Copy link
Author

ghost commented Aug 24, 2023

So what you thought was originally a bug was okay (though there's probably an argument for changing the default). It seems like you did find a genuine issue though. Glad you already reopened this yourself, I was going to suggest that.

Admittely, -n -1 confuses me, but README explains the behavior so you're correct. Probably -n -2 should be the default. Thanks for clarfying with me.

Are you running on Metal? I've heard there were issues that could lead to issues where llama.cpp would get stuck. edit: #2678 is the one I was thinking of.

Nope, Linux localhost 4.14.190-23725627-abG975WVLS8IWD1 #2 SMP PREEMPT Mon Apr 10 18:16:39 KST 2023 aarch64 Android, CPU build.

./main -m ~/llama-2-chat.gguf --color -c 200 --keep -1 -n -2 -t 3 -b 7 -i -p "Once upon a time, in a dark forest, there lived a little fox"
I tried your example, and it got stuck again:

generate: n_ctx = 200, n_batch = 7, n_predict = -2, n_keep = 17
== Running in interactive mode. ==                          
...
                                                                                                        
Once upon a time, in a dark forest, there lived a little fox Ummmm test test                                                                                                                                                               

llama_print_timings:        load time =  3887.72 ms        
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  4560.34 ms /    17 tokens (  268.26 ms per token,     3.73 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 42952.01 ms

@KerfuffleV2
Copy link
Collaborator

Ah, okay, I can replicate issues specifying the prompt with -p (was previously reading it from a file with -f) and using --interactive. This seems like a problem specific to interactive mode.

Not sure what's actually causing the issue. It doesn't seem like it's spinning consuming CPU, it seems like it thinks it's supposed to be reading user interaction.

@ghost
Copy link
Author

ghost commented Aug 24, 2023

It doesn't seem like it's spinning consuming CPU, it seems like it thinks it's supposed to be reading user interaction.

Exactly! Whew, sometimes I underestimate what it takes to replicate an issue, so thanks for honing in on this with me.

I was reluctant to change -n -1 cuz' the result threw me off.

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Aug 24, 2023

It's getting set back to interacting here: https://github.com/ggerganov/llama.cpp/blob/ef955fbd230c571cc1cda0d19baaeec347523175/examples/main/main.cpp#L800-L804

It happens almost immediately even if I set -c 1000 -n -2 (so there's plenty of context space left). If I hit / to continue without inserting a newline it'll just go back to interacting.

Something is definitely wrong with that logic (or the state at that point).

@KerfuffleV2
Copy link
Collaborator

Can you do some testing with #2767? Seems to fix the issue.

-n for positive values seems pretty unintuitive to me in interact more. It seems like it means "generate X tokens and then return control to the user" - every time. So you can set -n 2, it'll generate two tokens, you can enter / to return control and it'll generate another 2, etc. Weird, but I assume that's how it's supposed to work.

@ghost
Copy link
Author

ghost commented Aug 24, 2023

Yup, working.

main: context full, stopping generation

First time I've seen that!

I'll test a bit more too if you want to hold off on merging.

@KerfuffleV2
Copy link
Collaborator

I can't merge without someone else approving it anyway. :) More testing is always great though. I'm more scared of breaking something else than issues with -2 from that change, but the more I think about it the more I'm convinced this is correct.

Allowing n_predict == -2 for stop at full context is a pretty recent change. I think the logic there just didn't get updated (it was set to check just for -1).

@ghost
Copy link
Author

ghost commented Aug 24, 2023

the more I think about it the more I'm convinced this is correct.

I agree. We identified there was a problem, so that's significant, and my testing works.

So you can set -n 2, it'll generate two tokens, you can enter / to return control and it'll generate another 2, etc. Weird, but I assume that's how it's supposed to work.

It's definitely unintuitive, but I agree that's how it's intended to work.

@ghost ghost closed this as completed Aug 24, 2023
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant