-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[User] -n -2 generates nothing #2754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
|
Nope, the prompt is 13 tokens. Here's another example with
Assistant used 77 tokens and I had to CTRL + C to end it. |
Apologies, I read it wrong. Sorry for the confusion. |
Ahh, I think I know what's going on. You have
and also keep is set to
It'll just try to roll over the context whenever it fills up and keep those prompt tokens. Since the context is so small, it's not surprising this causes fairly nonsensical results. (The context rollover thing is kind of hit or miss in general since it happens in an arbitrary place that may be in the middle of a word.) Anyway, try setting |
Yes, but it shouldn't roll over the context.
The image shows llama.cpp produced 20 tokens total(equal to -n 20), then got hung up and didn't respond. I feel this's easily reproducable, here's another example where llama.cpp just hangs with
llama.cpp didn't even try to fill context. |
Well, it should if
Certainly that behavior is wrong too. It's different from your original problem where you thought the size just wasn't respected though. |
Thanks for your response, though I don't understand the purpose of context if llama.cpp ignores it by default. To clarify, llama.cpp defaults to All these other weird behaviors extend from llama.cpp neglecting to stop when it reaches max context. Edit: I'm so ready for a trace 😅 |
You can set it to a non-default value and have a context limit if you want. The context size also affects how much memory (CPU or GPU) is used so it still has significant effects even aside from limiting how many tokens are generated.
I'm afraid I don't agree here. If If you set But the generating tokens past the context limit when |
Alright, let's try to clarify here. Maybe this comes down to interpretation, but README shows This value affects the # of tokens llama.cpp predicts during inference. It cannot predict and generate more tokens than initally defined by context. It refers to the length of the generation, and not context size. Essentially, I'm saying that llama.cpp should not to generate or predict more tokens than set in
I disagree. That's not my understanding of the README.
I've also noticed this but I didn't intend to discover it. I don't have the time to explore more than 1 bug at a time, I'm sure you understand. |
Oh, but it actually can! There's special logic to handle running into the context limit. Assuming you didn't set This is a poor man's infinite context. This was more useful before the discovery of the rope stuff and LLaMA2 supporting 4096 context off the bat but using this approach it was possible to get pretty coherent output past the 2048 context limit of LLaMA1 models. See: https://github.com/ggerganov/llama.cpp/tree/master/examples/main#number-of-tokens-to-predict A value of -1 will enable infinite text generation, even though we have a finite context window. When the context window is full, some of the earlier tokens (half of the tokens after --n-keep) will be discarded. The context must then be re-evaluated before generation can resume. |
Oh dear,.. I stand corrected! I'll test various |
Not as far as I know. So what you thought was originally a bug was okay (though there's probably an argument for changing the default). It seems like you did find a genuine issue though. Glad you already reopened this yourself, I was going to suggest that. I tried it and wasn't able to replicate:
Are you running on Metal? I've heard there were issues that could lead to issues where llama.cpp would get stuck. edit: #2678 is the one I was thinking of. |
Admittely,
Nope,
|
Ah, okay, I can replicate issues specifying the prompt with Not sure what's actually causing the issue. It doesn't seem like it's spinning consuming CPU, it seems like it thinks it's supposed to be reading user interaction. |
Exactly! Whew, sometimes I underestimate what it takes to replicate an issue, so thanks for honing in on this with me. I was reluctant to change |
It's getting set back to interacting here: https://github.com/ggerganov/llama.cpp/blob/ef955fbd230c571cc1cda0d19baaeec347523175/examples/main/main.cpp#L800-L804 It happens almost immediately even if I set Something is definitely wrong with that logic (or the state at that point). |
Can you do some testing with #2767? Seems to fix the issue.
|
Yup, working.
First time I've seen that! I'll test a bit more too if you want to hold off on merging. |
I can't merge without someone else approving it anyway. :) More testing is always great though. I'm more scared of breaking something else than issues with Allowing |
I agree. We identified there was a problem, so that's significant, and my testing works.
It's definitely unintuitive, but I agree that's how it's intended to work. |
See this post: #2754 (comment)
The text was updated successfully, but these errors were encountered: