-
Notifications
You must be signed in to change notification settings - Fork 12k
server : fix context shift #5195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
not sure its correct, will test more later. edit: I forgot to mention the important thing: I does fix the previous observed hang! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: not your fault, but it uses a mixture of x--
and x += 1
...
while you are at it, pls add missing newlines to the help here https://github.com/ggerganov/llama.cpp/blob/5f62e231db3ec60320607ac6450ce9bf4dd43208/examples/server/server.cpp#L1825-L1826 |
decided to test the self extension and it broke. looks different but still ok:
Then the next completion.
|
5f62e23
to
d0e10bf
Compare
There were some major issues still - I think I fixed those, but I'm still not sure if self-extend works now. Please let me know if you give it another try |
normal operation seems to be working now.
yea, i forgot to mention that the actual predictions returned where garbled, even without getting close to full cache/context. |
Self extend does not work. Either my settings are way off (not 100% on those values), or its just broken.
|
update: running without selfextend now works properly, had my bot spam for hours with >20k server tasks and it was still fine. (beside the obv self deterioration when it keeps talking to itself)
|
@Green-Sky Are you talking about my code or the code of this PR? |
This pr with caching enabled. But since master with caching is broken too... edit: mostly tagged you to have your eyes on this :) |
n_past_se was needed because otherwise the self extend won't work because all tokens of the prompt are added to the batch at once and the n_past is the total token count |
I've reverted my changes on the self-extend for now, as it is difficult to test. But I think after we merge this PR in order to fix the context shift, we have to revisit and simplify the self-extend implementation. It's confusing to keep 2 values for |
@ggerganov I don't know if it is enough but calling it something like current_token_pos or token_pos_se would maybe help? |
That's one option, but I think there might be a way to do it with only |
btw, with both caching and self extend it does the log spam again:
but at least either seems to be working fine. |
It is still broken, but its way rarer now.
(from the 6gig log file)
|
* server : fix context shift + simplify self-extend * server : take system_tokens into account * server : more n_past fixes * server : rever n_past_se changes
Is there a new PR for this? I am getting the same spam messages for the KV cache. It seems to happen as soon as it reaches the context limit. Unless I am using it wrong. Is this the correct format? /server.exe -ngl 99 --host localhost --port 5000 -c 8192 --grp-attn-n 4 --grp-attn-w 2048 -m D:/AI/LLM/Mixtral-8x7B-Instruct-v0.1.IQ3_XXS.gguf |
I saw #5420, but did not test it yet. Or it is the one making it worse shrug |
Possible it broke group attention, I was only testing without. |
* server : fix context shift + simplify self-extend * server : take system_tokens into account * server : more n_past fixes * server : rever n_past_se changes
This probably fixes the context shift functionality.
Also, continuation of #5104 - not sure if the
n_past_se
added there is really necessary, so I removed it. Needs testing.