-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Context length documentation confusion #5732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Provided that we have an input prompt: The case we're considering is that the prompt above cannot fit into context length. What I understand by reading the server's code is:
About the context per slot, it's calculated by Edit: when I use server for the first time, I (and maybe many other people) thought that |
This is helpful, what I gather in short - It will keep initial part of the i/p, and keep on the rotating later part of the context to keep the connection to the recent token predictions.
Ah! Yes, I had the same impression that this actually serves n_parallel requests without compromising the context.
Thanks, will give this a read. I still have some more questions, but will check the articles first, may be it resolves my doubts or add new ones. Will update here once I've some more understanding of this. |
I removed my message because I thought that it's out-of-context, but I'm glad that you find it helpful :-) |
The article was helpful! Few confirmations, this would help confirming my understanding:
|
Yes, this will guarantee that you can handle your worst-case scenario of 4x 8k requests at the same time |
This helped clarifying my doubts. Thanks! |
Question: Ending n_keep tokens are kept in the context?
examples/server/README.md#result-json
Question: With infinite length output generation, will this return true for intermediate truncation? Where server will truncate some context tokens once it hits the context limit.
examples/main#number-of-tokens-to-predict
Question: We can keep arbitrary large value of n_predict?
Follow-up: To support this server keeps on generating the output, once it reaches the context-limit it truncates some tokens from start (i/p + o/p_so_far, sliding window of i/p+o/p_so_far), keep on doing this till stop is triggered?
Context: I'm using a context length of 16k (with Deepseek model) and using n_parallel=4 (4 requests to serve in parallel), I noticed as per the server logs: This divides the context length among 4 slots (4k each).
Question: Why is that? Due to memory constraint?
Follow-up: If I really want to support 16k context length of reach request, does setting context length as 16k * n_parallel suffice?
The text was updated successfully, but these errors were encountered: