Skip to content

Context length documentation confusion #5732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mprudra opened this issue Feb 26, 2024 · 6 comments
Closed

Context length documentation confusion #5732

mprudra opened this issue Feb 26, 2024 · 6 comments

Comments

@mprudra
Copy link

mprudra commented Feb 26, 2024

n_keep: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded. By default, this value is set to 0 (meaning no tokens are kept). Use -1 to retain all tokens from the prompt.

Question: Ending n_keep tokens are kept in the context?

truncated: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (tokens_evaluated) plus tokens generated (tokens predicted) exceeded the context size (n_ctx)

examples/server/README.md#result-json

Question: With infinite length output generation, will this return true for intermediate truncation? Where server will truncate some context tokens once it hits the context limit.

A value of -1 will enable infinite text generation, even though we have a finite context window. When the context window is full, some of the earlier tokens (half of the tokens after --n-keep) will be discarded. The context must then be re-evaluated before generation can resume. On large models and/or large context windows, this will result in significant pause in output.

examples/main#number-of-tokens-to-predict

Question: We can keep arbitrary large value of n_predict?
Follow-up: To support this server keeps on generating the output, once it reaches the context-limit it truncates some tokens from start (i/p + o/p_so_far, sliding window of i/p+o/p_so_far), keep on doing this till stop is triggered?

Context: I'm using a context length of 16k (with Deepseek model) and using n_parallel=4 (4 requests to serve in parallel), I noticed as per the server logs: This divides the context length among 4 slots (4k each).
Question: Why is that? Due to memory constraint?
Follow-up: If I really want to support 16k context length of reach request, does setting context length as 16k * n_parallel suffice?

@mprudra mprudra changed the title Context lenght documentation confusion Context length documentation confusion Feb 26, 2024
@ngxson
Copy link
Collaborator

ngxson commented Feb 26, 2024

Provided that we have an input prompt: <s>system: You are an assistant<s>user: Hello who are you

The case we're considering is that the prompt above cannot fit into context length.

What I understand by reading the server's code is:

  • n_keep: for keeping <s>system: You are an assistant part
  • If truncated, it will take half of the places left in context, but keep the n_keep. Assuming that half of places left is who are you, your prompt will become: <s>system: You are an assistant who are you
  • When the assistant replies, the response fill up the rest of context: <s>system: You are an assistant who are you<s>assistant: Hello I am
  • Because no more places left, it "shift" the context: <s>system: You are an assistant assistant: Hello I am
  • Then the rest of response is generated: <s>system: You are an assistant assistant: Hello I am an assistant

About the context per slot, it's calculated by n_ctx_slot = n_ctx / params.n_parallel, so logically you can x4 the n_ctx to get 16k per slot as you said. The reason is that all slots share the same llama_context, so in fact they use the same kv self. And notion of n_parallel is in fact "how many parallel sequences can be batched process at the same time"

Edit: when I use server for the first time, I (and maybe many other people) thought that n_parallel will spawn multiple "threads" so that one request does not block another, but turns out it's not the case. If the batch is busy processing, you have no other way than waiting it to finish before starting another batch. That's something designed to get the best out of GPU, since batching request with GPU can improvement performance a lot.

@mprudra
Copy link
Author

mprudra commented Feb 26, 2024

Provided that we have an input prompt:
...
The case we're considering is that the prompt above cannot fit into context length.
What I understand by reading the server's code is:
...

This is helpful, what I gather in short - It will keep initial part of the i/p, and keep on the rotating later part of the context to keep the connection to the recent token predictions.

Edit: when I use server for the first time, I (and maybe many other people) thought that n_parallel will spawn multiple "threads" so that one request does not block another, but turns out it's not the case. If the batch is busy processing, you have no other way than waiting it to finish before starting another batch. That's something designed to get the best out of GPU, since batching request with GPU can improvement performance a lot.

Ah! Yes, I had the same impression that this actually serves n_parallel requests without compromising the context.

This article may also make it more clear for you: https://www.anyscale.com/blog/continuous-batching-llm-inference

Thanks, will give this a read.

I still have some more questions, but will check the articles first, may be it resolves my doubts or add new ones. Will update here once I've some more understanding of this.

@ngxson
Copy link
Collaborator

ngxson commented Feb 26, 2024

I removed my message because I thought that it's out-of-context, but I'm glad that you find it helpful :-)

@mprudra
Copy link
Author

mprudra commented Feb 27, 2024

The article was helpful!

Few confirmations, this would help confirming my understanding:

  1. So any LLM provider is mainly scaling in at least two ways (2 out of many more):
    a) Batching, these sequences will be generated in parallel, size of batch would be limited by GPU vRam & supported context-length
    b) Adding more machines

  2. Let's say my GPU memory permits a context length of 32k (combined across all the slots), given practically I don't expect i/p + o/p to cross 8k tokens, I can allow 4 parallel sequences to be batched at a time, so this should in fact let the server serve 4 requests at a time?
    And since llama.cpp does continuous batching as soon as any of the generation ends, it can start with new request?

@ggerganov
Copy link
Member

Let's say my GPU memory permits a context length of 32k (combined across all the slots), given practically I don't expect i/p + o/p to cross 8k tokens, I can allow 4 parallel sequences to be batched at a time, so this should in fact let the server serve 4 requests at a time?
And since llama.cpp does continuous batching as soon as any of the generation ends, it can start with new request?

Yes, this will guarantee that you can handle your worst-case scenario of 4x 8k requests at the same time

@mprudra
Copy link
Author

mprudra commented Mar 13, 2024

This helped clarifying my doubts. Thanks!
Closing this.

@mprudra mprudra closed this as completed Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants