Description
n_keep: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded. By default, this value is set to 0 (meaning no tokens are kept). Use -1 to retain all tokens from the prompt.
Question: Ending n_keep tokens are kept in the context?
truncated: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (tokens_evaluated) plus tokens generated (tokens predicted) exceeded the context size (n_ctx)
examples/server/README.md#result-json
Question: With infinite length output generation, will this return true for intermediate truncation? Where server will truncate some context tokens once it hits the context limit.
A value of -1 will enable infinite text generation, even though we have a finite context window. When the context window is full, some of the earlier tokens (half of the tokens after --n-keep) will be discarded. The context must then be re-evaluated before generation can resume. On large models and/or large context windows, this will result in significant pause in output.
examples/main#number-of-tokens-to-predict
Question: We can keep arbitrary large value of n_predict?
Follow-up: To support this server keeps on generating the output, once it reaches the context-limit it truncates some tokens from start (i/p + o/p_so_far, sliding window of i/p+o/p_so_far), keep on doing this till stop is triggered?
Context: I'm using a context length of 16k (with Deepseek model) and using n_parallel=4 (4 requests to serve in parallel), I noticed as per the server logs: This divides the context length among 4 slots (4k each).
Question: Why is that? Due to memory constraint?
Follow-up: If I really want to support 16k context length of reach request, does setting context length as 16k * n_parallel suffice?