Context length documentation confusion

> n_keep: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded. By default, this value is set to 0 (meaning no tokens are kept). Use -1 to retain all tokens from the prompt.

**Question:** Ending n_keep tokens are kept in the context?

> truncated: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (tokens_evaluated) plus tokens generated (tokens predicted) exceeded the context size (n_ctx)

[examples/server/README.md#result-json
](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md#result-json)

**Question:** With infinite length output generation, will this return true for intermediate truncation? Where server will truncate some context tokens once it hits the context limit.

> A value of -1 will enable infinite text generation, even though we have a finite context window. When the context window is full, some of the earlier tokens (half of the tokens after --n-keep) will be discarded. The context must then be re-evaluated before generation can resume. On large models and/or large context windows, this will result in significant pause in output.

[examples/main#number-of-tokens-to-predict](https://github.com/ggerganov/llama.cpp/tree/master/examples/main#number-of-tokens-to-predict)

**Question:** We can keep arbitrary large value of n_predict?
**Follow-up:** To support this server keeps on generating the output, once it reaches the context-limit it truncates some tokens from start (i/p + o/p_so_far, sliding window of i/p+o/p_so_far), keep on doing this till stop is triggered?

**Context:** I'm using a context length of 16k (with Deepseek model) and using n_parallel=4 (4 requests to serve in parallel), I noticed as per the server logs: This divides the context length among 4 slots (4k each).
**Question:**  Why is that? Due to memory constraint?
**Follow-up:** If I really want to support 16k context length of reach request, does setting context length as 16k * n_parallel suffice?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Context length documentation confusion #5732

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Context length documentation confusion #5732

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions