Why llama-server is so limited to concurrent requests even using -cb and -np ? #13935

celsowm · 2025-05-31T00:13:22Z

celsowm
May 31, 2025

Hi
Why llama-server is so limited to concurrent requests even using -cb and -np ?
Sglang and vllm servers are not so limited like llama.cpp when you have even hundreds of requests

o3 response:

vLLM and SGLang were designed from the ground-up for token-level scheduling and smart KV-cache management, so one copy of the model can keep a GPU busy while dozens of users stream tokens concurrently.
llama-cpp’s server, by contrast, simply round-robins request-level contexts that each own a large, contiguous cache, so every extra user slices throughput and memory almost linearly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why llama-server is so limited to concurrent requests even using -cb and -np ? #13935

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Why llama-server is so limited to concurrent requests even using -cb and -np ? #13935

Uh oh!

Uh oh!

celsowm May 31, 2025

Replies: 0 comments

celsowm
May 31, 2025