-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
Checklist
- 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 2. Please use English, otherwise it will be closed.
Motivation
We have a problem where high volumes of small-prompt requests are usually processed smoothly, but quickly pile up in a giant queue when a small number of large-prompt requests are submitted.We attempted to use the --enable-mixed-chunk parameter to enable sglang to handle both prefill and decode concurrently. However, sglang only processes a single sequence during prefill, causing short-text requests to remain blocked by long-text requests.
We also observed that vLLM allows concurrent processing of multiple requests in the prefill phase through the --max-num-partial-prefills parameter. Nonetheless, sglang demonstrably outperforms vLLM on DeepSeek series models, making us reluctant to abandon sglang. We hope sglang can also support this feature.