-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Closed as not planned
Labels
Description
Motivation.
Here is the roadmap for disaggregated prefill (and general-purpose kv cache transfer). Feel free to contribute 😁.
Proposed Change.
- XpYd support (X vLLM prefill instances, and Y vLLM decode instances, very likely that the tp and pp are not the same between prefill and decode instances)
-
[Feature] Allow specifying region-of-interest / roi on(mooncake team proposed new design)num_head
dimension andlayer
dimension (currently theroi
tensor only contains tokens dimension) -
[Feature] XpYd support by building multiple connections between Xp and Yd(We now go for KVCache-store-based design. If you prefer direct P2P please raise concerns in vLLM #feat-prefill-disaggregation channel) - [Feature] XpYd support by letting Xp connect to one KV cache server, and connect this server to Yd ([Feature][Disaggregated] Support XpYd disaggregated prefill with MooncakeStore #12957)
-
- Building connection
- [Usage] Keep distributed connection alive by periodically sending dummy requests.
- [Usage] Build connection by running
vllm connect
([Frontend] Disaggregate prefill decode with zmq #11791 ) - [Feature] allow connecting prefiller and decoder between different nodes
- [Perf] Build connection by directly talking to the
Engine
instead of talking to the API server ([Frontend] Disaggregate prefill decode with zmq #11791)
- Compatibility
- [Feature] Compatible with chunked prefill
- [Feature] Compatble with prefix caching
- [Feature] Compatible with pipeline parallel ([Core] Make disaggregated prefill compatible with pipeline parallelism #12301)
- [Feature] Compatible with multi-modality
- Asynchronous KV cache transfer
- [Perf] KV cache prefetching
- [Perf] layer-by-layer pipelining (layerwise KV transfer in PD Disaggregation #12523)
Better memory control(postponed to 2025 Q2)-
[Perf] Reusing vLLM page table to avoid memory fragmentation -
[Perf] Reduce number of tensor copy
-
- Adaptivity and fault tolerance
- [Perf] If not all KV caches in the batch are received, only perform prefiling on those tokens without KV cache ([Core] Prefill Only Tokens Without KV Cache in Batch Requests (Disagg Prefill) #12285 )
- [Perf] Allow one prefill/decode vllm worker to be repurposed to decode/prefill vllm worker ([Feature][Disaggregated] Support XpYd disaggregated prefill with MooncakeStore #12957)
- Third-party engine integration
- Mooncake ([Core] Support disaggregated prefill with Mooncake Transfer Engine #10884 @alogfans )
[ ] InfiniteStore (Yet another Prefill-Decode separation in vllm #9079 @chenqianfzh )(no response from the developer)[ ] Valkey ([Core] Disaggregated prefilling supports valkey #8724 @zeroorhero @pizhenwei )(no response from the developer)- LMCache ([Feature] Support KV cache offloading and disagg prefill with LMCache connector. #12953)
- Persistant prefix caching support
- [Feature] allow fetching the KV cache on some prefix tokens and then prefill on the remaining tokens
- [Feature] allow fetching the KV cache of some contiguous tokens in the middle and then perform prefill on the remaining tokens to blend the KV cache with remaining context
- Orchestration
- [Feature] A centralized orchestrator for a pool of prefill and decode workers
- [Feature] Dynamically add / remove worker
- [Feature] Let the orchestrator be able to observe the workers using the observability APIs already exposed by vLLM
- [Feature] Initial routing support (send the decoding request to the most available decode instance first)
- Testing
- [Feature] Offline disaggregated prefill testing ([Misc] Add offline test for disaggregated prefill #12418)
Feedback Period.
No response
CC List.
@youkaichao @zeroorhero @comaniac @rkooo567 @WoosukKwon @liweiqing1997 @ShangmingCai @Leaf996 @coolkp @sjnaj @K-Mistele @ApostaC @YaoJiayi @njhill
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
coolkp, noooop, YaoJiayi, lvjing2, WangErXiao and 14 moreShangmingCai, YaoJiayi, jiangguochaoGG, zhentaoyu, AniZpZ and 1 moreYaoJiayi, coolkp, AniZpZ and MengqingCao