[RFC]: Disaggregated prefilling and KV cache transfer roadmap

### Motivation.

Here is the roadmap for disaggregated prefill (and general-purpose kv cache transfer). Feel free to contribute :grin:.



### Proposed Change.


- XpYd support (X vLLM prefill instances, and Y vLLM decode instances, very likely that the tp and pp are not the same between prefill and decode instances)
  - [ ] ~~[Feature] Allow specifying region-of-interest / roi on `num_head` dimension and `layer` dimension (currently the `roi` tensor only contains tokens dimension)~~ (mooncake team proposed new design)
  - [ ] ~~[Feature] XpYd support by building multiple connections between Xp and Yd~~ (We now go for KVCache-store-based design. If you prefer direct P2P please raise concerns in vLLM #feat-prefill-disaggregation channel)
  - [ ] [Feature] XpYd support by letting Xp connect to one KV cache server, and connect this server to Yd (#12957)
- Building connection 
  - [ ] [Usage] Keep distributed connection alive by periodically sending dummy requests.
  - [ ] [Usage] Build connection by running `vllm connect` (#11791 )
  - [ ] [Feature] allow connecting prefiller and decoder between different nodes
  - [ ] [Perf] Build connection by directly talking to the `Engine` instead of talking to the API server (#11791)
- Compatibility
  - [ ] [Feature] Compatible with chunked prefill
  - [ ] [Feature] Compatble with prefix caching
  - [ ] [Feature] Compatible with pipeline parallel (#12301)
  - [ ] [Feature] Compatible with multi-modality
- Asynchronous KV cache transfer
  - [ ] [Perf] KV cache prefetching
  - [ ] [Perf] layer-by-layer pipelining (#12523)
- ~~Better memory control~~ (postponed to 2025 Q2)
  - [ ] ~~[Perf] Reusing vLLM page table to avoid memory fragmentation~~
  - [ ] ~~[Perf] Reduce number of tensor copy~~
- Adaptivity and fault tolerance
  - [ ] [Perf] If not all KV caches in the batch are received, only perform prefiling on those tokens without KV cache (#12285 )
  - [ ] [Perf] Allow one prefill/decode vllm worker to be repurposed to decode/prefill vllm worker (#12957)
- [ ] Third-party engine integration
  - [x] Mooncake (#10884 @alogfans )
  - ~~[ ] InfiniteStore (#9079 @chenqianfzh )~~ (no response from the developer)
  - ~~[ ] Valkey (#8724 @zeroorhero @pizhenwei )~~ (no response from the developer)
  - [x] LMCache (#12953)
- Persistant prefix caching support
  - [ ] [Feature] allow fetching the KV cache on some prefix tokens and then prefill on the remaining tokens
  - [ ] [Feature] allow fetching the KV cache of some contiguous tokens in the middle and then perform prefill on the remaining tokens to blend the KV cache with remaining context
- Orchestration
  -  [ ] [Feature] A centralized orchestrator for a pool of prefill and decode workers
  -  [ ] [Feature] Dynamically add / remove worker
  -  [ ] [Feature] Let the orchestrator be able to observe the workers using the observability APIs already exposed by vLLM
  -  [ ] [Feature] Initial routing support (send the decoding request to the most available decode instance first)
- Testing
  -  [x] [Feature] Offline disaggregated prefill testing (#12418)

### Feedback Period.

_No response_

### CC List.

@youkaichao @zeroorhero @comaniac @rkooo567 @WoosukKwon @liweiqing1997 @ShangmingCai @Leaf996 @coolkp @sjnaj  @K-Mistele @ApostaC @YaoJiayi @njhill 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Disaggregated prefilling and KV cache transfer roadmap #10818

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Disaggregated prefilling and KV cache transfer roadmap #10818

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions