[Core] implement disaggregated prefilling via KV cache transfer #6170

KuntaiDu · 2024-07-06T07:44:35Z

This is a follow-up PR for #5557 .

Goal: implement disaggregated prefilling by launching 2 vllm instances (one for prefilling, one for decoding), and forward the KV cache from prefilling instance to decoding instance.

A rough roadmap:

Benchmark the idealized version of disaggregated prefilling (idealized in terms of the KV cache transfer can be done immediately).
Implement API calls in vllm to import / export KV cache
Implement an agent that can transfer KV cache between prefilling and decoding instance
Implement end-to-end prototype
Benchmark and improve the performance
Beta test (ongoing)
Support different tp/pp between prefill and decode instance

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

KuntaiDu · 2024-07-07T04:55:55Z

An example of disaggregated prefill can be much better than chunked prefill:

model: llama70B fp8
device: 8xH100
workload: QPS 4, input tokens 2048, output tokens 11
3 approaches
- chunked prefill — tp8
- chunked prefill — 2x tp4
- disaggregated prefill — 1x tp4 prefill, 1x tp4 decode

KuntaiDu · 2024-07-09T05:02:15Z

Summary of the measurement insight:

For long context (long: #input token >=1k when #output token=128), we should use disaggregated prefilling instead of chunked prefill.
The maximum overhead of disaggregated prefilling (caused by KV cache transfer) is 40ms. An ideal implementation should have less than 10 ms overhead.

…lt. Much more realistic

KuntaiDu · 2024-08-11T05:34:04Z

A back-to-back comparison between chunked prefill and disaggregated prefill:

Input length: 2048
Output length: 150
Dataset: sonnet
Num of prompts: 400
QPS: 2,4,6,8
Methods:
- chunked prefill: 2 vllm instances, tp4, with chunked prefill enabled, 2 instances share the workload in a round-robin manner
- disagg prefill: 2 vllm instances, tp4, one for prefill and one for decode
  - My current implementation can let us get the first token before implementation overheads (like KV transfer and waiting until decode instance is ready to receive generated KV cache) happen by fetching the first token from the prefill instance, but for benchmarking’s sake I count these overheads into TTFT by fetching the first token from decode instance.
Results:
- lower median TTFT and median ITL when QPS<=6 (at QPS=8 the decode instance is backlogging)
- worse p99 ITL ---- Sometimes the KV cache transfer may fail (this appears rarely and I am not sure why for now), forcing the decoding instance to redo the prefill by itself, which makes ITL worse.

wjj19950828 · 2024-08-20T03:10:38Z

A back-to-back comparison between chunked prefill and disaggregated prefill:

Input length: 2048

Output length: 150

Dataset: sonnet

Num of prompts: 400

QPS: 2,4,6,8

Methods:

chunked prefill: 2 vllm instances, tp4, with chunked prefill enabled, 2 instances share the workload in a round-robin manner

disagg prefill: 2 vllm instances, tp4, one for prefill and one for decode

My current implementation can let us get the first token before implementation overheads (like KV transfer and waiting until decode instance is ready to receive generated KV cache) happen by fetching the first token from the prefill instance, but for benchmarking’s sake I count these overheads into TTFT by fetching the first token from decode instance.

Results:

lower median TTFT and median ITL when QPS<=6 (at QPS=8 the decode instance is backlogging)

worse p99 ITL ---- Sometimes the KV cache transfer may fail (this appears rarely and I am not sure why for now), forcing the decoding instance to redo the prefill by itself, which makes ITL worse.

@KuntaiDu Why are some concurrency TTFT indicators reaching 5000-10000ms? Is it pending?

But I feel that under 70b tp4, the qps on the H100 card should not be so low

KuntaiDu · 2024-08-21T03:46:48Z

A back-to-back comparison between chunked prefill and disaggregated prefill:

Input length: 2048

Output length: 150

Dataset: sonnet

Num of prompts: 400

QPS: 2,4,6,8

Methods:

chunked prefill: 2 vllm instances, tp4, with chunked prefill enabled, 2 instances share the workload in a round-robin manner

disagg prefill: 2 vllm instances, tp4, one for prefill and one for decode

My current implementation can let us get the first token before implementation overheads (like KV transfer and waiting until decode instance is ready to receive generated KV cache) happen by fetching the first token from the prefill instance, but for benchmarking’s sake I count these overheads into TTFT by fetching the first token from decode instance.

Results:

lower median TTFT and median ITL when QPS<=6 (at QPS=8 the decode instance is backlogging)

worse p99 ITL ---- Sometimes the KV cache transfer may fail (this appears rarely and I am not sure why for now), forcing the decoding instance to redo the prefill by itself, which makes ITL worse.

@KuntaiDu Why are some concurrency TTFT indicators reaching 5000-10000ms? Is it pending?

But I feel that under 70b tp4, the qps on the H100 card should not be so low

Yes, the requests are pending and that's why the TTFT is high. As for the QPS, let me double check.

LesLieZC0324 · 2024-08-21T06:27:20Z

It is a nice work! However, I meet some problems in actual use.
In this PR, chunked prefill with tp4 used round_robin_proxy.sh to forward polling requests for port 8000 to ports 8100 and 8200. However, when I used it, once the socat process is started and starts listening on port 8000, it will continue to receive all subsequent connections without calling the get_next_port function again to select another port.
Looking forward to your reply！

Yang-x-Zhao · 2024-08-22T09:03:18Z

A friendly reminder here: chunked prefill feature needs a parameter --max-num-batched-tokens. In the original post of chunked prefill, the author found a fact that instead of the default 512, using 2048 on A100 gave a better result.

For Disaggregated prefill, it might be interesting to set this parameter differently on prefill and decode instances. Since prefill stage is compute bound and decode stage is memory bound, my intuition is to set a small max batched tokens to prefill and a large max batched tokens to decode

For instance, when I was benchmarking llama3-8B with prefilll tp2 A100 and decode tp2 A100, I have set prefill instance to --max-num-batched-tokens 4096 and set decode instance to --max-num-batched-tokens 32768. With these settings, I have achieved a slightly better result.

wjj19950828 · 2024-08-22T09:09:13Z

@KuntaiDu @MazarineGlacier Have you ever tested Disaggregated prefill vs normal version (without chunk prefill)? I tested P2D2 vs tp4 and found no benefit. Is this normal?

Yang-x-Zhao · 2024-08-22T09:46:36Z

I tested P2D2 vs tp4 and found no benefit. Is this normal?

In your case, disaggregate prefill will behave worse in TTFT. This is because default VLLM prioritizes prefill and tp4 has (less than) twice the compute capability during prefill than P2D2.

I am not certain about TPOT/ITL though, that depends on the real batched tokens.

After all, it is possible that no benefit is found.

wjj19950828 · 2024-08-22T09:56:00Z

I tested P2D2 vs tp4 and found no benefit. Is this normal?

In your case, disaggregate prefill will behave worse in TTFT. This is because default VLLM prioritizes prefill and tp4 has (less than) twice the compute capability during prefill than P2D2.

I am not certain about TPOT/ITL though, that depends on the real batched tokens.

After all, it is possible that no benefit is found.

In our scenario, disaggregate prefill on TTFT/TPOP/ITL is much worse than TP4. I don’t know where the problem is, so I wonder in which scenarios disaggregate prefill will be beneficial.

Yang-x-Zhao · 2024-08-22T10:50:45Z

In our scenario, disaggregate prefill on TTFT/TPOP/ITL is much worse than TP4. I don’t know where the problem is, so I wonder in which scenarios disaggregate prefill will be beneficial.

This paper might answer your question: https://arxiv.org/html/2401.11181v1. In this paper, when workload is too large on both prefill and decode, disaggregate prefill failed.

ChuanhongLi · 2024-08-22T12:25:29Z

I tested P2D2 vs tp4 and found no benefit. Is this normal?

In your case, disaggregate prefill will behave worse in TTFT. This is because default VLLM prioritizes prefill and tp4 has (less than) twice the compute capability during prefill than P2D2.
I am not certain about TPOT/ITL though, that depends on the real batched tokens.
After all, it is possible that no benefit is found.

In our scenario, disaggregate prefill on TTFT/TPOP/ITL is much worse than TP4. I don’t know where the problem is, so I wonder in which scenarios disaggregate prefill will be beneficial.

Me too. But I did it on 4090. Maybe the overhead is too high without nvlink.

wjj19950828 · 2024-08-23T02:49:49Z

I tested P2D2 vs tp4 and found no benefit. Is this normal?

In your case, disaggregate prefill will behave worse in TTFT. This is because default VLLM prioritizes prefill and tp4 has (less than) twice the compute capability during prefill than P2D2.
I am not certain about TPOT/ITL though, that depends on the real batched tokens.
After all, it is possible that no benefit is found.

In our scenario, disaggregate prefill on TTFT/TPOP/ITL is much worse than TP4. I don’t know where the problem is, so I wonder in which scenarios disaggregate prefill will be beneficial.

Me too. But I did it on 4090. Maybe the overhead is too high without nvlink.

Yes, I am studying the kv cache transmission overhead here. I see the author said it is about 30ms, which is definitely unacceptable.

LesLieZC0324 · 2024-08-23T03:43:00Z

I tested P2D2 vs tp4 and found no benefit. Is this normal?

In your case, disaggregate prefill will behave worse in TTFT. This is because default VLLM prioritizes prefill and tp4 has (less than) twice the compute capability during prefill than P2D2.
I am not certain about TPOT/ITL though, that depends on the real batched tokens.
After all, it is possible that no benefit is found.

In our scenario, disaggregate prefill on TTFT/TPOP/ITL is much worse than TP4. I don’t know where the problem is, so I wonder in which scenarios disaggregate prefill will be beneficial.

Me too. But I did it on 4090. Maybe the overhead is too high without nvlink.

Yes, I am studying the kv cache transmission overhead here. I see the author said it is about 30ms, which is definitely unacceptable.

In our scenario, there is a barrier in the KV Cache transmission without nvlink, which makes TTFT and ITL (which includes TTFT) increase.

wjj19950828 · 2024-08-25T12:07:00Z

@KuntaiDu In fact, I don't think it is necessary to do tolist operation for hash calculation, as follows:
input_tokens_tuple = tuple(model_input.input_tokens.tolist())
This will cause the d2h copy to be time-consuming, especially when input_ids is very long.Do you have any other recommended methods for calculating hash? Thanks~

KuntaiDu · 2024-08-26T05:08:55Z

I tested P2D2 vs tp4 and found no benefit. Is this normal?

In your case, disaggregate prefill will behave worse in TTFT. This is because default VLLM prioritizes prefill and tp4 has (less than) twice the compute capability during prefill than P2D2.
I am not certain about TPOT/ITL though, that depends on the real batched tokens.
After all, it is possible that no benefit is found.

In our scenario, disaggregate prefill on TTFT/TPOP/ITL is much worse than TP4. I don’t know where the problem is, so I wonder in which scenarios disaggregate prefill will be beneficial.

Me too. But I did it on 4090. Maybe the overhead is too high without nvlink.

The NVLink or Infinityband is a must for disaggregated prefilling in order for it to be better than chunked prefill. The time delta allowed for data transfer is less than 50ms.

KuntaiDu · 2024-08-26T05:15:03Z

I tested P2D2 vs tp4 and found no benefit. Is this normal?

In your case, disaggregate prefill will behave worse in TTFT. This is because default VLLM prioritizes prefill and tp4 has (less than) twice the compute capability during prefill than P2D2.
I am not certain about TPOT/ITL though, that depends on the real batched tokens.
After all, it is possible that no benefit is found.

In our scenario, disaggregate prefill on TTFT/TPOP/ITL is much worse than TP4. I don’t know where the problem is, so I wonder in which scenarios disaggregate prefill will be beneficial.

Me too. But I did it on 4090. Maybe the overhead is too high without nvlink.

Yes, I am studying the kv cache transmission overhead here. I see the author said it is about 30ms, which is definitely unacceptable.

There are several performance optimization opportunities I am not exploring yet. In the current implementation, the first token has been sampled twice, and the model input data is constructed twice. These overheads can be bypassed by engineering. and will be optimized after the implementation is stable.

I am now working on an upcoming vllm performance post, will circle back to this right after that.

KuntaiDu · 2024-08-26T05:16:20Z

It is a nice work! However, I meet some problems in actual use. In this PR, chunked prefill with tp4 used round_robin_proxy.sh to forward polling requests for port 8000 to ports 8100 and 8200. However, when I used it, once the socat process is started and starts listening on port 8000, it will continue to receive all subsequent connections without calling the get_next_port function again to select another port. Looking forward to your reply！

Oh let me double check and fix it.

gursimar · 2024-08-30T00:52:20Z

ut I did it on 4090. Maybe the overhead is to

I'm interested to build upon this implementation for faster kv cache transfer

What is the current method of kv cache transfer is you are not using nvlink?
Why can't we simply use torch.distributed.isend and torch.distributed.irecv to do the transfer using nccl?

Luis-xu · 2024-09-06T07:41:32Z

@KuntaiDu I am very interested in this work. I found that currently only KV cache transmission with flash-attn backend is supported. Is there any plan to support xformers and flashinfer?

wenqf11 · 2024-09-09T04:42:47Z

Thanks for your work, I got a question, can I start like 6 prefill instances and 2 decode instances on 8 GPUs and how ?

WhatGhost · 2024-09-10T12:19:29Z

@KuntaiDu hi, i just want to know what's the relationship between your work and the pr#2809 . it seems you both want to implement something like "disaggregated prefilling" to separate profile and generation phases.

I wonder what the difference.

Looking forward to your reply！
Thanks！

junna2016 · 2024-09-11T09:06:10Z

I meet a problem when I concat k and v tensor together for each layer to send and recv, Llama-2-7B output results will be random and uncorrect. Why does concating k and v lead to this result?

KuntaiDu · 2024-09-16T01:41:18Z

Close this PR now (I did a large-scale refactor and it is now in #8498 )

KuntaiDu · 2024-09-16T01:45:04Z

Thanks for your work, I got a question, can I start like 6 prefill instances and 2 decode instances on 8 GPUs and how ?

Not implemented directly in the new PR but yeah, it is on the roadmap and will soon be implemented.

KuntaiDu · 2024-09-16T01:46:19Z

@KuntaiDu hi, i just want to know what's the relationship between your work and the pr#2809 . it seems you both want to implement something like "disaggregated prefilling" to separate profile and generation phases.

I wonder what the difference.

Looking forward to your reply！ Thanks！

Not sure about that thread. I skimmed their code and my implementation is lighter weighted and overhead is tolerable (though definitely will be larger).

KuntaiDu · 2024-09-16T01:47:02Z

Deprecating this PR in favor of #8498 .

add idealized disagg prefill benchmark

6fc14d4

KuntaiDu marked this pull request as draft July 6, 2024 07:44

KuntaiDu added 12 commits July 6, 2024 16:30

add main

69d1514

fix typo

2bc8e79

use mkdir -p to avoid error

3ea715d

fix bug

3656f8a

disable log request from vllm server, and mute curl

f8cb6fc

add disaggregated prefilling benchmark

d4b23c0

do not launch 2 vllm instance

a942663

reduce # of prompt to half

540d362

reduce input len by 1

4b0a7ff

adjust filename

2989656

create 4x sonnet

69f729c

adjust setup

43e1e5e

KuntaiDu added 9 commits July 6, 2024 22:46

add benchmark

29a7b88

allow prefix input len == sonnet input len

4d31316

add parameter sweeping

4e336fc

aadjust firmat

2770c61

rename script

80061d2

align naming

8c0a9dc

adjust qps

7d84965

adjust swap range

5ac5249

remove results

8f25985

KuntaiDu added 5 commits July 9, 2024 16:26

adjust benchmark results so that there are 150 output tokens by defau…

2363fa0

…lt. Much more realistic

add example usage for disaggregated prefill

3db988c

add environment variable for disaggregated prefill

00e46de

add a new distributed group for disaggregated prefill NCCL communication

de434d9

only inflate the world size inside parallel_state.py

f157f6b

KuntaiDu added 2 commits August 11, 2024 05:29

add visualization script

96d38b4

fix bug: when KV transfer fails, do not return hidden state

3fc0c5c

KuntaiDu mentioned this pull request Aug 20, 2024

[RFC]: Enable Memory Tiering for vLLM #7697

Closed

KuntaiDu mentioned this pull request Sep 16, 2024

[Core] Implementing disaggregated prefilling, and caching KV cache in CPU/disk/database. #8498

Closed

KuntaiDu closed this Sep 16, 2024

Uh oh!

[Core] implement disaggregated prefilling via KV cache transfer #6170

[Core] implement disaggregated prefilling via KV cache transfer #6170

Uh oh!

Conversation

KuntaiDu commented Jul 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

Uh oh!

KuntaiDu commented Jul 7, 2024

Uh oh!

KuntaiDu commented Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KuntaiDu commented Aug 11, 2024

Uh oh!

wjj19950828 commented Aug 20, 2024

Uh oh!

KuntaiDu commented Aug 21, 2024

Uh oh!

LesLieZC0324 commented Aug 21, 2024

Uh oh!

Yang-x-Zhao commented Aug 22, 2024

Uh oh!

wjj19950828 commented Aug 22, 2024

Uh oh!

Yang-x-Zhao commented Aug 22, 2024

Uh oh!

wjj19950828 commented Aug 22, 2024

Uh oh!

Yang-x-Zhao commented Aug 22, 2024

Uh oh!

ChuanhongLi commented Aug 22, 2024

Uh oh!

wjj19950828 commented Aug 23, 2024

Uh oh!

LesLieZC0324 commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wjj19950828 commented Aug 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KuntaiDu commented Aug 26, 2024

Uh oh!

KuntaiDu commented Aug 26, 2024

Uh oh!

KuntaiDu commented Aug 26, 2024

Uh oh!

gursimar commented Aug 30, 2024

Uh oh!

Luis-xu commented Sep 6, 2024

Uh oh!

wenqf11 commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WhatGhost commented Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

junna2016 commented Sep 11, 2024

Uh oh!

KuntaiDu commented Sep 16, 2024

Uh oh!

KuntaiDu commented Sep 16, 2024

Uh oh!

KuntaiDu commented Sep 16, 2024

Uh oh!

KuntaiDu commented Sep 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

KuntaiDu commented Jul 6, 2024 •

edited

Loading

KuntaiDu commented Jul 9, 2024 •

edited

Loading

LesLieZC0324 commented Aug 23, 2024 •

edited

Loading

wjj19950828 commented Aug 25, 2024 •

edited

Loading

wenqf11 commented Sep 9, 2024 •

edited

Loading

WhatGhost commented Sep 10, 2024 •

edited

Loading