[Core] Disaggregated prefilling supports valkey #8724

zeroorhero · 2024-09-23T02:59:44Z

support valkey database

FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (link existing issues this PR will resolve)

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Adding or changing kernels

Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.

Make sure custom ops are registered following PyTorch guidelines: Custom C++ and CUDA Operators and The Custom Operators Manual
Custom operations that return Tensors require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.
Use torch.libary.opcheck() to test the function registration and meta-function for any registered ops. See tests/kernels for examples.
When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.
If a new custom type is needed, see the following document: Custom Class Support in PT2.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

…ic to reduce random exit branch

github-actions · 2024-09-23T02:59:57Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Add external database valkey operations.

Add valkey in prefill and decode nodes to transfer kv cache. Signed-off-by: Changqi Lu <[email protected]>

KuntaiDu

Changes requested. Thank you for contributing code and make valkey work! It will be much more scalable in datacenter scenario.

KuntaiDu · 2024-09-23T23:30:15Z

vllm/distributed/kv_transfer/kv_pipe/torch_distributed_pipe.py

+class ValkeyPipe(TorchDistributedPipe):
+    """
+    A pipe that uses the valkey protocol to transfer tensors between ranks.
+    """
+
+    def __init__(self):
+        self.transport_thread: Optional[ThreadPoolExecutor] = None
+        self.buffer_size = 0
+        self.buffer_size_lock = threading.Lock()
+        self.device = "cpu"
+        self.none_tensor = torch.tensor([NONE_INT], device=self.device)
+
+        self.rcv_metadata_buffer = torch.zeros(self.METADATA_LENGTH,
+                                               dtype=self.METADATA_DTYPE,
+                                               device=self.device)
+
+    def _send_metadata(self, d_metadata_buffer: torch.Tensor, tensor_key:str = ""):


Would be nice if you can move this pipe to a separate file.

KuntaiDu · 2024-09-23T23:38:43Z

vllm/distributed/kv_transfer/kv_pipe/base.py


    @abstractmethod
-    def send_tensor(self, tensor: Optional[torch.Tensor]) -> None:
+    def send_tensor(self, tensor: Optional[torch.Tensor], tensor_key: str = "") -> None:


Adding tensor_key is definitely needed for DBs. Would be great if you can make it Optional[str] to force people to generate this metadata if their implementation correctness is based on correct tensor_key .

KuntaiDu · 2024-09-23T23:41:16Z

vllm/distributed/kv_transfer/vllm_adapter.py

+        elif self.kv_transfer_driver.startswith("valkey"):
+            url = self.kv_transfer_driver.split("://")[1]
+            ip, port = parse_url(url)
+            # TODO add PING command
+            self.sender = KVDatabaseTransfer(ip, int(port), self.local_rank, ValkeyPipe())
+            self.recver = KVDatabaseTransfer(ip, int(port), self.local_rank, ValkeyPipe())
+        else:
+            raise ValueError("Invalid kv_transfer_driver")


lol we definitely need a factory class to build the lookup buffer in the future, but let us keep it as is for now.

KuntaiDu · 2024-09-23T23:51:31Z

vllm/distributed/kv_transfer/kv_lookup_buffer/kv_database_transfer.py

+    def drop_select(self, input_tokens: torch.Tensor,
+                    roi: torch.Tensor) -> List[Optional[torch.Tensor]]:
+
+        if not self.init_valkey:
+            ops.valkey_init(self.ip, self.port, True)
+            self.init_valkey = True
+
+        tensor_key = self._encode_tensors(input_tokens, roi) + "/" + str(self.local_rank)
+        key_key = tensor_key + "/key"
+        val_key = tensor_key + "/value"
+        hid_key = tensor_key + "/hidden"
+
+        key = self.data_pipe.recv_tensor(key_key)
+        val = self.data_pipe.recv_tensor(val_key)
+        hid = self.data_pipe.recv_tensor(hid_key)
+        res = [input_tokens, roi, key, val, hid]
+
+        return [tensor.to(self.recv_device) for tensor in res]


Would be great if you can make sure the valkey entry from the valkey database at prefill instance & decode instance are properly removed after drop_select to avoid OOM. (That's why we call it drop select -- we want to guarantee that the item selected from the lookup buffer will be dropped after drop_select call).

zeroorhero · 2024-09-24T01:55:51Z

Changes requested. Thank you for contributing code and make valkey work! It will be much more scalable in datacenter scenario.

Thanks!

kuangdao · 2024-10-25T09:12:39Z

m

cherhh · 2025-01-09T13:26:16Z

Could you please provide some benchmark tests?

KuntaiDu added 30 commits July 16, 2024 23:50

add a new distributed group for disaggregated prefill NCCL communication

de434d9

only inflate the world size inside parallel_state.py

f157f6b

add more log information

de82c3c

specify vllm port

69ce0e0

avoid switching to unused ports in disaggregated prefilling

e3dc2e9

Merge branch 'main' into kuntai-disagg

f164aa7

adjust parallel state to include _DISAGG distributed group

18fe19c

offset global rank for decoding instances

94cadb8

adjust naming: use prefill and decode instead of prefilling and decoding

ded5d92

adjust the example: let the decode process in foreground for debugging

709ae05

adjust logger format

2ab44d4

test if the P2P cache stucks when no disaggregated prefilling

2213881

let decode instance sleep, to avoid generating P2P cache simultaneously

544f5cb

continue disaggregated prefill debugging

04d319a

offset world group for decoding instance

2e0f02c

a syntax fix

fd5f115

bug fix

8d90e6a

specify the source of get_open_port

a9474a7

document why specifying the source of get_open_port

701b087

add VLLM_TRACE_FUNCTION to track the call stack

fa5d71f

fix customadapter bug

e2faede

add parallel state logs for debugging

76b6c5e

add sleep when initializing parallel state

cb6d6a5

only log when rank%4==0

fe8fb47

only log when rank%4==0

cc89bfb

bug fix

531bdf3

also only log when rank=4 in custom all reduce

1804656

add debuging statement around broadcast

81c8640

debug init_world_group

5ba142c

put the log inside a text file

cc939cf

KuntaiDu added 15 commits September 19, 2024 00:57

bug fix

9874b42

fix typo: Distributerd -> Distributed

5950ad5

remove the debug flag in example -- user don't need it

c116684

fix typo

44e8875

fixing benchmark_serving.py

181928f

fix the example

c17d18d

update build partial prefill input

0b00876

bug fix for LMCache -- adjust vLLM's rebuild input, and merge the log…

94a5086

…ic to reduce random exit branch

make format checker happy

8099fb3

make ruff and yapf happy, also fix test bug

603864e

remove empty file

1d7a1c9

fix bug when world_size == -1

10ad09c

adjust comments

38e3a57

make yapf and ruff happy

e2bd481

relaunch CI

4979337

zeroorhero mentioned this pull request Sep 23, 2024

[Core] Implementing disaggregated prefilling, and caching KV cache in CPU/disk/database. #8498

Closed

Changqi Lu added 2 commits September 23, 2024 11:47

[Kernel] add valkey ops

1f30c60

Add external database valkey operations.

[Core] disaggregated prefilling support valkey transfer

f09b929

Add valkey in prefill and decode nodes to transfer kv cache. Signed-off-by: Changqi Lu <[email protected]>

zeroorhero force-pushed the add-valkey branch from c661b7f to f09b929 Compare September 23, 2024 06:14

KuntaiDu requested changes Sep 23, 2024

View reviewed changes

zeroorhero mentioned this pull request Oct 21, 2024

[WIP] Disaggregated prefilling support X prefill + Y decode #9537

Closed

KuntaiDu mentioned this pull request Dec 2, 2024

[RFC]: Disaggregated prefilling and KV cache transfer roadmap #10818

Closed

28 tasks

hmellor closed this Mar 10, 2025

mergify bot added documentation Improvements or additions to documentation ci/build labels Mar 10, 2025

natoscott mentioned this pull request Oct 1, 2025

Valkey and RDMA support llm-d/llm-d-kv-cache-manager#134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Core] Disaggregated prefilling supports valkey #8724

[Core] Disaggregated prefilling supports valkey #8724

Uh oh!

zeroorhero commented Sep 23, 2024

Uh oh!

github-actions bot commented Sep 23, 2024

Uh oh!

KuntaiDu left a comment

Uh oh!

KuntaiDu Sep 23, 2024

Uh oh!

KuntaiDu Sep 23, 2024

Uh oh!

KuntaiDu Sep 23, 2024

Uh oh!

KuntaiDu Sep 23, 2024

Uh oh!

zeroorhero commented Sep 24, 2024

Uh oh!

kuangdao commented Oct 25, 2024

Uh oh!

cherhh commented Jan 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

[Core] Disaggregated prefilling supports valkey #8724

[Core] Disaggregated prefilling supports valkey #8724

Uh oh!

Conversation

zeroorhero commented Sep 23, 2024

PR Title and Classification

Code Quality

Adding or changing kernels

Notes for Large Changes

What to Expect for the Reviews

Thank You

Uh oh!

github-actions bot commented Sep 23, 2024

Uh oh!

KuntaiDu left a comment

Choose a reason for hiding this comment

Uh oh!

KuntaiDu Sep 23, 2024

Choose a reason for hiding this comment

Uh oh!

KuntaiDu Sep 23, 2024

Choose a reason for hiding this comment

Uh oh!

KuntaiDu Sep 23, 2024

Choose a reason for hiding this comment

Uh oh!

KuntaiDu Sep 23, 2024

Choose a reason for hiding this comment

Uh oh!

zeroorhero commented Sep 24, 2024

Uh oh!

kuangdao commented Oct 25, 2024

Uh oh!

cherhh commented Jan 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants