Skip to content

Conversation

jianzs
Copy link
Collaborator

@jianzs jianzs commented Apr 27, 2025

This PR implements the connector functionality for NPU based on LLMDataDist, building upon the connector API merged in vLLM v1. (vllm-project/vllm#15960) We've successfully tested various scenarios in offline environments:

  • Single-machine: Verified 2P2D testing with dense models (Llama) and MoE models (DeepSeek v2 Lite)
  • Two-machine: Completed 1P1D testing with DeepSeek R1 W8A8

Key implementation aspects include:

Cross-machine PD: LLMDataDist requires NPU device IP for connection establishment. Our approach utilizes a global rank table (JSON) on each machine containing:

  • Unique server IDs
  • IP addresses and device IDs for each card
  • Server ID specification in connector extra config at startup for instance information retrieval

nPmD: Given that the community's nPmD design, particularly the router component API, is still evolving, we've implemented a solution using a meta server component (to be provided separately) that:

  • Records prefill completion details (device and dp rank information)
  • Responds to decode node queries with prefill node locations
  • Enables decode nodes to retrieve data from appropriate prefill nodes

We propose initially merging the 1P1D implementation, where the global rank table contains information for two nodes, allowing direct prefill node identification. The nPmD implementation can be refined and merged following community discussion.

Todo:

  • Implement 1P1D (one prefill, one decode) configuration support
  • Add sample script for automatic global rank table generation
  • Document global rank table format specifications
  • Provide user guide for PD (Prefill-Decode) functionality

Ensure correct input for npu_reshape_and_cache function

The 'slot_indices' parameter of npu_reshape_and_cache must be:
- A torch.int32 tensor
- Located on the NPU device

Signed-off-by: Jade Zheng <[email protected]>
@jianzs jianzs closed this Apr 27, 2025
@jianzs jianzs deleted the zhengsj/datadist-conn-v1 branch April 27, 2025 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant