[Optimization] default compile rdma, reduce cudagraph buffer size in mm, fix some config bug #5121

yuanlehome · 2025-11-19T05:07:35Z

Motivation

PD分离需要依赖rdma_comm包 https://github.com/PaddlePaddle/FastDeploy/tree/develop/fastdeploy/cache_manager/transfer_factory/kvcache_transfer ，但没有默认打开编译且报错不明显。
混合并行中cudagraph+tsp会存在hang的情况。
cudagraph多模组网中申请的hidden_states buffer太大比较占显存。

Modifications

默认编译rdma_comm包
- 仅xpu/gpu
修复tsp hang的问题，在开启cudagraph的场景，不捕获capture size < tp_size的shape
合理计算buffer shape并修改

Usage or Command

无。

Accuracy Tests

无。

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…nfig logic

paddle-bot · 2025-11-19T05:07:41Z

Thanks for your contribution!

Copilot

Pull Request Overview

This PR makes three main optimizations: enables RDMA compilation by default, reduces CUDA graph buffer sizes in vision-language models, and fixes configuration validation logic.

Enables RDMA extension to compile by default by removing conditional checks on ENABLE_FD_RDMA
Optimizes memory usage by using max_capture_size instead of max_model_len for CUDA graph buffers
Fixes RDMA port validation to account for both data parallel and tensor parallel sizes

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
setup.py	Removes conditional compilation for RDMA extension, making it compile by default
qwen2_5_vl/qwen2_5_vl.py	Reduces buffer size from `max_model_len` to `max_capture_size` and optimizes buffer usage with conditional copy
paddleocr_vl/paddleocr_vl.py	Reduces buffer size and simplifies forward logic by removing redundant if-else branches
ernie_vl_rm.py	Reduces buffer size and adds conditional buffer copy for CUDA graph
ernie4_5_vl/ernie4_5_vl_moe.py	Reduces buffer size and adds conditional buffer copy for CUDA graph
engine/args_utils.py	Fixes RDMA port validation to multiply both DP and TP sizes
config.py	Disables sequence parallel MoE when using CUDA graph for mixed/decode roles

fastdeploy/model_executor/models/qwen2_5_vl/qwen2_5_vl.py

fastdeploy/engine/args_utils.py

fastdeploy/config.py

default compile rdma, reduce cudagraph buffer size in mm, fix some co…

28f03cd

…nfig logic

Copilot AI review requested due to automatic review settings November 19, 2025 05:07

Copilot started reviewing on behalf of yuanlehome November 19, 2025 05:07 View session

Copilot finished reviewing on behalf of yuanlehome November 19, 2025 05:08

Copilot AI reviewed Nov 19, 2025

View reviewed changes

fastdeploy/model_executor/models/qwen2_5_vl/qwen2_5_vl.py Show resolved Hide resolved

fastdeploy/engine/args_utils.py Outdated Show resolved Hide resolved

fastdeploy/config.py Outdated Show resolved Hide resolved

yuanlehome marked this pull request as draft November 19, 2025 06:05

update

a830470

yuanlehome marked this pull request as ready for review November 19, 2025 07:39

rainyfly previously approved these changes Nov 19, 2025

View reviewed changes

yuanlehome dismissed rainyfly’s stale review via ae5b4a3 November 19, 2025 14:40

yuanlehome force-pushed the enhance_some_logic branch from 725b8fe to ae5b4a3 Compare November 19, 2025 14:40

update

c7a99c9

yuanlehome force-pushed the enhance_some_logic branch from ae5b4a3 to c7a99c9 Compare November 19, 2025 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Optimization] default compile rdma, reduce cudagraph buffer size in mm, fix some config bug #5121

[Optimization] default compile rdma, reduce cudagraph buffer size in mm, fix some config bug #5121

yuanlehome commented Nov 19, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Nov 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Optimization] default compile rdma, reduce cudagraph buffer size in mm, fix some config bug #5121

Are you sure you want to change the base?

[Optimization] default compile rdma, reduce cudagraph buffer size in mm, fix some config bug #5121

Conversation

yuanlehome commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Nov 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanlehome commented Nov 19, 2025 •

edited

Loading