-
Notifications
You must be signed in to change notification settings - Fork 660
[Optimization] default compile rdma, reduce cudagraph buffer size in mm, fix some config bug #5121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR makes three main optimizations: enables RDMA compilation by default, reduces CUDA graph buffer sizes in vision-language models, and fixes configuration validation logic.
- Enables RDMA extension to compile by default by removing conditional checks on
ENABLE_FD_RDMA - Optimizes memory usage by using
max_capture_sizeinstead ofmax_model_lenfor CUDA graph buffers - Fixes RDMA port validation to account for both data parallel and tensor parallel sizes
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| setup.py | Removes conditional compilation for RDMA extension, making it compile by default |
| qwen2_5_vl/qwen2_5_vl.py | Reduces buffer size from max_model_len to max_capture_size and optimizes buffer usage with conditional copy |
| paddleocr_vl/paddleocr_vl.py | Reduces buffer size and simplifies forward logic by removing redundant if-else branches |
| ernie_vl_rm.py | Reduces buffer size and adds conditional buffer copy for CUDA graph |
| ernie4_5_vl/ernie4_5_vl_moe.py | Reduces buffer size and adds conditional buffer copy for CUDA graph |
| engine/args_utils.py | Fixes RDMA port validation to multiply both DP and TP sizes |
| config.py | Disables sequence parallel MoE when using CUDA graph for mixed/decode roles |
725b8fe to
ae5b4a3
Compare
ae5b4a3 to
c7a99c9
Compare
Motivation
PD分离需要依赖rdma_comm包 https://github.com/PaddlePaddle/FastDeploy/tree/develop/fastdeploy/cache_manager/transfer_factory/kvcache_transfer ,但没有默认打开编译且报错不明显。
混合并行中cudagraph+tsp会存在hang的情况。
cudagraph多模组网中申请的hidden_states buffer太大比较占显存。
Modifications
默认编译rdma_comm包
修复tsp hang的问题,在开启cudagraph的场景,不捕获capture size < tp_size的shape
合理计算buffer shape并修改
Usage or Command
无。
Accuracy Tests
无。
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.