Skip to content

Conversation

@yuanlehome
Copy link
Collaborator

@yuanlehome yuanlehome commented Nov 19, 2025

Motivation

Modifications

  • 默认编译rdma_comm包

    • 仅xpu/gpu
  • 修复tsp hang的问题,在开启cudagraph的场景,不捕获capture size < tp_size的shape

  • 合理计算buffer shape并修改

Usage or Command

无。

Accuracy Tests

无。

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings November 19, 2025 05:07
@paddle-bot
Copy link

paddle-bot bot commented Nov 19, 2025

Thanks for your contribution!

Copilot finished reviewing on behalf of yuanlehome November 19, 2025 05:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR makes three main optimizations: enables RDMA compilation by default, reduces CUDA graph buffer sizes in vision-language models, and fixes configuration validation logic.

  • Enables RDMA extension to compile by default by removing conditional checks on ENABLE_FD_RDMA
  • Optimizes memory usage by using max_capture_size instead of max_model_len for CUDA graph buffers
  • Fixes RDMA port validation to account for both data parallel and tensor parallel sizes

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
setup.py Removes conditional compilation for RDMA extension, making it compile by default
qwen2_5_vl/qwen2_5_vl.py Reduces buffer size from max_model_len to max_capture_size and optimizes buffer usage with conditional copy
paddleocr_vl/paddleocr_vl.py Reduces buffer size and simplifies forward logic by removing redundant if-else branches
ernie_vl_rm.py Reduces buffer size and adds conditional buffer copy for CUDA graph
ernie4_5_vl/ernie4_5_vl_moe.py Reduces buffer size and adds conditional buffer copy for CUDA graph
engine/args_utils.py Fixes RDMA port validation to multiply both DP and TP sizes
config.py Disables sequence parallel MoE when using CUDA graph for mixed/decode roles

@yuanlehome yuanlehome marked this pull request as draft November 19, 2025 06:05
@yuanlehome yuanlehome marked this pull request as ready for review November 19, 2025 07:39
rainyfly
rainyfly previously approved these changes Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants