Skip to content

Conversation

Chao1Han
Copy link
Contributor

@Chao1Han Chao1Han commented Jun 4, 2025

Support high priority stream for xccl, test case add in #2049
We need merge this pr first and upstream op register pytorch/pytorch#163049 and then test case could be pass

@Copilot Copilot AI review requested due to automatic review settings June 4, 2025 02:57
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds support for high priority streams in the XCCL process group. Key changes include adding a new Options struct with high priority and group name parameters, introducing a new groupRanks() accessor, and updating constructor and logging logic to reflect high priority stream usage.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/xccl/ProcessGroupXCCL.hpp Added a high priority stream option and new Options struct for configuration.
src/xccl/ProcessGroupXCCL.cpp Updated constructor initialization, logging, and introduced groupRanks().
Comments suppressed due to low confidence (1)

src/xccl/ProcessGroupXCCL.hpp:25

  • The constant TORCH_XCCL_HIGH_PRIORITY is defined as a non-const vector. Consider renaming and declaring it as a const container (or using constexpr) to clearly indicate its immutability.
static std::vector<std::string> TORCH_XCCL_HIGH_PRIORITY = {

@pytorchxpubot
Copy link

@sys_pytorchxpubot triage result for run 15864761625Triage bot UT analaysis result for reference only, please note unique error message only report once:
  1. third_party.torch-xpu-ops.test.xpu.test_modules_xpu.TestModuleXPU test_cpu_gpu_parity_nn_CrossEntropyLoss_xpu_float64 got failed with error message
 AssertionError: Scalars are not close! 

Triage bot response:

{
  "similar_issue_id": 645,
  "similar_issue_state": "closed",
  "issue_owner": "daisyden",
  "issue_description": "UT got failed with FP64 emulation feature. The reporter is mengfei25, and the assignee is daisyden. The issue is closed.",
  "root_causes": [
    "Issues related to tensor operations and reductions leading to precision mismatches.",
    "Potential differences in computation between CPU and XPU implementations.",
    "Possible issues with the CrossEntropyLoss implementation on XPU."
  ],
  "suggested_solutions": [
    "Investigate the precision handling in CrossEntropyLoss on XPU.",
    "Check for any implementation differences causing scalar mismatches.",
    "Consider allowing a small tolerance in scalar comparisons for CPU-GPU parity tests.",
    "Review and update test cases to handle potential precision discrepancies."
  ]
}

@chuanqi129 chuanqi129 added this pull request to the merge queue Sep 17, 2025
Merged via the queue into main with commit 74b11bf Sep 17, 2025
25 checks passed
@chuanqi129 chuanqi129 deleted the xccl/high_stream branch September 17, 2025 00:57
mengfei25 added a commit that referenced this pull request Sep 17, 2025
Support high priority stream for xccl, test case add in
#2049
We need merge this pr first and upstream op register
pytorch/pytorch#163049 and then test case could
be pass

---------

Co-authored-by: mengfei25 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants