Add distributed CI job (4xH100) and example unit tests #1106

yf225 · 2025-11-08T06:58:27Z

Part of #477.

cc. @joydddd

.github/workflows/test.yml

oulgen

failing tests

.github/workflows/test.yml

oulgen · 2025-11-08T07:50:29Z

.github/matrix.json

+      "image": "nvidia/cuda:13.0.1-devel-ubuntu24.04",
+      "runtime-version": "cu130",
+      "container-options": "--gpus all",
+      "pytorch-version": "pytorch-nightly",


should we just use pinned pytorch for this? avoid building triton all together

yeah I think the pytorch symm-mem library is still being actively improved, and use pytorch nightly will ensure that we can iterate on symm-mem and helion in lock step to get the latest features

yf225 · 2025-11-08T21:20:01Z

Will wait for #1107 to land first to have a clean CI.

yf225 · 2025-11-09T00:37:43Z

Somehow the tests are being skipped in CI, still debugging it.

This reverts commit 63066b9.

This reverts commit 26032dd.

yf225 requested review from jansel and oulgen November 8, 2025 06:58

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 8, 2025

yf225 changed the title ~~Add distributed CI job and distributed example unit tests~~ Add distributed CI job and example unit tests Nov 8, 2025

oulgen reviewed Nov 8, 2025

View reviewed changes

.github/workflows/test.yml Outdated Show resolved Hide resolved

.github/workflows/test.yml Outdated Show resolved Hide resolved

.github/workflows/test.yml Outdated Show resolved Hide resolved

oulgen requested changes Nov 8, 2025

View reviewed changes

yf225 force-pushed the dist_ci_job_and_example_tests branch 2 times, most recently from 426c29d to 683ef76 Compare November 8, 2025 07:37

oulgen reviewed Nov 8, 2025

View reviewed changes

.github/workflows/test.yml Outdated Show resolved Hide resolved

oulgen approved these changes Nov 8, 2025

View reviewed changes

oulgen reviewed Nov 8, 2025

View reviewed changes

yf225 changed the title ~~Add distributed CI job and example unit tests~~ Add distributed CI job (4xH100) and example unit tests Nov 8, 2025

jansel approved these changes Nov 8, 2025

View reviewed changes

yf225 force-pushed the dist_ci_job_and_example_tests branch 2 times, most recently from 03dbe34 to 6427d2f Compare November 8, 2025 23:23

yf225 force-pushed the dist_ci_job_and_example_tests branch 2 times, most recently from adaf8b7 to 63066b9 Compare November 9, 2025 05:59

yf225 marked this pull request as draft November 9, 2025 07:33

yf225 changed the title ~~Add distributed CI job (4xH100) and example unit tests~~ [WIP] Add distributed CI job (4xH100) and example unit tests Nov 9, 2025

yf225 force-pushed the dist_ci_job_and_example_tests branch 9 times, most recently from f0072c4 to f24febc Compare November 9, 2025 19:51

yf225 force-pushed the dist_ci_job_and_example_tests branch 2 times, most recently from 2cd19b4 to 7b9a4ec Compare November 9, 2025 21:30

yf225 changed the title ~~[WIP] Add distributed CI job (4xH100) and example unit tests~~ Add distributed CI job (4xH100) and example unit tests Nov 9, 2025

yf225 marked this pull request as ready for review November 9, 2025 22:07

yf225 added 12 commits November 9, 2025 14:27

Add distributed CI job

ddb842d

Add distributed unit tests for existing examples

a9f000e

fix test.yml

66b6073

remove custom timeout

ccae05f

clean up test.yml

3bbeeb0

clean up test.yml

310645c

debug skipped tests

a175c94

try cuda 12.8

a72b524

pip install ninja

5bbbc62

Revert "debug skipped tests"

7c90e91

This reverts commit 63066b9.

add assert for 4 gpu when running distributed job

cce01ad

remove skip_if_lt_x_gpu

26032dd

yf225 force-pushed the dist_ci_job_and_example_tests branch from 0942d02 to 26032dd Compare November 9, 2025 22:27

yf225 added 2 commits November 9, 2025 14:28

Revert "remove skip_if_lt_x_gpu"

9473fe4

This reverts commit 26032dd.

add skip check for dist job

7062b3d

yf225 merged commit 8a23df1 into main Nov 9, 2025
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add distributed CI job (4xH100) and example unit tests #1106

Add distributed CI job (4xH100) and example unit tests #1106

yf225 commented Nov 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oulgen left a comment

Uh oh!

Uh oh!

oulgen Nov 8, 2025

Uh oh!

yf225 Nov 8, 2025

Uh oh!

yf225 commented Nov 8, 2025

Uh oh!

yf225 commented Nov 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add distributed CI job (4xH100) and example unit tests #1106

Add distributed CI job (4xH100) and example unit tests #1106

Conversation

yf225 commented Nov 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oulgen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

oulgen Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

yf225 Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

yf225 commented Nov 8, 2025

Uh oh!

yf225 commented Nov 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants