-
Couldn't load subscription status.
- Fork 278
[CI] Refract GPU CIs #487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Refract GPU CIs #487
Conversation
|
Warning Rate limit exceeded@zhiyuan1i has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 23 minutes and 41 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
""" WalkthroughThe pull request refactors all GPU-specific GitHub Actions CI workflows by replacing detailed, inline job steps with calls to a new reusable workflow. This reusable workflow, defined in Changes
Sequence Diagram(s)sequenceDiagram
participant Workflow File
participant Reusable CI Workflow
participant Self-hosted Runner
participant Conda/Python Env
participant PyTorch/GPU
Workflow File->>Reusable CI Workflow: Invoke with parameters (runner, gpu_type, etc.)
Reusable CI Workflow->>Self-hosted Runner: Start test-ops job
Self-hosted Runner->>Conda/Python Env: Setup environment
Self-hosted Runner->>PyTorch/GPU: (Optional) Check GPU availability
Self-hosted Runner->>Self-hosted Runner: Detect changes, find dependent tests
Self-hosted Runner->>Conda/Python Env: Install dependencies
Self-hosted Runner->>Self-hosted Runner: Run pytest for ops (standard/varlen)
Self-hosted Runner->>Self-hosted Runner: Verify Python package import
Reusable CI Workflow->>Self-hosted Runner: Start test-models job (after test-ops)
Self-hosted Runner->>Self-hosted Runner: Run pytest for models (standard/varlen)
Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches🧪 Generate Unit Tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (3)
.github/workflows/reusable-ci-tests.yml (1)
205-207: Mirror the fix in the models job
test-modelsdeclares the same hard-coded value; update it exactly as in the ops job to avoid divergence.- PYTORCH_CUDA_VERSION: 'cu128' + PYTORCH_CUDA_VERSION: '${{ inputs.pytorch_cuda_version }}'.github/workflows/nvidia-a100.yml (1)
18-27: Same unreleased PyTorch 2.7.0 issue as above – see A770 comment for details/fix..github/workflows/nvidia-4090.yml (1)
24-27: Unreleased PyTorch 2.7.0 again – will fail just like the other two files.
Apply the same version fix or parameterise the version in the reusable workflow.
🧹 Nitpick comments (3)
.github/workflows/reusable-ci-tests.yml (1)
57-70: Improve robustness of the Conda discovery scriptMinor but valuable hardening:
- Use
set -euo pipefailto fail fast on undefined variables.- Quote
$CANDIDATE_PATHand$FOUND_PATHto survive paths with spaces.- Return early when a match is found instead of continuing the loop.
No functional change, but makes future debugging easier.
.github/workflows/intel-a770.yml (1)
18-27: Consider matrix-testing instead of hard-coding a single job.
A simple matrix{ {2.1.0,2.2.0}, {release,nightly} }would reuse the same reusable workflow while improving coverage and reducing copy-paste between GPU files..github/workflows/nvidia-a100.yml (1)
18-27: Nit: job-id naming drift (test-a100-pytorch-2-7) will need an update once version changes.
Using a neutral id such astest-a100avoids future churn.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.github/workflows/intel-a770.yml(1 hunks).github/workflows/nvidia-4090.yml(1 hunks).github/workflows/nvidia-a100.yml(1 hunks).github/workflows/nvidia-h100.yml(1 hunks).github/workflows/reusable-ci-tests.yml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: Test H100 (PyTorch 2.7) / test-models
- GitHub Check: Test H100 (PyTorch Nightly) / test-ops
🔇 Additional comments (2)
.github/workflows/nvidia-h100.yml (1)
49-51: Mismatching CUDA version between caller & calleeThe job passes
pytorch_cuda_version: 'cu126', but the reusable workflow currently overwrites the env tocu128(see previous comment).
After fixing the reusable workflow, double-check that'cu126'really propagates, otherwise the 2.6 wheel resolution will fail..github/workflows/intel-a770.yml (1)
18-27: PyTorch 2.7.0 doesn’t yet exist – job will fail at environment-setup time.
Current latest upstream release is 2.2.x. The conda/pip install step insidereusable-ci-tests.ymlwill error out, aborting every A770 run.- pytorch_version: '2.7.0' + # keep this in sync with the highest *released* version + pytorch_version: '2.2.0'Run once to prove the version absence:
#!/bin/bash curl -s https://pypi.org/pypi/torch/json | jq -r '.releases | keys[]' | grep -q '^2\.7\.0$' && echo "found" || echo "NOT FOUND"
Summary by CodeRabbit
Refactor
Chores