-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Nova] Add GHA Linux CPU Unittests for Torchvision #6759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@osalpekar I understand this PR is still WIP. Shall we mark as draft until you are ready? Feel free to ping us when you are done to help you review. :) |
@datumbox - Sure thing, marking this as a draft :) |
e7a1862
to
ab0e710
Compare
Linux.4xlarge instance sees the job complete but a handful of tests fail to allocate new memory for forking a process: https://github.com/pytorch/vision/actions/runs/3244683126/jobs/5321174303. |
.github/workflows/test-linux-cpu.yml
Outdated
if: ${{ (github.event_name == 'pull_request' && startsWith(github.base_ref, 'release')) || startsWith(github.ref, 'refs/heads/release') }} | ||
run: | | ||
echo "CHANNEL=test" >> "$GITHUB_ENV" | ||
- name: Install TorchVision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you might want to try out https://github.com/pytorch/test-infra/tree/main/.github/actions/setup-miniconda here, it would hide all the complex logic there like ENV_NAME I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing such errors when using the setup-miniconda action: https://github.com/pytorch/vision/actions/runs/3252067256/jobs/5337858164. Perhaps this is due to conda-build being installed in the same conda-env? Reverting to the local conda env for now.
lol, from what I see https://circleci.com/docs/configuration-reference/, Circle CI 2xlarge+ is the largest tier that Circle CI has for Docker and it's not the same as our self-hosted AWS linux.2xlarge. Circle CI 2xlarge+ has 20 vCPU and 40GB of memory, which is somewhere in between AWS c5.4xlarge and c5.12xlarge https://aws.amazon.com/ec2/instance-types/c5. This explains why 4xlarge still fails given that it has only "32GB" of memory |
901d159
to
5133175
Compare
Thanks @huydhn! 12xlarge does the trick: https://github.com/pytorch/vision/actions/runs/3251225589/jobs/5335938671. Will cleanup, use the conda-setup job, and use the build matrix to cover all the configs we want next. |
21c4223
to
4003322
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow looks good to me 💯 Let's get a stamp from vision folks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
@huydhn @osalpekar I was trying to setup in a similar way as here |
Add my thoughts on #6665 on why |
2351d3d
to
2167e0b
Compare
Hey @osalpekar! You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py |
Summary: * [Nova][WIP] Add Linux CPU Unittests for Torchvision * use conda-builder image since conda installation is needed * install torch dep with conda instead * use circleCI command to run tests * larger instance to avoid OOM issues * proper syntax for self-hosted runners * 4xlarge instance * 8xlarge * 12xlarge * use setup-miniconda job * add back PATH change to help setup py detect conda * run conda shell script * install other deps up front * git config and undo path change * revert to local conda install * conda-builder image * support for whole python version matrix * clean up the conda env once we are done with the job Reviewed By: YosuaMichael Differential Revision: D40588169 fbshipit-source-id: 515b12daa84d1707f6b700782fade13f8532ff05
Adding a GitHub Action to run Linux CPU Unittests.
For context, it seems like standard/2xlarge instances cause this job to OOM. 4xlarge runs most of the tests, but a few failure due to OOM as well. 8xlarge instances seem to have much higher queueing times. CircleCI simply requests 2xlarge+ in its resource config: https://github.com/pytorch/vision/blob/main/.circleci/config.yml#L739