Skip to content

[Nova] Add GHA Linux CPU Unittests for Torchvision #6759

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Oct 17, 2022

Conversation

osalpekar
Copy link
Member

@osalpekar osalpekar commented Oct 12, 2022

Adding a GitHub Action to run Linux CPU Unittests.

For context, it seems like standard/2xlarge instances cause this job to OOM. 4xlarge runs most of the tests, but a few failure due to OOM as well. 8xlarge instances seem to have much higher queueing times. CircleCI simply requests 2xlarge+ in its resource config: https://github.com/pytorch/vision/blob/main/.circleci/config.yml#L739

@osalpekar osalpekar changed the title [Nova][WIP] Add Linux CPU Unittests for Torchvision [Nova][WIP] Add GHA Linux CPU Unittests for Torchvision Oct 12, 2022
@datumbox
Copy link
Contributor

@osalpekar I understand this PR is still WIP. Shall we mark as draft until you are ready? Feel free to ping us when you are done to help you review. :)

@osalpekar
Copy link
Member Author

@osalpekar I understand this PR is still WIP. Shall we mark as draft until you are ready? Feel free to ping us when you are done to help you review. :)

@datumbox - Sure thing, marking this as a draft :)

@osalpekar osalpekar marked this pull request as draft October 13, 2022 15:29
@osalpekar osalpekar force-pushed the linux_cpu_unittests branch from e7a1862 to ab0e710 Compare October 13, 2022 15:47
@osalpekar
Copy link
Member Author

Linux.4xlarge instance sees the job complete but a handful of tests fail to allocate new memory for forking a process: https://github.com/pytorch/vision/actions/runs/3244683126/jobs/5321174303.

if: ${{ (github.event_name == 'pull_request' && startsWith(github.base_ref, 'release')) || startsWith(github.ref, 'refs/heads/release') }}
run: |
echo "CHANNEL=test" >> "$GITHUB_ENV"
- name: Install TorchVision
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you might want to try out https://github.com/pytorch/test-infra/tree/main/.github/actions/setup-miniconda here, it would hide all the complex logic there like ENV_NAME I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing such errors when using the setup-miniconda action: https://github.com/pytorch/vision/actions/runs/3252067256/jobs/5337858164. Perhaps this is due to conda-build being installed in the same conda-env? Reverting to the local conda env for now.

@huydhn
Copy link
Contributor

huydhn commented Oct 14, 2022

lol, from what I see https://circleci.com/docs/configuration-reference/, Circle CI 2xlarge+ is the largest tier that Circle CI has for Docker and it's not the same as our self-hosted AWS linux.2xlarge.

Circle CI 2xlarge+ has 20 vCPU and 40GB of memory, which is somewhere in between AWS c5.4xlarge and c5.12xlarge https://aws.amazon.com/ec2/instance-types/c5. This explains why 4xlarge still fails given that it has only "32GB" of memory

@osalpekar osalpekar force-pushed the linux_cpu_unittests branch from 901d159 to 5133175 Compare October 14, 2022 15:48
@osalpekar
Copy link
Member Author

lol, from what I see https://circleci.com/docs/configuration-reference/, Circle CI 2xlarge+ is the largest tier that Circle CI has for Docker and it's not the same as our self-hosted AWS linux.2xlarge.

Circle CI 2xlarge+ has 20 vCPU and 40GB of memory, which is somewhere in between AWS c5.4xlarge and c5.12xlarge https://aws.amazon.com/ec2/instance-types/c5. This explains why 4xlarge still fails given that it has only "32GB" of memory

Thanks @huydhn! 12xlarge does the trick: https://github.com/pytorch/vision/actions/runs/3251225589/jobs/5335938671. Will cleanup, use the conda-setup job, and use the build matrix to cover all the configs we want next.

@osalpekar osalpekar force-pushed the linux_cpu_unittests branch from 21c4223 to 4003322 Compare October 14, 2022 21:08
@osalpekar osalpekar marked this pull request as ready for review October 14, 2022 21:08
@osalpekar osalpekar changed the title [Nova][WIP] Add GHA Linux CPU Unittests for Torchvision [Nova] Add GHA Linux CPU Unittests for Torchvision Oct 14, 2022
Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow looks good to me 💯 Let's get a stamp from vision folks.

@osalpekar osalpekar requested a review from datumbox October 14, 2022 21:34
Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 17, 2022

@huydhn @osalpekar I was trying to setup in a similar way as here pytorch/conda-builder:cuda116 container (#6665) for tests with cuda. But somehow, nvidia-smi which is in the container is not seen in the CI step. Can you please take a look and help enabling cuda tests. Thanks

@huydhn
Copy link
Contributor

huydhn commented Oct 17, 2022

@huydhn @osalpekar I was trying to setup in a similar way as here pytorch/conda-builder:cuda116 container (#6665) for tests with cuda. But somehow, nvidia-smi which is in the container is not seen in the CI step. Can you please take a look and help enabling cuda tests. Thanks

Add my thoughts on #6665 on why nvidia-smi doesn't show up there

@osalpekar osalpekar force-pushed the linux_cpu_unittests branch from 2351d3d to 2167e0b Compare October 17, 2022 17:57
@osalpekar osalpekar merged commit 0610b13 into pytorch:main Oct 17, 2022
@github-actions
Copy link

Hey @osalpekar!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

facebook-github-bot pushed a commit that referenced this pull request Oct 21, 2022
Summary:
* [Nova][WIP] Add Linux CPU Unittests for Torchvision

* use conda-builder image since conda installation is needed

* install torch dep with conda instead

* use circleCI command to run tests

* larger instance to avoid OOM issues

* proper syntax for self-hosted runners

* 4xlarge instance

* 8xlarge

* 12xlarge

* use setup-miniconda job

* add back PATH change to help setup py detect conda

* run conda shell script

* install other deps up front

* git config and undo path change

* revert to local conda install

* conda-builder image

* support for whole python version matrix

* clean up the conda env once we are done with the job

Reviewed By: YosuaMichael

Differential Revision: D40588169

fbshipit-source-id: 515b12daa84d1707f6b700782fade13f8532ff05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants