Skip to content

formalize the ROCm image pre-build process #1414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

jeffdaily
Copy link
Contributor

ROCm must build magma and miopen from source. This takes considerable time and also exceeds allowed disk space on the GitHub Action runner.

ROCm images are now a two-stage process. For example, first run:

GPU_ARCH_VERSION=5.5 ./manywheel/build_docker_rocm_prebuild.sh

This builds and pushes the image:

rocm/dev-centos-7:5.5-magma-miopen-staging

Then the GHA workflow can proceed as usual.

ROCm must build magma and miopen from source. This takes considerable
time and also exceeds allowed disk space on the GitHub Action runner.

ROCm images are now a two-stage process. For example, first run:

GPU_ARCH_VERSION=5.5 ./manywheel/build_docker_rocm_prebuild.sh

This builds and pushes the image:

rocm/dev-centos-7:5.5-magma-miopen-staging

Then the GHA workflow can proceed as usual.
@jithunnair-amd jithunnair-amd requested review from malfet and atalman June 5, 2023 22:37
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add CI to do that. Also, there are no harm in picking a bigger runner for the job (as Docker images are rebuilt very infrequently)
Also, just curios, why MAGMA built mechanism for ROCm and CUDA are diverged?

@jithunnair-amd
Copy link
Contributor

jithunnair-amd commented Jun 6, 2023

@malfet Do you mean "add CI to build rocm/dev-centos-7:5.5-magma-miopen-staging" image? That docker image wouldn't be in a PyTorch docker repository.

@jithunnair-amd and that's a problem, isn't it? Because if somebody needs to fix/update magma, they'll have to ask you guys to update the basic container.

As for Magma, the CUDA flow installs magma from a conda package, while the ROCm flow builds magma from source.

Yes, I know the mechanics, but the question is: why? I.e. magma for CUDA is build by the following workflow, why can't rocm build of magma be added to it?

@jithunnair-amd
Copy link
Contributor

@malfet Do you mean "add CI to build rocm/dev-centos-7:5.5-magma-miopen-staging" image? That docker image wouldn't be in a PyTorch docker repository.

@jithunnair-amd and that's a problem, isn't it? Because if somebody needs to fix/update magma, they'll have to ask you guys to update the basic container.

I will say that we actually considered using one of our ROCm GHA runners for the builder CI jobs, but picking a bigger CPU-only GHA runner seems like a fine idea too. @jeffdaily, should we try the bigger GHA runner to see if we can avoid this two-stage docker build process?

As for Magma, the CUDA flow installs magma from a conda package, while the ROCm flow builds magma from source.

Yes, I know the mechanics, but the question is: why? I.e. magma for CUDA is build by the following workflow, why can't rocm build of magma be added to it?

Actually, we have been working on this recently, trying to take a look at the CUDA magma build flow to see if we can apply it to ROCm magma as well. It's a WIP.

@malfet
Copy link
Contributor

malfet commented Jun 6, 2023

@jithunnair-amd I'm not saying that ROCm magma builds need to follow CUDA magma builds or vice versa, but it would be nice if workflows can be kept as close to each other as possible.

@jeffdaily
Copy link
Contributor Author

I will say that we actually considered using one of our ROCm GHA runners for the builder CI jobs, but picking a bigger CPU-only GHA runner seems like a fine idea too. @jeffdaily, should we try the bigger GHA runner to see if we can avoid this two-stage docker build process?

YES! That would fix our issues, too, without all this pre-build trickery.

@jithunnair-amd
Copy link
Contributor

jithunnair-amd commented Jun 6, 2023

@malfet @seemethere What are the runner instance names to use for larger diskspace/CPU? This doesn't indicate specific labels: https://docs.github.com/en/actions/using-github-hosted-runners/using-larger-runners#machine-specs-for-larger-runners?

@malfet
Copy link
Contributor

malfet commented Jun 6, 2023

@jithunnair-amd we have self-hosted runners running on AWS which are defined in https://github.com/pytorch/test-infra/blob/main/.github/scale-config.yml

@jithunnair-amd jithunnair-amd marked this pull request as draft June 7, 2023 02:13
@jeffdaily
Copy link
Contributor Author

Closing in favor of #1418.

@jeffdaily jeffdaily closed this Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants