formalize the ROCm image pre-build process #1414

jeffdaily · 2023-06-02T19:20:31Z

ROCm must build magma and miopen from source. This takes considerable time and also exceeds allowed disk space on the GitHub Action runner.

ROCm images are now a two-stage process. For example, first run:

GPU_ARCH_VERSION=5.5 ./manywheel/build_docker_rocm_prebuild.sh

This builds and pushes the image:

rocm/dev-centos-7:5.5-magma-miopen-staging

Then the GHA workflow can proceed as usual.

ROCm must build magma and miopen from source. This takes considerable time and also exceeds allowed disk space on the GitHub Action runner. ROCm images are now a two-stage process. For example, first run: GPU_ARCH_VERSION=5.5 ./manywheel/build_docker_rocm_prebuild.sh This builds and pushes the image: rocm/dev-centos-7:5.5-magma-miopen-staging Then the GHA workflow can proceed as usual.

manywheel/Dockerfile

malfet

Please add CI to do that. Also, there are no harm in picking a bigger runner for the job (as Docker images are rebuilt very infrequently)
Also, just curios, why MAGMA built mechanism for ROCm and CUDA are diverged?

jithunnair-amd · 2023-06-06T04:21:29Z

@malfet Do you mean "add CI to build rocm/dev-centos-7:5.5-magma-miopen-staging" image? That docker image wouldn't be in a PyTorch docker repository.

@jithunnair-amd and that's a problem, isn't it? Because if somebody needs to fix/update magma, they'll have to ask you guys to update the basic container.

As for Magma, the CUDA flow installs magma from a conda package, while the ROCm flow builds magma from source.

Yes, I know the mechanics, but the question is: why? I.e. magma for CUDA is build by the following workflow, why can't rocm build of magma be added to it?

manywheel/Dockerfile

jithunnair-amd · 2023-06-06T04:50:28Z

@malfet Do you mean "add CI to build rocm/dev-centos-7:5.5-magma-miopen-staging" image? That docker image wouldn't be in a PyTorch docker repository.

@jithunnair-amd and that's a problem, isn't it? Because if somebody needs to fix/update magma, they'll have to ask you guys to update the basic container.

I will say that we actually considered using one of our ROCm GHA runners for the builder CI jobs, but picking a bigger CPU-only GHA runner seems like a fine idea too. @jeffdaily, should we try the bigger GHA runner to see if we can avoid this two-stage docker build process?

As for Magma, the CUDA flow installs magma from a conda package, while the ROCm flow builds magma from source.

Yes, I know the mechanics, but the question is: why? I.e. magma for CUDA is build by the following workflow, why can't rocm build of magma be added to it?

Actually, we have been working on this recently, trying to take a look at the CUDA magma build flow to see if we can apply it to ROCm magma as well. It's a WIP.

malfet · 2023-06-06T04:52:41Z

@jithunnair-amd I'm not saying that ROCm magma builds need to follow CUDA magma builds or vice versa, but it would be nice if workflows can be kept as close to each other as possible.

jeffdaily · 2023-06-06T17:41:10Z

I will say that we actually considered using one of our ROCm GHA runners for the builder CI jobs, but picking a bigger CPU-only GHA runner seems like a fine idea too. @jeffdaily, should we try the bigger GHA runner to see if we can avoid this two-stage docker build process?

YES! That would fix our issues, too, without all this pre-build trickery.

jithunnair-amd · 2023-06-06T18:04:57Z

@malfet @seemethere What are the runner instance names to use for larger diskspace/CPU? This doesn't indicate specific labels: https://docs.github.com/en/actions/using-github-hosted-runners/using-larger-runners#machine-specs-for-larger-runners?

malfet · 2023-06-06T20:32:31Z

@jithunnair-amd we have self-hosted runners running on AWS which are defined in https://github.com/pytorch/test-infra/blob/main/.github/scale-config.yml

jeffdaily · 2023-06-07T23:17:35Z

Closing in favor of #1418.

pytorch-bot bot added the module: rocm label Jun 2, 2023

facebook-github-bot added the cla signed label Jun 2, 2023

fix cmake symlink error

831be3a

jeffdaily commented Jun 2, 2023

View reviewed changes

manywheel/Dockerfile Outdated Show resolved Hide resolved

image name should be rocm/dev-centos-7 (missing "dev")

dc90d43

jithunnair-amd requested review from malfet and atalman June 5, 2023 22:37

malfet requested changes Jun 6, 2023

View reviewed changes

jithunnair-amd reviewed Jun 6, 2023

View reviewed changes

manywheel/Dockerfile Show resolved Hide resolved

jithunnair-amd marked this pull request as draft June 7, 2023 02:13

jeffdaily closed this Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

formalize the ROCm image pre-build process #1414

formalize the ROCm image pre-build process #1414

Uh oh!

jeffdaily commented Jun 2, 2023

Uh oh!

Uh oh!

malfet left a comment

Uh oh!

jithunnair-amd commented Jun 6, 2023 •

edited by malfet

Loading

Uh oh!

Uh oh!

jithunnair-amd commented Jun 6, 2023

Uh oh!

malfet commented Jun 6, 2023

Uh oh!

jeffdaily commented Jun 6, 2023

Uh oh!

jithunnair-amd commented Jun 6, 2023 •

edited

Loading

Uh oh!

malfet commented Jun 6, 2023

Uh oh!

jeffdaily commented Jun 7, 2023

Uh oh!

Uh oh!

formalize the ROCm image pre-build process #1414

formalize the ROCm image pre-build process #1414

Uh oh!

Conversation

jeffdaily commented Jun 2, 2023

Uh oh!

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd commented Jun 6, 2023 • edited by malfet Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jithunnair-amd commented Jun 6, 2023

Uh oh!

malfet commented Jun 6, 2023

Uh oh!

jeffdaily commented Jun 6, 2023

Uh oh!

jithunnair-amd commented Jun 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malfet commented Jun 6, 2023

Uh oh!

jeffdaily commented Jun 7, 2023

Uh oh!

Uh oh!

jithunnair-amd commented Jun 6, 2023 •

edited by malfet

Loading

jithunnair-amd commented Jun 6, 2023 •

edited

Loading