Skip to content

Commit 9b057a5

Browse files
committed
Update base for Update on "Incorporate coalesce analysis in codegen"
This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory. The motivating kernel is in #149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
2 parents 81727ff + e8f8baf commit 9b057a5

File tree

2,008 files changed

+73227
-33019
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

2,008 files changed

+73227
-33019
lines changed

.ci/caffe2/README.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,5 +10,3 @@ example: `py2-cuda9.0-cudnn7-ubuntu16.04`. The Docker images that are
1010
built on Jenkins and are used in triggered builds already have this
1111
environment variable set in their manifest. Also see
1212
`./docker/jenkins/*/Dockerfile` and search for `BUILD_ENVIRONMENT`.
13-
14-
Our Jenkins installation is located at https://ci.pytorch.org/jenkins/.

.ci/caffe2/test.sh

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then
1313
echo 'Skipping tests'
1414
exit 0
1515
fi
16-
if [[ "${BUILD_ENVIRONMENT}" == *-rocm* ]]; then
17-
# temporary to locate some kernel issues on the CI nodes
18-
export HSAKMT_DEBUG_LEVEL=4
19-
fi
2016
# These additional packages are needed for circleci ROCm builds.
2117
if [[ $BUILD_ENVIRONMENT == *rocm* ]]; then
2218
# Need networkx 2.0 because bellmand_ford was moved in 2.1 . Scikit-image by

.ci/docker/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,5 +34,5 @@ See `build.sh` for valid build environments (it's the giant switch).
3434
./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
3535

3636
# Set flags (see build.sh) and build image
37-
sudo bash -c 'PROTOBUF=1 ./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
37+
sudo bash -c 'TRITON=1 ./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
3838
```

.ci/docker/almalinux/Dockerfile

Lines changed: 14 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
ARG CUDA_VERSION=12.4
22
ARG BASE_TARGET=cuda${CUDA_VERSION}
3+
ARG ROCM_IMAGE=rocm/dev-almalinux-8:6.3-complete
34
FROM amd64/almalinux:8 as base
45

56
ENV LC_ALL en_US.UTF-8
@@ -8,10 +9,6 @@ ENV LANGUAGE en_US.UTF-8
89

910
ARG DEVTOOLSET_VERSION=11
1011

11-
ENV LC_ALL en_US.UTF-8
12-
ENV LANG en_US.UTF-8
13-
ENV LANGUAGE en_US.UTF-8
14-
1512
RUN yum -y update
1613
RUN yum -y install epel-release
1714
RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel openssl-devel yum-utils autoconf automake make gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
@@ -41,11 +38,12 @@ RUN bash ./install_conda.sh && rm install_conda.sh
4138

4239
# Install CUDA
4340
FROM base as cuda
44-
ARG CUDA_VERSION=12.4
41+
ARG CUDA_VERSION=12.6
4542
RUN rm -rf /usr/local/cuda-*
4643
ADD ./common/install_cuda.sh install_cuda.sh
4744
COPY ./common/install_nccl.sh install_nccl.sh
4845
COPY ./ci_commit_pins/nccl-cu* /ci_commit_pins/
46+
COPY ./common/install_cusparselt.sh install_cusparselt.sh
4947
ENV CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}
5048
# Preserve CUDA_VERSION for the builds
5149
ENV CUDA_VERSION=${CUDA_VERSION}
@@ -56,28 +54,29 @@ FROM cuda as cuda11.8
5654
RUN bash ./install_cuda.sh 11.8
5755
ENV DESIRED_CUDA=11.8
5856

59-
FROM cuda as cuda12.1
60-
RUN bash ./install_cuda.sh 12.1
61-
ENV DESIRED_CUDA=12.1
62-
63-
FROM cuda as cuda12.4
64-
RUN bash ./install_cuda.sh 12.4
65-
ENV DESIRED_CUDA=12.4
66-
6757
FROM cuda as cuda12.6
6858
RUN bash ./install_cuda.sh 12.6
6959
ENV DESIRED_CUDA=12.6
7060

61+
FROM cuda as cuda12.8
62+
RUN bash ./install_cuda.sh 12.8
63+
ENV DESIRED_CUDA=12.8
64+
65+
FROM ${ROCM_IMAGE} as rocm
66+
ENV PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
67+
ADD ./common/install_mkl.sh install_mkl.sh
68+
RUN bash ./install_mkl.sh && rm install_mkl.sh
69+
ENV MKLROOT /opt/intel
70+
7171
# Install MNIST test data
7272
FROM base as mnist
7373
ADD ./common/install_mnist.sh install_mnist.sh
7474
RUN bash ./install_mnist.sh
7575

7676
FROM base as all_cuda
7777
COPY --from=cuda11.8 /usr/local/cuda-11.8 /usr/local/cuda-11.8
78-
COPY --from=cuda12.1 /usr/local/cuda-12.1 /usr/local/cuda-12.1
79-
COPY --from=cuda12.4 /usr/local/cuda-12.4 /usr/local/cuda-12.4
8078
COPY --from=cuda12.6 /usr/local/cuda-12.6 /usr/local/cuda-12.6
79+
COPY --from=cuda12.4 /usr/local/cuda-12.8 /usr/local/cuda-12.8
8180

8281
# Final step
8382
FROM ${BASE_TARGET} as final

.ci/docker/almalinux/build.sh

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,16 @@ fi
1515
DOCKER_TAG_PREFIX=$(echo "${image}" | awk -F':' '{print $2}')
1616

1717
CUDA_VERSION=""
18+
ROCM_VERSION=""
19+
EXTRA_BUILD_ARGS=""
1820
if [[ "${DOCKER_TAG_PREFIX}" == cuda* ]]; then
1921
# extract cuda version from image name and tag. e.g. manylinux2_28-builder:cuda12.8 returns 12.8
2022
CUDA_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'cuda' '{print $2}')
23+
EXTRA_BUILD_ARGS="--build-arg CUDA_VERSION=${CUDA_VERSION}"
24+
elif [[ "${DOCKER_TAG_PREFIX}" == rocm* ]]; then
25+
# extract rocm version from image name and tag. e.g. manylinux2_28-builder:rocm6.2.4 returns 6.2.4
26+
ROCM_VERSION=$(echo "${DOCKER_TAG_PREFIX}" | awk -F'rocm' '{print $2}')
27+
EXTRA_BUILD_ARGS="--build-arg ROCM_IMAGE=rocm/dev-almalinux-8:${ROCM_VERSION}-complete"
2128
fi
2229

2330
case ${DOCKER_TAG_PREFIX} in
@@ -27,6 +34,9 @@ case ${DOCKER_TAG_PREFIX} in
2734
cuda*)
2835
BASE_TARGET=cuda${CUDA_VERSION}
2936
;;
37+
rocm*)
38+
BASE_TARGET=rocm
39+
;;
3040
*)
3141
echo "ERROR: Unknown docker tag ${DOCKER_TAG_PREFIX}"
3242
exit 1
@@ -47,8 +57,8 @@ docker build \
4757
--target final \
4858
--progress plain \
4959
--build-arg "BASE_TARGET=${BASE_TARGET}" \
50-
--build-arg "CUDA_VERSION=${CUDA_VERSION}" \
5160
--build-arg "DEVTOOLSET_VERSION=11" \
61+
${EXTRA_BUILD_ARGS} \
5262
-t ${tmp_tag} \
5363
$@ \
5464
-f "${TOPDIR}/.ci/docker/almalinux/Dockerfile" \

0 commit comments

Comments
 (0)