Skip to content

Conversation

yuantailing
Copy link
Member

@yuantailing yuantailing commented Feb 20, 2025

Previous Behaviour: extensions are built in series, despite the fact that multiple files in the same extension are compiled in parallel

This pull request adds support for parallel building of multiple extensions. Benchmark results show:

CPU Build Parameters Build Time
AMD EPYC 24-Core (48 threads) Original (No optimizations) 45m29.243s
AMD EPYC 24-Core (48 threads) --parallel 4 12m55.243s
AMD EPYC 24-Core (48 threads) --parallel 16 6m47.962s
AMD EPYC 24-Core (48 threads) NVCC_APPEND_FLAGS="--threads 8" 19m23.878s
AMD EPYC 24-Core (48 threads) --parallel 4, NVCC_APPEND_FLAGS="--threads 8" 7m33.151s
AMD EPYC 24-Core (48 threads) --parallel 16, NVCC_APPEND_FLAGS="--threads 8" 5m58.479s
Intel Xeon 112-Core (224 threads) NVCC_APPEND_FLAGS="--threads 8" 14m9.081s
Intel Xeon 112-Core (224 threads) --parallel 16, NVCC_APPEND_FLAGS="--threads 8" 2m24.733s

Memory usage is shown below. The "mem used" values are obtained using the free command, and background memory usage is included.

image

Build Parameters Peak mem used
Original 24.11 GiB
--parallel 16 58.25 GiB
NVCC_APPEND_FLAGS="--threads 8" 91.39 GiB
--parallel 16, NVCC_APPEND_FLAGS="--threads 8" 150.96 GiB

Image: nvcr.io/nvidia/pytorch:25.01-py3
(or other images with the same CUDA version and TORCH_CUDA_ARCH_LISTS)

cmdline: time NVCC_APPEND_FLAGS="--threads 8" pip wheel -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext --distributed_adam --distributed_lamb --cuda_ext --permutation_search --bnp --xentropy --focal_loss --group_norm --index_mul_2d --deprecated_fused_adam --deprecated_fused_lamb --fast_layer_norm --fmha --fast_multihead_attn --transducer --cudnn_gbn --peer_memory --nccl_p2p --fast_bottleneck --fused_conv_bias_relu --nccl_allocator --gpu_direct_storage --parallel 16" ./

@alpha0422
Copy link
Contributor

@crcrpar Could you help review this PR? This reduces APEX build time a lot.

README.md Outdated
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext --cuda_ext --parallel 4" ./
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise --thread option, this would increase CPU mem usage, so could you separately add the example command with --thread and --parallel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated README.md

README.md Outdated

To reduce the build time of APEX, parallel building can be enhanced via
```bash
export NVCC_APPEND_FLAGS="--threads 4"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest not exporting this env, it affects nvcc globally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to Temporary Environment Scope

Copy link
Collaborator

@crcrpar crcrpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing this nice option

@crcrpar crcrpar merged commit c9e6f05 into NVIDIA:master Feb 25, 2025
oraluben added a commit to oraluben/SageAttention that referenced this pull request Jul 7, 2025
XiaomingXu1995 added a commit to thu-ml/SageAttention that referenced this pull request Jul 13, 2025
forrestl111 pushed a commit to forrestl111/SageAttention that referenced this pull request Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants