Define USE_C10D_XCCL and USE_XCCL in pytorch #13

Chao1Han · 2025-02-21T03:05:37Z

Fixes #ISSUE_NUMBER

Summary: Added some logging and captured the indexing. See below image. {F1977773416} This is why the saved module path is called `/tmp/jimwan/minimizer_a_acc.pt` Now the updated module paths are `/tmp/jimwan/minimizer_addmm_default_103_acc.pt`. Test Plan: ``` MTIAC_USE_DIST_REF_KERNELS=all buck2 run @//mode/opt mtia/accuracy/minimizer:mtia_minimizer_runner -- --mode sequential --compare_fn allclose --pt_save_dir /tmp/debug3 --atol 1e-4 --rtol 1e-4 --all_outputs --start_idx native_layer_norm_default_80 --end_idx getitem_272 2>&1 | tee ~/test.log ``` {F1977773610} Reviewed By: qcyuan Differential Revision: D74369107 Pull Request resolved: pytorch#153130 Approved by: https://github.com/Skylion007

Related to: pytorch#152426 Pull Request resolved: pytorch#152825 Approved by: https://github.com/malfet

…s and outputs (pytorch#152878) Flex Attention may have symints in subgraph inputs and outputs. Existing code implicitly captures these symints but does not explicitly store it in TritonTemplateBuffer. This leads to error when analyzing symints used in Flex Attention as a TritonTemplateBuffer. This PR fixes the issue. Pull Request resolved: pytorch#152878 Approved by: https://github.com/drisspg

…caled_grouped_mm (pytorch#152968) Summary: The autotuner is using zero-filled tensors to autotune _scaled_grouped_mm and that's not appropriate for the offsets tensor, since it essentially corresponds to "no input" and thus yields invalid perf results. We can't really use the actual input tensors, since we might be compiling this op in the context of an entire graph. So instead, I decided to create a synthetic offsets tensor assuming that each group is (roughly) the same size. I don't have data but I'd guess this approach is OK for MoE since we're generally hoping to load-balance the experts; I'm not sure how well it applies to other scenarios that might be more heavy-tailed. Test Plan: ``` pytest test_matmul_cuda.py -k test_scaled_grouped_gemm_ ``` Pull Request resolved: pytorch#152968 Approved by: https://github.com/ngimel

@malfet

Summary: As discussed with @malfet , we're porting nativert code to torch/nativert/. Following up some concerns over the new directory, I'm trying to setup the tooling on OSS so various things (like linters) can run on torch/nativert/ properly. Test Plan: CI Differential Revision: D74407808 Pull Request resolved: pytorch#153164 Approved by: https://github.com/dolpm, https://github.com/Skylion007

@ezyang

Follow up to @ezyang's PR pytorch#153020 , but better uses PEP621 to reduce redundant fields and pass through metadata better to uv, setuptools, poetry and other tooling. * Enables modern tooling like uv sync and better support for tools like poetry. * Also allows us to set project wide settings that are respected by linters and IDE (in this example we are able centralize the minimum supported python version). * Currently most of the values are dynamically fetched from setuptools, eventually we can migrate all the statically defined values to pyproject.toml and they will be autopopulated in the setuptool arguments. * This controls what additional metadata shows up on PyPi . Special URL Names are listed here for rendering on pypi: https://packaging.python.org/en/latest/specifications/well-known-project-urls/#well-known-labels These also clearly shows us what fields will need to be migrated to pyproject.toml over time from setup.py per pytorch#152276. Static fields be fairly easy to migrate, the dynamically built ones like requirements are a bit more challenging. Without this, `uv sync` complains: ``` error: No `project` table found in: `pytorch/pyproject.toml` ``` Pull Request resolved: pytorch#153055 Approved by: https://github.com/ezyang

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#152146 Approved by: https://github.com/laithsakka

…rch#153112) A typical `bmm` kernel in Helion needs to pass in symint shapes to `torch.baddbmm`. Currently `self.expand((dim1, dim2, dim3))` in baddbmm runs unconditionally and it doesn't work with symint shapes (it raises the following error): ``` Traceback (most recent call last): File "/home/willfeng/local/helion_yf225/helion/_compiler/type_propagation.py", line 699, in propagate_call CheckForIndexCalls.retry_call(self.value, proxy_args, proxy_kwargs), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/helion_yf225/helion/_compiler/tile_index_proxy.py", line 104, in retry_call return fn(*proxy_args, **proxy_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1338, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1986, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1450, in _cached_dispatch_impl output = self._dispatch_impl(func, types, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 2645, in _dispatch_impl r = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_ops.py", line 806, in __call__ return self._op(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_prims_common/wrappers.py", line 309, in _fn result = fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_meta_registrations.py", line 2172, in meta_baddbmm self = self.expand((dim1, dim2, dim3)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: /home/willfeng/local/pytorch/build/aten/src/ATen/RegisterCompositeExplicitAutograd_0.cpp:5025: SymIntArrayRef expected to contain only concrete integers ``` This PR changes it so that we don't run `expand()` when not necessary, which makes the Helion use case (i.e. no broadcasting) work. Pull Request resolved: pytorch#153112 Approved by: https://github.com/jansel

1. Do multiple captures 2. Perform multiple collectives in one capture 3. Multiple replays (existing) Pull Request resolved: pytorch#150040 Approved by: https://github.com/fduwjj

Copied description by @hj-wei from ROCm#1809 > Hi all, I manually generating nvcc to bypass NVIDIA component checks(Megatron-LM), see https://github.com/NVIDIA/Megatron-LM/blob/2da43ef4c1b9e76f03b7567360cf7390e877f1b6/megatron/legacy/fused_kernels/__init__.py#L57 > but it can lead to incorrect CUDA_HOME configurations. This can cause initialization anomalies in downstream libraries like DeepSpeed Pull Request resolved: pytorch#152236 Approved by: https://github.com/jeffdaily

<img width="838" alt="Screenshot 2025-05-08 at 10 51 05 AM" src="https://github.com/user-attachments/assets/4cf43a16-3801-424b-a74f-ede1d41ff052" /> Pull Request resolved: pytorch#153079 Approved by: https://github.com/mori360

Has been seeing a lot of `Starting event listener thread for rank` recently in test print-out. Moving them to `logger.debug`. Pull Request resolved: pytorch#153116 Approved by: https://github.com/fduwjj

…os (pytorch#151727) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#151727 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/eqy

…orch.any() (pytorch#152658) Fixes pytorch#152176 Pull Request resolved: pytorch#152658 Approved by: https://github.com/malfet

…53055)" This reverts commit 5976419. Reverted pytorch#153055 on behalf of https://github.com/malfet due to And failures seems related to this change, but I don't know how, see for example https://hud.pytorch.org/hud/pytorch/pytorch/7cb5c751c3f76129c95aec479647679fbba4e7ba/1?per_page=50&name_filter=macos&mergeEphemeralLF=true ([comment](pytorch#153055 (comment)))

The edited comment should have the info. The code change looks large, but its copied from the install_cache script that our docker images use https://github.com/pytorch/pytorch/blob/6a8006472e431f872ca40c7aad250b61105de583/.ci/docker/common/install_cache.sh#L42 Sccache stopped working on xla at some point near dec 17 2023. I am not sure what commit caused it. I think it was having trouble writing to the cache. Either way, there is an sccache already installed on the docker image, so we should use that instead of a binary from s3 which we're probably no longer sure where it came from/what commit it was built from The one in the docker image is installed here https://github.com/pytorch/xla/blob/69d438ee65cc250c974ca80edd80462ffbb2e163/.github/upstream/Dockerfile#L61 and is also very old, so I have pytorch/xla#9102 to update it sccache still not writing properly, i will investigate, but xla build currently broken after the above xla pr, and this should fix it Pull Request resolved: pytorch#153002 Approved by: https://github.com/malfet

…ing (pytorch#153159) Pull Request resolved: pytorch#153159 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#152883, pytorch#153105

So far only run: - inductor/test_fp8.py - test_matmul_cuda.py - inductor/test_max_autotune.py Pull Request resolved: pytorch#153170 Approved by: https://github.com/drisspg

as title Pull Request resolved: pytorch#152211 Approved by: https://github.com/svekars, https://github.com/eellison, https://github.com/desertfire

@EikanWang

Fixes: intel/torch-xpu-ops#1503 `sycl/ext/oneapi/bfloat16.hpp` header file is a DPC++ compiler internal header. It's not documented for usage (see extension specification linked below) and is not guaranteed to exist. Instead, documented usage of extension suggests to rely on including `sycl/sycl.hpp` which in its turn includes `bfloat16.hpp` header (which is implementation detail). We stepped into issues by explicitly including `bloat16.hpp` sycl header whithin user facing production environment when `intel-sycl-rt` wheel is installed (which is the dependency of `torch` wheel package built and publicly available for xpu). Compiler includes this file from `intel-sycl-rt` and due to `#pragma once` usage its content is included as well giving redefinitions of symbols in this file (previous inclusion is coming from `sycl/sycl.hpp`): ``` In file included from /workspace/lib/python3.12/site-packages/torch/include/c10/util/BFloat16.h:23: /opt/intel/oneapi/compiler/2025.0/bin/compiler/../../include/sycl/ext/oneapi/bfloat16.hpp:60:23: error: redefinition of 'BF16VecToFloatVec' 60 | template <int N> void BF16VecToFloatVec(const bfloat16 src[N], float dst[N]) { | ^ /workspace/include/sycl/ext/oneapi/bfloat16.hpp:60:23: note: previous definition is here 60 | template <int N> void BF16VecToFloatVec(const bfloat16 src[N], float dst[N]) { | ``` While SYCL header files themselves can be improved (`#pragma once` dropped), we still must correct usage of sycl `bfloat16.hpp` header in pytorch, i.e. drop it. This fortunately helps to address the reported issue of redefinitions though follow up on compiler side is still required. Also, `SYCL_EXT_ONEAPI_BFLOAT16_MATH_FUNCTIONS` used to cover inclusion of `sycl/sycl.hpp` does not make sense since it's defined in this very header. Thus, we should use `SYCL_LANGUAGE_VERSION` instead which is defined on compiler level. See: https://github.com/intel/llvm/blob/f958dce28053dff145cd725ff57bc4ce94cb94d7/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc CC: @EikanWang, @guangyey, @gujinghui Pull Request resolved: pytorch#152562 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD

…`true` case (pytorch#153083) Also add some missing `@onlyCUDA` / support check decorators in `test_matmul_cuda.py` Should help resolve pytorch#151890 Pull Request resolved: pytorch#153083 Approved by: https://github.com/janeyx99

This reverts commit f87a0fe. Reverted pytorch#153170 on behalf of https://github.com/clee2000 due to workflow doesnt have right concurrency group? ([comment](pytorch#153170 (comment)))

Pull Request resolved: pytorch#153196 Approved by: https://github.com/BoyuanFeng

Pull Request resolved: pytorch#150907 Approved by: https://github.com/henrylhtsang, https://github.com/eellison ghstack dependencies: pytorch#153196

Pull Request resolved: pytorch#153177 Approved by: https://github.com/henrylhtsang ghstack dependencies: pytorch#153196, pytorch#150907

@malfet

This commit moves query for xpu arch flags to runtime when building SYCL extensions which allows to adjust `TORCH_XPU_ARCH_LIST` at python script level. That's handy for example in ci test which gives a try few variants of the list. CC: @malfet, @jingxu10, @EikanWang, @guangyey Pull Request resolved: pytorch#152192 Approved by: https://github.com/guangyey, https://github.com/gujinghui, https://github.com/albanD

…it NCHW convolutions (pytorch#153101)" This reverts commit ced90d2. Reverted pytorch#153101 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages on main, tentative revert: https://github.com/pytorch/pytorch/actions/runs/15024667248/job/42224521705 ([comment](pytorch#153101 (comment)))

…hapes from (pytorch#153297) tests This config option is not set anywhere, and does nothing, so this should cause no changes to tests. Pull Request resolved: pytorch#153297 Approved by: https://github.com/Skylion007

Fixes pyro-ppl/pyro#3419 which is actually a `torch` bug that can be replicated by the below code: ``` from torch import rand from torch.distributions import MixtureSameFamily, Categorical, Binomial max_count = 20 probs = rand(10, 5) binom_probs = rand(10, 5) d = MixtureSameFamily(Categorical(probs=probs), Binomial(max_count, binom_probs)) d.log_prob(d.sample()) ``` which results in: ``` Traceback (most recent call last): File "test.py", line 11, in <module> d.log_prob(d.sample()) File "pytorch\torch\distributions\mixture_same_family.py", line 168, in log_prob self._validate_sample(x) File "pytorch\torch\distributions\distribution.py", line 315, in _validate_sample valid = support.check(value) ^^^^^^^^^^^^^^^^^^^^ File "pytorch\torch\distributions\constraints.py", line 307, in check (value % 1 == 0) & (self.lower_bound <= value) & (value <= self.upper_bound) ^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: The size of tensor a (10) must match the size of tensor b (5) at non-singleton dimension 1 ``` ### Fix explanation (only for cases when the component distribution contains parameters with batch dimenisons) - The failure is due to sample validation taking place before padding in `MixtureSameFamily.log_prob`, and hence the fix is to pad before doing sample validation. - The fix itself does not alter the calculations at all. It only affects the sample validation process. - The failure does not occur with the component distribution set to the `Normal` distribution, as its validation is not defined elementwise (the validation itself is elementwise). - I've split the `test_mixture_same_family_log_prob` test into two tests based on the `Normal` and `Binomial` distributions. - Initially, the `Binomial` version of the test did not fail, but this was due to the component distribution having equal batch dimensions of (5, 5) so I changed it to (10, 5). ### Updated fix explanation (for all cases) - The previous fix caused a bug in sample shape validation (which is done correctly) due to the padding taking place before the sample validation. - The updated fix corrects the support to reflect the fact that the support of `MixtureSameFamily` is equal to the support of its components distribution with the first event dimension removed. - This issue was already anticipated in the [code](https://github.com/pytorch/pytorch/blob/331423e5c24170b218e743b3392acbad4480340d/torch/distributions/mixture_same_family.py#L127). Pull Request resolved: pytorch#151317 Approved by: https://github.com/albanD, https://github.com/fritzo

@ankurneog

MOTIVATION This PR includes a minor change to check for TEST_HPU flag as well before falling back to CPU. Without this flag, some tests were falling back to CPU causing them to fail. Please refer to this RFC as well: pytorch/rfcs#66 CHANGES add TEST_HPU flag to some of the conditions checking the environment use DEVICE_COUNT variable instead of torch.accelerator.device_count() API since the later is not supported on out-of-tree devices like Intel Gaudi. @ankurneog , @EikanWang , @cyyever , @guangyey Pull Request resolved: pytorch#153461 Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/albanD

) Async compile workers don't respect inductor configs generally that get changed in the middle of execution because they warm up early. StaticCudaLauncher is especially susceptible to this because it affects triton compilation without being part of the inductor meta. So we'll pass it in via extra configs on each worker run. Pull Request resolved: pytorch#153382 Approved by: https://github.com/masnesral, https://github.com/jansel

Summary: Optionally log when setGraphExecutorOptimize is called, so we can get insight into the GraphExecutor behavior. Differential Revision: D74692508 Pull Request resolved: pytorch#153549 Approved by: https://github.com/PaulZhang12, https://github.com/SamGinzburg

Address pytorch#151097. Including below changes, - Add XPU support package 2025.1 build and test in CI for both Linux and Windows - Keep XPU support package 2025.0 build in CI to ensure no break issue until PyTorch 2.8 release - Upgrade XPU support package from 2025.0 to 2025.1 in CD for both Linux and Windows - Enable XCCL in Linux CD wheel and oneMKL integration in both both Linux and Windows - Update XPU runtime pypi packages of CD wheels - Remove deprecated support package version docker image build Pull Request resolved: pytorch#151899 Approved by: https://github.com/EikanWang, https://github.com/atalman

@drisspg

…ward, test priority bump for `sm90`, `sm100` (pytorch#149282)" This reverts commit 9386701. Reverted pytorch#149282 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see [D74729259](https://www.internalfb.com/diff/D74729259). @drisspg may you help out the author have their PR merged? ([comment](pytorch#149282 (comment)))

Adds ruff YTT checks to help future proof version checks and follow best practices here. Also makes it easier for static linters like mypy to detect python version branching. Pull Request resolved: pytorch#153547 Approved by: https://github.com/albanD

…3426) Pull Request resolved: pytorch#153426 Approved by: https://github.com/jansel

Fix typo from pytorch#153386 Pull Request resolved: pytorch#153561 Approved by: https://github.com/albanD

…#152821) This PR adds checks on `gpu_memory_bandwidth` and `gpu_flops` in `get_estimated_runtime`. This will prevent division by zero and other potential incorrect values: https://github.com/pytorch/pytorch/blob/9210a98b9203c5ff42f39241304a8e38435110b8/torch/_inductor/scheduler.py#L864-L865 https://github.com/pytorch/pytorch/blob/9210a98b9203c5ff42f39241304a8e38435110b8/torch/_inductor/scheduler.py#L874 Pull Request resolved: pytorch#152821 Approved by: https://github.com/eellison, https://github.com/jansel

…ation (pytorch#153417) Shameful admission: I have encountered this error 1-2 times, but don't have a repro. torch/_inductor/select_algorithm.py", line 2022, in wait_on_futures elapsed_times[future], ~~~~~~~~~~~~~^^^^^^^^ torch._inductor.exc.InductorError: KeyError: <Future at 0x7fc4e394fb90 state=finished returned tuple> Pull Request resolved: pytorch#153417 Approved by: https://github.com/Skylion007, https://github.com/ColinPeppler

Untyped variables become ClassVar in dataclasses, this type alias should just be a type alias; no need for it to eb a classvar. Pull Request resolved: pytorch#153540 Approved by: https://github.com/albanD, https://github.com/aorenste

…os (pytorch#151727) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#151727 Approved by: https://github.com/seemethere Co-authored-by: Eli Uriegas <[email protected]>

…0+ (pytorch#150536) Changing the bool to int to express split_k_mode. Before 0.7.0 we only have 2 cusparseLtSplitKMode_t enum values ONE_KERNEL and TWO_KERNELS so a boolean is enough but since 0.7.0 there are more. For Blackwell, there has to be minor change to parameter split_k_one_kernel (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp#L103), since there are new values introduced to enum [cusparseLtSplitKMode_t](https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t) and a bool type is not enough for it (would have to be replaced with integer) https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t Error we see without the change ``` RuntimeError: CUDA error: invalid value when calling `cusparseLtMatmulAlgSetAttribute( &handle, &alg_sel, CUSPARSELT_MATMUL_SPLIT_K_MODE, &splitKMode, sizeof(splitKMode))` To execute this test, run the following from the base repo dir: python test/test_sparse_semi_structured.py TestSparseSemiStructuredCUSPARSELTCUDA.test_csrc_cslt_sparse_mm_search_cuda_int8 ``` Pull Request resolved: pytorch#150536 Approved by: https://github.com/jcaip, https://github.com/atalman

Before it's possible to use enable newer CMake. Pull Request resolved: pytorch#153380 Approved by: https://github.com/albanD

…ler (pytorch#149942) Fixes pytorch#119168 ## Test Result ![image](https://github.com/user-attachments/assets/edb8124c-f103-475a-b903-20fbc71fdea6) Pull Request resolved: pytorch#149942 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <[email protected]>

…orch#153405) Pull Request resolved: pytorch#153405 Approved by: https://github.com/mlazos

…ertion (pytorch#152353) Fixes pytorch#151930 This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages. The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg. In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging. Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py). - Verified both successful and failing assertion cases include the operator name. - Verified that generated Triton code contains the op name inside the asserts. Pull Request resolved: pytorch#152353 Approved by: https://github.com/jansel

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#153550 Approved by: https://github.com/suo Co-authored-by: Natalia Gimelshein <[email protected]>

@davidberard98

…o avoid perf degradation (pytorch#153357) Fixes pytorch#147336 ## Context NCU analysis of the fp8 flex attention perf issue in pytorch#147336 showed an unexpected increase in shared memory access bank conflicts when loading the V tensor from HBM to SRAM. Bringing this to the attention of triton developer @davidberard98 he identified the memory layout of the tensor in HBM to be causing non-pipelined loads into SRAM, causing the slowdown. To summarize: In flex attention when performing the FP8 GEMM `softmax_scores @ V` the right operand V must be in column-major memory layout. However, the `tl.load` of V blocks from HBM to SRAM cannot be pipelined if the V tensor isn't column-major in HBM already, leading to substantial performance degradation. This is because triton does not perform async copies with the `cp.async` PTX instruction if the number of contiguous bytes is less than 4 (see [here](https://github.com/triton-lang/triton/blob/81f93f2c8ec7d20a1f8184def767edeaebeb6812/lib/Dialect/TritonGPU/Transforms/Pipeliner/PipeliningUtility.cpp#L403)). i.e., when loading 4 bytes of contiguous data from a tensor stored in row-major in HBM, we have to perform 4 separate non-contiguous writes to SRAM to place those bytes in their new location in the col-major layout in SRAM. Thus the load is not a candidate for pipelining w/ cp.async and just moves data to registers then performs a series of single byte stores. ## Fix summary - To fix this, we should enforce memory layouts for Q, K, V in FlexAttention when fp8 is being used, to ensure they each exist in HBM in the necessary memory layout to facilitate pipelined loads into SRAM ahead of the FP8 GEMMs ## Benchmarks Rerunning the repro we see fp8 runtime is reduced from 120% of bf16 to 76% of bf16 runtime. Before fix: ``` (flex) [[email protected] ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8 2025-05-11 19:07:33,402 - flex_bench - INFO - Running benchmark: bf16 2025-05-11 19:07:35,885 - flex_bench - INFO - bf16: 424.87228804347734 us 2025-05-11 19:07:35,893 - flex_bench - INFO - Running benchmark: fp8e4m3 2025-05-11 19:07:37,319 - flex_bench - INFO - fp8e4m3: 515.714000000001 us ``` After fix: ``` (flex) [[email protected] ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8 2025-05-11 17:34:38,223 - flex_bench - INFO - Running benchmark: bf16 2025-05-11 17:34:41,157 - flex_bench - INFO - bf16: 423.4662032967036 us 2025-05-11 17:34:41,167 - flex_bench - INFO - Running benchmark: fp8e4m3 2025-05-11 17:34:42,917 - flex_bench - INFO - fp8e4m3: 326.3694803493453 us ``` Pull Request resolved: pytorch#153357 Approved by: https://github.com/ngimel, https://github.com/davidberard98

…ath` (pytorch#150726) This PR allows `FileManager` to accept `pathlib.Path` as arguments while keeping the original `str` path support. This allows us to simplify the code such as: 1. `os.path.join(..., ...)` with `Path.__floordiv__(..., ...)`. https://github.com/pytorch/pytorch/blob/95a5958db490608cacca75b89d9a1d2e955b60e8/torchgen/utils.py#L155 https://github.com/pytorch/pytorch/blob/95a5958db490608cacca75b89d9a1d2e955b60e8/torchgen/utils.py#L176 2. `os.path.basename(...)` with `Path(...).name`. https://github.com/pytorch/pytorch/blob/95a5958db490608cacca75b89d9a1d2e955b60e8/torchgen/utils.py#L161 3. Manual file extension split with `Path(...).with_stem(new_stem)` https://github.com/pytorch/pytorch/blob/95a5958db490608cacca75b89d9a1d2e955b60e8/torchgen/utils.py#L241-L256 ------ Pull Request resolved: pytorch#150726 Approved by: https://github.com/aorenste

Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError` `torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's [`PyErr_SetString`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L282), namely - Convert cstr into Python string with `PyUnicode_FromString` - Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L32) - Set `error_code` property using `PyObject_SetAttrString` - decref all temporary references Test that it works and captures CPP backtrace (in addition to CI) by running ```python import os os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1' import torch x = torch.rand(10, device="cuda") y = torch.arange(20, device="cuda") try: x[y] = 2 print(x) except torch.AcceleratorError as e: print("Exception was raised", e.args[0]) print("Captured error code is ", e.error_code) ``` which produces following output ``` Exception was raised CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) [clone .cold] from CUDAException.cpp:0 #7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0 #8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0 #9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0 #10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0 #11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0 #12 at::native::abs(at::Tensor const&) from ??:0 #13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0 #14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0 #15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 #16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 #17 at::_ops::abs::call(at::Tensor const&) from ??:0 #18 at::native::isfinite(at::Tensor const&) from ??:0 #19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0 #20 at::_ops::isfinite::call(at::Tensor const&) from ??:0 #21 torch::autograd::THPVariable_isfinite(_object*, _object*, _object*) from python_torch_functions_2.cpp:0 #22 PyObject_CallFunctionObjArgs from ??:0 #23 _PyObject_MakeTpCall from ??:0 #24 _PyEval_EvalFrameDefault from ??:0 pytorch#25 _PyObject_FastCallDictTstate from ??:0 pytorch#26 _PyStack_AsDict from ??:0 pytorch#27 _PyObject_MakeTpCall from ??:0 pytorch#28 _PyEval_EvalFrameDefault from ??:0 pytorch#29 _PyFunction_Vectorcall from ??:0 pytorch#30 _PyEval_EvalFrameDefault from ??:0 pytorch#31 _PyFunction_Vectorcall from ??:0 pytorch#32 _PyEval_EvalFrameDefault from ??:0 pytorch#33 _PyFunction_Vectorcall from ??:0 pytorch#34 _PyEval_EvalFrameDefault from ??:0 pytorch#35 PyFrame_GetCode from ??:0 pytorch#36 PyNumber_Xor from ??:0 pytorch#37 PyObject_Str from ??:0 pytorch#38 PyFile_WriteObject from ??:0 pytorch#39 _PyWideStringList_AsList from ??:0 pytorch#40 _PyDict_NewPresized from ??:0 pytorch#41 _PyEval_EvalFrameDefault from ??:0 pytorch#42 PyEval_EvalCode from ??:0 pytorch#43 PyEval_EvalCode from ??:0 pytorch#44 PyUnicode_Tailmatch from ??:0 pytorch#45 PyInit__collections from ??:0 pytorch#46 PyUnicode_Tailmatch from ??:0 pytorch#47 _PyRun_SimpleFileObject from ??:0 pytorch#48 _PyRun_AnyFileObject from ??:0 pytorch#49 Py_RunMain from ??:0 pytorch#50 Py_BytesMain from ??:0 pytorch#51 __libc_init_first from ??:0 pytorch#52 __libc_start_main from ??:0 pytorch#53 _start from ??:0 Captured error code is 710 ``` Pull Request resolved: pytorch#152023 Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel ghstack dependencies: pytorch#154436

…torch#156600) Don't call `sum()` on a tensor that is default constructed. Previously we could call `sum()` on a tensor that was default-contructed. That would lead to an error like this: ``` Traceback (most recent call last): File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/ahmads/personal/pytorch/torch/testing/_internal/common_utils.py", line 3191, in wrapper method(*args, **kwargs) File "/home/ahmads/personal/pytorch/test/test_nn.py", line 7235, in test_layer_norm_backwards_eps ln_out_cuda.backward(grad_output_cuda) File "/home/ahmads/personal/pytorch/torch/_tensor.py", line 647, in backward torch.autograd.backward( File "/home/ahmads/personal/pytorch/torch/autograd/__init__.py", line 354, in backward _engine_run_backward( File "/home/ahmads/personal/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: tensor does not have a device Exception raised from device_default at /home/ahmads/personal/pytorch/c10/core/TensorImpl.h:1265 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0 #7 at::TensorBase::options() const from :0 #8 at::meta::resize_reduction(at::impl::MetaBase&, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::ScalarType, bool) from :0 #9 at::meta::structured_sum_dim_IntList::meta(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 #10 at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 #11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 #12 at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 #13 void at::native::(anonymous namespace)::LaunchGammaBetaBackwardCUDAKernel<float, float>(float const*, float const*, float const*, float const*, long, long, at::Tensor*, at::Tensor*, CUstream_st*) from ??:0 #14 void at::native::(anonymous namespace)::LayerNormBackwardKernelImplInternal<float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0 #15 at::native::(anonymous namespace)::LayerNormBackwardKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0 #16 at::native::layer_norm_backward_cuda(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from ??:0 #17 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from RegisterCUDA_0.cpp:0 ``` Now we only call `sum(0)` on tensors that are defined and properly guard the `sum(0)` and assignment. Pull Request resolved: pytorch#156600 Approved by: https://github.com/eqy, https://github.com/ngimel

For tensor with non-zero offset, it must be multiplied by element size Add regression test by creating Tensor in array of 6 elements with offset 3, which before the fix crashed with ``` C++ exception with description "setStorage: sizes [3, 3], strides [0, 1], storage offset 3, and itemsize 4 requiring a storage size of 24 are out of bounds for storage of size 15 Exception raised from checkInBoundsForStorage at /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/Resize.h:123 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 56 (0x104a9cd44 in libc10.dylib) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 120 (0x104a9a05c in libc10.dylib) frame #2: void at::native::checkInBoundsForStorage<long long>(c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long, caffe2::TypeMeta const&, c10::Storage const&) + 656 (0x111dbd314 in libtorch_cpu.dylib) frame #3: void at::native::setStrided<long long>(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long) + 152 (0x111dcd22c in libtorch_cpu.dylib) frame #4: at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) + 312 (0x111dccf98 in libtorch_cpu.dylib) frame #5: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__as_strided(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>>>, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 104 (0x1129a1e94 in libtorch_cpu.dylib) frame #6: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 476 (0x112200ad0 in libtorch_cpu.dylib) frame #7: at::Tensor::as_strided(c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) const + 236 (0x1115db098 in libtorch_cpu.dylib) frame #8: at::native::expand(at::Tensor const&, c10::ArrayRef<long long>, bool) + 348 (0x111dcc0d4 in libtorch_cpu.dylib) frame #9: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::ADInplaceOrView::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 116 (0x1157ac410 in libtorch_cpu.dylib) frame #10: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::autograd::VariableType::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 992 (0x114e8b010 in libtorch_cpu.dylib) frame #11: at::_ops::expand::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 316 (0x112743c90 in libtorch_cpu.dylib) frame #12: at::expand_size(at::Tensor const&, c10::ArrayRef<long long>) + 164 (0x1047d82b4 in basic) frame #13: BasicTest_TestForBlobResizeCPU_Test::TestBody() + 284 (0x1047d8048 in basic) ``` Pull Request resolved: pytorch#158690 Approved by: https://github.com/angelayi

Chao1Han force-pushed the xccl-cmake branch from 093de39 to 2a91dd8 Compare February 21, 2025 07:38

pytorchmergebot force-pushed the xccl-cmake branch from 2a91dd8 to dacc989 Compare March 5, 2025 06:36

pytorchmergebot force-pushed the xccl-cmake branch from dacc989 to 0fac2c1 Compare April 10, 2025 02:36

pytorchmergebot force-pushed the xccl-cmake branch from 0fac2c1 to b6340bb Compare May 7, 2025 01:56

jimone1 and others added 26 commits May 8, 2025 19:59

Use gcc13 in Manylinux 2.28 images (pytorch#152825)

6ae7730

Related to: pytorch#152426 Pull Request resolved: pytorch#152825 Approved by: https://github.com/malfet

[dynamic shapes] guard_or_false for infer_size (pytorch#152146)

4166373

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#152146 Approved by: https://github.com/laithsakka

[c10d] Test multiple CUDA Graph captures (pytorch#150040)

4064062

1. Do multiple captures 2. Perform multiple collectives in one capture 3. Multiple replays (existing) Pull Request resolved: pytorch#150040 Approved by: https://github.com/fduwjj

[c10d] Reduce test verbosity (pytorch#153116)

5dd746b

Has been seeing a lot of `Starting event listener thread for rank` recently in test print-out. Moving them to `logger.debug`. Pull Request resolved: pytorch#153116 Approved by: https://github.com/fduwjj

[ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenari…

5683965

…os (pytorch#151727) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#151727 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/eqy

Fix the basic description of torch.min(), torch.max(), torch.all(), t…

7cb5c75

…orch.any() (pytorch#152658) Fixes pytorch#152176 Pull Request resolved: pytorch#152658 Approved by: https://github.com/malfet

[dynamo][pr_time_benchmark] Add dynamo benchmark to stress test inlin…

ab829ec

…ing (pytorch#153159) Pull Request resolved: pytorch#153159 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#152883, pytorch#153105

[CI] Add opt-in h100 tests (pytorch#153170)

f87a0fe

So far only run: - inductor/test_fp8.py - test_matmul_cuda.py - inductor/test_max_autotune.py Pull Request resolved: pytorch#153170 Approved by: https://github.com/drisspg

Mini tutorial for provenance tracking (pytorch#152211)

faff387

as title Pull Request resolved: pytorch#152211 Approved by: https://github.com/svekars, https://github.com/eellison, https://github.com/desertfire

Revert "[CI] Add opt-in h100 tests (pytorch#153170)"

3419630

This reverts commit f87a0fe. Reverted pytorch#153170 on behalf of https://github.com/clee2000 due to workflow doesnt have right concurrency group? ([comment](pytorch#153170 (comment)))

[Cutlass] Fix tests (pytorch#153196)

c54aa0d

Pull Request resolved: pytorch#153196 Approved by: https://github.com/BoyuanFeng

[Cutlass] Changes to gemm template for EVT (pytorch#150907)

a3154ca

Pull Request resolved: pytorch#150907 Approved by: https://github.com/henrylhtsang, https://github.com/eellison ghstack dependencies: pytorch#153196

[Cutlass] Implement memory planning for EVT (pytorch#153177)

9fa0734

Pull Request resolved: pytorch#153177 Approved by: https://github.com/henrylhtsang ghstack dependencies: pytorch#153196, pytorch#150907

pytorchmergebot and others added 23 commits May 14, 2025 18:52

[dynamo][compile-time] Compute logging related flags once (pytorch#15…

03d0186

…3426) Pull Request resolved: pytorch#153426 Approved by: https://github.com/jansel

Fix typo (pytorch#153561)

f887bff

Fix typo from pytorch#153386 Pull Request resolved: pytorch#153561 Approved by: https://github.com/albanD

[ROCm] Maxpool forward NHWC Perf Improvement targeting Resnet scenari…

72fee13

…os (pytorch#151727) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#151727 Approved by: https://github.com/seemethere Co-authored-by: Eli Uriegas <[email protected]>

Update CMake to 3.27 in Windows CI (pytorch#153380)

781ba0a

Before it's possible to use enable newer CMake. Pull Request resolved: pytorch#153380 Approved by: https://github.com/albanD

[cutlass backend] Add back descriptive names for epilogue fusion (pyt…

f5e0806

…orch#153405) Pull Request resolved: pytorch#153405 Approved by: https://github.com/mlazos

don't run triton mm for k<32 (pytorch#153550)

eaf2dee

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#153550 Approved by: https://github.com/suo Co-authored-by: Natalia Gimelshein <[email protected]>

pytorchmergebot force-pushed the xccl-cmake branch from b6340bb to ff07e01 Compare May 15, 2025 03:00

Define USE_C10D_XCCL and USE_XCCL in pytorch

373f4df

pytorchmergebot force-pushed the xccl-cmake branch from ff07e01 to 373f4df Compare May 15, 2025 03:02

Chao1Han closed this Jun 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Define USE_C10D_XCCL and USE_XCCL in pytorch #13

Define USE_C10D_XCCL and USE_XCCL in pytorch #13

Uh oh!

Chao1Han commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

110 participants

Define USE_C10D_XCCL and USE_XCCL in pytorch #13

Define USE_C10D_XCCL and USE_XCCL in pytorch #13

Uh oh!

Conversation

Chao1Han commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

110 participants