Merge OpenAI Triton commit `9a49104` #3316

whitneywhtsang · 2025-01-30T23:51:59Z

This PR change the Triton base from ac61cb0 to 9a49104 (Jan 29).
Pass rate: 98.19%

Please do not squash and merge this PR.

The `strip` method removes characters that match the input argument, but it does not remove substrings that match the full input. For example, previously, `"num_samples (inc)".strip("inc")` would result in `um_samples` because the character `n` is part of the input argument. Instead of using `strip`, the `replace` method can be used. However, in this case, we can simply output the original metric name (i.e., "num_samples (inc)") since it’s more useful to know whether the metric is inclusive or exclusive.

This PR allows a call to a JITFunction to pass another JITFunction as an argument. For example: ```python @triton.jit def fn_a(x): ... @triton.jit def fn_b(x, fn): ... @triton.jit def fn_c(x): return fn_b(x, fn_a) # fn_a (a JITFunction) is passed as an argument to fn_b (another JITFunction) ``` Prior to #5220, this worked. After #5220, the user needs to annotate the JITFunctions with @triton.constexpr manually (until this PR). Use case: Inductor has some generic helper functions for implementing scans (e.g. exclusive_scan_decoupled_lookback) which take a `combine_fn` to implement the combination function (similar to tl.reduce). These helper functions have stopped working after #5220. https://github.com/pytorch/pytorch/blob/01a4d86b31365cfb484dc17885c9a7ee09c235ab/torch/_inductor/runtime/triton_helpers.py#L321

Disabling lineinfo in ptxas can be somewhat tricky. If the input PTX file contains lineinfo, the generated CUBIN will include lineinfo by default, regardless of whether the `-lineinfo` option is used. To disable lineinfo in the generated CUBINs, the `-suppress-debug-info` option must be used in conjunction with `-lineinfo`. Here also attached an [test.txt](https://github.com/user-attachments/files/18550903/test.txt) file. Please rename it to `test.ptx`. ``` ptxas -lineinfo --gpu-name=sm_86 ./test.ptx nvdisasm -g ./test.ptx # you will still see the lineinfo ``` ``` ptxas -suppress-debug-info -lineinfo --gpu-name=sm_86 ./test.ptx nvdisasm -g ./test.ptx # lineinfo is gone ```

This way users can directly open the file using IDEs like vscode and jump to the corresponding lines, by holding the `ctrl` key and click the line on the terminal. Also, this PR emits an error instead of using workarounds for CUPTI compatibility problems and adds more instructions for using PC sampling in the tutorial. ![image](https://github.com/user-attachments/assets/9ecb90c3-0953-43bd-8db4-605dc13c38a2)

…er (#5717) Fixes #5691 and unblocks #5684

The `filter_traceback` call was commented out during the tuple PR. This just restores it and adds a check in the relevant tests.

@ThomasRaoux

Taking over triton-lang/triton#4914 due to an inactivity As discussed there, when there are multiple "contiguity of 1" in the `contiguity` array, doing argsort on it means that the resulting `order` becomes ascending for those elements. In the unit test, `order = [2, 1, 0]` becomes `[0, 1, 2]`, which is odd. This convention seems arbitrary, so it is better to pick the row-major ordering by default in such case to be consistent with the rest of code. The current convention is "correct", but we get an additional `convert_layout`. Moreover, this order is inherited to the SMEM allocated during SWP, which could be problematic for other ops. For example, in my case I was getting the order `[4, 0, 1, 2, 3]` in SMEM for 5D blocked scales because only the innermost axis had a contiguity 4 while the rest were 1. @ThomasRaoux @pawelszczerbuk @Jokeren @rawnhenry --------- Co-authored-by: Rawn Henry <[email protected]> Co-authored-by: Masahiro Masuda <[email protected]>

The custom fma codegen for Ampere has been upstreamed to NVPTX, so we no longer need custom conversion code. As a bonus, we now codegen vectorized bf16 ops for free.

For the most part, I expect this will just be a minor cleanup. It is vaguely possible that this might enable some better codegen.

…ds (#5739) It will be more consistent since the CUPTI header and library can use the same version. For roctracer, we don't package the library with triton, so we need to fall back to the system library.

This doesn't break functional backward compatiblity as the new semantic is a subset of the what was allowed before but it would break performance backward compatiblity. The makes it less error prone.

Follow-up to #5733 which somehow the passed CI and auto-merged, even with this bug. TMA should be using `bulk.commit.group` rather than `commit.group`.

Reverts triton-lang/triton#5707 This causes some functional changes that I need to investigate

@Jokeren

…r backends" (#5749) Reverts triton-lang/triton#5739 This causes problem on setup where the wheel is used on a remote machine. @Jokeren is working on a different fix

Pingpong pass only expects to handle local load ops as A/B. Avoid using the transform when different op is detected. Enabling it is the next step.

When reducing a 1D tensor the order of elements doesn't matter. This allows us to use a more relaxed version of reshape.

Pulls in llvm/llvm-project#123200 which is useful and also handles triton-lang/triton#5664. Integrations were required due to llvm/llvm-project#123026, llvm/llvm-project#123321 and llvm/llvm-project#123326. Also closes triton-lang/triton#5685

…lder Signed-off-by: Whitney Tsang <[email protected]>

This reverts commit 0ba4205.

This reverts commit b24ec52.

)

peterbell10 and others added 20 commits January 28, 2025 17:53

[README] Add instructions to build torch for Blackwell (#5727)

cb845fe

[NFC] replace TritonGPUToLLVM/Utility.h macros with TritonLLVMOpBuild…

bc4675a

…er (#5717) Fixes #5691 and unblocks #5684

[FRONTEND] Restore error traceback filtering (#5731)

e854fcd

The `filter_traceback` call was commented out during the tuple PR. This just restores it and adds a check in the relevant tests.

[NVIDIA] Use native bf16 ops (#5732)

7444438

The custom fma codegen for Ampere has been upstreamed to NVPTX, so we no longer need custom conversion code. As a bonus, we now codegen vectorized bf16 ops for free.

[PROTON] Add max flops formula for sm_100 (#5736)

10a22cc

[NVIDIA] Prefer nvvm intrinsics over custom PTX (#5733)

64fff02

For the most part, I expect this will just be a minor cleanup. It is vaguely possible that this might enable some better codegen.

[PROTON] Prefer the default library path when loading profiler backen…

a5235d4

…ds (#5739) It will be more consistent since the CUPTI header and library can use the same version. For roctracer, we don't package the library with triton, so we need to fall back to the system library.

[BC Breaking] Make tl.ravel keep element orders by default (#5743)

9db5c5f

This doesn't break functional backward compatiblity as the new semantic is a subset of the what was allowed before but it would break performance backward compatiblity. The makes it less error prone.

[BACKEND] Deprecate SharedToDotOperandMMAv2OrV3.cpp (#5734)

3734709

[NVIDIA] Use correct commit type for TMA (#5738)

0c7edf9

Follow-up to #5733 which somehow the passed CI and auto-merged, even with this bug. TMA should be using `bulk.commit.group` rather than `commit.group`.

Revert "[Coalesce] Fix the default order to be row major " (#5744)

23719b7

Reverts triton-lang/triton#5707 This causes some functional changes that I need to investigate

Revert "[PROTON] Prefer the default library path when loading profile…

140ff43

…r backends" (#5749) Reverts triton-lang/triton#5739 This causes problem on setup where the wheel is used on a remote machine. @Jokeren is working on a different fix

[AMD] Disable pingpong with non-local load input. (#5718)

c048fcb

Pingpong pass only expects to handle local load ops as A/B. Avoid using the transform when different op is detected. Enabling it is the next step.

Optimize reduce(reshape_1D) (#5748)

c2c193a

When reducing a 1D tensor the order of elements doesn't matter. This allows us to use a more relaxed version of reshape.

whitneywhtsang requested a review from pbchekin January 30, 2025 23:51

whitneywhtsang self-assigned this Jan 30, 2025

whitneywhtsang added 2 commits January 30, 2025 23:56

Merge commit 'bc4675aaa291097d96dc21183296c595947a27bc'

38563a8

[Intel] replace TritonGPUToLLVM/Utility.h macros with TritonLLVMOpBui…

2b5b6f7

…lder Signed-off-by: Whitney Tsang <[email protected]>

whitneywhtsang force-pushed the whitneywhtsang/merge branch from 56e95e2 to 2b5b6f7 Compare January 30, 2025 23:56

pbchekin approved these changes Jan 30, 2025

View reviewed changes

Revert "[PTXAS] Fix ptxas lineinfo option (#5705)"

be51333

This reverts commit 0ba4205.

whitneywhtsang changed the title ~~Merge OpenAI Triton commit bc4675a~~ Merge OpenAI Triton commit 64fff02 Jan 31, 2025

whitneywhtsang force-pushed the whitneywhtsang/merge branch from c5b8e80 to be51333 Compare January 31, 2025 01:35

Merge commit 'b24ec52288895c79247e7352b351e2be368da73a'

4f52e99

whitneywhtsang added 2 commits January 31, 2025 01:43

Revert "[Coalesce] Fix the default order to be row major (#5707)"

f66eab1

This reverts commit b24ec52.

Merge commit '64fff0289fa9800e965ac661b38ac6a9dc0e6482'

262102c

This was referenced Jan 31, 2025

Reland upstream commit b24ec52 #3317

Closed

Reland upstream commit 0ba4205 #3318

Closed

whitneywhtsang marked this pull request as ready for review January 31, 2025 02:54

Merge commit 'c2c193a9059707303db4650ad4b8aee03608e921'

0c3aa91

whitneywhtsang changed the title ~~Merge OpenAI Triton commit 64fff02~~ Merge OpenAI Triton commit c2c193a Jan 31, 2025

whitneywhtsang and others added 2 commits January 31, 2025 03:53

Merge commit '9a491048ed813410b8ea79216ee920580f83c3de'

d812777

Replace isF...() LLVM API calls with the corresponding isa<...>() (#3268

4a99671

)

whitneywhtsang changed the title ~~Merge OpenAI Triton commit c2c193a~~ Merge OpenAI Triton commit 9a49104 Jan 31, 2025

whitneywhtsang merged commit 4a99671 into main Jan 31, 2025
5 checks passed

whitneywhtsang deleted the whitneywhtsang/merge branch January 31, 2025 05:20

This was referenced Jan 31, 2025

Introduce support of the LLVM SPIR-V Backend to translate LLVM IR to SPIR-V #3229

Merged

Replace isF...() LLVM API calls with the corresponding isa<...>() #3268

Closed

Merge OpenAI Triton till Feb 1st #3211

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge OpenAI Triton commit `9a49104` #3316

Merge OpenAI Triton commit `9a49104` #3316

Uh oh!

whitneywhtsang commented Jan 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Merge OpenAI Triton commit 9a49104 #3316

Merge OpenAI Triton commit 9a49104 #3316

Uh oh!

Conversation

whitneywhtsang commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge OpenAI Triton commit `9a49104` #3316

Merge OpenAI Triton commit `9a49104` #3316

whitneywhtsang commented Jan 30, 2025 •

edited

Loading