AMDGPU: Wrong code for fcanonicalize #82937

hvdijk · 2024-02-25T22:49:20Z

Please consider this minimal LLVM IR:

define half @f(half %x) {
  %canonicalized = call half @llvm.canonicalize.f16(half %x)
  ret half %canonicalized
}

Run with llc -mtriple=amdgcn and we get:

f:                                      ; @f
        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
        s_setpc_b64 s[30:31]

The canonicalize operation has been entirely optimised away.

The reason for this is we get during ISel:

  t0: ch,glue = EntryToken
          t2: f32,ch = CopyFromReg # D:1 t0, Register:f32 %0
        t4: f16 = fp_round # D:1 t2, TargetConstant:i64<1>
      t5: f16 = fcanonicalize # D:1 t4
    t6: f32 = fp_extend # D:1 t5
  t8: ch,glue = CopyToReg # D:1 t0, Register:f32 $vgpr0, t6
  t9: ch = RET_GLUE # D:1 t8, Register:f32 $vgpr0, t8:1

Here, fcanonicalize is optimised away because SITargetLowering::isCanonicalized determines that fp_round is guaranteed to return an already-canonicalised result, so no work is needed, but that then leaves us with fp_extend (fp_round x, /*strict=*/1) which is optimised to a no-op.

This prevents another optimisation from going in (#80520) which makes this problem show up in more cases than it currently does, and sadly I struggle to find a good way of ensuring we get correct code for this case without also making codegen for other tests worse.

@llvm/pr-subscribers-backend-amdgpu

The text was updated successfully, but these errors were encountered:

hvdijk · 2024-02-25T23:29:50Z

(Incidentally: would AMDGPU's current handling of this operation make sense to move to target-independent stuff? No other target currently implements it and the same logic should work for all, I think?)

llvmbot · 2024-02-26T02:10:46Z

@llvm/issue-subscribers-backend-amdgpu

Author: Harald van Dijk (hvdijk)

Please consider this minimal LLVM IR: ```llvm define half @f(half %x) { %canonicalized = call half @llvm.canonicalize.f16(half %x) ret half %canonicalized } ``` Run with `llc -mtriple=amdgcn` and we get: ```asm f: ; @f s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) s_setpc_b64 s[30:31] ``` The `canonicalize` operation has been entirely optimised away.

The reason for this is we get during ISel:

  t0: ch,glue = EntryToken
          t2: f32,ch = CopyFromReg # D:1 t0, Register:f32 %0
        t4: f16 = fp_round # D:1 t2, TargetConstant:i64&lt;1&gt;
      t5: f16 = fcanonicalize # D:1 t4
    t6: f32 = fp_extend # D:1 t5
  t8: ch,glue = CopyToReg # D:1 t0, Register:f32 $vgpr0, t6
  t9: ch = RET_GLUE # D:1 t8, Register:f32 $vgpr0, t8:1

Here, fcanonicalize is optimised away because SITargetLowering::isCanonicalized determines that fp_round is guaranteed to return an already-canonicalised result, so no work is needed, but that then leaves us with fp_extend (fp_round x, /*strict=*/1) which is optimised to a no-op.

This prevents another optimisation from going in (#80520) which makes this problem show up in more cases than it currently does, and sadly I struggle to find a good way of ensuring we get correct code for this case without also making codegen for other tests worse.

@llvm/pr-subscribers-backend-amdgpu

jayfoad · 2024-02-26T10:13:45Z

fp_extend (fp_round x, /*strict=*/1) which is optimised to a no-op

That sounds like the bug?

arsenm · 2024-02-26T10:21:51Z

Is this only a problem with the implicit ABI-promoted half-to-float case? Do you see the same issue if you manually write out the casts on a modern subtarget?

(Incidentally: would AMDGPU's current handling of this operation make sense to move to target-independent stuff? No other target currently implements it and the same logic should work for all, I think?)

Partially yes, and partially no. There are some broken-ish edge cases (e.g. we assume things that could later be selected to non-canonicalizing operations are canonicalized).

hvdijk · 2024-02-26T10:53:06Z

fp_extend (fp_round x, /*strict=*/1) which is optimised to a no-op

That sounds like the bug?

It's not. This is what the trunc parameter of fp_round (which I mistakenly called strict) is intended for. Quoting https://llvm.org/doxygen/namespacellvm_1_1ISD.html:

X = FP_ROUND(Y, TRUNC) - Rounding 'Y' from a larger floating point type down to the precision of the destination VT.

TRUNC is a flag, which is always an integer that is zero or one. If TRUNC is 0, this is a normal rounding, if it is 1, this FP_ROUND is known to not change the value of Y.

The TRUNC = 1 case is used in cases where we know that the value will not be modified by the node, because Y is not using any of the extra precision of source type. This allows certain transformations like FP_EXTEND(FP_ROUND(X,1)) -> X which are not safe for FP_EXTEND(FP_ROUND(X,0)) because the extra bits aren't removed.

This transformation is not safe when it comes to signalling NaNs, but generally, LLVM optimisations are not intended to be signalling-NaN-safe and may leave SNaN as SNaN. fcanonicalize is special in that regard in that it must get rid of it.

Is this only a problem with the implicit ABI-promoted half-to-float case? Do you see the same issue if you manually write out the casts on a modern subtarget?

It's not limited to half:

define float @f(float %x) {
  %ext = fpext float %x to double
  %canonicalized = call double @llvm.canonicalize.f64(double %ext)
  %trunc = fptrunc double %ext to float
  ret float %trunc
}

shows the same issue, and in this case, it persists even if I try different -mcpu values.

Partially yes, and partially no. There are some broken-ish edge cases (e.g. we assume things that could later be selected to non-canonicalizing operations are canonicalized).

Ah, but for generic target-independent codegen that is fine, that would only lead to redundant canonicalisation, right? That is, it would result in correct but suboptimal codegen. That should not be a problem, provided targets have enough hooks to make sure that operations that already canonicalise on their end aren't recanonicalised.

arsenm · 2024-02-26T11:25:25Z

Ah, but for generic target-independent codegen that is fine, that would only lead to redundant canonicalisation, right? That is, it would result in correct but suboptimal codegen. That should not be a problem, provided targets have enough hooks to make sure that operations that already canonicalise on their end aren't recanonicalised.

The main issue I've been meaning to fix is we specially treat the generic minnum/maxnum nodes based on the knowledge of the underlying instruction canonicalization behavior (which changed in gfx9+ IIRC)

We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away. Fixes llvm#82937.

hvdijk · 2024-02-26T20:47:55Z

I had a thought on how to possibly fix it. It seems to work so I created a PR for it, but I'm not 100% certain it covers all the cases, I would welcome a very close look.

We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away. Fixes llvm#82937.

arsenm · 2024-02-27T16:24:09Z

It's not limited to half:

I looked at this example, and this one actually breaks in the IR which I think is a more serious issue. InstSimplify folds it away to just ret %x

I also think addressing this from the trunc fold doesn't fundamentally address the issue; I think we shouldn't be trying to fold away canonicalizes during combining. We probably should only do this during the selection, and then there may still be risks based on selection patterns. It would be safest to do this during a post-selection machine pass

hvdijk · 2024-02-27T16:33:33Z

It's not limited to half:

I looked at this example, and this one actually breaks in the IR which I think is a more serious issue. InstSimplify folds it away to just ret %x

Sorry, that's my fault, it was a bad test. It should have been %trunc = fptrunc double %canonicalized to float. I did test with a fixed version but left the bad one in my comment.

I also think addressing this from the trunc fold doesn't fundamentally address the issue;

I haven't gone over all the other optimisations that run at this time to check if others have the same issue, it's certainly possible there are more.

I think we shouldn't be trying to fold away canonicalizes during combining.

That would also address it. It does mean it cannot be done in a manner that can be shared across architectures, but the current method is also not shared across architectures.

Is this something that you expect will be done in the short term? If not, would a stopgap measure be okay seeing how it unblocks other optimisations?

arsenm · 2024-02-27T16:44:19Z

Is this something that you expect will be done in the short term? If not, would a stopgap measure be okay seeing how it unblocks other optimisations?

I don't foresee myself having time to work on this any time soon. It will interfere with some combines, but a more wholistic solution would be to only fold a "probably droppable" canonicalize to a freeze. Do you see many regressions if you try to do that?

hvdijk · 2024-02-27T16:46:23Z

That's what I did already. I don't see many regressions. Still more than I would like, so I may try to improve it further, but it might be acceptable as is already.

We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away. Fixes llvm#82937.

hvdijk · 2024-02-27T23:24:58Z

Surprisingly, a fairly specific but simple extra fold of (fp_to_fp16 (freeze (fp16_to_fp (fp_to_fp16 op)))) -> (fp_to_fp16 (freeze op)), and likewise for bf16, reduces the regressions (as seen in LLVM's testsuite) to zero. I have updated the PR with it, the only codegen changes left in existing tests are bugfixes where we previously dropped a fcanonicalize where we shouldn't.

@arsenm

We were relying on roundings to implicitly canonicalize, which is generally safe, except that roundings may be optimized away, and more generally, other operations may be optimized away as well. To solve this, as suggested by @arsenm, keep fcanonicalize nodes around for longer. Some tests revealed cases where we no longer figured out that previous operations already produced canonicalized results but we could easily change that; this commit includes those changes. Other tests revealed cases where we no longer figure out that previous operations already produced canonicalized results but larger changes are needed to detect that; this commit disables those tests or updates the expected results. Fixes llvm#82937.

@arsenm

We were relying on roundings to implicitly canonicalize, which is generally safe, except that roundings may be optimized away, and more generally, other operations may be optimized away as well. To solve this, as suggested by @arsenm, keep fcanonicalize nodes around for longer. Some tests revealed cases where we no longer figured out that previous operations already produced canonicalized results but we could easily change that; this commit includes those changes. Other tests revealed cases where we no longer figure out that previous operations already produced canonicalized results but larger changes are needed to detect that; this commit disables those tests or updates the expected results. Fixes llvm#82937.

We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away. Fixes #82937.

github-actions bot added the new issue label Feb 25, 2024

EugeneZelenko added backend:AMDGPU and removed new issue labels Feb 26, 2024

hvdijk mentioned this issue Feb 26, 2024

[AMDGPU] Fix canonicalization of truncated values. #83054

Merged

hvdijk added a commit to hvdijk/llvm-project that referenced this issue Feb 26, 2024

[AMDGPU] Fix canonicalization of truncated values.

197b1b3

We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away. Fixes llvm#82937.

hvdijk added a commit to hvdijk/llvm-project that referenced this issue Feb 27, 2024

[AMDGPU] Fix canonicalization of truncated values.

6c814cc

We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away. Fixes llvm#82937.

hvdijk added a commit to hvdijk/llvm-project that referenced this issue Feb 27, 2024

[AMDGPU] Fix canonicalization of truncated values.

c799ba0

We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away. Fixes llvm#82937.

hvdijk added a commit to hvdijk/llvm-project that referenced this issue Feb 27, 2024

[AMDGPU] Fix canonicalization of truncated values.

c496176

We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away. Fixes llvm#82937.

hvdijk closed this as completed in #83054 Mar 13, 2024

hvdijk added a commit that referenced this issue Mar 13, 2024

[AMDGPU] Fix canonicalization of truncated values. (#83054)

ceb744e

We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away. Fixes #82937.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMDGPU: Wrong code for fcanonicalize #82937

AMDGPU: Wrong code for fcanonicalize #82937

hvdijk commented Feb 25, 2024

hvdijk commented Feb 25, 2024

Uh oh!

llvmbot commented Feb 26, 2024

Uh oh!

jayfoad commented Feb 26, 2024

Uh oh!

arsenm commented Feb 26, 2024

Uh oh!

hvdijk commented Feb 26, 2024

Uh oh!

arsenm commented Feb 26, 2024

Uh oh!

hvdijk commented Feb 26, 2024

Uh oh!

arsenm commented Feb 27, 2024

Uh oh!

hvdijk commented Feb 27, 2024

Uh oh!

arsenm commented Feb 27, 2024

Uh oh!

hvdijk commented Feb 27, 2024

Uh oh!

hvdijk commented Feb 27, 2024

Uh oh!

AMDGPU: Wrong code for fcanonicalize #82937

AMDGPU: Wrong code for fcanonicalize #82937

Comments

hvdijk commented Feb 25, 2024

hvdijk commented Feb 25, 2024

Uh oh!

llvmbot commented Feb 26, 2024

Uh oh!

jayfoad commented Feb 26, 2024

Uh oh!

arsenm commented Feb 26, 2024

Uh oh!

hvdijk commented Feb 26, 2024

Uh oh!

arsenm commented Feb 26, 2024

Uh oh!

hvdijk commented Feb 26, 2024

Uh oh!

arsenm commented Feb 27, 2024

Uh oh!

hvdijk commented Feb 27, 2024

Uh oh!

arsenm commented Feb 27, 2024

Uh oh!

hvdijk commented Feb 27, 2024

Uh oh!

hvdijk commented Feb 27, 2024

Uh oh!