Skip to content

AMDGPU: Wrong code for fcanonicalize #82937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hvdijk opened this issue Feb 25, 2024 · 12 comments · Fixed by #83054
Closed

AMDGPU: Wrong code for fcanonicalize #82937

hvdijk opened this issue Feb 25, 2024 · 12 comments · Fixed by #83054

Comments

@hvdijk
Copy link
Contributor

hvdijk commented Feb 25, 2024

Please consider this minimal LLVM IR:

define half @f(half %x) {
  %canonicalized = call half @llvm.canonicalize.f16(half %x)
  ret half %canonicalized
}

Run with llc -mtriple=amdgcn and we get:

f:                                      ; @f
        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
        s_setpc_b64 s[30:31]

The canonicalize operation has been entirely optimised away.

The reason for this is we get during ISel:

  t0: ch,glue = EntryToken
          t2: f32,ch = CopyFromReg # D:1 t0, Register:f32 %0
        t4: f16 = fp_round # D:1 t2, TargetConstant:i64<1>
      t5: f16 = fcanonicalize # D:1 t4
    t6: f32 = fp_extend # D:1 t5
  t8: ch,glue = CopyToReg # D:1 t0, Register:f32 $vgpr0, t6
  t9: ch = RET_GLUE # D:1 t8, Register:f32 $vgpr0, t8:1

Here, fcanonicalize is optimised away because SITargetLowering::isCanonicalized determines that fp_round is guaranteed to return an already-canonicalised result, so no work is needed, but that then leaves us with fp_extend (fp_round x, /*strict=*/1) which is optimised to a no-op.

This prevents another optimisation from going in (#80520) which makes this problem show up in more cases than it currently does, and sadly I struggle to find a good way of ensuring we get correct code for this case without also making codegen for other tests worse.

@llvm/pr-subscribers-backend-amdgpu

@hvdijk
Copy link
Contributor Author

hvdijk commented Feb 25, 2024

(Incidentally: would AMDGPU's current handling of this operation make sense to move to target-independent stuff? No other target currently implements it and the same logic should work for all, I think?)

@llvmbot
Copy link
Member

llvmbot commented Feb 26, 2024

@llvm/issue-subscribers-backend-amdgpu

Author: Harald van Dijk (hvdijk)

Please consider this minimal LLVM IR: ```llvm define half @f(half %x) { %canonicalized = call half @llvm.canonicalize.f16(half %x) ret half %canonicalized } ``` Run with `llc -mtriple=amdgcn` and we get: ```asm f: ; @f s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) s_setpc_b64 s[30:31] ``` The `canonicalize` operation has been entirely optimised away.

The reason for this is we get during ISel:

  t0: ch,glue = EntryToken
          t2: f32,ch = CopyFromReg # D:1 t0, Register:f32 %0
        t4: f16 = fp_round # D:1 t2, TargetConstant:i64&lt;1&gt;
      t5: f16 = fcanonicalize # D:1 t4
    t6: f32 = fp_extend # D:1 t5
  t8: ch,glue = CopyToReg # D:1 t0, Register:f32 $vgpr0, t6
  t9: ch = RET_GLUE # D:1 t8, Register:f32 $vgpr0, t8:1

Here, fcanonicalize is optimised away because SITargetLowering::isCanonicalized determines that fp_round is guaranteed to return an already-canonicalised result, so no work is needed, but that then leaves us with fp_extend (fp_round x, /*strict=*/1) which is optimised to a no-op.

This prevents another optimisation from going in (#80520) which makes this problem show up in more cases than it currently does, and sadly I struggle to find a good way of ensuring we get correct code for this case without also making codegen for other tests worse.

@llvm/pr-subscribers-backend-amdgpu

@jayfoad
Copy link
Contributor

jayfoad commented Feb 26, 2024

fp_extend (fp_round x, /*strict=*/1) which is optimised to a no-op

That sounds like the bug?

@arsenm
Copy link
Contributor

arsenm commented Feb 26, 2024

Is this only a problem with the implicit ABI-promoted half-to-float case? Do you see the same issue if you manually write out the casts on a modern subtarget?

(Incidentally: would AMDGPU's current handling of this operation make sense to move to target-independent stuff? No other target currently implements it and the same logic should work for all, I think?)

Partially yes, and partially no. There are some broken-ish edge cases (e.g. we assume things that could later be selected to non-canonicalizing operations are canonicalized).

@hvdijk
Copy link
Contributor Author

hvdijk commented Feb 26, 2024

fp_extend (fp_round x, /*strict=*/1) which is optimised to a no-op

That sounds like the bug?

It's not. This is what the trunc parameter of fp_round (which I mistakenly called strict) is intended for. Quoting https://llvm.org/doxygen/namespacellvm_1_1ISD.html:

X = FP_ROUND(Y, TRUNC) - Rounding 'Y' from a larger floating point type down to the precision of the destination VT.

TRUNC is a flag, which is always an integer that is zero or one. If TRUNC is 0, this is a normal rounding, if it is 1, this FP_ROUND is known to not change the value of Y.

The TRUNC = 1 case is used in cases where we know that the value will not be modified by the node, because Y is not using any of the extra precision of source type. This allows certain transformations like FP_EXTEND(FP_ROUND(X,1)) -> X which are not safe for FP_EXTEND(FP_ROUND(X,0)) because the extra bits aren't removed.

This transformation is not safe when it comes to signalling NaNs, but generally, LLVM optimisations are not intended to be signalling-NaN-safe and may leave SNaN as SNaN. fcanonicalize is special in that regard in that it must get rid of it.

Is this only a problem with the implicit ABI-promoted half-to-float case? Do you see the same issue if you manually write out the casts on a modern subtarget?

It's not limited to half:

define float @f(float %x) {
  %ext = fpext float %x to double
  %canonicalized = call double @llvm.canonicalize.f64(double %ext)
  %trunc = fptrunc double %ext to float
  ret float %trunc
}

shows the same issue, and in this case, it persists even if I try different -mcpu values.

Partially yes, and partially no. There are some broken-ish edge cases (e.g. we assume things that could later be selected to non-canonicalizing operations are canonicalized).

Ah, but for generic target-independent codegen that is fine, that would only lead to redundant canonicalisation, right? That is, it would result in correct but suboptimal codegen. That should not be a problem, provided targets have enough hooks to make sure that operations that already canonicalise on their end aren't recanonicalised.

@arsenm
Copy link
Contributor

arsenm commented Feb 26, 2024

Ah, but for generic target-independent codegen that is fine, that would only lead to redundant canonicalisation, right? That is, it would result in correct but suboptimal codegen. That should not be a problem, provided targets have enough hooks to make sure that operations that already canonicalise on their end aren't recanonicalised.

The main issue I've been meaning to fix is we specially treat the generic minnum/maxnum nodes based on the knowledge of the underlying instruction canonicalization behavior (which changed in gfx9+ IIRC)

hvdijk added a commit to hvdijk/llvm-project that referenced this issue Feb 26, 2024
We were relying on roundings to implicitly canonicalize, which is
generally safe, except with roundings that may be optimized away.

Fixes llvm#82937.
@hvdijk
Copy link
Contributor Author

hvdijk commented Feb 26, 2024

I had a thought on how to possibly fix it. It seems to work so I created a PR for it, but I'm not 100% certain it covers all the cases, I would welcome a very close look.

hvdijk added a commit to hvdijk/llvm-project that referenced this issue Feb 27, 2024
We were relying on roundings to implicitly canonicalize, which is
generally safe, except with roundings that may be optimized away.

Fixes llvm#82937.
hvdijk added a commit to hvdijk/llvm-project that referenced this issue Feb 27, 2024
We were relying on roundings to implicitly canonicalize, which is
generally safe, except with roundings that may be optimized away.

Fixes llvm#82937.
@arsenm
Copy link
Contributor

arsenm commented Feb 27, 2024

It's not limited to half:

I looked at this example, and this one actually breaks in the IR which I think is a more serious issue. InstSimplify folds it away to just ret %x

I also think addressing this from the trunc fold doesn't fundamentally address the issue; I think we shouldn't be trying to fold away canonicalizes during combining. We probably should only do this during the selection, and then there may still be risks based on selection patterns. It would be safest to do this during a post-selection machine pass

@hvdijk
Copy link
Contributor Author

hvdijk commented Feb 27, 2024

It's not limited to half:

I looked at this example, and this one actually breaks in the IR which I think is a more serious issue. InstSimplify folds it away to just ret %x

Sorry, that's my fault, it was a bad test. It should have been %trunc = fptrunc double %canonicalized to float. I did test with a fixed version but left the bad one in my comment.

I also think addressing this from the trunc fold doesn't fundamentally address the issue;

I haven't gone over all the other optimisations that run at this time to check if others have the same issue, it's certainly possible there are more.

I think we shouldn't be trying to fold away canonicalizes during combining.

That would also address it. It does mean it cannot be done in a manner that can be shared across architectures, but the current method is also not shared across architectures.

Is this something that you expect will be done in the short term? If not, would a stopgap measure be okay seeing how it unblocks other optimisations?

@arsenm
Copy link
Contributor

arsenm commented Feb 27, 2024

Is this something that you expect will be done in the short term? If not, would a stopgap measure be okay seeing how it unblocks other optimisations?

I don't foresee myself having time to work on this any time soon. It will interfere with some combines, but a more wholistic solution would be to only fold a "probably droppable" canonicalize to a freeze. Do you see many regressions if you try to do that?

@hvdijk
Copy link
Contributor Author

hvdijk commented Feb 27, 2024

That's what I did already. I don't see many regressions. Still more than I would like, so I may try to improve it further, but it might be acceptable as is already.

hvdijk added a commit to hvdijk/llvm-project that referenced this issue Feb 27, 2024
We were relying on roundings to implicitly canonicalize, which is
generally safe, except with roundings that may be optimized away.

Fixes llvm#82937.
@hvdijk
Copy link
Contributor Author

hvdijk commented Feb 27, 2024

Surprisingly, a fairly specific but simple extra fold of (fp_to_fp16 (freeze (fp16_to_fp (fp_to_fp16 op)))) -> (fp_to_fp16 (freeze op)), and likewise for bf16, reduces the regressions (as seen in LLVM's testsuite) to zero. I have updated the PR with it, the only codegen changes left in existing tests are bugfixes where we previously dropped a fcanonicalize where we shouldn't.

hvdijk added a commit to hvdijk/llvm-project that referenced this issue Mar 5, 2024
We were relying on roundings to implicitly canonicalize, which is
generally safe, except that roundings may be optimized away, and more
generally, other operations may be optimized away as well.

To solve this, as suggested by @arsenm, keep fcanonicalize nodes around
for longer. Some tests revealed cases where we no longer figured out
that previous operations already produced canonicalized results but we
could easily change that; this commit includes those changes. Other
tests revealed cases where we no longer figure out that previous
operations already produced canonicalized results but larger changes are
needed to detect that; this commit disables those tests or updates the
expected results.

Fixes llvm#82937.
hvdijk added a commit to hvdijk/llvm-project that referenced this issue Mar 6, 2024
We were relying on roundings to implicitly canonicalize, which is
generally safe, except that roundings may be optimized away, and more
generally, other operations may be optimized away as well.

To solve this, as suggested by @arsenm, keep fcanonicalize nodes around
for longer. Some tests revealed cases where we no longer figured out
that previous operations already produced canonicalized results but we
could easily change that; this commit includes those changes. Other
tests revealed cases where we no longer figure out that previous
operations already produced canonicalized results but larger changes are
needed to detect that; this commit disables those tests or updates the
expected results.

Fixes llvm#82937.
hvdijk added a commit that referenced this issue Mar 13, 2024
We were relying on roundings to implicitly canonicalize, which is
generally safe, except with roundings that may be optimized away.

Fixes #82937.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants