-
Notifications
You must be signed in to change notification settings - Fork 13.6k
AMDGPU: Wrong code for fcanonicalize #82937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
(Incidentally: would AMDGPU's current handling of this operation make sense to move to target-independent stuff? No other target currently implements it and the same logic should work for all, I think?) |
@llvm/issue-subscribers-backend-amdgpu Author: Harald van Dijk (hvdijk)
Please consider this minimal LLVM IR:
```llvm
define half @f(half %x) {
%canonicalized = call half @llvm.canonicalize.f16(half %x)
ret half %canonicalized
}
```
Run with `llc -mtriple=amdgcn` and we get:
```asm
f: ; @f
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
s_setpc_b64 s[30:31]
```
The `canonicalize` operation has been entirely optimised away.
The reason for this is we get during ISel:
Here, This prevents another optimisation from going in (#80520) which makes this problem show up in more cases than it currently does, and sadly I struggle to find a good way of ensuring we get correct code for this case without also making codegen for other tests worse. @llvm/pr-subscribers-backend-amdgpu |
That sounds like the bug? |
Is this only a problem with the implicit ABI-promoted half-to-float case? Do you see the same issue if you manually write out the casts on a modern subtarget?
Partially yes, and partially no. There are some broken-ish edge cases (e.g. we assume things that could later be selected to non-canonicalizing operations are canonicalized). |
It's not. This is what the
This transformation is not safe when it comes to signalling NaNs, but generally, LLVM optimisations are not intended to be signalling-NaN-safe and may leave SNaN as SNaN.
It's not limited to define float @f(float %x) {
%ext = fpext float %x to double
%canonicalized = call double @llvm.canonicalize.f64(double %ext)
%trunc = fptrunc double %ext to float
ret float %trunc
} shows the same issue, and in this case, it persists even if I try different
Ah, but for generic target-independent codegen that is fine, that would only lead to redundant canonicalisation, right? That is, it would result in correct but suboptimal codegen. That should not be a problem, provided targets have enough hooks to make sure that operations that already canonicalise on their end aren't recanonicalised. |
The main issue I've been meaning to fix is we specially treat the generic minnum/maxnum nodes based on the knowledge of the underlying instruction canonicalization behavior (which changed in gfx9+ IIRC) |
We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away. Fixes llvm#82937.
I had a thought on how to possibly fix it. It seems to work so I created a PR for it, but I'm not 100% certain it covers all the cases, I would welcome a very close look. |
We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away. Fixes llvm#82937.
We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away. Fixes llvm#82937.
I looked at this example, and this one actually breaks in the IR which I think is a more serious issue. InstSimplify folds it away to just I also think addressing this from the trunc fold doesn't fundamentally address the issue; I think we shouldn't be trying to fold away canonicalizes during combining. We probably should only do this during the selection, and then there may still be risks based on selection patterns. It would be safest to do this during a post-selection machine pass |
Sorry, that's my fault, it was a bad test. It should have been
I haven't gone over all the other optimisations that run at this time to check if others have the same issue, it's certainly possible there are more.
That would also address it. It does mean it cannot be done in a manner that can be shared across architectures, but the current method is also not shared across architectures. Is this something that you expect will be done in the short term? If not, would a stopgap measure be okay seeing how it unblocks other optimisations? |
I don't foresee myself having time to work on this any time soon. It will interfere with some combines, but a more wholistic solution would be to only fold a "probably droppable" canonicalize to a freeze. Do you see many regressions if you try to do that? |
That's what I did already. I don't see many regressions. Still more than I would like, so I may try to improve it further, but it might be acceptable as is already. |
We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away. Fixes llvm#82937.
Surprisingly, a fairly specific but simple extra fold of (fp_to_fp16 (freeze (fp16_to_fp (fp_to_fp16 op)))) -> (fp_to_fp16 (freeze op)), and likewise for bf16, reduces the regressions (as seen in LLVM's testsuite) to zero. I have updated the PR with it, the only codegen changes left in existing tests are bugfixes where we previously dropped a fcanonicalize where we shouldn't. |
We were relying on roundings to implicitly canonicalize, which is generally safe, except that roundings may be optimized away, and more generally, other operations may be optimized away as well. To solve this, as suggested by @arsenm, keep fcanonicalize nodes around for longer. Some tests revealed cases where we no longer figured out that previous operations already produced canonicalized results but we could easily change that; this commit includes those changes. Other tests revealed cases where we no longer figure out that previous operations already produced canonicalized results but larger changes are needed to detect that; this commit disables those tests or updates the expected results. Fixes llvm#82937.
We were relying on roundings to implicitly canonicalize, which is generally safe, except that roundings may be optimized away, and more generally, other operations may be optimized away as well. To solve this, as suggested by @arsenm, keep fcanonicalize nodes around for longer. Some tests revealed cases where we no longer figured out that previous operations already produced canonicalized results but we could easily change that; this commit includes those changes. Other tests revealed cases where we no longer figure out that previous operations already produced canonicalized results but larger changes are needed to detect that; this commit disables those tests or updates the expected results. Fixes llvm#82937.
We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away. Fixes #82937.
Please consider this minimal LLVM IR:
Run with
llc -mtriple=amdgcn
and we get:The
canonicalize
operation has been entirely optimised away.The reason for this is we get during ISel:
Here,
fcanonicalize
is optimised away becauseSITargetLowering::isCanonicalized
determines thatfp_round
is guaranteed to return an already-canonicalised result, so no work is needed, but that then leaves us withfp_extend (fp_round x, /*strict=*/1)
which is optimised to a no-op.This prevents another optimisation from going in (#80520) which makes this problem show up in more cases than it currently does, and sadly I struggle to find a good way of ensuring we get correct code for this case without also making codegen for other tests worse.
@llvm/pr-subscribers-backend-amdgpu
The text was updated successfully, but these errors were encountered: