-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Question: What is the correct interpretation of LaneBitmask? #109797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: What is the correct interpretation of LaneBitmask? #109797
Conversation
@llvm/pr-subscribers-llvm-regalloc @llvm/pr-subscribers-backend-aarch64 Author: Sander de Smalen (sdesmalen-arm) ChangesWhile enabling sub-reg liveness for AArch64, we ran into a register allocator issue, which leads us to ask the question: What is the correct interpertation of the bits in LaneBitmask ? (1) The documentation of LaneBitmask suggests that if A == B that A overlaps with B, but suggests nothing that guarantees that A fully overlaps with B. AArch64 defines 16-bit H0 as being the larger register of 8-bit B0, e.g.
It doesn't define the top bits of H0 as an artificial register (e.g. B0_HI). In the reproducer the register allocator has to insert a spill. Because 'bsub' (8-bit) and 'hsub' (16-bit), 'ssub' (32-bit), etc. all have the same bitmask, it picks a register class for the COPY instruction using the first subreg-idx that matches. This results in a copy of a 32-bit value, rather than the (desired) full 128-bit value. If H0 would be defined as B0 and B0_HI (and similar for all other registers), then the following condition evaluates true only for the largest subreg index: // Early exit if we found a perfect match. There are several ways to fix this issue;
Can someone with more knowledge of the register allocator provide guidance? Thank you! Full diff: https://github.com/llvm/llvm-project/pull/109797.diff 3 Files Affected:
diff --git a/llvm/lib/CodeGen/TargetRegisterInfo.cpp b/llvm/lib/CodeGen/TargetRegisterInfo.cpp
index ac9a3d6f0d1a60..e62cdf7765c4f1 100644
--- a/llvm/lib/CodeGen/TargetRegisterInfo.cpp
+++ b/llvm/lib/CodeGen/TargetRegisterInfo.cpp
@@ -521,26 +521,62 @@ bool TargetRegisterInfo::getCoveringSubRegIndexes(
SmallVector<unsigned, 8> PossibleIndexes;
unsigned BestIdx = 0;
unsigned BestCover = 0;
-
+ unsigned BestSize = 0;
for (unsigned Idx = 1, E = getNumSubRegIndices(); Idx < E; ++Idx) {
// Is this index even compatible with the given class?
if (getSubClassWithSubReg(RC, Idx) != RC)
continue;
LaneBitmask SubRegMask = getSubRegIndexLaneMask(Idx);
+
+ // The code below (now disabled) is not correct when the lane mask does not
+ // cover the full sub-register. For example, when 16-bit H0 has only one
+ // subreg B0 of 8-bits, and the high lanes are not defined in TableGen (e.g.
+ // by defining B0 and some artificial B0_HI register), then this function
+ // may return either `bsub` or `hsub`, whereas in this case we'd want it to
+ // return the *largest* (in bithwidth) sub-register index. It is better to
+ // keep looking for the biggest one, but this is rather expensive.
+ //
+ // Before thinking of how to solve this, I first want to ask the
+ // question: What is the meaning of 'LaneBitmask'?
+ //
+ // According to LaneBitmask.h:
+ //
+ // "The individual bits in a lane mask can't be assigned
+ // any specific meaning. They can be used to check if two
+ // sub-register indices overlap."
+ //
+ // However, in the code here the assumption is made that if
+ // 'SubRegMask == LaneMask', then the subregister must overlap *fully*.
+ // This is not the case for the way AArch64 registers are defined
+ // at the moment.
+ //
+ // Which interpretation is correct?
+ //
+ // A) If the meaning is 'if the bits are equal, the sub-registers overlap but
+ // not necessarily fully', then we should fix the code in this function (in
+ // a better way than just disabling it).
+ //
+ // B) If the meaning is 'if the bits are equal, the sub-registers overlap
+ // fully', then we can define the high bits with an artificial register.
+
+#if 0
// Early exit if we found a perfect match.
if (SubRegMask == LaneMask) {
BestIdx = Idx;
break;
}
+#endif
// The index must not cover any lanes outside \p LaneMask.
if ((SubRegMask & ~LaneMask).any())
continue;
unsigned PopCount = SubRegMask.getNumLanes();
+ unsigned Size = getSubRegIdxSize(Idx);
PossibleIndexes.push_back(Idx);
- if (PopCount > BestCover) {
+ if ((!BestCover || PopCount >= BestCover) && Size >= BestSize) {
BestCover = PopCount;
+ BestSize = Size;
BestIdx = Idx;
}
}
@@ -559,6 +595,8 @@ bool TargetRegisterInfo::getCoveringSubRegIndexes(
int BestCover = std::numeric_limits<int>::min();
for (unsigned Idx : PossibleIndexes) {
LaneBitmask SubRegMask = getSubRegIndexLaneMask(Idx);
+ // FIXME: similar issue as above.
+
// Early exit if we found a perfect match.
if (SubRegMask == LanesLeft) {
BestIdx = Idx;
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.h b/llvm/lib/Target/AArch64/AArch64Subtarget.h
index accfb49c6fbe3a..338afb0cf1962a 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.h
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.h
@@ -152,6 +152,7 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
const Triple &getTargetTriple() const { return TargetTriple; }
bool enableMachineScheduler() const override { return true; }
bool enablePostRAScheduler() const override { return usePostRAScheduler(); }
+ bool enableSubRegLiveness() const override { return true; }
bool enableMachinePipeliner() const override;
bool useDFAforSMS() const override { return false; }
diff --git a/llvm/test/CodeGen/AArch64/regalloc-subregidx-reproducer.ll b/llvm/test/CodeGen/AArch64/regalloc-subregidx-reproducer.ll
new file mode 100644
index 00000000000000..c7e96a68779562
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/regalloc-subregidx-reproducer.ll
@@ -0,0 +1,75 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s | FileCheck %s
+
+target triple = "aarch64"
+
+define void @reproducer(ptr %ptr, ptr %ptr2, <8 x i32> %vec.arg) {
+; CHECK-LABEL: reproducer:
+; CHECK: // %bb.0: // %entry
+; CHECK-NEXT: sub sp, sp, #112
+; CHECK-NEXT: stp x29, x30, [sp, #80] // 16-byte Folded Spill
+; CHECK-NEXT: stp x20, x19, [sp, #96] // 16-byte Folded Spill
+; CHECK-NEXT: .cfi_def_cfa_offset 112
+; CHECK-NEXT: .cfi_offset w19, -8
+; CHECK-NEXT: .cfi_offset w20, -16
+; CHECK-NEXT: .cfi_offset w30, -24
+; CHECK-NEXT: .cfi_offset w29, -32
+; CHECK-NEXT: mov x8, xzr
+; CHECK-NEXT: stp q0, q1, [sp, #16] // 32-byte Folded Spill
+; CHECK-NEXT: mov x19, x1
+; CHECK-NEXT: ld2 { v0.4s, v1.4s }, [x8]
+; CHECK-NEXT: add x8, sp, #48
+; CHECK-NEXT: mov x20, x0
+; CHECK-NEXT: st1 { v0.2d, v1.2d }, [x8] // 32-byte Folded Spill
+; CHECK-NEXT: bl bar
+; CHECK-NEXT: ldp q1, q0, [sp, #16] // 32-byte Folded Reload
+; CHECK-NEXT: mov w8, #1 // =0x1
+; CHECK-NEXT: uzp2 v0.4s, v1.4s, v0.4s
+; CHECK-NEXT: dup v1.2d, x8
+; CHECK-NEXT: add x8, sp, #48
+; CHECK-NEXT: ld1 { v5.2d, v6.2d }, [x8] // 32-byte Folded Reload
+; CHECK-NEXT: ushll2 v2.2d, v6.4s, #1
+; CHECK-NEXT: ushll v3.2d, v6.2s, #1
+; CHECK-NEXT: ushll v4.2d, v5.2s, #0
+; CHECK-NEXT: mov v6.16b, v5.16b
+; CHECK-NEXT: ushll2 v5.2d, v0.4s, #1
+; CHECK-NEXT: ushll v0.2d, v0.2s, #1
+; CHECK-NEXT: orr v3.16b, v3.16b, v4.16b
+; CHECK-NEXT: ushll2 v6.2d, v6.4s, #0
+; CHECK-NEXT: orr v0.16b, v0.16b, v1.16b
+; CHECK-NEXT: orr v2.16b, v2.16b, v6.16b
+; CHECK-NEXT: stp q0, q3, [sp, #32] // 32-byte Folded Spill
+; CHECK-NEXT: orr v0.16b, v5.16b, v1.16b
+; CHECK-NEXT: stp q0, q2, [sp] // 32-byte Folded Spill
+; CHECK-NEXT: bl bar
+; CHECK-NEXT: ldr q1, [sp, #16] // 16-byte Folded Reload
+; CHECK-NEXT: ldr q0, [sp, #48] // 16-byte Folded Reload
+; CHECK-NEXT: ldp x29, x30, [sp, #80] // 16-byte Folded Reload
+; CHECK-NEXT: stp q0, q1, [x20]
+; CHECK-NEXT: ldr q1, [sp] // 16-byte Folded Reload
+; CHECK-NEXT: ldr q0, [sp, #32] // 16-byte Folded Reload
+; CHECK-NEXT: stp q0, q1, [x19]
+; CHECK-NEXT: ldp x20, x19, [sp, #96] // 16-byte Folded Reload
+; CHECK-NEXT: add sp, sp, #112
+; CHECK-NEXT: ret
+entry:
+ %wide.vec = load <8 x i32>, ptr null, align 4
+ call void @bar()
+ %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
+ %strided.vec2 = shufflevector <8 x i32> %wide.vec, <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
+ %strided.vec3 = shufflevector <8 x i32> %vec.arg, <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
+ %1 = zext <4 x i32> %strided.vec2 to <4 x i64>
+ %2 = zext <4 x i32> %strided.vec3 to <4 x i64>
+ %3 = shl <4 x i64> %1, <i64 1, i64 1, i64 1, i64 1>
+ %4 = shl <4 x i64> %2, <i64 1, i64 1, i64 1, i64 1>
+ %5 = zext <4 x i32> %strided.vec to <4 x i64>
+ %6 = or <4 x i64> %3, %5
+ %7 = or <4 x i64> %4, <i64 1, i64 1, i64 1, i64 1>
+ call void @bar()
+ store <4 x i64> %6, ptr %ptr, align 8
+ store <4 x i64> %7, ptr %ptr2, align 8
+ ret void
+}
+
+declare void @bar()
+
|
You can test this locally with the following command:git-clang-format --diff 3c83102f0615c7d66f6df698ca472ddbf0e9483d 0f1ed85616670fe321918f199a9e9baedc50d578 --extensions cpp,h -- llvm/lib/CodeGen/TargetRegisterInfo.cpp llvm/lib/Target/AArch64/AArch64Subtarget.h View the diff from clang-format here.diff --git a/llvm/lib/CodeGen/TargetRegisterInfo.cpp b/llvm/lib/CodeGen/TargetRegisterInfo.cpp
index e62cdf7765..f4a2feac7c 100644
--- a/llvm/lib/CodeGen/TargetRegisterInfo.cpp
+++ b/llvm/lib/CodeGen/TargetRegisterInfo.cpp
@@ -552,9 +552,10 @@ bool TargetRegisterInfo::getCoveringSubRegIndexes(
//
// Which interpretation is correct?
//
- // A) If the meaning is 'if the bits are equal, the sub-registers overlap but
- // not necessarily fully', then we should fix the code in this function (in
- // a better way than just disabling it).
+ // A) If the meaning is 'if the bits are equal, the sub-registers overlap
+ // but
+ // not necessarily fully', then we should fix the code in this function
+ // (in a better way than just disabling it).
//
// B) If the meaning is 'if the bits are equal, the sub-registers overlap
// fully', then we can define the high bits with an artificial register.
|
As it is today, I believe the intent is statement 1. However, I think we would be in a better place if we did require 2. We do have a number of issues whenever there are pieces of registers that aren't fully covered by a defined register, and it is harder to reason about. We probably need to require fully covering artificial register definitions to fully move the infrastructure to using register units (e.g. replacing livein tracking with regunits instead of registers, and for using regunit masks instead of regmasks) |
Agreed.
This was discussed in some detail on #95926 (also see #96146). Also also see #79831 (comment). |
Thanks for the quick feedback!
Does the approach of using register units (instead of registers) have any impact on the existence or future use of the LaneBitmask?
I have some local patches that define registers/subreg-indices for the top bits, but then ran into an issue that the |
No. In an ideal world you could directly map bits
How big are these registers? AMDGPU is sitting right at the 64-bit limit now |
Is this for the regular AArch64 integer registers? My understanding is:
So I would expect you would need subregisters for:
I.e. roughly log2(bitwidth) of them. |
That's right, however it is the register tuples that make things take up space, as it replicates the bits for each subvector in the tuple. It ends up with something like this:
and similar for other tuples, where each sub register must be separately addressable and cannot reuse the bits (otherwise qqsub_0:bsub and qqsub_1:bsub would alias for example). Then there are also bits to represent the indices in GPR registers a set of bits needed to express the (sub)tiles in a matrix register. With all of these together in the same bitmask, we run out of the 64 bits. |
This follows on from the conversation on llvm#109797. FWIW, I've considered several alternatives to this patch; (1) Using APInt as storage type rather than 'uint64_t Mask[2]'. (1) makes the code a bit simpler to read, but APInt by default only allocates space for a 64-bit value and otherwise dynamically allocates a larger buffer to represent the larger value. Because we know the value is always 128bit, this extra dynamic allocation is undesirable. We also rarely need the full power of APInt since most of the tests are bitwise operations, so I made the choice to represent it as a uint64_t array instead, and only moving to/from APInt when this is necessary. Because it is inconvenient that increasing the BitWidth applies to all targets, I tried to see if there is a way to make the bitwidth dynamic, by: (2) Making the BitWidth dynamic per target by passing it to constructors. (3) Modification of (2) that describes 'none', 'all' and 'lane' with an an enum which doesn't require a BitWidth, until doing arithmetic with an explicit bit mask (which does have a BitWidth). Unfortunately both these approaches don't seem feasible. For (2) that is because it would require passing the TargetRegisterInfo/MCRegisterInfo to many places where this info is not available, where it needs to instantiates a LaneBitmask value. Approach (3) leads to other issues such as questions like 'what is the meaning of 'operator==' when one value is a mask and the other is a 'all' enum?' If we let 'operator==' discard the bitwidth such that a 64-bit all-true bitmask == LaneBitmask::all() (using 'all' enum), then we could end up in a situation where: X == LaneBitmask::all() && Y == LaneBitmask::all() but `X != Y`. I considered replacing the equality operators by methods that take a RegisterInfo pointer, but the LaneBitmask struct is used in STL containers which require a plain 'operator==' or 'operator<'. We could work around that by providing custom lambdas (that call the method with the TargetInfo pointer), but this just gets increasingly more hacky. Perhaps just using more bits isn't actually that bad in practice.
// | ||
// A) If the meaning is 'if the bits are equal, the sub-registers overlap but | ||
// not necessarily fully', then we should fix the code in this function (in | ||
// a better way than just disabling it). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like @arsenm said this is the correct interpretation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for confirming! Now that this is cleared up, I'll close this PR.
This is a step towards enabling subreg liveness tracking for AArch64, which requires that registers are fully covered by their subregisters, as covered here llvm#109797. There are several changes in this patch: * AArch64RegisterInfo.td and tests: Define the high bits like B0_HI, H0_HI, S0_HI, D0_HI, Q0_HI. Because the bits must be defined by some register class, this added a register class which meant that we had to update 'magic numbers' in several tests. The use of ComposedSubRegIndex helped 'compress' the number of bits required for the lanemask. The correctness of the masks is tested by an explicit unit tests. * LoadStoreOptimizer: previously 'HasDisjunctSubRegs' was only true for register tuples, but with this change to describe the high bits, a register like 'D0' will also have 'HasDisjunctSubRegs' set to true (because it's fullly covered by S0 and S0_HI). The fix here is to explicitly test if the register class is one of the known D/Q/Z tuples. * TableGen: The handling of the isArtificial flag was entirely broken. Skipping out too early from some of the loops led to incorrect internal representation of the (sub)register(index) hierarchy, and thus resulted in incorrect TableGen info.
This is a step towards enabling subreg liveness tracking for AArch64, which requires that registers are fully covered by their subregisters, as covered here llvm#109797. There are several changes in this patch: * AArch64RegisterInfo.td and tests: Define the high bits like B0_HI, H0_HI, S0_HI, D0_HI, Q0_HI. Because the bits must be defined by some register class, this added a register class which meant that we had to update 'magic numbers' in several tests. The use of ComposedSubRegIndex helped 'compress' the number of bits required for the lanemask. The correctness of the masks is tested by an explicit unit tests. * LoadStoreOptimizer: previously 'HasDisjunctSubRegs' was only true for register tuples, but with this change to describe the high bits, a register like 'D0' will also have 'HasDisjunctSubRegs' set to true (because it's fullly covered by S0 and S0_HI). The fix here is to explicitly test if the register class is one of the known D/Q/Z tuples. * TableGen: The handling of the isArtificial flag was entirely broken. Skipping out too early from some of the loops led to incorrect internal representation of the (sub)register(index) hierarchy, and thus resulted in incorrect TableGen info.
This is a step towards enabling subreg liveness tracking for AArch64, which requires that registers are fully covered by their subregisters, as covered here llvm#109797. There are several changes in this patch: * AArch64RegisterInfo.td and tests: Define the high bits like B0_HI, H0_HI, S0_HI, D0_HI, Q0_HI. Because the bits must be defined by some register class, this added a register class which meant that we had to update 'magic numbers' in several tests. The use of ComposedSubRegIndex helped 'compress' the number of bits required for the lanemask. The correctness of the masks is tested by an explicit unit tests. * LoadStoreOptimizer: previously 'HasDisjunctSubRegs' was only true for register tuples, but with this change to describe the high bits, a register like 'D0' will also have 'HasDisjunctSubRegs' set to true (because it's fullly covered by S0 and S0_HI). The fix here is to explicitly test if the register class is one of the known D/Q/Z tuples. * TableGen: The handling of the isArtificial flag was entirely broken. Skipping out too early from some of the loops led to incorrect internal representation of the (sub)register(index) hierarchy, and thus resulted in incorrect TableGen info.
This is a step towards enabling subreg liveness tracking for AArch64, which requires that registers are fully covered by their subregisters, as covered here #109797. There are several changes in this patch: * AArch64RegisterInfo.td and tests: Define the high bits like B0_HI, H0_HI, S0_HI, D0_HI, Q0_HI. Because the bits must be defined by some register class, this added a register class which meant that we had to update 'magic numbers' in several tests. The use of ComposedSubRegIndex helped 'compress' the number of bits required for the lanemask. The correctness of the masks is tested by an explicit unit tests. * LoadStoreOptimizer: previously 'HasDisjunctSubRegs' was only true for register tuples, but with this change to describe the high bits, a register like 'D0' will also have 'HasDisjunctSubRegs' set to true (because it's fullly covered by S0 and S0_HI). The fix here is to explicitly test if the register class is one of the known D/Q/Z tuples.
While enabling sub-reg liveness for AArch64, we ran into a register allocator issue, which leads us to ask the question: What is the correct interpertation of the bits in LaneBitmask ?
(1) The documentation of LaneBitmask suggests that if A == B that A overlaps with B, but suggests nothing that guarantees that A fully overlaps with B.
(2) The code pointed out in this PR makes the assumption that if A == B, that A fully overlaps with B.
AArch64 defines 16-bit H0 as being the larger register of 8-bit B0, e.g.
It doesn't define the top bits of H0 as an artificial register (e.g. B0_HI).
In the reproducer the register allocator has to insert a spill. Because 'bsub' (8-bit) and 'hsub' (16-bit), 'ssub' (32-bit), etc. all have the same bitmask, it picks a register class for the COPY instruction using the first subreg-idx that matches. This results in a copy of a 32-bit value, rather than the (desired) full 128-bit value.
If H0 would be defined as B0 and B0_HI (and similar for all other registers), then the following condition evaluates true only for the largest subreg index:
// Early exit if we found a perfect match.
if (SubRegMask == LaneMask) {
BestIdx = Idx;
break;
}
There are several ways to fix this issue;
Can someone with more knowledge of the register allocator provide guidance? Thank you!