Skip to content

[TTI][RISCV] Unconditionally break critical edges to sink ADDI #108889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Nov 26, 2024

Conversation

preames
Copy link
Collaborator

@preames preames commented Sep 16, 2024

This looks like a rather weird change, so let me explain why this isn't as unreasonable as it looks. Let's start with the problem it's solving.

define signext i32 @overlap_live_ranges(ptr %arg, i32 signext %arg1) { bb:
  %i = icmp eq i32 %arg1, 1
  br i1 %i, label %bb2, label %bb5

bb2:                                              ; preds = %bb
  %i3 = getelementptr inbounds nuw i8, ptr %arg, i64 4
  %i4 = load i32, ptr %i3, align 4
  br label %bb5

bb5:                                              ; preds = %bb2, %bb
  %i6 = phi i32 [ %i4, %bb2 ], [ 13, %bb ]
  ret i32 %i6
}

Right now, we codegen this as:

	li	a3, 1
	li	a2, 13
	bne	a1, a3, .LBB0_2
	lw	a2, 4(a0)
.LBB0_2:
	mv	a0, a2
	ret

In this example, we have two values which must be assigned to a0 per the ABI (%arg, and the return value). SelectionDAG ensures that all values used in a successor phi are defined before exit the predecessor block. This creates an ADDI to materialize the immediate in the entry block.

Currently, this ADDI is not sunk into the tail block because we'd have to split a critical edges to do so. Note that if our immediate was anything large enough to require two instructions we would split this critical edge.

Looking at other targets, we notice that they don't seem to have this problem. They perform the sinking, and tail duplication that we don't. Why? Well, it turns out for AArch64 that this is entirely an accident of the existance of the gpr32all register class. The immediate is materialized into the gpr32 class, and then copied into the gpr32all register class. The existance of that copy puts us right back into the two instruction case noted above.

This change essentially just bypasses this emergent behavior aspect of the aarch64 behavior, and implements the same "always sink immediates" behavior for RISCV as well.

This looks like a rather weird change, so let me explain why this
isn't as unreasonable as it looks.  Let's start with the problem
it's solving.

define signext i32 @overlap_live_ranges(ptr %arg, i32 signext %arg1) {
bb:
  %i = icmp eq i32 %arg1, 1
  br i1 %i, label %bb2, label %bb5

bb2:                                              ; preds = %bb
  %i3 = getelementptr inbounds nuw i8, ptr %arg, i64 4
  %i4 = load i32, ptr %i3, align 4
  br label %bb5

bb5:                                              ; preds = %bb2, %bb
  %i6 = phi i32 [ %i4, %bb2 ], [ 13, %bb ]
  ret i32 %i6
}

Right now, we codegen this as:

	li	a3, 1
	li	a2, 13
	bne	a1, a3, .LBB0_2
	lw	a2, 4(a0)
.LBB0_2:
	mv	a0, a2
	ret

In this example, we have two values which must be assigned to a0 per the
ABI (%arg, and the return value).  SelectionDAG ensures that all values
used in a successor phi are defined before exit the predecessor block.
This creates an ADDI to materialize the immediate in the entry block.

Currently, this ADDI is not sunk into the tail block because we'd have
to split a critical edges to do so.  Note that if our immediate was
anything large enough to require two instructions we *would* split this
critical edge.

Looking at other targets, we notice that they don't seem to have this
problem.  They perform the sinking, and tail duplication that we don't.
Why?  Well, it turns out for AArch64 that this is entirely an accident
of the existance of the gpr32all register class.  The immediate is
materialized into the gpr32 class, and then copied into the gpr32all
register class.  The existance of that copy puts us right back into
the two instruction case noted above.

This change essentially just cuts around this accident, and implements
the same "always sink immediates" behavior for RISCV as well.
; RV64-NEXT: .LBB3_2:
; RV64-NEXT: li a0, 6
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These select lowering changes are arguably real regressions. We are loosing the opportunity to allocate the untaken result into a register and then conditionally overwrite it. The tail duplication probably wouldn't happen in real code, so we'd end up emitting an extra unconditional jump to bypass the critical edge we split before the join.

Specifically for these selects in tests, most can be done via arithmetic expansions instead. I started to implement that, but ended up pulling on a few too many intertwined changes. I do plan to go back to this, but I don't consider the select impact blocking here.

@llvmbot
Copy link
Member

llvmbot commented Sep 16, 2024

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-risc-v

Author: Philip Reames (preames)

Changes

This looks like a rather weird change, so let me explain why this isn't as unreasonable as it looks. Let's start with the problem it's solving.

define signext i32 @<!-- -->overlap_live_ranges(ptr %arg, i32 signext %arg1) { bb:
  %i = icmp eq i32 %arg1, 1
  br i1 %i, label %bb2, label %bb5

bb2:                                              ; preds = %bb
  %i3 = getelementptr inbounds nuw i8, ptr %arg, i64 4
  %i4 = load i32, ptr %i3, align 4
  br label %bb5

bb5:                                              ; preds = %bb2, %bb
  %i6 = phi i32 [ %i4, %bb2 ], [ 13, %bb ]
  ret i32 %i6
}

Right now, we codegen this as:

	li	a3, 1
	li	a2, 13
	bne	a1, a3, .LBB0_2
	lw	a2, 4(a0)
.LBB0_2:
	mv	a0, a2
	ret

In this example, we have two values which must be assigned to a0 per the ABI (%arg, and the return value). SelectionDAG ensures that all values used in a successor phi are defined before exit the predecessor block. This creates an ADDI to materialize the immediate in the entry block.

Currently, this ADDI is not sunk into the tail block because we'd have to split a critical edges to do so. Note that if our immediate was anything large enough to require two instructions we would split this critical edge.

Looking at other targets, we notice that they don't seem to have this problem. They perform the sinking, and tail duplication that we don't. Why? Well, it turns out for AArch64 that this is entirely an accident of the existance of the gpr32all register class. The immediate is materialized into the gpr32 class, and then copied into the gpr32all register class. The existance of that copy puts us right back into the two instruction case noted above.

This change essentially just bypasses this emergent behavior aspect of the aarch64 behavior, and implements the same "always sink immediates" behavior for RISCV as well.


Patch is 23.40 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108889.diff

10 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/TargetInstrInfo.h (+6)
  • (modified) llvm/lib/CodeGen/MachineSink.cpp (+3-1)
  • (modified) llvm/lib/Target/RISCV/RISCVInstrInfo.h (+4)
  • (modified) llvm/test/CodeGen/RISCV/aext-to-sext.ll (+8-6)
  • (modified) llvm/test/CodeGen/RISCV/machine-sink-load-immediate.ll (+6-6)
  • (modified) llvm/test/CodeGen/RISCV/rv64m-w-insts-legalization.ll (+4-6)
  • (modified) llvm/test/CodeGen/RISCV/select-const.ll (+6-6)
  • (modified) llvm/test/CodeGen/RISCV/select.ll (+24-24)
  • (modified) llvm/test/CodeGen/RISCV/sextw-removal.ll (+48-44)
  • (modified) llvm/test/CodeGen/RISCV/typepromotion-overflow.ll (+69-57)
diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
index a3bfc63f2a4790..aef2c1748a099d 100644
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@@ -159,6 +159,12 @@ class TargetInstrInfo : public MCInstrInfo {
     return true;
   }
 
+  /// For a "cheap" instruction which doesn't enable additional sinking,
+  /// should MachineSink break a critical edge to sink it anyways?
+  virtual bool shouldBreakCriticalEdgeToSink(MachineInstr &MI) const {
+    return false;
+  }
+
 protected:
   /// For instructions with opcodes for which the M_REMATERIALIZABLE flag is
   /// set, this hook lets the target specify whether the instruction is actually
diff --git a/llvm/lib/CodeGen/MachineSink.cpp b/llvm/lib/CodeGen/MachineSink.cpp
index 609f9af9767f5d..d34ee4f1040ffc 100644
--- a/llvm/lib/CodeGen/MachineSink.cpp
+++ b/llvm/lib/CodeGen/MachineSink.cpp
@@ -952,7 +952,9 @@ bool MachineSinking::isWorthBreakingCriticalEdge(
     }
   }
 
-  return false;
+  // Let the target decide if it's worth breaking this
+  // critical edge for a "cheap" instruction.
+  return TII->shouldBreakCriticalEdgeToSink(MI);
 }
 
 bool MachineSinking::isLegalToBreakCriticalEdge(MachineInstr &MI,
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfo.h b/llvm/lib/Target/RISCV/RISCVInstrInfo.h
index 457db9b9860d00..4fb7e68e23d422 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfo.h
@@ -78,6 +78,10 @@ class RISCVInstrInfo : public RISCVGenInstrInfo {
 
   bool isReallyTriviallyReMaterializable(const MachineInstr &MI) const override;
 
+  bool shouldBreakCriticalEdgeToSink(MachineInstr &MI) const override {
+    return MI.getOpcode() == RISCV::ADDI;
+  }
+
   void copyPhysRegVector(MachineBasicBlock &MBB,
                          MachineBasicBlock::iterator MBBI, const DebugLoc &DL,
                          MCRegister DstReg, MCRegister SrcReg, bool KillSrc,
diff --git a/llvm/test/CodeGen/RISCV/aext-to-sext.ll b/llvm/test/CodeGen/RISCV/aext-to-sext.ll
index 888ea666d71316..f3f71a923bdc29 100644
--- a/llvm/test/CodeGen/RISCV/aext-to-sext.ll
+++ b/llvm/test/CodeGen/RISCV/aext-to-sext.ll
@@ -78,12 +78,14 @@ bar:
 define i64 @sext_phi_constants(i32 signext %c) {
 ; RV64I-LABEL: sext_phi_constants:
 ; RV64I:       # %bb.0:
-; RV64I-NEXT:    li a1, -1
-; RV64I-NEXT:    bnez a0, .LBB2_2
-; RV64I-NEXT:  # %bb.1: # %iffalse
-; RV64I-NEXT:    li a1, -2
-; RV64I-NEXT:  .LBB2_2: # %merge
-; RV64I-NEXT:    slli a0, a1, 32
+; RV64I-NEXT:    beqz a0, .LBB2_2
+; RV64I-NEXT:  # %bb.1:
+; RV64I-NEXT:    li a0, -1
+; RV64I-NEXT:    j .LBB2_3
+; RV64I-NEXT:  .LBB2_2: # %iffalse
+; RV64I-NEXT:    li a0, -2
+; RV64I-NEXT:  .LBB2_3: # %merge
+; RV64I-NEXT:    slli a0, a0, 32
 ; RV64I-NEXT:    srli a0, a0, 32
 ; RV64I-NEXT:    ret
   %a = icmp ne i32 %c, 0
diff --git a/llvm/test/CodeGen/RISCV/machine-sink-load-immediate.ll b/llvm/test/CodeGen/RISCV/machine-sink-load-immediate.ll
index 775ea8e820afe3..f16bc0507cb480 100644
--- a/llvm/test/CodeGen/RISCV/machine-sink-load-immediate.ll
+++ b/llvm/test/CodeGen/RISCV/machine-sink-load-immediate.ll
@@ -184,13 +184,13 @@ declare i32 @toupper()
 define signext i32 @overlap_live_ranges(ptr %arg, i32 signext %arg1) {
 ; CHECK-LABEL: overlap_live_ranges:
 ; CHECK:       # %bb.0: # %bb
-; CHECK-NEXT:    li a3, 1
-; CHECK-NEXT:    li a2, 13
-; CHECK-NEXT:    bne a1, a3, .LBB1_2
+; CHECK-NEXT:    li a2, 1
+; CHECK-NEXT:    bne a1, a2, .LBB1_2
 ; CHECK-NEXT:  # %bb.1: # %bb2
-; CHECK-NEXT:    lw a2, 4(a0)
-; CHECK-NEXT:  .LBB1_2: # %bb5
-; CHECK-NEXT:    mv a0, a2
+; CHECK-NEXT:    lw a0, 4(a0)
+; CHECK-NEXT:    ret
+; CHECK-NEXT:  .LBB1_2:
+; CHECK-NEXT:    li a0, 13
 ; CHECK-NEXT:    ret
 bb:
   %i = icmp eq i32 %arg1, 1
diff --git a/llvm/test/CodeGen/RISCV/rv64m-w-insts-legalization.ll b/llvm/test/CodeGen/RISCV/rv64m-w-insts-legalization.ll
index f69909e76d4c10..a2c572e07ff7d0 100644
--- a/llvm/test/CodeGen/RISCV/rv64m-w-insts-legalization.ll
+++ b/llvm/test/CodeGen/RISCV/rv64m-w-insts-legalization.ll
@@ -5,15 +5,13 @@ define signext i32 @mulw(i32 signext %s, i32 signext %n, i32 signext %k) nounwin
 ; CHECK-LABEL: mulw:
 ; CHECK:       # %bb.0: # %entry
 ; CHECK-NEXT:    li a2, 1
-; CHECK-NEXT:    bge a0, a1, .LBB0_3
-; CHECK-NEXT:  # %bb.1: # %for.body.preheader
-; CHECK-NEXT:    li a2, 1
-; CHECK-NEXT:  .LBB0_2: # %for.body
+; CHECK-NEXT:    bge a0, a1, .LBB0_2
+; CHECK-NEXT:  .LBB0_1: # %for.body
 ; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    mulw a2, a0, a2
 ; CHECK-NEXT:    addiw a0, a0, 1
-; CHECK-NEXT:    blt a0, a1, .LBB0_2
-; CHECK-NEXT:  .LBB0_3: # %for.cond.cleanup
+; CHECK-NEXT:    blt a0, a1, .LBB0_1
+; CHECK-NEXT:  .LBB0_2: # %for.cond.cleanup
 ; CHECK-NEXT:    mv a0, a2
 ; CHECK-NEXT:    ret
 entry:
diff --git a/llvm/test/CodeGen/RISCV/select-const.ll b/llvm/test/CodeGen/RISCV/select-const.ll
index 792df6236ddc0e..b3e32b1f5c9c3d 100644
--- a/llvm/test/CodeGen/RISCV/select-const.ll
+++ b/llvm/test/CodeGen/RISCV/select-const.ll
@@ -61,22 +61,22 @@ define signext i32 @select_const_int_pow2_zero(i1 zeroext %a) nounwind {
 define signext i32 @select_const_int_harder(i1 zeroext %a) nounwind {
 ; RV32-LABEL: select_const_int_harder:
 ; RV32:       # %bb.0:
-; RV32-NEXT:    mv a1, a0
-; RV32-NEXT:    li a0, 6
-; RV32-NEXT:    bnez a1, .LBB3_2
+; RV32-NEXT:    bnez a0, .LBB3_2
 ; RV32-NEXT:  # %bb.1:
 ; RV32-NEXT:    li a0, 38
+; RV32-NEXT:    ret
 ; RV32-NEXT:  .LBB3_2:
+; RV32-NEXT:    li a0, 6
 ; RV32-NEXT:    ret
 ;
 ; RV64-LABEL: select_const_int_harder:
 ; RV64:       # %bb.0:
-; RV64-NEXT:    mv a1, a0
-; RV64-NEXT:    li a0, 6
-; RV64-NEXT:    bnez a1, .LBB3_2
+; RV64-NEXT:    bnez a0, .LBB3_2
 ; RV64-NEXT:  # %bb.1:
 ; RV64-NEXT:    li a0, 38
+; RV64-NEXT:    ret
 ; RV64-NEXT:  .LBB3_2:
+; RV64-NEXT:    li a0, 6
 ; RV64-NEXT:    ret
   %1 = select i1 %a, i32 6, i32 38
   ret i32 %1
diff --git a/llvm/test/CodeGen/RISCV/select.ll b/llvm/test/CodeGen/RISCV/select.ll
index 8aa50cc0f39c1a..e3766c024197ef 100644
--- a/llvm/test/CodeGen/RISCV/select.ll
+++ b/llvm/test/CodeGen/RISCV/select.ll
@@ -1585,22 +1585,22 @@ define i32 @select_cst_not5(i32 signext %a, i32 signext %b) {
 define i32 @select_cst_unknown(i32 signext %a, i32 signext %b) {
 ; RV32IM-LABEL: select_cst_unknown:
 ; RV32IM:       # %bb.0:
-; RV32IM-NEXT:    mv a2, a0
-; RV32IM-NEXT:    li a0, 5
-; RV32IM-NEXT:    blt a2, a1, .LBB42_2
+; RV32IM-NEXT:    blt a0, a1, .LBB42_2
 ; RV32IM-NEXT:  # %bb.1:
 ; RV32IM-NEXT:    li a0, -7
+; RV32IM-NEXT:    ret
 ; RV32IM-NEXT:  .LBB42_2:
+; RV32IM-NEXT:    li a0, 5
 ; RV32IM-NEXT:    ret
 ;
 ; RV64IM-LABEL: select_cst_unknown:
 ; RV64IM:       # %bb.0:
-; RV64IM-NEXT:    mv a2, a0
-; RV64IM-NEXT:    li a0, 5
-; RV64IM-NEXT:    blt a2, a1, .LBB42_2
+; RV64IM-NEXT:    blt a0, a1, .LBB42_2
 ; RV64IM-NEXT:  # %bb.1:
 ; RV64IM-NEXT:    li a0, -7
+; RV64IM-NEXT:    ret
 ; RV64IM-NEXT:  .LBB42_2:
+; RV64IM-NEXT:    li a0, 5
 ; RV64IM-NEXT:    ret
 ;
 ; RV64IMXVTCONDOPS-LABEL: select_cst_unknown:
@@ -1626,22 +1626,22 @@ define i32 @select_cst_unknown(i32 signext %a, i32 signext %b) {
 define i32 @select_cst1(i1 zeroext %cond) {
 ; RV32IM-LABEL: select_cst1:
 ; RV32IM:       # %bb.0:
-; RV32IM-NEXT:    mv a1, a0
-; RV32IM-NEXT:    li a0, 10
-; RV32IM-NEXT:    bnez a1, .LBB43_2
+; RV32IM-NEXT:    bnez a0, .LBB43_2
 ; RV32IM-NEXT:  # %bb.1:
 ; RV32IM-NEXT:    li a0, 20
+; RV32IM-NEXT:    ret
 ; RV32IM-NEXT:  .LBB43_2:
+; RV32IM-NEXT:    li a0, 10
 ; RV32IM-NEXT:    ret
 ;
 ; RV64IM-LABEL: select_cst1:
 ; RV64IM:       # %bb.0:
-; RV64IM-NEXT:    mv a1, a0
-; RV64IM-NEXT:    li a0, 10
-; RV64IM-NEXT:    bnez a1, .LBB43_2
+; RV64IM-NEXT:    bnez a0, .LBB43_2
 ; RV64IM-NEXT:  # %bb.1:
 ; RV64IM-NEXT:    li a0, 20
+; RV64IM-NEXT:    ret
 ; RV64IM-NEXT:  .LBB43_2:
+; RV64IM-NEXT:    li a0, 10
 ; RV64IM-NEXT:    ret
 ;
 ; RV64IMXVTCONDOPS-LABEL: select_cst1:
@@ -1664,24 +1664,24 @@ define i32 @select_cst1(i1 zeroext %cond) {
 define i32 @select_cst2(i1 zeroext %cond) {
 ; RV32IM-LABEL: select_cst2:
 ; RV32IM:       # %bb.0:
-; RV32IM-NEXT:    mv a1, a0
-; RV32IM-NEXT:    li a0, 10
-; RV32IM-NEXT:    bnez a1, .LBB44_2
+; RV32IM-NEXT:    bnez a0, .LBB44_2
 ; RV32IM-NEXT:  # %bb.1:
 ; RV32IM-NEXT:    lui a0, 5
 ; RV32IM-NEXT:    addi a0, a0, -480
+; RV32IM-NEXT:    ret
 ; RV32IM-NEXT:  .LBB44_2:
+; RV32IM-NEXT:    li a0, 10
 ; RV32IM-NEXT:    ret
 ;
 ; RV64IM-LABEL: select_cst2:
 ; RV64IM:       # %bb.0:
-; RV64IM-NEXT:    mv a1, a0
-; RV64IM-NEXT:    li a0, 10
-; RV64IM-NEXT:    bnez a1, .LBB44_2
+; RV64IM-NEXT:    bnez a0, .LBB44_2
 ; RV64IM-NEXT:  # %bb.1:
 ; RV64IM-NEXT:    lui a0, 5
 ; RV64IM-NEXT:    addiw a0, a0, -480
+; RV64IM-NEXT:    ret
 ; RV64IM-NEXT:  .LBB44_2:
+; RV64IM-NEXT:    li a0, 10
 ; RV64IM-NEXT:    ret
 ;
 ; RV64IMXVTCONDOPS-LABEL: select_cst2:
@@ -1782,24 +1782,24 @@ define i32 @select_cst4(i1 zeroext %cond) {
 define i32 @select_cst5(i1 zeroext %cond) {
 ; RV32IM-LABEL: select_cst5:
 ; RV32IM:       # %bb.0:
-; RV32IM-NEXT:    mv a1, a0
-; RV32IM-NEXT:    li a0, 2047
-; RV32IM-NEXT:    bnez a1, .LBB47_2
+; RV32IM-NEXT:    bnez a0, .LBB47_2
 ; RV32IM-NEXT:  # %bb.1:
 ; RV32IM-NEXT:    lui a0, 1
 ; RV32IM-NEXT:    addi a0, a0, -2047
+; RV32IM-NEXT:    ret
 ; RV32IM-NEXT:  .LBB47_2:
+; RV32IM-NEXT:    li a0, 2047
 ; RV32IM-NEXT:    ret
 ;
 ; RV64IM-LABEL: select_cst5:
 ; RV64IM:       # %bb.0:
-; RV64IM-NEXT:    mv a1, a0
-; RV64IM-NEXT:    li a0, 2047
-; RV64IM-NEXT:    bnez a1, .LBB47_2
+; RV64IM-NEXT:    bnez a0, .LBB47_2
 ; RV64IM-NEXT:  # %bb.1:
 ; RV64IM-NEXT:    lui a0, 1
 ; RV64IM-NEXT:    addiw a0, a0, -2047
+; RV64IM-NEXT:    ret
 ; RV64IM-NEXT:  .LBB47_2:
+; RV64IM-NEXT:    li a0, 2047
 ; RV64IM-NEXT:    ret
 ;
 ; RV64IMXVTCONDOPS-LABEL: select_cst5:
diff --git a/llvm/test/CodeGen/RISCV/sextw-removal.ll b/llvm/test/CodeGen/RISCV/sextw-removal.ll
index 8cf78551d28f98..58c3dd18875571 100644
--- a/llvm/test/CodeGen/RISCV/sextw-removal.ll
+++ b/llvm/test/CodeGen/RISCV/sextw-removal.ll
@@ -1032,82 +1032,86 @@ bb7:                                              ; preds = %bb2
 define signext i32 @bug(i32 signext %x) {
 ; CHECK-LABEL: bug:
 ; CHECK:       # %bb.0: # %entry
-; CHECK-NEXT:    beqz a0, .LBB18_4
+; CHECK-NEXT:    beqz a0, .LBB18_5
 ; CHECK-NEXT:  # %bb.1: # %if.end
-; CHECK-NEXT:    srliw a2, a0, 16
-; CHECK-NEXT:    seqz a1, a2
-; CHECK-NEXT:    slli a1, a1, 4
-; CHECK-NEXT:    sllw a1, a0, a1
-; CHECK-NEXT:    li a0, 16
-; CHECK-NEXT:    beqz a2, .LBB18_3
+; CHECK-NEXT:    srliw a1, a0, 16
+; CHECK-NEXT:    seqz a2, a1
+; CHECK-NEXT:    slli a2, a2, 4
+; CHECK-NEXT:    sllw a0, a0, a2
+; CHECK-NEXT:    beqz a1, .LBB18_3
 ; CHECK-NEXT:  # %bb.2: # %if.end
-; CHECK-NEXT:    li a0, 32
-; CHECK-NEXT:  .LBB18_3: # %if.end
-; CHECK-NEXT:    srliw a2, a1, 24
+; CHECK-NEXT:    li a1, 32
+; CHECK-NEXT:    j .LBB18_4
+; CHECK-NEXT:  .LBB18_3:
+; CHECK-NEXT:    li a1, 16
+; CHECK-NEXT:  .LBB18_4: # %if.end
+; CHECK-NEXT:    srliw a2, a0, 24
 ; CHECK-NEXT:    seqz a2, a2
 ; CHECK-NEXT:    slli a3, a2, 3
-; CHECK-NEXT:    sllw a1, a1, a3
+; CHECK-NEXT:    sllw a0, a0, a3
 ; CHECK-NEXT:    negw a2, a2
 ; CHECK-NEXT:    andi a2, a2, -8
-; CHECK-NEXT:    add a0, a0, a2
-; CHECK-NEXT:    srliw a2, a1, 28
+; CHECK-NEXT:    add a1, a1, a2
+; CHECK-NEXT:    srliw a2, a0, 28
 ; CHECK-NEXT:    seqz a2, a2
 ; CHECK-NEXT:    slli a3, a2, 2
-; CHECK-NEXT:    sllw a1, a1, a3
+; CHECK-NEXT:    sllw a0, a0, a3
 ; CHECK-NEXT:    negw a2, a2
 ; CHECK-NEXT:    andi a2, a2, -4
-; CHECK-NEXT:    add a0, a0, a2
-; CHECK-NEXT:    srliw a2, a1, 30
+; CHECK-NEXT:    add a1, a1, a2
+; CHECK-NEXT:    srliw a2, a0, 30
 ; CHECK-NEXT:    seqz a2, a2
 ; CHECK-NEXT:    slli a3, a2, 1
-; CHECK-NEXT:    sllw a1, a1, a3
+; CHECK-NEXT:    sllw a0, a0, a3
 ; CHECK-NEXT:    negw a2, a2
 ; CHECK-NEXT:    andi a2, a2, -2
-; CHECK-NEXT:    add a0, a0, a2
-; CHECK-NEXT:    not a1, a1
-; CHECK-NEXT:    srli a1, a1, 31
-; CHECK-NEXT:    addw a0, a0, a1
-; CHECK-NEXT:  .LBB18_4: # %cleanup
+; CHECK-NEXT:    add a1, a1, a2
+; CHECK-NEXT:    not a0, a0
+; CHECK-NEXT:    srli a0, a0, 31
+; CHECK-NEXT:    addw a0, a1, a0
+; CHECK-NEXT:  .LBB18_5: # %cleanup
 ; CHECK-NEXT:    ret
 ;
 ; NOREMOVAL-LABEL: bug:
 ; NOREMOVAL:       # %bb.0: # %entry
-; NOREMOVAL-NEXT:    beqz a0, .LBB18_4
+; NOREMOVAL-NEXT:    beqz a0, .LBB18_5
 ; NOREMOVAL-NEXT:  # %bb.1: # %if.end
-; NOREMOVAL-NEXT:    srliw a2, a0, 16
-; NOREMOVAL-NEXT:    seqz a1, a2
-; NOREMOVAL-NEXT:    slli a1, a1, 4
-; NOREMOVAL-NEXT:    sllw a1, a0, a1
-; NOREMOVAL-NEXT:    li a0, 16
-; NOREMOVAL-NEXT:    beqz a2, .LBB18_3
+; NOREMOVAL-NEXT:    srliw a1, a0, 16
+; NOREMOVAL-NEXT:    seqz a2, a1
+; NOREMOVAL-NEXT:    slli a2, a2, 4
+; NOREMOVAL-NEXT:    sllw a0, a0, a2
+; NOREMOVAL-NEXT:    beqz a1, .LBB18_3
 ; NOREMOVAL-NEXT:  # %bb.2: # %if.end
-; NOREMOVAL-NEXT:    li a0, 32
-; NOREMOVAL-NEXT:  .LBB18_3: # %if.end
-; NOREMOVAL-NEXT:    srliw a2, a1, 24
+; NOREMOVAL-NEXT:    li a1, 32
+; NOREMOVAL-NEXT:    j .LBB18_4
+; NOREMOVAL-NEXT:  .LBB18_3:
+; NOREMOVAL-NEXT:    li a1, 16
+; NOREMOVAL-NEXT:  .LBB18_4: # %if.end
+; NOREMOVAL-NEXT:    srliw a2, a0, 24
 ; NOREMOVAL-NEXT:    seqz a2, a2
 ; NOREMOVAL-NEXT:    slli a3, a2, 3
-; NOREMOVAL-NEXT:    sllw a1, a1, a3
+; NOREMOVAL-NEXT:    sllw a0, a0, a3
 ; NOREMOVAL-NEXT:    negw a2, a2
 ; NOREMOVAL-NEXT:    andi a2, a2, -8
-; NOREMOVAL-NEXT:    add a0, a0, a2
-; NOREMOVAL-NEXT:    srliw a2, a1, 28
+; NOREMOVAL-NEXT:    add a1, a1, a2
+; NOREMOVAL-NEXT:    srliw a2, a0, 28
 ; NOREMOVAL-NEXT:    seqz a2, a2
 ; NOREMOVAL-NEXT:    slli a3, a2, 2
-; NOREMOVAL-NEXT:    sllw a1, a1, a3
+; NOREMOVAL-NEXT:    sllw a0, a0, a3
 ; NOREMOVAL-NEXT:    negw a2, a2
 ; NOREMOVAL-NEXT:    andi a2, a2, -4
-; NOREMOVAL-NEXT:    add a0, a0, a2
-; NOREMOVAL-NEXT:    srliw a2, a1, 30
+; NOREMOVAL-NEXT:    add a1, a1, a2
+; NOREMOVAL-NEXT:    srliw a2, a0, 30
 ; NOREMOVAL-NEXT:    seqz a2, a2
 ; NOREMOVAL-NEXT:    slli a3, a2, 1
-; NOREMOVAL-NEXT:    sllw a1, a1, a3
+; NOREMOVAL-NEXT:    sllw a0, a0, a3
 ; NOREMOVAL-NEXT:    negw a2, a2
 ; NOREMOVAL-NEXT:    andi a2, a2, -2
-; NOREMOVAL-NEXT:    add a0, a0, a2
-; NOREMOVAL-NEXT:    not a1, a1
-; NOREMOVAL-NEXT:    srli a1, a1, 31
-; NOREMOVAL-NEXT:    addw a0, a0, a1
-; NOREMOVAL-NEXT:  .LBB18_4: # %cleanup
+; NOREMOVAL-NEXT:    add a1, a1, a2
+; NOREMOVAL-NEXT:    not a0, a0
+; NOREMOVAL-NEXT:    srli a0, a0, 31
+; NOREMOVAL-NEXT:    addw a0, a1, a0
+; NOREMOVAL-NEXT:  .LBB18_5: # %cleanup
 ; NOREMOVAL-NEXT:    ret
 entry:
   %tobool.not = icmp eq i32 %x, 0
diff --git a/llvm/test/CodeGen/RISCV/typepromotion-overflow.ll b/llvm/test/CodeGen/RISCV/typepromotion-overflow.ll
index ec7e0ecce80caa..ae1aabed498059 100644
--- a/llvm/test/CodeGen/RISCV/typepromotion-overflow.ll
+++ b/llvm/test/CodeGen/RISCV/typepromotion-overflow.ll
@@ -7,13 +7,14 @@ define zeroext i16 @overflow_add(i16 zeroext %a, i16 zeroext %b) {
 ; CHECK-NEXT:    add a0, a1, a0
 ; CHECK-NEXT:    ori a0, a0, 1
 ; CHECK-NEXT:    slli a0, a0, 48
-; CHECK-NEXT:    srli a1, a0, 48
-; CHECK-NEXT:    li a2, 1024
-; CHECK-NEXT:    li a0, 2
-; CHECK-NEXT:    bltu a2, a1, .LBB0_2
+; CHECK-NEXT:    srli a0, a0, 48
+; CHECK-NEXT:    li a1, 1024
+; CHECK-NEXT:    bltu a1, a0, .LBB0_2
 ; CHECK-NEXT:  # %bb.1:
 ; CHECK-NEXT:    li a0, 5
+; CHECK-NEXT:    ret
 ; CHECK-NEXT:  .LBB0_2:
+; CHECK-NEXT:    li a0, 2
 ; CHECK-NEXT:    ret
   %add = add i16 %b, %a
   %or = or i16 %add, 1
@@ -28,13 +29,14 @@ define zeroext i16 @overflow_sub(i16 zeroext %a, i16 zeroext %b) {
 ; CHECK-NEXT:    subw a0, a0, a1
 ; CHECK-NEXT:    ori a0, a0, 1
 ; CHECK-NEXT:    slli a0, a0, 48
-; CHECK-NEXT:    srli a1, a0, 48
-; CHECK-NEXT:    li a2, 1024
-; CHECK-NEXT:    li a0, 2
-; CHECK-NEXT:    bltu a2, a1, .LBB1_2
+; CHECK-NEXT:    srli a0, a0, 48
+; CHECK-NEXT:    li a1, 1024
+; CHECK-NEXT:    bltu a1, a0, .LBB1_2
 ; CHECK-NEXT:  # %bb.1:
 ; CHECK-NEXT:    li a0, 5
+; CHECK-NEXT:    ret
 ; CHECK-NEXT:  .LBB1_2:
+; CHECK-NEXT:    li a0, 2
 ; CHECK-NEXT:    ret
   %add = sub i16 %a, %b
   %or = or i16 %add, 1
@@ -49,13 +51,14 @@ define zeroext i16 @overflow_mul(i16 zeroext %a, i16 zeroext %b) {
 ; CHECK-NEXT:    mul a0, a1, a0
 ; CHECK-NEXT:    ori a0, a0, 1
 ; CHECK-NEXT:    slli a0, a0, 48
-; CHECK-NEXT:    srli a1, a0, 48
-; CHECK-NEXT:    li a2, 1024
-; CHECK-NEXT:    li a0, 2
-; CHECK-NEXT:    bltu a2, a1, .LBB2_2
+; CHECK-NEXT:    srli a0, a0, 48
+; CHECK-NEXT:    li a1, 1024
+; CHECK-NEXT:    bltu a1, a0, .LBB2_2
 ; CHECK-NEXT:  # %bb.1:
 ; CHECK-NEXT:    li a0, 5
+; CHECK-NEXT:    ret
 ; CHECK-NEXT:  .LBB2_2:
+; CHECK-NEXT:    li a0, 2
 ; CHECK-NEXT:    ret
   %add = mul i16 %b, %a
   %or = or i16 %add, 1
@@ -70,13 +73,14 @@ define zeroext i16 @overflow_shl(i16 zeroext %a, i16 zeroext %b) {
 ; CHECK-NEXT:    sll a0, a0, a1
 ; CHECK-NEXT:    ori a0, a0, 1
 ; CHECK-NEXT:    slli a0, a0, 48
-; CHECK-NEXT:    srli a1, a0, 48
-; CHECK-NEXT:    li a2, 1024
-; CHECK-NEXT:    li a0, 2
-; CHECK-NEXT:    bltu a2, a1, .LBB3_2
+; CHECK-NEXT:    srli a0, a0, 48
+; CHECK-NEXT:    li a1, 1024
+; CHECK-NEXT:    bltu a1, a0, .LBB3_2
 ; CHECK-NEXT:  # %bb.1:
 ; CHECK-NEXT:    li a0, 5
+; CHECK-NEXT:    ret
 ; CHECK-NEXT:  .LBB3_2:
+; CHECK-NEXT:    li a0, 2
 ; CHECK-NEXT:    ret
   %add = shl i16 %a, %b
   %or = or i16 %add, 1
@@ -89,12 +93,13 @@ define i32 @overflow_add_no_consts(i8 zeroext %a, i8 zeroext %b, i8 zeroext %lim
 ; CHECK-LABEL: overflow_add_no_consts:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    add a0, a1, a0
-; CHECK-NEXT:    andi a1, a0, 255
-; CHECK-NEXT:    li a0, 8
-; CHECK-NEXT:    bltu a2, a1, .LBB4_2
+; CHECK-NEXT:    andi a0, a0, 255
+; CHECK-NEXT:    bltu a2, a0, .LBB4_2
 ; CHECK-NEXT:  # %bb.1:
 ; CHECK-NEXT:    li a0, 16
+; CHECK-NEXT:    ret
 ; CHECK-NEXT:  .LBB4_2:
+; CHECK-NEXT:    li a0, 8
 ; CHECK-NEXT:    ret
   %add = add i8 %b, %a
   %cmp = icmp ugt i8 %add, %limit
@@ -106,13 +111,14 @@ define i32 @overflow_add_const_limit(i8 zeroext %a, i8 zeroext %b) {
 ; CHECK-LABEL: overflow_add_const_limit:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    add a0, a1, a0
-; CHECK-NEXT:    andi a1, a0, 255
-; CHECK-NEXT:    li a2, 128
-; CHECK-NEXT:    li a0, 8
-; CHECK-NEXT:    bltu a2, a1, .LBB5_2
+; CHECK-NEXT:    andi a0, a0, 255
+; CHECK-NEXT:    li a1, 128
+; CHECK-NEXT:    bltu a1, a0, .LBB5_2
 ; CHECK-NEXT:  # %bb.1:
 ; CHECK-NEXT:    li a0, 16
+; CHECK-NEXT:    ret
 ; CHECK-NEXT:  .LBB5_2:
+; CHECK-NEXT:    li a0, 8
 ; CHECK-NEXT:    ret
   %add = add i8 %b, %a
   %cmp = icmp ugt i8 %add, -128
@@ -124,13 +130,14 @@ define i32 @overflow_add_positive_const_limit(i8 zeroext %a) {
 ; CHECK-LABEL: overflow_add_positive_const_limit:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    slli a0, a0, 56
-; CHECK-NEXT:    srai a1, a0, 56
-; CHECK-NEXT:    li a2, -1
-; CHECK-NEXT:    li a0, 8
-; CHECK-NEXT:    blt a1, a2, .LBB6_2
+; CHECK-NEXT:    srai a0, a0, 56
+; CHECK-NEXT:    li a1, -1
+; CHECK-NEXT:    blt a0, a1, .LBB6_2
 ; CHECK-NEXT:  # %bb.1:
 ; CHECK-NEXT:    li a0, 16
+; CHECK-NEXT:    ret
 ; CHECK-NEXT:  .LBB6_2:
+; CHECK-NEXT:    li a0, 8
 ; CHECK-NEXT:    ret
   %cmp = icmp slt i8 %a, -1
   %res = select i1 %cmp, i32 8, i32 16
@@ -140,13 +147,13 @@ define i32 @overflow_add_positive_const_limit(i8 zeroext %a) {
 define i32 @unsafe_add_underflow(i8 zeroext %a) {
 ; CHECK-LABEL: unsafe_add_underflow:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    mv a1, a0
-; CHECK-NEXT:    li a2, 1
-; CHECK-NEXT:    li a0, 8
-; CHECK-NEXT:    beq a1, a2, .LBB7_2
+; CHECK-NEXT:    li a1, 1
+; CHECK-NEXT:    beq a0, a1, .LBB7_2
 ; CHECK-NEXT:  # %bb.1:
 ; CHECK-NEXT:    li a0, 16
+; CHECK-NEXT:    ret
 ; CHECK-NEXT:  .LBB7_2:
+; CHECK-NEXT:    li a0, 8
 ; CHECK-NEXT:    ret
   %cmp = icmp eq i8 %a, 1
   %res = select i1 %cmp, i32 8, i32 16
@@ -156,12 +163,12 @@ define i32 @unsafe_add_underflow(i8 zeroext %a) {
 define i32 @safe_add_underflow(i8 zeroext %a) {
 ; CHECK-LABEL: safe_add_underflow:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    mv a1, a0
-; CHECK-NEXT:    li a0, 8
-; CHECK-NEXT:    beqz a1, .LBB8_2
+; CHECK-NEXT:    beqz a0, .LBB8_2
 ; CHECK-NEXT:  # %bb.1:
 ; CHECK-NEXT:    li a0, 16
+; CHECK-NEXT:    ret
 ; CHECK-NEXT:  .LBB8_2:
+; CHECK-NEXT:    li a0, 8
 ; CHECK-NEXT:    ret
   %cmp = icmp eq i8 %a, 0
   %res = select i1 %cmp, i32 8, i32 16
@@ -171,13 +178,14 @@ define i32 @safe_add_underflow(i8 zeroext %a) {
 define i32 @safe_add_underflow_neg(i8 zeroext %a) {
 ; CHECK-LABEL: safe_add_underflow_neg:
 ; CHECK:       # %bb.0:
-; CHECK-N...
[truncated]

@preames
Copy link
Collaborator Author

preames commented Sep 16, 2024

Looks like I screwed up this PR - I'd not updated all the tests. Working on fixing that, will update once done.

Copy link

github-actions bot commented Sep 17, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

define i32 @f_medium_ledge_pos(i32 %in0) minsize {
%cmp = icmp CMPCOND i32 %in0, 32
%cmp = icmp CMPCOND i32 %in0, 33
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test delta in this file needs a bit of explanation. Most of the diff is me trying to understand the testing to find the actual problem. The only actual change is this line.

Essentially, the original test is confused about what's it's testing. "32" can be handled via the c.addi offset scheme. However, here's the originally assembly we got:

	addi	a1, a0, -32
	li	a0, -99
	bnez	a1, .LBB0_2
# %bb.1:
	li	a0, 42
.LBB0_2:
	ret

Note that the c.addi form can only be matched when source and destination register are the same.

One this change sinks the unrelated LI out of the way, the register allocator can do that, and we end up with a compressed addi.

From the naming of the test and the comment, this doesn't appear to have been the intended result, and thus I adjusted the constant to 33. If reviewers agree that this is the right test update, I'm going to land a rework of this test file separately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is the right test update.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I landed these changes on main, and eventually figured out we could autogen this test without loosing the spirit of the compressed checks. The resulting diff (there is a much bigger one than even I'd realized) should be a lot more obvious now. They do look to be improvements.

@@ -91,14 +87,14 @@ define i32 @f_small_edge_neg(i32 %in0) minsize {
; constant is medium and not fit in 6 bit (compress imm),
; but fit in 12 bit (imm)
; RV32IFDC-LABEL: <f_medium_ledge_pos>:
; RV32IFDC: addi [[MAYZEROREG:.*]], [[REG:.*]], -0x20
; RV32IFDC: RESBROPT [[MAYZEROREG]], [[PLACE:.*]]
; RV32IFDC: addi [[MAYZEROREG:.*]], [[REG:.*]], -0x21
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is REG here zero like in the RV32IFD case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind. It's not.

preames added a commit that referenced this pull request Sep 17, 2024
Per the comment, this test is intending to test the first constant which
can't be encoded via a c.addi.  However, -32 *can* be encoded as in a
c.addi, and all that's preventing it from doing so is the register
allocators choice to use a difference destination register on the
add than it's source.  (Which compressed doesn't support.)

The current LLC codegen for this test looks like:

	addi	a1, a0, -32
	li	a0, -99
	bnez	a1, .LBB0_2
	li	a0, 42
.LBB0_2:
	ret

After #108889, we sink the LI, and
the register allocator picks the same source and dest register for the addi
resulting in the c.addi form being emitted.  So, to avoid a confusing diff
let's fix the test to check what was originally intended.
tmsri pushed a commit to tmsri/llvm-project that referenced this pull request Sep 19, 2024
Per the comment, this test is intending to test the first constant which
can't be encoded via a c.addi.  However, -32 *can* be encoded as in a
c.addi, and all that's preventing it from doing so is the register
allocators choice to use a difference destination register on the
add than it's source.  (Which compressed doesn't support.)

The current LLC codegen for this test looks like:

	addi	a1, a0, -32
	li	a0, -99
	bnez	a1, .LBB0_2
	li	a0, 42
.LBB0_2:
	ret

After llvm#108889, we sink the LI, and
the register allocator picks the same source and dest register for the addi
resulting in the c.addi form being emitted.  So, to avoid a confusing diff
let's fix the test to check what was originally intended.
@asb
Copy link
Contributor

asb commented Sep 20, 2024

Looking at the diff for this with the GCC torture suite, I think there's one case where it seems this change makes codegen slightly worse that it would be worth you having a quick look at - 20020413-1.c. Previously, each conditional BB loaded an immediate then branched to the common exit block. After this change, the BBs conditionally branch to the BB that loads the needed immediate, which then branches to the common exit block.

This isn't a blocker I don't think, as overall the change is positive.

@preames
Copy link
Collaborator Author

preames commented Nov 25, 2024

Looking at the diff for this with the GCC torture suite, I think there's one case where it seems this change makes codegen slightly worse that it would be worth you having a quick look at - 20020413-1.c. Previously, each conditional BB loaded an immediate then branched to the common exit block. After this change, the BBs conditionally branch to the BB that loads the needed immediate, which then branches to the common exit block.

This isn't a blocker I don't think, as overall the change is positive.

I took a look at this, and it's basically the patch working as intended. One interesting quirk is that it turned out to be specific to having a series of branches. With branch dispatch you see a code change with this patch. With switch dispatch, we already sink into the successor blocks (by construction). So, if anything, it's another reason to land this patch - consistency!

Can I get an LGTM here?

Copy link
Collaborator

@topperc topperc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@preames preames merged commit 6657d4b into llvm:main Nov 26, 2024
9 checks passed
@preames preames deleted the pr-riscv-machine-sink-addi branch November 26, 2024 02:59
@llvm-ci
Copy link
Collaborator

llvm-ci commented Nov 26, 2024

LLVM Buildbot has detected a new failure on builder openmp-offload-amdgpu-runtime running on omp-vega20-0 while building llvm at step 7 "Add check check-offload".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/30/builds/11013

Here is the relevant piece of the build log for the reference
Step 7 (Add check check-offload) failure: test (failure)
******************** TEST 'libomptarget :: amdgcn-amd-amdhsa :: sanitizer/kernel_crash_async.c' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 2
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang -fopenmp    -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib  -fopenmp-targets=amdgcn-amd-amdhsa -O3 /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/sanitizer/kernel_crash_async.c -o /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/sanitizer/Output/kernel_crash_async.c.tmp /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang -fopenmp -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -fopenmp-targets=amdgcn-amd-amdhsa -O3 /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/sanitizer/kernel_crash_async.c -o /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/sanitizer/Output/kernel_crash_async.c.tmp /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a
# note: command had no output on stdout or stderr
# RUN: at line 3
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/not --crash env -u LLVM_DISABLE_SYMBOLIZATION OFFLOAD_TRACK_NUM_KERNEL_LAUNCH_TRACES=1 /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/sanitizer/Output/kernel_crash_async.c.tmp 2>&1 | /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/sanitizer/kernel_crash_async.c --check-prefixes=TRACE
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/not --crash env -u LLVM_DISABLE_SYMBOLIZATION OFFLOAD_TRACK_NUM_KERNEL_LAUNCH_TRACES=1 /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/sanitizer/Output/kernel_crash_async.c.tmp
# note: command had no output on stdout or stderr
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/sanitizer/kernel_crash_async.c --check-prefixes=TRACE
# note: command had no output on stdout or stderr
# RUN: at line 4
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/not --crash /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/sanitizer/Output/kernel_crash_async.c.tmp 2>&1 | /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/sanitizer/kernel_crash_async.c --check-prefixes=CHECK
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/not --crash /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/sanitizer/Output/kernel_crash_async.c.tmp
# note: command had no output on stdout or stderr
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/sanitizer/kernel_crash_async.c --check-prefixes=CHECK
# note: command had no output on stdout or stderr
# RUN: at line 5
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang -fopenmp    -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib  -fopenmp-targets=amdgcn-amd-amdhsa -O3 /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/sanitizer/kernel_crash_async.c -o /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/sanitizer/Output/kernel_crash_async.c.tmp /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -g
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang -fopenmp -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -fopenmp-targets=amdgcn-amd-amdhsa -O3 /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/sanitizer/kernel_crash_async.c -o /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/sanitizer/Output/kernel_crash_async.c.tmp /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -g
# note: command had no output on stdout or stderr
# RUN: at line 6
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/not --crash env -u LLVM_DISABLE_SYMBOLIZATION OFFLOAD_TRACK_NUM_KERNEL_LAUNCH_TRACES=1 /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/sanitizer/Output/kernel_crash_async.c.tmp 2>&1 | /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/sanitizer/kernel_crash_async.c --check-prefixes=TRACE
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/not --crash env -u LLVM_DISABLE_SYMBOLIZATION OFFLOAD_TRACK_NUM_KERNEL_LAUNCH_TRACES=1 /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/sanitizer/Output/kernel_crash_async.c.tmp
# note: command had no output on stdout or stderr
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/sanitizer/kernel_crash_async.c --check-prefixes=TRACE
# note: command had no output on stdout or stderr
# RUN: at line 7
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/not --crash /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/sanitizer/Output/kernel_crash_async.c.tmp 2>&1 | /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/sanitizer/kernel_crash_async.c --check-prefixes=CHECK
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/not --crash /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/sanitizer/Output/kernel_crash_async.c.tmp
# note: command had no output on stdout or stderr
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/sanitizer/kernel_crash_async.c --check-prefixes=CHECK
# .---command stderr------------
# | /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/sanitizer/kernel_crash_async.c:39:11: error: CHECK: expected string not found in input
# | // CHECK: Kernel {{[0-9]}}: {{.*}} (__omp_offloading_{{.*}}_main_l29)
# |           ^
# | <stdin>:1:1: note: scanning from here
# | Display only launched kernel:
# | ^
# | <stdin>:2:23: note: possible intended match here
# | Kernel 'omp target in main @ 29 (__omp_offloading_802_b38838e_main_l29)'
# |                       ^
# | 
# | Input file: <stdin>
# | Check file: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/sanitizer/kernel_crash_async.c
# | 
...

mga-sc added a commit to mga-sc/llvm-project that referenced this pull request Nov 26, 2024
This test has been failed after commit
[RISCV] Implement tail call optimization in machine outliner (PR llvm#115297).

Changes to the same file were merged today earlier
[TTI][RISCV] Unconditionally break critical edges to sink ADDI (PR llvm#108889).
wangpc-pp pushed a commit that referenced this pull request Nov 26, 2024
#117700)

This MR fixes failed test `CodeGen/RISCV/compress-opt-select.ll`.

It was failed due to previously merged commit `[TTI][RISCV]
Unconditionally break critical edges to sink ADDI (PR #108889)`.

So, regenerated `compress-opt-select` test.
@michaelmaitland
Copy link
Contributor

michaelmaitland commented Mar 10, 2025

@topperc and I are findinging that this change is causing a performance regression in dhrystone and perlbench_r.

It leads to splitting the edge and sinking which creates a new basic block and adds an extra branch that takes additional cycles.

Would it be possible for us to use better heuristics on when to split the edges?

I reduced the Perl_sv_cmp_flags function from 500.perlbench_r to highlight what is going on:

define fastcc i32 @Perl_sv_cmp_flags(ptr %sv1, ptr %sv2, i1 %cmp1, i32 %spec.select) nounwind {
entry:
  %tobool.not = icmp eq ptr %sv1, null
  br i1 %tobool.not, label %if.end, label %if.else

if.else:                                          ; preds = %entry
  br i1 %cmp1, label %cond.true, label %if.end

cond.true:                                        ; preds = %if.else
  br label %if.end

if.end:                                           ; preds = %cond.true, %if.else, %entry
  %pv1.0 = phi ptr [ %sv2, %cond.true ], [ null, %entry ], [ null, %if.else ]
  %tobool2.not = icmp eq ptr %sv2, null
  br i1 %tobool2.not, label %if.end106.thread, label %if.end18

if.end18:                                         ; preds = %if.end
  %call121 = call i32 null(ptr %pv1.0, ptr null, i64 0)
  br i1 %cmp1, label %if.else126, label %cleanup

if.end106.thread:                                 ; preds = %if.end
  %tobool107.not48 = icmp eq i64 0, 0
  br i1 %tobool107.not48, label %if.then108, label %cleanup

if.then108:                                       ; preds = %if.end106.thread
  br label %cleanup

if.else126:                                       ; preds = %if.end18
  br label %cleanup

cleanup:                                          ; preds = %if.else126, %if.then108, %if.end106.thread, %if.end18
  %retval.0 = phi i32 [ 1, %if.end106.thread ], [ 0, %if.then108 ], [ %spec.select, %if.else126 ], [ 0, %if.end18 ]
  ret i32 %retval.0
}

With this reduced example, we used to generate:

Perl_sv_cmp_flags:                      # @Perl_sv_cmp_flags
# %bb.0:                                # %entry
        addi    sp, sp, -32
        sd      ra, 24(sp)                      # 8-byte Folded Spill
        sd      s0, 16(sp)                      # 8-byte Folded Spill
        sd      s1, 8(sp)                       # 8-byte Folded Spill
        sd      s2, 0(sp)                       # 8-byte Folded Spill
        mv      s2, a3
        andi    s0, a2, 1
        beqz    a0, .LBB0_3
# %bb.1:                                # %if.else
        beqz    s0, .LBB0_6
# %bb.2:                                # %cond.true
        mv      a0, a1
.LBB0_3:                                # %if.end
        beqz    a1, .LBB0_7
.LBB0_4:                                # %if.end18
        li      s1, 0
        li      a1, 0
        li      a2, 0
        jalr    s1
        beqz    s0, .LBB0_9
# %bb.5:                                # %if.else126
        mv      s1, s2
        j       .LBB0_9
.LBB0_6:
        li      a0, 0
        bnez    a1, .LBB0_4
.LBB0_7:                                # %if.end106.thread
        li      s1, 1
        bnez    zero, .LBB0_9
# %bb.8:                                # %if.then108
        li      s1, 0
.LBB0_9:                                # %cleanup
        mv      a0, s1
        ld      ra, 24(sp)                      # 8-byte Folded Reload
        ld      s0, 16(sp)                      # 8-byte Folded Reload
        ld      s1, 8(sp)                       # 8-byte Folded Reload
        ld      s2, 0(sp)                       # 8-byte Folded Reload
        addi    sp, sp, 32
        ret

Now we generate:

Perl_sv_cmp_flags:                      # @Perl_sv_cmp_flags
# %bb.0:                                # %entry
        addi    sp, sp, -32
        sd      ra, 24(sp)                      # 8-byte Folded Spill
        sd      s0, 16(sp)                      # 8-byte Folded Spill
        sd      s1, 8(sp)                       # 8-byte Folded Spill
        sd      s2, 0(sp)                       # 8-byte Folded Spill
        mv      s2, a3
        andi    s0, a2, 1
        beqz    a0, .LBB0_3
# %bb.1:                                # %if.else
        beqz    s0, .LBB0_6
# %bb.2:                                # %cond.true
        mv      a0, a1
.LBB0_3:                                # %if.end
        beqz    a1, .LBB0_7
.LBB0_4:                                # %if.end18
        li      s1, 0
        li      a1, 0
        li      a2, 0
        jalr    s1
        beqz    s0, .LBB0_10
# %bb.5:                                # %if.else126
        mv      s1, s2
        j       .LBB0_10
.LBB0_6:
        li      a0, 0
        bnez    a1, .LBB0_4
.LBB0_7:                                # %if.end106.thread
        beqz    zero, .LBB0_9
# %bb.8:
        li      s1, 1
        j       .LBB0_10
.LBB0_9:                                # %if.then108
        li      s1, 0
.LBB0_10:                               # %cleanup
        mv      a0, s1
        ld      ra, 24(sp)                      # 8-byte Folded Reload
        ld      s0, 16(sp)                      # 8-byte Folded Reload
        ld      s1, 8(sp)                       # 8-byte Folded Reload
        ld      s2, 0(sp)                       # 8-byte Folded Reload
        addi    sp, sp, 32
        ret

In this reduced case, we end up with an extra basic block (and consequently an extra jump) as a result of this PR.

Please note that I found this in Perl_sv_cmp_flags function of perlbench, but when this patch is reverted, it helps out the dynamic IC in S_regmatch much more than it does in Perl_sv_cmp_flags.

@preames
Copy link
Collaborator Author

preames commented Mar 10, 2025

@michaelmaitland - I will take a look at this tomorrow, and get back to you.

@preames
Copy link
Collaborator Author

preames commented Mar 11, 2025

@michaelmaitland Can I ask you to file a bug for the regression you noted? I think this may have been reduced a bit past usefulness. Looking at the after assembly, the "right" answer for this case is for this:

        beqz    zero, .LBB0_9
# %bb.8:
        li      s1, 1
        j       .LBB0_10
.LBB0_9:                                # %if.then108
        li      s1, 0
.LBB0_10:                               # %cleanup

To become a snez. However, when I opened up the precodegen IR for the routine you mentioned, I don't see this pattern in the IR. I'm not claiming that my particular snapshot from some time ago is evidence against there being a regression here. I'm just saying I need to know how to reproduce the original example, not just the reduced case which may or may not be representative. I suspect that e.g. LTO vs non-LTO might be very important here.

For the record, I do not plan to revert this without significant investigation first. It's been four months since this landed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants