-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[LV] Support generating masks for switch terminators. #99808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-clang @llvm/pr-subscribers-llvm-transforms Author: Florian Hahn (fhahn) ChangesUpdate createEdgeMask to created masks where the terminator in Src is a switch. We need to handle 2 separate cases:
Fixes #48188. Patch is 84.17 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/99808.diff 7 Files Affected:
diff --git a/clang/test/Frontend/optimization-remark-analysis.c b/clang/test/Frontend/optimization-remark-analysis.c
index e43984942a6ef..9d8917265a320 100644
--- a/clang/test/Frontend/optimization-remark-analysis.c
+++ b/clang/test/Frontend/optimization-remark-analysis.c
@@ -1,7 +1,7 @@
// RUN: %clang -O1 -fvectorize -target x86_64-unknown-unknown -emit-llvm -Rpass-analysis -S %s -o - 2>&1 | FileCheck %s --check-prefix=RPASS
// RUN: %clang -O1 -fvectorize -target x86_64-unknown-unknown -emit-llvm -S %s -o - 2>&1 | FileCheck %s
-// RPASS: {{.*}}:12:5: remark: loop not vectorized: loop contains a switch statement
+// RPASS-NOT: {{.*}}:12:5: remark: loop not vectorized
// CHECK-NOT: remark: loop not vectorized: loop contains a switch statement
double foo(int N, int *Array) {
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index f54eebb2874ab..7f84455150093 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -1348,11 +1348,11 @@ bool LoopVectorizationLegality::canVectorizeWithIfConvert() {
// Collect the blocks that need predication.
for (BasicBlock *BB : TheLoop->blocks()) {
// We don't support switch statements inside loops.
- if (!isa<BranchInst>(BB->getTerminator())) {
- reportVectorizationFailure("Loop contains a switch statement",
- "loop contains a switch statement",
- "LoopContainsSwitch", ORE, TheLoop,
- BB->getTerminator());
+ if (!isa<BranchInst, SwitchInst>(BB->getTerminator())) {
+ reportVectorizationFailure("Loop contains an unsupported termaintor",
+ "loop contains an unsupported terminator",
+ "LoopContainsUnsupportedTerminator", ORE,
+ TheLoop, BB->getTerminator());
return false;
}
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 6d28b8fabe42e..2530762e3e424 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7763,6 +7763,41 @@ VPValue *VPRecipeBuilder::createEdgeMask(BasicBlock *Src, BasicBlock *Dst) {
VPValue *SrcMask = getBlockInMask(Src);
+ if (auto *SI = dyn_cast<SwitchInst>(Src->getTerminator())) {
+ // Create mask where the terminator in Src is a switch. We need to handle 2
+ // separate cases:
+ // 1. Dst is not the default desintation. Dst is reached if any of the cases
+ // with destination == Dst are taken. Join the conditions for each case
+ // where destination == Dst using a logical OR.
+ // 2. Dst is the default destination. Dst is reached if none of the cases
+ // with destination != Dst are taken. Join the conditions for each case
+ // where the destination is != Dst using a logical OR and negate it.
+ VPValue *Mask = nullptr;
+ VPValue *Cond = getVPValueOrAddLiveIn(SI->getCondition(), Plan);
+ bool IsDefault = SI->getDefaultDest() == Dst;
+ for (auto &C : SI->cases()) {
+ if (IsDefault) {
+ if (C.getCaseSuccessor() == Dst)
+ continue;
+ } else if (C.getCaseSuccessor() != Dst)
+ continue;
+
+ VPValue *Eq = EdgeMaskCache.lookup({Src, C.getCaseSuccessor()});
+ if (!Eq) {
+ VPValue *V = getVPValueOrAddLiveIn(C.getCaseValue(), Plan);
+ Eq = Builder.createICmp(CmpInst::ICMP_EQ, Cond, V);
+ }
+ if (Mask)
+ Mask = Builder.createOr(Mask, Eq);
+ else
+ Mask = Eq;
+ }
+ if (IsDefault)
+ Mask = Builder.createNot(Mask);
+ assert(Mask && "mask must be created");
+ return EdgeMaskCache[Edge] = Mask;
+ }
+
// The terminator has to be a branch inst!
BranchInst *BI = dyn_cast<BranchInst>(Src->getTerminator());
assert(BI && "Unexpected terminator found");
diff --git a/llvm/test/Transforms/LoopVectorize/X86/predicate-switch.ll b/llvm/test/Transforms/LoopVectorize/X86/predicate-switch.ll
index b8ce3c40920a3..ff73a149c8e39 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/predicate-switch.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/predicate-switch.ll
@@ -6,9 +6,43 @@ define void @switch_default_to_latch_common_dest(ptr %start, ptr %end) {
; IC1-LABEL: define void @switch_default_to_latch_common_dest(
; IC1-SAME: ptr [[START:%.*]], ptr [[END:%.*]]) #[[ATTR0:[0-9]+]] {
; IC1-NEXT: [[ENTRY:.*]]:
+; IC1-NEXT: [[START2:%.*]] = ptrtoint ptr [[START]] to i64
+; IC1-NEXT: [[END1:%.*]] = ptrtoint ptr [[END]] to i64
+; IC1-NEXT: [[TMP0:%.*]] = add i64 [[END1]], -8
+; IC1-NEXT: [[TMP1:%.*]] = sub i64 [[TMP0]], [[START2]]
+; IC1-NEXT: [[TMP2:%.*]] = lshr i64 [[TMP1]], 3
+; IC1-NEXT: [[TMP3:%.*]] = add nuw nsw i64 [[TMP2]], 1
+; IC1-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], 4
+; IC1-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; IC1: [[VECTOR_PH]]:
+; IC1-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP3]], 4
+; IC1-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF]]
+; IC1-NEXT: [[TMP4:%.*]] = mul i64 [[N_VEC]], 8
+; IC1-NEXT: [[IND_END:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP4]]
+; IC1-NEXT: br label %[[VECTOR_BODY:.*]]
+; IC1: [[VECTOR_BODY]]:
+; IC1-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; IC1-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 8
+; IC1-NEXT: [[TMP5:%.*]] = add i64 [[OFFSET_IDX]], 0
+; IC1-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP5]]
+; IC1-NEXT: [[TMP6:%.*]] = getelementptr i64, ptr [[NEXT_GEP]], i32 0
+; IC1-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[TMP6]], align 1
+; IC1-NEXT: [[TMP7:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 -12, i64 -12, i64 -12, i64 -12>
+; IC1-NEXT: [[TMP8:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 13, i64 13, i64 13, i64 13>
+; IC1-NEXT: [[TMP9:%.*]] = or <4 x i1> [[TMP7]], [[TMP8]]
+; IC1-NEXT: [[TMP10:%.*]] = or <4 x i1> [[TMP9]], [[TMP9]]
+; IC1-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 42, i64 42, i64 42, i64 42>, ptr [[TMP6]], i32 1, <4 x i1> [[TMP10]])
+; IC1-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
+; IC1-NEXT: [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; IC1-NEXT: br i1 [[TMP11]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; IC1: [[MIDDLE_BLOCK]]:
+; IC1-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC]]
+; IC1-NEXT: br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; IC1: [[SCALAR_PH]]:
+; IC1-NEXT: [[BC_RESUME_VAL:%.*]] = phi ptr [ [[IND_END]], %[[MIDDLE_BLOCK]] ], [ [[START]], %[[ENTRY]] ]
; IC1-NEXT: br label %[[LOOP_HEADER:.*]]
; IC1: [[LOOP_HEADER]]:
-; IC1-NEXT: [[PTR_IV:%.*]] = phi ptr [ [[START]], %[[ENTRY]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
+; IC1-NEXT: [[PTR_IV:%.*]] = phi ptr [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
; IC1-NEXT: [[L:%.*]] = load i64, ptr [[PTR_IV]], align 1
; IC1-NEXT: switch i64 [[L]], label %[[LOOP_LATCH]] [
; IC1-NEXT: i64 -12, label %[[IF_THEN:.*]]
@@ -20,16 +54,59 @@ define void @switch_default_to_latch_common_dest(ptr %start, ptr %end) {
; IC1: [[LOOP_LATCH]]:
; IC1-NEXT: [[PTR_IV_NEXT]] = getelementptr inbounds i64, ptr [[PTR_IV]], i64 1
; IC1-NEXT: [[EC:%.*]] = icmp eq ptr [[PTR_IV_NEXT]], [[END]]
-; IC1-NEXT: br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP_HEADER]]
+; IC1-NEXT: br i1 [[EC]], label %[[EXIT]], label %[[LOOP_HEADER]], !llvm.loop [[LOOP3:![0-9]+]]
; IC1: [[EXIT]]:
; IC1-NEXT: ret void
;
; IC2-LABEL: define void @switch_default_to_latch_common_dest(
; IC2-SAME: ptr [[START:%.*]], ptr [[END:%.*]]) #[[ATTR0:[0-9]+]] {
; IC2-NEXT: [[ENTRY:.*]]:
+; IC2-NEXT: [[START2:%.*]] = ptrtoint ptr [[START]] to i64
+; IC2-NEXT: [[END1:%.*]] = ptrtoint ptr [[END]] to i64
+; IC2-NEXT: [[TMP0:%.*]] = add i64 [[END1]], -8
+; IC2-NEXT: [[TMP1:%.*]] = sub i64 [[TMP0]], [[START2]]
+; IC2-NEXT: [[TMP2:%.*]] = lshr i64 [[TMP1]], 3
+; IC2-NEXT: [[TMP3:%.*]] = add nuw nsw i64 [[TMP2]], 1
+; IC2-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], 8
+; IC2-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; IC2: [[VECTOR_PH]]:
+; IC2-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP3]], 8
+; IC2-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF]]
+; IC2-NEXT: [[TMP4:%.*]] = mul i64 [[N_VEC]], 8
+; IC2-NEXT: [[IND_END:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP4]]
+; IC2-NEXT: br label %[[VECTOR_BODY:.*]]
+; IC2: [[VECTOR_BODY]]:
+; IC2-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; IC2-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 8
+; IC2-NEXT: [[TMP5:%.*]] = add i64 [[OFFSET_IDX]], 0
+; IC2-NEXT: [[TMP6:%.*]] = add i64 [[OFFSET_IDX]], 32
+; IC2-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP5]]
+; IC2-NEXT: [[NEXT_GEP3:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP6]]
+; IC2-NEXT: [[TMP7:%.*]] = getelementptr i64, ptr [[NEXT_GEP]], i32 0
+; IC2-NEXT: [[TMP8:%.*]] = getelementptr i64, ptr [[NEXT_GEP]], i32 4
+; IC2-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[TMP7]], align 1
+; IC2-NEXT: [[WIDE_LOAD4:%.*]] = load <4 x i64>, ptr [[TMP8]], align 1
+; IC2-NEXT: [[TMP9:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 -12, i64 -12, i64 -12, i64 -12>
+; IC2-NEXT: [[TMP10:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD4]], <i64 -12, i64 -12, i64 -12, i64 -12>
+; IC2-NEXT: [[TMP11:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 13, i64 13, i64 13, i64 13>
+; IC2-NEXT: [[TMP12:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD4]], <i64 13, i64 13, i64 13, i64 13>
+; IC2-NEXT: [[TMP13:%.*]] = or <4 x i1> [[TMP9]], [[TMP11]]
+; IC2-NEXT: [[TMP14:%.*]] = or <4 x i1> [[TMP10]], [[TMP12]]
+; IC2-NEXT: [[TMP15:%.*]] = or <4 x i1> [[TMP13]], [[TMP13]]
+; IC2-NEXT: [[TMP16:%.*]] = or <4 x i1> [[TMP14]], [[TMP14]]
+; IC2-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 42, i64 42, i64 42, i64 42>, ptr [[TMP7]], i32 1, <4 x i1> [[TMP15]])
+; IC2-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 42, i64 42, i64 42, i64 42>, ptr [[TMP8]], i32 1, <4 x i1> [[TMP16]])
+; IC2-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
+; IC2-NEXT: [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; IC2-NEXT: br i1 [[TMP17]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; IC2: [[MIDDLE_BLOCK]]:
+; IC2-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC]]
+; IC2-NEXT: br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; IC2: [[SCALAR_PH]]:
+; IC2-NEXT: [[BC_RESUME_VAL:%.*]] = phi ptr [ [[IND_END]], %[[MIDDLE_BLOCK]] ], [ [[START]], %[[ENTRY]] ]
; IC2-NEXT: br label %[[LOOP_HEADER:.*]]
; IC2: [[LOOP_HEADER]]:
-; IC2-NEXT: [[PTR_IV:%.*]] = phi ptr [ [[START]], %[[ENTRY]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
+; IC2-NEXT: [[PTR_IV:%.*]] = phi ptr [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
; IC2-NEXT: [[L:%.*]] = load i64, ptr [[PTR_IV]], align 1
; IC2-NEXT: switch i64 [[L]], label %[[LOOP_LATCH]] [
; IC2-NEXT: i64 -12, label %[[IF_THEN:.*]]
@@ -41,7 +118,7 @@ define void @switch_default_to_latch_common_dest(ptr %start, ptr %end) {
; IC2: [[LOOP_LATCH]]:
; IC2-NEXT: [[PTR_IV_NEXT]] = getelementptr inbounds i64, ptr [[PTR_IV]], i64 1
; IC2-NEXT: [[EC:%.*]] = icmp eq ptr [[PTR_IV_NEXT]], [[END]]
-; IC2-NEXT: br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP_HEADER]]
+; IC2-NEXT: br i1 [[EC]], label %[[EXIT]], label %[[LOOP_HEADER]], !llvm.loop [[LOOP3:![0-9]+]]
; IC2: [[EXIT]]:
; IC2-NEXT: ret void
;
@@ -73,9 +150,48 @@ define void @switch_all_dests_distinct(ptr %start, ptr %end) {
; IC1-LABEL: define void @switch_all_dests_distinct(
; IC1-SAME: ptr [[START:%.*]], ptr [[END:%.*]]) #[[ATTR0]] {
; IC1-NEXT: [[ENTRY:.*]]:
+; IC1-NEXT: [[START2:%.*]] = ptrtoint ptr [[START]] to i64
+; IC1-NEXT: [[END1:%.*]] = ptrtoint ptr [[END]] to i64
+; IC1-NEXT: [[TMP0:%.*]] = add i64 [[END1]], -8
+; IC1-NEXT: [[TMP1:%.*]] = sub i64 [[TMP0]], [[START2]]
+; IC1-NEXT: [[TMP2:%.*]] = lshr i64 [[TMP1]], 3
+; IC1-NEXT: [[TMP3:%.*]] = add nuw nsw i64 [[TMP2]], 1
+; IC1-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], 4
+; IC1-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; IC1: [[VECTOR_PH]]:
+; IC1-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP3]], 4
+; IC1-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF]]
+; IC1-NEXT: [[TMP4:%.*]] = mul i64 [[N_VEC]], 8
+; IC1-NEXT: [[IND_END:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP4]]
+; IC1-NEXT: br label %[[VECTOR_BODY:.*]]
+; IC1: [[VECTOR_BODY]]:
+; IC1-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; IC1-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 8
+; IC1-NEXT: [[TMP5:%.*]] = add i64 [[OFFSET_IDX]], 0
+; IC1-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP5]]
+; IC1-NEXT: [[TMP6:%.*]] = getelementptr i64, ptr [[NEXT_GEP]], i32 0
+; IC1-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[TMP6]], align 1
+; IC1-NEXT: [[TMP7:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], zeroinitializer
+; IC1-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 1, i64 1, i64 1, i64 1>, ptr [[TMP6]], i32 1, <4 x i1> [[TMP7]])
+; IC1-NEXT: [[TMP8:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 13, i64 13, i64 13, i64 13>
+; IC1-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> zeroinitializer, ptr [[TMP6]], i32 1, <4 x i1> [[TMP8]])
+; IC1-NEXT: [[TMP9:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 -12, i64 -12, i64 -12, i64 -12>
+; IC1-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 42, i64 42, i64 42, i64 42>, ptr [[TMP6]], i32 1, <4 x i1> [[TMP9]])
+; IC1-NEXT: [[TMP10:%.*]] = or <4 x i1> [[TMP9]], [[TMP8]]
+; IC1-NEXT: [[TMP11:%.*]] = or <4 x i1> [[TMP10]], [[TMP7]]
+; IC1-NEXT: [[TMP12:%.*]] = xor <4 x i1> [[TMP11]], <i1 true, i1 true, i1 true, i1 true>
+; IC1-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 2, i64 2, i64 2, i64 2>, ptr [[TMP6]], i32 1, <4 x i1> [[TMP12]])
+; IC1-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
+; IC1-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; IC1-NEXT: br i1 [[TMP13]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; IC1: [[MIDDLE_BLOCK]]:
+; IC1-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC]]
+; IC1-NEXT: br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; IC1: [[SCALAR_PH]]:
+; IC1-NEXT: [[BC_RESUME_VAL:%.*]] = phi ptr [ [[IND_END]], %[[MIDDLE_BLOCK]] ], [ [[START]], %[[ENTRY]] ]
; IC1-NEXT: br label %[[LOOP_HEADER:.*]]
; IC1: [[LOOP_HEADER]]:
-; IC1-NEXT: [[PTR_IV:%.*]] = phi ptr [ [[START]], %[[ENTRY]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
+; IC1-NEXT: [[PTR_IV:%.*]] = phi ptr [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
; IC1-NEXT: [[L:%.*]] = load i64, ptr [[PTR_IV]], align 1
; IC1-NEXT: switch i64 [[L]], label %[[DEFAULT:.*]] [
; IC1-NEXT: i64 -12, label %[[IF_THEN_1:.*]]
@@ -97,16 +213,69 @@ define void @switch_all_dests_distinct(ptr %start, ptr %end) {
; IC1: [[LOOP_LATCH]]:
; IC1-NEXT: [[PTR_IV_NEXT]] = getelementptr inbounds i64, ptr [[PTR_IV]], i64 1
; IC1-NEXT: [[EC:%.*]] = icmp eq ptr [[PTR_IV_NEXT]], [[END]]
-; IC1-NEXT: br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP_HEADER]]
+; IC1-NEXT: br i1 [[EC]], label %[[EXIT]], label %[[LOOP_HEADER]], !llvm.loop [[LOOP5:![0-9]+]]
; IC1: [[EXIT]]:
; IC1-NEXT: ret void
;
; IC2-LABEL: define void @switch_all_dests_distinct(
; IC2-SAME: ptr [[START:%.*]], ptr [[END:%.*]]) #[[ATTR0]] {
; IC2-NEXT: [[ENTRY:.*]]:
+; IC2-NEXT: [[START2:%.*]] = ptrtoint ptr [[START]] to i64
+; IC2-NEXT: [[END1:%.*]] = ptrtoint ptr [[END]] to i64
+; IC2-NEXT: [[TMP0:%.*]] = add i64 [[END1]], -8
+; IC2-NEXT: [[TMP1:%.*]] = sub i64 [[TMP0]], [[START2]]
+; IC2-NEXT: [[TMP2:%.*]] = lshr i64 [[TMP1]], 3
+; IC2-NEXT: [[TMP3:%.*]] = add nuw nsw i64 [[TMP2]], 1
+; IC2-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], 8
+; IC2-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; IC2: [[VECTOR_PH]]:
+; IC2-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP3]], 8
+; IC2-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF]]
+; IC2-NEXT: [[TMP4:%.*]] = mul i64 [[N_VEC]], 8
+; IC2-NEXT: [[IND_END:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP4]]
+; IC2-NEXT: br label %[[VECTOR_BODY:.*]]
+; IC2: [[VECTOR_BODY]]:
+; IC2-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; IC2-NEXT: [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 8
+; IC2-NEXT: [[TMP5:%.*]] = add i64 [[OFFSET_IDX]], 0
+; IC2-NEXT: [[TMP6:%.*]] = add i64 [[OFFSET_IDX]], 32
+; IC2-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP5]]
+; IC2-NEXT: [[NEXT_GEP3:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP6]]
+; IC2-NEXT: [[TMP7:%.*]] = getelementptr i64, ptr [[NEXT_GEP]], i32 0
+; IC2-NEXT: [[TMP8:%.*]] = getelementptr i64, ptr [[NEXT_GEP]], i32 4
+; IC2-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[TMP7]], align 1
+; IC2-NEXT: [[WIDE_LOAD4:%.*]] = load <4 x i64>, ptr [[TMP8]], align 1
+; IC2-NEXT: [[TMP9:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], zeroinitializer
+; IC2-NEXT: [[TMP10:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD4]], zeroinitializer
+; IC2-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 1, i64 1, i64 1, i64 1>, ptr [[TMP7]], i32 1, <4 x i1> [[TMP9]])
+; IC2-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 1, i64 1, i64 1, i64 1>, ptr [[TMP8]], i32 1, <4 x i1> [[TMP10]])
+; IC2-NEXT: [[TMP11:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 13, i64 13, i64 13, i64 13>
+; IC2-NEXT: [[TMP12:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD4]], <i64 13, i64 13, i64 13, i64 13>
+; IC2-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> zeroinitializer, ptr [[TMP7]], i32 1, <4 x i1> [[TMP11]])
+; IC2-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> zeroinitializer, ptr [[TMP8]], i32 1, <4 x i1> [[TMP12]])
+; IC2-NEXT: [[TMP13:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 -12, i64 -12, i64 -12, i64 -12>
+; IC2-NEXT: [[TMP14:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD4]], <i64 -12, i64 -12, i64 -12, i64 -12>
+; IC2-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 42, i64 42, i64 42, i64 42>, ptr [[TMP7]], i32 1, <4 x i1> [[TMP13]])
+; IC2-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 42, i64 42, i64 42, i64 42>, ptr [[TMP8]], i32 1, <4 x i1> [[TMP14]])
+; IC2-NEXT: [[TMP15:%.*]] = or <4 x i1> [[TMP13]], [[TMP11]]
+; IC2-NEXT: [[TMP16:%.*]] = or <4 x i1> [[TMP14]], [[TMP12]]
+; IC2-NEXT: [[TMP17:%.*]] = or <4 x i1> [[TMP15]], [[TMP9]]
+; IC2-NEXT: [[TMP18:%.*]] = or <4 x i1> [[TMP16]], [[TMP10]]
+; IC2-NEXT: [[TMP19:%.*]] = xor <4 x i1> [[TMP17]], <i1 true, i1 true, i1 true, i1 true>
+; IC2-NEXT: [[TMP20:%.*]] = xor <4 x i1> [[TMP18]], <i1 true, i1 true, i1 true, i1 true>
+; IC2-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 2, i64 2, i64 2, i64 2>, ptr [[TMP7]], i32 1, <4 x i1> [[TMP19]])
+; IC2-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 2, i64 2, i64 2, i64 2>, ptr [[TMP8]], i32 1, <4 x i1> [[TMP20]])
+; IC2-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
+; IC2-NEXT: [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; IC2-NEXT: br i1 [[TMP21]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; IC2: [[MIDDLE_BLOCK]]:
+; IC2-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC]]
+; IC2-NEXT: br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; IC2: [[SCALAR_PH]]:
+; IC2-NEXT: [[BC_RES...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see that switch statements can be vectorized by only teaching createEdgeMask() how to mask them! This conceptually replaces a switch statement with a series of non-cascading conditional branches - one per unique successor, along with if-converting them. In the long run each of these steps may be modelled separately in VPlan.
Also great to see the tests pre-committed! Have yet to review them next.
@@ -1,7 +1,7 @@ | |||
// RUN: %clang -O1 -fvectorize -target x86_64-unknown-unknown -emit-llvm -Rpass-analysis -S %s -o - 2>&1 | FileCheck %s --check-prefix=RPASS | |||
// RUN: %clang -O1 -fvectorize -target x86_64-unknown-unknown -emit-llvm -S %s -o - 2>&1 | FileCheck %s | |||
|
|||
// RPASS: {{.*}}:12:5: remark: loop not vectorized: loop contains a switch statement | |||
// RPASS-NOT: {{.*}}:12:5: remark: loop not vectorized | |||
// CHECK-NOT: remark: loop not vectorized: loop contains a switch statement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already CHECK-NOT'd before the patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's to check that the message doesn't get emitted when -Rpass-analysis
isn't passed. The function in the test likely needs to be replaced to preserve checking the plumbing for Rpass-analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test adjusted to use a non-vectorizable call: 4b3bc46
Changes here are gone
@@ -1348,11 +1348,11 @@ bool LoopVectorizationLegality::canVectorizeWithIfConvert() { | |||
// Collect the blocks that need predication. | |||
for (BasicBlock *BB : TheLoop->blocks()) { | |||
// We don't support switch statements inside loops. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// We don't support switch statements inside loops. | |
// We support only branches and switch statements as terminators inside the loop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
if (IsDefault) { | ||
if (C.getCaseSuccessor() == Dst) | ||
continue; | ||
} else if (C.getCaseSuccessor() != Dst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (IsDefault) { | |
if (C.getCaseSuccessor() == Dst) | |
continue; | |
} else if (C.getCaseSuccessor() != Dst) | |
bool IsCaseSuccessor = C.getCaseSuccessor() == Dst; | |
if ((!IsDefault && !IsCaseSuccessor) || (IsDefault && IsCaseSuccessor)) | |
continue; |
(can fold into if (IsDefault == IsCaseSuccessor)
if seems better)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Folded, thanks!
VPValue *Mask = nullptr; | ||
VPValue *Cond = getVPValueOrAddLiveIn(SI->getCondition(), Plan); | ||
bool IsDefault = SI->getDefaultDest() == Dst; | ||
for (auto &C : SI->cases()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the number of cases N may be large. All N cases are traversed here per successor, potentially in O(N^2). This potential compile-time concern could be addressed by building a reverse mapping from destinations to conditions, or creating all edge masks from Src to all its successors together.
There's also a potential cost/profitability concern: as N increases the performance advantage of vectorizing diminishes, as it applies if-conversion with a mask per successor, while the original scalar version may use table lookups or cascading conditional branches. Worth bounding the number of cases to something "reasonable", and/or checking that CM considers the associated cost "accurately"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewritten to compute all masks once, thanks
Also added cost estimate
// where destination == Dst using a logical OR. | ||
// 2. Dst is the default destination. Dst is reached if none of the cases | ||
// with destination != Dst are taken. Join the conditions for each case | ||
// where the destination is != Dst using a logical OR and negate it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cases whose destinations are the same as default are redundant and can be ignored/eliminated - they will get there anyhow. Perhaps slightly clearer to filter them as such during the traversal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are filtered out during the loop below I think.
VPValue *Cond = getVPValueOrAddLiveIn(SI->getCondition(), Plan); | ||
bool IsDefault = SI->getDefaultDest() == Dst; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine. Worth considering the alternative of computing the default case by recursively computing (and recording, or retrieving) the masks of all other edges from Src to its successors, excluding default. Note that there may be no such other edges, in which case the desired edge mask is that of Src (worth a test).
When computing the non-default case, all relevant edges need to be computed - the cache was checked at the beginning.
VPValue *Cond = getVPValueOrAddLiveIn(SI->getCondition(), Plan); | |
bool IsDefault = SI->getDefaultDest() == Dst; | |
if (SI->getDefaultDest() == Dst) { | |
VPValue *AllOtherEdgesMask = nullptr; | |
SmallPtrSet<BasicBlock *, 4> UniqueSuccessors(succ_begin(Src), succ_end(Src)); | |
for (auto *Successor : UniqueSuccessors) { | |
if (Succssor == Dst) | |
continue; | |
VPValue *OtherEdgeMask = createEdgeMask(Src, Successor); | |
if (AllOtherEdgesMask) | |
AllOtherEdgesMask = Builder.createLogicalOr(AllOtherEdgesMask, OtherEdgeMask); | |
else | |
AllOtherEdgesMask = OtherEdgeMask; | |
} | |
if (!AllOtherEdgesMask) { | |
assert(UniqueSuccessors.size() == 1 && "Src expected to have only default successor"); | |
return EdgeMaskCache[Edge] = SrcMask; | |
} | |
return EdgeMaskCache[Edge] = Builder.createNot(AllOtherEdgesMask); | |
} | |
VPValue *EdgeMask = nullptr; | |
VPValue *Cond = getVPValueOrAddLiveIn(SI->getCondition(), Plan); | |
for (auto &C : SI->cases()) { | |
if (C.getCaseSuccessor() != Dst) | |
continue; | |
// Edge for case C is not in EdgeMaskCache, create it. | |
VPValue *V = getVPValueOrAddLiveIn(C.getCaseValue(), Plan); | |
VPValue *CMask = Builder.createICmp(CmpInst::ICMP_EQ, Cond, V); | |
if (EdgeMask) | |
EdgeMask = Builder.createLogicalOr(EdgeMask, CMask); | |
else | |
EdgeMask = CMask; | |
} | |
assert(EdgeMask && "mask must be created"); | |
return EdgeMaskCache[Edge] = EdgeMask; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewritten to compute all masks once, thanks
Eq = Builder.createICmp(CmpInst::ICMP_EQ, Cond, V); | ||
} | ||
if (Mask) | ||
Mask = Builder.createOr(Mask, Eq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mask = Builder.createOr(Mask, Eq); | |
Mask = Builder.createLogicalOr(Mask, Eq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a logical OR shouldn't be needed; the only case where the created compares can be poison is when the switch condition is poison; in that case all conditions will be poison and there's no difference between bitwise or logical OR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth a comment?
The commit message should also be updated to read OR rather than logical OR, to be consistent.
else | ||
Mask = Eq; | ||
} | ||
if (IsDefault) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mask may be null in this case, avoid Not'ing it if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
@@ -7763,6 +7763,41 @@ VPValue *VPRecipeBuilder::createEdgeMask(BasicBlock *Src, BasicBlock *Dst) { | |||
|
|||
VPValue *SrcMask = getBlockInMask(Src); | |||
|
|||
if (auto *SI = dyn_cast<SwitchInst>(Src->getTerminator())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth asserting that SI stays in the same loop iteration, rather than breaking or continuing to its header? E.g., that !OrigLoop->isLoopExiting(Src)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added assert here and check to LoopVectorizationLegality, thanks!
} else if (C.getCaseSuccessor() != Dst) | ||
continue; | ||
|
||
VPValue *Eq = EdgeMaskCache.lookup({Src, C.getCaseSuccessor()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may succeed only in the IsDefault case, and may lead to OR'ing multiple instances of the same mask - if the same (non-default) successor is the destination of multiple cases. Should be easily folded later though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is currently happening for some test cases, but should be cleaned up by VPlan-to-VPlan recipe simplification (follow-up needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests for vplan printing ?
if (auto *SI = dyn_cast<SwitchInst>(Src->getTerminator())) { | ||
// Create mask where the terminator in Src is a switch. We need to handle 2 | ||
// separate cases: | ||
// 1. Dst is not the default desintation. Dst is reached if any of the cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
desintation
-> destination
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
; IC1-NEXT: [[TMP7:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 -12, i64 -12, i64 -12, i64 -12> | ||
; IC1-NEXT: [[TMP8:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 13, i64 13, i64 13, i64 13> | ||
; IC1-NEXT: [[TMP9:%.*]] = or <4 x i1> [[TMP7]], [[TMP8]] | ||
; IC1-NEXT: [[TMP10:%.*]] = or <4 x i1> [[TMP9]], [[TMP9]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
This stems from having parallel {pred, succ} edges in the CFG for multiple switch cases that share a common succ destination, coupled with caching a single edge mask per {pred, succ} pair of VPBB's.
The in mask of a VPBB is created by ORing the edge masks from its predecessors - suffice to visit each predecessor once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, could be done by de-duplicating predecessors (or VPlan based simplification)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Case should also apply to conditional branches whose two destinations are the same - a case optimized by createEdgeMask() - but not by its user createBlockInMask().
Extra tests for #99808, including cost model tests.
// collect compares for all cases once. | ||
VPValue *Cond = getVPValueOrAddLiveIn(SI->getCondition(), Plan); | ||
BasicBlock *DefaultDst = SI->getDefaultDest(); | ||
MapVector<BasicBlock *, SmallVector<VPValue *>> Map; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MapVector<BasicBlock *, SmallVector<VPValue *>> Map; | |
MapVector<BasicBlock *, SmallVector<VPValue *>> Destination2Compares; |
or some shorter yet meaningful name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to Dst2Compares
, thanks!
// 1. Dst is not the default destination. Dst is reached if any of the cases | ||
// with destination == Dst are taken. Join the conditions for each case | ||
// where destination == Dst using a logical OR. | ||
for (const auto &[Dst, Conds] : Map) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dst
is also a parameter to the method. Better outline this part which computes masks for all edges of a switch to a separate createSwitchEdgeMasks()
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outlined, thanks!
// We need to handle 2 separate cases: | ||
// 1. Dst is not the default destination. Dst is reached if any of the cases | ||
// with destination == Dst are taken. Join the conditions for each case | ||
// where destination == Dst using a logical OR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// where destination == Dst using a logical OR. | |
// whose destination == Dst using an OR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thanks!
|
||
// 2. Dst is the default destination. Dst is reached if none of the cases | ||
// with destination != Dst are taken. Join the conditions for each case | ||
// where the destination is != Dst using a logical OR and negate it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// where the destination is != Dst using a logical OR and negate it. | |
// where the destination is != Dst using an OR and negate it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated
BasicBlock *DefaultDst = SI->getDefaultDest(); | ||
MapVector<BasicBlock *, SmallVector<VPValue *>> Map; | ||
for (auto &C : SI->cases()) { | ||
auto I = Map.insert({C.getCaseSuccessor(), {}}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto I = Map.insert({C.getCaseSuccessor(), {}}); | |
// Cases whose destination is the same as default are redundant and can be ignored - they will get there anyhow. | |
if (C.getCaseSuccessor() == DefaultDst) | |
continue; | |
auto I = Map.insert({C.getCaseSuccessor(), {}}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one should also be updated
Mask = Builder.createOr(Mask, V); | ||
if (SrcMask) | ||
Mask = Builder.createLogicalAnd(SrcMask, Mask); | ||
EdgeMaskCache[{Src, Dst}] = Mask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EdgeMaskCache[{Src, Dst}] = Mask; | |
EdgeMaskCache[{Src, Dst}] = Mask; | |
DefaultMask = !DefaultMask ? Mask : Builder.createOr(DefaultMask, Mask); |
can also collect DefaultMask in this loop, to be finalized and stored after it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
; IC1-NEXT: [[TMP7:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 -12, i64 -12, i64 -12, i64 -12> | ||
; IC1-NEXT: [[TMP8:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 13, i64 13, i64 13, i64 13> | ||
; IC1-NEXT: [[TMP9:%.*]] = or <4 x i1> [[TMP7]], [[TMP8]] | ||
; IC1-NEXT: [[TMP10:%.*]] = or <4 x i1> [[TMP9]], [[TMP9]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Case should also apply to conditional branches whose two destinations are the same - a case optimized by createEdgeMask() - but not by its user createBlockInMask().
Eq = Builder.createICmp(CmpInst::ICMP_EQ, Cond, V); | ||
} | ||
if (Mask) | ||
Mask = Builder.createOr(Mask, Eq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth a comment?
The commit message should also be updated to read OR rather than logical OR, to be consistent.
Support for vectorizing switch statements will be added in #99808. Update the loop to use a call that cannot be vectorized to preserve testing surfacing analysis remarks via the frontend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thanks for accommodating! Left behind several comments.
@@ -6456,6 +6456,17 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, | |||
// a predicated block since it will become a fall-through, although we | |||
// may decide in the future to call TTI for all branches. | |||
} | |||
case Instruction::Switch: { | |||
if (VF.isScalar()) | |||
return TTI.getCFInstrCost(Instruction::Switch, CostKind); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above scalar cost seems right, wonder about the vector cost below - the cost associated with predicating conditional branches is collected when visiting each phi, rather than the branch itself. May be good to calibrate with some tests, can leave behind a TODO to be done separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vector code matches the cost of the generated masks, which h are costed explicitly for the version with branches due to the compares being explicit instructions. Currently it seems more like the scalar cost may be estimated by getCFInstrCost, but that probably would need to be fixed in TTI.
@@ -7839,6 +7850,60 @@ VPRecipeBuilder::mapToVPValues(User::op_range Operands) { | |||
return map_range(Operands, Fn); | |||
} | |||
|
|||
void VPRecipeBuilder::createSwitchEdgeMasks(SwitchInst *SI) { | |||
BasicBlock *Src = SI->getParent(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BasicBlock *Src = SI->getParent(); | |
BasicBlock *Src = SI->getParent(); | |
assert(!EdgeMaskCache.contains(Src) && "Edge masks already created"); |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added but moved down to iterating over all cases, as we need to look up (Src, Dst) pairs, thanks!
/// Create masks for all cases with destination different than the default | ||
/// destination, and a mask for the default destination. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Create masks for all cases with destination different than the default | |
/// destination, and a mask for the default destination. | |
/// Create an edge mask for every destination of cases and/or default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
; COST-NEXT: [[TMP7:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 -12, i64 -12, i64 -12, i64 -12> | ||
; COST-NEXT: [[TMP8:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 13, i64 13, i64 13, i64 13> | ||
; COST-NEXT: [[TMP9:%.*]] = or <4 x i1> [[TMP7]], [[TMP8]] | ||
; COST-NEXT: [[TMP10:%.*]] = or <4 x i1> [[TMP9]], [[TMP9]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be good to have alongside a version doing the same with a conditional branch, i.e., "if (%1 == -12) || (%1 == 13) *ptr_iv = 42". (But avoid getting them folded into a switch as in the last test at the bottom.) Could be useful for calibrating the associated costs, in addition to comparing the generated codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a variation, decision is the same.
@@ -104,9 +181,62 @@ define void @switch_all_dests_distinct(ptr %start, ptr %end) { | |||
; FORCED-LABEL: define void @switch_all_dests_distinct( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sanity check: is it indeed unprofitable to vectorize this case having four distinct destinations, say by VF=4 and UF=1, as decided by COST?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and the reason for not vectorizing is that getCFInstrCost
considers Switch
as free on X86.
@@ -162,21 +343,21 @@ define void @switch_to_header(ptr %start) { | |||
; IC1-NEXT: [[ENTRY:.*]]: | |||
; IC1-NEXT: br label %[[LOOP_HEADER:.*]] | |||
; IC1: [[LOOP_HEADER]]: | |||
; IC1-NEXT: [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP_HEADER_BACKEDGE:.*]] ] | |||
; IC1-NEXT: [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[IF_THEN1:.*]] ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If/how are these changes in block labels (and those below) related to this patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-generated check lines separately, changes here should be gone now 5286656
; IC1-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0 | ||
; IC1-NEXT: [[TMP1:%.*]] = getelementptr inbounds i64, ptr [[START]], i64 [[TMP0]] | ||
; IC1-NEXT: [[TMP2:%.*]] = getelementptr inbounds i64, ptr [[TMP1]], i32 0 | ||
; IC1-NEXT: store <2 x i64> <i64 42, i64 42>, ptr [[TMP2]], align 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
; IC2-NEXT: [[TMP2:%.*]] = getelementptr inbounds i64, ptr [[START]], i64 [[TMP0]] | ||
; IC2-NEXT: [[TMP3:%.*]] = getelementptr inbounds i64, ptr [[START]], i64 [[TMP1]] | ||
; IC2-NEXT: [[TMP4:%.*]] = getelementptr inbounds i64, ptr [[TMP2]], i32 0 | ||
; IC2-NEXT: [[TMP5:%.*]] = getelementptr inbounds i64, ptr [[TMP2]], i32 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is TMP3 (and TMP1) dead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is due to the fact that VPVectorPointerRecipe is responsible to generate pointers for all parts, using only the first part of the pointe operand, but needs marking as only-uses-first-part. Will do separately, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -11,21 +11,71 @@ define dso_local void @test(ptr %start, ptr %end) #0 { | |||
; CHECK-LABEL: @test( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time to replace above FIXME by a comment that under -O2 adjacent branches get fused into a switch statement before vectorization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped, thanks!
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <8 x i32>, ptr [[NEXT_GEP]], align 4 | ||
; CHECK-NEXT: [[WIDE_LOAD8:%.*]] = load <8 x i32>, ptr [[TMP5]], align 4 | ||
; CHECK-NEXT: [[WIDE_LOAD9:%.*]] = load <8 x i32>, ptr [[TMP6]], align 4 | ||
; CHECK-NEXT: [[WIDE_LOAD10:%.*]] = load <8 x i32>, ptr [[TMP7]], align 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sanity check: is the decision to vectorize by VF=8, UF=4 taken by COST according to cost estimates involving a two-case switch, profitable? I.e., comparable to the decision of a conditional branch (w/o being folded into a switch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep the devisions agree
Regenerate check lines for test to avoid unrelated changes in #99808.
Update createEdgeMask to created masks where the terminator in Src is a switch. We need to handle 2 separate cases: 1. Dst is not the default desintation. Dst is reached if any of the cases with destination == Dst are taken. Join the conditions for each case where destination == Dst using a logical OR. 2. Dst is the default destination. Dst is reached if none of the cases with destination != Dst are taken. Join the conditions for each case where the destination is != Dst using a logical OR and negate it. Fixes llvm#48188.
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/113/builds/2099 Here is the relevant piece of the build log for the reference:
|
The buildbot failure above was due to not considering |
VPVectorPointerRecipe only uses the first part of the pointer operand, so mark it accordingly. Follow-up suggested as part of #99808.
Update createEdgeMask to created masks where the terminator in Src is a switch. We need to handle 2 separate cases:
Fixes #48188.