Skip to content

[LV] Support generating masks for switch terminators. #99808

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 11, 2024
Merged

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented Jul 21, 2024

Update createEdgeMask to created masks where the terminator in Src is a switch. We need to handle 2 separate cases:

  1. Dst is not the default desintation. Dst is reached if any of the cases with destination == Dst are taken. Join the conditions for each case where destination == Dst using a logical OR.
  2. Dst is the default destination. Dst is reached if none of the cases with destination != Dst are taken. Join the conditions for each case where the destination is != Dst using a logical OR and negate it.

Fixes #48188.

@fhahn fhahn requested review from rengolin, ayalz and aniragil July 21, 2024 11:17
@llvmbot llvmbot added clang Clang issues not falling into any other category vectorizers llvm:transforms labels Jul 21, 2024
@llvmbot
Copy link
Member

llvmbot commented Jul 21, 2024

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

Update createEdgeMask to created masks where the terminator in Src is a switch. We need to handle 2 separate cases:

  1. Dst is not the default desintation. Dst is reached if any of the cases with destination == Dst are taken. Join the conditions for each case where destination == Dst using a logical OR.
  2. Dst is the default destination. Dst is reached if none of the cases with destination != Dst are taken. Join the conditions for each case where the destination is != Dst using a logical OR and negate it.

Fixes #48188.


Patch is 84.17 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/99808.diff

7 Files Affected:

  • (modified) clang/test/Frontend/optimization-remark-analysis.c (+1-1)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp (+5-5)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+35)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/predicate-switch.ll (+623-24)
  • (modified) llvm/test/Transforms/LoopVectorize/no_switch.ll (+5-7)
  • (modified) llvm/test/Transforms/LoopVectorize/predicate-switch.ll (+196-4)
  • (modified) llvm/test/Transforms/PhaseOrdering/X86/pr48844-br-to-switch-vectorization.ll (+55-5)
diff --git a/clang/test/Frontend/optimization-remark-analysis.c b/clang/test/Frontend/optimization-remark-analysis.c
index e43984942a6ef..9d8917265a320 100644
--- a/clang/test/Frontend/optimization-remark-analysis.c
+++ b/clang/test/Frontend/optimization-remark-analysis.c
@@ -1,7 +1,7 @@
 // RUN: %clang -O1 -fvectorize -target x86_64-unknown-unknown -emit-llvm -Rpass-analysis -S %s -o - 2>&1 | FileCheck %s --check-prefix=RPASS
 // RUN: %clang -O1 -fvectorize -target x86_64-unknown-unknown -emit-llvm -S %s -o - 2>&1 | FileCheck %s
 
-// RPASS: {{.*}}:12:5: remark: loop not vectorized: loop contains a switch statement
+// RPASS-NOT: {{.*}}:12:5: remark: loop not vectorized
 // CHECK-NOT: remark: loop not vectorized: loop contains a switch statement
 
 double foo(int N, int *Array) {
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index f54eebb2874ab..7f84455150093 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -1348,11 +1348,11 @@ bool LoopVectorizationLegality::canVectorizeWithIfConvert() {
   // Collect the blocks that need predication.
   for (BasicBlock *BB : TheLoop->blocks()) {
     // We don't support switch statements inside loops.
-    if (!isa<BranchInst>(BB->getTerminator())) {
-      reportVectorizationFailure("Loop contains a switch statement",
-                                 "loop contains a switch statement",
-                                 "LoopContainsSwitch", ORE, TheLoop,
-                                 BB->getTerminator());
+    if (!isa<BranchInst, SwitchInst>(BB->getTerminator())) {
+      reportVectorizationFailure("Loop contains an unsupported termaintor",
+                                 "loop contains an unsupported terminator",
+                                 "LoopContainsUnsupportedTerminator", ORE,
+                                 TheLoop, BB->getTerminator());
       return false;
     }
 
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 6d28b8fabe42e..2530762e3e424 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7763,6 +7763,41 @@ VPValue *VPRecipeBuilder::createEdgeMask(BasicBlock *Src, BasicBlock *Dst) {
 
   VPValue *SrcMask = getBlockInMask(Src);
 
+  if (auto *SI = dyn_cast<SwitchInst>(Src->getTerminator())) {
+    // Create mask where the terminator in Src is a switch. We need to handle 2
+    // separate cases:
+    // 1. Dst is not the default desintation. Dst is reached if any of the cases
+    // with destination == Dst are taken. Join the conditions for each case
+    // where destination == Dst using a logical OR.
+    // 2. Dst is the default destination. Dst is reached if none of the cases
+    // with destination != Dst are taken. Join the conditions for each case
+    // where the destination is != Dst using a logical OR and negate it.
+    VPValue *Mask = nullptr;
+    VPValue *Cond = getVPValueOrAddLiveIn(SI->getCondition(), Plan);
+    bool IsDefault = SI->getDefaultDest() == Dst;
+    for (auto &C : SI->cases()) {
+      if (IsDefault) {
+        if (C.getCaseSuccessor() == Dst)
+          continue;
+      } else if (C.getCaseSuccessor() != Dst)
+        continue;
+
+      VPValue *Eq = EdgeMaskCache.lookup({Src, C.getCaseSuccessor()});
+      if (!Eq) {
+        VPValue *V = getVPValueOrAddLiveIn(C.getCaseValue(), Plan);
+        Eq = Builder.createICmp(CmpInst::ICMP_EQ, Cond, V);
+      }
+      if (Mask)
+        Mask = Builder.createOr(Mask, Eq);
+      else
+        Mask = Eq;
+    }
+    if (IsDefault)
+      Mask = Builder.createNot(Mask);
+    assert(Mask && "mask must be created");
+    return EdgeMaskCache[Edge] = Mask;
+  }
+
   // The terminator has to be a branch inst!
   BranchInst *BI = dyn_cast<BranchInst>(Src->getTerminator());
   assert(BI && "Unexpected terminator found");
diff --git a/llvm/test/Transforms/LoopVectorize/X86/predicate-switch.ll b/llvm/test/Transforms/LoopVectorize/X86/predicate-switch.ll
index b8ce3c40920a3..ff73a149c8e39 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/predicate-switch.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/predicate-switch.ll
@@ -6,9 +6,43 @@ define void @switch_default_to_latch_common_dest(ptr %start, ptr %end) {
 ; IC1-LABEL: define void @switch_default_to_latch_common_dest(
 ; IC1-SAME: ptr [[START:%.*]], ptr [[END:%.*]]) #[[ATTR0:[0-9]+]] {
 ; IC1-NEXT:  [[ENTRY:.*]]:
+; IC1-NEXT:    [[START2:%.*]] = ptrtoint ptr [[START]] to i64
+; IC1-NEXT:    [[END1:%.*]] = ptrtoint ptr [[END]] to i64
+; IC1-NEXT:    [[TMP0:%.*]] = add i64 [[END1]], -8
+; IC1-NEXT:    [[TMP1:%.*]] = sub i64 [[TMP0]], [[START2]]
+; IC1-NEXT:    [[TMP2:%.*]] = lshr i64 [[TMP1]], 3
+; IC1-NEXT:    [[TMP3:%.*]] = add nuw nsw i64 [[TMP2]], 1
+; IC1-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], 4
+; IC1-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; IC1:       [[VECTOR_PH]]:
+; IC1-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP3]], 4
+; IC1-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF]]
+; IC1-NEXT:    [[TMP4:%.*]] = mul i64 [[N_VEC]], 8
+; IC1-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP4]]
+; IC1-NEXT:    br label %[[VECTOR_BODY:.*]]
+; IC1:       [[VECTOR_BODY]]:
+; IC1-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; IC1-NEXT:    [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 8
+; IC1-NEXT:    [[TMP5:%.*]] = add i64 [[OFFSET_IDX]], 0
+; IC1-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP5]]
+; IC1-NEXT:    [[TMP6:%.*]] = getelementptr i64, ptr [[NEXT_GEP]], i32 0
+; IC1-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[TMP6]], align 1
+; IC1-NEXT:    [[TMP7:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 -12, i64 -12, i64 -12, i64 -12>
+; IC1-NEXT:    [[TMP8:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 13, i64 13, i64 13, i64 13>
+; IC1-NEXT:    [[TMP9:%.*]] = or <4 x i1> [[TMP7]], [[TMP8]]
+; IC1-NEXT:    [[TMP10:%.*]] = or <4 x i1> [[TMP9]], [[TMP9]]
+; IC1-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 42, i64 42, i64 42, i64 42>, ptr [[TMP6]], i32 1, <4 x i1> [[TMP10]])
+; IC1-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
+; IC1-NEXT:    [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; IC1-NEXT:    br i1 [[TMP11]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; IC1:       [[MIDDLE_BLOCK]]:
+; IC1-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC]]
+; IC1-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; IC1:       [[SCALAR_PH]]:
+; IC1-NEXT:    [[BC_RESUME_VAL:%.*]] = phi ptr [ [[IND_END]], %[[MIDDLE_BLOCK]] ], [ [[START]], %[[ENTRY]] ]
 ; IC1-NEXT:    br label %[[LOOP_HEADER:.*]]
 ; IC1:       [[LOOP_HEADER]]:
-; IC1-NEXT:    [[PTR_IV:%.*]] = phi ptr [ [[START]], %[[ENTRY]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
+; IC1-NEXT:    [[PTR_IV:%.*]] = phi ptr [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
 ; IC1-NEXT:    [[L:%.*]] = load i64, ptr [[PTR_IV]], align 1
 ; IC1-NEXT:    switch i64 [[L]], label %[[LOOP_LATCH]] [
 ; IC1-NEXT:      i64 -12, label %[[IF_THEN:.*]]
@@ -20,16 +54,59 @@ define void @switch_default_to_latch_common_dest(ptr %start, ptr %end) {
 ; IC1:       [[LOOP_LATCH]]:
 ; IC1-NEXT:    [[PTR_IV_NEXT]] = getelementptr inbounds i64, ptr [[PTR_IV]], i64 1
 ; IC1-NEXT:    [[EC:%.*]] = icmp eq ptr [[PTR_IV_NEXT]], [[END]]
-; IC1-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP_HEADER]]
+; IC1-NEXT:    br i1 [[EC]], label %[[EXIT]], label %[[LOOP_HEADER]], !llvm.loop [[LOOP3:![0-9]+]]
 ; IC1:       [[EXIT]]:
 ; IC1-NEXT:    ret void
 ;
 ; IC2-LABEL: define void @switch_default_to_latch_common_dest(
 ; IC2-SAME: ptr [[START:%.*]], ptr [[END:%.*]]) #[[ATTR0:[0-9]+]] {
 ; IC2-NEXT:  [[ENTRY:.*]]:
+; IC2-NEXT:    [[START2:%.*]] = ptrtoint ptr [[START]] to i64
+; IC2-NEXT:    [[END1:%.*]] = ptrtoint ptr [[END]] to i64
+; IC2-NEXT:    [[TMP0:%.*]] = add i64 [[END1]], -8
+; IC2-NEXT:    [[TMP1:%.*]] = sub i64 [[TMP0]], [[START2]]
+; IC2-NEXT:    [[TMP2:%.*]] = lshr i64 [[TMP1]], 3
+; IC2-NEXT:    [[TMP3:%.*]] = add nuw nsw i64 [[TMP2]], 1
+; IC2-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], 8
+; IC2-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; IC2:       [[VECTOR_PH]]:
+; IC2-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP3]], 8
+; IC2-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF]]
+; IC2-NEXT:    [[TMP4:%.*]] = mul i64 [[N_VEC]], 8
+; IC2-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP4]]
+; IC2-NEXT:    br label %[[VECTOR_BODY:.*]]
+; IC2:       [[VECTOR_BODY]]:
+; IC2-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; IC2-NEXT:    [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 8
+; IC2-NEXT:    [[TMP5:%.*]] = add i64 [[OFFSET_IDX]], 0
+; IC2-NEXT:    [[TMP6:%.*]] = add i64 [[OFFSET_IDX]], 32
+; IC2-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP5]]
+; IC2-NEXT:    [[NEXT_GEP3:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP6]]
+; IC2-NEXT:    [[TMP7:%.*]] = getelementptr i64, ptr [[NEXT_GEP]], i32 0
+; IC2-NEXT:    [[TMP8:%.*]] = getelementptr i64, ptr [[NEXT_GEP]], i32 4
+; IC2-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[TMP7]], align 1
+; IC2-NEXT:    [[WIDE_LOAD4:%.*]] = load <4 x i64>, ptr [[TMP8]], align 1
+; IC2-NEXT:    [[TMP9:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 -12, i64 -12, i64 -12, i64 -12>
+; IC2-NEXT:    [[TMP10:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD4]], <i64 -12, i64 -12, i64 -12, i64 -12>
+; IC2-NEXT:    [[TMP11:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 13, i64 13, i64 13, i64 13>
+; IC2-NEXT:    [[TMP12:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD4]], <i64 13, i64 13, i64 13, i64 13>
+; IC2-NEXT:    [[TMP13:%.*]] = or <4 x i1> [[TMP9]], [[TMP11]]
+; IC2-NEXT:    [[TMP14:%.*]] = or <4 x i1> [[TMP10]], [[TMP12]]
+; IC2-NEXT:    [[TMP15:%.*]] = or <4 x i1> [[TMP13]], [[TMP13]]
+; IC2-NEXT:    [[TMP16:%.*]] = or <4 x i1> [[TMP14]], [[TMP14]]
+; IC2-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 42, i64 42, i64 42, i64 42>, ptr [[TMP7]], i32 1, <4 x i1> [[TMP15]])
+; IC2-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 42, i64 42, i64 42, i64 42>, ptr [[TMP8]], i32 1, <4 x i1> [[TMP16]])
+; IC2-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
+; IC2-NEXT:    [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; IC2-NEXT:    br i1 [[TMP17]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; IC2:       [[MIDDLE_BLOCK]]:
+; IC2-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC]]
+; IC2-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; IC2:       [[SCALAR_PH]]:
+; IC2-NEXT:    [[BC_RESUME_VAL:%.*]] = phi ptr [ [[IND_END]], %[[MIDDLE_BLOCK]] ], [ [[START]], %[[ENTRY]] ]
 ; IC2-NEXT:    br label %[[LOOP_HEADER:.*]]
 ; IC2:       [[LOOP_HEADER]]:
-; IC2-NEXT:    [[PTR_IV:%.*]] = phi ptr [ [[START]], %[[ENTRY]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
+; IC2-NEXT:    [[PTR_IV:%.*]] = phi ptr [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
 ; IC2-NEXT:    [[L:%.*]] = load i64, ptr [[PTR_IV]], align 1
 ; IC2-NEXT:    switch i64 [[L]], label %[[LOOP_LATCH]] [
 ; IC2-NEXT:      i64 -12, label %[[IF_THEN:.*]]
@@ -41,7 +118,7 @@ define void @switch_default_to_latch_common_dest(ptr %start, ptr %end) {
 ; IC2:       [[LOOP_LATCH]]:
 ; IC2-NEXT:    [[PTR_IV_NEXT]] = getelementptr inbounds i64, ptr [[PTR_IV]], i64 1
 ; IC2-NEXT:    [[EC:%.*]] = icmp eq ptr [[PTR_IV_NEXT]], [[END]]
-; IC2-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP_HEADER]]
+; IC2-NEXT:    br i1 [[EC]], label %[[EXIT]], label %[[LOOP_HEADER]], !llvm.loop [[LOOP3:![0-9]+]]
 ; IC2:       [[EXIT]]:
 ; IC2-NEXT:    ret void
 ;
@@ -73,9 +150,48 @@ define void @switch_all_dests_distinct(ptr %start, ptr %end) {
 ; IC1-LABEL: define void @switch_all_dests_distinct(
 ; IC1-SAME: ptr [[START:%.*]], ptr [[END:%.*]]) #[[ATTR0]] {
 ; IC1-NEXT:  [[ENTRY:.*]]:
+; IC1-NEXT:    [[START2:%.*]] = ptrtoint ptr [[START]] to i64
+; IC1-NEXT:    [[END1:%.*]] = ptrtoint ptr [[END]] to i64
+; IC1-NEXT:    [[TMP0:%.*]] = add i64 [[END1]], -8
+; IC1-NEXT:    [[TMP1:%.*]] = sub i64 [[TMP0]], [[START2]]
+; IC1-NEXT:    [[TMP2:%.*]] = lshr i64 [[TMP1]], 3
+; IC1-NEXT:    [[TMP3:%.*]] = add nuw nsw i64 [[TMP2]], 1
+; IC1-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], 4
+; IC1-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; IC1:       [[VECTOR_PH]]:
+; IC1-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP3]], 4
+; IC1-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF]]
+; IC1-NEXT:    [[TMP4:%.*]] = mul i64 [[N_VEC]], 8
+; IC1-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP4]]
+; IC1-NEXT:    br label %[[VECTOR_BODY:.*]]
+; IC1:       [[VECTOR_BODY]]:
+; IC1-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; IC1-NEXT:    [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 8
+; IC1-NEXT:    [[TMP5:%.*]] = add i64 [[OFFSET_IDX]], 0
+; IC1-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP5]]
+; IC1-NEXT:    [[TMP6:%.*]] = getelementptr i64, ptr [[NEXT_GEP]], i32 0
+; IC1-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[TMP6]], align 1
+; IC1-NEXT:    [[TMP7:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], zeroinitializer
+; IC1-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 1, i64 1, i64 1, i64 1>, ptr [[TMP6]], i32 1, <4 x i1> [[TMP7]])
+; IC1-NEXT:    [[TMP8:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 13, i64 13, i64 13, i64 13>
+; IC1-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> zeroinitializer, ptr [[TMP6]], i32 1, <4 x i1> [[TMP8]])
+; IC1-NEXT:    [[TMP9:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 -12, i64 -12, i64 -12, i64 -12>
+; IC1-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 42, i64 42, i64 42, i64 42>, ptr [[TMP6]], i32 1, <4 x i1> [[TMP9]])
+; IC1-NEXT:    [[TMP10:%.*]] = or <4 x i1> [[TMP9]], [[TMP8]]
+; IC1-NEXT:    [[TMP11:%.*]] = or <4 x i1> [[TMP10]], [[TMP7]]
+; IC1-NEXT:    [[TMP12:%.*]] = xor <4 x i1> [[TMP11]], <i1 true, i1 true, i1 true, i1 true>
+; IC1-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 2, i64 2, i64 2, i64 2>, ptr [[TMP6]], i32 1, <4 x i1> [[TMP12]])
+; IC1-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
+; IC1-NEXT:    [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; IC1-NEXT:    br i1 [[TMP13]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; IC1:       [[MIDDLE_BLOCK]]:
+; IC1-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC]]
+; IC1-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; IC1:       [[SCALAR_PH]]:
+; IC1-NEXT:    [[BC_RESUME_VAL:%.*]] = phi ptr [ [[IND_END]], %[[MIDDLE_BLOCK]] ], [ [[START]], %[[ENTRY]] ]
 ; IC1-NEXT:    br label %[[LOOP_HEADER:.*]]
 ; IC1:       [[LOOP_HEADER]]:
-; IC1-NEXT:    [[PTR_IV:%.*]] = phi ptr [ [[START]], %[[ENTRY]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
+; IC1-NEXT:    [[PTR_IV:%.*]] = phi ptr [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[PTR_IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ]
 ; IC1-NEXT:    [[L:%.*]] = load i64, ptr [[PTR_IV]], align 1
 ; IC1-NEXT:    switch i64 [[L]], label %[[DEFAULT:.*]] [
 ; IC1-NEXT:      i64 -12, label %[[IF_THEN_1:.*]]
@@ -97,16 +213,69 @@ define void @switch_all_dests_distinct(ptr %start, ptr %end) {
 ; IC1:       [[LOOP_LATCH]]:
 ; IC1-NEXT:    [[PTR_IV_NEXT]] = getelementptr inbounds i64, ptr [[PTR_IV]], i64 1
 ; IC1-NEXT:    [[EC:%.*]] = icmp eq ptr [[PTR_IV_NEXT]], [[END]]
-; IC1-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP_HEADER]]
+; IC1-NEXT:    br i1 [[EC]], label %[[EXIT]], label %[[LOOP_HEADER]], !llvm.loop [[LOOP5:![0-9]+]]
 ; IC1:       [[EXIT]]:
 ; IC1-NEXT:    ret void
 ;
 ; IC2-LABEL: define void @switch_all_dests_distinct(
 ; IC2-SAME: ptr [[START:%.*]], ptr [[END:%.*]]) #[[ATTR0]] {
 ; IC2-NEXT:  [[ENTRY:.*]]:
+; IC2-NEXT:    [[START2:%.*]] = ptrtoint ptr [[START]] to i64
+; IC2-NEXT:    [[END1:%.*]] = ptrtoint ptr [[END]] to i64
+; IC2-NEXT:    [[TMP0:%.*]] = add i64 [[END1]], -8
+; IC2-NEXT:    [[TMP1:%.*]] = sub i64 [[TMP0]], [[START2]]
+; IC2-NEXT:    [[TMP2:%.*]] = lshr i64 [[TMP1]], 3
+; IC2-NEXT:    [[TMP3:%.*]] = add nuw nsw i64 [[TMP2]], 1
+; IC2-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], 8
+; IC2-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; IC2:       [[VECTOR_PH]]:
+; IC2-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP3]], 8
+; IC2-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF]]
+; IC2-NEXT:    [[TMP4:%.*]] = mul i64 [[N_VEC]], 8
+; IC2-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP4]]
+; IC2-NEXT:    br label %[[VECTOR_BODY:.*]]
+; IC2:       [[VECTOR_BODY]]:
+; IC2-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; IC2-NEXT:    [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 8
+; IC2-NEXT:    [[TMP5:%.*]] = add i64 [[OFFSET_IDX]], 0
+; IC2-NEXT:    [[TMP6:%.*]] = add i64 [[OFFSET_IDX]], 32
+; IC2-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP5]]
+; IC2-NEXT:    [[NEXT_GEP3:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP6]]
+; IC2-NEXT:    [[TMP7:%.*]] = getelementptr i64, ptr [[NEXT_GEP]], i32 0
+; IC2-NEXT:    [[TMP8:%.*]] = getelementptr i64, ptr [[NEXT_GEP]], i32 4
+; IC2-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i64>, ptr [[TMP7]], align 1
+; IC2-NEXT:    [[WIDE_LOAD4:%.*]] = load <4 x i64>, ptr [[TMP8]], align 1
+; IC2-NEXT:    [[TMP9:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], zeroinitializer
+; IC2-NEXT:    [[TMP10:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD4]], zeroinitializer
+; IC2-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 1, i64 1, i64 1, i64 1>, ptr [[TMP7]], i32 1, <4 x i1> [[TMP9]])
+; IC2-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 1, i64 1, i64 1, i64 1>, ptr [[TMP8]], i32 1, <4 x i1> [[TMP10]])
+; IC2-NEXT:    [[TMP11:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 13, i64 13, i64 13, i64 13>
+; IC2-NEXT:    [[TMP12:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD4]], <i64 13, i64 13, i64 13, i64 13>
+; IC2-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> zeroinitializer, ptr [[TMP7]], i32 1, <4 x i1> [[TMP11]])
+; IC2-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> zeroinitializer, ptr [[TMP8]], i32 1, <4 x i1> [[TMP12]])
+; IC2-NEXT:    [[TMP13:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 -12, i64 -12, i64 -12, i64 -12>
+; IC2-NEXT:    [[TMP14:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD4]], <i64 -12, i64 -12, i64 -12, i64 -12>
+; IC2-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 42, i64 42, i64 42, i64 42>, ptr [[TMP7]], i32 1, <4 x i1> [[TMP13]])
+; IC2-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 42, i64 42, i64 42, i64 42>, ptr [[TMP8]], i32 1, <4 x i1> [[TMP14]])
+; IC2-NEXT:    [[TMP15:%.*]] = or <4 x i1> [[TMP13]], [[TMP11]]
+; IC2-NEXT:    [[TMP16:%.*]] = or <4 x i1> [[TMP14]], [[TMP12]]
+; IC2-NEXT:    [[TMP17:%.*]] = or <4 x i1> [[TMP15]], [[TMP9]]
+; IC2-NEXT:    [[TMP18:%.*]] = or <4 x i1> [[TMP16]], [[TMP10]]
+; IC2-NEXT:    [[TMP19:%.*]] = xor <4 x i1> [[TMP17]], <i1 true, i1 true, i1 true, i1 true>
+; IC2-NEXT:    [[TMP20:%.*]] = xor <4 x i1> [[TMP18]], <i1 true, i1 true, i1 true, i1 true>
+; IC2-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 2, i64 2, i64 2, i64 2>, ptr [[TMP7]], i32 1, <4 x i1> [[TMP19]])
+; IC2-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> <i64 2, i64 2, i64 2, i64 2>, ptr [[TMP8]], i32 1, <4 x i1> [[TMP20]])
+; IC2-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
+; IC2-NEXT:    [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; IC2-NEXT:    br i1 [[TMP21]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
+; IC2:       [[MIDDLE_BLOCK]]:
+; IC2-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP3]], [[N_VEC]]
+; IC2-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; IC2:       [[SCALAR_PH]]:
+; IC2-NEXT:    [[BC_RES...
[truncated]

Copy link
Collaborator

@ayalz ayalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see that switch statements can be vectorized by only teaching createEdgeMask() how to mask them! This conceptually replaces a switch statement with a series of non-cascading conditional branches - one per unique successor, along with if-converting them. In the long run each of these steps may be modelled separately in VPlan.
Also great to see the tests pre-committed! Have yet to review them next.

@@ -1,7 +1,7 @@
// RUN: %clang -O1 -fvectorize -target x86_64-unknown-unknown -emit-llvm -Rpass-analysis -S %s -o - 2>&1 | FileCheck %s --check-prefix=RPASS
// RUN: %clang -O1 -fvectorize -target x86_64-unknown-unknown -emit-llvm -S %s -o - 2>&1 | FileCheck %s

// RPASS: {{.*}}:12:5: remark: loop not vectorized: loop contains a switch statement
// RPASS-NOT: {{.*}}:12:5: remark: loop not vectorized
// CHECK-NOT: remark: loop not vectorized: loop contains a switch statement
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already CHECK-NOT'd before the patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's to check that the message doesn't get emitted when -Rpass-analysis isn't passed. The function in the test likely needs to be replaced to preserve checking the plumbing for Rpass-analysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test adjusted to use a non-vectorizable call: 4b3bc46

Changes here are gone

@@ -1348,11 +1348,11 @@ bool LoopVectorizationLegality::canVectorizeWithIfConvert() {
// Collect the blocks that need predication.
for (BasicBlock *BB : TheLoop->blocks()) {
// We don't support switch statements inside loops.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We don't support switch statements inside loops.
// We support only branches and switch statements as terminators inside the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

Comment on lines 7779 to 7782
if (IsDefault) {
if (C.getCaseSuccessor() == Dst)
continue;
} else if (C.getCaseSuccessor() != Dst)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (IsDefault) {
if (C.getCaseSuccessor() == Dst)
continue;
} else if (C.getCaseSuccessor() != Dst)
bool IsCaseSuccessor = C.getCaseSuccessor() == Dst;
if ((!IsDefault && !IsCaseSuccessor) || (IsDefault && IsCaseSuccessor))
continue;

(can fold into if (IsDefault == IsCaseSuccessor) if seems better)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folded, thanks!

VPValue *Mask = nullptr;
VPValue *Cond = getVPValueOrAddLiveIn(SI->getCondition(), Plan);
bool IsDefault = SI->getDefaultDest() == Dst;
for (auto &C : SI->cases()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the number of cases N may be large. All N cases are traversed here per successor, potentially in O(N^2). This potential compile-time concern could be addressed by building a reverse mapping from destinations to conditions, or creating all edge masks from Src to all its successors together.
There's also a potential cost/profitability concern: as N increases the performance advantage of vectorizing diminishes, as it applies if-conversion with a mask per successor, while the original scalar version may use table lookups or cascading conditional branches. Worth bounding the number of cases to something "reasonable", and/or checking that CM considers the associated cost "accurately"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewritten to compute all masks once, thanks

Also added cost estimate

// where destination == Dst using a logical OR.
// 2. Dst is the default destination. Dst is reached if none of the cases
// with destination != Dst are taken. Join the conditions for each case
// where the destination is != Dst using a logical OR and negate it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cases whose destinations are the same as default are redundant and can be ignored/eliminated - they will get there anyhow. Perhaps slightly clearer to filter them as such during the traversal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are filtered out during the loop below I think.

Comment on lines 7776 to 7777
VPValue *Cond = getVPValueOrAddLiveIn(SI->getCondition(), Plan);
bool IsDefault = SI->getDefaultDest() == Dst;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. Worth considering the alternative of computing the default case by recursively computing (and recording, or retrieving) the masks of all other edges from Src to its successors, excluding default. Note that there may be no such other edges, in which case the desired edge mask is that of Src (worth a test).
When computing the non-default case, all relevant edges need to be computed - the cache was checked at the beginning.

Suggested change
VPValue *Cond = getVPValueOrAddLiveIn(SI->getCondition(), Plan);
bool IsDefault = SI->getDefaultDest() == Dst;
if (SI->getDefaultDest() == Dst) {
VPValue *AllOtherEdgesMask = nullptr;
SmallPtrSet<BasicBlock *, 4> UniqueSuccessors(succ_begin(Src), succ_end(Src));
for (auto *Successor : UniqueSuccessors) {
if (Succssor == Dst)
continue;
VPValue *OtherEdgeMask = createEdgeMask(Src, Successor);
if (AllOtherEdgesMask)
AllOtherEdgesMask = Builder.createLogicalOr(AllOtherEdgesMask, OtherEdgeMask);
else
AllOtherEdgesMask = OtherEdgeMask;
}
if (!AllOtherEdgesMask) {
assert(UniqueSuccessors.size() == 1 && "Src expected to have only default successor");
return EdgeMaskCache[Edge] = SrcMask;
}
return EdgeMaskCache[Edge] = Builder.createNot(AllOtherEdgesMask);
}
VPValue *EdgeMask = nullptr;
VPValue *Cond = getVPValueOrAddLiveIn(SI->getCondition(), Plan);
for (auto &C : SI->cases()) {
if (C.getCaseSuccessor() != Dst)
continue;
// Edge for case C is not in EdgeMaskCache, create it.
VPValue *V = getVPValueOrAddLiveIn(C.getCaseValue(), Plan);
VPValue *CMask = Builder.createICmp(CmpInst::ICMP_EQ, Cond, V);
if (EdgeMask)
EdgeMask = Builder.createLogicalOr(EdgeMask, CMask);
else
EdgeMask = CMask;
}
assert(EdgeMask && "mask must be created");
return EdgeMaskCache[Edge] = EdgeMask;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewritten to compute all masks once, thanks

Eq = Builder.createICmp(CmpInst::ICMP_EQ, Cond, V);
}
if (Mask)
Mask = Builder.createOr(Mask, Eq);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Mask = Builder.createOr(Mask, Eq);
Mask = Builder.createLogicalOr(Mask, Eq);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a logical OR shouldn't be needed; the only case where the created compares can be poison is when the switch condition is poison; in that case all conditions will be poison and there's no difference between bitwise or logical OR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a comment?
The commit message should also be updated to read OR rather than logical OR, to be consistent.

else
Mask = Eq;
}
if (IsDefault)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mask may be null in this case, avoid Not'ing it if so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@@ -7763,6 +7763,41 @@ VPValue *VPRecipeBuilder::createEdgeMask(BasicBlock *Src, BasicBlock *Dst) {

VPValue *SrcMask = getBlockInMask(Src);

if (auto *SI = dyn_cast<SwitchInst>(Src->getTerminator())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth asserting that SI stays in the same loop iteration, rather than breaking or continuing to its header? E.g., that !OrigLoop->isLoopExiting(Src).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added assert here and check to LoopVectorizationLegality, thanks!

} else if (C.getCaseSuccessor() != Dst)
continue;

VPValue *Eq = EdgeMaskCache.lookup({Src, C.getCaseSuccessor()});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may succeed only in the IsDefault case, and may lead to OR'ing multiple instances of the same mask - if the same (non-default) successor is the destination of multiple cases. Should be easily folded later though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is currently happening for some test cases, but should be cleaned up by VPlan-to-VPlan recipe simplification (follow-up needed)

Copy link
Contributor

@nikolaypanchenko nikolaypanchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests for vplan printing ?

if (auto *SI = dyn_cast<SwitchInst>(Src->getTerminator())) {
// Create mask where the terminator in Src is a switch. We need to handle 2
// separate cases:
// 1. Dst is not the default desintation. Dst is reached if any of the cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

desintation -> destination

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

; IC1-NEXT: [[TMP7:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 -12, i64 -12, i64 -12, i64 -12>
; IC1-NEXT: [[TMP8:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 13, i64 13, i64 13, i64 13>
; IC1-NEXT: [[TMP9:%.*]] = or <4 x i1> [[TMP7]], [[TMP8]]
; IC1-NEXT: [[TMP10:%.*]] = or <4 x i1> [[TMP9]], [[TMP9]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

This stems from having parallel {pred, succ} edges in the CFG for multiple switch cases that share a common succ destination, coupled with caching a single edge mask per {pred, succ} pair of VPBB's.

The in mask of a VPBB is created by ORing the edge masks from its predecessors - suffice to visit each predecessor once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, could be done by de-duplicating predecessors (or VPlan based simplification)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case should also apply to conditional branches whose two destinations are the same - a case optimized by createEdgeMask() - but not by its user createBlockInMask().

fhahn added a commit that referenced this pull request Aug 1, 2024
Extra tests for
#99808, including cost model
tests.
// collect compares for all cases once.
VPValue *Cond = getVPValueOrAddLiveIn(SI->getCondition(), Plan);
BasicBlock *DefaultDst = SI->getDefaultDest();
MapVector<BasicBlock *, SmallVector<VPValue *>> Map;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MapVector<BasicBlock *, SmallVector<VPValue *>> Map;
MapVector<BasicBlock *, SmallVector<VPValue *>> Destination2Compares;

or some shorter yet meaningful name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to Dst2Compares, thanks!

// 1. Dst is not the default destination. Dst is reached if any of the cases
// with destination == Dst are taken. Join the conditions for each case
// where destination == Dst using a logical OR.
for (const auto &[Dst, Conds] : Map) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dst is also a parameter to the method. Better outline this part which computes masks for all edges of a switch to a separate createSwitchEdgeMasks() method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outlined, thanks!

// We need to handle 2 separate cases:
// 1. Dst is not the default destination. Dst is reached if any of the cases
// with destination == Dst are taken. Join the conditions for each case
// where destination == Dst using a logical OR.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// where destination == Dst using a logical OR.
// whose destination == Dst using an OR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks!


// 2. Dst is the default destination. Dst is reached if none of the cases
// with destination != Dst are taken. Join the conditions for each case
// where the destination is != Dst using a logical OR and negate it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// where the destination is != Dst using a logical OR and negate it.
// where the destination is != Dst using an OR and negate it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated

BasicBlock *DefaultDst = SI->getDefaultDest();
MapVector<BasicBlock *, SmallVector<VPValue *>> Map;
for (auto &C : SI->cases()) {
auto I = Map.insert({C.getCaseSuccessor(), {}});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto I = Map.insert({C.getCaseSuccessor(), {}});
// Cases whose destination is the same as default are redundant and can be ignored - they will get there anyhow.
if (C.getCaseSuccessor() == DefaultDst)
continue;
auto I = Map.insert({C.getCaseSuccessor(), {}});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should also be updated

Mask = Builder.createOr(Mask, V);
if (SrcMask)
Mask = Builder.createLogicalAnd(SrcMask, Mask);
EdgeMaskCache[{Src, Dst}] = Mask;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
EdgeMaskCache[{Src, Dst}] = Mask;
EdgeMaskCache[{Src, Dst}] = Mask;
DefaultMask = !DefaultMask ? Mask : Builder.createOr(DefaultMask, Mask);

can also collect DefaultMask in this loop, to be finalized and stored after it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

; IC1-NEXT: [[TMP7:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 -12, i64 -12, i64 -12, i64 -12>
; IC1-NEXT: [[TMP8:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 13, i64 13, i64 13, i64 13>
; IC1-NEXT: [[TMP9:%.*]] = or <4 x i1> [[TMP7]], [[TMP8]]
; IC1-NEXT: [[TMP10:%.*]] = or <4 x i1> [[TMP9]], [[TMP9]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case should also apply to conditional branches whose two destinations are the same - a case optimized by createEdgeMask() - but not by its user createBlockInMask().

Eq = Builder.createICmp(CmpInst::ICMP_EQ, Cond, V);
}
if (Mask)
Mask = Builder.createOr(Mask, Eq);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a comment?
The commit message should also be updated to read OR rather than logical OR, to be consistent.

fhahn added a commit that referenced this pull request Aug 5, 2024
Support for vectorizing switch statements will be added in
#99808.

Update the loop to use a call that cannot be vectorized to preserve
testing surfacing analysis remarks via the frontend.
Copy link
Collaborator

@ayalz ayalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks for accommodating! Left behind several comments.

@@ -6456,6 +6456,17 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I,
// a predicated block since it will become a fall-through, although we
// may decide in the future to call TTI for all branches.
}
case Instruction::Switch: {
if (VF.isScalar())
return TTI.getCFInstrCost(Instruction::Switch, CostKind);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above scalar cost seems right, wonder about the vector cost below - the cost associated with predicating conditional branches is collected when visiting each phi, rather than the branch itself. May be good to calibrate with some tests, can leave behind a TODO to be done separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vector code matches the cost of the generated masks, which h are costed explicitly for the version with branches due to the compares being explicit instructions. Currently it seems more like the scalar cost may be estimated by getCFInstrCost, but that probably would need to be fixed in TTI.

@@ -7839,6 +7850,60 @@ VPRecipeBuilder::mapToVPValues(User::op_range Operands) {
return map_range(Operands, Fn);
}

void VPRecipeBuilder::createSwitchEdgeMasks(SwitchInst *SI) {
BasicBlock *Src = SI->getParent();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
BasicBlock *Src = SI->getParent();
BasicBlock *Src = SI->getParent();
assert(!EdgeMaskCache.contains(Src) && "Edge masks already created");

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added but moved down to iterating over all cases, as we need to look up (Src, Dst) pairs, thanks!

Comment on lines 137 to 138
/// Create masks for all cases with destination different than the default
/// destination, and a mask for the default destination.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Create masks for all cases with destination different than the default
/// destination, and a mask for the default destination.
/// Create an edge mask for every destination of cases and/or default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

Comment on lines +30 to +33
; COST-NEXT: [[TMP7:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 -12, i64 -12, i64 -12, i64 -12>
; COST-NEXT: [[TMP8:%.*]] = icmp eq <4 x i64> [[WIDE_LOAD]], <i64 13, i64 13, i64 13, i64 13>
; COST-NEXT: [[TMP9:%.*]] = or <4 x i1> [[TMP7]], [[TMP8]]
; COST-NEXT: [[TMP10:%.*]] = or <4 x i1> [[TMP9]], [[TMP9]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be good to have alongside a version doing the same with a conditional branch, i.e., "if (%1 == -12) || (%1 == 13) *ptr_iv = 42". (But avoid getting them folded into a switch as in the last test at the bottom.) Could be useful for calibrating the associated costs, in addition to comparing the generated codes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a variation, decision is the same.

@@ -104,9 +181,62 @@ define void @switch_all_dests_distinct(ptr %start, ptr %end) {
; FORCED-LABEL: define void @switch_all_dests_distinct(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity check: is it indeed unprofitable to vectorize this case having four distinct destinations, say by VF=4 and UF=1, as decided by COST?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and the reason for not vectorizing is that getCFInstrCost considers Switch as free on X86.

@@ -162,21 +343,21 @@ define void @switch_to_header(ptr %start) {
; IC1-NEXT: [[ENTRY:.*]]:
; IC1-NEXT: br label %[[LOOP_HEADER:.*]]
; IC1: [[LOOP_HEADER]]:
; IC1-NEXT: [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP_HEADER_BACKEDGE:.*]] ]
; IC1-NEXT: [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[IF_THEN1:.*]] ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If/how are these changes in block labels (and those below) related to this patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-generated check lines separately, changes here should be gone now 5286656

; IC1-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
; IC1-NEXT: [[TMP1:%.*]] = getelementptr inbounds i64, ptr [[START]], i64 [[TMP0]]
; IC1-NEXT: [[TMP2:%.*]] = getelementptr inbounds i64, ptr [[TMP1]], i32 0
; IC1-NEXT: store <2 x i64> <i64 42, i64 42>, ptr [[TMP2]], align 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

; IC2-NEXT: [[TMP2:%.*]] = getelementptr inbounds i64, ptr [[START]], i64 [[TMP0]]
; IC2-NEXT: [[TMP3:%.*]] = getelementptr inbounds i64, ptr [[START]], i64 [[TMP1]]
; IC2-NEXT: [[TMP4:%.*]] = getelementptr inbounds i64, ptr [[TMP2]], i32 0
; IC2-NEXT: [[TMP5:%.*]] = getelementptr inbounds i64, ptr [[TMP2]], i32 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is TMP3 (and TMP1) dead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is due to the fact that VPVectorPointerRecipe is responsible to generate pointers for all parts, using only the first part of the pointe operand, but needs marking as only-uses-first-part. Will do separately, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partly fixed in 5a42a67, but this case would also require adjusting VPReplicateRecipe::execute to only generate a single part if only the first part is used.

Might be better to move forward with #95842, which I just rebased, will address comments soon!

@@ -11,21 +11,71 @@ define dso_local void @test(ptr %start, ptr %end) #0 {
; CHECK-LABEL: @test(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time to replace above FIXME by a comment that under -O2 adjacent branches get fused into a switch statement before vectorization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped, thanks!

; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <8 x i32>, ptr [[NEXT_GEP]], align 4
; CHECK-NEXT: [[WIDE_LOAD8:%.*]] = load <8 x i32>, ptr [[TMP5]], align 4
; CHECK-NEXT: [[WIDE_LOAD9:%.*]] = load <8 x i32>, ptr [[TMP6]], align 4
; CHECK-NEXT: [[WIDE_LOAD10:%.*]] = load <8 x i32>, ptr [[TMP7]], align 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity check: is the decision to vectorize by VF=8, UF=4 taken by COST according to cost estimates involving a two-case switch, profitable? I.e., comparable to the decision of a conditional branch (w/o being folded into a switch).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep the devisions agree

fhahn added a commit that referenced this pull request Aug 11, 2024
Regenerate check lines for test to avoid unrelated changes in
#99808.
fhahn added 4 commits August 11, 2024 15:09
Update createEdgeMask to created masks where the terminator in Src is a
switch. We need to handle 2 separate cases:

1. Dst is not the default desintation. Dst is reached if any of the cases
   with destination == Dst are taken. Join the conditions for each case
  where destination == Dst using a logical OR.
2. Dst is the default destination. Dst is reached if none of the cases
   with destination != Dst are taken. Join the conditions for each case
   where the destination is != Dst using a logical OR and negate it.

Fixes llvm#48188.
@fhahn fhahn merged commit f0df4fb into llvm:main Aug 11, 2024
6 of 8 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Aug 11, 2024

LLVM Buildbot has detected a new failure on builder bolt-x86_64-ubuntu-clang running on bolt-worker while building llvm at step 6 "test-build-clang-bolt-stage2-clang-bolt".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/113/builds/2099

Here is the relevant piece of the build log for the reference:

Step 6 (test-build-clang-bolt-stage2-clang-bolt) failure: test (failure)
...
726.610 [9/6/2999] Linking CXX static library lib/libclangStaticAnalyzerCore.a
726.783 [8/6/3000] Linking CXX static library lib/libclangStaticAnalyzerCheckers.a
727.676 [8/5/3001] Building CXX object tools/clang/tools/driver/CMakeFiles/clang.dir/cc1as_main.cpp.o
728.716 [8/4/3002] Building CXX object tools/clang/tools/driver/CMakeFiles/clang.dir/driver.cpp.o
729.181 [8/3/3003] Building CXX object tools/clang/lib/StaticAnalyzer/Frontend/CMakeFiles/obj.clangStaticAnalyzerFrontend.dir/AnalysisConsumer.cpp.o
729.204 [7/3/3004] Linking CXX static library lib/libclangStaticAnalyzerFrontend.a
729.215 [6/3/3005] Linking CXX static library lib/libclangFrontendTool.a
732.028 [6/2/3006] Building CXX object tools/clang/tools/driver/CMakeFiles/clang.dir/cc1_main.cpp.o
771.446 [5/2/3007] Linking CXX executable bin/llvm-bolt
821.543 [5/1/3008] Linking CXX executable bin/clang-20
FAILED: bin/clang-20 
: && /home/worker/bolt-worker2/bolt-x86_64-ubuntu-clang/build/./bin/clang++ -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -fprofile-instr-use="/home/worker/bolt-worker2/bolt-x86_64-ubuntu-clang/build/tools/clang/stage2-instrumented-bins/tools/clang/utils/perf-training/clang.profdata" -flto=thin -fno-common -Woverloaded-virtual -Wno-nested-anon-types -O3 -DNDEBUG -Wl,--emit-relocs,-znow -fuse-ld=lld -Wl,--color-diagnostics -fprofile-instr-use="/home/worker/bolt-worker2/bolt-x86_64-ubuntu-clang/build/tools/clang/stage2-instrumented-bins/tools/clang/utils/perf-training/clang.profdata" -flto=thin -Wl,--thinlto-cache-dir=/home/worker/bolt-worker2/bolt-x86_64-ubuntu-clang/build/tools/clang/stage2-instrumented-bins/tools/clang/stage2-bins/lto.cache   -Wl,--export-dynamic tools/clang/tools/driver/CMakeFiles/clang.dir/driver.cpp.o tools/clang/tools/driver/CMakeFiles/clang.dir/cc1_main.cpp.o tools/clang/tools/driver/CMakeFiles/clang.dir/cc1as_main.cpp.o tools/clang/tools/driver/CMakeFiles/clang.dir/cc1gen_reproducer_main.cpp.o tools/clang/tools/driver/CMakeFiles/clang.dir/clang-driver.cpp.o -o bin/clang-20  -Wl,-rpath,"\$ORIGIN/../lib:"  lib/libLLVMX86CodeGen.a  lib/libLLVMX86AsmParser.a  lib/libLLVMX86Desc.a  lib/libLLVMX86Disassembler.a  lib/libLLVMX86Info.a  lib/libLLVMAnalysis.a  lib/libLLVMCodeGen.a  lib/libLLVMCore.a  lib/libLLVMipo.a  lib/libLLVMAggressiveInstCombine.a  lib/libLLVMInstCombine.a  lib/libLLVMInstrumentation.a  lib/libLLVMMC.a  lib/libLLVMMCParser.a  lib/libLLVMObjCARCOpts.a  lib/libLLVMOption.a  lib/libLLVMScalarOpts.a  lib/libLLVMSupport.a  lib/libLLVMTargetParser.a  lib/libLLVMTransformUtils.a  lib/libLLVMVectorize.a  lib/libclangBasic.a  lib/libclangCodeGen.a  lib/libclangDriver.a  lib/libclangFrontend.a  lib/libclangFrontendTool.a  lib/libclangSerialization.a  lib/libLLVMAsmPrinter.a  lib/libLLVMGlobalISel.a  lib/libLLVMSelectionDAG.a  lib/libLLVMMCDisassembler.a  lib/libclangCodeGen.a  lib/libLLVMCoverage.a  lib/libLLVMFrontendDriver.a  lib/libLLVMLTO.a  lib/libLLVMExtensions.a  lib/libLLVMPasses.a  lib/libLLVMCodeGen.a  lib/libLLVMCodeGenTypes.a  lib/libLLVMObjCARCOpts.a  lib/libLLVMCFGuard.a  lib/libLLVMIRPrinter.a  lib/libLLVMTarget.a  lib/libLLVMCoroutines.a  lib/libLLVMipo.a  lib/libLLVMInstrumentation.a  lib/libLLVMVectorize.a  lib/libLLVMBitWriter.a  lib/libLLVMLinker.a  lib/libLLVMHipStdPar.a  lib/libclangExtractAPI.a  lib/libclangInstallAPI.a  lib/libLLVMTextAPIBinaryReader.a  lib/libclangRewriteFrontend.a  lib/libclangARCMigrate.a  lib/libclangStaticAnalyzerFrontend.a  lib/libclangStaticAnalyzerCheckers.a  lib/libclangStaticAnalyzerCore.a  lib/libclangCrossTU.a  lib/libclangIndex.a  lib/libclangFrontend.a  lib/libclangDriver.a  lib/libLLVMWindowsDriver.a  lib/libLLVMOption.a  lib/libclangParse.a  lib/libclangSerialization.a  lib/libclangSema.a  lib/libclangAnalysis.a  lib/libclangASTMatchers.a  lib/libLLVMFrontendHLSL.a  lib/libclangAPINotes.a  lib/libclangEdit.a  lib/libclangSupport.a  lib/libclangAST.a  lib/libclangFormat.a  lib/libclangToolingInclusions.a  lib/libclangToolingCore.a  lib/libclangRewrite.a  lib/libclangLex.a  lib/libclangBasic.a  lib/libLLVMFrontendOpenMP.a  lib/libLLVMScalarOpts.a  lib/libLLVMAggressiveInstCombine.a  lib/libLLVMInstCombine.a  lib/libLLVMFrontendOffloading.a  lib/libLLVMTransformUtils.a  lib/libLLVMAnalysis.a  lib/libLLVMProfileData.a  lib/libLLVMSymbolize.a  lib/libLLVMDebugInfoDWARF.a  lib/libLLVMDebugInfoPDB.a  lib/libLLVMDebugInfoMSF.a  lib/libLLVMDebugInfoBTF.a  lib/libLLVMObject.a  lib/libLLVMMCParser.a  lib/libLLVMMC.a  lib/libLLVMDebugInfoCodeView.a  lib/libLLVMIRReader.a  lib/libLLVMBitReader.a  lib/libLLVMAsmParser.a  lib/libLLVMTextAPI.a  lib/libLLVMCore.a  lib/libLLVMRemarks.a  lib/libLLVMBitstreamReader.a  lib/libLLVMBinaryFormat.a  lib/libLLVMTargetParser.a  lib/libLLVMSupport.a  lib/libLLVMDemangle.a  -lrt  -ldl  -lm  /usr/lib/x86_64-linux-gnu/libz.so && :
instruction should have been considered by earlier checks
UNREACHABLE executed at /home/worker/bolt-worker2/bolt-x86_64-ubuntu-clang/llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3361!
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Running pass "function<eager-inv>(float2int,lower-constant-intrinsics,chr,loop(loop-rotate<header-duplication;no-prepare-for-lto>,loop-deletion),loop-distribute,inject-tli-mappings,loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>,infer-alignment,loop-load-elim,instcombine<max-iterations=1;no-use-loop-info;no-verify-fixpoint>,simplifycfg<bonus-inst-threshold=1;forward-switch-cond;switch-range-to-icmp;switch-to-lookup;no-keep-loops;hoist-common-insts;sink-common-insts;speculate-blocks;simplify-cond-branch;no-speculate-unpredictables>,slp-vectorizer,vector-combine,instcombine<max-iterations=1;no-use-loop-info;no-verify-fixpoint>,loop-unroll<O3>,transform-warning,sroa<preserve-cfg>,infer-alignment,instcombine<max-iterations=1;no-use-loop-info;no-verify-fixpoint>,loop-mssa(licm<allowspeculation>),alignment-from-assumptions,loop-sink,instsimplify,div-rem-pairs,tailcallelim,simplifycfg<bonus-inst-threshold=1;no-forward-switch-cond;switch-range-to-icmp;no-switch-to-lookup;keep-loops;no-hoist-common-insts;no-sink-common-insts;speculate-blocks;simplify-cond-branch;speculate-unpredictables>)" on module "lib/libLLVMCoroutines.a(CoroFrame.cpp.o at 402394)"
1.	Running pass "loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>" on function "_ZL13solveTypeNamePN4llvm4TypeE"
LLVM ERROR: Failed to rename temporary file /home/worker/bolt-worker2/bolt-x86_64-ubuntu-clang/build/tools/clang/stage2-instrumented-bins/tools/clang/stage2-bins/lto.cache/Thin-34f422.tmp.o to /home/worker/bolt-worker2/bolt-x86_64-ubuntu-clang/build/tools/clang/stage2-instrumented-bins/tools/clang/stage2-bins/lto.cache/llvmcache-568F5DB9F016D4F2157F15CC13205EE586C27228: No such file or directory

clang++: error: unable to execute command: Aborted (core dumped)
clang++: error: linker command failed due to signal (use -v to see invocation)
ninja: build stopped: subcommand failed.
FAILED: tools/clang/stage2-stamps/stage2-clang-bolt /home/worker/bolt-worker2/bolt-x86_64-ubuntu-clang/build/tools/clang/stage2-instrumented-bins/tools/clang/stage2-stamps/stage2-clang-bolt 
cd /home/worker/bolt-worker2/bolt-x86_64-ubuntu-clang/build/tools/clang/stage2-instrumented-bins/tools/clang && /usr/bin/cmake --build /home/worker/bolt-worker2/bolt-x86_64-ubuntu-clang/build/tools/clang/stage2-instrumented-bins/tools/clang/stage2-bins/ --target clang-bolt
ninja: build stopped: subcommand failed.
FAILED: tools/clang/stage2-instrumented-stamps/stage2-instrumented-stage2-clang-bolt /home/worker/bolt-worker2/bolt-x86_64-ubuntu-clang/build/tools/clang/stage2-instrumented-stamps/stage2-instrumented-stage2-clang-bolt 
cd /home/worker/bolt-worker2/bolt-x86_64-ubuntu-clang/build/tools/clang && /usr/bin/cmake --build /home/worker/bolt-worker2/bolt-x86_64-ubuntu-clang/build/tools/clang/stage2-instrumented-bins/ --target stage2-clang-bolt
ninja: build stopped: subcommand failed.

@fhahn fhahn deleted the lv-switch branch August 11, 2024 19:58
@fhahn
Copy link
Contributor Author

fhahn commented Aug 11, 2024

The buildbot failure above was due to not considering SwitchInst in ::isPredicatedInst. Should be fixed in 60680f7

fhahn added a commit that referenced this pull request Aug 12, 2024
VPVectorPointerRecipe only uses the first part of the pointer operand,
so mark it accordingly.

Follow-up suggested as part of
#99808.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang Clang issues not falling into any other category llvm:transforms vectorizers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missed loop replace_if vectorization due to the presence of switch
5 participants