Skip to content

Commit 86bb7df

Browse files
committed
[CostModel][X86] getScalarizationOverhead - handle vXi1 extracts with MOVMSK (pre-AVX512)
We can quickly extract multiple elements of a bool vector using MOVMSK ops - since we don't know what generated the vXi1, I've been optimistic and assumed we can use PMOVMSKB to extract the maximum number of bools with a single op. The MOVMSK pattern isn't great for extract+insert round trips as vXi1 type legalization can interfere with this a lot - so this relies on us remaining good at using getScalarizationOverhead properly (and tagging both Insert and Extract modes) for those round trip cases. The AVX512 KMOV codegen for bool extraction is a bit of a mess so for now I've not included that - the per-element cost is a lot more accurate for current codegen.
1 parent fd7efe3 commit 86bb7df

File tree

5 files changed

+755
-559
lines changed

5 files changed

+755
-559
lines changed

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3833,10 +3833,21 @@ InstructionCost X86TTIImpl::getScalarizationOverhead(VectorType *Ty,
38333833
}
38343834
}
38353835

3836-
// TODO: Use default extraction for now, but we should investigate extending this
3837-
// to handle repeated subvector extraction.
3838-
if (Extract)
3836+
if (Extract) {
3837+
// vXi1 can be efficiently extracted with MOVMSK.
3838+
// TODO: AVX512 predicate mask handling.
3839+
// NOTE: This doesn't work well for roundtrip scalarization.
3840+
if (!Insert && Ty->getScalarSizeInBits() == 1 && !ST->hasAVX512()) {
3841+
unsigned NumElts = cast<FixedVectorType>(Ty)->getNumElements();
3842+
unsigned MaxElts = ST->hasAVX2() ? 32 : 16;
3843+
unsigned MOVMSKCost = (NumElts + MaxElts - 1) / MaxElts;
3844+
return MOVMSKCost;
3845+
}
3846+
3847+
// TODO: Use default extraction for now, but we should investigate extending
3848+
// this to handle repeated subvector extraction.
38393849
Cost += BaseT::getScalarizationOverhead(Ty, DemandedElts, false, Extract);
3850+
}
38403851

38413852
return Cost;
38423853
}

llvm/test/Analysis/CostModel/X86/intrinsic-cost-kinds.ll

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -281,7 +281,7 @@ define void @fshl(i32 %a, i32 %b, i32 %c, <16 x i32> %va, <16 x i32> %vb, <16 x
281281

282282
define void @maskedgather(<16 x float*> %va, <16 x i1> %vb, <16 x float> %vc) {
283283
; THRU-LABEL: 'maskedgather'
284-
; THRU-NEXT: Cost Model: Found an estimated cost of 92 for instruction: %v = call <16 x float> @llvm.masked.gather.v16f32.v16p0f32(<16 x float*> %va, i32 1, <16 x i1> %vb, <16 x float> %vc)
284+
; THRU-NEXT: Cost Model: Found an estimated cost of 77 for instruction: %v = call <16 x float> @llvm.masked.gather.v16f32.v16p0f32(<16 x float*> %va, i32 1, <16 x i1> %vb, <16 x float> %vc)
285285
; THRU-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
286286
;
287287
; LATE-LABEL: 'maskedgather'
@@ -302,7 +302,7 @@ define void @maskedgather(<16 x float*> %va, <16 x i1> %vb, <16 x float> %vc) {
302302

303303
define void @maskedscatter(<16 x float> %va, <16 x float*> %vb, <16 x i1> %vc) {
304304
; THRU-LABEL: 'maskedscatter'
305-
; THRU-NEXT: Cost Model: Found an estimated cost of 92 for instruction: call void @llvm.masked.scatter.v16f32.v16p0f32(<16 x float> %va, <16 x float*> %vb, i32 1, <16 x i1> %vc)
305+
; THRU-NEXT: Cost Model: Found an estimated cost of 77 for instruction: call void @llvm.masked.scatter.v16f32.v16p0f32(<16 x float> %va, <16 x float*> %vb, i32 1, <16 x i1> %vc)
306306
; THRU-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
307307
;
308308
; LATE-LABEL: 'maskedscatter'

0 commit comments

Comments
 (0)