Skip to content

Commit 299d710

Browse files
committed
[RISCV] Lower fixed vectors extract_vector_elt through stack at high LMUL
This is the extract side of D159332. The goal is to avoid non-linear costing on patterns where an entire vector is split back into scalars. This is an idiomatic pattern for SLP. Each vslide operation is linear in LMUL on common hardware. (For instance, the sifive-x280 cost model models slides this way.) If we do a VL unique extracts, each with a cost linear in LMUL, the overall cost is O(LMUL2) * VLEN/ETYPE. To avoid the degenerate case, fallback to the stack if we're beyond LMUL2. There's a subtly here. For this to work, we're *relying* on an optimization in LegalizeDAG which tries to reuse the stack slot from a previous extract. In practice, this appear to trigger for patterns within a block, but if we ended up with an explode idiom split across multiple blocks, we'd still be in quadratic territory. I don't think that variant is fixable within SDAG. It's tempting to think we can do better than going through the stack, but well, I haven't found it yet if it exists. Here's the results for sifive-s280 on all the variants I wrote (all 16 x i64 with V): output/sifive-x280/linear_decomp_with_slidedown.mca:Total Cycles: 20703 output/sifive-x280/linear_decomp_with_vrgather.mca:Total Cycles: 23903 output/sifive-x280/naive_linear_with_slidedown.mca:Total Cycles: 21604 output/sifive-x280/naive_linear_with_vrgather.mca:Total Cycles: 22804 output/sifive-x280/recursive_decomp_with_slidedown.mca:Total Cycles: 15204 output/sifive-x280/recursive_decomp_with_vrgather.mca:Total Cycles: 18404 output/sifive-x280/stack_by_vreg.mca:Total Cycles: 12104 output/sifive-x280/stack_element_by_element.mca:Total Cycles: 4304 I am deliberately excluding scalable vectors. It functionally works, but frankly, the code quality for an idiomatic explode loop is so terrible either way that it felt better to leave that for future work. Differential Revision: https://reviews.llvm.org/D159375
1 parent 070c257 commit 299d710

File tree

4 files changed

+472
-280
lines changed

4 files changed

+472
-280
lines changed

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7588,6 +7588,22 @@ SDValue RISCVTargetLowering::lowerEXTRACT_VECTOR_ELT(SDValue Op,
75887588
}
75897589
}
75907590

7591+
// If after narrowing, the required slide is still greater than LMUL2,
7592+
// fallback to generic expansion and go through the stack. This is done
7593+
// for a subtle reason: extracting *all* elements out of a vector is
7594+
// widely expected to be linear in vector size, but because vslidedown
7595+
// is linear in LMUL, performing N extracts using vslidedown becomes
7596+
// O(n^2) / (VLEN/ETYPE) work. On the surface, going through the stack
7597+
// seems to have the same problem (the store is linear in LMUL), but the
7598+
// generic expansion *memoizes* the store, and thus for many extracts of
7599+
// the same vector we end up with one store and a bunch of loads.
7600+
// TODO: We don't have the same code for insert_vector_elt because we
7601+
// have BUILD_VECTOR and handle the degenerate case there. Should we
7602+
// consider adding an inverse BUILD_VECTOR node?
7603+
MVT LMUL2VT = getLMUL1VT(ContainerVT).getDoubleNumVectorElementsVT();
7604+
if (ContainerVT.bitsGT(LMUL2VT) && VecVT.isFixedLengthVector())
7605+
return SDValue();
7606+
75917607
// If the index is 0, the vector is already in the right position.
75927608
if (!isNullConstant(Idx)) {
75937609
// Use a VL of 1 to avoid processing more elements than we need.

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-extract.ll

Lines changed: 175 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -244,32 +244,89 @@ define i64 @extractelt_v3i64(ptr %x) nounwind {
244244

245245
; A LMUL8 type
246246
define i32 @extractelt_v32i32(ptr %x) nounwind {
247-
; CHECK-LABEL: extractelt_v32i32:
248-
; CHECK: # %bb.0:
249-
; CHECK-NEXT: li a1, 32
250-
; CHECK-NEXT: vsetvli zero, a1, e32, m8, ta, ma
251-
; CHECK-NEXT: vle32.v v8, (a0)
252-
; CHECK-NEXT: vsetivli zero, 1, e32, m8, ta, ma
253-
; CHECK-NEXT: vslidedown.vi v8, v8, 31
254-
; CHECK-NEXT: vmv.x.s a0, v8
255-
; CHECK-NEXT: ret
247+
; RV32-LABEL: extractelt_v32i32:
248+
; RV32: # %bb.0:
249+
; RV32-NEXT: addi sp, sp, -256
250+
; RV32-NEXT: sw ra, 252(sp) # 4-byte Folded Spill
251+
; RV32-NEXT: sw s0, 248(sp) # 4-byte Folded Spill
252+
; RV32-NEXT: addi s0, sp, 256
253+
; RV32-NEXT: andi sp, sp, -128
254+
; RV32-NEXT: li a1, 32
255+
; RV32-NEXT: vsetvli zero, a1, e32, m8, ta, ma
256+
; RV32-NEXT: vle32.v v8, (a0)
257+
; RV32-NEXT: mv a0, sp
258+
; RV32-NEXT: vse32.v v8, (a0)
259+
; RV32-NEXT: lw a0, 124(sp)
260+
; RV32-NEXT: addi sp, s0, -256
261+
; RV32-NEXT: lw ra, 252(sp) # 4-byte Folded Reload
262+
; RV32-NEXT: lw s0, 248(sp) # 4-byte Folded Reload
263+
; RV32-NEXT: addi sp, sp, 256
264+
; RV32-NEXT: ret
265+
;
266+
; RV64-LABEL: extractelt_v32i32:
267+
; RV64: # %bb.0:
268+
; RV64-NEXT: addi sp, sp, -256
269+
; RV64-NEXT: sd ra, 248(sp) # 8-byte Folded Spill
270+
; RV64-NEXT: sd s0, 240(sp) # 8-byte Folded Spill
271+
; RV64-NEXT: addi s0, sp, 256
272+
; RV64-NEXT: andi sp, sp, -128
273+
; RV64-NEXT: li a1, 32
274+
; RV64-NEXT: vsetvli zero, a1, e32, m8, ta, ma
275+
; RV64-NEXT: vle32.v v8, (a0)
276+
; RV64-NEXT: mv a0, sp
277+
; RV64-NEXT: vse32.v v8, (a0)
278+
; RV64-NEXT: lw a0, 124(sp)
279+
; RV64-NEXT: addi sp, s0, -256
280+
; RV64-NEXT: ld ra, 248(sp) # 8-byte Folded Reload
281+
; RV64-NEXT: ld s0, 240(sp) # 8-byte Folded Reload
282+
; RV64-NEXT: addi sp, sp, 256
283+
; RV64-NEXT: ret
256284
%a = load <32 x i32>, ptr %x
257285
%b = extractelement <32 x i32> %a, i32 31
258286
ret i32 %b
259287
}
260288

261289
; Exercise type legalization for type beyond LMUL8
262290
define i32 @extractelt_v64i32(ptr %x) nounwind {
263-
; CHECK-LABEL: extractelt_v64i32:
264-
; CHECK: # %bb.0:
265-
; CHECK-NEXT: addi a0, a0, 128
266-
; CHECK-NEXT: li a1, 32
267-
; CHECK-NEXT: vsetvli zero, a1, e32, m8, ta, ma
268-
; CHECK-NEXT: vle32.v v8, (a0)
269-
; CHECK-NEXT: vsetivli zero, 1, e32, m8, ta, ma
270-
; CHECK-NEXT: vslidedown.vi v8, v8, 31
271-
; CHECK-NEXT: vmv.x.s a0, v8
272-
; CHECK-NEXT: ret
291+
; RV32-LABEL: extractelt_v64i32:
292+
; RV32: # %bb.0:
293+
; RV32-NEXT: addi sp, sp, -256
294+
; RV32-NEXT: sw ra, 252(sp) # 4-byte Folded Spill
295+
; RV32-NEXT: sw s0, 248(sp) # 4-byte Folded Spill
296+
; RV32-NEXT: addi s0, sp, 256
297+
; RV32-NEXT: andi sp, sp, -128
298+
; RV32-NEXT: addi a0, a0, 128
299+
; RV32-NEXT: li a1, 32
300+
; RV32-NEXT: vsetvli zero, a1, e32, m8, ta, ma
301+
; RV32-NEXT: vle32.v v8, (a0)
302+
; RV32-NEXT: mv a0, sp
303+
; RV32-NEXT: vse32.v v8, (a0)
304+
; RV32-NEXT: lw a0, 124(sp)
305+
; RV32-NEXT: addi sp, s0, -256
306+
; RV32-NEXT: lw ra, 252(sp) # 4-byte Folded Reload
307+
; RV32-NEXT: lw s0, 248(sp) # 4-byte Folded Reload
308+
; RV32-NEXT: addi sp, sp, 256
309+
; RV32-NEXT: ret
310+
;
311+
; RV64-LABEL: extractelt_v64i32:
312+
; RV64: # %bb.0:
313+
; RV64-NEXT: addi sp, sp, -256
314+
; RV64-NEXT: sd ra, 248(sp) # 8-byte Folded Spill
315+
; RV64-NEXT: sd s0, 240(sp) # 8-byte Folded Spill
316+
; RV64-NEXT: addi s0, sp, 256
317+
; RV64-NEXT: andi sp, sp, -128
318+
; RV64-NEXT: addi a0, a0, 128
319+
; RV64-NEXT: li a1, 32
320+
; RV64-NEXT: vsetvli zero, a1, e32, m8, ta, ma
321+
; RV64-NEXT: vle32.v v8, (a0)
322+
; RV64-NEXT: mv a0, sp
323+
; RV64-NEXT: vse32.v v8, (a0)
324+
; RV64-NEXT: lw a0, 124(sp)
325+
; RV64-NEXT: addi sp, s0, -256
326+
; RV64-NEXT: ld ra, 248(sp) # 8-byte Folded Reload
327+
; RV64-NEXT: ld s0, 240(sp) # 8-byte Folded Reload
328+
; RV64-NEXT: addi sp, sp, 256
329+
; RV64-NEXT: ret
273330
%a = load <64 x i32>, ptr %x
274331
%b = extractelement <64 x i32> %a, i32 63
275332
ret i32 %b
@@ -548,16 +605,105 @@ define i64 @extractelt_v3i64_idx(ptr %x, i32 zeroext %idx) nounwind {
548605
}
549606

550607
define i32 @extractelt_v32i32_idx(ptr %x, i32 zeroext %idx) nounwind {
551-
; CHECK-LABEL: extractelt_v32i32_idx:
552-
; CHECK: # %bb.0:
553-
; CHECK-NEXT: li a2, 32
554-
; CHECK-NEXT: vsetvli zero, a2, e32, m8, ta, ma
555-
; CHECK-NEXT: vle32.v v8, (a0)
556-
; CHECK-NEXT: vadd.vv v8, v8, v8
557-
; CHECK-NEXT: vsetivli zero, 1, e32, m8, ta, ma
558-
; CHECK-NEXT: vslidedown.vx v8, v8, a1
559-
; CHECK-NEXT: vmv.x.s a0, v8
560-
; CHECK-NEXT: ret
608+
; RV32NOM-LABEL: extractelt_v32i32_idx:
609+
; RV32NOM: # %bb.0:
610+
; RV32NOM-NEXT: addi sp, sp, -256
611+
; RV32NOM-NEXT: sw ra, 252(sp) # 4-byte Folded Spill
612+
; RV32NOM-NEXT: sw s0, 248(sp) # 4-byte Folded Spill
613+
; RV32NOM-NEXT: sw s2, 244(sp) # 4-byte Folded Spill
614+
; RV32NOM-NEXT: addi s0, sp, 256
615+
; RV32NOM-NEXT: andi sp, sp, -128
616+
; RV32NOM-NEXT: mv s2, a0
617+
; RV32NOM-NEXT: andi a0, a1, 31
618+
; RV32NOM-NEXT: li a1, 4
619+
; RV32NOM-NEXT: call __mulsi3@plt
620+
; RV32NOM-NEXT: li a1, 32
621+
; RV32NOM-NEXT: vsetvli zero, a1, e32, m8, ta, ma
622+
; RV32NOM-NEXT: vle32.v v8, (s2)
623+
; RV32NOM-NEXT: mv a1, sp
624+
; RV32NOM-NEXT: add a0, a1, a0
625+
; RV32NOM-NEXT: vadd.vv v8, v8, v8
626+
; RV32NOM-NEXT: vse32.v v8, (a1)
627+
; RV32NOM-NEXT: lw a0, 0(a0)
628+
; RV32NOM-NEXT: addi sp, s0, -256
629+
; RV32NOM-NEXT: lw ra, 252(sp) # 4-byte Folded Reload
630+
; RV32NOM-NEXT: lw s0, 248(sp) # 4-byte Folded Reload
631+
; RV32NOM-NEXT: lw s2, 244(sp) # 4-byte Folded Reload
632+
; RV32NOM-NEXT: addi sp, sp, 256
633+
; RV32NOM-NEXT: ret
634+
;
635+
; RV32M-LABEL: extractelt_v32i32_idx:
636+
; RV32M: # %bb.0:
637+
; RV32M-NEXT: addi sp, sp, -256
638+
; RV32M-NEXT: sw ra, 252(sp) # 4-byte Folded Spill
639+
; RV32M-NEXT: sw s0, 248(sp) # 4-byte Folded Spill
640+
; RV32M-NEXT: addi s0, sp, 256
641+
; RV32M-NEXT: andi sp, sp, -128
642+
; RV32M-NEXT: andi a1, a1, 31
643+
; RV32M-NEXT: li a2, 32
644+
; RV32M-NEXT: vsetvli zero, a2, e32, m8, ta, ma
645+
; RV32M-NEXT: vle32.v v8, (a0)
646+
; RV32M-NEXT: slli a1, a1, 2
647+
; RV32M-NEXT: mv a0, sp
648+
; RV32M-NEXT: or a1, a0, a1
649+
; RV32M-NEXT: vadd.vv v8, v8, v8
650+
; RV32M-NEXT: vse32.v v8, (a0)
651+
; RV32M-NEXT: lw a0, 0(a1)
652+
; RV32M-NEXT: addi sp, s0, -256
653+
; RV32M-NEXT: lw ra, 252(sp) # 4-byte Folded Reload
654+
; RV32M-NEXT: lw s0, 248(sp) # 4-byte Folded Reload
655+
; RV32M-NEXT: addi sp, sp, 256
656+
; RV32M-NEXT: ret
657+
;
658+
; RV64NOM-LABEL: extractelt_v32i32_idx:
659+
; RV64NOM: # %bb.0:
660+
; RV64NOM-NEXT: addi sp, sp, -256
661+
; RV64NOM-NEXT: sd ra, 248(sp) # 8-byte Folded Spill
662+
; RV64NOM-NEXT: sd s0, 240(sp) # 8-byte Folded Spill
663+
; RV64NOM-NEXT: sd s2, 232(sp) # 8-byte Folded Spill
664+
; RV64NOM-NEXT: addi s0, sp, 256
665+
; RV64NOM-NEXT: andi sp, sp, -128
666+
; RV64NOM-NEXT: mv s2, a0
667+
; RV64NOM-NEXT: andi a0, a1, 31
668+
; RV64NOM-NEXT: li a1, 4
669+
; RV64NOM-NEXT: call __muldi3@plt
670+
; RV64NOM-NEXT: li a1, 32
671+
; RV64NOM-NEXT: vsetvli zero, a1, e32, m8, ta, ma
672+
; RV64NOM-NEXT: vle32.v v8, (s2)
673+
; RV64NOM-NEXT: mv a1, sp
674+
; RV64NOM-NEXT: add a0, a1, a0
675+
; RV64NOM-NEXT: vadd.vv v8, v8, v8
676+
; RV64NOM-NEXT: vse32.v v8, (a1)
677+
; RV64NOM-NEXT: lw a0, 0(a0)
678+
; RV64NOM-NEXT: addi sp, s0, -256
679+
; RV64NOM-NEXT: ld ra, 248(sp) # 8-byte Folded Reload
680+
; RV64NOM-NEXT: ld s0, 240(sp) # 8-byte Folded Reload
681+
; RV64NOM-NEXT: ld s2, 232(sp) # 8-byte Folded Reload
682+
; RV64NOM-NEXT: addi sp, sp, 256
683+
; RV64NOM-NEXT: ret
684+
;
685+
; RV64M-LABEL: extractelt_v32i32_idx:
686+
; RV64M: # %bb.0:
687+
; RV64M-NEXT: addi sp, sp, -256
688+
; RV64M-NEXT: sd ra, 248(sp) # 8-byte Folded Spill
689+
; RV64M-NEXT: sd s0, 240(sp) # 8-byte Folded Spill
690+
; RV64M-NEXT: addi s0, sp, 256
691+
; RV64M-NEXT: andi sp, sp, -128
692+
; RV64M-NEXT: andi a1, a1, 31
693+
; RV64M-NEXT: li a2, 32
694+
; RV64M-NEXT: vsetvli zero, a2, e32, m8, ta, ma
695+
; RV64M-NEXT: vle32.v v8, (a0)
696+
; RV64M-NEXT: slli a1, a1, 2
697+
; RV64M-NEXT: mv a0, sp
698+
; RV64M-NEXT: or a1, a0, a1
699+
; RV64M-NEXT: vadd.vv v8, v8, v8
700+
; RV64M-NEXT: vse32.v v8, (a0)
701+
; RV64M-NEXT: lw a0, 0(a1)
702+
; RV64M-NEXT: addi sp, s0, -256
703+
; RV64M-NEXT: ld ra, 248(sp) # 8-byte Folded Reload
704+
; RV64M-NEXT: ld s0, 240(sp) # 8-byte Folded Reload
705+
; RV64M-NEXT: addi sp, sp, 256
706+
; RV64M-NEXT: ret
561707
%a = load <32 x i32>, ptr %x
562708
%b = add <32 x i32> %a, %a
563709
%c = extractelement <32 x i32> %b, i32 %idx

0 commit comments

Comments
 (0)