Skip to content

Commit 28ab833

Browse files
AArch64: extend cost model to cost outer loop vect where the inner loop is invariant [PR121290]
Consider the example: void f (int *restrict x, int *restrict y, int *restrict z, int n) { for (int i = 0; i < 4; ++i) { int res = 0; for (int j = 0; j < 100; ++j) res += y[j] * z[i]; x[i] = res; } } we currently vectorize as f: movi v30.4s, 0 ldr q31, [x2] add x2, x1, 400 .L2: ld1r {v29.4s}, [x1], 4 mla v30.4s, v29.4s, v31.4s cmp x2, x1 bne .L2 str q30, [x0] ret which is not useful because by doing outer-loop vectorization we're performing less work per iteration than we would had we done inner-loop vectorization and simply unrolled the inner loop. This patch teaches the cost model that if all your leafs are invariant, then adjust the loop cost by * VF, since every vector iteration has at least one lane really just doing 1 scalar. There are a couple of ways we could have solved this, one is to increase the unroll factor to process more iterations of the inner loop. This removes the need for the broadcast, however we don't support unrolling the inner loop within the outer loop. We only support unrolling by increasing the VF, which would affect the outer loop as well as the inner loop. We also don't directly support costing inner-loop vs outer-loop vectorization, and as such we're left trying to predict/steer the cost model ahead of time to what we think should be profitable. This patch attempts to do so using a heuristic which penalizes the outer-loop vectorization. We now cost the loop as note: Cost model analysis: Vector inside of loop cost: 2000 Vector prologue cost: 4 Vector epilogue cost: 0 Scalar iteration cost: 300 Scalar outside cost: 0 Vector outside cost: 4 prologue iterations: 0 epilogue iterations: 0 missed: cost model: the vector iteration cost = 2000 divided by the scalar iteration cost = 300 is greater or equal to the vectorization factor = 4. missed: not vectorized: vectorization not profitable. missed: not vectorized: vector version will never be profitable. missed: Loop costings may not be worthwhile. And subsequently generate: .L5: add w4, w4, w7 ld1w z24.s, p6/z, [x0, #1, mul vl] ld1w z23.s, p6/z, [x0, #2, mul vl] ld1w z22.s, p6/z, [x0, #3, mul vl] ld1w z29.s, p6/z, [x0] mla z26.s, p6/m, z24.s, z30.s add x0, x0, x8 mla z27.s, p6/m, z23.s, z30.s mla z28.s, p6/m, z22.s, z30.s mla z25.s, p6/m, z29.s, z30.s cmp w4, w6 bls .L5 and avoids the load and replicate if it knows it has enough vector pipes to do so. gcc/ChangeLog: PR target/121290 * config/aarch64/aarch64.cc (class aarch64_vector_costs ): Add m_loop_fully_scalar_dup. (aarch64_vector_costs::add_stmt_cost): Detect invariant inner loops. (adjust_body_cost): Adjust final costing if m_loop_fully_scalar_dup. gcc/testsuite/ChangeLog: PR target/121290 * gcc.target/aarch64/pr121290.c: New test.
1 parent f864fc3 commit 28ab833

File tree

2 files changed

+61
-2
lines changed

2 files changed

+61
-2
lines changed

gcc/config/aarch64/aarch64.cc

Lines changed: 43 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17057,6 +17057,14 @@ class aarch64_vector_costs : public vector_costs
1705717057
or vector loop. There is one entry for each tuning option of
1705817058
interest. */
1705917059
auto_vec<aarch64_vec_op_count, 2> m_ops;
17060+
17061+
/* When doing inner-loop vectorization the constraints on the data-refs in the
17062+
outer-loop could limit the inner loop references. i.e. the outerloop can
17063+
force the inner-loop to do a load and splat which will result in the loop
17064+
being entirely scalar as all lanes work on a duplicate. Currently we don't
17065+
support unrolling of the inner loop independently from the outerloop during
17066+
outer-loop vectorization which tends to lead to pipeline bubbles. */
17067+
bool m_loop_fully_scalar_dup = false;
1706017068
};
1706117069

1706217070
aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo,
@@ -18079,6 +18087,28 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
1807918087
analyze_loop_vinfo (loop_vinfo);
1808018088

1808118089
m_analyzed_vinfo = true;
18090+
if (in_inner_loop_p)
18091+
m_loop_fully_scalar_dup = true;
18092+
}
18093+
18094+
/* Detect whether the loop is working on fully duplicated lanes. This would
18095+
only be possible with inner loop vectorization since otherwise we wouldn't
18096+
try to vectorize. */
18097+
if (in_inner_loop_p
18098+
&& node
18099+
&& m_loop_fully_scalar_dup
18100+
&& SLP_TREE_LANES (node) == 1
18101+
&& !SLP_TREE_CHILDREN (node).exists ())
18102+
{
18103+
/* Check if load is a duplicate. */
18104+
if (gimple_vuse (stmt_info->stmt)
18105+
&& SLP_TREE_MEMORY_ACCESS_TYPE (node) == VMAT_INVARIANT)
18106+
;
18107+
else if (SLP_TREE_DEF_TYPE (node) == vect_constant_def
18108+
|| SLP_TREE_DEF_TYPE (node) == vect_external_def)
18109+
;
18110+
else
18111+
m_loop_fully_scalar_dup = false;
1808218112
}
1808318113

1808418114
/* Apply the heuristic described above m_stp_sequence_cost. */
@@ -18445,8 +18475,19 @@ adjust_body_cost (loop_vec_info loop_vinfo,
1844518475
if (m_vec_flags & VEC_ANY_SVE)
1844618476
threshold = CEIL (threshold, aarch64_estimated_sve_vq ());
1844718477

18448-
if (m_num_vector_iterations >= 1
18449-
&& m_num_vector_iterations < threshold)
18478+
/* Increase the cost of the vector code if it looks like the vector code has
18479+
limited throughput due to outer-loop vectorization. */
18480+
if (m_loop_fully_scalar_dup)
18481+
{
18482+
body_cost *= estimated_vf;
18483+
if (dump_enabled_p ())
18484+
dump_printf_loc (MSG_NOTE, vect_location,
18485+
"Increasing body cost to %d because vector code has"
18486+
" low throughput of per iteration due to splats\n",
18487+
body_cost);
18488+
}
18489+
else if (m_num_vector_iterations >= 1
18490+
&& m_num_vector_iterations < threshold)
1845018491
{
1845118492
if (dump_enabled_p ())
1845218493
dump_printf_loc (MSG_NOTE, vect_location,
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
/* { dg-do compile } */
2+
/* { dg-additional-options "-O3 -mcpu=neoverse-v2 -fdump-tree-vect-all -std=c99" } */
3+
4+
void
5+
f (int *restrict x, int *restrict y, int *restrict z, int n)
6+
{
7+
for (int i = 0; i < 4; ++i)
8+
{
9+
int res = 0;
10+
for (int j = 0; j < 100; ++j)
11+
res += y[j] * z[i];
12+
x[i] = res;
13+
}
14+
}
15+
16+
/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
17+
/* { dg-final { scan-tree-dump-not "OUTER LOOP VECTORIZED" "vect" } } */
18+
/* { dg-final { scan-tree-dump "low throughput of per iteration due to splats" "vect" } } */

0 commit comments

Comments
 (0)