Description
RyuJIT has several loop optimization phases that have various issues (both correctness and performance) and can be significantly improved. RyuJIT also lacks some loop optimizations that have been shown to benefit various use cases. For .NET 6 the
proposed work is fixing and improving the existing phases and collecting information and developing a plan for adding
the missing phases.
Existing Optimizations
Below is a list of the existing loop-related RyuJIT phases and a short description of the improvement opportunities.
Loop Recognition
RyuJIT currently has lexical-based loop recognition and only recognizes natural loops. We should consider replacing it with a standard Tarjan SCC algorithm that classifies all loops. Then we can extend some loop optimizations to also work on non-natural loops.
Even if we continue to use the current algorithm, we should verify that it catches the maximal set of natural loops; it is believed that it misses some natural loops.
- Certain loops do not get recorded in optLoopTable #43713 Describes two cases where loops are missed due to various issues.
Loop Inversion
"While" loops are transformed to "do-while" loops to save one branch in the loop. Some issues have been identified with
heuristics for this optimization.
- JIT: heuristics in optInvertWhileLoop may be overly conservative #6569 JIT: heuristics in fgOptWhileLoop may be overly conservative
- Iterating with ForEach over ImmutableArray is slower than over Array #780 The issue contains a link to a prototype tweaking the heuristics and some code size numbers and analysis
- Improve loop inversion #52347 Improve cases where the loop condition block (the entry block) has multiple non-loop predecessors.
Loop Cloning
This optimization creates two copies of a loop: one with bounds checks and one without bounds checks and executes one of them at runtime based on some condition. Several issues have been identified with this optimizations. One recurring theme is unnecessary loop cloning where we first clone a loop and then eliminate range checks from both copies.
- RyuJIT's loop cloning optimization has questionable CQ #4929 RyuJIT's loop cloning optimization has questionable CQ
- JIT: examples where loop cloning is not useful #8558 JIT: examples where loop cloning is not useful
- Poor loop optimization in BilinearInterpol benchmark #31831 Poor loop optimization in BilinearInterpol benchmark
- LoopCloneContext::EvaluateConditions need to evaluate for const init, limit condition. #10314 LoopCloneContext::EvaluateConditions need to evaluate for const init, limit condition.
- [Mostly done] loop cloning and pgo #48850 loop cloning and pgo. Remaining: use PGO data to influence cost/benefit analysis of deciding to clone a loop.
- If compReturnBB is unreachable we should remove it #48740 (comment) Poor tracking of return blocks impacts loop cloning
- [Stretch goal] Support loop cloning with struct arrays #48897 Support loop cloning with struct arrays
- Consider hoisting of class init checks for loop cloning and inversion #49102 Consider hoisting of class init checks for loop cloning and inversion
Loop Unrolling
The existing phase only does full unrolls, and only for SIMD loops: current heuristic is that the loop bounds test must be a SIMD element count. The impact of the optimization is currently very limited but in general it's a high-impact optimization with the right heuristics.
- Loop unrolling support in RyuJIT #4248 Loop unrolling support in RyuJIT
- JIT optimization: loop unrolling #8107 JIT optimization: loop unrolling
- Loop Unrolling is not Enabled in Release Build #41063 Loop Unrolling is not Enabled in Release Build
Loop Invariant Code Hoisting
This phase attempts to hoist code that will produce the same value on each iteration of the loop to the pre-header. There is
at least one (and likely more) correctness issue:
- JIT: Loop hoisting re-ordering exceptions #6639 JIT: Loop hoisting re-ordering exceptions
And multiple issues about limitations of the algorithm:
- JIT: limitations in hoisting (loop invariant code motion) #35735 JIT: limitations in hoisting (loop invariant code motion)
- JIT: Loop hoisting inhibited by phase-ordering issue #6554 JIT: Loop hoisting inhibited by phase-ordering issue
- RyuJIT: Loop hoist invariant struct field accesses #7265 RyuJIT: Loop hoist invariant struct field accesses
- RyuJIT: missed opportunity for LICM #6666 RyuJIT: missed opportunity for LICM
Loop optimization hygiene
Loop optimizations need to work well with the rest of the compiler phases and IR invariants, such as with PGO.
- [Mostly done] Loop opts should not be recomputing pred lists from scratch #49030 Loop opts should not be recomputing pred lists from scratch. Remaining phases to fix: optFindNaturalLoops, optUnrollLoops, fgInsertGCPolls.
Missing Optimizations
Several major optimizations are missing even though we have evidence of their effectiveness (at least on microbenchmarks).
Induction Variable Widening
Induction variable widening eliminates unnecessary widening converts from int32 sized induction variables to int64 size address mode register uses. On AMD64, this eliminates unnecessary movsxd
instructions prior to array dereferencing.
- RyuJIT: Index Variable Widening optimization for array accesses #7312 RyuJIT: Index Variable Widening optimization for array accesses
Strength Reduction
Strength reduction replaces expensive operations with equivalent but less expensive operations.
- Strength reduction for add operations performed power of 2 times #34938 Strength reduction for add operations performed power of 2 times
- ARM64: loop array indexing inefficiencies #34810 ARM64: loop array indexing inefficiencies
Loop Unswitching
Loop unswitching moves a conditional from inside a loop to outside of it by duplicating the loop's body, and placing a version of the loop inside each of the if
and else
clauses of the conditional. It has elements of both Loop Cloning and Loop Invariant Code Motion.
Loop Interchange
Loop interchange swaps an inner and outer loop to provide follow-on optimization opportunities.
- JIT: loop interchange optimization #4358 JIT: loop interchange optimization
Benefits
It's easy to show the benefit of improved loop optimizations on microbenchmarks. For example, the team has done analysis of JIT microbenchmarks (benchstones, SciMark, etc.) several years ago. The analysis contains estimates of perf improvement from several of these optimizations (each is low single digit %). Real code is also likely to have hot loops that will benefit from improved loop optimizations.
The benchmarks and other metrics we will measure to show the benefits is TBD.
Proposed work
- Do analysis of hot loops in important workloads (ASP.NET, etc.)
- Use the findings along with the existing microbenchmark analysis to prioritize loop optimizations work
- Fix the known issues in the existing loop optimizations starting with the more impactful ones as determined by the previous two items.
- Determine if current loop recognition and loop structure representation needs to be revamped to be more general and allow for more powerful optimizations.
- Recommend starting with Loop Cloning and Loop Invariant Code Hoisting as there are well-understood weaknesses and improvement opportunities in those phases.
- Evaluate the use of SSA in loop optimizations. Perhaps a better representation of heap locations in SSA will make it more useful for loop optimizations.
- Create a plan for adding missing optimizations
category:planning
theme:loop-opt
skill-level:expert
cost:large
Metadata
Metadata
Assignees
Labels
Type
Projects
Status