Skip to content

Conversation

@amanasifkhalid
Copy link
Contributor

Follow-up to #109346. cc @jakobbotsch, feel free to merge this back into your branch if you'd prefer to keep the old PR.

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 19, 2025
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@amanasifkhalid
Copy link
Contributor Author

I can't repro the SPMI failures locally, and the assert m_compGenTreeID == m_compiler->compGenTreeID looks like a symptom of a stale collection. I'll wait for the new collection to be ready before kicking off another run.

This test is failing with an AV from an array access with an underflowing index

static int TestOutOfBoundLowerDecreasing()
{
    int[] arr = new int[10];
    int i = 9;
    int j = 15;
    int sum = 0;

    // our range check optimizer is very naive, so if you don't have
    // i < 10, then it thinks `i` can overflow and doesn't bother 
    // calling `Widen` at all.
    //
    while (j >= 0 && i < 10)
    {
        --j;
        --i;
        sum += arr[i];  // range check will use 9 as lower bound.

        Console.WriteLine("i = " + i + ", j = " + j);
    }
    return sum;
}

Before inversion, we have the following layout:


---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   [IL range]   [jump]                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    [000..012)-> BB05(0.5),BB06(0.5)     ( cond ) // j >= 0 preliminary check
BB02 [0001]  1       BB04                  1    [012..046)-> BB07(0.5),BB04(0.5)     ( cond ) // loop body, and j >= 0 loop check
BB06 [0007]  1       BB01                  0.50 [046..???)-> BB04(1)                 (always) // preheader
BB04 [0003]  2       BB02,BB06             1    [046..04B)-> BB02(0.5),BB07(0.5)     ( cond ) // i < 10 loop check
BB07 [0008]  2       BB02,BB04             1    [04B..???)-> BB05(1)                 (always) // exit
BB05 [0004]  2       BB01,BB07             1    [04B..04D)                           (return)                     
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

We invert this loop by cloning the check i < 10 into the preheader, and get the following layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   [IL range]   [jump]                            
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    [000..012)-> BB05(0.5),BB06(0.5)     ( cond ) // j >= 0 preliminary check
BB02 [0001]  2       BB04,BB08             1    [012..046)-> BB07(0.5),BB04(0.5)     ( cond ) // loop body, and j >= 0 loop check
BB06 [0007]  1       BB01                  0.50 [046..???)-> BB09(0.5),BB08(0.5)     ( cond ) // i < 10 preliminary check
BB08 [0009]  1       BB06                  0.50 [???..???)-> BB02(1)                 (always) // new preheader
BB04 [0003]  1       BB02                  1    [046..04B)-> BB02(0.5),BB07(0.5)     ( cond ) // i < 10 loop check
BB07 [0008]  2       BB02,BB04             1    [???..???)-> BB09(1)                 (always) // exit
BB09 [0010]  2       BB06,BB07             1    [04B..???)-> BB05(1)                 (always) // non-enter
BB05 [0004]  2       BB01,BB09             1    [04B..04D)                           (return)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

RBO determines that the preliminary checks can be removed since we know i and j's initial values:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   [IL range]   [jump]                            
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    [000..012)-> BB03(1)                 (always) // local var setup
BB02 [0001]  2       BB05,BB04             4    [012..046)-> BB06(0.5),BB05(0.5)     ( cond ) // loop body, and j >= 0 check
BB03 [0007]  1       BB01                  0.25 [046..???)-> BB04(1)                 (always) // nothing
BB04 [0009]  1       BB03                  0.25 [???..???)-> BB02(1)                 (always) // nothing
BB05 [0003]  1       BB02                  4    [046..04B)-> BB02(0.5),BB06(0.5)     ( cond ) // i < 10 check
BB06 [0008]  2       BB02,BB05             0.50 [???..???)-> BB07(1)                 (always) // exit
BB07 [0010]  1       BB06                  0.50 [04B..???)-> BB08(1)                 (always) // exit
BB08 [0004]  1       BB07                  1    [04B..04D)                           (return)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Assertion prop gets rid of the i < 10 check:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    [000..012)-> BB03(1)                 (always)                     i hascall newarr
BB02 [0001]  2       BB05,BB04             4    [012..046)-> BB06(0.5),BB05(0.5)     ( cond )                     i loophead hascall gcsafe idxlen bwd bwd-target
BB03 [0007]  1       BB01                  0.25 [046..???)-> BB04(1)                 (always)                     internal
BB04 [0009]  1       BB03                  0.25 [???..???)-> BB02(1)                 (always)                     internal idxlen
BB05 [0003]  1       BB02                  4    [046..04B)-> BB02(1)                 (always)                     i bwd bwd-src
BB06 [0008]  1       BB02                  0.50 [???..???)-> BB07(1)                 (always)                     internal
BB07 [0010]  1       BB06                  0.50 [04B..???)-> BB08(1)                 (always)                     internal
BB08 [0004]  1       BB07                  1    [04B..04D)                           (return)                     i
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

After this, range check elimination seems to think i will be decremented only once, and erroneously eliminates the bounds check:

Range value <8, 8>
[RangeCheck::Widen] BB02, 
[000032]
<8, 8> BetweenBounds <0, [000109]>
$c8 upper bound is:  {IntCns 10}
Array size is: 10
[RangeCheck::OptimizeRangeCheck] Between bounds
Before optRemoveRangeCheck:
N014 ( 12, 16) [000073] ---XG+-----                         *  COMMA     int    $383
N003 (  6,  9) [000065] ---X-+-----                         +--*  BOUNDS_CHECK_Rng void   $285
N001 (  1,  1) [000032] -----+-----                         |  +--*  LCL_VAR   int    V01 loc1         u:4 $381
N002 (  1,  1) [000109] -----------                         |  \--*  CNS_INT   int    10 $c8
N013 (  6,  7) [000074] n---G+-----                         \--*  IND       int    $382
N012 (  3,  5) [000072] -----+-----                            \--*  ARR_ADDR  byref int[] $81
N011 (  3,  5) [000071] -----+-N---                               \--*  ADD       long   $304
N009 (  4,  5) [000070] -----+-N---                                  +--*  ADD       long   $303
N004 (  1,  1) [000062] -----+-----                                  |  +--*  LCL_VAR   long   V00 loc0         u:2 $2c0
N008 (  3,  4) [000068] -----+-N---                                  |  \--*  LSH       long   $302
N006 (  2,  3) [000066] -----+---U-                                  |     +--*  CAST      long <- uint $301
N005 (  1,  1) [000063] -----+-----                                  |     |  \--*  LCL_VAR   int    V01 loc1         u:4 $381
N007 (  1,  1) [000067] -----+-N---                                  |     \--*  CNS_INT   long   2 $1c2
N010 (  1,  1) [000069] -----+-----                                  \--*  CNS_INT   long   16 $1c3
After optRemoveRangeCheck for [000073]:
N017 ( 14, 18) [000036] DA--G+-----                         *  STORE_LCL_VAR int    V03 loc3         d:4 $285
N016 ( 14, 18) [000035] ----G+-----                         \--*  ADD       int    $385
N014 ( 12, 16) [000073] ----G+-N---                            +--*  COMMA     int    $383
N003 (  6,  9) [000065] -----+-----                            |  +--*  NOP       void  
N013 (  6,  7) [000074] n---G+-----                            |  \--*  IND       int    $382
N012 (  3,  5) [000072] -----+-----                            |     \--*  ARR_ADDR  byref int[] $81
N011 (  3,  5) [000071] -----+-N---                            |        \--*  ADD       long   $304
N009 (  4,  5) [000070] -----+-N---                            |           +--*  ADD       long   $303
N004 (  1,  1) [000062] -----+-----                            |           |  +--*  LCL_VAR   long   V00 loc0         u:2 $2c0
N008 (  3,  4) [000068] -----+-N---                            |           |  \--*  LSH       long   $302
N006 (  2,  3) [000066] -----+---U-                            |           |     +--*  CAST      long <- uint $301
N005 (  1,  1) [000063] -----+-----                            |           |     |  \--*  LCL_VAR   int    V01 loc1         u:4 $381
N007 (  1,  1) [000067] -----+-N---                            |           |     \--*  CNS_INT   long   2 $1c2
N010 (  1,  1) [000069] -----+-----                            |           \--*  CNS_INT   long   16 $1c3
N015 (  1,  1) [000030] -----+-----                            \--*  LCL_VAR   int    V03 loc3         u:3 (last use) $340

cc @EgorBo, seems like this exposed a range check bug?

@EgorBo
Copy link
Member

EgorBo commented Mar 20, 2025

and the assert m_compGenTreeID == m_compiler->compGenTreeID looks like a symptom of a stale collection.

Usually, this assert means that someone created a node in a phase that returned PhaseStatus::MODIFIED_NOTHING

@EgorBo
Copy link
Member

EgorBo commented Mar 20, 2025

seems like this exposed a range check bug?

can you provide the full jit dump?

@amanasifkhalid
Copy link
Contributor Author

Usually, this assert means that someone created a node in a phase that returned PhaseStatus::MODIFIED_NOTHING

It hit during flow opts, I suspect it's one of the paths where fgOptimizeBranch bails pretty late. I'll try to repro it again.

can you provide the full jit dump?

Sure:
dump.txt

@amanasifkhalid
Copy link
Contributor Author

amanasifkhalid commented Mar 20, 2025

I suspect it's one of the paths where fgOptimizeBranch bails pretty late

If fgUpdateFlowGraph and fgExpandRarelyRunBlocks don't modify anything, and fgOptimizeBranch starts cloning statements but bails because the condition it's going to clone isn't the right shape, optOptimizeFlow will return PhaseStatus::MODIFIED_NOTHING even though we cloned some nodes. I'll include a fix in the next push.

@amanasifkhalid
Copy link
Contributor Author

Diffs are still large, and coreclr_tests looks like the most impacted collection on x64/x86. The largest size regressions are instances where we previously were able to eliminate a loop body entirely:

48 (800.00% of base) : 487485.dasm - SwitchTest:TestEntryPoint():int (FullOpts)
106 (706.67% of base) : 37136.dasm - System.Runtime.Intrinsics.Vector64:Create(double):System.Runtime.Intrinsics.Vector64`1[double] (Instrumented Tier1)
106 (706.67% of base) : 48139.dasm - System.Runtime.Intrinsics.Vector64:Create(double):System.Runtime.Intrinsics.Vector64`1[double] (Tier1)
106 (706.67% of base) : 397331.dasm - System.Runtime.Intrinsics.Vector64:Create[double](double):System.Runtime.Intrinsics.Vector64`1[double] (FullOpts)
106 (706.67% of base) : 59042.dasm - System.Runtime.Intrinsics.Vector64:Create[double](double):System.Runtime.Intrinsics.Vector64`1[double] (Tier1)
94 (671.43% of base) : 34968.dasm - System.Runtime.Intrinsics.Vector64:Create(long):System.Runtime.Intrinsics.Vector64`1[long] (Instrumented Tier1)
94 (671.43% of base) : 34984.dasm - System.Runtime.Intrinsics.Vector64:Create(long):System.Runtime.Intrinsics.Vector64`1[long] (Tier1)
94 (671.43% of base) : 58838.dasm - System.Runtime.Intrinsics.Vector64:Create(ulong):System.Runtime.Intrinsics.Vector64`1[ulong] (Instrumented Tier1)
94 (671.43% of base) : 64669.dasm - System.Runtime.Intrinsics.Vector64:Create(ulong):System.Runtime.Intrinsics.Vector64`1[ulong] (Tier1)
94 (671.43% of base) : 399702.dasm - System.Runtime.Intrinsics.Vector64:Create[long](long):System.Runtime.Intrinsics.Vector64`1[long] (FullOpts)
94 (671.43% of base) : 59197.dasm - System.Runtime.Intrinsics.Vector64:Create[long](long):System.Runtime.Intrinsics.Vector64`1[long] (Tier1)
94 (671.43% of base) : 59117.dasm - System.Runtime.Intrinsics.Vector64:Create[ulong](ulong):System.Runtime.Intrinsics.Vector64`1[ulong] (Instrumented Tier1)
94 (671.43% of base) : 59335.dasm - System.Runtime.Intrinsics.Vector64:Create[ulong](ulong):System.Runtime.Intrinsics.Vector64`1[ulong] (Tier1)
272 (632.56% of base) : 69493.dasm - System.Runtime.Intrinsics.Vector64:Narrow(System.Runtime.Intrinsics.Vector64`1[double],System.Runtime.Intrinsics.Vector64`1[double]):System.Runtime.Intrinsics.Vector64`1[float] (Instrumented Tier1)
188 (626.67% of base) : 61714.dasm - System.Runtime.Intrinsics.Vector64`1[long]:op_Subtraction(System.Runtime.Intrinsics.Vector64`1[long],System.Runtime.Intrinsics.Vector64`1[long]):System.Runtime.Intrinsics.Vector64`1[long] (Tier1)
194 (606.25% of base) : 59486.dasm - System.Runtime.Intrinsics.Vector64`1[long]:op_Division(System.Runtime.Intrinsics.Vector64`1[long],System.Runtime.Intrinsics.Vector64`1[long]):System.Runtime.Intrinsics.Vector64`1[long] (Tier1)
91 (535.29% of base) : 35056.dasm - System.Runtime.Intrinsics.Vector64:Create(int):System.Runtime.Intrinsics.Vector64`1[int] (Instrumented Tier1)
91 (535.29% of base) : 48140.dasm - System.Runtime.Intrinsics.Vector64:Create(int):System.Runtime.Intrinsics.Vector64`1[int] (Tier1)
91 (535.29% of base) : 64667.dasm - System.Runtime.Intrinsics.Vector64:Create(uint):System.Runtime.Intrinsics.Vector64`1[uint] (Tier1)
91 (535.29% of base) : 59053.dasm - System.Runtime.Intrinsics.Vector64:Create[int](int):System.Runtime.Intrinsics.Vector64`1[int] (Tier1)

In these instances, loop unrolling isn't able to look before the hoisted loop checks to find the init value of the IV, so we leave the loops intact. In the baseline JIT, we produce straight-line code that's eventually collapsed into simple return logic. There are a few instances of similarly large improvements:

Top method improvements (percentages):
         -82 (-93.18% of base) : 537836.dasm - Test.App:TestEntryPoint() (FullOpts)
        -990 (-52.86% of base) : 488857.dasm - VectorTest:TestEntryPoint():int (FullOpts)
         -98 (-51.31% of base) : 529931.dasm - SetIPTest.SetIP:_Initialize() (FullOpts)
         -85 (-44.50% of base) : 488604.dasm - VectorMathTests.Program:TestEntryPoint():int (FullOpts)
        -363 (-43.32% of base) : 489054.dasm - VectorTest:VectorTReturnTest():int (FullOpts)
        -148 (-39.15% of base) : 241978.dasm - LoopsWithEH:Sum_TFLTFiTF(int[],int):int (FullOpts)
         -27 (-29.67% of base) : 241957.dasm - LoopsWithEH:Sum_TFTFLxTF(int[],int):int (FullOpts)
         -32 (-27.12% of base) : 533624.dasm - Runtime_71599:Test() (FullOpts)
         -35 (-25.93% of base) : 241945.dasm - LoopsWithEH:Sum_TCLxTF(int[],int):int (FullOpts)
         -37 (-25.52% of base) : 496428.dasm - System.Collections.Immutable.ImmutableArray`1+Builder+<GetEnumerator>d__66[System.__Canon]:MoveNext():ubyte:this (FullOpts)
        -328 (-24.70% of base) : 489012.dasm - VectorTest+VectorMulTest`1[long]:VectorMul(long,long,long,long,long):int (FullOpts)
        -328 (-24.70% of base) : 489024.dasm - VectorTest+VectorMulTest`1[ulong]:VectorMul(ulong,ulong,ulong,ulong,ulong):int (FullOpts)
         -31 (-24.60% of base) : 496436.dasm - System.Collections.Immutable.ImmutableArray`1+Builder+<GetEnumerator>d__66[System.Reflection.Metadata.TypeDefinitionHandle]:MoveNext():ubyte:this (FullOpts)
        -362 (-23.72% of base) : 471659.dasm - GitHub_22815.Program:Test() (FullOpts)
        -265 (-23.23% of base) : 382850.dasm - System.Threading.Tasks.Task:WaitAllCore(System.ReadOnlySpan`1[System.Threading.Tasks.Task],int,System.Threading.CancellationToken):ubyte (Tier1)
        -265 (-23.23% of base) : 3014.dasm - System.Threading.Tasks.Task:WaitAllCore(System.ReadOnlySpan`1[System.Threading.Tasks.Task],int,System.Threading.CancellationToken):ubyte (Tier1)
         -50 (-22.03% of base) : 488963.dasm - VectorTest:TestEntryPoint():int (FullOpts)
         -28 (-20.59% of base) : 308020.dasm - System.Reflection.Metadata.PathUtilities:IndexOfFileName(System.String):int (Tier1)
         -28 (-20.59% of base) : 296418.dasm - System.Reflection.Metadata.PathUtilities:IndexOfFileName(System.String):int (Tier1)
        -117 (-20.49% of base) : 523300.dasm - JitTest_lcs2_lcs_cs.LCS:buildLCS(int[][][][],ushort[],int[]):System.String (FullOpts)

In cases like Test.App:TestEntryPoint, lexical loop inversion creates a new loop that isn't unrolled away, whereas graph-based inversion doesn't run at all since we rely on the loop package to find optimization opportunities. In the latter case, we produce straight-line code that's eventually compacted down to some register juggling.

With loop unrolling disabled, the diffs are considerably less severe:

[22:40:02] Total bytes of base: 403368188
[22:40:02] Total bytes of diff: 403539205
[22:40:02] Total bytes of delta: 171017 (0.04% of base)
[22:40:02]
[22:40:02] Total PerfScore of base: 3.510062052156936e+118
[22:40:02] Total PerfScore of diff: 3.510390402213465e+118
[22:40:02] Total PerfScore of delta: 3.2835005652882044e+114 (0.01% of base)
[22:40:02]
[22:40:02] Relative PerfScore Geomean: 1.7922%
[22:40:02] Relative PerfScore Geomean (Diffs): 63.6871%
[22:40:02]
[22:40:02] Generated asm is located under C:\wk\runtime\artifacts\spmi\asm.coreclr_tests.run.windows.x64.checked.6\base C:\wk\runtime\artifacts\spmi\asm.coreclr_tests.run.windows.x64.checked.6\diff
[22:40:02] Textual differences found in generated asm.
[22:40:02] jit-analyze not found on PATH. Generate a diff analysis report by building jit-analyze from https://github.com/dotnet/jitutils and running:
[22:40:02]     jit-analyze -r --base C:\wk\runtime\artifacts\spmi\asm.coreclr_tests.run.windows.x64.checked.6\base --diff C:\wk\runtime\artifacts\spmi\asm.coreclr_tests.run.windows.x64.checked.6\diff
[22:40:02] 20,983 contexts with diffs (4,466 size improvements, 6,620 size regressions, 9,897 same size)
[22:40:02]                            (3,065 PerfScore improvements, 16,119 PerfScore regressions, 1,799 same PerfScore)
[22:40:02]   -42,617/+213,634 bytes
[22:40:02]   -13.25%/+82.50% PerfScore

@EgorBo
Copy link
Member

EgorBo commented Mar 25, 2025

@amanasifkhalid can you please re-base the PR to see if the range check issue still reproes with recent changes

@amanasifkhalid
Copy link
Contributor Author

@EgorBo this failure repro'd in the latest run

@EgorBo
Copy link
Member

EgorBo commented Mar 26, 2025

@amanasifkhalid can you point me to a pipeline where the test failed? I presume it's not SPMI

@amanasifkhalid
Copy link
Contributor Author

@EgorBo sure, it's failing in the Regression_2 test suite, which runs in the CoreCLR Pri0 test pipeline. Here's the win-x64 failure (console log).

@EgorBo
Copy link
Member

EgorBo commented Mar 26, 2025

@EgorBo sure, it's failing in the Regression_2 test suite, which runs in the CoreCLR Pri0 test pipeline. Here's the win-x64 failure (console log).

thanks, I can repro it locally

@EgorBo
Copy link
Member

EgorBo commented Mar 26, 2025

Ok, looks like RangeCheck gets confused by a range where lower bound is greater than the upper one <10, 9>

@EgorBo
Copy link
Member

EgorBo commented Mar 26, 2025

So I filed a PR to fix this case #113935, but I'd recommend to disable the range check phase in your PR, because I suspect you have other issues so I don't expect the PR to become green once you disable the range check phase. Or you can integrate my PR (let's wait till its green)

@dotnet-policy-service
Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants