[utils] Remove useless compare.py output #274

tomershafir · 2025-08-05T12:28:00Z

The last part of the output by print(d.describe()) aggregates numbers from different programs and doesn't statistically makes sense, making it pure noise.

Next, I plan to support quantile merging, and stddev for mean.

The last part of the output by `print(d.describe())` aggregates numbers from different programs and doesn't statistically makes sense, making it pure noise. Next, I plan to support quantile merging, and stddev for mean.

jcohen-apple · 2025-08-11T08:15:49Z

Could you make this controllable with a flag? My only concern is that if there are downstream projects / CI jobs that somehow rely on parsing those metrics, they could break if they just disappear. Either make them make sense or have a flag to silence the printout IMO.

tomershafir · 2025-08-11T08:50:11Z

Making these metrics make sense is not feasible without breaking the format. It requires to insert another dimension of a workload. Also such a flag would be ugly. Shouldn't we assume people don't use it because its garbage?

guy-david · 2025-08-11T10:32:26Z

I'm not too familiar with what's actually being removed, can you share the output? If it can be considered as a debug print or worse, then omitting it sounds fine to me, but otherwise we could add a flag like --minimal-names that controls this aspect of the output.

tomershafir · 2025-08-11T12:17:15Z

Its the last part of the output (non debug), for example:

           exec_time                             compile_time                                   size                           
l/r              lhs            rhs         diff          lhs          rhs        diff           lhs           rhs         diff
count  4310.000000    4310.000000    4310.000000  3463.000000  3463.000000  483.000000  3.463000e+03  3.463000e+03  3463.000000
mean   457.936342     458.528907     0.001676     0.103256     0.103186     0.000674    7.371460e+04  7.371937e+04 -0.000023   
std    10880.264844   10887.120051   0.069010     0.862819     0.862152     0.016053    9.972944e+04  9.976209e+04  0.002889   
min    0.000500       0.000500      -0.302439     0.000000     0.000000    -0.057416    1.728800e+04  1.728800e+04 -0.160348   
25%    0.000500       0.000500      -0.000044     0.000000     0.000000    -0.007525    3.407200e+04  3.407200e+04  0.000000   
50%    0.007200       0.007200       0.000000     0.000000     0.000000     0.000000    5.074400e+04  5.074400e+04  0.000000   
75%    0.051861       0.052006       0.000092     0.000000     0.000000     0.008551    1.048240e+05  1.048240e+05  0.000000   
max    321317.722681  321886.186324  1.904244     24.240900    24.227100    0.060109    2.704328e+06  2.704328e+06  0.042861

The problem I have with a flag is that this output doesn't make sense at all. If there is a single benchmark, there is only a single value that doesn't require statistical aggregation. If there are multiple benchmarks, this output means nothing.

tomershafir · 2025-08-11T13:10:45Z

@llvm/pr-subscribers-testing-tools maybe you can provide insights?

MatzeB · 2025-08-11T17:46:50Z

If there are multiple benchmarks, this output means nothing.

This seems a bit of a strong statement... while you are probably right that the "mean" value does not make statistical sense, the "count", "min", "max", quantiles aggregates seem sensible to me (especially when using the default mode of compare.py where only a couple rows at the beginning and end of the data are shown) and you would have to debate whether they are worthwhile enough to show (you can probably convince me to hide them by default).

It's a bit unfortunate that the describe() function in pandas comes as-is with nearly no way to modify it, so that if you want different aggregates you are force to implement similar functionality from scratch yourself...

That said, would be happy to see some actual development happen on compare.py and more apropriate aggregates (harmonic mean?) ... Would it make sense to wait for having the improved aggregates/statistics before landing this? (LGTM when replaced with better aggregates)

MatzeB · 2025-08-11T17:55:08Z

FWIW: I wouldn't worry about CI too much... I meant for this script to be used by humans first! I'm not immediately aware of any CI depending on it, and if there are, I'd be happy to argue that they better read the lit json files directly (or that we add a 2nd script that does a more low-level conversion from lit-json to something easy to post-process like csv/tsv files).

tomershafir · 2025-08-11T18:45:13Z

Yea, I guess I expressed too strongly. Sounds good to me, Ill try to first improve the tool, and re-evaluate this patch after that.

lukel97 · 2025-08-21T06:20:05Z

Its the last part of the output (non debug), for example:

           exec_time                             compile_time                                   size                           
l/r              lhs            rhs         diff          lhs          rhs        diff           lhs           rhs         diff
count  4310.000000    4310.000000    4310.000000  3463.000000  3463.000000  483.000000  3.463000e+03  3.463000e+03  3463.000000
mean   457.936342     458.528907     0.001676     0.103256     0.103186     0.000674    7.371460e+04  7.371937e+04 -0.000023   
std    10880.264844   10887.120051   0.069010     0.862819     0.862152     0.016053    9.972944e+04  9.976209e+04  0.002889   
min    0.000500       0.000500      -0.302439     0.000000     0.000000    -0.057416    1.728800e+04  1.728800e+04 -0.160348   
25%    0.000500       0.000500      -0.000044     0.000000     0.000000    -0.007525    3.407200e+04  3.407200e+04  0.000000   
50%    0.007200       0.007200       0.000000     0.000000     0.000000     0.000000    5.074400e+04  5.074400e+04  0.000000   
75%    0.051861       0.052006       0.000092     0.000000     0.000000     0.008551    1.048240e+05  1.048240e+05  0.000000   
max    321317.722681  321886.186324  1.904244     24.240900    24.227100    0.060109    2.704328e+06  2.704328e+06  0.042861

The problem I have with a flag is that this output doesn't make sense at all. If there is a single benchmark, there is only a single value that doesn't require statistical aggregation. If there are multiple benchmarks, this output means nothing.

Agreed they don't make sense for exec_time/compile_time etc. but they're useful if you're ever comparing LLVM statistics via the -m flag. For example I've been using it when hacking on the loop vectorizer to see the distribution of how vectorized programs are, e.g:

      loop-vectorize.LoopsVectorized                        
l/r                              lhs          rhs       diff
count  32.000000                      32.000000    32.000000
mean   358.062500                     345.125000  -0.040936 
std    896.992464                     848.984735   0.100790 
min    2.000000                       2.000000    -0.413793 
25%    32.500000                      31.250000   -0.030303 
50%    60.000000                      57.500000   -0.003145 
75%    281.500000                     280.500000   0.000000 
max    4875.000000                    4581.000000  0.010101

[utils] Remove useless compare.py output

2bb04cd

The last part of the output by `print(d.describe())` aggregates numbers from different programs and doesn't statistically makes sense, making it pure noise. Next, I plan to support quantile merging, and stddev for mean.

tomershafir force-pushed the utils-compare-remove-useless-output branch from 57ad7a4 to 2bb04cd Compare August 5, 2025 12:29

tomershafir requested review from fhahn, jcohen-apple and guy-david August 5, 2025 12:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[utils] Remove useless compare.py output #274

[utils] Remove useless compare.py output #274

Uh oh!

tomershafir commented Aug 5, 2025 •

edited

Loading

Uh oh!

jcohen-apple commented Aug 11, 2025

Uh oh!

tomershafir commented Aug 11, 2025

Uh oh!

guy-david commented Aug 11, 2025

Uh oh!

tomershafir commented Aug 11, 2025 •

edited

Loading

Uh oh!

tomershafir commented Aug 11, 2025

Uh oh!

MatzeB commented Aug 11, 2025 •

edited

Loading

Uh oh!

MatzeB commented Aug 11, 2025

Uh oh!

tomershafir commented Aug 11, 2025

Uh oh!

lukel97 commented Aug 21, 2025

Uh oh!

Uh oh!

[utils] Remove useless compare.py output #274

Are you sure you want to change the base?

[utils] Remove useless compare.py output #274

Uh oh!

Conversation

tomershafir commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcohen-apple commented Aug 11, 2025

Uh oh!

tomershafir commented Aug 11, 2025

Uh oh!

guy-david commented Aug 11, 2025

Uh oh!

tomershafir commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomershafir commented Aug 11, 2025

Uh oh!

MatzeB commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatzeB commented Aug 11, 2025

Uh oh!

tomershafir commented Aug 11, 2025

Uh oh!

lukel97 commented Aug 21, 2025

Uh oh!

Uh oh!

tomershafir commented Aug 5, 2025 •

edited

Loading

tomershafir commented Aug 11, 2025 •

edited

Loading

MatzeB commented Aug 11, 2025 •

edited

Loading