Skip to content

[utils] Remove useless compare.py output #274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tomershafir
Copy link
Contributor

@tomershafir tomershafir commented Aug 5, 2025

The last part of the output by print(d.describe()) aggregates numbers from different programs and doesn't statistically makes sense, making it pure noise.

Next, I plan to support quantile merging, and stddev for mean.

The last part of the output by `print(d.describe())` aggregates numbers from different programs and doesn't statistically makes sense, making it pure noise.

Next, I plan to support quantile merging, and stddev for mean.
@tomershafir tomershafir force-pushed the utils-compare-remove-useless-output branch from 57ad7a4 to 2bb04cd Compare August 5, 2025 12:29
@jcohen-apple
Copy link

Could you make this controllable with a flag? My only concern is that if there are downstream projects / CI jobs that somehow rely on parsing those metrics, they could break if they just disappear. Either make them make sense or have a flag to silence the printout IMO.

@tomershafir
Copy link
Contributor Author

Making these metrics make sense is not feasible without breaking the format. It requires to insert another dimension of a workload. Also such a flag would be ugly. Shouldn't we assume people don't use it because its garbage?

@guy-david
Copy link
Contributor

I'm not too familiar with what's actually being removed, can you share the output? If it can be considered as a debug print or worse, then omitting it sounds fine to me, but otherwise we could add a flag like --minimal-names that controls this aspect of the output.

@tomershafir
Copy link
Contributor Author

tomershafir commented Aug 11, 2025

Its the last part of the output (non debug), for example:

           exec_time                             compile_time                                   size                           
l/r              lhs            rhs         diff          lhs          rhs        diff           lhs           rhs         diff
count  4310.000000    4310.000000    4310.000000  3463.000000  3463.000000  483.000000  3.463000e+03  3.463000e+03  3463.000000
mean   457.936342     458.528907     0.001676     0.103256     0.103186     0.000674    7.371460e+04  7.371937e+04 -0.000023   
std    10880.264844   10887.120051   0.069010     0.862819     0.862152     0.016053    9.972944e+04  9.976209e+04  0.002889   
min    0.000500       0.000500      -0.302439     0.000000     0.000000    -0.057416    1.728800e+04  1.728800e+04 -0.160348   
25%    0.000500       0.000500      -0.000044     0.000000     0.000000    -0.007525    3.407200e+04  3.407200e+04  0.000000   
50%    0.007200       0.007200       0.000000     0.000000     0.000000     0.000000    5.074400e+04  5.074400e+04  0.000000   
75%    0.051861       0.052006       0.000092     0.000000     0.000000     0.008551    1.048240e+05  1.048240e+05  0.000000   
max    321317.722681  321886.186324  1.904244     24.240900    24.227100    0.060109    2.704328e+06  2.704328e+06  0.042861 

The problem I have with a flag is that this output doesn't make sense at all. If there is a single benchmark, there is only a single value that doesn't require statistical aggregation. If there are multiple benchmarks, this output means nothing.

@tomershafir
Copy link
Contributor Author

@llvm/pr-subscribers-testing-tools maybe you can provide insights?

@MatzeB
Copy link
Contributor

MatzeB commented Aug 11, 2025

If there are multiple benchmarks, this output means nothing.

This seems a bit of a strong statement... while you are probably right that the "mean" value does not make statistical sense, the "count", "min", "max", quantiles aggregates seem sensible to me (especially when using the default mode of compare.py where only a couple rows at the beginning and end of the data are shown) and you would have to debate whether they are worthwhile enough to show (you can probably convince me to hide them by default).

It's a bit unfortunate that the describe() function in pandas comes as-is with nearly no way to modify it, so that if you want different aggregates you are force to implement similar functionality from scratch yourself...

That said, would be happy to see some actual development happen on compare.py and more apropriate aggregates (harmonic mean?) ... Would it make sense to wait for having the improved aggregates/statistics before landing this? (LGTM when replaced with better aggregates)

@MatzeB
Copy link
Contributor

MatzeB commented Aug 11, 2025

FWIW: I wouldn't worry about CI too much... I meant for this script to be used by humans first! I'm not immediately aware of any CI depending on it, and if there are, I'd be happy to argue that they better read the lit json files directly (or that we add a 2nd script that does a more low-level conversion from lit-json to something easy to post-process like csv/tsv files).

@tomershafir
Copy link
Contributor Author

Yea, I guess I expressed too strongly. Sounds good to me, Ill try to first improve the tool, and re-evaluate this patch after that.

@lukel97
Copy link
Contributor

lukel97 commented Aug 21, 2025

Its the last part of the output (non debug), for example:

           exec_time                             compile_time                                   size                           
l/r              lhs            rhs         diff          lhs          rhs        diff           lhs           rhs         diff
count  4310.000000    4310.000000    4310.000000  3463.000000  3463.000000  483.000000  3.463000e+03  3.463000e+03  3463.000000
mean   457.936342     458.528907     0.001676     0.103256     0.103186     0.000674    7.371460e+04  7.371937e+04 -0.000023   
std    10880.264844   10887.120051   0.069010     0.862819     0.862152     0.016053    9.972944e+04  9.976209e+04  0.002889   
min    0.000500       0.000500      -0.302439     0.000000     0.000000    -0.057416    1.728800e+04  1.728800e+04 -0.160348   
25%    0.000500       0.000500      -0.000044     0.000000     0.000000    -0.007525    3.407200e+04  3.407200e+04  0.000000   
50%    0.007200       0.007200       0.000000     0.000000     0.000000     0.000000    5.074400e+04  5.074400e+04  0.000000   
75%    0.051861       0.052006       0.000092     0.000000     0.000000     0.008551    1.048240e+05  1.048240e+05  0.000000   
max    321317.722681  321886.186324  1.904244     24.240900    24.227100    0.060109    2.704328e+06  2.704328e+06  0.042861 

The problem I have with a flag is that this output doesn't make sense at all. If there is a single benchmark, there is only a single value that doesn't require statistical aggregation. If there are multiple benchmarks, this output means nothing.

Agreed they don't make sense for exec_time/compile_time etc. but they're useful if you're ever comparing LLVM statistics via the -m flag. For example I've been using it when hacking on the loop vectorizer to see the distribution of how vectorized programs are, e.g:

      loop-vectorize.LoopsVectorized                        
l/r                              lhs          rhs       diff
count  32.000000                      32.000000    32.000000
mean   358.062500                     345.125000  -0.040936 
std    896.992464                     848.984735   0.100790 
min    2.000000                       2.000000    -0.413793 
25%    32.500000                      31.250000   -0.030303 
50%    60.000000                      57.500000   -0.003145 
75%    281.500000                     280.500000   0.000000 
max    4875.000000                    4581.000000  0.010101 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants