Skip to content

Change the 'significance' filter on the comparison page to a 'relevance' filter #1277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 5, 2022

Conversation

rylev
Copy link
Member

@rylev rylev commented Apr 5, 2022

Since we really care about helping the user focus on "relevant" changes (and not just significance which is only a part of relevance), we should provide a filter for relevance instead of a filter for significance. Since relevance also includes magnitude, we don't need to have a filter for small results either.

Additionally, the filters usually follow a pattern of showing more data when activated. This change also makes it so that the filter is for "showing non-relevant data" which is off by default.

As a next step, I'd love to have a detail view for each test case comparison which shows the details of how we arrive at the conclusion that a result is relevant perhaps by showing things like historical data and where this change fits in the historical trend.

@rylev rylev requested a review from Kobzol April 5, 2022 09:32
@Kobzol
Copy link
Contributor

Kobzol commented Apr 5, 2022

Nice change, I was quite confused by the previous two checkboxes. And I also agree with the second part, it would be nice to (optionally) show the user the full data that was used to compute the result desription instead of saying "this result is quite significant, although only partly relevant, but it is definitely dodgy with a medium magnitude".

@rylev rylev force-pushed the relevance-filter branch from 5d0d640 to a13d12c Compare April 5, 2022 11:33
@rylev
Copy link
Member Author

rylev commented Apr 5, 2022

The CI failure is due to the fact that hyper no longer builds. This has already been reported here.

@rylev rylev merged commit 04fccd8 into rust-lang:master Apr 5, 2022
@rylev rylev deleted the relevance-filter branch April 5, 2022 12:02
@nnethercote
Copy link
Contributor

This is a nice simplification. One after-the-fact nitpick: in some places the code has a notion of high, medium, and low relevance. But here relevance is just a binary yes/no thing. Is this difference meaningful? Can it be removed? E.g. is "non-relevant" actually equal to "low relevance"?

@rylev
Copy link
Member Author

rylev commented Apr 6, 2022

@nnethercote I think there's still confusion caused by the fact that we talk about "relevancy" in two places:

  • The relevancy of a single test result comparison (i.e., whether the two tests of serde-1.0.139 doc full for the two artifacts being compared produces a relevant result). This is a bool - the test result comparison is either relevant or not. It's calculated here
  • The relevancy of an entire artifact comparison (which is composed of many individual test result comparisons discussed in the first bullet point). This is an enum of high, medium, and low relevancy. It is calculated here.

How we ended up here is long story. We are now much better at calculating the relevancy of individual result comparisons now. The reason we have a relevancy for the entire artifact comparison is because historically we've struggled to answer the question "should the user pay attention to the results of an artifact comparison?".

We could adopt the strategy of saying as long as an artifact comparison has 1 or more test result comparisons that are deemed relevant, the entire artifact comparison is relevant. However, in practice this means there are a lot artifact comparisons with only a few relatively small test result comparisons.

For example, this artifact comparison has 2 relevant results but they're rather small and they're for secondary benchmarks. Should we signal to the user that they don't need to look at this comparison or should we always encourage users to look at comparisons even if they're this small?

@Kobzol
Copy link
Contributor

Kobzol commented Apr 6, 2022

This is not strictly relevant to the current discussion, but maybe we could alter the relevancy calculation logic for the whole artifact comparison to take benchmark category into account? For example in the case that you have showed, we could make comparison relevancy smaller, since only secondary benchmarks are relevant on their own account.

Then we could design some rule like "the whole comparison is relevant if there is at least a single primary benchmark relevant result", which might be slightly better than just taking all benchmark results into account.

@nnethercote
Copy link
Contributor

Some principles that seem good to me:

  • Don't use the same word in subtly different ways.
  • If in doubt, prefer simplicity.

We could satisfy the first principle by more clearly distinguishing the two kinds of relevance. E.g. perhaps "test relevance" vs "artifact relevance", or even a different word from "relevance" for one of the concepts.

We could satisfy both principles by going with your suggestion "as long as an artifact comparison has 1 or more test result comparisons that are deemed relevant, the entire artifact comparison is relevant".

Of these two I would lean towards the latter. I would hope that people would look at the full results more often than not -- it's a single click to get there, and the summary is never going to paint the full picture. So pushing people more in that direction seems fine to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants