tighten test_normal_scalar bound #5366

austereantelope · 2022-01-17T21:08:29Z

Hi,

The test test_normal_scalar in test_sampling.py has an assertion bound (assert pval > 0.001) that is too loose. This means potential bug in the code could still pass the original test.

To quantify this I conducted some experiments where I generated multiple mutations of the source code under test and ran each mutant and the original code 100 times to build a distribution of their outputs. Each mutant is generated using simple mutation operators (e.g. > can become < ) on source code covered by the test. I used KS-test to find mutants that produced a different distribution from the original and use those mutants as a proxy for bugs that could be introduced. In the graph below I show the distribution of both the original code and also the mutants with a different distribution.

Here we see that the bound of 0.001 is too loose since the original distribution (in orange) is higher than 0.001. Furthermore in this graph we can observe that there are many mutants (proxy for bugs) that are above the bound as well and that is undesirable since the test should aim to catch potential bugs in code. I quantify the "bug detection" of this assertion by varying the bound in a trade-off graph below.

In this graph, I plot the mutant catch rate (ratio of mutant outputs that fail the test) and the original pass rate (ratio of original output that pass the test). The original bound of 0.001 (red dotted line) has a catch rate of 0.14.

To improve this test, I propose to tighten the bound to 0.01 (the blue dotted line). The new bound has a catch rate of 0.16 (+0.02 increase compare to original) while still has >99 % pass rate (test is not flaky, I ran the updated test 500 times and observed >99 % pass rate). I think this is a good balance between improving the bug-detection ability of the test while keep the flakiness of the test low.

Do you guys think this makes sense? Please let me know if this looks good or if you have any other suggestions or questions.

My Environment:

python=3.7.11
numpy=1.21.4

my pymc Experiment SHA:
19e67f371d4641fd2f9ad7c8a7887361bdaa53d6

Depending on what your PR does, here are a few things you might want to address in the description:

what are the (breaking) changes that this PR makes?
important background, or details about the implementation
are the changes—especially new features—covered by tests and docstrings?
linting/style checks have been run
consider adding/updating relevant example notebooks
right before it's ready to merge, mention the PR in the RELEASE-NOTES.md

codecov · 2022-01-17T21:16:20Z

Codecov Report

Merging #5366 (5365395) into main (d52655d) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #5366   +/-   ##
=======================================
  Coverage   80.36%   80.36%           
=======================================
  Files          89       89           
  Lines       14808    14808           
=======================================
  Hits        11901    11901           
  Misses       2907     2907

ricardoV94 · 2022-01-20T09:16:39Z

@austereantelope Would you be interested in doing this kind of analysis for this test here (after #5369 is merged):

pymc/pymc/tests/test_smc.py

Line 129 in d6b8107

def test_marginal_likelihood(self):

Even without "mutations" it would be good to know how tight/loose our bound is, because we had a couple of issues with that test in recent times.

austereantelope · 2022-01-20T15:20:24Z

@ricardoV94 I will take a look and analyse the test you mentioned.

michaelosthege

For this particular test I don't see much benefit, but as @ricardoV94 mentioned above there are other tests where such analyses would be great.

ricardoV94 · 2022-01-23T10:58:22Z

@austereantelope Would you be interested in doing this kind of analysis for this test here (after #5369 is merged):

pymc/pymc/tests/test_smc.py

Line 129 in d6b8107

def test_marginal_likelihood(self):

Even without "mutations" it would be good to know how tight/loose our bound is, because we had a couple of issues with that test in recent times.

@austereantelope, actually this applies to most of the SMC tests in that file. For us, more important than checking if we can make the bound more strict, is to check whether we can work with less samples as those tests are quite slow. So it is the same type of analysis but in the "other" direction, if that makes sense. What do you think?

tighten test_normal_scalar bound

5365395

ricardoV94 added the request discussion label Jan 20, 2022

michaelosthege approved these changes Jan 23, 2022

View reviewed changes

michaelosthege merged commit 4eecf14 into pymc-devs:main Jan 23, 2022

ricardoV94 mentioned this pull request Jan 27, 2022

Recently tightened test just failed #5411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tighten test_normal_scalar bound #5366

tighten test_normal_scalar bound #5366

austereantelope commented Jan 17, 2022

codecov bot commented Jan 17, 2022 •

edited

Loading

ricardoV94 commented Jan 20, 2022 •

edited

Loading

austereantelope commented Jan 20, 2022

michaelosthege left a comment

ricardoV94 commented Jan 23, 2022 •

edited

Loading

tighten test_normal_scalar bound #5366

tighten test_normal_scalar bound #5366

Conversation

austereantelope commented Jan 17, 2022

codecov bot commented Jan 17, 2022 • edited Loading

Codecov Report

ricardoV94 commented Jan 20, 2022 • edited Loading

austereantelope commented Jan 20, 2022

michaelosthege left a comment

Choose a reason for hiding this comment

ricardoV94 commented Jan 23, 2022 • edited Loading

codecov bot commented Jan 17, 2022 •

edited

Loading

ricardoV94 commented Jan 20, 2022 •

edited

Loading

ricardoV94 commented Jan 23, 2022 •

edited

Loading