-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Add ICDF for the Kumaraswamy distribution #6642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #6642 +/- ##
==========================================
+ Coverage 91.99% 92.01% +0.02%
==========================================
Files 95 95
Lines 16205 16246 +41
==========================================
+ Hits 14907 14948 +41
Misses 1298 1298
|
check_icdf( | ||
pm.Kumaraswamy, | ||
{"a": Rplus, "b": Rplus}, | ||
lambda q, a, b: (1 - (1 - q) ** (1 / b)) ** (1 / a), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't scipy have a ppf function? The point of this test is more to compare with an external reference to be sure we implemented it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you mention it doesn't. In that case an alternative would be to check that for some values exp(logcdf(dist, icdf(dist, value))) == value
Or something along those lines, basically testing that the inverse property holds.
You could add that as a util test function in testing.py
.
We do something similar for the consistency between discrete logp and logcdf helper.
It's a bit more work but it will be very helpful for testing other distributions whose ppf is missing in scipy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, SciPy doesn't have a ppf function for the Kumaraswamy distribution.
Sure, I will look into adding the util test function for testing when scipy doesn't have an implementation. I will get back on this if I need help, or if not, with the next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ricardoV94 I added a util function for testing the inverse of the icdf matches exp logcdf
;
However I found it is failing tests for small values of distribution parameter
Example test case where this is failing:
The icdf expression is:
For the above values for
For such a small value of
This fails the self consistency test (between icdf and cdf) for small values of
Do you have any idea to make this numerically stable? Working with the logarithms of numbers doesn't make sense as there is addition happening in the ICDF expression which also causes the numerical stability issue.
The other alternative is to simply skip the tests for small values of
Edit: I am now thinking on the lines of using scipy.special.powm1
or scipy.special.log1p
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you testing? We don't want to test equality directly, but using something like np.testing.assert_almost_equal
or something that allows for some wiggle room.
It may make sense to compare que log of the icdf value if that's more stable.
But yes, we don't need to test values very close to 0 or 1. It would be good to check if the distributions that already have icdf pass as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am already using np.testing.assert_almost_equal
. This still doesn't work because the probability returned by the cdf of the value returned by the icdf, is
It may make sense to compare que log of the icdf value if that's more stable.
log of the icdf does make sense, but we have to implement a function like logicdf everywhere. log(icdf) will not work as icdf already returns the less precise value, so it has to be implemented directly as logicdf.
But yes, we don't need to test values very close to 0 or 1.
Okay, in that case I will change the domain of testing to exclude small values of
It would be good to check if the distributions that already have icdf pass as well.
The other continuous distributions (uniform and normal) with icdf are passing the icdf self consistency test, even with small values of the distribution parameters.
For the discrete case, the test passed for the discrete uniform distribution, but failed for the geometric distribution for small values of the distribution parameter (
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to implement log-icdf, that's just the log of icdf which we already have.
My proposal was to compare logcdf(dist, icdf(dist, value)) = log(value)
, if that makes sense. Instead of exponentiating the logcdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My proposal was to compare logcdf(dist, icdf(dist, value)) = log(value)
No that doesn't work, I already tried it before I narrowed down to the root cause.
The numerical error starts at the output of icdf itself for very small parameters for distributions involving power terms with this parameter. This the log outside does not help.
I will proceed now by excluding very small values of the distribution parameters for the self consistency tests for icdf, if that sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well actually I see that for some of the domains, domains excluding small parameters is already defined.
For example, using Rplusbig
instead of Rplus
for the a and b distribution parameters for the Kumaraswamy distribution already passed the test.
I could simply use these domains for the tests. Does this sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh while this inverse icdf -cdf self test works for continuous distributions, it doesn't work in the same way for discrete distributions.
This is because there is a band of probability for which the icdf has the same value and this this need not match the probability from the cdf.
I will now implement another util test function for discrete distributions which checks that value = icdf(dist, exp(logcdf(dist, value)))
for a sample of integer values in the domain of the discrete distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shaping up really well!
Some suggestions below
pymc/distributions/continuous.py
Outdated
@@ -1287,6 +1287,16 @@ def logcdf(value, a, b): | |||
msg="a > 0, b > 0", | |||
) | |||
|
|||
def icdf(value, a, b): | |||
res = (1 - (1 - value) ** pt.reciprocal(b)) ** pt.reciprocal(a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this, would it help to compute this icdf on the log scale (to get rid of these exponents) and only exponentiate the final result?
res = (1 - (1 - value) ** pt.reciprocal(b)) ** pt.reciprocal(a) | |
res = exp(log(1 - (1 - value) * pt.reciprocal(b) * pt.reciprocal(a)) |
Or something along those lines (might have made a mistake)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pymc/testing.py
Outdated
@@ -212,6 +212,7 @@ def RandomPdMatrix(n): | |||
Rplusbig = Domain([0, 0.5, 0.9, 0.99, 1, 1.5, 2, 20, np.inf]) | |||
Rminusbig = Domain([-np.inf, -2, -1.5, -1, -0.99, -0.9, -0.5, -0.01, 0]) | |||
Unit = Domain([0, 0.001, 0.1, 0.5, 0.75, 0.99, 1]) | |||
Unitbig = Domain([0.1, 0.5, 0.75, 0.99, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The edges aren't included, do to test 0.1
you need to add something before
Unitbig = Domain([0.1, 0.5, 0.75, 0.99, 1]) | |
Unitbig = Domain([0, 0.1, 0.5, 0.75, 0.99, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unitbig isn't used now, so I am discarding this.
It was added to test the self-consistency of ICDF-logcdf for the uniform distribution, but we are not using this as they have scipy ppf functions available.
@@ -284,6 +289,10 @@ def test_normal(self): | |||
{"mu": R, "sigma": Rplus}, | |||
lambda q, mu, sigma: st.norm.ppf(q, mu, sigma), | |||
) | |||
check_selfconsistency_icdf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to test this for scipy distributions that have ppf, I meant just for you to confirm it's working well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I've removed these tests where scipy ppf functions are available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forgot about this one.
Is #6669 a blocker for this PR?
pymc/testing.py
Outdated
dist_logcdf_fn = compile_pymc(list(inputvars(dist_logcdf)), dist_logcdf) | ||
|
||
domains = paramdomains.copy() | ||
domains["value"] = Domain(np.linspace(0.1, 1, 100)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably enough?
domains["value"] = Domain(np.linspace(0.1, 1, 100)) | |
domains["value"] = Domain(np.linspace(0, 1, 10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
It is a blocker for discrete ICDF self consistency checks, whose code I haven't pushed yet. It is not a blocker for code I have pushed so far in this PR, which includes continuous ICDF self consistency checks. |
@gokuld How do you prefer to proceed? Get this done or wait and also get the discrete cases sorted out? I would suggest we wait, but up to you :) |
Let us wait, no problem. |
bf4fb36
to
22d2ad6
Compare
@ricardoV94 I've added self consistency tests for discrete ICDF, now that the blocker #6671 is closed. Though SciPy PPFs are available for the discrete uniform and discrete geometric distributions and used with I have also fixed the bug and pushed the commit here. |
22d2ad6
to
f754a9f
Compare
|
Try setting |
It is passing locally even with The failure upstream is happening in |
Does this branch already include the fixes from the other PR? |
Forgot to sync my forked repo. :/ I have synced it now. Do you think this was the issue? I can see commit history on this branch now, on Github with the fixes from the other PR. |
You might want to rebase from main and then force push so only the new commits show up (if that's the problem). Make sure to checkout a backup branch first in case something goes wrong. |
244bf1e
to
59e0c27
Compare
computed_value = dist_icdf_fn( | ||
**point, value=np.exp(dist_logcdf_fn(**point, value=value)) | ||
) | ||
assert ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.testing.assert_almost_equal
will give you a nice printing when it fails out of the box
@ricardoV94 This passed the tests in Some thoughts:
|
This reverts commit b0c93b77e3d39c81b22325b644cf94c2beeb9e9c. Avoid mypy test failure. This is not needed, we have identified the root cause now.
What is the disadvantage of the current approach Does this PR include changes that were already committed elsewhere? |
One issue is that after computing the CDF, if the result has the tiniest numerical error, it can create a huge error in the ICDF (exactly a 1 integer size error). This is because ICDF is integer valued but is very sensitive at the places in the CDF graph where there is a vertical jump. (The CDF function has a staircase like appearance with multiple vertical jumps.) This means, And while truncating we need to decide the number of decimal places which is another magic number / parameter, makes it less elegant. This kind of error, was the root cause I identified with the issue where it passed tests locally on my machine while failed on the github actions in the cloud. The difference in machine hardware could change the numerical value of the CDF ever so slightly and cause a huge integer change in the ICDF that is applied on top of it. I verified and solved this root cause, by adding truncation after CDF and before ICDF as discussed above (
No these changes are made only in this PR. |
Then let's do it the other way around. For your other question, its' fine to check with |
An issue with this other way round |
What is this PR about?
Adding the ICDF function for the Kumaraswamy distribution.
Issue: #6612
Creation of tests for the added ICDF function.
Reference: https://en.wikipedia.org/wiki/Kumaraswamy_distribution
SciPy does not have an implementation for the Kumaraswamy distribution so this ICDF was coded explicitly.
Checklist
Major / Breaking Changes
New features
Bugfixes
Documentation
Maintenance
📚 Documentation preview 📚: https://pymc--6642.org.readthedocs.build/en/6642/