-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Censored distributions #1833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We do not have built-in support for censoring. When I have to deal with this (for survival analysis), I typically just use a vector of indicators to indicate censoring. |
There is a section about this in the stan manual (section 11.3, Cencored data), and this should work pretty much the same in pymc3. For the second method you might have to code the ccdf functions on your own (a lot of special functions are in theano), as far as I know pymc does not provide those out of the box. You can use a |
I'll give it a go and create an example. |
The first technique in the manual was straightforward to test. And seems to work as expected, despite being very heavy computationally. I'll give the second technique a go, too. I've also found a paper Likelihood Estimation for Censored When comparing with a Am I doing something obviously incorrect here? Did I forget to normalize something? #!/usr/bin/env python
import numpy as np
import matplotlib.pyplot as plt
import pymc3 as pm
# Produce normally distributed samples
np.random.seed(123)
size = 10000
sigma = 5.
mu = 13.
samples = np.random.normal(mu, sigma, size)
expected_mu = np.mean(samples)
expected_sigma = np.std(samples)
# Set censoring/truncation limits.
high = 10.
low = 0.
# Truncate samples
truncated = samples[(samples > low) & (samples < high)]
# Keep tabs on left/right censoring for censored model
n_right_censored = len(samples[samples >= high])
n_left_censored = len(samples[samples <= low])
n_observed = len(samples) - n_right_censored - n_left_censored
# Omniscient model
with pm.Model():
mu = pm.Normal('mu', mu=0., sd=(high-low)/2.)
sigma = pm.HalfNormal('sigma', sd=(high-low)/2.)
observed = pm.Normal('observed', mu=mu, sd=sigma, observed=samples)
trace = pm.sample(10000, njobs=6)
print('Omniscient model')
pm.summary(trace)
pm.traceplot(trace)
plt.show()
# Truncated model
with pm.Model():
mu = pm.Normal('mu', mu=0., sd=(high-low)/2.)
sigma = pm.HalfNormal('sigma', sd=(high-low)/2.)
observed = pm.Bound(
pm.Normal, lower=low, upper=high
)('observed', mu=mu, sd=sigma, observed=truncated)
trace = pm.sample(10000, njobs=6)
print('Truncated model')
pm.summary(trace)
pm.traceplot(trace)
plt.show()
# Censored model
with pm.Model():
mu = pm.Normal('mu', mu=0., sd=(high-low)/2.)
sigma = pm.HalfNormal('sigma', sd=(high-low)/2.)
right_censored = pm.Bound(pm.Normal, lower=high)(
'right_censored', mu=mu, sd=sigma, shape=n_right_censored
)
left_censored = pm.Bound(pm.Normal, upper=low)(
'left_censored', mu=mu, sd=sigma, shape=n_left_censored
)
observed = pm.Normal(
'observed',
mu=mu,
sd=sigma,
observed=truncated,
shape=n_observed
)
trace = pm.sample(300)
print('Censored model')
pm.summary(trace)
pm.traceplot(trace)
plt.show() |
|
This seems to work: # Truncated model
with pm.Model() as model:
mu = pm.Flat('mu')
sigma = pm.HalfCauchy('sigma', beta=2.5)
pm.Normal('observed', mu=mu, sd=sigma, observed=truncated)
trunc = tt.log(tt.erf((high - mu) / sigma / np.sqrt(2)) - tt.erf((low - mu) / sigma / np.sqrt(2)))
pm.Potential('normal_truncation', - len(truncated) * trunc)
trace = pm.sample(2000, tune=1000, njobs=4, init=None) |
Thank you for that. Here's the second technique in the manual for censored data. Unsurprisingly this is much better computationally. It would have been much quicker had the standard functions been around already (the numerical stability is not great for lccdf using the naive implementation). If anything providing the functions needed to fix import theano.tensor as tt
# Unimputed censored model
def normal_lcdf(mu, sigma, x):
z = (x - mu) / tt.sqrt(2. * sigma ** 2.)
return tt.switch(
tt.lt(z, -1.0),
tt.log(tt.erfcx(-z/tt.sqrt(2.))/2.) - tt.sqr(tt.abs_(z))/2,
tt.log1p(-tt.erfc(z/tt.sqrt(2.))/2.)
)
def normal_lccdf(mu, sigma, x):
z = (x - mu) / tt.sqrt(2. * sigma ** 2.)
return tt.switch(
tt.gt(z, 1.0),
tt.log(tt.erfcx(z/tt.sqrt(2.))/2) - tt.sqr(tt.abs_(z))/2.,
tt.log1p(-tt.erfc(-z / tt.sqrt(2.))/2.)
)
def censored_right_likelihood(mu, sigma, n_right_censored, upper_bound):
return n_right_censored * normal_lccdf(mu, sigma, upper_bound)
def censored_left_likelihood(mu, sigma, n_left_censored, lower_bound):
return n_left_censored * normal_lcdf(mu, sigma, lower_bound)
with pm.Model():
mu = pm.Normal('mu', mu=0., sd=(high-low)/2.)
sigma = pm.HalfNormal('sigma', sd=(high-low)/2.)
observed = pm.Normal(
'observed',
mu=mu,
sd=sigma,
observed=truncated,
)
right_censored = pm.Potential(
'right_censored',
censored_right_likelihood(mu, sigma, n_right_censored, high)
)
left_censored = pm.Potential(
'left_censored',
censored_left_likelihood(mu, sigma, n_left_censored, low)
)
trace = pm.sample(50000)
print('Censored model')
pm.summary(trace)
pm.traceplot(trace)
plt.show() |
Should I make a pull request for an example on the different methods of dealing with censored data? |
I guess the pull request is a question for @fonnesbeck or @twiecki, but I doubt they'd say no. But it might be better to wait for #1843. |
Thank you again. I've updated the code above to use |
Yes, I think such a PR would be great. It's a bit unfortunate we have to add CDFs too though. |
I'll make a PR with these examples for PyMC3 as it stands, with the log CDF and log CCDF hard coded in the example as above. Where would you put these functions otherwise? As methods of each distribution? |
Right, exactly.
…On Fri, Mar 3, 2017 at 2:47 PM, domenzain ***@***.***> wrote:
I'll make a PR with these examples for PyMC3 as it stands, with the log
CDF and log CCDF hard coded in the example as above.
Where would you put these functions otherwise? As methods of each
distribution?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1833 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AApJmBabUoSvr4uF65itI-DqTyd-GLogks5riBnxgaJpZM4MNTZm>
.
|
It might be handy to have log-CDFs for our distributions. Definitely not a bad thing. As far as an interface goes, it might be nice to have it mirror
|
Implement numerically stable convenience functions for the Normal distribution. This will make the work in pymc-devs#1833 pymc-devs#1843 easier.
Is there a way to express left, right or interval censored distributions?
Here's a small example illustrating the differences with numpy:
Using
Bound
andNormal
I can describe something like thetruncated
distribution.But many processes behave like
censored
. How can the latter be expressed in PyMC3?The text was updated successfully, but these errors were encountered: