Skip to content

Censored distributions #1833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
domenzain opened this issue Feb 27, 2017 · 14 comments
Closed

Censored distributions #1833

domenzain opened this issue Feb 27, 2017 · 14 comments

Comments

@domenzain
Copy link
Contributor

Is there a way to express left, right or interval censored distributions?

Here's a small example illustrating the differences with numpy:

#!/usr/bin/env python
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(123)

# Create normally distributed samples.
size = 10000
sigma = 1.
mu = 1.
samples = np.random.normal(mu, sigma, size)

# Define bounds for the distribution
lower=0.
upper=1.

# Keep information about samples being outside of the range.
censored = samples.copy()
censored[censored > upper] = upper
censored[censored < lower] = lower

# Lose all information outside the range that can be sampled. 
truncated = samples[(samples >= lower) & (samples <= upper)]

plt.hist([samples, censored, truncated], bins=40)
plt.show()

Using Bound and Normal I can describe something like the truncated distribution.
But many processes behave like censored. How can the latter be expressed in PyMC3?

@fonnesbeck
Copy link
Member

We do not have built-in support for censoring. When I have to deal with this (for survival analysis), I typically just use a vector of indicators to indicate censoring.

@aseyboldt
Copy link
Member

There is a section about this in the stan manual (section 11.3, Cencored data), and this should work pretty much the same in pymc3. For the second method you might have to code the ccdf functions on your own (a lot of special functions are in theano), as far as I know pymc does not provide those out of the box. You can use a Potential to do the same as target += something.

@domenzain
Copy link
Contributor Author

I'll give it a go and create an example.

@domenzain
Copy link
Contributor Author

The first technique in the manual was straightforward to test. And seems to work as expected, despite being very heavy computationally. I'll give the second technique a go, too.

I've also found a paper Likelihood Estimation for Censored
Random Vectors
(PDF, Schnedler, 2005)
that describes a technique to automatically generate the likelihood of a broad category of censoring problems. Perhaps this could be implemented eventually?

When comparing with a Normal model and uncensored data, and Bounded Normal with truncated data, the latter behaves differently from what I'd expect. The estimate of both mu and sigma is off by a factor of 2.

Am I doing something obviously incorrect here? Did I forget to normalize something?

#!/usr/bin/env python
import numpy as np
import matplotlib.pyplot as plt
import pymc3 as pm

# Produce normally distributed samples
np.random.seed(123)
size = 10000
sigma = 5.
mu = 13.
samples = np.random.normal(mu, sigma, size)
expected_mu = np.mean(samples)
expected_sigma = np.std(samples)

# Set censoring/truncation limits.
high = 10.
low = 0.

# Truncate samples
truncated = samples[(samples > low) & (samples < high)]

# Keep tabs on left/right censoring for censored model
n_right_censored = len(samples[samples >= high])
n_left_censored = len(samples[samples <= low])
n_observed = len(samples) - n_right_censored - n_left_censored

# Omniscient model
with pm.Model():
    mu = pm.Normal('mu', mu=0., sd=(high-low)/2.)
    sigma = pm.HalfNormal('sigma', sd=(high-low)/2.)
    observed = pm.Normal('observed', mu=mu, sd=sigma, observed=samples)
    trace = pm.sample(10000, njobs=6)
    print('Omniscient model')
    pm.summary(trace)
    pm.traceplot(trace)
    plt.show()

# Truncated model
with pm.Model():
    mu = pm.Normal('mu', mu=0., sd=(high-low)/2.)
    sigma = pm.HalfNormal('sigma', sd=(high-low)/2.)
    observed = pm.Bound(
        pm.Normal, lower=low, upper=high
    )('observed', mu=mu, sd=sigma, observed=truncated)
    trace = pm.sample(10000, njobs=6)
    print('Truncated model')
    pm.summary(trace)
    pm.traceplot(trace)
    plt.show()

# Censored model
with pm.Model():
    mu = pm.Normal('mu', mu=0., sd=(high-low)/2.)
    sigma = pm.HalfNormal('sigma', sd=(high-low)/2.)
    right_censored = pm.Bound(pm.Normal, lower=high)(
        'right_censored', mu=mu, sd=sigma, shape=n_right_censored
    )
    left_censored = pm.Bound(pm.Normal, upper=low)(
        'left_censored', mu=mu, sd=sigma, shape=n_left_censored
    )
    observed = pm.Normal(
        'observed',
        mu=mu,
        sd=sigma,
        observed=truncated,
        shape=n_observed
    )
    trace = pm.sample(300)
    print('Censored model')
    pm.summary(trace)
    pm.traceplot(trace)
    plt.show()

@aseyboldt
Copy link
Member

aseyboldt commented Mar 1, 2017

Bound does not do what you think it does. It does not correct the logp with the cdf or ccdf of the distribution, but only takes care of re re-parametrization. I'd consider this a bug in pymc3, but others might disagree. At the moment you can work around that by manually adding a potential for the cdf (I think something likepm.Potential('normal_truncation', - tt.log(tt.erf((high - mu) / sigma) - tt.erf((low - mu) / sigma))), but I haven't tested it)

@aseyboldt
Copy link
Member

This seems to work:

# Truncated model
with pm.Model() as model:
    mu = pm.Flat('mu')
    sigma = pm.HalfCauchy('sigma', beta=2.5)
    pm.Normal('observed', mu=mu, sd=sigma, observed=truncated)
    trunc = tt.log(tt.erf((high - mu) / sigma / np.sqrt(2)) - tt.erf((low - mu) / sigma / np.sqrt(2)))
    pm.Potential('normal_truncation', - len(truncated) * trunc)

    trace = pm.sample(2000, tune=1000, njobs=4, init=None)

@domenzain
Copy link
Contributor Author

domenzain commented Mar 1, 2017

Thank you for that.

Here's the second technique in the manual for censored data. Unsurprisingly this is much better computationally.

It would have been much quicker had the standard functions been around already (the numerical stability is not great for lccdf using the naive implementation). If anything providing the functions needed to fix Bound in #1843 would be a huge convenience.

import theano.tensor as tt

# Unimputed censored model
def normal_lcdf(mu, sigma, x):
    z = (x - mu) / tt.sqrt(2. * sigma ** 2.)
    return tt.switch(
        tt.lt(z, -1.0),
        tt.log(tt.erfcx(-z/tt.sqrt(2.))/2.) - tt.sqr(tt.abs_(z))/2,
        tt.log1p(-tt.erfc(z/tt.sqrt(2.))/2.)
    )


def normal_lccdf(mu, sigma, x):
    z = (x - mu) / tt.sqrt(2. * sigma ** 2.)
    return tt.switch(
        tt.gt(z, 1.0),
        tt.log(tt.erfcx(z/tt.sqrt(2.))/2) - tt.sqr(tt.abs_(z))/2.,
        tt.log1p(-tt.erfc(-z / tt.sqrt(2.))/2.)
    )


def censored_right_likelihood(mu, sigma, n_right_censored, upper_bound):
    return n_right_censored * normal_lccdf(mu, sigma, upper_bound)


def censored_left_likelihood(mu, sigma, n_left_censored, lower_bound):
    return n_left_censored * normal_lcdf(mu, sigma, lower_bound)

with pm.Model():
    mu = pm.Normal('mu', mu=0., sd=(high-low)/2.)
    sigma = pm.HalfNormal('sigma', sd=(high-low)/2.)
    observed = pm.Normal(
        'observed',
        mu=mu,
        sd=sigma,
        observed=truncated,
    )
    right_censored = pm.Potential(
        'right_censored',
        censored_right_likelihood(mu, sigma, n_right_censored, high)
        )
    left_censored = pm.Potential(
        'left_censored',
        censored_left_likelihood(mu, sigma, n_left_censored, low)
    )
    trace = pm.sample(50000)
    print('Censored model')
    pm.summary(trace)
    pm.traceplot(trace)
    plt.show()

@domenzain
Copy link
Contributor Author

Should I make a pull request for an example on the different methods of dealing with censored data?

@aseyboldt
Copy link
Member

I guess the pull request is a question for @fonnesbeck or @twiecki, but I doubt they'd say no. But it might be better to wait for #1843.
Good idea about the numerical stability of normal_cdf, however the if will not do what you want. tt.lt and tt.gt return variables, that represent the comparison in a computational graph, not booleans. So python will always take the first branch, as the return value of tt.gt does not evaluate to False or None. You can use tt.switch to do what you want.

@domenzain
Copy link
Contributor Author

Thank you again. I've updated the code above to use tt.switch.

@twiecki
Copy link
Member

twiecki commented Mar 3, 2017

Yes, I think such a PR would be great. It's a bit unfortunate we have to add CDFs too though.

@domenzain
Copy link
Contributor Author

I'll make a PR with these examples for PyMC3 as it stands, with the log CDF and log CCDF hard coded in the example as above.

Where would you put these functions otherwise? As methods of each distribution?

@twiecki
Copy link
Member

twiecki commented Mar 3, 2017 via email

@fonnesbeck
Copy link
Member

It might be handy to have log-CDFs for our distributions. Definitely not a bad thing.

As far as an interface goes, it might be nice to have it mirror Bounded, such as:

censored_gamma = pm.Censored(pm.Gamma, right=censoring_times)

domenzain added a commit to domenzain/pymc3 that referenced this issue Mar 4, 2017
Implement numerically stable convenience functions for the Normal distribution.
This will make the work in pymc-devs#1833 pymc-devs#1843 easier.
@twiecki twiecki closed this as completed Dec 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants