Description
Describe the issue:
Automatic imputation fails silently in pyMC if a user passes partially observed data held in pm.ConstantData()
or pm.MutableData()
to observed
parameter of any distribution. In simple models, the user won't be able to sample (as loglik will evaluate to nan), but I have also been able to run more complex (GP) models that sampled - likely producing wrong results. (see https://discourse.pymc.io/t/issue-imputing-data-for-gaussian-process-model/11626/3 for detail).
Based on my initial review of the source code, it seems the culprit is Model.make_obs_var()
method, where the check whether passed data is performed with mask = getattr(data, "mask", None)
, which always returns None for tensors.
In case of pm.ConstantData()
, the fix appears to be quite simple (need to retrieve masked values by mask = getattr(data.value, "mask", None)
instead. In case of pm.MutableData()
, however, the issue seems to be that pytensor.shared()
does not maintain masked values. That is very problematic on its own if masked values are represented by actual numbers and not np.nan
. I'll file an issue under pytensor project about this, too.
I'd be happy to contribute a PR for pm.ConstantData()
fix + possibly a NotImplemented
error for pm.MutableData()
if this indeed cannot be solved in other ways. I'm new to pyMC code base and may be missing the big picture!
Reproduceable code example:
import pymc as pm
import numpy as np
real_X = np.random.default_rng().normal(size=1000)
Y = np.random.default_rng().normal(loc=3 * real_X, scale=0.1)
X = real_X.copy()
X[0:10] = np.nan
masked_X = np.ma.masked_where(np.isnan(X), X)
with pm.Model() as m:
β = pm.Normal("β", 0, 1)
σ = pm.Exponential("σ", 1)
# This works
# X = pm.Normal("X", 0, 1, observed = masked_X, dims='test')
#T his even fails to sample
X = pm.Normal("X", 0, 1, observed = pm.ConstantData("masked_X", masked_X))
pm.Normal("Y", pm.math.dot(X, β), σ, observed=Y)
pm.sample()
Error message:
No response
PyMC version information:
pymc 5.1.2
pytensor 2.10.1
Context for the issue:
The fact that it fails silently on some models is particularly concerning - it means some users may be using pyMC and getting wrong inference results.