PyMC3 - Differences in ways observations are passed to model -> difference in results?

PyMC3 - Differences in ways observations are passed to model -> difference in results? - python

I'm trying to understand if there is any meaningful difference in the ways of passing data into a model - either aggregated or as single trials (note this will only be a sensical question for certain distributions e.g. Binomial).
Predicting p for a yes/no trail, using a simple model with a Binomial distribution.
What is the difference in the computation/results of the following models (if any)?
I choose the two extremes, either passing in a single trail at once (reducing to Bernoulli) or passing in the sum of the entire series of trails, to exemplify my meaning though I am interested in the difference in between these extremes also.
# set up constants
p_true = 0.1
N = 3000
observed = scipy.stats.bernoulli.rvs(p_true, size=N)
Model 1: combining all observations into a single data point
with pm.Model() as binomial_model1:
p = pm.Uniform('p', lower=0, upper=1)
observations = pm.Binomial('observations', N, p, observed=np.sum(observed))
trace1 = pm.sample(40000)
Model 2: using each observation individually
with pm.Model() as binomial_model2:
p = pm.Uniform('p', lower=0, upper=1)
observations = pm.Binomial('observations', 1, p, observed=observed)
trace2 = pm.sample(40000)
There is isn't any noticeable difference in the trace or posteriors in this case. I attempted to dig into the pymc3 source code to try to see how the observations were being processed but couldn't find the right part.
Possible expected answers:
pymc3 aggregates the observations under the hood for Binomial anyway so their is no difference
the resultant posterior surface (which is explored in the sample process) is identical in each case -> there is no meaningful/statistical difference in the two models
there are differences in the resultant statistics because of this and that...

This is an interesting example! Your second suggestion is correct: you can actually work out the posterior analytically, and it will be distributed according to
Beta(sum(observed), N - sum(observed))
in either case.
The difference in modelling approach would show up if you used, for example, pm.sample_ppc, in that the first would be distributed according to Binomial(N, p) and the second would be N draws of Binomial(1, p).

Related

building PyMC3 model incorporating different measurements

I am trying to incorporate different types and replicates of measurements into one model in PyMC3.
Consider the following model: P(t)=P0*exp(-kBt) where P(t), P0, and B are concentrations. k is a rate. We measure P(t) at different times and B once, all through counting of particles. k is the parameter of interest we are trying to infer.
My question has two parts:
(1) How to incorporate measurements on P(t) and B into one model?
(2) How to use a variable number of replicate experiments to inform on the value of k?
I think I can answer part (1), but am unsure about whether it is right or done in the right flavour. I failed to generalise the code to include a variable number of replicates.
For one experiment (one replicate):
ts=np.asarray([time0,time1,...])
counts=np.asarray([countforB,countforPattime0,countforPattime1,...])
basic_model = pm.Model()
with basic_model:
k=pm.Uniform('k',0,20)
B=pm.Uniform('B',0,1000)
P=pm.Uniform('P',0,1000)
exprate=pm.Deterministic('exprate',k*B)
modelmu=pm.math.concatenate(B*(np.asarray([1.0]),P*pm.math.exp(-exprate*ts)))
Y_obs=pm.Poisson('Y_obs',mu=modelmu,observed=counts))
I tried to include different replicates along the lines of the above, but to no avail:
...
k=pm.Uniform('k',0,20) # same within replicates
B=pm.Uniform('B',0,1000,shape=numrepl) # can vary between expts.
P=pm.Uniform('P',0,1000,shape=numrepl) # can vary between expts.
exprate=???
modelmu=???

Multiple Observables
PyMC3 supports multiple observables, that is, you can add multiple RandomVariable objects to the graph with the observed argument set.
Single Trial
In your first case, this would lend some clarity to the model:
counts=[countforPattime0, countforPattime1, ...]
with pm.Model() as single_trial:
# priors
k = pm.Uniform('k', 0, 20)
B = pm.Uniform('B', 0, 1000)
P = pm.Uniform('P', 0, 1000)
# transformed RVs
rate = pm.Deterministic('exprate', k*B)
mu = P*pm.math.exp(-rate*ts)
# observations
B_obs = pm.Poisson('B_obs', mu=B, observed=countforB)
Y_obs = pm.Poisson('Y_obs', mu=mu, observed=counts)
Multiple Trials
With this additional flexibility, hopefully it makes the transition to multiple trials more obvious. It should go something like:
B_cts = np.array(...) # shape (N, 1)
Y_cts = np.array(...) # shape (N, M)
ts = np.array(...) # shape (1, M)
with pm.Model() as multi_trial:
# priors
k = pm.Uniform('k', 0, 20)
B = pm.Uniform('B', 0, 1000, shape=B_cts.shape)
P = pm.Uniform('P', 0, 1000, shape=B_cts.shape)
# transformed RVs
rate = pm.Deterministic('exprate', k*B)
mu = P*pm.math.exp(-rate*ts)
# observations
B_obs = pm.Poisson('B_obs', mu=B, observed=B_cts)
Y_obs = pm.Poisson('Y_obs', mu=mu, observed=Y_cts)
There might be some extra syntax stuff to get the matrices multiplying correctly, but this at least includes the correct shapes.
Priors
Once you get that setup working, it would be in your interest to reconsider the priors. I suspect you have more information about the typical values for those than is currently included, especially since this seems like a chemical or physical model.
For instance, right now the model says,
We believe the true value of B remains fixed for the duration of a trial, but across trials is a completely arbitrary value between 0 and 1000, and measuring it repeatedly within a trial would be Poisson distributed.
Typically, one should avoid truncations unless they are excluding meaningless values. Hence, a lower bound of 0 is fine, but the upper bounds are arbitrary. I'd recommend having a look at the Stan Wiki on choosing priors.

Is there a way to get the probability of a prediction using XGBoostRegressor?

I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?

There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.

As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
EDIT
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
plt.show()
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.

I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$

pymc: Inferring parameters based on functions of observables

I have observations of several optical emission lines, and I have a model that predicts several (flux) ratios of those lines, based on two parameters, q and z, which I want to infer.
I have created #pymc.deterministic objects that take values of q and z (each of which has uninformative priors over some physically-interesting region), and turn them into a "predicted" ratio. There are about 7 ratios, and they have the form:
#pymc.deterministic(observed=True, value=NII_SII)
def NII_SII_th(q=q, z=z):
return NII_SII_g(np.array([q, z]))
I can also define the ratios derived from observations, such as
#pymc.deterministic
def NII_SII(NII_6584=NII_6584, SII_6717=SII_6717,
rcf_NII_6584=rcf_NII_6584, rcf_SII_6717=rcf_SII_6717):
return np.log10(
(rcf_NII_6584*NII_6584) / \
(rcf_SII_6717*SII_6717))
where, for instance, NII_6584 is the observed flux of one of the lines and rcf_NII_6584 is the flux correction for that same line. These corrections are themselves determined by the line wavelengths (known with infinite precision), and by a parameter EBV, which can be calculated from the observed flux ratio of two lines that are supposed to have a fixed ratio r:
#pymc.deterministic
def EBV(Ha=Ha, Hb=Hb, r=r, R_V=R_V, Ha_l=Ha_l, Hb_l=Hb_l):
kHb = gas_meas.calzetti_k(lams=np.array([Ha_l]), Rv=R_V)
kHa = gas_meas.calzetti_k(lams=np.array([Hb_l]), Rv=R_V)
return 2.5 / (kHb - kHa) * np.log10((Ha/Hb) / r)
I also have a prior on the value of R_V.
The measurements themselves are expressed as Normal distributions, such as
NII_6584 = pymc.Normal(
'NII_6584', mu=f_row['[NII]6584'],
tau=1./e_row['[NII]6584']**2.,
observed=True, value=f_row['[NII]6584'])
I would like to get estimates of R_V, EBV, q, and z. However, when I make a pymc Model from all these, I am told that Deterministic objects cannot have observed values:
TypeError: __init__() got an unexpected keyword argument 'value'
First, am I misunderstanding the nature of Deterministic objects? If so, how else do I infer based on values that are not directly observed?
Second, am I constructing the observations correctly? It seems odd that I'd have to specify the observed flux as both the mean and the value argument, but it's not clear to me what else to do, other than also model the flux means and variances, which seems unnecessarily complicated.
Any advice would be appreciated!

I don't think you're constructing your observations correctly. This is not a minimum working example, but maybe we can clear up some confusion.
First off, I don't think the #deterministic decorator takes an argument value = <something>. It's not clear which of your deterministic statements is the actual model, but try to translate your code into the following template:
#Define your randomly-distributed variables (I'm assuming they're normal)
q = pymc.Normal(name,mu=mu,tau=tau)
z = pymc.Normal(name2,mu=mu2,tau=tau2)
#Define how you think they generate your data
#pymc.deterministic
def NII_SII_th(q=q, z=z):
return NII_SII_g(np.array([q, z])) #this fcn is defined somewhere else
#Your data array
f_row['[Nii]6584']=[...]
#Now link your model and your data
obs = pymc.Normal(modelname,mu=NII_SII_th,
observed=True, value=f_row['[NII]6584'])

PyMC: Estimating population parameters where each observation is the sum of two Weibull-distributed variables

I have a list of n observations, each of which is the sum of two Weibull-distributed variables:
x[i] = t1[i] + t2[i]
t1[i] ~ Weibull(shape1, scale1)
t2[i] ~ Weibull(shape2, scale2)
My goal is:
1) Estimate the shape and scale parameters for both Weibull distributions (shape1, scale1, shape2, scale2),
2) For each observation x[i], estimate t1[i] (and t2[i] follows from this).
(Aside: Each observation x[i] is the age of cancer diagnosis, and t1[i] and t2[i] are two different time periods in the development of the tumor. The actual model involves mutation data as well, but before I try that out, I want to make sure that I can use PyMC for this simpler problem.)
I am using PyMC2 to make these estimates, and it looks like the run converges, but to incorrect results. I do not know whether there is a problem with my PyMC model syntax, with the MCMC settings, or both. I tried adapting this advice on using Potentials to model latent variables. First I define x[i] and t1[i] for each observation:
for i in xrange(n):
x[i] = pm.Index('x_%i'%i, x=data, index=i) # data is a list of observations
t1[i] = pm.Weibull('t1_%i'%i, alpha=shape1, beta=scale1)
# Ensure that initial guess for t1 is not more than the observed sum:
if t1[i].value >= x[i].value:
t1[i].value = 0.95 * x[i].value
Then I define a Deterministic for t2[i] = x[i] - t1[i]:
for i in xrange(n):
def subtractfunc(t1=t1, x=x, ii=i):
return x[ii] - t1[ii]
t2[i] = pm.Lambda('t2_%i'%i, subtractfunc)
And last I define the Potential for t2[i]:
t2dist = np.empty(n, dtype=object)
for i in xrange(n):
def weibfunc(t2=t2, shape2=shape2, scale2=scale2, ii=i):
return pm.weibull_like(t2[ii], alpha=shape2, beta=scale2)
t2dist[i] = pm.Potential(logp = weibfunc,
name = 't2dist_%i'%i,
parents = {'shape2':shape2, 'scale2':scale2, 't2':t2},
doc = 'weibull potential for t2',
verbose = 0,
cache_depth = 2)
You can see my full code here. I test by simulating 60 independent observations, with shape1 = 1, scale1 = 30, shape2 = 6.5, scale2 = 10, and I run 1e5 iterations of AdaptiveMetropolis. The results converge to a mean of shape1=1.94, scale1=37.9, shape2=0.55, scale2=36.1, and the 95% HPDs do not include the true values. This resulting distribution is not even in the right ballpark, as this histogram shows. (Blue shows the simulated data x[i] that I used, while the red shows the completely different inferred distribution from a representative iteration in the MCMC run.)
Running again with a different random seed, I get shape1=4.65, scale1=23.3, shape2=0.83, scale2=21.3. This distribution is somewhat closer to the truth. Is there some way to change the MCMC settings to consistently get decent results for this sort of problem? Any advice about using PyMC more effectively is much appreciated.
Update -- tried an "assisted" MCMC run:
I also tried assisting the MCMC run by initializing population-level parameters with values close to the truth. The results are somewhat better, but I now find a systematic bias. The histogram below shows the true distribution of observations (blue) against the fitted distribution (red). The right tail fits nicely, but the fit fails to capture the sharp peak at the left side. This bias occurs consistently, for population sizes n = 60 and 100. I am not sure if this is more of a PyMC question or a general MCMC algorithm issue.

Pseudoexperiments in PyMC

Is it possible to perform "pseudoexperiments" using PyMC?
By pseudoexperiments, I mean generating random "observations" by sampling from the prior, and then, given each pseudoexperiment, drawing samples from the posterior. Afterwards, one would compare the trace for each parameter to the sample (obtained from the prior) used in sampling from the posterior.
A more concrete example: Suppose that I want to know the rate of process X. I count how many occurrences there are in a certain period of time. However, I know that process Y also sometimes occurs and will contaminate my count. The rate of process Y is known with some uncertainty. So, I build a model, include my observations, and sample from the posterior:
import pymc
class mymodel:
rate_x = pymc.Uniform('rate_x', lower=0, upper=100)
rate_y = pymc.Normal('rate_y', mu=150, tau=1./(15**2))
total_rate = pymc.LinearCombination('total_rate', [1,1], [rate_x, rate_y])
data = pymc.Poisson('data', mu=total_rate, value=193, observed=True)
Mod = pymc.Model(mymodel)
MCMC = pymc.MCMC(Mod)
MCMC.sample(100000, burn=5000, thin=5)
print MCMC.stats()['rate_x']['quantiles']
However, before I do my experiment (or before I "unblind" my analysis and look at my data), I would like to know how sensitive I expect to be -- what will be the uncertainty on my measurement of rate_x?
To answer this, I could sample from the prior
Mod.draw_from_prior()
but this only samples rate_x, rate_y, and calculates total_rate. But once the values of those are set by draw_from_prior(), I can draw a pseudoexperiment:
Mod.data.random()
This just returns a number, so I have to set the value of Mod.data to a random sample. Because Mod.data has the observed flag set, I have to also "force" it:
Mod.data.set_value(Mod.data.random(), force=True)
Now I can sample from the posterior again
MCMC.sample(100000, burn=500, thin=5)
print MCMC.stats()['rate_x']['quantiles']
All this works, so I suppose the simple answer to my question is "yes". But it feels very hacky. Is there a better or more natural way to accomplish this?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyMC3 - Differences in ways observations are passed to model -> difference in results? - python

Related

building PyMC3 model incorporating different measurements

Is there a way to get the probability of a prediction using XGBoostRegressor?

pymc: Inferring parameters based on functions of observables

PyMC: Estimating population parameters where each observation is the sum of two Weibull-distributed variables

Pseudoexperiments in PyMC

Categories

Resources