pymc determine sum of random variables - python

I have two independent Normal distributed random variables a, b. In pymc, it's something like:
from pymc import Normal
def model():
a = Normal('a', tau=0.01)
b = Normal('b', tau=0.1)
I'd like to know what's a+b if we can see it as a normal distribution, that is:
from pymc import Normal
def model():
a = Normal('a', tau=0.01)
b = Normal('b', tau=0.1)
tau_c = Uniform("tau_c", lower=0.0, upper=1.0)
c = Normal("a+b", tau=tau_c, observed=True, value=a+b)
Then I'd like to estimate tau_c, but it doesn't work with pymc because a and b are stochastic (if they are arrays it's posible, but I don't have observations of a or b, I just know their distributions).
A way I think I can do it, is generating random values by using the distributions of each a and b and then doing this:
def model(a, b):
tau_c = Uniform("tau_c", lower=0.0, upper=1.0)
c = Normal("a+b", tau=tau_c, observed=True, value=a+b)
But I think there's a better way of doing this with pymc.
Thanks!

If I understood correctly your question and code you should be doing something simpler. If you want to estimate the parameters of the distribution given by the sum of a and b, then use only the first block in the following example. If you also want to estimate the parameters for the variable a independently of the parameter of variable b, then use the other two blocks
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sd=10)
sd = pm.HalfNormal('sd', 10)
alpha = pm.Normal('alpha', mu=0, sd=10)
ab = pm.SkewNormal('ab', mu=mu, sd=sd, alpha=alpha, observed=a+b)
mu_a = pm.Normal('mu_a', mu=0, sd=10)
sd_a = pm.HalfNormal('sd_a', 10)
alpha_a = pm.Normal('alpha_a', mu=0, sd=10)
a = pm.SkewNormal('a', mu=mu_a, sd=sd_a, alpha=alpha_a, observed=a)
mu_b = pm.Normal('mu_b', mu=0, sd=10)
sd_b = pm.HalfNormal('sd_b', 10)
alpha_b = pm.Normal('alpha_b', mu=0, sd=10)
b = pm.SkewNormal('b', mu=mu_b, sd=sd_b, alpha=alpha_b, observed=b)
trace = pm.sample(1000)
Be sure to use the last version of PyMC3, since previous versions did not include the SkewNormal distribution.
Update:
Given that you change your question:
If a and b are independent random variables and both are normally distributed then their sum is going to be normally distributed.
a ~ N(mu_a, sd_a²)
b ~ N(mu_b, sd_b²)
a+b ~ N(mu_a+mu_b, sd_a²+sd_b²)
that is you sum their means and you sum their variances (not their standard deviations). You don't need to use PyMC3.
If you still want to use PyMC3 (may be your distribution are not Gaussian and you do not know how to compute their sum analytically). You can generate synthetic data from your a and b distributions and then use PyMC3 to estimate the parameters, something in line of:
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sd=10)
sd = pm.HalfNormal('sd', 10)
ab = pm.Normal('ab', mu=mu, sd=sd, observed=a+b)
trace = pm.sample(1000)

Related

PyMC3 fitting quadratic line to data

How do you choose quadratic model parameters? This is what I've got so far ..
with pm.Model() as quad_model:
#I'd like to add sigmas (and mu if that's reasonable) but I'm not sure how to decide how big to go..
a = pm.Normal('a')
b = pm.Normal('b')
c = pm.Normal('c')
sigma = pm.HalfNormal('scatter', sd=sigm_y)
#this .MutableData I'm not sure I'm using correct, I'm following an #example but my data is position not dates/time
t = pm.MutableData('t', bp_rp, dims='obs_time')
mu = a*t**2 + b*t + c
obs = pm.Normal('mg', mu=mu, sd=sigma, observed=mg, dims='obs_time')#, dims='obs_time'

PyMC3 sample function

I am using a CAR model as in https://docs.pymc.io/notebooks/PyMC3_tips_and_heuristic.html. I am using the same problem and model described in the link: Oi∼exp(Poisson(exp(β0+β1∗aff+ϕi+log(Ei))), in which I want to obtain the values of β0 and β1.
Here, model2 operates the parameters:
with pm.Model() as model2:
# Vague prior on intercept
beta0 = pm.Normal('beta0', mu=0.0, tau=1.0e-5)
# Vague prior on covariate effect
beta1 = pm.Normal('beta1', mu=0.0, tau=1.0e-5)
# Random effects (hierarchial) prior
tau_h = pm.Gamma('tau_h', alpha=3.2761, beta=1.81)
# Spatial clustering prior
tau_c = pm.Gamma('tau_c', alpha=1.0, beta=1.0)
# Regional random effects
theta = pm.Normal('theta', mu=0.0, tau=tau_h, shape=N)
mu_phi = CAR2('mu_phi', w=wmat2, a=amat2, tau=tau_c, shape=N)
# Zero-centre phi
phi = pm.Deterministic('phi', mu_phi-tt.mean(mu_phi))
# Mean model
mu = pm.Deterministic('mu', tt.exp(logE + beta0 + beta1*quil + theta + phi))
# Likelihood
Yi = pm.Poisson('Yi', mu=mu, observed=O)
# Marginal SD of heterogeniety effects
sd_h = pm.Deterministic('sd_h', tt.std(theta))
# Marginal SD of clustering (spatial) effects
sd_c = pm.Deterministic('sd_c', tt.std(phi))
# Proportion sptial variance
alpha = pm.Deterministic('alpha', sd_c/(sd_h+sd_c))
trace2 = pm.sample(1000, tune=500, cores=4,
init='advi',
nuts_kwargs={"target_accept":0.9,
"max_treedepth": 15})
My question is: when using the sample function, with 4000 iterations (1000 per core), I obtain a vector of 4000 positions for each parameter in the model. What is the best value for the parameter: the last value (result of the final iteration) or the average value, since they are calculated via a Markov chain?

Illogical parameters returned by scipy.curve_fit

I'm modelling a ball falling through fluid in Python and fitting the model function to a set of data points using the damping coefficients (a and b) and the density of the fluid, but the fitted value for the fluid density is coming back negative and I have no idea what is wrong in the code. My code is below:
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from scipy.optimize import curve_fit
##%%Parameters and constants
m = 0.1 #mass of object in kg
g = 9.81 #acceleration due to gravity in m/s**2
rho = 700 #density of object in kg/m**3
v0 = 0 #velocity at t=0
y0 = 0 #position at t=0
V = m / rho #volume in cubic meters
r = ((3/4)*(V/np.pi))**(1/3) #radius of the sphere
asample = 0.0001 #sample value for a
bsample = 0.0001 #sample value for b
#%%Defining integrating function
##fuction => y'' = g*(1-(rhof/rho))-((a/m)y'+(b/m)y'**2)
## y' = v
## v' = g*(1-rhof/rho)-((a/m)v+(b/m)v**2)
def sinkingball(Y, time, a, b, rhof):
return [Y[1],(1/m)*(V*g*(rho-rhof)-a*Y[1]-b*(Y[1]**2))]
def balldepth(time, a, b, rhof):
solutions = odeint(sinkingball, [y0,v0], time, args=(a, b, rhof))
return solutions[:,0]
time = np.linspace(0,15,151)
# imported some experimental values and named the array data
a, b, rhof = curve_fit(balldepth, time, data, p0=(asample, bsample, 100))[0]
print(a,b,rhof)
Providing the output you actually get would be helpful, and the comment about time not being used by sinkingball() is worth following.
You might find lmfit (https://lmfit.github.io/lmfit-py) useful. This provides a higher-level interface to curve-fitting that allows, among other things, placing bounds on parameters so that they can remain physically sensible. I think your problem would translate from curve_fit to lmfit as:
from lmfit import Model
def balldepth(time, a, b, rhof):
solutions = odeint(sinkingball, [y0,v0], time, args=(a, b, rhof))
return solutions[:,0]
# create a model based on the model function "balldepth"
ballmodel = Model(balldepth)
# create parameters, which will be named using the names of the
# function arguments, and provide initial values
params = bollmodel.make_params(a=0.001, b=0.001, rhof=100)
# you wanted rhof to **not** vary in the fit:
params['rhof'].vary = False
# set min/max values on `a` and `b`:
params['a'].min = 0
params['b'].min = 0
# run the fit
result = ballmodel.fit(data, params, time=time)
# print out full report of results
print(result.fit_report())
# get / print out best-fit parameters:
for parname, param in result.params.items():
print("%s = %f +/- %f" % (parname, param.value, param.stderr))

is there an equivalent of R's nls in statsmodels?

Does statsmodels support nonlinear regression to an arbitrary equation? (I know that there are some forms that are already built in, e.g. for logistic regression, but I am after something more flexible)
In the solution https://stats.stackexchange.com/a/44249 to a question about non-linear regression,
the code is in R and uses the function nls. There it has the equation's parameters defined with start = list(a1=0, ...). These are of course just some initial guesses and not the final fitted values. But what is different here compared to lm is that the parameters don't need to be from the columns of the input data.
I've been able to use statsmodels.formula.api.ols as an equivalent for R's lm, but when I try to use it with an equation that has parameters (and not weights for the inputs / combinations of inputs), statsmodels complains about the parameters not being defined. It does not seem to have an equivalent argument as start= so it isn't obvious how to introduce them.
Is there a different class or function in statsmodels that accepts definition of these initial parameter values?
EDIT:
My current attempt and also workaround following suggestion with lmfit
from statsmodels.formula.api import ols
import numpy as np
import pandas as pd
def eqn_poly(x, a, b):
''' simple polynomial '''
return a*x**2.0 + b*x
def eqn_nl(x, a, b):
''' fractional equation '''
return 1.0 / ((a+x)*b)
x = np.arange(0, 3, 0.1)
y1 = eqn_poly(x, 0.1, 0.5)
y2 = eqn_nl(x, 0.1, 0.5)
sigma =0.05
y1_noise = y1 + sigma * np.random.randn(*y1.shape)
y2_noise = y2 + sigma * np.random.randn(*y2.shape)
df1 = pd.DataFrame(np.vstack([x, y1_noise]).T, columns= ['x', 'y'])
df2 = pd.DataFrame(np.vstack([x, y2_noise]).T, columns= ['x', 'y'])
res1 = ols("y ~ 1 + x + I(x ** 2.0)", df1).fit()
print res1.summary()
res3 = ols("y ~ 1 + x + I(x ** 2.0)", df2).fit()
#res2 = ols("y ~ eqn_nl(x, a, b)", df2).fit()
# ^^^ this fails if a, b are not initialised ^^^
# so initialise a, b
a,b = 1.0, 1.0
res2 = ols("y ~ eqn_nl(x, a, b)", df2).fit()
print res2.summary()
# ===> and now the fitting is bad, it has an intercept -4.79, and a weight
# on the equation 15.7.
Giving lmfit the formula it is able to find parameters.
import lmfit
mod = lmfit.Model(eqn_nl)
lm_result = mod.fit(y2_noise, x=x, a=1.0, b=1.0)
print lm_result.fit_report()
# ===> this one works fine, a=0.101, b=0.4977
But trying to put y1, x into ols doesn't seem to work ("PatsyError: model is missing required outcome variables"). I didn't really follow that suggestion.
consider scipy.optimize.curve_fit as desired R.nls-like function

Getting the statistics of deterministic variables in PyMC

Say I have a random collection of (X,Y) points:
import pymc as pm
import numpy as np
import matplotlib.pyplot as plt
import scipy
x = np.array(range(0,50))
y = np.random.uniform(low=0.0, high=40.0, size=200)
y = map((lambda a: a[0] + a[1]), zip(x,y))
plt.scatter(x,y)
and that I fit simple linear regression:
std = 20.
tau=1/(std**2)
alpha = pm.Normal('alpha', mu=0, tau=tau)
beta = pm.Normal('beta', mu=0, tau=tau)
sigma = pm.Uniform('sigma', lower=0, upper=20)
y_est = alpha + beta * x
likelihood = pm.Normal('y', mu=y_est, tau=1/(sigma**2), observed=True, value=y)
model = pm.Model([likelihood, alpha, beta, sigma, y_est])
mcmc = pm.MCMC(model)
mcmc.sample(40000, 15000)
How can I get the distribution or the statistics of y_est[0], y_est[1], y_est[2].. (note that these variables correspond to the y values estimated for each input x value.
In PyMC 2, if you are interested in the trace of a deterministic, you should wrap the deterministic in a Lambda object (or decorate a function with #deterministic). In your case, this would be:
y_est = Lambda('y_est', lambda a=alpha, b=beta: a + b * x)
You should then be able to call the summary method or plot the node, just like a Stochastic.
BTW, you do not need to instantiate a Model object, as MCMC already does that for you. All you need is:
mcmc = pm.MCMC([likelihood, alpha, beta, sigma, y_est])
or even more concisely:
mcmc = pm.MCMC(vars())
Following #Chris' advice, the following works:
x = pm.Uniform('x', lower=xmin, upper=xmax)
alpha = pm.Normal('alpha', mu=0, tau=tau)
beta = pm.Normal('beta', mu=0, tau=tau)
sigma = pm.Uniform('sigma', lower=0, upper=20)
# The deterministic:
y_gen = pm.Lambda('y_gen', lambda a=alpha, x=x, b=beta: a + b * x)
And then draw samples from it as follows:
mcmc = pm.MCMC([x, y_gen])
mcmc.sample(n_total_samples, n_burn_in)
x_trace = mcmc.trace('x')[:]
y_trace = mcmc.trace('y_gen')[:]

Categories

Resources