Getting started with PYMC for linear regression

Getting started with PYMC for linear regression - python

Thought I'd start off following this example:
http://www.databozo.com/2014/01/17/Exploring_PyMC3.html
But when I follow the example precisely using pymc 2.3 I get an exit and told that the API has changed
UserWarning: The MCMC() syntax is deprecated. Please pass in nodes explicitly via M = MCMC(input).
'The MCMC() syntax is deprecated. Please pass in nodes explicitly via M = MCMC(input).') but I don't have a good idea how to change the example to provide exactly what to the model function and how to do with the 'with' clause?
The code in question is:
%pylab inline
import scipy
import numpy as np
x = np.array(range(0,50))
y = np.random.uniform(low=0.0, high=40.0, size=200)
y = map((lambda a: a[0] + a[1]), zip(x,y))
import matplotlib.pyplot as plt
plt.scatter(x,y)
Above sample data generator works fine
import pymc as pm
import numpy as np
trace = None
with pm.Model() as model: <<<<<<indicated as the error line
alpha = pm.Normal('alpha', mu=0, sd=20)
beta = pm.Normal('beta', mu=0, sd=20)
sigma = pm.Uniform('sigma', lower=0, upper=20)
y_est = alpha + beta * x
likelihood = pm.Normal('y', mu=y_est, sd=sigma, observed=y)
start = pm.find_MAP()
step = pm.NUTS(state=start)
trace = pm.sample(2000, step, start=start, progressbar=False)
pm.traceplot(trace);

Package author #fonnesbeck informed me that I needed the development version 3 from Github and not the pypi version 2.3. Installed that, via github and I'm good to go. Open source is great!

Related

arma_generate_sample cannot return the correct value for modeling, so I would like to add sm.tsa.ARIMA

All the other online information was off the mark, such as lying or having a different version.
I would appreciate it if you could answer how to solve it.
I just want to model an AR process such as ''yt = 0.33yt-1 + et.''
import numpy as np
import statsmodels.api as sm
from statsmodels.tsa.arima_process import arma_generate_sample
def make_arma(nobs=250):
# arma(1,1) => xt = ρxt-1 + et - θet-1
arparams = np.array([0.444, 0.333])
maparams = np.array([0])
arparams = np.r_[1, -arparams]
maparam = np.r_[1, maparams]
np.random.seed(2014)
y = arma_generate_sample(arparams, maparams, nobs)
### If you do not enter the code below this, it seems that all are 0.
model = sm.tsa.ARIMA(y, (1,0), trend='n').fit(disp=0)
data = model.params
return data
data = make_arma(nobs=250)
data[-30:]
TypeError: __new__() got an unexpected keyword argument 'trend'
statsmodel==0.12.2
https://github.com/statsmodels/statsmodels/releases
This is a bugfix release from the 0.12.x branch. Users are encouraged to upgrade.
Notable changes include a fix for a bug that could lead to incorrect results in predictions using the new ARIMA model (when d> 0 and trend ='t'), and a bug in the autocorrelation LM test. It will be.
ARIMA() error TypeError: __new__() got an unexpected keyword argument 'start'

The second argument to ARIMA is not the order, it is exog. So you need to set the order which is easiest if you use a keyword. Also, order is a 3-element tuple of the form (p,d,q), so you need (1,0,0). Finally, disp is not part of fit. Finally, you need set maparams to have a 1 in the first position (typo in the original post where maparam was being updated, not maparams.
import numpy as np
import statsmodels.api as sm
from statsmodels.tsa.arima_process import arma_generate_sample
def make_arma(nobs=250):
# Simulate a basic AR model
# arma(1,1) => xt = ρxt-1 + et - θet-1
arparams = np.array([0.444, 0.333])
maparams = np.array([0])
arparams = np.r_[1, -arparams]
maparams = np.r_[1, maparams]
np.random.seed(2014)
y = arma_generate_sample(arparams, maparams, nobs)
### If you do not enter the code below this
### it seems that all are 0.
model = sm.tsa.ARIMA(y, order=(1,0,0), trend='n').fit()
data = model.params
return data
data = make_arma(nobs=250)
data[-30:]

Ridge Regression: Scikit-learn vs. direct calculation does not match for alpha > 0

In Ridge Regression, we are solving Ax=b with L2 Regularization. The direct calculation is given by:
x = (ATA + alpha * I)-1ATb
I have looked at the scikit-learn code and they do implement the same calculation. But, I can't seem to get the same results for alpha > 0
The minimal code to reproduce this.
import numpy as np
A = np.asmatrix(np.c_[np.ones((10,1)),np.random.rand(10,3)])
b = np.asmatrix(np.random.rand(10,1))
I = np.identity(A.shape[1])
alpha = 1
x = np.linalg.inv(A.T*A + alpha * I)*A.T*b
print(x.T)
>>> [[ 0.37371021 0.19558433 0.06065241 0.17030177]]
from sklearn.linear_model import Ridge
model = Ridge(alpha = alpha).fit(A[:,1:],b)
print(np.c_[model.intercept_, model.coef_])
>>> [[ 0.61241566 0.02727579 -0.06363385 0.05303027]]
Any suggestions on what I can do to resolve this discrepancy?

This modification seems to yield the same result for the direct version and the numpy version:
import numpy as np
A = np.asmatrix(np.random.rand(10,3))
b = np.asmatrix(np.random.rand(10,1))
I = np.identity(A.shape[1])
alpha = 1
x = np.linalg.inv(A.T*A + alpha * I)*A.T*b
print (x.T)
from sklearn.linear_model import Ridge
model = Ridge(alpha = alpha, tol=0.1, fit_intercept=False).fit(A ,b)
print model.coef_
print model.intercept_
It seems the main reason for the difference is the class Ridge has the parameter fit_intercept=True (by inheritance from class _BaseRidge) (source)
This is applying a data centering procedure before passing the matrices to the _solve_cholesky function.
Here's the line in ridge.py that does it
X, y, X_mean, y_mean, X_std = self._center_data(
X, y, self.fit_intercept, self.normalize, self.copy_X,
sample_weight=sample_weight)
Also, it seems you were trying to implicitly account for the intercept by adding the column of 1's. As you see, this is not necessary if you specify fit_intercept=False
Appendix: Does the Ridge class actually implement the direct formula?
It depends on the choice of the solverparameter.
Effectively, if you do not specify the solverparameter in Ridge, it takes by default solver='auto' (which internally resorts to solver='cholesky'). This should be equivalent to the direct computation.
Rigorously, _solve_cholesky uses numpy.linalg.solve instead of numpy.inv. But it can be easily checked that
np.linalg.solve(A.T*A + alpha * I, A.T*b)
yields the same as
np.linalg.inv(A.T*A + alpha * I)*A.T*b

Mathieu Characteristics Cross When Plotted

I need to plot the mathieu characteristic parameters for various q. The plot should show 'flute' shapes going from wide on the left, to very narrow on the right. The code below does this, but it also introduces a handful of inter-band jumps (obvious from the plotted figure). How can I fix this?
Thank you!
AM
import numpy as np
import scipy as sp
import scipy.special as spfun
from matplotlib import pyplot as plt
uplim =120#E_rec
Npts =1000
Nstates =8
q = np.linspace(0, uplim/4.0, Npts)
EA = np.zeros([Npts,Nstates])
EB = np.zeros([Npts,Nstates])
U = 4*q
print np.shape(EA) #plt.fill_between(U, EA[:,i], EB[:,i]) #plt.plot(U,Ea,U,Eb)
for i in range(Nstates):
a = spfun.mathieu_a(i,q)
b = spfun.mathieu_b(i+1,q)
EA[:,i] = a + 2*q
EB[:,i] = b + 2*q
plt.fill_between(U, EA[:,i], EB[:,i]) #plt.plot(U,Ea,U,Eb)
print np.shape(EA) #plt.fill_between(U, EA[:,i], EB[:,i]) #plt.plot(U,Ea,U,Eb)
plt.show()
EDIT As DSM and pv have pointed out, this is a scipy bug. The glitches get worse as you go out further. What I ended up doing was exporting tables of values that I wanted from Mathematica, and importing them into python and interpolating. Not great, but works.

I tried computing this with the latest release of the NAG Library for Python which included a new Mathieu function routine.
I pushed a little harder -- more states and a higher value of uplim.
%matplotlib inline
import numpy as np
import scipy as sp
import scipy.special as spfun
from naginterfaces.library import specfun
from matplotlib import pyplot as plt
uplim =150#E_rec
Npts = 4000
Nstates = 10
q = np.linspace(0, uplim/4.0, Npts)
EA = np.zeros([Npts,Nstates])
EB = np.zeros([Npts,Nstates])
U = 4*q
plt.figure(figsize=(15,8))
plt.subplot(1,2,1)
plt.title('Using SciPy')
for i in range(Nstates):
a = spfun.mathieu_a(i,q)
b = spfun.mathieu_b(i+1,q)
EA[:,i] = a + 2*q
EB[:,i] = b + 2*q
plt.fill_between(U, EA[:,i], EB[:,i]) #plt.plot(U,Ea,U,Eb)
plt.subplot(1,2,2)
plt.title('Using NAG')
for i in range(Nstates):
a = [specfun.mathieu_ang_periodic_real(ordval=i, q=qi, parity=0, mode=3)[2] for qi in q]
b = [specfun.mathieu_ang_periodic_real(ordval=i+1, q=qi, parity=1, mode=3)[2] for qi in q]
EA[:,i] = a + 2*q
EB[:,i] = b + 2*q
plt.fill_between(U, EA[:,i], EB[:,i])
plt.show()
This uses Mark 27 of the NAG Library and version 1.2.1 of ScipPy

Supplying test values in pymc 3

I am exploring the use of bounded distributions in pymc. I am trying to bound a Gamma prior distribution between two values. The model specification seems to fail due to the absence of test values. How may I pass a testval argument such that I am able to specify these sorts of models?
For completeness I have included the error, as well as a minimal example below. Thank you!
AttributeError: <pymc.quickclass.Gamma object at 0x110a62890> has no default value to use, checked for: ['median', 'mean', 'mode'] pass testval argument or provide one of these.
import pymc as pm
import numpy as np
ndims = 2
nobs = 20
zdata = np.random.normal(loc=0, scale=0.75, size=(ndims, nobs))
BoundedGamma = pm.Bound(pm.Gamma, 0.5, 2)
with pm.Model() as model:
xbound = BoundedGamma('xbound', alpha=1, beta=2)
z = pm.Normal('z', mu=0, tau=xbound, shape=(ndims, 1), observed=zdata)
edit: for reference purposes, here is a simple working model utilizing a bounded gamma prior distribution:
import pymc as pm
import numpy as np
ndims = 2
nobs = 20
zdata = np.random.normal(loc=0, scale=0.75, size=(ndims, nobs))
BoundedGamma = pm.Bound(pm.Gamma, 0.5, 2)
with pm.Model() as model:
xbound = BoundedGamma('xbound', alpha=1, beta=2, testval=2)
z = pm.Normal('z', mu=0, tau=xbound, shape=(ndims, 1), observed=zdata)
with model:
start = pm.find_MAP()
with model:
step = pm.NUTS()
with model:
trace = pm.sample(3000, step, start)
pm.traceplot(trace);

Use that line:
xbound = BoundedGamma('xbound', alpha=1, beta=2, testval=1)

Fitting lognormal distribution using Scipy vs Matlab

I am trying to fit a lognormal distribution using Scipy. I've already done it using Matlab before but because of the need to extend the application beyond statistical analysis, I am in the process of trying to reproduce the fitted values in Scipy.
Below is the Matlab code I used to fit my data:
% Read input data (one value per line)
x = [];
fid = fopen(file_path, 'r'); % reading is default action for fopen
disp('Reading network degree data...');
if fid == -1
disp('[ERROR] Unable to open data file.')
else
while ~feof(fid)
[x] = [x fscanf(fid, '%f', [1])];
end
c = fclose(fid);
if c == 0
disp('File closed successfully.');
else
disp('[ERROR] There was a problem with closing the file.');
end
end
[f,xx] = ecdf(x);
y = 1-f;
parmhat = lognfit(x); % MLE estimate
mu = parmhat(1);
sigma = parmhat(2);
And here's the fitted plot:
Now here's my Python code with the aim of achieving the same:
import math
from scipy import stats
from statsmodels.distributions.empirical_distribution import ECDF
# The same input is read as a list in Python
ecdf_func = ECDF(degrees)
x = ecdf_func.x
ccdf = 1-ecdf_func.y
# Fit data
shape, loc, scale = stats.lognorm.fit(degrees, floc=0)
# Parameters
sigma = shape # standard deviation
mu = math.log(scale) # meanlog of the distribution
fit_ccdf = stats.lognorm.sf(x, [sigma], floc=1, scale=scale)
Here's the fit using the Python code.
As you can see, both sets of code are capable of producing good fits, at least visually speaking.
Problem is that there is a huge difference in the estimated parameters mu and sigma.
From Matlab: mu = 1.62 sigma = 1.29.
From Python: mu = 2.78 sigma = 1.74.
Why is there such a difference?
Note: I have double checked that both sets of data fitted are exactly the same. Same number of points, same distribution.
Your help is much appreciated! Thanks in advance.
Other info:
import scipy
import numpy
import statsmodels
scipy.__version__
'0.9.0'
numpy.__version__
'1.6.1'
statsmodels.__version__
'0.5.0.dev-1bbd4ca'
Version of Matlab is R2011b.
Edition:
As demonstrated in the answer below, the fault lies with Scipy 0.9. I am able to reproduce the mu and sigma results from Matlab using Scipy 11.0.
An easy way to update your Scipy is:
pip install --upgrade Scipy
If you don't have pip (you should!):
sudo apt-get install pip

There is a bug in the fit method in scipy 0.9.0 that has been fixed in later versions of scipy.
The output of the script below should be:
Explicit formula: mu = 4.99203450, sig = 0.81691086
Fit log(x) to norm: mu = 4.99203450, sig = 0.81691086
Fit x to lognorm: mu = 4.99203468, sig = 0.81691081
but with scipy 0.9.0, it is
Explicit formula: mu = 4.99203450, sig = 0.81691086
Fit log(x) to norm: mu = 4.99203450, sig = 0.81691086
Fit x to lognorm: mu = 4.23197270, sig = 1.11581240
The following test script shows three ways to get the same results:
import numpy as np
from scipy import stats
def lognfit(x, ddof=0):
x = np.asarray(x)
logx = np.log(x)
mu = logx.mean()
sig = logx.std(ddof=ddof)
return mu, sig
# A simple data set for easy reproducibility
x = np.array([50., 50, 100, 200, 200, 300, 500])
# Explicit formula
my_mu, my_sig = lognfit(x)
# Fit a normal distribution to log(x)
norm_mu, norm_sig = stats.norm.fit(np.log(x))
# Fit the lognormal distribution
lognorm_sig, _, lognorm_expmu = stats.lognorm.fit(x, floc=0)
print "Explicit formula: mu = %10.8f, sig = %10.8f" % (my_mu, my_sig)
print "Fit log(x) to norm: mu = %10.8f, sig = %10.8f" % (norm_mu, norm_sig)
print "Fit x to lognorm: mu = %10.8f, sig = %10.8f" % (np.log(lognorm_expmu), lognorm_sig)
With the option ddof=1 in the std. dev. calculation to use the unbiased variance estimation:
In [104]: x
Out[104]: array([ 50., 50., 100., 200., 200., 300., 500.])
In [105]: lognfit(x, ddof=1)
Out[105]: (4.9920345004312647, 0.88236457185021866)
There is a note in matlab's lognfit documentation that says when censoring is not used, lognfit computes sigma using the square root of the unbiased estimator of the variance. This corresponds to using ddof=1 in the above code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting started with PYMC for linear regression - python

Package author #fonnesbeck informed me that I needed the development version 3 from Github and not the pypi version 2.3. Installed that, via github and I'm good to go. Open source is great!

Related

arma_generate_sample cannot return the correct value for modeling, so I would like to add sm.tsa.ARIMA

Ridge Regression: Scikit-learn vs. direct calculation does not match for alpha > 0

Mathieu Characteristics Cross When Plotted

Supplying test values in pymc 3

Fitting lognormal distribution using Scipy vs Matlab

Categories

Resources