Overview
I am trying to implement autoregressive moving average (ARMA) parameter optimization using maximum likelihood estimation (MLE) via the Kalman Filter. I know that I can fit ARMA models using the statsmodels package in Python, but I want to write my own implementation of the ARMA likelihood and subsequent optimization as a prototype for a future C/C++ implementation. Also, when I look through the statsmodels documentation, I find that the statsmodels Kalman Filter Log Likelihood implements a slightly different expression than I have found in the literature.
Algorithms
In order to calculate the ARMA log likelihood, I am following the 1980 paper by Pearlman:
Pearlman, J. G. "An algorithm for the exact likelihood of a high-order autoregressive-moving average process." Biometrika 67.1 (1980): 232-233.). Available from JSTOR.
In order to calculate the initial P matrix, I am following an algorithm in
Gardner, G., Andrew C. Harvey, and Garry DA Phillips. "Algorithm AS 154: An algorithm for exact maximum likelihood estimation of autoregressive-moving average models by means of Kalman filtering." Journal of the Royal Statistical Society. Series C (Applied Statistics) 29.3 (1980): 311-322. Available from JSTOR.
For the initial parameter values, I am currently using the internal method that statsmodels ARMA models use to compute the initial guess for ARMA parameters. In the future I plan to move to my own implementation, but I am using _fit_starts_params while I debug my MLE.
For optimizing the MLE, I am simply using the L-BFGS solver in Scipy.
Code
import numpy as np
import statsmodels.api as sm
import statsmodels.tsa.arima_model
import scipy.optimize
class ARMA(object):
def __init__(self, endo, nar, nma):
np.random.seed(0)
# endogenous variables
self.endo = endo
# Number of AR terms
self.nar = nar
# Number of MA terms
self.nma = nma
# "Dimension" of the ARMA fit
self.dim = max(nar, nma+1)
# Current ARMA parameters
self.params = np.zeros(self.nar+self.nma, dtype='float')
def __g(self, ma_params):
'''
Build MA parameter vector
'''
g = np.zeros(self.dim, dtype='float')
g[0] = 1.0
if self.nma > 0:
g[1:self.nma+1] = ma_params
return g
def __F(self, ar_params):
'''
Build AR parameter matrix
'''
F = np.zeros((self.dim, self.dim), dtype='float')
F[:self.nar, 0] = ar_params
for i in xrange(1, self.dim):
F[i-1, i] = 1.0
return F
def __initial_P(self, R, T):
'''
Solve for initial P matrix
Solves P = TPT' + RR'
'''
v = np.zeros(self.dim*self.dim, dtype='float')
for i in xrange(self.dim):
for j in xrange(self.dim):
v[i+j*self.dim] = R[i]*R[j]
R = np.array([R])
S = np.identity(self.dim**2, dtype='float')-np.kron(T, T)
V = np.outer(R, R).ravel('F')
Pmat = np.linalg.solve(S,V).reshape(self.dim, self.dim, order='F')
return Pmat
def __likelihood(self, params):
'''
Compute log likehood for a parameter vector
Implements the Pearlman 1980 algorithm
'''
# these checks are pilfered from statsmodels
if self.nar > 0 and not np.all(np.abs(np.roots(np.r_[1, -params[:self.nar]]))<1):
print 'AR coefficients are not stationary'
if self.nma > 0 and not np.all(np.abs(np.roots(np.r_[1, -params[-self.nma:]]))<1):
print 'MA coefficients are not stationary'
ar_params = params[:self.nar]
ma_params = params[-self.nma:]
g = self.__g(ma_params)
F = self.__F(ar_params)
w = self.endo
P = self.__initial_P(g, F)
n = len(w)
z = np.zeros(self.dim, dtype='float')
R = np.zeros(n, dtype='float')
a = np.zeros(n, dtype='float')
K = np.dot(F, P[:, 0])
L = K.copy()
R[0] = P[0, 0]
for i in xrange(1, n):
a[i-1] = w[i-1] - z[0]
z = np.dot(F, z) + K*(a[i-1]/R[i-1])
Kupdate = -(L[0]/R[i-1])*np.dot(F, L)
Rupdate = -L[0]*L[0]/R[i-1]
P -= np.outer(L, L)/R[i-1]
L = np.dot(F, L) - (L[0]/R[i-1])*K
K += Kupdate
R[i] = R[i-1] + Rupdate
if np.abs(R[i] - 1.0) < 1e-9:
R[i:] = 1.0
break
for j in xrange(i, n):
a[j] = w[j] - z[0]
z = np.dot(F, z) + K*(a[i-1]/R[i-1])
likelihood = 0.0
for i in xrange(n):
likelihood += np.log(R[i])
likelihood *= -0.5
ssum = 0.0
for i in xrange(n):
ssum += a[i]*a[i]/R[i]
likelihood += -0.5*n*np.log(ssum)
return likelihood
def fit(self):
'''
Fit the ARMA model by minimizing the loglikehood
Uses scipy.optimize.minimize
'''
sm_arma = statsmodels.tsa.arima_model.ARMA(endog=self.endo, order=(self.nar, self.nma, 0))
params = statsmodels.tsa.arima_model.ARMA._fit_start_params_hr(sm_arma, order=(self.nar, self.nma, 0))
opt = scipy.optimize.minimize(fun=self.__likelihood, x0=params, method='L-BFGS-B')
print opt
# Test the code on statsmodels sunspots data
nar = 2
nma = 1
endo = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY'].tolist()
arma = ARMA(endo=endo, nar=nar, nma=nma)
arma.fit()
Issues
I find that the above example does not converge. In the third call of ARMA._likelihood, the code throws the following warning:
RuntimeWarning: invalid value encountered in log
likelihood += np.log(R[i])
which happens because ARMA._initial_P solves for a matrix where P[0][0] < 0.0. At this point, the current estimates of the AR parameters become non-stationary. All subsequent iterations then warn that the AR and MA parameters are non-stationary.
Questions
Is this implementation correct? I have checked that the initial P matrix satisfies the equation it is supposed to satisfy. For the likelihood calculation, I see several behaviors that I expect from the Pearlman paper:
R tends to one. For a pure AR process with p AR parameters, it achieves this limit in p steps. Basically, the break statement in the _likelihood function comes into effect after p iterations of the Pearlman algorithm steps.
L tends to the zero vector.
K tends to F.g. I check this by looking at abs(K - F.g) while calculating the likelihood.
After the warning about the negative value in the logarithm, the above limits are no longer obeyed.
I have also tried implementing a transformation of the ARMA parameters to prevent overflow/underflow, as recommended in
Jones, Richard H. "Maximum likelihood fitting of ARMA models to time series with missing observations." Technometrics 22.3 (1980): 389-395. Available from JSTOR.
This transformation seemed to no effect on the errors I observed.
If the implementation is correct, then how do I handle the negative R values? The issue seems to arise when scipy.optimize returns a parameter vector that corresponds to a P matrix for which the top diagonal element is negative. Is the optimization routine supposed to be bounded to prevent negative R values? I have also tried using complex logarithms for negative values as well as changing all numpy dtype parameters to 'complex'. For example:
def complex_log(val):
'''
Complex logarithm for negative values
Returns log(val) + I*pi
'''
if val < 0.0:
return complex(np.log(np.abs(val)), np.pi)
return np.log(val)
However, scipy.optimize cannot handle complex-valued functions, so this supposed fix has not worked so far. Any recommendations for preventing or handling these behaviors?
Thanks for reading this far. Any help is much appreciated.
Related
I am currently trying to write some python code to solve an arbitrary system of first order ODEs, using a general explicit Runge-Kutta method defined by the values alpha, gamma (both vectors of dimension m) and beta (lower triangular matrix of dimension m x m) of the Butcher table which are passed in by the user. My code appears to work for single ODEs, having tested it on a few different examples, but I'm struggling to generalise my code to vector valued ODEs (i.e. systems).
In particular, I try to solve a Van der Pol oscillator ODE (reduced to a first order system) using Heun's method defined by the Butcher Tableau values given in my code, but I receive the errors
"RuntimeWarning: overflow encountered in double_scalars f = lambda t,u: np.array(... etc)" and
"RuntimeWarning: invalid value encountered in add kvec[i] = f(t+alpha[i]*h,y+h*sum)"
followed by my solution vector that is clearly blowing up. Note that the commented out code below is one of the examples of single ODEs that I tried and is solved correctly. Could anyone please help? Here is my code:
import numpy as np
def rk(t,y,h,f,alpha,beta,gamma):
'''Runga Kutta iteration'''
return y + h*phi(t,y,h,f,alpha,beta,gamma)
def phi(t,y,h,f,alpha,beta,gamma):
'''Phi function for the Runga Kutta iteration'''
m = len(alpha)
count = np.zeros(len(f(t,y)))
kvec = k(t,y,h,f,alpha,beta,gamma)
for i in range(1,m+1):
count = count + gamma[i-1]*kvec[i-1]
return count
def k(t,y,h,f,alpha,beta,gamma):
'''returning a vector containing each step k_{i} in the m step Runga Kutta method'''
m = len(alpha)
kvec = np.zeros((m,len(f(t,y))))
kvec[0] = f(t,y)
for i in range(1,m):
sum = np.zeros(len(f(t,y)))
for l in range(1,i+1):
sum = sum + beta[i][l-1]*kvec[l-1]
kvec[i] = f(t+alpha[i]*h,y+h*sum)
return kvec
def timeLoop(y0,N,f,alpha,beta,gamma,h,rk):
'''function that loops through time using the RK method'''
t = np.zeros([N+1])
y = np.zeros([N+1,len(y0)])
y[0] = y0
t[0] = 0
for i in range(1,N+1):
y[i] = rk(t[i-1],y[i-1], h, f,alpha,beta,gamma)
t[i] = t[i-1]+h
return t,y
#################################################################
'''f = lambda t,y: (c-y)**2
Y = lambda t: np.array([(1+t*c*(c-1))/(1+t*(c-1))])
h0 = 1
c = 1.5
T = 10
alpha = np.array([0,1])
gamma = np.array([0.5,0.5])
beta = np.array([[0,0],[1,0]])
eff_rk = compute(h0,Y(0),T,f,alpha,beta,gamma,rk, Y,11)'''
#constants
mu = 100
T = 1000
h = 0.01
N = int(T/h)
#initial conditions
y0 = 0.02
d0 = 0
init = np.array([y0,d0])
#Butcher Tableau for Heun's method
alpha = np.array([0,1])
gamma = np.array([0.5,0.5])
beta = np.array([[0,0],[1,0]])
#rhs of the ode system
f = lambda t,u: np.array([u[1],mu*(1-u[0]**2)*u[1]-u[0]])
#solving the system
time, sol = timeLoop(init,N,f,alpha,beta,gamma,h,rk)
print(sol)
Your step size is not small enough. The Van der Pol oscillator with mu=100 is a fast-slow system with very sharp turns at the switching of the modes, so rather stiff. With explicit methods this requires small step sizes, the smallest sensible step size is 1e-5 to 1e-6. You get a solution on the limit cycle already for h=0.001, with resulting velocities up to 150.
You can reduce some of that stiffness by using a different velocity/impulse variable. In the equation
x'' - mu*(1-x^2)*x' + x = 0
you can combine the first two terms into a derivative,
mu*v = x' - mu*(1-x^2/3)*x
so that
x' = mu*(v+(1-x^2/3)*x)
v' = -x/mu
The second equation is now uniformly slow close to the limit cycle, while the first has long relatively straight jumps when v leaves the cubic v=x^3/3-x.
This integrates nicely with the original h=0.01, keeping the solution inside the box [-3,3]x[-2,2], even if it shows some strange oscillations that are not present for smaller step sizes and the exact solution.
My objective is to perform an Inverse Laplace Transform on some decay data (NMR T2 decay via CPMG). For that, we were provided with the CONTIN algorithm. This algorithm was adapted to Matlab by Iari-Gabriel Marino, and it works very well. I want to adapt this code into Python. The core of the problem is with scipy.optimize.fmin, which is not minimizing the mean square deviation (MSD) in any way similar to Matlab's fminsearch. The latter results in a good minimization, while the former doesn't.
I have gone through line by line of my adapted code in Python, and the original Matlab. I checked every matrix and every output. I used this to identify that the critical point is in fmin. I also tried scipy.optimize.minimize and other minimization algorithms, but none gave even remotely satisfactory results.
I have made two MWE, for Python and Matlab, to make it reproducible to all. The example data were obtained from the documentation of the matlab function. Apologies if this is long code, but I don't really know how to shorten it without sacrificing readability and clarity. I tried to have the lines match as closely as possible. I am using Python 3.7.3, scipy v1.3.0, numpy 1.16.2, Matlab R2018b, on Windows 8.1. It's a relatively recent Anaconda install (<2 months).
My code:
import numpy as np
from scipy.optimize import fmin
import matplotlib.pyplot as plt
def msd(g, y, A, alpha, R, w, constraints):
""" msd: mean square deviation. This is the function to be minimized by fmin"""
if 'zero_at_extremes' in constraints:
g[0] = 0
g[-1] = 0
if 'g>0' in constraints:
g = np.abs(g)
r = np.diff(g, axis=0, n=2)
yfit = A # g
# Sum of weighted square residuals
VAR = np.sum(w * (y - yfit) ** 2)
# Regularizor
REG = alpha ** 2 * np.sum((r - R # g) ** 2)
# output to be minimized
return VAR + REG
# Objective: match this distribution
g0 = np.array([0, 0, 10.1625, 25.1974, 21.8711, 1.6377, 7.3895, 8.736, 1.4256, 0, 0]).reshape((-1, 1))
s0 = np.logspace(-3, 6, len(g0)).reshape((-1, 1))
t = np.linspace(0.01, 500, 100).reshape((-1, 1))
sM, tM = np.meshgrid(s0, t)
A = np.exp(-tM / sM)
np.random.seed(1)
# Creates data from the initial distribution with some random noise.
data = (A # g0) + 0.07 * np.random.rand(t.size).reshape((-1, 1))
# Parameters and function start
alpha = 1E-2 # regularization parameter
s = np.logspace(-3, 6, 20).reshape((-1, 1)) # x of the ILT
g0 = np.ones(s.size).reshape((-1, 1)) # guess of y of ILT
y = data # noisy data
options = {'maxiter':1e8, 'maxfun':1e8} # for the fmin function
constraints=['g>0', 'zero_at_extremes'] # constraints for the MSD function
R=np.zeros((len(g0) - 2, len(g0)), order='F') # Regularizor
w=np.ones(y.reshape(-1, 1).size).reshape((-1, 1)) # Weights
sM, tM = np.meshgrid(s, t, indexing='xy')
A = np.exp(-tM/sM)
g0 = g0 * y.sum() / (A # g0).sum() # Makes a "better guess" for the distribution, according to algorithm
print('msd of input data:\n', msd(g0, y, A, alpha, R, w, constraints))
for i in range(5): # Just for testing. If this is extremely high, ~1000, it's still bad.
g = fmin(func=msd,
x0 = g0,
args=(y, A, alpha, R, w, constraints),
**options,
disp=True)[:, np.newaxis]
msdfit = msd(g, y, A, alpha, R, w, constraints)
if 'zero_at_extremes' in constraints:
g[0] = 0
g[-1] = 0
if 'g>0' in constraints:
g = np.abs(g)
g0 = g
print('New guess', g)
print('Final msd of g', msdfit)
# Visualize the fit
plt.plot(s, g, label='Initial approximation')
plt.plot(np.logspace(-3, 6, 11), np.array([0, 0, 10.1625, 25.1974, 21.8711, 1.6377, 7.3895, 8.736, 1.4256, 0, 0]), label='Distribution to match')
plt.xscale('log')
plt.legend()
plt.show()
Matlab:
% Objective: match this distribution
g0 = [0 0 10.1625 25.1974 21.8711 1.6377 7.3895 8.736 1.4256 0 0]';
s0 = logspace(-3,6,length(g0))';
t = linspace(0.01,500,100)';
[sM,tM] = meshgrid(s0,t);
A = exp(-tM./sM);
rng(1);
% Creates data from the initial distribution with some random noise.
data = A*g0 + 0.07*rand(size(t));
% Parameters and function start
alpha = 1e-2; % regularization parameter
s = logspace(-3,6,20)'; % x of the ILT
g0 = ones(size(s)); % initial guess of y of ILT
y = data; % noisy data
options = optimset('MaxFunEvals',1e8,'MaxIter',1e8); % constraints for fminsearch
constraints = {'g>0','zero_at_the_extremes'}; % constraints for MSD
R = zeros(length(g0)-2,length(g0));
w = ones(size(y(:)));
[sM,tM] = meshgrid(s,t);
A = exp(-tM./sM);
g0 = g0*sum(y)/sum(A*g0); % Makes a "better guess" for the distribution
disp('msd of input data:')
disp(msd(g0, y, A, alpha, R, w, constraints))
for k = 1:5
[g,msdfit] = fminsearch(#msd,g0,options,y,A,alpha,R,w,constraints);
if ismember('zero_at_the_extremes',constraints)
g(1) = 0;
g(end) = 0;
end
if ismember('g>0',constraints)
g = abs(g);
end
g0 = g;
end
disp('New guess')
disp(g)
disp('Final msd of g')
disp(msdfit)
% Visualize the fit
semilogx(s, g)
hold on
semilogx(logspace(-3,6,11), [0 0 10.1625 25.1974 21.8711 1.6377 7.3895 8.736 1.4256 0 0])
legend('First approximation', 'Distribution to match')
hold off
function out = msd(g,y,A,alpha,R,w,constraints)
% msd: The mean square deviation; this is the function
% that has to be minimized by fminsearch
% Constraints and any 'a priori' knowledge
if ismember('zero_at_the_extremes',constraints)
g(1) = 0;
g(end) = 0;
end
if ismember('g>0',constraints)
g = abs(g); % must be g(i)>=0 for each i
end
r = diff(diff(g(1:end))); % second derivative of g
yfit = A*g;
% Sum of weighted square residuals
VAR = sum(w.*(y-yfit).^2);
% Regularizor
REG = alpha^2 * sum((r-R*g).^2);
% Output to be minimized
out = VAR+REG;
end
Here is the optimization in Python
Here is the optimization in Matlab
I have checked the output of MSD of g0 before starting, and both give the value of 2651. After minimization, Python goes up, to 4547, and Matlab goes down to 0.1381.
I think the problem is one of the following. It's in my implementation, that is, I am using fmin wrong, or there's some other passage I got wrong, but I can't figure out what. The fact the MSD increases when it should have decreased with a minimization function is damning. Reading the documentation, the scipy implementation is different from Matlab's (they use the Nelder Mead method described in Lagarias, per their documentation), while scipy uses the original Nelder Mead). Maybe that affects significantly? Or perhaps my initial guess is too bad for scipy's algorithm?
So, quite a long time since I posted this, but I wanted to share what I ended up learning and doing.
The Inverse Laplace Transform for CPMG data is a bit of a misnomer, and it's more properly called just inversion. The general problem is solving a Fredholm integral of the first kind. One way of doing this is the Tikhonov regularization method. Turns out, you can describe this problem quite easily using numpy, and solve it with a scipy package, so I don't have to "reinvent" the wheel with this.
I used the solution shown in this post, and the names here reflect that solution.
def tikhonov_regularized_inversion(
kernel: np.ndarray, alpha: float, data: np.ndarray
) -> np.ndarray:
data = data.reshape(-1, 1)
I = alpha * np.eye(*kernel.shape)
C = np.concatenate([kernel, I], axis=0)
d = np.concatenate([data, np.zeros_like(data)])
x, _ = nnls(C, d.flatten())
Here, kernel is a matrix containing all the possible exponential decay curves, and my solution judges the contribution of each decay curve in the data I received. First, I stack my data as a column, then pad it with zeros, creating the vector d. I then stack my kernel on top of a diagonal matrix containing the regularization parameter alpha along the diagonal, of the same size as the kernel. Last, I call the convenient nnls, a non negative least square solver in scipy.optimize. This is because there's no reason to have a negative contribution, only no contribution.
This solved my problem, it's quick and convenient.
I'm implementing the PC algorithm in python. Such algorithm constructs the graphical model of a n-variate gaussian distribution. This graphical model is basically the skeleton of a directed acyclic graph, which means that if a structure like:
(x1)---(x2)---(x3)
Is in the graph, then x1 is independent by x3 given x2. More generally if A is the adjacency matrix of the graph and A(i,j)=A(j,i) = 0 (there is a missing edge between i and j) then i and j are conditionally independent, by all the variables that appear in any path from i to j. For statistical and machine learning purposes, it is be possible to "learn" the underlying graphical model.
If we have enough observations of a jointly gaussian n-variate random variable we could use the PC algorithm that works as follows:
given n as the number of variables observed, initialize the graph as G=K(n)
for each pair i,j of nodes:
if exists an edge e from i to j:
look for the neighbours of i
if j is in neighbours of i then remove j from the set of neighbours
call the set of neighbours k
TEST if i and j are independent given the set k, if TRUE:
remove the edge e from i to j
This algorithm computes also the separating set of the graph, that are used by another algorithm that constructs the dag starting from the skeleton and the separation set returned by the pc algorithm. This is what i've done so far:
def _core_pc_algorithm(a,sigma_inverse):
l = 0
N = len(sigma_inverse[0])
n = range(N)
sep_set = [ [set() for i in n] for j in n]
act_g = complete(N)
z = lambda m,i,j : -m[i][j]/((m[i][i]*m[j][j])**0.5)
while l<N:
for (i,j) in itertools.permutations(n,2):
adjacents_of_i = adj(i,act_g)
if j not in adjacents_of_i:
continue
else:
adjacents_of_i.remove(j)
if len(adjacents_of_i) >=l:
for k in itertools.combinations(adjacents_of_i,l):
if N-len(k)-3 < 0:
return (act_g,sep_set)
if test(sigma_inverse,z,i,j,l,a,k):
act_g[i][j] = 0
act_g[j][i] = 0
sep_set[i][j] |= set(k)
sep_set[j][i] |= set(k)
l = l + 1
return (act_g,sep_set)
a is the tuning-parameter alpha with which i will test for conditional independence, and sigma_inverse is the inverse of the covariance matrix of the sampled observations. Moreover, my test is:
def test(sigma_inverse,z,i,j,l,a,k):
def erfinv(x): #used to approximate the inverse of a gaussian cumulative density function
sgn = 1
a = 0.147
PI = numpy.pi
if x<0:
sgn = -1
temp = 2/(PI*a) + numpy.log(1-x**2)/2
add_1 = temp**2
add_2 = numpy.log(1-x**2)/a
add_3 = temp
rt1 = (add_1-add_2)**0.5
rtarg = rt1 - add_3
return sgn*(rtarg**0.5)
def indep_test_ijK(K): #compute partial correlation of i and j given ONE conditioning variable K
part_corr_coeff_ij = z(sigma_inverse,i,j) #this gives the partial correlation coefficient of i and j
part_corr_coeff_iK = z(sigma_inverse,i,K) #this gives the partial correlation coefficient of i and k
part_corr_coeff_jK = z(sigma_inverse,j,K) #this gives the partial correlation coefficient of j and k
part_corr_coeff_ijK = (part_corr_coeff_ij - part_corr_coeff_iK*part_corr_coeff_jK)/((((1-part_corr_coeff_iK**2))**0.5) * (((1-part_corr_coeff_jK**2))**0.5)) #this gives the partial correlation coefficient of i and j given K
return part_corr_coeff_ijK == 0 #i independent from j given K if partial_correlation(i,k)|K == 0 (under jointly gaussian assumption) [could check if abs is < alpha?]
def indep_test():
n = len(sigma_inverse[0])
phi = lambda p : (2**0.5)*erfinv(2*p-1)
root = (n-len(k)-3)**0.5
return root*abs(z(sigma_inverse,i,j)) <= phi(1-a/2)
if l == 0:
return z(sigma_inverse,i,j) == 0 #i independent from j <=> partial_correlation(i,j) == 0 (under jointly gaussian assumption) [could check if abs is < alpha?]
elif l == 1:
return indep_test_ijK(k[0])
elif l == 2:
return indep_test_ijK(k[0]) and indep_test_ijK(k[1]) #ASSUMING THAT IJ ARE INDEPENDENT GIVEN Y,Z <=> IJ INDEPENDENT GIVEN Y AND IJ INDEPENDENT GIVEN Z
else: #i have to use the independent test with the z-fisher function
return indep_test()
Where z is a lambda that receives a matrix (the inverse of the covariance matrix), an integer i, an integer j and it computes the partial correlation of i and j given all the rest of variables with the following rule (which I read in my teacher's slides):
corr(i,j)|REST = -var^-1(i,j)/sqrt(var^-1(i,i)*var^-1(j,j))
The main core of this application is the indep_test() function:
def indep_test():
n = len(sigma_inverse[0])
phi = lambda p : (2**0.5)*erfinv(2*p-1)
root = (n-len(k)-3)**0.5
return root*abs(z(sigma_inverse,i,j)) <= phi(1-a/2)
This function implements a statistical test which uses the fisher's z-transform of estimated partial correlations. I am using this algorithm in two ways:
Generate data from a linear regression model and compare the learned DAG with the expected one
Read a dataset and learn the underlying DAG
In both cases i do not always get correct results, either because I know the DAG underlying a certain dataset, or because i know the generative model but it does not coincide with the one my algorithm learns. I perfectly know that this is a non-trivial task and I may have misunderstand theoretical concept as well as committed error even in parts of the code i have omitted here; but first i'd like to know (from someone who is more experienced than me), if the test i wrote is right, and also if there are library functions that perform this kind of tests, i tried searching but i couldn't find any suitable function.
I get to the point. The most critical issue in the above code, regards the following error:
sqrt(n-len(k)-3)*abs(z(sigma_inverse[i][j])) <= phi(1-alpha/2)
I was mistaking the mean of n, it is not the size of the precision matrix but the number of total multi-variate observations (in my case, 10000 instead of 5). Another wrong assumption is that z(sigma_inverse[i][j]) has to provide the partial correlation of i and j given all the rest. That's not correct, z is the Fisher's transform on a proper subset of the precision matrix which estimates the partial correlation of i and j given the K. The correct test is the following:
if len(K) == 0: #CM is the correlation matrix, we have no variables conditioning (K has 0 length)
r = CM[i, j] #r is the partial correlation of i and j
elif len(K) == 1: #we have one variable conditioning, not very different from the previous version except for the fact that i have not to compute the correlations matrix since i start from it, and pandas provide such a feature on a DataFrame
r = (CM[i, j] - CM[i, K] * CM[j, K]) / math.sqrt((1 - math.pow(CM[j, K], 2)) * (1 - math.pow(CM[i, K], 2))) #r is the partial correlation of i and j given K
else: #more than one conditioning variable
CM_SUBSET = CM[np.ix_([i]+[j]+K, [i]+[j]+K)] #subset of the correlation matrix i'm looking for
PM_SUBSET = np.linalg.pinv(CM_SUBSET) #constructing the precision matrix of the given subset
r = -1 * PM_SUBSET[0, 1] / math.sqrt(abs(PM_SUBSET[0, 0] * PM_SUBSET[1, 1]))
r = min(0.999999, max(-0.999999,r))
res = math.sqrt(n - len(K) - 3) * 0.5 * math.log1p((2*r)/(1-r)) #estimating partial correlation with fisher's transofrmation
return 2 * (1 - norm.cdf(abs(res))) #obtaining p-value
I hope someone could find this helpful
I'm trying wrap my head around linear prediction and figured I'd code up a basic example in Python to test my understanding. The idea behind linear predictive coding is to estimate future samples of a signal based on linear combinations of past samples.
I'm using the lpc module in scikits.talkbox so I don't have to write any of the algorithm myself. Here's my code:
import math
import numpy as np
from scikits.talkbox.linpred.levinson_lpc import levinson, acorr_lpc, lpc
x = np.linspace(0,11,12)
order = 5
"""
a = solution of the inversion
e = prediction error
k = reflection coefficients
"""
(a,e,k) = lpc(x,order,axis=-1)
recon = []
for i in range(order,len(x)):
sum = 0
for j in range(order):
sum += -k[j]*x[i-j-1]
sum += math.sqrt(e)
recon.append(sum)
print(recon)
print(x[order:len(x)])
which gives an output of
[5.618790615323507, 6.316875690307965, 7.0149607652924235,
7.713045840276882, 8.411130915261339, 9.109215990245799, 9.807301065230257,
10.505386140214716]
[ 4. 5. 6. 7. 8. 9. 10. 11.]
My concern is that I'm implementing this incorrectly somehow because I figured that if my input array is a linear signal, it should have no issue predicting future values based on past values. However, it does seem to have a particularly high error, especially for the first few values. Would anyone be able to tell me if I'm implementing this correctly or point me to a few examples where this is done in Python? Any help is greatly appreciated, thanks!
Linear prediction algorithm extends the original sequence with infinite amount of zeros in both directions. So, unless your input signal is constant zero, the extended sequence is not linear and you should expect a nonzero error.
Here is my Python implementation:
def lpc(y, m):
"Return m linear predictive coefficients for sequence y using Levinson-Durbin prediction algorithm"
#step 1: compute autoregression coefficients R_0, ..., R_m
R = [y.dot(y)]
if R[0] == 0:
return [1] + [0] * (m-2) + [-1]
else:
for i in range(1, m + 1):
r = y[i:].dot(y[:-i])
R.append(r)
R = np.array(R)
#step 2:
A = np.array([1, -R[1] / R[0]])
E = R[0] + R[1] * A[1]
for k in range(1, m):
if (E == 0):
E = 10e-17
alpha = - A[:k+1].dot(R[k+1:0:-1]) / E
A = np.hstack([A,0])
A = A + alpha * A[::-1]
E *= (1 - alpha**2)
return A
I am very new to scipy and doing data analysis in python. I am trying to solve the following regularized optimization problem and unfortunately I haven't been able to make too much sense from the scipy documentation. I am looking to solve the following constrained optimization problem using scipy.optimize
Here is the function I am looking to minimize:
here A is an m X n matrix , the first term in the minimization is the residual sum of squares, the second is the matrix frobenius (L2 norm) of a sparse n X n matrix W, and the third one is an L1 norm of the same matrix W.
In the function A is an m X n matrix , the first term in the minimization is the residual sum of squares, the second term is the matrix frobenius (L2 norm) of a sparse n X n matrix W, and the third one is an L1 norm of the same matrix W.
I would like to know how to minimize this function subject to the constraints that:
wj >= 0
wj,j = 0
I would like to use coordinate descent (or any other method that scipy.optimize provides) to solve the above problem. I would like so direction on how to achieve this as I have no idea how to take the frobenius norm or how to tune the parameters beta and lambda or whether the scipy.optimize will tune and return the parameters for me. Any help regarding these questions would be much appreciated.
Thanks in advance!
How large is m and n?
Here is a basic example for how to use fmin:
from scipy import optimize
import numpy as np
m = 5
n = 3
a = np.random.rand(m, n)
idx = np.arange(n)
def func(w, beta, lam):
w = w.reshape(n, n)
w2 = np.abs(w)
w2[idx, idx] = 0
return 0.5*((a - np.dot(a, w2))**2).sum() + lam*w2.sum() + 0.5*beta*(w2**2).sum()
w = optimize.fmin(func, np.random.rand(n*n), args=(0.1, 0.2))
w = w.reshape(n, n)
w[idx, idx] = 0
w = np.abs(w)
print w
If you want to use coordinate descent, you can implement it by theano.
http://deeplearning.net/software/theano/
Your problem seems tailor-made for cvxopt - http://cvxopt.org/
and in particular
http://cvxopt.org/userguide/solvers.html#problems-with-nonlinear-objectives
using fmin would likely be slower, since it does not take advantage of gradient / Hessian information.
The code in HYRY's answer also has the drawback that as far as fmin is concerned the diagonal W is a variable and fmin would try to move the W-diagonal values around until it realizes that they don't do anything (since the objective function resets them to zero). Here is the implementation in cvxopt of HYRY's code that explicitly enforces the zero-constraints and uses gradient info, WARNING: I couldn't derive the Hessian for your objective... and you might double-check the gradient as well:
'''CVXOPT version:'''
from numpy import *
from cvxopt import matrix, mul
''' warning: CVXOPT uses column-major order (Fortran) '''
m = 5
n = 3
n_active = (n)*(n-1)
A = matrix(random.rand(m*n),(m,n))
ids = arange(n)
beta = 0.1;
lam = 0.2;
W = matrix(zeros(n*n), (n,n));
def cvx_objective_func(w=None, z=None):
if w is None:
num_nonlinear_constraints = 0;
w_0 = matrix(1, (n_active,1), 'd');
return num_nonlinear_constraints, w_0
#main call:
'calculate objective:'
'form W matrix, warning _w is column-major order (Fortran)'
'''column-major order!'''
_w = matrix(w, (n, n-1))
for k in xrange(n):
W[k, 0:k] = _w[k, 0:k]
W[k, k+1:n] = _w[k, k:n-1]
squared_error = A - A*W
objective_value = .5 * sum( mul(squared_error,squared_error)) +\
.5* beta*sum(mul(W,W)) +\
lam * sum(abs(W));
'not sure if i calculated this right...'
_Df = -A.T*(squared_error) + beta*W + lam;
'''column-major order!'''
Df = matrix(0., (1, n*(n-1)))
for jdx in arange(n):
for idx in list(arange(0,jdx)) + list(arange(jdx+1,n)):
idx = int(idx);
jdx = int(jdx)
Df[0, jdx*(n-1) + idx] = _Df[idx, jdx]
if z is None:
return objective_value, Df
'''Also form hessian of objective+non-linear constraints
(but there are no nonlinear constraints) :
This is the trickiest part...
WARNING: H is for sure coded wrong'''
H = matrix(1., (n_active, n_active))
return objective_value, Df, H
m, w_0 = cvx_objective_func()
print cvx_objective_func(w_0)
G = -matrix(diag(ones(n_active),), (n_active,n_active))
h = matrix(0., (n_active,1), 'd')
from cvxopt import solvers
print solvers.cp(cvx_objective_func, G=G, h=h)
having said that, the tricks to eliminate the equality/inequality constraints in HYRY's code are quite cute