As the question suggests, I would like to compute the gradient with respect to a matrix row. In code:
import numpy.random as rng
import theano.tensor as T
from theano import function
t_x = T.matrix('X')
t_w = T.matrix('W')
t_y = T.dot(t_x, t_w.T)
t_g = T.grad(t_y[0,0], t_x[0]) # my wish, but DisconnectedInputError
t_g = T.grad(t_y[0,0], t_x) # no problems, but a lot of unnecessary zeros
f = function([t_x, t_w], [t_y, t_g])
y,g = f(rng.randn(2,5), rng.randn(7,5))
As the comments indicate, the code works without any problems when I compute the gradient with respect to the entire matrix. In this case the gradient is correctly computed, but the problem is that the result has only non-zero entries in row 0 (because other rows of x obviously do not appear in the equations for the first row of y).
I have found this question, suggesting to store all rows of the matrix in separate variables and build graphs from these variables. In my setting though, I have no idea how much rows might be in X.
Would anybody have an idea how to get the gradient with respect to a single row of a matrix or how I could omit the extra zeros in the output? If anybody would have suggestions how an arbitrary amount of vectors can be stacked, that should work as well, I guess.
I realised that it is possible to get rid of the zeros when computing derivatives with respect to the entries in row i:
t_g = T.grad(t_y[i,0], t_x)[i]
and for computing the Jacobian, I found out that
t_g = T.jacobian(t_y[i], t_x)[:,i]
does the trick. However it seems to have a rather heavy impact on computation speed.
It would also be possible to approach this problem mathematically. The Jacobian of the matrix multiplication t_y w.r.t. t_x is simply the transpose of t_w.T, which is t_w in this case (the transpose of the transpose is the original matrix). Thus, the computation would be as simple as
t_g = t_w
Related
My problem is this: I have GMM model with K multi-variate gaussians, and also I have N samples.
I want to create a N*K numpy matrix, which in it's [i,k] cell there is the pdf function of the k'th gaussian on the i'th sample, i.e. in this cell there is
In short, I'm intrested in the following matrix:
pdf matrix
This what I have now (I'm working with python):
Q = np.array([scipy.stats.multivariate_normal(mu_t[k], cov_t[k]).pdf(X) for k in range(self.K)]).T
X in the code is a matrix whose lines are my samples.
It's works fine on small toy dataset from small dimension, but the dataset I'm working with is 10,000 28*28 pictures, and on it this line run extremely slowly...
I want to find a solution that doesn't envolve loops but only vector\matrix operation (i.e. vectorization). The scipy 'multivariate_normal' function cannot parameters of more than 1 gaussians, as far as I understand it (but it's 'pdf' function can calculates on multiple samples at once).
Does someone have an Idea?
I am afraid, that the main speed killer in your problem is the inversion and deteminant calculation for the cov_t matrices. If you somehow managed to precalculate these, you could enroll the calculation and use np.add.outer to get all combinations of x_i - mu_k and then use array comprehension to calculate the probabilities with the full formula of the normal distribution function.
Try
S = np.add.outer(X,-mu_t)
cov_t_inv = ??
cov_t_inv_det = ??
Q = 1/(2*np.pi*cov_t_inv_det)**0.5 * np.exp(-0.5*np.einsum('ikr,krs,kis->ik',S,cov_t_inv,S))
Where you insert precalculated arrays cov_t_inv for the inverse covariance matrices and cov_t_inv_det for their determinants.
Given the following input numpy 2d-array A that may be retrieved with the following link through the file hill_mat.npy, it would be great if I can compute only a subset of its eigenvalues using an iterative solver like scipy.sparse.linalg.eigs.
First of all, a little bit of context. This matrix A results from a quadratic eigenvalue problem of size N which has been linearized in an equivalent eigenvalue problem of double size 2*N. A has the following structure (blue color being zeroes):
plt.imshow(np.where(A > 1e-15,1.,0), interpolation='None')
and the following features:
A shape = (748, 748)
A dtype = float64
A sparsity ratio = 77.64841716949297 %
The true dimensions of A are much bigger than this small reproducible example. I expect the real sparsity ratio and shape to be close to 95% and (5508, 5508) for this case.
The resulting eigenvalues of A are complex (which come in complex conjugate pairs) and I am more interested in the ones with the smallest imaginary part in modulus.
Problem: when using direct solver:
w_dense = np.linalg.eigvals(A)
idx = np.argsort(abs(w_dense.imag))
w_dense = w_dense[idx]
calculation times become rapidly prohibitive. I am thus looking to use a sparse algorithm:
from scipy.sparse import csc_matrix, linalg as sla
w_sparse = sla.eigs(A, k=100, sigma=0+0j, which='SI', return_eigenvectors=False)
but it seems that ARPACK doesn't find any eigenvalues this way. From the scipy/arpack tutorial, when looking for small eigenvalues like which = 'SI', one should use the so-called shift-invert mode by specifying sigma kwarg, i.e. in order for the algorithm to know where it could expect to find these eigenvalues. Nonetheless, all of my attempts did not yield any results...
Could someone more experienced with this function give me a hand in order to make this work?
Here follows a whole code snippet:
import numpy as np
from matplotlib import pyplot as plt
from scipy.sparse import csc_matrix, linalg as sla
A = np.load('hill_mat.npy')
print('A shape =', A.shape)
print('A dtype =', A.dtype)
print('A sparsity ratio =',(np.product(A.shape) - np.count_nonzero(A)) / np.product(A.shape) *100, '%')
# quick look at the structure of A
plt.imshow(np.where(A > 1e-15,1.,0), interpolation='None')
# direct
w_dense = np.linalg.eigvals(A)
idx = np.argsort(abs(w_dense.imag))
w_dense = w_dense[idx]
# sparse
w_sparse = sla.eigs(csc_matrix(A), k=100, sigma=0+0j, which='SI', return_eigenvectors=False)
Problem finally solved, I guess I should have read the documentation more carefully, but yet, the following is quite counter-intuitive and could be better emphasized in my opinion:
... ARPACK contains a mode that allows a quick determination of
non-external eigenvalues: shift-invert mode. As mentioned above, this
mode involves transforming the eigenvalue problem to an equivalent
problem with different eigenvalues. In this case, we hope to find
eigenvalues near zero, so we’ll choose sigma = 0. The transformed
eigenvalues will then satisfy , so our small eigenvalues become large
eigenvalues .
This way, when looking for small eigenvalues, in order to help LAPACK do the work, one should activate shift-invert mode by specifying an appropriate sigma value while also reversing the desired specified subset specified in the which keyword argument.
Thus, it is simply a matter of executing:
w_sparse = sla.eigs(csc_matrix(A), k=100, sigma=0+0j, which='LM', return_eigenvectors=False, maxiter=2000)
idx = np.argsort(abs(w_sparse.imag))
w_sparse = w_sparse[idx]
Therefore, I can only hope this mistake help someone else :)
I'm wondering if anyone knows how to implement a rolling/moving window PCA that reuses the calculated PCA when adding and removing measurements.
The idea is that I have a large set of data (measurement) over a very long time, and I would like to have a moving window (say, 200 days) starting at the beginning of my dataset and each step, I include the next day's measurement and throw out the last measurement, so my window is always 200 days long. However, I would not like to simply recalculate the PCA each time.
Is it possible to make an algorithm that is more efficient than simply calculating the PCA for each window independently? Thanks in advance!
A complete answer depends on a lot of factors. I'll cover what I think are the most important such factors, and hopefully that'll be enough information to point you in the right direction.
First, directly answering your question, yes it is possible to make an algorithm that is more efficient than simply calculating the PCA for each window independently.
Improving the Naive PCA Algorithm (low-dimensional inputs)
As a first pass at the problem, let's assume that you're doing a naive PCA calculation with no normalization (i.e., you're leaving the data lone, computing the covariance matrix, and finding that matrix's eigenvalues/eigenvectors).
When faced with an input matrix X whose PCA we want to compute, the naive algorithm first computes the covariance matrix W = X.T # X. Once we've computed that for some window of 200 elements, we can cheaply add or remove elements from consideration from the original data set by removing their contribution to the covariance.
"""
W: shape (p, p)
row: shape (1, p)
"""
def add_row(W, row):
return W + (row.T # row)
def remove_row(W, row):
return W - (row.T # row)
Your description of a sliding window is equivalent to removing a row and adding a new one, so we can quickly compute a new covariance matrix using O(p^2) computations rather than the O(n p^2) a typical matrix multiply would take (with n==200 for this problem).
The covariance matrix isn't the final answer though, and we still need to find the principal components. If you aren't hand-rolling the eigensolver yourself there isn't a lot to be done -- you'll pay the cost for new eigenvalues and eigenvectors every time.
However, if you are writing your own eigensolver, most such methods accept a starting input and iterate till some termination condition (usually a max number of iterations or if the error becomes low enough, whichever you hit first). Swapping out a single data point isn't likely to drastically alter the principal components, so for typical data one might expect that re-using the existing eigenvalues/eigenvectors as inputs into the eigensolver would allow you to terminate in far fewer iterations than when starting from randomized inputs, affording an additional speedup.
Improving Covariance-Free Algorithms (high-dimensional inputs)
Usually (maybe always?), covariance-free PCA algorithms have some kind of iterated solver (much like an eigensolver), but they have computational shortcuts that allow finding eigenvalues/eigenvectors without explicitly materializing the covariance matrix.
Any individual such method might have additional tricks that allow you to save some information from one window to the next, but in general one would expect that you can reduce the total number of iterations simply by re-using the existing principal components instead of using random inputs to start the solver (much like in the eigensolver case above).
Window Normalization w/ Naive Algorithm
Supposing you're normalizing each window to have a mean of 0 in each column (common in PCA), you'll have some additional work when modifying the covariance matrix.
First I'll assume you already have a rolling mechanism for keeping track of any differences that need to be applied from one window to the next. If not, consider something like the following:
"""
We're lazy and don't want to handle a change in sample
size, so only work with row swaps -- good enough for
a sliding window.
old_row: shape (1, p)
new_row: shape (1, p)
"""
def replaced_row_mean_adjustment(old_row, new_row):
return (new_row - old_row)/200. # whatever your window size is
The effect on the covariance matrix isn't too bad to compute, but I'll put some code here anyway.
"""
W: shape (p, p)
center: shape (1, p)
exactly equal to the mean diff vector we referenced above
X: shape (200, p)
exactly equal to the window you're examining after any
previous mean centering has been applied, but before the
current centering has happened. Note that we only use
its row and column sums, so you could get away with
a rolling computation for those instead, but that's
a little more code, and I want to leave at least some
of the problem for you to work on
"""
def update_covariance(W, center, X):
result = W
result -= center.T # np.sum(X, axis=0).reshape(1, -1)
result -= np.sum(X, axis=1).reshape(-1, 1) # center
result += 200 * center.T # center # or whatever your window size is
return result
Rescaling to have a standard deviation of 1 is also common in PCA. That's pretty easy to accomodate as well.
"""
Updates the covariance matrix assuming you're modifing a window
of data X with shape (200, p) by multiplying each column by
its corresponding element in v. A rolling algorithm to compute
v isn't covered here, but it shouldn't be hard to figure out.
W: shape (p, p)
v: shape (1, p)
"""
def update_covariance(W, v):
return W * (v.T # v) # Note that this is element-wise multiplication of W
Window Normalization w/ Covariance-free Algorithm
The tricks that you have available here will vary quite a bit depending on the algorithm that you're using, but the general strategy I'd try first is to use a rolling algorithm to keep track of the mean and standard deviation for each column for the current window and to modify the iterative solver to take that into account (i.e., given a window X you want to iterate on the rescaled window Y -- substitute Y=a*X+b into the iterative algorithm of your choice and simplify symbolically to hopefully yield a version with a small additional constant cost).
As before you'll want to re-use any principal components you find instead of using a random initialization vector for each window.
So basically I am trying to do finite differencing on a 2d array without doing too many for loops. I would like to have the Hessian matrix of the array, and the gradient. So I need both the first order and second order derivative of the array.
This can be achieved by evaluating the following equation on on the array.
To deal with boundaries we only compute it for the interior points, so code for this derivate might look something like the following
arr = np.random.rand(16).reshape(4,4)
result = np.zeros_like(arr)
w, h = arr.shape
for i in range(1, w-1):
for j in range(1, h-1):
result[i,j] = (arr[i+1, j] - arr[i-1, j]) / (2*dx)
This gives the correct answer but can be very slow compared nu numpy operations, so I thought to myself. This is basically just a convolution with a kernel that looks like this
kernel = [1, 0 , -1]
So we execute the following code
from scipy.sigmal import convolve
result = np.pad((convolve(arr,kernel,mode='same',
method = 'direct')/(2*dx))[1:-1, 1:-1], 1).T
Since we are only dealing with the interior points, we cut them of and pad with zeros afterwards, to mimick what would happened in the previous naive case.
This works! But with some arrays, the mean squared error between the naive case and the convolution case sky rockets. So it seems that the numerical error increases very much for some cases.
I would like the speed gained by convolution with the stability of the naive case. Any help?
We can simply slice and operate. Hence, after output initialization, do -
result[1:-1,1:-1] = (arr[2:,1:-1] - arr[:-2,1:-1])/(2*dx)
Convolution IMHO would be an overkill when working with NumPy arrays, as slicing arrays are virtually free on memory and performance. Being compute heavy, one can look into numexpr though to leverage multi-cores.
In my attempt to perform cholesky decomposition on a variance-covariance matrix for a 2D array of periodic boundary condition, under certain parameter combinations, I always get LinAlgError: Matrix is not positive definite - Cholesky decomposition cannot be computed. Not sure if it's a numpy.linalg or implementation issue, as the script is straightforward:
sigma = 3.
U = 4
def FromListToGrid(l_):
i = np.floor(l_/U)
j = l_ - i*U
return np.array((i,j))
Ulist = range(U**2)
Cov = []
for l in Ulist:
di = np.array([np.abs(FromListToGrid(l)[0]-FromListToGrid(i)[0]) for i, x in enumerate(Ulist)])
di = np.minimum(di, U-di)
dj = np.array([np.abs(FromListToGrid(l)[1]-FromListToGrid(i)[1]) for i, x in enumerate(Ulist)])
dj = np.minimum(dj, U-dj)
d = np.sqrt(di**2+dj**2)
Cov.append(np.exp(-d/sigma))
Cov = np.vstack(Cov)
W = np.linalg.cholesky(Cov)
Attempts to remove potential singularies also failed to resolve the problem. Any help is much appreciated.
Digging a bit deeper in problem, I tried printing the Eigenvalues of the Cov matrix.
print np.linalg.eigvalsh(Cov)
And the answer turns out to be this
[-0.0801339 -0.0801339 0.12653595 0.12653595 0.12653595 0.12653595 0.14847999 0.36269785 0.36269785 0.36269785 0.36269785 1.09439988 1.09439988 1.09439988 1.09439988 9.6772531 ]
Aha! Notice the first two negative eigenvalues? Now, a matrix is positive definite if and only if all its eigenvalues are positive. So, the problem with the matrix is not that it's close to 'zero', but that it's 'negative'. To extend #duffymo analogy, this is linear algebra equivalent of trying to take square root of negative number.
Now, let's try to perform same operation, but this time with scipy.
scipy.linalg.cholesky(Cov, lower=True)
And that fails saying something more
numpy.linalg.linalg.LinAlgError: 12-th leading minor not positive definite
That's telling something more, (though I couldn't really understand why it's complaining about 12-th minor).
Bottom line, the matrix is not quite close to 'zero' but is more like 'negative'
The problem is the data you're feeding to it. The matrix is singular, according to the solver. That means a zero or near-zero diagonal element, so inversion is impossible.
It'd be easier to diagnose if you could provide a small version of the matrix.
Zero diagonals aren't the only way to create a singularity. If two rows are proportional to each other then you don't need both in the solution; they're redundant. It's more complex than just looking for zeroes on the diagonal.
If your matrix is correct, you have a non-empty null space. You'll need to change algorithms to something like SVD.
See my comment below.