I am working on developing a search algorithm and I am struggling to understand how to actually use the results of a singular value decomposition ( u,w,vt = svd(a) ) reduction on a term-document matrix.
For example, say I have an M x N matrix as follows where each column represents a document vector (number of terms in each document)
a = [[ 0, 0, 1 ],
[ 0, 1, 2 ],
[ 1, 1, 1 ],
[ 0, 2, 3 ]]
Now, I could run a tf-idf function on this matrix to generate a score for each term/document value, but for the sake of clarity, I will ignore that.
SVD Results
Upon running SVD on this matrix, I end up with the following diagonal vector for 'w'
import svd
u,w,vt = svd.svd(a)
print w
// [4.545183973611469, 1.0343228430392626, 0.5210363733873331]
I understand more or less what this represents (thanks to a lot of reading and particularly this article https://simonpaarlberg.com/post/latent-semantic-analyses/), however I can't figure out how to relate this resulting 'approximation' matrix back to the original documents? What do these weights represent? How can I use this result in my code to find documents related to a term query?
Basically... How do I use this?
The rank-r SVD reduces a rank-R MxN matrix A into r orthogonal rank-1 MxN matrices (u_n * s_n * v_n'). If you use these singular values and vectors to reconstruct the original matrix, you will obtain the best rank-r approximation of A.
Instead of storing the full matrix A, you just store the u_n, s_n, and v_n. (A is MxN, but U is Mxr, S can be stored in one dimension as r elements, and V' is rxN).
To approximate A * x, you simply compute (U * (S * (V' * x))) [Mxr x rxr x rxN x Nx1]. You can speed this up further by storing (U * S) instead of U and S separately.
So what do the singular values represent? In a way, they represent the energy of each rank-1 matrix. The higher a singular value is, the more that its associated rank-1 matrix contributes to the original matrix and the worse your reconstruction will be if it is not included if it is truncated.
Note that this procedure is closely related to Principal Component Analysis, which is performed on covariance matrices and is commonly used in machine learning to reduce the dimensionality of measured N-dimensional variables.
Additionally, it should be noted that the SVD is useful for many other applications in signal processing.
More information is on the Wikipedia article.
Related
I have a matrix of 63695 row vectors of dim 384.
I would like to compute the cosine similarity model for this matrix.
I was thinking of vectorizing it.
How would one proceed to that objective?
If you look in scikit-learns source code you will see that X and Y are first normalized and then X_norm # Y_norm.T (dot product) is returned. Or if as in your case no Y exists it is X_norm # X_norm.T.
Normalizing and transposing can be discarded when looking at the runtime, but the matrix multiplaction of a (63695 x 384) matrix should take somewhere in the neighbourhood of 63695*63695 (elements in result matrix) times 384*384 (element-wise multiplactions and additions to calculate one element) calculations, so something like 63695*63695*384*384 = 598,236,810,854,400 operations. (Or strictly, that number of multiplications plus that same number of additions.)
And as you already mentioned it requires 4 (Bytes for float32) * 63695 * 63695 = ~16.2 GB of memory to handle that result matrix.
Do you really need that enormous matrix? What type of data are you handling and what are you trying to do? If we are talking about e.g. vector represenations of text data then you should look at removing duplicates, processing it in chunks or reducing the dimensionality before analysing similarity. If you are looking for something like ranking these cosine similarities and finding then k most similar ones you'd be much better of using algorithms for finding similar data points instead of doing it all by hand yourself.
The scipy.linalg.eigh function can take two matrices as arguments: first the matrix a, of which we will find eigenvalues and eigenvectors, but also the matrix b, which is optional and chosen as the identity matrix in case it is left blank.
In what scenario would someone like to use this b matrix?
Some more context: I am trying to use xdawn covariances from the pyRiemann package. This uses the scipy.linalg.eigh function with a covariance matrix a and a baseline covariance matrix b. You can find the implementation here. This yields an error, as the b matrix in my case is not positive definitive and thus not useable in the scipy.linalg.eigh function. Removing this matrix and just using the identity matrix however solves this problem and yields relatively nice results... The problem is that I do not really understand what I changed, and maybe I am doing something I should not be doing.
This is the code from the pyRiemann package I am using (modified to avoid using functions defined in other parts of the package):
# X are samples (EEG data), y are labels
# shape of X is (1000, 64, 2459)
# shape of y is (1000,)
from scipy.linalg import eigh
Ne, Ns, Nt = X.shape
tmp = X.transpose((1, 2, 0))
b = np.matrix(sklearn.covariance.empirical_covariance(tmp.reshape(Ne, Ns * Nt).T))
for c in self.classes_:
# Prototyped response for each class
P = np.mean(X[y == c, :, :], axis=0)
# Covariance matrix of the prototyper response & signal
a = np.matrix(sklearn.covariance.empirical_covariance(P.T))
# Spatial filters
evals, evecs = eigh(a, b)
# and I am now using the following, disregarding the b matrix:
# evals, evecs = eigh(a)
If A and B were both symmetric matrices that doesn't necessarily have to imply that inv(A)*B must be a symmetric matrix. And so, if i had to solve a generalised eigenvalue problem of Ax=lambda Bx then i would use eig(A,B) rather than eig(inv(A)*B), so that the symmetry isn't lost.
One practical application is in finding the natural frequencies of a dynamic mechanical system from differential equations of the form M (d²x/dt²) = Kx where M is a positive definite matrix known as the mass matrix and K is the stiffness matrix, and x is displacement vector and d²x/dt² is acceleration vector which is the second derivative of the displacement vector. To find the natural frequencies, x can be substituted with x0 sin(ωt) where ω is the natural frequency. The equation reduces to Kx = ω²Mx. Now, one can use eig(inv(K)*M) but that might break the symmetry of the resultant matrix, and so I would use eig(K,M) instead.
A - lambda B x it means that x is not in the same basis as the covariance matrix.
If the matrix is not definite positive it means that there are vectors that can be flipped by your B.
I hope it was helpful.
I have a 2D means matrix in size n*m, where n is number of samples and m is the dimension of the data.
I have as well n matrices of m*m, namely sigma is my variance matrix in shape n*m*m.
I wish to sample n samples from a the distributions above, such that x_i~N(mean[i], sigma[i]).
Any way to do that in numpy or any other standard lib w/o running with a for loop?
The only option I thought was using np.random.multivariate_normal() by flatting the means matrix to one vector, and flatten the 3D sigma to a 2D blocks-diagonal matrix. And of course reshape afterwards. But that means we are going the sample with sigma in shape (n*m)*(n*m) which can easily be ridiculously huge, and only computing and allocating that matrix (if possible) can take longer than running in a for loop.
In my specific task, right now Sigma is the same matrix for all the samples, means I can express Sigma in m*m, and it is the same one for all n points. But I am interested in a general solution.
Appreciate your help.
Difficult to tell without testable code, but this should be close:
A = numpy.linalg.cholesky(sigma) # => shape (n, m, m), same as sigma
Z = np.random.normal(size = (n, m)) # shape (n, m)
X = np.einsum('ijk, ik -> ij', A, Z) + mean # shape (n, m)
What's going on:
We're manually sampling multivariate normal distributions according to the standard Cholesky decomposition method outlined here. A is built such that A#A.T = sigma. Then X (the multivariate normal) can be formed by the dot product of A and a univariate normal N(0, 1) vector Z, plus the mean.
You keep the extraneous dimension throughout the calculation in the first (index = 0, 'i' in the einsum) axis, while contracting the last ('k') axis, forming the dot product.
I'm trying to find the contents of matrix A given that A * A.T = X
I know that A (and therefore A.T) is a 5x5 matrix and I know the contents of X:
[
[-5608529.85751958,-1078099.28424021,782266.19092291,-5553202.27739048,-8346599.92810626],
[-1078099.28424021, -10655907.3511596 , -217503.83572109,-4964009.33281077,-7389416.05437836],
[782266.19092291,-217503.83572109,-1630229.70928628,-6085405.40152081,-9213840.50324483],
[-5553202.27739048,-4964009.33281078,-6085405.40152081,-6529161.83967772,8491769.6736334],
[-8346599.92810626,-7389416.05437838,-9213840.50324484,8491769.67363339, -11725726.66921404]
]
How can I compute A efficiently in python?
For Reference: Wikipedia: Gramian Matrix
Based on this post, I can get covariance between two vectors using np.cov((x,y), rowvar=0). I have a matrix MxN and a vector Mx1. I want to find the covariance between each column of the matrix and the given vector. I know that I can use for loop to write. I was wondering if I can somehow use np.cov() to get the result directly.
As Warren Weckesser said, the numpy.cov(X, Y) is a poor fit for the job because it will simply join the arrays in one M by (N+1) array and find the huge (N+1) by (N+1) covariance matrix. But we'll always have the definition of covariance and it's easy to use:
A = np.sqrt(np.arange(12).reshape(3, 4)) # some 3 by 4 array
b = np.array([[2], [4], [5]]) # some 3 by 1 vector
cov = np.dot(b.T - b.mean(), A - A.mean(axis=0)) / (b.shape[0]-1)
This returns the covariances of each column of A with b.
array([[ 2.21895142, 1.53934466, 1.3379221 , 1.20866607]])
The formula I used is for sample covariance (which is what numpy.cov computes, too), hence the division by (b.shape[0] - 1). If you divide by b.shape[0] you get the unadjusted population covariance.
For comparison, the same computation using np.cov:
import numpy as np
A = np.sqrt(np.arange(12).reshape(3, 4))
b = np.array([[2], [4], [5]])
np.cov(A, b, rowvar=False)[-1, :-1]
Same output, but it takes about twice this long (and for large matrices, the difference will be much larger). The slicing at the end is because np.cov computes a 5 by 5 matrix, in which only the first 4 entries of the last row are what you wanted. The rest is covariance of A with itself, or of b with itself.
Correlation coefficient
The correlation coefficientis obtained by dividing by square roots of variances. Watch out for that -1 adjustment mentioned earlier: numpy.var does not make it by default, to make it happen you need ddof=1 parameter.
corr = cov / np.sqrt(np.var(b, ddof=1) * np.var(A, axis=0, ddof=1))
Check that the output is the same as the less efficient version
np.corrcoef(A, b, rowvar=False)[-1, :-1]