Calculating eigen values of very large sparse matrices in python

Calculating eigen values of very large sparse matrices in python - python

I have a very large sparse matrix which represents a transition martix in a Markov Chain, i.e. the sum of each row of the matrix equals one and I'm interested in finding the first eigenvalue and its corresponding vector which is smaller than one. I know that the eigenvalues are bounded in the section [-1, 1] and they are all real (non-complex).
I am trying to calculate the values using python's scipy.sparse.eigs function, however, one of the parameters of the functions is the number of eigenvalues/vectors to estimate and every time I've increased the number of parameters to estimate, the numbers of eigenvalues which are exactly one grew as well.
Needless to say, I am using the which parameter with the value 'LR' in order to get the k largest eigenvalues, with k being the number of values to estimate.
Does anyone have an idea how to solve this problem (finding the first eigenvalue smaller than one and its corresponding vector)?

I agree with #pv. If your matrix S was symmetric, you could see it as a laplacian matrix of the matrix I - S. The number of connected components of I - S is the number of zero-eigenvalues of this matrix (i.e, the dimension of the space associated to eigenvalue 1 of S). You could check the number of connected components of the graph whose similarity matrix is I - S*S' for a start, e.g. with scipy.sparse.csgraph.connected_components.

Related

Power method to find eigenvectors of largest eigenvalues

How can I implement a power method in python that can find eigenvectors corresponding to the two eigenvalues of the largest magnitude by assuring that second vector remains orthogonal to the first one? For simple case, the given matrix will be small and symmetric.

calculate largest and smallest eigenvalues

I need to calculate the condition number of a dense matrix A many times(with some changes to A in each time), where the condition number define to be the largest eigenvalue of A divided by the smallest eigenvalue.
A is in an order of about 1000x5000 and performing
np.linalg.svd(A, compute_uv=False)
take approximately 0.6 seconds
on the other hand
scipy.sparse.linalg.svds(A, 1, which='SM')
and
scipy.sparse.linalg.svds(A, 1)
does not converge due to the density of A (I think)
there is any way to compute only the largest and smallest eigenvalues of a dense matrix? or a way to manipulate A in such a way in which I could use scipy.sparse.linalg.svds without changing the eigenvalues?

How do I organize an uneven matrix of many calculations in pandas / numpy / pandas

I am calculating a model that requires a large number of calculations in a big matrix, representing interactions between households (numbering N, roughly 10E4) and firms (numbering M roughly 10E4). In particular, I want to perform the following steps:
X2 is an N x M matrix representing pairwise distance between each household and each firm. The step is to multiply every entry by a parameter gamma.
delta is a vector length M. The step is to broadcast multiply delta into the rows of the matrix from 1.
Exponentiate the matrix from 2.
Calculate the row sums of the matrix from 3.
Broadcast division division by the row sum vector from 4 into the rows of the matrix from 3.
w is a vector of length N. The step is to broadcast multiply w into the columns of the matrix from 5.
The final step is to take column sums of the matrix from 6.
These steps have to be performed 1000s of times in the context of matching the model simulation to data. At the moment, I have an implementation using a big NxM numpy array and using matrix algebra operations to perform the steps as described above.
I would like to be able to reduce the number of calculations by eliminating all the "cells" where the distance is greater than some critical value r.
How can I organize my data to do this, while performing all the operations I need to do (exponentiation, row/column sums, broadcasting across rows and columns)?
The solution I have in mind is something like storing the distance matrix in "long form", with each row representing a household / firm pair, rather than the N x M matrix, deleting all the invalid rows to get an array whose length is something less than NM, and then performing all the calculations in this format. In this solution I am wondering if I can use pandas dataframes to make the "broadcasts" and "row sums" work properly (and quickly). How can I make that work?
(Or alternately, if there is a better way I should be exploring, I'd love to know!)

Find eigenvectors with specific eigenvalue of sparse matrix in python

I have a large sparse matrix and I want to find its eigenvectors with specific eigenvalue. In scipy.sparse.linalg.eigs, it says the required argument k:
"k is the number of eigenvalues and eigenvectors desired. k must be smaller than N-1. It is not possible to compute all eigenvectors of a matrix".
The problem is that I don't know how many eigenvectors corresponding to the eigenvalue I want. What should I do in this case?

I'd suggest using Singular Value Decomposition (SVD) instead. There is a function from scipy where you can use SVD from scipy.sparse.linalg import svds and it can handle sparse matrix. You can find eigenvalues (in this case will be singular value) and eigenvectors by the following:
U, Sigma, VT = svds(X, k=n_components, tol=tol)
where X can be sparse CSR matrix, U and VT is set of left eigenvectors and right eigenvectors corresponded to singular values in Sigma. Here, you can control number of components. I'd say start with small n_components first and then increase it. You can rank your Sigma and see the distribution of singular value you have. There will be some large number and drop quickly. You can make threshold on how many eigenvectors you want to keep from singular values.
If you want to use scikit-learn, there is a class sklearn.decomposition.TruncatedSVD that let you do what I explained.

Efficient calculation of euclidean distance

I have a MxN array, where M is the number of observations and N is the dimensionality of each vector. From this array of vectors, I need to calculate the mean and minimum euclidean distance between the vectors.
In my mind, this requires me to calculate MC2 distances, which is an O(nmin(k, n-k)) algorithm. My M is ~10,000 and my N is ~1,000, and this computation takes ~45 seconds.
Is there a more efficient way to compute the mean and min distances? Perhaps a probabilistic method? I don't need it to be exact, just close.

You didn't describe where your vectors come from, nor what use you will put mean and median to. Here are some observations about the general case. Limited ranges, error tolerance, and discrete values may admit of a more efficient approach.
The mean distance between M points sounds quadratic, O(M^2). But M / N is 10, fairly small, and N is huge, so the data probably resembles a hairy sphere in 1e3-space. Computing centroid of M points, and then computing M distances to centroid, might turn out to be useful in your problem domain, hard to tell.
The minimum distance among M points is more interesting. Choose a small number of pairs at random, say 100, compute their distance, and take half the minimum as an estimate of the global minimum distance. (Validate by comparing to the next few smallest distances, if desired.) Now use spatial UB-tree to model each point as a positive integer. This involves finding N minima for M x N values, adding constants so min becomes zero, scaling so estimated global min distance corresponds to at least 1.0, and then truncating to integer.
With these transformed vectors in hand, we're ready to turn them into a UB-tree representation that we can sort, and then do nearest neighbor spatial queries on the sorted values. For each point compute an integer. Shift the low-order bit of each dimension's value into the result, then iterate. Continue iterating over all dimensions until non-zero bits have all been consumed and appear in the result, and proceed to the next point. Numerically sort the integer result values, yielding a data structure similar to a PostGIS index.
Now you have a discretized representation that supports reasonably efficient queries for nearest neighbors (though admittedly N=1e3 is inconveniently large). After finding two or more coarse-grained nearby neighbors, you can query the original vector representation to obtain high-resolution distances between them, for finer discrimination. If your data distribution turns out to have a large fraction of points that discretize to being off by single bit from nearest neighbor, e.g. location of oxygen atoms where each has a buddy, then increase the global min distance estimate so the low order bits offer adequate discrimination.
A similar discretization approach would be appropriately scaling e.g. 2-dimensional inputs and marking an initially empty grid, then scanning immediate neighborhoods. This relies on global min being within a "small" neighborhood, due to appropriate scaling. In your case you would be marking an N-dimensional grid.

You may be able to speed things up with some sort of Space Partitioning.
For the minimum distance calculation, you would only need to consider pairs of points in the same or neigbouring partitions. For an approximate mean, you might be able to come up with some sort of weighted average based on the distances between partitions and the number of points within them.

I had the same issue before, and it worked for me once I normalized the values. So try to normalize the data before calculating the distance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.