How can I implement a power method in python that can find eigenvectors corresponding to the two eigenvalues of the largest magnitude by assuring that second vector remains orthogonal to the first one? For simple case, the given matrix will be small and symmetric.
Related
In Python I've calculated the eigenvectors and eigenvalues of my data matrix X through eig(). I'm looking to find the top 2 principal components of the data (U = [u1 u2]). I know the top 2 components are the 2 eigenvectors corresponding to the 2 largest eigenvalues, but I'm not sure how to calculate that information with the data at hand (eigenvalues, eigenvectors, and X).
Eigenvectors and eigenvalues calculated:
Eigenvectors = [[-0.68065502 -0.72805308 -0.08153196]
[-0.71680551 0.68482721 -0.13115467]
[-0.15132287 0.03082853 0.98800354]]
Eigenvalues = [2.84217094e-14 2.15257831e+02 8.95193455e+02]
Given the Eigenvalues you got Eigenvalues = [2.84217094e-14 2.15257831e+02 8.95193455e+02]
Your two largest Eigenvalues are 8.95193455e+02 and 2.15257831e+02,
The sum of your eigenvalues is 1110.0 which corresponds to the 100% of the information.
So your largest eigenvalue 8.95193455e+02 has 80.6% of the information. The second eigenvalue 2.15257831e+02 has the remaining 19.4% of the information, and the last eigenvalue 2.84217094e-14 is too small, thus it can be considered noise.
For the eigenvectors that match these eigenvalues, each column of your Eigenvectors matrix is associated to one eigenvalue, and they are in the same order.
For example, your first eigenvalue 8.95193455e+02 is associated to the eigenvector
[[-0.08153196]
[-0.13115467]
[ 0.98800354]]
I need to calculate the condition number of a dense matrix A many times(with some changes to A in each time), where the condition number define to be the largest eigenvalue of A divided by the smallest eigenvalue.
A is in an order of about 1000x5000 and performing
np.linalg.svd(A, compute_uv=False)
take approximately 0.6 seconds
on the other hand
scipy.sparse.linalg.svds(A, 1, which='SM')
and
scipy.sparse.linalg.svds(A, 1)
does not converge due to the density of A (I think)
there is any way to compute only the largest and smallest eigenvalues of a dense matrix? or a way to manipulate A in such a way in which I could use scipy.sparse.linalg.svds without changing the eigenvalues?
I have a large sparse matrix and I want to find its eigenvectors with specific eigenvalue. In scipy.sparse.linalg.eigs, it says the required argument k:
"k is the number of eigenvalues and eigenvectors desired. k must be smaller than N-1. It is not possible to compute all eigenvectors of a matrix".
The problem is that I don't know how many eigenvectors corresponding to the eigenvalue I want. What should I do in this case?
I'd suggest using Singular Value Decomposition (SVD) instead. There is a function from scipy where you can use SVD from scipy.sparse.linalg import svds and it can handle sparse matrix. You can find eigenvalues (in this case will be singular value) and eigenvectors by the following:
U, Sigma, VT = svds(X, k=n_components, tol=tol)
where X can be sparse CSR matrix, U and VT is set of left eigenvectors and right eigenvectors corresponded to singular values in Sigma. Here, you can control number of components. I'd say start with small n_components first and then increase it. You can rank your Sigma and see the distribution of singular value you have. There will be some large number and drop quickly. You can make threshold on how many eigenvectors you want to keep from singular values.
If you want to use scikit-learn, there is a class sklearn.decomposition.TruncatedSVD that let you do what I explained.
I have a MxN array, where M is the number of observations and N is the dimensionality of each vector. From this array of vectors, I need to calculate the mean and minimum euclidean distance between the vectors.
In my mind, this requires me to calculate MC2 distances, which is an O(nmin(k, n-k)) algorithm. My M is ~10,000 and my N is ~1,000, and this computation takes ~45 seconds.
Is there a more efficient way to compute the mean and min distances? Perhaps a probabilistic method? I don't need it to be exact, just close.
You didn't describe where your vectors come from, nor what use you will put mean and median to. Here are some observations about the general case. Limited ranges, error tolerance, and discrete values may admit of a more efficient approach.
The mean distance between M points sounds quadratic, O(M^2). But M / N is 10, fairly small, and N is huge, so the data probably resembles a hairy sphere in 1e3-space. Computing centroid of M points, and then computing M distances to centroid, might turn out to be useful in your problem domain, hard to tell.
The minimum distance among M points is more interesting. Choose a small number of pairs at random, say 100, compute their distance, and take half the minimum as an estimate of the global minimum distance. (Validate by comparing to the next few smallest distances, if desired.) Now use spatial UB-tree to model each point as a positive integer. This involves finding N minima for M x N values, adding constants so min becomes zero, scaling so estimated global min distance corresponds to at least 1.0, and then truncating to integer.
With these transformed vectors in hand, we're ready to turn them into a UB-tree representation that we can sort, and then do nearest neighbor spatial queries on the sorted values. For each point compute an integer. Shift the low-order bit of each dimension's value into the result, then iterate. Continue iterating over all dimensions until non-zero bits have all been consumed and appear in the result, and proceed to the next point. Numerically sort the integer result values, yielding a data structure similar to a PostGIS index.
Now you have a discretized representation that supports reasonably efficient queries for nearest neighbors (though admittedly N=1e3 is inconveniently large). After finding two or more coarse-grained nearby neighbors, you can query the original vector representation to obtain high-resolution distances between them, for finer discrimination. If your data distribution turns out to have a large fraction of points that discretize to being off by single bit from nearest neighbor, e.g. location of oxygen atoms where each has a buddy, then increase the global min distance estimate so the low order bits offer adequate discrimination.
A similar discretization approach would be appropriately scaling e.g. 2-dimensional inputs and marking an initially empty grid, then scanning immediate neighborhoods. This relies on global min being within a "small" neighborhood, due to appropriate scaling. In your case you would be marking an N-dimensional grid.
You may be able to speed things up with some sort of Space Partitioning.
For the minimum distance calculation, you would only need to consider pairs of points in the same or neigbouring partitions. For an approximate mean, you might be able to come up with some sort of weighted average based on the distances between partitions and the number of points within them.
I had the same issue before, and it worked for me once I normalized the values. So try to normalize the data before calculating the distance.
I have a very large sparse matrix which represents a transition martix in a Markov Chain, i.e. the sum of each row of the matrix equals one and I'm interested in finding the first eigenvalue and its corresponding vector which is smaller than one. I know that the eigenvalues are bounded in the section [-1, 1] and they are all real (non-complex).
I am trying to calculate the values using python's scipy.sparse.eigs function, however, one of the parameters of the functions is the number of eigenvalues/vectors to estimate and every time I've increased the number of parameters to estimate, the numbers of eigenvalues which are exactly one grew as well.
Needless to say, I am using the which parameter with the value 'LR' in order to get the k largest eigenvalues, with k being the number of values to estimate.
Does anyone have an idea how to solve this problem (finding the first eigenvalue smaller than one and its corresponding vector)?
I agree with #pv. If your matrix S was symmetric, you could see it as a laplacian matrix of the matrix I - S. The number of connected components of I - S is the number of zero-eigenvalues of this matrix (i.e, the dimension of the space associated to eigenvalue 1 of S). You could check the number of connected components of the graph whose similarity matrix is I - S*S' for a start, e.g. with scipy.sparse.csgraph.connected_components.