I have a large M x N 2D matrix of positive coefficients (8bit unsigned) with M,N~10^3.
I want to optimize the parameters (M*N~10^6) such that my error-function (where I put the matrix in) is minimized. Boundary conditions: Neighboring parameters vary slowly/smoothly.
I have already tried to downscale the matrix in order to reduce the number of parameters and then flatten the matrix to feed it into Scipy's Minimize function. This causes memory errors quickly.
Is there a smart approch to optimize this large coefficient matrix in time ranges less than infinity without applying a low-parametric model?
How can I implement a power method in python that can find eigenvectors corresponding to the two eigenvalues of the largest magnitude by assuring that second vector remains orthogonal to the first one? For simple case, the given matrix will be small and symmetric.
I'm trying to use networkx to calculate the eigenvector centrality of my graph:
import networkx as nx
import pandas as pd
import numpy as np
a = nx.eigenvector_centrality(my_graph)
But I get the error:
NetworkXError: eigenvector_centrality():
power iteration failed to converge in %d iterations."%(i+1))
What is the problem with my graph?
TL/DR: try nx.eigenvector_centrality_numpy.
Here's what's going on: nx.eigenvector_centrality relies on power iteration. The actions it takes are equivalent to repeatedly multiplying a vector by the same matrix (and then normalizing the result). This usually converges to the largest eigenvector. However, it fails when there are multiple eigenvalues with the same (largest) magnitude.
Your graph is a star graph. There are multiple "largest" eigenvalues for a star graph. In the case of a star with just two "peripheral nodes" you can easily check that sqrt(2) and -sqrt(2) are both eigenvalues. More generally sqrt(N) and -sqrt(N) are both eigenvalues, and the other eigenvalues have smaller magnitude. I believe that for any bipartite network, this will happen and the standard algorithm will fail.
The mathematical reason is that after n rounds of iteration, the solution looks like the sum of c_i lambda_i^n v_i/K_n where c_i is a constant that depends on the initial guess, lambda_i is the i-th eigenvalue, v_i is its eigenvector and K is a normalization factor (applied to all terms in the sum). When there is a dominant eigenvalue, lambda_i^n/K_n goes to a nonzero constant for the dominant eigenvalue and 0 for the others.
However in your case, you have two equally large eigenvalues, one is positive (lambda_1) and the other is negative (lambda_2=-lambda_1). The contribution of the smaller eigenvalues still goes to zero. But you're left with (c_1 lambda_1^n v_1 + c_2 lambda_2^n v_2)/K_n. Using lambda_2=-lambda_1 you are left with lambda_1^n(c_1 v_1+(-1)^n c_2v_2)/K_n. Then K_n-> lambda_1^n and this "converges" to c_1 v_1 + (-1)^n c_2 v_2. However, each time you iterate, you go from adding some multiple of v_2 to subtracting that multiple, so it doesn't really converge.
So the simple eigenvalue_centrality that networkx uses won't work. You can instead use nx.eigenvector_centrality_numpy so that numpy is used. That will get you v_1.
Note: With a quick look at the documentation, I'm not 100% positive that the numpy algorithm is guaranteed to be the largest (positive) eigenvalue. It uses a numpy algorithm to find an eigenvector, but I don't see in the documentation of that a guarantee that it is the dominant eigenvector. Most algorithms for finding a single eigenvector will result in the dominant eigenvector, so you're probably alright.
We can add a check to it:
as long as nx.eigenvector_centrality_numpy returns all positive values, the Perron-Frobenius theorem guarantees that this corresponds to the largest eigenvalue.
If some are zero, it gets a bit more tricky to be sure,
and if some are negative than it is not the dominant eigenvector.
I have a MxN array, where M is the number of observations and N is the dimensionality of each vector. From this array of vectors, I need to calculate the mean and minimum euclidean distance between the vectors.
In my mind, this requires me to calculate MC2 distances, which is an O(nmin(k, n-k)) algorithm. My M is ~10,000 and my N is ~1,000, and this computation takes ~45 seconds.
Is there a more efficient way to compute the mean and min distances? Perhaps a probabilistic method? I don't need it to be exact, just close.
You didn't describe where your vectors come from, nor what use you will put mean and median to. Here are some observations about the general case. Limited ranges, error tolerance, and discrete values may admit of a more efficient approach.
The mean distance between M points sounds quadratic, O(M^2). But M / N is 10, fairly small, and N is huge, so the data probably resembles a hairy sphere in 1e3-space. Computing centroid of M points, and then computing M distances to centroid, might turn out to be useful in your problem domain, hard to tell.
The minimum distance among M points is more interesting. Choose a small number of pairs at random, say 100, compute their distance, and take half the minimum as an estimate of the global minimum distance. (Validate by comparing to the next few smallest distances, if desired.) Now use spatial UB-tree to model each point as a positive integer. This involves finding N minima for M x N values, adding constants so min becomes zero, scaling so estimated global min distance corresponds to at least 1.0, and then truncating to integer.
With these transformed vectors in hand, we're ready to turn them into a UB-tree representation that we can sort, and then do nearest neighbor spatial queries on the sorted values. For each point compute an integer. Shift the low-order bit of each dimension's value into the result, then iterate. Continue iterating over all dimensions until non-zero bits have all been consumed and appear in the result, and proceed to the next point. Numerically sort the integer result values, yielding a data structure similar to a PostGIS index.
Now you have a discretized representation that supports reasonably efficient queries for nearest neighbors (though admittedly N=1e3 is inconveniently large). After finding two or more coarse-grained nearby neighbors, you can query the original vector representation to obtain high-resolution distances between them, for finer discrimination. If your data distribution turns out to have a large fraction of points that discretize to being off by single bit from nearest neighbor, e.g. location of oxygen atoms where each has a buddy, then increase the global min distance estimate so the low order bits offer adequate discrimination.
A similar discretization approach would be appropriately scaling e.g. 2-dimensional inputs and marking an initially empty grid, then scanning immediate neighborhoods. This relies on global min being within a "small" neighborhood, due to appropriate scaling. In your case you would be marking an N-dimensional grid.
You may be able to speed things up with some sort of Space Partitioning.
For the minimum distance calculation, you would only need to consider pairs of points in the same or neigbouring partitions. For an approximate mean, you might be able to come up with some sort of weighted average based on the distances between partitions and the number of points within them.
I had the same issue before, and it worked for me once I normalized the values. So try to normalize the data before calculating the distance.
I have a very large sparse matrix which represents a transition martix in a Markov Chain, i.e. the sum of each row of the matrix equals one and I'm interested in finding the first eigenvalue and its corresponding vector which is smaller than one. I know that the eigenvalues are bounded in the section [-1, 1] and they are all real (non-complex).
I am trying to calculate the values using python's scipy.sparse.eigs function, however, one of the parameters of the functions is the number of eigenvalues/vectors to estimate and every time I've increased the number of parameters to estimate, the numbers of eigenvalues which are exactly one grew as well.
Needless to say, I am using the which parameter with the value 'LR' in order to get the k largest eigenvalues, with k being the number of values to estimate.
Does anyone have an idea how to solve this problem (finding the first eigenvalue smaller than one and its corresponding vector)?
I agree with #pv. If your matrix S was symmetric, you could see it as a laplacian matrix of the matrix I - S. The number of connected components of I - S is the number of zero-eigenvalues of this matrix (i.e, the dimension of the space associated to eigenvalue 1 of S). You could check the number of connected components of the graph whose similarity matrix is I - S*S' for a start, e.g. with scipy.sparse.csgraph.connected_components.