Using networkx to calculate eigenvector centrality

Using networkx to calculate eigenvector centrality - python

I'm trying to use networkx to calculate the eigenvector centrality of my graph:
import networkx as nx
import pandas as pd
import numpy as np
a = nx.eigenvector_centrality(my_graph)
But I get the error:
NetworkXError: eigenvector_centrality():
power iteration failed to converge in %d iterations."%(i+1))
What is the problem with my graph?

TL/DR: try nx.eigenvector_centrality_numpy.
Here's what's going on: nx.eigenvector_centrality relies on power iteration. The actions it takes are equivalent to repeatedly multiplying a vector by the same matrix (and then normalizing the result). This usually converges to the largest eigenvector. However, it fails when there are multiple eigenvalues with the same (largest) magnitude.
Your graph is a star graph. There are multiple "largest" eigenvalues for a star graph. In the case of a star with just two "peripheral nodes" you can easily check that sqrt(2) and -sqrt(2) are both eigenvalues. More generally sqrt(N) and -sqrt(N) are both eigenvalues, and the other eigenvalues have smaller magnitude. I believe that for any bipartite network, this will happen and the standard algorithm will fail.
The mathematical reason is that after n rounds of iteration, the solution looks like the sum of c_i lambda_i^n v_i/K_n where c_i is a constant that depends on the initial guess, lambda_i is the i-th eigenvalue, v_i is its eigenvector and K is a normalization factor (applied to all terms in the sum). When there is a dominant eigenvalue, lambda_i^n/K_n goes to a nonzero constant for the dominant eigenvalue and 0 for the others.
However in your case, you have two equally large eigenvalues, one is positive (lambda_1) and the other is negative (lambda_2=-lambda_1). The contribution of the smaller eigenvalues still goes to zero. But you're left with (c_1 lambda_1^n v_1 + c_2 lambda_2^n v_2)/K_n. Using lambda_2=-lambda_1 you are left with lambda_1^n(c_1 v_1+(-1)^n c_2v_2)/K_n. Then K_n-> lambda_1^n and this "converges" to c_1 v_1 + (-1)^n c_2 v_2. However, each time you iterate, you go from adding some multiple of v_2 to subtracting that multiple, so it doesn't really converge.
So the simple eigenvalue_centrality that networkx uses won't work. You can instead use nx.eigenvector_centrality_numpy so that numpy is used. That will get you v_1.
Note: With a quick look at the documentation, I'm not 100% positive that the numpy algorithm is guaranteed to be the largest (positive) eigenvalue. It uses a numpy algorithm to find an eigenvector, but I don't see in the documentation of that a guarantee that it is the dominant eigenvector. Most algorithms for finding a single eigenvector will result in the dominant eigenvector, so you're probably alright.
We can add a check to it:
as long as nx.eigenvector_centrality_numpy returns all positive values, the Perron-Frobenius theorem guarantees that this corresponds to the largest eigenvalue.
If some are zero, it gets a bit more tricky to be sure,
and if some are negative than it is not the dominant eigenvector.

Related

Approximating a function with a step function with pairwise total error constraints in python

I need to approximate a function y(x) with a step function of height h where each "high" segment has a length l_i=n_i*l_0 and every "low" segment has a length of d_j=n_j*d_0 where n_i must be an integer. The function is strictly positive, (not strictly) steadily decreasing and continuous.
My function has been derived in sympy and is available as symbolic equation but it's acceptable to convert to numpy/scipy if beneficial.
My first approach was to solve the segments pairwise.
The end application requires the total difference, i.e. the integral between the approximation and target function, to be minimized pairwise.
Another practical constraint is for the segments to be as short as possible, with the constraint of n being an integer.
I would also need to take over any residual of the integral sum into the next calculation because the total approximation should also minimize the accumulated error.
The approach I thought about taking would involve doing a segment wise integral from x_0 to x_1 and from x_1 to x_2, find for which x_1, x_2 the sum of these integrals changes sign (or is minimized) and then find the lowest common denominator of n_i and n_j.
integral = smp.integrate(y-h,(x,x_0,x_1)) + smp.integrate(y,(x,x_1,x_2)
One approach would be to switch over to scipy.optimize.minimize at this point, however, I have read it has problems with integer values? On the other hand, I don't know how I could find a relationship for x_1(x_2) for which the integral would be close to 0 in sympy either as I just started using sympy yesterday. Any help would be hugely appreciated!

Getting l1 normalized eigenvectors from python instead of l2?

Consider this matrix:
[.6, .7]
[.4, .3]
This is a Markov chain matrix; the columns each sum to 1. This can represent a population distribution, transition rates, etc.
To get the population at equilibrium, take the eigenvalues and eigenvectors...
From wolfram alpha, the eigenvalues and their corresponding eigenvectors are:
l1 = 1, v1 = [4/7, 1]
l2 = -1/10, v2 = [-1,1]
For the population at equilibrium, take the eigenvector that corresponds to the eigenvalue of 1, and scale it so the total = 1.
Vector = [7/4, 1]
Total = 11/4
So multiply the vector by 4/11...
4/11 * [7/4, 1] = [7/11, 4/11]
Therefore at equilibrium the first state has 7/11 of the population and the other state has 4/11.
If you take the desired eigenvector, [7/4, 1] and l2 normalize it (so all squared values sum up to 1), you get roughly [.868, .496].
That's all fine. But when you get the eigenvectors from python...
mat = np.array([[.6, .7], [.4, .3]])
vals, vecs = np.linalg.eig(mat)
vecs = vecs.T #(because you want left eigenvectors)
One of the eigenvectors it spits out is the [.868, .496] one, for l2 normed ones. Now, you can pretty easily scale it again so the sum of each value is 1 (instead of the sum of THE SQUARE of each value) being 1... just do the vector * 1/sum(vector). But is there a way to skip this step? Why add the computaitonal expense to my script, having to sum up the vector each time I do this? Can you get numpy, scipy, etc to spit out the l1 normalized vector instead of the l2 normalized vector? Also, is that the correct usage of the terms l1 and l2...?
Note: I have seen previous questions asking how to get the markov steady states in this manner. My qusetion is different, I am asking how to get numpy to spit out a vector normalized in the way I want, and I am explaining my reasoning by including the markov part.

I think you're assuming that np.linalg.eig computes eigenvectors and eigenvalues like you would by hand. It doesn't. Under the hood, it uses a highly optimized (and famous) FORTRAN library called LAPACK. This library uses numerical techniques that are sort of out of scope, but long story short it doesn't compute the eigenvalues for a 2x2 like you would by hand. I believe it uses the QR algorithm most of the time, and sometimes QZ, or even others. It's not all that simple: I think it even chooses different algorithms based on the matrix structure/size sometimes (I'm not a LAPACK expert, so don't quote me here). What I do know is that LAPACK has been vetted over about 40 years and it is pretty darned fast, and with great speed comes great complexity.
Wolfram Alpha, on the other hand, is using Mathematica on the backend, which is a symbolic solver (i.e. not floating point arithmetic). That's why you get the "same" result as if you'd do it by hand.
Long story short, asking to get you the L1 norm from np.linalg.eig just isn't possible: if you look at the QR algorithm, each iteration will have the L2 normalized vector (that converges to an eigenvector). You'll have trouble getting it from most numerical libraries for the simple reason that a lot of them depend on LAPACK or use similar algorithms (for instance MATLAB outputs unit vectors as well).
At the end of the day, it doesn't really matter if the vector is normalized or not. It really just has to be in the right direction. If you need to scale it for a proportion, then do that. It'll be vectorized (i.e. fast) by numpy since it's a simple multiply.
HTH.

Efficient calculation of euclidean distance

I have a MxN array, where M is the number of observations and N is the dimensionality of each vector. From this array of vectors, I need to calculate the mean and minimum euclidean distance between the vectors.
In my mind, this requires me to calculate MC2 distances, which is an O(nmin(k, n-k)) algorithm. My M is ~10,000 and my N is ~1,000, and this computation takes ~45 seconds.
Is there a more efficient way to compute the mean and min distances? Perhaps a probabilistic method? I don't need it to be exact, just close.

You didn't describe where your vectors come from, nor what use you will put mean and median to. Here are some observations about the general case. Limited ranges, error tolerance, and discrete values may admit of a more efficient approach.
The mean distance between M points sounds quadratic, O(M^2). But M / N is 10, fairly small, and N is huge, so the data probably resembles a hairy sphere in 1e3-space. Computing centroid of M points, and then computing M distances to centroid, might turn out to be useful in your problem domain, hard to tell.
The minimum distance among M points is more interesting. Choose a small number of pairs at random, say 100, compute their distance, and take half the minimum as an estimate of the global minimum distance. (Validate by comparing to the next few smallest distances, if desired.) Now use spatial UB-tree to model each point as a positive integer. This involves finding N minima for M x N values, adding constants so min becomes zero, scaling so estimated global min distance corresponds to at least 1.0, and then truncating to integer.
With these transformed vectors in hand, we're ready to turn them into a UB-tree representation that we can sort, and then do nearest neighbor spatial queries on the sorted values. For each point compute an integer. Shift the low-order bit of each dimension's value into the result, then iterate. Continue iterating over all dimensions until non-zero bits have all been consumed and appear in the result, and proceed to the next point. Numerically sort the integer result values, yielding a data structure similar to a PostGIS index.
Now you have a discretized representation that supports reasonably efficient queries for nearest neighbors (though admittedly N=1e3 is inconveniently large). After finding two or more coarse-grained nearby neighbors, you can query the original vector representation to obtain high-resolution distances between them, for finer discrimination. If your data distribution turns out to have a large fraction of points that discretize to being off by single bit from nearest neighbor, e.g. location of oxygen atoms where each has a buddy, then increase the global min distance estimate so the low order bits offer adequate discrimination.
A similar discretization approach would be appropriately scaling e.g. 2-dimensional inputs and marking an initially empty grid, then scanning immediate neighborhoods. This relies on global min being within a "small" neighborhood, due to appropriate scaling. In your case you would be marking an N-dimensional grid.

You may be able to speed things up with some sort of Space Partitioning.
For the minimum distance calculation, you would only need to consider pairs of points in the same or neigbouring partitions. For an approximate mean, you might be able to come up with some sort of weighted average based on the distances between partitions and the number of points within them.

I had the same issue before, and it worked for me once I normalized the values. So try to normalize the data before calculating the distance.

Calculating eigen values of very large sparse matrices in python

I have a very large sparse matrix which represents a transition martix in a Markov Chain, i.e. the sum of each row of the matrix equals one and I'm interested in finding the first eigenvalue and its corresponding vector which is smaller than one. I know that the eigenvalues are bounded in the section [-1, 1] and they are all real (non-complex).
I am trying to calculate the values using python's scipy.sparse.eigs function, however, one of the parameters of the functions is the number of eigenvalues/vectors to estimate and every time I've increased the number of parameters to estimate, the numbers of eigenvalues which are exactly one grew as well.
Needless to say, I am using the which parameter with the value 'LR' in order to get the k largest eigenvalues, with k being the number of values to estimate.
Does anyone have an idea how to solve this problem (finding the first eigenvalue smaller than one and its corresponding vector)?

I agree with #pv. If your matrix S was symmetric, you could see it as a laplacian matrix of the matrix I - S. The number of connected components of I - S is the number of zero-eigenvalues of this matrix (i.e, the dimension of the space associated to eigenvalue 1 of S). You could check the number of connected components of the graph whose similarity matrix is I - S*S' for a start, e.g. with scipy.sparse.csgraph.connected_components.

Normalizing constant of mixture of dirichlet distribution goes unbounded

I need to calculate PDFs of mixture of Dirichlet distribution in python. But for each mixture component there is the normalizing constant, which is the inverse beta function which has gamma function of sum of the hyper-parameters as the numerator. So even for a sum of hyper-parameters of size '60' it goes unbounded. Please suggest me a work around for this problem. What happens when I ignore the normalizing constant?
First its not the calculation of NC itself that is the problem. For a single dirichlet I have no problem . But what I have here is a mixture of product of dirichlets, so each mixture component is a product of many dirichlets each with its own NCs. So the product of these goes unbounded. Regarding my objective, I have a joint distribution of p(s,T,O), where 's' is discrete, 'T' and 'O' are the dirichlet variables i.e. a set of vectors of parameters which sum to '1'. Now since 's' is discrete and finite I have |S| set of mixture of product of dirichlet components for each 's'. Now my objective here is to find p(s|T,O). So I directly substitute a particular (T,O) and calculate the value of each p('s'|T,O). For this I need to calc the NCs. If there is only one mixture component then I can ignore the norm constant, calc. and renormalise finally, but since I have several mixture components each components will have different scaling and so I can't renormalise. This is my conundrum.

Some ideas. (1) To calculate the normalizing factor exactly, maybe you can rewrite the gamma function via gamma(a_i + 1) = a_i gamma(a_i) (a_i need not be an integer, let the base case be a_i < 1) and then you'll have sum(a_i, i, 1, n) terms in the numerator and denominator and you can reorder them so that you divide the largest term by the largest term and multiply those single ratios together instead of computing an enormous numerator and an enormous denominator and dividing those. (2) If you don't need to be exact, maybe you can apply Stirling's approximation. (3) Maybe you don't need the pdf at all. For some purposes, you just need a function which is proportional to the pdf. I believe Markov chain Monte Carlo is like that. So, what is the larger goal you are trying to achieve here?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.