How to calculate a personalized PageRank over millions of nodes?

How to calculate a personalized PageRank over millions of nodes? - python

I have a sparse graph containing about a million nodes and 10 million edges. I want to calculate a personalized PageRank for each node, where by personalized PageRank at node n I mean:
# x_0 is a column vector of all zeros, except a 1 in the position corresponding to node n
# adjacency_matrix is a matrix with a 1 in position (i, j) if there is an edge from node i to node j
x_1 = 0.5 * x_0 + 0.5 * adjacency_matrix * x_0
x_2 = 0.5 * x_0 + 0.5 * adjacency_matrix * x_1
x_3 = 0.5 * x_0 + 0.5 * adjacency_matrix * x_2
# x_3 now holds the personalized PageRank scores
# i'm basically approximating the personalized PageRank by running this for only 3 iterations
I tried coding this up using NumPy, but it was taking too long to run. (about 1 second to calculate the personalized PageRank for each node)
I also tried changing x_0 to be matrix (by combining the column vectors of several different nodes), but this also didn't help much, and actually made the computation take much longer. (possibly because the matrix gets dense fairly quickly, and so it no longer fits in RAM? I'm not sure)
Is there another suggested way to calculate this, preferably in Python? I also thought about going the non-matrix approach to PageRank calculation, by doing a kind of simulated random walk for three iterations (i.e., I start each node with a score of 1, then propagate this score to its neighbors, etc.), but I'm not sure if this would be any faster. Would it be, and if so, why?

I would have thought a "PageRank" algorithm would be best viewed as a Directed Graph http://en.wikipedia.org/wiki/Directed_graph (possibly with appropriate weighting).
I like the networkx library at http://networkx.lanl.org
You'll find it also has a "PageRank" example under algorithms which you may be able to adapt.

In your case, using the simulated random walk iterative approach should work fine, if your data is stored in the right way. When you have very few edges compared to the number of nodes (as in your case), I don't think the matrix approach is a good choice, since it is a very sparse matrix and yet practically this approach means that you are checking the existence of a node from i to j for any i and j. (By the way, I'm not sure how much running time those multiplications by zero really take.)
If you have your data stored in a way that for each node object, you have a list of the destinations of its outgoing links, the random walk simulation approach will be rather quick. Ignoring the damping factor, this is what you will be actually doing in each iteration of your random walk simulation:
for node in nodes:
for destination in node.destinations:
destination.pageRank += node.pageRank/len(destinations)
The time complexity of each iteration is then O(n*k) where in your case n=1m and k=10. This sounds good, if I'm not missing anything here.

Related

Finding the nearest neighbours for a subset of samples

I have a dataset of about 3 million samples (each with just 3 features). I'm using scikit's sklearn.neighbors module - specifically radius_neighbor_graph - to find which samples fall within a small radius of a specific sample.
This works fine, but unsurprisingly it's really, really slow to compute this graph.
It's also very wasteful, because I only ever need to know the neighbors for a small subset of my samples (~ 100,000 of them) - and I know this subset in advance.
So... is there any way of being more efficient by calculating the neighbours within a given radius for just this subset of samples? It seems like it should be simple, but I can't think of an easy way of doing it.

First of all, the task of creating a radius-neighborhood-graph involves reading the N by N distance-matrix associated to your dataset. Since distance matrices have nice properties you can save some time, but still complexity lies somewhere in O(N^2). Here N is the number of data points in your data set X.
So one could say, that only a small number of n < N points are of interest as the center of a neighborhood, but the majority of points are just interesting as neighbors. This would result in an n by N distance matrix, where row i contains the distances of data point i to each other data point j, 1 <= i <= n, 1 <= j <= N. But this "distance matrix" has none of the desirable properties of a normal distance matrix (it is not even a square matrix), that you could use to speed up the process of creating an epsilon-neighborhood-graph.
Therefore I don't think that you find a predefined function for your case. If you want to build one your own, the steps should be as follows: Let X be your data set and i be the data point of interest.
Create the distance matrix D associated to your data set, use scipy.spatial.distance_matrix and take as x the small subset of your data set and as y the whole data set.
Create a list, neighbors = []
Loop over the i'th row of the distance matrix. If D(i,j) < epsilon, then save j in neighbors. It is the index of a data point in the epsilon neighborhood of i.
Return neighbors
Of course the computation of the distance matrix should happen once at the beginning (maybe in init() if you wrap everything up in a class), and the function/method that returns all epsilon neighbors of a data point should only depend on the index of the data point in question.
Hope this helps!

Generating an SIS epidemilogical model using Python networkx

I have been told networkx library in python is the standard library to use for graph-theoretical applications, but I have found using it quite frustrating so far.
What I want to do is this:
Generating an SIS epidemiological network, assigning initial contact rates and recovery rates and then following the progress of the disease.
More precisely, imagine a network of n individuals and an adjacency matrix A. Values of A are in [0,1] range and are contact rates. This means that the (i,j) entry shows the probability that disease is transferred from node i to node j. Initially, each node is assigned a random label, which can be either 1 (for Infective individuals) or 0 (for Susceptible, which are the ones which have not caught the disease yet).
At each time-step, if the node has a 0 label, then with a probability equal to the maximum value of weights for incoming edges to the node, it can turn into a 1. If the node has a 1 label then with a probability specified by its recovery rate, it can turn into a 0. Recovery rate is a value assigned to each node at the beginning of the simulation, and is in [0,1] range.
And while the network evolves in each time step, I want to display the network with each node label coloured differently.
If somebody knows of any other library in python that can do such a thing more efficiently than netwrokx, be grateful if you let me know.

Something like this is now possible with EoN.
You appear to want a discrete SIS epidemic with weighted edges.
At present this is the one common case I seem to have left out: here's the bug report I created a while ago. The pandemic has sapped my time to work on this.
https://github.com/springer-math/Mathematics-of-Epidemics-on-Networks/issues/40
What it can do right now is discrete time SIS where each edge is equally weighted. It can also do continuous time SIS or SIR as well as discrete time SIR where the edges may or may not be weighted.
A basic SIS simulation is:
import networkx as nx
import EoN
import matplotlib.pyplot as plt
G = nx.fast_gnp_random_graph(1000,0.002)
t, S, I = EoN.basic_discrete_SIS(G, 0.6, tmax = 20)
plt.plot(t,S)

Do you use networkx for calculation or visualization?
There is no need to use it for calculation since your model is simple and it is easier to calculate it with matrix (vector) operations. That is suitable for numpy.
Main part in a step is calculation of probability of switching from 0 to 1. Let N be vector that for each node stores 0 or 1 depending of state. Than probability that node n switch from 0 to 1 is numpy.amax(A[n,:] * N).
If you need visualization, than probably there are better libraries than networkx.

High frequency noise at solving differential equation

I'm trying to simulate a simple diffusion based on Fick's 2nd law.
from pylab import *
import numpy as np
gridpoints = 128
def profile(x):
range = 2.
straggle = .1576
dose = 1
return dose/(sqrt(2*pi)*straggle)*exp(-(x-range)**2/2/straggle**2)
x = linspace(0,4,gridpoints)
nx = profile(x)
dx = x[1] - x[0] # use np.diff(x) if x is not uniform
dxdx = dx**2
figure(figsize=(12,8))
plot(x,nx)
timestep = 0.5
steps = 21
diffusion_coefficient = 0.002
for i in range(steps):
coefficients = [-1.785714e-3, 2.539683e-2, -0.2e0, 1.6e0,
-2.847222e0,
1.6e0, -0.2e0, 2.539683e-2, -1.785714e-3]
ccf = (np.convolve(nx, coefficients) / dxdx)[4:-4] # second order derivative
nx = timestep*diffusion_coefficient*ccf + nx
plot(x,nx)
for the first few time steps everything looks fine, but then I start to get high frequency noise, do to build-up from numerical errors which are amplified through the second derivative. Since it seems to be hard to increase the float precision I'm hoping that there is something else that I can do to suppress this? I already increased the number of points that are being used to construct the 2nd derivative.

I don't have the time to study your solution in detail, but it seems that you are solving the partial differential equation with a forward Euler scheme. This is pretty easy to implement, as you show, but this can become numerical instable if your timestep is too small. Your only solution is to reduce the timestep or to increase the spatial resolution.
The easiest way to explain this is for the 1-D case: assume your concentration is a function of spatial coordinate x and timestep i. If you do all the math (write down your equations, substitute the partial derivatives with finite differences, should be pretty easy), you will probably get something like this:
C(x, i+1) = [1 - 2 * k] * C(x, i) + k * [C(x - 1, i) + C(x + 1, i)]
so the concentration of a point on the next step depends on its previous value and the ones of its two neighbors. It is not too hard to see that when k = 0.5, every point gets replaced by the average of its two neighbors, so a concentration profile of [...,0,1,0,1,0,...] will become [...,1,0,1,0,1,...] on the next step. If k > 0.5, such a profile will blow up exponentially. You calculate your second order derivative with a longer convolution (I effectively use [1,-2,1]), but I guess that does not change anything for the instability problem.
I don't know about normal diffusion, but based on experience with thermal diffusion, I would guess that k scales with dt * diffusion_coeff / dx^2. You thus have to chose your timestep small enough so that your simulation does not become instable. To make the simulation stable, but still as fast as possible, chose your parameters so that k is a bit smaller than 0.5. Something similar can be derived for 2-D and 3-D cases. The easiest way to achieve this is to increase dx, since your total calculation time will scale with 1/dx^3 for a linear problem, 1/dx^4 for 2-D problems, and even 1/dx^5 for 3-D problems.
There are better methods to solve diffusion equations, I believe that Crank Nicolson is at least standard for solving heat-equations (which is also a diffusion problem). The 'problem' is that this is an implicit method, which means that you have to solve a set of equations to calculate your 'concentration' at the next timestep, which is a bit of a pain to implement. But this method is guaranteed to be numerical stable, even for big timesteps.

Fitting curve: why small numbers are better?

I spent some time these days on a problem. I have a set of data:
y = f(t), where y is very small concentration (10^-7), and t is in second. t varies from 0 to around 12000.
The measurements follow an established model:
y = Vs * t - ((Vs - Vi) * (1 - np.exp(-k * t)) / k)
And I need to find Vs, Vi, and k. So I used curve_fit, which returns the best fitting parameters, and I plotted the curve.
And then I used a similar model:
y = (Vs * t/3600 - ((Vs - Vi) * (1 - np.exp(-k * t/3600)) / k)) * 10**7
By doing that, t is a number of hour, and y is a number between 0 and about 10. The parameters returned are of course different. But when I plot each curve, here is what I get:
http://i.imgur.com/XLa4LtL.png
The green fit is the first model, the blue one with the "normalized" model. And the red dots are the experimental values.
The fitting curves are different. I think it's not expected, and I don't understand why. Are the calculations more accurate if the numbers are "reasonnable" ?

The docstring for optimize.curve_fit says,
p0 : None, scalar, or M-length sequence
Initial guess for the parameters. If None, then the initial
values will all be 1 (if the number of parameters for the function
can be determined using introspection, otherwise a ValueError
is raised).
Thus, to begin with, the initial guess for the parameters is by default 1.
Moreover, curve fitting algorithms have to sample the function for various values of the parameters. The "various values" are initially chosen with an initial step size on the order of 1. The algorithm will work better if your data varies somewhat smoothly with changes in the parameter values that on the order of 1.
If the function varies wildly with parameter changes on the order of 1, then the algorithm may tend to miss the optimum parameter values.
Note that even if the algorithm uses an adaptive step size when it tweaks the parameter values, if the initial tweak is so far off the mark as to produce a big residual, and if tweaking in some other direction happens to produce a smaller residual, then the algorithm may wander off in the wrong direction and miss the local minimum. It may find some other (undesired) local minimum, or simply fail to converge. So using an algorithm with an adaptive step size won't necessarily save you.
The moral of the story is that scaling your data can improve the algorithm's chances of of finding the desired minimum.
Numerical algorithms in general all tend to work better when applied to data whose magnitude is on the order of 1. This bias enters into the algorithm in numerous ways. For instance, optimize.curve_fit relies on optimize.leastsq, and the call signature for optimize.leastsq is:
def leastsq(func, x0, args=(), Dfun=None, full_output=0,
col_deriv=0, ftol=1.49012e-8, xtol=1.49012e-8,
gtol=0.0, maxfev=0, epsfcn=None, factor=100, diag=None):
Thus, by default, the tolerances ftol and xtol are on the order of 1e-8. If finding the optimum parameter values require much smaller tolerances, then these hard-coded default numbers will cause optimize.curve_fit to miss the optimize parameter values.
To make this more concrete, suppose you were trying to minimize f(x) = 1e-100*x**2. The factor of 1e-100 squashes the y-values so much that a wide range of x-values (the parameter values mentioned above) will fit within the tolerance of 1e-8. So, with un-ideal scaling, leastsq will not do a good job of finding the minimum.
Another reason to use floats on the order of 1 is because there are many more (IEEE754) floats in the interval [-1,1] than there are far away from 1. For example,
import struct
def floats_between(x, y):
"""
http://stackoverflow.com/a/3587987/190597 (jsbueno)
"""
a = struct.pack("<dd", x, y)
b = struct.unpack("<qq", a)
return b[1] - b[0]
In [26]: floats_between(0,1) / float(floats_between(1e6,1e7))
Out[26]: 311.4397707054894
This shows there are over 300 times as many floats representing numbers between 0 and 1 than there are in the interval [1e6, 1e7].
Thus, all else being equal, you'll typically get a more accurate answer if working with small numbers than very large numbers.

I would imagine it has more to do with the initial parameter estimates you are passing to curve fit. If you are not passing any I believe they all default to 1. Normalizing your data makes those initial estimates closer to the truth. If you don't want to use normalized data just pass the initial estimates yourself and give them reasonable values.

Others have already mentioned that you probably need to have a good starting guess for your fit. In cases like this is, I usually try to find some quick and dirty tricks to get at least a ballpark estimate of the parameters. In your case, for large t, the exponential decays pretty quickly to zero, so for large t, you have
y == Vs * t - (Vs - Vi) / k
Doing a first-order linear fit like
[slope1, offset1] = polyfit(t[t > 2000], y[t > 2000], 1)
you will get slope1 == Vs and offset1 == (Vi - Vs) / k.
Subtracting this straight line from all the points you have, you get the exponential
residual == y - slope1 * t - offset1 == (Vs - Vi) * exp(-t * k)
Taking the log of both sides, you get
log(residual) == log(Vs - Vi) - t * k
So doing a second fit
[slope2, offset2] = polyfit(t, log(y - slope1 * t - offset1), 1)
will give you slope2 == -k and offset2 == log(Vs - Vi), which should be solvable for Vi since you already know Vs. You might have to limit the second fit to small values of t, otherwise you might be taking the log of negative numbers. Collect all the parameters you obtained with these fits and use them as the starting points for your curve_fit.
Finally, you might want to look into doing some sort of weighted fit. The information about the exponential part of your curve is contained in just the first few points, so maybe you should give those a higher weight. Doing this in a statistically correct way is not trivial.

Calculate Hitting Time between 2 nodes using NetworkX

I would like to know if i can use NetworkX to implement hitting time? Basically I want to calculate the hitting time between any 2 nodes in a graph. My graph is unweighted and undirected. If I understand hitting time correctly, it is very similar to the idea of PageRank.
Any idea how can I implement hitting time using the PageRank method provided by NetworkX?
May I know if there's any good starting point to work with?
I've checked: MapReduce, Python and NetworkX
but not quite sure how it works.

You don't need networkX to solve the problem, numpy can do it if you understand the math behind it. A undirected, unweighted graph can always be represented by a [0,1] adjacency matrix. nth powers of this matrix represent the number of steps from (i,j) after n steps. We can work with a Markov matrix, which is a row normalized form of the adj. matrix. Powers of this matrix represent a random walk over the graph. If the graph is small, you can take powers of the matrix and look at the index (start, end) that you are interested in. Make the final state an absorbing one, once the walk hits the spot it can't escape. At each power n you get probability that you'll have diffused from (i,j). The hitting time can be computed from this function (as you know the exact hit time for discrete steps).
Below is an example with a simple graph defined by the edge list. At the end, I plot this hitting time function. As a reference point, this is the graph used:
from numpy import *
hit_idx = (0,4)
# Define a graph by edge list
edges = [[0,1],[1,2],[2,3],[2,4]]
# Create adj. matrix
A = zeros((5,5))
A[zip(*edges)] = 1
# Undirected condition
A += A.T
# Make the final state an absorbing condition
A[hit_idx[1],:] = 0
A[hit_idx[1],hit_idx[1]] = 1
# Make a proper Markov matrix by row normalizing
A = (A.T/A.sum(axis=1)).T
B = A.copy()
Z = []
for n in xrange(100):
Z.append( B[hit_idx] )
B = dot(B,A)
from pylab import *
plot(Z)
xlabel("steps")
ylabel("hit probability")
show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.