Calculate Hitting Time between 2 nodes using NetworkX - python

I would like to know if i can use NetworkX to implement hitting time? Basically I want to calculate the hitting time between any 2 nodes in a graph. My graph is unweighted and undirected. If I understand hitting time correctly, it is very similar to the idea of PageRank.
Any idea how can I implement hitting time using the PageRank method provided by NetworkX?
May I know if there's any good starting point to work with?
I've checked: MapReduce, Python and NetworkX
but not quite sure how it works.

You don't need networkX to solve the problem, numpy can do it if you understand the math behind it. A undirected, unweighted graph can always be represented by a [0,1] adjacency matrix. nth powers of this matrix represent the number of steps from (i,j) after n steps. We can work with a Markov matrix, which is a row normalized form of the adj. matrix. Powers of this matrix represent a random walk over the graph. If the graph is small, you can take powers of the matrix and look at the index (start, end) that you are interested in. Make the final state an absorbing one, once the walk hits the spot it can't escape. At each power n you get probability that you'll have diffused from (i,j). The hitting time can be computed from this function (as you know the exact hit time for discrete steps).
Below is an example with a simple graph defined by the edge list. At the end, I plot this hitting time function. As a reference point, this is the graph used:
from numpy import *
hit_idx = (0,4)
# Define a graph by edge list
edges = [[0,1],[1,2],[2,3],[2,4]]
# Create adj. matrix
A = zeros((5,5))
A[zip(*edges)] = 1
# Undirected condition
A += A.T
# Make the final state an absorbing condition
A[hit_idx[1],:] = 0
A[hit_idx[1],hit_idx[1]] = 1
# Make a proper Markov matrix by row normalizing
A = (A.T/A.sum(axis=1)).T
B = A.copy()
Z = []
for n in xrange(100):
Z.append( B[hit_idx] )
B = dot(B,A)
from pylab import *
plot(Z)
xlabel("steps")
ylabel("hit probability")
show()

Related

Getting nodes without edges (When the N is larger than 60)

First I generated an NxN matrix of zeros and ones using NumPy. After that, I generated a copy matrix from the previous matrix, I replaced the ones in the first matrix with the weight of the edges. ( The matrix is symmetric and connected and undirected and its diagonal is zero like the original matrix) and I used BSF to check if it's connected and I found it connected every time. Then I used SciPy to find the MST (Minimum Spanning Tree). After that, I illustrated the MST using Network X
for generating NxN Matrix of zeros and ones
base = np.zeros((shape,shape))
for _ in range(100):
a = np.random.randint(shape)
b = np.random.randint(shape)
if a != b:
base[a, b] = 1
base[b, a] = 1
for generating NxN Matrix with the weight of edges
# Fetch the location of the 1s.
Weightofedges = base
ones = np.argwhere(Weightofedges == 1)
ones = ones[ones[:, 0] < ones[:, 1], :]
# Assign random values.
for a, b in ones:
Weightofedges[a, b] = Weightofedges[b, a] = np.random.randint(100)
Find the MST using SciPy
from scipy.sparse.csgraph import minimum_spanning_tree
X = minimum_spanning_tree(Weightofedges)
print("The Output Of The MST By Kruskal Algorithm:")
print(" Edges: Weights:")
print(X)
print("-----------------------")
my_matrix3 = X.toarray().astype(int)
The Problem: When I input a matrix with a large number of nodes I got some nodes not connected with an edge
e.g.
Number Of Nodes equals 75
Number Of Edges equals 65
In the MST the edges must be N-1 where N is the number of nodes
This is the graph using N = 75 ( as shown there are nodes without edges )
enter image description here
You have created a weighted version of the Erdős–Rényi model - to be exact the ER model G(n,M) with n nodes and M edges. Currently, you have fixed M=100 and you observe for n>60 that your becomes disconnected. This is typical and (at least for the second ER model variant G(n,p) with n nodes and probability of an edge p) you can even calculate the threshold where you (almost surely) get a single/large connected component. But even without the math, you can intuitively see that it becomes difficult to connect 75 nodes with only 100 random edges.
I recommend that you check out the networkx package, if you want to do more with graphs on python. For example, the implementation of the G(n,p) variant: erdos_renyi_graph.

Generating an SIS epidemilogical model using Python networkx

I have been told networkx library in python is the standard library to use for graph-theoretical applications, but I have found using it quite frustrating so far.
What I want to do is this:
Generating an SIS epidemiological network, assigning initial contact rates and recovery rates and then following the progress of the disease.
More precisely, imagine a network of n individuals and an adjacency matrix A. Values of A are in [0,1] range and are contact rates. This means that the (i,j) entry shows the probability that disease is transferred from node i to node j. Initially, each node is assigned a random label, which can be either 1 (for Infective individuals) or 0 (for Susceptible, which are the ones which have not caught the disease yet).
At each time-step, if the node has a 0 label, then with a probability equal to the maximum value of weights for incoming edges to the node, it can turn into a 1. If the node has a 1 label then with a probability specified by its recovery rate, it can turn into a 0. Recovery rate is a value assigned to each node at the beginning of the simulation, and is in [0,1] range.
And while the network evolves in each time step, I want to display the network with each node label coloured differently.
If somebody knows of any other library in python that can do such a thing more efficiently than netwrokx, be grateful if you let me know.
Something like this is now possible with EoN.
You appear to want a discrete SIS epidemic with weighted edges.
At present this is the one common case I seem to have left out: here's the bug report I created a while ago. The pandemic has sapped my time to work on this.
https://github.com/springer-math/Mathematics-of-Epidemics-on-Networks/issues/40
What it can do right now is discrete time SIS where each edge is equally weighted. It can also do continuous time SIS or SIR as well as discrete time SIR where the edges may or may not be weighted.
A basic SIS simulation is:
import networkx as nx
import EoN
import matplotlib.pyplot as plt
G = nx.fast_gnp_random_graph(1000,0.002)
t, S, I = EoN.basic_discrete_SIS(G, 0.6, tmax = 20)
plt.plot(t,S)
Do you use networkx for calculation or visualization?
There is no need to use it for calculation since your model is simple and it is easier to calculate it with matrix (vector) operations. That is suitable for numpy.
Main part in a step is calculation of probability of switching from 0 to 1. Let N be vector that for each node stores 0 or 1 depending of state. Than probability that node n switch from 0 to 1 is numpy.amax(A[n,:] * N).
If you need visualization, than probably there are better libraries than networkx.

Problems in performing K means clustering

I am trying to cluster the following data from a CSV file with K means clustering.
Sample1,Sample2,45
Sample1,Sample3,69
Sample1,Sample4,12
Sample2,Sample2,46
Sample2,Sample1,78
It is basically a graph where Samples are nodes and the numbers are the edges (weights).
I read the file as following:
fileopening = fopen('data.csv', 'rU')
reading = csv.reader(fileopening, delimiter=',')
L = list(reading)
I used this code: https://gist.github.com/betzerra/8744068
Here clusters are built based on the following:
num_points, dim, k, cutoff, lower, upper = 10, 2, 3, 0.5, 0, 200
points = map( lambda i: makeRandomPoint(dim, lower, upper), range(num_points) )
clusters = kmeans(points, k, cutoff)
for i,c in enumerate(clusters):
for p in c.points:
print " Cluster: ",i,"\t Point :", p
I replaced points with list L. But I got lots of errors: AttributeError, 'int' object has no attribute 'n', etc.
I need to perform K means clustering based on the third number column (edges) of my CSV file. This tutorial uses randomly creating points. But I am not sure, how to use this CSV data as an input to this k means function. How to perform k means (k=2) for my data? How can I send the CSV file data as input to this k means function?
In short "you can't".
Long answer:
K-means is defined for euclidean spaces only and it requires a valid points positions, while you only have distances between them, probably not in a strict mathematical sense but rather some kind of "similarity". K-means is not designed to work with similarity matrices.
What you can do?
You can use some other method to embeed your points in euclidean space in such a way, that they closely reasamble your distances, one of such tools is Multidimensional scaling (MDS): http://en.wikipedia.org/wiki/Multidimensional_scaling
Once point 1 is done you can run k-means
Alternatively you can also construct a kernel (valid in a Mercer's sense) by performing some kernel learning techniques to reasamble your data and then run kernel k-means on the resulting Gram matrix.
As lejlot said, only distances between points are not enough to run k-means in the classic sense. It's easy to understand if you understand the nature of k-means. On a high level, k-means works as follows:
1) Randomly assign points to cluster.
(Technically, there are more sophisticated ways of initial partitioning,
but that's not essential right now).
2) Compute centroids of the cluster.
(This is where you need the actual coordinates of the points.)
3) Reassign each point to a cluster with the closest centroid.
4) Repeat steps 2)-3) until stop condition is met.
So, as you can see, in the classic interpretation, k-means will not work, because it is unclear how to compute centroids. However, I have several suggestions of what you could do.
Suggestion 1.
Embed your points in N-dimensional space, where N is the number of points, so that the coordinates of each point are the distances to all the other points.
For example the data you showed:
Sample1,Sample2,45
Sample1,Sample3,69
Sample1,Sample4,12
Sample2,Sample2,46
Sample2,Sample1,78
becomes:
Sample1: (0,45,69,12,...)
Sample2: (78,46,0,0,...)
Then you can legitimately use Euclidean distance. Note, that the actual distances between points will not be preserved, but this could be a simple and reasonable approximation to preserve relative distances between the points. Another disadvantage is that if you have a lot of points, than your memory (and running time) requirements will be order of N^2.
Suggestion 2.
Instead of k-means, try k-medoids. For this one, you do not need the actual coordinates of the points, because instead of centroid, you need to compute medoids. Medoid of a cluster is a points from this cluster, whish has the smallest average distance to all other points in this cluster. You could look for the implementations online. Or it's actually pretty easy to implement. The running time will be proportional to N^2 as well.
Final remark.
Why do you wan to use k-means at all? Seems like you have a weighted directed graph. There are clustering algorithms specially intended for graphs. This is beyond the scope of your question, but maybe this is something that could be worth considering?

High frequency noise at solving differential equation

I'm trying to simulate a simple diffusion based on Fick's 2nd law.
from pylab import *
import numpy as np
gridpoints = 128
def profile(x):
range = 2.
straggle = .1576
dose = 1
return dose/(sqrt(2*pi)*straggle)*exp(-(x-range)**2/2/straggle**2)
x = linspace(0,4,gridpoints)
nx = profile(x)
dx = x[1] - x[0] # use np.diff(x) if x is not uniform
dxdx = dx**2
figure(figsize=(12,8))
plot(x,nx)
timestep = 0.5
steps = 21
diffusion_coefficient = 0.002
for i in range(steps):
coefficients = [-1.785714e-3, 2.539683e-2, -0.2e0, 1.6e0,
-2.847222e0,
1.6e0, -0.2e0, 2.539683e-2, -1.785714e-3]
ccf = (np.convolve(nx, coefficients) / dxdx)[4:-4] # second order derivative
nx = timestep*diffusion_coefficient*ccf + nx
plot(x,nx)
for the first few time steps everything looks fine, but then I start to get high frequency noise, do to build-up from numerical errors which are amplified through the second derivative. Since it seems to be hard to increase the float precision I'm hoping that there is something else that I can do to suppress this? I already increased the number of points that are being used to construct the 2nd derivative.
I don't have the time to study your solution in detail, but it seems that you are solving the partial differential equation with a forward Euler scheme. This is pretty easy to implement, as you show, but this can become numerical instable if your timestep is too small. Your only solution is to reduce the timestep or to increase the spatial resolution.
The easiest way to explain this is for the 1-D case: assume your concentration is a function of spatial coordinate x and timestep i. If you do all the math (write down your equations, substitute the partial derivatives with finite differences, should be pretty easy), you will probably get something like this:
C(x, i+1) = [1 - 2 * k] * C(x, i) + k * [C(x - 1, i) + C(x + 1, i)]
so the concentration of a point on the next step depends on its previous value and the ones of its two neighbors. It is not too hard to see that when k = 0.5, every point gets replaced by the average of its two neighbors, so a concentration profile of [...,0,1,0,1,0,...] will become [...,1,0,1,0,1,...] on the next step. If k > 0.5, such a profile will blow up exponentially. You calculate your second order derivative with a longer convolution (I effectively use [1,-2,1]), but I guess that does not change anything for the instability problem.
I don't know about normal diffusion, but based on experience with thermal diffusion, I would guess that k scales with dt * diffusion_coeff / dx^2. You thus have to chose your timestep small enough so that your simulation does not become instable. To make the simulation stable, but still as fast as possible, chose your parameters so that k is a bit smaller than 0.5. Something similar can be derived for 2-D and 3-D cases. The easiest way to achieve this is to increase dx, since your total calculation time will scale with 1/dx^3 for a linear problem, 1/dx^4 for 2-D problems, and even 1/dx^5 for 3-D problems.
There are better methods to solve diffusion equations, I believe that Crank Nicolson is at least standard for solving heat-equations (which is also a diffusion problem). The 'problem' is that this is an implicit method, which means that you have to solve a set of equations to calculate your 'concentration' at the next timestep, which is a bit of a pain to implement. But this method is guaranteed to be numerical stable, even for big timesteps.

How to calculate a personalized PageRank over millions of nodes?

I have a sparse graph containing about a million nodes and 10 million edges. I want to calculate a personalized PageRank for each node, where by personalized PageRank at node n I mean:
# x_0 is a column vector of all zeros, except a 1 in the position corresponding to node n
# adjacency_matrix is a matrix with a 1 in position (i, j) if there is an edge from node i to node j
x_1 = 0.5 * x_0 + 0.5 * adjacency_matrix * x_0
x_2 = 0.5 * x_0 + 0.5 * adjacency_matrix * x_1
x_3 = 0.5 * x_0 + 0.5 * adjacency_matrix * x_2
# x_3 now holds the personalized PageRank scores
# i'm basically approximating the personalized PageRank by running this for only 3 iterations
I tried coding this up using NumPy, but it was taking too long to run. (about 1 second to calculate the personalized PageRank for each node)
I also tried changing x_0 to be matrix (by combining the column vectors of several different nodes), but this also didn't help much, and actually made the computation take much longer. (possibly because the matrix gets dense fairly quickly, and so it no longer fits in RAM? I'm not sure)
Is there another suggested way to calculate this, preferably in Python? I also thought about going the non-matrix approach to PageRank calculation, by doing a kind of simulated random walk for three iterations (i.e., I start each node with a score of 1, then propagate this score to its neighbors, etc.), but I'm not sure if this would be any faster. Would it be, and if so, why?
I would have thought a "PageRank" algorithm would be best viewed as a Directed Graph http://en.wikipedia.org/wiki/Directed_graph (possibly with appropriate weighting).
I like the networkx library at http://networkx.lanl.org
You'll find it also has a "PageRank" example under algorithms which you may be able to adapt.
In your case, using the simulated random walk iterative approach should work fine, if your data is stored in the right way. When you have very few edges compared to the number of nodes (as in your case), I don't think the matrix approach is a good choice, since it is a very sparse matrix and yet practically this approach means that you are checking the existence of a node from i to j for any i and j. (By the way, I'm not sure how much running time those multiplications by zero really take.)
If you have your data stored in a way that for each node object, you have a list of the destinations of its outgoing links, the random walk simulation approach will be rather quick. Ignoring the damping factor, this is what you will be actually doing in each iteration of your random walk simulation:
for node in nodes:
for destination in node.destinations:
destination.pageRank += node.pageRank/len(destinations)
The time complexity of each iteration is then O(n*k) where in your case n=1m and k=10. This sounds good, if I'm not missing anything here.

Categories

Resources