Generating an SIS epidemilogical model using Python networkx

Generating an SIS epidemilogical model using Python networkx - python

I have been told networkx library in python is the standard library to use for graph-theoretical applications, but I have found using it quite frustrating so far.
What I want to do is this:
Generating an SIS epidemiological network, assigning initial contact rates and recovery rates and then following the progress of the disease.
More precisely, imagine a network of n individuals and an adjacency matrix A. Values of A are in [0,1] range and are contact rates. This means that the (i,j) entry shows the probability that disease is transferred from node i to node j. Initially, each node is assigned a random label, which can be either 1 (for Infective individuals) or 0 (for Susceptible, which are the ones which have not caught the disease yet).
At each time-step, if the node has a 0 label, then with a probability equal to the maximum value of weights for incoming edges to the node, it can turn into a 1. If the node has a 1 label then with a probability specified by its recovery rate, it can turn into a 0. Recovery rate is a value assigned to each node at the beginning of the simulation, and is in [0,1] range.
And while the network evolves in each time step, I want to display the network with each node label coloured differently.
If somebody knows of any other library in python that can do such a thing more efficiently than netwrokx, be grateful if you let me know.

Something like this is now possible with EoN.
You appear to want a discrete SIS epidemic with weighted edges.
At present this is the one common case I seem to have left out: here's the bug report I created a while ago. The pandemic has sapped my time to work on this.
https://github.com/springer-math/Mathematics-of-Epidemics-on-Networks/issues/40
What it can do right now is discrete time SIS where each edge is equally weighted. It can also do continuous time SIS or SIR as well as discrete time SIR where the edges may or may not be weighted.
A basic SIS simulation is:
import networkx as nx
import EoN
import matplotlib.pyplot as plt
G = nx.fast_gnp_random_graph(1000,0.002)
t, S, I = EoN.basic_discrete_SIS(G, 0.6, tmax = 20)
plt.plot(t,S)

Do you use networkx for calculation or visualization?
There is no need to use it for calculation since your model is simple and it is easier to calculate it with matrix (vector) operations. That is suitable for numpy.
Main part in a step is calculation of probability of switching from 0 to 1. Let N be vector that for each node stores 0 or 1 depending of state. Than probability that node n switch from 0 to 1 is numpy.amax(A[n,:] * N).
If you need visualization, than probably there are better libraries than networkx.

Related

How to remove noise using MeanShift Clustering Technique?

I'm using meanshift clustering to remove unwanted noise from my input data..
Data can be found here. Here what I have tried so far..
import numpy as np
from sklearn.cluster import MeanShift
data = np.loadtxt('model.txt', unpack = True)
## data size is [3X500]
ms = MeanShift()
ms.fit(data)
after trying some different bandwidth value I am getting only 1 cluster.. but the outliers and noise like in the picture suppose to be in different cluster.
when decreasing the bandwidth a little more then I ended up with this ... which is again not what I was looking for.
Can anyone help me with this?

You can remove outliers before using mean shift.
Statistical removal
For example, fix a number of neighbors to analyze for each point (e.g. 50), and the standard deviation multiplier (e.g. 1). All points who have a distance larger than 1 standard deviation of the mean distance to the query point will be marked as outliers and removed. This technique is used in libpcl, in the class pcl::StatisticalOutlierRemoval, and a tutorial can be found here.
Deterministic removal (radius based)
A simpler technique consists in specifying a radius R and a minimum number of neighbors N. All points who have less than N neighbours withing a radius of R will be marked as outliers and removed. Also this technique is used in libpcl, in the class pcl::RadiusOutlierRemoval, and a tutorial can be found here.

Mean-shift is not meant to remove low-density areas.
It tries to move all data to the most dense areas.
If there is one single most dense point, then everything should move there, and you get only one cluster.
Try a different method. Maybe remove the outliers first.

set his parameter to false cluster_allbool, default=True
If true, then all points are clustered, even those orphans that are not within any kernel. Orphans are assigned to the nearest kernel. If false, then orphans are given cluster label -1.

Why isn't optimal_count giving the right result?

I'm trying to understand python-igraph and specifically the community_walktrap function. I created the following example:
import numpy as np
import igraph
mat = np.zeros((200,200)) + 50
mat[20:30,20:30] = 2
mat[80:90,80:90] = 2
g = igraph.Graph.Weighted_Adjacency(mat.tolist(),
mode=igraph.ADJ_DIRECTED)
wl = g.community_walktrap(weights=g.es['weight'])
I would have assumed the optimal count of communities to be 3, but running
print wl.optimal_count
give me 1. If I force the dendrogram to be cut at 3 wl.as_clustering(3) I get a membership list that's correct. What am I doing wrong with optimal_count?

Why do you think the optimal cluster count should be 3? It seems to me that all the nodes have fairly strong connections to each other (they have a weight of 50), except two small groups where the connections are weaker. Note that clustering methods in igraph expect the weights to denote similarities, not distances. Also note that most clustering algorithms in igraph are not well-defined for directed networks (some of them even simply reject directed networks).
For what it's worth, wl.optimal_count simply calculates the so-called modularity measure (see the modularity() method of the Graph class) and then picks the cluster count where the modularity is highest. The modularity with only one cluster is zero (this is how the measure works by definition). The modularity with three clusters is around -0.0083, so igraph is right to pick one cluster only instead of three:
>>> wl.as_clustering(3).modularity
-0.00829996846600007
>>> wl.as_clustering(1).modularity
0.0

Problems in performing K means clustering

I am trying to cluster the following data from a CSV file with K means clustering.
Sample1,Sample2,45
Sample1,Sample3,69
Sample1,Sample4,12
Sample2,Sample2,46
Sample2,Sample1,78
It is basically a graph where Samples are nodes and the numbers are the edges (weights).
I read the file as following:
fileopening = fopen('data.csv', 'rU')
reading = csv.reader(fileopening, delimiter=',')
L = list(reading)
I used this code: https://gist.github.com/betzerra/8744068
Here clusters are built based on the following:
num_points, dim, k, cutoff, lower, upper = 10, 2, 3, 0.5, 0, 200
points = map( lambda i: makeRandomPoint(dim, lower, upper), range(num_points) )
clusters = kmeans(points, k, cutoff)
for i,c in enumerate(clusters):
for p in c.points:
print " Cluster: ",i,"\t Point :", p
I replaced points with list L. But I got lots of errors: AttributeError, 'int' object has no attribute 'n', etc.
I need to perform K means clustering based on the third number column (edges) of my CSV file. This tutorial uses randomly creating points. But I am not sure, how to use this CSV data as an input to this k means function. How to perform k means (k=2) for my data? How can I send the CSV file data as input to this k means function?

In short "you can't".
Long answer:
K-means is defined for euclidean spaces only and it requires a valid points positions, while you only have distances between them, probably not in a strict mathematical sense but rather some kind of "similarity". K-means is not designed to work with similarity matrices.
What you can do?
You can use some other method to embeed your points in euclidean space in such a way, that they closely reasamble your distances, one of such tools is Multidimensional scaling (MDS): http://en.wikipedia.org/wiki/Multidimensional_scaling
Once point 1 is done you can run k-means
Alternatively you can also construct a kernel (valid in a Mercer's sense) by performing some kernel learning techniques to reasamble your data and then run kernel k-means on the resulting Gram matrix.

As lejlot said, only distances between points are not enough to run k-means in the classic sense. It's easy to understand if you understand the nature of k-means. On a high level, k-means works as follows:
1) Randomly assign points to cluster.
(Technically, there are more sophisticated ways of initial partitioning,
but that's not essential right now).
2) Compute centroids of the cluster.
(This is where you need the actual coordinates of the points.)
3) Reassign each point to a cluster with the closest centroid.
4) Repeat steps 2)-3) until stop condition is met.
So, as you can see, in the classic interpretation, k-means will not work, because it is unclear how to compute centroids. However, I have several suggestions of what you could do.
Suggestion 1.
Embed your points in N-dimensional space, where N is the number of points, so that the coordinates of each point are the distances to all the other points.
For example the data you showed:
Sample1,Sample2,45
Sample1,Sample3,69
Sample1,Sample4,12
Sample2,Sample2,46
Sample2,Sample1,78
becomes:
Sample1: (0,45,69,12,...)
Sample2: (78,46,0,0,...)
Then you can legitimately use Euclidean distance. Note, that the actual distances between points will not be preserved, but this could be a simple and reasonable approximation to preserve relative distances between the points. Another disadvantage is that if you have a lot of points, than your memory (and running time) requirements will be order of N^2.
Suggestion 2.
Instead of k-means, try k-medoids. For this one, you do not need the actual coordinates of the points, because instead of centroid, you need to compute medoids. Medoid of a cluster is a points from this cluster, whish has the smallest average distance to all other points in this cluster. You could look for the implementations online. Or it's actually pretty easy to implement. The running time will be proportional to N^2 as well.
Final remark.
Why do you wan to use k-means at all? Seems like you have a weighted directed graph. There are clustering algorithms specially intended for graphs. This is beyond the scope of your question, but maybe this is something that could be worth considering?

How to calculate a personalized PageRank over millions of nodes?

I have a sparse graph containing about a million nodes and 10 million edges. I want to calculate a personalized PageRank for each node, where by personalized PageRank at node n I mean:
# x_0 is a column vector of all zeros, except a 1 in the position corresponding to node n
# adjacency_matrix is a matrix with a 1 in position (i, j) if there is an edge from node i to node j
x_1 = 0.5 * x_0 + 0.5 * adjacency_matrix * x_0
x_2 = 0.5 * x_0 + 0.5 * adjacency_matrix * x_1
x_3 = 0.5 * x_0 + 0.5 * adjacency_matrix * x_2
# x_3 now holds the personalized PageRank scores
# i'm basically approximating the personalized PageRank by running this for only 3 iterations
I tried coding this up using NumPy, but it was taking too long to run. (about 1 second to calculate the personalized PageRank for each node)
I also tried changing x_0 to be matrix (by combining the column vectors of several different nodes), but this also didn't help much, and actually made the computation take much longer. (possibly because the matrix gets dense fairly quickly, and so it no longer fits in RAM? I'm not sure)
Is there another suggested way to calculate this, preferably in Python? I also thought about going the non-matrix approach to PageRank calculation, by doing a kind of simulated random walk for three iterations (i.e., I start each node with a score of 1, then propagate this score to its neighbors, etc.), but I'm not sure if this would be any faster. Would it be, and if so, why?

I would have thought a "PageRank" algorithm would be best viewed as a Directed Graph http://en.wikipedia.org/wiki/Directed_graph (possibly with appropriate weighting).
I like the networkx library at http://networkx.lanl.org
You'll find it also has a "PageRank" example under algorithms which you may be able to adapt.

In your case, using the simulated random walk iterative approach should work fine, if your data is stored in the right way. When you have very few edges compared to the number of nodes (as in your case), I don't think the matrix approach is a good choice, since it is a very sparse matrix and yet practically this approach means that you are checking the existence of a node from i to j for any i and j. (By the way, I'm not sure how much running time those multiplications by zero really take.)
If you have your data stored in a way that for each node object, you have a list of the destinations of its outgoing links, the random walk simulation approach will be rather quick. Ignoring the damping factor, this is what you will be actually doing in each iteration of your random walk simulation:
for node in nodes:
for destination in node.destinations:
destination.pageRank += node.pageRank/len(destinations)
The time complexity of each iteration is then O(n*k) where in your case n=1m and k=10. This sounds good, if I'm not missing anything here.

Calculate Hitting Time between 2 nodes using NetworkX

I would like to know if i can use NetworkX to implement hitting time? Basically I want to calculate the hitting time between any 2 nodes in a graph. My graph is unweighted and undirected. If I understand hitting time correctly, it is very similar to the idea of PageRank.
Any idea how can I implement hitting time using the PageRank method provided by NetworkX?
May I know if there's any good starting point to work with?
I've checked: MapReduce, Python and NetworkX
but not quite sure how it works.

You don't need networkX to solve the problem, numpy can do it if you understand the math behind it. A undirected, unweighted graph can always be represented by a [0,1] adjacency matrix. nth powers of this matrix represent the number of steps from (i,j) after n steps. We can work with a Markov matrix, which is a row normalized form of the adj. matrix. Powers of this matrix represent a random walk over the graph. If the graph is small, you can take powers of the matrix and look at the index (start, end) that you are interested in. Make the final state an absorbing one, once the walk hits the spot it can't escape. At each power n you get probability that you'll have diffused from (i,j). The hitting time can be computed from this function (as you know the exact hit time for discrete steps).
Below is an example with a simple graph defined by the edge list. At the end, I plot this hitting time function. As a reference point, this is the graph used:
from numpy import *
hit_idx = (0,4)
# Define a graph by edge list
edges = [[0,1],[1,2],[2,3],[2,4]]
# Create adj. matrix
A = zeros((5,5))
A[zip(*edges)] = 1
# Undirected condition
A += A.T
# Make the final state an absorbing condition
A[hit_idx[1],:] = 0
A[hit_idx[1],hit_idx[1]] = 1
# Make a proper Markov matrix by row normalizing
A = (A.T/A.sum(axis=1)).T
B = A.copy()
Z = []
for n in xrange(100):
Z.append( B[hit_idx] )
B = dot(B,A)
from pylab import *
plot(Z)
xlabel("steps")
ylabel("hit probability")
show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.