Why isn't optimal_count giving the right result?

Why isn't optimal_count giving the right result? - python

I'm trying to understand python-igraph and specifically the community_walktrap function. I created the following example:
import numpy as np
import igraph
mat = np.zeros((200,200)) + 50
mat[20:30,20:30] = 2
mat[80:90,80:90] = 2
g = igraph.Graph.Weighted_Adjacency(mat.tolist(),
mode=igraph.ADJ_DIRECTED)
wl = g.community_walktrap(weights=g.es['weight'])
I would have assumed the optimal count of communities to be 3, but running
print wl.optimal_count
give me 1. If I force the dendrogram to be cut at 3 wl.as_clustering(3) I get a membership list that's correct. What am I doing wrong with optimal_count?

Why do you think the optimal cluster count should be 3? It seems to me that all the nodes have fairly strong connections to each other (they have a weight of 50), except two small groups where the connections are weaker. Note that clustering methods in igraph expect the weights to denote similarities, not distances. Also note that most clustering algorithms in igraph are not well-defined for directed networks (some of them even simply reject directed networks).
For what it's worth, wl.optimal_count simply calculates the so-called modularity measure (see the modularity() method of the Graph class) and then picks the cluster count where the modularity is highest. The modularity with only one cluster is zero (this is how the measure works by definition). The modularity with three clusters is around -0.0083, so igraph is right to pick one cluster only instead of three:
>>> wl.as_clustering(3).modularity
-0.00829996846600007
>>> wl.as_clustering(1).modularity
0.0

Related

Group geometry points according to spatial proximity

I have the following points in 3D space:
I need to group the points, according to D_max and d_max:
D_max = max dimension of each group
d_max = max distance of points inside each group
Like this:
The shape of the group in the above image looks like a box, but the shape can be anything which would be the output of the grouping algorithm.
I'm using Python and visualize the results with Blender. I'm considering using the scipy.spatial.KDTree and calling its query API, however, I'm not sure if that's the right tool for the job at hand. I'm worried that there might be a better tool which I'm not aware of. I'm curious to know if there is any other tool/library/algorithm which can help me.
As #CoMartel pointed out, there is DBSCAN and also HDBSCAN clustering modules which look like a good fit for this type of problems. However, as pointed out by #Paul they lack the option for max size of the cluster which correlates to my D_max parameter. I'm not sure how to add a max cluster size feature to DBSCAN and HDBSCAN clustering.
Thanks to #Anony-Mousse I watched Agglomerative Clustering: how it works and Hierarchical Clustering 3: single-link vs. complete-link and I'm studying Comparing Python Clustering Algorithms, I feel like it's getting more clear how these algorithms work.

As requested, my comment as an answer :
You could use DBSCAN(http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) or HDBSCAN.
Both these algorithm allow to group each point according to d_max (maximum distance between 2 points of the same dataset), but they don't take the maximum cluster size. The only way to limit the maximum size of a cluster is by reducing the epsparameter, which control the max distance between 2 points of the same cluster.

Use hierarchical agglomerative clustering.
If you use complete linkage you can control the maximum diameter of the clusters. The complete link is the maximum distance.
DBSCAN's epsilon parameter is not a maximum distance because multiple steps are joined transitively. Clusters can become much larger than epsilon!

DBSCAN clustering algorithm with the maximum distance of points inside each group extension
You can use the DBSCAN algorithm recursively.
def DBSCAN_with_max_size(myData, eps = E, max_size = S):
clusters = DBSCAN(myData, eps = E)
Big_Clusters = find_big_clusters(clusters)
for big_cluster in Big_Clusters:
DBSCAN_with_max_size(big_cluster ,eps = E/2 ,max_size = S) //eps is something lower than E (e.g. E/2)

Generating an SIS epidemilogical model using Python networkx

I have been told networkx library in python is the standard library to use for graph-theoretical applications, but I have found using it quite frustrating so far.
What I want to do is this:
Generating an SIS epidemiological network, assigning initial contact rates and recovery rates and then following the progress of the disease.
More precisely, imagine a network of n individuals and an adjacency matrix A. Values of A are in [0,1] range and are contact rates. This means that the (i,j) entry shows the probability that disease is transferred from node i to node j. Initially, each node is assigned a random label, which can be either 1 (for Infective individuals) or 0 (for Susceptible, which are the ones which have not caught the disease yet).
At each time-step, if the node has a 0 label, then with a probability equal to the maximum value of weights for incoming edges to the node, it can turn into a 1. If the node has a 1 label then with a probability specified by its recovery rate, it can turn into a 0. Recovery rate is a value assigned to each node at the beginning of the simulation, and is in [0,1] range.
And while the network evolves in each time step, I want to display the network with each node label coloured differently.
If somebody knows of any other library in python that can do such a thing more efficiently than netwrokx, be grateful if you let me know.

Something like this is now possible with EoN.
You appear to want a discrete SIS epidemic with weighted edges.
At present this is the one common case I seem to have left out: here's the bug report I created a while ago. The pandemic has sapped my time to work on this.
https://github.com/springer-math/Mathematics-of-Epidemics-on-Networks/issues/40
What it can do right now is discrete time SIS where each edge is equally weighted. It can also do continuous time SIS or SIR as well as discrete time SIR where the edges may or may not be weighted.
A basic SIS simulation is:
import networkx as nx
import EoN
import matplotlib.pyplot as plt
G = nx.fast_gnp_random_graph(1000,0.002)
t, S, I = EoN.basic_discrete_SIS(G, 0.6, tmax = 20)
plt.plot(t,S)

Do you use networkx for calculation or visualization?
There is no need to use it for calculation since your model is simple and it is easier to calculate it with matrix (vector) operations. That is suitable for numpy.
Main part in a step is calculation of probability of switching from 0 to 1. Let N be vector that for each node stores 0 or 1 depending of state. Than probability that node n switch from 0 to 1 is numpy.amax(A[n,:] * N).
If you need visualization, than probably there are better libraries than networkx.

Understanding output from kmeans clustering in python

I have two distance matrices, each 232*232 where the column and row labels are identical. So this would be an abridged version of the two where A, B, C and D are the names of the points between which the distances are measured:
A B C D ... A B C D ...
A 0 1 5 3 A 0 5 3 9
B 4 0 4 1 B 2 0 7 8
C 2 6 0 3 C 2 6 0 1
D 2 7 1 0 D 5 2 5 0
... ...
The two matrices therefore represent the distances between pairs of points in two different networks. I want to identify clusters of pairs that are close together in one network and far apart in the other. I attempted to do this by first adjusting the distances in each matrix by dividing every distance by the largest distance in the matrix. I then subtracted one matrix from the other and applied a clustering algorithm to the resultant matrix. The algorithm I was advised to use for this was the k means algorithm. The hope was that I could identify clusters of positive numbers that would correspond to pairs that were very close in matrix one and far apart in matrix two and vice versa for clusters of negative numbers.
Firstly, I've read quite a bit about how to implement k means in python I'm aware that there are multiple different modules that can be used. I've tried all three of these:
1.
import sklearn.cluster
import numpy as np
data = np.load('difference_matrix_file.npy') #loads difference matrix from file
a = np.array([x[0:] for x in data])
clust_centers = 3
model = sklearn.cluster.k_means(a, clust_centers)
print model
2.
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.cluster import KMeans
difference_matrix = np.load('difference_matrix_file.npy') #loads difference matrix from file
data = pd.DataFrame(difference_matrix)
model = KMeans(n_clusters=3)
print model.fit(data)
3.
import numpy as np
from scipy.cluster.vq import vq, kmeans, whiten
np.set_printoptions(threshold=np.nan)
difference_matrix = np.load('difference_matrix_file.npy') #loads difference matrix from file
whitened = whiten(difference_matrix)
centroids = kmeans(whitened, 3)
print centroids
What I'm struggling with is how to interpret the output from these scripts. (I might add at this point that I'm neither a mathematician nor a computer scientist if the reader hadn't already guessed). I was expecting the output of the algorithm to be lists of coordinates of clustered pairs, one for each cluster so three in this case, that I could then trace back to my two original matrices and identify the names of the pairs of interest.
However what I get is an array containing a list of numbers (one for each cluster) but I don't really understand what these numbers are, they don't obviously correspond to what I had in my input matrix other than the fact that there is 232 items in each list which is the same number of rows and columns there are in the input matrix. And the list item in the array is another single number which I presume must be the centroid of the clusters, but there isn't one for each cluster, just one for the whole array.
I've been trying to figure this out for quite a while now but I'm struggling to get anywhere. Whenever I search for interpreting the output of kmeans I just get explanations of how to plot my clusters on a graph which isn't what I want to do. Please can someone explain to me what I'm seeing in my output and how I can get from this to the coordinates of the items in each cluster?

You have two issues where, and the recommendation of k-means probably was not very good...
K-means expects a coordinate data matrix, not a distance matrix.
In order to compute a centroid, it needs the original coordinates. If you don't have coordinates like this, you probably should not be using k-means.
If you compute the difference of two distance matrixes, small values correspond to points that have a similar distance in both. These could still be very far away from each other! So if you use this matrix as a new "distance" matrix, you will get meaningless results. Consider points A and B, which have the maximum distance in both original graphs. After your procedure, they will have a difference of 0, and will thus be considered identical now.
So you haven't understood the input of k-means, no wonder you do not understand the output.
I'd rather treat the difference matrix as a similarity matrix (try absolute values, positives only, negatives only). Then use hierarchical clustering. But you will need an implementation for a similarity, the usual implementations for a distance matrix will not work.

Disclaimer: below, I tried to answer your question about how to interpret what the functions return and how to get the points in a cluster from that. I agree with #Anony-Mousse in that if you have a distance / similarity matrix (as opposed to a feature matrix), you will want to use different techniques, such as spectral clustering.
Sorry for being blunt, I also hate the "RTFM"-type answers, but the functions you used are well documented at:
sklearn.cluster,
scipy.cluster.vq?
In short,
the model sklearn.cluster.k_means() returns a tuple with three fields:
an array with the centroids (that should be 3x232 for you)
the label assignment for each point (i.e. a 232-long array with values 0-2)
and "intertia", a measure of how good the clustering is; there are several measures for that, so you might be better off not paying too much attention to this;
scipy.cluster.vq.kmeans2() returns a tuple with two fields:
the cluster centroids (as above)
the label assignment (as above)
kmeans() returns a "distortion" value instead of the label assignment, so I would definitely use kmeans2().
As for how to get to the coordinates of the points in each cluster, you could:
for cc in range(clust_centers):
print('Points for cluster {}:\n{}'.format(cc, data[model[1] == cc]))
where model is the tuple returned by either sklearn.cluster.k_means or scipy.cluster.vq.kmeans2, and data is a points x coordinates array, difference_matrix in your case.

How to remove noise using MeanShift Clustering Technique?

I'm using meanshift clustering to remove unwanted noise from my input data..
Data can be found here. Here what I have tried so far..
import numpy as np
from sklearn.cluster import MeanShift
data = np.loadtxt('model.txt', unpack = True)
## data size is [3X500]
ms = MeanShift()
ms.fit(data)
after trying some different bandwidth value I am getting only 1 cluster.. but the outliers and noise like in the picture suppose to be in different cluster.
when decreasing the bandwidth a little more then I ended up with this ... which is again not what I was looking for.
Can anyone help me with this?

You can remove outliers before using mean shift.
Statistical removal
For example, fix a number of neighbors to analyze for each point (e.g. 50), and the standard deviation multiplier (e.g. 1). All points who have a distance larger than 1 standard deviation of the mean distance to the query point will be marked as outliers and removed. This technique is used in libpcl, in the class pcl::StatisticalOutlierRemoval, and a tutorial can be found here.
Deterministic removal (radius based)
A simpler technique consists in specifying a radius R and a minimum number of neighbors N. All points who have less than N neighbours withing a radius of R will be marked as outliers and removed. Also this technique is used in libpcl, in the class pcl::RadiusOutlierRemoval, and a tutorial can be found here.

Mean-shift is not meant to remove low-density areas.
It tries to move all data to the most dense areas.
If there is one single most dense point, then everything should move there, and you get only one cluster.
Try a different method. Maybe remove the outliers first.

set his parameter to false cluster_allbool, default=True
If true, then all points are clustered, even those orphans that are not within any kernel. Orphans are assigned to the nearest kernel. If false, then orphans are given cluster label -1.

Problems in performing K means clustering

I am trying to cluster the following data from a CSV file with K means clustering.
Sample1,Sample2,45
Sample1,Sample3,69
Sample1,Sample4,12
Sample2,Sample2,46
Sample2,Sample1,78
It is basically a graph where Samples are nodes and the numbers are the edges (weights).
I read the file as following:
fileopening = fopen('data.csv', 'rU')
reading = csv.reader(fileopening, delimiter=',')
L = list(reading)
I used this code: https://gist.github.com/betzerra/8744068
Here clusters are built based on the following:
num_points, dim, k, cutoff, lower, upper = 10, 2, 3, 0.5, 0, 200
points = map( lambda i: makeRandomPoint(dim, lower, upper), range(num_points) )
clusters = kmeans(points, k, cutoff)
for i,c in enumerate(clusters):
for p in c.points:
print " Cluster: ",i,"\t Point :", p
I replaced points with list L. But I got lots of errors: AttributeError, 'int' object has no attribute 'n', etc.
I need to perform K means clustering based on the third number column (edges) of my CSV file. This tutorial uses randomly creating points. But I am not sure, how to use this CSV data as an input to this k means function. How to perform k means (k=2) for my data? How can I send the CSV file data as input to this k means function?

In short "you can't".
Long answer:
K-means is defined for euclidean spaces only and it requires a valid points positions, while you only have distances between them, probably not in a strict mathematical sense but rather some kind of "similarity". K-means is not designed to work with similarity matrices.
What you can do?
You can use some other method to embeed your points in euclidean space in such a way, that they closely reasamble your distances, one of such tools is Multidimensional scaling (MDS): http://en.wikipedia.org/wiki/Multidimensional_scaling
Once point 1 is done you can run k-means
Alternatively you can also construct a kernel (valid in a Mercer's sense) by performing some kernel learning techniques to reasamble your data and then run kernel k-means on the resulting Gram matrix.

As lejlot said, only distances between points are not enough to run k-means in the classic sense. It's easy to understand if you understand the nature of k-means. On a high level, k-means works as follows:
1) Randomly assign points to cluster.
(Technically, there are more sophisticated ways of initial partitioning,
but that's not essential right now).
2) Compute centroids of the cluster.
(This is where you need the actual coordinates of the points.)
3) Reassign each point to a cluster with the closest centroid.
4) Repeat steps 2)-3) until stop condition is met.
So, as you can see, in the classic interpretation, k-means will not work, because it is unclear how to compute centroids. However, I have several suggestions of what you could do.
Suggestion 1.
Embed your points in N-dimensional space, where N is the number of points, so that the coordinates of each point are the distances to all the other points.
For example the data you showed:
Sample1,Sample2,45
Sample1,Sample3,69
Sample1,Sample4,12
Sample2,Sample2,46
Sample2,Sample1,78
becomes:
Sample1: (0,45,69,12,...)
Sample2: (78,46,0,0,...)
Then you can legitimately use Euclidean distance. Note, that the actual distances between points will not be preserved, but this could be a simple and reasonable approximation to preserve relative distances between the points. Another disadvantage is that if you have a lot of points, than your memory (and running time) requirements will be order of N^2.
Suggestion 2.
Instead of k-means, try k-medoids. For this one, you do not need the actual coordinates of the points, because instead of centroid, you need to compute medoids. Medoid of a cluster is a points from this cluster, whish has the smallest average distance to all other points in this cluster. You could look for the implementations online. Or it's actually pretty easy to implement. The running time will be proportional to N^2 as well.
Final remark.
Why do you wan to use k-means at all? Seems like you have a weighted directed graph. There are clustering algorithms specially intended for graphs. This is beyond the scope of your question, but maybe this is something that could be worth considering?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.