Graph Clustering in Python, Hadoop, or other - python

Does anyone know of a package in python that can select a number of clusters in a very large undirected graph (100,000 nodes and a lot of edges) so as to minimize within cluster sum of squared distances or something similar? I am taking a look at MCL right now: http://micans.org/mcl/

It looks either spectral clustering with mahout or this MCL algo are gonna work.

Related

Is there a way to cluster networks generated from netwokx?

I'm trying to simulate an environment with minimum 300 nodes and random edges connecting the nodes. The network is generated using networkx in python. I want to divide this network into n clusters so that I run algorithms (like travelling salesman or tabu) in each cluster. But I can't find a good resource to do clustering / grouping.
I can successfully generate the graphs and I have previously worked on k-mean clustering, but bridging both has been difficult.
The data type generated by networkx is multidigraph How do I convert this to data type which I can run grouping/clustering algorithm on? (like matrix?if that's possible)
Or am I approaching this in the wrong way?
Any help would be really appreciated.

Does Pyspark ML KMean have a way to get the explained variance?

As I was reading through the ML package for Pyspark here, it seems the KMeanModel doesn't have a way to compute the explained variance in order to draw an elbow curve, to establish the optimal number of clusters.
However in this example, the user seems to have a computeCost() function. Where did that function come from? I'm not having success in my program.
I am using Spark 1.6. Thanks in advance!
I was stuckked with same issue regarding computcost method in pyspark.
Instead of using the computecost you can use mahalanobis distance or WSSE after applying kmeans.
To compute the distance you have to write the code and and getting the
various result you can draw the graph to see the knee point for
optimum number of cluster .
Anomaly Detection Using PySpark this use case which helped me have a look.

Dividing a graph into 2 disconnected subgraphs

I'm trying to find an algorithm that shows is it possible to divide a graph into two sub_graphs and my input is the Adjacency matrix of the graph.
And my language is python but it isn't so important and the algorithm is important.
thanks for your helps :)

How to cut a hierarchy cluster tree by a given maximum within cluster distance?

I am working on a one-dimensional gene positions data which is like
[705118, 705118, 832132, 860402, 865710, 867206, 925364, 925364,925364]
(around 2000 items in one array) and I wanna divide the array into clusters with a maximum within cluster distance less or equal to 2000.
So I used the
chrd=scipy.spatial.distance.pdist(chrn,metric='euclidean')
to get the distance matrix and then
scipy.cluster.hierarchy.linkage(chrd,method='average',metric='euclidean')
to get the linkage matrix.
But there is no such function in scipy.cluster.hiearchy.fcluster can cut the hierarchy tree based the the maximum within cluster distance.
Does anyone have any idea about how to handle this?
I try to rewrite a hierarchy algorithm which can include such threshold but it seems to be really hard to do >.<
Thanks in advance
Use complete linkage if you need a maximum distance of objects, not average.
Then cut the tree at the desired height.
Alternatively, you can also implement Leader clustering. It's stupid, but he maximum distance is twice the distance of the cluster radius there.
If using scikit-learn is a possibility, you could check out http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
it has a parameter that limits the distance between two samples

"Agglomerative" clustering of a graph based on node weight in network X?

I have a very large connected graph (millions of nodes). Each edge has a weight -- identifying the proximity of the connected nodes. I want to find "clusters" in the graph (sets of nodes that are very close together). For instance, if the nodes were every city in the US and the edges were distance between the cities -- the clusters might be {Dallas, Houston, Fort Worth} and {New York, Bridgeport, Jersey City, Trenton}.
The clusters don't have to be the same size and not all nodes have to be in a cluster. Instead, clusters need to have some average minimum weight, W which is equal to (sum of weights in cluster) / (number of edges in cluster).
I am most comfortable in Python, and NetworkX seems to be the standard tool for this
What is the most efficient graph data structure in Python?
It seems like this would not be too hard to program, although not particularly efficiently. Is there a name for the algorithm I am describing? Is there an implementation in NetworkX already?
I know some graph partitioning algorithms that their goal is to make all parts with approximate same size and minimum edge cut as possible, but as you described you do not need such an algorithm. Anyways i think your problem is NP complete like many other graph partitioning problems.
Maybe there be some algorithms which specifically work fine for your problem (and i think there are but i do not know them) but i think you can still find good and acceptable solutions with slight changing the some algorithms which are originally for finding minimum edge cut with same components size.
For example see this. i think you can use multilevel k-way partitioning with some changes.
For example in coarsening phase, you can use Light Edge Matching.
Consider a situation when in coarsening phase you've matched A and B into one group and also C and D into another group. weight of edge between these two groups is sum of edges of its members to each other e.g. W=Wac+Wad+Wbc+Wbd where W is edge weight, Wac is edge weight between A and C an so on. I also think that considering average of Wac, Wad, Wbc and Wbd instead of sum of them also worth a try.
From my experience this algorithm is very fast and i am not sure you be able to find precoded library in python to make changes into it.

Categories

Resources