I am trying to cluster the following data from a CSV file with K means clustering.
Sample1,Sample2,45
Sample1,Sample3,69
Sample1,Sample4,12
Sample2,Sample2,46
Sample2,Sample1,78
It is basically a graph where Samples are nodes and the numbers are the edges (weights).
I read the file as following:
fileopening = fopen('data.csv', 'rU')
reading = csv.reader(fileopening, delimiter=',')
L = list(reading)
I used this code: https://gist.github.com/betzerra/8744068
Here clusters are built based on the following:
num_points, dim, k, cutoff, lower, upper = 10, 2, 3, 0.5, 0, 200
points = map( lambda i: makeRandomPoint(dim, lower, upper), range(num_points) )
clusters = kmeans(points, k, cutoff)
for i,c in enumerate(clusters):
for p in c.points:
print " Cluster: ",i,"\t Point :", p
I replaced points with list L. But I got lots of errors: AttributeError, 'int' object has no attribute 'n', etc.
I need to perform K means clustering based on the third number column (edges) of my CSV file. This tutorial uses randomly creating points. But I am not sure, how to use this CSV data as an input to this k means function. How to perform k means (k=2) for my data? How can I send the CSV file data as input to this k means function?
In short "you can't".
Long answer:
K-means is defined for euclidean spaces only and it requires a valid points positions, while you only have distances between them, probably not in a strict mathematical sense but rather some kind of "similarity". K-means is not designed to work with similarity matrices.
What you can do?
You can use some other method to embeed your points in euclidean space in such a way, that they closely reasamble your distances, one of such tools is Multidimensional scaling (MDS): http://en.wikipedia.org/wiki/Multidimensional_scaling
Once point 1 is done you can run k-means
Alternatively you can also construct a kernel (valid in a Mercer's sense) by performing some kernel learning techniques to reasamble your data and then run kernel k-means on the resulting Gram matrix.
As lejlot said, only distances between points are not enough to run k-means in the classic sense. It's easy to understand if you understand the nature of k-means. On a high level, k-means works as follows:
1) Randomly assign points to cluster.
(Technically, there are more sophisticated ways of initial partitioning,
but that's not essential right now).
2) Compute centroids of the cluster.
(This is where you need the actual coordinates of the points.)
3) Reassign each point to a cluster with the closest centroid.
4) Repeat steps 2)-3) until stop condition is met.
So, as you can see, in the classic interpretation, k-means will not work, because it is unclear how to compute centroids. However, I have several suggestions of what you could do.
Suggestion 1.
Embed your points in N-dimensional space, where N is the number of points, so that the coordinates of each point are the distances to all the other points.
For example the data you showed:
Sample1,Sample2,45
Sample1,Sample3,69
Sample1,Sample4,12
Sample2,Sample2,46
Sample2,Sample1,78
becomes:
Sample1: (0,45,69,12,...)
Sample2: (78,46,0,0,...)
Then you can legitimately use Euclidean distance. Note, that the actual distances between points will not be preserved, but this could be a simple and reasonable approximation to preserve relative distances between the points. Another disadvantage is that if you have a lot of points, than your memory (and running time) requirements will be order of N^2.
Suggestion 2.
Instead of k-means, try k-medoids. For this one, you do not need the actual coordinates of the points, because instead of centroid, you need to compute medoids. Medoid of a cluster is a points from this cluster, whish has the smallest average distance to all other points in this cluster. You could look for the implementations online. Or it's actually pretty easy to implement. The running time will be proportional to N^2 as well.
Final remark.
Why do you wan to use k-means at all? Seems like you have a weighted directed graph. There are clustering algorithms specially intended for graphs. This is beyond the scope of your question, but maybe this is something that could be worth considering?
Related
I performed K-means clustering with a variety of k values and got the inertia of each k value (inertial being the sum of the standard deviation of all clusters, to my knowledge)
ks = range(1,30)
inertias = []
for k in ks:
km = KMeans(n_clusters=k).fit(trialsX)
inertias.append(km.inertia_)
plt.plot(ks,inertias)
Based on my reading, the optimal k value lies at the 'elbow' of this plot, but the calculation of the elbow has proven elusive. How can you programatically use this data to calculate k?
I'll post this, because it's the best I have come up with thus far:
It seems like using some threshold scaled to the range of the first derivative allong the curve might do a good job. This can be done by fitting a spline:
y_spl = UnivariateSpline(ks,inertias,s=0,k=4)
x_range = np.linspace(ks[0],ks[-1],1000)
y_spl_1d = y_spl.derivative(n=1)
plt.plot(x_range,y_spl_1d(x_range))
then, you can probably define k by, say 90% up this curve. I would imagine this is a pretty consistent way to do it, but there may be a better option.
EDIT: 2 years later,just use np.diff to generate this plot without fitting a spline, then find the point where the slope equals -1. See the comments for more info.
I've implemented a k-means algorithm and performance is highly dependent on how centroids were initialized. I'm finding random uniform initialization to give a good k-means about 5% of the time, whereas with k-means++, it's closer to 50%. Why is the yield for good k-means so low? I should disclaim I've only used a handful of data sets and my good/bad rates are indicative of only those, not broadly.
Here's an example using k-means++ where the end result was not great. The Dunn Index of this clustering is 0.16.
And an example where it worked perfectly with a Dunn Index of 0.67.
I was maybe under the naive impression k-means++ produced a good k-means every time. Is there perhaps something wrong with my code?
def initialize_centroids(points, k):
"""
Parameters:
points : a list of Points.
k : how many centroids to place.
Returns:
A list of centroids.
"""
clusters = []
clusters.append(choice(points)) # first centroid is random point
for _ in range(k - 1): # for other centroids
distances = []
for p in points:
d = inf
for c in clusters: # find the minimal distance between p and c
d = min(d, distance(p, c))
distances.append(d)
# find maximum distance index from minimal distances
clusters.append(points[distances.index(max(distances))])
return clusters
This is adapted from the algorithm as found on Wikipedia:
Choose one center uniformly at random from among the data points.
For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
Repeat Steps 2 and 3 until k centers have been chosen.
Now that the initial centers have been chosen, proceed using standard k-means clustering.
The difference is the centroids are chosen such that it is the furthest distance, not a probability to choose between furthest distances.
My intention is to compare the Dunn Index over different values of k, and empirically the Dunn Index being higher means better clustering. I can't collect (good) data if half of the time it doesn't work, so my results are skewed due to the faultiness of k-means++ or my implementation thereof.
What other initialization strategies can be employed to get a more consistent result?
I have the following points in 3D space:
I need to group the points, according to D_max and d_max:
D_max = max dimension of each group
d_max = max distance of points inside each group
Like this:
The shape of the group in the above image looks like a box, but the shape can be anything which would be the output of the grouping algorithm.
I'm using Python and visualize the results with Blender. I'm considering using the scipy.spatial.KDTree and calling its query API, however, I'm not sure if that's the right tool for the job at hand. I'm worried that there might be a better tool which I'm not aware of. I'm curious to know if there is any other tool/library/algorithm which can help me.
As #CoMartel pointed out, there is DBSCAN and also HDBSCAN clustering modules which look like a good fit for this type of problems. However, as pointed out by #Paul they lack the option for max size of the cluster which correlates to my D_max parameter. I'm not sure how to add a max cluster size feature to DBSCAN and HDBSCAN clustering.
Thanks to #Anony-Mousse I watched Agglomerative Clustering: how it works and Hierarchical Clustering 3: single-link vs. complete-link and I'm studying Comparing Python Clustering Algorithms, I feel like it's getting more clear how these algorithms work.
As requested, my comment as an answer :
You could use DBSCAN(http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) or HDBSCAN.
Both these algorithm allow to group each point according to d_max (maximum distance between 2 points of the same dataset), but they don't take the maximum cluster size. The only way to limit the maximum size of a cluster is by reducing the epsparameter, which control the max distance between 2 points of the same cluster.
Use hierarchical agglomerative clustering.
If you use complete linkage you can control the maximum diameter of the clusters. The complete link is the maximum distance.
DBSCAN's epsilon parameter is not a maximum distance because multiple steps are joined transitively. Clusters can become much larger than epsilon!
DBSCAN clustering algorithm with the maximum distance of points inside each group extension
You can use the DBSCAN algorithm recursively.
def DBSCAN_with_max_size(myData, eps = E, max_size = S):
clusters = DBSCAN(myData, eps = E)
Big_Clusters = find_big_clusters(clusters)
for big_cluster in Big_Clusters:
DBSCAN_with_max_size(big_cluster ,eps = E/2 ,max_size = S) //eps is something lower than E (e.g. E/2)
I have a dataset of about 3 million samples (each with just 3 features). I'm using scikit's sklearn.neighbors module - specifically radius_neighbor_graph - to find which samples fall within a small radius of a specific sample.
This works fine, but unsurprisingly it's really, really slow to compute this graph.
It's also very wasteful, because I only ever need to know the neighbors for a small subset of my samples (~ 100,000 of them) - and I know this subset in advance.
So... is there any way of being more efficient by calculating the neighbours within a given radius for just this subset of samples? It seems like it should be simple, but I can't think of an easy way of doing it.
First of all, the task of creating a radius-neighborhood-graph involves reading the N by N distance-matrix associated to your dataset. Since distance matrices have nice properties you can save some time, but still complexity lies somewhere in O(N^2). Here N is the number of data points in your data set X.
So one could say, that only a small number of n < N points are of interest as the center of a neighborhood, but the majority of points are just interesting as neighbors. This would result in an n by N distance matrix, where row i contains the distances of data point i to each other data point j, 1 <= i <= n, 1 <= j <= N. But this "distance matrix" has none of the desirable properties of a normal distance matrix (it is not even a square matrix), that you could use to speed up the process of creating an epsilon-neighborhood-graph.
Therefore I don't think that you find a predefined function for your case. If you want to build one your own, the steps should be as follows: Let X be your data set and i be the data point of interest.
Create the distance matrix D associated to your data set, use scipy.spatial.distance_matrix and take as x the small subset of your data set and as y the whole data set.
Create a list, neighbors = []
Loop over the i'th row of the distance matrix. If D(i,j) < epsilon, then save j in neighbors. It is the index of a data point in the epsilon neighborhood of i.
Return neighbors
Of course the computation of the distance matrix should happen once at the beginning (maybe in init() if you wrap everything up in a class), and the function/method that returns all epsilon neighbors of a data point should only depend on the index of the data point in question.
Hope this helps!
I'm using meanshift clustering to remove unwanted noise from my input data..
Data can be found here. Here what I have tried so far..
import numpy as np
from sklearn.cluster import MeanShift
data = np.loadtxt('model.txt', unpack = True)
## data size is [3X500]
ms = MeanShift()
ms.fit(data)
after trying some different bandwidth value I am getting only 1 cluster.. but the outliers and noise like in the picture suppose to be in different cluster.
when decreasing the bandwidth a little more then I ended up with this ... which is again not what I was looking for.
Can anyone help me with this?
You can remove outliers before using mean shift.
Statistical removal
For example, fix a number of neighbors to analyze for each point (e.g. 50), and the standard deviation multiplier (e.g. 1). All points who have a distance larger than 1 standard deviation of the mean distance to the query point will be marked as outliers and removed. This technique is used in libpcl, in the class pcl::StatisticalOutlierRemoval, and a tutorial can be found here.
Deterministic removal (radius based)
A simpler technique consists in specifying a radius R and a minimum number of neighbors N. All points who have less than N neighbours withing a radius of R will be marked as outliers and removed. Also this technique is used in libpcl, in the class pcl::RadiusOutlierRemoval, and a tutorial can be found here.
Mean-shift is not meant to remove low-density areas.
It tries to move all data to the most dense areas.
If there is one single most dense point, then everything should move there, and you get only one cluster.
Try a different method. Maybe remove the outliers first.
set his parameter to false cluster_allbool, default=True
If true, then all points are clustered, even those orphans that are not within any kernel. Orphans are assigned to the nearest kernel. If false, then orphans are given cluster label -1.