I have a collection of Points objects, containing latitude and longitude (along with a few other irrelevant properties). I want to form clusters i.e. collections of points that are close together, relative to other points.
Alternatively, I would like an algorithm which, if given a list of clusters containing close-by points and a new point, determines which cluster the new point belongs to (and adds it to a new cluster if it doesn't belong to an existing cluster).
I looked at Hierarchical Clustering algorithms but those run too slow. The k-means algorithm requires you to know the number of clusters beforehand, whcih is not really very helpful.
Thanks!
Try density based clustering methods.
DBSCAN is one of the most popular of those.
I am assuming you are using python.
Check out this:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
When you cluster based on GPS lat/lon, you may want to use a different distance calculation method than DBSCAN's default. Use its metric parameter to use your own distance calculation function or distance matrix. For distance calculations check out Haversine Formula.
Related
I have a set of lat/long coordinates spread out across a city (about 1000). I'd like to create clusters with this data following some strict rules:
No cluster can have more than X data points in it (possibly 8, but this can change)
No cluster can contain two data points with more than Xkm between them (possibly 1km, but this can change too)
There can be clusters with one single point
No specific number of clusters need to be created
I've tried doing this using AgglomerativeClustering from sklearn, using the following code:
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=None, affinity='euclidean', linkage='complete', distance_threshold=0.01)
cluster.fit_predict(arr)
The issue here is that I'm not fulfilling items 1,2 or 3 above, only item 4.
I'd like to have a clustering algorithm where I'm able to set those parameters and have it run the most efficient clustering possible (ie: least number of clusters that respect all of items 1,2,3 and 4).
Is there any way this could be done with sklearn or any other imported clustering algo or would one have to build this manually?
Thanks!
Write your own.
A simple approach would be to use agglomerative clustering (the real one, e.g., from scipy; the sklearn version is too limited) to get the full merge history for complete linkage. Then begin processing merges bottom-up if they satisfy your two requirements: the linkage is the maximum distance, and if the cluster becomes too large then you stop merging.
Beware that the result will, however, quite unbalanced. My guess is that you want as few clusters as possible to cover your data with the maximum radius and occupacy. Then your problem is likely closer to set cover. Finding the optimum result on such problems is usually NP hard, so you'll have to accept using an approximation. I'd go with a greedy strategy and then iterative refinement by local search.
I am trying to separate a data set that has 2 clusters that do not overlap in anyway and a single data point that is away from these two clusters.
When I use kmeans() to get the 2 clusters, it splits one of the "valid" cluster into half and considers the single data point as a separate cluster.
Is there a way to specify minimum number of points for this? I am using MATLAB.
There are several solutions:
Easy: try with 3 clusters;
Easy: remove the single data point (that you can detect as an outlier with any outlier detection technique;
To be tried: Use a k-medoids approach instead of k-means. This sometimes helps getting rid of outliers.
More complicated but surely works: Perform spectral clustering. This helps you get over the main issue of k-means, which is the brutal use of the euclidian distance
More explanations on the inadequate behaviour of k-means can be found on Cross Validated site (see here for instance).
OBJECTIVE
Aggregate store locations GPS information (longitude, latitude)
Aggregate size of population in surrounding store area (e.g 1,000,000
residents)
Use K-means to determine optimal distribution centers,
given store GPS data and local population (i.e distribution centers
are located closer to urban stores vs. rural stores due to higher
demand).
ISSUES
I've been researching on how to add weighted variables to a k-means algorithm, but am unsure on the actual process of weighting the variables. For example, if I have the [lat, long, and population (in thousands)] (e.g "New York" = [40.713, 74.005, 8406]) wouldn't this construct the centriod in 3-dimensional space? If so, wouldn't the distances be improperly skewed and mis-represent the best location for a warehouse distribution center?
Additional research alludes to UPGMA, "Unweighted Pair Group Method" where the size of the cluster is taken into account. However, I haven't fully reviewed this method and the intricacies associated with this method.
REFERENCES
Reference 1: http://cs.au.dk/~simina/weighted.pdf (page 5)
It can also be shown that a few other algorithms similar to k-means, namely k-median and k-mediods are also
weight-separable. The details appear in the appendix. Observe that all of these popular objective functions are highly
responsive to weight.
Reference 2: https://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf (page 39: "Ability to Handle Different cluster sizes"
1) You only want to do k-means in the (longitude, latitude) space. If you add population as a 3rd dimension, you will bias your centroids towards the midpoint between large population centres, which are often far apart.
2) The simplest hack to incorporate a weighting in k-means is to repeat a point (longitude, latitude) according to its population weight.
3) k-means is probably not the best clustering algorithm for the job, as travel times do not scale linearly with distance. Also, you are basically guaranteed to never have a distribution centre bang in the middle of a large population centre, which is probably not what you want. I would use DBSCAN, for which scikit-learn has a nice implementation:
http://scikit-learn.org/stable/modules/clustering.html
I am working on a one-dimensional gene positions data which is like
[705118, 705118, 832132, 860402, 865710, 867206, 925364, 925364,925364]
(around 2000 items in one array) and I wanna divide the array into clusters with a maximum within cluster distance less or equal to 2000.
So I used the
chrd=scipy.spatial.distance.pdist(chrn,metric='euclidean')
to get the distance matrix and then
scipy.cluster.hierarchy.linkage(chrd,method='average',metric='euclidean')
to get the linkage matrix.
But there is no such function in scipy.cluster.hiearchy.fcluster can cut the hierarchy tree based the the maximum within cluster distance.
Does anyone have any idea about how to handle this?
I try to rewrite a hierarchy algorithm which can include such threshold but it seems to be really hard to do >.<
Thanks in advance
Use complete linkage if you need a maximum distance of objects, not average.
Then cut the tree at the desired height.
Alternatively, you can also implement Leader clustering. It's stupid, but he maximum distance is twice the distance of the cluster radius there.
If using scikit-learn is a possibility, you could check out http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
it has a parameter that limits the distance between two samples
I'm curious if it is possible to specify your own distance function between two points for scipy clustering. I have datapoints with 3 values: GPS-lat, GPS-lon, and posix-time. I want to cluster these points using some algorithm: either agglomerative clustering, meanshift, or something else.
The problem is distance between GPS points needs to be calculated with the Haversine formula. And then that distance needs to be weighted appropriately so it is comparable with a distance in seconds for clustering purposes.
Looking at the documentation for scipy I don't see anything that jumps out as a way to specify a custom distance between two points.
Is there another way I should be going about this? I'm curious what the Pythonic thing to do is.
You asked for sklearn, but I don't have a good answer for you there. Basically, you could build a distance matrix the way you like, and many algorithms will process the distance matrix. The problem is that this needs O(n^2) memory.
For my attempts at clustering geodata, I have instead used ELKI (which is Java, not Python). First of all, it includes geodetic distance functions; but it also includes index acceleration for many algorithms and for this distance function.
I have not used an additional attribute such as time. As you already noticed you need to weight them appropriately, as 1 meter does not equal not 1 second. Weights will be very much use case dependant, and heuristic.
Why I'm suggesting ELKI is because they have a nice Tutorial on implementing custom distance functions that then can be used in most algorithms. They can't be used in every algorithm - some don't use distance at all, or are constrained to e.g. Minkowski metrics only. But a lot of algorithms can use arbitrary (even non-metric) distance functions.
There also is a follow-up tutorial on index accelerated distance functions. For my geodata, indexes were tremendously useful, speeding up by a factor of over 100x, and thus enabling be to process 10 times more data.