Weighted K-means with GPS Data

Weighted K-means with GPS Data - python

OBJECTIVE
Aggregate store locations GPS information (longitude, latitude)
Aggregate size of population in surrounding store area (e.g 1,000,000
residents)
Use K-means to determine optimal distribution centers,
given store GPS data and local population (i.e distribution centers
are located closer to urban stores vs. rural stores due to higher
demand).
ISSUES
I've been researching on how to add weighted variables to a k-means algorithm, but am unsure on the actual process of weighting the variables. For example, if I have the [lat, long, and population (in thousands)] (e.g "New York" = [40.713, 74.005, 8406]) wouldn't this construct the centriod in 3-dimensional space? If so, wouldn't the distances be improperly skewed and mis-represent the best location for a warehouse distribution center?
Additional research alludes to UPGMA, "Unweighted Pair Group Method" where the size of the cluster is taken into account. However, I haven't fully reviewed this method and the intricacies associated with this method.
REFERENCES
Reference 1: http://cs.au.dk/~simina/weighted.pdf (page 5)
It can also be shown that a few other algorithms similar to k-means, namely k-median and k-mediods are also
weight-separable. The details appear in the appendix. Observe that all of these popular objective functions are highly
responsive to weight.
Reference 2: https://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf (page 39: "Ability to Handle Different cluster sizes"

1) You only want to do k-means in the (longitude, latitude) space. If you add population as a 3rd dimension, you will bias your centroids towards the midpoint between large population centres, which are often far apart.
2) The simplest hack to incorporate a weighting in k-means is to repeat a point (longitude, latitude) according to its population weight.
3) k-means is probably not the best clustering algorithm for the job, as travel times do not scale linearly with distance. Also, you are basically guaranteed to never have a distribution centre bang in the middle of a large population centre, which is probably not what you want. I would use DBSCAN, for which scikit-learn has a nice implementation:
http://scikit-learn.org/stable/modules/clustering.html

Related

How to cluster lat/lng data with restrictions on max distance between points and max number of points per cluster

I have a set of lat/long coordinates spread out across a city (about 1000). I'd like to create clusters with this data following some strict rules:
No cluster can have more than X data points in it (possibly 8, but this can change)
No cluster can contain two data points with more than Xkm between them (possibly 1km, but this can change too)
There can be clusters with one single point
No specific number of clusters need to be created
I've tried doing this using AgglomerativeClustering from sklearn, using the following code:
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=None, affinity='euclidean', linkage='complete', distance_threshold=0.01)
cluster.fit_predict(arr)
The issue here is that I'm not fulfilling items 1,2 or 3 above, only item 4.
I'd like to have a clustering algorithm where I'm able to set those parameters and have it run the most efficient clustering possible (ie: least number of clusters that respect all of items 1,2,3 and 4).
Is there any way this could be done with sklearn or any other imported clustering algo or would one have to build this manually?
Thanks!

Write your own.
A simple approach would be to use agglomerative clustering (the real one, e.g., from scipy; the sklearn version is too limited) to get the full merge history for complete linkage. Then begin processing merges bottom-up if they satisfy your two requirements: the linkage is the maximum distance, and if the cluster becomes too large then you stop merging.
Beware that the result will, however, quite unbalanced. My guess is that you want as few clusters as possible to cover your data with the maximum radius and occupacy. Then your problem is likely closer to set cover. Finding the optimum result on such problems is usually NP hard, so you'll have to accept using an approximation. I'd go with a greedy strategy and then iterative refinement by local search.

I want to understand how does the max_distance in the routing solver of ortool's work and what's an optimal distance for clustering geo-locations

I have pickup and drop locations in the form of latitude and longitude. I'm clustering the locations based on their pickup locations using hierarchical clustering.
Zd = linkage(squareform(pickDistance), method= "ward", metric = "haversine")
cld = fcluster(Zd, 30, criterion = 'distance')
Here, 'pickDistance' is the proximity matrix created taking all the pickup lat-lons.
Using distance matrix for each cluster formed, and taking the pickup and drop locations, the or tool's routing solver is giving me the routes for multiple vehicles, for each cluster.
When I am increasing the cluster_distance, the solver keeps executing, and in the end I cancel the execution and reset the cluster_distance and max_distance till I get the routes.
I want to understand a few things here:
How to set an optimal cluster_distance and what is the best clustering method, according to you to cluster geo-locations?
How does max_distance parameter in the routing solver work? And, Is the max_distance for each vehicle or for all the vehicles it shall be utilizing?
Is there any way to make the cluster_distance and max_distance parameter of the routing solver dynamic, in a way that it will work for any number of locations in a cluster?
Kindly Help.

K-means should be right in this case. Since k-means tries to group based solely on euclidean distance between objects you will get back clusters of locations that are close to each other.
To find the optimal number of clusters you can try making an 'elbow' plot of the within group sum of square distance. This may be helpful!
(http://nbviewer.ipython.org/github/nborwankar/LearnDataScience/blob/master/notebooks/D3.%20K-Means%20Clustering%20Analysis.ipynb)

What is Cluster, dissimilarity and distance in python?

I am watching MIT OpenCourseWare 6.0002 clustering video and I do not understand some code from that class.
What is this .Cluster ?
for e in initialCentroids:
clusters.append(cluster.Cluster([e]))
What is .distance?
for e in examples:
smallestDistance = e.distance(clusters[0].getCentroid())
What is .dissimilarity?
minDissimilarity = cluster.dissimilarity(best)
From the code I can understand what they are doing, but I would like to more detail about it. Related document would be highly appreciated!

These are terms mainly to describe data and it's relationship between each other. Let's start with Cluster.
Cluster is set of observation data points which may have similar characteristics in some sense. Clustering is mainly method of unsupervised learning. To imagine easily - the map is set of clusters, grouping people by nationality, but as in ML, people may be scattered to other countries - which is normal till some grade.
if we take distance as distance between clusters, this term refers how far is cluster1's centroid from cluster2's centroid. Also term may refer to given point, by measuring distance from point to all clusters' centroids - where point would be owned by cluster with minimal distance.
In addition dissimilarity describers pretty same value as distance, it tells how datapoints are not similar to original centroid. Meaning that once distance is high - dissimilarity is also high, in my opinion - not sure about this one.
hope it helps.

Clustering Algorithms for a Set of Data Points

I have a collection of Points objects, containing latitude and longitude (along with a few other irrelevant properties). I want to form clusters i.e. collections of points that are close together, relative to other points.
Alternatively, I would like an algorithm which, if given a list of clusters containing close-by points and a new point, determines which cluster the new point belongs to (and adds it to a new cluster if it doesn't belong to an existing cluster).
I looked at Hierarchical Clustering algorithms but those run too slow. The k-means algorithm requires you to know the number of clusters beforehand, whcih is not really very helpful.
Thanks!

Try density based clustering methods.
DBSCAN is one of the most popular of those.
I am assuming you are using python.
Check out this:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
When you cluster based on GPS lat/lon, you may want to use a different distance calculation method than DBSCAN's default. Use its metric parameter to use your own distance calculation function or distance matrix. For distance calculations check out Haversine Formula.

What can be the reasons for 90% of samples belong to one cluster when there is 8 clusters?

I use the k-means algorithm to clustering set of documents.
(parameters are - number of clusters=8, number of runs for different centroids =10)
The number of documents are 5800
Surprisingly the result for the clustering is
90% of documents belong to cluster - 7 (final cluster)
9% of documents belong to cluster - 0 (first cluster)
and the rest 6 clusters have only a single sample. What might be the reason for this?

K-means clustering attempts to minimize sum of distances between each point and a centroid of a cluster each point belongs to. Therefore, if 90% of your points are close together the sum of distances between those points and the cluster centroid is fairly small, Therefore, the k-means solving algorithm puts a centroid there. Single points are put in their own cluster because they are really far from other points, and a cluster of those points with other points would not be optimal.

K-means is highly sensitive to noise!
Noise, which is farther away from the data, becomes even more influential when your square its deviations. This makes k-means really sensitive to this.
Produce a data set, with 50 points distributed N(0;0.1), 50 points distributed N(1;0.1) and 1 point at 100. Run k-means with k=2, and you are bound to get that one point a cluster, and the two real clusters merged.
It's just how k-means is supposed to work: find a least-squared quantization of the data; it does not care about "clumps" in your data set or not.
Now it may often be beneficial (with respect to the least-squares objective) to make one-element clusters if there are outliers (here, you apparently have at least 6 such outliers). In such cases, you may need to increase k by the number of such one-element clusters you get. Or use outlier detection methods, or a clustering algorithm such as DBSCAN which is tolerant wrt. noise.

K-means is indeed sensitive to noise BUT investigate your data!
Have you pre-processed your "real-data" before applying the distance measure on it?
Are you sure your distance metric represents proximity as you'll expected?
There are a lot of possible "bugs" that may cause this scenario.. not necessary k-means fault

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Weighted K-means with GPS Data - python

Related

How to cluster lat/lng data with restrictions on max distance between points and max number of points per cluster

I want to understand how does the max_distance in the routing solver of ortool's work and what's an optimal distance for clustering geo-locations

What is Cluster, dissimilarity and distance in python?

Clustering Algorithms for a Set of Data Points

What can be the reasons for 90% of samples belong to one cluster when there is 8 clusters?

Categories

Resources