What is Cluster, dissimilarity and distance in python?

What is Cluster, dissimilarity and distance in python? - python

I am watching MIT OpenCourseWare 6.0002 clustering video and I do not understand some code from that class.
What is this .Cluster ?
for e in initialCentroids:
clusters.append(cluster.Cluster([e]))
What is .distance?
for e in examples:
smallestDistance = e.distance(clusters[0].getCentroid())
What is .dissimilarity?
minDissimilarity = cluster.dissimilarity(best)
From the code I can understand what they are doing, but I would like to more detail about it. Related document would be highly appreciated!

These are terms mainly to describe data and it's relationship between each other. Let's start with Cluster.
Cluster is set of observation data points which may have similar characteristics in some sense. Clustering is mainly method of unsupervised learning. To imagine easily - the map is set of clusters, grouping people by nationality, but as in ML, people may be scattered to other countries - which is normal till some grade.
if we take distance as distance between clusters, this term refers how far is cluster1's centroid from cluster2's centroid. Also term may refer to given point, by measuring distance from point to all clusters' centroids - where point would be owned by cluster with minimal distance.
In addition dissimilarity describers pretty same value as distance, it tells how datapoints are not similar to original centroid. Meaning that once distance is high - dissimilarity is also high, in my opinion - not sure about this one.
hope it helps.

Related

How to compute Quantization Error for clustering?

I would like to measure the quality of clustering using Quantization Error but can't find any clear info regarding how to compute this metric.
The few documents/ articles I've found are:
"Estimating the number of clusters in a numerical data set via quantization error modeling" (Unfortunately there's no free access to this paper)
This question posted back in 2011 on Cross-Validated about the different types of distance measures (the question is very specific and doesn't give much about the calculation)
This gist repo where a quantization_error function (at the very end of the code) is implemented in Python
Regarding the third link (which is the best piece of info I've found so far) I don't know how to interpret the calculation (see snippet below):
(the # annotations are mine. question marks indicate steps that are unclear to me)
def quantization_error(self):
"""
This method calculates the quantization error of the given clustering
:return: the quantization error
"""
total_distance = 0.0
s = Similarity(self.e) #Class containing different types of distance measures
#For each point, compute squared fractional distance between point and centroid ?
for i in range(len(self.solution.patterns)):
total_distance += math.pow(s.fractional_distance(self.solution.patterns[i], self.solution.centroids[self.solution.solution[i]]), 2.0)
return total_distance / len(self.solution.patterns) # Divide total_distance by the total number of points ?
QUESTION: Is this calculation of the quantization error correct ? If no, what are the steps to compute it ?
Any help would be much appreciated.

At the risk of restating things you already know, I'll cover the basics.
REVIEW
Quantization is any time we simplify a data set by moving each of the many data points to a convenient (nearest, by some metric) quantum point. These quantum points are a much smaller set. For instance, given a set of floats, rounding each one to the nearest integer is a type of quantization.
Clustering is a well-known, often-used type of quantization, one in which we use the data points themselves to determine the quantum points.
Quantization error is a metric of the error introduced by moving each point from its original position to its associated quantum point. In clustering, we often measure this error as the root-mean-square error of each point (moved to the centroid of its cluster).
YOUR SOLUTION
... is correct, in a very common sense: you've computed the sum-squared error of the data set, and taken the mean of that. This is a perfectly valid metric.
The method I see more often is to take the square root of that final mean, cluster by cluster, and use the sum of those roots as the error function for the entire data set.
THE CITED PAPER
One common question in k-means clustering (or any clustering, for that matter), is "what is the optimum number of clusters for this data set?" The paper uses another level of quantization to look for a balance.
Given a set of N data points, we want to find the optimal number 'm' of clusters, which will satisfy some rationalization for "optimum clustering". Once we find m, we can proceed with our usual clustering algorithm to find the optimal clustering.
We cant' simply minimize the error at all cost: using N clusters gives us an error of 0.
Is that enough explanation for your needs?

Weighted K-means with GPS Data

OBJECTIVE
Aggregate store locations GPS information (longitude, latitude)
Aggregate size of population in surrounding store area (e.g 1,000,000
residents)
Use K-means to determine optimal distribution centers,
given store GPS data and local population (i.e distribution centers
are located closer to urban stores vs. rural stores due to higher
demand).
ISSUES
I've been researching on how to add weighted variables to a k-means algorithm, but am unsure on the actual process of weighting the variables. For example, if I have the [lat, long, and population (in thousands)] (e.g "New York" = [40.713, 74.005, 8406]) wouldn't this construct the centriod in 3-dimensional space? If so, wouldn't the distances be improperly skewed and mis-represent the best location for a warehouse distribution center?
Additional research alludes to UPGMA, "Unweighted Pair Group Method" where the size of the cluster is taken into account. However, I haven't fully reviewed this method and the intricacies associated with this method.
REFERENCES
Reference 1: http://cs.au.dk/~simina/weighted.pdf (page 5)
It can also be shown that a few other algorithms similar to k-means, namely k-median and k-mediods are also
weight-separable. The details appear in the appendix. Observe that all of these popular objective functions are highly
responsive to weight.
Reference 2: https://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf (page 39: "Ability to Handle Different cluster sizes"

1) You only want to do k-means in the (longitude, latitude) space. If you add population as a 3rd dimension, you will bias your centroids towards the midpoint between large population centres, which are often far apart.
2) The simplest hack to incorporate a weighting in k-means is to repeat a point (longitude, latitude) according to its population weight.
3) k-means is probably not the best clustering algorithm for the job, as travel times do not scale linearly with distance. Also, you are basically guaranteed to never have a distribution centre bang in the middle of a large population centre, which is probably not what you want. I would use DBSCAN, for which scikit-learn has a nice implementation:
http://scikit-learn.org/stable/modules/clustering.html

What can be the reasons for 90% of samples belong to one cluster when there is 8 clusters?

I use the k-means algorithm to clustering set of documents.
(parameters are - number of clusters=8, number of runs for different centroids =10)
The number of documents are 5800
Surprisingly the result for the clustering is
90% of documents belong to cluster - 7 (final cluster)
9% of documents belong to cluster - 0 (first cluster)
and the rest 6 clusters have only a single sample. What might be the reason for this?

K-means clustering attempts to minimize sum of distances between each point and a centroid of a cluster each point belongs to. Therefore, if 90% of your points are close together the sum of distances between those points and the cluster centroid is fairly small, Therefore, the k-means solving algorithm puts a centroid there. Single points are put in their own cluster because they are really far from other points, and a cluster of those points with other points would not be optimal.

K-means is highly sensitive to noise!
Noise, which is farther away from the data, becomes even more influential when your square its deviations. This makes k-means really sensitive to this.
Produce a data set, with 50 points distributed N(0;0.1), 50 points distributed N(1;0.1) and 1 point at 100. Run k-means with k=2, and you are bound to get that one point a cluster, and the two real clusters merged.
It's just how k-means is supposed to work: find a least-squared quantization of the data; it does not care about "clumps" in your data set or not.
Now it may often be beneficial (with respect to the least-squares objective) to make one-element clusters if there are outliers (here, you apparently have at least 6 such outliers). In such cases, you may need to increase k by the number of such one-element clusters you get. Or use outlier detection methods, or a clustering algorithm such as DBSCAN which is tolerant wrt. noise.

K-means is indeed sensitive to noise BUT investigate your data!
Have you pre-processed your "real-data" before applying the distance measure on it?
Are you sure your distance metric represents proximity as you'll expected?
There are a lot of possible "bugs" that may cause this scenario.. not necessary k-means fault

Clustering GPS points with a custom distance function in scipy

I'm curious if it is possible to specify your own distance function between two points for scipy clustering. I have datapoints with 3 values: GPS-lat, GPS-lon, and posix-time. I want to cluster these points using some algorithm: either agglomerative clustering, meanshift, or something else.
The problem is distance between GPS points needs to be calculated with the Haversine formula. And then that distance needs to be weighted appropriately so it is comparable with a distance in seconds for clustering purposes.
Looking at the documentation for scipy I don't see anything that jumps out as a way to specify a custom distance between two points.
Is there another way I should be going about this? I'm curious what the Pythonic thing to do is.

You asked for sklearn, but I don't have a good answer for you there. Basically, you could build a distance matrix the way you like, and many algorithms will process the distance matrix. The problem is that this needs O(n^2) memory.
For my attempts at clustering geodata, I have instead used ELKI (which is Java, not Python). First of all, it includes geodetic distance functions; but it also includes index acceleration for many algorithms and for this distance function.
I have not used an additional attribute such as time. As you already noticed you need to weight them appropriately, as 1 meter does not equal not 1 second. Weights will be very much use case dependant, and heuristic.
Why I'm suggesting ELKI is because they have a nice Tutorial on implementing custom distance functions that then can be used in most algorithms. They can't be used in every algorithm - some don't use distance at all, or are constrained to e.g. Minkowski metrics only. But a lot of algorithms can use arbitrary (even non-metric) distance functions.
There also is a follow-up tutorial on index accelerated distance functions. For my geodata, indexes were tremendously useful, speeding up by a factor of over 100x, and thus enabling be to process 10 times more data.

DBSCAN with potentially imprecise lat/long coordinates

I've been running sci-kit learn's DBSCAN implementation to cluster a set of geotagged photos by lat/long. For the most part, it works pretty well, but I came across a few instances that were puzzling. For instance, there were two sets of photos for which the user-entered text field specified that the photo was taken at Central Park, but the lat/longs for those photos were not clustered together. The photos themselves confirmed that they both sets of observations were from Central Park, but the lat/longs were in fact further apart than epsilon.
After a little investigation, I discovered that the reason for this was because the lat/long geotags (which were generated from the phone's GPS) are pretty imprecise. When I looked at the location accuracy of each photo, I discovered that they ranged widely (I've seen a margin of error of up to 600 meters) and that when you take the location accuracy into account, these two sets of photos are within a nearby distance in terms of lat/long.
Is there any way to account for margin of error in lat/long when you're doing DBSCAN?
(Note: I'm not sure if this question is as articulate as it should be, so if there's anything I can do to make it more clear, please let me know.)

Note that DBSCAN doesn't actually need the distances.
Look up Generalized DBSCAN: all it really uses is a "is a neighbor of" relationship.
If you really need to incorporate uncertainty, look up the various DBSCAN variations and extensions that handle imprecise data explicitely. However, you may get pretty much the same results just by choosing a threshold for epsilon that is somewhat reasonable. There is room for choosing a larger epsilon that the one you deem adequate: if you want to use epsilon = 1km, and you assume your data is imprecise on the range of 100m, then use 1100m as epsilon instead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.