I would like to measure the quality of clustering using Quantization Error but can't find any clear info regarding how to compute this metric.
The few documents/ articles I've found are:
"Estimating the number of clusters in a numerical data set via quantization error modeling" (Unfortunately there's no free access to this paper)
This question posted back in 2011 on Cross-Validated about the different types of distance measures (the question is very specific and doesn't give much about the calculation)
This gist repo where a quantization_error function (at the very end of the code) is implemented in Python
Regarding the third link (which is the best piece of info I've found so far) I don't know how to interpret the calculation (see snippet below):
(the # annotations are mine. question marks indicate steps that are unclear to me)
def quantization_error(self):
"""
This method calculates the quantization error of the given clustering
:return: the quantization error
"""
total_distance = 0.0
s = Similarity(self.e) #Class containing different types of distance measures
#For each point, compute squared fractional distance between point and centroid ?
for i in range(len(self.solution.patterns)):
total_distance += math.pow(s.fractional_distance(self.solution.patterns[i], self.solution.centroids[self.solution.solution[i]]), 2.0)
return total_distance / len(self.solution.patterns) # Divide total_distance by the total number of points ?
QUESTION: Is this calculation of the quantization error correct ? If no, what are the steps to compute it ?
Any help would be much appreciated.
At the risk of restating things you already know, I'll cover the basics.
REVIEW
Quantization is any time we simplify a data set by moving each of the many data points to a convenient (nearest, by some metric) quantum point. These quantum points are a much smaller set. For instance, given a set of floats, rounding each one to the nearest integer is a type of quantization.
Clustering is a well-known, often-used type of quantization, one in which we use the data points themselves to determine the quantum points.
Quantization error is a metric of the error introduced by moving each point from its original position to its associated quantum point. In clustering, we often measure this error as the root-mean-square error of each point (moved to the centroid of its cluster).
YOUR SOLUTION
... is correct, in a very common sense: you've computed the sum-squared error of the data set, and taken the mean of that. This is a perfectly valid metric.
The method I see more often is to take the square root of that final mean, cluster by cluster, and use the sum of those roots as the error function for the entire data set.
THE CITED PAPER
One common question in k-means clustering (or any clustering, for that matter), is "what is the optimum number of clusters for this data set?" The paper uses another level of quantization to look for a balance.
Given a set of N data points, we want to find the optimal number 'm' of clusters, which will satisfy some rationalization for "optimum clustering". Once we find m, we can proceed with our usual clustering algorithm to find the optimal clustering.
We cant' simply minimize the error at all cost: using N clusters gives us an error of 0.
Is that enough explanation for your needs?
Related
I have the expirimental value of 16 intensity values corresponding to 16 distance. I want to find the relation between Thea's points as an approximate equation,so that i can tell distance required to corresponding intensity value with out plotting the graph.
Is there any python programme for this ?
I can share the values,if required.
Based on the values you have given us, I highly doubt fitting a graph rule to this will work at all. The reason being is this:
If you aren't concerned with minute changes (in the decimals), then you can essentially estimate this to be 5.9 as a fair estimate. If you are concerned with these changes, then looking at the data it has a seemingly erratic behaviour, and I highly doubt you will get an r^2 value sufficient for any practical use.
If you had significantly more points you may be able to make a graph rule from this, or even apply a machine learning model to it (the data is simple enough that a basic feed forward neural network would work. Search for tensorflow), but with just those points a guess of 5.9 is as good as any.
I am watching MIT OpenCourseWare 6.0002 clustering video and I do not understand some code from that class.
What is this .Cluster ?
for e in initialCentroids:
clusters.append(cluster.Cluster([e]))
What is .distance?
for e in examples:
smallestDistance = e.distance(clusters[0].getCentroid())
What is .dissimilarity?
minDissimilarity = cluster.dissimilarity(best)
From the code I can understand what they are doing, but I would like to more detail about it. Related document would be highly appreciated!
These are terms mainly to describe data and it's relationship between each other. Let's start with Cluster.
Cluster is set of observation data points which may have similar characteristics in some sense. Clustering is mainly method of unsupervised learning. To imagine easily - the map is set of clusters, grouping people by nationality, but as in ML, people may be scattered to other countries - which is normal till some grade.
if we take distance as distance between clusters, this term refers how far is cluster1's centroid from cluster2's centroid. Also term may refer to given point, by measuring distance from point to all clusters' centroids - where point would be owned by cluster with minimal distance.
In addition dissimilarity describers pretty same value as distance, it tells how datapoints are not similar to original centroid. Meaning that once distance is high - dissimilarity is also high, in my opinion - not sure about this one.
hope it helps.
Is it possible to do clustering without providing any input apart from the data? The clustering method/algorithm should decide from the data on how many logical groups the data can be divided, even it doesn't require me to input the threshold eucledian distance on which the clusters are built, this also needs to be learned from the data.
Could you please suggest me what is closest solution for my problem?
Why not code your algorithm to create a list of clusters ranging from size 1 to n (which could be defined in a config file so that you can avoid hard coding and just fix it once).
Once that is done, compute the clusters of size 1 to n. Choose the value which gives you the smallest Mean Square Error.
This would require some additional work by your machine to determine the optimal number of logical groups the data can be divided (bounded between 1 and n).
Clustering is an explorative technique.
This means it must always be able to produce different results, as desired by the user. Having many parameters is a feature. It means the method can be adapted easily to very different data, and to user preferences.
There will never be a generally useful parameter-free technique. At best, some parameters will have default values or heuristics (such as Euclidean distance, such as standardizing the input prior to clusterings such as the gap statistic for choosing k) that may give a reasonable first try in 80% of cases. But after that first try, you'll need to understand the data, and try other parameters to learn more about your data.
Methods that claim to be "parameter free" usually just have some hidden parameters set so it works on the few toy example it was demonstrated on.
Recently I've been trying to figure out how to calculate the entropy of a random variable X using
sp.stats.entropy()
from the stats package of SciPy, with this random variable X being the returns I obtain from the stock of a specific company ("Company 1") from 1997 to 2012 (this is for a financial data/machine learning assignment). However, the arguments involve inputting the probability values
pk
and so far I'm even struggling with computing the actual empirical probabilities, seeing as I only have the observations of the random variable. I've tried different ways of normalising the data in order to obtain an array of probabilities, but my data contains negative values too, which means that when I try and do
asset1/np.sum(asset1)
where asset1 is the row array of the returns of the stock of "Company 1", I manage to obtain a new array which adds up to 1, but obviously with some negative values, and as we all know, negative probabilities do not exist. Therefore, is there any way of computing the empirical probabilities of my observations occurring again (ideally with the option of choosing specific bins, or for a range of values) on Python?
Furthermore, I've been trying to look for a Python package for countless hours which is solely dedicated to the calculation of random variable entropies, joint entropies, mutual information etc. as an alternative to SciPy's entropy option (simply to compare) but most seem to be outdated (I currently have Python 3.5), hence does anyone know of any good package which is compatible with my current version of Python? I know R seems to have a very compact one.
Any kind of help would be highly appreciated. Thank you very much in advance!
EDIT: stock returns are considered to be RANDOM VARIABLES, as opposed to the stock prices which are processes. Therefore, the entropy can definitely be applied in this context.
For continuous distributions, you are better off using the Kozachenko-Leonenko k-nearest neighbour estimator for entropy (K & L 1987) and the corresponding Kraskov, ..., Grassberger (2004) estimator for mutual information. These circumvent the intermediate step of calculating the probability density function, and estimate the entropy directly from the distances of data point to their k-nearest neighbour.
The basic idea of the Kozachenko-Leonenko estimator is to look at (some function of) the average distance between neighbouring data points. The intuition is that if that distance is large, the dispersion in your data is large and hence the entropy is large. In practice, instead of taking the nearest neighbour distance, one tends to take the k-nearest neighbour distance, which tends to make the estimate more robust.
I have implementations for both on my github:
https://github.com/paulbrodersen/entropy_estimators
The code has only been tested using python 2.7, but I would be surprised if it doesn't run on 3.x.
I'm curious if it is possible to specify your own distance function between two points for scipy clustering. I have datapoints with 3 values: GPS-lat, GPS-lon, and posix-time. I want to cluster these points using some algorithm: either agglomerative clustering, meanshift, or something else.
The problem is distance between GPS points needs to be calculated with the Haversine formula. And then that distance needs to be weighted appropriately so it is comparable with a distance in seconds for clustering purposes.
Looking at the documentation for scipy I don't see anything that jumps out as a way to specify a custom distance between two points.
Is there another way I should be going about this? I'm curious what the Pythonic thing to do is.
You asked for sklearn, but I don't have a good answer for you there. Basically, you could build a distance matrix the way you like, and many algorithms will process the distance matrix. The problem is that this needs O(n^2) memory.
For my attempts at clustering geodata, I have instead used ELKI (which is Java, not Python). First of all, it includes geodetic distance functions; but it also includes index acceleration for many algorithms and for this distance function.
I have not used an additional attribute such as time. As you already noticed you need to weight them appropriately, as 1 meter does not equal not 1 second. Weights will be very much use case dependant, and heuristic.
Why I'm suggesting ELKI is because they have a nice Tutorial on implementing custom distance functions that then can be used in most algorithms. They can't be used in every algorithm - some don't use distance at all, or are constrained to e.g. Minkowski metrics only. But a lot of algorithms can use arbitrary (even non-metric) distance functions.
There also is a follow-up tutorial on index accelerated distance functions. For my geodata, indexes were tremendously useful, speeding up by a factor of over 100x, and thus enabling be to process 10 times more data.