I'm curious if it is possible to specify your own distance function between two points for scipy clustering. I have datapoints with 3 values: GPS-lat, GPS-lon, and posix-time. I want to cluster these points using some algorithm: either agglomerative clustering, meanshift, or something else.
The problem is distance between GPS points needs to be calculated with the Haversine formula. And then that distance needs to be weighted appropriately so it is comparable with a distance in seconds for clustering purposes.
Looking at the documentation for scipy I don't see anything that jumps out as a way to specify a custom distance between two points.
Is there another way I should be going about this? I'm curious what the Pythonic thing to do is.
You asked for sklearn, but I don't have a good answer for you there. Basically, you could build a distance matrix the way you like, and many algorithms will process the distance matrix. The problem is that this needs O(n^2) memory.
For my attempts at clustering geodata, I have instead used ELKI (which is Java, not Python). First of all, it includes geodetic distance functions; but it also includes index acceleration for many algorithms and for this distance function.
I have not used an additional attribute such as time. As you already noticed you need to weight them appropriately, as 1 meter does not equal not 1 second. Weights will be very much use case dependant, and heuristic.
Why I'm suggesting ELKI is because they have a nice Tutorial on implementing custom distance functions that then can be used in most algorithms. They can't be used in every algorithm - some don't use distance at all, or are constrained to e.g. Minkowski metrics only. But a lot of algorithms can use arbitrary (even non-metric) distance functions.
There also is a follow-up tutorial on index accelerated distance functions. For my geodata, indexes were tremendously useful, speeding up by a factor of over 100x, and thus enabling be to process 10 times more data.
Related
I have N three-dimensional vectors
(x,y,z)
I want a simple yet effective approach for clustering these vectors (I do not know a priori the number of clusters, nor can I guess a valid number). I am not familiar with classical machine learning so any advice would be helpful.
The general Sklearn clustering page does a decent job of providing useful background on clustering methods and provides a nice overview of what the differences are between methods. Importantly for your case the table in section 2.3.1 lists the parameters of each method.
The differences in methods tend to be based on how the knowledge of the dataset you have matches the assumptions of each model. Some expect you to know the number the number of clusters (such as K-Means) while others will attempt to determine the number of clusters based on other input parameters (like DBSCAN).
While focusing on methods which attempt to find the number of clusters seems like it might be preferable, it is also possible to use a method which expects the number of clusters and simply test many different reasonable clusters to determine which one is optimal. One such example with K-Means is this.
The easiest algorithms for clustering can be K-Means (if your three features are numerical) and K-Medoids (allow any type of features).
This algorithms are quite easy to understanding. In few words, by calculating some distance measure between each observation of the dataset, they try to assign each observation to the cluster closer (in distance) to them. The main issue with these algorithms is that you have to specify how many clusters (K) you want, but there are techniques such as the Elbow method or the Silhouette that allows us to determine numerically which value of K would be a reasonable amount of clusters.
Starting with latitude/longitude data (in radians), I’m trying to efficiently find the nearest n neighbors, ideally with geodesic (WGS-84) distance.
Right now I’m using sklearn’s BallTree with haversine distance (KD-Tres only take minkowskian distance), which is nice and fast (3-4 seconds to find nearest 5 neighbors for 1200 locations in 7500 possible matches), but not as accurate as I need. Code:
tree = BallTree(possible_matches[['x', 'y']], leaf_size=2, metric='haversine')
distances, indices = tree.query(locations[['x', 'y']], k=5)
When I substitute in a custom function for metric (metric=lambda u, v: geopy.distance.geodesic(u, v).miles) it takes an "unreasonably" long time (4 minutes in the same case as above). It’s documented that custom functions can take a long time, but doesn't help me solve my problem.
I looked at using a KD-Tree with ECEF coordinates and euclidian distance, but I’m not sure if that’s actually any more accurate.
How can I keep the speed of my current method, but improve my distance accuracy?
The main reason for why your metric is slow is that it written in Python while other metrics in sklearn are written in Cython/C++/C.
So as for instance discussed here for Random Forests or here you would have to implement your metric in Cython, fork your own version of BallTree and include your custom metric there.
I've got a clustering problem that I believe requires an intuitive distance function. Each instance has an x, y coordinate but also has a set of attributes that describe it (varying in number per instance). Ideally it would be possible to pass it pythonobjects (instances of a class) and compare them arbitrarily based on their content.
I want to represent the distance as a weighted sum of the euclidean distance between the x, y values and something like a jaccard index to measure the set overlap of the other attributes. Something like:
dist = (euclidean(x1, y1, x2, y2) * 0.6) + (1-jaccard(attrs1, attrs2) * 0.4)
Most of the clustering algorithms and implementations I've found convert instance features into numbers. For example with dbscan in sklearn, to do my distance function I would need to convert the numbers back into the original representation somehow.
It would be great if it were possible to do clustering using a distance function that can compare instances in any arbitrary way. For example imagine a euclidean distance function that would evaluate objects as closer if they matched on another non-spatial feature.
def dist(ins1, ins2):
euc = euclidean(ins1.x, ins1.y, ins2.x, ins2.y)
if ins1.feature1 == ins2.feature1:
euc = euc * 0.9
return euc
Is there a method that would suit this? It would also be nice if the number of clusters didn't have to be set upfront (but this is not critical for me).
Actually, almost all the clustering algorithms (except for k-means, which needs numbers to compute the mean, obviously) can be used with arbitrary distance functions.
In sklearn, most algorithms accept metric="precomputed" and a distance matrix instead of the original input data. Please check the documentation more carefully. For example DBSCAN:
If metric is “precomputed”, X is assumed to be a distance matrix and must be square.
What you lose is the ability to accelerate some algorithms by indexing. Computing a distance matrix is O(n^2), so your algorithm cannot be faster than that. In sklearn, you would need to modify the sklearn Cython code to add a new distance function (using a pyfunc will yield very bad performance, unfortunately). Java tools such as ELKI can be extended with little overhead because the Just-in-time compiler of Java optimizes this well. If your distance is metric then many indexes can be used for acceleration of e.g. DBSCAN.
essentially I applied a DBSCAN algorithm (sklearn) with an euclidean distance on a subset of my original data. I found my clusters and all is fine: except for the fact that I want to keep only values that are far enough from those on which I did not run my analysis on. I have a new distance to test such new stuff with and I wanted to understand how to do it WITHOUT numerous nested loops.
in a picture:
my found clusters are in blue whereas the red ones are the points to which I don't want to be near. the crosses are the points belonging to the cluster that are carved out as they are within the new distance I specified.
now, as much I could do something of the sort:
for i in red_points:
for j in blu_points:
if dist(i,j) < given_dist:
original_dataframe.remove(j)
I refuse to believe there isn't a vectorized method. also, I can't afford to do as above simply because I'll have huge tables to operate upon and I'd like to avoid my CPU to evaporate away.
any and all suggestions welcome
Of course you can vectoriue this, but it will then still be O(n*m). Better neighbor search algorithms are not vectorized. e.g. kd-tree and ball-tree.
Both are available in sklearn, and used by the DBSCAN module. Please see the sklearn.neighbors package.
If you need exact answers, the fastest implementation should be sklearn's pairwise distance calculator:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html
If you can accept an approximate answer, you can do better with the kd tree's queryradius(): http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html
Software for vector quantization usually works only on numerical data. One example of this is Python's scipy.cluster.vq.vq (here), which performs vector quantization. The numerical data requirement also shows up for most clustering software.
Many have pointed out that you can always convert a categorical variable to a set of binary numeric variables. But this becomes awkward when working with big data where an individual categorical variable may have hundreds or thousands of categories.
The obvious alternative is to change the distance function. With mixed data types, the distance from an observation to a "center" or "codebook entry" could be expressed as a two-part sum involving (a) the usual Euclidean calculation for the numeric variables and (b) the sum of inequality indicators for categorical variables, as proposed here on page 125.
Is there any open-source software implementation of vector quantization with such a generalized distance function?
For machine learning and clustering algorithms you can also find useful scikit-learn. To achieve what you want, you can have a look to their implementation of DBSCAN.
In their documentation, you can find:
sklearn.cluster.dbscan(X, eps=0.5, min_samples=5, metric='minkowski', algorithm='auto', leaf_size=30, p=2, random_state=None)
Here X can be either your already computed distance matrix (and passing metric='precomputed') or the standard samples x features matrix, while metric= can be a string (with the identifier of one of the already implemented distance functions) or a callable python function that will compute distances in a pair-wise fashion.
If you can't find the metric you want, you can always program it as a python function:
def mydist(a, b):
return a - b # the metric you want comes here
And call dbscan with metric=mydist. Alternatively, you can calculate your distance matrix previously, and pass it to the clustering algorith.
There are some other clustering algorithms in the same library, have a look at them here.
You cannot "quantize" categorial data.
Recall definitions of quantization (Wiktionary):
To limit the number of possible values of a quantity, or states of a system, by applying the rules of quantum mechanics
To approximate a continuously varying signal by one whose amplitude can only have a set of discrete values
In other words, quantization means converting a continuous variable into a discrete variable. Vector quantization does the same, for multiple variables at the same time.
However, categorial variables already are discrete.
What you seem to be looking for is a prototype-based clustering algorithm for categorial data (maybe STING and COOLCAT? I don't know if they will produce prototypes); but this isn't "vector quantization" anymore.
I believe that very often, frequent itemset mining is actually the best approach to find prototypes/archetypes of categorial data.
As for clustering algorithms that allow other distance functions - there are plenty. ELKI has a lot of such algorithms, and also a tutorial on implementing a custom distance. But this is Java, not Python. I'm pretty sure at least some of the clustering algorithms in scipy to allow custom distances, too.
Now pythons scipy.cluster.vq.vq is really simple code. You do not need a library for that at all. The main job of this function is wrapping a C implementation which runs much faster than python code... if you look at the py_vq version (which is used when the C version cannot be used), is is really simple code... essentially, for every object obs[i] it calls this function:
code[i] = argmin(np.sum((obs[i] - code_book) ** 2, 1))
Now you obviously can't use Euclidean distance with a categorial codebook; but translating this line to whatever similarity you want is not hard.
The harder part usually is constructing the codebook, not using it.