Clustering algorithms with custom distance function in Python - python

I've got a clustering problem that I believe requires an intuitive distance function. Each instance has an x, y coordinate but also has a set of attributes that describe it (varying in number per instance). Ideally it would be possible to pass it pythonobjects (instances of a class) and compare them arbitrarily based on their content.
I want to represent the distance as a weighted sum of the euclidean distance between the x, y values and something like a jaccard index to measure the set overlap of the other attributes. Something like:
dist = (euclidean(x1, y1, x2, y2) * 0.6) + (1-jaccard(attrs1, attrs2) * 0.4)
Most of the clustering algorithms and implementations I've found convert instance features into numbers. For example with dbscan in sklearn, to do my distance function I would need to convert the numbers back into the original representation somehow.
It would be great if it were possible to do clustering using a distance function that can compare instances in any arbitrary way. For example imagine a euclidean distance function that would evaluate objects as closer if they matched on another non-spatial feature.
def dist(ins1, ins2):
euc = euclidean(ins1.x, ins1.y, ins2.x, ins2.y)
if ins1.feature1 == ins2.feature1:
euc = euc * 0.9
return euc
Is there a method that would suit this? It would also be nice if the number of clusters didn't have to be set upfront (but this is not critical for me).

Actually, almost all the clustering algorithms (except for k-means, which needs numbers to compute the mean, obviously) can be used with arbitrary distance functions.
In sklearn, most algorithms accept metric="precomputed" and a distance matrix instead of the original input data. Please check the documentation more carefully. For example DBSCAN:
If metric is “precomputed”, X is assumed to be a distance matrix and must be square.
What you lose is the ability to accelerate some algorithms by indexing. Computing a distance matrix is O(n^2), so your algorithm cannot be faster than that. In sklearn, you would need to modify the sklearn Cython code to add a new distance function (using a pyfunc will yield very bad performance, unfortunately). Java tools such as ELKI can be extended with little overhead because the Just-in-time compiler of Java optimizes this well. If your distance is metric then many indexes can be used for acceleration of e.g. DBSCAN.

Related

Problems with computing the entropy of random variables via SciPy (stats)

Recently I've been trying to figure out how to calculate the entropy of a random variable X using
sp.stats.entropy()
from the stats package of SciPy, with this random variable X being the returns I obtain from the stock of a specific company ("Company 1") from 1997 to 2012 (this is for a financial data/machine learning assignment). However, the arguments involve inputting the probability values
pk
and so far I'm even struggling with computing the actual empirical probabilities, seeing as I only have the observations of the random variable. I've tried different ways of normalising the data in order to obtain an array of probabilities, but my data contains negative values too, which means that when I try and do
asset1/np.sum(asset1)
where asset1 is the row array of the returns of the stock of "Company 1", I manage to obtain a new array which adds up to 1, but obviously with some negative values, and as we all know, negative probabilities do not exist. Therefore, is there any way of computing the empirical probabilities of my observations occurring again (ideally with the option of choosing specific bins, or for a range of values) on Python?
Furthermore, I've been trying to look for a Python package for countless hours which is solely dedicated to the calculation of random variable entropies, joint entropies, mutual information etc. as an alternative to SciPy's entropy option (simply to compare) but most seem to be outdated (I currently have Python 3.5), hence does anyone know of any good package which is compatible with my current version of Python? I know R seems to have a very compact one.
Any kind of help would be highly appreciated. Thank you very much in advance!
EDIT: stock returns are considered to be RANDOM VARIABLES, as opposed to the stock prices which are processes. Therefore, the entropy can definitely be applied in this context.
For continuous distributions, you are better off using the Kozachenko-Leonenko k-nearest neighbour estimator for entropy (K & L 1987) and the corresponding Kraskov, ..., Grassberger (2004) estimator for mutual information. These circumvent the intermediate step of calculating the probability density function, and estimate the entropy directly from the distances of data point to their k-nearest neighbour.
The basic idea of the Kozachenko-Leonenko estimator is to look at (some function of) the average distance between neighbouring data points. The intuition is that if that distance is large, the dispersion in your data is large and hence the entropy is large. In practice, instead of taking the nearest neighbour distance, one tends to take the k-nearest neighbour distance, which tends to make the estimate more robust.
I have implementations for both on my github:
https://github.com/paulbrodersen/entropy_estimators
The code has only been tested using python 2.7, but I would be surprised if it doesn't run on 3.x.

Python: String clustering with scikit-learn's dbscan, using Levenshtein distance as metric:

I have been trying to cluster multiple datasets of URLs (around 1 million each), to find the original and the typos of each URL. I decided to use the levenshtein distance as a similarity metric, along with dbscan as the clustering algorithm as k-means algorithms won't work because I do not know the number of clusters.
I am facing some problems using Scikit-learn's implementation of dbscan.
This snippet below works on small datasets in the format I an using, but since it is precomputing the entire distance matrix, that takes O(n^2) space and time and is way too much for my large datasets. I have run this for many hours but it just ends up taking all the memory of my PC.
lev_similarity = -1*np.array([[distance.levenshtein(w1[0],w2[0]) for w1 in words] for w2 in words])
dbscan = sklearn.cluster.DBSCAN(eps = 7, min_samples = 1)
dbscan.fit(lev_similarity)
So I figured I needed some way to compute the similarity on the fly and hence tried this method.
dbscan = sklearn.cluster.DBSCAN(eps = 7, min_samples = 1, metric = distance.levenshtein)
dbscan.fit(words)
But this method ends up giving me an error:
ValueError: could not convert string to float: URL
Which I realize means that its trying to convert the inputs to the similarity function to floats. But I don't want it to do that.
As far as I understand, it just needs a function that can take two arguments and return a float value that it can then compare to eps, which the levenshtein distance should do.
I am stuck at this point, as I do not know the implementation details of sklearn's dbscan to find why it is trying to convert it to float, and neither do I have any better idea on how to avoid the O(n^2) matrix computation.
Please let me know if there is any better or faster way to cluster these many strings that I may have overlooked.
From the scikit-learn faq you can do this by making a custom metric:
from leven import levenshtein
import numpy as np
from sklearn.cluster import dbscan
data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]
def lev_metric(x, y):
i, j = int(x[0]), int(y[0]) # extract indices
return levenshtein(data[i], data[j])
X = np.arange(len(data)).reshape(-1, 1)
dbscan(X, metric=lev_metric, eps=5, min_samples=2)
Try ELKI instead of sklearn.
It is the only tool I know that allows index accelerated DBSCAN with any metric.
It includes Levenshtein distance. You need to add an index to your database with -db.index. I always use the cover tree index (you need to choose the same distance for the index and for the algorithm, of course!)
You could use "pyfunc" distances and ball trees in sklearn, but performance was really bad because of the interpreter. Also, DBSCAN in sklearn is much more memory intensive.

Vector quantization for categorical data

Software for vector quantization usually works only on numerical data. One example of this is Python's scipy.cluster.vq.vq (here), which performs vector quantization. The numerical data requirement also shows up for most clustering software.
Many have pointed out that you can always convert a categorical variable to a set of binary numeric variables. But this becomes awkward when working with big data where an individual categorical variable may have hundreds or thousands of categories.
The obvious alternative is to change the distance function. With mixed data types, the distance from an observation to a "center" or "codebook entry" could be expressed as a two-part sum involving (a) the usual Euclidean calculation for the numeric variables and (b) the sum of inequality indicators for categorical variables, as proposed here on page 125.
Is there any open-source software implementation of vector quantization with such a generalized distance function?
For machine learning and clustering algorithms you can also find useful scikit-learn. To achieve what you want, you can have a look to their implementation of DBSCAN.
In their documentation, you can find:
sklearn.cluster.dbscan(X, eps=0.5, min_samples=5, metric='minkowski', algorithm='auto', leaf_size=30, p=2, random_state=None)
Here X can be either your already computed distance matrix (and passing metric='precomputed') or the standard samples x features matrix, while metric= can be a string (with the identifier of one of the already implemented distance functions) or a callable python function that will compute distances in a pair-wise fashion.
If you can't find the metric you want, you can always program it as a python function:
def mydist(a, b):
return a - b # the metric you want comes here
And call dbscan with metric=mydist. Alternatively, you can calculate your distance matrix previously, and pass it to the clustering algorith.
There are some other clustering algorithms in the same library, have a look at them here.
You cannot "quantize" categorial data.
Recall definitions of quantization (Wiktionary):
To limit the number of possible values of a quantity, or states of a system, by applying the rules of quantum mechanics
To approximate a continuously varying signal by one whose amplitude can only have a set of discrete values
In other words, quantization means converting a continuous variable into a discrete variable. Vector quantization does the same, for multiple variables at the same time.
However, categorial variables already are discrete.
What you seem to be looking for is a prototype-based clustering algorithm for categorial data (maybe STING and COOLCAT? I don't know if they will produce prototypes); but this isn't "vector quantization" anymore.
I believe that very often, frequent itemset mining is actually the best approach to find prototypes/archetypes of categorial data.
As for clustering algorithms that allow other distance functions - there are plenty. ELKI has a lot of such algorithms, and also a tutorial on implementing a custom distance. But this is Java, not Python. I'm pretty sure at least some of the clustering algorithms in scipy to allow custom distances, too.
Now pythons scipy.cluster.vq.vq is really simple code. You do not need a library for that at all. The main job of this function is wrapping a C implementation which runs much faster than python code... if you look at the py_vq version (which is used when the C version cannot be used), is is really simple code... essentially, for every object obs[i] it calls this function:
code[i] = argmin(np.sum((obs[i] - code_book) ** 2, 1))
Now you obviously can't use Euclidean distance with a categorial codebook; but translating this line to whatever similarity you want is not hard.
The harder part usually is constructing the codebook, not using it.

Clustering GPS points with a custom distance function in scipy

I'm curious if it is possible to specify your own distance function between two points for scipy clustering. I have datapoints with 3 values: GPS-lat, GPS-lon, and posix-time. I want to cluster these points using some algorithm: either agglomerative clustering, meanshift, or something else.
The problem is distance between GPS points needs to be calculated with the Haversine formula. And then that distance needs to be weighted appropriately so it is comparable with a distance in seconds for clustering purposes.
Looking at the documentation for scipy I don't see anything that jumps out as a way to specify a custom distance between two points.
Is there another way I should be going about this? I'm curious what the Pythonic thing to do is.
You asked for sklearn, but I don't have a good answer for you there. Basically, you could build a distance matrix the way you like, and many algorithms will process the distance matrix. The problem is that this needs O(n^2) memory.
For my attempts at clustering geodata, I have instead used ELKI (which is Java, not Python). First of all, it includes geodetic distance functions; but it also includes index acceleration for many algorithms and for this distance function.
I have not used an additional attribute such as time. As you already noticed you need to weight them appropriately, as 1 meter does not equal not 1 second. Weights will be very much use case dependant, and heuristic.
Why I'm suggesting ELKI is because they have a nice Tutorial on implementing custom distance functions that then can be used in most algorithms. They can't be used in every algorithm - some don't use distance at all, or are constrained to e.g. Minkowski metrics only. But a lot of algorithms can use arbitrary (even non-metric) distance functions.
There also is a follow-up tutorial on index accelerated distance functions. For my geodata, indexes were tremendously useful, speeding up by a factor of over 100x, and thus enabling be to process 10 times more data.

alternative similarity measure in DBSCAN?

I test my image set on DBSCAN algorithm in scikit-learn python module . There are alternatives for similarity computing:
# Compute similarities
D = distance.squareform(distance.pdist(X))
S = 1 - (D / np.max(D))
A weighted measure or something like that i could try, examples?
There exists a generalization of DBSCAN, known as "Generalized DBSCAN".
Actually for DBSCAN you do not even need a distance. Which is why it actually does not make sense to compute a similarity matrix in the first place.
All you need is a predicate "getNeighbors", that computes objects you consider as neighbors.
See: in DBSCAN, the distance is not really used, except to test whether an object is a neighbor or not. So all you need is this boolean decision.
You can try the following approach: initialize the matrix with all 1s.
For any two objecs that you consider similar for your application (we can't help you a lot on that, without knowing your application and data), fill the corresponding cells with 0.
Then run DBSCAN with epsilon = 0.5, and obviously DBSCAN will consider all the 0s as neighbors.
You can use whatever similarity matrix you like. It just need to based on a valid distance (symmetric, positive semi-definite).
I believe DBSCAN estimator wants distances and not similarities. But again when it comes to strings it will require a similarity matrix, which can even be a line of code for matching the equality between two strings. Therefore it depends on you how you use the similarity matrix and distinguish between a neighbor and non neighbor objects.

Categories

Resources