Clustering method for three-dimensional vectors

Clustering method for three-dimensional vectors - python

I have N three-dimensional vectors
(x,y,z)
I want a simple yet effective approach for clustering these vectors (I do not know a priori the number of clusters, nor can I guess a valid number). I am not familiar with classical machine learning so any advice would be helpful.

The general Sklearn clustering page does a decent job of providing useful background on clustering methods and provides a nice overview of what the differences are between methods. Importantly for your case the table in section 2.3.1 lists the parameters of each method.
The differences in methods tend to be based on how the knowledge of the dataset you have matches the assumptions of each model. Some expect you to know the number the number of clusters (such as K-Means) while others will attempt to determine the number of clusters based on other input parameters (like DBSCAN).
While focusing on methods which attempt to find the number of clusters seems like it might be preferable, it is also possible to use a method which expects the number of clusters and simply test many different reasonable clusters to determine which one is optimal. One such example with K-Means is this.

The easiest algorithms for clustering can be K-Means (if your three features are numerical) and K-Medoids (allow any type of features).
This algorithms are quite easy to understanding. In few words, by calculating some distance measure between each observation of the dataset, they try to assign each observation to the cluster closer (in distance) to them. The main issue with these algorithms is that you have to specify how many clusters (K) you want, but there are techniques such as the Elbow method or the Silhouette that allows us to determine numerically which value of K would be a reasonable amount of clusters.

Related

Is there any supervised clustering algorithm or a way to apply prior knowledge to your clustering?

In my case I have a dataset of letters and symbols, detected in an image. The detected items are represented by their coordinates, type (letter, number etc), value, orientation and not the actual bounding box of the image. My goal is, using this dataset, to group them into different "words" or contextual groups in general.
So far I achieved ok-ish results by applying classic unsupervised clustering, using DBSCAN algorithm, but still this is way tοo limited on the geometric distance of the samples and so the resulting groups cannot resemble the "words" I am aiming for. So I am searching for a way to influence the results of the clustering algorithm by using the knowledge I have about the "word-like" nature of the clusters needed.
My possible approach that I thought was to create a dataset of true and false clusters and train an SVM model (or any classifier) to detect whether a proposed cluster is correct or not. But still for this, I have no solid proof that I can train a model well enough to discriminate between good and bad clusters, plus I find it difficult to efficiently and consistently represent the clusters, based on the features of their members. Moreover, since my "testing data" will be a big amount of all possible combinations of the letters and symbols I have, the whole approach seems a bit too complicated to attempt implementing it without any proof or indications that it's going to work in the end.
To conclude, my question is, if someone has any prior experience with that kind of task (in my mind sounds rather simple task, but apparently it is not). Do you know of any supervised clustering algorithm and if so, which is the proper way to represent clusters of data so that you can efficiently train a model with them?
Any idea/suggestion or even hint towards where I can research about it will be much appreciated.

There are papers on supervised clustering. A nice, clear one is Eick et al., which is available for free. Unfortunately, I do not think any off-the-shelf libraries in python support this. There is also this in the specific realm of text, but it is a much more domain-specific approach compared to Eick.
But there is a very simple solution that is effectively a type of supervised clustering. Decision Trees essentially chop feature space into regions of high-purity, or at least attempt to. So you can do this as a quick type of supervised clustering:
Create a Decision Tree using the label data.
Think of each leaf as a "cluster."
In sklearn, you can retrieve the leaves of a Decision Tree by using the apply() method.

A standard approach would be to use the dendrogram.
Then merge branches only if they agree with your positive examples and don't violate any of your negative examples.

clustering in python without number of clusters or threshold

Is it possible to do clustering without providing any input apart from the data? The clustering method/algorithm should decide from the data on how many logical groups the data can be divided, even it doesn't require me to input the threshold eucledian distance on which the clusters are built, this also needs to be learned from the data.
Could you please suggest me what is closest solution for my problem?

Why not code your algorithm to create a list of clusters ranging from size 1 to n (which could be defined in a config file so that you can avoid hard coding and just fix it once).
Once that is done, compute the clusters of size 1 to n. Choose the value which gives you the smallest Mean Square Error.
This would require some additional work by your machine to determine the optimal number of logical groups the data can be divided (bounded between 1 and n).

Clustering is an explorative technique.
This means it must always be able to produce different results, as desired by the user. Having many parameters is a feature. It means the method can be adapted easily to very different data, and to user preferences.
There will never be a generally useful parameter-free technique. At best, some parameters will have default values or heuristics (such as Euclidean distance, such as standardizing the input prior to clusterings such as the gap statistic for choosing k) that may give a reasonable first try in 80% of cases. But after that first try, you'll need to understand the data, and try other parameters to learn more about your data.
Methods that claim to be "parameter free" usually just have some hidden parameters set so it works on the few toy example it was demonstrated on.

Machine learning : find the closest results to a queried vector

I have thousands of vectors of about 20 features each.
Given one query vector, and a set of potential matches, I would like to be able to select the best N matches.
I have spent a couple of days trying out regression (using SVM), training my model with a data set I have created myself : each vector is the concatenation of the query vector and a result vector, and I give a score (subjectively evaluated) between 0 and 1, 0 for perfect match, 1 for worst match.
I haven't had great results, and I believe one reason could be that it is very hard to subjectively assign these scores. What would be easier on the other hand is to subjectively rank results (score being an unknown function):
score(query, resultA) > score(query, resultB) > score(query, resultC)
So I believe this is more a problem of Learning to rank and I have found various links for Python:
http://fa.bianp.net/blog/2012/learning-to-rank-with-scikit-learn-the-pairwise-transform/
https://gist.github.com/agramfort/2071994
...
but I haven't been able to understand how it works really. I am really confused with all the terminology, pairwise ranking, etc ... (note that I know nothing about machine learning hence my feeling of being a bit lost), etc ... so I don't understand how to apply this to my problem.
Could someone please help me clarify things, point me to the exact category of problem I am trying to solve, and even better how I could implement this in Python (scikit-learn) ?

It seems to me that what you are trying to do is to simply compute the distances between the query and the rest of your data, then return the closest N vectors to your query. This is a search problem.
There is no ordering, you simply measure the distance between your query and "thousands of vectors". Finally, you sort the distances and take the smallest N values. These correspond to the most similar N vectors to your query.
For increased efficiency at making comparisons, you can use KD-Trees or other efficient search structures: http://scikit-learn.org/stable/modules/neighbors.html#kd-tree
Then, take a look at the Wikipedia page on Lp space. Before picking an appropriate metric, you need to think about the data and its representation:
What kind of data are you working with? Where does it come from and what does it represent? Is the feature space comprised of only real numbers or does it contain binary values, categorical values or all of them? Wiki for homogeneous vs heterogeneous data.
For a real valued feature space, the Euclidean distance (L2) is usually the choice metric used, with 20 features you should be fine. Start with this one. Otherwise you might have to think about cityblock distance (L1) or other metrics such as Pearson's correlation, cosine distance, etc.
You might have to do some engineering on the data before you can do anything else.
Are the features on the same scale? e.g. x1 = [0,1], x2 = [0, 100]
If not, then try scaling your features. This is usually a matter of trial and error since some features might be noisy in which case scaling might not help.
To explain this, think about a data set with two features: height and weight. If height is in centimeters (10^3) and weight is in kilograms (10^1), then you should aim to convert the cm to meters so both features weigh equally. This is generally a good idea for feature spaces with a wide range of values, meaning you have a large sample of values for both features. You'd ideally like to have all your features normally distributed, with only a bit of noise - see central limit theorem.
Are all of the features relevant?
If you are working with real valued data, you can use Principal Component Analysis (PCA) to rank the features and keep only the relevant ones.
Otherwise, you can try feature selection http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection
Reducing the dimension of the space increases performance, although it is not critical in your case.
If your data consists of continuous, categorical and binary values, then aim to scale or standardize the data. Use your knowledge about the data to come up with an appropriate representation. This is the bulk of the work and is more or less a black art. Trial and error.
As a side note, metric based methods such as knn and kmeans simply store data. Learning begins where memory ends.

Vector quantization for categorical data

Software for vector quantization usually works only on numerical data. One example of this is Python's scipy.cluster.vq.vq (here), which performs vector quantization. The numerical data requirement also shows up for most clustering software.
Many have pointed out that you can always convert a categorical variable to a set of binary numeric variables. But this becomes awkward when working with big data where an individual categorical variable may have hundreds or thousands of categories.
The obvious alternative is to change the distance function. With mixed data types, the distance from an observation to a "center" or "codebook entry" could be expressed as a two-part sum involving (a) the usual Euclidean calculation for the numeric variables and (b) the sum of inequality indicators for categorical variables, as proposed here on page 125.
Is there any open-source software implementation of vector quantization with such a generalized distance function?

For machine learning and clustering algorithms you can also find useful scikit-learn. To achieve what you want, you can have a look to their implementation of DBSCAN.
In their documentation, you can find:
sklearn.cluster.dbscan(X, eps=0.5, min_samples=5, metric='minkowski', algorithm='auto', leaf_size=30, p=2, random_state=None)
Here X can be either your already computed distance matrix (and passing metric='precomputed') or the standard samples x features matrix, while metric= can be a string (with the identifier of one of the already implemented distance functions) or a callable python function that will compute distances in a pair-wise fashion.
If you can't find the metric you want, you can always program it as a python function:
def mydist(a, b):
return a - b # the metric you want comes here
And call dbscan with metric=mydist. Alternatively, you can calculate your distance matrix previously, and pass it to the clustering algorith.
There are some other clustering algorithms in the same library, have a look at them here.

You cannot "quantize" categorial data.
Recall definitions of quantization (Wiktionary):
To limit the number of possible values of a quantity, or states of a system, by applying the rules of quantum mechanics
To approximate a continuously varying signal by one whose amplitude can only have a set of discrete values
In other words, quantization means converting a continuous variable into a discrete variable. Vector quantization does the same, for multiple variables at the same time.
However, categorial variables already are discrete.
What you seem to be looking for is a prototype-based clustering algorithm for categorial data (maybe STING and COOLCAT? I don't know if they will produce prototypes); but this isn't "vector quantization" anymore.
I believe that very often, frequent itemset mining is actually the best approach to find prototypes/archetypes of categorial data.
As for clustering algorithms that allow other distance functions - there are plenty. ELKI has a lot of such algorithms, and also a tutorial on implementing a custom distance. But this is Java, not Python. I'm pretty sure at least some of the clustering algorithms in scipy to allow custom distances, too.
Now pythons scipy.cluster.vq.vq is really simple code. You do not need a library for that at all. The main job of this function is wrapping a C implementation which runs much faster than python code... if you look at the py_vq version (which is used when the C version cannot be used), is is really simple code... essentially, for every object obs[i] it calls this function:
code[i] = argmin(np.sum((obs[i] - code_book) ** 2, 1))
Now you obviously can't use Euclidean distance with a categorial codebook; but translating this line to whatever similarity you want is not hard.
The harder part usually is constructing the codebook, not using it.

Threshold in Hierarchial clustering

I am new to clustering and doing some minor project on clustering tweets, I used TF-IDF and then hierarchial clustering. I am confused about setting up threshold value for hierarchical clustering. What should be its value and how to decide it?
I used python scikit module for implementation.

While there are several methods that exist to help terminate hierarchical clustering (or clustering in general) there is no best general way to do this. This stems from the fact that there is no "correct" clustering of arbitrary data. Rather, "correctness" is very domain and application specific.
So while you can try out different methods (e.g., elbow or others) they will in turn have their own parameters that you will have to "tune" to obtain a clustering that you deem "correct". This video might help you out a bit (though it mainly deals with k-means, the concepts extend to other clustering approaches) - https://www.youtube.com/watch?v=3JPGv0XC6AE

I assume you are talking about choosing the amount of clusters to extract from your hierarchical clustering algorithm. There are several ways of doing this, and there is a nice Wikipedia article about it for some theory: http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
For practical examples take a look at this question: Tutorial for scipy.cluster.hierarchy

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.