I am trying to separate a data set that has 2 clusters that do not overlap in anyway and a single data point that is away from these two clusters.
When I use kmeans() to get the 2 clusters, it splits one of the "valid" cluster into half and considers the single data point as a separate cluster.
Is there a way to specify minimum number of points for this? I am using MATLAB.
There are several solutions:
Easy: try with 3 clusters;
Easy: remove the single data point (that you can detect as an outlier with any outlier detection technique;
To be tried: Use a k-medoids approach instead of k-means. This sometimes helps getting rid of outliers.
More complicated but surely works: Perform spectral clustering. This helps you get over the main issue of k-means, which is the brutal use of the euclidian distance
More explanations on the inadequate behaviour of k-means can be found on Cross Validated site (see here for instance).
Related
I have N three-dimensional vectors
(x,y,z)
I want a simple yet effective approach for clustering these vectors (I do not know a priori the number of clusters, nor can I guess a valid number). I am not familiar with classical machine learning so any advice would be helpful.
The general Sklearn clustering page does a decent job of providing useful background on clustering methods and provides a nice overview of what the differences are between methods. Importantly for your case the table in section 2.3.1 lists the parameters of each method.
The differences in methods tend to be based on how the knowledge of the dataset you have matches the assumptions of each model. Some expect you to know the number the number of clusters (such as K-Means) while others will attempt to determine the number of clusters based on other input parameters (like DBSCAN).
While focusing on methods which attempt to find the number of clusters seems like it might be preferable, it is also possible to use a method which expects the number of clusters and simply test many different reasonable clusters to determine which one is optimal. One such example with K-Means is this.
The easiest algorithms for clustering can be K-Means (if your three features are numerical) and K-Medoids (allow any type of features).
This algorithms are quite easy to understanding. In few words, by calculating some distance measure between each observation of the dataset, they try to assign each observation to the cluster closer (in distance) to them. The main issue with these algorithms is that you have to specify how many clusters (K) you want, but there are techniques such as the Elbow method or the Silhouette that allows us to determine numerically which value of K would be a reasonable amount of clusters.
I have a set of lat/long coordinates spread out across a city (about 1000). I'd like to create clusters with this data following some strict rules:
No cluster can have more than X data points in it (possibly 8, but this can change)
No cluster can contain two data points with more than Xkm between them (possibly 1km, but this can change too)
There can be clusters with one single point
No specific number of clusters need to be created
I've tried doing this using AgglomerativeClustering from sklearn, using the following code:
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=None, affinity='euclidean', linkage='complete', distance_threshold=0.01)
cluster.fit_predict(arr)
The issue here is that I'm not fulfilling items 1,2 or 3 above, only item 4.
I'd like to have a clustering algorithm where I'm able to set those parameters and have it run the most efficient clustering possible (ie: least number of clusters that respect all of items 1,2,3 and 4).
Is there any way this could be done with sklearn or any other imported clustering algo or would one have to build this manually?
Thanks!
Write your own.
A simple approach would be to use agglomerative clustering (the real one, e.g., from scipy; the sklearn version is too limited) to get the full merge history for complete linkage. Then begin processing merges bottom-up if they satisfy your two requirements: the linkage is the maximum distance, and if the cluster becomes too large then you stop merging.
Beware that the result will, however, quite unbalanced. My guess is that you want as few clusters as possible to cover your data with the maximum radius and occupacy. Then your problem is likely closer to set cover. Finding the optimum result on such problems is usually NP hard, so you'll have to accept using an approximation. I'd go with a greedy strategy and then iterative refinement by local search.
I'm currently struggling to wrap my head around how multi-linear regression could be done to find separate sets of linear models in a single data set. I can perform regression on single data set for a single regressor and coef with no problem, but what if there are a known-number of lines existing in a single data space?
My first approach was to use hierarchical clustering to identify the points using ML first, but it doesn't seem to capture individual
cluster variance in Euclidean space as expected. My second trial was KMeans, which still relies on Euclidean distance so it creates clusters with radii. My last thought process led to Kmedian, but at this point I was wondering what other people might think regarding this problem.
If this is the right direction, I know I would have to project points in a better space(i.e. an axis that captures more-or most- variance) before I apply these methods.
I would appreciate any comments or input in any shape or form.
Thank you,
3-line summary:
Linear regression on dataset with multiple lines
Clustering first, then multiple single linear regression?
or have you guys come across a module for this such thing?
I would truly appreciate any insights;
Is it possible to do clustering without providing any input apart from the data? The clustering method/algorithm should decide from the data on how many logical groups the data can be divided, even it doesn't require me to input the threshold eucledian distance on which the clusters are built, this also needs to be learned from the data.
Could you please suggest me what is closest solution for my problem?
Why not code your algorithm to create a list of clusters ranging from size 1 to n (which could be defined in a config file so that you can avoid hard coding and just fix it once).
Once that is done, compute the clusters of size 1 to n. Choose the value which gives you the smallest Mean Square Error.
This would require some additional work by your machine to determine the optimal number of logical groups the data can be divided (bounded between 1 and n).
Clustering is an explorative technique.
This means it must always be able to produce different results, as desired by the user. Having many parameters is a feature. It means the method can be adapted easily to very different data, and to user preferences.
There will never be a generally useful parameter-free technique. At best, some parameters will have default values or heuristics (such as Euclidean distance, such as standardizing the input prior to clusterings such as the gap statistic for choosing k) that may give a reasonable first try in 80% of cases. But after that first try, you'll need to understand the data, and try other parameters to learn more about your data.
Methods that claim to be "parameter free" usually just have some hidden parameters set so it works on the few toy example it was demonstrated on.
essentially I applied a DBSCAN algorithm (sklearn) with an euclidean distance on a subset of my original data. I found my clusters and all is fine: except for the fact that I want to keep only values that are far enough from those on which I did not run my analysis on. I have a new distance to test such new stuff with and I wanted to understand how to do it WITHOUT numerous nested loops.
in a picture:
my found clusters are in blue whereas the red ones are the points to which I don't want to be near. the crosses are the points belonging to the cluster that are carved out as they are within the new distance I specified.
now, as much I could do something of the sort:
for i in red_points:
for j in blu_points:
if dist(i,j) < given_dist:
original_dataframe.remove(j)
I refuse to believe there isn't a vectorized method. also, I can't afford to do as above simply because I'll have huge tables to operate upon and I'd like to avoid my CPU to evaporate away.
any and all suggestions welcome
Of course you can vectoriue this, but it will then still be O(n*m). Better neighbor search algorithms are not vectorized. e.g. kd-tree and ball-tree.
Both are available in sklearn, and used by the DBSCAN module. Please see the sklearn.neighbors package.
If you need exact answers, the fastest implementation should be sklearn's pairwise distance calculator:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html
If you can accept an approximate answer, you can do better with the kd tree's queryradius(): http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html