Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Edit This question was written with little knowledge of clustering techniques and now in hindsight does not even meet the Standards of Stack Overflow Website, but SO won't let me delete it saying others have Invested time and Energy in this(Valid Point) and if I proceed to delete, I may not be able to ask questions for a while, So I am updating this question to make it relevant in a way that others can learn from this. Still it doesn't strictly comply with SO guidelines as I myself would flag this as too broad, but in it's current state it is of no value, so adding a little value to it is going to be worth the downvotes.
Updated Conversation topic
The Issue was to select the optimal number of cluster in a clustering algorithm which would be grouping various shapes which were the input of contour detection on an Image and then a deviation in cluster properties was to be marked as Noise or anomalies, The main point that raised the question at the time was that all datasets were different, the shapes obtained in them different, and the no of shapes would also vary from dataset to dataset. The proper solution to do this would be to go about using DBSCAN(Density based spatial clustering application with Noise) application of which can be find in scikit-learn which I was unaware of at the time, that works and now the product is in testing, I just wanted to come back to this and correct this old mistake.
Old Question
Old Title Dynamic selection of k in kmeans clustering
I have to generate a k-means clustering model in which number of classes are not known in advance, is there a way to automatically determine the value of k based on the Euclidean distance within the clusters.
How I want it to work. Start with a value of k, perform clustering, see if it satisfies threshold criterion and increase or decrease k accordingly. The problem is framework independent and If you have an Idea or implementation in a language other than Python, please share that as well.
I found this while researching the problem https://www.researchgate.net/publication/267752474_Dynamic_Clustering_of_Data_with_Modified_K-Means_Algorithm.
I couldn't find its Implementation.
I am looking for similar ideas to select the best and implement it myself, or an implementation that can be ported to my code.
Edit
The Ideas I am considering right now are:
The elbow method
X-means clustering
You can use elbow method. What this method basically do, is it use various values of k (no of clusters) and then calculate distance of each point from its cluster center. After certain number there won;t any major improvement this value you can take for k(no of cluster).
You can refer for further reading this link.
You iterate over the values of K and check your cluster validity using Silhouette Score
You can iterate through the score of k values of any range. Either you can check silhouette score for each k value or calculate the difference between SSE values for each k values. Where the difference is highest after 0.4 * number of k values will be the elbow point.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I ran a clustering algorithm DBSCAN on a set historical trajectory dataset which returns a set of clusters. Now, for each incoming new trajectory, I need to find the nearest cluster.
Suppose, I have 10 clusters (c1 to c10) and a trajectory 't'. I want to find the nearest cluster from the trajectory 't'. I saw people use kNN for this purpose, but as I am not fully clear I am asking this question. What is the best/efficient way to do this? I am using python.
Clustering techniques, such as DBSCAN, generally work slightly differently than other machine learning models. This is because once you fit a model to your historical trajectory dataset, you cannot predict a new trajectory like a traditional classifier or regressor would. This gives you a few options, either:
A) append your new trajectory to your historicals, run clustering again, see what label is assigned (this is very computationally expensive, probably a bad idea)
B) perform clustering on only historicals, use those generated labels to train a classifier, and perform inference on your new trajectory (this has high overhead, but with sufficient data can work pretty well)
C) use some measure of distance between your new trajectory and the mean of each cluster of historical trajectories (this is probably the easiest and fastest, but relies on your knowledge to implement it in a way that provides meaningful results)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have around 500,000 32 dimensional vectors (normalized with mean=0, std=1) that I currently store in a KD tree to efficiently find the nearest neighbor. However, now I also want to be able to exclude some of the vectors from the database dynamically for some queries (the condition changes often, so re-building the tree is not an option). I want to fix some of the 32 dimensions to a certain range depending on some conditions that change during runtime.
What I currently do, is instead of looking for k=1 nearest neighbors, I look for k=50 (or more) nearest neighbors and then iterate from the closest to the farthest until I find one that matches the condition. Unfortunately that is not a very elegant solution as it requires the query to return k=50 matches even if k=1 would already have returned the one I am looking for. Also, if k=50 was too small, I need to do another query with k=500 or so and that hurts performance.
So two solutions come to my mind:
Find a KD tree implementation that returns an iterator instead of a fixed-size result with k entries. The iterator would start at the nearest neighbor and then move towards neighbors with greater distance. Due to the design of a KDTree this should be very efficient. Then the tree only needs to be searched until a valid result is found and no fixed k needs to be specified. I was not able to find a Python implementation for that so far.
Use a different data structure or database (MySQL for example) that is designed to do queries based on conditions. Is there any database system (I am also open for NoSQL) that supports efficient nearest neighbor search using dynamic conditions? Maybe a database that allows to use a KD tree as the index?
If nothing is yet available, I'll probably give myself a try to implement a KD tree that does what I want on my own.
EDIT: The language I am currently using is Python for proto-typing, later I will move to C# (Unity).
The idea is similar to how to speed up boolean keyword search with positive terms:
candidates selection: reduce the size of the search space as much as possible
scoring: compare each candidates with the query vector and only keep the best candidate vector until all candidates have been scored. That step can be done in parallel. Basically, a brute force algorithm over a reduced space.
Unlike the full text search question from above you have vectors of floats or doubles with constraints on one or more dimensions. That is a geometric problem and is most of the time encountered in Geographical Information System (GIS) except in your case instead of two, or three or even four dimensions they are 32 dimensions.
One way to make the candidates selection is to index all the vectors using a space filling curve. The constraints describe a region inside the 32 dimensional space, and you want to know what are the vectors that are in that subspace, because the nearest neighbor you are looking for is necessarily in that subspace, and it can not be outside. You can not further reduce the search space without more constraints.
Space filing curves like morton code or xz-ordering can easily be implemented inside ordered key value store.
The best explanation of how the algorithm works is:
Z-Order Indexing for Multifaceted Queries in Amazon DynamoDB: Part 1
There is an implementation of xz-ordering in scala as part of geomesa.
There is various okvs implementation, to experiment I recommend lsm-db.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
Is there any easy way to return the furthermost outlier after sklearn kmeans clustering?
Essentially I want to make a list of the biggest outliers for a load of clusters. Unfortunately I need to use sklearn.cluster.KMeans due to the assignment.
K-means is not well suited for "outlier" detection.
k-means has a tendency to make outliers a one-element cluster. Then the outliers have the smallest possible distance and will not be detected.
K-means is not robust enough when there are outliers in your data. You may actually want to remove outliers prior to using k-means.
Use rather something like kNN, LOF or LoOP instead.
Sascha basically gives it away in the comments, but if X denotes your data, and model the instance of KMeans, you can sort the values of X by the distance to their centers through
X[np.argsort(np.linalg.norm(X - model.cluster_centers_[model.labels_], axis=1))]
Alternatively, since you know that each point is assigned to the cluster whose center minimizes Euclidean distance to the point, you can fit and sort in one step through
X[np.argsort(np.min(KMeans(n_clusters=2).fit_transform(X), axis=1))]
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have series of line data (2-3 connected points).
What is the best machine learning algorithm that I can use to be able to classify lines to their location similarities? (image below)
Preferably python libraries such as SciKit-Learn.
Edit:
I have tried DBSCAN, but the problem I faced was if there are two lines intersect each other, sometimes DBSCAN consider them to one group even though they are completely in different direction.
Here is a solution I found so far:
GeoPath Clustering Algorithm
The idea here is to cluster geo paths that travel very similar to each other into groups.
Steps:
1- Cluster lines based on slope
2- Within each cluster from step 1, find centriod of lines and by using k-mean
algorithm cluster them into smaller groups
3- Within each geoup from step 2, calculate lenght of each line and group lines within defined length threshold
Result will be small groups of lines that have similar slope, close to each other and with similar travel distance.
Here are screen shots of visualization:
Yellow lines are all lines and red are cluster of paths travel together.
I'll throw an answer since I think the current one is incomplete...and I also think the comment of "simple heuristic" is premature. I think that if you cluster on points, you'll get a different result than what your diagram depicts. As the clusters will be near the end-points and you wouldn't get your nice ellipses.
So, if your data really does behave similarly to how you display it. I would take a stab at turning each set of 2/3 points into a longer list of points that basically trace out the lines. (you will need to experiment on how dense)
Then run HDBSCAN on the result see video ( https://www.youtube.com/watch?v=AgPQ76RIi6A ) to get your clusters. I believe "pip install hdbscan" installs it.
Now, when testing a new sample, first decompose it into many(N) points and fit them with your hdbscan model. I reckon that if you take a majority voting approach with your N points, you'll get the best overall cluster to which the "line" belongs.
So, while I sort of agree with the "simple heuristic" comment, it's not so simple if you want the whole thing automated. And once you watch the video you may be convinced that HDBSCAN, because of its density-based algorithm, will suit this problem(if you decide to create many points from each sample).
I'll wrap up by saying that I'm sure there are line-intersection models that have done this before...and that there does exist heuristics and rules that can do the job. Likely, they're computationally more economical too. My answer is just something organic using sklearn as you requested...and I haven't even tested it! It's just how I would proceed if I were in your shoes.
edit
I poked around and there a couple of line similarity measures you can possibly try. Frechet and Hausdorff distance measures.
Frechet: http://arxiv.org/pdf/1307.6628.pdf
Hausdorff: distance matrix of curves in python for a python example.
If you generate all pair-wise similarities and then group them according to similarity and/or into N bins, you can then call those bins your "clusters" (not kmeans clusters though!). For each new line, generate all similarities and see which bin it belongs to. I revise my original comment of possibly being computationally less intensive...you're lucky your lines only have 2 or 3 points!
The problem you're trying to solve is called clustering. For an overview of clustering algorithms in sklearn, see http://scikit-learn.org/stable/modules/clustering.html#clustering.
Edit 2: KMeans was what sprung to mind when I first saw your post, but based on feedback from the comments it looks like it's not a good fit. You may want to try sklearn's DBSCAN instead.
A potential transformation or extra feature you could add would be to fit a straight line to each set of points, and then use the (slope, intercept) pair. You may also want to use the centroid of each line.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a set of data, the scatter plot of data is something like this:
I've shown the correct answer by a red area, it's almost in the center of the two branches. (The scatter plot is 'V' form)
I need an algorithm for finding this area and collecting all scatter data which contained in this area. (because there are another set data like this)
Both x,y data have been uploaded here:
Data
Based on your question so far, it is difficult to know how to evaluate what is correct(ie. why is this region correct? Is the based on values/coordinates of points, on point density in the region? Is it based on the position with respect to the larger structure(ie. centre of the branches) etc.).
That being said; there are a lot of machine learning algorithms available; eg. scikit-learn for python. Using a supervised learning algorithm you could train the solver on some data, then it could (try to) find the correct answer for other data.
More of an answer is difficult to provide before you rephrase your question.
If all your data looks like this, one option might be to do a PCA(ie, dimensional reduction) on the data to separate the branches into two clusters. You would then get some datapoints which can not clearly be identified as belonging to only one branch, which you could then select (scikit-learn's PCA docs). Note that while it should be reasonably accurate, you would never get a perfect circle using this.
If you only need it for this one dataset, which you already know the "radius" and centre of, you could identify a centre of your circle(ellipse) with its semi-major(& minor) a (& b) axes and then compute the distance using its canonical form.
It might then be simpler to use a square, though.
So it would look something like this(assuming 1d numpy.ndarrays):
#selecting points in a square
condition=(xarr>xmin) & (xarr<xmax) & (yarr>ymin) & (yarr<ymax)
#depending on what you want, coordinates or value at coordinates
xsq=xarr[condition]
ysq=yarr[condition]
squaredata=data[condition]
#for ellipse:
#x0, y0, a and b can be preset if only this function.
in_ellipse=np.vectorize(\
lambda x,y,x0,y0,a,b: np.sqrt(((x-x0)/a)**2 + ((y-y0)/b)**2)<=1.0)
ellipsedata=data[in_ellipse(xarr,yarr,1.6,-1125,0.1,10)]
x_ellipse=xarr[in_ellipse(xarr,yarr,1.6,-1125,0.1,10)]
y_ellipse=yarr[in_ellipse(xarr,yarr,1.6,-1125,0.1,10)]
The values for x0,y0, a and b were just estimated up by looking at the picture.