Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I ran a clustering algorithm DBSCAN on a set historical trajectory dataset which returns a set of clusters. Now, for each incoming new trajectory, I need to find the nearest cluster.
Suppose, I have 10 clusters (c1 to c10) and a trajectory 't'. I want to find the nearest cluster from the trajectory 't'. I saw people use kNN for this purpose, but as I am not fully clear I am asking this question. What is the best/efficient way to do this? I am using python.
Clustering techniques, such as DBSCAN, generally work slightly differently than other machine learning models. This is because once you fit a model to your historical trajectory dataset, you cannot predict a new trajectory like a traditional classifier or regressor would. This gives you a few options, either:
A) append your new trajectory to your historicals, run clustering again, see what label is assigned (this is very computationally expensive, probably a bad idea)
B) perform clustering on only historicals, use those generated labels to train a classifier, and perform inference on your new trajectory (this has high overhead, but with sufficient data can work pretty well)
C) use some measure of distance between your new trajectory and the mean of each cluster of historical trajectories (this is probably the easiest and fastest, but relies on your knowledge to implement it in a way that provides meaningful results)
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Have a project I'm working on and am running into an issue. Essentially I these points scattered across an x / y plot. I have one test point, where I get the target data (y) for the classification (number from 1 - 6). I have lots points where I have depth indexed data, with some features. The issue with these points is that I don't get a lot of data per point (maybe 100 points).
I'm using the point closest to the test point to fit the model, then trying to generalize that to the other points that are farther apart. It's not giving me great results.
I understand there's not a lot of data to fit to so I'm trying to improve the model by adding a set of 'k' points close to the test point.
These points all share the same columns, so I've tried to add vertically, but then my indexes don't match with the predictor variable y.
I've tried to concat them at the end using a suffix denoting the specific point id, but then I get an error about the amount of input features (for one point) when I try predicting again with the model using combined features.
Essentially what I'm trying to do is the following :
model.fit([X_1,X_2,X_3,X_4],y)
model.predict(X_5)
Where :
All features are numeric (floats)
X_1.columns = X_i.columns
Each X matrix is about 100 points long with a continuous index [0:100].
I only have one test point (with 100 observations) for each group of points, so it's imperative I use as much data close to the test point as possible.
Is there another model or technique I can use for this? I've done a bit more research into NN models (not familiar so would prefer to avoid), and found that Keras has the ability to take multiple inputs to fit using their functions API, but can I predict with only one input after it has been fitted to multiple?
Keras Sequential model with multiple inputs
Could you give more information about the features / classes, and the model you're using? It would make things easier to understand.
However, I can give two pointers based on what you've said so far.
To have a better measurement of how well your model is generalizing, you should have more than one test point. See https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets
Sounds like you're using a k-Nearest Neighbors approach. If you aren't already, using the sklearn implementation will save a lot of time, and you can easily experiment with different hyperparameters: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Other techniques: I like to start off with XGBoost or Random Forest, as those methods require little tuning and are reasonably robust. However, there is no magic bullet cure for modeling on a small dataset. The best thing to do would be to collect more data, or if that's impossible, you need to drill down and really understand your data (identify outliers, plot histograms / KDE, etc.).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Edit This question was written with little knowledge of clustering techniques and now in hindsight does not even meet the Standards of Stack Overflow Website, but SO won't let me delete it saying others have Invested time and Energy in this(Valid Point) and if I proceed to delete, I may not be able to ask questions for a while, So I am updating this question to make it relevant in a way that others can learn from this. Still it doesn't strictly comply with SO guidelines as I myself would flag this as too broad, but in it's current state it is of no value, so adding a little value to it is going to be worth the downvotes.
Updated Conversation topic
The Issue was to select the optimal number of cluster in a clustering algorithm which would be grouping various shapes which were the input of contour detection on an Image and then a deviation in cluster properties was to be marked as Noise or anomalies, The main point that raised the question at the time was that all datasets were different, the shapes obtained in them different, and the no of shapes would also vary from dataset to dataset. The proper solution to do this would be to go about using DBSCAN(Density based spatial clustering application with Noise) application of which can be find in scikit-learn which I was unaware of at the time, that works and now the product is in testing, I just wanted to come back to this and correct this old mistake.
Old Question
Old Title Dynamic selection of k in kmeans clustering
I have to generate a k-means clustering model in which number of classes are not known in advance, is there a way to automatically determine the value of k based on the Euclidean distance within the clusters.
How I want it to work. Start with a value of k, perform clustering, see if it satisfies threshold criterion and increase or decrease k accordingly. The problem is framework independent and If you have an Idea or implementation in a language other than Python, please share that as well.
I found this while researching the problem https://www.researchgate.net/publication/267752474_Dynamic_Clustering_of_Data_with_Modified_K-Means_Algorithm.
I couldn't find its Implementation.
I am looking for similar ideas to select the best and implement it myself, or an implementation that can be ported to my code.
Edit
The Ideas I am considering right now are:
The elbow method
X-means clustering
You can use elbow method. What this method basically do, is it use various values of k (no of clusters) and then calculate distance of each point from its cluster center. After certain number there won;t any major improvement this value you can take for k(no of cluster).
You can refer for further reading this link.
You iterate over the values of K and check your cluster validity using Silhouette Score
You can iterate through the score of k values of any range. Either you can check silhouette score for each k value or calculate the difference between SSE values for each k values. Where the difference is highest after 0.4 * number of k values will be the elbow point.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
So I am fairly new to machine learning and all and I am trying to create a python script to analyse a energy dataset of a computer.
The script should in the end determine the different states of the computer (like idle, standby, working, etc...) and how much energy those states are using on average.
And I was wondering if this task could be done by some clustering method like k-means or DBSCAN.
I tinkered a bit with some clustering methods in scikit learn but the results so far where not as good as I expected.
I researched a lot about clustering methods but I could never find a scenario similar to mine.
So my question is if it's even worth the trouble and if yes wich clustering method (or overall machine learning algorithm) would be best fitted for that task? or are there better ways to do it?
The energy dataset is just a single column table with one cell being one energy value per second of a few days.
The energy dataset is just a single column table with one cell being one energy value per second of a few days.
You will not be able to apply supervised learning for this dataset as you do not have labels for your dataset (there is no known state given an energy value). This means that models like SVM, decision trees, etc. are not feasible given your dataset.
What you have is a timeseries with a single variable output. As I understand it, your goal is to determine whether or not there are different energy states, and what the average value is for those state(s).
I think it would be incredibly helpful to plot the timeseries using something like matplotlib or seaborn. After plotting the data, you can have a better feel for whether your hypothesis is reasonable and how you might further want to approach the problem. You may be able to solve your problem by just plotting the timeseries and observing that there are, say, four distinct energy states (e.g. idle, standby, working, etc.), avoiding any complex statistical techniques, machine learning, etc.
To answer your question, you can in principle use k-means for one dimensional data. However, this is probably not recommended as these techniques are usually used on multidimensional data.
I would recommend that you look into Jenks natural breaks optimization or kernel density optimization. Similar questions to yours can be found here and here, and should help you get started.
Don't ignore time.
First of all, if your signal is noisy, temporal smoothing will likely help.
Secondly, you'll want to perform some feature extraction first. For example, by using segmentation to cut your time series into separate states. You can then try to cluster these states, but I am not convinced that clustering is applicable here at all. You probably will want to use a histogram, or a density plot. It's one dimensional data - you can visualize this, and choose thresholds manually instead of hoping that some automated technique may work (because it may not...)
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
Is there any easy way to return the furthermost outlier after sklearn kmeans clustering?
Essentially I want to make a list of the biggest outliers for a load of clusters. Unfortunately I need to use sklearn.cluster.KMeans due to the assignment.
K-means is not well suited for "outlier" detection.
k-means has a tendency to make outliers a one-element cluster. Then the outliers have the smallest possible distance and will not be detected.
K-means is not robust enough when there are outliers in your data. You may actually want to remove outliers prior to using k-means.
Use rather something like kNN, LOF or LoOP instead.
Sascha basically gives it away in the comments, but if X denotes your data, and model the instance of KMeans, you can sort the values of X by the distance to their centers through
X[np.argsort(np.linalg.norm(X - model.cluster_centers_[model.labels_], axis=1))]
Alternatively, since you know that each point is assigned to the cluster whose center minimizes Euclidean distance to the point, you can fit and sort in one step through
X[np.argsort(np.min(KMeans(n_clusters=2).fit_transform(X), axis=1))]
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm testing some things in image retrival and i was thinking about how to sort out bad pictures of a dataset. For e.g there are only pictures of houses and in between there is a picture of people and some of cars. So at the end i want to get only the houses.
At the Moment my approach looks like:
computing descriptors (Sift) of all pictures
clustering all descriptors with k-means
creating histograms of the pictures by computing the euclidean distance between the cluster centers and the descriptors of a picture
clustering the histograms again.
at this moment i have got a first sort (which isn't really good). Now my Idea is to take all pictures which are clustered to a center with len(center) > 1 and cluster them again and again. So the Result is that the pictures which are particular in a center will be sorted out. Maybe its enough to fit the result again to the same k-means without clustering again?!
the result isn't satisfying so maybe someone has got a good idea.
For Clustering etc. I'm using k-means of scikit learn.
K-means is not very robust to noise; and your "bad pictures" probably can be considered as such. Furthermore, k-means doesn't work too well for sparse data; as the means will not be sparse.
You may want to try other, more modern, clustering algorithms that can handle this situation much better.
I don't have the solution to your problem but here is a sanity check to perform prior to the final clustering, to check that the kind of features you extracted is suitable for your problem:
extract the histogram features for all the pictures in your dataset
compute the pairwise distances of all the pictures in your dataset using the histogram features (you can use sklearn.metrics.pairwise_distance)
np.argsort the raveled distances matrix to find the indices of the 20 top closest pairs of distinct pictures according to your features (you have to filter out the zero-valued diagonal elements of the distance matrix) and do the same to extract the top 20 most farest pairs of pictures based on your histogram features.
Visualize (for instance with plt.imshow) the pictures of top closest pairs and check that they are all pairs that you would expect to be very similar.
Visualize the pictures of the top farest pairs and check that they are all very dissimilar.
If one of those 2 checks fails, then it means that histogram of bag of SIFT words is not suitable to your task. Maybe you need to extract other kinds of features (e.g. HoG features) or reorganized the way your extract the cluster of SIFT descriptors, maybe using a pyramidal pooling structure to extract info on the global layout of the pictures at various scales.