Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
So I am fairly new to machine learning and all and I am trying to create a python script to analyse a energy dataset of a computer.
The script should in the end determine the different states of the computer (like idle, standby, working, etc...) and how much energy those states are using on average.
And I was wondering if this task could be done by some clustering method like k-means or DBSCAN.
I tinkered a bit with some clustering methods in scikit learn but the results so far where not as good as I expected.
I researched a lot about clustering methods but I could never find a scenario similar to mine.
So my question is if it's even worth the trouble and if yes wich clustering method (or overall machine learning algorithm) would be best fitted for that task? or are there better ways to do it?
The energy dataset is just a single column table with one cell being one energy value per second of a few days.
The energy dataset is just a single column table with one cell being one energy value per second of a few days.
You will not be able to apply supervised learning for this dataset as you do not have labels for your dataset (there is no known state given an energy value). This means that models like SVM, decision trees, etc. are not feasible given your dataset.
What you have is a timeseries with a single variable output. As I understand it, your goal is to determine whether or not there are different energy states, and what the average value is for those state(s).
I think it would be incredibly helpful to plot the timeseries using something like matplotlib or seaborn. After plotting the data, you can have a better feel for whether your hypothesis is reasonable and how you might further want to approach the problem. You may be able to solve your problem by just plotting the timeseries and observing that there are, say, four distinct energy states (e.g. idle, standby, working, etc.), avoiding any complex statistical techniques, machine learning, etc.
To answer your question, you can in principle use k-means for one dimensional data. However, this is probably not recommended as these techniques are usually used on multidimensional data.
I would recommend that you look into Jenks natural breaks optimization or kernel density optimization. Similar questions to yours can be found here and here, and should help you get started.
Don't ignore time.
First of all, if your signal is noisy, temporal smoothing will likely help.
Secondly, you'll want to perform some feature extraction first. For example, by using segmentation to cut your time series into separate states. You can then try to cluster these states, but I am not convinced that clustering is applicable here at all. You probably will want to use a histogram, or a density plot. It's one dimensional data - you can visualize this, and choose thresholds manually instead of hoping that some automated technique may work (because it may not...)
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I ran a clustering algorithm DBSCAN on a set historical trajectory dataset which returns a set of clusters. Now, for each incoming new trajectory, I need to find the nearest cluster.
Suppose, I have 10 clusters (c1 to c10) and a trajectory 't'. I want to find the nearest cluster from the trajectory 't'. I saw people use kNN for this purpose, but as I am not fully clear I am asking this question. What is the best/efficient way to do this? I am using python.
Clustering techniques, such as DBSCAN, generally work slightly differently than other machine learning models. This is because once you fit a model to your historical trajectory dataset, you cannot predict a new trajectory like a traditional classifier or regressor would. This gives you a few options, either:
A) append your new trajectory to your historicals, run clustering again, see what label is assigned (this is very computationally expensive, probably a bad idea)
B) perform clustering on only historicals, use those generated labels to train a classifier, and perform inference on your new trajectory (this has high overhead, but with sufficient data can work pretty well)
C) use some measure of distance between your new trajectory and the mean of each cluster of historical trajectories (this is probably the easiest and fastest, but relies on your knowledge to implement it in a way that provides meaningful results)
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Have a project I'm working on and am running into an issue. Essentially I these points scattered across an x / y plot. I have one test point, where I get the target data (y) for the classification (number from 1 - 6). I have lots points where I have depth indexed data, with some features. The issue with these points is that I don't get a lot of data per point (maybe 100 points).
I'm using the point closest to the test point to fit the model, then trying to generalize that to the other points that are farther apart. It's not giving me great results.
I understand there's not a lot of data to fit to so I'm trying to improve the model by adding a set of 'k' points close to the test point.
These points all share the same columns, so I've tried to add vertically, but then my indexes don't match with the predictor variable y.
I've tried to concat them at the end using a suffix denoting the specific point id, but then I get an error about the amount of input features (for one point) when I try predicting again with the model using combined features.
Essentially what I'm trying to do is the following :
model.fit([X_1,X_2,X_3,X_4],y)
model.predict(X_5)
Where :
All features are numeric (floats)
X_1.columns = X_i.columns
Each X matrix is about 100 points long with a continuous index [0:100].
I only have one test point (with 100 observations) for each group of points, so it's imperative I use as much data close to the test point as possible.
Is there another model or technique I can use for this? I've done a bit more research into NN models (not familiar so would prefer to avoid), and found that Keras has the ability to take multiple inputs to fit using their functions API, but can I predict with only one input after it has been fitted to multiple?
Keras Sequential model with multiple inputs
Could you give more information about the features / classes, and the model you're using? It would make things easier to understand.
However, I can give two pointers based on what you've said so far.
To have a better measurement of how well your model is generalizing, you should have more than one test point. See https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets
Sounds like you're using a k-Nearest Neighbors approach. If you aren't already, using the sklearn implementation will save a lot of time, and you can easily experiment with different hyperparameters: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Other techniques: I like to start off with XGBoost or Random Forest, as those methods require little tuning and are reasonably robust. However, there is no magic bullet cure for modeling on a small dataset. The best thing to do would be to collect more data, or if that's impossible, you need to drill down and really understand your data (identify outliers, plot histograms / KDE, etc.).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Edit This question was written with little knowledge of clustering techniques and now in hindsight does not even meet the Standards of Stack Overflow Website, but SO won't let me delete it saying others have Invested time and Energy in this(Valid Point) and if I proceed to delete, I may not be able to ask questions for a while, So I am updating this question to make it relevant in a way that others can learn from this. Still it doesn't strictly comply with SO guidelines as I myself would flag this as too broad, but in it's current state it is of no value, so adding a little value to it is going to be worth the downvotes.
Updated Conversation topic
The Issue was to select the optimal number of cluster in a clustering algorithm which would be grouping various shapes which were the input of contour detection on an Image and then a deviation in cluster properties was to be marked as Noise or anomalies, The main point that raised the question at the time was that all datasets were different, the shapes obtained in them different, and the no of shapes would also vary from dataset to dataset. The proper solution to do this would be to go about using DBSCAN(Density based spatial clustering application with Noise) application of which can be find in scikit-learn which I was unaware of at the time, that works and now the product is in testing, I just wanted to come back to this and correct this old mistake.
Old Question
Old Title Dynamic selection of k in kmeans clustering
I have to generate a k-means clustering model in which number of classes are not known in advance, is there a way to automatically determine the value of k based on the Euclidean distance within the clusters.
How I want it to work. Start with a value of k, perform clustering, see if it satisfies threshold criterion and increase or decrease k accordingly. The problem is framework independent and If you have an Idea or implementation in a language other than Python, please share that as well.
I found this while researching the problem https://www.researchgate.net/publication/267752474_Dynamic_Clustering_of_Data_with_Modified_K-Means_Algorithm.
I couldn't find its Implementation.
I am looking for similar ideas to select the best and implement it myself, or an implementation that can be ported to my code.
Edit
The Ideas I am considering right now are:
The elbow method
X-means clustering
You can use elbow method. What this method basically do, is it use various values of k (no of clusters) and then calculate distance of each point from its cluster center. After certain number there won;t any major improvement this value you can take for k(no of cluster).
You can refer for further reading this link.
You iterate over the values of K and check your cluster validity using Silhouette Score
You can iterate through the score of k values of any range. Either you can check silhouette score for each k value or calculate the difference between SSE values for each k values. Where the difference is highest after 0.4 * number of k values will be the elbow point.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have series of line data (2-3 connected points).
What is the best machine learning algorithm that I can use to be able to classify lines to their location similarities? (image below)
Preferably python libraries such as SciKit-Learn.
Edit:
I have tried DBSCAN, but the problem I faced was if there are two lines intersect each other, sometimes DBSCAN consider them to one group even though they are completely in different direction.
Here is a solution I found so far:
GeoPath Clustering Algorithm
The idea here is to cluster geo paths that travel very similar to each other into groups.
Steps:
1- Cluster lines based on slope
2- Within each cluster from step 1, find centriod of lines and by using k-mean
algorithm cluster them into smaller groups
3- Within each geoup from step 2, calculate lenght of each line and group lines within defined length threshold
Result will be small groups of lines that have similar slope, close to each other and with similar travel distance.
Here are screen shots of visualization:
Yellow lines are all lines and red are cluster of paths travel together.
I'll throw an answer since I think the current one is incomplete...and I also think the comment of "simple heuristic" is premature. I think that if you cluster on points, you'll get a different result than what your diagram depicts. As the clusters will be near the end-points and you wouldn't get your nice ellipses.
So, if your data really does behave similarly to how you display it. I would take a stab at turning each set of 2/3 points into a longer list of points that basically trace out the lines. (you will need to experiment on how dense)
Then run HDBSCAN on the result see video ( https://www.youtube.com/watch?v=AgPQ76RIi6A ) to get your clusters. I believe "pip install hdbscan" installs it.
Now, when testing a new sample, first decompose it into many(N) points and fit them with your hdbscan model. I reckon that if you take a majority voting approach with your N points, you'll get the best overall cluster to which the "line" belongs.
So, while I sort of agree with the "simple heuristic" comment, it's not so simple if you want the whole thing automated. And once you watch the video you may be convinced that HDBSCAN, because of its density-based algorithm, will suit this problem(if you decide to create many points from each sample).
I'll wrap up by saying that I'm sure there are line-intersection models that have done this before...and that there does exist heuristics and rules that can do the job. Likely, they're computationally more economical too. My answer is just something organic using sklearn as you requested...and I haven't even tested it! It's just how I would proceed if I were in your shoes.
edit
I poked around and there a couple of line similarity measures you can possibly try. Frechet and Hausdorff distance measures.
Frechet: http://arxiv.org/pdf/1307.6628.pdf
Hausdorff: distance matrix of curves in python for a python example.
If you generate all pair-wise similarities and then group them according to similarity and/or into N bins, you can then call those bins your "clusters" (not kmeans clusters though!). For each new line, generate all similarities and see which bin it belongs to. I revise my original comment of possibly being computationally less intensive...you're lucky your lines only have 2 or 3 points!
The problem you're trying to solve is called clustering. For an overview of clustering algorithms in sklearn, see http://scikit-learn.org/stable/modules/clustering.html#clustering.
Edit 2: KMeans was what sprung to mind when I first saw your post, but based on feedback from the comments it looks like it's not a good fit. You may want to try sklearn's DBSCAN instead.
A potential transformation or extra feature you could add would be to fit a straight line to each set of points, and then use the (slope, intercept) pair. You may also want to use the centroid of each line.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am testing out a few clustering algorithms on a dataset of text documents (with word frequencies as features). Running some of the methods of Scikit Learn Clustering one after the other, below is how long they take on ~ 50,000 files with 26 features per file. There are big differences in how long each take to converge that get more extreme the more data I put in; some of them (e.g. MeanShift) just stop working after the dataset grows to a certain size.
(Times given below are totals from the start of the script, i.e. KMeans took 0.004 minutes, Meanshift (2.56 - 0.004) minutes, etc. )
shape of input: (4957, 26)
KMeans: 0.00491824944814
MeanShift: 2.56759268443
AffinityPropagation: 4.04678163528
SpectralClustering: 4.1573699673
DBSCAN: 4.16347868443
Gaussian: 4.16394021908
AgglomerativeClustering: 5.52318491936
Birch: 5.52657626867
I know that some clustering algorithms are inherently more computing intensive (e.g. the chapter here outlines that Kmeans' demand is linear to number of data points while hierarchical models are O(m2logm)).
So I was wondering
How can I determine how many data points each of these algorithms can
handle; and are the number of input files / input features equally
relevant in this equation?
How much does the computation intensity depend on the clustering
settings -- e.g. the distance metric in Kmeans or the e in DBSCAN?
Does clustering success influence computation time? Some algorithms
such as DBSCAN finish very quickly - mabe because they don't find
any clustering in the data; Meanshift does not find clusters either
and still takes forever. (I'm using the default settings here). Might
that change drastically once they discover structure in the data?
How much is raw computing power a limiting factor for these kind of
algorithms? Will I be able to cluster ~ 300,000 files with ~ 30
features each on a regular desktop computer? Or does it make sense to
use a computer cluster for these kind of things?
Any help is greatly appreciated! The tests were run on an Mac mini, 2.6 Ghz, 8 GB. The data input is a numpy array.
This is a too broad question.
In fact, most of these questions are unanswered.
For example k-means is not simply linear O(n), but because the number of iterations needed until convergence tends to grow with data set size, it's more expensive than that (if run until convergence).
Hierarchical clustering can be anywhere from O(n log n) to O(n^3) mostly depending on the way it is implemented and on the linkage. If I recall correctly, the sklearn implementation is the O(n^3) algorithm.
Some algorithms have parameters to stop early. Before they are actually finished! For k-means, you should use tol=0 if you want to really finish the algorithm. Otherwise, it stops early if the relative improvement is less than this factor - which can be much too early. MiniBatchKMeans does never convergence. Because it only looks at random parts of the data every time, it would just go on forever unless you choose a fixed amount of iterations.
Never try to draw conclusions from small data sets. You need to go to your limits. I.e. what is the largest data set you can still process within say, 1, and 2, and 4, and 12 hours, with each algorithm?
To get meaningful results, your runtimes should be hours, except if the algorithms simply run out of memory before that - then you might be interested in predicting how far you could scale until your run out of memory - assuming you had 1 TB of RAM, how large would the data be that you can still process?
The problem is, you can't simply use the same parameters for data sets of different size. If you do not chose the parameters well (e.g. DBSCAN puts everything into noise, or everything into one cluster) then you cannot draw conclusions from that either.
And then, there might simply be an implementation error. DBSCAN in sklearn has become a lot lot lot faster recently. It's still the same algorithm. So most results done 2 years ago were simply wrong, because the implementation of DBSCAN in sklearn was bad... now it is much better, but is it optimal? Probably not. And similar problems might be in any of these algorithms!
Thus, doing a good benchmark of clustering is really difficult. In fact, I have not seen a good benchmark in a looong time.