Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am testing out a few clustering algorithms on a dataset of text documents (with word frequencies as features). Running some of the methods of Scikit Learn Clustering one after the other, below is how long they take on ~ 50,000 files with 26 features per file. There are big differences in how long each take to converge that get more extreme the more data I put in; some of them (e.g. MeanShift) just stop working after the dataset grows to a certain size.
(Times given below are totals from the start of the script, i.e. KMeans took 0.004 minutes, Meanshift (2.56 - 0.004) minutes, etc. )
shape of input: (4957, 26)
KMeans: 0.00491824944814
MeanShift: 2.56759268443
AffinityPropagation: 4.04678163528
SpectralClustering: 4.1573699673
DBSCAN: 4.16347868443
Gaussian: 4.16394021908
AgglomerativeClustering: 5.52318491936
Birch: 5.52657626867
I know that some clustering algorithms are inherently more computing intensive (e.g. the chapter here outlines that Kmeans' demand is linear to number of data points while hierarchical models are O(m2logm)).
So I was wondering
How can I determine how many data points each of these algorithms can
handle; and are the number of input files / input features equally
relevant in this equation?
How much does the computation intensity depend on the clustering
settings -- e.g. the distance metric in Kmeans or the e in DBSCAN?
Does clustering success influence computation time? Some algorithms
such as DBSCAN finish very quickly - mabe because they don't find
any clustering in the data; Meanshift does not find clusters either
and still takes forever. (I'm using the default settings here). Might
that change drastically once they discover structure in the data?
How much is raw computing power a limiting factor for these kind of
algorithms? Will I be able to cluster ~ 300,000 files with ~ 30
features each on a regular desktop computer? Or does it make sense to
use a computer cluster for these kind of things?
Any help is greatly appreciated! The tests were run on an Mac mini, 2.6 Ghz, 8 GB. The data input is a numpy array.
This is a too broad question.
In fact, most of these questions are unanswered.
For example k-means is not simply linear O(n), but because the number of iterations needed until convergence tends to grow with data set size, it's more expensive than that (if run until convergence).
Hierarchical clustering can be anywhere from O(n log n) to O(n^3) mostly depending on the way it is implemented and on the linkage. If I recall correctly, the sklearn implementation is the O(n^3) algorithm.
Some algorithms have parameters to stop early. Before they are actually finished! For k-means, you should use tol=0 if you want to really finish the algorithm. Otherwise, it stops early if the relative improvement is less than this factor - which can be much too early. MiniBatchKMeans does never convergence. Because it only looks at random parts of the data every time, it would just go on forever unless you choose a fixed amount of iterations.
Never try to draw conclusions from small data sets. You need to go to your limits. I.e. what is the largest data set you can still process within say, 1, and 2, and 4, and 12 hours, with each algorithm?
To get meaningful results, your runtimes should be hours, except if the algorithms simply run out of memory before that - then you might be interested in predicting how far you could scale until your run out of memory - assuming you had 1 TB of RAM, how large would the data be that you can still process?
The problem is, you can't simply use the same parameters for data sets of different size. If you do not chose the parameters well (e.g. DBSCAN puts everything into noise, or everything into one cluster) then you cannot draw conclusions from that either.
And then, there might simply be an implementation error. DBSCAN in sklearn has become a lot lot lot faster recently. It's still the same algorithm. So most results done 2 years ago were simply wrong, because the implementation of DBSCAN in sklearn was bad... now it is much better, but is it optimal? Probably not. And similar problems might be in any of these algorithms!
Thus, doing a good benchmark of clustering is really difficult. In fact, I have not seen a good benchmark in a looong time.
Related
I currently have a dataset of textual embeddings (768 dimensions). The current number of records is ~1 million. I am looking to detect related embeddings through a community detection algorithm. For small data sets, I have been able to use this one:
https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py
It works great, but, it doesn't really scale as the data set grows larger than memory.
The key here is that I am able to specify a threshold for community matches. I have been able to find clustering algorithms that scale to larger than memory, but I always have to specify a fixed number of clusters ahead of time. I need the system to detect the number of clusters for me.
I'm certain there are a class of algorithms - and hopefully a python library - that can handle this situation, but I have been unable to locate it. Does anyone know of an algorithm or a solution I could use?
That seems small enough that you could just rent a bigger computer.
Nevertheless, to answer the question, typically the play is to cluster the data into a few chunks (overlapping or not) that fit in memory and then apply a higher-quality in-memory clustering algorithm to each chunk. One typical strategy for cosine similarity is to cluster by SimHashes, but
there's a whole literature out there;
if you already have a scalable clustering algorithm you like, you can use that.
I am trying to cluster a data set with about 1,100,000 observations, each with three values.
The code is pretty simple in R:
df11.dist <-dist(df11cl) , where df11cl is a dataframe with three columns and 1,100,000 rows and all the values in this data frame are standardized.
the error I get is :
Error: cannot allocate vector of size 4439.0 Gb
Recommendations on similar problems include increasing RAM or chunking data. I already have 64GB RAM and my virtual memory is 171GB, so I don't think increasing RAM is a feasible solution. Also, as far as I know, chunked data in hierarchical data analysis yields different results. So, it seems using a sample of data is out of question.
I have also found this solution, but the answers actually alter the question. They technically advise k-means. K-means could work if one knows the number of clusters beforehand. I do not know the number of clusters. That said, I ran k-means using different number of clusters, but now I don't know how to justify the selection of one to another. Is there any test that can help?
Can you recommend anything in either R or python?
For trivial reasons, the function dist needs quadratic memory.
So if you have 1 million (10^6) points, a quadratic matrix needs 10^12 entries. With double precision, you need 8 bytes for each entry. With symmetry, you only need to store half of the entries, still that is 4*10^12 bytea., I.e. 4 Terabyte just to store this matrix. Even if you would store this on SSD or upgrade your system to 4 TB of RAM, computing all these distances would take an insane amount of time.
And 1 million is still pretty small, isn't it?
Using dist on big data is impossible. End of story.
For larger data sets, you'll need to
use methods such as k-means that do not use pairwise distances
use methods such as DBSCAN that do not need a distance matrix, and where in some cases an index can reduce the effort to O(n log n)
subsample your data to make it smaller
In particular that last thing is a good idea if you don't have a working solution yet. There is no use in struggling with scalability of a method that does not work.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
So I am fairly new to machine learning and all and I am trying to create a python script to analyse a energy dataset of a computer.
The script should in the end determine the different states of the computer (like idle, standby, working, etc...) and how much energy those states are using on average.
And I was wondering if this task could be done by some clustering method like k-means or DBSCAN.
I tinkered a bit with some clustering methods in scikit learn but the results so far where not as good as I expected.
I researched a lot about clustering methods but I could never find a scenario similar to mine.
So my question is if it's even worth the trouble and if yes wich clustering method (or overall machine learning algorithm) would be best fitted for that task? or are there better ways to do it?
The energy dataset is just a single column table with one cell being one energy value per second of a few days.
The energy dataset is just a single column table with one cell being one energy value per second of a few days.
You will not be able to apply supervised learning for this dataset as you do not have labels for your dataset (there is no known state given an energy value). This means that models like SVM, decision trees, etc. are not feasible given your dataset.
What you have is a timeseries with a single variable output. As I understand it, your goal is to determine whether or not there are different energy states, and what the average value is for those state(s).
I think it would be incredibly helpful to plot the timeseries using something like matplotlib or seaborn. After plotting the data, you can have a better feel for whether your hypothesis is reasonable and how you might further want to approach the problem. You may be able to solve your problem by just plotting the timeseries and observing that there are, say, four distinct energy states (e.g. idle, standby, working, etc.), avoiding any complex statistical techniques, machine learning, etc.
To answer your question, you can in principle use k-means for one dimensional data. However, this is probably not recommended as these techniques are usually used on multidimensional data.
I would recommend that you look into Jenks natural breaks optimization or kernel density optimization. Similar questions to yours can be found here and here, and should help you get started.
Don't ignore time.
First of all, if your signal is noisy, temporal smoothing will likely help.
Secondly, you'll want to perform some feature extraction first. For example, by using segmentation to cut your time series into separate states. You can then try to cluster these states, but I am not convinced that clustering is applicable here at all. You probably will want to use a histogram, or a density plot. It's one dimensional data - you can visualize this, and choose thresholds manually instead of hoping that some automated technique may work (because it may not...)
I am attempting to use MiniBatchKMeans to stream NLP data in and cluster it, but have no way of determining how many clusters I need. What I would like to do is periodically take the silhouette score and if it drops below a certain threshold, increase the number of centroids. But as far as I can tell, n_clusters is set when you initialize the clusterer and can't be changed without restarting. Am I wrong here? Is there another way to approach this problem that would avoid this issue?
It is not a good idea to do this during optimization, because it changes the optimization procedure substantially. It will essentially reset the whole optimization. There are strategies such as bisecting k-means that try to learn the value of k during clustering, but they are a bit more tricky than increasing k by one - they decide upon one particular cluster to split, and try to choose good initial centroids for this cluster to keep things somewhat stable.
Furthermore, increasing k will not necessarily improve Silhouette. It will trivially improve SSQ, so you cannot use SSQ as a heuristic for choosing k, either.
Last but not least, computing the Silhouette is O(n^2). It is too expensive to run often. If you have large enough amount of data to require MiniBatchKMeans (which really is only for massive data), then you clearly cannot afford to compute Silhouette at all.
t-SNE can supposedly scale to millions of observations (see here), but I'm curious how that can be true, at least in the Sklearn implementation.
I'm trying it on a dataset with ~100k items, each with ~190 features. Now, I'm aware that I can do a first pass of dimensionality reduction with, e.g. PCA, but the problem seems more fundamental.
t-SNE computes and stores the full, dense similarity matrix calculated for the input observations (
I've confirmed this by looking at the source). In my case, this is a 10 billion element dense matrix, which by itself requires 80 GB+ of memory. Extrapolate this to just one million observations, and you're looking at 8 terabytes of RAM just to store the distance matrix (let alone computation time...)
So, how can we possibly scale t-SNE to millions of datapoints in the sklearn implementation? Am I missing something? The sklearn docs at least imply that it's possible:
By default the gradient calculation algorithm uses Barnes-Hut approximation running in O(NlogN) time. method=’exact’ will run on the slower, but exact, algorithm in O(N^2) time. The exact algorithm should be used when nearest-neighbor errors need to be better than 3%. However, the exact method cannot scale to millions of examples.
That's my emphasis, but I would certainly read that as implying the Barnes-hut method can scale to millions of examples, but I'll reiterate that the code requires calculating the full distance matrix well before we even get to any of the actual t-sne transformations (with or without Barnes-hut).
So am I missing something? Is it possible to scale this up to millions of datapoints?
Barnes-Hut does NOT require you to compute and storex the full, dense similarity matrix calculated for the input observations.
Also, take a look at the references mentioned at the documentation. In particular, this one. Quoting that page:
The technique can be implemented via Barnes-Hut approximations, allowing it to be applied on large real-world datasets. We applied it on data sets with up to 30 million examples.
That page also links to this talk about how the approximation works: Visualizing Data Using t-SNE.
I recommend you using another algorithm called UMAP. It is proven to perform at least as well as t-SNE and in most cases, it performs better. Most importantly, it scales significantly better. Their approach to the problem is similar so they generate similar results but UMAP is a lot faster (Look at the last graph here: https://umap-learn.readthedocs.io/en/latest/benchmarking.html). You can look at the original paper and the following link for details.
https://www.nature.com/articles/nbt.4314.pdf
https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668#:~:text=tSNE%20is%20Dead.&text=Despite%20tSNE%20made%20a%20dramatic,be%20fixed%20sooner%20or%20later.
OpenVisuMap (at github) implements t-SNE without resorting to approximation. It uses GPU to calculate the distance matrix on-fly. It still has O(N^2) calculation complexity, but only O(N) memory complexity.