SVD using Scikit-Learn and Gensim with 6 million features

SVD using Scikit-Learn and Gensim with 6 million features - python

I am trying to classify paragraphs based on their sentiments. I have training data of 600 thousand documents. When I convert them to Tf-Idf vector space with words as analyzer and ngram range as 1-2 there are almost 6 million features. So I have to do Singular value decomposition (SVD) to reduce features.
I have tried gensim and sklearn's SVD feature. Both work fine for feature reduction till 100 but as soon as I try for 200 features they throw memory error.
Also I have not used entire document (600 thousand) as training data, I have taken 50000 documents only. So essentially my training matrix is:
50000 * 6 million and want to reduce it to 50000 * (100 to 500)
Is there any other way I can implement it in python, or do I have to implement sparks mllib SVD(written for only java and scala) ? If Yes, how much faster will it be?
System specification: 32 Gb RAM with 4 core processors on ubuntu 14.04

I don't really see why using sparks mllib SVD would improve performance or avoid memory errors. You simply exceed the size of your RAM. You have some options to deal with that:
Reduce the dictionary size of your tf-idf (playing with max_df and min_df parameters of scikit-learn for example).
Use a hashing vectorizer instead of tf-idf.
Get more RAM (but at some point tf-idf + SVD is not scalable).
Also you should show your code sample, you might do something wrong in your python code.

Related

Unexpected behavior with LinearDiscriminantAnalysis of scikit-learn

I am using LinearDiscriminantAnalysis of scikit-learn to perform class prediction on a dataset. The problem seems to be that the performance of LDA is not consistent. In general, when I increase the number of features that I use for training, I expect the performance to increase as well. Same thing with the number of samples that I use for training, the more samples the better the performance of LDA.
However, in my case, there seems to be a sweet spot where the LDA performs poorly depending on the number of samples and features used for training. More precisely, LDA performs poorly when the number of features equals the number of samples. I think that it does not have to do with my dataset. Not sure exactly what is the issue here but I have an extensive example code that can recreate these results.
Here is an image of the LDA performance results that I am talking about.
The dataset I use has shape 400 X 75 X 400 (trials X time X features). Here the trials represent the different samples. Each time I shuffle the trial indices of the dataset. I create the train set by picking the trials for training and similarly for the test set. Finally, I calculate the mean across time (second axis) and insert the final matrix with shape (trials X features) as input in the LDA to compute the score on the test set. The test set is always of size 50 trials.
A detailed jupyter notebook with comments and the data I use can be found here https://github.com/esigalas/LDA-test. In my environment I use
sklearn: 1.1.1,
numpy: 1.22.4.
Not sure if there is an issue with LDA itself (that would be worthy of opening an issue on the github) or something wrong with how I handle the dataset, but this behavior of LDA looks wrong.
Any comment/help is welcome. Thanks in advance!

Fixed RAM DBSCAN or another clustering algorithm without predefined number of clusters?

I want to cluster 3.5M 300-dimensional word2vec vectors from my custom gensim model to determine whether I can use those clustering to find topic-related words. It is not the same as model.most_similar_..., as I hope to attach quite distant, but still related words.
The overall size of the model (after normalization of vectors, i.e. model.init_sims(replace=True)) in memory is 4GB:
words = sorted(model.wv.vocab.keys())
vectors = np.array([model.wv[w] for w in words])
sys.getsizeof(vectors)
4456416112
I tried both scikit's DBSCAN and some other implementations from GitHub, but they seem to consume more and more RAM during processing and crash with std::bad_alloc after some time. I have 32 GB of RAM and 130GB swap.
Metric is euclidean, I convert my cosine distance threshold cos=0.48 as eps=sqrt(2-2*0.48), so all the optimizations should be applied.
The problem is that I don't know the number of clusters and want to determine them by setting the threshold for closely related words (let it be cos<0.48 or d_l2 < sqrt(2-2*0.48)). DBSCAN seems working on small subsets, but I can't pass the computation on the full data.
Is there any algorithm or workaround in Python which can help with that?
EDIT: Distance matrix seem to be for a size(float)=4bytes: 3.5M*3.5M*4/1024(KB)/1024(MB)/1024(GB)/1024(TB) = 44.5 TB, so it's impossible to precompute it.
EDIT2: Currently trying ELKI, but cannot make it to cluster data on toy subset properly.

SciKit One-class SVM classifier training time increases exponentially with size of training data

I am using the Python SciKit OneClass SVM classifier to detect outliers in lines of text. The text is converted to numerical features first using bag of words and TF-IDF.
When I train (fit) the classifier running on my computer, the time seems to increase exponentially with the number of items in the training set:
Number of items in training data and training time taken:
10K: 1 sec, 15K: 2 sec, 20K: 8 sec, 25k: 12 sec, 30K: 16 sec, 45K: 44 sec.
Is there anything I can do to reduce the time taken for training, and avoid that this will become too long when training data size increases to a couple of hundred thousand items ?

Well scikit's SVM is a high-level implementation so there is only so much you can do, and in terms of speed, from their website, "SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation."
You can increase your kernel size parameter based on your available RAM, but this increase does not help much.
You can try changing your kernel, though your model might be incorrect.
Here is some advice from http://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use: Scale your data.
Otherwise, don't use scikit and implement it yourself using neural nets.

Hope I'm not too late. OCSVM, and SVM, is resource hungry, and the length/time relationship is quadratic (the numbers you show follow this). If you can, see if Isolation Forest or Local Outlier Factor work for you, but if you're considering applying on a lengthier dataset I would suggest creating a manual AD model that closely resembles the context of these off-the-shelf solutions. By doing this then you should be able to work either in parallel or with threads.

For anyone coming here from Google, sklearn has implemented SGDOneClassSVM, which "has a linear complexity in the number of training samples". It should be faster for large datasets.

python : facing memory issue in document clustering using sklearn

I am using TfIdfVectorizer of sklearn for document clustering. I have 20 million texts, for which i want to compute clusters. But calculating TfIdf matrix is taking too much time and system is getting stuck.
Is there any technique to deal with this problem ? is there any alternative method for this in any python module ?

Well, a corpus of 20 million texts is very large, and without a meticulous and comprehensive preprocessing nor some good computing instances (i.e. a lot of memory and good CPUs), the TF-IDF calculation may take a lot of time.
What you can do :
Limit your text corpus to some hundred of thousands of samples (let's say 200.000 texts). Having too much texts might not introduce more variance than much smaller (but reasonable) datasets.
Try to preprocess your texts as much as you can. A basic approach would be : tokenize your texts, use stop words, word stemming, use carefully n_grams.
Once you've done all these steps, see how much you've reduced the size of your vocabulary. It should be much more smaller than the original one.
If not too big (talking about your dataset), these steps might help you compute the TF-IDF much faster .

start small.
First cluster only 100.00 documents. Only once it works (because it probably won't), then think about scaling up.
If you don't succeed clustering the subset (and text clusters are usually pretty bad), then you won't fare well on the large set.

Using scikit-learn, how do I learn a SVM over a small data set?

With scikit-learn, I have built a support vector machine, for a basic handwritten digit detection problem.
My total data set consists of 235 observations. My observations consist of 1025 features each. I know that one of the advantages of using a support vector machine is in situations like this, where there are a modest number of observations that have a large number of features.
After my SVM is created, I look at my confusion matrix (below)...
Confusion Matrix:
[[ 6 0]
[ 0 30]]
...and realize that holding out 15% of my data for testing (i.e., 36 observations) is not enough.
My problem is this: How can I work around this small data issue, using cross validation?

This is exactly what cross validation (and its generalizations, like Err^0.632) is for. Hold-out set is reasonable only with huge quantities of data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.