Is scikit-learn suitable for big data tasks? - python

I'm working on a TREC task involving use of machine learning techniques, where the dataset consists of more than 5 terabytes of web documents, from which bag-of-words vectors are planned to be extracted. scikit-learn has a nice set of functionalities that seems to fit my need, but I don't know whether it is going to scale well to handle big data. For example, is HashingVectorizer able to handle 5 terabytes of documents, and is it feasible to parallelize it? Moreover, what are some alternatives out there for large-scale machine learning tasks?

HashingVectorizer will work if you iteratively chunk your data into batches of 10k or 100k documents that fit in memory for instance.
You can then pass the batch of transformed documents to a linear classifier that supports the partial_fit method (e.g. SGDClassifier or PassiveAggressiveClassifier) and then iterate on new batches.
You can start scoring the model on a held-out validation set (e.g. 10k documents) as you go to monitor the accuracy of the partially trained model without waiting for having seen all the samples.
You can also do this in parallel on several machines on partitions of the data and then average the resulting coef_ and intercept_ attribute to get a final linear model for the all dataset.
I discuss this in this talk I gave in March 2013 at PyData: http://vimeo.com/63269736
There is also sample code in this tutorial on paralyzing scikit-learn with IPython.parallel taken from: https://github.com/ogrisel/parallel_ml_tutorial

Related

Is there an option like generator in keras with scikit to process large records of data?

I have a training dataset of shape(90000,50) and I trying to fit this in model(Gaussian process regression). This errors out with memory error. I do understand the computation, but is there way to pass data in batches using scikit? I am using the scikit implementation of the GPR algorithm.
Keras has generator because, you can create checkpoints and resume from where you left off in Neural Networks. However, not all of trainable algorithms has this property. Take a look at incremental learning from Scikit-API docs.
The Gaussian process implementation(Regression/classification) from scikit is'nt capable of handling big dataset. It can run only upto 15000 rows of data. So I decided to use a different algorithm instead as this seems to be a problem with algorithm.

how to do cross validation and hyper parameter tuning for huge dataset?

I have a csv file of 10+gb ,i used "chunksize" parameter available in the pandas.read_csv() to read and pre-process the data,for training the model want to use one of the online learning algo.
normally cross-validation and hyper-parameter tuning is done on the entire training data set and train the model using the best hyper-parameter,but in the case of the huge data, if i do the same on the chunk of the training data how to choose the hyper-parameter?
I believe you are looking for online learning algorithms like the ones mentioned on this link Scaling Strategies for large datasets. You should use algorithms that support partial_fit parameter to load these large datasets in chunks. You can also look at the following links to see which one helps you the best, since you haven't specified the exact problem or the algorithm that you are working on:
Numpy save partial results in RAM
Scalling Computationally - Sklearn
Using Large Datasets in Sklearn
Comparision of various Online Sovers -Sklearn
EDIT : If you want to solve the class imbalance problem, you can try this : imabalanced-learn library in Python

SciKit One-class SVM classifier training time increases exponentially with size of training data

I am using the Python SciKit OneClass SVM classifier to detect outliers in lines of text. The text is converted to numerical features first using bag of words and TF-IDF.
When I train (fit) the classifier running on my computer, the time seems to increase exponentially with the number of items in the training set:
Number of items in training data and training time taken:
10K: 1 sec, 15K: 2 sec, 20K: 8 sec, 25k: 12 sec, 30K: 16 sec, 45K: 44 sec.
Is there anything I can do to reduce the time taken for training, and avoid that this will become too long when training data size increases to a couple of hundred thousand items ?
Well scikit's SVM is a high-level implementation so there is only so much you can do, and in terms of speed, from their website, "SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation."
You can increase your kernel size parameter based on your available RAM, but this increase does not help much.
You can try changing your kernel, though your model might be incorrect.
Here is some advice from http://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use: Scale your data.
Otherwise, don't use scikit and implement it yourself using neural nets.
Hope I'm not too late. OCSVM, and SVM, is resource hungry, and the length/time relationship is quadratic (the numbers you show follow this). If you can, see if Isolation Forest or Local Outlier Factor work for you, but if you're considering applying on a lengthier dataset I would suggest creating a manual AD model that closely resembles the context of these off-the-shelf solutions. By doing this then you should be able to work either in parallel or with threads.
For anyone coming here from Google, sklearn has implemented SGDOneClassSVM, which "has a linear complexity in the number of training samples". It should be faster for large datasets.

sklearn linear regression for large data

Does sklearn.LinearRegression support online/incremental learning?
I have 100 groups of data, and I am trying to implement them altogether. For each group, there are over 10000 instances and ~ 10 features, so it will lead to memory error with sklearn if I construct a huge matrix (10^6 by 10). It will be nice if I can update the regressor each time with batch samples of new group.
I found this post relevant, but the accepted solution works for online learning with single new data (only one instance) rather than batch samples.
Take a look at linear_model.SGDRegressor, it learns a a linear model using stochastic gradient.
In general, sklearn has many models that admit "partial_fit", they are all pretty useful on medium to large datasets that don't fit in the RAM.
Not all algorithms can learn incrementally, without seeing all of the instances at once that is. That said, all estimators implementing the partial_fit API are candidates for the mini-batch learning, also known as "online learning".
Here is an article that goes over scaling strategies for incremental learning. For your purposes, have a look at the sklearn.linear_model.SGDRegressor class. It is truly online so the memory and convergence rate are not affected by the batch size.

Improving SVC prediction performance on single samples

I have large-ish SVC models (~50Mb cPickles) for text classification and I am trying out various ways to use them in a production environment. Classifying batches of documents works very well (about 1k documents per minute using both predict and predict_proba).
However, prediction on a single document is another story, as explained in a comment to this question:
Are you doing predictions in batches? The SVC.predict method, unfortunately, incurs a lot of overhead because it has to reconstruct a LibSVM data structure similar to the one that the training algorithm produced, shallow-copy in the support vectors, and convert the test samples to a LibSVM format that may be different from the NumPy/SciPy formats. Therefore, prediction on a single sample is bound to be slow. – larsmans
I am already serving the SVC models as Flask web-applications, so a part of the overhead is gone (unpickling) but the prediction times for single docs are still on the high side (0.25s).
I have looked at the code in the predict methods but cannot figure out if there is a way to "pre-warm" them, reconstructing the LibSVM data structure in advance at server startup... any ideas?
def predict(self, X):
"""Perform classification on samples in X.
For an one-class model, +1 or -1 is returned.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns
-------
y_pred : array, shape = [n_samples]
Class labels for samples in X.
"""
y = super(BaseSVC, self).predict(X)
return self.classes_.take(y.astype(np.int))
I can see three possible solutions.
Custom server
It is not the matter of "warming" anything up. Simply - libSVM is the C library, and you need to pack/unpack data into correct format. This process is more efficient on the whole matrices than on each row separately. The only way to overcome this would be to write more efficient wrapper between your production env and the libSVM (you could write a libsvm based server, which would use some kind of shared memory with your service). Unfortunately, this is to custom problem to be solvable by existing implementations.
Batches
Naive approach like buffering the queries is an option (if it is "high performance" system with thousands of queries, you can simply store them in N-element batches, and send them to libSVM in such packs).
Own classification
Lastly - classification using SVM is really simple task. You don't need libSVM to perform classification. Only training is a complex problem. Once you get all the support vectors (SV_i), kernel (K), lagragian multipliers (alpha_i) and intercept term (b), you classify using:
cl(x) = sgn( SUM_i y_i alpha_i K(SV_i, x) + b)
You can code this operation directly in your app, without the need to actualy pack/unpack/send anything to libsvm. This can speed things up by the order of magnitude. Obviously - probability is more complex to retrieve, as it requires the Platt's scaliing, but it is still possible.
You can't construct the LibSVM data structure in advance. When a request to classify a document arrives, you get the text of the document, make a vector out of if and only then convert to LibSVM format so you can get a decision.
LinearSVC should be considerably faster than a SVC with a linear kernel as it uses liblinear. You could try using a different classifier if that does not decrease performance too much.

Categories

Resources