GridsearchCV with big dataset

GridsearchCV with big dataset - python

I am trying to build a classifier with GridSearchCV with a huge dataset (2M records * 500 features and growing, expecting at least 15M in total). However, I find that GridSearchCV.fitdoesn't take generator for X and Y. The problem is I don't have all the memory space for the task. The classifier I use is SGDClassifier (which supports partial_fit).
Before this I would use a much smaller subset of the dataset for the GridSearchCV, and then retrain the best classifier with the whole dataset. Is this the right way to use GridSearchCV?

Related

Setting exact number of iterations for Logistic regression in python

I'm creating a model to perform Logistic regression on a dataset using Python. This is my code:
from sklearn import linear_model
my_classifier2=linear_model.LogisticRegression(solver='lbfgs',max_iter=10000)
Now, according to Sklearn doc page, max_iter is maximum number of iterations taken for the solvers to converge. How do I specifically state that I need 'N' number of iterations ?
Any kind of help would be really appreciated.

I’m not sure, but, Do you want to know the optimal number of iterations for your model? If so, you are better off utilizing GridSearchCV that scan tune hyper parameter like max_iter.
Briefly,
Split your data into two groups: train/test data with train_test_split or KFold that can be imported from sklean
Set your parameter, for instance para=[{‘max_iter’:[1,10,100,100]}]
Instance, for example clf=GridSearchCV(LogisticRegression, param_grid=para, cv=5, scoring=‘r2’)
Implement with using train data like this: clf.fit(x_train, y_train)
You can also fetch the best number of iterations with RandomizedSearchCV or BayesianOptimization.

About the GridSearchCV of the max_iter parameter, the fitted LogisticRegression models have and attribute n_iter_ so you can discover the exact max_iter needed for a given sample size and regarding features:
n_iter_: ndarray of shape (n_classes,) or (1, )
Actual number of iterations for all classes. If binary or multinomial, it
returns only 1 element. For liblinear solver, only the maximum number of
iteration across all classes is given.
Scanning very short intervals, like 1 by 1, is a waste of resources that could be used for more important LogisticRegression fit parameters such as the combination of solver itself, its regularization penalty and the inverse of the regularization strength C which contributes for a faster convergence within a given max_iter.
Setting a very high max_iter could be also a waste of resources if you haven't previously did a minimal feature preprocessing, at least, feature scaling or maybe imputation, outlier clipping and a dimensionality reduction (e.g. PCA).
Things can become worse: a tunned max_iter could be ok for a given sample size but not for a bigger sample size, for instance, if you are developing a cross-validated learning curve, which by the way is imperative for optimal machine learning.
It becomes even worse if you increase a sample size in a pipeline that generates feature vectors such as n-grams (NLP): more rows will generate more (sparse) features for the LogisticRegression classification.
I think it's important to observe if different solvers converges or not on given sample size, generated features and max_iter.
Methods that help a faster convergence which eventually won't demand increasing max_iter are:
Feature scaling
Dimensionality Reduction (e.g. PCA) of scaled features
There's a nice sklearn example demonstrating the importance of feature scaling

Does GridSearchCV perform cross-validation?

I'm currently working on a problem which compares three different machine learning algorithms performance on the same data-set. I divided the data-set into 70/30 training/testing sets and then performed grid search for the best parameters of each algorithm using GridSearchCV and X_train, y_train.
First question, am I suppose to perform grid search on the training set or is it suppose to be on the whole data-set?
Second question, I know that GridSearchCV uses K-fold in its' implementation, does it mean that I performed cross-validation if I used the same X_train, y_train for all three algorithms I compare in the GridSearchCV?
Any answer would be appreciated, thank you.

All estimators in scikit where name ends with CV perform cross-validation.
But you need to keep a separate test set for measuring the performance.
So you need to split your whole data to train and test. Forget about this test data for a while.
And then pass this train data only to grid-search. GridSearch will split this train data further into train and test to tune the hyper-parameters passed to it. And finally fit the model on the whole train data with best found parameters.
Now you need to test this model on the test data you kept aside in the beginning. This will give you the near real world performance of model.
If you use the whole data into GridSearchCV, then there would be leakage of test data into parameter tuning and then the final model may not perform that well on newer unseen data.
You can look at my other answers which describe the GridSearch in more detail:
Model help using Scikit-learn when using GridSearch
scikit-learn GridSearchCV with multiple repetitions

Yes, GridSearchCV performs cross-validation. If I understand the concept correctly - you want to keep part of your data set unseen for the model in order to test it.
So you train your models against train data set and test them on a testing data set.
Here I was doing almost the same - you might want to check it...

Is RandomForestRegressor predict() fundamentally slow?

I can only make 2-3 predictions per second with this model which is super slow.
When using LinearRegression model I can easily achieve 40x speedup.
I'm using scikit-learn python package with a very simple dataset containing 3 columns (day, hour and result) so basically 2 features.
day and hour are categorical variables.
Naturally there are 7 day and 24 hour categories.
Training sample is relatively small (cca 5000 samples).
It takes just a dew seconds to train it.
But when I go on predicting something it's very slow.
So my question is: is this fundamental characteristic of RandomForrestRegressor or I can actually do something about it?
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100,
max_features='auto',
oob_score=True,
n_jobs=-1,
random_state=42,
min_samples_leaf=2)

Here are some steps to optimize a RandomForest with sklearn
Do batch predictions by passing multiple datapoints to predict(). This reduces Python overhead.
Reduce the depth of trees. Using something like min_samples_leaf or min_samples_split to avoid having lots of small decision nodes. To use 5% percent of training set, use 0.05.
Reduce the number of trees. With somewhat pruned trees, RF can often perform OK with as little as n_estimators=10.
Use an optimized RF inference implementation like emtrees. Last thing to try, also dependent on prior steps to perform well.
The performance of the optimized model must be validated, using cross-validation or similar. Steps 2 and 3 are related, so one can do a grid-search to find the combination that best preserves model performance.

sklearn: Naive Bayes classifier gives low accuracy

I have a dataset which includes 200000 labelled training examples.
For each training example I have 10 features, including both continuous and discrete.
I'm trying to use sklearn package of python in order to train the model and make predictions but I have some troubles (and some questions too).
First let me write the code which I have written so far:
from sklearn.naive_bayes import GaussianNB
# data contains the 200 000 examples
# targets contain the corresponding labels for each training example
gnb = GaussianNB()
gnb.fit(data, targets)
predicted = gnb.predict(data)
The problem is that I get really low accuracy (too many misclassified labels) - around 20%.
However I am not quite sure whether there is a problem with the data (e.g. more data is needed or something else) or with the code.
Is this the proper way to implement a Naive Bayes classifier given a dataset with both discrete and continuous features?
Furthermore, in Machine Learning we know that the dataset should be split into training and validation/testing sets. Is this automatically performed by sklearn or should I fit the model using the training dataset and then call predict using the validation set?
Any thoughts or suggestions will be much appreciated.

The problem is that I get really low accuracy (too many misclassified labels) - around 20%. However I am not quite sure whether there is a problem with the data (e.g. more data is needed or something else) or with the code.
This is not big error for Naive Bayes, this is extremely simple classifier and you should not expect it to be strong, more data probably won't help. Your gaussian estimators are probably already very good, simply Naive assumptions are the problem. Use stronger model. You can start with Random Forest since it is very easy to use even by non-experts in the field.
Is this the proper way to implement a Naive Bayes classifier given a dataset with both discrete and continuous features?
No, it is not, you should use different distributions in discrete features, however scikit-learn does not support that, you would have to do this manually. As said before - change your model.
Furthermore, in Machine Learning we know that the dataset should be split into training and validation/testing sets. Is this automatically performed by sklearn or should I fit the model using the training dataset and then call predict using the validation set?
Nothing is done automatically in this manner, you need to do this on your own (scikit learn has lots of tools for that - see the cross validation pacakges).

Multi-label classification for large dataset

I am solving a multilabel classification problem. I have about 6 Million of rows to be processed which are huge chunks of text. They are tagged with multiple tags in a separate column.
Any advice on what scikit libraries can help me scale up my code. I am using One-vs-Rest and SVM within it. But they don't scale beyond 90-100k rows.
classifier = Pipeline([
('vectorizer', CountVectorizer(min_df=1)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

SVM's scale well as the number of columns increase, but poorly with the number of rows, as they are essentially learning which rows constitute the support vectors. I have seen this as a common complaint with SVM's, but most people don't understand why, as they typically scale well for most reasonable datasets.
You will want 1 vs the rest, as you are using. One vs One will not scale well for this (n(n-1) classifiers, vs n).
I set a minimum df for the terms you consider to at least 5, maybe higher, which will drastically reduce your row size. You will find a lot of words occur once or twice, and they add no value to your classification as at that frequency, an algorithm cannot possibly generalize. Stemming may help there.
Also remove stop words (the, a, an, prepositions, etc, look on google). That will further cut down the number of columns.
Once you have reduced your column size as described, I would try to eliminate some rows. If there are documents that are very noisy, or very short after steps 1-3, or maybe very long, I would look to eliminate them. Look at the s.d. and mean doc length, and plot the length of the docs (in terms of word count) against the frequency at that length to decide
If the dataset is still too large, I would suggest a decision tree, or naive bayes, both are present in sklearn. DT's scale very well. I would set a depth threshold to limit the depth of the tree, as otherwise it will try to grow a humungous tree to memorize that dataset. NB on the other hand is very fast to train and handles large numbers of columns quite well. If the DT works well, you can try RF with a small number of trees, and leverage the ipython parallelization to multi-thread.
Alternatively, segment your data into smaller datasets, train a classifier on each, persist that to disk, and then build an ensemble classifier from those classifiers.

HashingVectorizer will work if you iteratively chunk your data into batches of 10k or 100k documents that fit in memory for instance.
You can then pass the batch of transformed documents to a linear classifier that supports the partial_fit method (e.g. SGDClassifier or PassiveAggressiveClassifier) and then iterate on new batches.
You can start scoring the model on a held-out validation set (e.g. 10k documents) as you go to monitor the accuracy of the partially trained model without waiting for having seen all the samples.
You can also do this in parallel on several machines on partitions of the data and then average the resulting coef_ and intercept_ attribute to get a final linear model for the all dataset.
I discuss this in this talk I gave in March 2013 at PyData: http://vimeo.com/63269736
There is also sample code in this tutorial on paralyzing scikit-learn with IPython.parallel taken from: https://github.com/ogrisel/parallel_ml_tutorial

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.