Relatively new to model building with sklearn. I know cross validation can be parallelized via the n_jobs parameter, but if I'm not using CV, how can I utilize my available cores to speed up model fitting?
There are alternatives like XGboost or LigtGMB that are distributed (i.e., can run parallel). These are well documented and popular boosting algorithms.
At the moment afaik there is no parameter like "n_jobs" for the GradientBoosting method, but you can try to use the HistGradientBoostingRegressor which uses OpenMP for parallelization (will use as many threads as possible).
You can read more here:
https://scikit-learn.org/stable/modules/ensemble.html
https://scikit-learn.org/stable/modules/computing.html#parallelism
Related
In the context of model selection for a classification problem, while running cross validation, is it ok to specify n_jobs=-1 both in model specification and cross validation function in order to take full advantage of the power of the machine?
For example, comparing sklearn RandomForestClassifier and xgboost XGBClassifier:
RF_model = RandomForestClassifier( ..., n_jobs=-1)
XGB_model = XGBClassifier( ..., n_jobs=-1)
RF_cv = cross_validate(RF_model, ..., n_jobs=-1)
XGB_cv = cross_validate(XGB_model, ..., n_jobs=-1)
is it ok to specify the parameters in both? Or should I specify it only once? And in which of them, model or cross validation statement?
I used for the example models from two different libraries (sklearn and xgboost) because maybe there is a difference in how it works, also cross_validate function is from sklearn.
Specifying n_jobs twice does have an effect, though whether it has a positive or negative effect is complicated.
When you specify n_jobs twice, you get two levels of parallelism. Imagine you have N cores. The cross-validation function creates N copies of your model. Each model creates N threads to run fitting and predictions. You then have N*N threads.
This can blow up pretty spectacularly. I once worked on a program which needed to apply ARIMA to tens of thousands of time-series. Since each ARIMA is independent, I parallelized it and ran one ARIMA on each core of a 12-core CPU. I ran this, and it performed very poorly. I opened up htop, and was surprised to find 144 threads running. It turned out that this library, pmdarima, internally parallelized ARIMA operations. (It doesn't parallelize them well, but it does try.) I got a massive speedup just by turning off this inner layer of parallelism. Having two levels of parallelism is not necessarily better than having one.
In your specific case, I benchmarked a random forest with cross validation, and I benchmarked four configurations:
No parallelism
Parallelize across different CV folds, but no model parallelism
Parallelize within the model, but not on CV folds
Do both
(Error bars represent 95% confidence interval. All tests used RandomForestClassifier. Test was performed using cv=5, 100K samples, and 100 trees. Test system had 4 cores with SMT disabled. Scores are mean duration of 7 runs.)
This graph shows that no parallelism is the slowest, CV parallelism is third fastest, and model parallelism and combined parallelism are tied for first place.
However, this is closely tied to what classifiers I'm using - a benchmark for pmdarima, for example, would find that cross-val parallelism is faster than model parallelism or combined parallelism. If you don't know which one is faster, then test it.
I often use GridSearchCV for hyperparameter tuning. For example, for tuning regularization parameter C in Logistic Regression. Whenever an estimator I am using has its own n_jobs parameter I am confused where to set it, in estimator or in GridSearchCV, or in both? Same thing applies to cross_validate.
This is a very interesting question. I don't have a definitive answer, but some elements that are worth mentioning to understand the issue, and don't fir in a comment.
Let's start with why you should or should not use multiprocessing :
Multiprocessing is useful for independent tasks. This is the case in a GridSearch, where all your different variations of your models are independent.
Multiprocessing is not useful / make things slower when :
Task are too small : creating a new process takes time, and if your task is really small, this overhead with slow the execution of the whole code
Too many processes are spawned : your computer have a limited number of cores. If you have more processes than cores, a load balancing mechanism will force the computer to regularly switch the processes that are running. These switches take some time, resulting in a slower execution.
The first take-out is that you should not use n_jobs in both GridSearch and the model you're optimizing, because you will spawn a lot of processes and end up slowing the execution.
Now, a lot of sklearn models and functions are based on Numpy/SciPy which in turn, are usually implemented in C/Fortran, and thus already use multiprocessing. That means that these should not be used with n_jobs>1 set in the GridSearch.
If you assume your model is not already parallelized, you can choose to set n_jobsat the model level or at the GridSearch level. A few models are able to be fully parallelized (RandomForest for instance), but most may have at least some part that is sequential (Boosting for instance). In the other end, GridSearch has no sequential component by design, so it would make sense to set n_jobs in GridSearch rather than in the model.
That being said, it depend on the implementation of the model, and you can't have a definitive answer without testing for yourself for your case. For example, if you pipeline consume a lot of memory for some reason, setting n_jobs in the GridSearch may cause memory issues.
As a complement, here is a very interesting note on parallelism in sklearn
I have been using pytorch a lot and got used to their dataloaders and transforms, in particular when it comes to data augmentation, as they're very user-friendly and easy to understand.
However, I need to run some ML models from sklearn.
Is there a way to use pytorch's dataloaders for sklearn ?
Yes, you can. You can do this with sklearn's partial_fit method. Read HERE.
6.1.3. Incremental learning
Finally, for 3. we have a number of options inside scikit-learn. Although all algorithms cannot learn
incrementally (i.e. without seeing all the instances at once), all
estimators implementing the partial_fit API are candidates. Actually,
the ability to learn incrementally from a mini-batch of instances
(sometimes called “online learning”) is key to out-of-core learning as
it guarantees that at any given time there will be only a small amount
of instances in the main memory. Choosing a good size for the
mini-batch that balances relevancy and memory footprint could involve
some tuning [1].
Not all algorithms can do this, however.
Then, you can use pytorch's dataloader to preprocess the data and feed it in batches to partial_fit.
I came across the skorch library recently and this could help you.
"The goal of skorch is to make it possible to use PyTorch with sklearn. "
From the skorch docs:
class skorch.dataset.Dataset(X, y=None, length=None)
General dataset wrapper that can be used in conjunction with PyTorch DataLoader.
I guess you could use the Dataset class for wrapping your PyTorch DataLoader and use sklearn models. If you would like to use other PyTorch features like PyTorch Tensors you could also do that.
As per the documentation of RandomizedSearchCV and GridSearchCV modules of sklearn, they support only the fit method for the classifier which is passed to them and doesn't support the partial_fit method of the classifiers which can be used for training on an incremental basis. Currently, I am trying to use SGDClassifier which can be trained on incremental data using the partial_fit method and also find the best set of hyper-parameters for the same. I was just wondering why doesn't RandomizedSearchCV or GridSearchCV support partial_fit. I don't see any technical reasons as to why this cannot be done (please correct me if I am wrong here). Any leads will be really appreciated.
Yeah, technically you can write a GridSerachCV for partial_fit as well, but when you think about
what is that you are searching for?
what is that your are optimizing for?
it becomes quite different from what we do with the .fit() approach. Here is my list of reason for not having partial_fit in GridsearchCV/RandomSearchCV.
what is that you are searching for?
When we optimize for the hyper parameters of a model for one batch of data, it could be sub-optimal for the final model (which is trained on complete data using multiple partial_fits). Now the problem becomes finding the best schedule for the hyper parameters i.e. what is the optimal value of the hyper parameter at each batch/time step. One example of this is the decaying learning rate in neural networks, where we train the model using multiple partial_fits and the hyper parameter - learning rate value would not be a single value but a series values that needs to be used for each time step/batch.
Also you need to loop through the entire dataset multiple times (multiple epochs) to know the best scheduling of the hyper parameters. This needs a basic API change for GridSearchCV.
what is that your are optimizing for?
There is a need to change the evaluation metric of the model now. The metric could be achieving best performance at the end of all partial_fits or reaching the sweet-spot quickly (in fewer batches) for usual metric (precision, recall, f1-score, etc.), some combination of one and two. Hence, this also needs a API change for computing the single value for summarizing the performance of a model, which was trained using multiple partial_fits.
I think this can be solved in a different way. I have encontered the problem that only partial_fit works (data is too big to do full batch learning via fit), so I think scikit-learn should have partial_fit support somewhere.
Instead of having partial_fit in GridSearchCV, you can write a simple wrapper (something like a pytorch DataLoader) which turns a partial_fit model into fit model, and do batch split and shuffle inside the wrapper's fit. Then you can make GridSearchCV work, with extra parameter to be fine-tuned provided by the wrapper (batch_size and is_shuffle)
Do you know if models from scikit-learn use automatically multithreading or just sequential instructions?
Thanks
No. All scikit-learn estimators will by default work on a single thread only.
But then again, it all depends on the algorithm and the problem. If the algorithm is such that which want sequential data, we cannot do anything. If the dataset is multi-class or multi-label and algorithm works on a one-vs-rest basis, then yes it can use multi-threading.
Look for a param n_jobs in the utilities or algorithm you want to use, and set it to -1 for using the multi-threading.
For eg.
LogisticRegression if working in a binary problem will only train a single model, which will require data sequentially, so here using n_jobs have no effect. But it handles multi-class problems as OvR, so it will have to train those many estimators using the same data. In this case you can use the n_jobs=-1.
DecisionTreeClassifier is inherently multi-class enabled and dont need to train multiple models. So we dont have that param there.
Ensemble methods like RandomForestClassifier will train multiple estimators (irrespective of problem type) which individually work on some part of data, so here again we can make use of n_jobs.
Cross-validation utilities like cross_val_score or GridSearchCV will again work on some part of data or some individual parameters, which is independent of other folds, so here also we can use multi-threading capabilities.