In the context of model selection for a classification problem, while running cross validation, is it ok to specify n_jobs=-1 both in model specification and cross validation function in order to take full advantage of the power of the machine?
For example, comparing sklearn RandomForestClassifier and xgboost XGBClassifier:
RF_model = RandomForestClassifier( ..., n_jobs=-1)
XGB_model = XGBClassifier( ..., n_jobs=-1)
RF_cv = cross_validate(RF_model, ..., n_jobs=-1)
XGB_cv = cross_validate(XGB_model, ..., n_jobs=-1)
is it ok to specify the parameters in both? Or should I specify it only once? And in which of them, model or cross validation statement?
I used for the example models from two different libraries (sklearn and xgboost) because maybe there is a difference in how it works, also cross_validate function is from sklearn.
Specifying n_jobs twice does have an effect, though whether it has a positive or negative effect is complicated.
When you specify n_jobs twice, you get two levels of parallelism. Imagine you have N cores. The cross-validation function creates N copies of your model. Each model creates N threads to run fitting and predictions. You then have N*N threads.
This can blow up pretty spectacularly. I once worked on a program which needed to apply ARIMA to tens of thousands of time-series. Since each ARIMA is independent, I parallelized it and ran one ARIMA on each core of a 12-core CPU. I ran this, and it performed very poorly. I opened up htop, and was surprised to find 144 threads running. It turned out that this library, pmdarima, internally parallelized ARIMA operations. (It doesn't parallelize them well, but it does try.) I got a massive speedup just by turning off this inner layer of parallelism. Having two levels of parallelism is not necessarily better than having one.
In your specific case, I benchmarked a random forest with cross validation, and I benchmarked four configurations:
No parallelism
Parallelize across different CV folds, but no model parallelism
Parallelize within the model, but not on CV folds
Do both
(Error bars represent 95% confidence interval. All tests used RandomForestClassifier. Test was performed using cv=5, 100K samples, and 100 trees. Test system had 4 cores with SMT disabled. Scores are mean duration of 7 runs.)
This graph shows that no parallelism is the slowest, CV parallelism is third fastest, and model parallelism and combined parallelism are tied for first place.
However, this is closely tied to what classifiers I'm using - a benchmark for pmdarima, for example, would find that cross-val parallelism is faster than model parallelism or combined parallelism. If you don't know which one is faster, then test it.
Related
I tried to read the docs for RepeatedStratifiedKFold and StratifiedKFold, but couldn't tell the difference between the two methods except that RepeatedStratifiedKFold repeats StratifiedKFold n times with different randomization in each repetition.
My question is: Do these two methods return the same results? Which one should I use to split an imbalanced dataset when doing GridSearchCV and what is the rationale for choosing that method?
Both StratifiedKFold and RepeatedStratifiedKFold can be very effective when used on classification problems with a severe class imbalance. They both stratify the sampling by the class label; that is, they split the dataset in such a way that preserves approximately the same class distribution (i.e., the same percentage of samples of each class) in each subset/fold as in the original dataset. However, a single run of StratifiedKFold might result in a noisy estimate of the model's performance, as different splits of the data might result in very different results. That is where RepeatedStratifiedKFold comes into play.
RepeatedStratifiedKFold allows improving the estimated performance of a machine learning model, by simply repeating the cross-validation procedure multiple times (according to the n_repeats value), and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the model's performance (see this article).
Thus—to answer your question—no, these two methods would not provide the same results. Using RepeatedStratifiedKFold means that each time running the procedure would result in a different split of the dataset into stratified k-folds, and hence, the performance results would be different.
RepeatedStratifiedKFold has the benefit of improving the estimated model's performance at the cost of fitting and evaluating many more models. If, for example, 5 repeats (i.e., n_repeats=5) of 10-fold cross-validation were used for estimating the model's performance, it means that 50 different models would need to be fitted (trained) and evaluated—which might be computationally expensive, depending on the dataset's size, type of machine learning algorithm, device specifications, etc. However, RepeatedStratifiedKFold process could be executed on different cores or different machines, which could dramatically speed up the process. For instance, setting n_jobs=-1 would use all the cores available on your system (have a look here).
When it comes to evaluation, make sure to use appropriate metrics, as described in this answer.
I often use GridSearchCV for hyperparameter tuning. For example, for tuning regularization parameter C in Logistic Regression. Whenever an estimator I am using has its own n_jobs parameter I am confused where to set it, in estimator or in GridSearchCV, or in both? Same thing applies to cross_validate.
This is a very interesting question. I don't have a definitive answer, but some elements that are worth mentioning to understand the issue, and don't fir in a comment.
Let's start with why you should or should not use multiprocessing :
Multiprocessing is useful for independent tasks. This is the case in a GridSearch, where all your different variations of your models are independent.
Multiprocessing is not useful / make things slower when :
Task are too small : creating a new process takes time, and if your task is really small, this overhead with slow the execution of the whole code
Too many processes are spawned : your computer have a limited number of cores. If you have more processes than cores, a load balancing mechanism will force the computer to regularly switch the processes that are running. These switches take some time, resulting in a slower execution.
The first take-out is that you should not use n_jobs in both GridSearch and the model you're optimizing, because you will spawn a lot of processes and end up slowing the execution.
Now, a lot of sklearn models and functions are based on Numpy/SciPy which in turn, are usually implemented in C/Fortran, and thus already use multiprocessing. That means that these should not be used with n_jobs>1 set in the GridSearch.
If you assume your model is not already parallelized, you can choose to set n_jobsat the model level or at the GridSearch level. A few models are able to be fully parallelized (RandomForest for instance), but most may have at least some part that is sequential (Boosting for instance). In the other end, GridSearch has no sequential component by design, so it would make sense to set n_jobs in GridSearch rather than in the model.
That being said, it depend on the implementation of the model, and you can't have a definitive answer without testing for yourself for your case. For example, if you pipeline consume a lot of memory for some reason, setting n_jobs in the GridSearch may cause memory issues.
As a complement, here is a very interesting note on parallelism in sklearn
How are we supposed to use the dictionary output from lightgbm.cv to improve our predictions?
Here's an example - we train our cv model using the code below:
cv_mod = lgb.cv(params,
d_train,
500,
nfold = 10,
early_stopping_rounds = 25,
stratified = True)
How can we use the parameters found from the best iteration of the above code to predict an output? In this case, cv_mod has no "predict" method like lightgbm.train, and the dictionary output from lightgbm.cvthrows an error when used in lightgbm.train.predict(..., pred_parameters = cv_mod).
Am I missing an important transformation step?
In general, the purpose of CV is NOT to do hyperparameter optimisation. The purpose is to evaluate performance of model-building procedure.
A basic train/test split is conceptually identical to a 1-fold CV (with a custom size of the split in contrast to the 1/K train size in the k-fold CV). The advantage of doing more splits (i.e. k>1 CV) is to get more information about the estimate of generalisation error. There is more info in a sense of getting the error + stat uncertainty. There is an excellent discussion on CrossValidated (start with the links added to the question, which cover the same question, but formulated in a different way). It covers nested cross validation and is absolutely not straightforward. But if you will wrap your head around the concept in general, this will help you in various non-trivial situations. The idea that you have to take away is: The purpose of CV is to evaluate performance of model-building procedure.
Keeping that idea in mind, how does one approach hyperparameter estimation in general (not only in LightGBM)?
You want to train a model with a set of parameters on some data and evaluate each variation of the model on an independent (validation) set. Then you intend to choose the best parameters by choosing the variant that gives the best evaluation metric of your choice.
This can be done with a simple train/test split. But evaluated performance, and thus the choice of the optimal model parameters, might be just a fluctuation on a particular split.
Thus, you can evaluate each of those models more statistically robust averaging evaluation over several train/test splits, i.e k-fold CV.
Then you can make a step further and say that you had an additional hold-out set, that was separated before hyperparameter optimisation was started. This way you can evaluate the chosen best model on that set to measure the final generalisation error. However, you can make even step further and instead of having a single test sample you can have an outer CV loop, which brings us to nested cross validation.
Technically, lightbgm.cv() allows you only to evaluate performance on a k-fold split with fixed model parameters. For hyper-parameter tuning you will need to run it in a loop providing different parameters and recoding averaged performance to choose the best parameter set. after the loop is complete. This interface is different from sklearn, which provides you with complete functionality to do hyperparameter optimisation in a CV loop. Personally, I would recommend to use the sklearn-API of lightgbm. It is just a wrapper around the native lightgbm.train() functionality, thus it is not slower. But it allows you to use the full stack of sklearn toolkit, thich makes your life MUCH easier.
If you're happy with your CV results, you just use those parameters to call the 'lightgbm.train' method. Like #pho said, CV is usually just for param tuning. You don't use the actual CV object for predictions.
You should use CV for parameter optimization.
If your model performs well on all folds use these parameters to train on the whole training set.
Then evaluate that model on the external test set.
Relatively new to model building with sklearn. I know cross validation can be parallelized via the n_jobs parameter, but if I'm not using CV, how can I utilize my available cores to speed up model fitting?
There are alternatives like XGboost or LigtGMB that are distributed (i.e., can run parallel). These are well documented and popular boosting algorithms.
At the moment afaik there is no parameter like "n_jobs" for the GradientBoosting method, but you can try to use the HistGradientBoostingRegressor which uses OpenMP for parallelization (will use as many threads as possible).
You can read more here:
https://scikit-learn.org/stable/modules/ensemble.html
https://scikit-learn.org/stable/modules/computing.html#parallelism
I can only make 2-3 predictions per second with this model which is super slow.
When using LinearRegression model I can easily achieve 40x speedup.
I'm using scikit-learn python package with a very simple dataset containing 3 columns (day, hour and result) so basically 2 features.
day and hour are categorical variables.
Naturally there are 7 day and 24 hour categories.
Training sample is relatively small (cca 5000 samples).
It takes just a dew seconds to train it.
But when I go on predicting something it's very slow.
So my question is: is this fundamental characteristic of RandomForrestRegressor or I can actually do something about it?
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100,
max_features='auto',
oob_score=True,
n_jobs=-1,
random_state=42,
min_samples_leaf=2)
Here are some steps to optimize a RandomForest with sklearn
Do batch predictions by passing multiple datapoints to predict(). This reduces Python overhead.
Reduce the depth of trees. Using something like min_samples_leaf or min_samples_split to avoid having lots of small decision nodes. To use 5% percent of training set, use 0.05.
Reduce the number of trees. With somewhat pruned trees, RF can often perform OK with as little as n_estimators=10.
Use an optimized RF inference implementation like emtrees. Last thing to try, also dependent on prior steps to perform well.
The performance of the optimized model must be validated, using cross-validation or similar. Steps 2 and 3 are related, so one can do a grid-search to find the combination that best preserves model performance.