How to add oversampling/undersampling procedure in scikit's Pipeline?

How to add oversampling/undersampling procedure in scikit's Pipeline? - python

I would like to add oversampling procedure, like SMOTE oversampling, to scikit's Pipeline. But the transformers only supports fit and transform method, and do not provide a way to increase the number of samples and targets.
One possible way to do this is to break the pipeline to two separate pipelines connected by SMOTE sampling.
Is there any better solutions?

Our current Pipeline does not support changing the number of samples between steps as the Transformer.transform method does not return the y argument that would need to also be resampled. This is a know limitation of the current design. It might be fixed in a future version but we have not started to work on that yet.

Related

Confusion around the SKLearn GridSearchCV scoring parameter and using train test split

I'm a little bit confused about how GridSearchCV works with Train Test Split.
As far as I know, when creating models for the dataset I'm using, a paper used roc-auc.
I'm trying to replicate what this paper did, at least as well as I can. From reading a few other posts here, I've gathered that running GridSearchCV on the entire dataset is prone to overfitting, so we should split the data into a training partition and a testing partition. Then, we should run the training partition with GridSearchCV with whatever model and parameters, and then fit it, and then get a score using the test part of the dataset we set aside.
Now where I'm confused is with GridSearchCV, as far as I understand, it gives us scores for each of the folds that the data is split into when doing the search for parameters and using best_score_ we can pull the best of these scores. I don't understand what the scores represent and why you can pass in a scoring parameter to begin with, since the job of GridSearchCV is to always find the best possible parameters anyways? (Perhaps I'm making a poor assumption here but I'm assuming that there is an objective best set of parameters, regardless of scoring method). What I figured was that I would find the best parameters with GridSearchCV and then use the said parameters to create fit a model, and finally use that model and the partition I saved for testing and test it using the roc-auc scoring method.
So in the end, does it matter (if at all) what scoring methods I'm passing into GridSearchCV, as it will always look to give the best set of parameters anyways, which I will use to compute my final score with the testing partition?

This document may help.
Here you see that the scoring parameter allows you to have various metrics, such as roc_auc. See here all Scikit's metrics.
Optimizing over different metrics result in different optimal parameters. Just think about optimizing precision versus recall. Optimizing precision leads to less false positives while optimizing recall leads to less false negatives.
Also, in GridSearchCV, the CV stands for cross validated. Train/test splitting happens inside this function, it's taken care of. You only have to provide the splitter as an argument to GridSearchCV, for example cv=StratifiedKFold(n_splits=5, shuffle=True).

Using Pytorch's dataloaders & transforms with sklearn

I have been using pytorch a lot and got used to their dataloaders and transforms, in particular when it comes to data augmentation, as they're very user-friendly and easy to understand.
However, I need to run some ML models from sklearn.
Is there a way to use pytorch's dataloaders for sklearn ?

Yes, you can. You can do this with sklearn's partial_fit method. Read HERE.
6.1.3. Incremental learning
Finally, for 3. we have a number of options inside scikit-learn. Although all algorithms cannot learn
incrementally (i.e. without seeing all the instances at once), all
estimators implementing the partial_fit API are candidates. Actually,
the ability to learn incrementally from a mini-batch of instances
(sometimes called “online learning”) is key to out-of-core learning as
it guarantees that at any given time there will be only a small amount
of instances in the main memory. Choosing a good size for the
mini-batch that balances relevancy and memory footprint could involve
some tuning [1].
Not all algorithms can do this, however.
Then, you can use pytorch's dataloader to preprocess the data and feed it in batches to partial_fit.

I came across the skorch library recently and this could help you.
"The goal of skorch is to make it possible to use PyTorch with sklearn. "
From the skorch docs:
class skorch.dataset.Dataset(X, y=None, length=None)
General dataset wrapper that can be used in conjunction with PyTorch DataLoader.
I guess you could use the Dataset class for wrapping your PyTorch DataLoader and use sklearn models. If you would like to use other PyTorch features like PyTorch Tensors you could also do that.

GridSearchCV/RandomizedSearchCV with partial_fit in sklearn

As per the documentation of RandomizedSearchCV and GridSearchCV modules of sklearn, they support only the fit method for the classifier which is passed to them and doesn't support the partial_fit method of the classifiers which can be used for training on an incremental basis. Currently, I am trying to use SGDClassifier which can be trained on incremental data using the partial_fit method and also find the best set of hyper-parameters for the same. I was just wondering why doesn't RandomizedSearchCV or GridSearchCV support partial_fit. I don't see any technical reasons as to why this cannot be done (please correct me if I am wrong here). Any leads will be really appreciated.

Yeah, technically you can write a GridSerachCV for partial_fit as well, but when you think about
what is that you are searching for?
what is that your are optimizing for?
it becomes quite different from what we do with the .fit() approach. Here is my list of reason for not having partial_fit in GridsearchCV/RandomSearchCV.
what is that you are searching for?
When we optimize for the hyper parameters of a model for one batch of data, it could be sub-optimal for the final model (which is trained on complete data using multiple partial_fits). Now the problem becomes finding the best schedule for the hyper parameters i.e. what is the optimal value of the hyper parameter at each batch/time step. One example of this is the decaying learning rate in neural networks, where we train the model using multiple partial_fits and the hyper parameter - learning rate value would not be a single value but a series values that needs to be used for each time step/batch.
Also you need to loop through the entire dataset multiple times (multiple epochs) to know the best scheduling of the hyper parameters. This needs a basic API change for GridSearchCV.
what is that your are optimizing for?
There is a need to change the evaluation metric of the model now. The metric could be achieving best performance at the end of all partial_fits or reaching the sweet-spot quickly (in fewer batches) for usual metric (precision, recall, f1-score, etc.), some combination of one and two. Hence, this also needs a API change for computing the single value for summarizing the performance of a model, which was trained using multiple partial_fits.

I think this can be solved in a different way. I have encontered the problem that only partial_fit works (data is too big to do full batch learning via fit), so I think scikit-learn should have partial_fit support somewhere.
Instead of having partial_fit in GridSearchCV, you can write a simple wrapper (something like a pytorch DataLoader) which turns a partial_fit model into fit model, and do batch split and shuffle inside the wrapper's fit. Then you can make GridSearchCV work, with extra parameter to be fine-tuned provided by the wrapper (batch_size and is_shuffle)

Scikit-learn multithreading

Do you know if models from scikit-learn use automatically multithreading or just sequential instructions?
Thanks

No. All scikit-learn estimators will by default work on a single thread only.
But then again, it all depends on the algorithm and the problem. If the algorithm is such that which want sequential data, we cannot do anything. If the dataset is multi-class or multi-label and algorithm works on a one-vs-rest basis, then yes it can use multi-threading.
Look for a param n_jobs in the utilities or algorithm you want to use, and set it to -1 for using the multi-threading.
For eg.
LogisticRegression if working in a binary problem will only train a single model, which will require data sequentially, so here using n_jobs have no effect. But it handles multi-class problems as OvR, so it will have to train those many estimators using the same data. In this case you can use the n_jobs=-1.
DecisionTreeClassifier is inherently multi-class enabled and dont need to train multiple models. So we dont have that param there.
Ensemble methods like RandomForestClassifier will train multiple estimators (irrespective of problem type) which individually work on some part of data, so here again we can make use of n_jobs.
Cross-validation utilities like cross_val_score or GridSearchCV will again work on some part of data or some individual parameters, which is independent of other folds, so here also we can use multi-threading capabilities.

Is it possible to toggle a certain step in sklearn pipeline?

I wonder if we can set up an "optional" step in sklearn.pipeline. For example, for a classification problem, I may want to try an ExtraTreesClassifier with AND without a PCA transformation ahead of it. In practice, it might be a pipeline with an extra parameter specifying the toggle of the PCA step, so that I can optimize on it via GridSearch and etc. I don't see such an implementation in sklearn source, but is there any work-around?
Furthermore, since the possible parameter values of a following step in pipeline might depend on the parameters in a previous step (e.g., valid values of ExtraTreesClassifier.max_features depend on PCA.n_components), is it possible to specify such a conditional dependency in sklearn.pipeline and sklearn.grid_search?
Thank you!

From the docs:
Individual steps may also be replaced as parameters, and non-final
steps may be ignored by setting them to None:
from sklearn.linear_model import LogisticRegression
params = dict(reduce_dim=[None, PCA(5), PCA(10)],
clf=[SVC(), LogisticRegression()],
clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=params)

Pipeline steps cannot currently be made optional in a grid search but you could wrap the PCA class into your own OptionalPCA component with a boolean parameter to turn off PCA when requested as a quick workaround. You might want to have a look at hyperopt to setup more complex search spaces. I think it has good sklearn integration to support this kind of patterns by default but I cannot find the doc anymore. Maybe have a look at this talk.
For the dependent parameters problem, GridSearchCV supports trees of parameters to handle this case as demonstrated in the documentation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.