I have an H2O AutoML generated GBM model using python. I wonder if we can convert this into a standard sklearn model so that I can fit it into my ecosystem of other sklearn models.
I can see the model properties as below when I print the model.
If direct conversion from H2O to sklearn is not feasible, is there a way we can use the above properties to recreate GBM in sklearn? These terminologies look slightly different from the standard sklearn GBM parameters.
Thanks in advance.
It will be a bit tricky, since the packages are a bit different. Sklearn is based on Python/Cython/C and H2O uses Java. The underlying algorithms could also be different. However, you can try matching/translating your hyperparameters between the two since they will be similar.
Additionally, it would be a good idea to have an ecosystem that is library agnostic so that you can interchange different models.
Related
When dealing with class imbalance issue, penalizing the majority class is a common practice that I have come across while building Machine Learning models. Hence, I often use class weights post re-sampling. LightGBM is one efficient decision tree based framework that is believed to handle class imbalance well. So I am using a LightGBM model for my binary classification problem. The dataset has high class imbalance in the ratio 34:1.
I initially used the LightGBM Classifier with 'class weights' parameter. However, the documentation of LightGBM Classifier mentions to use this parameter for multi-class problems only. For binary classification, it suggests using the 'is_unbalance' or 'scale_pos_weight' parameters. But, by using class weights I see better results and it is also easier to tune the weights and track performance of the model in comparison to when using the other two params.
But since the documentation recommends not to use it for Binary Classification, are there any repercussions of using the parameter? I am getting good results with it on my test data and validation data, but I wonder if it will behave otherwise on other real time data?
Documentation recommends alternative parameters:
Use this parameter only for multi-class classification task; for binary classification task you may use is_unbalance or scale_pos_weight parameters.
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html
I have been using pytorch a lot and got used to their dataloaders and transforms, in particular when it comes to data augmentation, as they're very user-friendly and easy to understand.
However, I need to run some ML models from sklearn.
Is there a way to use pytorch's dataloaders for sklearn ?
Yes, you can. You can do this with sklearn's partial_fit method. Read HERE.
6.1.3. Incremental learning
Finally, for 3. we have a number of options inside scikit-learn. Although all algorithms cannot learn
incrementally (i.e. without seeing all the instances at once), all
estimators implementing the partial_fit API are candidates. Actually,
the ability to learn incrementally from a mini-batch of instances
(sometimes called “online learning”) is key to out-of-core learning as
it guarantees that at any given time there will be only a small amount
of instances in the main memory. Choosing a good size for the
mini-batch that balances relevancy and memory footprint could involve
some tuning [1].
Not all algorithms can do this, however.
Then, you can use pytorch's dataloader to preprocess the data and feed it in batches to partial_fit.
I came across the skorch library recently and this could help you.
"The goal of skorch is to make it possible to use PyTorch with sklearn. "
From the skorch docs:
class skorch.dataset.Dataset(X, y=None, length=None)
General dataset wrapper that can be used in conjunction with PyTorch DataLoader.
I guess you could use the Dataset class for wrapping your PyTorch DataLoader and use sklearn models. If you would like to use other PyTorch features like PyTorch Tensors you could also do that.
I'm trying to ensemble the three different models (FastText, SVM, NaiveBayes).
I thought of using python to do this. I'm sure that we can ensemble NaiveBayes as well as SVM models. But, can we ensemble fastText using python ?
Can anyone please suggest me regarding the same ...
The approach you can apply for combining multiple models is not related with the way you are going to implement it in Python/Java/R.
Maybe what you are looking for is Ensemble learning.
One of the most popular approach to achieve an ensemble of different models is stacking, which involves learning a new model on how to combine the predictions of individual models you have already trained. See this tutorial which uses Python.
In your use case you can as you're dealing with 3 models you should keep in mind that:
The models have different mechanics to use the predict() method:
FastText uses an internal file (serialized model with .bin extension, for example) with all embeddings and wordNGrams and you can pass raw text directly;
SVM and NaiveBayes you're obligated to pre-process the data using CountVectorizer, TfidfVectorizer LabelEncoder, get the result, repass for the LabelEncoder and deliver the result.
You will need at the end deal with different probabilities (if you're predicting with k > 1) and probably you need to take care of this
If you're going to serialize it to production you'll need to pickle the SVM and NB models and use .bin for FastText model and of course the embeddings from the former ones need to be instantiated too. This can be a little pain in your response time if you need to predict in near real time.
I would like to know, if there is the possibility to somehow train a svm classifier using scikit in python (love this module and its documentation) and import that trained model into C++ for prediction making.
Here is how far I got:
I have written a python script which uses scikit to create a reasonable svm classifier
I can also store that model in pickle format
Now, I had a look at libSVM for C++ but I do not see how that is able to import such a model. I think that the documentation is not that good or I missed something here.
However, I also thought that instead of storing the whole model, I could just store the parameters of the SVM Classifier and load only those parameters ( I think the needed once are: Support Vectors, C, degree) for a linear SVM classifier. Unfortunately, I cannot find any documentation of libSVM on how to do that.
A last option which I would not prefer that much would be to go with OpenCV in which I could train a SVM classifier, store it and load it back all in C++. But this would introduce even more library dependencies (especially such a large one) for my program. If there is a good way to avoid that, I would love to do so.
As always I thank you in advance!
Best,
Tukk
While running in production, is it possible to update a trained model with new data without re-fitting the model? I see you can use the warm_start parameter to enable adding trees to the model; however, I am looking for a way to update the existing trees with the incoming data.
As far as I can tell, this is not possible with sklearn (as they seem to implement the classical Breiman algorithm). However, you might have a look at Mondrian Forests (https://papers.nips.cc/paper/5234-mondrian-forests-efficient-online-random-forests.pdf, python implementation: https://github.com/balajiln/mondrianforest).