While running in production, is it possible to update a trained model with new data without re-fitting the model? I see you can use the warm_start parameter to enable adding trees to the model; however, I am looking for a way to update the existing trees with the incoming data.
As far as I can tell, this is not possible with sklearn (as they seem to implement the classical Breiman algorithm). However, you might have a look at Mondrian Forests (https://papers.nips.cc/paper/5234-mondrian-forests-efficient-online-random-forests.pdf, python implementation: https://github.com/balajiln/mondrianforest).
Related
I am trying to improve my classification model, using statsmodel in LogisticRegression i note that some features that didn't pass in t test and don't have many influency when i use this model are very important when i change the model, for example i looked up to feature_importances of a RandomForestClassifier and the more important feature did not influence LogisticRegression.
With this in mind, i thought to use LogisticRegression without this feature and use the predict_proba to pick the probabilities, then i create another model using RandomForest but now using all features and including the logisticRegressor probabilities. Or i can pick all probabilities of many models and use them as features of another model.. Anything of This make sense? I dont know if i am inserting any bias doing this and why.
I found that what I was doing was stacking, but instead of using another model's response as a feature, I was using the probability of being 1 (predict_proba).
When dealing with class imbalance issue, penalizing the majority class is a common practice that I have come across while building Machine Learning models. Hence, I often use class weights post re-sampling. LightGBM is one efficient decision tree based framework that is believed to handle class imbalance well. So I am using a LightGBM model for my binary classification problem. The dataset has high class imbalance in the ratio 34:1.
I initially used the LightGBM Classifier with 'class weights' parameter. However, the documentation of LightGBM Classifier mentions to use this parameter for multi-class problems only. For binary classification, it suggests using the 'is_unbalance' or 'scale_pos_weight' parameters. But, by using class weights I see better results and it is also easier to tune the weights and track performance of the model in comparison to when using the other two params.
But since the documentation recommends not to use it for Binary Classification, are there any repercussions of using the parameter? I am getting good results with it on my test data and validation data, but I wonder if it will behave otherwise on other real time data?
Documentation recommends alternative parameters:
Use this parameter only for multi-class classification task; for binary classification task you may use is_unbalance or scale_pos_weight parameters.
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html
I'm trying to ensemble the three different models (FastText, SVM, NaiveBayes).
I thought of using python to do this. I'm sure that we can ensemble NaiveBayes as well as SVM models. But, can we ensemble fastText using python ?
Can anyone please suggest me regarding the same ...
The approach you can apply for combining multiple models is not related with the way you are going to implement it in Python/Java/R.
Maybe what you are looking for is Ensemble learning.
One of the most popular approach to achieve an ensemble of different models is stacking, which involves learning a new model on how to combine the predictions of individual models you have already trained. See this tutorial which uses Python.
In your use case you can as you're dealing with 3 models you should keep in mind that:
The models have different mechanics to use the predict() method:
FastText uses an internal file (serialized model with .bin extension, for example) with all embeddings and wordNGrams and you can pass raw text directly;
SVM and NaiveBayes you're obligated to pre-process the data using CountVectorizer, TfidfVectorizer LabelEncoder, get the result, repass for the LabelEncoder and deliver the result.
You will need at the end deal with different probabilities (if you're predicting with k > 1) and probably you need to take care of this
If you're going to serialize it to production you'll need to pickle the SVM and NB models and use .bin for FastText model and of course the embeddings from the former ones need to be instantiated too. This can be a little pain in your response time if you need to predict in near real time.
I would like to know, if there is the possibility to somehow train a svm classifier using scikit in python (love this module and its documentation) and import that trained model into C++ for prediction making.
Here is how far I got:
I have written a python script which uses scikit to create a reasonable svm classifier
I can also store that model in pickle format
Now, I had a look at libSVM for C++ but I do not see how that is able to import such a model. I think that the documentation is not that good or I missed something here.
However, I also thought that instead of storing the whole model, I could just store the parameters of the SVM Classifier and load only those parameters ( I think the needed once are: Support Vectors, C, degree) for a linear SVM classifier. Unfortunately, I cannot find any documentation of libSVM on how to do that.
A last option which I would not prefer that much would be to go with OpenCV in which I could train a SVM classifier, store it and load it back all in C++. But this would introduce even more library dependencies (especially such a large one) for my program. If there is a good way to avoid that, I would love to do so.
As always I thank you in advance!
Best,
Tukk
I am doing machine learning using scikit-learn as recommended in this question. To my surprise, it does not appear to provide access to the actual models it trains. For example, if I create an SVM, linear classifier or even a decision tree, it doesn't seem to provide a way for me to see the parameters selected for the actual trained model.
Seeing the actual model is useful if the model is being created partly to get a clearer picture of what features it is using (e.g., decision trees). Seeing the model is also a significant issue if one wants to use Python to train the model and some other code to actually implement it.
Am I missing something in scikit-learn or is there some way to get at this in scikit-learn? If not, what is the a good free machine learning workbench, not necessarily in python, in which models are transparently available?
The fitted model parameters are stored directly as attributes on the model instance. There is a specific naming convention for those fitted parameters: they all end with a trailing underscore as opposed to user-provided constructor parameters (a.k.a. hyperparameters) which don't.
The type of the fitted attributes is algorithm-dependent. For instance for a kernel Support Vector Machine you will have the arrays support vectors, dual coefs and intercepts while for random forests and extremly randomized trees you will have a collection of binary trees (internally represented in memory as contiguous numpy arrays for performance matters: structure of arrays representation).
See the Attributes section of the docstring of each model for more details, for instance for SVC:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
For tree based models you also have a helper function to generate a graphivz_export of the learned trees:
http://scikit-learn.org/stable/modules/tree.html#classification
To find the importance of features in forests models you should also have a look at the compute_importances parameter, see the following examples for instance:
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#example-ensemble-plot-forest-importances-py
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances_faces.html#example-ensemble-plot-forest-importances-faces-py