I'm trying to ensemble the three different models (FastText, SVM, NaiveBayes).
I thought of using python to do this. I'm sure that we can ensemble NaiveBayes as well as SVM models. But, can we ensemble fastText using python ?
Can anyone please suggest me regarding the same ...
The approach you can apply for combining multiple models is not related with the way you are going to implement it in Python/Java/R.
Maybe what you are looking for is Ensemble learning.
One of the most popular approach to achieve an ensemble of different models is stacking, which involves learning a new model on how to combine the predictions of individual models you have already trained. See this tutorial which uses Python.
In your use case you can as you're dealing with 3 models you should keep in mind that:
The models have different mechanics to use the predict() method:
FastText uses an internal file (serialized model with .bin extension, for example) with all embeddings and wordNGrams and you can pass raw text directly;
SVM and NaiveBayes you're obligated to pre-process the data using CountVectorizer, TfidfVectorizer LabelEncoder, get the result, repass for the LabelEncoder and deliver the result.
You will need at the end deal with different probabilities (if you're predicting with k > 1) and probably you need to take care of this
If you're going to serialize it to production you'll need to pickle the SVM and NB models and use .bin for FastText model and of course the embeddings from the former ones need to be instantiated too. This can be a little pain in your response time if you need to predict in near real time.
Related
I am trying to improve my classification model, using statsmodel in LogisticRegression i note that some features that didn't pass in t test and don't have many influency when i use this model are very important when i change the model, for example i looked up to feature_importances of a RandomForestClassifier and the more important feature did not influence LogisticRegression.
With this in mind, i thought to use LogisticRegression without this feature and use the predict_proba to pick the probabilities, then i create another model using RandomForest but now using all features and including the logisticRegressor probabilities. Or i can pick all probabilities of many models and use them as features of another model.. Anything of This make sense? I dont know if i am inserting any bias doing this and why.
I found that what I was doing was stacking, but instead of using another model's response as a feature, I was using the probability of being 1 (predict_proba).
I am working on classifying texts and images of scientific articles. From the texts I use title and abstract. So far I have achieved good results using an SVM for the texts and not that good using a CNN for the images. I still did a multimodal classification, which did not show any classification improvement.
What I would like to do now is to use the svm and cnn predictions to classify, something like a vote ensemble. However the VotingClassifier from sklearn does not accept mixed inputs. You would have some idea of how I could implement or some guide line.
Thank you!
One simple thing you can do is take the outputs from both your models and just use them as inputs to third linear regression model. This effectively "mixes" your 2 learners into a small ensemble. Of course this is a very simple strategy but it might give you a slight boost over using each model separately.
Do you know if models from scikit-learn use automatically multithreading or just sequential instructions?
Thanks
No. All scikit-learn estimators will by default work on a single thread only.
But then again, it all depends on the algorithm and the problem. If the algorithm is such that which want sequential data, we cannot do anything. If the dataset is multi-class or multi-label and algorithm works on a one-vs-rest basis, then yes it can use multi-threading.
Look for a param n_jobs in the utilities or algorithm you want to use, and set it to -1 for using the multi-threading.
For eg.
LogisticRegression if working in a binary problem will only train a single model, which will require data sequentially, so here using n_jobs have no effect. But it handles multi-class problems as OvR, so it will have to train those many estimators using the same data. In this case you can use the n_jobs=-1.
DecisionTreeClassifier is inherently multi-class enabled and dont need to train multiple models. So we dont have that param there.
Ensemble methods like RandomForestClassifier will train multiple estimators (irrespective of problem type) which individually work on some part of data, so here again we can make use of n_jobs.
Cross-validation utilities like cross_val_score or GridSearchCV will again work on some part of data or some individual parameters, which is independent of other folds, so here also we can use multi-threading capabilities.
I would like to know, if there is the possibility to somehow train a svm classifier using scikit in python (love this module and its documentation) and import that trained model into C++ for prediction making.
Here is how far I got:
I have written a python script which uses scikit to create a reasonable svm classifier
I can also store that model in pickle format
Now, I had a look at libSVM for C++ but I do not see how that is able to import such a model. I think that the documentation is not that good or I missed something here.
However, I also thought that instead of storing the whole model, I could just store the parameters of the SVM Classifier and load only those parameters ( I think the needed once are: Support Vectors, C, degree) for a linear SVM classifier. Unfortunately, I cannot find any documentation of libSVM on how to do that.
A last option which I would not prefer that much would be to go with OpenCV in which I could train a SVM classifier, store it and load it back all in C++. But this would introduce even more library dependencies (especially such a large one) for my program. If there is a good way to avoid that, I would love to do so.
As always I thank you in advance!
Best,
Tukk
While running in production, is it possible to update a trained model with new data without re-fitting the model? I see you can use the warm_start parameter to enable adding trees to the model; however, I am looking for a way to update the existing trees with the incoming data.
As far as I can tell, this is not possible with sklearn (as they seem to implement the classical Breiman algorithm). However, you might have a look at Mondrian Forests (https://papers.nips.cc/paper/5234-mondrian-forests-efficient-online-random-forests.pdf, python implementation: https://github.com/balajiln/mondrianforest).