I am trying to compare the validation set performance of an ensemble classifier with the individual predictors that make up the ensemble.
I've been following the code for Exercise 8 from this notebook to build a hard VotingClassifier with a LinearSVC, RandomForestClassifier, ExtraTreesClassifier, and MLPClassifier for version 1 of the MNIST Digits dataset using sklearn's fetch_openml API.
I trained the ensemble and evaluated it by calling its score function with validation data, and got a score of 0.97. So I'm certain the ensemble and, by extension, the individual predictors have been trained/fit.
But when I try using list comprehension to call score on the individual fitted estimators_ in this ensemble, like so
[estimator.score(X_val, y_val) for estimator in voting_clf.estimators_]
I always get a result of 0.0 for each predictor, even if I evaluate on the training data.
I've confirmed the sub-estimators in estimators_ have been fit using the predict method as described in this StackOverflow post.
I have also trained the same estimators individually and evaluated them with the same method. This seems to work as scores are similar to the ones in the tutorial notebook.
Am I referencing the wrong list of sub-estimators in the ensemble object?
You can try adding
mnist.target = mnist.target.astype(np.uint8)
after loading the MNIST dataset.
It works for me.
Related
Actually this doubt is more like -- "why is this code working properly?".
I was working out a problem from a text book. Specifically, the problem was to build a Pipeline that had a Data Preparation phase (remove NA values, perform Feature Scaling etc.) and then a Prediction phase, which involves a Predictor trained on the transformed dataset and returning its predictions.
Here, we used a Support Vector Regressor module (sklearn.svm.svr).
I tried some code of mine, but it didn't work. So I looked up the actual solution provided by the author of the textbook -
prepare_select_and_predict_pipeline = Pipeline([
('preparation', data_prep),
('svm_reg', SVR(kernel='rbf',C=30000,gamma='scale'))
])
prepare_select_and_predict_pipeline.fit(x_train,y_train)
some_data = x_train.iloc[:4]
print("Predictions for a subset of Training Set:",prepare_select_and_predict_pipeline.predict(some_data))
I tried this code, and it does work as expected.
How can it work properly? My main objections are:
We have only fit the dataset, but where are we actually
transforming it? We are not calling a transform() function anywhere...
Also, how can we use the predict() function with this pipeline? SVR
might be a part of this pipeline, but so are the other transformers,
and they don't have a predict() function.
Thanks in advance for your answers!
When you perform fit on the Pipeline scikit-learn performs under the hood fit_transform of preprocessing step and fit on last step (classifier|regressor). When you call predict on the Pipeline scikit-learn perform transform on the preprocessing stage and predict on the last step.
Now, the definition of the model is not the last step but all the steps that takes in data and output results. The Pipeline is now a model. If you used GridSearchCV which has Pipelines, and Pipelines has preprocessing and final steps (regressor|classifier), then GridSearchCV is now the model.
See Pipeline Documentation
I wanted to use KNN on features which were textmined while using another type of regression for the rest of my features. Is it possible to somehow combine both regression models to predict a single label? Should I split my datasets into two different ones?
I am currently using pandas and sklearn.
You can absolutely do that using Ensemble models.
Ensemble models combine decisions from various models in order to improve the overall performance. For regression problems I would suggest the following ensemble models/techniques:
Averaging
Is a fairly simple ensemble technique where you need to take the average of the predictions from all of your models and use it to make the final prediction.
Weighted Averaging
This is similar to simple averaging, but all of the models are now assgined different weights, defining the importance/contribution of each of the models in the final prediction.
Bagging meta-estimator
Is an ensembling technique that can be used in both classification (BaggingClassifier) and regression (BaggingRegressor). Bagging meta-estimator undertakes the following steps in order to reach to the final prediction:
Randomly create subsets out of the original dataset
A base estimator is fitted on each of the subsets created in step 1.
Predictions are combined to get the final predicted label
Below is a very simple example that makes use of BaggingRegressor of sklearn:
from sklearn.ensemble import BaggingRegressor
ensemble_model = BaggingRegressor(tree.DecisionTreeRegressor(random_state=1))
ensemble_model.fit(X_train, Y_train)
ensemble_model.score(X_test,Y_test)
I am trying to run 5-fold cross-validation on WEKA using a FilteredClassifier with SMOTE.
To my knowledge, I should apply SMOTE in each of the CV folds to obtain my CV error.
Does anyone have documentation or background on how WEKA performs CV in a FilteredClassifier using
Evaluation().crossvalidate_model(INPUTS)
I am using python with the weka-wrapper.
Thank you!
Weka treats the FilteredClassifier meta-classifier just like any other classifier (since they both implement the weka.classifiers.Classifier interface).
If you're performing 5-fold CV, then the data gets split into 5 pairs of train/test folds and each time the classifier gets trained with the training fold and then evaluated on the test fold. The weka.classifiers.Evaluation class records the statistics obtained from the test data of each of the folds.
In your case (for each train/test fold), the FilteredClassifier uses the training data to initialize the SMOTE filter and filter it before building the base-classifier with it.
So the answer is yes, your SMOTE filter gets initialized and applied in each of the CV folds.
The official place for Weka questions is the Weka mailing list.
I have a sample of approximately 10,000 tweets that I want to classify into the categories "relevant" and "not relevant". I am using Python's scikit-learn for this model. I manually coded 1,000 tweets as "relevant" or "not relevant". Then, I ran a SVM model using 80% of the manually coded data as training data and the rest as test data. I obtained good results (prediction accuracy ~0.90), but to avoid overfitting I decided to use cross-validation on all 1,000 manually coded tweets.
Below is my code after already obtaining the tf-idf matrix for the tweets in my sample. "target" is an array listing whether the tweet was marked as "relevant" or "not relevant".
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
clf = SGDClassifier()
scores = cross_val_score(clf, X_tfidf, target, cv=10)
predicted = cross_val_predict(clf, X_tfidf, target, cv=10)
With this code, I was able to get predictions of what classes the 1,000 tweets belonged to, and I could compare that against my manual coding.
I'm stuck on what to do next in order to use my model to classify the other ~9,000 tweets that I did not manually code. I was thinking to use cross_val_predict again, but I'm not sure what to put in the third argument since the class is what I'm trying to predict.
Thanks for all your help in advance!
cross_val_predict is not method to actually obtain predictions from the model. Cross validation is a technique for model selection/evaluation, no to train model. cross_val_predict is very specific function (which gives you predictions of many models, trained during cross validation procedure). For actual model building yu are supposed to use fit to train your model and predict to get predictions. No cross validation involved here - as said before - this is for model selection (to choose your classifier, hyperparamters etc.) and not to train actual model.
I'm using scikit-learn in my Python program in order to perform some machine-learning operations. The problem is that my data-set has severe imbalance issues.
Is anyone familiar with a solution for imbalance in scikit-learn or in python in general? In Java there's the SMOTE mechanizm. Is there something parallel in python?
There is a new one here
https://github.com/scikit-learn-contrib/imbalanced-learn
It contains many algorithms in the following categories, including SMOTE
Under-sampling the majority class(es).
Over-sampling the minority class.
Combining over- and under-sampling.
Create ensemble balanced sets.
In Scikit learn there are some imbalance correction techniques, which vary according with which learning algorithm are you using.
Some one of them, like Svm or logistic regression, have the class_weight parameter. If you instantiate an SVC with this parameter set on 'balanced', it will weight each class example proportionally to the inverse of its frequency.
Unfortunately, there isn't a preprocessor tool with this purpose.
I found one other library here which implements undersampling and also multiple oversampling techniques including multiple SMOTE implementations and another which uses SVM:
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
Since others have listed links to the very popular imbalanced-learn library I'll give an overview about how to properly use it along with some links.
https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html
https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py
https://imbalanced-learn.org/en/stable/combine.html
Some common over-sampling and under-sampling techniques in imbalanced-learn are imblearn.over_sampling.RandomOverSampler, imblearn.under_sampling.RandomUnderSampler, and imblearn.SMOTE. For these libraries there is a nice parameter that allows the user to change the sampling ratio.
For example, in SMOTE, to change the ratio you would input a dictionary, and all values must be greater than or equal to the largest class (since SMOTE is an over-sampling technique). The reason I have found SMOTE to be a better fit for model performance is probably because with RandomOverSampler you are duplicating rows, which means the model can start to memorize the data rather than generalize to new data. SMOTE uses the K-Nearest-Neighbors algorithm to make "similar" data points to those under sampled ones.
It is not good practice to blindly use SMOTE, setting the ratio to it's default (even class balance) because the model may overfit one or more of the minority classes (even though SMOTE is using nearest neighbors to make "similar" observations). In a similar way that you tune hyperparameters of a ML model you will tune the hyperparameters of the SMOTE algorithm, such as the ratio and/or knn. Below is a working example of how to properly use SMOTE.
NOTE: It is vital that you do not use SMOTE on the full data set. You MUST use SMOTE on the training set only (after you split). Then validate on your val/test sets and see if your SMOTE model out performed your other model(s). If you do not do this there will be data leakage and your model is essentially cheating.
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import numpy as np
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
sm = SMOTE(random_state=0, n_jobs=8, ratio={'class1':100, 'class2':100, 'class3':80, 'class4':60, 'class5':90})
### Train test split
X_train, X_val, y_train, y_val = train_test_split(X, y)
### Scale the data before applying SMOTE
scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)
### Resample X_train_scaled
X_train_resampled, y_train_resampled = sm.fit_sample(X_train_scaled, y_train)
print('Original dataset shape:', Counter(y_train))
print('Resampled dataset shape:', Counter(y_train_resampled))
### Train a model
xgbc_smote = XGBClassifier(n_jobs=8).fit(X_train_smote, y_train_smote,
eval_set = [(X_val_scaled, y_val)],
early_stopping_rounds=10)
### Evaluate the model
print('\ntrain\n')
print(accuracy_score(xgbc_smote.predict(np.array(X_train_scaled)), y_train))
print(f1_score(xgbc_smote.predict(np.array(X_train_scaled)), y_train))
print('\nval\n')
print(accuracy_score(xgbc_smote.predict(np.array(X_val_scaled)), y_val))
print(f1_score(xgbc_smote.predict(np.array(X_val_scaled)), y_val))
SMOTE is not a builtin in scikit-learn, but there are implementations available online nevertheless.
Edit: The discussion with a SMOTE implementation on GMane that I originally
linked to, appears to be no longer available. The code is preserved here.
The newer answer below, by #nos, is also quite good.