Cross-Validation and SMOTE in WEKA - python

I am trying to run 5-fold cross-validation on WEKA using a FilteredClassifier with SMOTE.
To my knowledge, I should apply SMOTE in each of the CV folds to obtain my CV error.
Does anyone have documentation or background on how WEKA performs CV in a FilteredClassifier using
Evaluation().crossvalidate_model(INPUTS)
I am using python with the weka-wrapper.
Thank you!

Weka treats the FilteredClassifier meta-classifier just like any other classifier (since they both implement the weka.classifiers.Classifier interface).
If you're performing 5-fold CV, then the data gets split into 5 pairs of train/test folds and each time the classifier gets trained with the training fold and then evaluated on the test fold. The weka.classifiers.Evaluation class records the statistics obtained from the test data of each of the folds.
In your case (for each train/test fold), the FilteredClassifier uses the training data to initialize the SMOTE filter and filter it before building the base-classifier with it.
So the answer is yes, your SMOTE filter gets initialized and applied in each of the CV folds.
The official place for Weka questions is the Weka mailing list.

Related

Scikit-Learn Voting Classifier Predictor Scores Always 0

I am trying to compare the validation set performance of an ensemble classifier with the individual predictors that make up the ensemble.
I've been following the code for Exercise 8 from this notebook to build a hard VotingClassifier with a LinearSVC, RandomForestClassifier, ExtraTreesClassifier, and MLPClassifier for version 1 of the MNIST Digits dataset using sklearn's fetch_openml API.
I trained the ensemble and evaluated it by calling its score function with validation data, and got a score of 0.97. So I'm certain the ensemble and, by extension, the individual predictors have been trained/fit.
But when I try using list comprehension to call score on the individual fitted estimators_ in this ensemble, like so
[estimator.score(X_val, y_val) for estimator in voting_clf.estimators_]
I always get a result of 0.0 for each predictor, even if I evaluate on the training data.
I've confirmed the sub-estimators in estimators_ have been fit using the predict method as described in this StackOverflow post.
I have also trained the same estimators individually and evaluated them with the same method. This seems to work as scores are similar to the ones in the tutorial notebook.
Am I referencing the wrong list of sub-estimators in the ensemble object?
You can try adding
mnist.target = mnist.target.astype(np.uint8)
after loading the MNIST dataset.
It works for me.

Does GridSearchCV perform cross-validation?

I'm currently working on a problem which compares three different machine learning algorithms performance on the same data-set. I divided the data-set into 70/30 training/testing sets and then performed grid search for the best parameters of each algorithm using GridSearchCV and X_train, y_train.
First question, am I suppose to perform grid search on the training set or is it suppose to be on the whole data-set?
Second question, I know that GridSearchCV uses K-fold in its' implementation, does it mean that I performed cross-validation if I used the same X_train, y_train for all three algorithms I compare in the GridSearchCV?
Any answer would be appreciated, thank you.
All estimators in scikit where name ends with CV perform cross-validation.
But you need to keep a separate test set for measuring the performance.
So you need to split your whole data to train and test. Forget about this test data for a while.
And then pass this train data only to grid-search. GridSearch will split this train data further into train and test to tune the hyper-parameters passed to it. And finally fit the model on the whole train data with best found parameters.
Now you need to test this model on the test data you kept aside in the beginning. This will give you the near real world performance of model.
If you use the whole data into GridSearchCV, then there would be leakage of test data into parameter tuning and then the final model may not perform that well on newer unseen data.
You can look at my other answers which describe the GridSearch in more detail:
Model help using Scikit-learn when using GridSearch
scikit-learn GridSearchCV with multiple repetitions
Yes, GridSearchCV performs cross-validation. If I understand the concept correctly - you want to keep part of your data set unseen for the model in order to test it.
So you train your models against train data set and test them on a testing data set.
Here I was doing almost the same - you might want to check it...

Training a decision tree with K-Fold - Is this the correct approach?

I've used two approaches with the same SKlearn decision tree, one approach using a validation set and the other using K-Fold. I'm however not sure if I'm actually achieving anything by using KFold. Technically the Cross Validation does show a 5% rise in accuracy, but I'm not sure if that's just the pecularity of this particular data skewing the result.
For my implementation of KFold I first split the training set into segments using:
f = KFold(n_splits=8)
f.get_n_splits(data)
And then got data-frames from it by using
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
In a loop, as witnessed in many online tutorials on how to do it. However, here comes the tricky part. The tutorial I saw had a .train() function which I do not think this decision tree classifier does. Instead, I just do this:
tree = tree.DecisionTreeClassifier()
tree.fit(X_train, y_train)
predictions = tree.predict(X_test)
The accuracy scores achieved are:
Accuracy score: 0.79496591505
Accuracy score: 0.806502359727
Accuracy score: 0.800734137389
... and so on
But I am not sure if I'm actually making my classifier any better by doing this, as the scores go up and down. Isn't this just comparing 9 independent results together? Is the purpose of K-fold not to train the classifier to be better?
I've read similar questions and found that K-fold is meant to provide a way to compare between "independent instances" but I wanted to make sure that was the case, not that my code was flawed in some way.
Is the purpose of K-fold not to train the classifier to be better?
The purpose of the K-fold is to prevent the classifier from over fitting the training data. So on each fold you keep a separate test set which the classifier has not seen and verify the accuracy on it. You average your prediction to see how best your classifier is performing.
Isn't this just comparing 9 independent results together?
Yes, you compare the different scores to see how best your classifier is performing
In general using cross validation prevents overfitting. For that you split the data in multiple parts and evaluate the loss, accuracy or other metrics (e.g. f-1 score). A good introduction can be found on the official site [1].
In addition I would recommend using StratifiedKFold [2] instead of KFold.
skf = StratifiedKFold(n_splits=8)
skf.get_n_splits(X, y)
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
So you have balanced labels.

How to run scikit's cross validation with several classifiers on the same folds

I'm currently working on a research study about classifiers performances comparison. To evaluate those performances, I'm computing the accuracy, the area under curve and the squared error for each classifier on all the datasets I have. Besides I need to perform tuning parameters for some of the classifiers in order to select the best parameters in terms of accuracy, so a validation test is required (I chose 20% of the dataset).
I was told that, in order to make this comparison even more meaningful, the cross validation should be performed on the same sets for each classifier.
So basically, is there a way to use the cross_val_score method so that it runs always on the same folds for all the classifiers or should I rewrite from scratch some code that can do this job ?
Thank you in advance.
cross_val_score accepts a cv parameter which represents the cross validation object you want to use. You probably want StratifiedKFold, which accepts a shuffle parameter, which specifies if you want to shuffle the data prior to running cross validation on it.
cv can also be an int, in which case a StratifiedKFold or KFold object will be created automatically with K = cv.
As you can tell from the documentation, shuffle is False by default, so by default it will already be performed on the same folds for all of your classifiers.
You can test it by running it twice on the same classifier to make sure (you should get the exact same results).
You can specify it yourself like this:
your_cv = StratifiedKFold(your_y, n_folds=10, shuffle=True) # or shuffle=False
cross_val_score(your_estimator, your_X, y=your_y, cv=your_cv)

Imbalance in scikit-learn

I'm using scikit-learn in my Python program in order to perform some machine-learning operations. The problem is that my data-set has severe imbalance issues.
Is anyone familiar with a solution for imbalance in scikit-learn or in python in general? In Java there's the SMOTE mechanizm. Is there something parallel in python?
There is a new one here
https://github.com/scikit-learn-contrib/imbalanced-learn
It contains many algorithms in the following categories, including SMOTE
Under-sampling the majority class(es).
Over-sampling the minority class.
Combining over- and under-sampling.
Create ensemble balanced sets.
In Scikit learn there are some imbalance correction techniques, which vary according with which learning algorithm are you using.
Some one of them, like Svm or logistic regression, have the class_weight parameter. If you instantiate an SVC with this parameter set on 'balanced', it will weight each class example proportionally to the inverse of its frequency.
Unfortunately, there isn't a preprocessor tool with this purpose.
I found one other library here which implements undersampling and also multiple oversampling techniques including multiple SMOTE implementations and another which uses SVM:
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
Since others have listed links to the very popular imbalanced-learn library I'll give an overview about how to properly use it along with some links.
https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html
https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py
https://imbalanced-learn.org/en/stable/combine.html
Some common over-sampling and under-sampling techniques in imbalanced-learn are imblearn.over_sampling.RandomOverSampler, imblearn.under_sampling.RandomUnderSampler, and imblearn.SMOTE. For these libraries there is a nice parameter that allows the user to change the sampling ratio.
For example, in SMOTE, to change the ratio you would input a dictionary, and all values must be greater than or equal to the largest class (since SMOTE is an over-sampling technique). The reason I have found SMOTE to be a better fit for model performance is probably because with RandomOverSampler you are duplicating rows, which means the model can start to memorize the data rather than generalize to new data. SMOTE uses the K-Nearest-Neighbors algorithm to make "similar" data points to those under sampled ones.
It is not good practice to blindly use SMOTE, setting the ratio to it's default (even class balance) because the model may overfit one or more of the minority classes (even though SMOTE is using nearest neighbors to make "similar" observations). In a similar way that you tune hyperparameters of a ML model you will tune the hyperparameters of the SMOTE algorithm, such as the ratio and/or knn. Below is a working example of how to properly use SMOTE.
NOTE: It is vital that you do not use SMOTE on the full data set. You MUST use SMOTE on the training set only (after you split). Then validate on your val/test sets and see if your SMOTE model out performed your other model(s). If you do not do this there will be data leakage and your model is essentially cheating.
from collections import Counter
from sklearn.preprocessing import MinMaxScaler
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import numpy as np
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
sm = SMOTE(random_state=0, n_jobs=8, ratio={'class1':100, 'class2':100, 'class3':80, 'class4':60, 'class5':90})
### Train test split
X_train, X_val, y_train, y_val = train_test_split(X, y)
### Scale the data before applying SMOTE
scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)
### Resample X_train_scaled
X_train_resampled, y_train_resampled = sm.fit_sample(X_train_scaled, y_train)
print('Original dataset shape:', Counter(y_train))
print('Resampled dataset shape:', Counter(y_train_resampled))
### Train a model
xgbc_smote = XGBClassifier(n_jobs=8).fit(X_train_smote, y_train_smote,
eval_set = [(X_val_scaled, y_val)],
early_stopping_rounds=10)
### Evaluate the model
print('\ntrain\n')
print(accuracy_score(xgbc_smote.predict(np.array(X_train_scaled)), y_train))
print(f1_score(xgbc_smote.predict(np.array(X_train_scaled)), y_train))
print('\nval\n')
print(accuracy_score(xgbc_smote.predict(np.array(X_val_scaled)), y_val))
print(f1_score(xgbc_smote.predict(np.array(X_val_scaled)), y_val))
SMOTE is not a builtin in scikit-learn, but there are implementations available online nevertheless.
Edit: The discussion with a SMOTE implementation on GMane that I originally
linked to, appears to be no longer available. The code is preserved here.
The newer answer below, by #nos, is also quite good.

Categories

Resources