How can I properly split imbalanced dataset to train and test set?

How can I properly split imbalanced dataset to train and test set? - python

I have a flight delay dataset and try to split the set to train and test set before sampling. On-time cases are about 80% of total data and delayed cases are about 20% of that.
Normally in machine learning ratio of train and test set size is 8:2. But the data is too imbalanced. So considering extreme case, most of train data are on-time cases and most of test data are delayed cases and accuracy will be poor.
So my question is How can I properly split imbalanced dataset to train and test set??

Probably just by playing with ratio of train and test you might not get the correct prediction and results.
if you are working on imbalanced dataset, you should try re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.
Also use different metric for performance measurement such as F1 Score etc in case of imbalanced data set
Please go through the below link, it will give you more clarity.
What is the correct procedure to split the Data sets for classification problem?
Cleveland heart disease dataset - can’t describe the class

Start from 50/50 and go on changing the sets as 60/40, 70/30, 80/20, 90/10. declare all the results and come to some conclusion. In one of my work on Flight delays prediction project, I used 60/40 database and got 86.8 % accuracy using MLP NN.

There are two approaches that you can take.
A simple one: no preprocessing of the dataset but careful sampling of the dataset so that both classes are represented in the same proportion in the test and train subsets. You can do it by splitting by class first and then randomly sampling from both sets.
import sklearn
XclassA = dataX[0] # TODO: change to split by class
XclassB = dataX[1]
YclassA = dataY[0]
YclassB = dataY[1]
XclassA_train, XclassA_test, YclassA_train, YclassA_test = sklearn.model_selection.train_test_split(XclassA, YclassA, test_size=0.2, random_state=42)
XclassB_train, XclassB_test, YclassB_train, YclassB_test = sklearn.model_selection.train_test_split(XclassB, YclassB, test_size=0.2, random_state=42)
Xclass_train = XclassA_train + XclassB_train
Yclass_train = YclassA_train + YclassB_train
A more involved, and arguably better one, you can try first to balance your dataset. For that you can use one of many techniques (under-, over-sampling, SMOTE, AdaSYN, Tomek links, etc.). I recommend you review the methods of imbalanced-learn package. Having done balancing you can use the ordinary test/train split using typical methods without any additional intermediary steps.
The second approach is better not only from the perspective of splitting the data but also from the speed and even ability to train a model (which for heavily imbalanced datasets is not guaranteed to work).

Related

Imbalanced learning problem - out of sample vs validation

I am training on three classes with one dominant majority class of about 80% and the other two even. I am able to train a model using undersampling / oversampling techniques to get validation accuracy of 67% which would already be quite good for my purposes. The issue is that this performance is only present on the balanced validation data, once I test on out of sample with imbalanced data it seems to have picked up a bias towards even class predictions. I have also tried using weighted loss functions but also no joy on out of sample. Is there a good way to ensure the validation performance translates over? I have tried using auroc to validate the model successfully but again the strong performance is only present in the balanced validation data.
Methods of resampling I have tried: SMOTE oversampling and random undersampling.

If I understood correctly, may be you are looking for performance measurement and better classification results on imbalance datasets.
Alone measuring the performance using accuracy in case of imbalanced datasets usually high and misleading and minority class could be totally ignored Instead use f1-score, precision/recall score.
For my project work on imbalanced datasets, I have used SMOTE sampling methods along with the K-Fold cross validation.
Cross validation technique assures that model gets the correct patterns from the data, and it is not getting up too much noise.
References :
What is the correct procedure to split the Data sets for classification problem?

Does GridSearchCV perform cross-validation?

I'm currently working on a problem which compares three different machine learning algorithms performance on the same data-set. I divided the data-set into 70/30 training/testing sets and then performed grid search for the best parameters of each algorithm using GridSearchCV and X_train, y_train.
First question, am I suppose to perform grid search on the training set or is it suppose to be on the whole data-set?
Second question, I know that GridSearchCV uses K-fold in its' implementation, does it mean that I performed cross-validation if I used the same X_train, y_train for all three algorithms I compare in the GridSearchCV?
Any answer would be appreciated, thank you.

All estimators in scikit where name ends with CV perform cross-validation.
But you need to keep a separate test set for measuring the performance.
So you need to split your whole data to train and test. Forget about this test data for a while.
And then pass this train data only to grid-search. GridSearch will split this train data further into train and test to tune the hyper-parameters passed to it. And finally fit the model on the whole train data with best found parameters.
Now you need to test this model on the test data you kept aside in the beginning. This will give you the near real world performance of model.
If you use the whole data into GridSearchCV, then there would be leakage of test data into parameter tuning and then the final model may not perform that well on newer unseen data.
You can look at my other answers which describe the GridSearch in more detail:
Model help using Scikit-learn when using GridSearch
scikit-learn GridSearchCV with multiple repetitions

Yes, GridSearchCV performs cross-validation. If I understand the concept correctly - you want to keep part of your data set unseen for the model in order to test it.
So you train your models against train data set and test them on a testing data set.
Here I was doing almost the same - you might want to check it...

I need help understanding train_test_split?

from sklearn.model_selection import train_test_split
I have a question about the train_test_split function from sklearn. First, why do we split the data??? and were do we get the testing data from. Do we just chop the data in half and use some of it to train and some of it to test?? Than doesn't make sense since the data is already filled. If it is filled, then what are we predicting now?? I need help!

First, why do we split the data???
We split the data to isolate a portion of it for validation purposes. Use the non-isolated portion to fit the algorithm and use the algorithm to test again the isolated portion.
were do we get the testing data from.
The testing data is actually part of your original dataset.
Do we just chop the data in half
We usually chop at the 20-40% mark.
If it is filled, then what are we predicting now??*
You are actually not trying to predict the result directly. You are training the algorithm to fit the training set and use the testing set to see how accurate the algorithm is.

Your original dataset should be split up into training and testing data. For example, 80% of the data could be used for training and 20% could be used for testing. The data is split so that there is data for the model to be evaluated on to see how well the model performs on unseen data.
Training: This data is used to build your model. E.g. finding the optimal coefficients in a Linear Regression model; or using the CART
algorithm to create a Decision Tree.
Testing: This data is used to see how the model performs on unseen data, as it would in a real world situation. This data should
be left completely unseen until you would like to test your model to
evaluate performance.
Extra notes on validation data:
To tune your model (for example, finding the best max_depth value for a decision tree) the training data should also be proportioned into training and validation data. K-Folds Cross Validation can be used here. The model should be trained on the training data and evaluated on the validation data. This is performed multiple time with cross validation.
The results (e.g. MSE, F1 etc.) can then be evaluated on each fold and used to tune the hyperparameters. By using cross-validation to tune the hyperparameters, this ensures that the model is not overfitting to the test data.
Once the model is tuned, it can then be applied to the test data.

sklearn: Naive Bayes classifier gives low accuracy

I have a dataset which includes 200000 labelled training examples.
For each training example I have 10 features, including both continuous and discrete.
I'm trying to use sklearn package of python in order to train the model and make predictions but I have some troubles (and some questions too).
First let me write the code which I have written so far:
from sklearn.naive_bayes import GaussianNB
# data contains the 200 000 examples
# targets contain the corresponding labels for each training example
gnb = GaussianNB()
gnb.fit(data, targets)
predicted = gnb.predict(data)
The problem is that I get really low accuracy (too many misclassified labels) - around 20%.
However I am not quite sure whether there is a problem with the data (e.g. more data is needed or something else) or with the code.
Is this the proper way to implement a Naive Bayes classifier given a dataset with both discrete and continuous features?
Furthermore, in Machine Learning we know that the dataset should be split into training and validation/testing sets. Is this automatically performed by sklearn or should I fit the model using the training dataset and then call predict using the validation set?
Any thoughts or suggestions will be much appreciated.

The problem is that I get really low accuracy (too many misclassified labels) - around 20%. However I am not quite sure whether there is a problem with the data (e.g. more data is needed or something else) or with the code.
This is not big error for Naive Bayes, this is extremely simple classifier and you should not expect it to be strong, more data probably won't help. Your gaussian estimators are probably already very good, simply Naive assumptions are the problem. Use stronger model. You can start with Random Forest since it is very easy to use even by non-experts in the field.
Is this the proper way to implement a Naive Bayes classifier given a dataset with both discrete and continuous features?
No, it is not, you should use different distributions in discrete features, however scikit-learn does not support that, you would have to do this manually. As said before - change your model.
Furthermore, in Machine Learning we know that the dataset should be split into training and validation/testing sets. Is this automatically performed by sklearn or should I fit the model using the training dataset and then call predict using the validation set?
Nothing is done automatically in this manner, you need to do this on your own (scikit learn has lots of tools for that - see the cross validation pacakges).

How to get ordered list of labels after fitting sklearn

train_index, test_index = next(iter(ShuffleSplit(821, train_size=0.2, test_size=0.80, random_state=42)))
print train_index, len(train_index)
print test_index, len(test_index)
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features, labels, train_size=0.33, random_state=42)
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test, labels_test)
print pred, len(pred)
A few questions from this code:
Why do I need the cross_validation.train_test_split line in order to fit and predict with my classifier? (I am not doing any preprocessing on my data except for stopword removal I have already done)
Do the test and train indexes correspond to the classified & predicted labels? My goal is to get all my labels, in their original order, after fitting and predicting them. My features and labels used for training and testing are from a pandas dataframe (two columns), and I need the predicted labels, in order, so that I can feed them back into the pandas dataframe.
Is there a way to predict the labels for the whole set, and not just the test set?

tl;dr
Because your decision tree classifier has to be trained before it can predict anything. It's not a magic algorithm. It has to be shown examples of what to do before it can work out what to do on other things.
cross_validation.test_train_split() facilitates this by splitting your data into a test and training dataset in such a way that you can analyse how well it performed later on. Without this, you have no way of assessing how well your decision tree classifier actually performed.
You can create your own testing and training data without test_train_split() (and I suspect that was what you were trying to do with ShuffleSplit()), but you will need at least some training data.
test_index and train_index have nothing to do with your data. Full stop. They come from a randomly generated process that is completely unrelated to what test_train_split() does.
The purpose of ShuffleSplit() is to give you the indices to partition your data into training and test yourself. test_train_split() will instead choose their own indices and partition based on those indices. You should either use one or the other and sensibly.
Yes. You can always just call
pred = clf.predict(features) or pred = clf.predict(features_test + features_train)
The Full Story
You need cross_validation if you want to do this right. The whole purpose of cross-validation is to avoid overfit.
Basically, if you run your model on both the training and the testing data, then your model is going to perform really well on the training set (because, well, that's what you trained it on) and that's going to skew your overall metrics of how well your model will perform on real data.
It's a lot like asking a student to perform in an exam and then in real life: if you want to know whether your student learned from the process of preparing for an exam, you don't give him another exam, you ask him to demonstrate his skills in the real world dealing with unknown and complex data.
If you want to know if your model will be useful, then you want to cross-validate. Wikipedia puts it best:
In a prediction problem, a model is usually given a dataset of known
data on which training is run (training dataset), and a dataset of
unknown data (or first seen data) against which the model is tested
(testing dataset).
The goal of cross validation is to define a
dataset to "test" the model in the training phase (i.e., the
validation dataset), in order to limit problems like overfitting, give
an insight on how the model will generalize to an independent dataset
(i.e., an unknown dataset, for instance from a real problem), etc.
cross_validation.train_test_split doesn't do anything except split the dataset into training and testing data for you.
But perhaps you don't care about metrics, and that's fine. The question then becomes: is it possible to run a decision tree classifier without a training dataset?
The answer is no. Decision tree classifiers are supervised algorithms: they need to be trained on data before they can generalise their model to new results. If you don't give them any data to train on, it will be unable to do anything with any data you feed it in predict.
Finally, while it is perfectly possible to get the labels for the whole set (see tl;dr) , it is a really bad idea if you actually care about whether or not you're getting sensible results.
You already have the labels for the testing and training data. You don't need another column that includes prediction on the testing data, because they'll either come out to be identical or close enough to identical.
I can't think of a single meaningful reason to get back predicted results for your training data short of trying to optimise how it's performing on your training data. If that's what you are trying to do, then do that. What you are doing right now is definitely not that, and I encourage you to think strongly about what your reasons are for blindly inserting numbers into your table without due cause to believe they actually mean something.
There are ways to improve this: get back an accuracy metric, for example, or try to do k-fold cross-validation to model accuracy, or look at log-loss or AUC or any one of number of metrics to gauge whether or not your model is performing well.

Using both ShuffleSplit and train_test_split is redundant. You do not even appear to be using the indices returned by ShuffleSplit.
An example of how to use the indices return by ShuffleSplit is below. X and y are np.array. X is number of instances by number of features. y contains the labels of each row.
train_inds, test_inds = train_test_split(range(len(y)),test_size=0.33, random_state=42)
X_train, y_train = X[train_inds], y[train_inds]
X_test , y_test = X[test_inds] , y[test_inds]
You should not test on your training data! But if you want to see what happens just do
pred = clf.predict(features_train)
Also you do not need to pass the labels to predict. You should be using
score = metrics.accuracy_score(y_test, pred)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.