Stratified Train/Validation/Test-split in scikit-learn

Stratified Train/Validation/Test-split in scikit-learn - python

There is already a description here of how to do stratified train/test split in scikit via train_test_split (Stratified Train/Test-split in scikit-learn) and a description of how to random train/validation/test split via np.split (How to split data into 3 sets (train, validation and test)?). But what about doing stratified train/validation/test split.
The closest approximation that comes to mind for doing stratified (on class label) train/validation/test split is as follows, but I suspect there's a better way that can perhaps achieve this in one function call or in a more accurate way:
Let's say we want to do a 60/20/20 train/validation/test split, then my current approach is to first do 60/40 stratified split, then do a 50/50 stratifeid split on that first 40 as to ultimately get a 60/20/20 stratified split.
from sklearn.cross_validation import train_test_split
SEED = 2000
x_train, x_validation_and_test, y_train, y_validation_and_test = train_test_split(x, y, test_size=.4, random_state=SEED)
x_validation, x_test, y_validation, y_test = train_test_split(x_validation_and_test, y_validation_and_test, test_size=.5, random_state=SEED)
Please get back if my approach is correct and/or if you have a better approach.
Thank you

The solution is to just use StratifiedShuffleSplit twice, like below:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.4, random_state=42)
for train_index, test_valid_index in split.split(df, df.target):
train_set = df.iloc[train_index]
test_valid_set = df.iloc[test_valid_index]
split2 = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=42)
for test_index, valid_index in split2.split(test_valid_set, test_valid_set.target):
test_set = test_valid_set.iloc[test_index]
valid_set = test_valid_set.iloc[valid_index]

Yes, this is exactly how I would do it - running train_test_split() twice. Think of the first as splitting off your training set, and then that training set may get divided into different folds or holdouts down the line.
In fact, if you end up testing your model using a scikit model that includes built-in cross-validation, you may not even have to explicitly run train_test_split() again. Same if you use the (very handy!) model_selection.cross_val_score function.

Related

StratifiedKFold and Over-Sampling together

I have a machine learning model and a dataset with 15 features about breast cancer. I want to predict the status of a person (alive or dead). I have 85% alive cases and only 15% dead. So, I want to use over-sampling for dealing with this problem and combine it with stratified k fold.
I write this code, it seems to work well, but I don t know if I put them in the right order:
skf = StratifiedKFold(n_splits=10, random_state=None)
skf.get_n_splits(x, y)
ros = RandomOverSampler(sampling_strategy="not majority")
x_res, y_res = ros.fit_resample(x, y)
for train_index, test_index in skf.split(x_res,y_res):
x_train,x_test=x_res.iloc[train_index],x_res.iloc[test_index]
y_train,y_test=y_res.iloc[train_index],y_res.iloc[test_index]
Is it correct in this way? Or should I apply oversampling before stratified k fold?

Careful: resampling before splitting can cause data leakage where training data leaks into the test data (see the common pitfalls section of the imblearn docs).
Put the steps in a pipeline, then pass to cross_validate with StratifiedKFold:
from imblearn.pipeline import make_pipeline
model = make_pipeline(
RandomOverSampler(sampling_strategy="not majority"),
LogisticRegression(),
)
print(cross_validate(model, X, y, cv=StratifiedKFold())["test_score"].mean())

Is it correct in this way? Or should I apply oversampling before stratified k fold?
Note that this is exactly what your code does: you apply oversampling ros.fit_resample(x, y) before k-fold split skf.split(x_res,y_res).
You should apply oversampling after k-fold split. If you do oversampling before the split there's a chance that some data points will be present both in train and in test in the same split (this is called data leakage) which would lead to overfitting.
Correct version of your code would look like this:
skf = StratifiedKFold(n_splits=10, random_state=None)
ros = RandomOverSampler(sampling_strategy="not majority")
for train_index, test_index in skf.split(x, y):
x_train_unsampled, x_test = x.iloc[train_index], x.iloc[test_index]
y_train_unsampled, y_test = y.iloc[train_index], y.iloc[test_index]
x_train, y_train = ros.fit_resample(x_train_unsampled, y_train_unsampled)
However, I encourage you to use pipelining and cross_validate instead of writing all the boilerplate code yourself, as Alexander suggested in his answer. This will both save you time and effort and also minimize the risk of introducing bugs.
Few other notes:
get_n_splits() does nothing except for returning the number of splits you provided in the line before that. It does not actually do anything with the data. You can just remove that from your code.
Notice that I oversample only train pool. Usually you would only want to oversample train pool.

How do I split a dataset into training and testing whilst retaining the proportions of binary data (i.e some drugs work some don't)?

I have a dataset of drugs, associated chemical features and whether they are "responsive" or "Unresponsive". I need to ensure that once I split the dataset into test and train they both have the same proportion of responsive:unresponsive. I know how to randomly split the data where training is 80% and test is 20%. Not sure how to do the stratified sampling necessary here, is this what I'm meant to use - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html?

The train_test_split function already has one parameters that allows you keeping the proportion of y. The parameter is stratify; and is defined in the documentation as "If not None, data is split in a stratified fashion, using this as the class labels".
An example of code would be:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

Should Cross Validation Score be performed on original or split data?

When I want to evaluate my model with cross validation, should I perform cross validation on original (data thats not split on train and test) or on train / test data?
I know that training data is used for fitting the model, and testing for evaluating. If I use cross validation, should I still split the data into train and test, or not?
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
accuracy_test = cross_val_score(clf, x_test, y_test, cv = 5)
Or should I do like this:
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
clf = LogisticRegression()
model = clf.fit(features, results)
accuracy_test = cross_val_score(clf, features, results, cv = 5)), 2)
Or maybe something different?

Both your approaches are wrong.
In the first one, you apply cross validation to the test set, which is meaningless
In the second one, you first fit the model with your whole data, and then you perform cross validation, which is again meaningless. Moreover, the approach is redundant (your fitted clf is not used by the cross_val_score method, which does its own fitting)
Since you are not doing any hyperparameter tuning (i.e. you seem to be interested only in performance assessment), there are two ways:
Either with a separate test set
Or with cross validation
First way (test set):
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
accuracy_test = accuracy_score(y_test, y_pred)
Second way (cross validation):
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
clf = LogisticRegression()
# shuffle data first:
features_s, results_s = shuffle(features, results)
accuracy_cv = cross_val_score(clf, features_s, results_s, cv = 5, scoring='accuracy')
# fit the model afterwards with the whole data, if satisfied with the performance:
model = clf.fit(features, results)

I will try to summarize the "best practice" here:
1) If you want to train your model, fine-tune parameters, and do final evaluation, I recommend you to split your data into training|val|test.
You fit your model using the training part, and then you check different parameter combinations on the val part. Finally, when you're sure which classifier/parameter obtains the best result on the val part, you evaluate on the test to get the final rest.
Once you evaluate on the test part, you shouldn't change the parameters any more.
2) On the other hand, some people follow another way, they split their data into training and test, and they finetune their model using cross-validation on the training part and at the end they evaluate it on the test part.
If your data is quite large, I recommend you to use the first way, but if your data is small, the 2.

Should I first train_test_split and then use cross validation?

If I plan to use cross validation (KFold), should I still split the dataset into training and test data and perform my training (including cross valid) only on the training set? Or will CV do everything for me? E.g.
Option 1
X_train, X_test, y_train, y_test = train_test_split(X,y)
clf = GridSearchCV(... cv=5)
clf.fit(X_train, y_train)
Option 2
clf = GridSearchCV(... cv=5)
clf.fit(X y)

CV is good, but it's better to have train/test split to provide independent score estimation on the untouched data.
If your CV and test data shows about the same score, then you can drop train/test split phase and CV on whole data to achive slightly better model score. But don't do it before you sure your split and CV score is consistent.

Unexpected cross-validation scores with scikit-learn LinearRegression

I am trying to learn to use scikit-learn for some basic statistical learning tasks. I thought I had successfully created a LinearRegression model fit to my data:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y,
test_size=0.2, random_state=0)
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
print model.score(X_test, y_test)
Which yields:
0.797144744766
Then I wanted to do multiple similar 4:1 splits via automatic cross-validation:
model = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(model, X, y, cv=5)
print scores
And I get output like this:
[ 0.04614495 -0.26160081 -3.11299397 -0.7326256 -1.04164369]
How can the cross-validation scores be so different from the score of the single random split? They are both supposed to be using r2 scoring, and the results are the same if I pass the scoring='r2' parameter to cross_val_score.
I've tried a number of different options for the random_state parameter to cross_validation.train_test_split, and they all give similar scores in the 0.7 to 0.9 range.
I am using sklearn version 0.16.1

It turns out that my data was ordered in blocks of different classes, and by default cross_validation.cross_val_score picks consecutive splits rather than random (shuffled) splits. I was able to solve this by specifying that the cross-validation should use shuffled splits:
model = linear_model.LinearRegression()
shuffle = cross_validation.KFold(len(X), n_folds=5, shuffle=True, random_state=0)
scores = cross_validation.cross_val_score(model, X, y, cv=shuffle)
print scores
Which gives:
[ 0.79714474 0.86636341 0.79665689 0.8036737 0.6874571 ]
This is in line with what I would expect.

train_test_split seems to generate random splits of the dataset, while cross_val_score uses consecutive sets, i.e.
"When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default"
http://scikit-learn.org/stable/modules/cross_validation.html
Depending on the nature of your data set, e.g. data highly correlated over the length of one segment, consecutive sets will give vastly different fits than e.g. random samples from the whole data set.

Folks, thanks for this thread.
The code in the answer above (Schneider) is outdated.
As of scikit-learn==0.19.1, this will work as expected.
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=3, shuffle=True, random_state=0)
cv_scores = cross_val_score(regressor, X, y, cv=kf)
Best,
M.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Stratified Train/Validation/Test-split in scikit-learn - python

Related

StratifiedKFold and Over-Sampling together

How do I split a dataset into training and testing whilst retaining the proportions of binary data (i.e some drugs work some don't)?

Should Cross Validation Score be performed on original or split data?

Should I first train_test_split and then use cross validation?

Unexpected cross-validation scores with scikit-learn LinearRegression

Categories

Resources