Stratified Train/Test-split in scikit-learn

Stratified Train/Test-split in scikit-learn - python

I need to split my data into a training set (75%) and test set (25%). I currently do that with the code below:
X, Xt, userInfo, userInfo_train = sklearn.cross_validation.train_test_split(X, userInfo)
However, I'd like to stratify my training dataset. How do I do that? I've been looking into the StratifiedKFold method, but doesn't let me specifiy the 75%/25% split and only stratify the training dataset.

[update for 0.17]
See the docs of sklearn.model_selection.train_test_split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.25)
[/update for 0.17]
There is a pull request here.
But you can simply do train, test = next(iter(StratifiedKFold(...)))
and use the train and test indices if you want.

TL;DR : Use StratifiedShuffleSplit with test_size=0.25
Scikit-learn provides two modules for Stratified Splitting:
StratifiedKFold : This module is useful as a direct k-fold cross-validation operator: as in it will set up n_folds training/testing sets such that classes are equally balanced in both.
Heres some code(directly from above documentation)
>>> skf = cross_validation.StratifiedKFold(y, n_folds=2) #2-fold cross validation
>>> len(skf)
2
>>> for train_index, test_index in skf:
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
... #fit and predict with X_train/test. Use accuracy metrics to check validation performance
StratifiedShuffleSplit : This module creates a single training/testing set having equally balanced(stratified) classes. Essentially this is what you want with the n_iter=1. You can mention the test-size here same as in train_test_split
Code:
>>> sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
>>> len(sss)
1
>>> for train_index, test_index in sss:
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
>>> # fit and predict with your classifier using the above X/y train/test

You can simply do it with train_test_split() method available in Scikit learn:
from sklearn.model_selection import train_test_split
train, test = train_test_split(X, test_size=0.25, stratify=X['YOUR_COLUMN_LABEL'])
I have also prepared a short GitHub Gist which shows how stratify option works:
https://gist.github.com/SHi-ON/63839f3a3647051a180cb03af0f7d0d9

Here's an example for continuous/regression data (until this issue on GitHub is resolved).
min = np.amin(y)
max = np.amax(y)
# 5 bins may be too few for larger datasets.
bins = np.linspace(start=min, stop=max, num=5)
y_binned = np.digitize(y, bins, right=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
stratify=y_binned
)
Where start is min and stop is max of your continuous target.
If you don't set right=True then it will more or less make your max value a separate bin and your split will always fail because too few samples will be in that extra bin.

In addition to the accepted answer by #Andreas Mueller, just want to add that as #tangy mentioned above:
StratifiedShuffleSplit most closely resembles train_test_split(stratify = y)
with added features of:
stratify by default
by specifying n_splits, it repeatedly splits the data

StratifiedShuffleSplit is done after we choose the column that should be evenly represented in all the small dataset we are about to generate.
'The folds are made by preserving the percentage of samples for each class.'
Suppose we've got a dataset 'data' with a column 'season' and we want the get an even representation of 'season' then it looks like that:
from sklearn.model_selection import StratifiedShuffleSplit
sss=StratifiedShuffleSplit(n_splits=1,test_size=0.25,random_state=0)
for train_index, test_index in sss.split(data, data["season"]):
sss_train = data.iloc[train_index]
sss_test = data.iloc[test_index]

As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.
This is called a stratified train-test split.
We can achieve this by setting the “stratify” argument to the y component of the original dataset. This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array.

#train_size is 1 - tst_size - vld_size
tst_size=0.15
vld_size=0.15
X_train_test, X_valid, y_train_test, y_valid = train_test_split(df.drop(y, axis=1), df.y, test_size = vld_size, random_state=13903)
X_train_test_V=pd.DataFrame(X_train_test)
X_valid=pd.DataFrame(X_valid)
X_train, X_test, y_train, y_test = train_test_split(X_train_test, y_train_test, test_size=tst_size, random_state=13903)

Updating #tangy answer from above to the current version of scikit-learn: 0.23.2 (StratifiedShuffleSplit documentation).
from sklearn.model_selection import StratifiedShuffleSplit
n_splits = 1 # We only want a single split in this case
sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=0.25, random_state=0)
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

Related

Repeated holdout method

How can I make "Repeated" holdout method, I made holdout method and get accuracy but need to repeat holdout method for 30 times
There is my code for holdout method
[IN]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y.values.ravel(), random_state=100)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
[OUT]
Accuracy: 49.62%
I see many codes for repeated method but only for K fold cross, nothing for holdout method

So to use a repeated holdout you could use the ShuffleSplit method from sklearn. A minimum working example (following the name conventions that you used) might be as follows:
from sklearn.modelselection import ShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Create some artificial data to train on, can be replace by your own data
X, Y = make_classification()
rs = ShuffleSplit(n_splits=30, test_size=0.25, random_state=100)
model = LogisticRegression()
for train_index, test_index in rs.split(X):
X_train, Y_train = X[train_index], Y[train_index]
X_test, Y_test = X[test_index], Y[test_index]
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
n_splits determines how many time you would like to repeat the holdout. test_size deterimines the fraction of samples that is sampled as a test set. In this case 75% is sampled as train set, whereas 25% is sampled to your test set. For reproducible results you can set the random_state (any number suffices, as long as you use the same number consistently).

Machine Learning Using Scikit-Learn & SVM

Load popular digits dataset from sklearn.datasets module and assign it to variable digits.
Split digits.data into two sets names X_train and X_test. Also, split digits.target into two sets Y_train and Y_test.
Hint: Use train_test_split() method from sklearn.model_selection; set random_state to 30; and perform stratified sampling.
Build an SVM classifier from X_train set and Y_train labels, with default parameters. Name the model as svm_clf.
Evaluate the model accuracy on the testing data set and print its score.
I used the following code:
import sklearn.datasets as datasets
import sklearn.model_selection as ms
from sklearn.model_selection import train_test_split
digits = datasets.load_digits();
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30)
print(X_train.shape)
print(X_test.shape)
from sklearn.svm import SVC
svm_clf = SVC().fit(X_train, y_train)
print(svm_clf.score(X_test,y_test))
I got the below output.
(1347,64)
(450,64)
0.4088888888888889
But I am not able to pass the test. Can someone help with what is wrong?

You are missing the stratified sampling requirement; modify your split to include it:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30, stratify=y)
Check the documentation.

How to stratify the training and testing data in Scikit-Learn?

I am trying to implement Classification algorithm for Iris Dataset (Downloaded from Kaggle). In the Species column the classes (Iris-setosa, Iris-versicolor , Iris-virginica) are in sorted order. How can I stratify the train and test data using Scikit-Learn?

If you want to shuffle and split your data with 0.3 test ratio, you can use
sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)
where X is your data, y is corresponding labels, test_size is the percentage of the data that should be held over for testing, shuffle=True shuffles the data before splitting
In order to make sure that the data is equally splitted according to a column, you can give it to the stratify parameter.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
shuffle=True,
stratify = X['YOUR_COLUMN_LABEL'])

To make sure that the three classes are represented equally in your train and test, you can use the stratify parameter of the train_test_split function.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(X, y, stratify = y)
This will make sure that the ratio of all the classes is maintained equally.

use sklearn.model_selection.train_test_split and play around with Shuffle parameter.
shuffle: boolean, optional (default=True)
Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

Get row indexes of H2O dataframe when split into train and test

I am new to H2O. So far for the train-test split I have used the StratifiedKFold() of sklearn.
skf = StratifiedKFold(n_splits=n, random_state=None, shuffle=False)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
I need the indexes for some further processing later.
In H2O I can't figure out how to get the indexes while doing cross validation. From what I have gathered via videos and blogs, this is how we do CV in H2O:
gbm_model = H2OGradientBoostingEstimator(model_id = 'gbm_model',nfolds=5)
How do I get the train and test indexes of each fold?
Also, how do I get the indexes while doing a simple split?
data_split = data.split_frame(ratios=[0.8],seed = 1234)
train_df = data_split[0]
test_df = data_split[1]
How do I get the indexes that went into train and test?

you could use stratified_kfold_column(n_folds=3, seed=-1) or stratified_split(test_frac=0.2, seed=-1) which create a column with the splits you can use to subset to split on later.
see more about these in the docs

how can I split data in 3 or more parts with sklearn

I want to split data into train,test and validation datasets which are stratification, but sklearn only provides cross_validation.train_test_split which only can divide into 2 pieces.
What should i do if i want do this

If you want to use a Stratified Train/Test split, you can use StratifiedKFold in Sklearn
Suppose X is your features and y are your labels, based on the example here :
from sklearn.model_selection import StratifiedKFold
cv_stf = StratifiedKFold(n_splits=3)
for train_index, test_index in skf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Update : To split data into say 3 different percentages use numpy.split() can be done like this :
X_train, X_test, X_validate = np.split(X, [int(.7*len(X)), int(.8*len(X))])
y_train, y_test, y_validate = np.split(y, [int(.7*len(y)), int(.8*len(y))])

You can also use train_test_split more than once to achieve this. The second time, run it on the training output from the first call to train_test_split.
from sklearn.model_selection import train_test_split
def train_test_validate_stratified_split(features, targets, test_size=0.2, validate_size=0.1):
# Get test sets
features_train, features_test, targets_train, targets_test = train_test_split(
features,
targets,
stratify=targets,
test_size=test_size
)
# Run train_test_split again to get train and validate sets
post_split_validate_size = validate_size / (1 - test_size)
features_train, features_validate, targets_train, targets_validate = train_test_split(
features_train,
targets_train,
stratify=targets_train,
test_size=post_split_validate_size
)
return features_train, features_test, features_validate, targets_train, targets_test, targets_validate

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Stratified Train/Test-split in scikit-learn - python

In addition to the accepted answer by #Andreas Mueller, just want to add that as #tangy mentioned above: StratifiedShuffleSplit most closely resembles train_test_split(stratify = y) with added features of: stratify by default by specifying n_splits, it repeatedly splits the data

Related

Repeated holdout method

Machine Learning Using Scikit-Learn & SVM

How to stratify the training and testing data in Scikit-Learn?

Get row indexes of H2O dataframe when split into train and test

how can I split data in 3 or more parts with sklearn

Categories

Resources