Best practice for train, validation and test set - python

I want to assign a sample class to each instance in a dataframe - 'train', 'validation' and 'test'. If I use sklearn train_test_split(), twice, I can get the indices for a train, validation and test set like this:
X = df.drop(['target'], axis=1)
y=df[['target']]
X_train, X_test, y_train, y_test, indices_train, indices_test=train_test_split(X, y, df.index,
test_size=0.2,
random_state=10,
stratify=y,
shuffle=True)
df_=df.iloc[indices_train]
X_ = df_.drop(['target'], axis=1)
y_=df_[['target']]
X_train, X_val, y_train, y_val, indices_train, indices_val=train_test_split(X_, y_, df_.index,
test_size=0.15,
random_state=10,
stratify=y_,
shuffle=True)
df['sample']=['train' if i in indices_train else 'test' if i in indices_test else 'val' for i in df.index]
What is best practice to get a train, validation and test set? Is there any problems with my approach above and can it be frased better?

a faster and optimal solution if dataset is large would be using numpy.
How to split data into 3 sets (train, validation and test)?
or the simpler way is your solution, but maybe just feed the x_train, y_train you obtained in the 1 step, for the train validation split? like the indices being stored and rows just removed from the df feels unnecessary.

So, I did a dummy dataset of 100 points.
I separate the data and I did the first split:
X = df.drop('target', axis=1)
y = df['target']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
If you have a look, my test size is 0.3 which means 70 data points will go for traininf and 30 for test and validation as well.
X_train.shape # Output (70, 3)
X_test.shape # Output (30, 3)
Now you need to split again for validation, so you can do it like this:
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5)
Notice how I name the groups and the test_size is now 0.5. Which means I take the 30 points for test and I splitted for validation as well. So the shape of validation and testing, will be:
X_val.shape # Output (15, 3)
X_test.shape # Output (15, 3)
At the end you have 70 points for training, 15 for testing and 15 for validation.
Now, consider validation as "double check" of your training. There are a lot of messy concepts related with that. It's just be sure of your training.

Related

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1)

What is purpose of this line :
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1)
For neural networks you have input features (X) and output labels (Y). It's very important to split your data into a training dataset and testing dataset.
To make this easy sklearn has a function called
train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None).
Here's the documentation for sklearn.model_selection.train_test_split
Going through the function we can see that:
1.) X is your input features array
2.) Y is your output label array
3.) test_size = 0.25 states that you want your testing data to be 25% of your overall data. Therefore your training data will be 75% of your overall data.
4.) random_state = 1 Controls the shuffling applied to the data before applying the split.
5.) Your question is why do you have 4 outputs (X_train, X_test, y_train, y_test). It is because X will be split into X_train (75%) and X_test (25%) and then Y will be split into y_train (75%) and y_test (25%). It's all put onto one line.

I want to compare my prediction value with original train data

I am trying to learn decision tree regressor and I have wrote below code.
X_train, X_test, y_train, y_test = train_test_split(
x, y, test_size = 0.3, random_state = 100)
model = DecisionTreeRegressor(random_state=1)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
I want to create a dataframe which include X_test and Y_test and Y_pred.
Is there any method or function for that.
Append the below code at the end of your prediction code:
final_df = X_test.copy()
final_df["Y_original"] = y_test
final_df["Y_predicted"] = y_pred
Here we are creating a new dataframe namely final_df and putting all the values you require into it. Would not suggest you to directly append values into X_test, as it might be needed for use again for prediction.

How to stratify the training and testing data in Scikit-Learn?

I am trying to implement Classification algorithm for Iris Dataset (Downloaded from Kaggle). In the Species column the classes (Iris-setosa, Iris-versicolor , Iris-virginica) are in sorted order. How can I stratify the train and test data using Scikit-Learn?
If you want to shuffle and split your data with 0.3 test ratio, you can use
sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)
where X is your data, y is corresponding labels, test_size is the percentage of the data that should be held over for testing, shuffle=True shuffles the data before splitting
In order to make sure that the data is equally splitted according to a column, you can give it to the stratify parameter.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
shuffle=True,
stratify = X['YOUR_COLUMN_LABEL'])
To make sure that the three classes are represented equally in your train and test, you can use the stratify parameter of the train_test_split function.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(X, y, stratify = y)
This will make sure that the ratio of all the classes is maintained equally.
use sklearn.model_selection.train_test_split and play around with Shuffle parameter.
shuffle: boolean, optional (default=True)
Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

how to creat a subset of sample from original size of mnist data, while keeping all 10 classes

suppose X,Y = load_mnist() where X and Y are the tensors that contain the whole mnist. Now i want a smaller proportion of the data to make my code run faster, but i need to keep all 10 classes there and also in a balanced manner. Is there an easy way to do this?
scikit-learn's train_test_split is meant to split the data into train and test classes, but you can use it to create a "balanced" subset of your dataset using the stratified argument. You can just specify the train/test size proportion you desire and thereby obtain a smaller, stratified sample of your data. In your case:
from sklearn.model_selection import train_test_split
X_1, X_2, Y_1, Y_2 = train_test_split(X, Y, stratify=Y, test_size=0.5)
If you want to do this with more control, you could use numpy.random.randint to generate indices of size of the subset and sample the original arrays as in the following piece of code:
# input data, assume that you've 10K samples
In [77]: total_samples = 10000
In [78]: X, Y = np.random.random_sample((total_samples, 784)), np.random.randint(0, 10, total_samples)
# out of these 10K, we want to pick only 500 samples as a subset
In [79]: subset_size = 500
# generate uniformly distributed indices, of size `subset_size`
In [80]: subset_idx = np.random.choice(total_samples, subset_size)
# simply index into the original arrays to obtain the subsets
In [81]: X_subset, Y_subset = X[subset_idx], Y[subset_idx]
In [82]: X_subset.shape, Y_subset.shape
Out[82]: ((500, 784), (500,))
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=Ture, test_size=0.33, random_state=42)
Stratify will ensure the proportion of classes.
If you want to perform K-Fold then
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
check here for sklearn documentaion.

Stratified Train/Test-split in scikit-learn

I need to split my data into a training set (75%) and test set (25%). I currently do that with the code below:
X, Xt, userInfo, userInfo_train = sklearn.cross_validation.train_test_split(X, userInfo)
However, I'd like to stratify my training dataset. How do I do that? I've been looking into the StratifiedKFold method, but doesn't let me specifiy the 75%/25% split and only stratify the training dataset.
[update for 0.17]
See the docs of sklearn.model_selection.train_test_split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.25)
[/update for 0.17]
There is a pull request here.
But you can simply do train, test = next(iter(StratifiedKFold(...)))
and use the train and test indices if you want.
TL;DR : Use StratifiedShuffleSplit with test_size=0.25
Scikit-learn provides two modules for Stratified Splitting:
StratifiedKFold : This module is useful as a direct k-fold cross-validation operator: as in it will set up n_folds training/testing sets such that classes are equally balanced in both.
Heres some code(directly from above documentation)
>>> skf = cross_validation.StratifiedKFold(y, n_folds=2) #2-fold cross validation
>>> len(skf)
2
>>> for train_index, test_index in skf:
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
... #fit and predict with X_train/test. Use accuracy metrics to check validation performance
StratifiedShuffleSplit : This module creates a single training/testing set having equally balanced(stratified) classes. Essentially this is what you want with the n_iter=1. You can mention the test-size here same as in train_test_split
Code:
>>> sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
>>> len(sss)
1
>>> for train_index, test_index in sss:
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
>>> # fit and predict with your classifier using the above X/y train/test
You can simply do it with train_test_split() method available in Scikit learn:
from sklearn.model_selection import train_test_split
train, test = train_test_split(X, test_size=0.25, stratify=X['YOUR_COLUMN_LABEL'])
I have also prepared a short GitHub Gist which shows how stratify option works:
https://gist.github.com/SHi-ON/63839f3a3647051a180cb03af0f7d0d9
Here's an example for continuous/regression data (until this issue on GitHub is resolved).
min = np.amin(y)
max = np.amax(y)
# 5 bins may be too few for larger datasets.
bins = np.linspace(start=min, stop=max, num=5)
y_binned = np.digitize(y, bins, right=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
stratify=y_binned
)
Where start is min and stop is max of your continuous target.
If you don't set right=True then it will more or less make your max value a separate bin and your split will always fail because too few samples will be in that extra bin.
In addition to the accepted answer by #Andreas Mueller, just want to add that as #tangy mentioned above:
StratifiedShuffleSplit most closely resembles train_test_split(stratify = y)
with added features of:
stratify by default
by specifying n_splits, it repeatedly splits the data
StratifiedShuffleSplit is done after we choose the column that should be evenly represented in all the small dataset we are about to generate.
'The folds are made by preserving the percentage of samples for each class.'
Suppose we've got a dataset 'data' with a column 'season' and we want the get an even representation of 'season' then it looks like that:
from sklearn.model_selection import StratifiedShuffleSplit
sss=StratifiedShuffleSplit(n_splits=1,test_size=0.25,random_state=0)
for train_index, test_index in sss.split(data, data["season"]):
sss_train = data.iloc[train_index]
sss_test = data.iloc[test_index]
As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.
This is called a stratified train-test split.
We can achieve this by setting the “stratify” argument to the y component of the original dataset. This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array.
#train_size is 1 - tst_size - vld_size
tst_size=0.15
vld_size=0.15
X_train_test, X_valid, y_train_test, y_valid = train_test_split(df.drop(y, axis=1), df.y, test_size = vld_size, random_state=13903)
X_train_test_V=pd.DataFrame(X_train_test)
X_valid=pd.DataFrame(X_valid)
X_train, X_test, y_train, y_test = train_test_split(X_train_test, y_train_test, test_size=tst_size, random_state=13903)
Updating #tangy answer from above to the current version of scikit-learn: 0.23.2 (StratifiedShuffleSplit documentation).
from sklearn.model_selection import StratifiedShuffleSplit
n_splits = 1 # We only want a single split in this case
sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=0.25, random_state=0)
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

Categories

Resources