How to split training and test sets? - python

Where should we use
X_train,X_test,y_train,y_test= train_test_split(data, test_size=0.3, random_state=42)
and where should we use
train, test= train_test_split(data, test_size=0.3, random_state=0).
The former one return this:
value error: not enough values to unpack (expected 4, got 2)

The first form you use if you want to split instances with features (X) and labels (y). The second form you use if you only want to split features (X).
X_train, X_test, y_train, y_test= train_test_split(data, y, test_size=0.3, random_state=42)
The reason why it didn' t work for you was because you didn't prodide the label data in your train_test_split() function. The above should work well. Just replace y with your label/target data.

if you have 1 data list, it split to 2,
|---data_train
data ----train_test_split()--|
|---data_test
if you have 2 data list, it split EACH of the data list to 2, that is 4 in total.
|---data_train_x
|---data_train_y
data_x, data_y ----train_test_split()--|
|---data_test_x
|---data_test_y
The same as n data list.

train_test_split method accepts as many arrays as argument as you need.
But, since you need four returned values you have to pass 2 arrays as argument.
X_train, X_test, y_train, y_test= train_test_split(data, y_data, test_size=0.3, random_state=42)
If you need to pass many arrays you can use extended iterable unpacking operator.
train_test_split(*arrays, test_size = test_size, random_state = 0)

Related

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1)

What is purpose of this line :
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1)
For neural networks you have input features (X) and output labels (Y). It's very important to split your data into a training dataset and testing dataset.
To make this easy sklearn has a function called
train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None).
Here's the documentation for sklearn.model_selection.train_test_split
Going through the function we can see that:
1.) X is your input features array
2.) Y is your output label array
3.) test_size = 0.25 states that you want your testing data to be 25% of your overall data. Therefore your training data will be 75% of your overall data.
4.) random_state = 1 Controls the shuffling applied to the data before applying the split.
5.) Your question is why do you have 4 outputs (X_train, X_test, y_train, y_test). It is because X will be split into X_train (75%) and X_test (25%) and then Y will be split into y_train (75%) and y_test (25%). It's all put onto one line.

Best practice for train, validation and test set

I want to assign a sample class to each instance in a dataframe - 'train', 'validation' and 'test'. If I use sklearn train_test_split(), twice, I can get the indices for a train, validation and test set like this:
X = df.drop(['target'], axis=1)
y=df[['target']]
X_train, X_test, y_train, y_test, indices_train, indices_test=train_test_split(X, y, df.index,
test_size=0.2,
random_state=10,
stratify=y,
shuffle=True)
df_=df.iloc[indices_train]
X_ = df_.drop(['target'], axis=1)
y_=df_[['target']]
X_train, X_val, y_train, y_val, indices_train, indices_val=train_test_split(X_, y_, df_.index,
test_size=0.15,
random_state=10,
stratify=y_,
shuffle=True)
df['sample']=['train' if i in indices_train else 'test' if i in indices_test else 'val' for i in df.index]
What is best practice to get a train, validation and test set? Is there any problems with my approach above and can it be frased better?
a faster and optimal solution if dataset is large would be using numpy.
How to split data into 3 sets (train, validation and test)?
or the simpler way is your solution, but maybe just feed the x_train, y_train you obtained in the 1 step, for the train validation split? like the indices being stored and rows just removed from the df feels unnecessary.
So, I did a dummy dataset of 100 points.
I separate the data and I did the first split:
X = df.drop('target', axis=1)
y = df['target']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
If you have a look, my test size is 0.3 which means 70 data points will go for traininf and 30 for test and validation as well.
X_train.shape # Output (70, 3)
X_test.shape # Output (30, 3)
Now you need to split again for validation, so you can do it like this:
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5)
Notice how I name the groups and the test_size is now 0.5. Which means I take the 30 points for test and I splitted for validation as well. So the shape of validation and testing, will be:
X_val.shape # Output (15, 3)
X_test.shape # Output (15, 3)
At the end you have 70 points for training, 15 for testing and 15 for validation.
Now, consider validation as "double check" of your training. There are a lot of messy concepts related with that. It's just be sure of your training.

How to stratify the training and testing data in Scikit-Learn?

I am trying to implement Classification algorithm for Iris Dataset (Downloaded from Kaggle). In the Species column the classes (Iris-setosa, Iris-versicolor , Iris-virginica) are in sorted order. How can I stratify the train and test data using Scikit-Learn?
If you want to shuffle and split your data with 0.3 test ratio, you can use
sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)
where X is your data, y is corresponding labels, test_size is the percentage of the data that should be held over for testing, shuffle=True shuffles the data before splitting
In order to make sure that the data is equally splitted according to a column, you can give it to the stratify parameter.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
shuffle=True,
stratify = X['YOUR_COLUMN_LABEL'])
To make sure that the three classes are represented equally in your train and test, you can use the stratify parameter of the train_test_split function.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(X, y, stratify = y)
This will make sure that the ratio of all the classes is maintained equally.
use sklearn.model_selection.train_test_split and play around with Shuffle parameter.
shuffle: boolean, optional (default=True)
Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

train_test_split( ) method of scikit learn

I am trying to create a machine learning model using DecisionTreeClassifier. To train & test my data I imported train_test_split method from scikit learn. But I can not understand one of its arguments called random_state.
What is the significance of assigning numeric values to random_state of model_selection.train_test_split function and how may I know which numeric value to assign random_state for my decision tree?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)
As the docs mention, random_state is for the initialization of the random number generator used in train_test_split (similarly for other methods, as well). As there are many different ways to actually split a dataset, this is to ensure that you can use the method several times with the same dataset (e.g. in a series of experiments) and always get the same result (i.e. the exact same train and test sets here), i.e for reproducibility reasons. Its exact value is not important and is not something you have to worry about.
Using the example in the docs, setting random_state=42 ensures that you get the exact same result shown there (the code below is actually run in my machine, and not copy-pasted from the docs):
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
X_train
# array([[4, 5],
# [0, 1],
# [6, 7]])
y_train
# [2, 0, 3]
X_test
# array([[2, 3],
# [8, 9]])
y_test
# [1, 4]
You should experiment yourself with different values for random_state (or without specifying it at all) in the above snippet to get the feeling.
Providing a value to random state will be helpful in reproducing the same values in the split when you re-run the program.
If you don't provide any value to the random state, we will get different set of values for test and train after each run. In such a case, if you encounter any error, then it will not be helpful in debugging.
Example:
Setup:
from sklearn.model_selection import train_test_split
import pandas as pd
data = pd.read_csv("diabetes.csv")
X=data.iloc[0:,0:8]
X.head()
y=data.iloc[0:,-1]
y.head()
Loop with random_state:
for _ in range(2):
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)
print(X_train.head())
print(X_test.head())
Note the data is the same for both iterations
Loop without random_state:
for _ in range(2):
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33)
print(X_train.head())
print(X_test.head())
Note the data is not the same for both iterations
If you run the code, and see the output, you will see when random_state is the same, it will provide the same train / test set, but when random_state is not provided, the set of values in test / train is different each time.
If you don't specify random_state every time you execute your code you will get a different (random) split. Instead if you give a random_state value the split will always be the same. It is often used for experiments reproducibility.
For example:
X = [[1,5],[2,6],[3,2],[4,7], [5,5], [6,2], [7,1],[8,6]]
y = [1,2,3,4,5,6,7,8]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
X_train_rs, X_test_rs, y_train_rs, y_test_rs = train_test_split(X, y, test_size=0.33, random_state=324)
print("WITH RANDOM STATE: ")
print("X_train: {}\ny_train: {}\nX_test: {}\ny_test: {}".format(X_train_rs, X_test_rs, y_train_rs, y_test_rs))
print("WITHOUT RANDOM STATE: ")
print("X_train: {}\ny_train: {}\nX_test: {}\ny_test: {}".format(X_train, X_test, y_train, y_test))
If you run this code different times you can see that the splits without random state change at every run.
As explained in the sklearn documentation, random_state can be an integer if you want specify the random number generator seed (the most frequent case), or directly an instance of RandomState class.
random_state argument is just to seed random order. if you give different random_state it will split dataset in different order. if you provide same random_state every time then split will be same. dataset will split in same order.
If you want your dataset to split in same order every time then provide same random_state.

How can I do K fold cross-validation for splitting the train and test set?

I have a set of documents and a set of labels.
Right now, I am using train_test_split to split my dataset in a 90:10 ratio. However, I wish to use Kfold cross-validation.
train=[]
with open("/Users/rte/Documents/Documents.txt") as f:
for line in f:
train.append(line.strip().split())
labels=[]
with open("/Users/rte/Documents/Labels.txt") as t:
for line in t:
labels.append(line.strip().split())
X_train, X_test, Y_train, Y_test= train_test_split(train, labels, test_size=0.1, random_state=42)
When I try the method provided in the documentation of scikit learn: I receive an error that says:
kf=KFold(len(train), n_folds=3)
for train_index, test_index in kf:
X_train, X_test = train[train_index],train[test_index]
y_train, y_test = labels[train_index],labels[test_index]
error
X_train, X_test = train[train_index],train[test_index]
TypeError: only integer arrays with one element can be converted to an index
How can I perform a 10 fold cross-validation on my documents and labels?
There are two ways to solve this error:
First way:
Cast your data to a numpy array:
import numpy as np
[...]
train = np.array(train)
labels = np.array(labels)
then it should work with your current code.
Second way:
Use list comprehension to index the train & label list with the train_index & test_index list
for train_index, test_index in kf:
X_train, X_test = [train[i] for i in train_index],[train[j] for j in test_index]
y_train, y_test = [labels[i] for i in train_index],[labels[j] for j in test_index]
(For this solution also see related question index list with another list)

Categories

Resources