train_test_split( ) method of scikit learn

train_test_split( ) method of scikit learn - python

I am trying to create a machine learning model using DecisionTreeClassifier. To train & test my data I imported train_test_split method from scikit learn. But I can not understand one of its arguments called random_state.
What is the significance of assigning numeric values to random_state of model_selection.train_test_split function and how may I know which numeric value to assign random_state for my decision tree?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

As the docs mention, random_state is for the initialization of the random number generator used in train_test_split (similarly for other methods, as well). As there are many different ways to actually split a dataset, this is to ensure that you can use the method several times with the same dataset (e.g. in a series of experiments) and always get the same result (i.e. the exact same train and test sets here), i.e for reproducibility reasons. Its exact value is not important and is not something you have to worry about.
Using the example in the docs, setting random_state=42 ensures that you get the exact same result shown there (the code below is actually run in my machine, and not copy-pasted from the docs):
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
X_train
# array([[4, 5],
# [0, 1],
# [6, 7]])
y_train
# [2, 0, 3]
X_test
# array([[2, 3],
# [8, 9]])
y_test
# [1, 4]
You should experiment yourself with different values for random_state (or without specifying it at all) in the above snippet to get the feeling.

Providing a value to random state will be helpful in reproducing the same values in the split when you re-run the program.
If you don't provide any value to the random state, we will get different set of values for test and train after each run. In such a case, if you encounter any error, then it will not be helpful in debugging.
Example:
Setup:
from sklearn.model_selection import train_test_split
import pandas as pd
data = pd.read_csv("diabetes.csv")
X=data.iloc[0:,0:8]
X.head()
y=data.iloc[0:,-1]
y.head()
Loop with random_state:
for _ in range(2):
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)
print(X_train.head())
print(X_test.head())
Note the data is the same for both iterations
Loop without random_state:
for _ in range(2):
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33)
print(X_train.head())
print(X_test.head())
Note the data is not the same for both iterations
If you run the code, and see the output, you will see when random_state is the same, it will provide the same train / test set, but when random_state is not provided, the set of values in test / train is different each time.

If you don't specify random_state every time you execute your code you will get a different (random) split. Instead if you give a random_state value the split will always be the same. It is often used for experiments reproducibility.
For example:
X = [[1,5],[2,6],[3,2],[4,7], [5,5], [6,2], [7,1],[8,6]]
y = [1,2,3,4,5,6,7,8]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
X_train_rs, X_test_rs, y_train_rs, y_test_rs = train_test_split(X, y, test_size=0.33, random_state=324)
print("WITH RANDOM STATE: ")
print("X_train: {}\ny_train: {}\nX_test: {}\ny_test: {}".format(X_train_rs, X_test_rs, y_train_rs, y_test_rs))
print("WITHOUT RANDOM STATE: ")
print("X_train: {}\ny_train: {}\nX_test: {}\ny_test: {}".format(X_train, X_test, y_train, y_test))
If you run this code different times you can see that the splits without random state change at every run.
As explained in the sklearn documentation, random_state can be an integer if you want specify the random number generator seed (the most frequent case), or directly an instance of RandomState class.

random_state argument is just to seed random order. if you give different random_state it will split dataset in different order. if you provide same random_state every time then split will be same. dataset will split in same order.
If you want your dataset to split in same order every time then provide same random_state.

Related

How to stratify the training and testing data in Scikit-Learn?

I am trying to implement Classification algorithm for Iris Dataset (Downloaded from Kaggle). In the Species column the classes (Iris-setosa, Iris-versicolor , Iris-virginica) are in sorted order. How can I stratify the train and test data using Scikit-Learn?

If you want to shuffle and split your data with 0.3 test ratio, you can use
sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)
where X is your data, y is corresponding labels, test_size is the percentage of the data that should be held over for testing, shuffle=True shuffles the data before splitting
In order to make sure that the data is equally splitted according to a column, you can give it to the stratify parameter.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
shuffle=True,
stratify = X['YOUR_COLUMN_LABEL'])

To make sure that the three classes are represented equally in your train and test, you can use the stratify parameter of the train_test_split function.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(X, y, stratify = y)
This will make sure that the ratio of all the classes is maintained equally.

use sklearn.model_selection.train_test_split and play around with Shuffle parameter.
shuffle: boolean, optional (default=True)
Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

How to split training and test sets?

Where should we use
X_train,X_test,y_train,y_test= train_test_split(data, test_size=0.3, random_state=42)
and where should we use
train, test= train_test_split(data, test_size=0.3, random_state=0).
The former one return this:
value error: not enough values to unpack (expected 4, got 2)

The first form you use if you want to split instances with features (X) and labels (y). The second form you use if you only want to split features (X).
X_train, X_test, y_train, y_test= train_test_split(data, y, test_size=0.3, random_state=42)
The reason why it didn' t work for you was because you didn't prodide the label data in your train_test_split() function. The above should work well. Just replace y with your label/target data.

if you have 1 data list, it split to 2,
|---data_train
data ----train_test_split()--|
|---data_test
if you have 2 data list, it split EACH of the data list to 2, that is 4 in total.
|---data_train_x
|---data_train_y
data_x, data_y ----train_test_split()--|
|---data_test_x
|---data_test_y
The same as n data list.

train_test_split method accepts as many arrays as argument as you need.
But, since you need four returned values you have to pass 2 arrays as argument.
X_train, X_test, y_train, y_test= train_test_split(data, y_data, test_size=0.3, random_state=42)
If you need to pass many arrays you can use extended iterable unpacking operator.
train_test_split(*arrays, test_size = test_size, random_state = 0)

scikit-learn random state in splitting dataset

Can anyone tell me why we set random state to zero in splitting train and test set.
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.30, random_state=0)
I have seen situations like this where random state is set to 1!
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.30, random_state=1)
What is the consequence of this random state in cross validation as well?

It doesn't matter if the random_state is 0 or 1 or any other integer. What matters is that it should be set the same value, if you want to validate your processing over multiple runs of the code. By the way I have seen random_state=42 used in many official examples of scikit as well as elsewhere also.
random_state as the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case. In the documentation, it is stated that:
If random_state is None or np.random, then a randomly-initialized RandomState object is returned.
If random_state is an integer, then it is used to seed a new RandomState object.
If random_state is a RandomState object, then it is passed through.
This is to check and validate the data when running the code multiple times. Setting random_state a fixed value will guarantee that same sequence of random numbers are generated each time you run the code. And unless there is some other randomness present in the process, the results produced will be same as always. This helps in verifying the output.

when random_state set to an integer, train_test_split will return same results for each execution.
when random_state set to an None, train_test_split will return different results for each execution.
see below example:
from sklearn.model_selection import train_test_split
X_data = range(10)
y_data = range(10)
for i in range(5):
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.3,random_state = 0) # zero or any other integer
print(y_test)
print("*"*30)
for i in range(5):
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.3,random_state = None)
print(y_test)
Output:
[2, 8, 4]
[2, 8, 4]
[2, 8, 4]
[2, 8, 4]
[2, 8, 4]
[4, 7, 6]
[4, 3, 7]
[8, 1, 4]
[9, 5, 8]
[6, 4, 5]

If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time.
However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets.

The random_state splits a randomly selected data but with a twist. And the twist is the order of the data will be same for a particular value of random_state.You need to understand that it's not a bool accpeted value. starting from 0 to any integer no, if you pass as random_state,it'll be a permanent order for it. Ex: the order you will get in random_state=0 remain same. After that if you execuit random_state=5 and again come back to random_state=0 you'll get the same order. And like 0 for all integer will go same.
How ever random_state=None splits randomly each time.
If still having doubt watch this

If you don't specify the random_state in your code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.
However, if a fixed value is assigned like random_state = 0 or 1 or 42 then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.

random_state is None by default which means every time when you run your program you will get different output because of splitting between train and test varies within.
random_state = any int value means every time when you run your program you will get tehe same output because of splitting between train and test does not varies within.

The random_state is an integer value which implies the selection of a random combination of train and test. When you set the test_size as 1/4 the there is a set generated of permutation and combination of train and test and each combination has one state.
Suppose you have a dataset---> [1,2,3,4]
Train | Test | State
[1,2,3] [4] **0**
[1,3,4] [2] **1**
[4,2,3] [1] **2**
[2,4,1] [3] **3**
We need it because while param tuning of model same state will considered again and again.
So that there won't be any inference with the accuracy.
But in case of Random forest there is also similar story but in a different way w.r.t the variables.

We used the random_state parameter for reproducibility of the initial shuffling of training datasets after each epoch.

For multiple times of execution of our model, random state make sure that data values will be same for training and testing data sets. It fixes the order of data for train_test_split

Lets say our dataset is having one feature and 10data points. X=[0,1,2,3,4,5,6,7,8,9]
and lets say 0.3(30% is testset) is specified as test data percentage then we are going to have 10C3=120 different combinations of data.[Refer picture in link for tabular explanation]: https://i.stack.imgur.com/FZm4a.png
Based on the random number specified system will pick random state and assigns train and test data

Understanding scikit's decision tree - inconsistent learning

I have been using the package tsfresh to find relevant features for time-series. It outputs approximately 300 "relevant" features that pass a p-test threshold for predictability for each feature. When I train a classifier using scikit's DecisionTreeClassifier() I get some odd results. Each time I execute the learning of the tree it returns a tree with only two levels, and every time the features it uses are different. I am befuddled. The tree does a nice job every time but am I not seeing all the levels?
Using this code:
from sklearn import tree
from sklearn.tree import _tree
X_train, X_test, y_train, y_test = train_test_split(X_filtered, y, test_size=.2)
cl = DecisionTreeClassifier()
cl.fit(X_train, y_train)
tree.export_graphviz(cl,out_file='tree.dot',feature_names=X.columns)
where len(X.colums) is over 300 returns a decision tree of two levels every time.

The output of this line is random:
X_train, X_test, y_train, y_test = train_test_split(X_filtered, y, test_size=.2)
That is, every time you split the data in train and test sets, you get different sets. You can use the random_state attribute to obtain a predictable split:
X_train, X_test, y_train, y_test = train_test_split(X_filtered, y, test_size=.2, random_state=4)
Doing so should give you the same split features always for the tree.

Equivalent of R's createDataPartition in Python

I am trying to reproduce the behavior of the R's createDataPartition function in python. I have a dataset for machine learning with the boolean target variable. I would like to split my dataset in a training set (60%) and a testing set (40%).
If I do it totally random, my target variable won't be properly distributed between the two sets.
I achieve it in R using:
inTrain <- createDataPartition(y=data$repeater, p=0.6, list=F)
training <- data[inTrain,]
testing <- data[-inTrain,]
How can I do the same in Python?
PS : I am using scikit-learn as my machine learning lib and python pandas.

In scikit-learn, you get the tool train_test_split
from sklearn.cross_validation import train_test_split
from sklearn import datasets
# Use Age and Weight to predict a value for the food someone chooses
X_train, X_test, y_train, y_test = train_test_split(table['Age', 'Weight'],
table['Food Choice'],
test_size=0.25)
# Another example using the sklearn pre-loaded datasets:
iris = datasets.load_iris()
X_iris, y_iris = iris.data, iris.target
X, y = X_iris[:, :2], y_iris
X_train, X_test, y_train, y_test = train_test_split(X, y)
This breaks the data in to
inputs for training
inputs for the evaluation data
output for the training data
output for the evaluation data
respectively. You can also add a keyword argument: test_size=0.25 to vary the percentage of the data used for training and testing
To split a single dataset, you can use a call like this to get 40% test data:
>>> data = np.arange(700).reshape((100, 7))
>>> training, testing = train_test_split(data, test_size=0.4)
>>> print len(data)
100
>>> print len(training)
60
>>> print len(testing)
40

The correct answer is sklearn.model_selection.StratifiedShuffleSplit
Stratified ShuffleSplit cross-validator
Provides train/test indices to split data into train/test sets.
This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.
Note: like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

The answer provided is not correct. Apparently there is no function in python that can do stratified sampling, not random sampling, like DataPartition in R does.

As mentioned in the comments, the selected answer does not preserve the class distribution of the data. The scikit-learn docs point out that if is required, then the StratifiedShuffleSplit should be used. This can be done with the train_test_split method with by passing your target array to the stratify option.
>>> import numpy as np
>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split
>>> X, y = datasets.load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, stratify=y, random_state=42)
>>> # show counts of each type after split
>>> print(np.unique(y, return_counts=True))
(array([0, 1, 2]), array([50, 50, 50], dtype=int64))
>>> print(np.unique(y_test, return_counts=True))
(array([0, 1, 2]), array([16, 17, 17], dtype=int64))
>>> print(np.unique(y_train, return_counts=True))
(array([0, 1, 2]), array([34, 33, 33], dtype=int64))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

train_test_split( ) method of scikit learn - python

Related

How to stratify the training and testing data in Scikit-Learn?

How to split training and test sets?

scikit-learn random state in splitting dataset

Understanding scikit's decision tree - inconsistent learning

Equivalent of R's createDataPartition in Python

Categories

Resources