scikit-learn random state in splitting dataset

scikit-learn random state in splitting dataset - python

Can anyone tell me why we set random state to zero in splitting train and test set.
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.30, random_state=0)
I have seen situations like this where random state is set to 1!
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.30, random_state=1)
What is the consequence of this random state in cross validation as well?

It doesn't matter if the random_state is 0 or 1 or any other integer. What matters is that it should be set the same value, if you want to validate your processing over multiple runs of the code. By the way I have seen random_state=42 used in many official examples of scikit as well as elsewhere also.
random_state as the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case. In the documentation, it is stated that:
If random_state is None or np.random, then a randomly-initialized RandomState object is returned.
If random_state is an integer, then it is used to seed a new RandomState object.
If random_state is a RandomState object, then it is passed through.
This is to check and validate the data when running the code multiple times. Setting random_state a fixed value will guarantee that same sequence of random numbers are generated each time you run the code. And unless there is some other randomness present in the process, the results produced will be same as always. This helps in verifying the output.

when random_state set to an integer, train_test_split will return same results for each execution.
when random_state set to an None, train_test_split will return different results for each execution.
see below example:
from sklearn.model_selection import train_test_split
X_data = range(10)
y_data = range(10)
for i in range(5):
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.3,random_state = 0) # zero or any other integer
print(y_test)
print("*"*30)
for i in range(5):
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.3,random_state = None)
print(y_test)
Output:
[2, 8, 4]
[2, 8, 4]
[2, 8, 4]
[2, 8, 4]
[2, 8, 4]
[4, 7, 6]
[4, 3, 7]
[8, 1, 4]
[9, 5, 8]
[6, 4, 5]

If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time.
However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets.

The random_state splits a randomly selected data but with a twist. And the twist is the order of the data will be same for a particular value of random_state.You need to understand that it's not a bool accpeted value. starting from 0 to any integer no, if you pass as random_state,it'll be a permanent order for it. Ex: the order you will get in random_state=0 remain same. After that if you execuit random_state=5 and again come back to random_state=0 you'll get the same order. And like 0 for all integer will go same.
How ever random_state=None splits randomly each time.
If still having doubt watch this

If you don't specify the random_state in your code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.
However, if a fixed value is assigned like random_state = 0 or 1 or 42 then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.

random_state is None by default which means every time when you run your program you will get different output because of splitting between train and test varies within.
random_state = any int value means every time when you run your program you will get tehe same output because of splitting between train and test does not varies within.

The random_state is an integer value which implies the selection of a random combination of train and test. When you set the test_size as 1/4 the there is a set generated of permutation and combination of train and test and each combination has one state.
Suppose you have a dataset---> [1,2,3,4]
Train | Test | State
[1,2,3] [4] **0**
[1,3,4] [2] **1**
[4,2,3] [1] **2**
[2,4,1] [3] **3**
We need it because while param tuning of model same state will considered again and again.
So that there won't be any inference with the accuracy.
But in case of Random forest there is also similar story but in a different way w.r.t the variables.

We used the random_state parameter for reproducibility of the initial shuffling of training datasets after each epoch.

For multiple times of execution of our model, random state make sure that data values will be same for training and testing data sets. It fixes the order of data for train_test_split

Lets say our dataset is having one feature and 10data points. X=[0,1,2,3,4,5,6,7,8,9]
and lets say 0.3(30% is testset) is specified as test data percentage then we are going to have 10C3=120 different combinations of data.[Refer picture in link for tabular explanation]: https://i.stack.imgur.com/FZm4a.png
Based on the random number specified system will pick random state and assigns train and test data

Related

train_test_split( ) method of scikit learn

I am trying to create a machine learning model using DecisionTreeClassifier. To train & test my data I imported train_test_split method from scikit learn. But I can not understand one of its arguments called random_state.
What is the significance of assigning numeric values to random_state of model_selection.train_test_split function and how may I know which numeric value to assign random_state for my decision tree?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

As the docs mention, random_state is for the initialization of the random number generator used in train_test_split (similarly for other methods, as well). As there are many different ways to actually split a dataset, this is to ensure that you can use the method several times with the same dataset (e.g. in a series of experiments) and always get the same result (i.e. the exact same train and test sets here), i.e for reproducibility reasons. Its exact value is not important and is not something you have to worry about.
Using the example in the docs, setting random_state=42 ensures that you get the exact same result shown there (the code below is actually run in my machine, and not copy-pasted from the docs):
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
X_train
# array([[4, 5],
# [0, 1],
# [6, 7]])
y_train
# [2, 0, 3]
X_test
# array([[2, 3],
# [8, 9]])
y_test
# [1, 4]
You should experiment yourself with different values for random_state (or without specifying it at all) in the above snippet to get the feeling.

Providing a value to random state will be helpful in reproducing the same values in the split when you re-run the program.
If you don't provide any value to the random state, we will get different set of values for test and train after each run. In such a case, if you encounter any error, then it will not be helpful in debugging.
Example:
Setup:
from sklearn.model_selection import train_test_split
import pandas as pd
data = pd.read_csv("diabetes.csv")
X=data.iloc[0:,0:8]
X.head()
y=data.iloc[0:,-1]
y.head()
Loop with random_state:
for _ in range(2):
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)
print(X_train.head())
print(X_test.head())
Note the data is the same for both iterations
Loop without random_state:
for _ in range(2):
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33)
print(X_train.head())
print(X_test.head())
Note the data is not the same for both iterations
If you run the code, and see the output, you will see when random_state is the same, it will provide the same train / test set, but when random_state is not provided, the set of values in test / train is different each time.

If you don't specify random_state every time you execute your code you will get a different (random) split. Instead if you give a random_state value the split will always be the same. It is often used for experiments reproducibility.
For example:
X = [[1,5],[2,6],[3,2],[4,7], [5,5], [6,2], [7,1],[8,6]]
y = [1,2,3,4,5,6,7,8]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
X_train_rs, X_test_rs, y_train_rs, y_test_rs = train_test_split(X, y, test_size=0.33, random_state=324)
print("WITH RANDOM STATE: ")
print("X_train: {}\ny_train: {}\nX_test: {}\ny_test: {}".format(X_train_rs, X_test_rs, y_train_rs, y_test_rs))
print("WITHOUT RANDOM STATE: ")
print("X_train: {}\ny_train: {}\nX_test: {}\ny_test: {}".format(X_train, X_test, y_train, y_test))
If you run this code different times you can see that the splits without random state change at every run.
As explained in the sklearn documentation, random_state can be an integer if you want specify the random number generator seed (the most frequent case), or directly an instance of RandomState class.

random_state argument is just to seed random order. if you give different random_state it will split dataset in different order. if you provide same random_state every time then split will be same. dataset will split in same order.
If you want your dataset to split in same order every time then provide same random_state.

Why the model accuracy is different while splitting data with different approach in LightGBM?

I am creating a lightGBM Model for prediction using Python. Initially, i did the data split using sklearn.model_selection.train_test_split which resulted into lower Mean absolute error(MAE). Later, i did the split in some other way by splitting the dataframe into two different data frames, df_train and df_test. With this approach, MAE is significantly higher than earlier approach. Is the use of sklearn.model_selection.train_test_split mandatory in LightGBM or data could be splitted in any way? If it is not mandatory, the results should be somewhat similar. In my case, its very different.
Looking for suggestions/help.

To always keep the same outcome with sklearn.model_selection.train_test_split you have to keep the random_state:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)
based on the documentation:
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
otherwise you cannot produce the same result.
If you have the feeling the split does not fit to your dataframe, you should use cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html , there you are avoiding over- and underfitting for a specific train/test split

What is "random-state" in sklearn.model_selection.train_test_split example? [duplicate]

This question already has answers here:
Random state (Pseudo-random number) in Scikit learn
(7 answers)
Closed 1 year ago.
Can someone explain me what random_state means in below example?
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
Why is it hard coded to 42?

Isn't that obvious? 42 is the Answer to the Ultimate Question of Life, the Universe, and Everything.
On a serious note, random_state simply sets a seed to the random generator, so that your train-test splits are always deterministic. If you don't set a seed, it is different each time.
Relevant documentation:
random_state : int, RandomState instance or None, optional
(default=None)
If int, random_state is the seed used by the random
number generator; If RandomState instance, random_state is the random
number generator; If None, the random number generator is the
RandomState instance used by np.random.

If you don't specify the random_state in the code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.
However, if a fixed value is assigned like random_state = 0 or 1 or 42 or any other integer then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.

Random state ensures that the splits that you generate are reproducible. Scikit-learn uses random permutations to generate the splits. The random state that you provide is used as a seed to the random number generator. This ensures that the random numbers are generated in the same order.

When the Random_state is not defined in the code for every run train data will change and accuracy might change for every run.
When the Random_state = " constant integer" is defined then train data will be constant For every run so that it will make easy to debug.

The random state is simply the lot number of the set generated randomly in any operation. We can specify this lot number whenever we want the same set again.

Random state (Pseudo-random number) in Scikit learn

I want to implement a machine learning algorithm in scikit learn, but I don't understand what this parameter random_state does? Why should I use it?
I also could not understand what is a Pseudo-random number.

train_test_split splits arrays or matrices into random train and test subsets. That means that everytime you run it without specifying random_state, you will get a different result, this is expected behavior. For example:
Run 1:
>>> a, b = np.arange(10).reshape((5, 2)), range(5)
>>> train_test_split(a, b)
[array([[6, 7],
[8, 9],
[4, 5]]),
array([[2, 3],
[0, 1]]), [3, 4, 2], [1, 0]]
Run 2
>>> train_test_split(a, b)
[array([[8, 9],
[4, 5],
[0, 1]]),
array([[6, 7],
[2, 3]]), [4, 2, 0], [3, 1]]
It changes. On the other hand if you use random_state=some_number, then you can guarantee that the output of Run 1 will be equal to the output of Run 2, i.e. your split will be always the same.
It doesn't matter what the actual random_state number is 42, 0, 21, ... The important thing is that everytime you use 42, you will always get the same output the first time you make the split.
This is useful if you want reproducible results, for example in the documentation, so that everybody can consistently see the same numbers when they run the examples.
In practice I would say, you should set the random_state to some fixed number while you test stuff, but then remove it in production if you really need a random (and not a fixed) split.
Regarding your second question, a pseudo-random number generator is a number generator that generates almost truly random numbers. Why they are not truly random is out of the scope of this question and probably won't matter in your case, you can take a look here form more details.

If you don't specify the random_state in your code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.
However, if a fixed value is assigned like random_state = 42 then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.

Well the question what is "random state" and why is it used, has been answered above nicely by people above. I will try and answer the question "Why do we choose random state as 42 very often during training a machine learning model? why we dont choose 12 or 32 or 5? "
Is there a scientific explanation?
Many students and practitioners use this number(42) as random state is because it is used by many instructors in online courses. They often set the random state or numpy seed to number 42 and learners follow the same practice without giving it much thought.
To be specific, 42 has nothing to do with AI or ML. It is actually a generic number, In Machine Learning, it doesn't matter what the actual random number is, as mentioned in scikit API doc, any INTEGER is sufficient enough for the task at hand.
42 is a reference from Hitchhikers guide to galaxy book. The answer to life universe and everything and is meant as a joke. It has no other significance.
References:
Wikipedia: on Hitchhikers guide to galaxy
Stack Exchange: Why the Number 42 is preferred when indicating something random
Why the Number 42
Quora: Why the Number 42 is preferred when indicating something random
YouTube: Nice Simple video explaining use of random state in train-test-split

If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time.
However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets.
Refer below code:
import pandas as pd
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,random_state = 1,test_size = .3)
size25split = train_test_split(test_series,random_state = 1,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))
Doesn't matter how many times you run the code, the output will be 70.
70
Try to remove the random_state and run the code.
import pandas as pd
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,test_size = .3)
size25split = train_test_split(test_series,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))
Now here output will be different each time you execute the code.

random_state number splits the test and training datasets with a random manner. In addition to what is explained here, it is important to remember that random_state value can have significant effect on the quality of your model (by quality I essentially mean accuracy to predict). For instance, If you take a certain dataset and train a regression model with it, without specifying the random_state value, there is the potential that everytime, you will get a different accuracy result for your trained model on the test data.
So it is important to find the best random_state value to provide you with the most accurate model. And then, that number will be used to reproduce your model in another occasion such as another research experiment.
To do so, it is possible to split and train the model in a for-loop by assigning random numbers to random_state parameter:
for j in range(1000):
X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =j, test_size=0.35)
lr = LarsCV().fit(X_train, y_train)
tr_score.append(lr.score(X_train, y_train))
ts_score.append(lr.score(X_test, y_test))
J = ts_score.index(np.max(ts_score))
X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =J, test_size=0.35)
M = LarsCV().fit(X_train, y_train)
y_pred = M.predict(X_test)`

If there is no randomstate provided the system will use a randomstate that is generated internally. So, when you run the program multiple times you might see different train/test data points and the behavior will be unpredictable. In case, you have an issue with your model you will not be able to recreate it as you do not know the random number that was generated when you ran the program.
If you see the Tree Classifiers - either DT or RF, they try to build a try using an optimal plan. Though most of the times this plan might be the same there could be instances where the tree might be different and so the predictions. When you try to debug your model you may not be able to recreate the same instance for which a Tree was built. So, to avoid all this hassle we use a random_state while building a DecisionTreeClassifier or RandomForestClassifier.
PS: You can go a bit in depth on how the Tree is built in DecisionTree to understand this better.
randomstate is basically used for reproducing your problem the same every time it is run. If you do not use a randomstate in traintestsplit, every time you make the split you might get a different set of train and test data points and will not help you in debugging in case you get an issue.
From Doc:
If int, randomstate is the seed used by the random number generator; If RandomState instance, randomstate is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

sklearn.model_selection.train_test_split(*arrays, **options)[source]
Split arrays or matrices into random train and test subsets
Parameters: ...
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
source: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
'''Regarding the random state, it is used in many randomized algorithms in sklearn to determine the random seed passed to the pseudo-random number generator. Therefore, it does not govern any aspect of the algorithm's behavior. As a consequence, random state values which performed well in the validation set do not correspond to those which would perform well in a new, unseen test set. Indeed, depending on the algorithm, you might see completely different results by just changing the ordering of training samples.'''
source: https://stats.stackexchange.com/questions/263999/is-random-state-a-parameter-to-tune

Equivalent of R's createDataPartition in Python

I am trying to reproduce the behavior of the R's createDataPartition function in python. I have a dataset for machine learning with the boolean target variable. I would like to split my dataset in a training set (60%) and a testing set (40%).
If I do it totally random, my target variable won't be properly distributed between the two sets.
I achieve it in R using:
inTrain <- createDataPartition(y=data$repeater, p=0.6, list=F)
training <- data[inTrain,]
testing <- data[-inTrain,]
How can I do the same in Python?
PS : I am using scikit-learn as my machine learning lib and python pandas.

In scikit-learn, you get the tool train_test_split
from sklearn.cross_validation import train_test_split
from sklearn import datasets
# Use Age and Weight to predict a value for the food someone chooses
X_train, X_test, y_train, y_test = train_test_split(table['Age', 'Weight'],
table['Food Choice'],
test_size=0.25)
# Another example using the sklearn pre-loaded datasets:
iris = datasets.load_iris()
X_iris, y_iris = iris.data, iris.target
X, y = X_iris[:, :2], y_iris
X_train, X_test, y_train, y_test = train_test_split(X, y)
This breaks the data in to
inputs for training
inputs for the evaluation data
output for the training data
output for the evaluation data
respectively. You can also add a keyword argument: test_size=0.25 to vary the percentage of the data used for training and testing
To split a single dataset, you can use a call like this to get 40% test data:
>>> data = np.arange(700).reshape((100, 7))
>>> training, testing = train_test_split(data, test_size=0.4)
>>> print len(data)
100
>>> print len(training)
60
>>> print len(testing)
40

The correct answer is sklearn.model_selection.StratifiedShuffleSplit
Stratified ShuffleSplit cross-validator
Provides train/test indices to split data into train/test sets.
This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.
Note: like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

The answer provided is not correct. Apparently there is no function in python that can do stratified sampling, not random sampling, like DataPartition in R does.

As mentioned in the comments, the selected answer does not preserve the class distribution of the data. The scikit-learn docs point out that if is required, then the StratifiedShuffleSplit should be used. This can be done with the train_test_split method with by passing your target array to the stratify option.
>>> import numpy as np
>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split
>>> X, y = datasets.load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, stratify=y, random_state=42)
>>> # show counts of each type after split
>>> print(np.unique(y, return_counts=True))
(array([0, 1, 2]), array([50, 50, 50], dtype=int64))
>>> print(np.unique(y_test, return_counts=True))
(array([0, 1, 2]), array([16, 17, 17], dtype=int64))
>>> print(np.unique(y_train, return_counts=True))
(array([0, 1, 2]), array([34, 33, 33], dtype=int64))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scikit-learn random state in splitting dataset - python

We used the random_state parameter for reproducibility of the initial shuffling of training datasets after each epoch.

For multiple times of execution of our model, random state make sure that data values will be same for training and testing data sets. It fixes the order of data for train_test_split

Related

train_test_split( ) method of scikit learn

Why the model accuracy is different while splitting data with different approach in LightGBM?

What is "random-state" in sklearn.model_selection.train_test_split example? [duplicate]

Random state (Pseudo-random number) in Scikit learn

Equivalent of R's createDataPartition in Python

Categories

Resources