I want to implement a machine learning algorithm in scikit learn, but I don't understand what this parameter random_state does? Why should I use it?
I also could not understand what is a Pseudo-random number.
train_test_split splits arrays or matrices into random train and test subsets. That means that everytime you run it without specifying random_state, you will get a different result, this is expected behavior. For example:
Run 1:
>>> a, b = np.arange(10).reshape((5, 2)), range(5)
>>> train_test_split(a, b)
[array([[6, 7],
[8, 9],
[4, 5]]),
array([[2, 3],
[0, 1]]), [3, 4, 2], [1, 0]]
Run 2
>>> train_test_split(a, b)
[array([[8, 9],
[4, 5],
[0, 1]]),
array([[6, 7],
[2, 3]]), [4, 2, 0], [3, 1]]
It changes. On the other hand if you use random_state=some_number, then you can guarantee that the output of Run 1 will be equal to the output of Run 2, i.e. your split will be always the same.
It doesn't matter what the actual random_state number is 42, 0, 21, ... The important thing is that everytime you use 42, you will always get the same output the first time you make the split.
This is useful if you want reproducible results, for example in the documentation, so that everybody can consistently see the same numbers when they run the examples.
In practice I would say, you should set the random_state to some fixed number while you test stuff, but then remove it in production if you really need a random (and not a fixed) split.
Regarding your second question, a pseudo-random number generator is a number generator that generates almost truly random numbers. Why they are not truly random is out of the scope of this question and probably won't matter in your case, you can take a look here form more details.
If you don't specify the random_state in your code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.
However, if a fixed value is assigned like random_state = 42 then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.
Well the question what is "random state" and why is it used, has been answered above nicely by people above. I will try and answer the question "Why do we choose random state as 42 very often during training a machine learning model? why we dont choose 12 or 32 or 5? "
Is there a scientific explanation?
Many students and practitioners use this number(42) as random state is because it is used by many instructors in online courses. They often set the random state or numpy seed to number 42 and learners follow the same practice without giving it much thought.
To be specific, 42 has nothing to do with AI or ML. It is actually a generic number, In Machine Learning, it doesn't matter what the actual random number is, as mentioned in scikit API doc, any INTEGER is sufficient enough for the task at hand.
42 is a reference from Hitchhikers guide to galaxy book. The answer to life universe and everything and is meant as a joke. It has no other significance.
References:
Wikipedia: on Hitchhikers guide to galaxy
Stack Exchange: Why the Number 42 is preferred when indicating something random
Why the Number 42
Quora: Why the Number 42 is preferred when indicating something random
YouTube: Nice Simple video explaining use of random state in train-test-split
If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time.
However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets.
Refer below code:
import pandas as pd
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,random_state = 1,test_size = .3)
size25split = train_test_split(test_series,random_state = 1,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))
Doesn't matter how many times you run the code, the output will be 70.
70
Try to remove the random_state and run the code.
import pandas as pd
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,test_size = .3)
size25split = train_test_split(test_series,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))
Now here output will be different each time you execute the code.
random_state number splits the test and training datasets with a random manner. In addition to what is explained here, it is important to remember that random_state value can have significant effect on the quality of your model (by quality I essentially mean accuracy to predict). For instance, If you take a certain dataset and train a regression model with it, without specifying the random_state value, there is the potential that everytime, you will get a different accuracy result for your trained model on the test data.
So it is important to find the best random_state value to provide you with the most accurate model. And then, that number will be used to reproduce your model in another occasion such as another research experiment.
To do so, it is possible to split and train the model in a for-loop by assigning random numbers to random_state parameter:
for j in range(1000):
X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =j, test_size=0.35)
lr = LarsCV().fit(X_train, y_train)
tr_score.append(lr.score(X_train, y_train))
ts_score.append(lr.score(X_test, y_test))
J = ts_score.index(np.max(ts_score))
X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =J, test_size=0.35)
M = LarsCV().fit(X_train, y_train)
y_pred = M.predict(X_test)`
If there is no randomstate provided the system will use a randomstate that is generated internally. So, when you run the program multiple times you might see different train/test data points and the behavior will be unpredictable. In case, you have an issue with your model you will not be able to recreate it as you do not know the random number that was generated when you ran the program.
If you see the Tree Classifiers - either DT or RF, they try to build a try using an optimal plan. Though most of the times this plan might be the same there could be instances where the tree might be different and so the predictions. When you try to debug your model you may not be able to recreate the same instance for which a Tree was built. So, to avoid all this hassle we use a random_state while building a DecisionTreeClassifier or RandomForestClassifier.
PS: You can go a bit in depth on how the Tree is built in DecisionTree to understand this better.
randomstate is basically used for reproducing your problem the same every time it is run. If you do not use a randomstate in traintestsplit, every time you make the split you might get a different set of train and test data points and will not help you in debugging in case you get an issue.
From Doc:
If int, randomstate is the seed used by the random number generator; If RandomState instance, randomstate is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
sklearn.model_selection.train_test_split(*arrays, **options)[source]
Split arrays or matrices into random train and test subsets
Parameters: ...
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
source: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
'''Regarding the random state, it is used in many randomized algorithms in sklearn to determine the random seed passed to the pseudo-random number generator. Therefore, it does not govern any aspect of the algorithm's behavior. As a consequence, random state values which performed well in the validation set do not correspond to those which would perform well in a new, unseen test set. Indeed, depending on the algorithm, you might see completely different results by just changing the ordering of training samples.'''
source: https://stats.stackexchange.com/questions/263999/is-random-state-a-parameter-to-tune
Related
I believe, the weight should change slightly with different random state.
What could be the reason for getting different weights at every run with random_state = None
Following are the weights value for few runs( contains 3 features)
1)4.67100318,1.26129186,17.26554955
2)3.39793468,2.10265234,18.42484435
3)-2.08082186,1.25948975,10.37120852
4)3.71122156,0.93510126,16.63007864
Because of this fluctuations, I am not sure which random_state should I use and this is creating trouble while performing feature selection.
Please note that I am using data after performing standardisation.
I am using very simple code as below to train my model, as my data contain only 200 rows of data with 3 features
from sklearn.linear_model import SGDClassifier
SGDClf = SGDClassifier(loss='log',random_state=1)
SGDClf.fit(X,Y)
Machine learning models will produce different results on same dataset, random_state = None,
the models generate a sequence of random numbers called random seed used within the process of generating test, validation and training datasets from a given dataset, ex:random_state = 1.
Configurating a model's seed to a set value will ensure that the (weight) results are reproducible.
SGDClassifier() shuffles the entered data:
The passed (random state) value will have an effect on the reproducibility of the
results returned by the function (fit, split, or any other function
like k_means). - random state doc
Hope it is helpful
I am working with large-scale, imbalanced datasets where I need to pick a stratified training set. However, even if the dataset is strongly imbalanced, I still need to ensure that at least every label class is included at least once in the training set. sklearns train_test_split or StratifiedShuffleSplit will not "guarantee" this inclusion.
Here is an example:
import numpy as np
from sklearn.model_selection import train_test_split
X = np.arange(100).reshape((50, 2))
y = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,4,4]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=4, random_state=42, stratify=y)
print(X_train, y_train)
The result is
[[80 81]
[48 49]
[18 19]
[30 31]] [2, 2, 1, 1]
So the label classes 3 and 4 are not included in this training split. Given the absolute train_size=4, these two classes are not large enough to be included. For a strictly stratified split, this is correct.
However, for the smaller classes, I need at least make sure that the algorithm "has seen the label class". Therefore, I need some kind of softening of the stratification principle, and have some kind of proportional inclusion of smaller classes.
I have written quite some code to achieve this, which removes smaller classes first, and then handles them separately with a proportional split. However, when removed, this will also influence train_test_split due to the changes in class amounts/total size.
Is there any simple function/algorithm to achieve this behavior?
Have you checked sklearn.model_selection.StratifiedKFold? Try setting n_folds to be less than or equal to the number of members in the least populated class. If you have, then I can only recommend using under-/over-sampling methods from imbalanced-learn.
This question already has answers here:
Random state (Pseudo-random number) in Scikit learn
(7 answers)
Closed 1 year ago.
Can someone explain me what random_state means in below example?
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
Why is it hard coded to 42?
Isn't that obvious? 42 is the Answer to the Ultimate Question of Life, the Universe, and Everything.
On a serious note, random_state simply sets a seed to the random generator, so that your train-test splits are always deterministic. If you don't set a seed, it is different each time.
Relevant documentation:
random_state : int, RandomState instance or None, optional
(default=None)
If int, random_state is the seed used by the random
number generator; If RandomState instance, random_state is the random
number generator; If None, the random number generator is the
RandomState instance used by np.random.
If you don't specify the random_state in the code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.
However, if a fixed value is assigned like random_state = 0 or 1 or 42 or any other integer then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.
Random state ensures that the splits that you generate are reproducible. Scikit-learn uses random permutations to generate the splits. The random state that you provide is used as a seed to the random number generator. This ensures that the random numbers are generated in the same order.
When the Random_state is not defined in the code for every run train data will change and accuracy might change for every run.
When the Random_state = " constant integer" is defined then train data will be constant For every run so that it will make easy to debug.
The random state is simply the lot number of the set generated randomly in any operation. We can specify this lot number whenever we want the same set again.
Can anyone tell me why we set random state to zero in splitting train and test set.
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.30, random_state=0)
I have seen situations like this where random state is set to 1!
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.30, random_state=1)
What is the consequence of this random state in cross validation as well?
It doesn't matter if the random_state is 0 or 1 or any other integer. What matters is that it should be set the same value, if you want to validate your processing over multiple runs of the code. By the way I have seen random_state=42 used in many official examples of scikit as well as elsewhere also.
random_state as the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case. In the documentation, it is stated that:
If random_state is None or np.random, then a randomly-initialized RandomState object is returned.
If random_state is an integer, then it is used to seed a new RandomState object.
If random_state is a RandomState object, then it is passed through.
This is to check and validate the data when running the code multiple times. Setting random_state a fixed value will guarantee that same sequence of random numbers are generated each time you run the code. And unless there is some other randomness present in the process, the results produced will be same as always. This helps in verifying the output.
when random_state set to an integer, train_test_split will return same results for each execution.
when random_state set to an None, train_test_split will return different results for each execution.
see below example:
from sklearn.model_selection import train_test_split
X_data = range(10)
y_data = range(10)
for i in range(5):
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.3,random_state = 0) # zero or any other integer
print(y_test)
print("*"*30)
for i in range(5):
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.3,random_state = None)
print(y_test)
Output:
[2, 8, 4]
[2, 8, 4]
[2, 8, 4]
[2, 8, 4]
[2, 8, 4]
[4, 7, 6]
[4, 3, 7]
[8, 1, 4]
[9, 5, 8]
[6, 4, 5]
If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time.
However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets.
The random_state splits a randomly selected data but with a twist. And the twist is the order of the data will be same for a particular value of random_state.You need to understand that it's not a bool accpeted value. starting from 0 to any integer no, if you pass as random_state,it'll be a permanent order for it. Ex: the order you will get in random_state=0 remain same. After that if you execuit random_state=5 and again come back to random_state=0 you'll get the same order. And like 0 for all integer will go same.
How ever random_state=None splits randomly each time.
If still having doubt watch this
If you don't specify the random_state in your code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.
However, if a fixed value is assigned like random_state = 0 or 1 or 42 then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.
random_state is None by default which means every time when you run your program you will get different output because of splitting between train and test varies within.
random_state = any int value means every time when you run your program you will get tehe same output because of splitting between train and test does not varies within.
The random_state is an integer value which implies the selection of a random combination of train and test. When you set the test_size as 1/4 the there is a set generated of permutation and combination of train and test and each combination has one state.
Suppose you have a dataset---> [1,2,3,4]
Train | Test | State
[1,2,3] [4] **0**
[1,3,4] [2] **1**
[4,2,3] [1] **2**
[2,4,1] [3] **3**
We need it because while param tuning of model same state will considered again and again.
So that there won't be any inference with the accuracy.
But in case of Random forest there is also similar story but in a different way w.r.t the variables.
We used the random_state parameter for reproducibility of the initial shuffling of training datasets after each epoch.
For multiple times of execution of our model, random state make sure that data values will be same for training and testing data sets. It fixes the order of data for train_test_split
Lets say our dataset is having one feature and 10data points. X=[0,1,2,3,4,5,6,7,8,9]
and lets say 0.3(30% is testset) is specified as test data percentage then we are going to have 10C3=120 different combinations of data.[Refer picture in link for tabular explanation]: https://i.stack.imgur.com/FZm4a.png
Based on the random number specified system will pick random state and assigns train and test data
In the documentation of SciKit-Learn Random Forest classifier , it is stated that
The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
What I dont understand is that if the sample size is always the same as the input sample size than how can we talk about a random selection. There is no selection here because we use all the (and naturally the same) samples at each training.
Am I missing something here?
I believe this part of docs answers your question
In random forests (see RandomForestClassifier and
RandomForestRegressor classes), each tree in the ensemble is built
from a sample drawn with replacement (i.e., a bootstrap sample) from
the training set. In addition, when splitting a node during the
construction of the tree, the split that is chosen is no longer the
best split among all features. Instead, the split that is picked is
the best split among a random subset of the features. As a result of
this randomness, the bias of the forest usually slightly increases
(with respect to the bias of a single non-random tree) but, due to
averaging, its variance also decreases, usually more than compensating
for the increase in bias, hence yielding an overall better model.
The key to understanding is in "sample drawn with replacement". This means that each instance can be drawn more than once. This in turn means, that some instances in the train set are present several times and some are not present at all (out-of-bag). Those are different for different trees
Certainly not all samples are selected for each tree. Be default each sample has a 1-((N-1)/N)^N~0.63 chance of being sampled for one particular tree and 0.63^2 for being sampled twice, and 0.63^3 for being sampled 3 times... where N is the sample size of the training set.
Each bootstrap sample selection is in average enough different from other bootstraps, such that decision trees are adequately different, such that the average prediction of trees is robust toward the variance of each tree model. If sample size could be increased to 5 times more than training set size, every observation would probably be present 3-7 times in each tree and the overall ensemble prediction performance would suffer.
The answer from #communitywiki misses out the question: "What I dont understand is that if the sample size is always the same as the input sample size than how can we talk about a random selection": It has to do with the nature of bootstrapping itself. Bootstrapping includes repeating the same values different times but still have same sample size as original data: Example (courtesy wiki page of Bootstrapping/Approach):
Original Sample : [1,2,3,4,5]
Boostrap 1 : [1,2,4,4,1]
Bootstrap 2: [1,1,3,3,5]
and so on.
This is how random selection can occur and still sample size can remain same.
Although I am pretty new to python, I had a similar problem.
I tried to fit a RandomForestClassifier to my data. I splitted the data into train and test:
train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.2, random_state=0)
The length of the DFs were the same but after I predicted the model:
rfc_pred = rf_mod.predict(test_x)
The results had a different length.
To solve this I set the bootstrap option to false:
param_grid = {
'bootstrap': [False],
'max_depth': [110, 150, 200],
'max_features': [3, 5],
'min_samples_leaf': [1, 3],
'min_samples_split': [6, 8],
'n_estimators': [100, 200]
}
And ran the process all over again. It worked fine and I could calculate my confusion matrix. But I wish to understand how to use bootstrap and generate the predicted data with the same length.