How to reproduce a RandomForestClassifier when random_state uses RandomState

How to reproduce a RandomForestClassifier when random_state uses RandomState - python

Context:
I was reading the Common Pitfalls documentation of scikit-learn. I was surprised with the Controlling randomness section, as in my case I've always use random_state with and integer, to read that the estimator should use an instance of np.random.RandomState in the initialization of the classifier, but not on the cv split; after reading Robustness of cross-validation results section I thought: 'Ok I get it', BUT, there is a warning in the cloning section that says:
from sklearn import clone
from sklearn.ensemble import RandomForestClassifier
import numpy as np
rng = np.random.RandomState(0)
a = RandomForestClassifier(random_state=rng)
b = clone(a)
Since a RandomState instance was passed to a, a and b are not clones in the strict sense, but rather clones in the statistical sense: a and b will still be different models, even when calling fit(X, y) on the same data. Moreover, a and b will influence each-other since they share the same internal RNG: calling a.fit will consume b’s RNG, and calling b.fit will consume a’s RNG, since they are the same. This bit is true for any estimators that share a random_state parameter; it is not specific to clones.
If an integer were passed, a and b would be exact clones and they would not influence each other
Warning Even though clone is rarely used in user code, it is called pervasively throughout scikit-learn codebase: in particular, most meta-estimators that accept non-fitted estimators call clone internally (GridSearchCV, StackingClassifier, CalibratedClassifierCV, etc.).
Question
If I am on the developing phase of a project and trying to identify which model(s) works best with lets say a GridSearchCV how can I get the random_state value of my model so I can use it in production?
At first I thoght lets use in the grid random_state, but the Robustness of cross-validation results section writes:
Passing instances leads to more robust CV results, and makes the comparison between various algorithms fairer. It also helps limiting the temptation to treat the estimator’s RNG as a hyper-parameter that can be tuned.

The random_state parameter is really intended for deterministic reproducibility & is especially useful when developing more complex pipelines.
GridSearchCV is used to find the best settings for your learning procedure. I emphasize procedure, because the cross-validation aspect is to get statistical results based on an estimate rather than an actual concrete model. Your RandomForest classifier & cross-validation like many techniques in machine learning rely on entropy/randomness to fairly approximate things. random_state should NOT be treated as a hyper-parameter else you are metric climbing on noise.
Once you know the optimal settings that statistically yield a good model beyond random chance, you want to re-apply the procedure to derive your production model. Note that the metric performance of this model will be bounded within (but not identical to) what grid-search estimated. Specifying random_state for the production model is bogus.
Here is a valid way to do things with/out specifying the random seeds:
# define your procedure as you wish. random_state is optional
# good for testing, irrelevant for predicting.
my_pipeline = RandomForestClassifier(random_state=42, ..)
# search parameter space & then refit another model with your
# procedure on the discovered parameters.
optimal = GridSearchCV(my_pipeline, params, refit=True, ...)
optimal.fit(train_X, train_y)
# get the new model trained with the best found parameters,
# rather than best performing model of the cross-validation!!
prod_model = optimal.best_estimator_

The warning stems from using the random number generator object rather than the random seed integer.
When you use the generator, subsequent calls to it, will produce different seeds in the random sequence. This means that since the clones share the same generator sequence the order of operations, rate at which each gets invoked etc is implicitly coupled & thus produce different but statistically equivalent objects. Using random integer seeds is safe.

Related

Differences between RepeatedStratifiedKFold and StratifiedKFold in sklearn

I tried to read the docs for RepeatedStratifiedKFold and StratifiedKFold, but couldn't tell the difference between the two methods except that RepeatedStratifiedKFold repeats StratifiedKFold n times with different randomization in each repetition.
My question is: Do these two methods return the same results? Which one should I use to split an imbalanced dataset when doing GridSearchCV and what is the rationale for choosing that method?

Both StratifiedKFold and RepeatedStratifiedKFold can be very effective when used on classification problems with a severe class imbalance. They both stratify the sampling by the class label; that is, they split the dataset in such a way that preserves approximately the same class distribution (i.e., the same percentage of samples of each class) in each subset/fold as in the original dataset. However, a single run of StratifiedKFold might result in a noisy estimate of the model's performance, as different splits of the data might result in very different results. That is where RepeatedStratifiedKFold comes into play.
RepeatedStratifiedKFold allows improving the estimated performance of a machine learning model, by simply repeating the cross-validation procedure multiple times (according to the n_repeats value), and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the model's performance (see this article).
Thus—to answer your question—no, these two methods would not provide the same results. Using RepeatedStratifiedKFold means that each time running the procedure would result in a different split of the dataset into stratified k-folds, and hence, the performance results would be different.
RepeatedStratifiedKFold has the benefit of improving the estimated model's performance at the cost of fitting and evaluating many more models. If, for example, 5 repeats (i.e., n_repeats=5) of 10-fold cross-validation were used for estimating the model's performance, it means that 50 different models would need to be fitted (trained) and evaluated—which might be computationally expensive, depending on the dataset's size, type of machine learning algorithm, device specifications, etc. However, RepeatedStratifiedKFold process could be executed on different cores or different machines, which could dramatically speed up the process. For instance, setting n_jobs=-1 would use all the cores available on your system (have a look here).
When it comes to evaluation, make sure to use appropriate metrics, as described in this answer.

Confusion around the SKLearn GridSearchCV scoring parameter and using train test split

I'm a little bit confused about how GridSearchCV works with Train Test Split.
As far as I know, when creating models for the dataset I'm using, a paper used roc-auc.
I'm trying to replicate what this paper did, at least as well as I can. From reading a few other posts here, I've gathered that running GridSearchCV on the entire dataset is prone to overfitting, so we should split the data into a training partition and a testing partition. Then, we should run the training partition with GridSearchCV with whatever model and parameters, and then fit it, and then get a score using the test part of the dataset we set aside.
Now where I'm confused is with GridSearchCV, as far as I understand, it gives us scores for each of the folds that the data is split into when doing the search for parameters and using best_score_ we can pull the best of these scores. I don't understand what the scores represent and why you can pass in a scoring parameter to begin with, since the job of GridSearchCV is to always find the best possible parameters anyways? (Perhaps I'm making a poor assumption here but I'm assuming that there is an objective best set of parameters, regardless of scoring method). What I figured was that I would find the best parameters with GridSearchCV and then use the said parameters to create fit a model, and finally use that model and the partition I saved for testing and test it using the roc-auc scoring method.
So in the end, does it matter (if at all) what scoring methods I'm passing into GridSearchCV, as it will always look to give the best set of parameters anyways, which I will use to compute my final score with the testing partition?

This document may help.
Here you see that the scoring parameter allows you to have various metrics, such as roc_auc. See here all Scikit's metrics.
Optimizing over different metrics result in different optimal parameters. Just think about optimizing precision versus recall. Optimizing precision leads to less false positives while optimizing recall leads to less false negatives.
Also, in GridSearchCV, the CV stands for cross validated. Train/test splitting happens inside this function, it's taken care of. You only have to provide the splitter as an argument to GridSearchCV, for example cv=StratifiedKFold(n_splits=5, shuffle=True).

Cross-validation in LightGBM

How are we supposed to use the dictionary output from lightgbm.cv to improve our predictions?
Here's an example - we train our cv model using the code below:
cv_mod = lgb.cv(params,
d_train,
500,
nfold = 10,
early_stopping_rounds = 25,
stratified = True)
How can we use the parameters found from the best iteration of the above code to predict an output? In this case, cv_mod has no "predict" method like lightgbm.train, and the dictionary output from lightgbm.cvthrows an error when used in lightgbm.train.predict(..., pred_parameters = cv_mod).
Am I missing an important transformation step?

In general, the purpose of CV is NOT to do hyperparameter optimisation. The purpose is to evaluate performance of model-building procedure.
A basic train/test split is conceptually identical to a 1-fold CV (with a custom size of the split in contrast to the 1/K train size in the k-fold CV). The advantage of doing more splits (i.e. k>1 CV) is to get more information about the estimate of generalisation error. There is more info in a sense of getting the error + stat uncertainty. There is an excellent discussion on CrossValidated (start with the links added to the question, which cover the same question, but formulated in a different way). It covers nested cross validation and is absolutely not straightforward. But if you will wrap your head around the concept in general, this will help you in various non-trivial situations. The idea that you have to take away is: The purpose of CV is to evaluate performance of model-building procedure.
Keeping that idea in mind, how does one approach hyperparameter estimation in general (not only in LightGBM)?
You want to train a model with a set of parameters on some data and evaluate each variation of the model on an independent (validation) set. Then you intend to choose the best parameters by choosing the variant that gives the best evaluation metric of your choice.
This can be done with a simple train/test split. But evaluated performance, and thus the choice of the optimal model parameters, might be just a fluctuation on a particular split.
Thus, you can evaluate each of those models more statistically robust averaging evaluation over several train/test splits, i.e k-fold CV.
Then you can make a step further and say that you had an additional hold-out set, that was separated before hyperparameter optimisation was started. This way you can evaluate the chosen best model on that set to measure the final generalisation error. However, you can make even step further and instead of having a single test sample you can have an outer CV loop, which brings us to nested cross validation.
Technically, lightbgm.cv() allows you only to evaluate performance on a k-fold split with fixed model parameters. For hyper-parameter tuning you will need to run it in a loop providing different parameters and recoding averaged performance to choose the best parameter set. after the loop is complete. This interface is different from sklearn, which provides you with complete functionality to do hyperparameter optimisation in a CV loop. Personally, I would recommend to use the sklearn-API of lightgbm. It is just a wrapper around the native lightgbm.train() functionality, thus it is not slower. But it allows you to use the full stack of sklearn toolkit, thich makes your life MUCH easier.

If you're happy with your CV results, you just use those parameters to call the 'lightgbm.train' method. Like #pho said, CV is usually just for param tuning. You don't use the actual CV object for predictions.

You should use CV for parameter optimization.
If your model performs well on all folds use these parameters to train on the whole training set.
Then evaluate that model on the external test set.

Problems with the random-state parameter on data splitting with sklearn

When I look for the random -state parameter in sklearn's documentation, this is what I find:
random_state : int or RandomState
Pseudo-random number generator state used for random sampling.
I don't understand very well what it is.
The accuracy for different classifiers changes notably depending on the number I write on the random-state parameter. Why is that? Which number should I set?
It is my first time on a Machine Learning project.

Setting the random_state parameter ensures that your data are split in exactly the same manner each time you run your code. This practice is important when you want to compare the accuracy of different models (e.g. different algorithms or additional features, or both): if you keep shuffling the deck in different ways while testing new approaches, how are you to know whether the increase or decrease in accuracy is due to the changes you've made to your model, versus being due to using slightly different train and test datasets?
As far as choosing the number for your random_state parameter: that's up to you. Some experiment with different values of the parameter and see for which random_state value the model performs best. It really depends on your application: is this a production-scale machine-learning model you're developing, or is it a model for a data science challenge? In the former case, it shouldn't matter much. In the latter case, I have known people who tune their model completely and then begin experimenting with different random_state parameters to bump up their accuracies. I don't necessarily agree with that practice, because it seems like another form of overfitting (see more here. I usually choose 100 because that number is funny to me -- there's really no logic behind it. Some people choose 42, others 1, etc.
See a more detailed example here.

Classification tree in sklearn giving inconsistent answers

I am using a classification tree from sklearn and when I have the the model train twice using the same data, and predict with the same test data, I am getting different results. I tried reproducing on a smaller iris data set and it worked as predicted. Here is some code
from sklearn import tree
from sklearn.datasets import iris
clf = tree.DecisionTreeClassifier()
clf.fit(iris.data, iris.target)
r1 = clf.predict_proba(iris.data)
clf.fit(iris.data, iris.target)
r2 = clf.predict_proba(iris.data)
r1 and r2 are the same for this small example, but when I run on my own much larger data set I get differing results. Is there a reason why this would occur?
EDIT After looking into some documentation I see that DecisionTreeClassifier has an input random_state which controls the starting point. By setting this value to a constant I get rid of the problem I was previously having. However now I'm concerned that my model is not as optimal as it could be. What is the recommended method for doing this? Try some randomly? Or are all results expected to be about the same?

The DecisionTreeClassifier works by repeatedly splitting the training data, based on the value of some feature. The Scikit-learn implementation lets you choose between a few splitting algorithms by providing a value to the splitter keyword argument.
"best" randomly chooses a feature and finds the 'best' possible split for it, according to some criterion (which you can also choose; see the methods signature and the criterion argument). It looks like the code does this N_feature times, so it's actually quite like a bootstrap.
"random" chooses the feature to consider at random, as above. However, it also then tests randomly-generated thresholds on that feature (random, subject to the constraint that it's between its minimum and maximum values). This may help avoid 'quantization' errors on the tree where the threshold is strongly influenced by the exact values in the training data.
Both of these randomization methods can improve the trees' performance. There are some relevant experimental results in Lui, Ting, and Fan's (2005) KDD paper.
If you absolutely must have an identical tree every time, then I'd re-use the same random_state. Otherwise, I'd expect the trees to end up more or less equivalent every time and, in the absence of a ton of held-out data, I'm not sure how you'd decide which random tree is best.
See also: Source code for the splitter

The answer provided by Matt Krause does not answer the question entirely correctly.
The reason for the observed behaviour in scikit-learn's DecisionTreeClassifier is explained in this issue on GitHub.
When using the default settings, all features are considered at each split. This is governed by the max_features parameter, which specifies how many features should be considered at each split. At each node, the classifier randomly samples max_features without replacement (!).
Thus, when using max_features=n_features, all features are considered at each split. However, the implementation will still sample them at random from the list of features (even though this means all features will be sampled, in this case). Thus, the order in which the features are considered is pseudo-random. If two possible splits are tied, the first one encountered will be used as the best split.
This is exactly the reason why your decision tree is yielding different results each time you call it: the order of features considered is randomized at each node, and when two possible splits are then tied, the split to use will depend on which one was considered first.
As has been said before, the seed used for the randomization can be specified using the random_state parameter.

The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.
Source: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier#Notes

I don't know anything about sklearn but...
I guess DecisionTreeClassifier has some internal state, create by fit, which only gets updated/extended.
You should create a new one?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.