Problems with the random-state parameter on data splitting with sklearn

Problems with the random-state parameter on data splitting with sklearn - python

When I look for the random -state parameter in sklearn's documentation, this is what I find:
random_state : int or RandomState
Pseudo-random number generator state used for random sampling.
I don't understand very well what it is.
The accuracy for different classifiers changes notably depending on the number I write on the random-state parameter. Why is that? Which number should I set?
It is my first time on a Machine Learning project.

Setting the random_state parameter ensures that your data are split in exactly the same manner each time you run your code. This practice is important when you want to compare the accuracy of different models (e.g. different algorithms or additional features, or both): if you keep shuffling the deck in different ways while testing new approaches, how are you to know whether the increase or decrease in accuracy is due to the changes you've made to your model, versus being due to using slightly different train and test datasets?
As far as choosing the number for your random_state parameter: that's up to you. Some experiment with different values of the parameter and see for which random_state value the model performs best. It really depends on your application: is this a production-scale machine-learning model you're developing, or is it a model for a data science challenge? In the former case, it shouldn't matter much. In the latter case, I have known people who tune their model completely and then begin experimenting with different random_state parameters to bump up their accuracies. I don't necessarily agree with that practice, because it seems like another form of overfitting (see more here. I usually choose 100 because that number is funny to me -- there's really no logic behind it. Some people choose 42, others 1, etc.
See a more detailed example here.

Related

How to reproduce a RandomForestClassifier when random_state uses RandomState

Context:
I was reading the Common Pitfalls documentation of scikit-learn. I was surprised with the Controlling randomness section, as in my case I've always use random_state with and integer, to read that the estimator should use an instance of np.random.RandomState in the initialization of the classifier, but not on the cv split; after reading Robustness of cross-validation results section I thought: 'Ok I get it', BUT, there is a warning in the cloning section that says:
from sklearn import clone
from sklearn.ensemble import RandomForestClassifier
import numpy as np
rng = np.random.RandomState(0)
a = RandomForestClassifier(random_state=rng)
b = clone(a)
Since a RandomState instance was passed to a, a and b are not clones in the strict sense, but rather clones in the statistical sense: a and b will still be different models, even when calling fit(X, y) on the same data. Moreover, a and b will influence each-other since they share the same internal RNG: calling a.fit will consume b’s RNG, and calling b.fit will consume a’s RNG, since they are the same. This bit is true for any estimators that share a random_state parameter; it is not specific to clones.
If an integer were passed, a and b would be exact clones and they would not influence each other
Warning Even though clone is rarely used in user code, it is called pervasively throughout scikit-learn codebase: in particular, most meta-estimators that accept non-fitted estimators call clone internally (GridSearchCV, StackingClassifier, CalibratedClassifierCV, etc.).
Question
If I am on the developing phase of a project and trying to identify which model(s) works best with lets say a GridSearchCV how can I get the random_state value of my model so I can use it in production?
At first I thoght lets use in the grid random_state, but the Robustness of cross-validation results section writes:
Passing instances leads to more robust CV results, and makes the comparison between various algorithms fairer. It also helps limiting the temptation to treat the estimator’s RNG as a hyper-parameter that can be tuned.

The random_state parameter is really intended for deterministic reproducibility & is especially useful when developing more complex pipelines.
GridSearchCV is used to find the best settings for your learning procedure. I emphasize procedure, because the cross-validation aspect is to get statistical results based on an estimate rather than an actual concrete model. Your RandomForest classifier & cross-validation like many techniques in machine learning rely on entropy/randomness to fairly approximate things. random_state should NOT be treated as a hyper-parameter else you are metric climbing on noise.
Once you know the optimal settings that statistically yield a good model beyond random chance, you want to re-apply the procedure to derive your production model. Note that the metric performance of this model will be bounded within (but not identical to) what grid-search estimated. Specifying random_state for the production model is bogus.
Here is a valid way to do things with/out specifying the random seeds:
# define your procedure as you wish. random_state is optional
# good for testing, irrelevant for predicting.
my_pipeline = RandomForestClassifier(random_state=42, ..)
# search parameter space & then refit another model with your
# procedure on the discovered parameters.
optimal = GridSearchCV(my_pipeline, params, refit=True, ...)
optimal.fit(train_X, train_y)
# get the new model trained with the best found parameters,
# rather than best performing model of the cross-validation!!
prod_model = optimal.best_estimator_

The warning stems from using the random number generator object rather than the random seed integer.
When you use the generator, subsequent calls to it, will produce different seeds in the random sequence. This means that since the clones share the same generator sequence the order of operations, rate at which each gets invoked etc is implicitly coupled & thus produce different but statistically equivalent objects. Using random integer seeds is safe.

Random forest regression- force to use more than 20% of possible variables

I'm building model with about 30 features. I know the best fit would be if RF (regression) would use many of them (more than 20%) in similar proportions (in meaning model.feature_importances_ would be high for many of them). I performed some optimization using different n_estimators,max_depth, leaf_nodes and so on. I did't found the parameter of RF that forces using many features. Of course I can limit usage of them from top, but how from bottom?

Look into this:
https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html
The limiting from the bottom doesn't make a lot of sense.
During fit, you can use n_features parameter.

Optimize Random Forest regressor due to computational limits

Model fitting using Random Forest regressor takes up all the RAM which leads to online hosted notebook environment (Google colab or Kaggle kernel), crashing. Could you guys help me out with optimization of the model?
I already tried hypertuning the parameters like reducing the number of estimators but doesn't work. df.info() shows 4446965 records for train data which takes up ~1GB of memory.
I can't post the whole notebook code here as it would be too long, but could you please check this link for your reference. I've provided some information below related to the dataframe for training.
clf = RandomForestRegressor(n_estimators=100,min_samples_leaf=2,min_samples_split=3, max_features=0.5 ,n_jobs=-1)
clf.fit(train_X, train_y)
pred = clf.predict(val_X)
train_x.info() shows 3557572 records taking up almost 542 MB of memory
I'm still getting started with ML and any help would be appreciated. Thank you!

Random Forest by nature puts a massive load on the CPU and RAM and that's one of its very known drawbacks! So there is nothing unusual in your question.
Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:
The Number of Attributes (features) in Dataset.
The Number of Trees (n_estimators).
The Maximum Depth of the Tree (max_depth).
The Minimum Number of Samples required to be at a Leaf Node (min_samples_leaf).
Moreover, it's clearly stated by Scikit-learn about this issue, and I am quoting here:
The default values for the parameters controlling the size of the
trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown
and unpruned trees which can potentially be very large on some data
sets. To reduce memory consumption, the complexity and size of the
trees should be controlled by setting those parameter values.
What to Do?
There's not too much that you can do especially Scikit-learn did not add an option to manipulate the storage issue on the fly (as far I am aware of).
Rather you need to change the value of the above mentioned parameters, for example:
Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).
Try to reduce the number of estimators.
max_depth is None by default which means the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_leaf is 1 by default: A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
So try to change the the parameters by understanding their effects on the performance, the reference you need is this.
The final and last option you have is to create your own customized Random Forest from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!
Side-Note:
Practically I experienced on my Core i7 laptop that setting the parameter n_jobs to -1 overwhelms the machine, I always find it more efficient to keep the default setting that is n_jobs=None! Although theoretically speaking it should be the opposite!

Determine WHY Features Are Important in Decision Tree Models

Often-times stakeholders don't want a black-box model that's good at predicting; they want insights about features to have a better understanding about their business, and so they can explain it to others.
When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we?
Is there a way to explain not only what features are important but also WHY they're important?
I was told to use shap but running even some of the boilerplate examples throws errors so I'm looking for alternatives (or even just a procedural way to inspect trees and glean insights I can take away other than a plot_importance() plot).
In the example below, how does one go about explaining WHY feature f19 is the most important (while also realizing that decision trees are random without a random_state or seed).
from xgboost import XGBClassifier, plot_importance
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
X,y = make_classification(random_state=68)
xgb = XGBClassifier()
xgb.fit(X, y)
plot_importance(xgb)
plt.show()
Update:
What I'm looking for is a programmatic procedural proof that the features chosen by the model above contribute either positively or negatively to the predictive power. I want to see code (not theory) of how you would go about inspecting the actual model and determining each feature's positive or negative contribution. Currently, I maintain that it's not possible so somebody please prove me wrong. I'd love to be wrong!
I also understand that decision trees are non-parametric and have no coefficients. Still, is there a way to see if a feature contributes positively (one unit of this feature increases y) or negatively (one unit of this feature decreases y).
Update2:
Despite a thumbs down on this question, and several "close" votes, it seems this question isn't so crazy after all. Partial dependence plots might be the answer.
Partial Dependence Plots (PDP) were introduced by Friedman (2001) with
purpose of interpreting complex Machine Learning algorithms.
Interpreting a linear regression model is not as complicated as
interpreting Support Vector Machine, Random Forest or Gradient
Boosting Machine models, this is were Partial Dependence Plot can come
into use. For some statistical explaination you can refer hereand More
Advance. Some of the algorithms have methods for finding variable
importance but they do not express whether a varaible is positively or
negatively affecting the model .

tldr; http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html
I'd like to clear up some of the wording to make sure we're on the same page.
Predictive power: what features significantly contribute to the prediction
Feature dependence: are the features positively or negatively
correlated, i.e., does a change in the feature X cause the prediction y to increase/decrease
1. Predictive power
Your feature importance shows you what retains the most information, and are the most significant features. Power could imply what causes the biggest change - you would have to check by plugging in dummy values to see their overall impact, much like you would have to do with linear regression coefficients.
2. Correlation/Dependence
As pointed out by #Tiago1984, it depends heavily on the underlying algorithm. XGBoost/GBM are additively building a committee of stubs (decision trees with a low number of trees, usually only one split).
In a regression problem, the trees are typically using a criterion related to the MSE. I won't go into the full details, but you can read more here: https://medium.com/towards-data-science/boosting-algorithm-gbm-97737c63daa3.
You'll see that at each step it calculates a vector for the "direction" of the weak learner, so you in principle know the direction of the influence from it (but keep in mind it may appear many times in one tree, in multiple steps of the additive model).
But, to cut to the chase; you could just fix all your features apart from f19 and make a prediction for a range of f19 values and see how it is related to the response value.
Take a look at partial dependency plots: http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html
There's also a chapter on it in Elements of Statistical Learning, Chapter 10.13.2.

The "importance" of a feature depends on the algorithm you are using to build the trees. In C4.5 trees, for example, a maximum-entropy criterion is often used. This means that the feature set is the one that allows classification with the fewer decision steps.

When we inspect the feature importance of an xgboost or sklearn gradient boosting model, we can determine the feature importance... but we don't understand WHY the features are important, do we?
Yes we do. Feature importance is not some magical object, it is a well defined mathematical criterion - its exact definition depends on particular model (and/or some additional choices), but it is always an object which tells "why". The "why" is usually the most basic thing possible, and boils down to "because it has the strongest predictive power". For example for random forest feature importance is a measure of how probable it is for this feature to be used on a decision path when randomly selected training data point is pushed through the tree. So it gives "why" in a proper, mathematical sense.

Classification tree in sklearn giving inconsistent answers

I am using a classification tree from sklearn and when I have the the model train twice using the same data, and predict with the same test data, I am getting different results. I tried reproducing on a smaller iris data set and it worked as predicted. Here is some code
from sklearn import tree
from sklearn.datasets import iris
clf = tree.DecisionTreeClassifier()
clf.fit(iris.data, iris.target)
r1 = clf.predict_proba(iris.data)
clf.fit(iris.data, iris.target)
r2 = clf.predict_proba(iris.data)
r1 and r2 are the same for this small example, but when I run on my own much larger data set I get differing results. Is there a reason why this would occur?
EDIT After looking into some documentation I see that DecisionTreeClassifier has an input random_state which controls the starting point. By setting this value to a constant I get rid of the problem I was previously having. However now I'm concerned that my model is not as optimal as it could be. What is the recommended method for doing this? Try some randomly? Or are all results expected to be about the same?

The DecisionTreeClassifier works by repeatedly splitting the training data, based on the value of some feature. The Scikit-learn implementation lets you choose between a few splitting algorithms by providing a value to the splitter keyword argument.
"best" randomly chooses a feature and finds the 'best' possible split for it, according to some criterion (which you can also choose; see the methods signature and the criterion argument). It looks like the code does this N_feature times, so it's actually quite like a bootstrap.
"random" chooses the feature to consider at random, as above. However, it also then tests randomly-generated thresholds on that feature (random, subject to the constraint that it's between its minimum and maximum values). This may help avoid 'quantization' errors on the tree where the threshold is strongly influenced by the exact values in the training data.
Both of these randomization methods can improve the trees' performance. There are some relevant experimental results in Lui, Ting, and Fan's (2005) KDD paper.
If you absolutely must have an identical tree every time, then I'd re-use the same random_state. Otherwise, I'd expect the trees to end up more or less equivalent every time and, in the absence of a ton of held-out data, I'm not sure how you'd decide which random tree is best.
See also: Source code for the splitter

The answer provided by Matt Krause does not answer the question entirely correctly.
The reason for the observed behaviour in scikit-learn's DecisionTreeClassifier is explained in this issue on GitHub.
When using the default settings, all features are considered at each split. This is governed by the max_features parameter, which specifies how many features should be considered at each split. At each node, the classifier randomly samples max_features without replacement (!).
Thus, when using max_features=n_features, all features are considered at each split. However, the implementation will still sample them at random from the list of features (even though this means all features will be sampled, in this case). Thus, the order in which the features are considered is pseudo-random. If two possible splits are tied, the first one encountered will be used as the best split.
This is exactly the reason why your decision tree is yielding different results each time you call it: the order of features considered is randomized at each node, and when two possible splits are then tied, the split to use will depend on which one was considered first.
As has been said before, the seed used for the randomization can be specified using the random_state parameter.

The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.
Source: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier#Notes

I don't know anything about sklearn but...
I guess DecisionTreeClassifier has some internal state, create by fit, which only gets updated/extended.
You should create a new one?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.