Classification tree in sklearn giving inconsistent answers - python

I am using a classification tree from sklearn and when I have the the model train twice using the same data, and predict with the same test data, I am getting different results. I tried reproducing on a smaller iris data set and it worked as predicted. Here is some code
from sklearn import tree
from sklearn.datasets import iris
clf = tree.DecisionTreeClassifier()
clf.fit(iris.data, iris.target)
r1 = clf.predict_proba(iris.data)
clf.fit(iris.data, iris.target)
r2 = clf.predict_proba(iris.data)
r1 and r2 are the same for this small example, but when I run on my own much larger data set I get differing results. Is there a reason why this would occur?
EDIT After looking into some documentation I see that DecisionTreeClassifier has an input random_state which controls the starting point. By setting this value to a constant I get rid of the problem I was previously having. However now I'm concerned that my model is not as optimal as it could be. What is the recommended method for doing this? Try some randomly? Or are all results expected to be about the same?

The DecisionTreeClassifier works by repeatedly splitting the training data, based on the value of some feature. The Scikit-learn implementation lets you choose between a few splitting algorithms by providing a value to the splitter keyword argument.
"best" randomly chooses a feature and finds the 'best' possible split for it, according to some criterion (which you can also choose; see the methods signature and the criterion argument). It looks like the code does this N_feature times, so it's actually quite like a bootstrap.
"random" chooses the feature to consider at random, as above. However, it also then tests randomly-generated thresholds on that feature (random, subject to the constraint that it's between its minimum and maximum values). This may help avoid 'quantization' errors on the tree where the threshold is strongly influenced by the exact values in the training data.
Both of these randomization methods can improve the trees' performance. There are some relevant experimental results in Lui, Ting, and Fan's (2005) KDD paper.
If you absolutely must have an identical tree every time, then I'd re-use the same random_state. Otherwise, I'd expect the trees to end up more or less equivalent every time and, in the absence of a ton of held-out data, I'm not sure how you'd decide which random tree is best.
See also: Source code for the splitter

The answer provided by Matt Krause does not answer the question entirely correctly.
The reason for the observed behaviour in scikit-learn's DecisionTreeClassifier is explained in this issue on GitHub.
When using the default settings, all features are considered at each split. This is governed by the max_features parameter, which specifies how many features should be considered at each split. At each node, the classifier randomly samples max_features without replacement (!).
Thus, when using max_features=n_features, all features are considered at each split. However, the implementation will still sample them at random from the list of features (even though this means all features will be sampled, in this case). Thus, the order in which the features are considered is pseudo-random. If two possible splits are tied, the first one encountered will be used as the best split.
This is exactly the reason why your decision tree is yielding different results each time you call it: the order of features considered is randomized at each node, and when two possible splits are then tied, the split to use will depend on which one was considered first.
As has been said before, the seed used for the randomization can be specified using the random_state parameter.

The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.
Source: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier#Notes

I don't know anything about sklearn but...
I guess DecisionTreeClassifier has some internal state, create by fit, which only gets updated/extended.
You should create a new one?

Related

how to show every each of the bootstrapped data / sampled data in random forest sklearn?

from what i understand about random forest alogirthm is that the algorithm randomly samples the original dataset to build a new sampled/bootstrapped dataset. the sampled dataset then turned into decision trees.
in scikit learn, you can visualize each individual trees in random forest. but my question is, How to show the sampled/bootstrapped dataset from each of those trees?
i want to see the features and the rows of data used to build each individual trees.
I am not aware of a way to see the bootstrapped rows (samples) of your data, as this is a random process. Neither do I think it is of much importance, if the algorithm trained well.
Nevertheless, for the features: In your visualization of the tree you might already see which features were used by the splits that have been done. But you can also access them directly via the attribute feature_names_in_ (see here) for each DecisionTree in the forest, e.g.
print(rf.estimators_[0].feature_names_in_)
If your feature names in the data have not been defined as given in the documentation, a workaround is to use the feature_importances_ (see here) as a proxy instead - the features that were unused obviously have an importance of 0. You can contrast this with the number of max_features_ that you define for the training to see if you caught all.

Confusion around the SKLearn GridSearchCV scoring parameter and using train test split

I'm a little bit confused about how GridSearchCV works with Train Test Split.
As far as I know, when creating models for the dataset I'm using, a paper used roc-auc.
I'm trying to replicate what this paper did, at least as well as I can. From reading a few other posts here, I've gathered that running GridSearchCV on the entire dataset is prone to overfitting, so we should split the data into a training partition and a testing partition. Then, we should run the training partition with GridSearchCV with whatever model and parameters, and then fit it, and then get a score using the test part of the dataset we set aside.
Now where I'm confused is with GridSearchCV, as far as I understand, it gives us scores for each of the folds that the data is split into when doing the search for parameters and using best_score_ we can pull the best of these scores. I don't understand what the scores represent and why you can pass in a scoring parameter to begin with, since the job of GridSearchCV is to always find the best possible parameters anyways? (Perhaps I'm making a poor assumption here but I'm assuming that there is an objective best set of parameters, regardless of scoring method). What I figured was that I would find the best parameters with GridSearchCV and then use the said parameters to create fit a model, and finally use that model and the partition I saved for testing and test it using the roc-auc scoring method.
So in the end, does it matter (if at all) what scoring methods I'm passing into GridSearchCV, as it will always look to give the best set of parameters anyways, which I will use to compute my final score with the testing partition?
This document may help.
Here you see that the scoring parameter allows you to have various metrics, such as roc_auc. See here all Scikit's metrics.
Optimizing over different metrics result in different optimal parameters. Just think about optimizing precision versus recall. Optimizing precision leads to less false positives while optimizing recall leads to less false negatives.
Also, in GridSearchCV, the CV stands for cross validated. Train/test splitting happens inside this function, it's taken care of. You only have to provide the splitter as an argument to GridSearchCV, for example cv=StratifiedKFold(n_splits=5, shuffle=True).

How does `max_features` limit the number of features in an sklearn ensemble model?

I still do not entirely understand max_features in sklearn classifiers. The documentation leaves a little bit of room open for interpretation. For the purposes of this question, suppose I am using a tree-based classifier, such as a decision tree, random forest, gradient boosting, etc.
If for example, I were to set max_features=10, does that mean that each estimator will randomly take 10 features from my dataset to build the entire tree, or does it mean that each time a node is being split, each estimator randomly samples 10 features and picks the one that reduces entropy the most?
It was my understanding that an individual estimator would limit itself to 10 variables for the entire process. However, it seems that the maximum number of features resets at each node. That is, for any given node, the estimator randomly selects 10 features, chooses the best one, splits the node and repeats the process for all subsequent nodes.
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(max_features=10, n_estimators=150)
However, it seems that the maximum number of features resets at each node. That is, for any given node, the estimator randomly selects 10 features, chooses the best one, splits the node and repeats the process for all subsequent nodes.
Your understanding here is correct. The way both gradient boosting and random forests work is that at each split for each tree, they will randomly select max_features (in the literature, this parameter is called mtry) to evaluate. This is one of the mechanisms by which the model introduces randomness across models—by not evaluating every feature at every split.
Complementary to Tgsmith61591 answer, if you dive deeper into the code you can find an additional comment which add some useful information on how the max_features hyperparameter works in the model:
Notes
-----
The features are always randomly permuted at each split. Therefore,
the best found split may vary, even with the same training data and
max_features=n_features, if the improvement of the criterion is
identical for several splits enumerated during the search of the best
split. To obtain a deterministic behaviour during fitting,
random_state has to be fixed.
Furthermore, the key concept is that random forests need to increase the randomness in order to decrease the variance of the model (although this may lead to higher bias), as it is warned in this final note:
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

Problems with the random-state parameter on data splitting with sklearn

When I look for the random -state parameter in sklearn's documentation, this is what I find:
random_state : int or RandomState
Pseudo-random number generator state used for random sampling.
I don't understand very well what it is.
The accuracy for different classifiers changes notably depending on the number I write on the random-state parameter. Why is that? Which number should I set?
It is my first time on a Machine Learning project.
Setting the random_state parameter ensures that your data are split in exactly the same manner each time you run your code. This practice is important when you want to compare the accuracy of different models (e.g. different algorithms or additional features, or both): if you keep shuffling the deck in different ways while testing new approaches, how are you to know whether the increase or decrease in accuracy is due to the changes you've made to your model, versus being due to using slightly different train and test datasets?
As far as choosing the number for your random_state parameter: that's up to you. Some experiment with different values of the parameter and see for which random_state value the model performs best. It really depends on your application: is this a production-scale machine-learning model you're developing, or is it a model for a data science challenge? In the former case, it shouldn't matter much. In the latter case, I have known people who tune their model completely and then begin experimenting with different random_state parameters to bump up their accuracies. I don't necessarily agree with that practice, because it seems like another form of overfitting (see more here. I usually choose 100 because that number is funny to me -- there's really no logic behind it. Some people choose 42, others 1, etc.
See a more detailed example here.

What is "The sum of true positives and false positives are equal to zero for some labels." mean?

I'm using scikit learn to perform cross validation using StratifiedKFold to compute the f1 score, but it says that some of my labels have the sum of true positives and false positives are equal to zero for some labels. I thought using StratifiedKFold should prevent this? Why am I getting this problem?
Also, is there a way to get the confusion matrix from the cross_val_score function?
Your classifier is probably classifying all data points as negative, so there are no positives. You can check that is the case by looking at the confusion matrix (docs and example here). It's hard to tell what is happening without information about your data and choice of classifier, but common causes include:
bug in your code. Check your training data contains negative data points, and that these data points contain non-zero features.
inappropriate classifier parameters. If using Naive Bayes, check your class biases. If using SVM, try using grid search over parameter values.
The sklearn classification_report function may come in handy (docs).
Re your second question: stratification ensures that each fold contains roughly the same proportion of data points from all classes. This does not mean your classifier will perform sensibly.
Update:
In a classification task (and especially when class imbalance is present) you are trading off precision for recall. Depending on your application, you can set your classifier so it does well most of the time (i.e. high accuracy) or so that it can detect the few points that you care about (i.e. high recall of the smaller classes). For example, if the task is to forward support emails to the right department, you want high accuracy. It is somewhat acceptable to misclassify the kind of email you get once a year, because you only upset one person. If your task is to detect posts by sexual predators on a children's forum, you definitely do not want to miss any of them, even if the price is that a few posts will get incorrectly flagged. Bottom line: you should optimise for your application.
Are you micro or macro averaging recall? In the former case, more weight will be given to the frequent classes (which is similar to optimising for accuracy), and in the latter all classes will have the same weight.

Categories

Resources