How can I use early stopping with XGBRFRegressor? - python

I've tried fitting a random forest like so:
from xgboost import XGBRFRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(random_state=7)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)
forest = XGBRFRegressor(num_parallel_tree = 10, num_boost_round = 1000, verbose=3)
forest.fit(
X_train,
y_train,
eval_set = [(X_test, y_test)],
early_stopping_rounds = 10,
verbose = True
)
However, early stopping never seems to kick in and as far as I can tell, the model fits the full 10,000 trees requested. The evaluation metric is only printed once, rather than after every boosting round as I would have expected.
What's the right way to set up this type of model (working within the scikit-learn API) so that early stopping takes effect as I would expect?
I have requested clarification from the developers here:
https://discuss.xgboost.ai/t/how-is-xgbrfregressor-intended-to-work-with-early-stopping/2391

The docs say:
[XGBRFRegressor has] default values and meaning of some of the parameters adjusted accordingly. In particular:
n_estimators specifies the size of the forest to be trained; it is converted to num_parallel_tree, instead of the number of boosting rounds
learning_rate is set to 1 by default
colsample_bynode and subsample are set to 0.8 by default
booster is always gbtree
And you can see that in action in the code: num_parallel_trees gets overridden as the input n_estimators, and the num_boosting_rounds gets overridden as 1.
It's probably worth reading the paragraphs preceding the documentation link in order to understand how xgboost treats random forests.

Related

How come you can get a permutation feature importance greater than 1?

Take this simple code:
import lightgbm as lgb
from sklearn.inspection import permutation_importance
X_train, X_test, y_train, y_test = train_test_split(X, y)
lgbr = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.15)
model_lgb = lgbr.fit(X_train, y_train)
r = permutation_importance(model_lgb, X_test, y_test, n_repeats=30, random_state=0)
for i in r.importances_mean.argsort()[::-1]:
print(f"{i} " f"{r.importances_mean[i]:.3f}" f" +/- {r.importances_std[i]:.3f}")
When I run this on my dataset the top value is about 1.20.
But I thought that the permutation_importance mean for a feature was the amount that the score was changed on average by permuting the feature column so this can't be more than 1 can it?
What am I missing?
(I get the same issue if I replace lightgbm by xgboost so I don't think it is a feature of the particular regression method.)
But I thought that the permutation_importance mean for a feature was the amount that the score was changed on average by permuting the feature column[...]
Correct.
so this can't be more than 1 can it?
That depends on whether the score can "worsen" by more than 1. The default for the scoring parameter of permutation_importance is None, which uses the model's score function. For LGBMRegressor (and most regressors), that's the R2 score, which has a maximum of 1 but can take arbitrarily large negative values, so indeed the score can worsen by an arbitrarily large amount.

What are possible causes of and solutions to n_jobs=-1 ineffectiveness in RandomizedSearchCV and KNeighborsClassifier in scikit learn?

I am exploring the hyperparametrs of a KNN model to fit MNIST dataset by using RandomizedSearchCV in sklearn.
To do this exploring I kept only 10000 data, otherwise it will take forever.
My code is something similar to this:
from sklearn.datasets import fetch_openml
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
mnist = fetch_openml("mnist_784");
X, y = mnist['data'], mnist['target'];
X_train_pre, X_test, y_train_pre, y_test = X[:60000], X[60000:], y[:60000], y[60000:];
shuffle_index = np.random.permutation(60000);
X_train, y_train = X_train_pre[shuffle_index], y_train_pre[shuffle_index];
del y_train_pre, X_train_pre
X_short, y_short = X_train[:10000], y_train[:10000];
param_grid2 = {"weights": ["uniform", "distance"],"n_neighbors":[3,4,5,6,7,8], };#5,10,15, 20, 25, 30, 40,50]
knn_clf_rs2 = KNeighborsClassifier(n_jobs=-1);
random_search2 = RandomizedSearchCV(knn_clf_rs2, param_grid2, cv=5,
scoring="neg_mean_squared_error",
verbose = 5, n_jobs=-1);
random_search2.fit(X_short, y_short);
To speed up, I set the parameter n_jobs=-1 both in the classifier constructor and in the random_search2 constructor, although for the moment I need only the one for the search while the one set in the classifier constructor is unused. However, looking around I saw that according to many people n_jobs=-1 is ineffective and actually just one processor is used. Even worse, it slows down performances due to the increase resource usage of multiprocessing. So it is actually better to do (in my case too):
random_search2 = RandomizedSearchCV(knn_clf_rs2, param_grid2, cv=5,
scoring="neg_mean_squared_error",
verbose = 5);
First of all, is n_jobs=-1 not working because of the classifier type (KNN) or because of RandomizedSearchCV or because of the combination of the two?
Furthermore is there any solution to this such as to force multiprocessing?

Random forest with unbalanced class (positive is minority class), low precision and weird score distributions

I have a very unbalanced dataset (5000 positive, 300000 negative). I am using sklearn RandomForestClassifier to try and predict the probability of the positive class. I have data for multiple years and one of the features I've engineered is the class in the previous year, so I am withholding the last year of the dataset to test on in addition to my test set from within the years I'm training on.
Here is what I've tried (and the result):
Upsampling with SMOTE and SMOTEENN (weird score distributions, see first pic, predicted probabilities for positive and negative class are both the same, i.e., the model predicts a very low probability for most of the positive class)
Downsampling to a balanced dataset (recall is ~0.80 for the test set, but 0.07 for the out-of-year test set from sheer number of total negatives in the unbalanced out of year test set, see second pic)
Leave it unbalanced (weird scoring distribution again, precision goes up to ~0.60 and recall falls to 0.05 and 0.10 for test and out-of-year test set)
XGBoost (slightly better recall on the out-of-year test set, 0.11)
What should I try next? I'd like to optimize for F1, as both false positives and false negatives are equally bad in my case. I would like to incorporate k-fold cross validation and have read I should do this before upsampling, a) should I do this/is it likely to help and b) how can I incorporate this into a pipeline similar to this:
from imblearn.pipeline import make_pipeline, Pipeline
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
smote_enn = SMOTEENN(smote = sm)
kf = StratifiedKFold(n_splits=5)
pipeline = make_pipeline(??)
pipeline.fit(X_train, ytrain)
ypred = pipeline.predict(Xtest)
ypredooy = pipeline.predict(Xtestooy)
Upsampling with SMOTE and SMOTEENN : I am far from being an expert with those but by upsampling your dataset you might amplify existing noise which induce overfitting. This could explain the fact that your algorithm cannot correctly classify, thus giving the results in the first graph.
I found a little bit more info here and maybe how to improve your results:
https://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/1773_ver14_ASOC_SMOTE_FRPS.pdf
When you downsample you seem to encounter the same overfitting problem as I understand it (at least for the target result of the previous year). It is hard to deduce the reason behind it without a view on the data though.
Your overfitting problem might come from the number of features you use that could add unnecessary noise. You might try to reduce the number of features you use and gradually increase it (using a RFE model). More info here:
https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/
For the models you used, you mention Random Forest and XGBoost, but you did not mention having used simpler model. You could try simpler model and focus on you data engineering.
If you have not try it yet, maybe you could:
Downsample your data
Normalize all your data with a StandardScaler
Test "brute force" tuning of simple models such as Naive Bayes and Logistic Regression
# Define steps of the pipeline
steps = [('scaler', StandardScaler()),
('log_reg', LogisticRegression())]
pipeline = Pipeline(steps)
# Specify the hyperparameters
parameters = {'C':[1, 10, 100],
'penalty':['l1', 'l2']}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
# Instantiate a GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid=parameters)
# Fit to the training set
cv.fit(X_train, y_train)
Anyway, for your example the pipeline could be (I made it with Logistic Regression but you can change it with another ML algorithm and change the parameters grid consequently):
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
param_grid = {'C': [1, 10, 100]}
clf = LogisticRegression(solver='lbfgs', multi_class = 'auto')
sme = SMOTEENN(smote = SMOTE(k_neighbors = 2), random_state=42)
grid = GridSearchCV(estimator=clf, param_grid = param_grid, score = "f1")
pipeline = Pipeline([('scale', StandardScaler()),
('SMOTEENN', sme),
('grid', grid)])
cv = StratifiedKFold(n_splits = 4, random_state=42)
score = cross_val_score(pipeline, X, y, cv=cv)
I hope this may help you.
(edit: I added score = "f1" in the GridSearchCV)

scikit-learn - train_test_split and ShuffleSplit yielding very different results

I'm trying to do run a simple RandomForestClassifier() with a large-ish dataset. I typically first do the cross-validation using train_test_split, and then start using cross_val_score.
In this case though, I get very different results from these two approaches, and I can't figure out why. My understanding these is that these two snippets should do exactly the same thing:
cfc = RandomForestClassifier(n_estimators=50)
scores = cross_val_score(cfc, X, y,
cv = ShuffleSplit(len(X), 1, 0.25),
scoring = 'roc_auc')
print(scores)
>>> [ 0.88482262]
and this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
cfc = RandomForestClassifier(n_estimators=50)
cfc.fit(X_train, y_train)
roc_auc_score(y_test, cfc.predict(X_test))
>>> 0.57733474562203269
And yet the scores are widely different. (The scores are very representative, I observed the same behavior across many many runs).
Any ideas why this can be? I am tempted to trust the cross_val_score result, but I want to be sure that I am not messing up somewhere..
** Update **
I noticed that when I reverse the order of the arguments to roc_auc_score, I get a similar result:
roc_auc_score(cfc.predict(X_test), y_test)
But the documentation explicitly states that the first element should be the real values, and the second one the target.
I'm not sure what's the issue but here are two things you could try:
ROC AUC needs prediction probabilities for proper evaluation, not hard scores (i.e. 0 or 1). Therefore change the cross_val_score to work with probabilities. You can check the first answer on this link for more details.
Compare this with roc_auc_score(y_test, cfc.predict_proba(X_test)[:,1])
As xysmas said, try setting a random_state to both cross_val_score and roc_auc_score

Unexpected cross-validation scores with scikit-learn LinearRegression

I am trying to learn to use scikit-learn for some basic statistical learning tasks. I thought I had successfully created a LinearRegression model fit to my data:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y,
test_size=0.2, random_state=0)
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
print model.score(X_test, y_test)
Which yields:
0.797144744766
Then I wanted to do multiple similar 4:1 splits via automatic cross-validation:
model = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(model, X, y, cv=5)
print scores
And I get output like this:
[ 0.04614495 -0.26160081 -3.11299397 -0.7326256 -1.04164369]
How can the cross-validation scores be so different from the score of the single random split? They are both supposed to be using r2 scoring, and the results are the same if I pass the scoring='r2' parameter to cross_val_score.
I've tried a number of different options for the random_state parameter to cross_validation.train_test_split, and they all give similar scores in the 0.7 to 0.9 range.
I am using sklearn version 0.16.1
It turns out that my data was ordered in blocks of different classes, and by default cross_validation.cross_val_score picks consecutive splits rather than random (shuffled) splits. I was able to solve this by specifying that the cross-validation should use shuffled splits:
model = linear_model.LinearRegression()
shuffle = cross_validation.KFold(len(X), n_folds=5, shuffle=True, random_state=0)
scores = cross_validation.cross_val_score(model, X, y, cv=shuffle)
print scores
Which gives:
[ 0.79714474 0.86636341 0.79665689 0.8036737 0.6874571 ]
This is in line with what I would expect.
train_test_split seems to generate random splits of the dataset, while cross_val_score uses consecutive sets, i.e.
"When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default"
http://scikit-learn.org/stable/modules/cross_validation.html
Depending on the nature of your data set, e.g. data highly correlated over the length of one segment, consecutive sets will give vastly different fits than e.g. random samples from the whole data set.
Folks, thanks for this thread.
The code in the answer above (Schneider) is outdated.
As of scikit-learn==0.19.1, this will work as expected.
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=3, shuffle=True, random_state=0)
cv_scores = cross_val_score(regressor, X, y, cv=kf)
Best,
M.

Categories

Resources