I am using xgboost for the first time and trying the two different interfaces. First I get the data:
import xgboost as xgb
import dlib
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
X = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
y = raw_df.values[1::2, 2]
dmatrix = xgb.DMatrix(data=X, label=y)
Now the scikit-learn interface:
xgbr = xgb.XGBRegressor(objective='reg:squarederror', seed=20)
print(cross_val_score(xgbr, X, y, cv=5))
This outputs:
[0.73438184 0.84902986 0.82579692 0.52374618 0.29743001]
Now the xgboost native interface:
dmatrix = xgb.DMatrix(data=X, label=y)
params={'objective':'reg:squarederror'}
cv_results = xgb.cv(dtrain=dmatrix, params=params, nfold=5, metrics={'rmse'}, seed=20)
print('RMSE: %.2f' % cv_results['test-rmse-mean'].min())
This gives 3.50.
Why are the outputs so different? What am I doing wrong?
First of all, you didn't specify the metric in cross_val_score, therefore you are not calculating RMSE, but rather the estimator's default metric, which is usually just its loss function. You need to specify it for comparable results:
cross_val_score(xgbr, X, y, cv=5, scoring = 'neg_root_mean_squared_error')
Second, you need to match sklearn's CV procedure exactly. For that, you can pass folds argument to XGBoost's cv method:
from sklearn.model_selection import KFold
cv_results = xgb.cv(dtrain=dmatrix, params=params, metrics={'rmse'}, folds = KFold(n_splits=5))
Finally, you need to ensure that XGBoost's cv procedure actually converges. For some reason it only does 10 boosting rounds by default, which is too low to converge on your dataset. This is done via nrounds argument (num_boost_round if you're on an older version), I found that 100 rounds work just fine on this dataset:
cv_results = xgb.cv(dtrain=dmatrix, params=params, metrics={'rmse'}, folds = KFold(n_splits=5), nrounds = 100)
Now you will get matching results.
On a side note, it's interesting how you say it's your first time using XGBoost, but you actually have a question on XGBoost dating back to 2017.
Related
I noticed that there are two possible implementations of XGBoost in Python as discussed here and here
When I tried running the same dataset through the two possible implementations I noticed that the results were different.
Code
import xgboost as xgb
from xgboost.sklearn import XGBRegressor
import xgboost
import pandas as pd
import numpy as np
from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)
Y = df["target"]
X = df.drop('target', axis=1)
#### Code using Native Impl for XGBoost
dtrain = xgboost.DMatrix(X, label=Y, missing=0.0)
params = {'max_depth': 3, 'learning_rate': .05, 'min_child_weight' : 4, 'subsample' : 0.8}
evallist = [(dtrain, 'eval'), (dtrain, 'train')]
model = xgboost.train(dtrain=dtrain, params=params,num_boost_round=200)
predictions = model.predict(dtrain)
#### Code using Sklearn Wrapper for XGBoost
model = XGBRegressor(n_estimators = 200, max_depth=3, learning_rate =.05, min_child_weight=4, subsample=0.8 )
#model = model.fit(X, Y, eval_set = [(X, Y), (X, Y)], eval_metric = 'rmse', verbose=True)
model = model.fit(X, Y)
predictions2 = model.predict(X)
print(np.absolute(predictions-predictions2).sum())
Absolute difference sum using sklearn boston dataset
62.687134
When I ran the same for other datasets like the sklearn diabetes dataset I observed that the difference was much smaller.
Absolute difference sum using sklearn diabetes dataset
0.0011711121
Make sure random seeds are the same.
For both approaches set the same seed
param['seed'] = 123
EDIT: then there are a couple of different things.
First is n_estimators also 200? Are you imputing missing values in the second dataset also with 0? are others default values also the same(for this one I think yes because its a wrapper, but check other 2 things)
I've not set the "missing" parameter for the sklearn implementation. Once that was set the values were matching.
Also as Noah pointed out, sklearn wrapper has a few different default values which needs to be matched in order to exactly match the results.
When I run this cross validation leave one out, it does nothing, not even an error message. I can't figure out what I am missing. I'm using the csv from kaggle - https://www.kaggle.com/dileep070/heart-disease-prediction-using-logistic-regression/downloads/heart-disease-prediction-using-logistic-regression.zip/1
import csv
from sklearn.model_selection import LeaveOneOut
from sklearn import svm
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd
from pandas import read_csv
from sklearn.model_selection import cross_val_score,
cross_val_predict
from sklearn import metrics
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
#replace missing values with mean
dataset=read_csv("//Users/crystalfortress/Desktop/CompGenetics
/Final_Project_Comp/framingham.csv")
dataset.fillna(dataset.mean(), inplace=True)
print(dataset.isnull().sum())
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 15].values
model = svm.SVC(kernel='linear', C=10, gamma = 0.1)
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')
print('Accuracy after cross validation:', scores.mean())
predictions = cross_val_predict(model, X, y, cv=loo)
accuracy = metrics.r2_score(y, predictions)
print('Prediction accuracy:', accuracy)
x = metrics.classification_report(y, predictions)
print(x)
cf = metrics.confusion_matrix(y, predictions)
print(cf)
I tried running this on my machine, it looks like your code works fine (though I did comment out a lot of unnecessary imports). Leave One Out will take a really long time to train a model (it makes n training data sets out of n data points link). So you will have to wait for it to train before you will get your results. Changing cv= to a number (default is 3, I believe) will train a model more quickly, and most likely have less variance in the model. Also, add n_jobs=-1 to your cross_val_score call will allow python access to all of your processors. E.g.: scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy', n_jobs=-1)
You can also set the verbose parameter in your cross_val_score as well to watch the progress (though, heads up, it isn't fast). I believe the highest value is 3 (it doesn't mention it in their online docs), but using a higher value is fine. So the final cross_val_score call would look like:
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy', n_jobs=-1, verbose=10)
using cross validation (CV) with sklearn is quite easy and straight-forward. But the default implementation when setting cv=5 in a linear CV model, like ElasticNetCV or LassoCV is a KFold CV. For various reasons I'd like to use a StratifiedKFold. From the documentation, it seems like any CV method can be given with cv=.
Passing cv=KFold(5) works as expected, but cv=StratifiedKFold(5) raises the Error:
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
I know that I can use cross_val_score after fitting, but I'd like to pass StratifiedKFold as CV directly to the linear model.
My minimum working example is:
from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import KFold, StratifiedKFold
import numpy as np
x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.arange(100) + np.random.rand(100)
# KFold default implementation:
model_default = ElasticNetCV(cv=5)
model_default.fit(x, y) # works fine
# KFold given as cv explicitly:
model_kfexp = ElasticNetCV(cv=KFold(5))
model_kfexp.fit(x, y) # also works fine
# StratifiedKFold given as cv explicitly:
model_skf = ElasticNetCV(cv=StratifiedKFold(5))
model_skf.fit(x, y) # THIS RAISES THE ERROR
Any idea how I can set StratifiedKFold as CV directly?
The root of your problem is this line:
y = np.arange(100) + np.random.rand(100)
StratifiedKFold cannot sample from continuous distribution hence your error. Try changing this line and your code will execute happily:
from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import KFold, StratifiedKFold
import numpy as np
x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.random.choice([0,1], size=100)
# KFold default implementation:
model_default = ElasticNetCV(cv=5)
model_default.fit(x, y) # works fine
# KFold given as cv explicitly:
model_kfexp = ElasticNetCV(cv=KFold(5))
model_kfexp.fit(x, y) # also works fine
# StratifiedKFold given as cv explicitly:
model_skf = ElasticNetCV(cv=StratifiedKFold(5))
model_skf.fit(x, y) # no ERROR
NOTE
If you sample on continuous data, use KFold. If your target is categorical you may use both KFold and StratifiedKFold whichever suits your needs.
NOTE 2
If you insist on emulating stratified sampling on continuous data, you may wish to apply pandas.cut to your data, then do stratified sampling on that data, and finally pass resulting (train_id, test_id) generator to cv param:
x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.arange(100) + np.random.rand(100)
y_cat = pd.cut(y, 10, labels=range(10))
skf_gen = StratifiedKFold(5).split(x, y_cat)
model_skf = ElasticNetCV(cv=skf_gen)
model_skf.fit(x, y) # no ERROR
I'm trying to make a classifier on a data set. I first used XGBoost:
import xgboost as xgb
import pandas as pd
import numpy as np
train = pd.read_csv("train_users_processed_onehot.csv")
labels = train["Buy"].map({"Y":1, "N":0})
features = train.drop("Buy", axis=1)
data_dmat = xgb.DMatrix(data=features, label=labels)
params={"max_depth":5, "min_child_weight":2, "eta": 0.1, "subsamples":0.9, "colsample_bytree":0.8, "objective" : "binary:logistic", "eval_metric": "logloss"}
rounds = 180
result = xgb.cv(params=params, dtrain=data_dmat, num_boost_round=rounds, early_stopping_rounds=50, as_pandas=True, seed=23333)
print result
And the result is:
test-logloss-mean test-logloss-std train-logloss-mean
0 0.683539 0.000141 0.683407
179 0.622302 0.001504 0.606452
We can see it is around 0.622;
But when I switch to sklearn using the exactly same parameters(I think), the result is quite different. Below is my code:
from sklearn.model_selection import cross_val_score
from xgboost.sklearn import XGBClassifier
import pandas as pd
train_dataframe = pd.read_csv("train_users_processed_onehot.csv")
train_labels = train_dataframe["Buy"].map({"Y":1, "N":0})
train_features = train_dataframe.drop("Buy", axis=1)
estimator = XGBClassifier(learning_rate=0.1, n_estimators=190, max_depth=5, min_child_weight=2, objective="binary:logistic", subsample=0.9, colsample_bytree=0.8, seed=23333)
print cross_val_score(estimator, X=train_features, y=train_labels, scoring="neg_log_loss")
and the result is:[-4.11429976 -2.08675843 -3.27346662], after reversing it is still far from 0.622.
I tossed a break point into cross_val_score, and saw that the classifier is making crazy predictions by trying to predict every tuple in the test set to be negative with about 0.99 probability.
I'm wondering where have I gone wrong. Could someone help me?
This question is a bit old, but I ran into the problem today and figured out why the results given by xgboost.cv and sklearn.model_selection.cross_val_score are quite different.
By default cross_val_score use KFold or StratifiedKFold whose shuffle argument is False so the folds are not pulled randomly from the data.
So if you do this, then you should get the same results:
cross_val_score(estimator, X=train_features, y=train_labels, scoring="neg_log_loss",
cv = StratifiedKFold(shuffle=True, random_state=23333))
Keep the random state in StratifiedKfold and seed in xgboost.cv same to get exactly reproducible results.
I want to score different classifiers with different parameters.
For speedup on LogisticRegression I use LogisticRegressionCV (which at least 2x faster) and plan use GridSearchCV for others.
But problem while it give me equal C parameters, but not the AUC ROC scoring.
I'll try fix many parameters like scorer, random_state, solver, max_iter, tol...
Please look at example (real data have no mater):
Test data and common part:
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
y[y <= y.mean()] = 0; y[y > 0] = 1
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV
fold = KFold(len(y), n_folds=5, shuffle=True, random_state=777)
GridSearchCV
grid = {
'C': np.power(10.0, np.arange(-10, 10))
, 'solver': ['newton-cg']
}
clf = LogisticRegression(penalty='l2', random_state=777, max_iter=10000, tol=10)
gs = GridSearchCV(clf, grid, scoring='roc_auc', cv=fold)
gs.fit(X, y)
print ('gs.best_score_:', gs.best_score_)
gs.best_score_: 0.939162082194
LogisticRegressionCV
searchCV = LogisticRegressionCV(
Cs=list(np.power(10.0, np.arange(-10, 10)))
,penalty='l2'
,scoring='roc_auc'
,cv=fold
,random_state=777
,max_iter=10000
,fit_intercept=True
,solver='newton-cg'
,tol=10
)
searchCV.fit(X, y)
print ('Max auc_roc:', searchCV.scores_[1].max())
Max auc_roc: 0.970588235294
Solver newton-cg used just to provide fixed value, other tried too.
What I forgot?
P.S. In both cases I also got warning "/usr/lib64/python3.4/site-packages/sklearn/utils/optimize.py:193: UserWarning: Line Search failed
warnings.warn('Line Search failed')" which I can't understand too. I'll be happy if someone also describe what it mean, but I hope it is not relevant to my main question.
EDIT UPDATES
By #joeln comment add max_iter=10000 and tol=10 parameters too. It does not change result in any digit, but the warning disappeared.
Here is a copy of the answer by Tom on the scikit-learn issue tracker:
LogisticRegressionCV.scores_ gives the score for all the folds.
GridSearchCV.best_score_ gives the best mean score over all the folds.
To get the same result, you need to change your code:
print('Max auc_roc:', searchCV.scores_[1].max()) # is wrong
print('Max auc_roc:', searchCV.scores_[1].mean(axis=0).max()) # is correct
By also using the default tol=1e-4 instead of your tol=10, I get:
('gs.best_score_:', 0.939162082193857)
('Max auc_roc:', 0.93915947999923843)
The (small) remaining difference might come from warm starting in LogisticRegressionCV (which is actually what makes it faster than GridSearchCV).