R square and RMSE using Sklearn cross validation - python

I have a problem calculating the R^2 and the RMSE values using SKlearn. Below you can see my code block. A similar code block already works without Cross validation (CV) calculating it for the training and the validation set as it should and then saved in a df.
Now, my goal is to calculate R^2 and RMSE with CV for each considered model. I see that my code somehow ignores to calculate this for each model. And of course there must be more than one value for each model, of which the mean is then calculated.
Hope someone could help me. Thanks in advance
Regards
for modelname,model in models.items():
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
r2_train = r2_score(y_train,y_train_pred)
r2_val = r2_score(y_val,y_val_pred)
kfoldscores = cross_val_score(model,X_train,y_train,cv=RepeatedKFold(n_splits=5,
n_repeats=25,random_state=0))
rmse_train = mean_squared_error(y_train,y_train_pred, squared = False)
rmse_val = mean_squared_error(y_val,y_val_pred, squared = False)
model_results5.loc[modelname,["R2_train","R2_val","RMSE_train","RMSE_val"]] = [r2_train,r2_val,rmse_train,rmse_val]
save_models5[modelname] = model

Related

How do I calculate the MSE during walk-forward optimization?

I am trying to predict the variable "ec" using features with a time lag of 1 period with different models. In order to see which model (I am comparing OLS, Ridge, Lasso and ARIMAX) fits the data best, I use a Walk-Forward approach (expanding window) and want to calculate the Mean Squared Error for each of the models. (I am providing the code for my OLS model as an example) Although my code seems to be working, I am not sure whether the calculation of my MSE is correct: As can be seen in the code below, I am saving the MSE of each loop (Each combination of Training and Test set) in a list (ols_mse_list) and then I calculate the "overall" MSE taking the average of the list. Is that the correct way?? I am slightly confused, as I couldn't find a proper instruction on how to calculate the MSE during the optimization process...
# Separate the predictors and label
X_bss = data_bss[data_bss.columns[~data_bss.columns.isin(
["gdp_lag1", "pop_lag1", "exp_lag1", "sp_500_lag1", "temp_lag1"])]]
y = data_bss["ec"]
tscv = expanding_window(initial=350, horizon = 12, period = 1)
for train_index, test_index in tscv.split(X_bss):
print("Train:",train_index)
print("Test :",test_index)
from sklearn.linear_model import LinearRegression
ols = LinearRegression()
ols_mse_list = []
ols_mean_mse = []
# Loop through the splits. Run a Linear Regression for each split.
for i, (train_index, test_index) in enumerate(tscv.split(data_bss)):
X_train = data_bss[["gdp_lag1", "pop_lag1", "exp_lag1", "sp_500_lag1", "temp_lag1"]].iloc[train_index]
y_train = data_bss[["ec"]].iloc[train_index]
X_test = data_bss[["gdp_lag1", "pop_lag1", "exp_lag1", "sp_500_lag1", "temp_lag1"]].iloc[test_index]
y_test = data_bss[["ec"]].iloc[test_index]
ols.fit(X_train,y_train)
ols_mse = mean_squared_error(y_test,ols.predict(X_test))
ols_mse_list.append(ols_mse)
ols_mean_mse.append(np.mean(ols_mse_list))
print("OLS MSE:",ols_mean_mse)

Metric for K-fold Cross Validation for Regression models

I wanted to do Cross Validation on a regression (non-classification ) model and ended getting mean accuracies of about 0.90. however, i don't know what metric is used in the method to find out the accuracies. I know how splitting in k-fold cross validation works . I just don't know the formula that the scikit learn library is using to calculate the accuracy of prediction. (I know how it works for classification model though). Can someone give me the metric/formula used by sklearn.model_selection.cross_val_score?
Thanks in advance.
from sklearn.model_selection import cross_val_score
def metrics_of_accuracy(classifier , X_train , y_train) :
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
return accuracies
By default, sklearn uses accuracy in case of classification and r2_score for regression when you use the model.score method(same for cross_val_score). So r2_score in this case whose formula is
r2 = 1 - (SSE(y_hat)/SSE(y_mean))
where
SSE(y_hat) is the squared error for predictions made
SSE(y_mean) is the squared error when all predictions are the mean of the actual predictions
Yes, Also I can use the same metric using sklearn.metrics-> r2_score.
r2_score(y_true, y_pred). This score is also called Coefficient of determination or R-squared.
The formula for the same is as follows -
Find the link to image below.
https://i.stack.imgur.com/USaWH.png
For more on this -
https://en.wikipedia.org/wiki/Coefficient_of_determination

What is the difference between these two ways of specifying training/testing data for sklearn GPR

This is somewhat of a follow up to my previous question about evaluating my scikit Gaussian process regressor. I am very new to GPRs and I think that I may be making a methodological mistake in how I am using training vs testing data.
Essentially I'm wondering what the difference is between specifying training data by splitting the input between test and training data like this:
X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size = 0.33,
random_state = 0)
kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
alpha=1e-10,
copy_X_train=True,
kernel = kernel,
n_restarts_optimizer=10,
normalize_y=False,
random_state=None)
gp.fit(X_train, y_train)
score = gp.score(X_test, y_test)
print(score)
x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)
vs using the full data set to train like this.
X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T
kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
alpha=1e-10,
copy_X_train=True,
kernel = kernel,
n_restarts_optimizer=10,
normalize_y=False,
random_state=None)
gp.fit(X, Y)
score = gp.score(X, Y)
print(score)
x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)
Is one of these options going to result in incorrect predictions?
You split off the training data from the test data to evaluate your model, because otherwise you have no idea if you are over fitting the data. For example, just place data in excel and plot it with a smooth line. Technically, that spline function from excel is a perfect model but useless for predicting new values.
In your example, your predictions are over a uniform space to allow you to visualize what your model thinks is the underlying function. But it would be useless for understanding how general the model is. Sometimes you can get very high accuracy (> 95%) on training data and less than chance for testing data, which means the model is over fitting.
In addition to plotting a uniform prediction space to visualize the model, you should also predict values from the test set, then see accuracy metrics for both testing and training data.

Scoring in GridSearchCV for XGBoost

I'm currently trying to analyze data for the first time using XGBoost. I want to find the best parameters using GridsearchCV. I want to minimize the root mean squared error and to do this, I used "rmse" as eval_metric. However, scoring in grid search does not have such a metric. I found on this site that the "neg_mean_squared_error" does the same, but I found that this gives me different results than the RMSE. When I calculate the root of the absolute value of the "neg_mean_squared_error", I get a value of around 8.9 while a different function gives me a RMSE of about 4.4.
I don't know what goes wrong or how I get these two functions to agree/give the same values?
Because of this problem, I get wrong values as "best_params_" which give me a higher RMSE than some values I initially started with to tune.
Can anyone please explain me how to get score on the RMSE in the grid search or why my code gives different values?
Thanks in advance.
def modelfit(alg, trainx, trainy, useTrainCV=True, cv_folds=10, early_stopping_rounds=50):
if useTrainCV:
xgb_param = alg.get_xgb_params()
xgtrain = xgb.DMatrix(trainx, label=trainy)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
metrics='rmse', early_stopping_rounds=early_stopping_rounds)
alg.set_params(n_estimators=cvresult.shape[0])
# Fit the algorithm on the data
alg.fit(trainx, trainy, eval_metric='rmse')
# Predict training set:
dtrain_predictions = alg.predict(trainx)
# dtrain_predprob = alg.predict_proba(trainy)[:, 1]
print(dtrain_predictions)
print(np.sqrt(mean_squared_error(trainy, dtrain_predictions)))
# Print model report:
print("\nModel Report")
print("RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(trainy, dtrain_predictions)))
param_test2 = {
'max_depth':[6,7,8],
'min_child_weight':[2,3,4]
}
grid2 = GridSearchCV(estimator = xgb.XGBRegressor( learning_rate =0.1, n_estimators=2000, max_depth=5,
min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'reg:linear', nthread=4, scale_pos_weight=1, random_state=4),
param_grid = param_test2, scoring='neg_mean_squared_error', n_jobs=4,iid=False, cv=10, verbose=20)
grid2.fit(X_train,y_train)
# Mean cross-validated score of the best_estimator
print(grid2.best_params_, np.sqrt(np.abs(grid2.best_score_))), print(np.sqrt(np.abs(grid2.score(X_train, y_train))))
modelfit(grid2.best_estimator_, X_train, y_train)
print(np.sqrt(np.abs(grid2.score(X_train, y_train))))
In GridSearchCV the scoring parameter is transformed so that higher values are always better than lower values. In your example neg_mean_squared_error is just a negated version of RMSE. You should not interpret neg_mean_squared_error to be RMSE, rather in your cross-validation you should compare values of neg_mean_squared_error where a higher value is better than lower values.
In the scoring parameter portion of the model_evaluation documentation this behavior is mentioned.
Scikit-Learn Scoring Parameter Documentation
It's because XGBoostRegressor.score returns the coefficient of determination of the prediction, not RMSE.

Relationship between sklearn .fit() and .score()

While working with a linear regression model I split the data into a training set and test set. I then calculated R^2, RMSE, and MAE using the following:
lm.fit(X_train, y_train)
R2 = lm.score(X,y)
y_pred = lm.predict(X_test)
RMSE = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
MAE = metrics.mean_absolute_error(y_test, y_pred)
I thought that I was calculating R^2 for the entire data set (instead of comparing the training and original data). However, I learned that you must fit the model before you score it, therefore I'm not sure if I'm scoring the original data (as inputted in R2) or the data that I used to fit the model (X_train, and y_train). When I run:
lm.fit(X_train, y_train)
lm.score(X_train, y_train)
I get a different result than what I got when I was scoring X and y. So my question is are the inputs to the .score parameter compared to the model that was fitted (thereby making lm.fit(X,y); lm.score(X,y) the R^2 value for the original data and lm.fit(X_train, y_train); lm.score(X,y) the R^2 value for the original data based off the model created in .fit.) or is something else entirely happening?
fit() that only fit the data which is synonymous to train, that is fit the data means train the data.
score is something like testing or predict.
So one should use different dataset for training the classifier and testing the acuracy
One can do like this.
X_train,X_test,y_train,y_test=cross_validation.train_test_split(X,y,test_size=0.2)
clf=neighbors.KNeighborsClassifier()
clf.fit(X_train,y_train)
accuracy=clf.score(X_test,y_test)

Categories

Resources