I am sorry if this is a long post, but i have some questions related to Confusion Matrix metric and Cross-Validation that i really need help with.
This picture from Sklearn CV link, shows that our whole dataset should be split into train and test. Then, the train set is split again into a validation part and we train our model in k-1 folds and validate in the remaining one (repeat this k times). And lastly, we test our model with the test set from the beggining.
In my problem, i have a dataset for a unbalanced binary classification problem with 42372 samples. 3615 belong to class 1, the rest are class 0.
Since my dataset is unbalanced, i was using StratifiedShuffleSplit with 5 folds, and got this:
As result, using a MLPClassfier i got the following confusion matrix:
As you can see from that matrix, half my dataset is being used for test (19361+19+1782+28 = 21190).
After this, i changed the CV strategy, and tried StratifiedKfold:
And, as Confusion Matrix, i got this:
As you can see from this second confusion matrix, my whole dataset is being used for test (38644+113+3329+286 = 42372).
So, here are my questions:
1 - Do i need to split my whole data into train/test (e.g., using train_test_split), and then feed CV iterators (KFold, StratifiedKFold, StratifiedShuffleSplit, etc) only with the train part? Or should i feed my whole data into the iterators and they will do the job of splitting it into train/test and split again this train into train and validation?
2 - About the CV strategies i tried, why StratifiedShuffleSplit is using half the data? and why StratifiedKFold uses all the data? Any of those CV is wrong? Are both wrong or are both correct? What i am missing here?
EDIT: The original code to generate the Confusion Matrix i found here. I have just modified it a little bit to fit my needs, and here it goes:
import itertools
import time as time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
# from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix
n_splits = 5 # Num of Folds
stratshufkfold = StratifiedShuffleSplit(n_splits=n_splits, random_state=0)
# stratshufkfold = KFold(n_splits=n_splits)
def generate_confusion_matrix(cnf_matrix, classes, normalize=False, title='Matriz de Confusão'):
if normalize:
cnf_matrix = cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, np.newaxis]
print("Matriz de confusão normalizada")
else:
print('Matriz de confusão, sem normalização')
plt.imshow(cnf_matrix, interpolation='nearest', cmap=plt.get_cmap('Blues'))
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cnf_matrix.max() / 2.
for i, j in itertools.product(range(cnf_matrix.shape[0]), range(cnf_matrix.shape[1])):
plt.text(j, i, format(cnf_matrix[i, j], fmt), horizontalalignment="center",
color="white" if cnf_matrix[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('Real')
plt.xlabel('Predito')
return cnf_matrix
def plot_confusion_matrix(predicted_labels_list, y_test_list):
cnf_matrix = confusion_matrix(y_test_list, predicted_labels_list)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
generate_confusion_matrix(cnf_matrix, classes=class_names, title='Matriz de confusão, sem normalização')
plt.show()
# Plot normalized confusion matrix
plt.figure()
generate_confusion_matrix(cnf_matrix, classes=class_names, normalize=True, title='Matriz de confusão normalizada')
plt.show()
def evaluate_model_MLP(x, y):
predicted_targets = np.array([])
actual_targets = np.array([])
global t_inicial_MLP
global t_final_MLP
t_inicial_MLP = time.time()
for train_ix, test_ix in stratshufkfold.split(x, y):
train_x, train_y, test_x, test_y = x[train_ix], y[train_ix], x[test_ix], y[test_ix]
# Fit
classifier = MLPClassifier(activation='relu', batch_size=56, solver='sgd').fit(train_x, train_y)
predicted_labels = classifier.predict(test_x)
predicted_targets = np.append(predicted_targets, predicted_labels)
actual_targets = np.append(actual_targets, test_y)
t_final_MLP = time.time()
return predicted_targets, actual_targets
predicted_target_MLP, actual_target_MLP = evaluate_model_MLP(x, y)
plot_confusion_matrix(predicted_target_MLP, actual_target_MLP)
acuracia_MLP = accuracy_score(actual_target_MLP, predicted_target_MLP)
As specified within the comment, for what concerns the first question, the first option is the way to go. Namely, splitting the whole dataset via train_test_split and then calling method .split() of the chosen cross-validator object on the training set.
For the second point, the issue is hidden behind some default parameters of StratifiedKFold and StratifiedShuffleSplit and on the sligthly different meaning of parameter n_splits.
For what concerns StratifiedKFold, the parameter n_splits identifies the number of folds you're considering as per documentation. Therefore, imposing n_splits=5 means that the model will be trained on 4-folds (80% of the training set) and tested on one fold (20% of the training set), for each possible combination.
For what concerns StratifiedShuffleSplit, the parameter n_splits specifies the number of reshuffling and splitting iterations. On the other side, it is the parameter train_size (together with test_size) to define how big the folds will be (relatively to the size of the training set). In particular, according to the docs, the default setting defines that, if none of them is specified, train_size=0.9 (90% of the training set) and test_size=0.1 (10% of the training set).
Therefore specifying test_size within the StratifiedShuffleSplit constructor - eg - should solve your problem:
stratshufkfold = StratifiedShuffleSplit(n_splits=n_splits, random_state=0, test_size=0.2)
Related
TLDR Probably this problem but how can we do it using sklearn? I'm okay if only the mean over the CVs I did for each lambda or alpha are shown in the plots.
Hi all, if I understand correctly, we need to cross-validate on the training set to select the alpha (as in sklearn) for the ridge regression. In particular, I want to perform a 5-fold CV repeated 5 times (so 25 CVs) on the training set.
What I want to do is for each alpha from the alphas:
from numpy import logspace as logs
alphas = logs(-3, 3, 71) # from 10^{-3}, 10^{-2.9}, ..., to 10^3
I get the MSEs on the 25 (different?) validation sets, and the MSE on the test set after I finish all the CVs for each training set, then take the average out of the 25 MSEs for plotting or reporting.
The issue is I'm not sure how to do so. Is this the correct code to retrieve the 25 MSEs from the validation sets which we usually couldn't observe?
from sklearn.linear_model import RidgeCV, Ridge
from sklearn.model_selection import cross_val_score as CVS
# 5 fold now repeated all 5 times
cvs = RKF(n_splits=5, n_repeats=5, random_state=42)
# each alpha input as al
# the whole data set is generated with different RNG each time
# if you like you may take any existing data sets to explain whether I did wrong
# for each whole data set, the training set is split using the same random state
CVS(Ridge(alpha=al, random_state=42), X_train, Y_train, scoring="neg_mean_squared_error", cv=cvs)
If no, should I use cross_validate or even RidgeCV to get the MSEs I want? Thanks in advance.
Most likely you need to use GridSearchCV, using an example where we have 10 values of alpha:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, RepeatedKFold
from sklearn.linear_model import RidgeCV,Ridge
from sklearn.model_selection import cross_val_score
from numpy import logspace as logs
from sklearn import datasets
alphas = logs(-3, 3, 71)
diabetes = datasets.load_diabetes()
X = diabetes.data[:300]
y = diabetes.target[:300]
X_val = diabetes.data[300:]
y_val = diabetes.target[300:]
We define the repeated cross validation, and the alphas to fit over:
cvs = RepeatedKFold(n_splits=5, n_repeats=5, random_state=42)
parameters = {'alpha':alphas}
clf = GridSearchCV(Ridge(), parameters,cv=cvs)
clf.fit(X, y)
So the means of the scores will be stored under clf.cv_results_['mean_test_score'] and you also have the individual results under the dictionary. To plot, you can simply do:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.bar(np.arange(len(alphas)), height =clf.cv_results_['mean_test_score'],
yerr=clf.cv_results_['std_test_score'], alpha=0.5,
error_kw=dict(ecolor='gray', lw=1, capsize=5, capthick=2))
ax.set_xticks(np.arange(len(alphas)))
ax.set_xticklabels(np.round(alphas,3))
This shows the mean and standard error of the score over 10 values of alpha.
You can see this post on how to get the scores for a pre-defined validation set.
I was training a model that contains 8 features that allows us to predict the probability of a room been sold.
Region: The region the room belongs to (an integer, taking value between 1 and 10)
Date:The date of stay (an integer between 1‐365, here we consider only one‐day
request)
Weekday: Day of week (an integer between 1‐7)
Apartment: Whether the room is a whole apartment (1) or just a room (0)
#beds:The number of beds in the room (an integer between 1‐4)
Review: Average review of the seller (a continuous variable between 1 and 5)
Pic Quality: Quality of the picture of the room (a continuous variable between 0 and 1)
Price: he historic posted price of the room (a continuous variable)
Accept:Whether this post gets accepted (someone took it, 1) or not (0) in the end
Column Accept is the "y". Hence, this is a binary classification.
We have plot the data and some of the data were skewed so we applied power transform.
We tried a neural network, ExtraTrees, XGBoost, Gradient boost, Random forest. They all gave about 0.77 AUC. However, when we tried them on the test set, the AUC dropped to 0.55 with a precision of 27%.
I am not sure where when wrong but my thinking was that the reason may due to the mixing of discrete and continuous data. Especially some of them are either 0 or 1.
Can anyone help?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
df_train = pd.read_csv('case2_training.csv')
X, y = df_train.iloc[:, 1:-1], df_train.iloc[:, -1]
y = y.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
transform_list = ['Pic Quality', 'Review', 'Price']
X_train[transform_list] = pt.fit_transform(X_train[transform_list])
X_test[transform_list] = pt.transform(X_test[transform_list])
for i in transform_list:
df = X_train[i]
ax = df.plot.hist()
ax.set_title(i)
plt.show()
# Normalization
sc = MinMaxScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from torch import nn
from skorch import NeuralNetBinaryClassifier
import torch
model = nn.Sequential(
nn.Linear(8,64),
nn.BatchNorm1d(64),
nn.GELU(),
nn.Linear(64,32),
nn.BatchNorm1d(32),
nn.GELU(),
nn.Linear(32,16),
nn.BatchNorm1d(16),
nn.GELU(),
nn.Linear(16,1),
# nn.Sigmoid()
)
net = NeuralNetBinaryClassifier(
model,
max_epochs=100,
lr=0.1,
# Shuffle training data on each epoch
optimizer=torch.optim.Adam,
iterator_train__shuffle=True,
)
net.fit(X_train, y_train)
from xgboost.sklearn import XGBClassifier
clf = XGBClassifier(silent=0,
learning_rate=0.01,
min_child_weight=1,
max_depth=6,
objective='binary:logistic',
n_estimators=500,
seed=1000)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
Here is an attachment of a screenshot of the data.
Sample data
This is the fundamental first step of Data Analytics. You need to do two things here:
Data understanding - do the data fields in their current format make sense (data types, value range etc.)
Data preparation - what should I do to update these data fields before passing them to our model? Also which inputs do you think will be useful for your model and which will provide little benefit? Are there outliers I need to consider/handle?
A good book if you're starting in the field of data analytics is Fundamentals of Machine Learning for Predictive Data Analytics (I have no affiliation with this book).
Looking at your dataset there's a couple of things you could try to see how it influences your prediction results:
Unless region order is actually ranked in importance/value I would change this to a one hot encoded feature, you can do this in sklearn. Otherwise you run the risk of your model thinking that regions with a higher number (say 10) are more important than regions with a lower value (say 1).
You could attempt to normalise certain categories if they are much larger than some of your other data fields Why Data Normalization is necessary for Machine Learning models
Consider looking at the Kaggle competition House Prices: Advanced Regression Techniques. It's doing a similar thing to what you're attempting to do, and it might have some pointers for how you should approach the problem in the Notebooks and Discussion tabs.
Without deeply exploring all the data you are using it is hard to say for certain what is causing the drop in accuracy (or AUC) when moving from your training set to the testing set. It is unlikely to be caused by the mixed discrete/continuous data.
The drop just suggests that your models are over-fitting to your training data (and therefore not transferring well). This could be caused by too many learned parameters (given the amount of data you have)--more often a problem with neural networks than with some of the other methods you mentioned. Or, the problem could be with the way the data was split into training/testing. If the distribution of the data has a significant difference (that's maybe not obvious) then you wouldn't expect the testing performance to be as good. If it were me, I'd look carefully at how the data was split into training/testing (assuming you have a reasonably large set of data). You may try repeating your experiments with a number of random training/testing splits (search k-fold cross validation if you're not familiar with it).
your model is overfit. try to make a simple model first and use a lower parameter value. for tree-based classification, scaling does not have any impact on the model.
I am separating the features in X and y then I preprocess my train test data after splitting it with k fold cross validation. After that i fit the train data to my Random Forest Regressor model and calculate the confidence score. Why do i preprocess after splitting? because people tell me that it's more correct to do it that way and i'm keeping that principle since that for the sake of my model performance.
This is my first time using KFold Cross Validation because my model score overifts and i thought i could fix it with cross validation. I'm still confused of how to use this, i have read the documentation and some articles but i do not really catch how do i really imply it to my model but i tried anyway and my model still overfits. Using train test split or cross validation resulting my model score is still 0.999, I do not know what is my mistake since i'm very new using this method but i think maybe i did it wrong so it does not fix the overfitting. Please tell me what's wrong with my code and how to fix this
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
import scipy.stats as ss
avo_sales = pd.read_csv('avocados.csv')
avo_sales.rename(columns = {'4046':'small PLU sold',
'4225':'large PLU sold',
'4770':'xlarge PLU sold'},
inplace= True)
avo_sales.columns = avo_sales.columns.str.replace(' ','')
x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)
# X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
kf = KFold(n_splits=10)
for train_index, test_index in kf.split(x):
X_train, X_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]
impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
X_test[:,8] = impC.transform(X_test[:,8].reshape(-1,1)).ravel()
imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
X_test[:,1:8] = imp.transform(X_test[:,1:8])
le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.transform(X_test[:,8])
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
confidence = rfr.score(X_test, y_test)
print(confidence)
The reason you're overfitting is because a non-regularized tree-based model will adjust to the data until all training samples are correctly classified. See for example this image:
As you can see, this does not generalize well. If you don't specify arguments that regularize the trees, the model will fit the test data poorly because it will basically just learn the noise in the training data. There are many ways to regularize trees in sklearn, you can find them here. For instance:
max_features
min_samples_leaf
max_depth
With proper regularization, you can get a model that generalizes well to the test data. Look at a regularized model for instance:
To regularize your model, instantiate the RandomForestRegressor() module like this:
rfr = RandomForestRegressor(max_features=0.5, min_samples_leaf=4, max_depth=6)
These argument values are arbitrary, it's up to you to find the ones that fit your data best. You can use domain-specific knowledge to choose these values, or a hyperparameter tuning search like GridSearchCV or RandomizedSearchCV.
Other than that, imputing the mean and median might bring a lot of noise in your data. I would advise against it unless you had no other choice.
While #NicolasGervais answer gets to the bottom of why your specific model is overfitting, I think there is a conceptual misunderstanding with regards to cross-validation in the original question; you seem to think that:
Cross-validation is a method that improves the performance of a machine learning model.
But this is not the case.
Cross validation is a method that is used to estimate the performance of a given model on unseen data. By itself, it cannot improve the accuracy.
In other words, the respective scores can tell you if your model is overfitting the training data, but simply applying cross-validation does not make your model better.
Example:
Let's look at a dataset with 10 points, and fit a line through it:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
X = np.random.randint(0,10,10)
Y = np.random.randint(0,10,10)
fig = plt.figure(figsize=(1,10))
def line(x, slope, intercept):
return slope * x + intercept
for i in range(5):
# note that this is not technically 5-fold cross-validation
# because I allow the same datapoint to go into the test set
# several times. For illustrative purposes it is fine imho.
test_indices = np.random.choice(np.arange(10),2)
train_indices = list(set(range(10))-set(test_indices))
# get train and test sets
X_train, Y_train = X[train_indices], Y[train_indices]
X_test, Y_test = X[test_indices], Y[test_indices]
# training set has one feature and multiple entries
# so, reshape(-1,1)
X_train, Y_train, X_test, Y_test = X_train.reshape(-1,1), Y_train.reshape(-1,1), X_test.reshape(-1,1), Y_test.reshape(-1,1)
# fit and evaluate linear regression
reg = LinearRegression().fit(X_train, Y_train)
score_train = reg.score(X_train, Y_train)
score_test = reg.score(X_test, Y_test)
# extract coefficients from model:
slope, intercept = reg.coef_[0], reg.intercept_[0]
print(score_test)
# show train and test sets
plt.subplot(5,1,i+1)
plt.scatter(X_train, Y_train, c='k')
plt.scatter(X_test, Y_test, c='r')
# draw regression line
plt.plot(np.arange(10), line(np.arange(10), slope, intercept))
plt.ylim(0,10)
plt.xlim(0,10)
plt.title('train: {:.2f} test: {:.2f}'.format(score_train, score_test))
You can see that the scores on training and test set are vastly different. You can also see that the estimated parameters vary a lot with the change of train and test set.
That does not make your linear model any better at all.
But now you know exactly how bad it is :)
I am playing around with scikit-learn a bit and wanted to reproduce the cross-validation scores for one specific hyper-parameter combination of a carried out grid search.
For the grid search, I used the GridSearchCV class and to reproduce the result for one specific hyper-parameter combination I used the cross_validate function with the exact same split and classifier settings.
My problem is that I do not get the expected score results, which to my understanding should be exactly the same as the same computations are carried out to obtain the scores in both methods.
I made sure to exclude any randomness sources from my script by fixing the used splits on the training data.
In the following code snippet, an example of the stated problem is given.
import numpy as np
from sklearn.model_selection import cross_validate, StratifiedKFold, GridSearchCV
from sklearn.svm import NuSVC
np.random.seed(2018)
# generate random training features
X = np.random.random((100, 10))
# class labels
y = np.random.randint(2, size=100)
clf = NuSVC(nu=0.4, gamma='auto')
# Compute score for one parameter combination
grid = GridSearchCV(clf,
cv=StratifiedKFold(n_splits=10, random_state=2018),
param_grid={'nu': [0.4]},
scoring=['f1_macro'],
refit=False)
grid.fit(X, y)
print(grid.cv_results_['mean_test_f1_macro'][0])
# Recompute score for exact same input
result = cross_validate(clf,
X,
y,
cv=StratifiedKFold(n_splits=10, random_state=2018),
scoring=['f1_macro'])
print(result['test_f1_macro'].mean())
Executing the given snippet results in the output:
0.38414468864468865
0.3848840048840049
I would have expected these scores to be exactly the same, as they are computed on the same split, using the same training data with the same classifier.
It is because the mean_test_f1_macro is not a simple average of all combination of folds, it is a weight average, with weights being the size of the test fold. To know more about the actual implementation of refer this answer.
Now, to replicate the GridSearchCV result, try this!
print('grid search cv result',grid.cv_results_['mean_test_f1_macro'][0])
# grid search cv result 0.38414468864468865
print('simple mean: ', result['test_f1_macro'].mean())
# simple mean: 0.3848840048840049
weights= [len(test) for (_, test) in StratifiedKFold(n_splits=10, random_state=2018).split(X,y)]
print('weighted mean: {}'.format(np.average(result['test_f1_macro'], axis=0, weights=weights)))
# weighted mean: 0.38414468864468865
I am learning to do classification for Cover Type data for 7 classes. I train my model with GradientBoostingClassifier from scikit-learn. When I try to plot my loss function this goes like this:
Is this kind of plot shows me that my model suffers from high variance? If yes, what should I do? And I don't know why in the middle of iterations 200 until 500, the plot is shaped like a rectangle.
(EDIT)
To edit this post, I'm not sure what's wrong with my code becaue I just used the regular code to fit the training data. I'm using jupyter notebook. So I'm just going to provide the code
Y = train["Cover_Type"]
X = train.drop({"Cover_Type"}, axis=1)
#split training data dan cross validation
from sklearn.model_selection import train_test_split
X_train, X_val, Y_train, Y_val = train_test_split(X,Y,test_size=0.3,random_state=42)
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_friedman1
from sklearn.ensemble import GradientBoostingClassifier
params = {'n_estimators': 1000,'learning_rate': 0.3, 'max_features' : 'sqrt'}
dtree=GradientBoostingClassifier(**params)
dtree.fit(X_train,Y_train)
#mau lihat F1-Score
from sklearn.metrics import f1_score
Y_pred = dtree.predict(X_val) #prediksi data cross validation menggunakan model tadi
print Y_pred
score = f1_score(Y_val, Y_pred, average="micro")
print("Gradient Boosting Tree F1-score: "+str(score)) # I got 0.86 F1-Score
import matplotlib.pyplot as plt
# Plot training deviance
# compute test set deviance
val_score = np.zeros((params['n_estimators'],), dtype=np.float64)
for i, Y_pred in enumerate(dtree.staged_predict(X_val)):
val_score[i] = dtree.loss_(Y_val, Y_pred.reshape(-1, 1))
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title('Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, dtree.train_score_, 'b-',
label='Training Set Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, val_score, 'r-',
label='Validation Set Deviance')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Deviance')
There are several issues that I will explain them one by one, also I have added the correct code for your example.
staged_predict(X) method shall NOT be used
As staged_predict(X) outputs the predicted class instead of predicted probabilities, it is not correct to use that.
One can (where the context accept) use staged_decision_function(X) method and pass the computed decisions at each stage to the model.loss_ attribute. But in this example, it does not work (the loss based on staged decision increases while the loss decreases).
You should use staged_predict_proba(X) with cross entropy loss
you should use staged_predict_proba(X)
you also need to define a function that calculate the cross entropy loss at each stage.
I have provided the code below. Note that I set the verbosity to 2, and then you can see that the sklearn training loss at each stage is the same as our loss (as a sanity check that our approach works correctly).
Why you have big jumps
I think the reason is that GBC becomes very confident and then predict a label is 1 (as an example) with probability one, while it is not correct (for example the label is 2). This creates big jumps (as cross entropy goes to infinity). In such a scenario you should change your GBC parameters.
The code and the Plot are given below
The code is:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
def _cross_entropy_like_loss(model, input_data, targets, num_estimators):
loss = np.zeros((num_estimators, 1))
for index, predict in enumerate(model.staged_predict_proba(input_data)):
loss[index, :] = -np.sum(np.log([predict[sample_num, class_num-1]
for sample_num, class_num in enumerate(targets)]))
print(f'ce loss {index}:{loss[index, :]}')
return loss
covtype = fetch_covtype()
X = covtype.data
Y = covtype.target
n_estimators = 10
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.3, random_state=42)
clf = GradientBoostingClassifier(n_estimators=n_estimators, learning_rate=0.3, verbose=2 )
clf.fit(X_train, Y_train)
tr_loss_ce = _cross_entropy_like_loss(clf, X_train, Y_train, n_estimators)
test_loss_ce = _cross_entropy_like_loss(clf, X_val, Y_val, n_estimators)
plt.figure()
plt.plot(np.arange(n_estimators) + 1, tr_loss_ce, '-r', label='training_loss_ce')
plt.plot(np.arange(n_estimators) + 1, test_loss_ce, '-b', label='val_loss_ce')
plt.ylabel('Error')
plt.xlabel('num_components')
plt.legend(loc='upper right')
The output of console is like below, from which you can easily verify the approach is correct.
Iter Train Loss Remaining Time
1 482434.6631 1.04m
2 398501.7223 55.56s
3 351391.6893 48.51s
4 322290.3230 41.60s
5 301887.1735 34.65s
6 287438.7801 27.72s
7 276109.2008 20.82s
8 268089.2418 13.84s
9 261372.6689 6.93s
10 256096.1205 0.00s
ce loss 0:[ 482434.6630936]
ce loss 1:[ 398501.72228276]
ce loss 2:[ 351391.68933547]
ce loss 3:[ 322290.32300604]
ce loss 4:[ 301887.17346783]
ce loss 5:[ 287438.7801033]
ce loss 6:[ 276109.20077844]
ce loss 7:[ 268089.2418214]
ce loss 8:[ 261372.66892149]
ce loss 9:[ 256096.1205235]
The plot is here.
It seems like you have several issues. It's hard to say for sure, because you don't provide any code.
Does my model suffer from high variance?
First, your model is overfitting from the start. You can tell that this is the case since your validation loss is increasing although your training is decreasing. What's interesting is that your validation loss is increasing from the very start, which suggests that your model is not working. So to answer your question, yes, it suffers from high variance.
What should I do?
Are you sure there is a trend in your data? The fact that the validation increases from the very start hints that either this model does not apply to your data at all, that your data does not have a trend, or that you have issues with your code. Perhaps try other models, and make sure your code is correct. Again, it's hard to say without a minimal example.
The strange rectangle
This looks strange. Either there is an issue with your data in the validation set (because this impact does not occur on the validation set) or you just have an issue with your code. If you provide a sample, we could probably help you more.