Why cross validation result shows high accuracy while there is overfitting?

Why cross validation result shows high accuracy while there is overfitting? - python

I'm using random tree algo for binary classification problem. Training set contains 70k values as "0" class and only 3k as "1". In addition, result of predicting on X_test should give same amount of "0" and "1".
clf = RandomForestClassifier(random_state=1, n_estimators=350, min_samples_split=6, min_samples_leaf=2)
scores = cross_validation.cross_val_score(clf, x_train, y_train, cv=cv)
print("Accuracy (random forest): {}+/-{}".format(scores.mean(), scores.std()))
Accuracy (random forest): 0.960755941369/1.40500919606e-06
clf.fit(x_train, y_train)
prediction_final = clf.predict(X_test) # this return Target values: 76k Zeroes and only 15 ones
#x_test is 10% of x_train set
preds_test = clf.predict(x_test)
print "precision_score", precision_score(y_test, preds_final)
print "recall_score", recall_score(y_test, preds_final)
precision_score 0.0;
recall_score 0.0
confusion_matrix [[7279 1]
[ 322 0]]
As far I can see, there is the overfitting problem, but why doesn't cross validation detect it? Even standard deviation is very low. So how can I fix that problem?
P.S. I've tried to take 3k rows with "0" and 3k with "1" - as training set, model is much better, but this is not solution.

(Overall) Accuracy is a nearly useless measure for unbalanced data sets like yours, since it computes the percentage of correct predictions. In your case, imagine a classifier that would learn nothing, but just always predict "0". Since you have 70k zeroes and only 3k ones, that classifier would reach an accuracy score of 70/73 = 95.9%.
Inspecting the Confusion Matrix is often helpful for disclosing such a "classifier".
Thus, you should definitely use another measure to quantify classification quality. Average Accuracy would be an option, since it computes the average accuracy over all classes. In the case of binary classification, it is also called Balanced Accuracy and results in computing (TP/P + TN/N)/2, so that the classifier imagined above, which always predicts "0", would only score (100% + 0%) / 2 = 50%. However, that measure seems to be not implemented in scikit-learn. Though you could implement such a scoring function by yourself, it will probably be easier and faster to use one of the other predefined scorers.
For example, you could compute the F1 Score instead of Accuracy by passing scoring = 'f1' to cross_validation.cross_val_score. The F1 Score takes both precision and recall into account.

Related

Why does LogisticRegressionCV's .score() differ from cross_val_score?

I was using LogisticRegressionCV's .score() method to yield an accuracy score for my model.
I also used cross_val_score to yield an accuracy score with the same cv split (skf), expecting the same score to show up.
But alas, they were different and I'm confused.
I first did a StratifiedKFold:
skf = StratifiedKFold(n_splits = 5,
shuffle = True,
random_state = 708)
After which I instantiated a LogisticRegressionCV() with the skf as an argument for the CV parameter, fitted, and scored on the training set.
logreg = LogisticRegressionCV(cv=skf, solver='liblinear')
logreg.fit(X_train_sc, y_train)
logreg.score(X_train_sc, y_train)
This gave me a score of 0.849507735583685, which was accuracy by default. Since this is LogisticRegressionCV, this score is actually the mean accuracy score right?
Then I used cross_val_score:
cross_val_score(logreg, X_train_sc, y_train, cv=skf).mean()
This gave me a mean accuracy score of 0.8227814439082044.
I'm kind of confused as to why the scores differ, since I thought I was basically doing the same thing.

[.score] is actually the mean accuracy score right?
No. The score method here is the accuracy score of the final classifier (which was retrained on the entire training set, using the optimal value of the regularization strength). By evaluating it on the training set again, you're getting an optimistically-biased estimate of future performance.
To recover the cross-validation scores, you can use the attribute scores_. Even with the same folds, these may be slightly different from cross_val_score due to randomness in the solver, if it doesn't converge completely.

To add to the answer, you can change the behaviour of your code by simply adding "refit = False" as a parameter to LogisticRegressionCV(), e.g.
logreg = LogisticRegressionCV(cv=skf, solver='liblinear', refit=False)
the rest you can keep the same.

Does random_state in train_test_split effect the actual performance of a model?

I get why a model's score is different for each random_state but did expect the difference between the highest and the lowest score (from random_state 0-100) to be 0.37 which is lot. Also tried ten-fold-cross-validation, the difference is still kinda big.
So does this actually matter or is it something i should ignore ?
The Data-set link
(Download -> Data Folder -> student.zip -> student-mat.csv)
Full Code :
import pandas as pd
acc_dic = {}
grade_df_main = pd.read_csv(r'F:\Python\Jupyter Notebook\ML Projects\data\student-math-grade.csv', sep = ";")
grade_df = grade_df_main[["G1", "G2", "G3", "studytime", "failures", "absences"]]
X = grade_df.drop("G3", axis = "columns")
Y = grade_df["G3"].copy()
def cross_val_scores(scores):
print("Cross validation result :-")
#print("Scores: {}".format(scores))
print("Mean: {}".format(scores.mean()))
print("Standard deviation: {}".format(scores.std()))
def start(rand_state):
print("Index {}".format(rand_state))
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=.1, random_state=rand_state)
from sklearn.linear_model import LinearRegression
lin_reg_obj = LinearRegression()
lin_reg_obj.fit(x_train, y_train)
accuracy = lin_reg_obj.score(x_test, y_test)
print("Accuracy: {}".format(accuracy))
acc_dic[rand_state] = accuracy
from sklearn.model_selection import cross_val_score
scores = cross_val_score(lin_reg_obj, x_test, y_test, scoring="neg_mean_squared_error", cv=10)
cross_val_scores(scores)
print()
for i in range(0, 101):
start(i)
print("Overview : \n")
result_val = list(acc_dic.values())
min_index = result_val.index(min(result_val))
max_index = result_val.index(max(result_val))
print("Minimum Accuracy : ")
start(min_index)
print("Maximum Accuracy : ")
start(max_index)
Result :
Only included the highest and the lowest results
Minimum Accuracy :
Index 54
Accuracy: 0.5635271419142645
Cross validation result :-
Mean: -8.969894370977539
Standard deviation: 5.614516642510817
Maximum Accuracy :
Index 97
Accuracy: 0.9426035720345269
Cross validation result :-
Mean: -0.7063598117158191
Standard deviation: 0.3149445166291036

TL;DR
It is not the split on the dataset you used to train and evaluate your model that decides how well your final model will actually perform once it is deployed. The split and evaluation technique is more about getting a valid estimation of how well the model might perform in real life. And as you can see, the choice of the splitting and evaluation technique can have a great influence on this estimation. The results on your dataset highly suggest preferring k-fold cross-validation over a simple train/test split.
Longer version
I believe you have already figured out that the split you do on the dataset to separate it into train and test sets has nothing to do with the performance of your final model, which is likely to be trained on the whole dataset and then be deployed.
The purpose of testing is to get a feeling of the predictive performance on unseen data. In a best-case scenario, you would ideally have two completely different data sets from different cohorts/sources to train and test your model (external validation). This is the best approach to evaluate how your model would perform once it is deployed. However, since you often do not have such a second source of data, you do an internal validation where you get samples for training and testing from the same cohort/source.
Usually, given that this dataset is big enough, randomness will make sure that the splits for the train and test sets are a good representation of your original dataset and the performance metrics you get are a fair estimation of the model's predictive performance in real life.
However, as you see on your own dataset, there are cases where the split does actually heavily influence the result. It is exactly for such cases, where you are definitely better off evaluating your performance with a cross-validation technique such as k-fold cross-validation, and compute the mean across different splits.

How to detect overfitting with Cross Validation: What should be the difference threshold?

After building the Classification model, I evaluated it by means of accuracy, precision and recall. To check over fitting I used K Fold Cross Validation. I am aware that if my model scores vary greatly from my cross validation scores then my model is over fitting. However, am stuck with how to define the threshold. Like how much difference in the scores will actually infer that the model is over fitting. For example, here are 3 splits (3 Fold CV, shuffle= True, random_state= 42) and their respective scores upon a Logistic Regression model:
Split Number 1
Accuracy= 0.9454545454545454
Precision= 0.94375
Recall= 1.0
Split Number 2
Accuracy= 0.9757575757575757
Precision= 0.9753086419753086
Recall= 1.0
Split Number 3
Accuracy= 0.9695121951219512
Precision= 0.9691358024691358
Recall= 1.0
Direct training of the Logistic Regression model without CV:
Accuracy= 0.9530201342281879
Precision= 0.952054794520548
Recall= 1.0
So how do I decide by what magnitude my scores need to vary in order to infer an over fitting case?

I would assume that you are using Cross-validation:
Which will split your training and test data.
Right now you have probably something like this implemented:
from sklearn.model_selection import cross_validate
from sklearn.metrics import recall_score
scoring = ['precision_macro', 'recall_macro']
clf = svm.SVC(kernel='linear', C=1, random_state=0)
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,cv=5)
So right now you are calculating only the test score, which in all 3 cases is very good.
The first option is:
return_train_score is set to False by default to save computation
time. To evaluate the scores on the training set as well you need to
be set to True
There you can also see the training scores of your folds. If you would see 1.0 accuracy for training sets, this is overfitting.
The other option is:
Run more splits. Then you are sure that the algorithm is not overfitting, if every test score has a high accuracy you are doing good.
Did you add a baseline? I would assume it is binary classifcation, and I have the feeling the datset is highly imbalanced, so 0.96 accuarcy is not so good in general maybe, because your dummy classification (always one class) would have 0.95 accuracy.

Random Forest and Imbalance

I'm working on a dataset of around 20000 rows.
The aim is to predict whether a person has been hired by a company or not given some features like gender, experience, application date, test score, job skill, etc. The dataset is imbalanced: the classes are either '1' or '0' (hired / not hired) with ratio 1:10.
I chose to train a Random Forest Classifier to work on this problem.
I splitted the dataset 70%-30% randomly into a training set and a test set.
After careful reading of the different options to tackle the imbalance problem (e.g. Dealing with the class imbalance in binary classification, Unbalanced classification using RandomForestClassifier in sklearn) I got stuck on getting a good score on my test set.
I tried several things:
I trained three different random forests on the whole X_train, on undersampled training X_und and on oversampled X_sm respectively. X_und was generated by simply cutting down at random the rows of X_train labelled by 0 to get 50-50, 66-33 or 75-25 ratios of 0s and 1s; X_sm was generated by SMOTE.
Using scikit-learn GridSearchCV i tweaked the three models to get the best parameters:
param_grid = {'min_samples_leaf':[3,5,7,10,15],'max_features':[0.5,'sqrt','log2'],
'max_depth':[10,15,20],
'class_weight':[{0:1,1:1},{0:1,1:2},{0:1,1:5},'balanced'],
'criterion':['entropy','gini']}
sss = StratifiedShuffleSplit(n_splits=5)
grid = GridSearchCV(RandomForestClassifier(),param_grid,cv=sss,verbose=1,n_jobs=-1,scoring='roc_auc')
grid.fit(X_train,y_train)
The best score was obtained with
rfc = RandomForestClassifier(n_estimators=150, criterion='gini', min_samples_leaf=3,
max_features=0.5, n_jobs=-1, oob_score=True, class_weight={0:1,1:5})
trained on the whole X_train and giving classification report on the test set
precision recall f1-score support
0 0.9397 0.9759 0.9575 5189
1 0.7329 0.5135 0.6039 668
micro avg 0.9232 0.9232 0.9232 5857
macro avg 0.8363 0.7447 0.7807 5857
weighted avg 0.9161 0.9232 0.9171 5857
With the sampling methods I got similar results, but no better ones. Precision went down with the undersampling and I got almost the same result with the oversampling.
For undersampling:
precision recall f1-score support
0 0.9532 0.9310 0.9420 5189
1 0.5463 0.6452 0.5916 668
For SMOTE:
precision recall f1-score support
0 0.9351 0.9794 0.9567 5189
1 0.7464 0.4716 0.5780 668
I played with the parameter class_weights to give more weight to the 1s and also with sample_weight in the fitting process.
I tried to figure out which score to take into account other than accuracy. Running the GridSearchCV to tweak the forests, I used different scores, focusing especially on f1 and roc_auc hoping to decrease the False Negatives. I got great scores with the SMOTE-oversampling, but this model did not generalize well on the test set. I wasn't able to understand how to change the splitting criterion or the scoring for the random forest in order to lower the number of False Negatives and increase the Recall for the 1s. I saw that cohen_kappa_score is also useful for imbalanced datasets, but it cannot be used in cross validation methods of sklearn like GridSearch.
Select only the most important features, but this did not change the result, on the contrary it got worse. I remarked that feature importance obtained from training a RF after SMOTE was completely different from the normal sample one.
I don't know exactly what to do with the oob_score other than considering it as a free validation score obtained when training the forests. With the oversampling I get the highest oob_score = 0.9535 but this is kind of natural since the training set is in this case balanced, the problem is still that it does not generalize well to the test set.
Right now I ran out of ideas, so I would like to know if I'm missing something or doing something wrong. Or should I just try another model instead of Random Forest?

How a metric computed with cross_val_score can differ from the same metric computed starting from cross_val_predict?

How a metric computed with cross_val_score can differ from the same metric computed starting from cross_val_predict (used to obtain predictions to be then given to a metric function)?
Here is an example:
from sklearn import cross_validation
from sklearn import datasets
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
iris = datasets.load_iris()
gnb_clf = GaussianNB()
# compute mean accuracy with cross_val_predict
predicted = cross_validation.cross_val_predict(gnb_clf, iris.data, iris.target, cv=5)
accuracy_cvp = metrics.accuracy_score(iris.target, predicted)
# compute mean accuracy with cross_val_score
score_cvs = cross_validation.cross_val_score(gnb_clf, iris.data, iris.target, cv=5)
accuracy_cvs = score_cvs.mean()
print('Accuracy cvp: %0.8f\nAccuracy cvs: %0.8f' % (accuracy_cvp, accuracy_cvs))
In this case, we obtain the same result:
Accuracy cvp: 0.95333333
Accuracy cvs: 0.95333333
Nevertheless, this seems not to be always the case, as on the official documentation it is written (regarding a result computed using cross_val_predict):
Note that the result of this computation may be slightly different
from those obtained using cross_val_score as the elements are grouped
in different ways.

Imagine following labels and splitting
[010|101|10]
So you have 8 data points, 4 per class and you split it to 3 folds, leading to 2 folds with 3 elements and one with 2. Now let us assume that during cross validation you get following preds
[010|100|00]
thus, your scores are [100%, 67%, 50%], and cross val score (as an average) is around 72%. Now what about accuracy over predictions? You clearly have 6/8 things right, thus 75%. As you can see the scores are different, even thoug they both rely on cross validation. Here, the difference arises because the splits are not exactly the same size, thus this last "50%" is actually lowering total score because it is an avergae over just 2 samples (and the rest are based on 3).
There might be other similar phenomena, in general - it should boil down to the way averaging is computed. Thus - cross val score is an average over averages, which does not have to be an average over cross validation predictions.

In addition to lejlot's answer, another way that you might get slightly different results between cross_val_score and cross_val_predict is when the target classes are not distributed in a way that allows them to be evenly split between folds.
According to the documentation for cross_val_predict, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used by default. This may lead to a situation where even though the total number of instances in the dataset is divisible by the number of folds, you end up with folds of slightly different sizes, because the splitter is splitting based on the presence of the target. This can then lead to the issue where an average of averages is slightly different to an overall average.
For example, if you have 100 data points, and 33 of these are the target class, then KFold with n_splits=5 would split this into 5 folds of 20 observations, but StratifiedKFold would not necessarily give you equally-sized folds.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.