Scaling of validation data in supervised ML algorithm - python

I have written a classification algorithm in Python. It satisfies to Scikit-Learn's API. Given labeled data X, y, I would like to train my algorithm on this data in the following way:
X, y are split into X_aux, y_aux and X_test, y_test.
X_aux, y_aux are split into X_train, y_train and X_val, y_val.
Then, using Scikit-Learn, I define a Pipeline which is the concatenation of a StandardScaler (for feature normalization) and my model. Eventually, the pipeline is trained and evaluated as follows:
pipe = Pipeline([('scaler', StandardScaler()), ('clf', Model())])
pipe.fit(X_train, y_train, validation_data = (X_val, y_val))
pred_proba = pipe.predict_proba(X_test)
score = roc_auc_score(y_test, pred_proba)
The fit method of Model accepts a validation_data parameter to monitor progress during training and possibly avoid overfitting. To this aim, at each iteration, the fit method prints the model loss on training data (X_train, y_train) (training loss) and model loss on validation data (X_val, y_val) (validation loss). In addition to validation loss, I also would like the fit method to return ROC AUC score on validation data. My question is the following :
Shall X_val be normalized with the scaler of the pipeline before it is used to compute validation ROC AUC score during training ? Also, in this code, only X_train is normalized by the scaler. Should I do X_aux = scaler.fit_transform(X_aux) instead and then split into train/validation ?
I apologize in advance for my question is very naive. I confess I got confused.
I think that X_val should be normalized. The way I see it is that the few lines of code above are equivalent to:
scaler = StandardScaler()
clf = Model()
X_train = scaler.fit_transform(X_train)
clf.fit(X_train, y_train, validation_data = (X_val, y_val))
# During `fit`, at each iteration, we would have:
# train_loss = loss(X_train, y_train)
# validation_loss = loss(X_val, y_val)
# pred_proba_val = predict_proba(X_val, y_val) (*)
# roc_auc_val = roc_auc_score(y_val, pred_proba_val)
X_test = scaler.transform(X_test)
pred_proba = clf.predict_proba(X_test) (**)
score = roc_auc_score(y_test, pred_proba)
In line (*) the predict_proba method is called on unnormalized data whereas it is called on normalized data on line (**). This is why I believe that X_val should be normalized. Still I am not sure whether my thinking is correct.

Related

Why does my cross-validation consistently perform better than train-test split?

I have the code below (using sklearn) that first uses the training set for cross-validation, and for a final check, uses the test set. However, the cross-validation consistently performs better, as shown below. Am I over-fitting on the training data? If so, which hyper parameter(s) would be best to tune to avoid this?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#Cross validation
rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc' }
scores = cross_validate(rfc, X_train, y_train, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
mean(scores['test_precision']),
mean(scores['test_recall']),
mean(scores['test_f1']),
mean(scores['test_roc_auc'])
)
Which gives me:
0.8536558341101569 0.8641939667622551 0.8392201023654705 0.8514895113569482 0.9264002192260914
Re-train the model now with the entire training+validation set, and test it with never-seen-before test-set
RFC = RandomForestClassifier()
RFC.fit(X_train, y_train)
y_pred = RFC.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
y_pred_proba = RFC.predict_proba(X_test)[::,1]
auc = roc_auc_score(y_test, y_pred_proba)
print(accuracy,
precision,
recall,
f1,
auc
)
Now it gives me the numbers below, which are clearly worse:
0.7809788654060067 0.5113236034222446 0.5044687189672294 0.5078730317420644 0.7589037004728368
I am able to reproduce your scenario with Pima Indians Diabetes Dataset.
The difference you see in the prediction metrics is not consistence and in some runs you may even notice the opposite, because it depends on the selection of the X_test during the split - some of the cases will be easier to predict and will give better metrics and vice versa. While Cross-validation runs predictions on the whole set you have in rotation and aggregates this effect, the single X_test set will suffer from effects of random splits.
In order to have better visibility on what is happening here, I have modified your experiment and split in two steps:
1. Cross-validation step:
I use the whole of the X and y sets and run rest of the code as it is
rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)
# cv = KFold(n_splits=10)
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}
scores = cross_validate(rfc, X, y, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
mean(scores['test_precision']),
mean(scores['test_recall']),
mean(scores['test_f1']),
mean(scores['test_roc_auc'])
)
Output:
0.768257006151743 0.6943032069967433 0.593436328663432 0.6357667086829574 0.8221242747913622
2. Classic train-test step:
Next I run the plain train-test step, but I do it 50 times with the different train_test splits, and average the metrics (similar to Cross-validation step).
accuracies = []
precisions = []
recalls = []
f1s = []
aucs = []
for i in range(50):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
RFC = RandomForestClassifier()
RFC.fit(X_train, y_train)
y_pred = RFC.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
y_pred_proba = RFC.predict_proba(X_test)[::, 1]
auc = roc_auc_score(y_test, y_pred_proba)
accuracies.append(accuracy)
precisions.append(precision)
recalls.append(recall)
f1s.append(f1)
aucs.append(auc)
print(mean(accuracies),
mean(precisions),
mean(recalls),
mean(f1s),
mean(aucs)
)
Output:
0.7606926406926405 0.7001931059992001 0.5778712922956755 0.6306501622080503 0.8207846633339568
As expected the prediction metrics are similar. However, the Cross-validation runs much faster and uses each data point of the whole data set for testing (in rotation) by a given number of times.

Python: I want to perform 5 fold cross validation for logistic regression and report scores. Do I use LogisticRegressionCV() or cross_val_score()?

cross_val_scores gives different results than LogisticRegressionCV, and I can't figure out why.
Here is my code:
seed = 42
test_size = .33
X_train, X_test, Y_train, Y_test = train_test_split(scale(X),Y, test_size=test_size, random_state=seed)
#Below is my model that I use throughout the program.
model = LogisticRegressionCV(random_state=42)
print('Logistic Regression results:')
#For cross_val_score below, I just call LogisticRegression (and not LogRegCV) with the same parameters.
scores = cross_val_score(LogisticRegression(random_state=42), X_train, Y_train, scoring='accuracy', cv=5)
print(np.amax(scores)*100)
print("%.2f%% average accuracy with a standard deviation of %0.2f" % (scores.mean() * 100, scores.std() * 100))
model.fit(X_train, Y_train)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(Y_test, predictions)
coef=np.round(model.coef_,2)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
The output is this.
Logistic Regression results:
79.90483019359885
79.69% average accuracy with a standard deviation of 0.14
Accuracy: 79.81%
Why is the maximum accuracy from cross_val_score higher than the accuracy used by LogisticRegressionCV?
And, I recognize that cross_val_scores does not return a model, which is why I want to use LogisticRegressionCV, but I am struggling to understand why it is not performing as well. Likewise, I am not sure how to get the standard deviations of the predictors from LogisticRegressionCV.
For me, there might be some points to take into consideration:
Cross validation is generally used whenever you should simulate a validation set (for instance when the training set is not that big to be divided into training, validation and test sets) and only uses training data. In your case you're computing accuracy of model on test data, making it impossible to exactly compare results.
According to the docs:
Cross-validation estimators are named EstimatorCV and tend to be roughly equivalent to GridSearchCV(Estimator(), ...). The advantage of using a cross-validation estimator over the canonical estimator class along with grid search is that they can take advantage of warm-starting by reusing precomputed results in the previous steps of the cross-validation process. This generally leads to speed improvements.
If you look at this snippet, you'll see that's what happens indeed:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
data = load_breast_cancer()
X, y = data['data'], data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
estimator = LogisticRegression(random_state=42, solver='liblinear')
grid = {
'C': np.power(10.0, np.arange(-10, 10)),
}
gs = GridSearchCV(estimator, param_grid=grid, scoring='accuracy', cv=5)
gs.fit(X_train, y_train)
print(gs.best_score_) # 0.953846153846154
lrcv = LogisticRegressionCV(Cs=list(np.power(10.0, np.arange(-10, 10))),
cv=5, scoring='accuracy', solver='liblinear', random_state=42)
lrcv.fit(X_train, y_train)
print(lrcv.scores_[1].mean(axis=0).max()) # 0.953846153846154
I would suggest to have a look here, too, so as to get the details of lrcv.scores_[1].mean(axis=0).max().
Eventually, to get the same results with cross_val_score you should better write:
score = cross_val_score(gs.best_estimator_, X_train, y_train, cv=5, scoring='accuracy')
score.mean() # 0.953846153846154

Should Cross Validation Score be performed on original or split data?

When I want to evaluate my model with cross validation, should I perform cross validation on original (data thats not split on train and test) or on train / test data?
I know that training data is used for fitting the model, and testing for evaluating. If I use cross validation, should I still split the data into train and test, or not?
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
accuracy_test = cross_val_score(clf, x_test, y_test, cv = 5)
Or should I do like this:
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
clf = LogisticRegression()
model = clf.fit(features, results)
accuracy_test = cross_val_score(clf, features, results, cv = 5)), 2)
Or maybe something different?
Both your approaches are wrong.
In the first one, you apply cross validation to the test set, which is meaningless
In the second one, you first fit the model with your whole data, and then you perform cross validation, which is again meaningless. Moreover, the approach is redundant (your fitted clf is not used by the cross_val_score method, which does its own fitting)
Since you are not doing any hyperparameter tuning (i.e. you seem to be interested only in performance assessment), there are two ways:
Either with a separate test set
Or with cross validation
First way (test set):
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
accuracy_test = accuracy_score(y_test, y_pred)
Second way (cross validation):
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
clf = LogisticRegression()
# shuffle data first:
features_s, results_s = shuffle(features, results)
accuracy_cv = cross_val_score(clf, features_s, results_s, cv = 5, scoring='accuracy')
# fit the model afterwards with the whole data, if satisfied with the performance:
model = clf.fit(features, results)
I will try to summarize the "best practice" here:
1) If you want to train your model, fine-tune parameters, and do final evaluation, I recommend you to split your data into training|val|test.
You fit your model using the training part, and then you check different parameter combinations on the val part. Finally, when you're sure which classifier/parameter obtains the best result on the val part, you evaluate on the test to get the final rest.
Once you evaluate on the test part, you shouldn't change the parameters any more.
2) On the other hand, some people follow another way, they split their data into training and test, and they finetune their model using cross-validation on the training part and at the end they evaluate it on the test part.
If your data is quite large, I recommend you to use the first way, but if your data is small, the 2.

How to run SVC classifier after running 10-fold cross validation in sklearn?

I'm relatively new to machine learning and would like some help in the following:
I ran a Support Vector Machine Classifier (SVC) on my data with 10-fold cross validation and calculated the accuracy score (which was around 89%). I'm using Python and scikit-learn to perform the task. Here's a code snippet:
def get_scores(features,target,classifier):
X_train, X_test, y_train, y_test =train_test_split(features, target ,
test_size=0.3)
scores = cross_val_score(
classifier,
X_train,
y_train,
cv=10,
scoring='accuracy',
n_jobs=-1)
return(scores)
get_scores(features_from_df,target_from_df,svm.SVC())
Now, how can I use my classifier (after running the 10-folds cv) to test it on X_test and compare the predicted results to y_test? As you may have noticed, I only used X_train and y_train in the cross validation process.
I noticed that sklearn have cross_val_predict:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html should I replace my cross_val_score by cross_val_predict? just FYI: my target data column is binarized (have values of 0s and 1s).
If my approach is wrong, please advise me with the best way to proceed with.
Thanks!
You only need to split your X and y. Do not split the train and test.
Then you can pass your classifier in your case svm to the cross_val_score function to get the accuracy for each experiment.
In just 3 lines of code:
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=10)
print scores
from sklearn.metrics import classification_report
classifier = svm.SVC()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(classification_report(y_test , y_pred)
You're almost there:
# Build your classifier
classifier = svm.SVC()
# Train it on the entire training data set
classifier.fit(X_train, y_train)
# Get predictions on the test set
y_pred = classifier.predict(X_test)
At this point, you can use any metric from the sklearn.metrics module to determine how well you did. For example:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

How to see the validation error after each epoch in keras

I am using keras to train a model for regression. My code looks like:
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=baseline_model, epochs=100, batch_size=32, verbose=2)))
pipeline = Pipeline(estimators)
X_train, X_test, y_train, y_test = train_test_split(X, Y,
train_size=0.75, test_size=0.25)
pipeline.fit(X_train, y_train)
The problem is that it is dramatically overfitting. How can I see the
validation error after each epoch?
You can transmit parameters to KerasRegressor fit method:
validation_split: float (0. < x < 1). Fraction of the data to use as
held-out validation data. validation_data: tuple (x_val, y_val) or
tuple (x_val, y_val, val_sample_weights) to be used as held-out
validation data. Will override validation_split.
via Pipeline fit method:
**fit_params : dict of string -> object Parameters passed to the fit method of each step, where each parameter name is prefixed such that
parameter p for step s has key s__p.
Example:
pipeline.fit(X_train, y_train, mlp__validation_split=0.3)

Categories

Resources