Why applying cross validation before training a model

Why applying cross validation before training a model - python

So, I am struggling to understand why is it that, as a common practice, a cross-validation step is done to a model does has not been trained yet. An example of what I am saying can be found in here. A piece of the code is pasted below:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# create dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# prepare the cross-validation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# create model
model = LogisticRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))
Questions:
What would be the purpose of the cross-validation at that point?
Does some training procedure take place on any part of that code?
How does RepeatedKFold contributes to tackling an unbalance dataset (let's assume that this is the case).
Thanks in advance!

cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
according to the documentation the "cross_val_score" fits the model using the given cross validation technique, there
in the code above, "model" contains the model that will be fit, and "cv" contains information about the cross validation method that the "cross_val_score" will use to structure the training and CV sets and evaluate the model.
in other words, those are just definitions, the actual training and CV happen inside the "cross_val_score" function.
How does RepeatedKFold contributes to tackling an unbalance dataset (let's assume that this is the case).
KFold CV generally doesn't tackle an unbalanced dataset, it just assures that the result will not be biased by the choice of the training/CV datasets,
Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.
if you want to tackle an unbalanced dataset you have to use a better metric than accuracy, like ‘balanced_accuracy’ or ‘roc_auc’ and making sure both the training and CV datasets have both positive and negative cases.

Related

Sklearn - Can we use cross validation and batch training in same model?

Is it possible to train a sklearn random forest model by using k-fold cross validation by giving batch by batch the input set? Because I have some problems with a memory issue. I can not fit all training data at a time. Besides, I want to use cross-validation (not train test split). Is there any example of usage for that?
Update: I found this website where they present a library to do that. In that, the is an example as follows:
from dask_ml.wrappers import Incremental
from dask_ml.datasets import make_classification
import sklearn.linear_model
X, y = make_classification(chunks=25)
est = sklearn.linear_model.SGDClassifier()
clf = Incremental(est, scoring='accuracy')
clf.fit(X, y, classes=[0, 1])
However, I can not understand where I give my own X and y data to the model. I do not see where they came from. How can I fit my own train data to the model?

High recall, low precision with EasyEnsembleClassifier

I have a dataset which has 450.000 data points, 12 features and label(0 or 1). I am using imblearn library of python because my dataset is imbalanced(ratio= 1:50, class 1 is minority). I am using EasyEnsembleClassifier as classifier. My problem is; I get high recall but very low precision as you can see from image below(90% recall, 8% precision, 14% f1 score).
Here is my code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from dask_ml.preprocessing import RobustScaler
from imblearn.ensemble import EasyEnsembleClassifier
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_score, confusion_matrix
from sklearn import metrics
df = read_csv(...)
X = df[['features...']]
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
clf = EasyEnsembleClassifier(n_estimators=50, n_jobs=-1, sampling_strategy = 1.0)
clf.fit(X_train, y_train)
X_test = scaler.transform(X_test)
y_pred = clf.predict(X_test)
------code for report------
.............
Output:
Classification Report
I tried different scalers namely MinMaxScaler, StandardScaler. I tried changing test-train split ratio, different parameters of EasyEnsembleClassifier. I also tried BalancedRandomForestClassifier from same library but result are same. Changing number of estimators in classifier parameter also doesn't change the result.
What is the reason of this results? What can I do to improve precision without damaging recall? It looks like I am doing something wrong in my code or I am missing an important concept.
Edit:
I still couldn't figure out the true reason of my problem but since no one answered my question here is some ideas about what could be the reason of this weird model in case someone else encounters with similar problem;
Most probably my dataset is poorly labeled. It is possible that model cannot distinguish classes because they are very alike. I will try to generate some synthetic data to train my model again.
I did not test this but some features may be harming the model. I need to visually inspect to find out if there is correlation between features and remove some of them but I highly suspect this is the problem because boosting classifiers should handle this problem automatically by weighting each feature.
Also 12 features in my case may not be enough. I may need more. Although it is not easy for my dataset to generate more features I will think about it.
Finally maybe undersampling is not suited for my dataset. I will give a shot to oversampling techniques or SMOTE if I feel desperate enough.

You could try other ensemble methods for class imbalance learning. SMOTEBoost is one such method that combines boosting and data sampling method, essential injects SMOTE technique at each boosting iteration.
This article could be of interest to you.

Hyper-prparameter tuning and classification algorithm comparation

I have a doubt about classification algorithm comparation.
I am doing a project regarding hyperparameter tuning and classification model comparation for a dataset.
The Goal is to find out the best fitted model with the best hyperparameters for my dataset.
For example: I have 2 classification models (SVM and Random Forest), my dataset has 1000 rows and 10 columns (9 columns are features) and 1 last column is lable.
First of all, I splitted dataset into 2 portions (80-10) for training (800 rows) and tesing (200rows) correspondingly. After that, I use Grid Search with CV = 10 to tune hyperparameter on training set with these 2 models (SVM and Random Forest). When hyperparameters are identified for each model, I use these hyperparameters of these 2 models to test Accuracy_score on training and testing set again in order to find out which model is the best one for my data (conditions: Accuracy_score on training set < Accuracy_score on testing set (not overfiting) and which Accuracy_score on testing set of model is higher, that model is the best model).
However, SVM shows the accuracy_score of training set is 100 and the accuracy_score of testing set is 83.56, this means SVM with tuning hyperparameters is overfitting. On the other hand, Random Forest shows the accuracy_score of training set is 72.36 and the accuracy_score of testing set is 81.23. It is clear that the accuracy_score of testing set of SVM is higher than the accuracy_score of testing set of Random Forest, but SVM is overfitting.
I have some question as below:
_ Is my method correst when I implement comparation of accuracy_score for training and testing set as above instead of using Cross-Validation? (if use Cross-Validation, how to do it?
_ It is clear that SVM above is overfitting but its accuracy_score of testing set is higher than accuracy_score of testing set of Random Forest, could I conclude that SVM is a best model in this case?
Thank you!

It's good that you've done quite an analysis on your part to investigate the best model. However, I would suggest you elaborate on your investigation a bit. As you're searching for the best model for your data, "Accuracy" alone is not a good evaluation metric for your models. You should also evaluate your model on "Precision Score", "Recall Score", "ROC", "Sensitivity", "Specificity" etc. Find out if your data has imbalance (If they do, there're techniques to work 'em around). After evaluating all those metrics you may come up with a decision.
For the training-testing part, you're quite on the right track, with only one issue (which is quite severe), the time you're testing your model on the test set, you're injecting a sort of bias. So I would say make 3 partitions of your data, and use cross-validation (sklearn has got what you need for this) on your "training set", after cross-validation, you may use another partition "validation set" for testing the generalization power of your model (performance on unseen data), you may change some parameter after that. And after you've come up to a conclusion and tuning everything you needed to, only then use your "test set". No matter what the results are (on the test set) don't change the model after that, as those scores represent the true capability of your model.
you can create 3 partitions of your data in the following way for example-
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_blobs
# Dummy dataset for example purpose
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, cluster_std=6.0)
# first partition i.e. "train-set" and "test-set"
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, random_state=123)
# second partition, we're splitting the "train-set" into 2 sets, thus creating a new partition of "train-set" and "validation-set"
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=0.9, random_state=123)
print(X_train.shape, X_test.shape, X_val.shape) # output : ((810, 2), (100, 2), (90, 2))

I would suggest splitting your data into three sets, rather than two:
Training
Validation
Testing
Training is used to train the model, as you have been doing. The validation set is used to evaluate the performance of a model trained with a given set of hyperparameters. The optimal set of hyperparameters is then used to generate predictions on the test set, which wasn't part of either training or hyper parameter selection. You can then compare performance on the test set between your classifiers.
The large decrease in performance on your SVM model on your validation dataset does suggest overfitting, though it is common for a classifier to perform better on the training dataset than an evaluation or test dataset.

For your second question, yes your SVM would be overfitting although in most machine-learning cases, the training set's accuracy would not really matter. It is much more important to look at the testing set's accuracy. It is not unusual to have a higher training accuracy than testing accuracy so I suggest to not look to overfitting and only look at the difference in the testing accuracy. With the information provided, yes you could say that the SVM is the best model in your case.
For your first question, you are already doing a type of cross validation and it is an acceptable way to do evaluate the model.
This might be a useful article for you to read

Nested cross-validation: How does cross_validate handle GridSearchCV as its input estimator?

The following code combines cross_validate with GridSearchCV to perform a nested cross-validation for an SVC on the iris dataset.
(Modified example of the following documentation page:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py.)
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_validate, KFold
import numpy as np
np.set_printoptions(precision=2)
# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
# Set up possible values of parameters to optimize over
p_grid = {"C": [1, 10],
"gamma": [.01, .1]}
# We will use a Support Vector Classifier with "rbf" kernel
svm = SVC(kernel="rbf")
# Choose techniques for the inner and outer loop of nested cross-validation
inner_cv = KFold(n_splits=5, shuffle=True, random_state=1)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=1)
# Perform nested cross-validation
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv, iid=False)
clf.fit(X_iris, y_iris)
best_estimator = clf.best_estimator_
cv_dic = cross_validate(clf, X_iris, y_iris, cv=outer_cv, scoring=['accuracy'], return_estimator=False, return_train_score=True)
mean_val_score = cv_dic['test_accuracy'].mean()
print('nested_train_scores: ', cv_dic['train_accuracy'])
print('nested_val_scores: ', cv_dic['test_accuracy'])
print('mean score: {0:.2f}'.format(mean_val_score))
cross_validate splits the data set in each fold into a training and a test set. In each fold, the input estimator is then trained based on the training set associated with the fold. The inputted estimator here is clf, a parameterized GridSearchCV estimator, i.e. an estimator that cross-validates itself again.
I have three questions about the whole thing:
If clf is used as the estimator for cross_validate, does it (in the course of the GridSearchCV cross validation) split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?
Out of all models tested via GridSearchCV, does cross_validate validate only the model stored in the best_estimator_ attribute?
Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?
To make it clearer how the questions are meant, here is an illustration of how I imagine the double cross validation at the moment.

If clf is used as the estimator for cross_validate, does it split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?
Yes as you can see here at Line 230 the training set is again split into a subtraining and validation set (Specifically at line 240).
Update Yes, when you will pass the GridSearchCV classifier into cross-validate it will again split the training set into a test and train set. Here is a link describing this in more detail. Your diagram and assumption is correct.
Out of all models tested via GridSearchCV, does cross_validate train & validate only the model stored in the variable best_estimator?
Yes, as you can see from the answers here and here, the GridSearchCV returns the best_estimator in your case(since refit parameter is True by default in your case.) However, this best estimator will has to be trained again
Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?
As per your third and final question, Yes, it trains an estimator and returns it if return_estimator is set to True. See this line. Which makes sense, since how else is it supposed to return the scores without training an estimator in the first place ?
Update
The reason the model is trained again is because the default use case for cross-validate does not assume that you give in the best classfier with the optimum parameters. In this case specifically, you are sending in a classifier from the GridSearchCV but if you send any untrained classifier it is supposed to be trained. What I mean to say here is that, yes, in your case it shouldn't train it again since you are already doing cross-validation using GridSearchCV and using the best estimator. However, there is no way for cross-validate to know this, hence, it assumes that you are sending in an un-optimized or rather untrained estimator, thus it has to train it again and return the scores for the same.

Using sklearn cross_val_score and kfolds to fit and help predict model

I'm trying to understand using kfolds cross validation from the sklearn python module.
I understand the basic flow:
instantiate a model e.g. model = LogisticRegression()
fitting the model e.g. model.fit(xtrain, ytrain)
predicting e.g. model.predict(ytest)
use e.g. cross val score to test the fitted model accuracy.
Where i'm confused is using sklearn kfolds with cross val score. As I understand it the cross_val_score function will fit the model and predict on the kfolds giving you an accuracy score for each fold.
e.g. using code like this:
kf = KFold(n=data.shape[0], n_folds=5, shuffle=True, random_state=8)
lr = linear_model.LogisticRegression()
accuracies = cross_val_score(lr, X_train,y_train, scoring='accuracy', cv = kf)
So if I have a dataset with training and testing data, and I use the cross_val_score function with kfolds to determine the accuracy of the algorithm on my training data for each fold, is the model now fitted and ready for prediction on the testing data?
So in the case above using lr.predict

No the model is not fitted. Looking at the source code for cross_val_score:
scores=parallel(delayed(_fit_and_score)(clone(estimator),X,y,scorer,
train,test,verbose,None,fit_params)
As you can see, cross_val_score clones the estimator before fitting the fold training data to it. cross_val_score will give you output an array of scores which you can analyse to know how the estimator performs for different folds of the data to check if it overfits the data or not. You can know more about it here
You need to fit the whole training data to the estimator once you are satisfied with the results of cross_val_score, before you can use it to predict on test data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.