I've always read that Cross Validation can be used to create multiple splits, then average the model to avoid overfitting, but have not been able to find an example doing that.
Looking at scikitlearn, I only see examples of cross_val_score that gives you the CV score across multiple splits for you to evaluate whether there is overfitting such as here
from sklearn.model_selection import cross_val_score
all_accuracies = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=5)
How can one uses CV to actually average the models across the splits ?
Related
So, I am struggling to understand why is it that, as a common practice, a cross-validation step is done to a model does has not been trained yet. An example of what I am saying can be found in here. A piece of the code is pasted below:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# create dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# prepare the cross-validation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# create model
model = LogisticRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))
Questions:
What would be the purpose of the cross-validation at that point?
Does some training procedure take place on any part of that code?
How does RepeatedKFold contributes to tackling an unbalance dataset (let's assume that this is the case).
Thanks in advance!
cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
according to the documentation the "cross_val_score" fits the model using the given cross validation technique, there
in the code above, "model" contains the model that will be fit, and "cv" contains information about the cross validation method that the "cross_val_score" will use to structure the training and CV sets and evaluate the model.
in other words, those are just definitions, the actual training and CV happen inside the "cross_val_score" function.
How does RepeatedKFold contributes to tackling an unbalance dataset (let's assume that this is the case).
KFold CV generally doesn't tackle an unbalanced dataset, it just assures that the result will not be biased by the choice of the training/CV datasets,
Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.
if you want to tackle an unbalanced dataset you have to use a better metric than accuracy, like ‘balanced_accuracy’ or ‘roc_auc’ and making sure both the training and CV datasets have both positive and negative cases.
I have a doubt about classification algorithm comparation.
I am doing a project regarding hyperparameter tuning and classification model comparation for a dataset.
The Goal is to find out the best fitted model with the best hyperparameters for my dataset.
For example: I have 2 classification models (SVM and Random Forest), my dataset has 1000 rows and 10 columns (9 columns are features) and 1 last column is lable.
First of all, I splitted dataset into 2 portions (80-10) for training (800 rows) and tesing (200rows) correspondingly. After that, I use Grid Search with CV = 10 to tune hyperparameter on training set with these 2 models (SVM and Random Forest). When hyperparameters are identified for each model, I use these hyperparameters of these 2 models to test Accuracy_score on training and testing set again in order to find out which model is the best one for my data (conditions: Accuracy_score on training set < Accuracy_score on testing set (not overfiting) and which Accuracy_score on testing set of model is higher, that model is the best model).
However, SVM shows the accuracy_score of training set is 100 and the accuracy_score of testing set is 83.56, this means SVM with tuning hyperparameters is overfitting. On the other hand, Random Forest shows the accuracy_score of training set is 72.36 and the accuracy_score of testing set is 81.23. It is clear that the accuracy_score of testing set of SVM is higher than the accuracy_score of testing set of Random Forest, but SVM is overfitting.
I have some question as below:
_ Is my method correst when I implement comparation of accuracy_score for training and testing set as above instead of using Cross-Validation? (if use Cross-Validation, how to do it?
_ It is clear that SVM above is overfitting but its accuracy_score of testing set is higher than accuracy_score of testing set of Random Forest, could I conclude that SVM is a best model in this case?
Thank you!
It's good that you've done quite an analysis on your part to investigate the best model. However, I would suggest you elaborate on your investigation a bit. As you're searching for the best model for your data, "Accuracy" alone is not a good evaluation metric for your models. You should also evaluate your model on "Precision Score", "Recall Score", "ROC", "Sensitivity", "Specificity" etc. Find out if your data has imbalance (If they do, there're techniques to work 'em around). After evaluating all those metrics you may come up with a decision.
For the training-testing part, you're quite on the right track, with only one issue (which is quite severe), the time you're testing your model on the test set, you're injecting a sort of bias. So I would say make 3 partitions of your data, and use cross-validation (sklearn has got what you need for this) on your "training set", after cross-validation, you may use another partition "validation set" for testing the generalization power of your model (performance on unseen data), you may change some parameter after that. And after you've come up to a conclusion and tuning everything you needed to, only then use your "test set". No matter what the results are (on the test set) don't change the model after that, as those scores represent the true capability of your model.
you can create 3 partitions of your data in the following way for example-
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_blobs
# Dummy dataset for example purpose
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, cluster_std=6.0)
# first partition i.e. "train-set" and "test-set"
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, random_state=123)
# second partition, we're splitting the "train-set" into 2 sets, thus creating a new partition of "train-set" and "validation-set"
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=0.9, random_state=123)
print(X_train.shape, X_test.shape, X_val.shape) # output : ((810, 2), (100, 2), (90, 2))
I would suggest splitting your data into three sets, rather than two:
Training
Validation
Testing
Training is used to train the model, as you have been doing. The validation set is used to evaluate the performance of a model trained with a given set of hyperparameters. The optimal set of hyperparameters is then used to generate predictions on the test set, which wasn't part of either training or hyper parameter selection. You can then compare performance on the test set between your classifiers.
The large decrease in performance on your SVM model on your validation dataset does suggest overfitting, though it is common for a classifier to perform better on the training dataset than an evaluation or test dataset.
For your second question, yes your SVM would be overfitting although in most machine-learning cases, the training set's accuracy would not really matter. It is much more important to look at the testing set's accuracy. It is not unusual to have a higher training accuracy than testing accuracy so I suggest to not look to overfitting and only look at the difference in the testing accuracy. With the information provided, yes you could say that the SVM is the best model in your case.
For your first question, you are already doing a type of cross validation and it is an acceptable way to do evaluate the model.
This might be a useful article for you to read
I am using cross_val_score to compute the mean score for a regressor. Here's a small snippet.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
cross_val_score(LinearRegression(), X, y_reg, cv = 5)
Using this I get an array of scores. I would like to know how the scores on the validation set (as returned in the array above) differ from those on the training set, to understand whether my model is over-fitting or under-fitting.
Is there a way of doing this with the cross_val_score object?
You can use cross_validate instead of cross_val_score
according to doc:
The cross_validate function differs from cross_val_score in two ways -
It allows specifying multiple metrics for evaluation.
It returns a dict containing training scores, fit-times and score-times in addition to the test score.
Why would you want that? cross_val_score(cv=5) does that for you as it splits your train data 10 times and verifies accuracy scores on 5 test subsets. This method already serves as a way to prevent your model from over-fitting.
Anyway, if you are eager to verify accuracy on your validation data, then you have to fit your LinearRegression first on X and y_reg.
I have a sample of approximately 10,000 tweets that I want to classify into the categories "relevant" and "not relevant". I am using Python's scikit-learn for this model. I manually coded 1,000 tweets as "relevant" or "not relevant". Then, I ran a SVM model using 80% of the manually coded data as training data and the rest as test data. I obtained good results (prediction accuracy ~0.90), but to avoid overfitting I decided to use cross-validation on all 1,000 manually coded tweets.
Below is my code after already obtaining the tf-idf matrix for the tweets in my sample. "target" is an array listing whether the tweet was marked as "relevant" or "not relevant".
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
clf = SGDClassifier()
scores = cross_val_score(clf, X_tfidf, target, cv=10)
predicted = cross_val_predict(clf, X_tfidf, target, cv=10)
With this code, I was able to get predictions of what classes the 1,000 tweets belonged to, and I could compare that against my manual coding.
I'm stuck on what to do next in order to use my model to classify the other ~9,000 tweets that I did not manually code. I was thinking to use cross_val_predict again, but I'm not sure what to put in the third argument since the class is what I'm trying to predict.
Thanks for all your help in advance!
cross_val_predict is not method to actually obtain predictions from the model. Cross validation is a technique for model selection/evaluation, no to train model. cross_val_predict is very specific function (which gives you predictions of many models, trained during cross validation procedure). For actual model building yu are supposed to use fit to train your model and predict to get predictions. No cross validation involved here - as said before - this is for model selection (to choose your classifier, hyperparamters etc.) and not to train actual model.
I am using python sklearn library for doing classification of data. Following is the code I have implemented. I just want to ask, is this correct way of classifying? I mean can following code potentially remove all the biases? And, is it 10-k fold cross validation?
cv = cross_validation.ShuffleSplit(n_samples, n_iter=3, test_size=0.1, random_state=0)
knn = KNeighborsClassifier()
KNeighborsClassifier(algorithm='auto', leaf_size=1, metric='minkowski',
n_neighbors=2, p=2, weights='uniform')
knn_score = cross_validation.cross_val_score(knn,x_data_arr, target_arr, cv=cv)
print "Accuracy =", knn_score.mean()
Thanks!!
If you want a 10-fold cross validation using ShuffleSplit you should put n_iter=10. However, as is shown in the documentation: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.
The other option if you want a 10fold cross-validation with different folds is using the function KFold with n_folds=10.
The parameter cv in cross_validation.cross_val_score() module means the iteration times you want to run. It gets a default value of 3 which means it will run cross validation 3 times by default. If you want to run a 10-folds cross validation, this param cv should equal to 10. I hope this will help.