I have a doubt about classification algorithm comparation.
I am doing a project regarding hyperparameter tuning and classification model comparation for a dataset.
The Goal is to find out the best fitted model with the best hyperparameters for my dataset.
For example: I have 2 classification models (SVM and Random Forest), my dataset has 1000 rows and 10 columns (9 columns are features) and 1 last column is lable.
First of all, I splitted dataset into 2 portions (80-10) for training (800 rows) and tesing (200rows) correspondingly. After that, I use Grid Search with CV = 10 to tune hyperparameter on training set with these 2 models (SVM and Random Forest). When hyperparameters are identified for each model, I use these hyperparameters of these 2 models to test Accuracy_score on training and testing set again in order to find out which model is the best one for my data (conditions: Accuracy_score on training set < Accuracy_score on testing set (not overfiting) and which Accuracy_score on testing set of model is higher, that model is the best model).
However, SVM shows the accuracy_score of training set is 100 and the accuracy_score of testing set is 83.56, this means SVM with tuning hyperparameters is overfitting. On the other hand, Random Forest shows the accuracy_score of training set is 72.36 and the accuracy_score of testing set is 81.23. It is clear that the accuracy_score of testing set of SVM is higher than the accuracy_score of testing set of Random Forest, but SVM is overfitting.
I have some question as below:
_ Is my method correst when I implement comparation of accuracy_score for training and testing set as above instead of using Cross-Validation? (if use Cross-Validation, how to do it?
_ It is clear that SVM above is overfitting but its accuracy_score of testing set is higher than accuracy_score of testing set of Random Forest, could I conclude that SVM is a best model in this case?
Thank you!
It's good that you've done quite an analysis on your part to investigate the best model. However, I would suggest you elaborate on your investigation a bit. As you're searching for the best model for your data, "Accuracy" alone is not a good evaluation metric for your models. You should also evaluate your model on "Precision Score", "Recall Score", "ROC", "Sensitivity", "Specificity" etc. Find out if your data has imbalance (If they do, there're techniques to work 'em around). After evaluating all those metrics you may come up with a decision.
For the training-testing part, you're quite on the right track, with only one issue (which is quite severe), the time you're testing your model on the test set, you're injecting a sort of bias. So I would say make 3 partitions of your data, and use cross-validation (sklearn has got what you need for this) on your "training set", after cross-validation, you may use another partition "validation set" for testing the generalization power of your model (performance on unseen data), you may change some parameter after that. And after you've come up to a conclusion and tuning everything you needed to, only then use your "test set". No matter what the results are (on the test set) don't change the model after that, as those scores represent the true capability of your model.
you can create 3 partitions of your data in the following way for example-
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_blobs
# Dummy dataset for example purpose
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, cluster_std=6.0)
# first partition i.e. "train-set" and "test-set"
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, random_state=123)
# second partition, we're splitting the "train-set" into 2 sets, thus creating a new partition of "train-set" and "validation-set"
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=0.9, random_state=123)
print(X_train.shape, X_test.shape, X_val.shape) # output : ((810, 2), (100, 2), (90, 2))
I would suggest splitting your data into three sets, rather than two:
Training
Validation
Testing
Training is used to train the model, as you have been doing. The validation set is used to evaluate the performance of a model trained with a given set of hyperparameters. The optimal set of hyperparameters is then used to generate predictions on the test set, which wasn't part of either training or hyper parameter selection. You can then compare performance on the test set between your classifiers.
The large decrease in performance on your SVM model on your validation dataset does suggest overfitting, though it is common for a classifier to perform better on the training dataset than an evaluation or test dataset.
For your second question, yes your SVM would be overfitting although in most machine-learning cases, the training set's accuracy would not really matter. It is much more important to look at the testing set's accuracy. It is not unusual to have a higher training accuracy than testing accuracy so I suggest to not look to overfitting and only look at the difference in the testing accuracy. With the information provided, yes you could say that the SVM is the best model in your case.
For your first question, you are already doing a type of cross validation and it is an acceptable way to do evaluate the model.
This might be a useful article for you to read
Related
I'm trying to do sentiment analysis on text docoments but I got lost in the steps.
So my goal is to:
Train SVM, KNN and Naive Bayes algorithms
Use gridsearch to find best parameters
Evaluate models accuracy and find the best one
Use those parameters and get optimal result
Almost on every guide I find that train_test_split method is used. But I've read that Holdout cross validation method isn't very accurate. It's when you split data into train test sets for example 80:20 and hold that 20% for the testing. So instead i wanted to use K-folds cross validation. But the question is how could i use it and do i still need to split my data into train test sets?
So far what i've tried is:
sentences = svietimas_data['text']
y = svietimas_data['sentiment']
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.1, random_state=1)
sentences_train, sentences_validate, y_train, y_validate = train_test_split(sentences_train, y_train, test_size=0.1111, random_state=1)
classifier = KNeighborsClassifier()
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range, weights = weights, metric = metric )
vectorizer = TfidfVectorizer(lowercase=False, max_df=100)
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_validate = vectorizer.transform(sentences_validate)
X_test = vectorizer.transform(sentences_test)
grid_search = GridSearchCV(classifier, param_grid, cv=10,scoring='accuracy', return_train_score=False)
grid_search.fit(X_train, y_train)
print(grid_search.best_score_)
print(grid_search.best_params_)
I split the data into train validate and test - 80:10:10. I use my train data for the gridsearch parameter analysis to find best parameters and after that i put those parameters into my classifier to use it with validate and test sets to find the best results like this:
classifier.fit(X_train, y_train)
y_pred_validate = classifier.predict(X_validate)
print(classification_report(y_validate, y_pred_validate))
y_pred_test = classifier.predict(X_test)
print(classification_report(y_test, y_pred_test))
But since this method isn't very accurate could i instead use my whole data set on gridsearch and thats it? or after getting best parameters with 80% data set I should put those parameters into classifier and use K-folds cross validation with full data set? Because using gridsearch or k-folds with train (80%) data i waste 20% of the data and as far as i know if i would use 100% of the data K-folds would split that data into for example gievn k-5 sets and the data wouldn't count as seen or overfitted?
Or what my exact steps should be to correctly achieve that goal?
You're doing parameter tuning, which is equivalent to training: this is why you must keep a fresh test set to evaluate the final model (otherwise performance could be overestimated).
However since you're using CV in the first level of training, you need only one more test set. So the typical process would be like this:
Split training and test set
Apply CV to the training set for all combinations of parameters (grid search), then pick the best parameters.
Re-train the final model on the full training set with the best parameters.
Evaluate the model on the test set
But since this method isn't very accurate could i instead use my whole data set on gridsearch and thats it?
If you don't evaluate on a fresh test set after parameter tuning, you might have overfitting (best parameters by chance) and you wouldn't know it, the performance would be biased.
or after getting best parameters with 80% data set I should put those parameters into classifier and use K-folds cross validation with full data set?
It is possible to use CV also for the last stage of evaluation, but it's not so simple: you would have to used nested CV, it's not really worth it and it would take a lot more time because you would have to repeat the parameter tuning stage for each training inside the top-level CV.
Because using gridsearch or k-folds with train (80%) data i waste 20% of the data
Actually you don't waste the data. The test set is needed only for the purpose of reliable evaluation, but once this is done you could perfectly re-train your model on the full data.
Also this is a bad sign when 20% of the data matters a lot for performance, it means that the model probably doesn't have a large enough training set and even the full data might not be enough.
So, I am struggling to understand why is it that, as a common practice, a cross-validation step is done to a model does has not been trained yet. An example of what I am saying can be found in here. A piece of the code is pasted below:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# create dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# prepare the cross-validation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# create model
model = LogisticRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))
Questions:
What would be the purpose of the cross-validation at that point?
Does some training procedure take place on any part of that code?
How does RepeatedKFold contributes to tackling an unbalance dataset (let's assume that this is the case).
Thanks in advance!
cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
according to the documentation the "cross_val_score" fits the model using the given cross validation technique, there
in the code above, "model" contains the model that will be fit, and "cv" contains information about the cross validation method that the "cross_val_score" will use to structure the training and CV sets and evaluate the model.
in other words, those are just definitions, the actual training and CV happen inside the "cross_val_score" function.
How does RepeatedKFold contributes to tackling an unbalance dataset (let's assume that this is the case).
KFold CV generally doesn't tackle an unbalanced dataset, it just assures that the result will not be biased by the choice of the training/CV datasets,
Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.
if you want to tackle an unbalanced dataset you have to use a better metric than accuracy, like ‘balanced_accuracy’ or ‘roc_auc’ and making sure both the training and CV datasets have both positive and negative cases.
I'd like to manually analyse the errors that my ML model (whichever) does, comparing its predictions with the labels. From my understanding, this should be done on instances of the validation set, not the training set.
I trained my model through GridSearchCV, extracting the best_estimator_, the one performing the best during the cross validation then retrained on the entire dataset.
Therefore, my question is: how can I get prediction on a validation set to compare with the labels (without touching the test set), if my best model is re-trained on the whole training set?
One solution would be to split the training set further before performing the GridSearchCV, but I guess there must be a better solution, for example to get the predictions on the validation sets during the cross validation. Is there a way to get these prediction for the best estimator?
Thank you!
You can compute a validation curve with the model that you obtained from GridSearchCV. Read the documentation here. You will just need to define arrays for the hyperparameters that you want to inspect and a scoring function. Here is an example:
train_scores, valid_scores = validation_curve(model, X_train, y_train, "alpha", np.logspace(-7, 3, 3), cv=5, scoring="accuracy")
I understood my conceptual error, I'll post here since maybe it can help some other ML beginners as me!
The solution that should work is to use cross_val_predict splitting the fold in the same way as done in GridSearchCV. In fact, cross_val_predict re-trains the model on each fold and do not use the previously trained model! So the result is the same as getting the prediction on the validation sets during GridSearchCV.
I have some classifiers. I want to get the metrics for the classifier with classification_report
I used cross_val_predict to get the predictions, and then passed them to classification_report.
I also use the output of cross_val_predict to plot the confusion matrix.
labels = get_labels() #ground truth
result = cross_val_predict(classifier, features, labels, cv=KFold(n_splits=10, shuffle=True, random_state=seed))
report = classification_report(labels, result, digits=3, target_names=['no', 'yes'], output_dict=True)
cm = confusion_matrix(result, labels, [no, yes])
In the cross_val_predict documentation:
Passing these predictions into an evaluation metric may not be a valid way to measure generalization performance. Results can differ from cross_validate and cross_val_score unless all tests sets have equal size and the metric decomposes over samples.
So, is it the wrong way to do that? How should I do that ?
I would say that your process should look something like this:
Train/test split
Model selection using a (cross-)validation set
Retrain your model using the whole train set
Evaluate on your test split from step 1
If you do not have a lot of data, training with KFold should give you more reliable results than a single train/test split but as a rule of thumb, consider that you should evaluate on a dataset/split that has not been used before, even if it was only used for model selection or early stopping.
Back to your question, cross_val_predict is actually splitting the input array in K splits and using all trained CV models to predict 5 predictions splits, then combined together. I think that you can use that to get a general idea of the cross-validation results (eg if you want to plot them or to calculate additional metrics) but definitely not to evaluate your model.
I've used two approaches with the same SKlearn decision tree, one approach using a validation set and the other using K-Fold. I'm however not sure if I'm actually achieving anything by using KFold. Technically the Cross Validation does show a 5% rise in accuracy, but I'm not sure if that's just the pecularity of this particular data skewing the result.
For my implementation of KFold I first split the training set into segments using:
f = KFold(n_splits=8)
f.get_n_splits(data)
And then got data-frames from it by using
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
In a loop, as witnessed in many online tutorials on how to do it. However, here comes the tricky part. The tutorial I saw had a .train() function which I do not think this decision tree classifier does. Instead, I just do this:
tree = tree.DecisionTreeClassifier()
tree.fit(X_train, y_train)
predictions = tree.predict(X_test)
The accuracy scores achieved are:
Accuracy score: 0.79496591505
Accuracy score: 0.806502359727
Accuracy score: 0.800734137389
... and so on
But I am not sure if I'm actually making my classifier any better by doing this, as the scores go up and down. Isn't this just comparing 9 independent results together? Is the purpose of K-fold not to train the classifier to be better?
I've read similar questions and found that K-fold is meant to provide a way to compare between "independent instances" but I wanted to make sure that was the case, not that my code was flawed in some way.
Is the purpose of K-fold not to train the classifier to be better?
The purpose of the K-fold is to prevent the classifier from over fitting the training data. So on each fold you keep a separate test set which the classifier has not seen and verify the accuracy on it. You average your prediction to see how best your classifier is performing.
Isn't this just comparing 9 independent results together?
Yes, you compare the different scores to see how best your classifier is performing
In general using cross validation prevents overfitting. For that you split the data in multiple parts and evaluate the loss, accuracy or other metrics (e.g. f-1 score). A good introduction can be found on the official site [1].
In addition I would recommend using StratifiedKFold [2] instead of KFold.
skf = StratifiedKFold(n_splits=8)
skf.get_n_splits(X, y)
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
So you have balanced labels.