I am using python sklearn library for doing classification of data. Following is the code I have implemented. I just want to ask, is this correct way of classifying? I mean can following code potentially remove all the biases? And, is it 10-k fold cross validation?
cv = cross_validation.ShuffleSplit(n_samples, n_iter=3, test_size=0.1, random_state=0)
knn = KNeighborsClassifier()
KNeighborsClassifier(algorithm='auto', leaf_size=1, metric='minkowski',
n_neighbors=2, p=2, weights='uniform')
knn_score = cross_validation.cross_val_score(knn,x_data_arr, target_arr, cv=cv)
print "Accuracy =", knn_score.mean()
Thanks!!
If you want a 10-fold cross validation using ShuffleSplit you should put n_iter=10. However, as is shown in the documentation: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.
The other option if you want a 10fold cross-validation with different folds is using the function KFold with n_folds=10.
The parameter cv in cross_validation.cross_val_score() module means the iteration times you want to run. It gets a default value of 3 which means it will run cross validation 3 times by default. If you want to run a 10-folds cross validation, this param cv should equal to 10. I hope this will help.
Related
I've always read that Cross Validation can be used to create multiple splits, then average the model to avoid overfitting, but have not been able to find an example doing that.
Looking at scikitlearn, I only see examples of cross_val_score that gives you the CV score across multiple splits for you to evaluate whether there is overfitting such as here
from sklearn.model_selection import cross_val_score
all_accuracies = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=5)
How can one uses CV to actually average the models across the splits ?
It seems basic, but I can't see the difference and the advantages or disadvantages between the following 2 ways:
first way:
kf = KFold(n_splits=2)
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
second way:
cross_val_score(clf, X, y, cv=2)
It seems that the 2 ways do the same thing, and the second one is shorter (one line).
What am I missing ?
What are the differences and advantages or disadvantages for each way ?
Arguably, the best way to see such differences is to experiment, although here the situation is rather easy to discern:
clf.score is in a loop; hence, after the loop execution, it contains just the score in the last validation fold, forgetting everything that has been done before in the previous k-1 folds.
cross_cal_score, on the other hand, returns the score from all k folds. It is generally preferable, but it lacks a shuffle option (which shuffling is always advisable), so you need to manually shuffle the data first, as shown here, or use it with cv=KFold(n_splits=k, shuffle=True).
A disadvantage of the for loop + kfold method is that it is run serially, while the CV procedure in cross_val_score can be parallelized in multiple cores with the n_jobs argument.
A limitation of cross_val_score is that it cannot be used with multiple metrics, but even in this case you can use cross_validate, as shown in this thread - not necessary to use for + kfold.
The use of kfold in a for loop gives additional flexibility for cases where neither cross_val_score nor cross_validate may be adequate, for example using the scikit-learn wrapper for Keras while still getting all the metrics returned by native Keras during training, as shown here; or if you want to permanently store the different folds in separate variables/files, as shown here.
In short:
if you just want the scores for a single metric, stick to cross_val_score (shuffle first and parallelize).
if you want multiple metrics, use cross_validate (again, shuffle first and parallelize).
if you need a greater degree of control or monitor of the whole CV process, revert to using kfold in a for loop accordingly.
Can I use cross_validate in sklearn with cv=10 to instead of using Kfold with n_splits=10? Does they work as same?
I believe that KFold will simply carve your training data into 10 splits.
cross_validate, however, will also carve the data into 10 splits (with the cv=10 parameter) but it will also actually perform the cross-validation. In other words, it will run your model 10x and you will be able to report on the performance of your model, which KFold does not do.
In other words, KFold is 1 small step in cross_validation.
I've used two approaches with the same SKlearn decision tree, one approach using a validation set and the other using K-Fold. I'm however not sure if I'm actually achieving anything by using KFold. Technically the Cross Validation does show a 5% rise in accuracy, but I'm not sure if that's just the pecularity of this particular data skewing the result.
For my implementation of KFold I first split the training set into segments using:
f = KFold(n_splits=8)
f.get_n_splits(data)
And then got data-frames from it by using
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
In a loop, as witnessed in many online tutorials on how to do it. However, here comes the tricky part. The tutorial I saw had a .train() function which I do not think this decision tree classifier does. Instead, I just do this:
tree = tree.DecisionTreeClassifier()
tree.fit(X_train, y_train)
predictions = tree.predict(X_test)
The accuracy scores achieved are:
Accuracy score: 0.79496591505
Accuracy score: 0.806502359727
Accuracy score: 0.800734137389
... and so on
But I am not sure if I'm actually making my classifier any better by doing this, as the scores go up and down. Isn't this just comparing 9 independent results together? Is the purpose of K-fold not to train the classifier to be better?
I've read similar questions and found that K-fold is meant to provide a way to compare between "independent instances" but I wanted to make sure that was the case, not that my code was flawed in some way.
Is the purpose of K-fold not to train the classifier to be better?
The purpose of the K-fold is to prevent the classifier from over fitting the training data. So on each fold you keep a separate test set which the classifier has not seen and verify the accuracy on it. You average your prediction to see how best your classifier is performing.
Isn't this just comparing 9 independent results together?
Yes, you compare the different scores to see how best your classifier is performing
In general using cross validation prevents overfitting. For that you split the data in multiple parts and evaluate the loss, accuracy or other metrics (e.g. f-1 score). A good introduction can be found on the official site [1].
In addition I would recommend using StratifiedKFold [2] instead of KFold.
skf = StratifiedKFold(n_splits=8)
skf.get_n_splits(X, y)
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
So you have balanced labels.
I'm currently working on a research study about classifiers performances comparison. To evaluate those performances, I'm computing the accuracy, the area under curve and the squared error for each classifier on all the datasets I have. Besides I need to perform tuning parameters for some of the classifiers in order to select the best parameters in terms of accuracy, so a validation test is required (I chose 20% of the dataset).
I was told that, in order to make this comparison even more meaningful, the cross validation should be performed on the same sets for each classifier.
So basically, is there a way to use the cross_val_score method so that it runs always on the same folds for all the classifiers or should I rewrite from scratch some code that can do this job ?
Thank you in advance.
cross_val_score accepts a cv parameter which represents the cross validation object you want to use. You probably want StratifiedKFold, which accepts a shuffle parameter, which specifies if you want to shuffle the data prior to running cross validation on it.
cv can also be an int, in which case a StratifiedKFold or KFold object will be created automatically with K = cv.
As you can tell from the documentation, shuffle is False by default, so by default it will already be performed on the same folds for all of your classifiers.
You can test it by running it twice on the same classifier to make sure (you should get the exact same results).
You can specify it yourself like this:
your_cv = StratifiedKFold(your_y, n_folds=10, shuffle=True) # or shuffle=False
cross_val_score(your_estimator, your_X, y=your_y, cv=your_cv)