Reading doc for k fold cross validation http://scikit-learn.org/stable/modules/cross_validation.html I'm attempting to understand the training procedure for each of the folds.
Is this correct :
In generating the cross_val_score each fold contains a new training and test set , these training and test sets are utilized by the passed in classifier clf in below code for evaluating each fold performance ?
This implies that increasing size of fold can affect accuracy depending on size of training set as increase number of folds reduces training data available for each fold ?
From doc cross_val_score is generated using :
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
scores
array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ])
I don't think the statement "each fold contains a new training and test set" is correct.
By default, cross_val_score uses KFold cross-validation. This works by splitting the data set into K equal folds. Say we have 3 folds (fold1, fold2, fold3), then the algorithm works as follows:
Use fold1 and fold2 as your training set in svm and test performance on fold3.
Use fold1 and fold3 as our training set in svm and test performance on fold2.
Use fold2 and fold3 as our training set in svm and test performance on fold1.
So each fold is used for both training and testing.
Now to second part of your question. If you increase the number of rows of data in a fold, you do reduce the number of training samples for each of the runs (above, that would be run 1, 2, and 3) but the total number of training samples is unchanged.
Generally, selecting the right number of folds is both art and science. For some heuristics on how to choose your number of folds, I would suggest this answer. The bottom line is that accuracy can be slightly affected by your choice of the number of folds. For large data sets, you are relatively safe with a large number of folds; for smaller data sets, you should run the exercise multiple times with new random splits.
Related
Suppose I iterate with the following code until I acquire an accuracy that I'm satisfied with:
from sklearn.model_selection import train_test_split
x, y = # ... read in some data set ...
c = 3000 # iterate over some arbitrary range
for i in range(c):
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=i)
model = # ... initialize some classifier of choice ...
model.fit(x_train, y_train)
p = model.predict(x_test)
p = np.round(p).reshape(-1)
test_accuracy = np.mean(p == y_test) * 100
For a particular data set and range, say I build a classifier such that the training accuracy is 97% and the test accuracy is 96%. Can I truly claim the model is 96% accurate? For the same range and data set, I can also build a classifier such that the training accuracy and test accuracy are as low as 99% and 70%, respectively.
Since I have selected random_state based on the test set accuracy, is the test set really a validation set here? I don't know why, but I think to claim the first model is 96% accurate would not be true. What should I do instead in order to make a correct claim about the model's accuracy?
Is it bad practice to iterate over many random training & test set splits until a high accuracy is achieved?
Yes, this is bad practice. You should be evaluating on data that your model has never been trained on, and this wouldn't really be the case if you train many times to find the best train/test split.
You can put aside a test set before you train the model. Then you can create as many train/validation splits as you want and train the model multiple times. You would evaluate on the test set, on which the model was never trained.
You can also look into nested cross-validation.
Kinda. There is Cross-Validation which is similar to what you have described. This is where the train/test split is randomised and the model trained each time. Except the final value quoted is the average test accuracy - not simply the best. This sort of thing is done in tricky situations e.g. with very small datasets.
In terms of the bigger picture, the test data should be representative of the training data and vice versa. Sure you can cheese it that way, but if the atypical 'weird' cases are hidden in your training set and the test set is just full of easy cases (e.g. only the digit 0 for MNIST) then you aren't really achieving anything. You're only cheating yourself.
I get why a model's score is different for each random_state but did expect the difference between the highest and the lowest score (from random_state 0-100) to be 0.37 which is lot. Also tried ten-fold-cross-validation, the difference is still kinda big.
So does this actually matter or is it something i should ignore ?
The Data-set link
(Download -> Data Folder -> student.zip -> student-mat.csv)
Full Code :
import pandas as pd
acc_dic = {}
grade_df_main = pd.read_csv(r'F:\Python\Jupyter Notebook\ML Projects\data\student-math-grade.csv', sep = ";")
grade_df = grade_df_main[["G1", "G2", "G3", "studytime", "failures", "absences"]]
X = grade_df.drop("G3", axis = "columns")
Y = grade_df["G3"].copy()
def cross_val_scores(scores):
print("Cross validation result :-")
#print("Scores: {}".format(scores))
print("Mean: {}".format(scores.mean()))
print("Standard deviation: {}".format(scores.std()))
def start(rand_state):
print("Index {}".format(rand_state))
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=.1, random_state=rand_state)
from sklearn.linear_model import LinearRegression
lin_reg_obj = LinearRegression()
lin_reg_obj.fit(x_train, y_train)
accuracy = lin_reg_obj.score(x_test, y_test)
print("Accuracy: {}".format(accuracy))
acc_dic[rand_state] = accuracy
from sklearn.model_selection import cross_val_score
scores = cross_val_score(lin_reg_obj, x_test, y_test, scoring="neg_mean_squared_error", cv=10)
cross_val_scores(scores)
print()
for i in range(0, 101):
start(i)
print("Overview : \n")
result_val = list(acc_dic.values())
min_index = result_val.index(min(result_val))
max_index = result_val.index(max(result_val))
print("Minimum Accuracy : ")
start(min_index)
print("Maximum Accuracy : ")
start(max_index)
Result :
Only included the highest and the lowest results
Minimum Accuracy :
Index 54
Accuracy: 0.5635271419142645
Cross validation result :-
Mean: -8.969894370977539
Standard deviation: 5.614516642510817
Maximum Accuracy :
Index 97
Accuracy: 0.9426035720345269
Cross validation result :-
Mean: -0.7063598117158191
Standard deviation: 0.3149445166291036
TL;DR
It is not the split on the dataset you used to train and evaluate your model that decides how well your final model will actually perform once it is deployed. The split and evaluation technique is more about getting a valid estimation of how well the model might perform in real life. And as you can see, the choice of the splitting and evaluation technique can have a great influence on this estimation. The results on your dataset highly suggest preferring k-fold cross-validation over a simple train/test split.
Longer version
I believe you have already figured out that the split you do on the dataset to separate it into train and test sets has nothing to do with the performance of your final model, which is likely to be trained on the whole dataset and then be deployed.
The purpose of testing is to get a feeling of the predictive performance on unseen data. In a best-case scenario, you would ideally have two completely different data sets from different cohorts/sources to train and test your model (external validation). This is the best approach to evaluate how your model would perform once it is deployed. However, since you often do not have such a second source of data, you do an internal validation where you get samples for training and testing from the same cohort/source.
Usually, given that this dataset is big enough, randomness will make sure that the splits for the train and test sets are a good representation of your original dataset and the performance metrics you get are a fair estimation of the model's predictive performance in real life.
However, as you see on your own dataset, there are cases where the split does actually heavily influence the result. It is exactly for such cases, where you are definitely better off evaluating your performance with a cross-validation technique such as k-fold cross-validation, and compute the mean across different splits.
Say I have a learning curve that is sklearn learning curve SVM. And I'm also doing 5-fold cross-validation, which as far as I understand, it means splitting your training data into 5 pieces, train on four of them and testing on the last one.
So my question is, since for each data point in the LearningCurve, the size of the training set is different (Because we want to see how will the model perform with the increasing amount of data), how does the cross-validation work in that case? Does it still split the whole training set into 5 equal pieces? Or it splits the current point training set into five different small pieces, then computes the test score? Is it possible to get a confusion matrix for each data point? (i.e. True Positive, True Negative etc.). I don't see a way to do that yet based on the sklearn learning curve code.
Does how many folds of cross-validation relate to how many pieces of training set we are splitting in train_sizes = np.linspace(0.1, 1.0, 5).
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator,
X, y, cv,
n_jobs, scoring,
train_sizes)
Thank you!
No, it does the split the training data into 5 folds again. Instead, for a particular combination of training folds (for example - folds 1,2,3 and 4 as training), it will pick only k number of data points (x- ticks) as training from those 4 training folds. Test fold would be used as such as the testing data.
If you look at the code here it would become clearer for you.
for train, test in cv_iter:
for n_train_samples in train_sizes_abs:
train_test_proportions.append((train[:n_train_samples], test))
n_train_samples would be something like [200,400,...1400] for the plot that you had mentioned.
Does how many folds of cross-validation relate to how many pieces of training set we are splitting in train_sizes = np.linspace(0.1, 1.0, 5)?
we can't assign any number of folds for a certain train_sizes. It is just a subset of datapoints from all the training folds.
I am looking to use k-fold cross validation on a random forest regressor in Python. I understand that k refers to the number of folds in the data-set, but how can I adjust the test-set size? Say I wanted to split the data in ten different ways, but in each fold I wanted the data split 50/50, how would I do this? Here is what i currently have;
from sklearn.cross_validation import cross_val_predict
from sklearn.ensemble import RandomForestRegressor as rfr
<wasn't sure how to include the data as its a big file>
# BUILD RANDOM FOREST MODEL
rfmodel = rfr(n_estimators = 100, random_state = 0)
# make cross validated predictions
cv_preds = cross_val_predict(rfmodel, x, y, cv=10)
cv_preds = (np.around(cv_preds, 2))
I'm aware that k-fold is not necessarily needed for RF, however for the purposes of this project that is what I need to do.
EDIT: I'll try to re-articulate as I probably haven't described my problem well enough. Say I have 100 observations with k=5, instead of dividing observations into five equal-sized folds, training on k-1 and testing on the remaining fold, what I am looking to do is to randomly allocate the 100 observations into a 50/50 train-test split, run the model, and then re-allocate the 100 observations into another 50/50 split and run the model again. I would then do this 5 times.
How a metric computed with cross_val_score can differ from the same metric computed starting from cross_val_predict (used to obtain predictions to be then given to a metric function)?
Here is an example:
from sklearn import cross_validation
from sklearn import datasets
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
iris = datasets.load_iris()
gnb_clf = GaussianNB()
# compute mean accuracy with cross_val_predict
predicted = cross_validation.cross_val_predict(gnb_clf, iris.data, iris.target, cv=5)
accuracy_cvp = metrics.accuracy_score(iris.target, predicted)
# compute mean accuracy with cross_val_score
score_cvs = cross_validation.cross_val_score(gnb_clf, iris.data, iris.target, cv=5)
accuracy_cvs = score_cvs.mean()
print('Accuracy cvp: %0.8f\nAccuracy cvs: %0.8f' % (accuracy_cvp, accuracy_cvs))
In this case, we obtain the same result:
Accuracy cvp: 0.95333333
Accuracy cvs: 0.95333333
Nevertheless, this seems not to be always the case, as on the official documentation it is written (regarding a result computed using cross_val_predict):
Note that the result of this computation may be slightly different
from those obtained using cross_val_score as the elements are grouped
in different ways.
Imagine following labels and splitting
[010|101|10]
So you have 8 data points, 4 per class and you split it to 3 folds, leading to 2 folds with 3 elements and one with 2. Now let us assume that during cross validation you get following preds
[010|100|00]
thus, your scores are [100%, 67%, 50%], and cross val score (as an average) is around 72%. Now what about accuracy over predictions? You clearly have 6/8 things right, thus 75%. As you can see the scores are different, even thoug they both rely on cross validation. Here, the difference arises because the splits are not exactly the same size, thus this last "50%" is actually lowering total score because it is an avergae over just 2 samples (and the rest are based on 3).
There might be other similar phenomena, in general - it should boil down to the way averaging is computed. Thus - cross val score is an average over averages, which does not have to be an average over cross validation predictions.
In addition to lejlot's answer, another way that you might get slightly different results between cross_val_score and cross_val_predict is when the target classes are not distributed in a way that allows them to be evenly split between folds.
According to the documentation for cross_val_predict, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used by default. This may lead to a situation where even though the total number of instances in the dataset is divisible by the number of folds, you end up with folds of slightly different sizes, because the splitter is splitting based on the presence of the target. This can then lead to the issue where an average of averages is slightly different to an overall average.
For example, if you have 100 data points, and 33 of these are the target class, then KFold with n_splits=5 would split this into 5 folds of 20 observations, but StratifiedKFold would not necessarily give you equally-sized folds.