Say I have a learning curve that is sklearn learning curve SVM. And I'm also doing 5-fold cross-validation, which as far as I understand, it means splitting your training data into 5 pieces, train on four of them and testing on the last one.
So my question is, since for each data point in the LearningCurve, the size of the training set is different (Because we want to see how will the model perform with the increasing amount of data), how does the cross-validation work in that case? Does it still split the whole training set into 5 equal pieces? Or it splits the current point training set into five different small pieces, then computes the test score? Is it possible to get a confusion matrix for each data point? (i.e. True Positive, True Negative etc.). I don't see a way to do that yet based on the sklearn learning curve code.
Does how many folds of cross-validation relate to how many pieces of training set we are splitting in train_sizes = np.linspace(0.1, 1.0, 5).
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator,
X, y, cv,
n_jobs, scoring,
train_sizes)
Thank you!
No, it does the split the training data into 5 folds again. Instead, for a particular combination of training folds (for example - folds 1,2,3 and 4 as training), it will pick only k number of data points (x- ticks) as training from those 4 training folds. Test fold would be used as such as the testing data.
If you look at the code here it would become clearer for you.
for train, test in cv_iter:
for n_train_samples in train_sizes_abs:
train_test_proportions.append((train[:n_train_samples], test))
n_train_samples would be something like [200,400,...1400] for the plot that you had mentioned.
Does how many folds of cross-validation relate to how many pieces of training set we are splitting in train_sizes = np.linspace(0.1, 1.0, 5)?
we can't assign any number of folds for a certain train_sizes. It is just a subset of datapoints from all the training folds.
Related
Suppose I iterate with the following code until I acquire an accuracy that I'm satisfied with:
from sklearn.model_selection import train_test_split
x, y = # ... read in some data set ...
c = 3000 # iterate over some arbitrary range
for i in range(c):
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=i)
model = # ... initialize some classifier of choice ...
model.fit(x_train, y_train)
p = model.predict(x_test)
p = np.round(p).reshape(-1)
test_accuracy = np.mean(p == y_test) * 100
For a particular data set and range, say I build a classifier such that the training accuracy is 97% and the test accuracy is 96%. Can I truly claim the model is 96% accurate? For the same range and data set, I can also build a classifier such that the training accuracy and test accuracy are as low as 99% and 70%, respectively.
Since I have selected random_state based on the test set accuracy, is the test set really a validation set here? I don't know why, but I think to claim the first model is 96% accurate would not be true. What should I do instead in order to make a correct claim about the model's accuracy?
Is it bad practice to iterate over many random training & test set splits until a high accuracy is achieved?
Yes, this is bad practice. You should be evaluating on data that your model has never been trained on, and this wouldn't really be the case if you train many times to find the best train/test split.
You can put aside a test set before you train the model. Then you can create as many train/validation splits as you want and train the model multiple times. You would evaluate on the test set, on which the model was never trained.
You can also look into nested cross-validation.
Kinda. There is Cross-Validation which is similar to what you have described. This is where the train/test split is randomised and the model trained each time. Except the final value quoted is the average test accuracy - not simply the best. This sort of thing is done in tricky situations e.g. with very small datasets.
In terms of the bigger picture, the test data should be representative of the training data and vice versa. Sure you can cheese it that way, but if the atypical 'weird' cases are hidden in your training set and the test set is just full of easy cases (e.g. only the digit 0 for MNIST) then you aren't really achieving anything. You're only cheating yourself.
I am looking to use k-fold cross validation on a random forest regressor in Python. I understand that k refers to the number of folds in the data-set, but how can I adjust the test-set size? Say I wanted to split the data in ten different ways, but in each fold I wanted the data split 50/50, how would I do this? Here is what i currently have;
from sklearn.cross_validation import cross_val_predict
from sklearn.ensemble import RandomForestRegressor as rfr
<wasn't sure how to include the data as its a big file>
# BUILD RANDOM FOREST MODEL
rfmodel = rfr(n_estimators = 100, random_state = 0)
# make cross validated predictions
cv_preds = cross_val_predict(rfmodel, x, y, cv=10)
cv_preds = (np.around(cv_preds, 2))
I'm aware that k-fold is not necessarily needed for RF, however for the purposes of this project that is what I need to do.
EDIT: I'll try to re-articulate as I probably haven't described my problem well enough. Say I have 100 observations with k=5, instead of dividing observations into five equal-sized folds, training on k-1 and testing on the remaining fold, what I am looking to do is to randomly allocate the 100 observations into a 50/50 train-test split, run the model, and then re-allocate the 100 observations into another 50/50 split and run the model again. I would then do this 5 times.
According to the DOC of scikit-learn
sklearn.model_selection.cross_val_score(estimator, X, y=None,
groups=None, scoring=None, cv=None, n_jobs=1, verbose=0,
fit_params=None, pre_dispatch=‘2*n_jobs’)
X and y
X : array-like The data to fit. Can be for example a list, or an
array.
y : array-like, optional, default: None The target variable to
try to predict in the case of supervised learning.
I am wondering whether [X,y] is X_train and y_train or [X,y] should be the whole dataset. In some of the notebooks from kaggle some people use the whole dataset and some others X_train and y_train.
To my knowledge, cross validation just evaluate the model and shows whether or not you overfit/underfit your data (it does not actually train the model). Then, in my view the most data you have the better will be the performance, so I would use the whole dataset.
What do you think?
Model performance is dependent on way the data is split and sometimes model does not have ability to generalize.
So that's why we need the cross validation.
Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.
I am wondering whether [X,y] is X_train and y_train or [X,y] should be
the whole dataset.
[X, y] should be the whole dataset because internally cross validation spliting the data into training data and test data.
Suppose you use cross validation with 5 folds (cv = 5).
We begin by splitting the dataset into five groups or folds. Then we hold out the first fold as a test set, fit out model on the remaining four folds, predict on the test set and compute the metric of interest.
Next, we hold out the second fold as out test set, fit on the remaining data, predict on the test set and compute the metric of interest.
By default, scikit-learn's cross_val_score() function uses R^2 score as the metric of choice for regression.
R^2 score is called coefficient of determination.
Reading doc for k fold cross validation http://scikit-learn.org/stable/modules/cross_validation.html I'm attempting to understand the training procedure for each of the folds.
Is this correct :
In generating the cross_val_score each fold contains a new training and test set , these training and test sets are utilized by the passed in classifier clf in below code for evaluating each fold performance ?
This implies that increasing size of fold can affect accuracy depending on size of training set as increase number of folds reduces training data available for each fold ?
From doc cross_val_score is generated using :
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
scores
array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ])
I don't think the statement "each fold contains a new training and test set" is correct.
By default, cross_val_score uses KFold cross-validation. This works by splitting the data set into K equal folds. Say we have 3 folds (fold1, fold2, fold3), then the algorithm works as follows:
Use fold1 and fold2 as your training set in svm and test performance on fold3.
Use fold1 and fold3 as our training set in svm and test performance on fold2.
Use fold2 and fold3 as our training set in svm and test performance on fold1.
So each fold is used for both training and testing.
Now to second part of your question. If you increase the number of rows of data in a fold, you do reduce the number of training samples for each of the runs (above, that would be run 1, 2, and 3) but the total number of training samples is unchanged.
Generally, selecting the right number of folds is both art and science. For some heuristics on how to choose your number of folds, I would suggest this answer. The bottom line is that accuracy can be slightly affected by your choice of the number of folds. For large data sets, you are relatively safe with a large number of folds; for smaller data sets, you should run the exercise multiple times with new random splits.
How a metric computed with cross_val_score can differ from the same metric computed starting from cross_val_predict (used to obtain predictions to be then given to a metric function)?
Here is an example:
from sklearn import cross_validation
from sklearn import datasets
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
iris = datasets.load_iris()
gnb_clf = GaussianNB()
# compute mean accuracy with cross_val_predict
predicted = cross_validation.cross_val_predict(gnb_clf, iris.data, iris.target, cv=5)
accuracy_cvp = metrics.accuracy_score(iris.target, predicted)
# compute mean accuracy with cross_val_score
score_cvs = cross_validation.cross_val_score(gnb_clf, iris.data, iris.target, cv=5)
accuracy_cvs = score_cvs.mean()
print('Accuracy cvp: %0.8f\nAccuracy cvs: %0.8f' % (accuracy_cvp, accuracy_cvs))
In this case, we obtain the same result:
Accuracy cvp: 0.95333333
Accuracy cvs: 0.95333333
Nevertheless, this seems not to be always the case, as on the official documentation it is written (regarding a result computed using cross_val_predict):
Note that the result of this computation may be slightly different
from those obtained using cross_val_score as the elements are grouped
in different ways.
Imagine following labels and splitting
[010|101|10]
So you have 8 data points, 4 per class and you split it to 3 folds, leading to 2 folds with 3 elements and one with 2. Now let us assume that during cross validation you get following preds
[010|100|00]
thus, your scores are [100%, 67%, 50%], and cross val score (as an average) is around 72%. Now what about accuracy over predictions? You clearly have 6/8 things right, thus 75%. As you can see the scores are different, even thoug they both rely on cross validation. Here, the difference arises because the splits are not exactly the same size, thus this last "50%" is actually lowering total score because it is an avergae over just 2 samples (and the rest are based on 3).
There might be other similar phenomena, in general - it should boil down to the way averaging is computed. Thus - cross val score is an average over averages, which does not have to be an average over cross validation predictions.
In addition to lejlot's answer, another way that you might get slightly different results between cross_val_score and cross_val_predict is when the target classes are not distributed in a way that allows them to be evenly split between folds.
According to the documentation for cross_val_predict, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used by default. This may lead to a situation where even though the total number of instances in the dataset is divisible by the number of folds, you end up with folds of slightly different sizes, because the splitter is splitting based on the presence of the target. This can then lead to the issue where an average of averages is slightly different to an overall average.
For example, if you have 100 data points, and 33 of these are the target class, then KFold with n_splits=5 would split this into 5 folds of 20 observations, but StratifiedKFold would not necessarily give you equally-sized folds.