I'm confused about the reason to use cross_val_score.
From what I understood, cross_val_score tells if my model is
'overfitting' or 'underfitting'. Moreover,it does not train my model.
Since I have only 1 feature, it is tfidf (sparse matrix). I don't know
what to do if it under/over fitting.
Q1: Did I use it in wrong order? I've seen both 'cross->fit' and
'fit->cross' examples.
Q2: What did the scores in '#print1' tell me? Does it mean I have to train my model k-times (with the same training set) where k is the k-fold that give the best score?
My code now:
model1=GaussianNB(priors=None)
score=cross_val_score(model1, X_train.toarray(), y_train,cv=3,scoring='accuracy')
# print1
print (score.mean())
model1.fit(X_train.toarray(),y_train)
predictions1 = model1.predict(X_test.toarray()) #held out data
# print2
print (classification_report(predictions1,y_test))
Here are some informations about cross-validation.
The order (cross then fit) seems fine to me.
First you evaluate the performance of your model on known data. Taking the mean of all the CV scores is interesting but maybe it would be best to leave the raw scores to see if your model doesn't work on some sets.
If your model works, then you can fit it on your train set and predict on your test set.
Training the same model k times won't change anything.
Related
I'd like to manually analyse the errors that my ML model (whichever) does, comparing its predictions with the labels. From my understanding, this should be done on instances of the validation set, not the training set.
I trained my model through GridSearchCV, extracting the best_estimator_, the one performing the best during the cross validation then retrained on the entire dataset.
Therefore, my question is: how can I get prediction on a validation set to compare with the labels (without touching the test set), if my best model is re-trained on the whole training set?
One solution would be to split the training set further before performing the GridSearchCV, but I guess there must be a better solution, for example to get the predictions on the validation sets during the cross validation. Is there a way to get these prediction for the best estimator?
Thank you!
You can compute a validation curve with the model that you obtained from GridSearchCV. Read the documentation here. You will just need to define arrays for the hyperparameters that you want to inspect and a scoring function. Here is an example:
train_scores, valid_scores = validation_curve(model, X_train, y_train, "alpha", np.logspace(-7, 3, 3), cv=5, scoring="accuracy")
I understood my conceptual error, I'll post here since maybe it can help some other ML beginners as me!
The solution that should work is to use cross_val_predict splitting the fold in the same way as done in GridSearchCV. In fact, cross_val_predict re-trains the model on each fold and do not use the previously trained model! So the result is the same as getting the prediction on the validation sets during GridSearchCV.
My question has to do with GridSearchCV, RidgeCV, and StackingClassifier/Regressor.
Stacking Classifier/Regressor-AFAIK, it first trains the whole train set individually for each base estimator. Then, it uses a cross validation scheme, using the predictions for each base estimator as the new features to train the new final estimator. From the documentation: "To generalize and avoid over-fitting, the final_estimator is trained on out-samples using sklearn.model_selection.cross_val_predict internally."
My question is, what exactly does this mean? Does it break the train data into k folds, and then for each fold, train the final estimator on the training section of the fold, test it on the testing section of the fold, and then take the final estimator weights from the fold with the best score? or what?
I think I can group GridSearchCV and RidgeCV into the same question as they are quite similar. ( albeit, ridgeCV uses one vs all CV by default)
-To find the best hyperparameters, do they do a CV on all the folds, for each hyperparameter, find the hyperparameters that had the best average score AND THEN AFTER finding the best hyperparameters, train the model with the best hyperparameters, using the WHOLE training set? Or am I looking at it wrong?
If anyone could shed some light on this, that would be great. Thanks!
You're exactly right. The process looks like this:
Select the first set of hyperparameters
Partition the data into k-folds
Run the model on each fold
Obtain the average score (loss, r2, or whatever specified criteria)
Repeat steps 2-4 for all other sets of hyperparameters
Choose the set of hyperparameters with the best score
Retrain the model on the entire dataset (as opposed to a single fold) using the best hyperparameters
I've used two approaches with the same SKlearn decision tree, one approach using a validation set and the other using K-Fold. I'm however not sure if I'm actually achieving anything by using KFold. Technically the Cross Validation does show a 5% rise in accuracy, but I'm not sure if that's just the pecularity of this particular data skewing the result.
For my implementation of KFold I first split the training set into segments using:
f = KFold(n_splits=8)
f.get_n_splits(data)
And then got data-frames from it by using
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
In a loop, as witnessed in many online tutorials on how to do it. However, here comes the tricky part. The tutorial I saw had a .train() function which I do not think this decision tree classifier does. Instead, I just do this:
tree = tree.DecisionTreeClassifier()
tree.fit(X_train, y_train)
predictions = tree.predict(X_test)
The accuracy scores achieved are:
Accuracy score: 0.79496591505
Accuracy score: 0.806502359727
Accuracy score: 0.800734137389
... and so on
But I am not sure if I'm actually making my classifier any better by doing this, as the scores go up and down. Isn't this just comparing 9 independent results together? Is the purpose of K-fold not to train the classifier to be better?
I've read similar questions and found that K-fold is meant to provide a way to compare between "independent instances" but I wanted to make sure that was the case, not that my code was flawed in some way.
Is the purpose of K-fold not to train the classifier to be better?
The purpose of the K-fold is to prevent the classifier from over fitting the training data. So on each fold you keep a separate test set which the classifier has not seen and verify the accuracy on it. You average your prediction to see how best your classifier is performing.
Isn't this just comparing 9 independent results together?
Yes, you compare the different scores to see how best your classifier is performing
In general using cross validation prevents overfitting. For that you split the data in multiple parts and evaluate the loss, accuracy or other metrics (e.g. f-1 score). A good introduction can be found on the official site [1].
In addition I would recommend using StratifiedKFold [2] instead of KFold.
skf = StratifiedKFold(n_splits=8)
skf.get_n_splits(X, y)
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
So you have balanced labels.
I am getting different score values from different estimators from scikit.
SVR(kernel='rbf', C=1e5, gamma=0.1) 0.97368549023058548
Linear regression 0.80539997869990632
DecisionTreeRegressor(max_depth = 5) 0.83165426563946387
Since all regression estimators should use R-square score, I think they are comparable, i.e. the closer the score is to 1, the better the model is trained. However, each model implements score function separately so I am not sure. Please clarify.
If you have a similar pipeline to feed the same data into the models, then the metrics are comparable. You can choose the SVR Model without any doubt.
By the way, it could be really interesting for you to "redevelop" this "R_squared" Metric, it could be a nice way to learn the underlying mechanic.
train_index, test_index = next(iter(ShuffleSplit(821, train_size=0.2, test_size=0.80, random_state=42)))
print train_index, len(train_index)
print test_index, len(test_index)
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features, labels, train_size=0.33, random_state=42)
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test, labels_test)
print pred, len(pred)
A few questions from this code:
Why do I need the cross_validation.train_test_split line in order to fit and predict with my classifier? (I am not doing any preprocessing on my data except for stopword removal I have already done)
Do the test and train indexes correspond to the classified & predicted labels? My goal is to get all my labels, in their original order, after fitting and predicting them. My features and labels used for training and testing are from a pandas dataframe (two columns), and I need the predicted labels, in order, so that I can feed them back into the pandas dataframe.
Is there a way to predict the labels for the whole set, and not just the test set?
tl;dr
Because your decision tree classifier has to be trained before it can predict anything. It's not a magic algorithm. It has to be shown examples of what to do before it can work out what to do on other things.
cross_validation.test_train_split() facilitates this by splitting your data into a test and training dataset in such a way that you can analyse how well it performed later on. Without this, you have no way of assessing how well your decision tree classifier actually performed.
You can create your own testing and training data without test_train_split() (and I suspect that was what you were trying to do with ShuffleSplit()), but you will need at least some training data.
test_index and train_index have nothing to do with your data. Full stop. They come from a randomly generated process that is completely unrelated to what test_train_split() does.
The purpose of ShuffleSplit() is to give you the indices to partition your data into training and test yourself. test_train_split() will instead choose their own indices and partition based on those indices. You should either use one or the other and sensibly.
Yes. You can always just call
pred = clf.predict(features) or pred = clf.predict(features_test + features_train)
The Full Story
You need cross_validation if you want to do this right. The whole purpose of cross-validation is to avoid overfit.
Basically, if you run your model on both the training and the testing data, then your model is going to perform really well on the training set (because, well, that's what you trained it on) and that's going to skew your overall metrics of how well your model will perform on real data.
It's a lot like asking a student to perform in an exam and then in real life: if you want to know whether your student learned from the process of preparing for an exam, you don't give him another exam, you ask him to demonstrate his skills in the real world dealing with unknown and complex data.
If you want to know if your model will be useful, then you want to cross-validate. Wikipedia puts it best:
In a prediction problem, a model is usually given a dataset of known
data on which training is run (training dataset), and a dataset of
unknown data (or first seen data) against which the model is tested
(testing dataset).
The goal of cross validation is to define a
dataset to "test" the model in the training phase (i.e., the
validation dataset), in order to limit problems like overfitting, give
an insight on how the model will generalize to an independent dataset
(i.e., an unknown dataset, for instance from a real problem), etc.
cross_validation.train_test_split doesn't do anything except split the dataset into training and testing data for you.
But perhaps you don't care about metrics, and that's fine. The question then becomes: is it possible to run a decision tree classifier without a training dataset?
The answer is no. Decision tree classifiers are supervised algorithms: they need to be trained on data before they can generalise their model to new results. If you don't give them any data to train on, it will be unable to do anything with any data you feed it in predict.
Finally, while it is perfectly possible to get the labels for the whole set (see tl;dr) , it is a really bad idea if you actually care about whether or not you're getting sensible results.
You already have the labels for the testing and training data. You don't need another column that includes prediction on the testing data, because they'll either come out to be identical or close enough to identical.
I can't think of a single meaningful reason to get back predicted results for your training data short of trying to optimise how it's performing on your training data. If that's what you are trying to do, then do that. What you are doing right now is definitely not that, and I encourage you to think strongly about what your reasons are for blindly inserting numbers into your table without due cause to believe they actually mean something.
There are ways to improve this: get back an accuracy metric, for example, or try to do k-fold cross-validation to model accuracy, or look at log-loss or AUC or any one of number of metrics to gauge whether or not your model is performing well.
Using both ShuffleSplit and train_test_split is redundant. You do not even appear to be using the indices returned by ShuffleSplit.
An example of how to use the indices return by ShuffleSplit is below. X and y are np.array. X is number of instances by number of features. y contains the labels of each row.
train_inds, test_inds = train_test_split(range(len(y)),test_size=0.33, random_state=42)
X_train, y_train = X[train_inds], y[train_inds]
X_test , y_test = X[test_inds] , y[test_inds]
You should not test on your training data! But if you want to see what happens just do
pred = clf.predict(features_train)
Also you do not need to pass the labels to predict. You should be using
score = metrics.accuracy_score(y_test, pred)