Should I first train_test_split and then use cross validation? - python

If I plan to use cross validation (KFold), should I still split the dataset into training and test data and perform my training (including cross valid) only on the training set? Or will CV do everything for me? E.g.
Option 1
X_train, X_test, y_train, y_test = train_test_split(X,y)
clf = GridSearchCV(... cv=5)
clf.fit(X_train, y_train)
Option 2
clf = GridSearchCV(... cv=5)
clf.fit(X y)

CV is good, but it's better to have train/test split to provide independent score estimation on the untouched data.
If your CV and test data shows about the same score, then you can drop train/test split phase and CV on whole data to achive slightly better model score. But don't do it before you sure your split and CV score is consistent.

Related

Evaluate Polynomial regression using cross_val_score

I am trying to use cross_val_score to evaluate my regression model (with PolymonialFeatures(degree = 2)). As I noted from different blog posts that I should use cross_val_score with original X, y values, not the X_train and y_train.
r_squareds = cross_val_score(pipe, X, y, cv=10)
r_squareds
>>> array([ 0.74285583, 0.78710331, -1.67690578, 0.68890253, 0.63120873,
0.74753825, 0.13937611, 0.18794756, -0.12916661, 0.29576638])
which indicates my model doesn't perform really well with the mean r2 of only 0.241. Is this supposed to be a correct interpretation?
However, I came across a Kaggle code working on the same data and the guy performed cross_val_score on X_train and y_train. I gave this a try and the average r2 was better.
r_squareds = cross_val_score(pipe, X_train, y_train, cv=10)
r_squareds.mean()
>>> 0.673
Is this supposed to be a problem?
Here is the code for my model:
X = df[['CHAS', 'RM', 'LSTAT']]
y = df['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
pipe = Pipeline(
steps=[('poly_feature', PolynomialFeatures(degree=2)),
('model', LinearRegression())]
)
## fit the model
pipe.fit(X_train, y_train)
You first interpretation is correct. The first cross_val_score is training 10 models with 90% of your data as train and 10 as a validation dataset. We can see from these results that the estimator's r_square variance is quite high. Sometimes the model performs even worse than a straight line.
From this result we can safely say that the model is not performing well on this dataset.
It is possible that the obtained result using only the train set on your cross_val_score is higher but this score is most likely not representative of your model performance as the dataset might be to small to capture all its variance. (The train set for the second cross_val_score is only 54% of your dataset 90% of 60% of the original dataset)

Should Cross Validation Score be performed on original or split data?

When I want to evaluate my model with cross validation, should I perform cross validation on original (data thats not split on train and test) or on train / test data?
I know that training data is used for fitting the model, and testing for evaluating. If I use cross validation, should I still split the data into train and test, or not?
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
accuracy_test = cross_val_score(clf, x_test, y_test, cv = 5)
Or should I do like this:
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
clf = LogisticRegression()
model = clf.fit(features, results)
accuracy_test = cross_val_score(clf, features, results, cv = 5)), 2)
Or maybe something different?
Both your approaches are wrong.
In the first one, you apply cross validation to the test set, which is meaningless
In the second one, you first fit the model with your whole data, and then you perform cross validation, which is again meaningless. Moreover, the approach is redundant (your fitted clf is not used by the cross_val_score method, which does its own fitting)
Since you are not doing any hyperparameter tuning (i.e. you seem to be interested only in performance assessment), there are two ways:
Either with a separate test set
Or with cross validation
First way (test set):
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
accuracy_test = accuracy_score(y_test, y_pred)
Second way (cross validation):
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
clf = LogisticRegression()
# shuffle data first:
features_s, results_s = shuffle(features, results)
accuracy_cv = cross_val_score(clf, features_s, results_s, cv = 5, scoring='accuracy')
# fit the model afterwards with the whole data, if satisfied with the performance:
model = clf.fit(features, results)
I will try to summarize the "best practice" here:
1) If you want to train your model, fine-tune parameters, and do final evaluation, I recommend you to split your data into training|val|test.
You fit your model using the training part, and then you check different parameter combinations on the val part. Finally, when you're sure which classifier/parameter obtains the best result on the val part, you evaluate on the test to get the final rest.
Once you evaluate on the test part, you shouldn't change the parameters any more.
2) On the other hand, some people follow another way, they split their data into training and test, and they finetune their model using cross-validation on the training part and at the end they evaluate it on the test part.
If your data is quite large, I recommend you to use the first way, but if your data is small, the 2.

How to shuffle data each time when using cross_val_score?

When training a Ridge Classifier, I'm able to perform 10 fold cross validation like so:
clf = linear_model.RidgeClassifier()
n_folds = 10
scores = cross_val_score(clf, X_train, y_train, cv=n_folds)
scores
array([0.83236107, 0.83937346, 0.84490172, 0.82985258, 0.84336609,
0.83753071, 0.83753071, 0.84213759, 0.84121622, 0.84398034])
If I want to perform 10 fold cross validation again, and I use:
scores = cross_val_score(clf, X_train, y_train, cv=n_folds)
I end up with the same results.
Thus, it seems the data is being split the same way both times.
Is there a way to randomly partition the data into n_folds every time I perform cross validation?
What you will want to do is create your own instances of Stratified K Folds object and pass it in to the cv argument in cross_val_score. This way you can supply different random seeds for splitting the data.
from sklearn.model_selection import StratifiedKFold
clf = linear_model.RidgeClassifier()
for n in range(5):
strat_k_fold = StratifiedKFold(n_splits=10, shuffle=False, random_state=n)
scores = cross_val_score(clf, X_train, y_train, cv=strat_k_fold)

How to run SVC classifier after running 10-fold cross validation in sklearn?

I'm relatively new to machine learning and would like some help in the following:
I ran a Support Vector Machine Classifier (SVC) on my data with 10-fold cross validation and calculated the accuracy score (which was around 89%). I'm using Python and scikit-learn to perform the task. Here's a code snippet:
def get_scores(features,target,classifier):
X_train, X_test, y_train, y_test =train_test_split(features, target ,
test_size=0.3)
scores = cross_val_score(
classifier,
X_train,
y_train,
cv=10,
scoring='accuracy',
n_jobs=-1)
return(scores)
get_scores(features_from_df,target_from_df,svm.SVC())
Now, how can I use my classifier (after running the 10-folds cv) to test it on X_test and compare the predicted results to y_test? As you may have noticed, I only used X_train and y_train in the cross validation process.
I noticed that sklearn have cross_val_predict:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html should I replace my cross_val_score by cross_val_predict? just FYI: my target data column is binarized (have values of 0s and 1s).
If my approach is wrong, please advise me with the best way to proceed with.
Thanks!
You only need to split your X and y. Do not split the train and test.
Then you can pass your classifier in your case svm to the cross_val_score function to get the accuracy for each experiment.
In just 3 lines of code:
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=10)
print scores
from sklearn.metrics import classification_report
classifier = svm.SVC()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(classification_report(y_test , y_pred)
You're almost there:
# Build your classifier
classifier = svm.SVC()
# Train it on the entire training data set
classifier.fit(X_train, y_train)
# Get predictions on the test set
y_pred = classifier.predict(X_test)
At this point, you can use any metric from the sklearn.metrics module to determine how well you did. For example:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

Stratified Train/Validation/Test-split in scikit-learn

There is already a description here of how to do stratified train/test split in scikit via train_test_split (Stratified Train/Test-split in scikit-learn) and a description of how to random train/validation/test split via np.split (How to split data into 3 sets (train, validation and test)?). But what about doing stratified train/validation/test split.
The closest approximation that comes to mind for doing stratified (on class label) train/validation/test split is as follows, but I suspect there's a better way that can perhaps achieve this in one function call or in a more accurate way:
Let's say we want to do a 60/20/20 train/validation/test split, then my current approach is to first do 60/40 stratified split, then do a 50/50 stratifeid split on that first 40 as to ultimately get a 60/20/20 stratified split.
from sklearn.cross_validation import train_test_split
SEED = 2000
x_train, x_validation_and_test, y_train, y_validation_and_test = train_test_split(x, y, test_size=.4, random_state=SEED)
x_validation, x_test, y_validation, y_test = train_test_split(x_validation_and_test, y_validation_and_test, test_size=.5, random_state=SEED)
Please get back if my approach is correct and/or if you have a better approach.
Thank you
The solution is to just use StratifiedShuffleSplit twice, like below:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.4, random_state=42)
for train_index, test_valid_index in split.split(df, df.target):
train_set = df.iloc[train_index]
test_valid_set = df.iloc[test_valid_index]
split2 = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=42)
for test_index, valid_index in split2.split(test_valid_set, test_valid_set.target):
test_set = test_valid_set.iloc[test_index]
valid_set = test_valid_set.iloc[valid_index]
Yes, this is exactly how I would do it - running train_test_split() twice. Think of the first as splitting off your training set, and then that training set may get divided into different folds or holdouts down the line.
In fact, if you end up testing your model using a scikit model that includes built-in cross-validation, you may not even have to explicitly run train_test_split() again. Same if you use the (very handy!) model_selection.cross_val_score function.

Categories

Resources