I am using Python 2.7 and Scikit.
I am wondering if is wise to use pipeline when doing cross validation.
Consider following example:
#Pipeline
pipe_rf = Pipeline([('pca', PCA(n_components=80)),
('clf',RandomForestClassifier(n_estimators=100))])
pipe_rf.fit(X_train_s,y_train_s)
pred = pipe_rf.predict(X_test)
#CrossValidation
from sklearn import cross_validation
scores = cross_validation.cross_val_score(pipe_rf,
X_train,
y_train,
cv=10,
scoring='f1')
print 'Train score is: %.5f' % scores.mean()
Like this the CV module will apply the PCA step 10 times, which is very consuming and unnecessary. And I didn't include any other steps.
Related
I have the following piece of code:
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.pipeline import Pipeline
...
x_train, x_test, y_train, y_test= model_selection.train_test_split(dataframe[features_],dataframe[labels], test_size=0.30,random_state=42, shuffle=True)
classifier = RandomForestClassifier(n_estimators=11)
pipe = Pipeline([('feats', feature), ('clf', classifier)])
pipe.fit(x_train, y_train)
predicts = pipe.predict(x_test)
Instead of train test split, I want to use k-fold cross validation to train my model. However, I do not know how can make it by using pipeline structure. I came across this: https://scikit-learn.org/stable/modules/compose.html but I could not fit to my code.
I want to use from sklearn.model_selection import StratifiedKFold if possible. I can use it without pipeline structure but I can not use it with pipeline.
Update:
I tried this but it generates me error.
x_train = dataframe[features_]
y_train = dataframe[labels]
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
classifier = RandomForestClassifier(n_estimators=11)
#pipe = Pipeline([('feats', feature), ('clf', classifier)])
#pipe.fit(x_train, y_train)
#predicts = pipe.predict(x_test)
predicts = cross_val_predict(classifier, x_train , y_train , cv=skf)
Pipeline is used to assemble several steps such as preprocessing, transformations, and modeling. StratifiedKFold is used to split your dataset to assess the performance of your model. It is not meant to be used as a part of the Pipeline as you do not want to perform it on new data.
Therefore it is normal to perform it out of the pipeline's structure.
While I echo with the answer provided above, below might be the code you are looking for:
x_train = dataframe[features_]
y_train = dataframe[labels]
pipe = Pipeline([('feats', feature), ('clf', RandomForestClassifier(n_estimators=11))])
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
predicts = cross_val_predict(pipe, x , y , scoring='accuracy',cv=skf, n_jobs = -1, error_score = 'raise')
Since you are using Cross validation, train_test_split() is not required to be passed to the Pipeline, the cross validation split (i.e., the training sample) is passed directly to the pipeline for feature extraction and modeling and evaluated on the test set (which is also generated as part of the cv split)
I'm looking for the easiest way to teach my students how to perform 10CV, for standard classifiers in sklearn such as logisticregression, knnm, decision tree, adaboost, svm, etc.
I was hoping there was a method that created the folds for them instead of having to loop like below:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=0)
X=df1.drop(['Unnamed: 0','ID','target'],axis=1).values
y=df1.target.values
for train_index, test_index in sss.split(X,y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = LogisticRegressionCV()
clf.fit(X_train, y_train)
train_predictions = clf.predict(X_test)
acc = accuracy_score(y_test, train_predictions)
print(acc)
Seems like there should be an easier way.
I think your question is, whether there is an already existing method for 10-fold cross validation. So to answer it, there is the sklearn documentation, which explains cross validation and also how to use it:
Cross-validation: evaluating estimator performance
Besides that, you can also make use of the sklearn modules for cross validation
Various splitting techniques with modules
Model validation with cross validation
To include a code example, which should work with your code, import the required library
from sklearn.model_selection import cross_val_score
and add this line instead of your loop:
print(cross_val_score(clf, X, y, cv=10))
And your n_splits is just set to 1 by the way, so its 1-fold and not 10-fold in your code.
When I want to evaluate my model with cross validation, should I perform cross validation on original (data thats not split on train and test) or on train / test data?
I know that training data is used for fitting the model, and testing for evaluating. If I use cross validation, should I still split the data into train and test, or not?
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
accuracy_test = cross_val_score(clf, x_test, y_test, cv = 5)
Or should I do like this:
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
clf = LogisticRegression()
model = clf.fit(features, results)
accuracy_test = cross_val_score(clf, features, results, cv = 5)), 2)
Or maybe something different?
Both your approaches are wrong.
In the first one, you apply cross validation to the test set, which is meaningless
In the second one, you first fit the model with your whole data, and then you perform cross validation, which is again meaningless. Moreover, the approach is redundant (your fitted clf is not used by the cross_val_score method, which does its own fitting)
Since you are not doing any hyperparameter tuning (i.e. you seem to be interested only in performance assessment), there are two ways:
Either with a separate test set
Or with cross validation
First way (test set):
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
accuracy_test = accuracy_score(y_test, y_pred)
Second way (cross validation):
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
clf = LogisticRegression()
# shuffle data first:
features_s, results_s = shuffle(features, results)
accuracy_cv = cross_val_score(clf, features_s, results_s, cv = 5, scoring='accuracy')
# fit the model afterwards with the whole data, if satisfied with the performance:
model = clf.fit(features, results)
I will try to summarize the "best practice" here:
1) If you want to train your model, fine-tune parameters, and do final evaluation, I recommend you to split your data into training|val|test.
You fit your model using the training part, and then you check different parameter combinations on the val part. Finally, when you're sure which classifier/parameter obtains the best result on the val part, you evaluate on the test to get the final rest.
Once you evaluate on the test part, you shouldn't change the parameters any more.
2) On the other hand, some people follow another way, they split their data into training and test, and they finetune their model using cross-validation on the training part and at the end they evaluate it on the test part.
If your data is quite large, I recommend you to use the first way, but if your data is small, the 2.
I'm relatively new to machine learning and would like some help in the following:
I ran a Support Vector Machine Classifier (SVC) on my data with 10-fold cross validation and calculated the accuracy score (which was around 89%). I'm using Python and scikit-learn to perform the task. Here's a code snippet:
def get_scores(features,target,classifier):
X_train, X_test, y_train, y_test =train_test_split(features, target ,
test_size=0.3)
scores = cross_val_score(
classifier,
X_train,
y_train,
cv=10,
scoring='accuracy',
n_jobs=-1)
return(scores)
get_scores(features_from_df,target_from_df,svm.SVC())
Now, how can I use my classifier (after running the 10-folds cv) to test it on X_test and compare the predicted results to y_test? As you may have noticed, I only used X_train and y_train in the cross validation process.
I noticed that sklearn have cross_val_predict:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html should I replace my cross_val_score by cross_val_predict? just FYI: my target data column is binarized (have values of 0s and 1s).
If my approach is wrong, please advise me with the best way to proceed with.
Thanks!
You only need to split your X and y. Do not split the train and test.
Then you can pass your classifier in your case svm to the cross_val_score function to get the accuracy for each experiment.
In just 3 lines of code:
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=10)
print scores
from sklearn.metrics import classification_report
classifier = svm.SVC()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(classification_report(y_test , y_pred)
You're almost there:
# Build your classifier
classifier = svm.SVC()
# Train it on the entire training data set
classifier.fit(X_train, y_train)
# Get predictions on the test set
y_pred = classifier.predict(X_test)
At this point, you can use any metric from the sklearn.metrics module to determine how well you did. For example:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
I have a following code using linear_model.Lasso:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)
clf = linear_model.Lasso()
clf.fit(X_train,y_train)
accuracy = clf.score(X_test,y_test)
print(accuracy)
I want to perform k fold (10 times to be specific) cross_validation. What would be the right code to do that?
here is the code I use to perform cross validation on a linear regression model and also to get the details:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X_Train, Y_Train, scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
As said in this book at page 108 this is the reason why we use -score:
Scikit-Learn cross-validation features expect a utility function
(greater is better) rather than a cost function (lower is better), so
the scoring function is actually the opposite of the MSE (i.e., a
negative value), which is why the preceding code computes -scores
before calculating the square root.
and to visualize the result use this simple function:
def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
You can run 10-fold using the model_selection module:
# for 0.18 version or newer, use:
from sklearn.model_selection import cross_val_score
# for pre-0.18 versions of scikit, use:
from sklearn.cross_validation import cross_val_score
X = # Some features
y = # Some classes
clf = linear_model.Lasso()
scores = cross_val_score(clf, X, y, cv=10)
This code will return 10 different scores. You can easily get the mean:
scores.mean()