Simple CV for classic classification models - python

I'm looking for the easiest way to teach my students how to perform 10CV, for standard classifiers in sklearn such as logisticregression, knnm, decision tree, adaboost, svm, etc.
I was hoping there was a method that created the folds for them instead of having to loop like below:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=0)
X=df1.drop(['Unnamed: 0','ID','target'],axis=1).values
y=df1.target.values
for train_index, test_index in sss.split(X,y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = LogisticRegressionCV()
clf.fit(X_train, y_train)
train_predictions = clf.predict(X_test)
acc = accuracy_score(y_test, train_predictions)
print(acc)
Seems like there should be an easier way.

I think your question is, whether there is an already existing method for 10-fold cross validation. So to answer it, there is the sklearn documentation, which explains cross validation and also how to use it:
Cross-validation: evaluating estimator performance
Besides that, you can also make use of the sklearn modules for cross validation
Various splitting techniques with modules
Model validation with cross validation
To include a code example, which should work with your code, import the required library
from sklearn.model_selection import cross_val_score
and add this line instead of your loop:
print(cross_val_score(clf, X, y, cv=10))
And your n_splits is just set to 1 by the way, so its 1-fold and not 10-fold in your code.

Related

How to know from which interval of the input the features used in sktime's TimeSeriesForestClassifier are calculated

I used the sktime library's TimeSeriesForestClassifier class to perform multivariate time series classification.
The code is as follows
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sktime.classification.compose import ColumnEnsembleClassifier
from sktime.classification.interval_based import TimeSeriesForestClassifier
from sktime.datasets import load_basic_motions
from sktime.transformations.panel.compose import ColumnConcatenator
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
steps = [
("concatenate", ColumnConcatenator()),
("classify", TimeSeriesForestClassifier(n_estimators=100)),
]
clf = Pipeline(steps)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
I would like to check the value of feature_importances_, which is not the same length as the input, but an array with the same length as the number of features.
clf.steps[1][1].feature_importances_
I would like to know which part of the input each importance corresponds to. Is there any way to get information about which section of the input the TimeSeriesForestClassifier is calculating features from?
You can get the intervals (start and end index) for each tree of the ensemble from:
clf.steps[1][1].intervals_
sktime now also has an implementation of the newer Canonical Interval Forecast.
When we first implemented the Time Series Forest algorithm, we ended up with two versions. The one that you're using is the recommended one, but the older version provides its own functionality for the feature importance graph (see below).
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sktime.classification.compose import ColumnEnsembleClassifier
from sktime.classification.compose import ComposableTimeSeriesForestClassifier
from sktime.datasets import load_basic_motions
from sktime.transformations.panel.compose import ColumnConcatenator
X, y = load_basic_motions(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
steps = [
("concatenate", ColumnConcatenator()),
("classify", ComposableTimeSeriesForestClassifier(n_estimators=100)),
]
clf = Pipeline(steps)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
clf.steps[-1][-1].feature_importances_.rename(columns={"_slope": "slope"}).plot(xlabel="time", ylabel="feature importance")
Be aware of some subtle issues in the calculation and interpretation of the feature importances. The relevant issues are here:
https://github.com/alan-turing-institute/sktime/issues/214
https://github.com/alan-turing-institute/sktime/issues/669

Why the voting classifier has less accuracy than one of the individual predictors that made it

I have a simple question concerning the votting classifier. As I understood, the voting classifier should have the highest accuracy than those individual predictors which built it (the wisdom of the crowd). Here is the code
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
# import dataset
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
# split the dataset into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
log_clf = LogisticRegression(solver='liblinear', random_state=42)
svm_clf = SVC(gamma='auto', random_state=42)
voting_clf = VotingClassifier(
estimators= [('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
voting='hard')
voting_clf = voting_clf.fit(X_train, y_train)
predictors_list= [log_clf, rnd_clf, svm_clf, voting_clf]
for clf in predictors_list:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_pred, y_test)
print(clf.__class__.__name__, accuracy)
what I get as a accuracy is as follows:
LogisticRegression 0.776
RandomForestClassifier 0.88
SVC 0.864
VotingClassifier 0.864
As you can see for this run the Random Forest predictor has a slightly better accuray than the VotingClassifier!
Any explanation for this?
Many thanks in Advance
Fethi
Let's take a look at the voting parameter you passed 'hard'
documentation says:
If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.
So maybe, the prediction of ‍‍‍‍LogisticRegression and your SVC(SVM) are the same and wrong for some cases this makes your majority vote wrong for those cases.
you can use voting='soft' or assign weight as prior for prediction of each model, this way you make your prediction a little bit immune to the wrong prediction of bad models, and relay more on your best models.

How to perform SMOTE with cross validation in sklearn in python

I have a highly imbalanced dataset and would like to perform SMOTE to balance the dataset and perfrom cross validation to measure the accuracy. However, most of the existing tutorials make use of only single training and testing iteration to perfrom SMOTE.
Therefore, I would like to know the correct procedure to perfrom SMOTE using cross-validation.
My current code is as follows. However, as mentioned above it only uses single iteration.
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())
clf_rf = RandomForestClassifier(n_estimators=25, random_state=12)
clf_rf.fit(x_train_res, y_train_res)
I am happy to provide more details if needed.
You need to perform SMOTE within each fold. Accordingly, you need to avoid train_test_split in favour of KFold:
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score
kf = KFold(n_splits=5)
for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
X_train = X[train_index]
y_train = y[train_index] # Based on your code, you might need a ravel call here, but I would look into how you're generating your y
X_test = X[test_index]
y_test = y[test_index] # See comment on ravel and y_train
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model = ... # Choose a model here
model.fit(X_train_oversampled, y_train_oversampled )
y_pred = model.predict(X_test)
print(f'For fold {fold}:')
print(f'Accuracy: {model.score(X_test, y_test)}')
print(f'f-score: {f1_score(y_test, y_pred)}')
You can also, for example, append the scores to a list defined outside.
from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import SMOTE
cv = StratifiedKFold(n_splits=5)
for train_idx, test_idx, in cv.split(X, y):
X_train, y_train = X[train_idx], y[train_idx]
X_test, y_test = X[test_idx], y[test_idx]
X_train, y_train = SMOTE().fit_sample(X_train, y_train)
....
I think you can also solve this with a pipeline from the imbalanced-learn library.
I saw this solution in a blog called Machine Learning Mastery https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
The idea is to use a pipeline from imblearn to do the cross-validation. Please, let me know if that works. The example below is with a decision tree, but the logic is the same.
#decision tree evaluated on imbalanced dataset with SMOTE oversampling
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
# define dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# define pipeline
steps = [('over', SMOTE()), ('model', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
score = mean(scores))

different score between cross_val_score and train_test_split

I met a problem when I was using python3.6 with scikitlearn 0.18
I used the randomforest to do a regression and the regression was pretty good, but when I was trying to compute the cross-validation, I met a problem that the scores got from cross_val_score and train_test_split were really different. The score of train_test_split is 0.9 but the mean of score of cross_val_score is about 0.3.
Could you tell me the reason?
or anything wrong in my code?
the code is
import numpy as np
import cv2
import itertools
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import cross_val_score,cross_val_predict,ShuffleSplit,KFold
from sklearn.model_selection import train_test_split
train= np.loadtxt('.txt')
traindata=train[0:,38:]
traintarget=train[0:,j]
rf=RandomForestRegressor(n_estimators=20)
rf.fit(traindata,traintarget)
X=traindata
Y=traintarget
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
print (rf.score(X_test, Y_test))
score3 = cross_val_score(rf, X, Y, scoring= 'r2', cv=ShuffleSplit(n=len(X),test_size=0.3,train_size=0.6))
score4 = cross_val_score(rf, X, Y, scoring= 'neg_mean_absolute_error', cv=ShuffleSplit(n=len(X),test_size=0.3,train_size=0.6))
score5 = cross_val_score(rf, X, Y, scoring= 'neg_mean_squared_error', cv=ShuffleSplit(n=len(X),test_size=0.3,train_size=0.6))
print (score3)
print (score4)
print (score5)
When you're using train_test_split, you're splitting the dataset into train and test randomly .
Shufflesplit is not too different. The issue may be because the class distribution is highly uneven. (BTW remember one thing, Leave One Out Cross Validation that you're using always gives bad results in my experience). Instead use a 5-fold cross validation.
You can also use stratified cross validation if the class distribution is uneven. It preserves the percentage of samples in each class. Take a look at http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html for example.

Save sklearn classifier fit by cross_val

I've got a classifier I'm fitting using a cross_val and getting good results. Essentially all I'm doing is:
clf = RandomForestClassifier(class_weight="balanced")
scores = cross_val_score(clf, data, target, cv=8)
predict_RF = cross_val_predict(clf, data, target, cv=8)
from sklearn.externals import joblib
joblib.dump(clf, 'churnModel.pkl')
Essentially what I want to do is take the model that's getting fit by cross_val and export to joblib. However when I try to pull it in in a separate project I get:
sklearn.exceptions.NotFittedError: Estimator not fitted, call `fit` before exploiting the model.
So I'm guessing cross_val is not actually saving the fit to my clf? How do I persist the model fit that cross_val is generating?
juanpa.arrivillaga is right. I am afraid you would have to do it manually, but scikit-learn makes it quite easy. The cross_val_score create trained models that are not returned to you. Below you would have the trained models in a list (i.e. clf_models)
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from copy import deepcopy
kf = StratifiedKFold(n_splits=8)
clf = RandomForestClassifier(class_weight="balanced")
clf_models = []
# keep in mind your X and y should be indexed same here
kf.get_n_splits(X_data)
for train_index, test_index in kf.split(X_data, y_data):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X_data[train_index], X_data[test_index]
y_train, y_test = y_data[train_index], y_data[test_index]
tmp_clf = deepcopy(clf)
tmp_clf.fit(X_train, y_train)
print("Got a score of {}".format(tmp_clf.score(X_test, y_test)))
clf_models.append(tmp_clf)
-edit via juanpa.arrivillaga's advice
StratifiedKFold is a better choice. Here I selected just for demonstration purposes.

Categories

Resources