predict_proba for a cross-validated model - python

I would like to predict the probability from Logistic Regression model with cross-validation. I know you can get the cross-validation scores, but is it possible to return the values from predict_proba instead of the scores?
# imports
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import (StratifiedKFold, cross_val_score,
from sklearn import datasets
# setup data
iris = datasets.load_iris()
X =
y =
# setup model
cv = StratifiedKFold(y, 10)
logreg = LogisticRegression()
# cross-validation scores
scores = cross_val_score(logreg, X, y, cv=cv)
# predict probabilities
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y), ytrain)
proba = logreg.predict_proba(Xtest)

This is now implemented as part of scikit-learn version 0.18. You can pass a 'method' string parameter to the cross_val_predict method. Documentation is here.
proba = cross_val_predict(logreg, X, y, cv=cv, method='predict_proba')
Also note that this is part of the new sklearn.model_selection package so you will need this import:
from sklearn.model_selection import cross_val_predict

An easy workaround for this is to create a wrapper class, which for your case would be
class proba_logreg(LogisticRegression):
def predict(self, X):
return LogisticRegression.predict_proba(self, X)
and then pass an instance of it as the classifier object to cross_val_predict
# cross validation probabilities
probas = cross_val_predict(proba_logreg(), X, y, cv=cv)

There is a function cross_val_predict that gives you the predicted values, but there is no such function for "predict_proba" yet. Maybe we could make that an option.

This is easy to implement:
def my_cross_val_predict(
m, X, y, cv=KFold(),
predict=lambda m, x: m.predict_proba(x),
preds = []
for train, test in cv.split(X):[train, :], y[train])
pred = predict(m, X[test, :])
return combine(preds)
This one returns predict_proba.
If you need both predict and predict_proba just change predict and combine arguments:
def stack(arrs):
if arrs[0].ndim == 1:
return np.hstack(arrs)
return np.vstack(arrs)
def my_cross_val_predict(
m, X, y, cv=KFold(),
predict=lambda m, x:[ m.predict(x)
, m.predict_proba(x)
combine=lambda preds: list(map(stack, zip(*preds)))
preds = []
for train, test in cv.split(X):[train, :], y[train])
pred = predict(m, X[test, :])
return combine(preds)


How to use t-SNE inside the pipeline

How could I use t-SNE inside my pipeline?
I have managed without pipelining to successfully run t-SNE and on it a classification algorithm.
Do I need to write a custom method that can be called in the pipeline that returns a dataframe, or how does it work?
# How I used t-SNE
from sklearn.manifold import TSNE
X_std = StandardScaler().fit_transform(dfListingsFeature_classification)
ts = TSNE()
X_tsne = ts.fit_transform(X_std)
feature_list = []
for i in range(1,X_tsne.shape[1]+1):
feature_list .append("TSNE" + str(i))
df_new = pd.DataFrame(X_tsne, columns= feature_list )
df_new['label'] = y
X = df_new.drop(columns=['label'])
y = df_new['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
rfc= RandomForestClassifier()
# Train Decision Tree Classifer
#Predict the response for test dataset
y_pred = rfc.predict(X_test)
What I want to use it
# How could I use TSNE() inside the the pipeline?
steps = [('standardscaler', StandardScaler()),
('tsne', TSNE()),
('rfc', RandomForestClassifier())]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=30)
parameteres = {'rfc__max_depth':[1,2,3,4,5,6,7,8,9,10,11,12],
'rfc__criterion':['gini', 'entropy']}
grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5), y_train)
print("score = %3.2f" %(grid.score(X_test,y_test)))
print('Training set score: ' + str(grid.score(X_train,y_train)))
print('Test set score: ' + str(grid.score(X_test,y_test)))
y_pred = grid.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precison:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
[OUT] TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'TSNE()' (type <class 'sklearn.manifold._t_sne.TSNE'>) doesn't
Should I build a custom method or how ? If so how should it look like ?
class TestTSNE(BaseEstimator, TransformerMixin):
def __init__(self):
# don't know
def fit(self, X, y = None):
X_std = StandardScaler().fit_transform(dfListingsFeature_classification)
ts = TSNE()
X_tsne = ts.fit_transform(X_std)
return self
def transform(self, X, y = None):
feature_list = []
for i in range(1,shelf.X_tsne.shape[1]+1):
feature_list .append("TSNE" + str(i))
df_new = pd.DataFrame(X_tsne, columns= feature_list )
df_new['label'] = y
X = df_new.drop(columns=['label'])
y = df_new['label']
return X, y
steps = [('standardscaler', StandardScaler()),
('testTSNE', TestTSNE()),
('rfc', RandomForestClassifier())]
pipeline = Pipeline(steps)
I think you misunderstood the use of pipeline. From help page:
Pipeline of transforms with a final estimator.
Sequentially apply a list of transforms and a final estimator.
Intermediate steps of the pipeline must be ‘transforms’, that is, they
must implement fit and transform methods. The final estimator only
needs to implement fit
So this means if your pipeline is:
steps = [('standardscaler', StandardScaler()),
('tsne', TSNE()),
('rfc', RandomForestClassifier())]
You are going to apply standscaler to your features first, then transform the result of this with tsne, before passing it to the classifier. I don't think it makes much sense to train on the tsne output.
If you really want to latch onto pipeline, then you will need to store the results of tsne as an attribute, then just return the feature, training as it is, so that the classifier can work on it.
Something like
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.manifold import TSNE
from sklearn.datasets import make_classification
class TestTSNE(BaseEstimator, TransformerMixin):
def __init__(self,n_components,random_state=None,method='exact'):
self.n_components = n_components
self.method = method
self.random_state = random_state
def fit(self, X, y = None):
ts = TSNE(n_components = self.n_components,
method = self.method, random_state = self.random_state)
self.X_tsne = ts.fit_transform(X)
return self
def transform(self, X, y = None):
return X
steps = [('standardscaler', StandardScaler()),
('testTSNE', TestTSNE(2)),
('rfc', RandomForestClassifier())]
pipeline = Pipeline(steps)
X,y = make_classification(),y)
You can retrieve your tsne like this:
0 1
0 -38.756626 -4.693253
1 46.516308 53.633842
2 49.107910 16.482645
3 18.306377 9.432504
4 33.551056 -27.441383
.. ... ...
95 -31.337574 -16.913471
96 -57.918224 -39.959976
97 55.282658 37.582535
98 66.425125 19.717241
99 -50.692646 11.545088

Sklearn: Is there a way to define a specific score type to pipeline?

I can do this:
kfold = model_selection.KFold(n_splits=number_splits,shuffle=True, random_state=random_state)
scalar = StandardScaler()
pipeline = Pipeline([('transformer', scalar), ('estimator', model)])
results = model_selection.cross_validate(pipeline, X, y, cv=kfold, scoring=score_list,return_train_score=True)
where score_list can be something like ['accuracy','balanced_accuracy','precision','recall','f1'].
I also can do this:
kfold = model_selection.KFold(n_splits=number_splits,shuffle=True, random_state=random_state)
scalar = StandardScaler()
pipeline = Pipeline([('transformer', scalar), ('estimator', model)])
for i, (train, test) in enumerate(kfold.split(X, y)):[train], self.y[train])
pipeline.score(self.X[test], self.y[test])
However, I am not able to change the score type for pipeline in the last line. How can I do that?
score method is always accuracy for classification and r2 score for regression. There is no parameter to change that. It comes from the Classifiermixin and RegressorMixin.
Instead, when we need other scoring options, we have to import it from sklearn.metrics like the following.
from sklearn.metrics import balanced_accuracy
y_pred = pipeline.predict(self.X[test])
balanced_accuracy(self.y_test, y_pred)

Got ValueError when calling cross_val_score

I am trying to make a project for Machine Learning and I wanted to perform an accuracy evaluation of multiple alhorithms. I am using this CSV and I am loading only Date, Time and CO columns ( I manually renamed it in the CSV). After I prepare my training data, I am trying to perform the evaluations, but I am getting:
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'unknown' instead.
The shapes for the vectors used for evaluations (X_train and Y_train) are:
(9357, 2)
The class:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
class Models:
test_size: float
random_state: int
def __init__(self, test_size: float = 0.20, random_state: int = 1) -> None:
self.test_size = test_size
self.random_state = random_state
def init_models() -> []:
return [
('LR', LogisticRegression(solver='liblinear', multi_class='ovr')),
('LDA', LinearDiscriminantAnalysis()),
('KNN', KNeighborsClassifier()),
('CART', DecisionTreeClassifier()),
('NB', GaussianNB()),
('SVM', SVC(gamma='auto'))
def train(self, x: [], y: []):
x_train, x_validation, y_train, y_validation = train_test_split(x, y, test_size=self.test_size,
return x_train, x_validation, y_train, y_validation
def evaluate(self, x_train: [], y_train: [], splits: int = 10, random_state: int = 1):
results = []
names = []
models = self.init_models()
for name, model in models:
kfold = StratifiedKFold(n_splits=splits, random_state=random_state)
cv_results = cross_val_score(estimator=model, X=x_train, y=y_train, cv=kfold, scoring='accuracy')
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
And I am calling my class as:
models_helper = Models()
array = dataset.values
X = array[:, 1:3]
Y = array[:, 2]
prepared = models_helper.train(X, Y)
classification = models_helper.evaluate(prepared[0], prepared[2])
I avoided this problem by first calculating predicted values with cross_val_predict and then using the predicted values with y_test to get score with metrics.accuracy_score.
# Function that runs the requested algorithm and returns the accuracy metrics.
# Passing the sklearn model as an argument along with cv values and training data.
def fit_ml_algo(algo, X_train, y_train, cv):
# One Pass
model =, y_train)
acc = round(model.score(X_train, y_train) * 100, 2)
# Cross Validation
train_pred = model_selection.cross_val_predict(algo,
n_jobs = -1)
# Cross-validation accuracy metric
acc_cv = round(metrics.accuracy_score(y_train, train_pred) * 100, 2)
return train_pred, acc, acc_cv

Multi-class, multi-label, ordinal classification with sklearn

I was wondering how to run a multi-class, multi-label, ordinal classification with sklearn. I want to predict a ranking of target groups, ranging from the one that is most prevalant at a certain location (1) to the one that is least prevalent (7).
I don't seem to be able to get it right. Could you please help me out?
# Random Forest Classification
# Import
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.metrics import make_scorer, accuracy_score, confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
# Import dataset
dataset = pd.read_excel('alle_probs_edit.v2.xlsx')
X = dataset.iloc[:,4:-1].values
Y = dataset.iloc[:,-1].values
# Split in Train and Test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42 )
# Scaling the features (alle Variablen auf eine gleiche Ebene), necessary depend on the choosen method
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Creat classifier
classifier = RandomForestClassifier(criterion = 'entropy')
# Choose some parameter combinations to try
parameters = {'bootstrap': [True, False],
'max_depth': [50],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2, 3, 4],
'min_samples_split': [9, 10, 11, 12, 13],
'n_estimators': [500,1000,1500]}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(accuracy_score)
# Run the grid search
grid_obj = GridSearchCV(classifier, parameters, scoring=acc_scorer, cv = 3, n_jobs = -1)
grid_obj =, Y_train)
# Set the classifier to the best combination of parameters
classifier = grid_obj.best_estimator_
# Fit the best algorithm to the data, Y_train)
#Prediction the Test data
Y_pred = classifier.predict(X_test)
#Confusion Matrix
cm = pd.DataFrame(confusion_matrix(Y_test, Y_pred))
accuracy1 = accuracy_score(Y_test, Y_pred)
print("Accuracy1: %.2f%%" % (accuracy1 * 100.0))
# k-Fold Class Validation
accuracy1 = cross_val_score(estimator = classifier, X = X_train, y = Y_train, cv = 10)
kfold = accuracy1.mean()
This may not be the precise answer you're looking for, this article outlines a technique as follows:
We can take advantage of the ordered class value by transforming a k-class ordinal regression problem to a k-1 binary classification problem, we convert an ordinal attribute A* with ordinal value V1, V2, V3, … Vk into k-1 binary attributes, one for each of the original attribute’s first k − 1 values. The ith binary attribute represents the test A* > Vi
Essentially, aggregate multiple binary classifiers (predict target > 1, target > 2, target > 3, target > 4) to be able to predict whether a target is 1, 2, 3, 4 or 5. The author creates an OrdinalClassifier class that stores multiple binary classifiers in a Python dictionary.
class OrdinalClassifier():
def __init__(self, clf):
self.clf = clf
self.clfs = {}
def fit(self, X, y):
self.unique_class = np.sort(np.unique(y))
if self.unique_class.shape[0] > 2:
for i in range(self.unique_class.shape[0]-1):
# for each k - 1 ordinal value we fit a binary classification problem
binary_y = (y > self.unique_class[i]).astype(np.uint8)
clf = clone(self.clf), binary_y)
self.clfs[i] = clf
def predict_proba(self, X):
clfs_predict = {k: self.clfs[k].predict_proba(X) for k in self.clfs}
predicted = []
for i, y in enumerate(self.unique_class):
if i == 0:
# V1 = 1 - Pr(y > V1)
predicted.append(1 - clfs_predict[i][:,1])
elif i in clfs_predict:
# Vi = Pr(y > Vi-1) - Pr(y > Vi)
predicted.append(clfs_predict[i-1][:,1] - clfs_predict[i][:,1])
# Vk = Pr(y > Vk-1)
return np.vstack(predicted).T
def predict(self, X):
return np.argmax(self.predict_proba(X), axis=1)
def score(self, X, y, sample_weight=None):
_, indexed_y = np.unique(y, return_inverse=True)
return accuracy_score(indexed_y, self.predict(X), sample_weight=sample_weight)
The technique originates in A Simple Approach to Ordinal Classification
Here is an example using KNN that should be tuneable in an sklearn pipeline or grid search.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.base import clone, BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_is_fitted, check_array
from sklearn.utils.multiclass import check_classification_targets
class KNeighborsOrdinalClassifier(BaseEstimator, ClassifierMixin):
def __init__(self, n_neighbors=5, *, weights='uniform',
algorithm='auto', leaf_size=30, p=2,
metric='minkowski', metric_params=None, n_jobs=None):
self.n_neighbors = n_neighbors
self.weights = weights
self.algorithm = algorithm
self.leaf_size = leaf_size
self.p = p
self.metric = metric
self.metric_params = metric_params
self.n_jobs = n_jobs
def fit(self, X, y):
X, y = check_X_y(X, y)
self.clf_ = KNeighborsClassifier(**self.get_params())
self.clfs_ = {}
self.classes_ = np.sort(np.unique(y))
if self.classes_.shape[0] > 2:
for i in range(self.classes_.shape[0]-1):
# for each k - 1 ordinal value we fit a binary classification problem
binary_y = (y > self.classes_[i]).astype(np.uint8)
clf = clone(self.clf_), binary_y)
self.clfs_[i] = clf
return self
def predict_proba(self, X):
X = check_array(X)
check_is_fitted(self, ['classes_', 'clf_', 'clfs_'])
clfs_predict = {k:self.clfs_[k].predict_proba(X) for k in self.clfs_}
predicted = []
for i,y in enumerate(self.classes_):
if i == 0:
# V1 = 1 - Pr(y > V1)
predicted.append(1 - clfs_predict[y][:,1])
elif y in clfs_predict:
# Vi = Pr(y > Vi-1) - Pr(y > Vi)
predicted.append(clfs_predict[y-1][:,1] - clfs_predict[y][:,1])
# Vk = Pr(y > Vk-1)
return np.vstack(predicted).T
def predict(self, X):
X = check_array(X)
check_is_fitted(self, ['classes_', 'clf_', 'clfs_'])
return np.argmax(self.predict_proba(X), axis=1)
Building off both David Diaz, the white paper, and Kartik above along with others linked to on Medium and attributed in the readme, I'm working on an OrdinalClassifier that is built on the sklearn framework and which works well with sklearn pipelines, scoring, and cross validation.
The OC performs very well vs. standard non ordinal mc classification and gives greater control over optimizing for precision/recall on the positive class (ie. "high" in for example the diabetes disease progression of low<medium<high classes. It supports any sklearn classifier that supports pred_proba. Cross validation scores are shown on repo.
OrdinalClassifer based on sklearn
At this time, I would not call it multi-label.

How to correctly perform cross validation in scikit-learn?

I am trying to do a cross validation on a k-nn classifier and I am confused about which of the following two methods below conducts cross validation correctly.
training_scores = defaultdict(list)
validation_f1_scores = defaultdict(list)
validation_precision_scores = defaultdict(list)
validation_recall_scores = defaultdict(list)
validation_scores = defaultdict(list)
def model_1(seed, X, Y):
scoring = ['accuracy', 'f1_macro', 'precision_macro', 'recall_macro']
model = KNeighborsClassifier(n_neighbors=13)
kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=seed)
scores = model_selection.cross_validate(model, X, Y, cv=kfold, scoring=scoring, return_train_score=True)
#rest of print statments
It seems that for loop in the second model is redundant.
def model_2(seed, X, Y):
scoring = ['accuracy', 'f1_macro', 'precision_macro', 'recall_macro']
model = KNeighborsClassifier(n_neighbors=13)
kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=seed)
for train, test in kfold.split(X, Y):
scores = model_selection.cross_validate(model, X[train], Y[train], cv=kfold, scoring=scoring, return_train_score=True)
# rest of print statments
I am using StratifiedKFold and I am not sure if I need for loop as in model_2 function or does cross_validate function already use the split as we are passing cv=kfold as an argument.
I am not calling fit method, is this OK? Does cross_validate calls that automatically or do I need to call fit before calling cross_validate?
Finally, how can I create confusion matrix? Do I need to create it for each fold, if yes, how can the final/average confusion matrix be calculated?
The documentation is arguably your best friend in such questions; from the simple example there it should be apparent that you should use neither a for loop nor a call to fit. Adapting the example to use KFold as you do:
from sklearn.model_selection import KFold, cross_validate
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
X, y = load_boston(return_X_y=True)
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True)
model = DecisionTreeRegressor()
scoring=('r2', 'neg_mean_squared_error')
cv_results = cross_validate(model, X, y, cv=kf, scoring=scoring, return_train_score=False)
{'fit_time': array([0.00901461, 0.00563478, 0.00539804, 0.00529385, 0.00638533]),
'score_time': array([0.00132656, 0.00214362, 0.00134897, 0.00134444, 0.00176597]),
'test_neg_mean_squared_error': array([-11.15872549, -30.1549505 , -25.51841584, -16.39346535,
'test_r2': array([0.7765484 , 0.68106786, 0.73327311, 0.83008371, 0.79572363])}
how can I create confusion matrix? Do I need to create it for each fold
No one can tell you if you need to create a confusion matrix for each fold - it is your choice. If you choose to do so, it may be better to skip cross_validate and do the procedure "manually" - see my answer in How to display confusion matrix and report (recall, precision, fmeasure) for each cross validation fold.
if yes, how can the final/average confusion matrix be calculated?
There is no "final/average" confusion matrix; if you want to calculate anything further than the k ones (one for each k-fold) as described in the linked answer, you need to have available a separate validation set...
model_1 is correct.
cross_validate(estimator, X, y=None, groups=None, scoring=None, cv=’warn’, n_jobs=None, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’, return_train_score=’warn’, return_estimator=False, error_score=’raise-deprecating’)
estimator is an object implementing ‘fit’. It will be called to fit the model on the train folds.
cv: is a cross-validation generator that is used to generated train and test splits.
If you follow the example in the sklearn docs
cv_results = cross_validate(lasso, X, y, cv=3, return_train_score=False)
array([0.33150734, 0.08022311, 0.03531764])
You can see that the model lasso is fitted 3 times once for each fold on train splits and also validated 3 times on test splits. You can see that the test score on validation data are reported.
Cross validation of Keras models
Keras provides wrapper which makes the keras models compatible with sklearn cross_validatation method. You have to wrap the keras model using KerasClassifier
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import KFold, cross_validate
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
def get_model():
model = Sequential()
model.add(Dense(2, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
model = KerasClassifier(build_fn=get_model, epochs=10, batch_size=8, verbose=0)
kf = KFold(n_splits=3, shuffle=True)
X = np.random.rand(10,2)
y = np.random.rand(10,1)
cv_results = cross_validate(model, X, y, cv=kf, return_train_score=False)
print (cv_results)

