I am trying to estimate the confusion matrix of a classifier using 10-fold cross-validation with sklearn.
To compute the confusion matrix I am using sklearn.metrics.confusion_matrix. I know that I can evaluate a model with cv using sklearn.model_selection.cross_val_score and sklearn.metrics.make_scorer like:
from sklearn.metrics import confusion_matrix, make_scorer
from sklearn.model_selection import cross_val_score
cm = cross_val_score(clf, X, y, make_scorer(confusion_matrix))
Where clf is my classifier and X, y the feature and class vectors. However, this will raise an error since confusion_matrix does not return a float number but a matrix.
I've tried doing something like:
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import StratifiedKFold
def cv_confusion_matrix(clf, X, y, folds=10):
skf = StratifiedKFold(n_splits=folds)
cv_iter = skf.split(X, y)
cms = []
for train, test in cv_iter:
clf.fit(X[train,], y[train])
cm = confusion_matrix(y[test], clf.predict(X[test]), labels=clf.classes_)
cms.append(cm)
return np.mean(np.array(cms), axis=1)
This will work, but I missing the parallelism that sklearn has with cross_val_score and the n_jobs parameter.
Is there any way to do this and to take the advantage of the parallelism?
Related
I saw this post that trained a LGBM model but i would like to know how to adapt it for Lasso. I know the prediction may not necessarily be between 0 and 1, but I would like to try this model. I have tried this but it doesnt work:
import numpy as np
from lightgbm import LGBMClassifier
from sklearn.datasets import make_classification
from sklearn.linear_model import Lasso
X, y = make_classification(n_features=10, random_state=0, n_classes=2, n_samples=1000, n_informative=8)
class Lasso(Lasso):
def predict(self,X, threshold=0.5):
result = super(Lasso, self).predict_proba(X)
predictions = [1 if p>threshold else 0 for p in result[:,0]]
return predictions
clf = Lasso(alpha=0.05)
clf.fit(X,y)
precision = cross_val_score(Lasso(),X,y,cv=5,scoring='precision')
I get
UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan
The specific model class you chose (Lasso()) is actually used for regression problems, as it minimizes penalized square loss, which is not appropriate in your case. Instead, you can use LogisticRegression() with L1 penalty to optimize a logistic function with a Lasso penalty. To control the regularization strength, the C= parameter is used (see docs).
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
X, y = make_classification(
n_features=10, random_state=0, n_classes=2, n_samples=1000, n_informative=8
)
class LassoLR(LogisticRegression):
def predict(self,X, threshold=0.5):
result = super(LassoLR, self).predict_proba(X)
predictions = [1 if p>threshold else 0 for p in result[:,0]]
return predictions
clf = LassoLR(penalty='l1', solver='liblinear', C=1.)
clf.fit(X,y)
precision = cross_val_score(LassoLR(),X,y,cv=5,scoring='precision')
print(precision)
# array([0.04166667, 0.08163265, 0.1010101 , 0.125 , 0.05940594])
I am trying to manually predict a logistic regression model using the coefficient and intercept outputs from a scikit-learn model. However, I can't match up my probability predictions with the predict_proba method from the classifier.
I have tried:
from sklearn.datasets import load_iris
from scipy.special import expit
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0).fit(X, y)
# use sklearn's predict_proba function
sk_probas = clf.predict_proba(X[:1, :])
# and attempting manually (using scipy's inverse logit)
manual_probas = expit(np.dot(X[:1], clf.coef_.T)+clf.intercept_)
# with a completely manual inverse logit
full_manual_probas = 1/(1+np.exp(-(np.dot(iris_test, iris_coef.T)+clf.intercept_)))
outputs:
>>> sk_probas
array([[9.81815067e-01, 1.81849190e-02, 1.44120963e-08]])
>>> manual_probas
array([[9.99352591e-01, 9.66205386e-01, 2.26583306e-05]])
>>> full_manual_probas
array([[9.99352591e-01, 9.66205386e-01, 2.26583306e-05]])
I do seem to get the classes to match (using np.argmax), but the probabilities are different. What am I missing?
I've looked at this and this but haven't managed to figure it out yet.
The documentation states that
For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class
That is, in order to get the same values as sklearn you have to normalize using softmax, like this:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, max_iter=1000).fit(X, y)
decision = np.dot(X[:1], clf.coef_.T)+clf.intercept_
print(clf.predict_proba(X[:1]))
print(np.exp(decision) / np.exp(decision).sum())
To use sigmoids instead you can do it like this:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, max_iter=1000, multi_class='ovr').fit(X, y) # Notice the extra argument
full_manual_probas = 1/(1+np.exp(-(np.dot(X[:1], clf.coef_.T)+clf.intercept_)))
print(clf.predict_proba(X[:1]))
print(full_manual_probas / full_manual_probas.sum())
I am trying to use the GaussianProcessRegressor in scikit-learn with some graph kernels computed by the grakel software. Below is my code for a 5-fold cross-validation on 100 graph data. For the sake of testing convenience, I have commented out all graph-related lines and use random kernel matrices and y values instead.
from sklearn.model_selection import KFold
from sklearn.utils import check_random_state
from sklearn.gaussian_process import GaussianProcessRegressor as GPR
from sklearn.metrics import mean_squared_error
#from grakel.kernels import WeisfeilerLehman
import numpy as np
def Kfold_CV_GPR(Gs, y, n_iter=4, n_splits=5, random_state=None):
random_state = check_random_state(random_state)
kf = KFold(n_splits=n_splits, random_state=random_state, shuffle=True)
errors = []
for train_idxs, test_idxs in kf.split(y):
# gk = WeisfeilerLehman(n_iter=n_iter, normalize=True)
# K_train = gk.fit_transform(Gs[train_idxs])
# K_test = gk.transform(Gs[test_idxs])
K_train = np.random.randn(80, 80)
K_test = np.random.randn(20, 80)
gpr = GPR(kernel='precomputed')
gpr.fit(K_train, y[train_idxs])
y_pred = gpr.predict(K_test)
rmse = mean_squared_error(y[test_idxs], y_pred, squared=False)
errors.append(rmse)
return -np.mean(errors)
score = Kfold_CV_GPR(Gs=None, y=np.random.randn(100, ), n_iter=4, n_splits=5)
print(score)
However, I am getting the following error
TypeError: Cannot clone object ''precomputed'' (type <class 'str'>): it does not seem to be a scikit-learn
estimator as it does not implement a 'get_params' method.
When I change sklearn.gaussian_process.GaussianProcessRegressor to sklearn.svm.SVR (support vector regression), my code doesn't throw any error, but it will run forever for some reason. I also tested classifers like sklearn.svm.SVC and my code is working fine.
Anyone know how to use precomputed kernel in a scikit-learn's GaussianProcessRegressor?
Let's take data
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
data = load_breast_cancer()
X = data.data
y = data.target
I want to create model using only first principal component and calculate AUC for it.
My work so far
scaler = StandardScaler()
scaler.fit(X_train)
X_scaled = scaler.transform(X)
pca = PCA(n_components=1)
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1'])
clf = LogisticRegression()
clf = clf.fit(principalDf, y)
pred = clf.predict_proba(principalDf)
But while I'm trying to use
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
Following error occurs :
y should be a 1d array, got an array of shape (569, 2) instead.
I tried to reshape my data
fpr, tpr, thresholds = metrics.roc_curve(y.reshape(1,-1), pred, pos_label=2)
But it didn't solve the issue (it outputs) :
multilabel-indicator format is not supported
Do you have any idea how can I perform AUC on this first principal component?
You may wish to try:
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
X,y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y)
scaler = StandardScaler()
pca = PCA(2)
clf = LogisticRegression()
ppl = Pipeline([("scaler",scaler),("pca",pca),("clf",clf)])
ppl.fit(X_train, y_train)
preds = ppl.predict(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test, preds, pos_label=1)
metrics.plot_roc_curve(ppl, X_test, y_test)
The problem is that predict_proba returns a column for each class. Generally with binary classification, your classes are 0 and 1, so you want the probability of the second class, so it's quite common to slice as follows (replacing the last line in your code block):
pred = clf.predict_proba(principalDf)[:, 1]
I have a multiclass classification problem, the code bellow can classify the data at the multiclass level.
from sklearn import datasets
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Binarize the output
y_bin = label_binarize(y, classes=[0, 1, 2])
n_classes = y_bin.shape[1]
clf = OneVsRestClassifier(QDA())
y_score = cross_val_predict(clf, X, y, cv=10 ,method='predict_proba')
How Can I calculate the performance measures, listed below, of this classifier using the above code?
accuracy
specificity
sensitivity
presison
mcc
f1
Recall
thank you..
You should be able to do it with:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
In this case from what I understand, y_test is what you defined as y and y_pred can be easily calculated via:
y_pred = clf.predict(X)
You can find further information regarding metrics here and some explinations in wikipedia just remember for multilabel classes, the concepts of true negative and true positive should be understood differently.