I have a multiclass classification problem, the code bellow can classify the data at the multiclass level.
from sklearn import datasets
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Binarize the output
y_bin = label_binarize(y, classes=[0, 1, 2])
n_classes = y_bin.shape[1]
clf = OneVsRestClassifier(QDA())
y_score = cross_val_predict(clf, X, y, cv=10 ,method='predict_proba')
How Can I calculate the performance measures, listed below, of this classifier using the above code?
accuracy
specificity
sensitivity
presison
mcc
f1
Recall
thank you..
You should be able to do it with:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
In this case from what I understand, y_test is what you defined as y and y_pred can be easily calculated via:
y_pred = clf.predict(X)
You can find further information regarding metrics here and some explinations in wikipedia just remember for multilabel classes, the concepts of true negative and true positive should be understood differently.
Related
I saw this post that trained a LGBM model but i would like to know how to adapt it for Lasso. I know the prediction may not necessarily be between 0 and 1, but I would like to try this model. I have tried this but it doesnt work:
import numpy as np
from lightgbm import LGBMClassifier
from sklearn.datasets import make_classification
from sklearn.linear_model import Lasso
X, y = make_classification(n_features=10, random_state=0, n_classes=2, n_samples=1000, n_informative=8)
class Lasso(Lasso):
def predict(self,X, threshold=0.5):
result = super(Lasso, self).predict_proba(X)
predictions = [1 if p>threshold else 0 for p in result[:,0]]
return predictions
clf = Lasso(alpha=0.05)
clf.fit(X,y)
precision = cross_val_score(Lasso(),X,y,cv=5,scoring='precision')
I get
UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan
The specific model class you chose (Lasso()) is actually used for regression problems, as it minimizes penalized square loss, which is not appropriate in your case. Instead, you can use LogisticRegression() with L1 penalty to optimize a logistic function with a Lasso penalty. To control the regularization strength, the C= parameter is used (see docs).
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
X, y = make_classification(
n_features=10, random_state=0, n_classes=2, n_samples=1000, n_informative=8
)
class LassoLR(LogisticRegression):
def predict(self,X, threshold=0.5):
result = super(LassoLR, self).predict_proba(X)
predictions = [1 if p>threshold else 0 for p in result[:,0]]
return predictions
clf = LassoLR(penalty='l1', solver='liblinear', C=1.)
clf.fit(X,y)
precision = cross_val_score(LassoLR(),X,y,cv=5,scoring='precision')
print(precision)
# array([0.04166667, 0.08163265, 0.1010101 , 0.125 , 0.05940594])
I have a simple question concerning the votting classifier. As I understood, the voting classifier should have the highest accuracy than those individual predictors which built it (the wisdom of the crowd). Here is the code
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
# import dataset
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
# split the dataset into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
log_clf = LogisticRegression(solver='liblinear', random_state=42)
svm_clf = SVC(gamma='auto', random_state=42)
voting_clf = VotingClassifier(
estimators= [('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
voting='hard')
voting_clf = voting_clf.fit(X_train, y_train)
predictors_list= [log_clf, rnd_clf, svm_clf, voting_clf]
for clf in predictors_list:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_pred, y_test)
print(clf.__class__.__name__, accuracy)
what I get as a accuracy is as follows:
LogisticRegression 0.776
RandomForestClassifier 0.88
SVC 0.864
VotingClassifier 0.864
As you can see for this run the Random Forest predictor has a slightly better accuray than the VotingClassifier!
Any explanation for this?
Many thanks in Advance
Fethi
Let's take a look at the voting parameter you passed 'hard'
documentation says:
If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.
So maybe, the prediction of LogisticRegression and your SVC(SVM) are the same and wrong for some cases this makes your majority vote wrong for those cases.
you can use voting='soft' or assign weight as prior for prediction of each model, this way you make your prediction a little bit immune to the wrong prediction of bad models, and relay more on your best models.
I am using sklearn to compute the average precision and roc_auc of a classifier and yellowbrick to plot the roc_auc and precision-recall curves. The problem is that the packages give different scores in both metrics and I do not know which one is the correct.
The code used:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import PrecisionRecallCurve
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score
seed = 42
# provides de data
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0,
n_informative=2, random_state=seed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf_lr = LogisticRegression(random_state=seed)
clf_lr.fit(X_train, y_train)
y_pred = clf_lr.predict(X_test)
roc_auc = roc_auc_score(y_test, y_pred)
avg_precision = average_precision_score(y_test, y_pred)
print(f"ROC_AUC: {roc_auc}")
print(f"Average_precision: {avg_precision}")
print('='*20)
# visualizations
viz3 = ROCAUC(LogisticRegression(random_state=seed))
viz3.fit(X_train, y_train)
viz3.score(X_test, y_test)
viz3.show()
viz4 = PrecisionRecallCurve(LogisticRegression(random_state=seed))
viz4.fit(X_train, y_train)
viz4.score(X_test, y_test)
viz4.show()
The code produces the following output:
As it can be seen above, the metrics give different values depending the package. In the print statement are the values computed by scikit-learn whereas in the plots appear annotated the values computed by yellowbrick.
Since you use the predict method of scikit-learn, your predictions y_pred are hard class memberships, and not probabilities:
np.unique(y_pred)
# array([0, 1])
But for ROC and Precision-Recall calculations, this should not be the case; the predictions you pass to these methods should be probabilities, and not hard classes. From the average_precision_score docs:
y_score: array, shape = [n_samples] or [n_samples, n_classes]
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as
returned by “decision_function” on some classifiers).
where non-thresholded means exactly not hard classes. Similar is the case for the roc_auc_score (docs).
Correcting this with the following code, makes the scikit-learn results identical to the ones returned by Yellowbrick:
y_pred = clf_lr.predict_proba(X_test) # get probabilities
y_prob = np.array([x[1] for x in y_pred]) # keep the prob for the positive class 1
roc_auc = roc_auc_score(y_test, y_prob)
avg_precision = average_precision_score(y_test, y_prob)
print(f"ROC_AUC: {roc_auc}")
print(f"Average_precision: {avg_precision}")
Results:
ROC_AUC: 0.9545954595459546
Average_precision: 0.9541994473779806
As Yellowbrick handles all these computational details internally (and transparently), it does not suffer from the mistake in the manual scikit-learn procedure made here.
Notice that, in the binary case (as here), you can (and should) make your plots less cluttered with the binary=True argument:
viz3 = ROCAUC(LogisticRegression(random_state=seed), binary=True) # similarly for the PrecisionRecall curve
and that, contrary to what one migh expect intuitively, for the binary case at least, the score method of ROCAUC will not return the AUC, but the accuracy, as specified in the docs:
viz3.score(X_test, y_test)
# 0.88
# verify this is the accuracy:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, clf_lr.predict(X_test))
# 0.88
I am trying to predict Boston Housing prices. When I choose polynomial regression degree 1 or 2, R2 score is OK. But 3rd degree decreases R2 score.
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
from sklearn.datasets import load_boston
boston_dataset = load_boston()
dataset = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names)
dataset['MEDV'] = boston_dataset.target
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values.reshape(-1,1)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 2) # <-- Tuning to 3
X_poly = poly_reg.fit_transform(X_train)
poly_reg.fit(X_poly, y_train)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y_train)
y_pred = lin_reg_2.predict(poly_reg.fit_transform(X_test))
from sklearn.metrics import r2_score
print('Prediction Score is: ', r2_score(y_test, y_pred))
Output (degree=2):
Prediction Score is: 0.6903318065831567
Output (degree=3):
Prediction Score is: -12898.308114085281
It is called overfitting the model.What you are doing is fitting the model perfectly on the training set that will lead to high variance.When you fit your hypothesis well on the training set it will then fail on the test set. You can check your r2_score for your training set using r2_score(X_train,y_train). It will be high. You need to balance the trade-off between bias and variance.
You can try other regression models like lasso and ridge and can play with their alpha value in case you are looking for a high r2_score. For better understanding, I am putting up an image that will show how hypothesis line gets affected on increasing the degree of the polynomial.
I am trying to estimate the confusion matrix of a classifier using 10-fold cross-validation with sklearn.
To compute the confusion matrix I am using sklearn.metrics.confusion_matrix. I know that I can evaluate a model with cv using sklearn.model_selection.cross_val_score and sklearn.metrics.make_scorer like:
from sklearn.metrics import confusion_matrix, make_scorer
from sklearn.model_selection import cross_val_score
cm = cross_val_score(clf, X, y, make_scorer(confusion_matrix))
Where clf is my classifier and X, y the feature and class vectors. However, this will raise an error since confusion_matrix does not return a float number but a matrix.
I've tried doing something like:
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import StratifiedKFold
def cv_confusion_matrix(clf, X, y, folds=10):
skf = StratifiedKFold(n_splits=folds)
cv_iter = skf.split(X, y)
cms = []
for train, test in cv_iter:
clf.fit(X[train,], y[train])
cm = confusion_matrix(y[test], clf.predict(X[test]), labels=clf.classes_)
cms.append(cm)
return np.mean(np.array(cms), axis=1)
This will work, but I missing the parallelism that sklearn has with cross_val_score and the n_jobs parameter.
Is there any way to do this and to take the advantage of the parallelism?