AttributeError: 'NumpyArrayIterator' object has no attribute 'classes' - python

I get this error:
AttributeError: 'NumpyArrayIterator' object has no attribute 'classes'
I am trying to make a confusion matrix to evaluate the Neural Net I have trained. I am using ImageDatagenerator and datagen.flow functions for before the fit_generator function for training.
For predictions I use the predict_generator function on the test set. All is working fine so far. Issue arrises in the following:
test_generator.reset()
pred = model.predict_generator(test_generator, steps=len(test_generator), verbose=2)
from sklearn.metrics import classification_report, confusion_matrix, cohen_kappa_score
y_pred = np.argmax(pred, axis=1)
print('Confusion Matrix')
print(pd.DataFrame(confusion_matrix(test_generator.classes, y_pred)))
I should be seeing a confusion matrix but instead I see an error. I ran the same code with sample data before I ran on the actual dataset and that did show me the results.

First you need to extract labels from generator and then put them in confusion_matrix function. To extract labels use x_gen,y_gen = test_generator.next(), just pay attention that labels are one hot encoded. Example:
test_generator.reset()
pred = model.predict_generator(test_generator, steps=len(test_generator), verbose=2)
from sklearn.metrics import classification_report, confusion_matrix, cohen_kappa_score
y_pred = np.argmax(pred, axis=1)
x_gen,y_gen = test_generator.next()
y_gen = np.argmax(y_gen, axis=1)
print('Confusion Matrix')
print(pd.DataFrame(confusion_matrix(y_gen, y_pred)))

Related

Why is there an error while printing the accuracy score?

For testing the accuracy of my trained model I am using the accuracy_score function but it is not working.
from sklearn.metrics import accuracy_score
y_test = pd.read_csv('Test.csv')
labels = y_test["ClassId"].values
imgs = y_test["Path"].values
data=[]
for img in imgs:
image = Image.open(img)
image = image.resize((30,30))
data.append(np.array(image))
X_test=np.array(data)
pred = model.predict(X_test)
classes_x=np.argmax(X_test,axis=1)
#Accuracy with the test data
from sklearn.metrics import accuracy_score
print(accuracy_score(labels, pred))
ERROR:
This is what it shows
It seems like the problem has something to do with the format you use to represent the output of the model.
I will assume that you are using One hot coding, so you do:
pred = model.predict(X_test)
classes_x=np.argmax(X_test,axis=1)
On np.argmax should go axis=-1:
predictions = np.argmax(model.predict(X_test), axis=-1)
At the end, on accuracy function you're sending pred, no classes_x.
print(accuracy_score(labels, pred))

How to solve Feature Shape Mismatch error in XGBoost Feature Selection?

I'm trying to use the following code to get feature importance for feature selection in XGBOOST. But I keep getting an error saying that there's a "Feature shape mismatch, expected: 91, got 78".
"select_X_train" has 78 features (columns) as does "select_X_test" so I think the issue is that the thresholds array must still contain the 91 original variables? But, I can't figure out a way to fix it. Can you please advise?
Here is the code:
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report
# load data
dataset = df_train
# split data into X and y
X_train = df[df.columns.difference(['Date','IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_train = df['IsDeceased'].values
X_test = df_test[df_test.columns.difference(['Date','IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_test = df_test['IsDeceased'].values
# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
print(thresh)
# eval model
select_X_test = selection.transform(X_test)
y_pred = model.predict(select_X_test)
report = classification_report(y_test,y_pred)
print("Thresh= {} , n= {}\n {}" .format(thresh,select_X_train.shape[1], report))
cm = confusion_matrix(y_test, y_pred)
print(cm)```

How to output feature names with XGBOOST feature selection

My model uses feature importance for feature selection with XGBOOST. But, at the end, it outputs all the confusion matrices/results and how many features the model includes. That now works successfully, but I also need to have the feature names that were used in each model outputted as well.
I get a warning that says "X has feature names, but SelectFromModel was fitted without feature names", so I know something needs to be added to have them be in the model before I can output them, but I'm not sure how to handle either of those steps. I found several old questions about this, but I wasn't able to successfully implement any of them to my particular code. I'd really appreciate any ideas you have. Thank you!
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report
# load data
dataset = df_train
# split data into X and y
X_train = df[df.columns.difference(['IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_train = df['IsDeceased'].values
X_test = df_test[df_test.columns.difference(['IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_test = df_test['IsDeceased'].values
# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
print(thresh)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)
report = classification_report(y_test,y_pred)
print("Thresh= {} , n= {}\n {}" .format(thresh,select_X_train.shape[1], report))
cm = confusion_matrix(y_test, y_pred)
print(cm)

Scikit-learn and Yellowbrick giving different scores

I am using sklearn to compute the average precision and roc_auc of a classifier and yellowbrick to plot the roc_auc and precision-recall curves. The problem is that the packages give different scores in both metrics and I do not know which one is the correct.
The code used:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import PrecisionRecallCurve
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score
seed = 42
# provides de data
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0,
n_informative=2, random_state=seed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf_lr = LogisticRegression(random_state=seed)
clf_lr.fit(X_train, y_train)
y_pred = clf_lr.predict(X_test)
roc_auc = roc_auc_score(y_test, y_pred)
avg_precision = average_precision_score(y_test, y_pred)
print(f"ROC_AUC: {roc_auc}")
print(f"Average_precision: {avg_precision}")
print('='*20)
# visualizations
viz3 = ROCAUC(LogisticRegression(random_state=seed))
viz3.fit(X_train, y_train)
viz3.score(X_test, y_test)
viz3.show()
viz4 = PrecisionRecallCurve(LogisticRegression(random_state=seed))
viz4.fit(X_train, y_train)
viz4.score(X_test, y_test)
viz4.show()
The code produces the following output:
As it can be seen above, the metrics give different values depending the package. In the print statement are the values computed by scikit-learn whereas in the plots appear annotated the values computed by yellowbrick.
Since you use the predict method of scikit-learn, your predictions y_pred are hard class memberships, and not probabilities:
np.unique(y_pred)
# array([0, 1])
But for ROC and Precision-Recall calculations, this should not be the case; the predictions you pass to these methods should be probabilities, and not hard classes. From the average_precision_score docs:
y_score: array, shape = [n_samples] or [n_samples, n_classes]
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as
returned by “decision_function” on some classifiers).
where non-thresholded means exactly not hard classes. Similar is the case for the roc_auc_score (docs).
Correcting this with the following code, makes the scikit-learn results identical to the ones returned by Yellowbrick:
y_pred = clf_lr.predict_proba(X_test) # get probabilities
y_prob = np.array([x[1] for x in y_pred]) # keep the prob for the positive class 1
roc_auc = roc_auc_score(y_test, y_prob)
avg_precision = average_precision_score(y_test, y_prob)
print(f"ROC_AUC: {roc_auc}")
print(f"Average_precision: {avg_precision}")
Results:
ROC_AUC: 0.9545954595459546
Average_precision: 0.9541994473779806
As Yellowbrick handles all these computational details internally (and transparently), it does not suffer from the mistake in the manual scikit-learn procedure made here.
Notice that, in the binary case (as here), you can (and should) make your plots less cluttered with the binary=True argument:
viz3 = ROCAUC(LogisticRegression(random_state=seed), binary=True) # similarly for the PrecisionRecall curve
and that, contrary to what one migh expect intuitively, for the binary case at least, the score method of ROCAUC will not return the AUC, but the accuracy, as specified in the docs:
viz3.score(X_test, y_test)
# 0.88
# verify this is the accuracy:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, clf_lr.predict(X_test))
# 0.88

How to calculate slearn.roc_auc_score using python?

I want to calculate and print roc_auc_score to evaluate the performance of my model (random forest), I am doing NLP, so my data in y_test and y_pred are basically list of words, I vectorise them with the function pipe_vect.transform but when I launch the training I have the following error:
ValueError: Only one class present in y_true. ROC AUC score is not defined in that case."
I think the error appear because y_test and y_pred havn't the same dimension.
My code :
x_test_vect = pipe_vect.transform(x_test)
y_pred = model.predict_proba(x_test_vect)
auc_score = roc_auc_score(y_test, y_pred)
print('Performance model :', auc_score)

Categories

Resources