I have a project that consists of utilizing the kNN algorithm in a csv file and show selected metrics. But when I try to present some metrics it throws a few errors.
When trying to use: sensitivity, f1_Score and Precision:
sensitivity - print(metrics.recall_score(y_test, y_pred_class))
F1_score - print(metrics.f1_score(y_test, y_pred_class))
Presicion - print(metrics.precision_score(y_test, y_pred_class))
Pycharm throws the following error:
ValueError: Target is multiclass but average='binary'. Please choose another average setting
The error when trying to print the ROC curve's a little different:
ValueError: multiclass format is not supported
DATASET
LINK TO DATASET: https://www.dropbox.com/s/yt3n1eqxlsb816n/Testfile%20-%20kNN.csv?dl=0
Program
import matplotlib
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
from matplotlib.dviread import Text
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
#Tools para teste
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
def main():
dataset = pd.read_csv('filetestKNN.csv')
X = dataset.drop(columns=['Label'])
y = dataset['Label'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.34)
Classifier = KNeighborsClassifier(n_neighbors=2, p=2, metric='euclidean')
Classifier.fit(X_train, y_train)
y_pred_class = Classifier.predict(X_test)
y_pred_prob = Classifier.predict_proba(X_test)[:, 1]
accuracy = Classifier.score(X_test, y_test)
confusion = metrics.confusion_matrix(y_test, y_pred_class)
print()
print("Accuracy")
print(metrics.accuracy_score(y_test, y_pred_class))
print()
print("Classification Error")
print(1 - metrics.accuracy_score(y_test, y_pred_class))
print()
print("Confusion matrix")
print(metrics.confusion_matrix(y_test, y_pred_class))
#error
print(metrics.recall_score(y_test, y_pred_class))
#error
print(metrics.roc_curve(y_test, y_pred_class))
#error
print(metrics.f1_score(y_test, y_pred_class))
#error
print(metrics.precision_score(y_test, y_pred_class))
I just wanted to show the algorithm metrics on the screen.
You need to set the average keyword argument to these sklearn.metrics functions. For an example, look at the documentation of f1_score. Here is the part corresponding to the average keyword arg:
average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’,
‘samples’, ‘weighted’]
This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this
determines the type of averaging performed on the data:
'binary':
Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.
'micro':
Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro':
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted':
Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label).
This alters ‘macro’ to account for label imbalance; it can result in
an F-score that is not between precision and recall.
'samples':
Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from
accuracy_score).
Here we can see that this describes how results are aggregated over the different labels on your multiclass task. I'm not sure which one you'd like to use, but micro seems nice. Here's how your call to f1_score would look with this choice:
print(metrics.f1_score(y_test, y_pred_class, average='micro'))
You can adjust the other metrics similarly. Hope this helps.
Related
My model uses feature importance for feature selection with XGBOOST. But, at the end, it outputs all the confusion matrices/results and how many features the model includes. That now works successfully, but I also need to have the feature names that were used in each model outputted as well.
I get a warning that says "X has feature names, but SelectFromModel was fitted without feature names", so I know something needs to be added to have them be in the model before I can output them, but I'm not sure how to handle either of those steps. I found several old questions about this, but I wasn't able to successfully implement any of them to my particular code. I'd really appreciate any ideas you have. Thank you!
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report
# load data
dataset = df_train
# split data into X and y
X_train = df[df.columns.difference(['IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_train = df['IsDeceased'].values
X_test = df_test[df_test.columns.difference(['IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]
y_test = df_test['IsDeceased'].values
# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
print(thresh)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)
report = classification_report(y_test,y_pred)
print("Thresh= {} , n= {}\n {}" .format(thresh,select_X_train.shape[1], report))
cm = confusion_matrix(y_test, y_pred)
print(cm)
I am using sklearn to compute the average precision and roc_auc of a classifier and yellowbrick to plot the roc_auc and precision-recall curves. The problem is that the packages give different scores in both metrics and I do not know which one is the correct.
The code used:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import PrecisionRecallCurve
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score
seed = 42
# provides de data
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0,
n_informative=2, random_state=seed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf_lr = LogisticRegression(random_state=seed)
clf_lr.fit(X_train, y_train)
y_pred = clf_lr.predict(X_test)
roc_auc = roc_auc_score(y_test, y_pred)
avg_precision = average_precision_score(y_test, y_pred)
print(f"ROC_AUC: {roc_auc}")
print(f"Average_precision: {avg_precision}")
print('='*20)
# visualizations
viz3 = ROCAUC(LogisticRegression(random_state=seed))
viz3.fit(X_train, y_train)
viz3.score(X_test, y_test)
viz3.show()
viz4 = PrecisionRecallCurve(LogisticRegression(random_state=seed))
viz4.fit(X_train, y_train)
viz4.score(X_test, y_test)
viz4.show()
The code produces the following output:
As it can be seen above, the metrics give different values depending the package. In the print statement are the values computed by scikit-learn whereas in the plots appear annotated the values computed by yellowbrick.
Since you use the predict method of scikit-learn, your predictions y_pred are hard class memberships, and not probabilities:
np.unique(y_pred)
# array([0, 1])
But for ROC and Precision-Recall calculations, this should not be the case; the predictions you pass to these methods should be probabilities, and not hard classes. From the average_precision_score docs:
y_score: array, shape = [n_samples] or [n_samples, n_classes]
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as
returned by “decision_function” on some classifiers).
where non-thresholded means exactly not hard classes. Similar is the case for the roc_auc_score (docs).
Correcting this with the following code, makes the scikit-learn results identical to the ones returned by Yellowbrick:
y_pred = clf_lr.predict_proba(X_test) # get probabilities
y_prob = np.array([x[1] for x in y_pred]) # keep the prob for the positive class 1
roc_auc = roc_auc_score(y_test, y_prob)
avg_precision = average_precision_score(y_test, y_prob)
print(f"ROC_AUC: {roc_auc}")
print(f"Average_precision: {avg_precision}")
Results:
ROC_AUC: 0.9545954595459546
Average_precision: 0.9541994473779806
As Yellowbrick handles all these computational details internally (and transparently), it does not suffer from the mistake in the manual scikit-learn procedure made here.
Notice that, in the binary case (as here), you can (and should) make your plots less cluttered with the binary=True argument:
viz3 = ROCAUC(LogisticRegression(random_state=seed), binary=True) # similarly for the PrecisionRecall curve
and that, contrary to what one migh expect intuitively, for the binary case at least, the score method of ROCAUC will not return the AUC, but the accuracy, as specified in the docs:
viz3.score(X_test, y_test)
# 0.88
# verify this is the accuracy:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, clf_lr.predict(X_test))
# 0.88
This is my dataset and Median_Price is my target variable
RMSE VALUE before and after using GridSearch CV parameter tuning is attached in the code. How Can I decrease the RMSE based on my dataset??
Dataset is to download from google drive here and also I had added a picture of the dataset for understanding.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as seabornInstance
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
from io import StringIO
from sklearn import metrics
%matplotlib inline
dataset = pd.read_csv('E:/MMU/FYP/Property Recommendation System/Final Dataset/median/Top5_median.csv')
dataset['Median_Price'] = dataset['Median_Price'].str.replace(',', '').astype(int)
dataset['population'] = dataset['population'].apply(np.int64)
dataset['Median_Price'] = dataset['Median_Price'].apply(np.int64)
dataset['Type1'] = pd.to_numeric(dataset['Type1'], errors='coerce')
dataset['Type2'] = pd.to_numeric(dataset['Type2'], errors='coerce')
dataset = dataset.replace(np.nan, 0, regex=True)
X = dataset[['Type1','Type2','Filed Transactions', 'population', 'Jr Secure Technology']]
y = dataset['Median_Price']
from sklearn.model_selection import cross_val_score# function to get cross validation scores
def get_cv_scores(model):
scores = cross_val_score(model,
X_train,
y_train,
cv=5,
scoring='neg_mean_squared_error')
print('CV Mean: ', np.mean(scores))
print('STD: ', np.std(scores))
print('\n')
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# get cross val scores
get_cv_scores(regressor)
from sklearn.linear_model import Ridge# Train model with default alpha=1
ridge = Ridge(alpha=1).fit(X_train, y_train)# get cross val scores
get_cv_scores(ridge)
# find optimal alpha with grid search
alpha = \[9,10,11,12,13,14,15,100,1000\]
param_grid = dict(alpha=alpha)
grid = GridSearchCV(estimator=ridge, param_grid=param_grid, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)
grid_result = grid.fit(X_train, y_train)
print('Best Score: ', grid_result.best_score_)
print('Best Params: ', grid_result.best_params_)
### Before GridSerach RMSE: 487656.3828
### After GridSerach RMSE: 453873.438
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))][1]
Dataset CSV download link
Well, there seems to be a certain decrease in the RMSE value after using GridSearchCV.
You can try out the feature selection, feature engineering, scale your data, transformations, try some other algorithms, these might help you decrease your RMSE value to some extent.
Also, the RMSE value depends completely on the context of data. Seems your data points are separated far from each other which is giving you very high RMSE value. The different techniques I mentioned above can help you to decrease RMSE only to a limited extent.
As a beginner in scikit-learn, and trying to classify the iris dataset, I'm having problems with adjusting the scoring metric from scoring='accuracy' to others like precision, recall, f1 etc., in the cross-validation step. Below is the full code sample (enough to start at # Test options and evaluation metric).
# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection # for command model_selection.cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
# Test options and evaluation metric
seed = 7
scoring = 'accuracy'
#Below, we build and evaluate 6 different models
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn, we calculate the cv-scores, ther mean and std for each model
#
results = []
names = []
for name, model in models:
#below, we do k-fold cross-validation
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
Now, apart from scoring ='accuracy', I'd like to evaluate other performance metrics for this multiclass classification problem. But when I use, scoring='precision', it raises:
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
My questions are:
1) I guess the above is happening because 'precision' and 'recall' are defined in scikit-learn only for binary classification-is that correct? If yes, then, which command(s) should replace scoring='accuracy' in the code above?
2) If I want to compute the confusion matrix, precision and recall for each fold while performing the k-fold cross validation, what commands should I type?
3) For the sake of experimentation, I tried scoring='balanced_accuracy', only to find:
ValueError: 'balanced_accuracy' is not a valid scoring value.
Why is this happening, when the model evaluation documentation (https://scikit-learn.org/stable/modules/model_evaluation.html) clearly says balanced_accuracy is a scoring method? I'm quite confused here, so an actual code to show how to evaluate other performance etrics would be appreciated! Thanks inn advance!!
1) I guess the above is happening because 'precision' and 'recall' are defined in scikit-learn only for binary classification-is that correct?
No. Precision & recall are certainly valid for multi-class problems, too - see the docs for precision & recall.
If yes, then, which command(s) should replace scoring='accuracy' in the code above?
The problem arises because, as you can see from the documentation links I have provided above, the default setting for these metrics is for binary classification (average='binary'). In your case of multi-class classification, you need to specify which exact "version" of the particular metric you are interested in (there are more than one); have a look at the relevant page of the scikit-learn documentation, but some valid options for your scoring parameter could be:
'precision_macro'
'precision_micro'
'precision_weighted'
'recall_macro'
'recall_micro'
'recall_weighted'
The documentation link above contains even an example of using 'recall_macro' with the iris data - be sure to check it.
2) If I want to compute the confusion matrix, precision and recall for each fold while performing the k-fold cross validation, what commands should I type?
This is not exactly trivial, but you can see a way in my answer for Cross-validation metrics in scikit-learn for each data split
3) For the sake of experimentation, I tried scoring='balanced_accuracy', only to find:
ValueError: 'balanced_accuracy' is not a valid scoring value.
This is because you are probably using an older version of scikit-learn. balanced_accuracy became available only in v0.20 - you can verify that it is not available in v0.18. Upgrade your scikit-learn to v0.20 and you should be fine.
How can I find the overall accuracy of the outputs that we got by running a decision tree algorithm.I am able to get the top five class labels for the active user input but I am getting the accuracy for the X_train and Y_train dataset using accuracy_score().Suppose I am getting five top recommendation . I wish to get the accuracy for each class labels and with the help of these, the overall accuracy for the output.Please suggest some idea.
My python script is here:
here event is the different class labels
DTC= DecisionTreeClassifier()
DTC.fit(X_train_one_hot,y_train)
print("output from DTC:")
res=DTC.predict_proba(X_test_one_hot)
new=list(chain.from_iterable(res))
#Here I got the index value of top five probabilities
index=sorted(range(len(new)), key=lambda i: new[i], reverse=True)[:5]
for i in index:
print(event[i])
Here is the sample code which i tried to get the accuracy for the predicted class labels:
here index is the index for the top five probability of class label and event is the different class label.
for i in index:
DTC.fit(X_train_one_hot,y_train)
y_pred=event[i]
AC=accuracy_score((event,y_pred)*100)
print(AC)
Since you have a multi-class classification problem, you can calculate accuracy of the classifier by using the confusion_matrix function in Python.
To get overall accuracy, sum the values in the diagonal and divide the sum by the total number of samples.
Consider the following simple multi-class classification example using the IRIS dataset:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
classifier = svm.SVC(kernel='linear', C=0.01)
y_pred = classifier.fit(X_train, y_train).predict(X_test)
Now to calculate overall accuracy, use confusion matrix:
conf_mat = confusion_matrix(y_pred, y_test)
acc = np.sum(conf_mat.diagonal()) / np.sum(conf_mat)
print('Overall accuracy: {} %'.format(acc*100))