I have some code that is applying the random forest algorithm in order to predict the value of the gold column based on the remaining columns.
My input file is under the form:
gold,MethodType,CallersT,CallersN,CallersU,CallersCallersT,CallersCallersN,CallersCallersU,CalleesT,CalleesN,CalleesU,CalleesCalleesT,CalleesCalleesN,CalleesCalleesU,classGold,VariableTraceValue
T,Inner,1,0,0,1,1,0,1,1,1,1,0,1,Trace,U
T,Inner,1,0,0,1,0,0,1,0,1,1,0,0,Trace,T
T,Inner,1,0,1,0,0,1,0,0,1,0,0,1,Trace,T
N,Inner,0,0,1,0,0,1,0,1,1,0,0,1,Trace,T
N,Inner,0,1,0,0,1,0,0,1,1,0,1,1,NoTrace,N
N,Leaf,0,0,0,0,0,0,0,0,0,0,0,0,NoTrace,U
N,Leaf,0,0,0,0,0,0,0,0,0,0,0,0,NoTrace,N
N,Inner,0,1,1,0,0,1,0,0,1,0,0,1,NoTrace,N
N,Inner,0,1,1,0,0,1,0,0,1,0,0,1,NoTrace,N
N,Leaf,0,0,1,0,0,0,0,0,0,0,0,0,NoTrace,U
N,Leaf,0,0,1,0,0,1,0,0,0,0,0,0,NoTrace,U
N,Leaf,0,1,1,0,1,1,0,0,0,0,0,0,NoTrace,N
N,Leaf,0,1,0,0,0,0,0,0,0,0,0,0,NoTrace,N
N,Leaf,0,0,0,0,0,0,0,0,0,0,0,0,NoTrace,N
N,Root,0,0,0,0,0,0,0,0,1,0,0,0,NoTrace,N
N,Leaf,0,0,0,0,0,0,0,0,0,0,0,0,NoTrace,U
N,Root,0,0,0,0,0,0,0,1,1,0,0,1,NoTrace,N
N,Inner,0,0,1,0,0,0,0,0,1,0,0,1,NoTrace,N
N,Inner,0,1,0,0,1,0,0,1,1,0,1,1,NoTrace,N
N,Root,0,0,0,0,0,0,0,1,1,0,0,1,NoTrace,U
I am trying to predict the value of the gold column which can take 2 values: T or N. I am printing the classification report of the random forest algorithm as shown below:
I want to store the T precision, N precision, T recall, and N recall into 4 separate variables. I am trying to use output_dict to do so but I receive the error message classification_report() got an unexpected keyword argument 'output_dict'.
Here is my source code:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import sys
import os
import matplotlib.pyplot as plt
def main():
X_train={}
X_test={}
y_train={}
y_test={}
path = 'NtoT'
files = os.listdir(path)
#for f in files:
#print(f)
filename=path+'\\'+'NtoT-2.5-9.txt'
print(filename)
dataset = pd.read_csv( filename, sep= ',', index_col=False)
#convert Inner, Root, Leaf into 0, 1, 2
dataset['MethodType'] = dataset['MethodType'].astype('category').cat.codes
dataset['classGold'] = dataset['classGold'].astype('category').cat.codes
dataset['VariableTraceValue'] = dataset['VariableTraceValue'].astype('category').cat.codes
pd.set_option('display.max_columns', None)
row_count, column_count = dataset.shape
X = dataset.iloc[:, 1:column_count].values
y = dataset.iloc[:, 0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
################################################################################
classifier = RandomForestClassifier(n_estimators=400, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
report = classification_report(y_test,y_pred,output_dict=True)
print('confusion matrix\n',confusion_matrix(y_test,y_pred))
print('classification report\n', report)
print('accuracy score', accuracy_score(y_test,y_pred))
if __name__=="__main__":
main()
How can I fix this and store the T precision, N precision, T recall, N recall into 4 separate variables?
The problem is that you may have an outdated version of sklearn. Here you can find a explanation how to check your version:
How to get output of sklearn.metrics.classification_report as a dict?
Let's take data
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
data = load_breast_cancer()
X = data.data
y = data.target
I want to create model using only first principal component and calculate AUC for it.
My work so far
scaler = StandardScaler()
scaler.fit(X_train)
X_scaled = scaler.transform(X)
pca = PCA(n_components=1)
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1'])
clf = LogisticRegression()
clf = clf.fit(principalDf, y)
pred = clf.predict_proba(principalDf)
But while I'm trying to use
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
Following error occurs :
y should be a 1d array, got an array of shape (569, 2) instead.
I tried to reshape my data
fpr, tpr, thresholds = metrics.roc_curve(y.reshape(1,-1), pred, pos_label=2)
But it didn't solve the issue (it outputs) :
multilabel-indicator format is not supported
Do you have any idea how can I perform AUC on this first principal component?
You may wish to try:
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
X,y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y)
scaler = StandardScaler()
pca = PCA(2)
clf = LogisticRegression()
ppl = Pipeline([("scaler",scaler),("pca",pca),("clf",clf)])
ppl.fit(X_train, y_train)
preds = ppl.predict(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test, preds, pos_label=1)
metrics.plot_roc_curve(ppl, X_test, y_test)
The problem is that predict_proba returns a column for each class. Generally with binary classification, your classes are 0 and 1, so you want the probability of the second class, so it's quite common to slice as follows (replacing the last line in your code block):
pred = clf.predict_proba(principalDf)[:, 1]
I am using SVM and I have a problem in which the execution of the program stays stuck in model.fit(X_test, y_test), which corresponds to fitting the SVM model. How to fix that? Here is my code:
# Make Predictions with Naive Bayes On The Iris Dataset
import collections
from csv import reader
from math import sqrt, exp, pi
from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns; sns.set()
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.externals.six import StringIO
from sklearn.feature_selection import SelectKBest, chi2
from sklearn import datasets, metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
# Function to split the dataset
def splitdataset(balance_data, column_count):
# Separating the target variable
X = balance_data.values[:, 1:column_count]
Y = balance_data.values[:, 0]
# Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 100)
return X, Y, X_train, X_test, y_train, y_test
def importdata():
balance_data = pd.read_csv( 'dataExtended.txt', sep= ',')
row_count, column_count = balance_data.shape
# Printing the dataswet shape
print ("Dataset Length: ", len(balance_data))
print ("Dataset Shape: ", balance_data.shape)
print("Number of columns ", column_count)
# Printing the dataset obseravtions
print ("Dataset: ",balance_data.head())
balance_data['gold'] = balance_data['gold'].astype('category').cat.codes
balance_data['Program'] = balance_data['Program'].astype('category').cat.codes
return balance_data, column_count
# Driver code
def main():
print("hey")
# Building Phase
data,column_count = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data, column_count)
#Create a svm Classifier
model = svm.SVC(kernel='linear',probability=True) # Linear Kernel
print('before fitting')
model.fit(X_test, y_test)
print('fitting over')
predicted = model.predict(X_test)
print('prediction over')
print(metrics.classification_report(y_test, predicted))
print('classification over')
print(metrics.confusion_matrix(y_test, predicted))
probs = model.predict_proba(X_test)
probs_list = list(probs)
y_pred=[None]*len(y_test)
y_pred_list = list(y_pred)
y_test_list = list(y_test)
i=0
threshold=0.7
while i<len(probs_list):
#print('probs ',probs_list[i][0])
if (probs_list[i][0]>=threshold) & (probs_list[i][1]<threshold):
y_pred_list[i]=0
i=i+1
elif (probs_list[i][0]<threshold) & (probs_list[i][1]>=threshold):
y_pred_list[i]=1
i=i+1
else:
#print(y_pred[i])
#print('i==> ',i, ' probs length ', len(probs_list), ' ', len(y_pred_list), ' ', len(y_test_list))
y_pred_list.pop(i)
y_test_list.pop(i)
probs_list.pop(i)
#print(y_pred_list)
print('confusion matrix\n',confusion_matrix(y_test_list,y_pred_list))
print('classification report\n', classification_report(y_test_list,y_pred_list))
print('accuracy score', accuracy_score(y_test_list, y_pred_list))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test_list, y_pred_list))
print('Mean Squared Error:', metrics.mean_squared_error(y_test_list, y_pred_list))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test_list, y_pred_list)))
if __name__=="__main__":
main()
This is most likely due to the parameter probability set to True when you initialize the model. As you can read in the docs:
probability: bool, default=False
Whether to enable probability
estimates. This must be enabled prior to calling fit, will slow down
that method as it internally uses 5-fold cross-validation, and
predict_proba may be inconsistent with predict.
This issue has been discussed on StackOverflow here and here.