Having trouble applying KNN - python

I have a problem trying to use KNN
I'm applying the training and tests fit and I'm getting this error:
ValueError: Found input variables with inconsistent numbers of samples: [4482, 2015]
Full error:
The problem is that the dataframe has already been treated and is without any problem
X_treino and y_treino shape:
Im going to put here all the code sequence I made and that is giving an error at the end:
X = wines_class.drop(['color'], axis=1)
y = wines_class['color']
from sklearn.model_selection import train_test_split
X_treino, y_treino, X_teste, y_teste = train_test_split(X, y, test_size=0.31, random_state=0)
print(X.shape,y.shape)
(6497, 12) (6497,)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
def class_pontos(clf, y_predito):
acc_treino = clf.score(X_treino, y_treino)*100
acc_teste = clf.score(X_teste, y_teste)*100
roc = roc_auc_score(y_test, y_predito)*100
vn, fp, fn, vp = confusion_matrix(y_test, y_predito).ravel()
cm = confusion_matrix(y_teste, y_predito)
Correto = vp + vn
Incorreto = fp + fn
return acc_treino, acc_teste, roc, Correto, Incorreto, cm
#KNN
from sklearn.neighbors import KNeighborsClassifier
class_knn = KNeighborsClassifier()
class_knn.fit(X_treino, y_treino)
y_pred_knn = class_knn.predict(X_teste)
print(class_pontos(class_knn, y_pred_knn))
Im sharing the csv here in this drive

Related

TypeError: classification_report() got an unexpected keyword argument 'output_dict'

I have some code that is applying the random forest algorithm in order to predict the value of the gold column based on the remaining columns.
My input file is under the form:
gold,MethodType,CallersT,CallersN,CallersU,CallersCallersT,CallersCallersN,CallersCallersU,CalleesT,CalleesN,CalleesU,CalleesCalleesT,CalleesCalleesN,CalleesCalleesU,classGold,VariableTraceValue
T,Inner,1,0,0,1,1,0,1,1,1,1,0,1,Trace,U
T,Inner,1,0,0,1,0,0,1,0,1,1,0,0,Trace,T
T,Inner,1,0,1,0,0,1,0,0,1,0,0,1,Trace,T
N,Inner,0,0,1,0,0,1,0,1,1,0,0,1,Trace,T
N,Inner,0,1,0,0,1,0,0,1,1,0,1,1,NoTrace,N
N,Leaf,0,0,0,0,0,0,0,0,0,0,0,0,NoTrace,U
N,Leaf,0,0,0,0,0,0,0,0,0,0,0,0,NoTrace,N
N,Inner,0,1,1,0,0,1,0,0,1,0,0,1,NoTrace,N
N,Inner,0,1,1,0,0,1,0,0,1,0,0,1,NoTrace,N
N,Leaf,0,0,1,0,0,0,0,0,0,0,0,0,NoTrace,U
N,Leaf,0,0,1,0,0,1,0,0,0,0,0,0,NoTrace,U
N,Leaf,0,1,1,0,1,1,0,0,0,0,0,0,NoTrace,N
N,Leaf,0,1,0,0,0,0,0,0,0,0,0,0,NoTrace,N
N,Leaf,0,0,0,0,0,0,0,0,0,0,0,0,NoTrace,N
N,Root,0,0,0,0,0,0,0,0,1,0,0,0,NoTrace,N
N,Leaf,0,0,0,0,0,0,0,0,0,0,0,0,NoTrace,U
N,Root,0,0,0,0,0,0,0,1,1,0,0,1,NoTrace,N
N,Inner,0,0,1,0,0,0,0,0,1,0,0,1,NoTrace,N
N,Inner,0,1,0,0,1,0,0,1,1,0,1,1,NoTrace,N
N,Root,0,0,0,0,0,0,0,1,1,0,0,1,NoTrace,U
I am trying to predict the value of the gold column which can take 2 values: T or N. I am printing the classification report of the random forest algorithm as shown below:
I want to store the T precision, N precision, T recall, and N recall into 4 separate variables. I am trying to use output_dict to do so but I receive the error message classification_report() got an unexpected keyword argument 'output_dict'.
Here is my source code:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import sys
import os
import matplotlib.pyplot as plt
def main():
X_train={}
X_test={}
y_train={}
y_test={}
path = 'NtoT'
files = os.listdir(path)
#for f in files:
#print(f)
filename=path+'\\'+'NtoT-2.5-9.txt'
print(filename)
dataset = pd.read_csv( filename, sep= ',', index_col=False)
#convert Inner, Root, Leaf into 0, 1, 2
dataset['MethodType'] = dataset['MethodType'].astype('category').cat.codes
dataset['classGold'] = dataset['classGold'].astype('category').cat.codes
dataset['VariableTraceValue'] = dataset['VariableTraceValue'].astype('category').cat.codes
pd.set_option('display.max_columns', None)
row_count, column_count = dataset.shape
X = dataset.iloc[:, 1:column_count].values
y = dataset.iloc[:, 0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
################################################################################
classifier = RandomForestClassifier(n_estimators=400, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
report = classification_report(y_test,y_pred,output_dict=True)
print('confusion matrix\n',confusion_matrix(y_test,y_pred))
print('classification report\n', report)
print('accuracy score', accuracy_score(y_test,y_pred))
if __name__=="__main__":
main()
How can I fix this and store the T precision, N precision, T recall, N recall into 4 separate variables?
The problem is that you may have an outdated version of sklearn. Here you can find a explanation how to check your version:
How to get output of sklearn.metrics.classification_report as a dict?

Calculating AUC for LogisticRegression model

Let's take data
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
data = load_breast_cancer()
X = data.data
y = data.target
I want to create model using only first principal component and calculate AUC for it.
My work so far
scaler = StandardScaler()
scaler.fit(X_train)
X_scaled = scaler.transform(X)
pca = PCA(n_components=1)
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1'])
clf = LogisticRegression()
clf = clf.fit(principalDf, y)
pred = clf.predict_proba(principalDf)
But while I'm trying to use
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
Following error occurs :
y should be a 1d array, got an array of shape (569, 2) instead.
I tried to reshape my data
fpr, tpr, thresholds = metrics.roc_curve(y.reshape(1,-1), pred, pos_label=2)
But it didn't solve the issue (it outputs) :
multilabel-indicator format is not supported
Do you have any idea how can I perform AUC on this first principal component?
You may wish to try:
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
X,y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y)
scaler = StandardScaler()
pca = PCA(2)
clf = LogisticRegression()
ppl = Pipeline([("scaler",scaler),("pca",pca),("clf",clf)])
ppl.fit(X_train, y_train)
preds = ppl.predict(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test, preds, pos_label=1)
metrics.plot_roc_curve(ppl, X_test, y_test)
The problem is that predict_proba returns a column for each class. Generally with binary classification, your classes are 0 and 1, so you want the probability of the second class, so it's quite common to slice as follows (replacing the last line in your code block):
pred = clf.predict_proba(principalDf)[:, 1]

Unable to get feature names from sklearn model, since the inputs are numpy arrays. How can I structure my code so that I can pull feature names?

I am working on a Random Forest classification model using stratified k-fold cross validation. I want to plot the feature importance of each fold. My input data is in the form of numpy arrays, however I am unable to put the feature names in my code below. How can I structure this code so that I can pull feature names, so I can plot the built-in feature importance?
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, KFold, cross_validate, cross_val_score, StratifiedKFold, RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, f1_score, mean_squared_error
import matplotlib.pyplot as plt
y_downsample = downsampled[['dependent_variable']].values
X_downsample = downsampled[['Feature1'
,'Feature2'
,'Feature3'
,'Feature4'
,'Feature5'
,'Feature6'
,'Feature7'
,'Feature8'
,'Feature9'
,'Feature10']].values
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
f1_results = []
accuracy_results = []
precision_results = []
recall_results = []
feature_imp = []
for train_index, test_index in skf.split(X_downsample,y_downsample):
X_train, X_test = X_downsample[train_index], X_downsample[test_index]
y_train, y_test = y_downsample[train_index], y_downsample[test_index]
model = RandomForestClassifier(n_estimators = 100, random_state = 24)
model.fit(X_train, y_train.ravel())
y_pred = model.predict(X_test)
f1_results.append(metrics.f1_score(y_test, y_pred))
accuracy_results.append(metrics.accuracy_score(y_test, y_pred))
precision_results.append(metrics.precision_score(y_test, y_pred))
recall_results.append(metrics.recall_score(y_test, y_pred))
# plot
importances = pd.DataFrame({'FEATURE':pd.DataFrame(X_downsample.columns),'IMPORTANCE':np.round(model.feature_importances_,3)})
importances = importances.sort_values('IMPORTANCE',ascending=False).set_index('FEATURE')
importances.plot.bar()
plt.show()
print("Accuracy: ", np.mean(accuracy_results))
print("Precision: ", np.mean(precision_results))
print("Recall: ", np.mean(recall_results))
print("F1-score: ", np.mean(f1_results))
--------------------------------------------------------------------------- AttributeError Traceback (most recent call
> last) in
> 21
> 22 # plot
> ---> 23 importances = pd.DataFrame({'FEATURE':pd.DataFrame(X_downsample.columns),'IMPORTANCE':np.round(model.feature_importances_,3)})
> 24 importances = importances.sort_values('IMPORTANCE',ascending=False).set_index('FEATURE')
> 25
>
> AttributeError: 'numpy.ndarray' object has no attribute 'columns'

SVM stuck in fitting the model

I am using SVM and I have a problem in which the execution of the program stays stuck in model.fit(X_test, y_test), which corresponds to fitting the SVM model. How to fix that? Here is my code:
# Make Predictions with Naive Bayes On The Iris Dataset
import collections
from csv import reader
from math import sqrt, exp, pi
from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns; sns.set()
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.externals.six import StringIO
from sklearn.feature_selection import SelectKBest, chi2
from sklearn import datasets, metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
# Function to split the dataset
def splitdataset(balance_data, column_count):
# Separating the target variable
X = balance_data.values[:, 1:column_count]
Y = balance_data.values[:, 0]
# Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 100)
return X, Y, X_train, X_test, y_train, y_test
def importdata():
balance_data = pd.read_csv( 'dataExtended.txt', sep= ',')
row_count, column_count = balance_data.shape
# Printing the dataswet shape
print ("Dataset Length: ", len(balance_data))
print ("Dataset Shape: ", balance_data.shape)
print("Number of columns ", column_count)
# Printing the dataset obseravtions
print ("Dataset: ",balance_data.head())
balance_data['gold'] = balance_data['gold'].astype('category').cat.codes
balance_data['Program'] = balance_data['Program'].astype('category').cat.codes
return balance_data, column_count
# Driver code
def main():
print("hey")
# Building Phase
data,column_count = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data, column_count)
#Create a svm Classifier
model = svm.SVC(kernel='linear',probability=True) # Linear Kernel
print('before fitting')
model.fit(X_test, y_test)
print('fitting over')
predicted = model.predict(X_test)
print('prediction over')
print(metrics.classification_report(y_test, predicted))
print('classification over')
print(metrics.confusion_matrix(y_test, predicted))
probs = model.predict_proba(X_test)
probs_list = list(probs)
y_pred=[None]*len(y_test)
y_pred_list = list(y_pred)
y_test_list = list(y_test)
i=0
threshold=0.7
while i<len(probs_list):
#print('probs ',probs_list[i][0])
if (probs_list[i][0]>=threshold) & (probs_list[i][1]<threshold):
y_pred_list[i]=0
i=i+1
elif (probs_list[i][0]<threshold) & (probs_list[i][1]>=threshold):
y_pred_list[i]=1
i=i+1
else:
#print(y_pred[i])
#print('i==> ',i, ' probs length ', len(probs_list), ' ', len(y_pred_list), ' ', len(y_test_list))
y_pred_list.pop(i)
y_test_list.pop(i)
probs_list.pop(i)
#print(y_pred_list)
print('confusion matrix\n',confusion_matrix(y_test_list,y_pred_list))
print('classification report\n', classification_report(y_test_list,y_pred_list))
print('accuracy score', accuracy_score(y_test_list, y_pred_list))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test_list, y_pred_list))
print('Mean Squared Error:', metrics.mean_squared_error(y_test_list, y_pred_list))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test_list, y_pred_list)))
if __name__=="__main__":
main()
This is most likely due to the parameter probability set to True when you initialize the model. As you can read in the docs:
probability: bool, default=False
Whether to enable probability
estimates. This must be enabled prior to calling fit, will slow down
that method as it internally uses 5-fold cross-validation, and
predict_proba may be inconsistent with predict.
This issue has been discussed on StackOverflow here and here.

How to remove outlier

I'm working on n a regression problem. I have 10 independent variables.I'm using SVR. Despite doing feature selection and tuning SVR parameters Using Grid search, I got huge MAPE which is 15%. So I'm trying to remove outliers but after removing them I cannot split the data. My question is do outliers affect the accuracy of regression?
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import Normalizer
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
import pandas as pd
from sklearn import preprocessing
features=pd.read_csv('selectedData.csv')
target = features['SYSLoad']
features= features.drop('SYSLoad', axis = 1)
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(features))
print(z)
threshold = 3
print(np.where(z > 3))
features2 = features[(z < 3).all(axis=1)]
from sklearn.model_selection import train_test_split
train_input, test_input, train_target, test_target = train_test_split(features2, target, test_size = 0.25, random_state = 42)
while executing the following code I get this error.
"samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of
samples: [33352, 35064]"
You get the error because, while your target variable is of equal length with features (presumably 35064) due to:
target = features['SYSLoad']
your features2 variable is of lesser length (presumably 33352), i.e. it is a subset of features, due to:
features2 = features[(z < 3).all(axis=1)]
and your train_test_split justifiably complains that the lengths of your features & labels are not equal.
So, you should also subset your target accordingly, and use this target2 in your train_test_split:
target2 = target[(z < 3).all(axis=1)]
train_input, test_input, train_target, test_target = train_test_split(features2, target2, test_size = 0.25, random_state = 42)

Categories

Resources