Incorrect labels in confusion matrix - python

I have tried to create a confusion matrix on a knn-classifier in python, but the labeled classes are wrong.
The classes attribute of the dataset is 2 (for benign) and 4 (for malignant), but when I plot the confusion matrix, all labels are 2. The code I use is:
Data source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
KNN classifier on Breast Cancer Wisconsin (Diagnostic) Data Set from UCI:
data = pd.read_csv('/breast-cancer-wisconsin.data')
data.replace('?', 0, inplace=True)
data.drop('id', 1, inplace = True)
X = np.array(data.drop(' class ', 1))
Y = np.array(data[' class '])
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, Y_train)
accuracy = clf.score(X_test, Y_test)
Plot confusion matrix
from sklearn.metrics import plot_confusion_matrix
disp = plot_confusion_matrix(clf, X_test, Y_test,
display_labels=Y,
cmap=plt.cm.Blues,)
Confusion matrix

The problem is that you're specifying the display_labels argument with Y, where it should just be the target names used for plotting. Now it's just using the two first values that appear in Y, which happen to be 2, 2. Note too that, as mentioned in the docs, the displayed labels will be the same as specified in labels if it is provided, so you just need:
from sklearn.metrics import plot_confusion_matrix
fig, ax = plt.subplots(figsize=(8,8))
disp = plot_confusion_matrix(clf, X_test, Y_test,
labels=np.unique(y),
cmap=plt.cm.Blues,ax=ax)

Related

How I can fix the error in: "Fit the PCA transformer on the training data and transform the data line" and "model = svm.SVC(kernel='linear')"

I have create this Support Vector machine model but I get errors in "Fit the PCA transformer on the training data and transform the data line" and "model = svm.SVC(kernel='linear')"
The first error is:
NameError: name 'x_train' is not defined
The second error is:
NameError: name 'svm' is not defined
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import seaborn as sns
import seaborn as sns # for data visualization
train_df = pd.read_csv('diabetes_data_upload.csv')
train_df.head()
# checking total of rows and columns
train_df.shape
# Transforming the Gender into 0 and 1
train_df["Gender"] = train_df["Gender"].map({"Male": 0, "Female": 1}).astype(int)
#Rounding the Age
train_df["Age"] = train_df.Age.round()
# Separating the data to predict the missing ages
X_train = train_df[train_df.Age.notnull()][['Age','Gender','weakness','Obesity', 'class']]
X_test = train_df[train_df.Age.isnull()][['Age','Gender','weakness','Obesity', 'class']]
y = train_df.Age.dropna()
# Just confirming if there is no more ages missing
train_df.Age.isnull().sum()
# Taking only the features that is important for now
X = train_df[['Gender', 'Age', 'weakness']]
# Taking the labels
Y = train_df['class']
# Spliting into 80% for training set and 20% for testing set so we can see our accuracy
X_train, x_test, Y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
# Declaring the SVC with no tunning
print(X_train.shape)
print(Y_train.shape)
from sklearn.decomposition import PCA
# Create a PCA transformer with 3 components
pca = PCA(n_components=3)
# Fit the PCA transformer on the training data and transform the data
x_train_pca = pca.fit_transform(x_train)
# Transform the test data using the PCA transformer fitted on the training data
x_test_pca = pca.transform(x_test)
# Fit the classifier on the transformed training data
classifier.fit(x_train_pca, y_train)
# Predict the labels for the transformed test data
predictions = classifier.predict(x_test_pca)
# Calculate the accuracy of the model
accuracy = classifier.score(x_test_pca, y_test)
print("Accuracy:", accuracy)
#make it binary classification problem
X = X[np.logical_or(Y==0,Y==1)]
Y = Y[np.logical_or(Y==0,Y==1)]
model = svm.SVC(kernel='linear')
clf = model.fit(X, Y)
# The equation of the separating plane is given by all x so that np.dot(svc.coef_[0], x) + b = 0.
# Solve for w3 (z)
z = lambda x,y: (-clf.intercept_[0]-clf.coef_[0][0]*x -clf.coef_[0][1]*y) / clf.coef_[0][2]
tmp = np.linspace(-5,5,30)
x,y = np.meshgrid(tmp,tmp)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot3D(X[Y==0,0], X[Y==0,1], X[Y==0,2],'ob')
ax.plot3D(X[Y==1,0], X[Y==1,1], X[Y==1,2],'sr')
ax.plot_surface(x, y, z(x,y))
ax.view_init(30, 60)
plt.show()
The two errors you mention can be solved by the following -
1. NameError: name 'x_train' is not defined
This is because you are using x_train instead of the X_train variable you have defined right above. Remember, variables names are case sensitive.
x_train_pca = pca.fit_transform(x_train) # your code
x_train_pca = pca.fit_transform(X_train) # change it to this
2. NameError: name 'svm' is not defined
This is because you are already importing the SVC class using from svm import SVC. But while trying to instantiating the class the model you are using svm.SVC.
model = svm.SVC(kernel='linear') # your code
model = SVC(kernel='linear') # change it to this

Can't predict values using Linear Regression

there!
I'm studying the IBM Data Science course by Coursera and I'm trying to create some snippets to practice. I've created the following code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
# Import and format the dataframes
ibov = pd.read_csv('https://raw.githubusercontent.com/thiagobodruk/datasets/master/ibov.csv')
ifix = pd.read_csv('https://raw.githubusercontent.com/thiagobodruk/datasets/master/ifix.csv')
ibov['DATA'] = pd.to_datetime(ibov['DATA'], format='%d/%m/%Y')
ifix['DATA'] = pd.to_datetime(ifix['DATA'], format='%d/%m/%Y')
ifix = ifix.sort_values(by='DATA', ascending=False)
ibov = ibov.sort_values(by='DATA', ascending=False)
ibov = ibov[['DATA','FECHAMENTO']]
ibov.rename(columns={'FECHAMENTO':'IBOV'}, inplace=True)
ifix = ifix[['DATA','FECHAMENTO']]
ifix.rename(columns={'FECHAMENTO':'IFIX'}, inplace=True)
# Merge datasets
df_idx = ibov.merge( ifix, how='left', on='DATA')
df_idx.set_index('DATA', inplace=True)
df_idx.head()
# Split training and testing samples
x_train, x_test, y_train, y_test = train_test_split(df_idx['IBOV'], df_idx['IFIX'], test_size=0.2)
# Convert the samples to Numpy arrays
regr = linear_model.LinearRegression()
x_train = np.array([x_train])
y_train = np.array([y_train])
x_test = np.array([x_test])
y_test = np.array([y_test])
# Plot the result
regr.fit(x_train, y_train)
y_pred = regr.predict(y_train)
plt.scatter(x_train, y_train)
plt.plot(x_test, y_pred, color='blue', linewidth=3) # This line produces no result
I experienced some issues with the output values returned by the train_test_split() method. So I converted them to Numpy arrays, then my code worked. I can plot my scatter plot normally, but I can't plot my prediction line.
Running this code on my IBM Data Cloud Notebook produces the following warning:
/opt/conda/envs/Python36/lib/python3.6/site-packages/matplotlib/axes/_base.py:380: MatplotlibDeprecationWarning:
cycling among columns of inputs with non-matching shapes is deprecated.
cbook.warn_deprecated("2.2", "cycling among columns of inputs "
I searched on Google and here on StackOverflow, but I can't figure what is wrong.
I'll appreciate some assistance. Thanks in advance!
There are several issues in your code, like y_pred = regr.predict(y_train) and the way you draw a line.
The following code snippet should set you in the right direction:
# Split training and testing samples
x_train, x_test, y_train, y_test = train_test_split(df_idx['IBOV'], df_idx['IFIX'], test_size=0.2)
# Convert the samples to Numpy arrays
regr = linear_model.LinearRegression()
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values
# Plot the result
plt.scatter(x_train, y_train)
regr.fit(x_train.reshape(-1,1), y_train)
idx = np.argsort(x_train)
y_pred = regr.predict(x_train[idx].reshape(-1,1))
plt.plot(x_train[idx], y_pred, color='blue', linewidth=3);
To do the same for the test subset with already fitted model:
# Plot the result
plt.scatter(x_test, y_test)
idx = np.argsort(x_test)
y_pred = regr.predict(x_test[idx].reshape(-1,1))
plt.plot(x_test[idx], y_pred, color='blue', linewidth=3);
Feel free to ask questions if you have any.

Suppress scientific notation in sklearn.metrics.plot_confusion_matrix

I was trying to plot a confusion matrix nicely, so I followed scikit-learn's newer version 0.22's in built plot confusion matrix function. However, one value of my confusion matrix value is 153, but it appears as 1.5e+02 in the confusion matrix plot:
Following the scikit-learn's documentation, I spotted this parameter called values_format, but I do not know how to manipulate this parameter so that it can suppress the scientific notation. My code is as follows.
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
# import some data to play with
X = pd.read_csv("datasets/X.csv")
y = pd.read_csv("datasets/y.csv")
class_names = ['Not Fraud (positive)', 'Fraud (negative)']
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
titles_options = [("Confusion matrix, without normalization", None),
("Normalized confusion matrix", 'true')]
for title, normalize in titles_options:
disp = plot_confusion_matrix(logreg, X_test, y_test,
display_labels=class_names,
cmap=plt.cm.Greens,
normalize=normalize, values_format = '{:.5f}'.format)
disp.ax_.set_title(title)
print(title)
print(disp.confusion_matrix)
plt.show()
Just remove ".format" and the {} brackets from your call parameter declaration:
disp = plot_confusion_matrix(logreg, X_test, y_test,
display_labels=class_names,
cmap=plt.cm.Greens,
normalize=normalize, values_format = '.5f')
In addition, you can use '.5g' to avoid decimal 0's
Taken from source
In case anyone using seabornĀ“s heatmap to plot the confusion matrix, and none of the answers above worked. You should turn off scientific notation in confusion matrix seaborn with fmt='g', like so:
sns.heatmap(conf_matrix,annot=True, fmt='g')
Simply pass values_format=''
Example:
plot_confusion_matrix(clf, X_test, Y_test, values_format = '')

Scaled data with SVM causing strange Confusion Matrix

I have a dataset with X.shape (104481, 34) and y.shape (104481,), and I want to train an SVM model on it.
The steps I do are (1) Split data, (2) Scale data, and (3) Train SVM:
(1) Split data:
Function:
from sklearn.model_selection import train_test_split
def split_data(X,y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=12, stratify=y)
return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = split_data_set.split_data(X,y)
The 4 classes are the following. The data set is quite imbalanced, but that is an issue for later.
y_train.value_counts()
out:
Status_9_Substatus_8 33500
Other 33500
Status_62_Substatus_7 2746
Status_62_Substatus_30 256
Name: Status, dtype: int64
y_test.value_counts()
out:
Status_9_Substatus_8 16500
Other 16500
Status_62_Substatus_7 1352
Status_62_Substatus_30 127
Name: Status, dtype: int64
(2) Scale data:
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(X_train_scaled.shape)
print(y_train.shape)
(3) Train and predict with SVM:
svm_method.get_svm_model(X_train_scaled, X_test_scaled, y_train, y_test)
Calling this method:
def get_svm_model(X_train, X_test, y_train, y_test):
print('Loading...')
print('Training...')
svm, y_train_pred, y_test_pred = train_svm_model(X_train,y_train, X_test)
print('Training Complete')
print('Plotting Confusion Matrix...')
performance_measure.plot_confusion_matrix(y_test,y_test_pred, normalize=True)
print('Plotting Performance Measure...')
performance_measure.get_performance_measures(y_test, y_test_pred)
return svm
Which calls this method:
def train_svm_model(X_train,y_train, X_test):
#
svm = SVC(kernel='poly', gamma='auto', random_state=12)
# Fitting the model
svm.fit(X_train, y_train)
# Predicting values
y_train_pred = svm.predict(X_train)
y_test_pred = svm.predict(X_test)
return svm, y_train_pred, y_test_pred
The resulting '''Output''' is this screenshot.
What is strange is that there are samples from all four classes present (since I used the stratify parameter when calling train_test_split), however, it looks like some of the classes disappear?
The SVM and confusion matrix functions worked well with a toy data set:
from sklearn.datasets import load_wine
data = load_wine()
X = pd.DataFrame(data.data, columns = data.feature_names)
y = pd.DataFrame(data.target)
y = np.array(y)
y = np.ravel(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
svm, y_train_pred, y_test_pred = train_svm_model(X_train, y_train, X_test)
get_svm_model(X_train, X_test, y_train, y_test)
Any idea what is going on here?
Thanks in advance.
The CM code:
def plot_confusion_matrix(y_true, y_pred,
normalize=False,
title=None,
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if not title:
if normalize:
title = 'Normalized confusion matrix'
else:
title = 'Confusion matrix, without normalization'
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)
# Only use the labels that appear in the data
#classes = classes[unique_labels(y_true, y_pred)]
classes = unique_labels(y_pred)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
ax.figure.colorbar(im, ax=ax)
# We want to show all ticks...
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# ... and label them with the respective list entries
xticklabels=classes, yticklabels=classes,
title=title,
ylabel='True label',
xlabel='Predicted label')
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()
plt.show()
return ax
Your confusion matrix is not zero:
If we look at this on the x-axis you have the predicted label and on the y-axis the true labels, lets have a look at the third row from the top:
0.94: 0.94 of the true label: Status_62_Substatus_7 are predicted as class other, which is wrong
0.00 of the same true label are predicted also wrong
0.00 of the same true label are predicted wrong (this should be the correct predict value (higher is better)
0.06 are predicted again wrong
Is your problem is so imbalanced you just have 0 predictions for two labels.

Train test dataset regression results

My problem is that I am applying a simple linear regression on my data: when I split the data to train and test data I don't find significant model when bad p-value and r squared and adjusted r squared results while there is good results in train data.
Here's the code for more explanations :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
data = pd.read_excel ("C:\\Users\\AchourAh\\Desktop\\PL14_IPC_03_09_2018_SP_Level.xlsx",'Sheet1') #Import Excel file
data1 = data.fillna(0) #Replace null values of the whole dataset with 0
print(data1)
X = data1.iloc[0:len(data1),5].values.reshape(-1, 1) #Extract the column of the COPCOR SP we are going to check its impact
Y = data1.iloc[0:len(data1),6].values.reshape(-1, 1) #Extract the column of the PAUS SP
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =0.3, random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
plt.scatter(X_train, Y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('SP00114585')
plt.xlabel('COP COR Quantity')
plt.ylabel('PAUS Quantity')
plt.show()
plt.scatter(X_test, Y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('SP00114585')
plt.xlabel('COP COR Quantity')
plt.ylabel('PAUS Quantity')
plt.show()
X2 = sm.add_constant(X_train)
est = sm.OLS(Y_train, X2)
est2 = est.fit()
print(est2.summary())
X3 = sm.add_constant(X_test)
est3 = sm.OLS(Y_test, X3)
est4 = est3.fit()
print(est4.summary())
At the end, when trying to display statistical results, I always find good results in train data but not in test data. Probably something wrong in my code.
To notice I am a beginner with python
Try running this model a few times, without specifying random_state in train_test_split or changing the test_size parameter.
I.e.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =0.2)
As of now, every time you run the model, you do the same split of data, so it is possible that you overfit the model just because of the split.

Categories

Resources