How to calculate specificity for multiclass problems using Scikit-learn [duplicate]

How to calculate specificity for multiclass problems using Scikit-learn [duplicate] - python

This question already has answers here:
calculate precision and recall in a confusion matrix
(6 answers)
Closed 2 years ago.
I'm using Python and have some confusion matrixes. I'd like to calculate precisions and recalls and f-measure by confusion matrixes in multiclass classification. My result logs don't contain y_true and y_pred, just contain confusion matrix.
Could you tell me how to get these scores from confusion matrix in multiclass classification?

Let's consider the case of MNIST data classification (10 classes), where for a test set of 10,000 samples we get the following confusion matrix cm (Numpy array):
array([[ 963, 0, 0, 1, 0, 2, 11, 1, 2, 0],
[ 0, 1119, 3, 2, 1, 0, 4, 1, 4, 1],
[ 12, 3, 972, 9, 6, 0, 6, 9, 13, 2],
[ 0, 0, 8, 975, 0, 2, 2, 10, 10, 3],
[ 0, 2, 3, 0, 953, 0, 11, 2, 3, 8],
[ 8, 1, 0, 21, 2, 818, 17, 2, 15, 8],
[ 9, 3, 1, 1, 4, 2, 938, 0, 0, 0],
[ 2, 7, 19, 2, 2, 0, 0, 975, 2, 19],
[ 8, 5, 4, 8, 6, 4, 14, 11, 906, 8],
[ 11, 7, 1, 12, 16, 1, 1, 6, 5, 949]])
In order to get the precision & recall (per class), we need to compute the TP, FP, and FN per class. We don't need TN, but we will compute it, too, as it will help us for our sanity check.
The True Positives are simply the diagonal elements:
# numpy should have already been imported as np
TP = np.diag(cm)
TP
# array([ 963, 1119, 972, 975, 953, 818, 938, 975, 906, 949])
The False Positives are the sum of the respective column, minus the diagonal element (i.e. the TP element):
FP = np.sum(cm, axis=0) - TP
FP
# array([50, 28, 39, 56, 37, 11, 66, 42, 54, 49])
Similarly, the False Negatives are the sum of the respective row, minus the diagonal (i.e. TP) element:
FN = np.sum(cm, axis=1) - TP
FN
# array([17, 16, 60, 35, 29, 74, 20, 53, 68, 60])
Now, the True Negatives are a little trickier; let's first think what exactly a True Negative means, with respect to, say class 0: it means all the samples that have been correctly identified as not being 0. So, essentially what we should do is remove the corresponding row & column from the confusion matrix, and then sum up all the remaining elements:
num_classes = 10
TN = []
for i in range(num_classes):
temp = np.delete(cm, i, 0) # delete ith row
temp = np.delete(temp, i, 1) # delete ith column
TN.append(sum(sum(temp)))
TN
# [8970, 8837, 8929, 8934, 8981, 9097, 8976, 8930, 8972, 8942]
Let's make a sanity check: for each class, the sum of TP, FP, FN, and TN must be equal to the size of our test set (here 10,000): let's confirm that this is indeed the case:
l = 10000
for i in range(num_classes):
print(TP[i] + FP[i] + FN[i] + TN[i] == l)
The result is
True
True
True
True
True
True
True
True
True
True
Having calculated these quantities, it is now straightforward to get the precision & recall per class:
precision = TP/(TP+FP)
recall = TP/(TP+FN)
which for this example are
precision
# array([ 0.95064166, 0.97558849, 0.96142433, 0.9456838 , 0.96262626,
# 0.986731 , 0.93426295, 0.95870206, 0.94375 , 0.9509018])
recall
# array([ 0.98265306, 0.98590308, 0.94186047, 0.96534653, 0.97046843,
# 0.91704036, 0.97912317, 0.94844358, 0.9301848 , 0.94053518])
Similarly we can compute related quantities, like specificity (recall that sensitivity is the same thing with recall):
specificity = TN/(TN+FP)
Results for our example:
specificity
# array([0.99445676, 0.99684151, 0.9956512 , 0.99377086, 0.99589709,
# 0.99879227, 0.99270073, 0.99531877, 0.99401728, 0.99455011])
You should now be able to compute these quantities virtually for any size of your confusion matrix.

If you have confusion matrix in the form of:
cmat = [[ 5, 7],
[25, 37]]
Following simple function can be made:
def myscores(smat):
tp = smat[0][0]
fp = smat[0][1]
fn = smat[1][0]
tn = smat[1][1]
return tp/(tp+fp), tp/(tp+fn)
Testing:
print("precision and recall:", myscores(cmat))
Output:
precision and recall: (0.4166666666666667, 0.16666666666666666)
Above function can also be extended to produce other scores, the formulae for which are mentioned on https://en.wikipedia.org/wiki/Confusion_matrix

There is a package called 'disarray'.
So, if I have four classes :
import numpy as np
a = np.random.randint(0,4,[100])
b = np.random.randint(0,4,[100])
I can use disarray to calculate 13 matrices :
import disarray
# Instantiate the confusion matrix DataFrame with index and columns
cm = confusion_matrix(a,b)
df = pd.DataFrame(cm, index= ['a','b','c','d'], columns=['a','b','c','d'])
df.da.export_metrics()
which gives :

Related

Calculate sum of all directly surrounding elements to some element in matrix

I am to calculate sum of all the directly surrounding elements to some element in a matrix.
[ [1, 2, 3],
[4, 5, 6],
[7, 8, 9] ]
so that sum_neighbours(matrix[0][0]) == 11 and sum_neighbours(matrix[1][1]) == 40.
The problem is just that I'm a beginner and I don't know how to make sum_neighbours calculate how many neighbours a certain number has.
I figured that I could write write if-elif-else-statement and then give the specific amount of neighbours that each value in the matrix has, but surely there must be a more efficient way to do this?
Otherwise it'll only be able to calculate the sum of the neighbours for matrices that are 3 x 3.

A nice approach is to use numpy and a convolution:
import numpy as np
from scipy.signal import convolve2d
a = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
convolve2d(a, [[1,1,1],[1,0,1],[1,1,1]], mode='same')
# top center bottom
output:
array([[11, 19, 13],
[23, 40, 27],
[17, 31, 19]])
Alternatively:
convolve2d(a, np.ones((3,3)), mode='same')-a
# this sums the neighbours + the center
# so we need to subtract the initial array
example on a larger array and ignoring the top left neighbor
this is just to show yo how easy it is to perform similar operations when using convolutions
a = np.arange(5*6).reshape((5,6))
# array([[ 0, 1, 2, 3, 4, 5],
# [ 6, 7, 8, 9, 10, 11],
# [12, 13, 14, 15, 16, 17],
# [18, 19, 20, 21, 22, 23],
# [24, 25, 26, 27, 28, 29]])
convolve2d(a, [[0,1,1],[1,0,1],[1,1,1]], mode='same')
array([[ 7, 15, 19, 23, 27, 25],
[ 20, 42, 49, 56, 63, 52],
[ 44, 84, 91, 98, 105, 82],
[ 68, 126, 133, 140, 147, 112],
[ 62, 107, 112, 117, 122, 73]])

If you would like to achieve this without any imports (the underlying assumption is that you have already checked that you have a well formed list of lists/matrix i.e. all the rows have the same length):
# you pass the matrix and the (i,j) coordinates of the element of interest
# This select the "matrix" around i,j (flooring to 0 and capping to
# the number of elements in the list - this is for the elements on the edge
# of the matrix)
def select(m, i, j):
def s(x, y): return x[max(0,y-1):min(len(x),y+1) + 1]
return [s(x, j) for x in s(m, i)]
def sum_around(m, i, j, excluded = True):
# this sums all the elements within each list and compute the
# grand total. It then subtracts the element in (i,j) if
# excluded = True (which is the default behaviour and what you want here)
return sum([sum(x) for x in select(m, i, j)]) - (m[i][j] if excluded else 0)
m = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
print(sum_around(m, 0, 0)) # prints 11
print(sum_around(m, 1, 1)) # prints 40

I guess you can add an extra row and column on boundary with values 0.
Then you can easily add the neighbouring elements, without any boundary conditions.

Finding all the variables that give the highest Adjusted R squared value

I have a dataframe which stores different variables. I'm using OLS linear regression and using all of the variables to predict the 'price' column.
import pandas as pd
import statsmodels.api as sm
data = {'accommodates':[2, 2, 3, 2, 2, 6, 8, 4, 3, 2],
'bedrooms':[1, 2, 1, 1, 3, 4, 2, 2, 2, 3],
'instant_bookable':[1, 0, 1, 1, 1, 1, 0, 0, 0, 1],
'availability_365':[123, 3, 33, 14, 15, 16, 3, 41, 61, 74],
'minimum_nights':[3, 12, 1, 4, 6, 7, 2, 3, 6, 10],
'beds':[2, 2, 3, 4, 1, 5, 6, 2, 3, 2],
'price':[59, 234, 15, 162, 56, 42, 28, 52, 22, 31]}
df = pd.DataFrame(data, columns = ['accommodates', 'bedrooms', 'instant_bookable', 'availability_365',
'minimum_nights', 'beds', 'price'])
I have a for loop which calculates the Adjusted R squared value for each variable:
fit_d = {}
for columns in [x for x in df.columns if x != 'price']:
Y = df['price']
X = df[columns]
X = sm.add_constant(X)
model = sm.OLS(Y,X, missing = 'drop').fit()
fit_d[columns] = model.rsquared
fit_d
How can I modify my code in order to find the combination of variables that give the largest Adjusted R squared value? Ideally the function would find the variable with the largest adj. R squared value first, then using the 1st variable iterate with the remaining variables to get 2 variables that give the highest value, then 3 variables etc. until the value cannot be increased further. I'd like the output to be something like
Best variables: {'accommodates, 'availability', 'bedrooms'}

Here is a "brute force way" to do all possible combinations (from itertools) of different length to find the variables with higher R value. The idea is to do 2 loops, one for the number of variables to try, and one for all the combinations with the number of variables.
from itertools import combinations
# all possible columns for X
cols = [x for x in df.columns if x != 'price']
# define Y as same accross the loops
Y = df['price']
# define result dictionary
fit_d = {}
# loop for any length of combinations
for i in range(1, len(cols)+1):
# loop for any combinations with length i
for comb in combinations(cols, i):
# Define X from the combination
X = df[list(comb)]
X = sm.add_constant(X)
# perform the OLS opertion
model = sm.OLS(Y,X, missing = 'drop').fit()
# save the rsquared in a dictionnary
fit_d[comb] = model.rsquared
# extract the key for the max R value
key_max = max(fit_d, key=fit_d.get)
print(f'Best variables {key_max} for a R-value of {round(fit_d[key_max], 5)}')
# Best variables ('accommodates', 'bedrooms', 'instant_bookable', 'availability_365', 'minimum_nights', 'beds') for a R-value of 0.78506

DOATools.py - Using my own signal source (NOT generated)

I'm using doatools.py library (https://github.com/morriswmz/doatools.py)
Now, my code looks like:
import numpy as np
from scipy import constants as const
import math
import doatools.model as model
import doatools.estimation as estimation
def calculate_wavelength(frequency):
return const.speed_of_light / frequency
# Uniform circular array
# X
# |
# X---------X
# |
# X
NUMBER_OF_ELEMENTS = 4 # elements are shown as "X"
RADIUS = 0.47 / 2
FREQ_MHZ = 315
freq = FREQ_MHZ * const.mega
wavelength = calculate_wavelength(freq)
antenna_array = model.UniformCircularArray(NUMBER_OF_ELEMENTS, RADIUS)
# Create a MUSIC-based estimator.
grid = estimation.FarField1DSearchGrid()
estimator = estimation.MUSIC(antenna_array, wavelength, grid)
R = np.array([[1.5, 2, 3, 4], [4, 5, 6, 5], [45, 5, 5, 6], [5, 1, 0, 5]])
_, estimates = estimator.estimate(R, 1, return_spectrum=False, refine_estimates=True)
print('Estimates: {0}'.format(estimates.locations))
I can generate signal with this library, but how to use my own? For example, signal from ADC (like this:
-> Switching to antenna 0 : [0, 4, 7, 10]
-> Switching to antenna 1 : [5, 6, 11, 83]
-> Switching to antenna 2 : [0, 23, 2, 34]
-> Switching to antenna 3 : [23, 105, 98, 200]
)

I think your question is how you should feed the real data from antennas, right?
Supposedly your data should be in order along time. I mean in case of "antenna 0 : [0, 4, 7, 10]", 0 is the 1st-in data, and 4, 7, in order, and the 10 is the last one in time.
If yes, you could leave them as a simple matrix like what you typed above:
r = matrix 4x4 of
0, 4, 7, 10
5, 6, 11, 83
0, 23, 2, 34
23, 105, 98, 200
//===============
r(0,0) = 0, r(0,1) = 4, r(0,2) = 7, r(0,3) = 10
r(1,0) = 5, r(1,1) = 6, ... etc.
r(2,0) = 0, ...etc.
//==============
R = the product of r and its hermitian matrix (r.h in python).
R = r # r.h
And this is the covariance matrix that you need to fill in as the 1st argument in function.

How to get precision, recall and f-measure from confusion matrix in Python [duplicate]

This question already has answers here:
calculate precision and recall in a confusion matrix
(6 answers)
Closed 2 years ago.
I'm using Python and have some confusion matrixes. I'd like to calculate precisions and recalls and f-measure by confusion matrixes in multiclass classification. My result logs don't contain y_true and y_pred, just contain confusion matrix.
Could you tell me how to get these scores from confusion matrix in multiclass classification?

Let's consider the case of MNIST data classification (10 classes), where for a test set of 10,000 samples we get the following confusion matrix cm (Numpy array):
array([[ 963, 0, 0, 1, 0, 2, 11, 1, 2, 0],
[ 0, 1119, 3, 2, 1, 0, 4, 1, 4, 1],
[ 12, 3, 972, 9, 6, 0, 6, 9, 13, 2],
[ 0, 0, 8, 975, 0, 2, 2, 10, 10, 3],
[ 0, 2, 3, 0, 953, 0, 11, 2, 3, 8],
[ 8, 1, 0, 21, 2, 818, 17, 2, 15, 8],
[ 9, 3, 1, 1, 4, 2, 938, 0, 0, 0],
[ 2, 7, 19, 2, 2, 0, 0, 975, 2, 19],
[ 8, 5, 4, 8, 6, 4, 14, 11, 906, 8],
[ 11, 7, 1, 12, 16, 1, 1, 6, 5, 949]])
In order to get the precision & recall (per class), we need to compute the TP, FP, and FN per class. We don't need TN, but we will compute it, too, as it will help us for our sanity check.
The True Positives are simply the diagonal elements:
# numpy should have already been imported as np
TP = np.diag(cm)
TP
# array([ 963, 1119, 972, 975, 953, 818, 938, 975, 906, 949])
The False Positives are the sum of the respective column, minus the diagonal element (i.e. the TP element):
FP = np.sum(cm, axis=0) - TP
FP
# array([50, 28, 39, 56, 37, 11, 66, 42, 54, 49])
Similarly, the False Negatives are the sum of the respective row, minus the diagonal (i.e. TP) element:
FN = np.sum(cm, axis=1) - TP
FN
# array([17, 16, 60, 35, 29, 74, 20, 53, 68, 60])
Now, the True Negatives are a little trickier; let's first think what exactly a True Negative means, with respect to, say class 0: it means all the samples that have been correctly identified as not being 0. So, essentially what we should do is remove the corresponding row & column from the confusion matrix, and then sum up all the remaining elements:
num_classes = 10
TN = []
for i in range(num_classes):
temp = np.delete(cm, i, 0) # delete ith row
temp = np.delete(temp, i, 1) # delete ith column
TN.append(sum(sum(temp)))
TN
# [8970, 8837, 8929, 8934, 8981, 9097, 8976, 8930, 8972, 8942]
Let's make a sanity check: for each class, the sum of TP, FP, FN, and TN must be equal to the size of our test set (here 10,000): let's confirm that this is indeed the case:
l = 10000
for i in range(num_classes):
print(TP[i] + FP[i] + FN[i] + TN[i] == l)
The result is
True
True
True
True
True
True
True
True
True
True
Having calculated these quantities, it is now straightforward to get the precision & recall per class:
precision = TP/(TP+FP)
recall = TP/(TP+FN)
which for this example are
precision
# array([ 0.95064166, 0.97558849, 0.96142433, 0.9456838 , 0.96262626,
# 0.986731 , 0.93426295, 0.95870206, 0.94375 , 0.9509018])
recall
# array([ 0.98265306, 0.98590308, 0.94186047, 0.96534653, 0.97046843,
# 0.91704036, 0.97912317, 0.94844358, 0.9301848 , 0.94053518])
Similarly we can compute related quantities, like specificity (recall that sensitivity is the same thing with recall):
specificity = TN/(TN+FP)
Results for our example:
specificity
# array([0.99445676, 0.99684151, 0.9956512 , 0.99377086, 0.99589709,
# 0.99879227, 0.99270073, 0.99531877, 0.99401728, 0.99455011])
You should now be able to compute these quantities virtually for any size of your confusion matrix.

If you have confusion matrix in the form of:
cmat = [[ 5, 7],
[25, 37]]
Following simple function can be made:
def myscores(smat):
tp = smat[0][0]
fp = smat[0][1]
fn = smat[1][0]
tn = smat[1][1]
return tp/(tp+fp), tp/(tp+fn)
Testing:
print("precision and recall:", myscores(cmat))
Output:
precision and recall: (0.4166666666666667, 0.16666666666666666)
Above function can also be extended to produce other scores, the formulae for which are mentioned on https://en.wikipedia.org/wiki/Confusion_matrix

There is a package called 'disarray'.
So, if I have four classes :
import numpy as np
a = np.random.randint(0,4,[100])
b = np.random.randint(0,4,[100])
I can use disarray to calculate 13 matrices :
import disarray
# Instantiate the confusion matrix DataFrame with index and columns
cm = confusion_matrix(a,b)
df = pd.DataFrame(cm, index= ['a','b','c','d'], columns=['a','b','c','d'])
df.da.export_metrics()
which gives :

Python - how to add and subtract elements in array

If I have an array, let's say: np.array([4,8,-2,9,6,0,3,-6]) and I would like to add the previous number to the next element, how do I do?
And every time the number 0 shows up the addition of elements 'restarts'.
An example with the above array, I should get the following output when I run the function:
stock = np.array([4,12,10,19,25,0,3,-3]) is the right output, if the above array is inserted in transactions.
def cumulativeStock(transactions):
# insert your code here
return stock
I can't think of a method to solving this problem. Any help would be very appreciated.

I believe you mean something like this?
z = np.array([4,8,-2,9,6,0,3,-6])
n = z == 0
[False False False False False True False False]
res = np.split(z,np.where(n))
[array([ 4, 8, -2, 9, 6]), array([ 0, 3, -6])]
res_total = [np.cumsum(x) for x in res]
[array([ 4, 12, 10, 19, 25]), array([ 0, 3, -3])]
np.concatenate(res_total)
[ 4 12 10 19 25 0 3 -3]

another vectorized solution:
import numpy as np
stock = np.array([4, 8, -2, 9, 6, 0, 3, -6])
breaks = stock == 0
tmp = np.cumsum(stock)
brval = numpy.diff(numpy.concatenate(([0], -tmp[breaks])))
stock[breaks] = brval
np.cumsum(stock)
# array([ 4, 12, 10, 19, 25, 0, 3, -3])

import numpy as np
stock = np.array([4, 12, 10, 19, 25, 0, 3, -3, 4, 12, 10, 0, 19, 25, 0, 3, -3])
def cumsum_stock(stock):
## Detect all Zero's first
zero_p = np.where(stock==0)[0]
## Create empty array to append final result
final_stock = np.empty(shape=[0, len(zero_p)])
for i in range(len(zero_p)):
## First Zero detection
if(i==0):
stock_first_part = np.cumsum(stock[:zero_p[0]])
stock_after_zero_part = np.cumsum(stock[zero_p[0]:zero_p[i+1]])
final_stock = np.append(final_stock, stock_first_part)
final_stock = np.append(final_stock, stock_after_zero_part)
## Last Zero detection
elif(i==(len(zero_p)-1)):
stock_last_part = np.cumsum(stock[zero_p[i]:])
final_stock = np.append(final_stock, stock_last_part, axis=0)
## Intermediate Zero detection
else:
intermediate_stock = np.cumsum(stock[zero_p[i]:zero_p[i+1]])
final_stock = np.append(final_stock, intermediate_stock, axis=0)
return(final_stock)
final_stock = cumsum_stock(stock).astype(int)
#Output
final_stock
Out[]: array([ 4, 16, 26, ..., 0, 3, 0])
final_stock.tolist()
Out[]: [4, 16, 26, 45, 70, 0, 3, 0, 4, 16, 26, 0, 19, 44, 0, 3, 0]

def cumulativeStock(transactions):
def accum(x):
acc=0
for i in x:
if i==0:
acc=0
acc+=i
yield acc
stock = np.array(list(accum(transactions)))
return stock
for your input np.array([4,8,-2,9,6,0,3,-6])
it returns
array([ 1, 3, 6, 9, 13, 0, 1, 3, 6])

I assume you mean you want to seperate the list at every zero?
from itertools import groupby
import numpy
def cumulativeStock(transactions):
#split list on item 0
groupby(transactions, lambda x: x == 0)
all_lists = [list(group) for k, group in groupby(transactions, lambda x: x == 0) if not k]
# cumulative the items
stock = []
for sep_list in all_lists:
for item in numpy.cumsum(sep_list):
stock.append(item)
return stock
print(cumulativeStock([4,8,-2,9,6,0,3,-6]))
Which will return:
[4, 12, 10, 19, 25, 3, -3]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to calculate specificity for multiclass problems using Scikit-learn [duplicate] - python

Related

Calculate sum of all directly surrounding elements to some element in matrix

Finding all the variables that give the highest Adjusted R squared value

DOATools.py - Using my own signal source (NOT generated)

How to get precision, recall and f-measure from confusion matrix in Python [duplicate]

Python - how to add and subtract elements in array

Categories

Resources