Python scikit learn multi-class multi-label performance metrics?

Python scikit learn multi-class multi-label performance metrics? - python

I ran Random Forest classifier for my multi-class multi-label output variable. I got below output.
My y_test values
Degree Nature
762721 1 7
548912 0 6
727126 1 12
14880 1 12
189505 1 12
657486 1 12
461004 1 0
31548 0 6
296674 1 7
121330 0 17
predicted output :
[[ 1. 7.]
[ 0. 6.]
[ 1. 12.]
[ 1. 12.]
[ 1. 12.]
[ 1. 12.]
[ 1. 0.]
[ 0. 6.]
[ 1. 7.]
[ 0. 17.]]
Now I want to check the performance of my classifier. I found that for multiclass multilabel "Hamming loss or jaccard_similarity_score" is the good metrics. I tried to calculate it but I was getting value error.
Error:
ValueError: multiclass-multioutput is not supported
Below line I tried:
print hamming_loss(y_test, RF_predicted)
print jaccard_similarity_score(y_test, RF_predicted)
Thanks,

To calculate the unsupported hamming loss for multiclass / multilabel, you could:
import numpy as np
y_true = np.array([[1, 1], [2, 3]])
y_pred = np.array([[0, 1], [1, 2]])
np.sum(np.not_equal(y_true, y_pred))/float(y_true.size)
0.75
You can also get the confusion_matrix for each of the two labels like so:
from sklearn.metrics import confusion_matrix, precision_score
np.random.seed(42)
y_true = np.vstack((np.random.randint(0, 2, 10), np.random.randint(2, 5, 10))).T
[[0 4]
[1 4]
[0 4]
[0 4]
[0 2]
[1 4]
[0 3]
[0 2]
[0 3]
[1 3]]
y_pred = np.vstack((np.random.randint(0, 2, 10), np.random.randint(2, 5, 10))).T
[[1 2]
[1 2]
[1 4]
[1 4]
[0 4]
[0 3]
[1 4]
[1 3]
[1 3]
[0 4]]
confusion_matrix(y_true[:, 0], y_pred[:, 0])
[[1 6]
[2 1]]
confusion_matrix(y_true[:, 1], y_pred[:, 1])
[[0 1 1]
[0 1 2]
[2 1 2]]
You could also calculate the precision_score like so (or the recall_score in a similiar way):
precision_score(y_true[:, 0], y_pred[:, 0])
0.142857142857

Related

Compute Hausdorff distance for 3D numpy arrays

Imagine we are under a segmentation problem that has 5 classes (0, 1, 2, 3, 4). Considering that we have the following 3D mask volumes (A.K.A. 3D numpy arrays):
# Ground truth mask
y_true = np.array([[[2, 1, 4], [0, 1, 1], [2, 1, 0]],
[[2, 2, 2], [0, 1, 0], [0, 1, 1]],
[[2, 4, 4], [2, 1, 4], [2, 1, 1]]])
# Predicted mask
y_pred = np.array([[[2, 0, 4], [0, 2, 1], [2, 0, 0]],
[[2, 4, 0], [0, 1, 2], [0, 4, 1]],
[[2, 0, 4], [1, 1, 4], [2, 2, 1]]])
How can I compute the Hausdorff distance between them? I've looked into Monai's implementation however couldn't figure out the meaning of the compute_hausdorff_distance output.
I implemented a one-hot encoder, since Monai requires the inputs to be one-hot encoded.
def one_hot_encode(array):
return np.eye(5)[array].astype(dtype=int)
Now we have that:
# Ground truth mask
y_true = [[[[0 0 1 0 0]
[0 1 0 0 0]
[0 0 0 0 1]]
...
[[1 0 0 0 0]
[0 1 0 0 0]
[0 1 0 0 0]]]
# Predicted mask
y_pred = [[[[0 0 1 0 0]
[1 0 0 0 0]
[0 0 0 0 1]]
...
[[0 0 1 0 0]
[0 0 1 0 0]
[0 1 0 0 0]]]]
The output of Monai's implementation is:
>>> compute_hausdorff_distance(one_hot_encode(y_pred), one_hot_encode(y_true), include_background=True)
>>> [[1. 1. 1. ]
[2. 1.41421356 3. ]
[2.23606798 1. 1. ]]
Looking at it I can understand it is computing the euclidean distance. It looks like it is looking at labels as positions, but should't the output be of shape 3x3x3just like the masks?
Also, Scipy implementation only works for 2D masks/arrays. Would it be right to compute the Hausdorff distance slice-wise, i.e., slice by slice, and afterwards average all the slice Hausdorff distances obtained? Or does this approach violates the Hausdorff distance principle for 3D data?

Batch of multidimensional arrays multiplied by a batch of scalers (without loops)

I have a batch, two data, of multidimensional arrays (3,3,2) as following:
batch= np.asarray([
[
[[1,2,3],
[3,1,1,],
[4,9,0,]],
[[2,2,2],
[5,6,7],
[3,3,3]]
],
[
[[2,2,2],
[5,6,7],
[3,3,3]],
[[1,2,3],
[3,1,1],
[4,9,0]]
]
])
correspondingly I have batch, two data, of (1,1,2) as follows
scalers = np.asarray([
[
[[1]],
[[2]]
],
[
[[0]],
[[3]]
]
])
each dimension in the batch should be multiplied by its corresponding scaler in the scalers array. For example:
# the first dimension
[[1,2,3],
1 * [3,1,1,],
[4,9,0,]]
# the second dimension
2 * [[2,2,2],
[5,6,7],
[3,3,3]]
.
.
.
# the last dimension
3* [[1,2,3],
[3,1,1],
[4,9,0]]
So , the expected output should be like the following:
[
[
[[1 2 3],
[3 1 1 ],
[4 9 0 ]],
[[4 4 4],
[10 12 14],
[6 6 6]]
],
[
[[0 0 0],
[0 0 0],
[0 0 0]],
[[3 6 9],
[9 3 3],
[12 27 0]]
]
]
I was trying to do the following to avoid any loops
batch * scalers
but it seems it is not correct, I wonder how to do the behavior above

Numpy - is there a way to specify broadcasting dimension?

Broadcasting is only possible (as far as I know) with matrices matching shape from the end (shape [4,3,2] is broadcastable with shapes [2], [3,2], [4,3,2]). But why?
Consider the following example:
np.zeros([4,3,2])
[[[0 0]
[0 0]
[0 0]]
[[0 0]
[0 0]
[0 0]]
[[0 0]
[0 0]
[0 0]]
[[0 0]
[0 0]
[0 0]]]
Why broadcasting with [1,2,3], or [1,2,3,4] isn't possible?
Adding with [1,2,3] (shape: [3], target shape: [4,3,2]) expected result:
[[[1 1]
[2 2]
[3 3]]
[[1 1]
[2 2]
[3 3]]
[[1 1]
[2 2]
[3 3]]
[[1 1]
[2 2]
[3 3]]]
Adding with [1,2,3,4] (shape: [4], target shape: [4,3,2]) expected result:
[[[1 1]
[1 1]
[1 1]]
[[2 2]
[2 2]
[2 2]]
[[3 3]
[3 3]
[3 3]]
[[4 4]
[4 4]
[4 4]]]
Or, if there would be concerns about multi dimensional broadcasting this way, adding with:
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
(shape: [4,3], target shape: [4,3,2]) expected result:
[[[ 1 1]
[ 2 2]
[ 3 3]]
[[ 4 4]
[ 5 5]
[ 6 6]]
[[ 7 7]
[ 8 8]
[ 9 9]]
[[10 10]
[11 11]
[12 12]]]
So basically what I'm saying is that I can't see a reason why it couldn't find the matching shape, and do the operations respectively. If there's multiple dimensions matching in the target matrix, just select the last one automatically, or have the option to specify which dimension we want to perform the operation.
Any ideas/suggestions?

The broadcasting rules are simple and unambiguous.
add leading size 1 dimension as needed to match total number of dimensions
adjust all size 1 dimensions as needed to match
With (4,3,2)
(2,) => (1,1,2) => (4,3,2)
(3,2) => (1,3,2) => (4,3,2)
(3,) => (1,1,3) => (4,3,3) ERROR
(4,) => (1,1,4)
(4,3) => (1,4,3)
With reshape or np.newaxis we can add explicit new dimensions in the right place:
(3,1) => (1,3,1) => (4,3,2)
(4,1,1) => (4,3,2)
(4,3,1) => (4,3,2)
Why doesn't it do the last stuff automatically? Potential ambiguity. Without those rules, especially the 'add only leading', it would be possible to add the extra dimension in several different places.
e.g.
(2,3,3) + (3,) => is that (1,1,3) or (1,3,1)?
(2,3,3,3) + (3,3)

Constructing a confusion matrix from data without sklearn

I am trying to construct a confusion matrix without using the sklearn library. I am having trouble correctly forming the confusion matrix. Here's my code:
def comp_confmat():
currentDataClass = [1,3,3,2,5,5,3,2,1,4,3,2,1,1,2]
predictedClass = [1,2,3,4,2,3,3,2,1,2,3,1,5,1,1]
cm = []
classes = int(max(currentDataClass) - min(currentDataClass)) + 1 #find number of classes
for c1 in range(1,classes+1):#for every true class
counts = []
for c2 in range(1,classes+1):#for every predicted class
count = 0
for p in range(len(currentDataClass)):
if currentDataClass[p] == predictedClass[p]:
count += 1
counts.append(count)
cm.append(counts)
print(np.reshape(cm,(classes,classes)))
However this returns:
[[7 7 7 7 7]
[7 7 7 7 7]
[7 7 7 7 7]
[7 7 7 7 7]
[7 7 7 7 7]]
But I don't understand why each iteration results in 7 when I am reseting the count each time and it's looping through different values?
This is what I should be getting (using the sklearn's confusion_matrix function):
[[3 0 0 0 1]
[2 1 0 1 0]
[0 1 3 0 0]
[0 1 0 0 0]
[0 1 1 0 0]]

You can derive the confusion matrix by counting the number of instances in each combination of actual and predicted classes as follows:
import numpy as np
def comp_confmat(actual, predicted):
# extract the different classes
classes = np.unique(actual)
# initialize the confusion matrix
confmat = np.zeros((len(classes), len(classes)))
# loop across the different combinations of actual / predicted classes
for i in range(len(classes)):
for j in range(len(classes)):
# count the number of instances in each combination of actual / predicted classes
confmat[i, j] = np.sum((actual == classes[i]) & (predicted == classes[j]))
return confmat
# sample data
actual = [1, 3, 3, 2, 5, 5, 3, 2, 1, 4, 3, 2, 1, 1, 2]
predicted = [1, 2, 3, 4, 2, 3, 3, 2, 1, 2, 3, 1, 5, 1, 1]
# confusion matrix
print(comp_confmat(actual, predicted))
# [[3. 0. 0. 0. 1.]
# [2. 1. 0. 1. 0.]
# [0. 1. 3. 0. 0.]
# [0. 1. 0. 0. 0.]
# [0. 1. 1. 0. 0.]]

In your innermost loop, there should be a case distinction: Currently this loop counts agreement, but you only want that if actually c1 == c2.
Here's another way, using nested list comprehensions:
currentDataClass = [1,3,3,2,5,5,3,2,1,4,3,2,1,1,2]
predictedClass = [1,2,3,4,2,3,3,2,1,2,3,1,5,1,1]
classes = int(max(currentDataClass) - min(currentDataClass)) + 1 #find number of classes
counts = [[sum([(currentDataClass[i] == true_class) and (predictedClass[i] == pred_class)
for i in range(len(currentDataClass))])
for pred_class in range(1, classes + 1)]
for true_class in range(1, classes + 1)]
counts
[[3, 0, 0, 0, 1],
[2, 1, 0, 1, 0],
[0, 1, 3, 0, 0],
[0, 1, 0, 0, 0],
[0, 1, 1, 0, 0]]

Here is my solution using numpy and pandas:
import numpy as np
import pandas as pd
true_classes = [1, 3, 3, 2, 5, 5, 3, 2, 1, 4, 3, 2, 1, 1, 2]
predicted_classes = [1, 2, 3, 4, 2, 3, 3, 2, 1, 2, 3, 1, 5, 1, 1]
classes = set(true_classes)
number_of_classes = len(classes)
conf_matrix = pd.DataFrame(
np.zeros((number_of_classes, number_of_classes),dtype=int),
index=classes,
columns=classes)
for true_label, prediction in zip(true_classes ,predicted_classes):
# Each pair of (true_label, prediction) is a position in the confusion matrix (row, column)
# Basically here we are counting how many times we have each pair.
# The counting will be placed at the matrix index (true_label/row, prediction/column)
conf_matrix.loc[true_label, prediction] += 1
print(conf_matrix.values)
[[3 0 0 0 1]
[2 1 0 1 0]
[0 1 3 0 0]
[0 1 0 0 0]
[0 1 1 0 0]]

how to duplicate each row of a matrix N times Numpy

I have a matrix with these dimensions (150,2) and I want to duplicate each row N times. I show what I mean with an example.
Input:
a = [[2, 3], [5, 6], [7, 9]]
suppose N= 3, I want this output:
[[2 3]
[2 3]
[2 3]
[5 6]
[5 6]
[5 6]
[7 9]
[7 9]
[7 9]]
Thank you.

Use np.repeat with parameter axis=0 as:
a = np.array([[2, 3],[5, 6],[7, 9]])
print(a)
[[2 3]
[5 6]
[7 9]]
r_a = np.repeat(a, repeats=3, axis=0)
print(r_a)
[[2 3]
[2 3]
[2 3]
[5 6]
[5 6]
[5 6]
[7 9]
[7 9]
[7 9]]

To create an empty multidimensional array in NumPy (e.g. a 2D array m*n to store your matrix), in case you don't know m how many rows you will append and don't care about the computational cost Stephen Simmons mentioned (namely re-building the array at each append), you can squeeze to 0 the dimension to which you want to append to: X = np.empty(shape=[0, n]).
This way you can use for example (here m = 5 which we assume we didn't know when creating the empty matrix, and n = 2):
import numpy as np
n = 2
X = np.empty(shape=[0, n])
for i in range(5):
for j in range(2):
X = np.append(X, [[i, j]], axis=0)
print X
which will give you:
[[ 0. 0.]
[ 0. 1.]
[ 1. 0.]
[ 1. 1.]
[ 2. 0.]
[ 2. 1.]
[ 3. 0.]
[ 3. 1.]
[ 4. 0.]
[ 4. 1.]]

If your input is a vector, use atleast_2d first.
a = np.atleast_2d([2, 3]).repeat(repeats=3, axis=0)
print(a)
# [[2 3]
# [2 3]
# [2 3]]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python scikit learn multi-class multi-label performance metrics? - python

Related

Compute Hausdorff distance for 3D numpy arrays

Batch of multidimensional arrays multiplied by a batch of scalers (without loops)

Numpy - is there a way to specify broadcasting dimension?

Constructing a confusion matrix from data without sklearn

how to duplicate each row of a matrix N times Numpy

Categories

Resources