Constructing a confusion matrix from data without sklearn - python

I am trying to construct a confusion matrix without using the sklearn library. I am having trouble correctly forming the confusion matrix. Here's my code:
def comp_confmat():
currentDataClass = [1,3,3,2,5,5,3,2,1,4,3,2,1,1,2]
predictedClass = [1,2,3,4,2,3,3,2,1,2,3,1,5,1,1]
cm = []
classes = int(max(currentDataClass) - min(currentDataClass)) + 1 #find number of classes
for c1 in range(1,classes+1):#for every true class
counts = []
for c2 in range(1,classes+1):#for every predicted class
count = 0
for p in range(len(currentDataClass)):
if currentDataClass[p] == predictedClass[p]:
count += 1
counts.append(count)
cm.append(counts)
print(np.reshape(cm,(classes,classes)))
However this returns:
[[7 7 7 7 7]
[7 7 7 7 7]
[7 7 7 7 7]
[7 7 7 7 7]
[7 7 7 7 7]]
But I don't understand why each iteration results in 7 when I am reseting the count each time and it's looping through different values?
This is what I should be getting (using the sklearn's confusion_matrix function):
[[3 0 0 0 1]
[2 1 0 1 0]
[0 1 3 0 0]
[0 1 0 0 0]
[0 1 1 0 0]]

You can derive the confusion matrix by counting the number of instances in each combination of actual and predicted classes as follows:
import numpy as np
def comp_confmat(actual, predicted):
# extract the different classes
classes = np.unique(actual)
# initialize the confusion matrix
confmat = np.zeros((len(classes), len(classes)))
# loop across the different combinations of actual / predicted classes
for i in range(len(classes)):
for j in range(len(classes)):
# count the number of instances in each combination of actual / predicted classes
confmat[i, j] = np.sum((actual == classes[i]) & (predicted == classes[j]))
return confmat
# sample data
actual = [1, 3, 3, 2, 5, 5, 3, 2, 1, 4, 3, 2, 1, 1, 2]
predicted = [1, 2, 3, 4, 2, 3, 3, 2, 1, 2, 3, 1, 5, 1, 1]
# confusion matrix
print(comp_confmat(actual, predicted))
# [[3. 0. 0. 0. 1.]
# [2. 1. 0. 1. 0.]
# [0. 1. 3. 0. 0.]
# [0. 1. 0. 0. 0.]
# [0. 1. 1. 0. 0.]]

In your innermost loop, there should be a case distinction: Currently this loop counts agreement, but you only want that if actually c1 == c2.
Here's another way, using nested list comprehensions:
currentDataClass = [1,3,3,2,5,5,3,2,1,4,3,2,1,1,2]
predictedClass = [1,2,3,4,2,3,3,2,1,2,3,1,5,1,1]
classes = int(max(currentDataClass) - min(currentDataClass)) + 1 #find number of classes
counts = [[sum([(currentDataClass[i] == true_class) and (predictedClass[i] == pred_class)
for i in range(len(currentDataClass))])
for pred_class in range(1, classes + 1)]
for true_class in range(1, classes + 1)]
counts
[[3, 0, 0, 0, 1],
[2, 1, 0, 1, 0],
[0, 1, 3, 0, 0],
[0, 1, 0, 0, 0],
[0, 1, 1, 0, 0]]

Here is my solution using numpy and pandas:
import numpy as np
import pandas as pd
true_classes = [1, 3, 3, 2, 5, 5, 3, 2, 1, 4, 3, 2, 1, 1, 2]
predicted_classes = [1, 2, 3, 4, 2, 3, 3, 2, 1, 2, 3, 1, 5, 1, 1]
classes = set(true_classes)
number_of_classes = len(classes)
conf_matrix = pd.DataFrame(
np.zeros((number_of_classes, number_of_classes),dtype=int),
index=classes,
columns=classes)
for true_label, prediction in zip(true_classes ,predicted_classes):
# Each pair of (true_label, prediction) is a position in the confusion matrix (row, column)
# Basically here we are counting how many times we have each pair.
# The counting will be placed at the matrix index (true_label/row, prediction/column)
conf_matrix.loc[true_label, prediction] += 1
print(conf_matrix.values)
[[3 0 0 0 1]
[2 1 0 1 0]
[0 1 3 0 0]
[0 1 0 0 0]
[0 1 1 0 0]]

Related

Compute Hausdorff distance for 3D numpy arrays

Imagine we are under a segmentation problem that has 5 classes (0, 1, 2, 3, 4). Considering that we have the following 3D mask volumes (A.K.A. 3D numpy arrays):
# Ground truth mask
y_true = np.array([[[2, 1, 4], [0, 1, 1], [2, 1, 0]],
[[2, 2, 2], [0, 1, 0], [0, 1, 1]],
[[2, 4, 4], [2, 1, 4], [2, 1, 1]]])
# Predicted mask
y_pred = np.array([[[2, 0, 4], [0, 2, 1], [2, 0, 0]],
[[2, 4, 0], [0, 1, 2], [0, 4, 1]],
[[2, 0, 4], [1, 1, 4], [2, 2, 1]]])
How can I compute the Hausdorff distance between them? I've looked into Monai's implementation however couldn't figure out the meaning of the compute_hausdorff_distance output.
I implemented a one-hot encoder, since Monai requires the inputs to be one-hot encoded.
def one_hot_encode(array):
return np.eye(5)[array].astype(dtype=int)
Now we have that:
# Ground truth mask
y_true = [[[[0 0 1 0 0]
[0 1 0 0 0]
[0 0 0 0 1]]
...
[[1 0 0 0 0]
[0 1 0 0 0]
[0 1 0 0 0]]]
# Predicted mask
y_pred = [[[[0 0 1 0 0]
[1 0 0 0 0]
[0 0 0 0 1]]
...
[[0 0 1 0 0]
[0 0 1 0 0]
[0 1 0 0 0]]]]
The output of Monai's implementation is:
>>> compute_hausdorff_distance(one_hot_encode(y_pred), one_hot_encode(y_true), include_background=True)
>>> [[1. 1. 1. ]
[2. 1.41421356 3. ]
[2.23606798 1. 1. ]]
Looking at it I can understand it is computing the euclidean distance. It looks like it is looking at labels as positions, but should't the output be of shape 3x3x3just like the masks?
Also, Scipy implementation only works for 2D masks/arrays. Would it be right to compute the Hausdorff distance slice-wise, i.e., slice by slice, and afterwards average all the slice Hausdorff distances obtained? Or does this approach violates the Hausdorff distance principle for 3D data?

Numpy broadcast array to smaller array with exact position for every row

Consider example matrix array:
[[0 1 2 1 0]
[1 1 2 1 0]
[0 1 0 0 0]
[1 2 1 0 0]
[1 2 2 3 2]]
What I need to do:
find maxima in every row
select smaller surrounding of the maxima from every row (3 values in this case)
paste the surrounding of the maxima into new array (narrower)
For the example above, the result is:
[[ 1. 2. 1.]
[ 1. 2. 1.]
[ 0. 1. 0.]
[ 1. 2. 1.]
[ 2. 3. 2.]]
My current working code:
import numpy as np
A = np.array([
[0, 1, 2, 1, 0],
[1, 1, 2, 1, 0],
[0, 1, 0, 0, 0],
[1, 2, 1, 0, 0],
[1, 2, 2, 3, 2],
])
b = A.argmax(axis=1)
C = np.zeros((len(A), 3))
for idx, loc, row in zip(range(len(A)), b, A):
print(idx, loc, row)
C[idx] = row[loc-1:loc+2]
print(C)
My question:
How to get rid of the for loop and replace it with some cheaper numpy operation?
Note:
This algorithm is for straightening broken "lines" in video stream frames with thousands of rows.
Approach #1
We can have a vectorized solution based on setting up sliding windows and then indexing into those with b-offsetted indices to get desired output. We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windows. More info on use of as_strided based view_as_windows.
The implementation would be -
from skimage.util.shape import view_as_windows
L = 3 # window length
w = view_as_windows(A,(1,L))[...,0,:]
Cout = w[np.arange(len(b)),b-L//2]
Being a view-based method, this has the advantage of being memory-efficient and hence good on performance too.
Approach #2
Alternatively, a one-liner by creating all those indices with outer-addition would be -
A[np.arange(len(b))[:,None],b[:,None] + np.arange(-(L//2),L//2+1)]
This works by making and array with all the desired indices, but somehow using that directly on A results in a 3D array, hence the subsequent indexing... Probably not optimal, but definitely another way of doing it!
import numpy as np
A = np.array([
[0, 1, 2, 1, 0],
[1, 1, 2, 1, 0],
[0, 1, 0, 0, 0],
[1, 2, 1, 0, 0],
[1, 2, 2, 3, 2],
])
b = A.argmax(axis = 1).reshape(-1, 1)
index = b + np.arange(-1,2,1).reshape(1, -1)
A[:,index][np.arange(b.size),np.arange(b.size)]

n*n matrix of 0s and 1s in python [duplicate]

This question already has answers here:
How to make a checkerboard in numpy?
(28 answers)
Closed 4 years ago.
How to create an ‘n*n’ checkerboard matrix with the values alternate 0 and 1, using the tile function.
For example:
when n has a value of 2, Output should be:
[[0 1]
[1 0]]
I am able to create a matrix with 0 and 1, but they are not appearing alternatively, below is what i tried:
import numpy as np
n = 4
arr = ([0,1])
print(np.tile(arr,(n,n//2)))
output I got:
[[0 1 0 1]
[0 1 0 1]
[0 1 0 1]
[0 1 0 1]]`
output I want:
[[0 1 0 1]
[1 0 1 0]
[0 1 0 1]
[1 0 1 0]]`
A simple way using numpy could be to define a vector of 0s and 1s of size n and take advantage of broadcasting to create a nxn checkerboard:
def checkerboard(n):
a = np.resize([0,1], n)
return np.abs(a-np.array([a]).T)
Sample use -
checkerboard(2)
array([[0, 1],
[1, 0]])
checkerboard(4)
array([[0, 1, 0, 1],
[1, 0, 1, 0],
[0, 1, 0, 1],
[1, 0, 1, 0]])
Details -
The above works by initially creating a length n 1D vector of 0s and 1s using np.resize:
import numpy as np
n = 3
np.resize([0,1], n)
array([0, 1, 0])
And then subtracting its transposed (2D), which will result in a broadcast array of shape (n,n), with negative and positive 1s:
a-np.array([a]).T
array([[ 0, 1, 0, 1],
[-1, 0, -1, 0],
[ 0, 1, 0, 1],
[-1, 0, -1, 0]])
We just need to take the absolute value of it and we have a checkerboard matrix.
You could use numpy fancy indexing, no need to use np.tile:
import numpy as np
def tiling(n):
result = np.zeros((n, n))
result[::2, 1::2] = 1
result[1::2, ::2] = 1
return result
print(tiling(2))
print()
print(tiling(4))
Output
[[0. 1.]
[1. 0.]]
[[0. 1. 0. 1.]
[1. 0. 1. 0.]
[0. 1. 0. 1.]
[1. 0. 1. 0.]]
Here is a one line numpy solution. That said, I think Daniel's response is much more readable and probably more efficient.
If n is odd then np.arange(n*n).reshape(n,n)%2 gives the correct result. However, if n is even, then all the rows and columns will be the same (like your result). We can fix this by subtracting one from every other row.
tile = (np.arange(n*n).reshape(n,n)-np.arange(n).reshape(n,1)*(n%2+1))%2
Equivalently,
tile = (np.arange(n*n).reshape(n,n,order='F')-np.arange(n)*(n+1))%2

Faster way to "distribute" values of an ndarray into other ndarrays based on assignments?

Generally, I'm trying to split a distance matrix into K folds. Specifically, for the 3 x 3 case, my distance matrix might look like this:
full = np.array([
[0, 0, 3],
[1, 0, 1],
[2, 1, 0]
])
I also have a list of randomly generated assignments, the length of which is equal to the sum over all elements in the distance matrix. For the K = 3 case, it might look like this:
assignments = np.array([0, 1, 0, 2, 1, 1, 0, 0])
I want to create K = 3 new 3 x 3 matrices of zeros, in which the values of the distance matrix are "distributed" according to the assignments list. Code is more precise than words, so here's my current attempt:
def assign(full, assignments):
folds = [np.zeros(full.shape) for _ in xrange(np.max(assignments) + 1)]
rows, cols = full.shape
a = 0
for r in xrange(rows):
for c in xrange(cols):
for i in xrange(full[r, c]):
folds[assignments[a]][r, c] += 1
a += 1
return folds
This works (slowly), and in this example,
folds = assign(full, assignments)
for f in folds:
print f
returns
[[ 0. 0. 2.]
[ 0. 0. 0.]
[ 1. 1. 0.]]
[[ 0. 0. 1.]
[ 0. 0. 1.]
[ 1. 0. 0.]]
[[ 0. 0. 0.]
[ 1. 0. 0.]
[ 0. 0. 0.]]
as desired. However, my current attempt is very slow, especially for the N x N case for N large. How can I improve the speed of this function? Is there some numpy magic that I should be using here?
One idea I had was converting to a sparse matrix and looping over nonzero entries. This would only help a bit, however,
You can use add.at to do unbuffered in place operation:
import numpy as np
full = np.array([
[0, 0, 3],
[1, 0, 1],
[2, 1, 0]
])
assignments = np.array([0, 1, 0, 2, 1, 1, 0, 0])
res = np.zeros((np.max(assignments) + 1,) + full.shape, dtype=int)
r, c = np.nonzero(full)
n = full[r, c]
r = np.repeat(r, n)
c = np.repeat(c, n)
np.add.at(res, (assignments, r, c), 1)
print(res)
You just need to figure out what item in the flattened output would get incremented each time, then aggregate them with bincount:
def assign(full, assignments):
assert len(assignments) == np.sum(full)
rows, cols = full.shape
n = np.max(assignments) + 1
full_flat = full.reshape(-1)
full_flat_non_zero = full_flat != 0
full_flat_indices = np.repeat(np.where(full_flat_non_zero)[0],
full_flat[full_flat_non_zero])
folds_flat_indices = full_flat_indices + assignments*rows*cols
return np.bincount(folds_flat_indices,
minlength=n*rows*cols).reshape(n, rows, cols)
>>> assign(full, assignments)
array([[[0, 0, 2],
[0, 0, 0],
[1, 1, 0]],
[[0, 0, 1],
[0, 0, 1],
[1, 0, 0]],
[[0, 0, 0],
[1, 0, 0],
[0, 0, 0]]])
You may want to print out each of those intermediate arrays for your example, to see what exactly is going on.

Sum over rows in scipy.sparse.csr_matrix

I have a big csr_matrix and I want to add over rows and obtain a new csr_matrix with the same number of columns but reduced number of rows. (Context: The matrix is a document-term matrix obtained from sklearn CountVectorizer and I want to be able to quickly combine documents according to codes associated with these documents)
For a minimal example, this is my matrix:
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import vstack
row = np.array([0, 4, 1, 3, 2])
col = np.array([0, 2, 2, 0, 1])
dat = np.array([1, 2, 3, 4, 5])
A = csr_matrix((dat, (row, col)), shape=(5, 5))
print A.toarray()
[[1 0 0 0 0]
[0 0 3 0 0]
[0 5 0 0 0]
[4 0 0 0 0]
[0 0 2 0 0]]
No let's say I want a new matrix B in which rows (1, 4) and (2, 3, 5) are combined by summing them, which would look something like this:
[[5 0 0 0 0]
[0 5 5 0 0]]
And should be again in sparse format (because the real data I'm working with is large). I tried to sum over slices of the matrix and then stack it:
idx1 = [1, 4]
idx2 = [2, 3, 5]
A_sub1 = A[idx1, :].sum(axis=1)
A_sub2 = A[idx2, :].sum(axis=1)
B = vstack((A_sub1, A_sub2))
But this gives me the summed up values just for the non-zero columns in the slice, so I can't combine it with the other slices because the number of columns in the summed slices are different.
I feel like there must be an easy way to do this. But I couldn't find any discussion of this online or in the documentation. What am I missing?
Thank you for your help
Note that you can do this by carefully constructing another matrix. Here's how it would work for a dense matrix:
>>> S = np.array([[1, 0, 0, 1, 0,], [0, 1, 1, 0, 1]])
>>> np.dot(S, A.toarray())
array([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])
>>>
The sparse version is only a little more complicated. The information about which rows should be summed together is encoded in row:
col = range(5)
row = [0, 1, 1, 0, 1]
dat = [1, 1, 1, 1, 1]
S = csr_matrix((dat, (row, col)), shape=(2, 5))
result = S * A
# check that the result is another sparse matrix
print type(result)
# check that the values are the ones we want
print result.toarray()
Output:
<class 'scipy.sparse.csr.csr_matrix'>
[[5 0 0 0 0]
[0 5 5 0 0]]
You can handle more rows in your output by including higher values in row and extending the shape of S accordingly.
The indexing should be:
idx1 = [0, 3] # rows 1 and 4
idx2 = [1, 2, 4] # rows 2,3 and 5
Then you need to keep A_sub1 and A_sub2 in sparse format and use axis=0:
A_sub1 = csr_matrix(A[idx1, :].sum(axis=0))
A_sub2 = csr_matrix(A[idx2, :].sum(axis=0))
B = vstack((A_sub1, A_sub2))
B.toarray()
array([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])
Note, I think the A[idx, :].sum(axis=0) operations involve conversion from sparse matrices - so #Mr_E's answer is probably better.
Alternatively, it works when you use axis=0 and np.vstack (as opposed to scipy.sparse.vstack):
A_sub1 = A[idx1, :].sum(axis=0)
A_sub2 = A[idx2, :].sum(axis=0)
np.vstack((A_sub1, A_sub2))
Giving:
matrix([[5, 0, 0, 0, 0],
[0, 5, 5, 0, 0]])

Categories

Resources