How to label_binarize Multiclass in a specific class order - python

I have a list of ground truth labels:
yTrue = ['class2','classC','class3','class3','classA','classB','class2']
and a list of the possible classes (distinct, in custom order):
orderedClasses = ['classA','class2','classB','class3','classC']
I want to code the list in One-Vs-The-Rest for all possible classes.
Desired output:
[[0,1,0,0,0],[0,0,0,0,1],[0,0,0,1,0],[0,0,0,1,0],[1,0,0,0,0],[0,0,1,0,0],[0,1,0,0,0]]
I tried to use sklearn.preprocessing.label_binarize (doc) but the problem is it doesn't maintain my custom order for classes:
[[0,0,1,0,0],[0,0,0,0,1],[1,0,0,0,0],[1,0,0,0,0],[0,0,0,1,0],[0,1,0,0,0],[0,0,1,0,0]]
Looking for an Pythonic and efficient way to get the desired output

Simply pass orderedClasses as classes parameter
In [15]: label_binarize(yTrue, orderedClasses)
Out[15]:
array([[0, 1, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 0, 1, 0],
[0, 0, 0, 1, 0],
[1, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 1, 0, 0, 0]])

Related

Generate random vector using different max across dimension in PyTorch

I have multiple one-hot encoded vectors per sample across batch - shape (3, 5, 10), where 3 is batch size, 5 is number of one-hot vectors and 10 is number of classes.
I want to randomly pick one one-hot encoded vector per sample, so during training these are going to be different every epoch. But there are also some padding vectors that I had to add to be able to work with data loader, and I don't want to pick them. There are not 3 actual vectors per sample, I just generated it like that
Example input (there are padding vectors at the end):
tensor([[[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]],
[[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]],
[[1, 0, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]]]))
Expected output:
tensor([[0, 0, 0, 1, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 1]])
Can I achieve it using only PyTorch? I care about not moving data between devices
I tried using random_indices = torch.randint(low=0, high=3, size=(3,)) and then applying it on my tensor, but I wasnt able to pass multiple high values

remove empty dimension of numpy array

I have a numpy array of shape (X,Y,Z). I want to check each of the Z dimension and delete the non-zero dimension really fast.
Detailed explanation:
I would like to check array[:,:,0] if any entry is non-zero.
If yes, ignore and check array[:,:,1].
Else if No, delete dimension array[:,:,0]
Also not 100% sure what your after but I think you want
np.squeeze(array, axis=2)
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.squeeze.html
I'm not certain what you want but this hopefully points in the right direction.
Edit 1st Jan:
Inspired by #J.Warren's use of np.squeeze I think np.compress may be more appropriate.
This does the compression in one line
np.compress((a!=0).sum(axis=(0,1)), a, axis=2) #
To explain the first parameter in np.compress
(a!=0).sum(axis=(0, 1)) # sum across both the 0th and 1st axes.
Out[37]: array([1, 1, 0, 0, 2]) # Keep the slices where the array !=0
My first answer which may no longer be relevant
import numpy as np
a=np.random.randint(2, size=(3,4,5))*np.random.randint(2, size=(3,4,5))*np.random.randint(2, size=(3,4,5))
# Make a an array of mainly zeroes.
a
Out[31]:
array([[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]],
[[0, 1, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]],
[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[1, 0, 0, 0, 0]]])
res=np.zeros(a.shape[2], dtype=np.bool)
for ix in range(a.shape[2]):
res[ix] = (a[...,ix]!=0).any()
res
Out[34]: array([ True, True, False, False, True], dtype=bool)
# res is a boolean array of which slices of 'a' contain nonzero data
a[...,res]
# use this array to index a
# The output contains the nonzero slices
Out[35]:
array([[[0, 0, 0],
[0, 0, 1],
[0, 0, 0],
[0, 0, 0]],
[[0, 1, 0],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0]],
[[0, 0, 0],
[0, 0, 1],
[0, 0, 0],
[1, 0, 0]]])

scipy.ndimage.label: include error margin

After reading an interesting topic on scipy.ndimage.label (Variable area threshold for identifying objects - python), I'd like to include an 'error margin' in the labelling.
In the above linked discussion:
How can the blue dot on top be included, too (let's say it is wrongly disconnected from the orange, biggest, object)?
I found the structure attribute, which should be able to include that dot by changing the array (from np.ones(3,3,3) to anything more than that (I'd like it to be 3D). However, adjusting the 'structure' attribute to a larger array does not seem to work, unfortunately. It either gives an error of dimensions (RuntimeError: structure and input must have equal rank
) or it does not change anything..
Thanks!
this is the code:
labels, nshapes = ndimage.label(a, structure=np.ones((3,3,3)))
in which a is a 3D array.
Here's a possible approach that uses scipy.ndimage.binary_dilation. It is easier to see what is going on in a 2D example, but I'll show how to generalize to 3D at the end.
In [103]: a
Out[103]:
array([[0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 1, 0, 0],
[1, 1, 0, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 1, 1],
[1, 1, 1, 0, 0, 0, 0]])
In [104]: from scipy.ndimage import label, binary_dilation
Extend each "shape" by one pixel down and to the right:
In [105]: b = binary_dilation(a, structure=np.array([[0, 0, 0], [0, 1, 1], [0, 1, 1]])).astype(int)
In [106]: b
Out[106]:
array([[0, 0, 0, 1, 1, 0, 0],
[0, 0, 0, 1, 1, 0, 0],
[1, 1, 1, 0, 1, 1, 0],
[1, 1, 1, 0, 1, 1, 1],
[1, 1, 1, 0, 0, 1, 1],
[1, 1, 1, 1, 0, 1, 1]])
Apply label to the padded array:
In [107]: labels, numlabels = label(b)
In [108]: numlabels
Out[108]: 2
In [109]: labels
Out[109]:
array([[0, 0, 0, 1, 1, 0, 0],
[0, 0, 0, 1, 1, 0, 0],
[2, 2, 2, 0, 1, 1, 0],
[2, 2, 2, 0, 1, 1, 1],
[2, 2, 2, 0, 0, 1, 1],
[2, 2, 2, 2, 0, 1, 1]], dtype=int32)
By multiplying a by labels, we get the desired array of labels of a:
In [110]: alab = labels*a
In [111]: alab
Out[111]:
array([[0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[2, 2, 0, 0, 1, 0, 0],
[2, 2, 0, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 1, 1],
[2, 2, 2, 0, 0, 0, 0]])
(This assumes that the values in a are 0 or 1. If they are not, you can use alab = labels * (a > 0).)
For a 3D input, you have to change the structure argument to binary_dilation:
struct = np.zeros((3, 3, 3), dtype=int)
struct[1:, 1:, 1:] = 1
b = binary_dilation(a, structure=struct).astype(int)

Initializing matrix in Python using "[[0]*x]*y" creates linked rows?

Initializing a matrix as so seems to link the rows so that when one row changes, they all change:
>>> grid = [[0]*5]*5
>>> grid
[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]]
>>> grid[2][2] = 1
>>> grid
[[0, 0, 1, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 1, 0, 0]]
How can I avoid this?
grid = [[0]*5 for i in range(5)]
Note: [int]*5 copies the int 5 times (but when you copy an int you just copy the value). [list]*5 copies the reference to the same list 5 times. (when you copy a list you copy the reference that points to the list in memory).

Selecting specific column in each row from array

I am trying to select specific column elements for each row of a numpy array. For example, in the following example:
In [1]: a = np.random.random((3,2))
Out[1]:
array([[ 0.75670668, 0.1283942 ],
[ 0.51326555, 0.59378083],
[ 0.03219789, 0.53612603]])
I would like to select the first element of the first row, the second element of the second row, and the first element of the third row. So I tried to do the following:
In [2]: b = np.array([0,1,0])
In [3]: a[:,b]
But this produces the following output:
Out[3]:
array([[ 0.75670668, 0.1283942 , 0.75670668],
[ 0.51326555, 0.59378083, 0.51326555],
[ 0.03219789, 0.53612603, 0.03219789]])
which clearly is not what I am looking for. Is there an easy way to do what I would like to do without using loops?
You can use:
a[np.arange(3), (0,1,0)]
in your example above.
OK, just to clarify here, lets do a simple example
A=diag(arange(0,10,1))
gives
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 3, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 4, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 5, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 6, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 7, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 8, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 9]])
then
A[0][0:4]
gives
array([0, 0, 0, 0])
that is first row, elements 0 to 3. But
A[0:4][1]
doesn't give the first 4 rows, the 2nd element in each. Instead we get
array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0])
i.e the entire 2nd column.
A[0:4,1]
gives
array([0, 1, 0, 0])
I'm sure there is a very good reason for this and which makes perfect sense to programmers
but for those of us uninitiated in that great religion it can be quite confusing.
This isn't an answer so much as an attempt to document this a bit. For the answer above, we would have:
>>> import numpy as np
>>> A = np.array(range(6))
>>> A
array([0, 1, 2, 3, 4, 5])
>>> A.shape = (3,2)
>>> A
array([[0, 1],
[2, 3],
[4, 5]])
>>> A[(0,1,2),(0,1,0)]
array([0, 3, 4])
Specifying a list (or tuple) of individual row and column coordinates allows fancy indexing of the array. The first example in the comment looks similar at first, but the indices are slices. They don't extend over the whole range, and the shape of the array that is returned is different:
>>> A[0:2,0:2]
array([[0, 1],
[2, 3]])
For the second example in the comment
>>> A[[0,1],[0,1]]
array([0, 3])
So it seems that slices are different, but except for that, regardless of how indices are constructed, you can specify a tuple or list of (x-values, y-values), and recover those specific elements from the array.

Categories

Resources