NumPy - creating 1-hot tensor from a 2D numpy array

NumPy - creating 1-hot tensor from a 2D numpy array - python

I have a numpy 2D array with values that range from 0 to 59.
for those who are familiar with DL and specifically Image Segmentation - I create the array (call it L) from a .png image and the value of each pixel L[x,y] means the class that this pixel belongs to (out of the 60 classes).
I want to create a 1-hot tensor - Lhot, in which (Lhot[x,y,z] == 1) only if (L[x,y] == z), and 0 otherwise.
I want to create it with some kind of broadcasting/indexing (1,2 lines) - without loops.
it should be functionally equal to this piece of code (Dtype corresponds to L):
Lhot = np.zeros((L.shape[0], L.shape[1], 60), dtype=Dtype)
for i in range(L.shape[0]):
for j in range(L.shape[1]):
Lhot[i,j,L[i,j]] = 1
anyone has an idea?
Thanks!

Much faster and cleaner way using pure numpy
Lhot = np.transpose(np.eye(60)[L], (1,2,0))
Problem you'll run into with multidimensional one-hots is they get really big and really sparse and there's no good way to handle sparse arrays with more than 2D in numpy/scipy (or sklearn or many other ML packages either I think). Do you really need an n-d one-hot?

Since typical one-hot encoding is defined for 1D vectors, all you have to do is flatten your matrix, use one hot encoder from scikit-learn (or any other library with one-hot encoding) and reshape back.
from sklearn.preprocessing import OneHotEncoder
n, m = L.shape
k = 60
Lhot = np.array(OneHotEncoder(n_values=k).fit_transform(L.reshape(-1,1)).todense()).reshape(n, m, k)
of course you can do it by hand too
n, m = L.shape
k = 60
Lhot = np.zeros((n*m, k)) # empty, flat array
Lhot[np.arange(n*m), L.flatten()] = 1 # one-hot encoding for 1D
Lhot = Lhot.reshape(n, m, k) # reshaping back to 3D tensor

Related

How to use a ndarray of stored ndarrays with memmap as a big ndarray tensor

I recently started to use numpy memmap to link an array in my project since I have a 3 dimensions tensor for a total of 133 billions values for a graph of the dataset I am using as example.
I am trying to calculate the heat kernel signature of a 5748 nodes graph (21st of DD dataset). My code to calculate the projectors (where I use memmap) is:
Path('D:/hks_temp').mkdir(parents=True, exist_ok=True)
for l, ll in enumerate(L):
pl = np.zeros((n, n))
for k in ll:
pl += np.outer(evecs[:, k], evecs[:, k])
fp = np.memmap('D:/hks_temp/{}_hks.npy'.format(l), dtype='float32', mode='w+', shape=(n, n))
fp[:] = pl[:]
fp.flush()
inside all the X_hks.npy there is a n by n ndarray (from the example 5748 * 5748).
Then I want all these computed arrays to form the 3 dimension tensor so I "link" (I don't know if it's the right term) them in this way:
P = np.array([None] * len(L)) # len(L) = 4043
for l in range(len(L)):
P[l] = np.memmap('D:/hks_temp/{}_hks.npy'.format(l), dtype='float32', mode='r', shape=(n, n))
P is used later only to do inside a cycle H = np.einsum('ijk,i->jk', P, np.exp(-unique_eval * t)).
However, that raises an error: ValueError: einstein sum subscripts string contains too many subscripts for operand 0. Since the method is correct for smaller graphs that doesn't require memmap, my thought was that P isn't well structured for numpy and I must arrange the data, maybe doing a reshape. So I tried to do a P.reshape(len(L), n, n) but it doesn't work giving ValueError: cannot reshape array of size 4043 into shape (4043,5748,5748). How can I make it work?
I already found this question but it doesn't fit this case. I think I can't store all inside one big object since it did 497GB of memmap files (126MB each). If I can do it, please tell me.
If it is impossible to do it I will reduce the use case, however I am quite interested to make it work for all the possibilities.

Numpy 2D spatial mask to be filled with specific values from a 2D array to form a 3D structure

I'm quite new to programming in general, but I could not figure this problem out until now.
I've got a two-dimensional numpy array mask, lets say mask.shape is (3800,3500)which is filled with 0s and 1s representing a spatial resolution of a 2D image, where a 1 represents a visible pixel and 0 represents background.
I've got a second two-dimensional array data of data.shape is (909,x) where x is exactly the amount of 1s in the first array. I now want to replace each 1 in the first array with a vector of length 909 from the second array. Resulting in a final 3D array of shape(3800,3500,909) which is basically a 2D x by y image where select pixels have a spectrum of 909 values in z direction.
I tried
mask_vector = mask.flatten
ones = np.ones((909,1))
mask_909 = mask_vector.dot(ones) #results in a 13300000 by 909 2d array
count = 0
for i in mask_vector:
if i == 1:
mask_909[i,:] = data[:,count]
count += 1
result = mask_909.reshape((3800,3500,909))
This results in a viable 3D array giving a 2D picture when doing plt.imshow(result.mean(axis=2))
But the values are still only 1s and 0s not the wanted spectral data in z direction.
I also tried using np.where but broadcasting fails as the two 2D arrays have clearly different shapes.
Has anybody got a solution? I am sure that there must be an easy way...

Basically, you simply need to use np.where to locate the 1s in your mask array. Then initialize your result array to zero and replace the third dimension with your data using the outputs of np.where:
import numpy as np
m, n, k = 380, 350, 91
mask = np.round(np.random.rand(m, n))
x = np.sum(mask == 1)
data = np.random.rand(k, x)
result = np.zeros((m, n, k))
row, col = np.where(mask == 1)
result[row,col] = data.transpose()

Efficiently creating masks - Numpy /Python

Suppose I have a NumPy array with shape (50, 10000, 10000) with 1000 distinct "clusters". For example, there would be small volume somewhere with just 1s, another small volume with 2s, etc. I would like to iterate through each cluster to create a mask like so:
for i in np.unique(arr)[1:]:
mask = arr == i
#do other stuff with mask
Creating each mask takes about 15 seconds, and iterating through 1000 clusters would take more than 4 hours. Is there a possible way to speed up the code or is this the best there is since there is no avoiding iterating through each element of the array?
EDIT: the dtype of the array is uint16

I'm assuming arr is sparse:
you say the clusters are small, and 1000 clusters isn't going to tile an array that big
you iterate over np.unique(arr)[1:], so I assume the first unique value is 0
In this case I would recommend leveraging a scipy.sparse.csr_matrix
from scipy.sparse import csr_matrix
sp_arr = csr_matrix(arr.reshape(1,-1))
This turns your big dense array into a one-row compressed sparse row array. Since sparse arrays don't like more than 2 dimensions, this tricks it into using ravelled indices. Now sp_arr has data (the cluster labels), indices (the ravelled indices), and indptr (which is trivial here since we only have one row). So,
for i in np.unique(sp_arr.data): # as a bonus this `unique` call should be faster too
x, y, z = np.unravel_index(sp_arr.indices[sp_arr.data == i], arr.shape)
Should much more efficiently give equivalent coordinates to
for i in np.unique(arr)[:1]:
x, y, z = np.nonzero(arr == i)
where x, y, z are the indices of the True values in mask. From there you can either reconstruct mask or work off the indices (recommended).
You could also do this purely with numpy, and still have a boolean mask at the end, but a bit less memory efficient:
all_mask = arr != 0 # points assigned to any cluster
data = arr[all_mask] # all cluster labels
for i in np.unique(data):
mask = all_mask.copy()
mask[mask] = data == i # now mask is same as before

np.bincount for 1 line, vectorized multidimensional averaging

I am trying to vectorize an operation using numpy, which I use in a python script that I have profiled, and found this operation to be the bottleneck and so needs to be optimized since I will run it many times.
The operation is on a data set of two parts. First, a large set (n) of 1D vectors of different lengths (with maximum length, Lmax) whose elements are integers from 1 to maxvalue. The set of vectors is arranged in a 2D array, data, of size (num_samples,Lmax) with trailing elements in each row zeroed. The second part is a set of scalar floats, one associated with each vector, that I have a computed and which depend on its length and the integer-value at each position. The set of scalars is made into a 1D array, Y, of size num_samples.
The desired operation is to form the average of Y over the n samples, as a function of (value,position along length,length).
This entire operation can be vectorized in matlab with use of the accumarray function: by using 3 2D arrays of the same size as data, whose elements are the corresponding value, position, and length indices of the desired final array:
sz_Y = num_samples;
sz_len = Lmax
sz_pos = Lmax
sz_val = maxvalue
ind_len = repmat( 1:sz_len ,1 ,sz_samples);
ind_pos = repmat( 1:sz_pos ,sz_samples,1 );
ind_val = data
ind_Y = repmat((1:sz_Y)',1 ,Lmax );
copiedY=Y(ind_Y);
mask = data>0;
finalarr=accumarray({ind_val(mask),ind_pos(mask),ind_len(mask)},copiedY(mask), [sz_val sz_pos sz_len])/sz_val;
I was hoping to emulate this implementation with np.bincounts. However, np.bincounts differs to accumarray in two relevant ways:
both arguments must be of same 1D size, and
there is no option to choose the shape of the output array.
In the above usage of accumarray, the list of indices, {ind_val(mask),ind_pos(mask),ind_len(mask)}, is 1D cell array of 1x3 arrays used as index tuples, while in np.bincounts it must be 1D scalars as far as I understand. I expect np.ravel may be useful but am not sure how to use it here to do what I want. I am coming to python from matlab and some things do not translate directly, e.g. the colon operator which ravels in opposite order to ravel. So my question is how might I use np.bincount or any other numpy method to achieve an efficient python implementation of this operation.
EDIT: To avoid wasting time: for these multiD index problems with complicated index manipulation, is the recommend route to just use cython to implement the loops explicity?
EDIT2: Alternative Python implementation I just came up with.
Here is a heavy ram solution:
First precalculate:
Using index units for length (i.e., length 1 =0) make a 4D bool array, size (num_samples,Lmax+1,Lmax+1,maxvalue) , holding where the conditions are satisfied for each value in Y.
ALLcond=np.zeros((num_samples,Lmax+1,Lmax+1,maxvalue+1),dtype='bool')
for l in range(Lmax+1):
for i in range(Lmax+1):
for v in range(maxvalue+!):
ALLcond[:,l,i,v]=(data[:,i]==v) & (Lvec==l)`
Where Lvec=[len(row) for row in data]. Then get the indices for these using np.where and initialize a 4D float array into which you will assign the values of Y:
[indY,ind_len,ind_pos,ind_val]=np.where(ALLcond)
Yval=np.zeros(np.shape(ALLcond),dtype='float')
Now in the loop in which I have to perform the operation, I compute it with the two lines:
Yval[ind_Y,ind_len,ind_pos,ind_val]=Y[ind_Y]
Y_avg=sum(Yval)/num_samples
This gives a factor of 4 or so speed up over the direct loop implementation. I was expecting more. Perhaps, this is a more tangible implementation for Python heads to digest. Any faster suggestions are welcome :)

One way is to convert the 3 "indices" to a linear index and then apply bincount. Numpy's ravel_multi_index is essentially the same as MATLAB's sub2ind. So the ported code could be something like:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
ind_len = np.tile(Lvec[:,None], [1, Lmax])
ind_pos = np.tile(posvec, [n, 1])
ind_val = data
Y_copied = np.tile(Y[:,None], [1, Lmax])
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((ind_len[mask], ind_pos[mask], ind_val[mask]), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied[mask], minlength=np.prod(shape)) / n
Y_avg.shape = shape
This is assuming data has shape (n, Lmax), Lvec is Numpy array, etc. You may need to adapt the code a little to get rid of off-by-one errors.
One could argue that the tile operations are not very efficient and not very "numpythonic". Something with broadcast_arrays could be nice, but I think I prefer this way:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
len_idx = np.repeat(Lvec, Lvec)
pos_idx = np.broadcast_to(posvec, data.shape)[mask]
val_idx = data[mask]
Y_copied = np.repeat(Y, Lvec)
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((len_idx, pos_idx, val_idx), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied, minlength=np.prod(shape)) / n
Y_avg.shape = shape
Note broadcast_to was added in Numpy 1.10.0.

numpy broadcast from first dimension

In NumPy, is there an easy way to broadcast two arrays of dimensions e.g. (x,y) and (x,y,z)? NumPy broadcasting typically matches dimensions from the last dimension, so usual broadcasting will not work (it would require the first array to have dimension (y,z)).
Background: I'm working with images, some of which are RGB (shape (h,w,3)) and some of which are grayscale (shape (h,w)). I generate alpha masks of shape (h,w), and I want to apply the mask to the image via mask * im. This doesn't work because of the above-mentioned problem, so I end up having to do e.g.
mask = mask.reshape(mask.shape + (1,) * (len(im.shape) - len(mask.shape)))
which is ugly. Other parts of the code do operations with vectors and matrices, which also run into the same issue: it fails trying to execute m + v where m has shape (x,y) and v has shape (x,). It's possible to use e.g. atleast_3d, but then I have to remember how many dimensions I actually wanted.

how about use transpose:
(a.T + c.T).T

numpy functions often have blocks of code that check dimensions, reshape arrays into compatible shapes, all before getting down to the core business of adding or multiplying. They may reshape the output to match the inputs. So there is nothing wrong with rolling your own that do similar manipulations.
Don't offhand dismiss the idea of rotating the variable 3 dimension to the start of the dimensions. Doing so takes advantage of the fact that numpy automatically adds dimensions at the start.
For element by element multiplication, einsum is quite powerful.
np.einsum('ij...,ij...->ij...',im,mask)
will handle cases where im and mask are any mix of 2 or 3 dimensions (assuming the 1st 2 are always compatible. Unfortunately this does not generalize to addition or other operations.
A while back I simulated einsum with a pure Python version. For that I used np.lib.stride_tricks.as_strided and np.nditer. Look into those functions if you want more power in mixing and matching dimensions.

as another angle: if you encounter this pattern frequently, it may be useful to create a utility function to enforce right-broadcasting:
def right_broadcasting(arr, target):
return arr.reshape(arr.shape + (1,) * (target.ndim - arr.ndim))
Although if there are only two types of input (already having 3 dims or having only 2), id say the single if statement is preferable.

Indexing with np.newaxis creates a new axis in that place. Ie
xyz = #some 3d array
xy = #some 2d array
xyz_sum = xyz + xy[:,:,np.newaxis]
or
xyz_sum = xyz + xy[:,:,None]
Indexing in this way creates an axis with shape 1 and stride 0 in this location.

Why not just decorate-process-undecorate:
def flipflop(func):
def wrapper(a, mask):
if len(a.shape) == 3:
mask = mask[..., None]
b = func(a, mask)
return np.squeeze(b)
return wrapper
#flipflop
def f(x, mask):
return x * mask
Then
>>> N = 12
>>> gs = np.random.random((N, N))
>>> rgb = np.random.random((N, N, 3))
>>>
>>> mask = np.ones((N, N))
>>>
>>> f(gs, mask).shape
(12, 12)
>>> f(rgb, mask).shape
(12, 12, 3)

Easy, you just add a singleton dimension at the end of the smaller array. For example, if xyz_array has shape (x,y,z) and xy_array has shape (x,y), you can do
xyz_array + np.expand_dims(xy_array, xy_array.ndim)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

NumPy - creating 1-hot tensor from a 2D numpy array - python

Related

How to use a ndarray of stored ndarrays with memmap as a big ndarray tensor

Numpy 2D spatial mask to be filled with specific values from a 2D array to form a 3D structure

Efficiently creating masks - Numpy /Python

np.bincount for 1 line, vectorized multidimensional averaging

numpy broadcast from first dimension

Categories

Resources