Numpy using multidimensional array to index a 1D array - python

I do not understand the following code, i.e. the last part of it.
max = np.max(rel_coords, axis=0)
min = np.min(rel_coords, axis=0)
bins = [np.arange(low, high) for low, high in zip(min, max)]
new_coord = np.array(np.meshgrid(*bins)).T
coord_norms = norm(new_coord, axis=-1).round().astype(int)
bin_count = np.bincount(coord_norms.flatten())
new_count = bin_count[coord_norms]
Can someone explain how I can index a 1-D array (bin_count) using a 2-D array (coord_norms)?
I do understand numpy broadcasting and advanced indexing, but would like to understand what`s going on behind the scenes in this case. Does bin_count first get broadcasted to the same shape as coord_norms?
How does Python assign the values in new_data?

Related

bootstrap numpy 2D array

I am trying to sample with replacement a base 2D numpy array with shape of (4,2) by rows, say 10 times. The final output should be a 3D numpy array.
Have tried the code below, it works. But is there a way to do it without the for loop?
base=np.array([[20,30],[50,60],[70,80],[10,30]])
print(np.shape(base))
nsample=10
tmp=np.zeros((np.shape(base)[0],np.shape(base)[1],10))
for i in range(nsample):
id_pick = np.random.choice(np.shape(base)[0], size=(np.shape(base)[0]))
print(id_pick)
boot1=base[id_pick,:]
tmp[:,:,i]=boot1
print(tmp)
Here's one vectorized approach -
m,n = base.shape
idx = np.random.randint(0,m,(m,nsample))
out = base[idx].swapaxes(1,2)
Basic idea is that we generate all the possible indices with np.random.randint as idx. That would an array of shape (m,nsample). We use this array to index into the input array along the first axis. Thus, it selects random rows off base. To get the final output with a shape (m,n,nsample), we need to swap last two axes.
You can use the stack function from numpy. Your code would then look like:
base=np.array([[20,30],[50,60],[70,80],[10,30]])
print(np.shape(base))
nsample=10
tmp = []
for i in range(nsample):
id_pick = np.random.choice(np.shape(base)[0], size=(np.shape(base)[0]))
print(id_pick)
boot1=base[id_pick,:]
tmp.append(boot1)
tmp = np.stack(tmp, axis=-1)
print(tmp)
Based on #Divakar 's answer, if you already know the shape of this 2D-array, you can treat it as an (8,) 1D array while bootstrapping, and then reshape it:
m, n = base.shape
flatbase = np.reshape(base, (m*n,))
idxs = np.random.choice(range(8), (numReps, m*n))
bootflats = flatbase[idx]
boots = np.reshape(flatbase, (numReps, m, n))

How can I create a masked array with columns filtered out by column sum in numpy?

I'm using python and numpy. I have an array with a number of columns, which I want to mask (not remove, in order to preserve indices) depending on whether the column sum is below a threshold. Here is what I have:
x_frequencies = np.sum(X, axis=0)
cutoff = np.percentile(x_frequencies, q=99)
mask = np.sum(X, axis=0) < cutoff
print(X.shape)
print(mask.shape)
print(mask[0].shape)
X_filtered = X[:,mask] # Error here
and the output for this is
(22987, 29308)
(1, 29308)
(1, 29308)
# Stacktrace
IndexError: invalid index shape
So I have two questions: firstly, how can I do what I'm intending to do; secondly, how can I get a 1d array out of mask (i.e. one with shape (29308,)) because I've tried reshape and flatten and neither of them are changing the shape.
Edit: X is a scipy.sparse.csr.csr_matrix
SOLVED: Had to convert mask from a matrix into an array:
mask = np.array(np.sum(X, axis=0) < cutoff).flatten()
Thank you to Divakar for asking about the type of X; I'm new to scipy/numpy and didn't know the difference between the matrix and array types

Delete 2d subarray from 3d array in numpy

In numpy I have a 3d array and I would ike to remove some of the 2d subarrays. Think about it like this:
r = range(27)
arr = np.reshape(r, (3,3,3))
del = [[0,1,2],[0,0,2]]
flatSeam = np.ravel_multi_index(del, arr.shape)
arr = np.delete(arr, flatSeam)
So at the end I would like to have an array of the shape (3,2,3) without the elements 00, 10, 22 from the original array. My problem is that I acn not use ravel_multi_index for this, because my indices are 2d and the array shape is 3d, so the wrong indices are calculated (the code above also does not execute because the indices array and the shape have to be the same size).
Do you have any ideas how I can achieve this?
Here's an approach using advanced-indexing -
# arr: Input array, rm_idx : 2-row list/array of indices to be removed
m,n,p = arr.shape
mask = np.asarray(rm_idx[1])[:,None] != np.arange(n)
out = arr[np.arange(m)[:,None],np.where(mask)[1].reshape(m,-1)]
Alternatively, with boolean-indexing -
out = arr.reshape(-1,p)[mask.ravel()].reshape(m,-1,p)
A bit less memory-intensive approach as we try to avoid creating 2D mask -
vmask = ~np.in1d(np.arange(m*n),rm_idx[1] + n*np.arange(m))
out = arr.reshape(-1,p)[vmask].reshape(m,-1,p)

np.bincount for 1 line, vectorized multidimensional averaging

I am trying to vectorize an operation using numpy, which I use in a python script that I have profiled, and found this operation to be the bottleneck and so needs to be optimized since I will run it many times.
The operation is on a data set of two parts. First, a large set (n) of 1D vectors of different lengths (with maximum length, Lmax) whose elements are integers from 1 to maxvalue. The set of vectors is arranged in a 2D array, data, of size (num_samples,Lmax) with trailing elements in each row zeroed. The second part is a set of scalar floats, one associated with each vector, that I have a computed and which depend on its length and the integer-value at each position. The set of scalars is made into a 1D array, Y, of size num_samples.
The desired operation is to form the average of Y over the n samples, as a function of (value,position along length,length).
This entire operation can be vectorized in matlab with use of the accumarray function: by using 3 2D arrays of the same size as data, whose elements are the corresponding value, position, and length indices of the desired final array:
sz_Y = num_samples;
sz_len = Lmax
sz_pos = Lmax
sz_val = maxvalue
ind_len = repmat( 1:sz_len ,1 ,sz_samples);
ind_pos = repmat( 1:sz_pos ,sz_samples,1 );
ind_val = data
ind_Y = repmat((1:sz_Y)',1 ,Lmax );
copiedY=Y(ind_Y);
mask = data>0;
finalarr=accumarray({ind_val(mask),ind_pos(mask),ind_len(mask)},copiedY(mask), [sz_val sz_pos sz_len])/sz_val;
I was hoping to emulate this implementation with np.bincounts. However, np.bincounts differs to accumarray in two relevant ways:
both arguments must be of same 1D size, and
there is no option to choose the shape of the output array.
In the above usage of accumarray, the list of indices, {ind_val(mask),ind_pos(mask),ind_len(mask)}, is 1D cell array of 1x3 arrays used as index tuples, while in np.bincounts it must be 1D scalars as far as I understand. I expect np.ravel may be useful but am not sure how to use it here to do what I want. I am coming to python from matlab and some things do not translate directly, e.g. the colon operator which ravels in opposite order to ravel. So my question is how might I use np.bincount or any other numpy method to achieve an efficient python implementation of this operation.
EDIT: To avoid wasting time: for these multiD index problems with complicated index manipulation, is the recommend route to just use cython to implement the loops explicity?
EDIT2: Alternative Python implementation I just came up with.
Here is a heavy ram solution:
First precalculate:
Using index units for length (i.e., length 1 =0) make a 4D bool array, size (num_samples,Lmax+1,Lmax+1,maxvalue) , holding where the conditions are satisfied for each value in Y.
ALLcond=np.zeros((num_samples,Lmax+1,Lmax+1,maxvalue+1),dtype='bool')
for l in range(Lmax+1):
for i in range(Lmax+1):
for v in range(maxvalue+!):
ALLcond[:,l,i,v]=(data[:,i]==v) & (Lvec==l)`
Where Lvec=[len(row) for row in data]. Then get the indices for these using np.where and initialize a 4D float array into which you will assign the values of Y:
[indY,ind_len,ind_pos,ind_val]=np.where(ALLcond)
Yval=np.zeros(np.shape(ALLcond),dtype='float')
Now in the loop in which I have to perform the operation, I compute it with the two lines:
Yval[ind_Y,ind_len,ind_pos,ind_val]=Y[ind_Y]
Y_avg=sum(Yval)/num_samples
This gives a factor of 4 or so speed up over the direct loop implementation. I was expecting more. Perhaps, this is a more tangible implementation for Python heads to digest. Any faster suggestions are welcome :)
One way is to convert the 3 "indices" to a linear index and then apply bincount. Numpy's ravel_multi_index is essentially the same as MATLAB's sub2ind. So the ported code could be something like:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
ind_len = np.tile(Lvec[:,None], [1, Lmax])
ind_pos = np.tile(posvec, [n, 1])
ind_val = data
Y_copied = np.tile(Y[:,None], [1, Lmax])
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((ind_len[mask], ind_pos[mask], ind_val[mask]), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied[mask], minlength=np.prod(shape)) / n
Y_avg.shape = shape
This is assuming data has shape (n, Lmax), Lvec is Numpy array, etc. You may need to adapt the code a little to get rid of off-by-one errors.
One could argue that the tile operations are not very efficient and not very "numpythonic". Something with broadcast_arrays could be nice, but I think I prefer this way:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
len_idx = np.repeat(Lvec, Lvec)
pos_idx = np.broadcast_to(posvec, data.shape)[mask]
val_idx = data[mask]
Y_copied = np.repeat(Y, Lvec)
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((len_idx, pos_idx, val_idx), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied, minlength=np.prod(shape)) / n
Y_avg.shape = shape
Note broadcast_to was added in Numpy 1.10.0.

Vectorizing a numpy array call of varying indices

I have a 2D numpy array and a list of lists of indices for which I wish to compute the sum of the corresponding 1D vectors from the numpy array. This can be easily done through a for loop or via list comprehension, but I wonder if it's possible to vectorize it. With similar code I gain about 40x speedups from the vectorization.
Here's sample code:
import numpy as np
indices = [[1,2],[1,3],[2,0,3],[1]]
array_2d = np.array([[0.5, 1.5],[1.5,2.5],[2.5,3.5],[3.5,4.5]])
soln = [np.sum(array_2d[x], axis=-1) for x in indices]
(edit): Note that the indices are not (x,y) coordinates for array_2d, instead indices[0] = [1,2] represents the first and second vectors (rows) in array_2d. The number of elements of each list in indices can be variable.
This is what I would hope to be able to do:
vectorized_soln = np.sum(array_2d[indices[:]], axis=-1)
Does anybody know if there are any ways of achieving this?
First to all, I think you have a typo in the third element of indices...
The easy way to do that is building a sub_array with two arrays of indices:
i = np.array([1,1,2])
j = np.array([2,3,?])
sub_arr2d = array_2d[i,j]
and finally, you can take the sum of sub_arr2d...

Categories

Resources