Creating Mask that Applies to Vectors in 3D Array - python

I could not find a previous post that specifically addressed how to create masks that work against vectors in a 3D array. I have only found previous questions and answers that either address only how masks can be applied to individual elements in a 3D array or vectors in a 2D array. So as the title states, that is exactly what I wish to do here. I want to remove all zero vectors from a 3D (x,y,z) array and the only method I can think of is to create two for loops that run over both x and (y,:) as shown in the code below. However, this does not work either because of the error message I receive when I try to run this.
'list' object cannot be safely interpreted as an integer
Moreover, even if I do get this method to work somehow, I know that using a double for loop will make this masking process very time consuming because eventually I want to apply this to array sizes in the millions. So this develops into my main question; What would be the fastest method to accomplish this?
Code:
import numpy as np
data = np.array([[[0,0,0],[1,2,3],[4,5,6],[0,0,0]],[[7,8,9],[0,0,0],[0,0,0],[10,11,12]]],dtype=float)
datanonzero = np.empty([[],[]],dtype=float)
for maskclear1 in range(0,2):
for maskclear2 in range(0,4):
datanonzero[maskclear1,maskclear2,:] = data[~np.all(data[maskclear1,maskclear2,0:3] == 0, axis=0)

import numpy as np
data = np.array([[[0,0,0],[1,2,3],[4,5,6],[0,0,0]],[[7,8,9],[0,0,0],[0,0,0],[10,11,12]]],dtype=float)
flatten_data = data.reshape(-1, 3)
datanonzero = [ data[~np.all(vec == 0, axis=0)] for vec in flatten_data ]
datanonzero = np.reshape(datanonzero, (2,-1))

Related

How to add several vectors to numpy structered array and call matrix later from fieldname?

Hey guys Ii need help..
I want to use tensorflows data import, where data is loaded by calling the features/labels vectors from a structured numpy array.
https://www.tensorflow.org/programmers_guide/datasets#consuming_numpy_arrays
I want to create such an structured array by adding consecutively the 2 vectors (feature_vec and label_vec) to an numpy structured array.
import numpy as np
# example vectors
feature_vec= np.arange(10)
label_vec = np.arange(10)
# structured array which should get the vectors
struc_array = np.array([feature_vec,label_vec],dtype=([('features',np.float32), ('labels',np.float32)]))
# How can I add now new vectors to struc_array?
struc_array.append(---)
I want later when this array is loaded from file call either the feature vectors (which is a matrix now) by using the fieldname:
with np.load("/var/data/training_data.npy") as data:
features = data["features"] # matrix containing feature vectors as rows
labels = data["labels"] #matrix containing labels vectors as rows
Everything I tried to code was complete crap.. never got a correct output..
Thanks for your help!
Don't create a NumPy array and then append to it. That doesn't really make sense, as NumPy arrays have a fixed size and require a full copy to append a single row or column. Instead, create a list, append to it, then construct the array at the end:
vecs = [feature_vec,label_vec]
dtype = [('features',np.float32), ('labels',np.float32)]
# append as many times as you want:
vecs.append(other_vec)
dtype.append(('other', np.float32))
struc_array = np.array(vecs, dtype=dtype)
Of course, you probably need ot
Unfortunately, this doesn't solve the problem.
i want to get just the labels or the features from structured array by using:
labels = struc_array['labels']
features = struc_array['features']
But when i use the structured array like you did, labels and also features contains all given appended vectors:
import numpy as np
feature_vec= np.arange(10)
label_vec = np.arange(0,5,0.5)
vecs = [feature_vec,label_vec]
dtype = [('features',np.float32), ('labels',np.float32)]
other_vec = np.arange(6,11,0.5)
vecs.append(other_vec)
dtype.append(('other', np.float32))
struc_array = np.array(vecs, dtype=dtype)
# This contains all vectors.. not just the labels vector
labels = struc_array['labels']
# This also contains all vectors.. not just the feature vector
features = struc_array['features']

np.bincount for 1 line, vectorized multidimensional averaging

I am trying to vectorize an operation using numpy, which I use in a python script that I have profiled, and found this operation to be the bottleneck and so needs to be optimized since I will run it many times.
The operation is on a data set of two parts. First, a large set (n) of 1D vectors of different lengths (with maximum length, Lmax) whose elements are integers from 1 to maxvalue. The set of vectors is arranged in a 2D array, data, of size (num_samples,Lmax) with trailing elements in each row zeroed. The second part is a set of scalar floats, one associated with each vector, that I have a computed and which depend on its length and the integer-value at each position. The set of scalars is made into a 1D array, Y, of size num_samples.
The desired operation is to form the average of Y over the n samples, as a function of (value,position along length,length).
This entire operation can be vectorized in matlab with use of the accumarray function: by using 3 2D arrays of the same size as data, whose elements are the corresponding value, position, and length indices of the desired final array:
sz_Y = num_samples;
sz_len = Lmax
sz_pos = Lmax
sz_val = maxvalue
ind_len = repmat( 1:sz_len ,1 ,sz_samples);
ind_pos = repmat( 1:sz_pos ,sz_samples,1 );
ind_val = data
ind_Y = repmat((1:sz_Y)',1 ,Lmax );
copiedY=Y(ind_Y);
mask = data>0;
finalarr=accumarray({ind_val(mask),ind_pos(mask),ind_len(mask)},copiedY(mask), [sz_val sz_pos sz_len])/sz_val;
I was hoping to emulate this implementation with np.bincounts. However, np.bincounts differs to accumarray in two relevant ways:
both arguments must be of same 1D size, and
there is no option to choose the shape of the output array.
In the above usage of accumarray, the list of indices, {ind_val(mask),ind_pos(mask),ind_len(mask)}, is 1D cell array of 1x3 arrays used as index tuples, while in np.bincounts it must be 1D scalars as far as I understand. I expect np.ravel may be useful but am not sure how to use it here to do what I want. I am coming to python from matlab and some things do not translate directly, e.g. the colon operator which ravels in opposite order to ravel. So my question is how might I use np.bincount or any other numpy method to achieve an efficient python implementation of this operation.
EDIT: To avoid wasting time: for these multiD index problems with complicated index manipulation, is the recommend route to just use cython to implement the loops explicity?
EDIT2: Alternative Python implementation I just came up with.
Here is a heavy ram solution:
First precalculate:
Using index units for length (i.e., length 1 =0) make a 4D bool array, size (num_samples,Lmax+1,Lmax+1,maxvalue) , holding where the conditions are satisfied for each value in Y.
ALLcond=np.zeros((num_samples,Lmax+1,Lmax+1,maxvalue+1),dtype='bool')
for l in range(Lmax+1):
for i in range(Lmax+1):
for v in range(maxvalue+!):
ALLcond[:,l,i,v]=(data[:,i]==v) & (Lvec==l)`
Where Lvec=[len(row) for row in data]. Then get the indices for these using np.where and initialize a 4D float array into which you will assign the values of Y:
[indY,ind_len,ind_pos,ind_val]=np.where(ALLcond)
Yval=np.zeros(np.shape(ALLcond),dtype='float')
Now in the loop in which I have to perform the operation, I compute it with the two lines:
Yval[ind_Y,ind_len,ind_pos,ind_val]=Y[ind_Y]
Y_avg=sum(Yval)/num_samples
This gives a factor of 4 or so speed up over the direct loop implementation. I was expecting more. Perhaps, this is a more tangible implementation for Python heads to digest. Any faster suggestions are welcome :)
One way is to convert the 3 "indices" to a linear index and then apply bincount. Numpy's ravel_multi_index is essentially the same as MATLAB's sub2ind. So the ported code could be something like:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
ind_len = np.tile(Lvec[:,None], [1, Lmax])
ind_pos = np.tile(posvec, [n, 1])
ind_val = data
Y_copied = np.tile(Y[:,None], [1, Lmax])
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((ind_len[mask], ind_pos[mask], ind_val[mask]), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied[mask], minlength=np.prod(shape)) / n
Y_avg.shape = shape
This is assuming data has shape (n, Lmax), Lvec is Numpy array, etc. You may need to adapt the code a little to get rid of off-by-one errors.
One could argue that the tile operations are not very efficient and not very "numpythonic". Something with broadcast_arrays could be nice, but I think I prefer this way:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
len_idx = np.repeat(Lvec, Lvec)
pos_idx = np.broadcast_to(posvec, data.shape)[mask]
val_idx = data[mask]
Y_copied = np.repeat(Y, Lvec)
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((len_idx, pos_idx, val_idx), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied, minlength=np.prod(shape)) / n
Y_avg.shape = shape
Note broadcast_to was added in Numpy 1.10.0.

Best practice for fancy indexing a numpy array along multiple axes

I'm trying to optimize an algorithm to reduce memory usage, and I've identified this particular operation as a pain point.
I have a symmetric matrix, an index array along the rows, and another index array along the columns (which is just all values that I wasn't selecting in the row index). I feel like I should just be able to pass in both indexes at the same time, but I find myself being forced to select along one axis and then the other, which is causing some memory issues because I don't actually need the copy of the array that's returned, just statistics I'm calculating from it. Here's what I am trying to do:
from scipy.spatial.distance import pdist, squareform
from sklearn import datasets
import numpy as np
iris = datasets.load_iris().data
dx = pdist(iris)
mat = squareform(dx)
outliers = [41,62,106,108,109,134,135]
inliers = np.setdiff1d( range(iris.shape[0]), outliers)
# What I want to be able to do:
scores = mat[inliers, outliers].min(axis=0)
Here's what I'm actually doing to make this work:
# What I'm being forced to do:
s1 = mat[:,outliers]
scores = s1[inliers,:].min(axis=0)
Because I'm fancy indexing, s1 is a new array instead of a view. I only need this array for one operation, so if I could eliminate returning a copy here or at least make the new array smaller (i.e. by respecting the second fancy index selection while I'm doing the first one instead of two separate fancy index operations) that would be preferable.
"Broadcasting" applies to indexing. You could convert inliers into column matrix (e.g. inliers.reshape(-1,1) or inliers[:, np.newaxis], so it has shape (m,1)) and index mat with that in the first column:
s1 = mat[inliers.reshape(-1,1), outliers]
scores = s1.min(axis=0)
There's a better way in terms of readability:
result = mat[np.ix_(inliers, outliers)].min(0)
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ix_.html#numpy.ix_
Try:
outliers = np.array(outliers) # just to be sure they are arrays
result = mat[inliers[:, np.newaxis], outliers[np.newaxis, :]].min(0)

scipy -- how to insert an array of zeros into another array with different dimensions

If i have an array:
myzeros=scipy.zeros((c*pos,c*pos)) , c=0.1 , pos=100
and an array:
grid=scipy.ones((pos,pos))
How can i insert the zeros into the grid in random positions? The problem is with the dimensions.
I know that in 1d you can do:
myzeros=sc.zeros(c*pos) # array full of (zeros)
grid=sc.ones(pos) # grid full of available positions(ones)
dist=sc.random.permutation(pos)[:c*pos] # distribute c*pos zeros in random
# positions
grid[dist]=myzeros
I tried something similar but it doesn't work. I tried also: myzeros=sc.zeros(c*pos), but it still does not work.
There are several ways, but the easiest seems to be to first convert the 2D grid into a 1D grid and proceed as in the 1D case, then convert back to 2D:
c = 0.1
pos = 100
myzeros=scipy.zeros((c*pos,c*pos))
myzeros1D = myzeros.ravel()
grid=scipy.ones((pos,pos))
grid1D = grid.ravel()
dist=sc.random.permutation(pos*pos)[:c*pos*c*pos]
grid1D[dist]=myzeros1D
myzeros = myzeros1D.reshape((c*pos,c*pos))
grid = grid1D.reshape((pos, pos))
EDIT: to answer your comment: if you only want a part of the myzeros to go into the grid array, you have to make the dist array smaller. Example:
dist = scipy.random.permutation(pos*pos)[:c*pos]
grid1D[dist] = myzeros1D[:c*pos]
And I hope you are aware, that this last line can be written as
grid1D[dist] = 0
if you really only want to set those elements to a single instead of using the elements from another array.

Convert a list of 2D numpy arrays to one 3D numpy array?

I have a list of several hundred 10x10 arrays that I want to stack together into a single Nx10x10 array. At first I tried a simple
newarray = np.array(mylist)
But that returned with "ValueError: setting an array element with a sequence."
Then I found the online documentation for dstack(), which looked perfect: "...This is a simple way to stack 2D arrays (images) into a single 3D array for processing." Which is exactly what I'm trying to do. However,
newarray = np.dstack(mylist)
tells me "ValueError: array dimensions must agree except for d_0", which is odd because all my arrays are 10x10. I thought maybe the problem was that dstack() expects a tuple instead of a list, but
newarray = np.dstack(tuple(mylist))
produced the same result.
At this point I've spent about two hours searching here and elsewhere to find out what I'm doing wrong and/or how to go about this correctly. I've even tried converting my list of arrays into a list of lists of lists and then back into a 3D array, but that didn't work either (I ended up with lists of lists of arrays, followed by the "setting array element as sequence" error again).
Any help would be appreciated.
newarray = np.dstack(mylist)
should work. For example:
import numpy as np
# Here is a list of five 10x10 arrays:
x = [np.random.random((10,10)) for _ in range(5)]
y = np.dstack(x)
print(y.shape)
# (10, 10, 5)
# To get the shape to be Nx10x10, you could use rollaxis:
y = np.rollaxis(y,-1)
print(y.shape)
# (5, 10, 10)
np.dstack returns a new array. Thus, using np.dstack requires as much additional memory as the input arrays. If you are tight on memory, an alternative to np.dstack which requires less memory is to
allocate space for the final array first, and then pour the input arrays into it one at a time.
For example, if you had 58 arrays of shape (159459, 2380), then you could use
y = np.empty((159459, 2380, 58))
for i in range(58):
# instantiate the input arrays one at a time
x = np.random.random((159459, 2380))
# copy x into y
y[..., i] = x

Categories

Resources