So I am a little new to using matrices in Python, and I am looking for the best way to perform the following operation.
Say I have a vector of an arbitrary length, like this:
data = np.array(range(255))
And I want to fit this data inside a matrix with a shape like so:
concept = np.zeros((3, 9, 6))
Now, obviously this will not fit, and results in an error:
ValueError: cannot reshape array of size 255 into shape (3,9,6)
What would be the best way to go about fitting as much of the data vector inside the first matrix with the shape (3, 9, 6) while making sure any "overflow" is stored in a second (or third, fourth, etc.) matrix?
Does this make sense?
Basically, I want to be able to take a vector of any size and produce an arbitrary amount of matrices that have the data shaped according to the 3, 9, 6 dimensions.
Thank you for your help.
def each_matrix(a, dims):
size = dims.prod()
padded = np.concatenate([ a, np.zeros(size-1) ])
for i in range(len(padded) / size):
yield padded[i*size : (i+1)*size].reshape(dims)
for matrix in each_matrix(np.array(range(255)),
dims=np.array([ 3, 9, 6 ])):
print(str(matrix) + '\n\n-------\n')
This will fill the last matrix with zeros.
Here is a rough solution to your problem.
def split_padded(a,n):
padding = n - len(data)%n
numOfsplit = int(len(data)/n)+1
print padding, numOfsplit
return np.split(np.concatenate((a,np.zeros(padding))),numOfsplit)
data = np.array(range(255))
splitnum = 3*9*6
splitdata = split_padded(data,splitnum)
for mat in splitdata:
print mat.reshape(3,9,6)
It is very rough and works for 1D input for array.
First, calculating the number of 0 we need to pad in padding and then calculating the number of matrices we can get out of input data in numOfsplit and doing the splitting in last line.
Related
I have a 3d numpy array that looks like this
A = np.random.randin(0, 10, (23, 23, 39)) # H, W, D
And wish to random sample from its depth to reach a 2d array with H and W only
Note … this doesn't work
B = A[np.random.randint(0, 39, (23,23))]
I think this is what you're looking for:
B = np.array([x[np.random.randint(A.shape[2])] for y in A for x in y]).reshape(A.shape[:-1])
A little explanation: we use list comprehension to iterate, two dimensionally, over every sub-array in the list (y iterates over dimension 0, x iterates over dimension 1, we get arrays of dimension 2)
In each of these arrays, we then take a random number.
The result is a large one dimensional array containing one element from each sub-array. We finally resize the array so it is the shape of A, minus the last dimension (in our case, 23 x 23)|
Hope it's what you're looking for!
I'm extracting some features from some data generated with an accelerometer and I have the following arrays:
X_mfccs_processed (list with 40 values)
Y_mfccs_processed (list with 40 values)
Z_mfccs_processed (list with 40 values)
X_mean (1 value)
Y_mean (1 value)
Z_mean (1 value)
At the moment i'm able to create a 3D array [shape=(1,40,3)] and insert into it my mfcss arrays
self.extracted_features = np.ndarray(shape=(1, len(self.X_mfccs_processed), 3))
self.extracted_features[:,:,0] = self.X_mfccs_processed
self.extracted_features[:,:,1] = self.Y_mfccs_processed
self.extracted_features[:,:,2] = self.Z_mfccs_processed
My question is: How can i create a 4D array [shape=(1,40,1,3)] where to store also my mean values?
To make the array, instead of assigning values to a preallocated array a better way is:
self.extracted_features = np.array([X_mfccs_processed,Y_mfccs_processed,Z_mfccs_processed]).T[None,...]
or equivalently:
self.extracted_features = np.array([X_mfccs_processed,Y_mfccs_processed,Z_mfccs_processed]).T.reshape(1,-1,3)
However, you cannot add another dimension with shape 1 and insert mean values in it. A dimension value is the number of elements along that dimension. An easy way to think about it is that a matrix of shape (1,N) is a 1xN matrix and it does not mean you can insert the mean in first dimension an a list of size N in the second dimension. You need to come up with another idea to store your means. I would suggest a separate array like this with shape (1,3,1):
self.extracted_features_mean = np.array([X_mean,Y_mean,Z_mean]).T[None,...]
And use similar indexing to access the mean. An alternative would be using dictionaries. Depending on your application, you can pick one that is easier and/or faster.
Usually np.reshape(self.extracted_features, (1,40,1,3)) works well.
The shape would have to be different to store the mean values as well. There isn't enough space.
(1,40,1,6) or (1,40,2,3)
seem reasonable shapes.
for (1,40,1,6)
self.extracted_features = np.ndarray(shape=(1, len(self.X_mfccs_processed), 1, 6))
self.extracted_features[:,:,:,0] = self.X_mfccs_processed
self.extracted_features[:,:,:,1] = self.Y_mfccs_processed
self.extracted_features[:,:,:,2] = self.Z_mfccs_processed
self.extracted_features[:,:,:,3] = self.X_mean
self.extracted_features[:,:,:,4] = self.Y_mean
self.extracted_features[:,:,:,5] = self.Z_mean
for (1,40,2,3)
self.extracted_features = np.ndarray(shape=(1, len(self.X_mfccs_processed), 2, 3))
self.extracted_features[:,:,0,0] = self.X_mfccs_processed
self.extracted_features[:,:,0,1] = self.Y_mfccs_processed
self.extracted_features[:,:,0,2] = self.Z_mfccs_processed
self.extracted_features[:,:,1,0] = self.X_mean
self.extracted_features[:,:,1,1] = self.Y_mean
self.extracted_features[:,:,1,2] = self.Z_mean
I should mention this casts the mean values meaning that it duplicates them (40 times). This would be bad for storage but if you doing some type of machine learning or numerics this might be a good tradeoff. Alternatively you could do a (1,41,1,3) shape.
I have used numpy meshgrids for a long time, and typically find no issues when trying to pass that meshgrid through a function. In my experience it has always been the case that I can define my coordinate space as
x,y,z = numpy.meshgrid(numpy.linspace(-10,10,10),
numpy.linspace(-10,10,10),
numpy.linspace(-10,10,10))
and then can easily compute something like
u,v,w = numpy.sin(x*y)+numpy.cos(z).
My issue has arisen from the need to do a cross product in that calculation. I am defining a field using the meshgrid, and trying to pass the entire meshgrid through the function:
field_equation = lambda x,y,z: sum([parameter*np.cross([wire_x[i],wire_y[i],wire_z[i]],[x,y,z]) for i in range(len(wire))])
Depending on how I try to solve the problem, I get a whole host of problems. The code works fine when passing individual points (x,y,z) through one at a time, but cannot calculate for the entire field. How do I get around this?
np.cross only accept a vector of size 3, or nd-array with the last dimension of size 3, so we need to stack np.stack([x,y,z]) to create a 10*10*10*3 nd-array first.
The results will be a 10*10*10*3 array, and to be able to unpack this array later, we need to transpose it to size 3*10*10*10, so I swap axes of resulting array at the end.
In the code below, I also take the liberty to shorten the code wrt wire a little, assuming wire_x, wire_y, wire_z are just 3 components of wire.
import numpy as np
# test data
x,y,z = np.meshgrid(np.linspace(-10,10,10),
np.linspace(-10,10,10),
np.linspace(-10,10,10))
wire = [[1,2,3,4], [5,6,7,8], [3,4,5,6]]
parameter = 1
field_equation = lambda x,y,z: sum([parameter*np.cross(w, np.stack([x,y,z], axis=-1)) for w in zip(*wire)]).swapaxes(0,-1)
a,b,c = field_equation(x,y,z)
print(a.shape, b.shape, c.shape)
#(10, 10, 10) (10, 10, 10) (10, 10, 10)
I am trying to vectorize an operation using numpy, which I use in a python script that I have profiled, and found this operation to be the bottleneck and so needs to be optimized since I will run it many times.
The operation is on a data set of two parts. First, a large set (n) of 1D vectors of different lengths (with maximum length, Lmax) whose elements are integers from 1 to maxvalue. The set of vectors is arranged in a 2D array, data, of size (num_samples,Lmax) with trailing elements in each row zeroed. The second part is a set of scalar floats, one associated with each vector, that I have a computed and which depend on its length and the integer-value at each position. The set of scalars is made into a 1D array, Y, of size num_samples.
The desired operation is to form the average of Y over the n samples, as a function of (value,position along length,length).
This entire operation can be vectorized in matlab with use of the accumarray function: by using 3 2D arrays of the same size as data, whose elements are the corresponding value, position, and length indices of the desired final array:
sz_Y = num_samples;
sz_len = Lmax
sz_pos = Lmax
sz_val = maxvalue
ind_len = repmat( 1:sz_len ,1 ,sz_samples);
ind_pos = repmat( 1:sz_pos ,sz_samples,1 );
ind_val = data
ind_Y = repmat((1:sz_Y)',1 ,Lmax );
copiedY=Y(ind_Y);
mask = data>0;
finalarr=accumarray({ind_val(mask),ind_pos(mask),ind_len(mask)},copiedY(mask), [sz_val sz_pos sz_len])/sz_val;
I was hoping to emulate this implementation with np.bincounts. However, np.bincounts differs to accumarray in two relevant ways:
both arguments must be of same 1D size, and
there is no option to choose the shape of the output array.
In the above usage of accumarray, the list of indices, {ind_val(mask),ind_pos(mask),ind_len(mask)}, is 1D cell array of 1x3 arrays used as index tuples, while in np.bincounts it must be 1D scalars as far as I understand. I expect np.ravel may be useful but am not sure how to use it here to do what I want. I am coming to python from matlab and some things do not translate directly, e.g. the colon operator which ravels in opposite order to ravel. So my question is how might I use np.bincount or any other numpy method to achieve an efficient python implementation of this operation.
EDIT: To avoid wasting time: for these multiD index problems with complicated index manipulation, is the recommend route to just use cython to implement the loops explicity?
EDIT2: Alternative Python implementation I just came up with.
Here is a heavy ram solution:
First precalculate:
Using index units for length (i.e., length 1 =0) make a 4D bool array, size (num_samples,Lmax+1,Lmax+1,maxvalue) , holding where the conditions are satisfied for each value in Y.
ALLcond=np.zeros((num_samples,Lmax+1,Lmax+1,maxvalue+1),dtype='bool')
for l in range(Lmax+1):
for i in range(Lmax+1):
for v in range(maxvalue+!):
ALLcond[:,l,i,v]=(data[:,i]==v) & (Lvec==l)`
Where Lvec=[len(row) for row in data]. Then get the indices for these using np.where and initialize a 4D float array into which you will assign the values of Y:
[indY,ind_len,ind_pos,ind_val]=np.where(ALLcond)
Yval=np.zeros(np.shape(ALLcond),dtype='float')
Now in the loop in which I have to perform the operation, I compute it with the two lines:
Yval[ind_Y,ind_len,ind_pos,ind_val]=Y[ind_Y]
Y_avg=sum(Yval)/num_samples
This gives a factor of 4 or so speed up over the direct loop implementation. I was expecting more. Perhaps, this is a more tangible implementation for Python heads to digest. Any faster suggestions are welcome :)
One way is to convert the 3 "indices" to a linear index and then apply bincount. Numpy's ravel_multi_index is essentially the same as MATLAB's sub2ind. So the ported code could be something like:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
ind_len = np.tile(Lvec[:,None], [1, Lmax])
ind_pos = np.tile(posvec, [n, 1])
ind_val = data
Y_copied = np.tile(Y[:,None], [1, Lmax])
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((ind_len[mask], ind_pos[mask], ind_val[mask]), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied[mask], minlength=np.prod(shape)) / n
Y_avg.shape = shape
This is assuming data has shape (n, Lmax), Lvec is Numpy array, etc. You may need to adapt the code a little to get rid of off-by-one errors.
One could argue that the tile operations are not very efficient and not very "numpythonic". Something with broadcast_arrays could be nice, but I think I prefer this way:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
len_idx = np.repeat(Lvec, Lvec)
pos_idx = np.broadcast_to(posvec, data.shape)[mask]
val_idx = data[mask]
Y_copied = np.repeat(Y, Lvec)
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((len_idx, pos_idx, val_idx), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied, minlength=np.prod(shape)) / n
Y_avg.shape = shape
Note broadcast_to was added in Numpy 1.10.0.
I would like to use a generic filter to calculate the mean of values within a given window (or kernel), for values that fulfill a couple of conditions. I expected the following code to produce a mean filter of the first array in a 3-layer window, using the other two arrays to mask values from the mean calculation.
from scipy import ndimage
import numpy as np
#some test data
tstArr = np.random.rand(3,7,7)
tstArr = tstArr*10
tstArr = np.int_(tstArr)
tstArr[1] = tstArr[1]*100
tstArr[2] = tstArr[2] *1000
#mean function
def testFun(tstData,processLayer,nLayers,kernelSize):
funData= tstData.reshape((nLayers,kernelSize,kernelSize))
meanLayer = funData[processLayer]
maskedData = meanLayer[(funData[1]>1)&(funData[2]<9000)]
returnMean = np.mean(maskedData)
return returnMean
#number of layers in the array
nLayers = np.shape(tstArr)[0]
#window size
kernelSize = 5
#create a sampling window of 5x5 elements from each array
footprnt = np.ones((nLayers,kernelSize,kernelSize),dtype = np.int)
# calculate the mean of the first layer in the array (other two are for masking)
processLayer = 0
tstOut = ndimage.generic_filter(tstArr, testFun, footprint=footprnt, extra_arguments = (processLayer,nLayers,kernelSize))
I thought this would yield a 7x7 array of masked mean values from the first layer in the input array. The output is a 3x7x7 array, and I don't understand what the values represent. I'm not sure how to produce the "masked" mean-filtered array, or how to interpret the output as given.
Your code produce a mean filter of the first array in a 3-layer window, using the over two arrays to mask values from the mean calculation. You will find the result in tstOut[1].
What is going on ? When you call ndimage.generic_filter with tstArr of shape (3, 7, 7) and footprint=np.ones((3, 5, 5)) then for all i from 0 to 2, for all j from 0 to 6 and for all k from 0 to 6, testFun is called with the subarray of tstArr centered in (i, j, k) and of shape (3, 5, 5) (the array is reflected at the boundary to supply missing values).
In the end:
tstOut[0] is the mean filter of tstArr[0] with tstArr[0] and tstArr[1] as masks
tstOut[1] is the mean filter of tstArr[0] with tstArr[1] and tstArr[2] as masks
tstOut[2] is the mean filter of tstArr[1] with tstArr[2] and tstArr[2] as masks
Again, the wanted result is in tstOut[1].
I hope this will help you.