Numpy - Fastest way to create a integer index matrix - python

Problem:
I have an array that represents products, lets say 3 for example
prod = [1,2,3]
then I have a correlation matrix for those products (just a number that represents something between two products, lets call c_ij for simplicity), in this case a 3x3 matrix
corr = [[c_11,c_12,c_13],
[c_21,c_22,c_23],
[c_31,c_32,c_33]]
The problem is that a need to shuffle the prod array, then I need to shuffle the corr matrix in a way that corr[i,j] still represent the correlation between prod[i] and prod[j]
My solution:
I know I can use a integer array as index to shuffle multiple array in the same way, like this:
order = [0,1,2]
new_order = np.random.permutation(order) # [2,0,1] for example
shuf_prod = prod[new_order]
Looking in the web I find that to make this work in a matrix I need to transform the order array in a matrix like
new_order = [2,0,1]
new_index = [ [[2,2,2],[0,0,0],[1,1,1]],
[[2,0,1],[2,0,1],[2,0,1]] ]
new_corr = corr[tuple(new_index)]
# this output what I want that is:
# [[c_33,c_31,c_32],
# [c_13,c_11,c_12],
# [c_23,c_21,c_22]]
Question:
The entire solution of shuffling look chunky and not efficient, this is a performance critical application so there is a faster way to do this? (I don't really care for simplicity of code, just performance)
If this is a good way of doing this, how I can create the new_index matrix from new_order array?
EDIT: Michael Szczesny solved the problem
new_corr = corr[new_order].T[new_order].T

you can use the indices directly as subscripts to the matrix as long as you provide the right shape for the second axis:
import numpy as np
mat = np.array([[3,4,5],
[4,8,9],
[5,9,7]])
order = np.array([2,0,1])
mat[order,order[:,None]]
array([[7, 5, 9],
[5, 3, 4],
[9, 4, 8]])

Related

Is there an efficient way to combine 4 small arrays into one big array using an interleaved 'mosaic' pattern?

I have 4 square arrays of the same shape
array1 = 1*np.ones((10,10))
array2 = 2*np.ones((10,10))
array3 = 3*np.ones((10,10))
array4 = 4*np.ones((10,10))
I want to recombine them into one big array in an interleaved mosaic pattern as such:
result = np.asarray([[1,2,1,2,...,1,2],\
[3,4,3,4,...,3,4],\
[1,2,1,2,...,1,2],\
...
[3,4,3,4,...,3,4]])
Where result is twice as big in both dimensions as the original individual images.
Is there an efficient way to do this?
To illustrate my question, I used arrays containing constant values but in reality, these 4 arrays would be different images.
Two common approaches for interlacing data in numpy are:
A) Assign each source to a slice of a blank result array, corresponding to where the data should go:
result = np.zeros((20, 20)) # allocate space
result[::2, ::2] = array1 # put those values in the appropriate spots
result[::2, 1::2] = array2
result[1::2, ::2] = array3
result[1::2, 1::2] = array4
B) use stacking to stick the data together in a single array, and then reshape to flatten the data in a way that leaves it interlaced. This typically requires a bit of trial and error, but after playing around with the REPL a bit I came up with:
result = np.hstack((np.dstack((array1, array2)), np.dstack((array3, array4)))).reshape(20, 20)

How to use numpy to calculate mean and standard deviation of an irregular shaped array

I have a numpy array that has many samples in it of varying length
Samples = np.array([[1001, 1002, 1003],
... ,
[1001, 1002]])
I want to (elementwise) subtract the mean of the array then divide by the standard deviation of the array. Something like:
newSamples = (Samples-np.mean(Samples))/np.std(Samples)
Except that doesn't work for irregular shaped arrays,
np.mean(Samples) causes
unsupported operand type(s) for /: 'list' and 'int'
due to what I assume to be it having set a static size for each axis and then when it encounters a different sized sample it can't handle it. What is an approach to solve this using numpy?
example input:
Sample = np.array([[1, 2, 3],
[1, 2]])
After subtracting by the mean and then dividing by standard deviation:
Sample = array([[-1.06904497, 0.26726124, 1.60356745],
[-1.06904497, 0.26726124]])
Don't make ragged arrays. Just don't. Numpy can't do much with them, and any code you might make for them will always be unreliable and slow because numpy doesn't work that way. It turns them into object dtypes:
Sample
array([[1, 2, 3], [1, 2]], dtype=object)
Which almost no numpy functions work on. In this case those objects are list objects, which makes your code even more confusing as you either have to switch between list and ndarray methods, or stick to list-safe numpy methods. This a recipe for disaster as anyone noodling around with the code later (even yourself if you forget) will be dancing in a minefield.
There's two things you can do with your data to make things work better:
First method is to index and flatten.
i = np.cumsum(np.array([len(x) for x in Sample]))
flat_sample = np.hstack(Sample)
This preserves the index of the end of each sample in i, while keeping the sample as a 1D array
The other method is to pad one dimension with np.nan and use nan-safe functions
m = np.array([len(x) for x in Sample]).max()
nan_sample = np.array([x + [np.nan] * (m - len(x)) for x in Sample])
So to do your calculations, you can use flat_sample and do similar to above:
new_flat_sample = (flat_sample - np.mean(flat_sample)) / np.std(flat_sample)
and use i to recreate your original array (or list of arrays, which I recommend:, see np.split).
new_list_sample = np.split(new_flat_sample, i[:-1])
[array([-1.06904497, 0.26726124, 1.60356745]),
array([-1.06904497, 0.26726124])]
Or use nan_sample, but you will need to replace np.mean and np.std with np.nanmean and np.nanstd
new_nan_sample = (nan_sample - np.nanmean(nan_sample)) / np.nanstd(nan_sample)
array([[-1.06904497, 0.26726124, 1.60356745],
[-1.06904497, 0.26726124, nan]])
#MichaelHackman (following the comment remark).
That's weird because when I compute the overall std and mean then apply it, I obtain different result (see code below).
import numpy as np
Samples = np.array([[1, 2, 3],
[1, 2]])
c = np.hstack(Samples) # Will gives [1,2,3,1,2]
mean, std = np.mean(c), np.std(c)
newSamples = np.asarray([(np.array(xi)-mean)/std for xi in Samples])
print newSamples
# [array([-1.06904497, 0.26726124, 1.60356745]), array([-1.06904497, 0.26726124])]
edit: Add np.asarray(), put mean,std computation outside loop following Imanol Luengo's excellent comments (Thanks!)

random but consistent permutation of two numpy arrays

currently im facing a problem regarding the permutation of 2 numpy arrays of different row sizes, i know how to to utilize the np.random.shuffle function but i cannot seem to find a solution to my specific problem, the examples from the numpy documentation only refers to nd arrays with the same row sizes, e.g x.shape=[10][784] y.shape=[10][784]
I want to permute/random shuffle the column values in a consistent order for both arrays with those shapes:x.shape=[60000][784], y.shape=[10000][784].
e.g.
x[59000] = [0,1,2,3,4,5,6,7,8,9]
y[9999] = [0,1,2,3,4,5,6,7,8,9]
After the permutation, both of them should be shuffled in the same consistent way e.g.
x[59000] = [3,0,1,6,7,2,9,8,4,5] y[9999] = [3,0,1,6,7,2,9,8,4,5]
The shuffle order needs to be consistent over the two arrays which have different row sizes. I seem to get a ValueError: Found input variables with inconsistent numbers of samples: [60000, 10000]" Any ideas on how to fix this issue? Really appreciate any help!
Stick the arrays together and permute the combined array:
merged = numpy.concatenate([x, y])
numpy.shuffle(merged.T)
x, y = numpy.split(merged, [x.shape[0]])
Check also old threads
Better way to shuffle two numpy arrays in unison
Or compute a permutation ahead
your_permutation = np.shuffle(np.array([0, 1, 2, 3, 4, 5]))
i = np.argsort(your_permutation)
x = x[i]
y = y[i]

np.bincount for 1 line, vectorized multidimensional averaging

I am trying to vectorize an operation using numpy, which I use in a python script that I have profiled, and found this operation to be the bottleneck and so needs to be optimized since I will run it many times.
The operation is on a data set of two parts. First, a large set (n) of 1D vectors of different lengths (with maximum length, Lmax) whose elements are integers from 1 to maxvalue. The set of vectors is arranged in a 2D array, data, of size (num_samples,Lmax) with trailing elements in each row zeroed. The second part is a set of scalar floats, one associated with each vector, that I have a computed and which depend on its length and the integer-value at each position. The set of scalars is made into a 1D array, Y, of size num_samples.
The desired operation is to form the average of Y over the n samples, as a function of (value,position along length,length).
This entire operation can be vectorized in matlab with use of the accumarray function: by using 3 2D arrays of the same size as data, whose elements are the corresponding value, position, and length indices of the desired final array:
sz_Y = num_samples;
sz_len = Lmax
sz_pos = Lmax
sz_val = maxvalue
ind_len = repmat( 1:sz_len ,1 ,sz_samples);
ind_pos = repmat( 1:sz_pos ,sz_samples,1 );
ind_val = data
ind_Y = repmat((1:sz_Y)',1 ,Lmax );
copiedY=Y(ind_Y);
mask = data>0;
finalarr=accumarray({ind_val(mask),ind_pos(mask),ind_len(mask)},copiedY(mask), [sz_val sz_pos sz_len])/sz_val;
I was hoping to emulate this implementation with np.bincounts. However, np.bincounts differs to accumarray in two relevant ways:
both arguments must be of same 1D size, and
there is no option to choose the shape of the output array.
In the above usage of accumarray, the list of indices, {ind_val(mask),ind_pos(mask),ind_len(mask)}, is 1D cell array of 1x3 arrays used as index tuples, while in np.bincounts it must be 1D scalars as far as I understand. I expect np.ravel may be useful but am not sure how to use it here to do what I want. I am coming to python from matlab and some things do not translate directly, e.g. the colon operator which ravels in opposite order to ravel. So my question is how might I use np.bincount or any other numpy method to achieve an efficient python implementation of this operation.
EDIT: To avoid wasting time: for these multiD index problems with complicated index manipulation, is the recommend route to just use cython to implement the loops explicity?
EDIT2: Alternative Python implementation I just came up with.
Here is a heavy ram solution:
First precalculate:
Using index units for length (i.e., length 1 =0) make a 4D bool array, size (num_samples,Lmax+1,Lmax+1,maxvalue) , holding where the conditions are satisfied for each value in Y.
ALLcond=np.zeros((num_samples,Lmax+1,Lmax+1,maxvalue+1),dtype='bool')
for l in range(Lmax+1):
for i in range(Lmax+1):
for v in range(maxvalue+!):
ALLcond[:,l,i,v]=(data[:,i]==v) & (Lvec==l)`
Where Lvec=[len(row) for row in data]. Then get the indices for these using np.where and initialize a 4D float array into which you will assign the values of Y:
[indY,ind_len,ind_pos,ind_val]=np.where(ALLcond)
Yval=np.zeros(np.shape(ALLcond),dtype='float')
Now in the loop in which I have to perform the operation, I compute it with the two lines:
Yval[ind_Y,ind_len,ind_pos,ind_val]=Y[ind_Y]
Y_avg=sum(Yval)/num_samples
This gives a factor of 4 or so speed up over the direct loop implementation. I was expecting more. Perhaps, this is a more tangible implementation for Python heads to digest. Any faster suggestions are welcome :)
One way is to convert the 3 "indices" to a linear index and then apply bincount. Numpy's ravel_multi_index is essentially the same as MATLAB's sub2ind. So the ported code could be something like:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
ind_len = np.tile(Lvec[:,None], [1, Lmax])
ind_pos = np.tile(posvec, [n, 1])
ind_val = data
Y_copied = np.tile(Y[:,None], [1, Lmax])
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((ind_len[mask], ind_pos[mask], ind_val[mask]), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied[mask], minlength=np.prod(shape)) / n
Y_avg.shape = shape
This is assuming data has shape (n, Lmax), Lvec is Numpy array, etc. You may need to adapt the code a little to get rid of off-by-one errors.
One could argue that the tile operations are not very efficient and not very "numpythonic". Something with broadcast_arrays could be nice, but I think I prefer this way:
shape = (Lmax+1, Lmax+1, maxvalue+1)
posvec = np.arange(1, Lmax+1)
len_idx = np.repeat(Lvec, Lvec)
pos_idx = np.broadcast_to(posvec, data.shape)[mask]
val_idx = data[mask]
Y_copied = np.repeat(Y, Lvec)
mask = posvec <= Lvec[:,None] # fill-value independent
lin_idx = np.ravel_multi_index((len_idx, pos_idx, val_idx), shape)
Y_avg = np.bincount(lin_idx, weights=Y_copied, minlength=np.prod(shape)) / n
Y_avg.shape = shape
Note broadcast_to was added in Numpy 1.10.0.

Vectorizing a numpy array call of varying indices

I have a 2D numpy array and a list of lists of indices for which I wish to compute the sum of the corresponding 1D vectors from the numpy array. This can be easily done through a for loop or via list comprehension, but I wonder if it's possible to vectorize it. With similar code I gain about 40x speedups from the vectorization.
Here's sample code:
import numpy as np
indices = [[1,2],[1,3],[2,0,3],[1]]
array_2d = np.array([[0.5, 1.5],[1.5,2.5],[2.5,3.5],[3.5,4.5]])
soln = [np.sum(array_2d[x], axis=-1) for x in indices]
(edit): Note that the indices are not (x,y) coordinates for array_2d, instead indices[0] = [1,2] represents the first and second vectors (rows) in array_2d. The number of elements of each list in indices can be variable.
This is what I would hope to be able to do:
vectorized_soln = np.sum(array_2d[indices[:]], axis=-1)
Does anybody know if there are any ways of achieving this?
First to all, I think you have a typo in the third element of indices...
The easy way to do that is building a sub_array with two arrays of indices:
i = np.array([1,1,2])
j = np.array([2,3,?])
sub_arr2d = array_2d[i,j]
and finally, you can take the sum of sub_arr2d...

Categories

Resources