what does numpy.apply_along_axis perform exactly? - python

I have come across the numpy.apply_along_axis function in some code. And I don't understand the documentation about it.
This is an example of the documentation:
>>> def new_func(a):
... """Divide elements of a by 2."""
... return a * 0.5
>>> b = np.array([[1,2,3], [4,5,6], [7,8,9]])
>>> np.apply_along_axis(new_func, 0, b)
array([[ 0.5, 1. , 1.5],
[ 2. , 2.5, 3. ],
[ 3.5, 4. , 4.5]])
As far I as thought I understood the documentation, I would have expected:
array([[ 0.5, 1. , 1.5],
[ 4 , 5 , 6 ],
[ 7 , 8 , 9 ]])
i.e. having applied the function along the axis [1,2,3] which is axis 0 in [[1,2,3], [4,5,6], [7,8,9]]
Obviously I am wrong. Could you correct me ?

apply_along_axis applies the supplied function along 1D slices of the input array, with the slices taken along the axis you specify. So in your example, new_func is applied over each slice of the array along the first axis. It becomes clearer if you use a vector valued function, rather than a scalar, like this:
In [20]: b = np.array([[1,2,3], [4,5,6], [7,8,9]])
In [21]: np.apply_along_axis(np.diff,0,b)
Out[21]:
array([[3, 3, 3],
[3, 3, 3]])
In [22]: np.apply_along_axis(np.diff,1,b)
Out[22]:
array([[1, 1],
[1, 1],
[1, 1]])
Here, numpy.diff (i.e. the arithmetic difference of adjacent array elements) is applied along each slice of either the first or second axis (dimension) of the input array.

The function is performed on 1-d arrays along axis=0. You can specify another axis using the "axis" argument. A usage of this paradigm is:
np.apply_along_axis(np.cumsum, 0, b)
The function was performed on each subarray along dimension 0. So, it is meant for 1-D functions and returns a 1D array for each 1-D input.
Another example is :
np.apply_along_axis(np.sum, 0, b)
Provides a scalar output for a 1-D array.
Of course you could just set the axis parameter in cumsum or sum to do the above, but the point here is that it can be used for any 1-D function you write.

Related

How to sum every 2 consecutive vectors using numpy

How to sum every 2 consecutive vectors using numpy. Or the mean of every 2 consecutive vectors.
The list of lists (that can have even or uneven number of vectors.)
example:
[[2,2], [1,2], [1,1], [2,2]] --> [[3,4], [3,3]]
Maybe something like this but using numpy and something that actually works on array of vectors and not an array of integers. Or maybe some sort of array comprehension if the that exists.
def pairwiseSum(lst, n):
sum = 0;
for i in range(len(lst)-1):
# adding the alternate numbers
sum = lst[i] + lst[i + 1]
def mean_consecutive_vectors(lst, step):
idx_list = list(range(step, len(lst), step))
new_lst = np.split(lst, idx_list)
return np.mean(new_lst, axis=1)
Same could be done with np.sum() instead of np.mean().
You can reshape your array into pairs, which will allow you to use np.sum() or np.mean() directly by providing the correct axis:
import numpy as np
a = np.array([[2,2], [1,2], [1,1], [2,2]])
np.sum(a.reshape(-1, 2, 2), axis=1)
# array([[3, 4],
# [3, 3]])
Edit to address comment:
To get a the means of each adjacent pair, you can add slices of the original array and broadcast division by 2:
> a = np.array([[2,2], [1,2], [1,1], [2,2], [11, 10], [20, 30]])
> (a[:-1] + a[1:])/2
array([[ 1.5, 2. ],
[ 1. , 1.5],
[ 1.5, 1.5],
[ 6.5, 6. ],
[15.5, 20. ]])

How to use numpy.argsort() as indices in more than 2 dimensions?

I know something similar to this question has been asked many times over already, but all answers given to similar questions only seem to work for arrays with 2 dimensions.
My understanding of np.argsort() is that np.sort(array) == array[np.argsort(array)] should be True.
I have found out that this is indeed correct if np.ndim(array) == 2, but it gives different results if np.ndim(array) > 2.
Example:
>>> array = np.array([[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]],
[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]]])
>>> np.sort(array)
array([[[ 0.62078744, 0.81774634],
[ 0.29718462, 0.43912609]],
[[ 0.1266578 , 0.82282054],
[ 0.79134389, 0.98180375]]])
>>> array.argsort()
array([[[1, 0],
[1, 0]],
[[0, 1],
[1, 0]]])
>>> array[array.argsort()]
array([[[[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]],
[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]]],
[[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]],
[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]]]],
[[[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]],
[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]]],
[[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]],
[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]]]]])
So, can anybody explain to me how exactly np.argsort() can be used as the indices to obtain the sorted array?
The only way I can come up with is:
args = np.argsort(array)
array_sort = np.zeros_like(array)
for i in range(array.shape[0]):
for j in range(array.shape[1]):
array_sort[i, j] = array[i, j, args[i, j]]
which is extremely tedious and cannot be generalized for any given number of dimensions.
Here is a general method:
import numpy as np
array = np.array([[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]],
[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]]])
a = 1 # or 0 or 2
order = array.argsort(axis=a)
idx = np.ogrid[tuple(map(slice, array.shape))]
# if you don't need full ND generality: in 3D this can be written
# much more readable as
# m, n, k = array.shape
# idx = np.ogrid[:m, :n, :k]
idx[a] = order
print(np.all(array[idx] == np.sort(array, axis=a)))
Output:
True
Explanation: We must specify for each element of the output array the complete index of the corresponding element of the input array. Thus each index into the input array has the same shape as the output array or must be broadcastable to that shape.
The indices for the axes along which we do not sort/argsort stay in place. We therefore need to pass a broadcastable range(array.shape[i]) for each of those. The easiest way is to use ogrid to create such a range for all dimensions (If we used this directly, the array would come back unchanged.) and then replace the index correspondingg to the sort axis with the output of argsort.
UPDATE March 2019:
Numpy is becoming more strict in enforcing multi-axis indices being passed as tuples. Currently, array[idx] will trigger a deprecation warning. To be future proof use array[tuple(idx)] instead. (Thanks #Nathan)
Or use numpy's new (version 1.15.0) convenience function take_along_axis:
np.take_along_axis(array, order, a)
#Hameer's answer works, though it might use some simplification and explanation.
sort and argsort are working on the last axis. argsort returns a 3d array, same shape as the original. The values are the indices on that last axis.
In [17]: np.argsort(arr, axis=2)
Out[17]:
array([[[1, 0],
[1, 0]],
[[0, 1],
[1, 0]]], dtype=int32)
In [18]: _.shape
Out[18]: (2, 2, 2)
In [19]: idx=np.argsort(arr, axis=2)
To use this we need to construct indices for the other dimensions that broadcast to the same (2,2,2) shape. ix_ is a handy tool for this.
Just using idx as one of the ix_ inputs doesn't work:
In [20]: np.ix_(range(2),range(2),idx)
....
ValueError: Cross index must be 1 dimensional
Instead I use the last range, and then ignore it. #Hameer instead constructs the 2d ix_, and then expands them.
In [21]: I,J,K=np.ix_(range(2),range(2),range(2))
In [22]: arr[I,J,idx]
Out[22]:
array([[[ 0.62078744, 0.81774634],
[ 0.29718462, 0.43912609]],
[[ 0.1266578 , 0.82282054],
[ 0.79134389, 0.98180375]]])
So the indices for the other dimensions work with the (2,2,2) idx array:
In [24]: I.shape
Out[24]: (2, 1, 1)
In [25]: J.shape
Out[25]: (1, 2, 1)
That's the basics for constructing the other indices when you are given multidimensional index for one dimension.
#Paul constructs the same indices with ogrid:
In [26]: np.ogrid[slice(2),slice(2),slice(2)] # np.ogrid[:2,:2,:2]
Out[26]:
[array([[[0]],
[[1]]]), array([[[0],
[1]]]), array([[[0, 1]]])]
In [27]: _[0].shape
Out[27]: (2, 1, 1)
ogrid as a class works with slices, while ix_ requires a list/array/range.
argsort for a multidimensional ndarray (from 2015) works with a 2d array, but the same logic applies (find a range index(s) that broadcasts with the argsort).
Here's a vectorized implementation. It should be N-dimensional and quite a bit faster than what you're doing.
import numpy as np
def sort1(array, args):
array_sort = np.zeros_like(array)
for i in range(array.shape[0]):
for j in range(array.shape[1]):
array_sort[i, j] = array[i, j, args[i, j]]
return array_sort
def sort2(array, args):
shape = array.shape
idx = np.ix_(*tuple(np.arange(l) for l in shape[:-1]))
idx = tuple(ar[..., None] for ar in idx)
array_sorted = array[idx + (args,)]
return array_sorted
if __name__ == '__main__':
array = np.random.rand(5, 6, 7)
idx = np.argsort(array)
result1 = sort1(array, idx)
result2 = sort2(array, idx)
print(np.array_equal(result1, result2))

Get indices of top N values in 2D numpy ndarray or numpy matrix

I have an array of N-dimensional vectors.
data = np.array([[5, 6, 1], [2, 0, 8], [4, 9, 3]])
In [1]: data
Out[1]:
array([[5, 6, 1],
[2, 0, 8],
[4, 9, 3]])
I'm using sklearn's pairwise_distances function to compute a matrix of distance values. Note that this matrix is symmetric about the diagonal.
dists = pairwise_distances(data)
In [2]: dists
Out[2]:
array([[ 0. , 9.69535971, 3.74165739],
[ 9.69535971, 0. , 10.48808848],
[ 3.74165739, 10.48808848, 0. ]])
I need the indices corresponding to the top N values in this matrix dists, because these indices will correspond the pairwise indices in data that represent vectors with the greatest distances between them.
I have tried doing np.argmax(np.max(distances, axis=1)) to get the index of the max value in each row, and np.argmax(np.max(distances, axis=0)) to get the index of the max value in each column, but note that:
In [3]: np.argmax(np.max(dists, axis=1))
Out[3]: 1
In [4]: np.argmax(np.max(dists, axis=0))
Out[4]: 1
and:
In [5]: dists[1, 1]
Out[5]: 0.0
Because the matrix is symmetric about the diagonal, and because argmax returns the first index it finds with the max value, I end up with the cell in the diagonal in the row and column matching where the max values are stored, instead of the row and column of the top values themselves.
At this point I'm sure I could write some more code to find the values I'm looking for, but surely there is an easier way to do what I'm trying to do. So I have two questions that are more or less equivalent:
How can I find the indices corresponding to the top N values in a matrix, or , how can I find the vectors with the top N pairwise distances from an array of vectors?
I'd ravel, argsort, and then unravel. I'm not claiming this is the best way, only that it's the first way that occurred to me, and I'll probably delete it in shame after someone posts something more obvious. :-)
That said (choosing the top 2 values, arbitrarily):
In [73]: dists = sklearn.metrics.pairwise_distances(data)
In [74]: dists[np.tril_indices_from(dists, -1)] = 0
In [75]: dists
Out[75]:
array([[ 0. , 9.69535971, 3.74165739],
[ 0. , 0. , 10.48808848],
[ 0. , 0. , 0. ]])
In [76]: ii = np.unravel_index(np.argsort(dists.ravel())[-2:], dists.shape)
In [77]: ii
Out[77]: (array([0, 1]), array([1, 2]))
In [78]: dists[ii]
Out[78]: array([ 9.69535971, 10.48808848])
As a slight improvement over the otherwise very good answer by DSM, instead of using np.argsort(), it is more efficient to use np.argpartition() if the order of the N greatest is of no consequence.
Partitioning an array arr with index i rearranges the elements such that the element at index i is the ith greatest, while those on the left are greater and on the right are lesser. The partitions on the left and right are not necessarily sorted. This has the advantage that it runs in linear time.

How to calculate the mean of a stack of arrays?

my stack is something like this
array([[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
[[2, 2, 2],
[2, 2, 2],
[2, 2, 2]]])
I want this result:
array([[ 1.5, 2. , 2.5],
[ 3. , 3.5, 4. ],
[ 4.5, 5. , 5.5]])
I updated my question I think it's more clearer now.
Well, first, you don't have a stack of 2D arrays, you have three separate variables.
Fortunately, most functions in NumPy take an array_like argument. And the tuple (a, b, c) is "array-like" enough—it'll be converted into the 3D array that you should have had in the first place.
Anyway, the obvious function to take the mean is np.mean. As the docs say:
The average is taken over the flattened array by default, otherwise over the specified axis.
So just specify the axis you want—the newly-created axis 0.
np.mean((a,b,c), axis=0)
In your updated question, you now have a single 2x3x3 array, a, instead of three 2x2 arrays, a, b, and c, and you want the mean across the first axis (the one with dimension 2). This is the same thing, but slightly easier:
np.mean(a, axis=0)
Or course the mean of 4, 7, and 3 is 4.666666666666667, not 4. In your updated question, that seems to be what you want; in your original question… I'm not sure if you wanted to truncate or round, or if you wanted the median or something else rather than the mean, or anything else, but those are all easy (add dtype=int64 to the call, call .round() on the result, call median instead of mean, etc.).
>>> a = np.array([[1,2],[3,4]])
>>> b = np.array([[1,5],[6,7]])
>>> c = np.array([[1,8],[8,3]])
>>> np.mean((a,b,c), axis=0)
array([[ 1. , 5. ],
[ 5.66666667, 4.66666667]])
As per your output it seems you are looking for median rather than mean.
>>> np.median((a,b,c), axis=0)
array([[ 1., 5.],
[ 6., 4.]])

Find pair of index values that minimizes the Euclidian distance between two meshgrids and a column vector

I want to find the two find the pair of values and their index number in a meshgrid that a closets to another pair of values. Suppose I have two vectors a= np.array([0.01,0.5,0.9]) and b = np.array([0,3,6,10]) and two meshgrids X,Y = np.meshgrid(a,b). For illustration, they look as follows:
X= array([[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9]])
Y =array([[ 0, 0, 0],
[ 3, 3, 3],
[ 6, 6, 6],
[10, 10, 10]])
Now, I have another array called c of dimension (2 x N). For illustration suppose c contains the following entries:
c = array([[ 0.07268017, 0.08816632, 0.11084398, 0.13352165, 0.1490078 ],
[ 0.00091219, 0.00091219, 0.00091219, 0.00091219, 0.00091219]])
Denote a column vector of c by x. For each vector x I want to find
To complicate matters a bit, I am in fact not only looking for the index with the smallest distance (i,j) but also the second smallest distance (i',j').
All my approaches so far turned out to be extremely complicated and involved a lot of side routes. Does someone have an idea for how to tackle the problem efficiently?
If X, Y always come from meshgrid(), your minimization is separable in X and Y. Just find the closest elements of X to c[0,] and the closest elements of Y to c[1,] ---
you don't need to calculate the 2-dimensional metric.
If either a or b have uniform steps, you can save yourself even more time if you scale the corresponding values of c onto the indexes. In your example, all(a == 0.1+0.4*arange(3)), so you can find the x values by inverting: x = (c[0,] - 0.1)/0.4. If you have an invertible (possibly non-linear) function that maps integers onto b, you can similarly find y values directly by applying the inverse function to c[1,].
This is more a comment than an answer but i like to [... lots of stuff mercifully deleted, that you can still see using the revision history ...]
Complete Revision
As a followup of my own comment, please look at the following
Setup
In [25]: from numpy import *
In [26]: from scipy.spatial import KDTree
In [27]: X= array([[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9],
[ 0.1, 0.5, 0.9]])
In [28]: Y =array([[ 0, 0, 0],
[ 3, 3, 3],
[ 6, 6, 6],
[10, 10, 10]])
In [29]: c = array([[ 0.07268017, 0.08816632, 0.11084398, 0.13352165, 0.1490078 ],
[ 0.00091219, 0.00091219, 0.00091219, 0.00091219, 0.00091219]])
Solution
Two lines of code, please notice that you have to pass the transpose of your c array.
In [30]: tree = KDTree(zip(X.ravel(), Y.ravel()))
In [31]: tree.query(c.T,k=2)
Out[31]:
(array([[ 0.02733505, 0.4273208 ],
[ 0.01186879, 0.41183469],
[ 0.01088228, 0.38915709],
[ 0.03353406, 0.36647949],
[ 0.04901629, 0.35099339]]), array([[0, 1],
[0, 1],
[0, 1],
[0, 1],
[0, 1]]))
Comment
To interpret the result, the excellent scipy docs inform you that tree.query() gives you back two arrays, containing respectively for each point in c
a scalar or an array of length k>=2 giving you the distances
from the point to the closest point on grid, the second closest, etc,
a scalar or an array of length k>=2 giving you the indices
pointing to the grid point(s) closest (next close etc).
To access the grid point, KDTree maintains a copy of the grid data, e.g
In [32]: tree.data[[0,1]]
Out[32]:
array([[ 0.1, 0. ],
[ 0.5, 0. ]])
where [0,1] is the first element of the second output array.
Should you need the indices of the closest(s) point in the mesh matrices, it simply a matter of using divmod.

Categories

Resources