Get index of numpy-array elements by comparing element-positions between arrays - python

Context
I have the following example-arrays in numpy:
import numpy as np
# All arrays in this example have the shape (15,)
# Note: All values > 0 are unqiue!
a = np.array([8,5,4,-1,-1, 7,-1,-1,12,11,-1,-1,14,-1,-1])
reference = np.array([0,1,2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14])
lookup = np.array([3,6,0,-2,-2,24,-2,-2,24,48,-2,-2,84,-2,-2])
My goal is to find the elements inside the reference in a, then get the index in a and use it to extract the corresponding elements in lookup.
Finding out the matching elements and their indices works with np.flatnonzero( np.isin() ).
I can also lookup the correspodning values:
# Example how to find the index
np.flatnonzero( np.isin( reference, a) )
# -> array([ 4, 5, 7, 8, 11, 12, 14])
# Example how to find corresponding values:
lookup[ np.flatnonzero( np.isin( a, reference) ) ]
# -> array([ 3, 6, 0, 24, 24, 48, 84], dtype=int64)
Problem
I want to fill an array z with the values I looked up, following the reference.
This means, that e.g. the 8th element of z corresponds to the 8th element in the lookup-value for the 8th element in reference (= 8). This value would be 3 (reference[8] -> a[0] because a==8 here -> lookup[0] -> 3).
z = np.zeros(reference.size)
z[np.flatnonzero(np.isin(reference, a))] = ? -> numpy-array of correctly ordered lookup_values
The expected outcome for z would be:
z = [ 0 0 0 0 0 6 0 24 3 0 0 48 24 0 84]
I cannot get my head around this; I have to avoid for-loops due to performance reasons and would need a pure numpy-solution (best without udfs).
How can I fill z according with the lookup-values at the correct position?
Note: As stated in the code above, all values a > 0 are unique. Thus, there is no need to take care about the duplicated values for a < 0.

You say that you 'have to avoid for-loops due to performance reasons', so I assume that your real-world datastructure a is going to be large (thousands or millions of elements?). Since np.isin(reference, a) performs a linear search in a for every element of reference, your runtime will be O(len(reference) * len(a)).
I would strongly suggest using a dict for a, allowing lookup in O(1) per element of reference, and loop in python using for. For sufficiently large a this will outperform the 'fast' linear search performed by np.isin.

The most natural way I can think of is to just treat a and lookup as a dictionary:
In [82]: d = dict(zip(a, lookup))
In [83]: np.array([d.get(i, 0) for i in reference])
Out[83]: array([ 0, 0, 0, 0, 0, 6, 0, 24, 3, 0, 0, 48, 24, 0, 84])
This does have a bit of memory overhead but nothing crazy if reference isn't too large.

I actually had an enlightenment.
# Initialize the result
# All non-indexed entries shall be 0
z = np.zeros(reference.size, dtype=np.int64)
Now evaluate which elements in a are relevant:
mask = np.flatnonzero(np.isin(a, reference))
# Short note: If we know that any positive element of a is a number
# Which has to be in the reference, we can also shorten this to
# a simple boolean mask. This will be significantly faster to process.
mask = (a > 0)
Now the following trick: All values a > 0 are unique. Additionally, their value corresponds to the position in reference (e.g. 8 in a shall correspond to the 8th position in reference. Thus, we can use the values as index themselves:
z[ a[mask] ] = lookup[mask]
This results in the desired outcome:
z = [ 0 0 0 0 0 6 0 24 3 0 0 48 24 0 84]

Related

Efficient ways to aggregate and replicate values in a numpy matrix

In my work I often need to aggregate and expand matrices of various quantities, and I am looking for the most efficient ways to do these actions. E.g. I'll have an NxN matrix that I want to aggregate from NxN into PxP where P < N. This is done using a correspondence between the larger dimensions and the smaller dimensions. Usually, P will be around 100 or so.
For example, I'll have a hypothetical 4x4 matrix like this (though in practice, my matrices will be much larger, around 1000x1000)
m=np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]])
>>> m
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16]])
and a correspondence like this (schematically):
0 -> 0
1 -> 1
2 -> 0
3 -> 1
that I usually store in a dictionary. This means that indices 0 and 2 (for rows and columns) both get allocated to new index 0 and indices 1 and 3 (for rows and columns) both get allocated to new index 1. The matrix could be anything at all, but the correspondence is always many-to-one when I want to compress.
If the input matrix is A and the output matrix is B, then cell B[0, 0] would be the sum of A[0, 0] + A[0, 2] + A[2, 0] + A[2, 2] because new index 0 is made up of original indices 0 and 2.
The aggregation process here would lead to:
array([[ 1+3+9+11, 2+4+10+12 ],
[ 5+7+13+15, 6+8+14+16 ]])
= array([[ 24, 28 ],
[ 40, 44 ]])
I can do this by making an empty matrix of the right size and looping over all 4x4=16 cells of the initial matrix and accumulating in nested loops, but this seems to be inefficient and the vectorised nature of numpy is always emphasised by people. I have also done it by using np.ix_ to make sets of indices and use m[row_indices, col_indices].sum(), but I am wondering what the most efficient numpy-like way to do it is.
Conversely, what is the sensible and efficient way to expand a matrix using the correspondence the other way? For example with the same correspondence but in reverse I would go from:
array([[ 1, 2 ],
[ 3, 4 ]])
to
array([[ 1, 2, 1, 2 ],
[ 3, 4, 3, 4 ],
[ 1, 2, 1, 2 ],
[ 3, 4, 3, 4 ]])
where the values simply get replicated into the new cells.
In my attempts so far for the aggregation, I have used approaches with pandas methods with groupby on index and columns and then extracting the final matrix with, e.g. df.values. However, I don't know the equivalent way to expand a matrix, without using a lot of things like unstack and join and so on. And I see people often say that using pandas is not time-efficient.
Edit 1: I was asked in a comment about exactly how the aggregation should be done. This is how it would be done if I were using nested loops and a dictionary lookup between the original dimensions and the new dimensions:
>>> m=np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
>>> mnew=np.zeros((2,2))
>>> big2small={0:0, 1:1, 2:0, 3:1}
>>> for i in range(4):
... inew = big2small[i]
... for j in range(4):
... jnew = big2small[j]
... mnew[inew, jnew] += m[i, j]
...
>>> mnew
array([[24., 28.],
[40., 44.]])
Edit 2: Another comment asked for the aggregation example towards the start to be made more explicit, so I have done so.
Assuming you don't your indices don't have a regular structure I would do it try sparse matrices.
import scipy.sparse as ss
import numpy as np
# your current array of indices
g=np.array([[0,0],[1,1],[2,0],[3,1]])
# a sparse matrix of (data=ones, (row_ind=g[:,0], col_ind=g[:,1]))
# it is one for every pair (g[i,0], g[i,1]), zero elsewhere
u=ss.csr_matrix((np.ones(len(g)), (g[:,0], g[:,1])))
Aggregate
m=np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
u.T # m # u
Expand
m2 = np.array([[1,2],[3,4]])
u # m2 # u.T

numpy - column-wise and row-wise sums of a given 2d matrix

I have this numpy matrix (ndarray).
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
I want to calculate the column-wise and row-wise sums.
I know this is done by calling respectively
np.sum(mat, axis=0) ### column-wise sums
np.sum(mat, axis=1) ### row-wise sums
but I cannot understand these two calls.
Why is axis 0 giving me the sums column-by-column?!
Shouldn't it be the other way around?
I thought the rows are axis 0, and the columns are axis 1.
What I am seeing as a behavior here looks counter-intuitive
(but I am sure it's OK, I guess I am just missing something important).
I am just looking for some intuitive explanation here.
Thanks in advance.
Intuition around arrays and axes
I want to offer 3 types of intuitions here.
Graphical (How to imagine them visually)
Physical (How they are physically stored)
Logical (How to work with them logically)
Graphical intuition
Consider a numpy array as a n-dimensional object. This n-dimensional object contains elements in each of the directions as below.
Axes in this representation are the direction of the tensor. So, a 2D matrix has only 2 axes, while a 4D tensor has 4 axes.
Sum in a given axis can be essentially considered as a reduction in that direction. Imagine a 3D tensor being squashed in such a way that it becomes flat (a 2D tensor). The axis tells us which direction to squash or reduce it in.
Physical intuition
Numpy stores its ndarrays as contiguous blocks of memory. Each element is stored in a sequential manner every n bytes after the previous.
(images referenced from this excellent SO post)
So if your 3D array looks like this -
Then in memory its stores as -
When retrieving an element (or a block of elements), NumPy calculates how many strides (bytes) it needs to traverse to get the next element in that direction/axis. So, for the above example, for axis=2 it has to traverse 8 bytes (depending on the datatype) but for axis=1 it has to traverse 8*4 bytes, and axis=0 it needs 8*8 bytes.
Axes in this representation is basically the series of next elements after a given stride. Consider the following array -
print(X)
print(X.strides)
[[0 2 1 4 0 0 0]
[5 0 0 0 0 0 0]
[8 0 0 0 0 0 0]
[0 0 0 0 0 0 0]
[0 0 1 0 0 0 0]
[0 0 0 1 0 0 0]]
#Strides (bytes) required to traverse in each axis.
(56, 8)
In the above array, every element after 56 bytes from any element is the next element in axis=0 and every element after 8 bytes from any element is in axis=1. (except from the last element)
Sum or reduction in this regards means taking a sum of every element in that strided series. So, sum over axis=0 means that I need to sum [0,5,8,0,0,0], [2,0,0,0,0,0], ... and sum over axis=1 means just summing [0 2 1 4 0 0 0] , [5 0 0 0 0 0 0], ...
Logical intuition
This interpretation has to do with element groupings. A numpy stores its ndarrays as groups of groups of groups ... of elements. Elements are grouped together and contain the last axis (axis=-1). Then another grouping over them creates another axis before it (axis=-2). The final outermost group is the axis=0.
These are 3 groups of 2 groups of 5 elements.
Similarly, the shape of a NumPy array is also determined by the same.
1D_array = [1,2,3]
2D_array = [[1,2,3]]
3D_array = [[[1,2,3]]]
...
Axes in this representation are the group in which elements are stored. The outermost group is axis=0 and the innermost group is axis=-1.
Sum or reduction in this regard means that I reducing elements across that specific group or axis. So, sum over axis=-1 means I sum over the innermost groups. Consider a (6, 5, 8) dimensional tensor. When I say I want a sum over some axis, I want to reduce the elements lying in that grouping / direction to a single value that is equal to their sum.
So,
np.sum(arr, axis=-1) will reduce the inner most groups (of length 8) into a single value and return (6,5,1) or (6,5).
np.sum(arr, axis=-2) will reduce the elements that lie in the 1st axis (or -2nd axis) direction and reduce those to a single value returning (6,1,8) or (6,8)
np.sum(arr, axis=0) will similarly reduce the tensor to (1,5,8) or (5,8).
Hope these 3 intuitions are beneficial to anyone trying to understand how axes and NumPy tensors work in general and how to build an intuitive understanding to work better with them.
Let's start with a one dimensional example:
a, b, c, d, e = 0, 1, 2, 3, 4
arr = np.array([a, b, c, d, e])
If you do,
arr.sum(0)
Output
10
That is the sum of the elements of the array
a + b + c + d + e
Now before moving on a 2 dimensional example. Let's clarify that in numpy the sum of two 1 dimensional arrays is done element wise, for example:
a = np.array([1, 2, 3, 4, 5])
b = np.array([6, 7, 8, 9, 10])
print(a + b)
Output
[ 7 9 11 13 15]
Now if we change our initial variables to arrays, instead of scalars, to create a two dimensional array and do the sum
a = np.array([1, 2, 3, 4, 5])
b = np.array([6, 7, 8, 9, 10])
c = np.array([11, 12, 13, 14, 15])
d = np.array([16, 17, 18, 19, 20])
e = np.array([21, 22, 23, 24, 25])
arr = np.array([a, b, c, d, e])
print(arr.sum(0))
Output
[55 60 65 70 75]
The output is the same as for the 1 dimensional example, i.e. the sum of the elements of the array:
a + b + c + d + e
Just that now the elements of the arrays are 1 dimensional arrays and the sum of those elements is applied. Now before explaining the results, for axis = 1, let's consider an alternative notation to the notation across axis = 0, basically:
np.array([arr[0, :], arr[1, :], arr[2, :], arr[3, :], arr[4, :]]).sum(0) # [55 60 65 70 75]
That is we took full slices in all other indices that were not the first dimension. If we swap to:
res = np.array([arr[:, 0], arr[:, 1], arr[:, 2], arr[:, 3], arr[:, 4]]).sum(0)
print(res)
Output
[ 15 40 65 90 115]
We get the result of the sum along axis=1. So to sum it up you are always summing elements of the array. The axis will indicate how this elements are constructed.
Intuitively, 'axis 0' goes from top to bottom and 'axis 1' goes from left to right. Therefore, when you sum along 'axis 0' you get the column sum, and along 'axis 1' you get the row sum.
As you go along 'axis 0', the row number increases. As you go along 'axis 1' the column number increases.
Think of a 1-dimension array:
mat=array([ 1, 2, 3, 4, 5])
Its items are called by mat[0], mat[1], etc
If you do:
np.sum(mat, axis=0)
it will return 15
In the background, it sums all items with mat[0], mat[1], mat[2], mat[3], mat[4]
meaning the first index (axis=0)
Now consider a 2-D array:
mat=array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
When you ask for
np.sum(mat, axis=0)
it will again sum all items based on the first index (axis=0) keeping all the rest same. This means that
mat[0][1], mat[1][1], mat[2][1], mat[3][1], mat[4][1]
will give one sum
mat[0][2], mat[1][2], mat[2][2], mat[3][2], mat[4][2]
will give another one, etc
If you consider a 3-D array, the logic will be the same. Every sum will be calculated on the same axis (index) keeping all the rest same. Sums on axis=0 will be produced by:
mat[0][1][1],mat[1][1][1],mat[2][1][1],mat[3][1][1],mat[4][1][1]
etc
Sums on axis=2 will be produced by:
mat[2][3][0], mat[2][3][1], mat[2][3][2], mat[2][3][3], mat[2][3][4]
etc
I hope you understand the logic. To keep things simple in your mind, consider axis=position of index in a chain index, eg axis=3 on a 7-mensional array will be:
mat[0][0][0][this is our axis][0][0][0]

Sorting by another matrix works in one case but fails for another

I need to sort matrices according to the descending order of the values in another matrix.
E.g. in a first step I would have the following matrix A:
1 0 1 0 1
0 1 0 1 0
0 1 0 1 1
1 0 1 0 0
Then for the procedure I am following I need to take the rows of the matrix as binary numbers and sort them in descending order of their binary value.
I am doing this the following way:
for i in range(0,num_rows):
for j in range(0,num_cols):
row_val[i] = row_val[i] + A[i][j] * (2 ** (num_cols - 1 - j))
This gets me a 4x1 vector row_val with the following values:
21
10
11
20
Now I am sorting the rows of the matrix according to row_val by
A = [x for _,x in sorted(zip(row_val,A),reverse=True)]
This works perfectly fine I get the matrix A:
1 0 1 0 1
1 0 1 0 0
0 1 0 1 1
0 1 0 1 0
However now I need to apply the same procedure to the columns. So I calculate a the col_val vector with the binary values of the columns:
12
3
12
3
3
To sort the matrix A according to the vector col_val I thought I could just transpose matrix A and then do the same as before:
At = np.transpose(A)
At = [y for _,y in sorted(zip(col_val,At),reverse=True)]
Unfortunatly this fails with the error message
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I am suspecting that this might be because there are several entries with the same value in vector col_val, however in an example shown in another question the sorting seems to work for a case with several equal entries.
Your suspicion is correct, you can't sort multidimensional numpy arrays using the Python builtin sorted because the comparison of two rows, say, will yield a row of truth values instead of a single one
A[0] < A[1]
# array([False, True, False, True, False])
so sorted can't tell which should go before the other.
In your first example this is masked by lexicographic ordering of tuples: Because tuples are compared left to right and because row_val has unique entries the comparison never looks at the second elements.
But in your second example because some col_val entries are equal, the comparison will look at At for a tie breaker which is where the exception occurs.
Here is a working method which uses numpy methods:
A[np.argsort(np.packbits(A, axis=1).ravel())[::-1]]
# array([[1, 0, 1, 0, 1],
# [1, 0, 1, 0, 0],
# [0, 1, 0, 1, 1],
# [0, 1, 0, 1, 0]])
A[:, np.argsort(np.packbits(A, axis=0).ravel())[::-1]]
# array([[1, 1, 1, 0, 0],
# [0, 0, 0, 1, 1],
# [1, 0, 0, 1, 1],
# [0, 1, 1, 0, 0]])
Explanation:
np.packbits as the name suggests packs binary vectors into bit field; it is almost equivalent to your hand-written code - there is one small difference in that packbits operates on chunks of 8 and pads with zero on the right, so for example [1, 1] will go to 192, not 3.
np.argsort does an indirect sort, so it doesn't actually move the elements of its operand A but just writes down the sequence of indices I into A which would sort it A[I] == np.sort(A). This is useful when we want to sort something based on the order of something else like in this case.

Rearrange numpy vector according to mapping rule

I have a vector withs 0s and 1s. I want to have a new vector with rearranged values, whereas I have another vector with a mapping rule:
Example:
input: 1,0,0,1
rule: 0,3,2,1
after mapping:1,1,0,0
The mapping vector determines for each index at which index in the new vector the value can be found.
How do I do that?
Let's say a is the original array and b is the mapping rule. Since the mapping rule says "at which index in the new vector the value can be found", you need to compute a[c] where c is the inverse of the permutation b. The computation of inverse permutation is addressed in detail elsewhere, so I'll pick one of solutions from there:
c = np.zeros(b.size, b.dtype)
c[b] = np.arange(b.size)
new_array = a[c]
Example: if a is [7, 8, 9] and b is [1, 2, 0], this returns [9, 7, 8]. Let's check:
7 went to position 1,
8 went to position 2,
9 went to position 0
The result is correct.
If you did a[b] as suggested by Bort, the result would be [8, 9, 7], which is different. Indeed, in this version the entries of b say where the numbers came from in the original array:
8 came from position 1
9 came from position 2
7 came from position 0
To muddle the matter, the example you gave is a permutation that is equal to its inverse, so the distinction is lost.

Sum identical elements of an array based on their indicies

How would be elegant solution for summing all 2's from an array based on their indices?
I have this array x = [2 2 2 3 2 2 2 2 3 3 2 3 2 2 3 3 2]
Then I found their positions with
y = where(isclose(x,2))
and get another array like this y = (array([ 0, 1, 2, 4, 5, 6, 7, 10, 12, 13, 16])
So how I can use with numpy to calculate sum of elements in x based on indices in y.
You can simply use a simple indexing to get the corresponding items then use np.sum:
>>> np.sum(x[np.where(x==2)[0]])
22
Also note that you don't need alclose within np.where you can just use x=2.And as said in comment this is not the proper way to doing this task if this is the only problem.
You don't need to use np.where for this - an array of booleans, like the one returned by np.isclose or the various comparison operators, works as an index to another array (provided the sizes match). This means you get all of the 2's with:
>>> x[np.isclose(x, 2)]
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
And sum them directly:
>>> x[np.isclose(x, 2)].sum()
22
If x contains only non-negative ints, you could sum the occurrences of each value with
total = np.bincount(x, weights=x)
# array([ 0., 0., 22., 18.])
The value of total[2] is 22 since there are 11 two's in x.
The value of total[3] is 18 since there are 3 three's in x.

Categories

Resources