Vectorize grouped summation

Vectorize grouped summation - python

I have an algorithm with plain Python loops which I would like to optimize
for speed.
Starting from an array indicating bin indices, I want summed up values for those bins.
More detailed: I start from an index array pointing to a vector of values and values for the same index should be summed up. The plain and slow Python version is like:
import numpy
ix = numpy.array([0 , 1 , 1 , 4 ])
values = numpy.array([10, 20, 30, 40])
# this models bin assignment:
# 10 belongs to bin 0
# 20 and 30 belong to bin 1
# 40 belongs to bin 4
summed = numpy.zeros_like(values)
for i in ix:
summed[i] += values[ix[i]]
print summed
[10, 50, 0, 0, 40]
This is quite slow and I ask if anybody can give me a hint how to
vectorize this.

You can use numpy.bincount():
>>> numpy.bincount(ix, values)
array([ 10., 50., 0., 0., 40.])

Bit of a hack but this gives the same result
indicator = np.arange(5)[None, ...] == ix[..., None]
summed = np.sum(values[..., None] * indicator, axis=0)

Related

Efficient ways to aggregate and replicate values in a numpy matrix

In my work I often need to aggregate and expand matrices of various quantities, and I am looking for the most efficient ways to do these actions. E.g. I'll have an NxN matrix that I want to aggregate from NxN into PxP where P < N. This is done using a correspondence between the larger dimensions and the smaller dimensions. Usually, P will be around 100 or so.
For example, I'll have a hypothetical 4x4 matrix like this (though in practice, my matrices will be much larger, around 1000x1000)
m=np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]])
>>> m
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16]])
and a correspondence like this (schematically):
0 -> 0
1 -> 1
2 -> 0
3 -> 1
that I usually store in a dictionary. This means that indices 0 and 2 (for rows and columns) both get allocated to new index 0 and indices 1 and 3 (for rows and columns) both get allocated to new index 1. The matrix could be anything at all, but the correspondence is always many-to-one when I want to compress.
If the input matrix is A and the output matrix is B, then cell B[0, 0] would be the sum of A[0, 0] + A[0, 2] + A[2, 0] + A[2, 2] because new index 0 is made up of original indices 0 and 2.
The aggregation process here would lead to:
array([[ 1+3+9+11, 2+4+10+12 ],
[ 5+7+13+15, 6+8+14+16 ]])
= array([[ 24, 28 ],
[ 40, 44 ]])
I can do this by making an empty matrix of the right size and looping over all 4x4=16 cells of the initial matrix and accumulating in nested loops, but this seems to be inefficient and the vectorised nature of numpy is always emphasised by people. I have also done it by using np.ix_ to make sets of indices and use m[row_indices, col_indices].sum(), but I am wondering what the most efficient numpy-like way to do it is.
Conversely, what is the sensible and efficient way to expand a matrix using the correspondence the other way? For example with the same correspondence but in reverse I would go from:
array([[ 1, 2 ],
[ 3, 4 ]])
to
array([[ 1, 2, 1, 2 ],
[ 3, 4, 3, 4 ],
[ 1, 2, 1, 2 ],
[ 3, 4, 3, 4 ]])
where the values simply get replicated into the new cells.
In my attempts so far for the aggregation, I have used approaches with pandas methods with groupby on index and columns and then extracting the final matrix with, e.g. df.values. However, I don't know the equivalent way to expand a matrix, without using a lot of things like unstack and join and so on. And I see people often say that using pandas is not time-efficient.
Edit 1: I was asked in a comment about exactly how the aggregation should be done. This is how it would be done if I were using nested loops and a dictionary lookup between the original dimensions and the new dimensions:
>>> m=np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
>>> mnew=np.zeros((2,2))
>>> big2small={0:0, 1:1, 2:0, 3:1}
>>> for i in range(4):
... inew = big2small[i]
... for j in range(4):
... jnew = big2small[j]
... mnew[inew, jnew] += m[i, j]
...
>>> mnew
array([[24., 28.],
[40., 44.]])
Edit 2: Another comment asked for the aggregation example towards the start to be made more explicit, so I have done so.

Assuming you don't your indices don't have a regular structure I would do it try sparse matrices.
import scipy.sparse as ss
import numpy as np
# your current array of indices
g=np.array([[0,0],[1,1],[2,0],[3,1]])
# a sparse matrix of (data=ones, (row_ind=g[:,0], col_ind=g[:,1]))
# it is one for every pair (g[i,0], g[i,1]), zero elsewhere
u=ss.csr_matrix((np.ones(len(g)), (g[:,0], g[:,1])))
Aggregate
m=np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
u.T # m # u
Expand
m2 = np.array([[1,2],[3,4]])
u # m2 # u.T

Get index of numpy-array elements by comparing element-positions between arrays

Context
I have the following example-arrays in numpy:
import numpy as np
# All arrays in this example have the shape (15,)
# Note: All values > 0 are unqiue!
a = np.array([8,5,4,-1,-1, 7,-1,-1,12,11,-1,-1,14,-1,-1])
reference = np.array([0,1,2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14])
lookup = np.array([3,6,0,-2,-2,24,-2,-2,24,48,-2,-2,84,-2,-2])
My goal is to find the elements inside the reference in a, then get the index in a and use it to extract the corresponding elements in lookup.
Finding out the matching elements and their indices works with np.flatnonzero( np.isin() ).
I can also lookup the correspodning values:
# Example how to find the index
np.flatnonzero( np.isin( reference, a) )
# -> array([ 4, 5, 7, 8, 11, 12, 14])
# Example how to find corresponding values:
lookup[ np.flatnonzero( np.isin( a, reference) ) ]
# -> array([ 3, 6, 0, 24, 24, 48, 84], dtype=int64)
Problem
I want to fill an array z with the values I looked up, following the reference.
This means, that e.g. the 8th element of z corresponds to the 8th element in the lookup-value for the 8th element in reference (= 8). This value would be 3 (reference[8] -> a[0] because a==8 here -> lookup[0] -> 3).
z = np.zeros(reference.size)
z[np.flatnonzero(np.isin(reference, a))] = ? -> numpy-array of correctly ordered lookup_values
The expected outcome for z would be:
z = [ 0 0 0 0 0 6 0 24 3 0 0 48 24 0 84]
I cannot get my head around this; I have to avoid for-loops due to performance reasons and would need a pure numpy-solution (best without udfs).
How can I fill z according with the lookup-values at the correct position?
Note: As stated in the code above, all values a > 0 are unique. Thus, there is no need to take care about the duplicated values for a < 0.

You say that you 'have to avoid for-loops due to performance reasons', so I assume that your real-world datastructure a is going to be large (thousands or millions of elements?). Since np.isin(reference, a) performs a linear search in a for every element of reference, your runtime will be O(len(reference) * len(a)).
I would strongly suggest using a dict for a, allowing lookup in O(1) per element of reference, and loop in python using for. For sufficiently large a this will outperform the 'fast' linear search performed by np.isin.

The most natural way I can think of is to just treat a and lookup as a dictionary:
In [82]: d = dict(zip(a, lookup))
In [83]: np.array([d.get(i, 0) for i in reference])
Out[83]: array([ 0, 0, 0, 0, 0, 6, 0, 24, 3, 0, 0, 48, 24, 0, 84])
This does have a bit of memory overhead but nothing crazy if reference isn't too large.

I actually had an enlightenment.
# Initialize the result
# All non-indexed entries shall be 0
z = np.zeros(reference.size, dtype=np.int64)
Now evaluate which elements in a are relevant:
mask = np.flatnonzero(np.isin(a, reference))
# Short note: If we know that any positive element of a is a number
# Which has to be in the reference, we can also shorten this to
# a simple boolean mask. This will be significantly faster to process.
mask = (a > 0)
Now the following trick: All values a > 0 are unique. Additionally, their value corresponds to the position in reference (e.g. 8 in a shall correspond to the 8th position in reference. Thus, we can use the values as index themselves:
z[ a[mask] ] = lookup[mask]
This results in the desired outcome:
z = [ 0 0 0 0 0 6 0 24 3 0 0 48 24 0 84]

Python: find consecutive values in 3D numpy array without using groupby?

Say that you have the following 3D numpy array:
matrices=
numpy.array([[[1, 0, 0], #Level 0
[1, 1, 1],
[0, 1, 1]],
[[0, 1, 0], #Level 1
[1, 1, 0],
[0, 0, 0]],
[[0, 0, 1], #Level 2
[0, 1, 1],
[1, 0, 1]]])
And that you want to compute the number of times you get consecutive values of 1 for each cell. Let's say you want to count the number of occurrences of 2 and 3 consecutive values of 1 for each cell. The result should be something like this:
two_cons=([[0,0,0],
[1,1,0],
[0,0,0]])
three_cons=([[0,0,0],
[0,1,0],
[0,0,0]])
meaning that two cells have had at least 2 consecutive values of 1, and only one has had 3 consecutive values.
I know this could be done by using groupby, extracting the "vertical" series of values for each cell, and counting how many times you get n consecutive ones:
import numpy
two_cons=numpy.zeros((3,3))
for i in range(0,matrices.shape[0]): #Iterate through each "level"
for j in range(0,matrices.shape[1]):
vertical=matrices[:,i,j] #Extract the series of 0-1 for each cell of the matrix
#Determine the occurrence of 2 consecutive values
cons=numpy.concatenate([numpy.cumsum(c) if c[0] == 1 else c for c in numpy.split(vertical, 1 + numpy.where(numpy.diff(vertical))[0])])
two_cons[i][j]=numpy.count_nonzero(cons==2)
In this example, you get that:
two_cons=
array([[ 0., 0., 0.],
[ 1., 1., 0.],
[ 0., 0., 0.]])
My question: how can I do this if I cannot access vertical? In my real case, the 3D numpy array is too large for me to extract vertical series across many levels, so I have to loop through each level at once, and kind of keep memory of what happened at the previous n levels. What do you suggest to do?

I haven't checked the code, but something like this should work... the idea is to scan the matrix along the third dimension and have 2 helper matrices, one keeping track of the length of the actual sequence of ones, and one keeping track of the best sequence encountered so far.
bests = np.zeros(matrices.shape[:-1])
counter = np.zeros(matrices.shape[:-1])
for depth in range(matrices.shape[0]):
this_level = matrices[depth, :, :]
counter = counter * this_level + this_level
bests = (np.stack([bests, counter], axis=0)).max(axis=0)
two_con = bests > 1
three_con = bests > 2

Numpy Array Conditional Operation Mask?

Suppose you have an array:
a =
[ 0,1,0]
[-1,2,1]
[3,-4,2]
And lets say you add 20 to everything
b =
[ 20, 21, 20]
[ 19, 22, 21]
[ 23, 16, 22]
Now lets say I want to add the resulting b to the original array a but only in cases where a < 0 i.e at the index [0,1] and [1,2] where a = -1, -4 respectively getting the value 0 otherwise. Ultimately leading to a matrix as such:
c =
[ 0, 0, 0]
[ 18, 0, 0]
[ 0, 12, 0]
18 = 19 (from b) + -1 (from a)
12 = 16 (from b) + -4 (from a)
And assume that I want to be able to extend this to any operation (not just add 20), so that you can't just filter all values < 20 from matrix c. So I want to use matrix a as a mask toward matrix c, zeroing the i, j where a[i,j] < 0.
I'm having a tough time finding a concise example of how to do this in numpy with python. I was hoping you may be able to direct me to the correct implementation of such a method.
What I am struggling to get is this into a mask and only performing operations on the retained values, finally resulting in c.
Thanks for the help in advance.

Probably something like:
(a + b)*(a<0)
should work unless you have very strong requirements concerning the number of intermediate arrays.

You can do this through a combination of boolean indexing and broadcasting. Working example below,
import numpy as np
a = np.array([[ 0,1,0],[-1,2,1],[3,-4,2]])
b = a+20
c = np.zeros(a.shape)
c[a<0] = b[a<0] + a[a<0]
which gives c as
array([[ 0., 0., 0.],
[ 18., 0., 0.],
[ 0., 12., 0.]])
The only important line in the code snippet above is the last one. Because the entries of a, b, and c are all aligned, we can say we want only the corresponding indices of c where a<0 to be assigned to the sum of the entries in b and a where a<0.

Here is another way to get the same result:
c = np.where(a < 0, a + b, 0)
Although this is slightly more verbose than Thomas Baruchel's solution, I found the method signature similar to the ternary operation (a < 0 ? a + b : 0), which makes it easier for me to understand what it is doing right away. Also, this is still a one-liner which makes it elegant enough in my opinion.
reference: numpy.where

Perhaps not the cleanest solution, but how about this?:
def my_mask(a, b, threshold=0):
c = numpy.zeros(a.shape)
idx = np.where(a < threshold)
for ii in idx:
c[ii[1], ii[0]] = a[ii[1], ii[0]] + b[ii[1], ii[0]]
return c

The solution using numpy.zeros_like function:
import numpy as np
# the initial array
a = [[ 0,1,0],
[-1,2,1],
[3,-4,2]]
a = np.array(a)
b = a + 20 # after adding 20 to each element
c = np.zeros_like(a) # resulting matrix (filled with zeros by default)
neg = a < 0 # indeces of negative values
c[neg] = b[neg] + a[neg] # overriding the needed elements
print(c)
The output:
[[ 0 0 0]
[18 0 0]
[ 0 12 0]]

Numpy: mean calculation results in nan values

I have an array of values x:
x=numpy.array([[-0.11361818 -0.113618185 -0.98787775 -0.09719566],
[-0.11361818 -0.04173076 -0.98787775 -0.09719566],
[-0.11361818 -0.04173076 -0.98787775 -0.09719566],
[-0.62610493 -0.71682393 -0.24673653 -0.18242028],
[-0.62584854 -0.71613061 -0.24904998 -0.18287883],
[-0.62538661 -0.71551038 -0.25160676 -0.18338629]])
and an array of corresponding classes labels y:
y=numpy.array([1, 1, 2, 3, 4, 4])
The first class label 1 in y belongs to the first row in array x, the second class label 1 in y belongs to the second row in array x and so on.
Now I want to calculate the mean values for each class 1-4. For example, row 1 and 2 in x both belong to class 1, so I calculate the mean of row 1 and 2.
I have the following code:
means = numpy.array([x[y == i].mean(axis=0) for i in xrange(4)])
When I do this I end up with this result:
array([[ nan],
[-1.27636606],
[-1.24042235],
[-1.77208567]])
If I take xrange(6), I have this result:
array([[ nan],
[-1.27636606],
[-1.24042235],
[-1.77208567],
[-1.774899 ],
[ nan]])
Why is this the case and how do I get rid off the nans and end up with my 4 mean values only?
I have the code from here, where they took the number of classes as argument for xrange(), and I don't quite see what I did differently.
Thanks in advance for your help!

xrange(4) results in the values [0, 1, 2, 3]. Your first value in means is nan because you don't have a y value equal to zero.
Instead, do:
In [49]: means = numpy.array([x[y == i].mean(axis=0) for i in xrange(1, 5)])
In [50]: means
Out[50]:
array([[-1.27636606],
[-1.24042235],
[-1.77208567],
[-1.774899 ]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Vectorize grouped summation - python

You can use numpy.bincount(): >>> numpy.bincount(ix, values) array([ 10., 50., 0., 0., 40.])

Bit of a hack but this gives the same result indicator = np.arange(5)[None, ...] == ix[..., None] summed = np.sum(values[..., None] * indicator, axis=0)

Related

Efficient ways to aggregate and replicate values in a numpy matrix

Get index of numpy-array elements by comparing element-positions between arrays

Python: find consecutive values in 3D numpy array without using groupby?

Numpy Array Conditional Operation Mask?

Numpy: mean calculation results in nan values

Categories

Resources