Product of array elements by group in numpy (Python) - python

I'm trying to build a function that returns the products of subsets of array elements. Basically I want to build a prod_by_group function that does this:
values = np.array([1, 2, 3, 4, 5, 6])
groups = np.array([1, 1, 1, 2, 3, 3])
Vprods = prod_by_group(values, groups)
And the resulting Vprods should be:
Vprods
array([6, 4, 30])
There's a great answer here for sums of elements that I think it should be similar to:
https://stackoverflow.com/a/4387453/1085691
I tried taking the log first, then sum_by_group, then exp, but ran into numerical issues.
There are some other similar answers here for min and max of elements by group:
https://stackoverflow.com/a/8623168/1085691
Edit: Thanks for the quick answers! I'm trying them out. I should add that I want it to be as fast as possible (that's the reason I'm trying to get it in numpy in some vectorized way, like the examples I gave).
Edit: I evaluated all the answers given so far, and the best one is given by #seberg below. Here's the full function that I ended up using:
def prod_by_group(values, groups):
order = np.argsort(groups)
groups = groups[order]
values = values[order]
group_changes = np.concatenate(([0], np.where(groups[:-1] != groups[1:])[0] + 1))
return np.multiply.reduceat(values, group_changes)

If you groups are already sorted (if they are not you can do that with np.argsort), you can do this using the reduceat functionality to ufuncs (if they are not sorted, you would have to sort them first to do it efficiently):
# you could do the group_changes somewhat faster if you care a lot
group_changes = np.concatenate(([0], np.where(groups[:-1] != groups[1:])[0] + 1))
Vprods = np.multiply.reduceat(values, group_changes)
Or mgilson answer if you have few groups. But if you have many groups, then this is much more efficient. Since you avoid boolean indices for every element in the original array for every group. Plus you avoid slicing in a python loop with reduceat.
Of course pandas does these operations conveniently.
Edit: Sorry had prod in there. The ufunc is multiply. You can use this method for any binary ufunc. This means it works for basically all numpy functions that can work element wise on two input arrays. (ie. multiply normally multiplies two arrays elementwise, add adds them, maximum/minimum, etc. etc.)

First set up a mask for the groups such that you expand the groups in another dimension
mask=(groups==unique(groups).reshape(-1,1))
mask
array([[ True, True, True, False, False, False],
[False, False, False, True, False, False],
[False, False, False, False, True, True]], dtype=bool)
now we multiply with val
mask*val
array([[1, 2, 3, 0, 0, 0],
[0, 0, 0, 4, 0, 0],
[0, 0, 0, 0, 5, 6]])
now you can already do prod along the axis 1 except for those zeros, which is easy to fix:
prod(where(mask*val,mask*val,1),axis=1)
array([ 6, 4, 30])

As suggested in the comments, you can also use the Pandas module. Using the grouby() function, this task becomes an one-liner:
import numpy as np
import pandas as pd
values = np.array([1, 2, 3, 4, 5, 6])
groups = np.array([1, 1, 1, 2, 3, 3])
df = pd.DataFrame({'values': values, 'groups': groups})
So df then looks as follows:
groups values
0 1 1
1 1 2
2 1 3
3 2 4
4 3 5
5 3 6
Now you can groupby() the groups column and apply numpy's prod() function to each of the groups like this
df.groupby(groups)['values'].apply(np.prod)
which gives you the desired output:
1 6
2 4
3 30

Well, I doubt this is a great answer, but it's the best I can come up with:
np.array([np.product(values[np.flatnonzero(groups == x)]) for x in np.unique(groups)])

It's not a numpy solution, but it's fairly readable (I find that sometimes numpy solutions aren't!):
from operator import itemgetter, mul
from itertools import groupby
grouped = groupby(zip(groups, values), itemgetter(0))
groups = [reduce(mul, map(itemgetter(1), vals), 1) for key, vals in grouped]
print groups
# [6, 4, 30]

Related

Differences in an array based on groups defined by another array

I have two arrays of the same size. One, call it A, contains a series of repeated numbers; the other, B contains random numbers.
import numpy as np
A = np.array([1,1,1,2,2,2,0,0,0,3,3])
B = np.array([1,2,3,6,5,4,7,8,9,10,11])
I need to find the differences in B between the two extremes defined by the groups in A. More specifically, I need an output C such as
C = [2, -2, 2, 1]
where each term is the difference 3 - 1, 4 - 6, 9 - 7, and 11 - 10, i.e., the difference between the extremes in B identified by the groups of repeated numbers in A.
I tried to play around with itertools.groupby to isolate the groups in the first array, but it is not clear to me how to exploit the indexing to operate the differences in the second.
Edit: C is now sorted the same way as in the question
C = []
_, idx = np.unique(A, return_index=True)
for i in A[np.sort(idx)]:
bs = B[A==i]
C.append(bs[-1] - bs[0])
print(C) // [2, -2, 2, 1]
np.unique returns, for each unique value in A, the index of the first appearance of it.
i in A[np.sort(idx)] iterates over the unique values in the order of the indexes.
B[A==i] extracts the values from B at the same indexes as those values in A.
This is easily achieved using pandas' groupby:
A = np.array([1,1,1,2,2,2,0,0,0,3,3])
B = np.array([1,2,3,6,5,4,7,8,9,10,11])
import pandas as pd
pd.Series(B).groupby(A, sort=False).agg(lambda g: g.iloc[-1]-g.iloc[0]).to_numpy()
output: array([ 2, -2, 2, 1])
using itertools.groupby:
from itertools import groupby
[(x:=list(g))[-1][1]-x[0][1] for k, g in groupby(zip(A,B), lambda x: x[0])]
output: [2, -2, 2, 1]
NB. Note that the two solutions will behave differently if there are different non-consecutive groups

Python Fancy Indexing Assignments: cannot assign 3 input values to the 6 output values where the mask is true

I am trying to make a zeroed array with the same shape of a source array. Then modify every value in the second array that corresponds to a specific value in the first array.
This would be simple enough if I was just replacing one value. Here is a toy example:
import numpy as np
arr1 = np.array([[1,2,3],[3,4,5],[1,2,3]])
arr2 = np.array([[0,0,0],[0,0,0],[0,0,0]])
arr2[arr1==1] = -1
This will work as expected and arr2 would be:
[[-1,0,0],
[ 0,0,0],
[-1,0,0]]
But I would like to replace an entire row. Something like this replacing the last line of the sample code above:
arr2[arr1==[3,4,5]] = [-1,-1,-1]
When I do this, it also works as expected and arr2 would be:
[[ 0, 0, 0],
[-1,-1,-1],
[ 0, 0, 0]]
But when I tried to replace the last line of sample code with something like:
arr2[arr1==[1,2,3]] = [-1,-1,-1]
I expected to get something like the last output, but with the 0th and 2nd rows being changed. But instead I got the following error.
ValueError: NumPy boolean array indexing assignment cannot assign 3 input values to the 6
output values where the mask is true
I assume this is because, unlike the other example, it was going to have to replace more than one row. Though this seems odd to me, since it worked fine replacing more than one value in the simple single value example.
I'm just wondering if anyone can explain this behavior to me, because it is a little confusing. I am not that experienced with the inner workings of numpy operations. Also, if anyone has an any recommendations to do what I am trying to accomplish in an efficient manner.
In my real world implementation, I am working with a very large three dimensional array (an image with 3 color channels) and I want to make an new array that stores a specific value into these three color channels if the source image has a specific three color values in that corresponding pixel (and remain [0,0,0] if it doesn't match our pixel_rgb_of_interest). I could go through in linear time and just check every single pixel, but this could get kind of slow if there are a lot of images, and was just wondering if there was a better way.
Thank you!
This would be a good application for numpy.where
>>> import numpy as np
>>> arr1 = np.array([[1,2,3],[3,4,5],[1,2,3]])
>>> arr2 = np.array([[0,0,0],[0,0,0],[0,0,0]])
>>> np.where(arr1 == [1,2,3], [-1,-1,-1], arr1)
array([[-1, -1, -1],
[ 3, 4, 5],
[-1, -1, -1]])
This basically works as "wherever the condition is true, use the x argument, then use the y argument the rest of the time"
Lets add an "index" array:
In [56]: arr1 = np.array([[1,2,3],[3,4,5],[1,2,3]])
...: arr2 = np.array([[0,0,0],[0,0,0],[0,0,0]])
...: arr3 = np.arange(9).reshape(3,3)
The test against 1 value:
In [57]: arr1==1
Out[57]:
array([[ True, False, False],
[False, False, False],
[ True, False, False]])
that has 2 true values:
In [58]: arr3[arr1==1]
Out[58]: array([0, 6])
We could assign one value as you do, or 2.
Test with a list, which is converted to array first:
In [59]: arr1==[3,4,5]
Out[59]:
array([[False, False, False],
[ True, True, True],
[False, False, False]])
That has 3 True:
In [60]: arr3[arr1==[3,4,5]]
Out[60]: array([3, 4, 5])
so it works to assign a list of 3 values as you do. Or a scalar.
In [61]: arr1==[1,2,3]
Out[61]:
array([[ True, True, True],
[False, False, False],
[ True, True, True]])
Here the test has 6 True.
In [62]: arr3[arr1==[1,2,3]]
Out[62]: array([0, 1, 2, 6, 7, 8])
So we can assign 6 values or a scalar. But you tried to assign 3 values.
Or we could apply all to find the rows that match [1,2,3]:
In [63]: np.all(arr1==[1,2,3], axis=1)
Out[63]: array([ True, False, True])
In [64]: arr3[np.all(arr1==[1,2,3], axis=1)]
Out[64]:
array([[0, 1, 2],
[6, 7, 8]])
To this we could assign a (2,3) array, a scalar, a (3,) array, or a (2,1) (as per broadcasting rules):
In [65]: arr2[np.all(arr1==[1,2,3], axis=1)]=np.array([100,200])[:,None]
In [66]: arr2
Out[66]:
array([[100, 100, 100],
[ 0, 0, 0],
[200, 200, 200]])

Selective deletion by value in numpy array

EDITED: Refined problem statement
I am still figuring out the fancy options which are offered by the numpy library. Following topic came on my desk:
Purpose:
In a multi-dimensional array I select one column. This slicing works fine. But after that, values stored in another list need to be filtered out of the column values.
Current status:
array1 = np.asarray([[0,1,2],[1,0,3],[2,3,0]])
print(array1)
array1woZero = np.nonzero(array1)
print(array1woZero)
toBeRemoved = []
toBeRemoved.append(1)
print(toBeRemoved)
column = array1[:,1]
result = np.delete(column,toBeRemoved)
The above mentioned code does not bring the expected result. In fact, the np.delete() command just removes the value at index 1 - but I would need the value of 1 to be filtered out instead. What I also do not understand is the shape change when applying the nonzero to array1: While array1 is (3,3), the array1woZero turns out into a tuple of 2 dims with 6 values each.
0
Array of int64
(6,)
0
0
1
1
2
2
1
Array of int64
(6,)
1
2
0
2
0
1
My feeling is that I would require something like slicing with an exclusion operator. Do you have any hints for me to solve that? Is it necessary to use different data structures?
In [18]: arr = np.asarray([[0,1,2],[1,0,3],[2,3,0]])
In [19]: arr
Out[19]:
array([[0, 1, 2],
[1, 0, 3],
[2, 3, 0]])
nonzero gives the indices of all non-zero elements of its argument (arr):
In [20]: idx = np.nonzero(arr)
In [21]: idx
Out[21]: (array([0, 0, 1, 1, 2, 2]), array([1, 2, 0, 2, 0, 1]))
This is a tuple of arrays, one per dimension. That output can be confusing, but it is easily used to return all of those non-zero elements:
In [22]: arr[idx]
Out[22]: array([1, 2, 1, 3, 2, 3])
Indexing like this, with a pair of arrays, produces a 1d array. In your example there is just one 0 per row, but in general that's not the case.
This is the same indexing - with 2 lists of the same length:
In [24]: arr[[0,0,1,1,2,2], [1,2,0,2,0,1]]
Out[24]: array([1, 2, 1, 3, 2, 3])
idx[0] just selects on array of that tuple, the row indices. That probably isn't what you want. And I doubt if you want to apply np.delete to that tuple.
It's hard to tell from the description, and code, what you want. Maybe that's because you don't understand what nonzero is producing.
We can also select the nonzero elements with boolean masking:
In [25]: arr>0
Out[25]:
array([[False, True, True],
[ True, False, True],
[ True, True, False]])
In [26]: arr[ arr>0 ]
Out[26]: array([1, 2, 1, 3, 2, 3])
the hint with the boolean masking very good and helped me to develop my own solution. The symbolic names in the following code snippets are different, but the idea should become clear anyway.
At the beginning, I have my overall searchSpace.
searchSpace = relativeDistances[currentNode,:]
Assume that its shape is (5,). My filter is defined on the indexes, i.e. range 0..4. Then I define another numpy array "filter" of same shape with all 1, and the values to be filtered out I set to 0.
filter = np.full(shape=nodeCount,fill_value=1,dtype=np.int32())
filter[0] = 0
filter[3] = 0
searchSpace = searchSpace * filter
minValue = searchSpace[searchSpace > 0].min()
neighborNode = np.where(searchSpace==minValue)
The filter array provides me the flexibility to adjust the filter later on as part of a loop. Using the element-wise multiplication with 0 and subsequent boolean masking, I can create my reduced searchSpace for minimum search. Compared to a separate array or list, I still have the original shape, which is required to get the correct index in the where-statement.

Expand a numpy array based on a value in another array

I have the following numpy array a = np.array([1,1,2,1,3]) that should be transformed into the following array b = np.array([1,1,1,1,1,1,1,1]).
What happens is that all the non 1 values in the a array should be expanded in the b array to their multiple defined in the a array. Simpler said, the 2 should become 2 ones, and the 3 should become 3 ones.
Frankly, I couldn't find a numpy function that does this, but I'm sure one exists. Any advice would be very welcome! Thank you!
We can simply do -
np.ones(a.sum(),dtype=int)
This will accomodate all numbers : 1s and non-1s, because of the summing and hence give us the desired output.
In [71]: np.ones(len(a),int).repeat(a)
Out[71]: array([1, 1, 1, 1, 1, 1, 1, 1])
For this small example it is faster than np.ones(a.sum(),int), but it doesn't scale quite as well. But overall both are fast.
Here's one possible way based on the number you wanna be repeated:
In [12]: a = np.array([1,1,2,1,3])
In [13]: mask = a != 1
In [14]: np.concatenate((a[~mask], np.repeat(1, np.prod(a[mask]))))
Out[14]: array([1, 1, 1, 1, 1, 1, 1, 1, 1])

Python: turn single array of sorted, repeat values into an array of arrays?

I have a sorted array with some repeated values. How can this array be turned into an array of arrays with the subarrays grouped by value (see below)? In actuality, my_first_array has ~8 million entries, so the solution would preferably be as time efficient as possible.
my_first_array = [1,1,1,3,5,5,9,9,9,9,9,10,23,23]
wanted_array = [ [1,1,1], [3], [5,5], [9,9,9,9,9], [10], [23,23] ]
itertools.groupby makes this trivial:
import itertools
wanted_array = [list(grp) for _, grp in itertools.groupby(my_first_array)]
With no key function, it just yields groups consisting of runs of identical values, so you list-ify each one in a list comprehension; easy-peasy. You can think of it as basically a within-Python API for doing the work of the GNU toolkit program, uniq, and related operations.
In CPython (the reference interpreter), groupby is implemented in C, and it operates lazily and linearly; the data must already appear in runs matching the key function, so sorting might make it too expensive, but for already sorted data like you have, there is nothing that will be more efficient.
Note: If the inputs might be value identical, but different objects, it may make sense for memory reasons to change list(grp) for _, grp to [k] * len(list(grp)) for k, grp. The former would retain the original (possibly value but not identity duplicate) objects in the final result, the latter would replicate the first object from each group instead, reducing the final cost per group to the cost of N references to a single object, instead of N references to between 1 and N objects.
I am assuming that the input is a NumPy array and you are looking for a list of arrays as output. Now, you can split the input array at indices where those shifts (groups of repeats have boundaries) with np.split. To find such indices, there are two ways - Using np.unique with its optional argument return_index set as True, and another with a combination of np.where and np.diff. Thus, we would have two approaches as listed next.
With np.unique -
import numpy as np
_,idx = np.unique(my_first_array, return_index=True)
out = np.split(my_first_array, idx)[1:]
With np.where and np.diff -
idx = np.where(np.diff(my_first_array)!=0)[0] + 1
out = np.split(my_first_array, idx)
Sample run -
In [28]: my_first_array
Out[28]: array([ 1, 1, 1, 3, 5, 5, 9, 9, 9, 9, 9, 10, 23, 23])
In [29]: _,idx = np.unique(my_first_array, return_index=True)
...: out = np.split(my_first_array, idx)[1:]
...:
In [30]: out
Out[30]:
[array([1, 1, 1]),
array([3]),
array([5, 5]),
array([9, 9, 9, 9, 9]),
array([10]),
array([23, 23])]
In [31]: idx = np.where(np.diff(my_first_array)!=0)[0] + 1
...: out = np.split(my_first_array, idx)
...:
In [32]: out
Out[32]:
[array([1, 1, 1]),
array([3]),
array([5, 5]),
array([9, 9, 9, 9, 9]),
array([10]),
array([23, 23])]
Here is a solution, although it might not be very efficient:
my_first_array = [1,1,1,3,5,5,9,9,9,9,9,10,23,23]
wanted_array = [ [1,1,1], [3], [5,5], [9,9,9,9,9], [10], [23,23] ]
new_array = [ [my_first_array[0]] ]
count = 0
for i in range(1,len(my_first_array)):
a = my_first_array[i]
if a == my_first_array[i - 1]:
new_array[count].append(a)
else:
count += 1
new_array.append([])
new_array[count].append(a)
new_array == wanted_array
This is O(n):
a = [1,1,1,3,5,5,9,9,9,9,9,10,23,23,24]
res = []
s = 0
e = 0
length = len(a)
while s < length:
b = []
while e < length and a[s] == a[e]:
b.append(a[s])
e += 1
res.append(b)
s = e
print res

Categories

Resources