Element-wise broadcasting for comparing two NumPy arrays? - python

Let's say I have an array like this:
import numpy as np
base_array = np.array([-13, -9, -11, -3, -3, -4, 2, 2,
2, 5, 7, 7, 8, 7, 12, 11])
Suppose I want to know: "how many elements in base_array are greater than 4?" This can be done simply by exploiting broadcasting:
np.sum(4 < base_array)
For which the answer is 7. Now, suppose instead of comparing to a single value, I want to do this over an array. In other words, for each value c in the comparison_array, find out how many elements of base_array are greater than c. If I do this the naive way, it obviously fails because it doesn't know how to broadcast it properly:
comparison_array = np.arange(-13, 13)
comparison_result = np.sum(comparison_array < base_array)
Output:
Traceback (most recent call last):
File "<pyshell#87>", line 1, in <module>
np.sum(comparison_array < base_array)
ValueError: operands could not be broadcast together with shapes (26,) (16,)
If I could somehow have each element of comparison_array get broadcast to base_array's shape, that would solve this. But I don't know how to do such an "element-wise broadcasting".
Now, I do know I how to implement this for both cases using list comprehension:
first = sum([4 < i for i in base_array])
second = [sum([c < i for i in base_array])
for c in comparison_array]
print(first)
print(second)
Output:
7
[15, 15, 14, 14, 13, 13, 13, 13, 13, 12, 10, 10, 10, 10, 10, 7, 7, 7, 6, 6, 3, 2, 2, 2, 1, 0]
But as we all know, this will be orders of magnitude slower than a correctly-vectorized numpy implementation on larger arrays. So, how should I do this in numpy so that it's fast? Ideally this solution should extend to any kind of operation where broadcasting works, not just greater-than or less-than in this example.

You can simply add a dimension to the comparison array, so that the comparison is "stretched" across all values along the new dimension.
>>> np.sum(comparison_array[:, None] < base_array)
228
This is the fundamental principle with broadcasting, and works for all kinds of operations.
If you need the sum done along an axis, you just specify the axis along which you want to sum after the comparison.
>>> np.sum(comparison_array[:, None] < base_array, axis=1)
array([15, 15, 14, 14, 13, 13, 13, 13, 13, 12, 10, 10, 10, 10, 10, 7, 7,
7, 6, 6, 3, 2, 2, 2, 1, 0])

You will want to transpose one of the arrays for broadcasting to work correctly. When you broadcast two arrays together, the dimensions are lined up and any unit dimensions are effectively expanded to the non-unit size that they match. So two arrays of size (16, 1) (the original array) and (1, 26) (the comparison array) would broadcast to (16, 26).
Don't forget to sum across the dimension of size 16:
(base_array[:, None] > comparison_array).sum(axis=1)
None in a slice is equivalent to np.newaxis: it's one of many ways to insert a new unit dimension at the specified index. The reason that you don't need to do comparison_array[None, :] is that broadcasting lines up the highest dimensions, and fills in the lowest with ones automatically.

Here's one with np.searchsorted with focus on memory efficiency and hence performance -
def get_comparative_sum(base_array, comparison_array):
n = len(base_array)
base_array_sorted = np.sort(base_array)
idx = np.searchsorted(base_array_sorted, comparison_array, 'right')
idx[idx==n] = n-1
return n - idx - (base_array_sorted[idx] == comparison_array)
Timings -
In [40]: np.random.seed(0)
...: base_array = np.random.randint(-1000,1000,(10000))
...: comparison_array = np.random.randint(-1000,1000,(20000))
# #miradulo's soln
In [41]: %timeit np.sum(comparison_array[:, None] < base_array, axis=1)
1 loop, best of 3: 386 ms per loop
In [42]: %timeit get_comparative_sum(base_array, comparison_array)
100 loops, best of 3: 2.36 ms per loop

Related

Is there a way to add different arrays where some elements from one array are used repeatedly?

I'm new to python and I'm trying to find the best way to transform my array.
I have two arrays, A and B. I want to add them together such that every value of array A is added to two values of array B
A = np.array(2, 4, 6, 8, 10)
B = np.array(10, 10, 10, 10, 10, 10, 10, 10, 10, 10)
combining the two would give me array C as
C = np.array(12, 12, 14, 14, 16, 16, 18, 18, 20, 20)
I though maybe a for loop might achieve this, but I'm not sure how to specify to apply each value of array A twice before continuing. Any help would be appreciated thank you so much!
It's sort of hacky and not quite for a beginner:
(A[None, :] + B.reshape((2, -1))).reshape(-1)
A[None, :] treats A as a 1x5 array. B.reshape((2, -1)) treats B as a 2x5 array. Python knows how to add 1x5 arrays to 2x5 arrays via broadcasting. The final reshape turns the 2x5 array back into a 10-element array.
-1 in a reshape says "make this dimension as large as necessary to use all the data." That way, I don't have to bake 2x5 into the code, but this will work for any n-element and 2-n element arrays.
You could reshape, add and reshape back:
import numpy as np
A = np.array([2, 4, 6, 8, 10])
B = np.array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
res = (B.reshape((-1, 2)) + A[:, None]).reshape(-1)
print(res)
Output
[12 12 14 14 16 16 18 18 20 20]
The expression:
B.reshape((-1, 2))
creates the following array:
[[10 10]
[10 10]
[10 10]
[10 10]
[10 10]]
basically you tell numpy fit the array in a 2 by N, column where N is determined by the original size of B (this is all due to the -1). The other part:
A[:, None]
creates:
[[ 2]
[ 4]
[ 6]
[ 8]
[10]]
Then using broadcasting you can add them together. Finally reshape back.
You could use slicing to index every other item from A then augmented addition for in-place transformation. I don't know the details of how numpy will handle the slicing, but I think this will use the least memory.
B[::2] += A
B[1::2] += A
Another way to expand and reshape is
B += np.array([A, A]).flatten("F")
It loooks to me that this will use 4x the size of A but I think all of the reshaping methods will eat memory.

Minimum distance for each value in array respect to other

I have two numpy arrays of integers A and B. The values in array A and B correspond to time-points at which events A and B occurred. I would like to transform A to contain the time since the most recent event b occurred.
I know I need to subtract each element of A by its nearest smaller the element of B but am unsure of how to do so. Any help would be greatly appreciated.
>>> import numpy as np
>>> A = np.array([11, 12, 13, 17, 20, 22, 33, 34])
>>> B = np.array([5, 10, 15, 20, 25, 30])
Desired Result:
cond_a = relative_timestamp(to_transform=A, reference=B)
cond_a
>>> array([1, 2, 3, 2, 0, 2, 3, 4])
You can use np.searchsorted to find the indices where the elements of A should be inserted in B to maintain order. In other words, you are finding the closest elemet in B for each element in A:
idx = np.searchsorted(B, A, side='right')
result = A-B[idx-1] # substract one for proper index
According to the docs searchsorted uses binary search, so it will scale fine for large inputs.
Here's an approach consisting on computing the pairwise differences. Note that it has a O(n**2) complexity so it might for larger arrays #brenlla's answer will perform much better.
The idea here is to use np.subtract.outer and then find the minimum difference along axis 1 over a masked array, where only values in B smaller than a are considered:
dif = np.abs(np.subtract.outer(A,B))
np.ma.array(dif, mask = A[:,None] < B).min(1).data
# array([1, 2, 3, 2, 0, 2, 3, 4])
As I am not sure, if it is really faster to calculate all pairwise differences, instead of a python loop over each array entry (worst case O(Len(A)+len(B)), the solution with a loop:
A = np.array([11, 12, 13, 17, 20, 22, 33, 34])
B = np.array([5, 10, 15, 20, 25, 30])
def calculate_next_distance(to_transform, reference):
max_reference = len(reference) - 1
current_reference = 0
transformed_values = np.zeros_like(to_transform)
for i, value in enumerate(to_transform):
while current_reference < max_reference and reference[current_reference+1] <= value:
current_reference += 1
transformed_values[i] = value - reference[current_reference]
return transformed_values
calculate_next_distance(A,B)
# array([1, 2, 3, 2, 0, 2, 3, 4])

Numpy vectorization: comparing array against multiple values [duplicate]

Let's say I have an array like this:
import numpy as np
base_array = np.array([-13, -9, -11, -3, -3, -4, 2, 2,
2, 5, 7, 7, 8, 7, 12, 11])
Suppose I want to know: "how many elements in base_array are greater than 4?" This can be done simply by exploiting broadcasting:
np.sum(4 < base_array)
For which the answer is 7. Now, suppose instead of comparing to a single value, I want to do this over an array. In other words, for each value c in the comparison_array, find out how many elements of base_array are greater than c. If I do this the naive way, it obviously fails because it doesn't know how to broadcast it properly:
comparison_array = np.arange(-13, 13)
comparison_result = np.sum(comparison_array < base_array)
Output:
Traceback (most recent call last):
File "<pyshell#87>", line 1, in <module>
np.sum(comparison_array < base_array)
ValueError: operands could not be broadcast together with shapes (26,) (16,)
If I could somehow have each element of comparison_array get broadcast to base_array's shape, that would solve this. But I don't know how to do such an "element-wise broadcasting".
Now, I do know I how to implement this for both cases using list comprehension:
first = sum([4 < i for i in base_array])
second = [sum([c < i for i in base_array])
for c in comparison_array]
print(first)
print(second)
Output:
7
[15, 15, 14, 14, 13, 13, 13, 13, 13, 12, 10, 10, 10, 10, 10, 7, 7, 7, 6, 6, 3, 2, 2, 2, 1, 0]
But as we all know, this will be orders of magnitude slower than a correctly-vectorized numpy implementation on larger arrays. So, how should I do this in numpy so that it's fast? Ideally this solution should extend to any kind of operation where broadcasting works, not just greater-than or less-than in this example.
You can simply add a dimension to the comparison array, so that the comparison is "stretched" across all values along the new dimension.
>>> np.sum(comparison_array[:, None] < base_array)
228
This is the fundamental principle with broadcasting, and works for all kinds of operations.
If you need the sum done along an axis, you just specify the axis along which you want to sum after the comparison.
>>> np.sum(comparison_array[:, None] < base_array, axis=1)
array([15, 15, 14, 14, 13, 13, 13, 13, 13, 12, 10, 10, 10, 10, 10, 7, 7,
7, 6, 6, 3, 2, 2, 2, 1, 0])
You will want to transpose one of the arrays for broadcasting to work correctly. When you broadcast two arrays together, the dimensions are lined up and any unit dimensions are effectively expanded to the non-unit size that they match. So two arrays of size (16, 1) (the original array) and (1, 26) (the comparison array) would broadcast to (16, 26).
Don't forget to sum across the dimension of size 16:
(base_array[:, None] > comparison_array).sum(axis=1)
None in a slice is equivalent to np.newaxis: it's one of many ways to insert a new unit dimension at the specified index. The reason that you don't need to do comparison_array[None, :] is that broadcasting lines up the highest dimensions, and fills in the lowest with ones automatically.
Here's one with np.searchsorted with focus on memory efficiency and hence performance -
def get_comparative_sum(base_array, comparison_array):
n = len(base_array)
base_array_sorted = np.sort(base_array)
idx = np.searchsorted(base_array_sorted, comparison_array, 'right')
idx[idx==n] = n-1
return n - idx - (base_array_sorted[idx] == comparison_array)
Timings -
In [40]: np.random.seed(0)
...: base_array = np.random.randint(-1000,1000,(10000))
...: comparison_array = np.random.randint(-1000,1000,(20000))
# #miradulo's soln
In [41]: %timeit np.sum(comparison_array[:, None] < base_array, axis=1)
1 loop, best of 3: 386 ms per loop
In [42]: %timeit get_comparative_sum(base_array, comparison_array)
100 loops, best of 3: 2.36 ms per loop

getting multiple array after performing subtraction operation within array elements

import numpy as np
m = []
k = []
a = np.array([[1,2,3,4,5,6],[50,51,52,40,20,30],[60,71,82,90,45,35]])
for i in range(len(a)):
m.append(a[i, -1:])
for j in range(len(a[i])-1):
n = abs(m[i] - a[i,j])
k.append(n)
k.append(m[i])
print(k)
Expected Output in k:
[5,4,3,2,1,6],[20,21,22,10,10,30],[25,36,47,55,10,35]
which is also a numpy array.
But the output that I am getting is
[array([5]), array([4]), array([3]), array([2]), array([1]), array([6]), array([20]), array([21]), array([22]), array([10]), array([10]), array([30]), array([25]), array([36]), array([47]), array([55]), array([10]), array([35])]
How can I solve this situation?
You want to subtract the last column of each sub array from themselves. Why don't you use a vectorized approach? You can do all the subtractions at once by subtracting the last column from the rest of the items and then column_stack together with unchanged version of the last column. Also note that you need to change the dimension of the last column inorder to be subtractable from the 2D array. For that sake we can use broadcasting.
In [71]: np.column_stack((abs(a[:, :-1] - a[:, None, -1]), a[:,-1]))
Out[71]:
array([[ 5, 4, 3, 2, 1, 6],
[20, 21, 22, 10, 10, 30],
[25, 36, 47, 55, 10, 35]])

Map numpy's `in1d` over 2D array

I have two 2D numpy arrays,
import numpy as np
a = np.array([[ 1, 15, 16, 200, 10],
[ -1, 10, 17, 11, -1],
[ -1, -1, 20, -1, -1]])
g = np.array([[ 1, 12, 15, 100, 11],
[ 2, 13, 16, 200, 12],
[ 3, 14, 17, 300, 13],
[ 4, 17, 18, 400, 14],
[ 5, 20, 19, 500, 16]])
What I want to do is, for each column of g, to check if it contains any element from the corresponding column of a. For the first column, I want to check if any of the values [1,2,3,4,5] appears in [1,-1,-1] and return True. For the second, I want to return False because no element in [12,13,14,17,20] appears in [15,10,-1]. At the moment, I do this using Python's list comprehension. Running
result = [np.any(np.in1d(g[:,i], a[:, i])) for i in range(5)]
calculates the correct result, but is getting slow when a has a lot of columns. Is there a more "pure numpy" way of doing this same thing? I feel like there should be an axis keyword one could add to the numpy.in1d function, but there isn't any...
I'd use broadcasting tricks, but this depends very much on the size of your arrays and the amount of RAM available to you:
M = g.reshape(g.shape+(1,)) - a.T.reshape((1,a.shape[1],a.shape[0]))
np.any(np.any(M == 0, axis=0), axis=1)
# returns:
# array([ True, False, True, True, False], dtype=bool)
It's easier to explain with a piece of paper and a pen (and smaller test arrays) (see below), but basically you're making copies of each column in g (one copy for each row in a) and subtracting single elements taken from the corresponding column in a from these copies. Similar to the original algorithm, just vectorized.
Caveat: if any of the arrays g or a is 1D, you'll need to force it to become 2D, such that its shape is at least (1,n).
Speed gains:
based only on your arrays: a factor ~20
python for loops: 301us per loop
vectorized: 15.4us per loop
larger arrays: factor ~80
In [2]: a = np.random.random_integers(-2, 3, size=(4, 50))
In [3]: b = np.random.random_integers(-20, 30, size=(35, 50))
In [4]: %timeit np.any(np.any(b.reshape(b.shape+(1,)) - a.T.reshape((1,a.shape[1],a.shape[0])) == 0, axis=0), axis=1)
10000 loops, best of 3: 39.5 us per loop
In [5]: %timeit [np.any(np.in1d(b[:,i], a[:, i])) for i in range(a.shape[1])]
100 loops, best of 3: 3.13 ms per loop
Image attached to explain broadcasting:
Instead of processing the input by column, you can process it by rows. For example you find out if any element of the first row of a is present in the columns of g, so that you can stop processing the columns where the element is found.
idx = arange(a.shape[1])
result = empty((idx.size,), dtype=bool)
result.fill(False)
for j in range(a.shape[0]):
#delete this print in production
print "%d line, I look only at columns " % (j + 1), idx
line_pruned = take(a[j], idx)
g_pruned = take(g, idx, axis=1)
positive_idx = where((g_pruned - line_pruned) == 0)[1]
#delete this print in production
print "positive hit on the ", positive_idx, " -th columns"
put(result, positive_idx, True)
idx = setdiff1d(idx, positive_idx)
if not idx.size:
break
To understand how it works, we can consider a different input:
a = np.array([[ 0, 15, 16, 200, 10],
[ -1, 10, 17, 11, -1],
[ 1, -1, 20, -1, -1]])
g = np.array([[ 1, 12, 15, 100, 11],
[ 2, 13, 16, 200, 12],
[ 3, 14, 17, 300, 13],
[ 4, 17, 18, 400, 14],
[ 5, 20, 19, 500, 16]])
The output of the script is:
1 line, I look only at columns [0 1 2 3 4]
positive hit on the [2 3] -th columns
2 line, I look only at columns [0 1 4]
positive hit on the [] -th columns
3 line, I look only at columns [0 1 4]
positive hit on the [0] -th columns
Basically you can see how in the 2nd and 3rd round of the loop you're not processing the 2nd and 4th column.
The performance of this solution really depends on many factors, but it will be faster if it is likely that you hit many True values, and the problem has many rows. This of course depends also on the input, not just on the shape.

Categories

Resources