Classify elements of a numpy array using a second array as reference - python

Let's say I have an array with a finite amount of unique values. Say
data = array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])
And I also have a reference array with all the unique values found in data, without repetitions and in a particular order. Say
reference = array([20, 10, 30])
And I want to create an array with the same shape than data containing as values the indices in the reference array where each element in the data array is found.
In other words, having data and reference, I want to create an array indexes such that the following holds.
data = reference[indexes]
A suboptimal approach to compute indexes would be using a for loop, like this
indexes = np.zeros_like(data, dtype=int)
for i in range(data.size):
indexes[i] = np.where(data[i] == reference)[0]
but I'd be surprised there is not a numpythonic (and thus faster!) way to do this... Any ideas?
Thanks!

We have data and reference as -
In [375]: data
Out[375]: array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])
In [376]: reference
Out[376]: array([20, 10, 30])
For a moment, let us consider a sorted version of reference -
In [373]: np.sort(reference)
Out[373]: array([10, 20, 30])
Now, we can use np.searchsorted to find out the position of each data element in this sorted version, like so -
In [378]: np.searchsorted(np.sort(reference), data, side='left')
Out[378]: array([2, 1, 2, 0, 1, 0, 1, 0, 2, 1, 1, 2, 2, 0, 2], dtype=int64)
If we run the original code, the expected output turns out to be -
In [379]: indexes
Out[379]: array([2, 0, 2, 1, 0, 1, 0, 1, 2, 0, 0, 2, 2, 1, 2])
As can be seen, the searchsorted output is fine except the 0's in it must be 1s and 1's must be changed to 0's. Now, we had taken into computation, the sorted version of reference. So, to do the 0's to 1's and vice versa changes, we need to bring in the indices used for sorting reference, i.e. np.argsort(reference). That's basically it for a vectorized no-loop or no-dict approach! So, the final implementation would look something like this -
# Get sorting indices for reference
sort_idx = np.argsort(reference)
# Sort reference and get searchsorted indices for data in reference
pos = np.searchsorted(reference[sort_idx], data, side='left')
# Change pos indices based on sorted indices for reference
out = np.argsort(reference)[pos]
Runtime tests -
In [396]: data = np.random.randint(0,30000,150000)
...: reference = np.unique(data)
...: reference = reference[np.random.permutation(reference.size)]
...:
...:
...: def org_approach(data,reference):
...: indexes = np.zeros_like(data, dtype=int)
...: for i in range(data.size):
...: indexes[i] = np.where(data[i] == reference)[0]
...: return indexes
...:
...: def vect_approach(data,reference):
...: sort_idx = np.argsort(reference)
...: pos = np.searchsorted(reference[sort_idx], data, side='left')
...: return sort_idx[pos]
...:
In [397]: %timeit org_approach(data,reference)
1 loops, best of 3: 9.86 s per loop
In [398]: %timeit vect_approach(data,reference)
10 loops, best of 3: 32.4 ms per loop
Verify results -
In [399]: np.array_equal(org_approach(data,reference),vect_approach(data,reference))
Out[399]: True

You have to loop through the data once to map the data values onto indexes. The quickest way to do that is to look up the value indexes in a dictionary. So you need to create a dictionary from values to indexes first.
Here's a complete example:
import numpy
data = numpy.array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])
reference = numpy.array([20, 10, 30])
reference_index = dict((value, index) for index, value in enumerate(reference))
indexes = [reference_index[value] for value in data]
assert numpy.all(data == reference[indexes])
This will be faster than the numpy.where approach because numpy.where will do a linear, O(n), search while the dictionary approach uses a hashtable to find the index in O(1) time.

import numpy as np
data = np.array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])
reference = {20:0, 10:1, 30:2}
indexes = np.zeros_like(data, dtype=int)
for i in xrange(data.size):
indexes[i] = reference[data[i]]
A dictionary lookup is significantly faster. The use of xrange also helped marginally.
Using timeit:
Original: 4.01297836938
This version: 1.30972428591

Related

Minimum distance for each value in array respect to other

I have two numpy arrays of integers A and B. The values in array A and B correspond to time-points at which events A and B occurred. I would like to transform A to contain the time since the most recent event b occurred.
I know I need to subtract each element of A by its nearest smaller the element of B but am unsure of how to do so. Any help would be greatly appreciated.
>>> import numpy as np
>>> A = np.array([11, 12, 13, 17, 20, 22, 33, 34])
>>> B = np.array([5, 10, 15, 20, 25, 30])
Desired Result:
cond_a = relative_timestamp(to_transform=A, reference=B)
cond_a
>>> array([1, 2, 3, 2, 0, 2, 3, 4])
You can use np.searchsorted to find the indices where the elements of A should be inserted in B to maintain order. In other words, you are finding the closest elemet in B for each element in A:
idx = np.searchsorted(B, A, side='right')
result = A-B[idx-1] # substract one for proper index
According to the docs searchsorted uses binary search, so it will scale fine for large inputs.
Here's an approach consisting on computing the pairwise differences. Note that it has a O(n**2) complexity so it might for larger arrays #brenlla's answer will perform much better.
The idea here is to use np.subtract.outer and then find the minimum difference along axis 1 over a masked array, where only values in B smaller than a are considered:
dif = np.abs(np.subtract.outer(A,B))
np.ma.array(dif, mask = A[:,None] < B).min(1).data
# array([1, 2, 3, 2, 0, 2, 3, 4])
As I am not sure, if it is really faster to calculate all pairwise differences, instead of a python loop over each array entry (worst case O(Len(A)+len(B)), the solution with a loop:
A = np.array([11, 12, 13, 17, 20, 22, 33, 34])
B = np.array([5, 10, 15, 20, 25, 30])
def calculate_next_distance(to_transform, reference):
max_reference = len(reference) - 1
current_reference = 0
transformed_values = np.zeros_like(to_transform)
for i, value in enumerate(to_transform):
while current_reference < max_reference and reference[current_reference+1] <= value:
current_reference += 1
transformed_values[i] = value - reference[current_reference]
return transformed_values
calculate_next_distance(A,B)
# array([1, 2, 3, 2, 0, 2, 3, 4])

Numpy vectorization: comparing array against multiple values [duplicate]

Let's say I have an array like this:
import numpy as np
base_array = np.array([-13, -9, -11, -3, -3, -4, 2, 2,
2, 5, 7, 7, 8, 7, 12, 11])
Suppose I want to know: "how many elements in base_array are greater than 4?" This can be done simply by exploiting broadcasting:
np.sum(4 < base_array)
For which the answer is 7. Now, suppose instead of comparing to a single value, I want to do this over an array. In other words, for each value c in the comparison_array, find out how many elements of base_array are greater than c. If I do this the naive way, it obviously fails because it doesn't know how to broadcast it properly:
comparison_array = np.arange(-13, 13)
comparison_result = np.sum(comparison_array < base_array)
Output:
Traceback (most recent call last):
File "<pyshell#87>", line 1, in <module>
np.sum(comparison_array < base_array)
ValueError: operands could not be broadcast together with shapes (26,) (16,)
If I could somehow have each element of comparison_array get broadcast to base_array's shape, that would solve this. But I don't know how to do such an "element-wise broadcasting".
Now, I do know I how to implement this for both cases using list comprehension:
first = sum([4 < i for i in base_array])
second = [sum([c < i for i in base_array])
for c in comparison_array]
print(first)
print(second)
Output:
7
[15, 15, 14, 14, 13, 13, 13, 13, 13, 12, 10, 10, 10, 10, 10, 7, 7, 7, 6, 6, 3, 2, 2, 2, 1, 0]
But as we all know, this will be orders of magnitude slower than a correctly-vectorized numpy implementation on larger arrays. So, how should I do this in numpy so that it's fast? Ideally this solution should extend to any kind of operation where broadcasting works, not just greater-than or less-than in this example.
You can simply add a dimension to the comparison array, so that the comparison is "stretched" across all values along the new dimension.
>>> np.sum(comparison_array[:, None] < base_array)
228
This is the fundamental principle with broadcasting, and works for all kinds of operations.
If you need the sum done along an axis, you just specify the axis along which you want to sum after the comparison.
>>> np.sum(comparison_array[:, None] < base_array, axis=1)
array([15, 15, 14, 14, 13, 13, 13, 13, 13, 12, 10, 10, 10, 10, 10, 7, 7,
7, 6, 6, 3, 2, 2, 2, 1, 0])
You will want to transpose one of the arrays for broadcasting to work correctly. When you broadcast two arrays together, the dimensions are lined up and any unit dimensions are effectively expanded to the non-unit size that they match. So two arrays of size (16, 1) (the original array) and (1, 26) (the comparison array) would broadcast to (16, 26).
Don't forget to sum across the dimension of size 16:
(base_array[:, None] > comparison_array).sum(axis=1)
None in a slice is equivalent to np.newaxis: it's one of many ways to insert a new unit dimension at the specified index. The reason that you don't need to do comparison_array[None, :] is that broadcasting lines up the highest dimensions, and fills in the lowest with ones automatically.
Here's one with np.searchsorted with focus on memory efficiency and hence performance -
def get_comparative_sum(base_array, comparison_array):
n = len(base_array)
base_array_sorted = np.sort(base_array)
idx = np.searchsorted(base_array_sorted, comparison_array, 'right')
idx[idx==n] = n-1
return n - idx - (base_array_sorted[idx] == comparison_array)
Timings -
In [40]: np.random.seed(0)
...: base_array = np.random.randint(-1000,1000,(10000))
...: comparison_array = np.random.randint(-1000,1000,(20000))
# #miradulo's soln
In [41]: %timeit np.sum(comparison_array[:, None] < base_array, axis=1)
1 loop, best of 3: 386 ms per loop
In [42]: %timeit get_comparative_sum(base_array, comparison_array)
100 loops, best of 3: 2.36 ms per loop

getting multiple array after performing subtraction operation within array elements

import numpy as np
m = []
k = []
a = np.array([[1,2,3,4,5,6],[50,51,52,40,20,30],[60,71,82,90,45,35]])
for i in range(len(a)):
m.append(a[i, -1:])
for j in range(len(a[i])-1):
n = abs(m[i] - a[i,j])
k.append(n)
k.append(m[i])
print(k)
Expected Output in k:
[5,4,3,2,1,6],[20,21,22,10,10,30],[25,36,47,55,10,35]
which is also a numpy array.
But the output that I am getting is
[array([5]), array([4]), array([3]), array([2]), array([1]), array([6]), array([20]), array([21]), array([22]), array([10]), array([10]), array([30]), array([25]), array([36]), array([47]), array([55]), array([10]), array([35])]
How can I solve this situation?
You want to subtract the last column of each sub array from themselves. Why don't you use a vectorized approach? You can do all the subtractions at once by subtracting the last column from the rest of the items and then column_stack together with unchanged version of the last column. Also note that you need to change the dimension of the last column inorder to be subtractable from the 2D array. For that sake we can use broadcasting.
In [71]: np.column_stack((abs(a[:, :-1] - a[:, None, -1]), a[:,-1]))
Out[71]:
array([[ 5, 4, 3, 2, 1, 6],
[20, 21, 22, 10, 10, 30],
[25, 36, 47, 55, 10, 35]])

Taking an average of an array according to another array of indices

Say I have an array that looks like this:
a = np.array([0, 20, 40, 30, 60, 35, 15, 18, 2])
and I have an array of indices that I want to average between:
averaging_indices = np.array([2, 4, 7, 8])
What I want to do is to average the elements of array a according to the averaging_indices array. Just to make that clear I want to take the averages:
np.mean(a[0:2]), np.mean(a[2:4]), np.mean(a[4:7]), np.mean(a[7,8]), np.mean(a[8:])
and I want to return an array that then has the correct dimensions, in this case
result = [10, 35, 36.66, 18, 2]
Can anyone think of a neat way to do this? The only way I can imagine is by looping, which is very anti-numpy.
Here's a vectorized approach with np.bincount -
# Create "shifts array" and then IDs array for use with np.bincount later on
shifts_array = np.zeros(a.size,dtype=int)
shifts_array[averaging_indices] = 1
IDs = shifts_array.cumsum()
# Use np.bincount to get the summations for each tag and also tag counts.
# Thus, get tagged averages as final output.
out = np.bincount(IDs,a)/np.bincount(IDs)
Sample input, output -
In [60]: a
Out[60]: array([ 0, 20, 40, 30, 60, 35, 15, 18, 2])
In [61]: averaging_indices
Out[61]: array([2, 4, 7, 8])
In [62]: out
Out[62]: array([ 10. , 35. , 36.66666667, 18. , 2. ])

Map numpy's `in1d` over 2D array

I have two 2D numpy arrays,
import numpy as np
a = np.array([[ 1, 15, 16, 200, 10],
[ -1, 10, 17, 11, -1],
[ -1, -1, 20, -1, -1]])
g = np.array([[ 1, 12, 15, 100, 11],
[ 2, 13, 16, 200, 12],
[ 3, 14, 17, 300, 13],
[ 4, 17, 18, 400, 14],
[ 5, 20, 19, 500, 16]])
What I want to do is, for each column of g, to check if it contains any element from the corresponding column of a. For the first column, I want to check if any of the values [1,2,3,4,5] appears in [1,-1,-1] and return True. For the second, I want to return False because no element in [12,13,14,17,20] appears in [15,10,-1]. At the moment, I do this using Python's list comprehension. Running
result = [np.any(np.in1d(g[:,i], a[:, i])) for i in range(5)]
calculates the correct result, but is getting slow when a has a lot of columns. Is there a more "pure numpy" way of doing this same thing? I feel like there should be an axis keyword one could add to the numpy.in1d function, but there isn't any...
I'd use broadcasting tricks, but this depends very much on the size of your arrays and the amount of RAM available to you:
M = g.reshape(g.shape+(1,)) - a.T.reshape((1,a.shape[1],a.shape[0]))
np.any(np.any(M == 0, axis=0), axis=1)
# returns:
# array([ True, False, True, True, False], dtype=bool)
It's easier to explain with a piece of paper and a pen (and smaller test arrays) (see below), but basically you're making copies of each column in g (one copy for each row in a) and subtracting single elements taken from the corresponding column in a from these copies. Similar to the original algorithm, just vectorized.
Caveat: if any of the arrays g or a is 1D, you'll need to force it to become 2D, such that its shape is at least (1,n).
Speed gains:
based only on your arrays: a factor ~20
python for loops: 301us per loop
vectorized: 15.4us per loop
larger arrays: factor ~80
In [2]: a = np.random.random_integers(-2, 3, size=(4, 50))
In [3]: b = np.random.random_integers(-20, 30, size=(35, 50))
In [4]: %timeit np.any(np.any(b.reshape(b.shape+(1,)) - a.T.reshape((1,a.shape[1],a.shape[0])) == 0, axis=0), axis=1)
10000 loops, best of 3: 39.5 us per loop
In [5]: %timeit [np.any(np.in1d(b[:,i], a[:, i])) for i in range(a.shape[1])]
100 loops, best of 3: 3.13 ms per loop
Image attached to explain broadcasting:
Instead of processing the input by column, you can process it by rows. For example you find out if any element of the first row of a is present in the columns of g, so that you can stop processing the columns where the element is found.
idx = arange(a.shape[1])
result = empty((idx.size,), dtype=bool)
result.fill(False)
for j in range(a.shape[0]):
#delete this print in production
print "%d line, I look only at columns " % (j + 1), idx
line_pruned = take(a[j], idx)
g_pruned = take(g, idx, axis=1)
positive_idx = where((g_pruned - line_pruned) == 0)[1]
#delete this print in production
print "positive hit on the ", positive_idx, " -th columns"
put(result, positive_idx, True)
idx = setdiff1d(idx, positive_idx)
if not idx.size:
break
To understand how it works, we can consider a different input:
a = np.array([[ 0, 15, 16, 200, 10],
[ -1, 10, 17, 11, -1],
[ 1, -1, 20, -1, -1]])
g = np.array([[ 1, 12, 15, 100, 11],
[ 2, 13, 16, 200, 12],
[ 3, 14, 17, 300, 13],
[ 4, 17, 18, 400, 14],
[ 5, 20, 19, 500, 16]])
The output of the script is:
1 line, I look only at columns [0 1 2 3 4]
positive hit on the [2 3] -th columns
2 line, I look only at columns [0 1 4]
positive hit on the [] -th columns
3 line, I look only at columns [0 1 4]
positive hit on the [0] -th columns
Basically you can see how in the 2nd and 3rd round of the loop you're not processing the 2nd and 4th column.
The performance of this solution really depends on many factors, but it will be faster if it is likely that you hit many True values, and the problem has many rows. This of course depends also on the input, not just on the shape.

Categories

Resources