I have an array called a and another array b. The array a is the main array where I store float data, and b is an array which contain some indexes belonging to a.
Example:
a = [1.3, 1.7, 18.4, 56.2, 82.2, 18.1, 81.9, 56.9, -274.45]
b = [0, 1, 2, 3, 4, 5, 6, 7]
In this example b contains indexes of a from 0 to 7.
What I'm trying to do in Python is to remove "duplicates", I mean to remove all indexes from b which have their similar value in a. For example, notice that there are pair 1.3 and 1.7. Also, there are 18.4 and 18.1 etc. I want to find all these values and to write -1 in all places in array b which has that value.
Output should be the following:
b = [0, -1, 2, 3, 4, -1, -1, -1]
I think it is obvious what I am trying to achieve. Here index 1 is replaced with -1 because in a it represents 1.7 which has "pair" 1.3. Also, last 3 indexes represents 18.1, 81.9 and 56.9 which also have their "pairs" before, so they are replaced with -1.
Of course, I have a parameter x which represents how "similar" values are. So, here x = 2 which mean that any 2 values which differ by 2 are similar.
What have I tried? I tried to use 2 nested for loops and a lot of unnecessary variables and my algorithm eats memory and performance. Is there an elegant np-ish way to achieve it?
Approach #1 : Here's a vectorized approach using broadcasting and a bit memory intensive -
x = 2 # threshold that decides similarity
a_b = a[b]
mask = np.triu(np.abs(a_b[:,None]-a_b)<x,1).any(0)
b[mask[:len(b)]] = -1
Sample run -
In [95]: a = np.array([1.3, 1.7, 18.4, 56.2, 82.2, 18.1, 81.9, 56.9, -274.45])
...: b = np.array([0, 1, 2, 3, 4, 5, 6, 7])
...:
# After code run ...
In [97]: b
Out[97]: array([ 0, -1, 2, 3, 4, -1, -1, -1])
Approach #2 : Less memory intensive approach
import pandas as pd
def set_mask(a,b,thresh):
a_b = a[b]
N = len(a_b)
sidx = a_b.argsort()
sorted_a_b = a_b[sidx]
mask0 = sorted_a_b[1:] - sorted_a_b[:-1] < thresh
id_arr = np.zeros(N, dtype=int)
id_arr[np.flatnonzero(~mask0)+1] = 1
ids = id_arr.cumsum()
d = np.column_stack(( ids, sidx))
df0 = pd.DataFrame(d, columns=(('ids','sidx')))
pp = df0['sidx'].groupby([ids]).min()
maskc = np.ones(N,dtype=bool)
maskc[pp.values] = 0
return maskc
Use this mask to replace the mask needed at the last step from previous approach.
Related
Say:
p = array([4, 0, 8, 2, 7])
Want to find the index of max value, except few indexes, say:
excptIndx = [2, 3]
Ans: 4, as 7 will be max.
if excptIndx = [1, 3], Ans: 2, as 8 will be max.
In numpy, you can mask all values at excptIndx and run argmax to obtain index of max element:
import numpy as np
p = np.array([4, 0, 8, 2, 7])
excptIndx = [2, 3]
m = np.zeros(p.size, dtype=bool)
m[excptIndx] = True
a = np.ma.array(p, mask=m)
print(np.argmax(a))
# 4
The setup:
In [153]: p = np.array([4,0,8,2,7])
In [154]: exceptions = [2,3]
Original indexes in p:
In [155]: idx = np.arange(p.shape[0])
delete exceptions from both:
In [156]: np.delete(p,exceptions)
Out[156]: array([4, 0, 7])
In [157]: np.delete(idx,exceptions)
Out[157]: array([0, 1, 4])
Find the argmax in the deleted array:
In [158]: np.argmax(np.delete(p,exceptions))
Out[158]: 2
Use that to find the max value (could just as well use np.max(_156)
In [159]: _156[_158]
Out[159]: 7
Use the same index to find the index in the original p
In [160]: _157[_158]
Out[160]: 4
In [161]: p[_160] # another way to get the max value
Out[161]: 7
For this small example, the pure Python alternatives might well be faster. They often are in small cases. We need test cases with a 1000 or more values to really see the advantages of numpy.
Another method
Set the exceptions to a small enough value, and take the argmax:
In [162]: p1 = p.copy(); p1[exceptions] = -1000
In [163]: np.argmax(p1)
Out[163]: 4
Here the small enough is easy to pick; more generally it may require some thought.
Or taking advantage of the np.nan... functions:
In [164]: p1 = p.astype(float); p1[exceptions]=np.nan
In [165]: np.nanargmax(p1)
Out[165]: 4
A solution is
mask = np.isin(np.arange(len(p)), excptIndx)
subset_idx = np.argmax(p[mask])
parent_idx = np.arange(len(p))[mask][subset_idx]
See http://seanlaw.github.io/2015/09/10/numpy-argmin-with-a-condition/
p = np.array([4,0,8,2,7]) # given
exceptions = [2,3] # given
idx = list( range(0,len(p)) ) # simple array of index
a1 = np.delete(idx, exceptions) # remove exceptions from idx (i.e., index)
a2 = np.argmax(np.delete(p, exceptions)) # get index of the max value after removing exceptions from actual p array
a1[a2] # as a1 and a2 are in sync, this will give the original index (as asked) of the max value
I have a sorted array with some repeated values. How can this array be turned into an array of arrays with the subarrays grouped by value (see below)? In actuality, my_first_array has ~8 million entries, so the solution would preferably be as time efficient as possible.
my_first_array = [1,1,1,3,5,5,9,9,9,9,9,10,23,23]
wanted_array = [ [1,1,1], [3], [5,5], [9,9,9,9,9], [10], [23,23] ]
itertools.groupby makes this trivial:
import itertools
wanted_array = [list(grp) for _, grp in itertools.groupby(my_first_array)]
With no key function, it just yields groups consisting of runs of identical values, so you list-ify each one in a list comprehension; easy-peasy. You can think of it as basically a within-Python API for doing the work of the GNU toolkit program, uniq, and related operations.
In CPython (the reference interpreter), groupby is implemented in C, and it operates lazily and linearly; the data must already appear in runs matching the key function, so sorting might make it too expensive, but for already sorted data like you have, there is nothing that will be more efficient.
Note: If the inputs might be value identical, but different objects, it may make sense for memory reasons to change list(grp) for _, grp to [k] * len(list(grp)) for k, grp. The former would retain the original (possibly value but not identity duplicate) objects in the final result, the latter would replicate the first object from each group instead, reducing the final cost per group to the cost of N references to a single object, instead of N references to between 1 and N objects.
I am assuming that the input is a NumPy array and you are looking for a list of arrays as output. Now, you can split the input array at indices where those shifts (groups of repeats have boundaries) with np.split. To find such indices, there are two ways - Using np.unique with its optional argument return_index set as True, and another with a combination of np.where and np.diff. Thus, we would have two approaches as listed next.
With np.unique -
import numpy as np
_,idx = np.unique(my_first_array, return_index=True)
out = np.split(my_first_array, idx)[1:]
With np.where and np.diff -
idx = np.where(np.diff(my_first_array)!=0)[0] + 1
out = np.split(my_first_array, idx)
Sample run -
In [28]: my_first_array
Out[28]: array([ 1, 1, 1, 3, 5, 5, 9, 9, 9, 9, 9, 10, 23, 23])
In [29]: _,idx = np.unique(my_first_array, return_index=True)
...: out = np.split(my_first_array, idx)[1:]
...:
In [30]: out
Out[30]:
[array([1, 1, 1]),
array([3]),
array([5, 5]),
array([9, 9, 9, 9, 9]),
array([10]),
array([23, 23])]
In [31]: idx = np.where(np.diff(my_first_array)!=0)[0] + 1
...: out = np.split(my_first_array, idx)
...:
In [32]: out
Out[32]:
[array([1, 1, 1]),
array([3]),
array([5, 5]),
array([9, 9, 9, 9, 9]),
array([10]),
array([23, 23])]
Here is a solution, although it might not be very efficient:
my_first_array = [1,1,1,3,5,5,9,9,9,9,9,10,23,23]
wanted_array = [ [1,1,1], [3], [5,5], [9,9,9,9,9], [10], [23,23] ]
new_array = [ [my_first_array[0]] ]
count = 0
for i in range(1,len(my_first_array)):
a = my_first_array[i]
if a == my_first_array[i - 1]:
new_array[count].append(a)
else:
count += 1
new_array.append([])
new_array[count].append(a)
new_array == wanted_array
This is O(n):
a = [1,1,1,3,5,5,9,9,9,9,9,10,23,23,24]
res = []
s = 0
e = 0
length = len(a)
while s < length:
b = []
while e < length and a[s] == a[e]:
b.append(a[s])
e += 1
res.append(b)
s = e
print res
Given 2 numpy arrays of unequal size: A (a presorted dataset) and B (a list of query values). I want to find the closest "lower" neighbor in array A to each element of array B. Example code below:
import numpy as np
A = np.array([0.456, 2.0, 2.948, 3.0, 7.0, 12.132]) #pre-sorted dataset
B = np.array([1.1, 1.9, 2.1, 5.0, 7.0]) #query values, not necessarily sorted
print A.searchsorted(B)
# RESULT: [1 1 2 4 4]
# DESIRED: [0 0 1 3 4]
In this example, B[0]'s closest neighbors are A[0] and A[1]. It is closest to A[1], which is why searchsorted returns index 1 as a match, but what i want is the lower neighbor at index 0. Same for B[1:4], and B[4] should be matched with A[4] because both values are identical.
I could do something clunky like this:
desired = []
for b in B:
id = -1
for a in A:
if a > b:
if id == -1:
desired.append(0)
else:
desired.append(id)
break
id+=1
print desired
# RESULT: [0, 0, 1, 3, 4]
But there's gotta be a prettier more concise way to write this with numpy. I'd like to keep my solution in numpy because i'm dealing with large data sets, but i'm open to other options.
You can introduce the optional argument side and set it to 'right' as mentioned in the docs. Then, subtract the final indices result by 1 for the desired output, like so -
A.searchsorted(B,side='right')-1
Sample run -
In [63]: A
Out[63]: array([ 0.456, 2. , 2.948, 3. , 7. , 12.132])
In [64]: B
Out[64]: array([ 1.1, 1.9, 2.1, 5. , 7. ])
In [65]: A.searchsorted(B,side='right')-1
Out[65]: array([0, 0, 1, 3, 4])
In [66]: A.searchsorted(A,side='right')-1 # With itself
Out[66]: array([0, 1, 2, 3, 4, 5])
Here's one way to do this. np.argmax stops at the first True it encounters, so as long as A is sorted this provides the desired result.
[np.argmax(A>b)-1 for b in B]
Edit: I got the inequality wrong initially, it works now.
I'm trying to get the index of the last negative value of an array per column (in order to slice it after).
a simple working example on a 1d vector is :
import numpy as np
A = np.arange(10) - 5
A[2] = 2
print A # [-5 -4 2 -2 -1 0 1 2 3 4]
idx = np.max(np.where(A <= 0)[0])
print idx # 5
A[:idx] = 0
print A # [0 0 0 0 0 0 1 2 3 4]
Now I wanna do the same thing on each column of a 2D array :
A = np.arange(10) - 5
A[2] = 2
A2 = np.tile(A, 3).reshape((3, 10)) - np.array([0, 2, -1]).reshape((3, 1))
print A2
# [[-5 -4 2 -2 -1 0 1 2 3 4]
# [-7 -6 0 -4 -3 -2 -1 0 1 2]
# [-4 -3 3 -1 0 1 2 3 4 5]]
And I would like to obtain :
print A2
# [[0 0 0 0 0 0 1 2 3 4]
# [0 0 0 0 0 0 0 0 1 2]
# [0 0 0 0 0 1 2 3 4 5]]
but I can't manage to figure out how to translate the max/where statement to the this 2d array...
You already have good answers, but I wanted to propose a potentially quicker variation using the function np.maximum.accumulate. Since your method for a 1D array uses max/where, you may also find this approach quite intuitive. (Edit: quicker Cython implementation added below).
The overall approach is very similar to the others; the mask is created with:
np.maximum.accumulate((A2 < 0)[:, ::-1], axis=1)[:, ::-1]
This line of code does the following:
(A2 < 0) creates a Boolean array, indicating whether a value is negative or not. The index [:, ::-1] flips this left-to-right.
np.maximum.accumulate is used to return the cumulative maximum along each row (i.e. axis=1). For example [False, True, False] would become [False, True, True].
The final indexing operation [:, ::-1] flips this new Boolean array left-to-right.
Then all that's left to do is to use the Boolean array as a mask to set the True values to zero.
Borrowing the timing methodology and two functions from #Divakar's answer, here are the benchmarks for my proposed method:
# method using np.maximum.accumulate
def accumulate_based(A2):
A2[np.maximum.accumulate((A2 < 0)[:, ::-1], axis=1)[:, ::-1]] = 0
return A2
# large sample array
A2 = np.random.randint(-4, 10, size=(100000, 100))
A2c = A2.copy()
A2c2 = A2.copy()
The timings are:
In [47]: %timeit broadcasting_based(A2)
10 loops, best of 3: 61.7 ms per loop
In [48]: %timeit cumsum_based(A2c)
10 loops, best of 3: 127 ms per loop
In [49]: %timeit accumulate_based(A2c2) # quickest
10 loops, best of 3: 43.2 ms per loop
So using np.maximum.accumulate can be as much as 30% faster than the next fastest solution for arrays of this size and shape.
As #tom10 points out, each NumPy operation processes arrays in their entirety, which can be inefficient when multiple operations are needed to get a result. An iterative approach which works through the array just once may fare better.
Below is a naive function written in Cython which could more than twice as fast as a pure NumPy approach.
This function may be able to be sped up further using memory views.
cimport cython
import numpy as np
cimport numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
def cython_based(np.ndarray[long, ndim=2, mode="c"] array):
cdef int rows, cols, i, j, seen_neg
rows = array.shape[0]
cols = array.shape[1]
for i in range(rows):
seen_neg = 0
for j in range(cols-1, -1, -1):
if seen_neg or array[i, j] < 0:
seen_neg = 1
array[i, j] = 0
return array
This function works backwards through each row and starts setting values to zero once it has seen a negative value.
Testing it works:
A2 = np.random.randint(-4, 10, size=(100000, 100))
A2c = A2.copy()
np.array_equal(accumulate_based(A2), cython_based(A2c))
# True
Comparing the performance of the function:
In [52]: %timeit accumulate_based(A2)
10 loops, best of 3: 49.8 ms per loop
In [53]: %timeit cython_based(A2c)
100 loops, best of 3: 18.6 ms per loop
Assuming that you are looking to set all elements for each row until the last negative element to be set to zero (as per the expected output listed in the question for a sample case), two approaches could be suggested here.
Approach #1
This one is based on np.cumsum to generate a mask of elements to be set to zeros as listed next -
# Get boolean mask with TRUEs for each row starting at the first element and
# ending at the last negative element
mask = (np.cumsum(A2[:,::-1]<0,1)>0)[:,::-1]
# Use mask to set all such al TRUEs to zeros as per the expected output in OP
A2[mask] = 0
Sample run -
In [280]: A2 = np.random.randint(-4,10,(6,7)) # Random input 2D array
In [281]: A2
Out[281]:
array([[-2, 9, 8, -3, 2, 0, 5],
[-1, 9, 5, 1, -3, -3, -2],
[ 3, -3, 3, 5, 5, 2, 9],
[ 4, 6, -1, 6, 1, 2, 2],
[ 4, 4, 6, -3, 7, -3, -3],
[ 0, 2, -2, -3, 9, 4, 3]])
In [282]: A2[(np.cumsum(A2[:,::-1]<0,1)>0)[:,::-1]] = 0 # Use mask to set zeros
In [283]: A2
Out[283]:
array([[0, 0, 0, 0, 2, 0, 5],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 3, 5, 5, 2, 9],
[0, 0, 0, 6, 1, 2, 2],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 9, 4, 3]])
Approach #2
This one starts with the idea of finding the last negative element indices from #tom10's answer and develops into a mask finding method using broadcasting to get us the desired output, similar to approach #1.
# Find last negative index for each row
last_idx = A2.shape[1] - 1 - np.argmax(A2[:,::-1]<0, axis=1)
# Find the invalid indices (rows with no negative indices)
invalid_idx = A2[np.arange(A2.shape[0]),last_idx]>=0
# Set the indices for invalid ones to "-1"
last_idx[invalid_idx] = -1
# Boolean mask with each row starting with TRUE as the first element
# and ending at the last negative element
mask = np.arange(A2.shape[1]) < (last_idx[:,None] + 1)
# Set masked elements to zeros, for the desired output
A2[mask] = 0
Runtime tests -
Function defintions:
def broadcasting_based(A2):
last_idx = A2.shape[1] - 1 - np.argmax(A2[:,::-1]<0, axis=1)
last_idx[A2[np.arange(A2.shape[0]),last_idx]>=0] = -1
A2[np.arange(A2.shape[1]) < (last_idx[:,None] + 1)] = 0
return A2
def cumsum_based(A2):
A2[(np.cumsum(A2[:,::-1]<0,1)>0)[:,::-1]] = 0
return A2
Runtimes:
In [379]: A2 = np.random.randint(-4,10,(100000,100))
...: A2c = A2.copy()
...:
In [380]: %timeit broadcasting_based(A2)
10 loops, best of 3: 106 ms per loop
In [381]: %timeit cumsum_based(A2c)
1 loops, best of 3: 167 ms per loop
Verify results -
In [384]: A2 = np.random.randint(-4,10,(100000,100))
...: A2c = A2.copy()
...:
In [385]: np.array_equal(broadcasting_based(A2),cumsum_based(A2c))
Out[385]: True
Finding the first is usually easier and faster than finding the last, so here I reverse the array and then find the first negative (using the OP's version of A2):
im = A2.shape[1] - 1 - np.argmax(A2[:,::-1]<0, axis=1)
# [4 6 3] # which are the indices of the last negative in A2
Also, though, note that if you have large arrays with many negative numbers, it might actually be faster to use a non-numpy approach so you can short circuit the search. That is, numpy will do the calculation on the entire array, so if you have 10000 elements in a row but typically will hit a negative number in the first 10 elements (of a reverse search), a pure Python approach might end up being faster.
Overall, iterating the rows might be faster for subsequent operations as well. For example, if your next step is multiplication, it could be faster to just multiply the slices at the ends that are non-zeros, or maybe find that longest non-zero section and just deal with the truncated array.
This basically comes down to number of negatives per row. If you have 1000 negatives per row you'll on average have non-zeros segments that are 1/1000th of your full row length, so you could get a 1000x speed-up by just looking at the ends. The short example given in the question is great for understanding and answering the basic question, but I wouldn't take timing tests too seriously when your end application is a very different use case; especially since your fractional time savings by using iteration improves in proportion to array size (assuming a constant ratio and random distribution of negative numbers).
You can access individual rows:
A2[0] == array([-5, -4, 2, -2, -1, 0, 1, 2, 3, 4])
Assume I have the following arrays:
N = 8
M = 4
a = np.zeros(M)
b = np.random.randint(M, size=N) # contains indices for a
c = np.random.rand(N) # contains random values
I want to sum the values of c according to the indices provided in b, and store them in a. Writing a loop for this is trivial:
for i, v in enumerate(b):
a[v] += c[i]
Since N can get quite big in my real-world problem I'd like to avoid using python loops, but I can't figure out how to write it as a numpy-statement. Can anyone help me out?
Ok, here some example values:
In [27]: b
Out[27]: array([0, 1, 2, 0, 2, 3, 1, 1])
In [28]: c
Out[28]:
array([ 0.15517108, 0.84717734, 0.86019899, 0.62413489, 0.24357903,
0.86015187, 0.85813481, 0.7071174 ])
In [30]: a
Out[30]: array([ 0.77930596, 2.41242955, 1.10377802, 0.86015187])
import numpy as np
N = 8
M = 4
b = np.array([0, 1, 2, 0, 2, 3, 1, 1])
c = np.array([ 0.15517108, 0.84717734, 0.86019899, 0.62413489, 0.24357903, 0.86015187, 0.85813481, 0.7071174 ])
a = ((np.mgrid[:M,:N] == b)[0] * c).sum(axis=1)
returns
array([ 0.77930597, 2.41242955, 1.10377802, 0.86015187])