I have a large data frame (~1 million rows). I am ultimately going to be interpolating spectra between two age groups. However I need to first find the nearest values above and below any age I need to find.
The DataFrame briefly looks like this
Age Wavelength Luminosity
1
1
1
4
4
6
6
I need to be able to input 5, and return values 4 and 6. I am struggling to find a way to do this? This is what I tried :
def findnearest(array,value):
idx = np.searchsorted(array,value, side='left')
if idx > 125893.0:
return array[idx]
else:
return array[idx]
idx1 = np.searchsorted(array,value, side='right')
if idx1 < 2e10:
return array[idx1]
else:
return array [idx1-1]
C = findnearest(m05_010['age'], 5.12e7)
print(C)
This only returns one value, and not both. Is this the right path or should I be doing something different? Is there a better way?
IIUC and assuming sorted input array, you can do something like this -
above = arr[np.searchsorted(arr,value,'left')-1]
below = arr[np.searchsorted(arr,value,'right')]
Sample runs -
Case 1: Without exact match for value
In [17]: arr = np.array([1,1,1,4,4,4,4,4,4,4,6,6])
In [18]: value = 5
In [19]: above = arr[np.searchsorted(arr,value,'left')-1]
...: below = arr[np.searchsorted(arr,value,'right')]
...:
In [20]: above, below
Out[20]: (4, 6)
Case 2: With exact match for value
In [33]: arr = np.array([1,1,1,4,4,4,4,4,4,4,5,5,5,6,6])
In [34]: value = 5
In [35]: above = arr[np.searchsorted(arr,value,'left')-1]
...: below = arr[np.searchsorted(arr,value,'right')]
...:
In [36]: above, below
Out[36]: (4, 6)
I think you should use bisect, its a lot faster and is made for this purpose only.
from bisect import *
arr = np.array([1,1,1,4,4,4,4,4,4,4,6,6])
value = 5
lower = arr[bisect_left(arr, value) - 1]
above = arr[bisect_right(arr, value)]
lower, above
Output -
(4, 6)
Heres the time comparison from Ipython -
%timeit for x in range(100): arr[bisect_left(arr, value)]
Output -
10000 loops, best of 3: 92.4 µs per loop
And using searchsorted -
%timeit for x in range(100): arr[np.searchsorted(arr,value,'left')-1]
Output -
The slowest run took 7.62 times longer than the fastest. This could
mean that an intermediate result is being cached. 10000 loops, best of
3: 142 µs per loop
Related
I want to check how many numpy array elements inside numpy array are different.
The solution should not contain list comprehension.
Something along these lines (note that a and b differ in the last array):
a = np.array( [[1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5]] )
b = np.array( [[1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,0,0]] )
y = diff_count( a,b )
print y
>> 1
Approach #1
Perform element-wise comparison for non-equality and then get ANY reduction along last axis and finally count -
(a!=b).any(-1).sum()
Approach #2
Probably faster one with np.count_nonzero for counting booleans -
np.count_nonzero((a!=b).any(-1))
Approach #3
Much faster one with views -
# https://stackoverflow.com/a/45313353/ #Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
a1D,b1D = view1D(a,b)
out = np.count_nonzero(a1D!=b1D)
Benchmarking
In [32]: np.random.seed(0)
...: m,n = 10000,100
...: a = np.random.randint(0,9,(m,n))
...: b = a.copy()
...:
...: # Let's set 10% of rows as different ones
...: b[np.random.choice(len(a), len(a)//10, replace=0)] = 0
In [33]: %timeit (a!=b).any(-1).sum() # app#1 from this soln
...: %timeit np.count_nonzero((a!=b).any(-1)) # app#2
...: %timeit np.any(a - b, axis=1).sum() # #Graipher's soln
1000 loops, best of 3: 1.14 ms per loop
1000 loops, best of 3: 1.08 ms per loop
100 loops, best of 3: 2.33 ms per loop
In [34]: %%timeit # app#3
...: a1D,b1D = view1D(a,b)
...: out = np.count_nonzero((a1D!=b1D).any(-1))
1000 loops, best of 3: 797 µs per loop
You can try it using np.ravel(). If you want element wise comparison.
(a.ravel()!=b.ravel()).sum()
(a-b).any(axis=0).sum()
above lines gives 2 as output.
If you want row wise comparison, you can use.
(a-b).any(axis=1).sum()
This gives 1 as output.
You can use numpy.any for this:
y = np.any(a - b, axis=1).sum()
Would this work?
y=sum(a[i]!=b[i]for i in range len(a))
Sorry that I can’t test this myself right now.
Assume that I have two arrays A and B, where both A and B are m x n. My goal is now, for each row of A and B, to find where I should insert the elements of row i of A in the corresponding row of B. That is, I wish to apply np.digitize or np.searchsorted to each row of A and B.
My naive solution is to simply iterate over the rows. However, this is far too slow for my application. My question is therefore: is there a vectorized implementation of either algorithm that I haven't managed to find?
We can add each row some offset as compared to the previous row. We would use the same offset for both arrays. The idea is to use np.searchsorted on flattened version of input arrays thereafter and thus each row from b would be restricted to find sorted positions in the corresponding row in a. Additionally, to make it work for negative numbers too, we just need to offset for the minimum numbers as well.
So, we would have a vectorized implementation like so -
def searchsorted2d(a,b):
m,n = a.shape
max_num = np.maximum(a.max() - a.min(), b.max() - b.min()) + 1
r = max_num*np.arange(a.shape[0])[:,None]
p = np.searchsorted( (a+r).ravel(), (b+r).ravel() ).reshape(m,-1)
return p - n*(np.arange(m)[:,None])
Runtime test -
In [173]: def searchsorted2d_loopy(a,b):
...: out = np.zeros(a.shape,dtype=int)
...: for i in range(len(a)):
...: out[i] = np.searchsorted(a[i],b[i])
...: return out
...:
In [174]: # Setup input arrays
...: a = np.random.randint(11,99,(10000,20))
...: b = np.random.randint(11,99,(10000,20))
...: a = np.sort(a,1)
...: b = np.sort(b,1)
...:
In [175]: np.allclose(searchsorted2d(a,b),searchsorted2d_loopy(a,b))
Out[175]: True
In [176]: %timeit searchsorted2d_loopy(a,b)
10 loops, best of 3: 28.6 ms per loop
In [177]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 13.7 ms per loop
The solution provided by #Divakar is ideal for integer data, but beware of precision issues for floating point values, especially if they span multiple orders of magnitude (e.g. [[1.0, 2,0, 3.0, 1.0e+20],...]). In some cases r may be so large that applying a+r and b+r wipes out the original values you're trying to run searchsorted on, and you're just comparing r to r.
To make the approach more robust for floating-point data, you could embed the row information into the arrays as part of the values (as a structured dtype), and run searchsorted on these structured dtypes instead.
def searchsorted_2d (a, v, side='left', sorter=None):
import numpy as np
# Make sure a and v are numpy arrays.
a = np.asarray(a)
v = np.asarray(v)
# Augment a with row id
ai = np.empty(a.shape,dtype=[('row',int),('value',a.dtype)])
ai['row'] = np.arange(a.shape[0]).reshape(-1,1)
ai['value'] = a
# Augment v with row id
vi = np.empty(v.shape,dtype=[('row',int),('value',v.dtype)])
vi['row'] = np.arange(v.shape[0]).reshape(-1,1)
vi['value'] = v
# Perform searchsorted on augmented array.
# The row information is embedded in the values, so only the equivalent rows
# between a and v are considered.
result = np.searchsorted(ai.flatten(),vi.flatten(), side=side, sorter=sorter)
# Restore the original shape, decode the searchsorted indices so they apply to the original data.
result = result.reshape(vi.shape) - vi['row']*a.shape[1]
return result
Edit: The timing on this approach is abysmal!
In [21]: %timeit searchsorted_2d(a,b)
10 loops, best of 3: 92.5 ms per loop
You would be better off just just using map over the array:
In [22]: %timeit np.array(list(map(np.searchsorted,a,b)))
100 loops, best of 3: 13.8 ms per loop
For integer data, #Divakar's approach is still the fastest:
In [23]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 7.26 ms per loop
Assume that I have two arrays A and B, where both A and B are m x n. My goal is now, for each row of A and B, to find where I should insert the elements of row i of A in the corresponding row of B. That is, I wish to apply np.digitize or np.searchsorted to each row of A and B.
My naive solution is to simply iterate over the rows. However, this is far too slow for my application. My question is therefore: is there a vectorized implementation of either algorithm that I haven't managed to find?
We can add each row some offset as compared to the previous row. We would use the same offset for both arrays. The idea is to use np.searchsorted on flattened version of input arrays thereafter and thus each row from b would be restricted to find sorted positions in the corresponding row in a. Additionally, to make it work for negative numbers too, we just need to offset for the minimum numbers as well.
So, we would have a vectorized implementation like so -
def searchsorted2d(a,b):
m,n = a.shape
max_num = np.maximum(a.max() - a.min(), b.max() - b.min()) + 1
r = max_num*np.arange(a.shape[0])[:,None]
p = np.searchsorted( (a+r).ravel(), (b+r).ravel() ).reshape(m,-1)
return p - n*(np.arange(m)[:,None])
Runtime test -
In [173]: def searchsorted2d_loopy(a,b):
...: out = np.zeros(a.shape,dtype=int)
...: for i in range(len(a)):
...: out[i] = np.searchsorted(a[i],b[i])
...: return out
...:
In [174]: # Setup input arrays
...: a = np.random.randint(11,99,(10000,20))
...: b = np.random.randint(11,99,(10000,20))
...: a = np.sort(a,1)
...: b = np.sort(b,1)
...:
In [175]: np.allclose(searchsorted2d(a,b),searchsorted2d_loopy(a,b))
Out[175]: True
In [176]: %timeit searchsorted2d_loopy(a,b)
10 loops, best of 3: 28.6 ms per loop
In [177]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 13.7 ms per loop
The solution provided by #Divakar is ideal for integer data, but beware of precision issues for floating point values, especially if they span multiple orders of magnitude (e.g. [[1.0, 2,0, 3.0, 1.0e+20],...]). In some cases r may be so large that applying a+r and b+r wipes out the original values you're trying to run searchsorted on, and you're just comparing r to r.
To make the approach more robust for floating-point data, you could embed the row information into the arrays as part of the values (as a structured dtype), and run searchsorted on these structured dtypes instead.
def searchsorted_2d (a, v, side='left', sorter=None):
import numpy as np
# Make sure a and v are numpy arrays.
a = np.asarray(a)
v = np.asarray(v)
# Augment a with row id
ai = np.empty(a.shape,dtype=[('row',int),('value',a.dtype)])
ai['row'] = np.arange(a.shape[0]).reshape(-1,1)
ai['value'] = a
# Augment v with row id
vi = np.empty(v.shape,dtype=[('row',int),('value',v.dtype)])
vi['row'] = np.arange(v.shape[0]).reshape(-1,1)
vi['value'] = v
# Perform searchsorted on augmented array.
# The row information is embedded in the values, so only the equivalent rows
# between a and v are considered.
result = np.searchsorted(ai.flatten(),vi.flatten(), side=side, sorter=sorter)
# Restore the original shape, decode the searchsorted indices so they apply to the original data.
result = result.reshape(vi.shape) - vi['row']*a.shape[1]
return result
Edit: The timing on this approach is abysmal!
In [21]: %timeit searchsorted_2d(a,b)
10 loops, best of 3: 92.5 ms per loop
You would be better off just just using map over the array:
In [22]: %timeit np.array(list(map(np.searchsorted,a,b)))
100 loops, best of 3: 13.8 ms per loop
For integer data, #Divakar's approach is still the fastest:
In [23]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 7.26 ms per loop
I have two structured 2D numpy arrays which are equal in principle, meaning
A = numpy.array([[a1,b1,c1],
[a2,b2,c2],
[a3,b3,c3],
[a4,b4,c4]])
B = numpy.array([[a2,b2,c2],
[a4,b4,c4],
[a3,b3,c3],
[a1,b1,c1]])
Not in the sense of
numpy.array_equal(A,B) # False
numpy.array_equiv(A,B) # False
numpy.equal(A,B) # ndarray of True and False
But in the sense that one array (A) is the original and in the other one (B) the data is shuffled along one axis (could be along the rows or columns).
What is an efficient way to sort/shuffle B to match or become equal to A or alternatively sort A to become equal to B? An equality check is indeed not important, as long as both arrays are shuffled to match each other. A and hence B have unique rows.
I tried the view method to sort both the arrays like so
def sort2d(A):
A_view = np.ascontiguousarray(A).view(np.dtype((np.void,
A.dtype.itemsize * A.shape[1])))
A_view.sort()
return A_view.view(A.dtype).reshape(-1,A.shape[1])
but that doesn't work here apparently. This operation needs to be performed for really large arrays, so performance and scalability is critical.
Based on your example, it seems that you have shuffled all of the columns simultaneously, such that there is a vector of row indices that maps A→B. Here's a toy example:
A = np.random.permutation(12).reshape(4, 3)
idx = np.random.permutation(4)
B = A[idx]
print(repr(A))
# array([[ 7, 11, 6],
# [ 4, 10, 8],
# [ 9, 2, 0],
# [ 1, 3, 5]])
print(repr(B))
# array([[ 1, 3, 5],
# [ 4, 10, 8],
# [ 7, 11, 6],
# [ 9, 2, 0]])
We want to recover a set of indices, idx, such that A[idx] == B. This will be a unique mapping if and only if A and B contain no repeated rows.
One efficient* approach would be to find the indices that would lexically sort the rows in A, then find where each row in B would fall within the sorted version of A. A useful trick is to view A and B as 1D arrays using an np.void dtype that treats each row as a single element:
rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
# A and B must be C-contiguous, might need to force a copy here
a = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()
a_to_as = np.argsort(a) # indices that sort the rows of A in lexical order
Now we can use np.searchsorted to perform a binary search for where each row in B would fall within the sorted version of A:
# using the `sorter=` argument rather than `a[a_to_as]` avoids making a copy of `a`
as_to_b = a.searchsorted(b, sorter=a_to_as)
The mapping from A→B can be expressed as a composite of A→As→B
a_to_b = a_to_as.take(as_to_b)
print(np.all(A[a_to_b] == B))
# True
If A and B contain no repeated rows, the inverse mapping from B→A can also be obtained using
b_to_a = np.argsort(a_to_b)
print(np.all(B[b_to_a] == A))
# True
As a single function:
def find_row_mapping(A, B):
"""
Given A and B, where B is a copy of A permuted over the first dimension, find
a set of indices idx such that A[idx] == B.
This is a unique mapping if and only if there are no repeated rows in A and B.
Arguments:
A, B: n-dimensional arrays with same shape and dtype
Returns:
idx: vector of indices into the rows of A
"""
if not (A.shape == B.shape):
raise ValueError('A and B must have the same shape')
if not (A.dtype == B.dtype):
raise TypeError('A and B must have the same dtype')
rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
a = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()
a_to_as = np.argsort(a)
as_to_b = a.searchsorted(b, sorter=a_to_as)
return a_to_as.take(as_to_b)
Benchmark:
In [1]: gen = np.random.RandomState(0)
In [2]: %%timeit A = gen.rand(1000000, 100); B = A.copy(); gen.shuffle(B)
....: find_row_mapping(A, B)
1 loop, best of 3: 2.76 s per loop
*The most costly step would be the quicksort over rows which is O(n log n) on average. I'm not sure it's possible to do any better than this.
Since either one of the arrays could be shuffled to match the other, no one has stopped us from re-arranging both. Using Jaime's Answer, we can vstack both the arrays and find the unique rows. Then the inverse indices returned by unique is essentially the desired mapping (as the arrays don't contain duplicate rows).
Let's first define a unique2d function for convenience:
def unique2d(arr,consider_sort=False,return_index=False,return_inverse=False):
"""Get unique values along an axis for 2D arrays.
input:
arr:
2D array
consider_sort:
Does permutation of the values within the axis matter?
Two rows can contain the same values but with
different arrangements. If consider_sort
is True then those rows would be considered equal
return_index:
Similar to numpy unique
return_inverse:
Similar to numpy unique
returns:
2D array of unique rows
If return_index is True also returns indices
If return_inverse is True also returns the inverse array
"""
if consider_sort is True:
a = np.sort(arr,axis=1)
else:
a = arr
b = np.ascontiguousarray(a).view(np.dtype((np.void,
a.dtype.itemsize * a.shape[1])))
if return_inverse is False:
_, idx = np.unique(b, return_index=True)
else:
_, idx, inv = np.unique(b, return_index=True, return_inverse=True)
if return_index == False and return_inverse == False:
return arr[idx]
elif return_index == True and return_inverse == False:
return arr[idx], idx
elif return_index == False and return_inverse == True:
return arr[idx], inv
else:
return arr[idx], idx, inv
We can now define our mapping as follows
def row_mapper(a,b,consider_sort=False):
"""Given two 2D numpy arrays returns mappers idx_a and idx_b
such that a[idx_a] = b[idx_b] """
assert a.dtype == b.dtype
assert a.shape == b.shape
c = np.concatenate((a,b))
_, inv = unique2d(c, consider_sort=consider_sort, return_inverse=True)
mapper_a = inv[:b.shape[0]]
mapper_b = inv[b.shape[0]:]
return np.argsort(mapper_a), np.argsort(mapper_b)
Verify:
n = 100000
A = np.arange(n).reshape(n//4,4)
B = A[::-1,:]
idx_a, idx_b = row_mapper(A,B)
print np.all(A[idx_a]==B[idx_b])
# True
Benchmark:
benchmark against #ali_m's solution
%timeit find_row_mapping(A,B) # ali_m's solution
%timeit row_mapper(A,B) # current solution
# n = 100
100000 loops, best of 3: 12.2 µs per loop
10000 loops, best of 3: 47.3 µs per loop
# n = 1000
10000 loops, best of 3: 49.1 µs per loop
10000 loops, best of 3: 148 µs per loop
# n = 10000
1000 loops, best of 3: 548 µs per loop
1000 loops, best of 3: 1.6 ms per loop
# n = 100000
100 loops, best of 3: 6.96 ms per loop
100 loops, best of 3: 19.3 ms per loop
# n = 1000000
10 loops, best of 3: 160 ms per loop
1 loops, best of 3: 372 ms per loop
# n = 10000000
1 loops, best of 3: 2.54 s per loop
1 loops, best of 3: 5.92 s per loop
Although maybe there is room for improvement, the current solution is 2-3 times slower than ali_m's solution and perhaps a little messier as well as both arrays need to be mapped. Just thought this could be an alternate solution.
I have 2 arrays: x and bigx. They span the same range, but bigx has many more points.
e.g.
x = np.linspace(0,10,100)
bigx = np.linspace(0,10,1000)
I want to find the indices in bigx where x and bigx match to 2 significant figures. I need to do this extremely quickly as I need the indices for each step of an integral.
Using numpy.where is very slow:
index_bigx = [np.where(np.around(bigx,2) == i) for i in np.around(x,2)]
Using numpy.in1d is ~30x faster
index_bigx = np.where(np.in1d(np.around(bigx), np.around(x,2) == True)
I also tried using zip and enumerate as I know that's supposed be faster but it returns empty:
>>> index_bigx = [i for i,(v,myv) in enumerate(zip(np.around(bigx,2), np.around(x,2))) if myv == v]
>>> print index_bigx
[]
I think I must have muddled things here and I want to optimise it as much as possible. Any suggestions?
Since bigx is always evenly spaced, it's quite straightforward to just directly compute the indices:
start = bigx[0]
step = bigx[1] - bigx[0]
indices = ((x - start)/step).round().astype(int)
Linear time, no searching necessary.
Since we are mapping x to bigx which has its elemments equidistant, you can use a binning operation with np.searchsorted to simulate the index finding operation using its 'left' option. Here's the implementation -
out = np.searchsorted(np.around(bigx,2), np.around(x,2),side='left')
Runtime tests
In [879]: import numpy as np
...:
...: xlen = 10000
...: bigxlen = 70000
...: bigx = 100*np.linspace(0,1,bigxlen)
...: x = bigx[np.random.permutation(bigxlen)[:xlen]]
...:
In [880]: %timeit np.where(np.in1d(np.around(bigx,2), np.around(x,2)))
...: %timeit np.searchsorted(np.around(bigx,2), np.around(x,2),side='left')
...:
100 loops, best of 3: 4.1 ms per loop
1000 loops, best of 3: 1.81 ms per loop
If you want just the elements, this should work:
np.intersect1d(np.around(bigx,2), np.around(x,2))
If you want the indices, try this:
around_x = set(np.around(x,2))
index_bigx = [i for i,b in enumerate(np.around(bigx,2)) if b in around_x]
Note: these were not tested.