Find number of zeros before non-zero in a numpy array - python

I have a numpy array A. I would like to return the number of zeros before a non-zero in A in an efficient way as it is in a loop.
If A = np.array([0,1,2]) then np.nonzero(A)[0][0] returns 1. However if A = np.array([0,0,0]) this doesn't work (I would like the answer 3 in this case). And also if A is very big and the first non-zero is near the beginning this seems inefficient.

By adding a nonzero number at the end of the array, you can still use np.nonzero to get your desired outcome.
A = np.array([0,1,2])
B = np.array([0,0,0])
np.min(np.nonzero(np.hstack((A, 1)))) # --> 1
np.min(np.nonzero(np.hstack((B, 1)))) # --> 3

i = np.argmax(A!=0)
if i==0 and np.all(A==0): i=len(A)
This should be the most performant solution without extensions. Also easily vectorized to act along multiple axes.

Here's an iterative Cython version, which may be your best bet if this is a serious bottleneck
# saved as file count_leading_zeros.pyx
import numpy as np
cimport numpy as np
cimport cython
DTYPE = np.int
ctypedef np.int_t DTYPE_t
#cython.boundscheck(False)
def count_leading_zeros(np.ndarray[DTYPE_t, ndim=1] a):
cdef int elements = a.size
cdef int i = 0
cdef int count = 0
while i < elements:
if a[i] == 0:
count += 1
else:
return count
i += 1
return count
This is similar to #mtrw's answer but with indexing at native speeds. My Cython is a bit sketchy so there may be further improvements to be made.
A quick test of an extremely favourable case with IPython with a few different methods
In [1]: import numpy as np
In [2]: import pyximport; pyximport.install()
Out[2]: (None, <pyximport.pyximport.PyxImporter at 0x53e9250>)
In [3]: import count_leading_zeros
In [4]: %paste
def count_leading_zeros_python(x):
ctr = 0
for k in x:
if k == 0:
ctr += 1
else:
return ctr
return ctr
## -- End pasted text --
In [5]: a = np.zeros((10000000,), dtype=np.int)
In [6]: a[5] = 1
In [7]:
In [7]: %timeit np.min(np.nonzero(np.hstack((a, 1))))
10 loops, best of 3: 91.1 ms per loop
In [8]:
In [8]: %timeit np.where(a)[0][0] if np.shape(np.where(a)[0])[0] != 0 else np.shape(a)[0]
10 loops, best of 3: 107 ms per loop
In [9]:
In [9]: %timeit count_leading_zeros_python(a)
100000 loops, best of 3: 3.87 µs per loop
In [10]:
In [10]: %timeit count_leading_zeros.count_leading_zeros(a)
1000000 loops, best of 3: 489 ns per loop
However I'd only use something like this if I had evidence (with a profiler) that this was a bottleneck. Many things may seem inefficient but are never worth your time to fix.

What's wrong with the naive approach:
def countLeadingZeros(x):
""" Count number of elements up to the first non-zero element, return that count """
ctr = 0
for k in x:
if k == 0:
ctr += 1
else: #short circuit evaluation, we found a non-zero so return immediately
return ctr
return ctr #we get here in the case that x was all zeros
This returns as soon as a non-zero element is found, so it is O(n) in the worst case. You could make it faster by porting it to C, but it would be worth testing to see if that is really necessary for the arrays you're working with.

I am surprised why nobody has not used np.where yet
np.where(a)[0][0] if np.shape(np.where(a)[0])[0] != 0 else np.shape(a)[0] will do the trick
>> a = np.array([0,1,2])
>> np.where(a)[0][0] if np.shape(np.where(a)[0])[0] != 0 else np.shape(a)[0]
... 1
>> a = np.array([0,0,0))
>> np.where(a)[0][0] if np.shape(np.where(a)[0])[0] != 0 else np.shape(a)[0]
... 3
>> a = np.array([1,2,3))
>> np.where(a)[0][0] if np.shape(np.where(a)[0])[0] != 0 else np.shape(a)[0]
... 0

If you don't care about the speed, I have a small trick to do the job:
a = np.array([0,0,1,1,1])
t = np.where(a==0,1,0)+np.append(np.where(a==0,0,1),0)[1:]
print t
[1 2 1 1 0]
np.where(t==2)
(array([1]),)

Related

Run two nested for loops in parallel to create matrix

I've written a method that takes in an integer "n" and creates a square matrix where the values of each element are dictated by their respective i,j indices.
When I build a small matrix 30x30 it works just fine, but when I try to do something larger like 1000x1000 it takes very long. Is there any way that I can speed it up with multiprocessing?
def createMatrix(n):
matrix = []
for j in range(1,n+1):
row = []
for i in range(1,n+1):
value = 1/(i+j-1)
row.append(value)
matrix.append(row)
return np.array(matrix)
Parallelizing two computation-bound for loops in Python is not trivial because of GIL. The good news is that your case is perfectly vectorizeable:
def createMatrix(n):
return 1 / (np.arange(n)[None, :] + np.arange(n)[:, None] + 1)
Explanation:
essentially, your formula for the matrix is X[row][column] = 1/(row+column-1), where rows and columns are 1-based
np.arange(n) creates a range that can be used for rows or columns
[None, :] and [:, None] turn it into a 2d array, 1 x n or n x 1
numpy then broadcasts dimensions, replicating row and column indexes to match dimensions - thus, implicitly tiling both into n x n when added
since both ranges are 0-based, using +1 instead of -1
As a rule of thumb, it is almost never a good idea to use for loops on numpy arrays. A vectorized approach (i.e. matrix form computations) is orders of magnitude faster.
It's not a good idea to use fors to fill a list then convert it to a matrix. the operation that you have can be vectorized with numpy from scratch. if you think that given the i,j, M(i,j) = 1/(j+i-1) considering that both indices starts at 1.
Here's my proposal :
def createMatrix2(n):
arr =np.arange(1,n+1)
xx,yy = np.meshgrid(arr,arr)
matrix = 1/(xx+yy-1)
return matrix
looking at Marat answer, I think his/her it's better, so tested the 3 methods:
EDIT: added wwii method as createMatrix4 (correcting the errors):
import numpy as np
from time import time
def createMatrix1(n):
matrix = []
for j in range(1,n+1):
row = []
for i in range(1,n+1):
value = 1/(i+j-1)
row.append(value)
matrix.append(row)
return np.array(matrix)
def createMatrix2(n):
arr =np.arange(1,n+1)
xx,yy = np.meshgrid(arr,arr)
matrix = 1/(xx+yy-1)
return matrix
def createMatrix3(n):
"""Marat's proposed matrix"""
return 1 / (1 + np.arange(n)[None, :] + np.arange(n)[:, None])
def createMatrix4(n):
""" wwii method"""
i,j = np.ogrid[1:n,1:n]
return 1/(i+j-1)
#test all the three methods
n = 10000
t1 = time()
m1 = createMatrix1(n)
t2 = time()
m2 = createMatrix2(n)
t3 = time()
m3 = createMatrix3(n)
t4 = time()
m4 = createMatrix4(n)
t5 = time()
print(np.allclose(m1,m2))
print(np.allclose(m1,m3))
print(np.allclose(m1,m4))
print("Matrix 1 (OP): ",t2-t1)
print("Matrix 2: (mine)",t3-t2)
print("Matrix 3: (Marat)",t4-t3)
print("Matrix 4: (wwii)",t5-t4)
# the output is:
#True
#True
#True
#Matrix 1 (OP): 18.4886577129364
#Matrix 2: (mine) 1.005324363708496
#Matrix 3: (Marat) 0.43033909797668457
#Matrix 4: (wwii) 0.5138359069824219
So Marat's solution is faster. As general comments:
Try to avoid fors loops
Think your problem as operation with indices and dessing operations with numpy arrays directly.
For last, given Marat's answer I thought my proposal is a easier to read, and understand. But it's just a subjective view
Your code can be written in another style, accelerated by numba library in a parallel no python mode:
import numba as nb
#nb.njit("float64[:, ::1](int64)", parallel=True, fastmath=True)
def createMatrix(n):
matrix = np.empty((n, n)) # np.zeros is slower than np.empty
for j in nb.prange(1, n + 1):
for i in range(1, n + 1):
matrix[j - 1, i - 1] = 1 / (i + j - 1)
return matrix
This solution will be faster than the Marat answer above 3 times.
Benchmarks: (temporary link to colab)
n = 1000
1000 loops, best of 5: 3.52 ms per loop # Marat
1000 loops, best of 5: 1.5 ms per loop # numba accelerated with np.zeros
1000 loops, best of 5: 1.05 ms per loop # numba accelerated with np.empty
n = 3000
1000 loops, best of 5: 39.5 ms per loop
1000 loops, best of 5: 19.3 ms per loop
1000 loops, best of 5: 8.91 ms per loop
n = 5000
1000 loops, best of 5: 109 ms per loop
1000 loops, best of 5: 53.5 ms per loop
1000 loops, best of 5: 24.8 ms per loop

Customized search for consequtive values in numpy array

Please assume the following NumPy array:
A = array([1, 1, 0, 1, 0, 0, 0, 0, 0, 0])
I would like to find the indices of this array that N consecutive values are equal to zero (inclusive).
For example, assume N=3. We know that A[2]=0 while A[3]>0. Thus, the second element of array A is not with three consecutive zero value (inclusive). A desirable outcome for array A would look like the below:
B = array([False, False, False, False, True, True, True, True, False, False])
I can write a loop as the answer to this question:
N = 3
A = np.array([1, 1, 0, 0, 0, 0, 0, 0, 0, 0])
B = np.zeros(len(A), dtype=bool)
for i in range(len(A)):
if (i + N <= len(A)) and (sum(A[i:i + N]) == 0):
B[i] = True
Since matrix A could be much larger and I need to do the above search process millions of times, I need the fastest possible way. What would be your suggestion?
TL;DR
This is a significant improvement over the original approach as long as the block size N (below referred to as m or size) is large enough, and the input has a significant number of non-zero values:
import numpy as np
import numba as nb
#nb.njit
def find_blocks_while2_nb(arr, size, value):
n = arr.size
result = np.zeros(n, dtype=np.bool_)
i = j = 0
while i < n - size + 1:
# last iteration found a block, only checking one value
if j < i and arr[i + size - 1] == value:
result[i] = True
i += 1
else:
# search backward for index of last non-block value
j = i + size - 1
while j >= i and arr[j] == value:
j -= 1
if j < i:
# block found, advance by 1
result[i] = True
i += 1
else:
# block not found, advance by last non-block value occurrence
j += 1
i = j
return result
Essentially it is a Numba accelerated version of your original approach with a number of micro-optimizations:
The number of times the main loop is executed is reduced by setting proper limit to the main counter
The part that checks if the block contains all the same values:
avoids a relatively expensive array slicing
returns the position where it failed
runs backward
avoids re-checking consecutive blocks
For small block sizes, more micro-optimized approach may be faster as introduced in other answers, and briefly discussed below.
Introduction
The problem of finding the indices where a block of identical values starts can be seen as a specialization of a classical problem in computer science: finding a sub-array within an array (discussed in some detail here).
In this case, the sub-string (i.e. the block) consists of a series of identical values, more precisely 0s.
Theory
The complexity of this problem depends, in general, on the size of the array (n) and the size of the block (m or size), as well as the actual values within them (obviously).
Of course, it is assumed that n > m, as it is clear that if n == m the solution is trivial, and when n < m the array cannot fit the block.
For the same reason, no index above n - m can be found.
While a naïve implementation would require essentially two nested loops, one for string and one for sub-string (which would result is O(nm) computational complexity), it is possible to make use of the fact that the block consists of identical values to run the algorithm in guaranteed linear time O(n) and with constant memory requirement O(1).
However, under the assumption that the input array has a mixed distribution of both block and non-block values, it is possible to run in approx. O(n / (m - ε)) time and still with O(1) memory consumption. The ε < m - 1 is a parameter proportional to the probability of finding a non-block value.
The idea is that when trying to determine if a block starts at position i if the array has a non-block value at position j with j - i < m then the next position to evaluate is j + 1, since, the block cannot fit between i and j.
Moreover, if the search starts from the back (i.e. at position i - m), then the number of position that are not checked is maximal for sufficiently heterogeneous inputs, as all j - i values do not need to be checked.
For similar reasons, if it is known that a block starts at position i, to check whether a block starts at position i + j with j < m, only the values between i + m and i + m + j needs to checked, and the values between i and i + m - 1 do not need to checked again.
Implementations
This a manual case where a generator is a perfect fit, because it is possible to leverage its lazy evaluation properties to perform the bit of computation only as needed.
Additionally, it is not known in advance how large the result will be (i.e. it is possible to find many blocks or no blocks at all).
However, if this is part of a larger computation it may be faster to pre-allocate the size of the result and just modify the value its indices.
There should not be a need to allocate more than n - m + 1 values.
However, for compatibility with the requirements of the question, an array the size of the input is pre-allocated.
If needed, it is easy to adapt the following implementations to avoid wasting this extra memory.
More precisely, the solutions based on explicit looping can be converted to generators by replacing result[i] = True with yield i (or similar).
Likewise, solutions that compute the boolean array natively can be converted to generators with something like:
for i in np.nonzero(result)[0]:
yield i
The remainder of the analysis will assume that the target result is the boolean array.
One of the challenges of implementing this in pure Python is that explicit looping is relatively slow.
For this reason, it is convenient to use some acceleration technique (e.g. Numba or Cython) to reduce computation time.
Alternatively, given that the algorithm consists essentially of two loops (one for the array and one for the sub-array) one may seek to vectorize one or both loops.
However, do note that these approaches would necessarily run in O(nm) worst-case time.
The basic naïve approach is the following:
def find_blocks_for2(arr, size, value):
n = arr.size
result = np.zeros(n, dtype=np.bool_)
for i in range(n - size + 1):
all_equal = True
for j in range(i, i + size):
if arr[j] != value:
all_equal = False
break # vectorized approaches may not have this
if all_equal:
result[i] = True
return result
This can be improved in a number of directions.
One way stems from the fact that the inner loop is actually useless, because it is sufficient to keep track of the number of consecutive "good" values present (this is essentially the same as MichaelSzczesny's answer):
import numpy as np
def find_blocks_for(arr, size, value):
n = arr.size
result = np.zeros(n, dtype=np.bool_)
block_size = 0
for i in range(n):
if arr[i] == value:
block_size += 1
else:
block_size = 0
if block_size >= size:
result[i - size + 1] = True
return result
Note that this can be seen as a specialization of the Rabin-Karp algorithm where the role of the hashing is played by the block_size counter.
Another possible approach is to skip the second loop only when the last iteration already identified the presence of a block and have it running backward to maximize the jump to the next possible block:
import numpy as np
def find_blocks_while2(arr, size, value):
n = arr.size
result = np.zeros(n, dtype=np.bool_)
i = j = 0
while i < n - size + 1:
# last iteration found a block, only checking one value
if j < i and arr[i + size - 1] == value:
result[i] = True
i += 1
else:
# search backward for index of last non-block value
j = i + size - 1
while j >= i and arr[j] == value:
j -= 1
if j < i:
# block found, advance by 1
result[i] = True
i += 1
else:
# block not found, advance by last non-block value occurrence
j += 1
i = j
return result
Note that this can be seen as a specialization of the Knuth-Morris-Pratt algorithm without the need of computing offsets for each value of the block (they are all the same).
A number of variations can be devised for the above approaches, all with the same asymptotic behavior but slightly different running time depending on the actual input.
For example, one could change the order of the conditions to speed-up the case where many consecutive blocks are found, at the expenses of running more instructions elsewhere, or the other way around.
All these run just too slow without acceleration either with Numba, Cython or Pypy.
Pypy uses a different interpreter and I will not discuss/benchmark it further.
Numba provides one of the most straightforward and effective acceleration (just apply a decorator), e.g.:
import numba as nb
find_blocks_while2_nb = nb.njit(find_blocks_while2)
find_blocks_while2_nb.__name__ = "find_blocks_while2_nb"
Cython can also be used, e.g. (using %load_ext Cython Jupyter magic), but to take full advantage of the compilation it would require extra care to make sure bottleneck code runs without Python interaction:
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import cython as cy
import numpy as np
cdef _find_blocks_lin_cy(long[::1] arr, char[::1] result, Py_ssize_t n, Py_ssize_t size, long value):
cdef Py_ssize_t block_size = 0
for i in range(n):
if arr[i] == value:
block_size += 1
else:
block_size = 0
if block_size >= size:
result[i - size + 1] = True
def find_blocks_lin_cy(arr, size, value):
n = arr.size
result = np.zeros(n, dtype=np.bool_)
_find_blocks_lin_cy(arr, result, n, size, value)
return result
Whether it is going to be faster Numba or Cython depends on how much optimization can be triggered by the code, and as such depends also on which combination of compiler and CPU is used.
Alternatively, either the inner or the outer loop (or both) can be vectorized.
Of course this is most effective on whether it is run more the inner or the outer loop.
The largest loop should be run vectorized for maximum speed-up.
Vectorizing the inner loop would lead to something close to this (similar to the original post, but with smaller boundaries and sum() replaced with np.all() which also sports short-circuiting):
def find_blocks_for_all(arr, size, value):
n = arr.size
result = np.zeros(n, dtype=np.bool_)
for i in range(n - size + 1):
if np.all(arr[i:i + size] == value):
result[i] = True
return result
Vectorizing the outer loop would lead to something close to this (similar to what is presented in DanielF's answer, but with overall cleaner code, especially avoiding unnecessary function calls):
def find_blocks_for_and(arr, size, value):
result = (arr == value)
for i in range(1, size):
result[:-1] &= result[1:]
result[1 - size:] = False
return result
Vectorizing both loops would lead to something close to this (similar to one approach from NaphatAmundsen's answer):
def find_blocks_strides(arr, size, value):
n = arr.size
strides = (arr.itemsize,) * 2
block = np.full(size, value, dtype=arr.dtype)
# note that `as_strided()` does not allocate extra memory
# the downside is that buffer overflow will occur and
# extra care must be taken not to use data outside the boundaries
check = np.lib.stride_tricks.as_strided(arr, (n, size), strides)
result = np.all(check == block, axis=-1)
result[1 - size:] = False
return result
The above code can be further specialized under the assumption that the input will contain only positive values and that block consists of zeros only.
This would mean replacing np.all() calls with np.sum() or similar.
One approach where this is actually beneficial is with the fully vectorized approach, where as_strided() can be replaced by computing the correlation with a non-zero block (similar to one approach from NaphatAmundsen's answer):
def find_blocks_0_conv(arr, size):
n = arr.size
result = np.zeros_like(arr, dtype=np.bool_)
block = np.ones(size, dtype=arr.dtype)
result[:1 - size] = (np.correlate(arr, block, 'valid') == 0)
return result
Aside of the looping, the running time will also depend on how the code translates to hardware instructions and ultimately how fast are them.
For example, if NumPy is compiled with full support of SIMD instructions, it may very well be that a vectorized (but not theoretically optimal) approach would outperform, for many input combinations, theoretically optimal approaches that are naïvely accelerated.
The same could happen for approaches that are written in a way that acceleration techniques can make very good use of SIMD instructions.
However, the extra speed they provide will be limited to specific systems (i.e. specific combinations of compilers and processors).
One such example is the following (essentially a polishing of Jérôme Richard's answer to avoid global variables and removing unnecessary looping):
def find_blocks_simd_nb_cached(arr, size, value=0, cache={}):
assert size > 0
if size not in cache:
print('cached: ', list(cache.keys()))
def find_blocks_simd_nb(arr, value):
n = arr.size
result = np.zeros(n, dtype=np.bool_)
check = (arr == value)
for i in range(n - size + 1):
# using a `temp` variable avoid unnecessarily accessing the `check` array
temp = check[i]
for j in range(1, size):
temp &= check[i + j]
result[i] = temp
return result
cache[size] = nb.njit(find_blocks_simd_nb)
return cache[size](arr, value)
Benchmarking
Here, some benchmarks are reported.
As always, they should be taken with a grain of salt.
Only the fastest approaches are considered, for varying input sizes, block sizes and values.
The _nb and _cy (except for find_blocks_for_cy() are essentially non-optimized versions of the same functions without the _nb or _cy suffix from above.
Varying input size, block size = 4
Uniform distribution of 1s and 0s
All 0s
All 1s
Varying block size, input size = 600_000
Uniform distribution of 1s and 0s
All 0s
All 1s
Note that:
find_blocks_for_all() is actually fast for sufficiently large values of size
find_blocks_for_and() is very fast for smaller values of size and does not require Numba or Cython
find_blocks_for_nb() is essentially independent on the block size and performs overall really well
find_blocks_simd_nb() is very fast for small values of size but performs comparatively poorly for larger values
find_blocks_while2_nb() really shines for larger values of size and it is respectably fast for smaller values
(Full benchmarks available here).
For small values of N (not changing much at runtime), the best solution is to use a SIMD-friendly Numba implementation that is specialized for each possible specific N value. The compiler can generate a very efficient code when N <= ~30. For larger N values, the solution of #norok2 is start to be better since it can skip many items of the array as long as there are not too many 0 items. When the number of 0 is big, this function is still competitive as long as N <= 64. When N > 64, please use a better suited implementation.
# Cache dictionary meant to store function for a given N
cache = {}
def compute(A, N):
assert N > 0
if N not in cache:
# Numba function to be compiled for a specific N
def compute_specific(A):
out = np.zeros(A.size, dtype=np.bool_)
isZero = (A == 0).view(np.uint8)
for i in range(isZero.size-N+1):
allSame = isZero[i]
for j in range(1, N):
allSame &= isZero[i+j]
out[i] = allSame
for i in range(A.size-N+1, A.size):
out[i] = 0
return out
cache[N] = nb.njit(compute_specific)
return cache[N](A)
The compiler can auto-vectorize this code with SIMD instruction (like AVX-2) so that 32~64 items are computed per cycle on modern x86 processors. Larger N values require more SIMD instructions causing the function to be less efficient. In comparison, implementations using conditional branches tends to be slow since a miss-predicted branch generally cost about 10~15 cycles on mainstream x86 processors and each iteration can only compute one scalar item at a time (ie. ~2 order of magnitude slower assuming the amount of work would be the same).
Benchmark
Here are results on my machine (Intel Xeon Skylake) with a random (32-bit) integer arrays of size 100_000. There is about 50% of 0 values in the array. The Numba compilation time is excluded in all implementations.
With N = 3:
- Initial solution: 115_000 µs
- Ali_Sh: 1_575 µs
- Naphat (stridetricks): 929 µs
- Naphat (correlational): 528 µs
- Michael Szczesny: 350 µs
- Norok2: 306 µs
- Franciska: 295 µs
- This code: 11 us <---
With N = 10:
- Initial solution: 149_000 µs
- Ali_Sh: 1_592 µs
- Naphat (stridetricks): 1_139 µs
- Naphat (correlational): 831 µs
- Michael Szczesny: 489 µs
- Franciska: 444 µs
- Norok2: 98 µs
- This code: 14 µs <---
With N = 30:
- Ali_Sh: [FAIL]
- Initial solution: 255_000 µs
- Franciska: 2_285 µs
- Naphat (correlational): 1_936 µs
- Naphat (stridetricks): 1_628 µs
- Michael Szczesny: 647 µs
- Norok2: 30 µs
- This code: 25 µs <---
With N = 60:
- Ali_Sh: [FAIL]
- Initial solution: 414_000 µs
- Naphat (correlational): 3_814 µs
- Franciska: 3_242 µs
- Naphat (stridetricks): 3_048 µs
- Michael Szczesny: 816 µs
- This code: 50 µs <---
- Norok2: 14 µs
We can clearly see that this function is the fastest one (by a large margin) except when N is big (in this case, the Norok2's code is faster).
A solution with linear time complexity, compiled with numba.
The idea is to look at each element of A once and count occurrences of zeros in counter c. If a non-zero element is encountered, c is reset to 0. As long as c >= N the result array is set to True at the appropriate index.
import numba as nb # tested with numba 0.55.1
#nb.jit
def numba_linear(A, N):
B = np.zeros(len(A), dtype=np.bool_)
c = 0
for i in range(len(A)):
c = c+1 if A[i] == 0 else 0
if c >= N: B[i - N + 1] = 1
return B
A = np.array([1, 1, 0, 1, 0, 0, 0, 0, 0, 0])
numba_linear(A, 3)
Output
array([False, False, False, False, True, True, True, True, False,
False])
Benchmark against #Nephat baseline_numba with #Mercury's loop optimization applied (sum was faster than any in my benchmarks).
import numpy as np
#nb.njit
def baseline_numba(A: np.ndarray, N: int):
'''
You may need to run this once first to see performance gains
as this has to be JIT compiled by Numba
'''
B = np.zeros(len(A), dtype=np.int8)
for i in nb.prange(len(A)-N+1):
if A[i:i + N].sum() == 0:
B[i] = 1
return B
rng = np.random.default_rng()
A = rng.integers(2, size=20000)
N = 3
%timeit baseline_numba(A, N)
#10000 loops, best of 5: 120 µs per loop
%timeit numba_linear(A, N)
#10000 loops, best of 5: 29.4 µs per loop
np.testing.assert_array_equal(numba_linear(A,N), baseline_numba(A,N))
I have implemented a few other alternatives that you could try:
Use numba to JIT compile the loopy code
Use np.correlate to do a signal-processing type of correlation, then use nonzero to obtain indices
Same as above, but use stride_tricks to do the correlation instead
I would like to benchmark the solutions, but it is getting a bit late here. I can maybe do it after work tomorrow if you'd like.
I also use dtype=np.int8 for B since it is easier to visually confirm that the output is correct (see output).
import numpy as np
import numba
from numpy.lib.stride_tricks import as_strided
A = np.array([1, 1, 0, 1, 0, 0, 0, 0, 0, 0])
N = 3
def baseline(A: np.ndarray):
'''
Your solution
'''
B = np.zeros(len(A), dtype=np.int8)
for i in range(len(A)):
if (i + N <= len(A)) and (sum(A[i:i + N]) == 0):
B[i] = 1
return B
#numba.njit()
def baseline_numba(A: np.ndarray, N: int):
'''
You may need to run this once first to see performance gains
as this has to be JIT compiled by Numba
'''
B = np.zeros(len(A), dtype=np.int8)
for i in range(len(A)):
if (i + N <= len(A)) and (A[i:i + N].sum() == 0):
B[i] = 1
return B
def correlational(A: np.ndarray):
'''
Use a correlation operation
'''
B = np.zeros_like(A, dtype=np.int8)
B[(np.correlate(A, np.full(N, 1), 'valid') == 0).nonzero()[0]] = 1
return B
def stridetricks(A: np.ndarray):
'''
Same idea as using correlation, but with stride tricks
Maybe it will be faster? Probably not.
'''
u = np.array(A.itemsize)
B = np.zeros_like(A, dtype=np.int8)
B[(as_strided(A, (len(A) - N + 1, N), u * (1, 1)).sum(1) == 0).nonzero()[0]] = 1
return B
print(baseline(A))
print(baseline_numba(A, N))
print(correlational(A))
print(stridetricks(A))
Output:
[0 0 0 0 1 1 1 1 0 0]
[0 0 0 0 1 1 1 1 0 0]
[0 0 0 0 1 1 1 1 0 0]
[0 0 0 0 1 1 1 1 0 0]
You can use a convolution filter, to see the consecutive zeros. And as the "leftover" zeros at the end can't be True, if there isn't an N length sequence of them, you can always complete the result with Falses.
import numpy as np
N = 3
A = np.array([1, 1, 0, 1, 0, 0, 0, 0, 0, 0])
conv=np.convolve(np.ones((N)),A,mode='valid')
B=np.where(conv==0,True,False)
B=np.append(B,np.zeros((N-len(A)%N),dtype=bool))
the result is:
[False False False False True True True True False False]
Should be possible with only bitwise math. Not sure about speed, probably doesn't beat numba, but should top anything with a for loop otherwise.
def find_blocks(A, N):
reduce = lambda x: np.logical_and(x[:-1], x[1:], out = x[:-1])
x = np.logical_not(A)
for i in range(N-1):
reduce(x)
x[-N+1:] = False
return x
find_blocks(A, 3)
Out[]:
array([False, False, False, False, True, True, True, True, False,
False])
Finding consecutive values and related indices is one of subjects that have solution in other SO question. In this regard we can find range indices for repeated zeros and their numbers:
def zero_runs(a):
iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges
# [[ 2 3]
# [ 4 10]]
When we found the indices ranges, we can fill the 1D array by using
def filled_array(start, end, length):
out = np.zeros((length), dtype=int)
np.add.at(out, start, 1)
np.add.at(out, end, -1)
return out.cumsum() > 0
def filled_array_v2(start, end, length):
out =np.bincount(start, minlength=length) - np.bincount(end, minlength=length)
return out.cumsum().astype(bool)
So, by using the aforementioned functions and defining a master function as below:
def final(A, N):
runs = zero_runs(A)
repeats_cound = runs[:, 1] - runs[:, 0]
# [1 6]
suspected_ranges = runs[repeats_cound >= 3]
# [[ 4 10]]
start = suspected_ranges[:, 0]
end = start + N + 1
return filled_array(start, end, length=A.size) # filled_array_v2(start, end, length=A.size)
# [False False False False True True True True False False]
In one of my benchmarks on data array volume 10000, this code ran faster than the proposed numba method by Naphat.
Some benchmarks (Colab)
By using any (instead sum) as #Mercury's loop optimization mentioned, we can improve Naphat's baseline_numba by parallelizing it usingnb.prange along with parallel=True in the decorator line.
By A size: 20000 on Colab:
# N ----> 3:
1000 loops, best of 5: 105 µs per loop Norok
--------------------
100 loops, best of 5: 35 ms per loop Nephat baseline
100 loops, best of 5: 730 µs per loop Nephat baseline_numba
1000 loops, best of 5: 225 µs per loop Nephat correlational
1000 loops, best of 5: 397 µs per loop Nephat stridetricks
--------------------
1000 loops, best of 5: 42.3 µs per loop Michael Szczesny (The fastest)
--------------------
1000 loops, best of 5: 352 µs per loop This answer
1000 loops, best of 5: 279 µs per loop Nephat baseline_numba PARALLELIZED
=================================================================
# N ----> 10:
1000 loops, best of 5: 25.3 µs per loop Norok (The fastest)
--------------------
100 loops, best of 5: 46.8 ms per loop Nephat baseline
100 loops, best of 5: 794 µs per loop Nephat baseline_numba
1000 loops, best of 5: 318 µs per loop Nephat correlational
1000 loops, best of 5: 433 µs per loop Nephat stridetricks
--------------------
1000 loops, best of 5: 31.9 µs per loop Michael Szczesny
--------------------
1000 loops, best of 5: 341 µs per loop This answer
1000 loops, best of 5: 291 µs per loop Nephat baseline_numba PARALLELIZED
=================================================================
# N ----> 20
1000 loops, best of 5: 6.52 µs per loop Norok (The fastest)
--------------------
1000 loops, best of 5: 31.5 µs per loop Michael Szczesny
--------------------
1000 loops, best of 5: 350 µs per loop This answer
1000 loops, best of 5: 292 µs per loop Nephat baseline_numba PARALLELIZED
=================================================================
# N ----> 30
1000 loops, best of 5: 4.01 µs per loop Norok (The fastest)
--------------------
1000 loops, best of 5: 31.8 µs per loop Michael Szczesny
Latest benchmarks:
The following results show the best performances, respectively, based on my last test on the prepared colab link (the last two cells):
for N <10 ---> Richard, (==) Daniel
for 10 < N <20 ---> Richard, Michael, (==) Norok
for 20 < N <60 ---> Norok, Richard, Michael
for N >60 ---> Norok, Michael, Richard
This one is numpy only and O(len(A)) similar to the answer from Ali_Sh but shorter and quicker.
It does require some awkward handling of the edges of the array.
The choice for a smaller dtype for the cumsum is a micro-optimization that I'm not sure is worth it..
def numpy_linear(A, N):
dtype = np.min_scalar_type(N)
assert issubclass(dtype.type, np.unsignedinteger)
# The cumsum can overflow for longer arrays
# but the difference calculation will still be correct
c = np.cumsum(A == 0, dtype=dtype)
B = np.r_[c[N-1] == N, c[N:]-c[:-N] == N, np.zeros(N-1, bool)]
return B
To Obtain one's index, you can start with the following method:
import bumpy as np
my__Truth = np.array([True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True])
np.sort(my_Truth, in_place=my_Truth[0])
# ~ Obtain Index ~ #
index_values = np.where(my_Truth)
print(index_values)
# un-comment if using python2
# print index_values

Python/Pandas find closest value above/below in one Column

I have a large data frame (~1 million rows). I am ultimately going to be interpolating spectra between two age groups. However I need to first find the nearest values above and below any age I need to find.
The DataFrame briefly looks like this
Age Wavelength Luminosity
1
1
1
4
4
6
6
I need to be able to input 5, and return values 4 and 6. I am struggling to find a way to do this? This is what I tried :
def findnearest(array,value):
idx = np.searchsorted(array,value, side='left')
if idx > 125893.0:
return array[idx]
else:
return array[idx]
idx1 = np.searchsorted(array,value, side='right')
if idx1 < 2e10:
return array[idx1]
else:
return array [idx1-1]
C = findnearest(m05_010['age'], 5.12e7)
print(C)
This only returns one value, and not both. Is this the right path or should I be doing something different? Is there a better way?
IIUC and assuming sorted input array, you can do something like this -
above = arr[np.searchsorted(arr,value,'left')-1]
below = arr[np.searchsorted(arr,value,'right')]
Sample runs -
Case 1: Without exact match for value
In [17]: arr = np.array([1,1,1,4,4,4,4,4,4,4,6,6])
In [18]: value = 5
In [19]: above = arr[np.searchsorted(arr,value,'left')-1]
...: below = arr[np.searchsorted(arr,value,'right')]
...:
In [20]: above, below
Out[20]: (4, 6)
Case 2: With exact match for value
In [33]: arr = np.array([1,1,1,4,4,4,4,4,4,4,5,5,5,6,6])
In [34]: value = 5
In [35]: above = arr[np.searchsorted(arr,value,'left')-1]
...: below = arr[np.searchsorted(arr,value,'right')]
...:
In [36]: above, below
Out[36]: (4, 6)
I think you should use bisect, its a lot faster and is made for this purpose only.
from bisect import *
arr = np.array([1,1,1,4,4,4,4,4,4,4,6,6])
value = 5
lower = arr[bisect_left(arr, value) - 1]
above = arr[bisect_right(arr, value)]
lower, above
Output -
(4, 6)
Heres the time comparison from Ipython -
%timeit for x in range(100): arr[bisect_left(arr, value)]
Output -
10000 loops, best of 3: 92.4 µs per loop
And using searchsorted -
%timeit for x in range(100): arr[np.searchsorted(arr,value,'left')-1]
Output -
The slowest run took 7.62 times longer than the fastest. This could
mean that an intermediate result is being cached. 10000 loops, best of
3: 142 µs per loop

Filling a list faster

I have a small block of code which I use to fill a list with integers. I need to improve its performance, perhaps translating the whole thing into numpy arrays, but I'm not sure how.
Here's the MWE:
import numpy as np
# List filled with integers.
a = np.random.randint(0,100,1000)
N = 10
b = [[] for _ in range(N-1)]
for indx,integ in enumerate(a):
if 0<elem<N:
b[integ-1].append(indx)
This is what it does:
for every integer (integ) in a
see if it is located between a given range (0,N)
if it is, store its index in a sub-list of b where the index of said sub-list is the original integer minus 1 (integ-1)
This bit of code runs pretty fast but my actual code uses much larger lists, hence the need to improve its performance.
Here's one way of doing it:
mask = (a > 0) & (a < N)
elements = a[mask]
indicies = np.arange(a.size)[mask]
b = [indicies[elements == i] for i in range(1, N)]
If we time the two:
import numpy as np
a = np.random.randint(0,100,1000)
N = 10
def original(a, N):
b = [[] for _ in range(N-1)]
for indx,elem in enumerate(a):
if 0<elem<N:
b[elem-1].append(indx)
return b
def new(a, N):
mask = (a > 0) & (a < N)
elements = a[mask]
indicies = np.arange(a.size)[mask]
return [indicies[elements == i] for i in range(1, N)]
The "new" way is considerably (~20x) faster:
In [5]: %timeit original(a, N)
100 loops, best of 3: 1.21 ms per loop
In [6]: %timeit new(a, N)
10000 loops, best of 3: 57 us per loop
And the results are identical:
In [7]: new_results = new(a, N)
In [8]: old_results = original(a, N)
In [9]: for x, y in zip(new_results, old_results):
....: assert np.allclose(x, y)
....:
In [10]:
The "new" vectorized version also scales much better to longer sequences. If we use a million-item-long sequence for a, the original solution takes slightly over 1 second, while the new version takes only 17 milliseconds (a ~70x speedup).
Try this solution! The first half I shamelessly stole from Joe's answer, but after that it uses sorting and binary search, which scales better with N.
def new(a, N):
mask = (a > 0) & (a < N)
elements = a[mask]
indices = np.arange(a.size)[mask]
sorting_idx = np.argsort(elements, kind='mergesort')
ind_sorted = indices[sorting_idx]
x = np.searchsorted(elements, range(N), side='right', sorter=sorting_idx)
return [ind_sorted[x[i]:x[i+1]] for i in range(N-1)]
You could put x = x.tolist() in there for an additional albeit small speed improvement (NB: if you do an a = a.tolist() in your original code, you do get a significant speedup). Also, I used 'mergesort' which is a stable sort but if you don't need the final result sorted, you can get away with a faster sorting algorithm.

Performing operations on all values of a numpy array, referencing i and j

I am trying to improve numpy performance by applying operations on a 2d array, the problem is that the value at each element in the array depends on the i,j location of that element.
Obviously the easy way to do this is to use a nested for-loop, but I was wondering if there might be a better way by referencing np.indices or something along those lines? Here is my 'stupid' code:
for J in range(1025):
for I in range(1025):
PSI[I][J] = A*math.sin((float(I+1)-.5)*DI)*math.sin((float(J+1)-.5)*DJ)
P[I][J] = PCF*(math.cos(2.*float(I)*DI)+math.cos(2.*float(J)*DJ))+50000.
Since you're doing multiplication among your two arrays, you can use the outer function, after using arange to get arrays of your sin/cos.
Something like this (use numpy's trig functions, since they're vectorized)
PSI_i = numpy.sin((arange(1,1026)-0.5)*DI)
PSI_j = numpy.sin((arange(1,1026)-0.5)*DJ)
PSI = A*outer(PSI_i, PSI_j)
P_i = numpy.cos(2.*arange(1,1026)*DI)
P_j = numpy.cos(2.*arange(1,1026)*DJ)
P = PCF*outer(P_i, P_j) + 50000
If your environment is set up using from numpy import * or from pylab import *, then you don't need those numpy. prefixes before your trig functions. I kept them in to distinguish them from the math ones, which won't work for this approach.
You can get a grid of the index values with indices:
I,J=np.indices(PSI.shape)
#All constants set to one
PSI2=np.sin(I+1-.5)*np.sin(J+1-.5)
print PSI-PSI2 # should be zero.
I did some timings with ipython:
import numpy as np
import math
A = 1
P = 1
DI = 1
DJ = 1
def a():
PSI=np.zeros((1025,1025))
for J in range(1025):
for I in range(1025):
PSI[I][J] = A*math.sin((float(I+1)-.5)*DI)*math.sin((float(J+1)-.5)*DJ)
%timeit a()
def b():
PSI=np.zeros((1025,1025))
for I,J in np.ndindex(*PSI.shape):
PSI[I,J] = A*math.sin((float(I+1)-.5)*DI)*math.sin((float(J+1)-.5)*DJ)
%timeit b()
def c():
I,J=np.indices((1025, 1025))
P2=A*np.sin((I+1-.5)*DI)*np.sin((J+1-.5)*DJ)
%timeit c()
def d():
PSI_i = np.sin((np.arange(1,1026)-0.5)*DI)
PSI_j = np.sin((np.arange(1,1026)-0.5)*DJ)
PSI = A*np.outer(PSI_i, PSI_j)
%timeit d()
The result is not at all surprising on my machine:
1 loops, best of 3: 1.75 s per loop
1 loops, best of 3: 3.51 s per loop
10 loops, best of 3: 77.1 ms per loop
100 loops, best of 3: 7.16 ms per loop
Try the ndenumerate function of numpy, which returns the value as well as the indices:
>>> a
array([[5, 5, 5],
[1, 2, 3]])
>>> for index, value in numpy.ndenumerate(a):
... print index, value
(0, 0) 5
(0, 1) 5
(0, 2) 5
(1, 0) 1
(1, 1) 2
(1, 2) 3

Categories

Resources