Please assume the following NumPy array:
A = array([1, 1, 0, 1, 0, 0, 0, 0, 0, 0])
I would like to find the indices of this array that N consecutive values are equal to zero (inclusive).
For example, assume N=3. We know that A[2]=0 while A[3]>0. Thus, the second element of array A is not with three consecutive zero value (inclusive). A desirable outcome for array A would look like the below:
B = array([False, False, False, False, True, True, True, True, False, False])
I can write a loop as the answer to this question:
N = 3
A = np.array([1, 1, 0, 0, 0, 0, 0, 0, 0, 0])
B = np.zeros(len(A), dtype=bool)
for i in range(len(A)):
if (i + N <= len(A)) and (sum(A[i:i + N]) == 0):
B[i] = True
Since matrix A could be much larger and I need to do the above search process millions of times, I need the fastest possible way. What would be your suggestion?
TL;DR
This is a significant improvement over the original approach as long as the block size N (below referred to as m or size) is large enough, and the input has a significant number of non-zero values:
import numpy as np
import numba as nb
#nb.njit
def find_blocks_while2_nb(arr, size, value):
n = arr.size
result = np.zeros(n, dtype=np.bool_)
i = j = 0
while i < n - size + 1:
# last iteration found a block, only checking one value
if j < i and arr[i + size - 1] == value:
result[i] = True
i += 1
else:
# search backward for index of last non-block value
j = i + size - 1
while j >= i and arr[j] == value:
j -= 1
if j < i:
# block found, advance by 1
result[i] = True
i += 1
else:
# block not found, advance by last non-block value occurrence
j += 1
i = j
return result
Essentially it is a Numba accelerated version of your original approach with a number of micro-optimizations:
The number of times the main loop is executed is reduced by setting proper limit to the main counter
The part that checks if the block contains all the same values:
avoids a relatively expensive array slicing
returns the position where it failed
runs backward
avoids re-checking consecutive blocks
For small block sizes, more micro-optimized approach may be faster as introduced in other answers, and briefly discussed below.
Introduction
The problem of finding the indices where a block of identical values starts can be seen as a specialization of a classical problem in computer science: finding a sub-array within an array (discussed in some detail here).
In this case, the sub-string (i.e. the block) consists of a series of identical values, more precisely 0s.
Theory
The complexity of this problem depends, in general, on the size of the array (n) and the size of the block (m or size), as well as the actual values within them (obviously).
Of course, it is assumed that n > m, as it is clear that if n == m the solution is trivial, and when n < m the array cannot fit the block.
For the same reason, no index above n - m can be found.
While a naïve implementation would require essentially two nested loops, one for string and one for sub-string (which would result is O(nm) computational complexity), it is possible to make use of the fact that the block consists of identical values to run the algorithm in guaranteed linear time O(n) and with constant memory requirement O(1).
However, under the assumption that the input array has a mixed distribution of both block and non-block values, it is possible to run in approx. O(n / (m - ε)) time and still with O(1) memory consumption. The ε < m - 1 is a parameter proportional to the probability of finding a non-block value.
The idea is that when trying to determine if a block starts at position i if the array has a non-block value at position j with j - i < m then the next position to evaluate is j + 1, since, the block cannot fit between i and j.
Moreover, if the search starts from the back (i.e. at position i - m), then the number of position that are not checked is maximal for sufficiently heterogeneous inputs, as all j - i values do not need to be checked.
For similar reasons, if it is known that a block starts at position i, to check whether a block starts at position i + j with j < m, only the values between i + m and i + m + j needs to checked, and the values between i and i + m - 1 do not need to checked again.
Implementations
This a manual case where a generator is a perfect fit, because it is possible to leverage its lazy evaluation properties to perform the bit of computation only as needed.
Additionally, it is not known in advance how large the result will be (i.e. it is possible to find many blocks or no blocks at all).
However, if this is part of a larger computation it may be faster to pre-allocate the size of the result and just modify the value its indices.
There should not be a need to allocate more than n - m + 1 values.
However, for compatibility with the requirements of the question, an array the size of the input is pre-allocated.
If needed, it is easy to adapt the following implementations to avoid wasting this extra memory.
More precisely, the solutions based on explicit looping can be converted to generators by replacing result[i] = True with yield i (or similar).
Likewise, solutions that compute the boolean array natively can be converted to generators with something like:
for i in np.nonzero(result)[0]:
yield i
The remainder of the analysis will assume that the target result is the boolean array.
One of the challenges of implementing this in pure Python is that explicit looping is relatively slow.
For this reason, it is convenient to use some acceleration technique (e.g. Numba or Cython) to reduce computation time.
Alternatively, given that the algorithm consists essentially of two loops (one for the array and one for the sub-array) one may seek to vectorize one or both loops.
However, do note that these approaches would necessarily run in O(nm) worst-case time.
The basic naïve approach is the following:
def find_blocks_for2(arr, size, value):
n = arr.size
result = np.zeros(n, dtype=np.bool_)
for i in range(n - size + 1):
all_equal = True
for j in range(i, i + size):
if arr[j] != value:
all_equal = False
break # vectorized approaches may not have this
if all_equal:
result[i] = True
return result
This can be improved in a number of directions.
One way stems from the fact that the inner loop is actually useless, because it is sufficient to keep track of the number of consecutive "good" values present (this is essentially the same as MichaelSzczesny's answer):
import numpy as np
def find_blocks_for(arr, size, value):
n = arr.size
result = np.zeros(n, dtype=np.bool_)
block_size = 0
for i in range(n):
if arr[i] == value:
block_size += 1
else:
block_size = 0
if block_size >= size:
result[i - size + 1] = True
return result
Note that this can be seen as a specialization of the Rabin-Karp algorithm where the role of the hashing is played by the block_size counter.
Another possible approach is to skip the second loop only when the last iteration already identified the presence of a block and have it running backward to maximize the jump to the next possible block:
import numpy as np
def find_blocks_while2(arr, size, value):
n = arr.size
result = np.zeros(n, dtype=np.bool_)
i = j = 0
while i < n - size + 1:
# last iteration found a block, only checking one value
if j < i and arr[i + size - 1] == value:
result[i] = True
i += 1
else:
# search backward for index of last non-block value
j = i + size - 1
while j >= i and arr[j] == value:
j -= 1
if j < i:
# block found, advance by 1
result[i] = True
i += 1
else:
# block not found, advance by last non-block value occurrence
j += 1
i = j
return result
Note that this can be seen as a specialization of the Knuth-Morris-Pratt algorithm without the need of computing offsets for each value of the block (they are all the same).
A number of variations can be devised for the above approaches, all with the same asymptotic behavior but slightly different running time depending on the actual input.
For example, one could change the order of the conditions to speed-up the case where many consecutive blocks are found, at the expenses of running more instructions elsewhere, or the other way around.
All these run just too slow without acceleration either with Numba, Cython or Pypy.
Pypy uses a different interpreter and I will not discuss/benchmark it further.
Numba provides one of the most straightforward and effective acceleration (just apply a decorator), e.g.:
import numba as nb
find_blocks_while2_nb = nb.njit(find_blocks_while2)
find_blocks_while2_nb.__name__ = "find_blocks_while2_nb"
Cython can also be used, e.g. (using %load_ext Cython Jupyter magic), but to take full advantage of the compilation it would require extra care to make sure bottleneck code runs without Python interaction:
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import cython as cy
import numpy as np
cdef _find_blocks_lin_cy(long[::1] arr, char[::1] result, Py_ssize_t n, Py_ssize_t size, long value):
cdef Py_ssize_t block_size = 0
for i in range(n):
if arr[i] == value:
block_size += 1
else:
block_size = 0
if block_size >= size:
result[i - size + 1] = True
def find_blocks_lin_cy(arr, size, value):
n = arr.size
result = np.zeros(n, dtype=np.bool_)
_find_blocks_lin_cy(arr, result, n, size, value)
return result
Whether it is going to be faster Numba or Cython depends on how much optimization can be triggered by the code, and as such depends also on which combination of compiler and CPU is used.
Alternatively, either the inner or the outer loop (or both) can be vectorized.
Of course this is most effective on whether it is run more the inner or the outer loop.
The largest loop should be run vectorized for maximum speed-up.
Vectorizing the inner loop would lead to something close to this (similar to the original post, but with smaller boundaries and sum() replaced with np.all() which also sports short-circuiting):
def find_blocks_for_all(arr, size, value):
n = arr.size
result = np.zeros(n, dtype=np.bool_)
for i in range(n - size + 1):
if np.all(arr[i:i + size] == value):
result[i] = True
return result
Vectorizing the outer loop would lead to something close to this (similar to what is presented in DanielF's answer, but with overall cleaner code, especially avoiding unnecessary function calls):
def find_blocks_for_and(arr, size, value):
result = (arr == value)
for i in range(1, size):
result[:-1] &= result[1:]
result[1 - size:] = False
return result
Vectorizing both loops would lead to something close to this (similar to one approach from NaphatAmundsen's answer):
def find_blocks_strides(arr, size, value):
n = arr.size
strides = (arr.itemsize,) * 2
block = np.full(size, value, dtype=arr.dtype)
# note that `as_strided()` does not allocate extra memory
# the downside is that buffer overflow will occur and
# extra care must be taken not to use data outside the boundaries
check = np.lib.stride_tricks.as_strided(arr, (n, size), strides)
result = np.all(check == block, axis=-1)
result[1 - size:] = False
return result
The above code can be further specialized under the assumption that the input will contain only positive values and that block consists of zeros only.
This would mean replacing np.all() calls with np.sum() or similar.
One approach where this is actually beneficial is with the fully vectorized approach, where as_strided() can be replaced by computing the correlation with a non-zero block (similar to one approach from NaphatAmundsen's answer):
def find_blocks_0_conv(arr, size):
n = arr.size
result = np.zeros_like(arr, dtype=np.bool_)
block = np.ones(size, dtype=arr.dtype)
result[:1 - size] = (np.correlate(arr, block, 'valid') == 0)
return result
Aside of the looping, the running time will also depend on how the code translates to hardware instructions and ultimately how fast are them.
For example, if NumPy is compiled with full support of SIMD instructions, it may very well be that a vectorized (but not theoretically optimal) approach would outperform, for many input combinations, theoretically optimal approaches that are naïvely accelerated.
The same could happen for approaches that are written in a way that acceleration techniques can make very good use of SIMD instructions.
However, the extra speed they provide will be limited to specific systems (i.e. specific combinations of compilers and processors).
One such example is the following (essentially a polishing of Jérôme Richard's answer to avoid global variables and removing unnecessary looping):
def find_blocks_simd_nb_cached(arr, size, value=0, cache={}):
assert size > 0
if size not in cache:
print('cached: ', list(cache.keys()))
def find_blocks_simd_nb(arr, value):
n = arr.size
result = np.zeros(n, dtype=np.bool_)
check = (arr == value)
for i in range(n - size + 1):
# using a `temp` variable avoid unnecessarily accessing the `check` array
temp = check[i]
for j in range(1, size):
temp &= check[i + j]
result[i] = temp
return result
cache[size] = nb.njit(find_blocks_simd_nb)
return cache[size](arr, value)
Benchmarking
Here, some benchmarks are reported.
As always, they should be taken with a grain of salt.
Only the fastest approaches are considered, for varying input sizes, block sizes and values.
The _nb and _cy (except for find_blocks_for_cy() are essentially non-optimized versions of the same functions without the _nb or _cy suffix from above.
Varying input size, block size = 4
Uniform distribution of 1s and 0s
All 0s
All 1s
Varying block size, input size = 600_000
Uniform distribution of 1s and 0s
All 0s
All 1s
Note that:
find_blocks_for_all() is actually fast for sufficiently large values of size
find_blocks_for_and() is very fast for smaller values of size and does not require Numba or Cython
find_blocks_for_nb() is essentially independent on the block size and performs overall really well
find_blocks_simd_nb() is very fast for small values of size but performs comparatively poorly for larger values
find_blocks_while2_nb() really shines for larger values of size and it is respectably fast for smaller values
(Full benchmarks available here).
For small values of N (not changing much at runtime), the best solution is to use a SIMD-friendly Numba implementation that is specialized for each possible specific N value. The compiler can generate a very efficient code when N <= ~30. For larger N values, the solution of #norok2 is start to be better since it can skip many items of the array as long as there are not too many 0 items. When the number of 0 is big, this function is still competitive as long as N <= 64. When N > 64, please use a better suited implementation.
# Cache dictionary meant to store function for a given N
cache = {}
def compute(A, N):
assert N > 0
if N not in cache:
# Numba function to be compiled for a specific N
def compute_specific(A):
out = np.zeros(A.size, dtype=np.bool_)
isZero = (A == 0).view(np.uint8)
for i in range(isZero.size-N+1):
allSame = isZero[i]
for j in range(1, N):
allSame &= isZero[i+j]
out[i] = allSame
for i in range(A.size-N+1, A.size):
out[i] = 0
return out
cache[N] = nb.njit(compute_specific)
return cache[N](A)
The compiler can auto-vectorize this code with SIMD instruction (like AVX-2) so that 32~64 items are computed per cycle on modern x86 processors. Larger N values require more SIMD instructions causing the function to be less efficient. In comparison, implementations using conditional branches tends to be slow since a miss-predicted branch generally cost about 10~15 cycles on mainstream x86 processors and each iteration can only compute one scalar item at a time (ie. ~2 order of magnitude slower assuming the amount of work would be the same).
Benchmark
Here are results on my machine (Intel Xeon Skylake) with a random (32-bit) integer arrays of size 100_000. There is about 50% of 0 values in the array. The Numba compilation time is excluded in all implementations.
With N = 3:
- Initial solution: 115_000 µs
- Ali_Sh: 1_575 µs
- Naphat (stridetricks): 929 µs
- Naphat (correlational): 528 µs
- Michael Szczesny: 350 µs
- Norok2: 306 µs
- Franciska: 295 µs
- This code: 11 us <---
With N = 10:
- Initial solution: 149_000 µs
- Ali_Sh: 1_592 µs
- Naphat (stridetricks): 1_139 µs
- Naphat (correlational): 831 µs
- Michael Szczesny: 489 µs
- Franciska: 444 µs
- Norok2: 98 µs
- This code: 14 µs <---
With N = 30:
- Ali_Sh: [FAIL]
- Initial solution: 255_000 µs
- Franciska: 2_285 µs
- Naphat (correlational): 1_936 µs
- Naphat (stridetricks): 1_628 µs
- Michael Szczesny: 647 µs
- Norok2: 30 µs
- This code: 25 µs <---
With N = 60:
- Ali_Sh: [FAIL]
- Initial solution: 414_000 µs
- Naphat (correlational): 3_814 µs
- Franciska: 3_242 µs
- Naphat (stridetricks): 3_048 µs
- Michael Szczesny: 816 µs
- This code: 50 µs <---
- Norok2: 14 µs
We can clearly see that this function is the fastest one (by a large margin) except when N is big (in this case, the Norok2's code is faster).
A solution with linear time complexity, compiled with numba.
The idea is to look at each element of A once and count occurrences of zeros in counter c. If a non-zero element is encountered, c is reset to 0. As long as c >= N the result array is set to True at the appropriate index.
import numba as nb # tested with numba 0.55.1
#nb.jit
def numba_linear(A, N):
B = np.zeros(len(A), dtype=np.bool_)
c = 0
for i in range(len(A)):
c = c+1 if A[i] == 0 else 0
if c >= N: B[i - N + 1] = 1
return B
A = np.array([1, 1, 0, 1, 0, 0, 0, 0, 0, 0])
numba_linear(A, 3)
Output
array([False, False, False, False, True, True, True, True, False,
False])
Benchmark against #Nephat baseline_numba with #Mercury's loop optimization applied (sum was faster than any in my benchmarks).
import numpy as np
#nb.njit
def baseline_numba(A: np.ndarray, N: int):
'''
You may need to run this once first to see performance gains
as this has to be JIT compiled by Numba
'''
B = np.zeros(len(A), dtype=np.int8)
for i in nb.prange(len(A)-N+1):
if A[i:i + N].sum() == 0:
B[i] = 1
return B
rng = np.random.default_rng()
A = rng.integers(2, size=20000)
N = 3
%timeit baseline_numba(A, N)
#10000 loops, best of 5: 120 µs per loop
%timeit numba_linear(A, N)
#10000 loops, best of 5: 29.4 µs per loop
np.testing.assert_array_equal(numba_linear(A,N), baseline_numba(A,N))
I have implemented a few other alternatives that you could try:
Use numba to JIT compile the loopy code
Use np.correlate to do a signal-processing type of correlation, then use nonzero to obtain indices
Same as above, but use stride_tricks to do the correlation instead
I would like to benchmark the solutions, but it is getting a bit late here. I can maybe do it after work tomorrow if you'd like.
I also use dtype=np.int8 for B since it is easier to visually confirm that the output is correct (see output).
import numpy as np
import numba
from numpy.lib.stride_tricks import as_strided
A = np.array([1, 1, 0, 1, 0, 0, 0, 0, 0, 0])
N = 3
def baseline(A: np.ndarray):
'''
Your solution
'''
B = np.zeros(len(A), dtype=np.int8)
for i in range(len(A)):
if (i + N <= len(A)) and (sum(A[i:i + N]) == 0):
B[i] = 1
return B
#numba.njit()
def baseline_numba(A: np.ndarray, N: int):
'''
You may need to run this once first to see performance gains
as this has to be JIT compiled by Numba
'''
B = np.zeros(len(A), dtype=np.int8)
for i in range(len(A)):
if (i + N <= len(A)) and (A[i:i + N].sum() == 0):
B[i] = 1
return B
def correlational(A: np.ndarray):
'''
Use a correlation operation
'''
B = np.zeros_like(A, dtype=np.int8)
B[(np.correlate(A, np.full(N, 1), 'valid') == 0).nonzero()[0]] = 1
return B
def stridetricks(A: np.ndarray):
'''
Same idea as using correlation, but with stride tricks
Maybe it will be faster? Probably not.
'''
u = np.array(A.itemsize)
B = np.zeros_like(A, dtype=np.int8)
B[(as_strided(A, (len(A) - N + 1, N), u * (1, 1)).sum(1) == 0).nonzero()[0]] = 1
return B
print(baseline(A))
print(baseline_numba(A, N))
print(correlational(A))
print(stridetricks(A))
Output:
[0 0 0 0 1 1 1 1 0 0]
[0 0 0 0 1 1 1 1 0 0]
[0 0 0 0 1 1 1 1 0 0]
[0 0 0 0 1 1 1 1 0 0]
You can use a convolution filter, to see the consecutive zeros. And as the "leftover" zeros at the end can't be True, if there isn't an N length sequence of them, you can always complete the result with Falses.
import numpy as np
N = 3
A = np.array([1, 1, 0, 1, 0, 0, 0, 0, 0, 0])
conv=np.convolve(np.ones((N)),A,mode='valid')
B=np.where(conv==0,True,False)
B=np.append(B,np.zeros((N-len(A)%N),dtype=bool))
the result is:
[False False False False True True True True False False]
Should be possible with only bitwise math. Not sure about speed, probably doesn't beat numba, but should top anything with a for loop otherwise.
def find_blocks(A, N):
reduce = lambda x: np.logical_and(x[:-1], x[1:], out = x[:-1])
x = np.logical_not(A)
for i in range(N-1):
reduce(x)
x[-N+1:] = False
return x
find_blocks(A, 3)
Out[]:
array([False, False, False, False, True, True, True, True, False,
False])
Finding consecutive values and related indices is one of subjects that have solution in other SO question. In this regard we can find range indices for repeated zeros and their numbers:
def zero_runs(a):
iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges
# [[ 2 3]
# [ 4 10]]
When we found the indices ranges, we can fill the 1D array by using
def filled_array(start, end, length):
out = np.zeros((length), dtype=int)
np.add.at(out, start, 1)
np.add.at(out, end, -1)
return out.cumsum() > 0
def filled_array_v2(start, end, length):
out =np.bincount(start, minlength=length) - np.bincount(end, minlength=length)
return out.cumsum().astype(bool)
So, by using the aforementioned functions and defining a master function as below:
def final(A, N):
runs = zero_runs(A)
repeats_cound = runs[:, 1] - runs[:, 0]
# [1 6]
suspected_ranges = runs[repeats_cound >= 3]
# [[ 4 10]]
start = suspected_ranges[:, 0]
end = start + N + 1
return filled_array(start, end, length=A.size) # filled_array_v2(start, end, length=A.size)
# [False False False False True True True True False False]
In one of my benchmarks on data array volume 10000, this code ran faster than the proposed numba method by Naphat.
Some benchmarks (Colab)
By using any (instead sum) as #Mercury's loop optimization mentioned, we can improve Naphat's baseline_numba by parallelizing it usingnb.prange along with parallel=True in the decorator line.
By A size: 20000 on Colab:
# N ----> 3:
1000 loops, best of 5: 105 µs per loop Norok
--------------------
100 loops, best of 5: 35 ms per loop Nephat baseline
100 loops, best of 5: 730 µs per loop Nephat baseline_numba
1000 loops, best of 5: 225 µs per loop Nephat correlational
1000 loops, best of 5: 397 µs per loop Nephat stridetricks
--------------------
1000 loops, best of 5: 42.3 µs per loop Michael Szczesny (The fastest)
--------------------
1000 loops, best of 5: 352 µs per loop This answer
1000 loops, best of 5: 279 µs per loop Nephat baseline_numba PARALLELIZED
=================================================================
# N ----> 10:
1000 loops, best of 5: 25.3 µs per loop Norok (The fastest)
--------------------
100 loops, best of 5: 46.8 ms per loop Nephat baseline
100 loops, best of 5: 794 µs per loop Nephat baseline_numba
1000 loops, best of 5: 318 µs per loop Nephat correlational
1000 loops, best of 5: 433 µs per loop Nephat stridetricks
--------------------
1000 loops, best of 5: 31.9 µs per loop Michael Szczesny
--------------------
1000 loops, best of 5: 341 µs per loop This answer
1000 loops, best of 5: 291 µs per loop Nephat baseline_numba PARALLELIZED
=================================================================
# N ----> 20
1000 loops, best of 5: 6.52 µs per loop Norok (The fastest)
--------------------
1000 loops, best of 5: 31.5 µs per loop Michael Szczesny
--------------------
1000 loops, best of 5: 350 µs per loop This answer
1000 loops, best of 5: 292 µs per loop Nephat baseline_numba PARALLELIZED
=================================================================
# N ----> 30
1000 loops, best of 5: 4.01 µs per loop Norok (The fastest)
--------------------
1000 loops, best of 5: 31.8 µs per loop Michael Szczesny
Latest benchmarks:
The following results show the best performances, respectively, based on my last test on the prepared colab link (the last two cells):
for N <10 ---> Richard, (==) Daniel
for 10 < N <20 ---> Richard, Michael, (==) Norok
for 20 < N <60 ---> Norok, Richard, Michael
for N >60 ---> Norok, Michael, Richard
This one is numpy only and O(len(A)) similar to the answer from Ali_Sh but shorter and quicker.
It does require some awkward handling of the edges of the array.
The choice for a smaller dtype for the cumsum is a micro-optimization that I'm not sure is worth it..
def numpy_linear(A, N):
dtype = np.min_scalar_type(N)
assert issubclass(dtype.type, np.unsignedinteger)
# The cumsum can overflow for longer arrays
# but the difference calculation will still be correct
c = np.cumsum(A == 0, dtype=dtype)
B = np.r_[c[N-1] == N, c[N:]-c[:-N] == N, np.zeros(N-1, bool)]
return B
To Obtain one's index, you can start with the following method:
import bumpy as np
my__Truth = np.array([True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True])
np.sort(my_Truth, in_place=my_Truth[0])
# ~ Obtain Index ~ #
index_values = np.where(my_Truth)
print(index_values)
# un-comment if using python2
# print index_values
Related
I've written a method that takes in an integer "n" and creates a square matrix where the values of each element are dictated by their respective i,j indices.
When I build a small matrix 30x30 it works just fine, but when I try to do something larger like 1000x1000 it takes very long. Is there any way that I can speed it up with multiprocessing?
def createMatrix(n):
matrix = []
for j in range(1,n+1):
row = []
for i in range(1,n+1):
value = 1/(i+j-1)
row.append(value)
matrix.append(row)
return np.array(matrix)
Parallelizing two computation-bound for loops in Python is not trivial because of GIL. The good news is that your case is perfectly vectorizeable:
def createMatrix(n):
return 1 / (np.arange(n)[None, :] + np.arange(n)[:, None] + 1)
Explanation:
essentially, your formula for the matrix is X[row][column] = 1/(row+column-1), where rows and columns are 1-based
np.arange(n) creates a range that can be used for rows or columns
[None, :] and [:, None] turn it into a 2d array, 1 x n or n x 1
numpy then broadcasts dimensions, replicating row and column indexes to match dimensions - thus, implicitly tiling both into n x n when added
since both ranges are 0-based, using +1 instead of -1
As a rule of thumb, it is almost never a good idea to use for loops on numpy arrays. A vectorized approach (i.e. matrix form computations) is orders of magnitude faster.
It's not a good idea to use fors to fill a list then convert it to a matrix. the operation that you have can be vectorized with numpy from scratch. if you think that given the i,j, M(i,j) = 1/(j+i-1) considering that both indices starts at 1.
Here's my proposal :
def createMatrix2(n):
arr =np.arange(1,n+1)
xx,yy = np.meshgrid(arr,arr)
matrix = 1/(xx+yy-1)
return matrix
looking at Marat answer, I think his/her it's better, so tested the 3 methods:
EDIT: added wwii method as createMatrix4 (correcting the errors):
import numpy as np
from time import time
def createMatrix1(n):
matrix = []
for j in range(1,n+1):
row = []
for i in range(1,n+1):
value = 1/(i+j-1)
row.append(value)
matrix.append(row)
return np.array(matrix)
def createMatrix2(n):
arr =np.arange(1,n+1)
xx,yy = np.meshgrid(arr,arr)
matrix = 1/(xx+yy-1)
return matrix
def createMatrix3(n):
"""Marat's proposed matrix"""
return 1 / (1 + np.arange(n)[None, :] + np.arange(n)[:, None])
def createMatrix4(n):
""" wwii method"""
i,j = np.ogrid[1:n,1:n]
return 1/(i+j-1)
#test all the three methods
n = 10000
t1 = time()
m1 = createMatrix1(n)
t2 = time()
m2 = createMatrix2(n)
t3 = time()
m3 = createMatrix3(n)
t4 = time()
m4 = createMatrix4(n)
t5 = time()
print(np.allclose(m1,m2))
print(np.allclose(m1,m3))
print(np.allclose(m1,m4))
print("Matrix 1 (OP): ",t2-t1)
print("Matrix 2: (mine)",t3-t2)
print("Matrix 3: (Marat)",t4-t3)
print("Matrix 4: (wwii)",t5-t4)
# the output is:
#True
#True
#True
#Matrix 1 (OP): 18.4886577129364
#Matrix 2: (mine) 1.005324363708496
#Matrix 3: (Marat) 0.43033909797668457
#Matrix 4: (wwii) 0.5138359069824219
So Marat's solution is faster. As general comments:
Try to avoid fors loops
Think your problem as operation with indices and dessing operations with numpy arrays directly.
For last, given Marat's answer I thought my proposal is a easier to read, and understand. But it's just a subjective view
Your code can be written in another style, accelerated by numba library in a parallel no python mode:
import numba as nb
#nb.njit("float64[:, ::1](int64)", parallel=True, fastmath=True)
def createMatrix(n):
matrix = np.empty((n, n)) # np.zeros is slower than np.empty
for j in nb.prange(1, n + 1):
for i in range(1, n + 1):
matrix[j - 1, i - 1] = 1 / (i + j - 1)
return matrix
This solution will be faster than the Marat answer above 3 times.
Benchmarks: (temporary link to colab)
n = 1000
1000 loops, best of 5: 3.52 ms per loop # Marat
1000 loops, best of 5: 1.5 ms per loop # numba accelerated with np.zeros
1000 loops, best of 5: 1.05 ms per loop # numba accelerated with np.empty
n = 3000
1000 loops, best of 5: 39.5 ms per loop
1000 loops, best of 5: 19.3 ms per loop
1000 loops, best of 5: 8.91 ms per loop
n = 5000
1000 loops, best of 5: 109 ms per loop
1000 loops, best of 5: 53.5 ms per loop
1000 loops, best of 5: 24.8 ms per loop
I have a numpy array embed_vec of length tot_vec in which each entry is a 3d vector:
[[ 0.52483319 0.78015841 0.71117216]
[ 0.53041481 0.79462171 0.67234534]
[ 0.53645428 0.80896727 0.63119403]
...,
[ 0.72283509 0.40070804 0.15220522]
[ 0.71277758 0.38498613 0.16141834]
[ 0.70221445 0.36918032 0.17370776]]
For each of the elements in this array, I want to find out the number of other entries which are "close" to that entry. By close, I mean that the distance between two vectors is less than a specified value R. For this, I must compare all the possible pairs in this array with each other and then find out the number of close vectors for each of the vectors in the array. So I am doing this:
p = np.zeros(tot_vec) # This contains the number of close vectors
for i in range(tot_vec-1):
for j in range(i+1, tot_vec):
if np.linalg.norm(embed_vec[i]-embed_vec[j]) < R:
p[i] += 1
However, this is extremely inefficient because I have two nested python loops and for larger array sizes, this takes forever. If this were in C++ or Fortran, it wouldn't have been a great issue. My question is, can one achieve the same thing using numpy efficiently using some vectorization method? As a side note, I don't mind a solution using Pandas also.
Approach #1 : Vectorized approach -
def vectorized_app(embed_vec, R):
tot_vec = embed_vec.shape[0]
r,c = np.triu_indices(tot_vec,1)
subs = embed_vec[r] - embed_vec[c]
dists = np.einsum('ij,ij->i',subs,subs)
return np.bincount(r,dists<R**2,minlength=tot_vec)
Approach #2 : With less loop complexity (for very large arrays) -
def loopy_less_app(embed_vec, R):
tot_vec = embed_vec.shape[0]
Rsq = R**2
out = np.zeros(tot_vec,dtype=int)
for i in range(tot_vec):
subs = embed_vec[i] - embed_vec[i+1:tot_vec]
dists = np.einsum('ij,ij->i',subs,subs)
out[i] = np.count_nonzero(dists < Rsq)
return out
Benchmarking
Original approach -
def loopy_app(embed_vec, R):
tot_vec = embed_vec.shape[0]
p = np.zeros(tot_vec) # This contains the number of close vectors
for i in range(tot_vec-1):
for j in range(i+1, tot_vec):
if np.linalg.norm(embed_vec[i]-embed_vec[j]) < R:
p[i] += 1
return p
Timings -
In [76]: # Sample random array
...: embed_vec = np.random.rand(3000,3)
...: R = 0.5
...:
In [77]: %timeit loopy_app(embed_vec, R)
1 loops, best of 3: 50.5 s per loop
In [78]: %timeit loopy_less_app(embed_vec, R)
10 loops, best of 3: 143 ms per loop
350x+ speedup there!
Going with much bigger array with the proposed loopy_less_app -
In [81]: # Sample random array
...: embed_vec = np.random.rand(20000,3)
...: R = 0.5
...:
In [82]: %timeit loopy_less_app(embed_vec, R)
1 loops, best of 3: 4.47 s per loop
I am intrigued by that question and attempted to solve it efficintly using scipy's cKDTree. However, this approach may run out of memory because internally a list of all pairs with distance <= R is maintained. If your R and tot_vec are small enough it will work:
import numpy as np
from scipy.spatial import cKDTree as KDTree
tot_vec = 60000
embed_vec = np.random.randn(tot_vec, 3)
R = 0.1
tree = KDTree(embed_vec, leafsize=100)
p = np.zeros(tot_vec)
for pair in tree.query_pairs(R):
p[pair[0]] += 1
p[pair[1]] += 1
In case memory is an issue, with some effort it is possible to rewrite query_pairs as a generator function in Python at the cost of C performance.
first broadcast the difference:
disp_vecs=tot_vec[:,None,:]-tot_vec[None,:,:]
Now, depending on how big your dataset is, you may want to do a fist pass without all the math. If the distance is less than r, all the components should be less than r
first_mask=np.max(disp_vec, axis=-1)<r
Then do the actual calculation
disps=np.linlg.norm(disp_vec[first_mask],axis=-1)
second_mask=disps<r
Now reassign
disps=disps[second_mask]
first_mask[first_mask]=second_mask
disps are now the good values, and first_mask is a boolean mask of where they go. You can process from there.
Say I have a range r=numpy.array(range(1, 6)) and I am calculating the cumulative sum using numpy.cumsum(r). But instead of returning [1, 3, 6, 10, 15] I would like it to return [1, 3, 6] because of the condition that the cumulative result must be less than 10.
If the array is very large, I would like the cumulative sum to break out before it starts calculating values that are redundant and will be thrown away later. Of course, I am trivializing everything here for the sake of the question.
Is it possible to break out of cumsum or cumprod early based on a condition?
I don't think this is possible with any function in numpy, since in most cases these are meant for vectorized computations on fixed-length arrays. One obvious way to do what you want is to break out of a standard for-loop in Python (as I assume you know):
def limited_cumsum(x, limit):
y = []
sm = 0
for item in x:
sm += item
if sm > limit:
return y
y.append(sm)
return y
But this would obviously be an order of magnitude slower than numpy's cumsum.
Since you probably need some very specialized function, the changes are low to find the exact function you need in numpy. You should probably have a look at Cython, which allows you to implement custom functions that are as flexible as a Python function (and using a syntax that is almost Python), with a speed close to that of C.
Depending on size of the array you are computing the cumulative sum for and how quickly you expect the target value to be reached it may be quicker to calculate the cumulative sum in steps.
import numpy as np
size = 1000000
target = size
def stepped_cumsum():
arr = np.arange(size)
out = np.empty(len(arr), dtype=int)
step = 1000
last_value = 0
for i in range(0, len(arr), step):
np.cumsum(arr[i:i+step], out=out[i:i+step])
out[i:i+step] += last_value
last_value = out[i+step-1]
if last_value >= target:
break
else:
return out
greater_than_target_index = i + (out[i:i+step] >= target).argmax()
# .copy() required so rest of backing array can be freed
return out[:greater_than_target_index].copy()
def normal_cumsum():
arr = np.arange(size)
out = np.cumsum(arr)
return out
stepped_result = stepped_cumsum()
normal_result = normal_cumsum()
assert (stepped_result < target).all()
assert (stepped_result == normal_result[:len(stepped_result)]).all()
Results:
In [60]: %timeit cumsum.stepped_cumsum()
1000 loops, best of 3: 1.22 ms per loop
In [61]: %timeit cumsum.normal_cumsum()
100 loops, best of 3: 3.69 ms per loop
I have a small block of code which I use to fill a list with integers. I need to improve its performance, perhaps translating the whole thing into numpy arrays, but I'm not sure how.
Here's the MWE:
import numpy as np
# List filled with integers.
a = np.random.randint(0,100,1000)
N = 10
b = [[] for _ in range(N-1)]
for indx,integ in enumerate(a):
if 0<elem<N:
b[integ-1].append(indx)
This is what it does:
for every integer (integ) in a
see if it is located between a given range (0,N)
if it is, store its index in a sub-list of b where the index of said sub-list is the original integer minus 1 (integ-1)
This bit of code runs pretty fast but my actual code uses much larger lists, hence the need to improve its performance.
Here's one way of doing it:
mask = (a > 0) & (a < N)
elements = a[mask]
indicies = np.arange(a.size)[mask]
b = [indicies[elements == i] for i in range(1, N)]
If we time the two:
import numpy as np
a = np.random.randint(0,100,1000)
N = 10
def original(a, N):
b = [[] for _ in range(N-1)]
for indx,elem in enumerate(a):
if 0<elem<N:
b[elem-1].append(indx)
return b
def new(a, N):
mask = (a > 0) & (a < N)
elements = a[mask]
indicies = np.arange(a.size)[mask]
return [indicies[elements == i] for i in range(1, N)]
The "new" way is considerably (~20x) faster:
In [5]: %timeit original(a, N)
100 loops, best of 3: 1.21 ms per loop
In [6]: %timeit new(a, N)
10000 loops, best of 3: 57 us per loop
And the results are identical:
In [7]: new_results = new(a, N)
In [8]: old_results = original(a, N)
In [9]: for x, y in zip(new_results, old_results):
....: assert np.allclose(x, y)
....:
In [10]:
The "new" vectorized version also scales much better to longer sequences. If we use a million-item-long sequence for a, the original solution takes slightly over 1 second, while the new version takes only 17 milliseconds (a ~70x speedup).
Try this solution! The first half I shamelessly stole from Joe's answer, but after that it uses sorting and binary search, which scales better with N.
def new(a, N):
mask = (a > 0) & (a < N)
elements = a[mask]
indices = np.arange(a.size)[mask]
sorting_idx = np.argsort(elements, kind='mergesort')
ind_sorted = indices[sorting_idx]
x = np.searchsorted(elements, range(N), side='right', sorter=sorting_idx)
return [ind_sorted[x[i]:x[i+1]] for i in range(N-1)]
You could put x = x.tolist() in there for an additional albeit small speed improvement (NB: if you do an a = a.tolist() in your original code, you do get a significant speedup). Also, I used 'mergesort' which is a stable sort but if you don't need the final result sorted, you can get away with a faster sorting algorithm.
I need to extend this question, which sums values of an array based on indices from a second array. Let A be the result array, B be the index array, and C the array to be summed over. Then A[i] = sum over C such that index(B) == i.
Instead, my setup is
N = 5
M = 2
A = np.zeros((M,N))
B = np.random.randint(M, size=N) # contains indices for A
C = np.random.rand(N,N)
I need A[i,j] = sum_{k in 0...N} C[j,k] such that C[k] == i , i.e. a rowsum conditional on the indices of B matching i. Is there an efficient way to do this? For my application N is around 10,000 and M is around 20. This operation is called for every iteration in a minimization problem... my current looping method is terribly slow.
Thanks!
Following #DSM's comment, I'm assuming your C[k] == i supposed to be B[k] == i. If that's the case, does your loop version look something like this?
Nested Loop Version
import numpy as np
N = 5
M = 2
A = np.zeros((M,N))
B = np.random.randint(M, size=N) # contains indices for A
C = np.random.rand(N,N)
for i in range(M):
for j in range(N):
for k in range(N):
if B[k] == i:
A[i,j] += C[j,k]
There's more than one way to vectorize this problem. I'm going to show my thought process below, but there are more efficient ways to do it (e.g. #DSM's version that recognizes the matrix multiplication inherent in the problem).
For the sake of explanation, here's a walk-through of one approach.
Vectorizing the Inner Loop
Let's start by re-writing the inner k loop:
for i in range(M):
for j in range(N):
A[i,j] = C[j, B == i].sum()
It might be easier to think of this as C[j][B == i].sum(). We're just selecting jth row of C, selecting only the elements in that row where B is equal to i, and summing them.
Vectorizing the Outer-most Loop
Next let's break down the outer i loop. Now we're going to get to the point where readability will start to suffer, unfortunately...
i = np.arange(M)[:,np.newaxis]
mask = (B == i).astype(int)
for j in range(N):
A[:,j] = (C[j] * mask).sum(axis=-1)
There are a couple different tricks here. In this case, we're iterating over the columns of A. Each column of A is the sum of a subset of the corresponding row of C. The subset of the row of C is determined by where B is equal to the row index i.
To get around iterating through i, we're making a 2D array where B == i by adding a new axis to i. (Have a look at the documentation for numpy broadcasting if you're confused by this.) In other words:
B:
array([1, 1, 1, 1, 0])
i:
array([[0],
[1]])
B == i:
array([[False, False, False, False, True],
[ True, True, True, True, False]], dtype=bool)
What we want is to take two (M) filtered sums of C[j], one for each row in B == i. This will give us a two-element vector corresponding to the jth column in A.
We can't do this by indexing C directly because the result won't maintain it's shape, as each row may have a different number of elements. We'll get around this by multiplying the B == i mask by the current row of C, resulting in zeros where B == i is False, and the value in the current row of C where it's true.
To do this, we need to turn the boolean array B == i into integers:
mask = (B == i).astype(int):
array([[0, 0, 0, 0, 1],
[1, 1, 1, 1, 0]])
So when we multiply it by the current row of C:
C[j]:
array([ 0.19844887, 0.44858679, 0.35370919, 0.84074259, 0.74513377])
C[j] * mask:
array([[ 0. , 0. , 0. , 0. , 0.74513377],
[ 0.19844887, 0.44858679, 0.35370919, 0.84074259, 0. ]])
Then we can sum over each row to get the current column of A (This will be broadcast to a column when it's assigned to A[:,j]):
(C[j] * mask).sum(axis=-1):
array([ 0.74513377, 1.84148744])
Fully Vectorized Version
Finally, breaking down the last loop, we can apply the exact same principle to add a third dimension for the loop over j:
i = np.arange(M)[:,np.newaxis,np.newaxis]
mask = (B == i).astype(int)
A = (C * mask).sum(axis=-1)
#DSM's vectorized version
As #DSM suggested, you could also do:
A = (B == np.arange(M)[:,np.newaxis]).dot(C.T)
This is by far the fastest solution for most sizes of M and N, and arguably the most elegant (much more elegant than my solutions, anyway).
Let's break it down a bit.
The B == np.arange(M)[:,np.newaxis] is exactly equivalent to B == i in the "Vectorizing the Outer-most Loop" section above.
The key is in recognizing that all of the j and k loops are equivalent to matrix multiplication. dot will cast the boolean B == i array to the same dtype as C behind-the-scenes, so we don't need to worry about explicitly casting it to a different type.
After that, we're just performing matrix multiplication on the transpose of C (a 5x5 array) and the "mask" 0 and 1 array above, yielding a 2x5 array.
dot will take advantage of any optimized BLAS libraries you have installed (e.g. ATLAS, MKL), so it's very fast.
Timings
For small M's and N's, the differences are less apparent (~6x between looping and DSM's version):
M, N = 2, 5
%timeit loops(B,C,M)
10000 loops, best of 3: 83 us per loop
%timeit k_vectorized(B,C,M)
10000 loops, best of 3: 106 us per loop
%timeit vectorized(B,C,M)
10000 loops, best of 3: 23.7 us per loop
%timeit askewchan(B,C,M)
10000 loops, best of 3: 42.7 us per loop
%timeit einsum(B,C,M)
100000 loops, best of 3: 15.2 us per loop
%timeit dsm(B,C,M)
100000 loops, best of 3: 13.9 us per loop
However, once M and N start to grow, the difference becomes very significant (~600x) (note the units!):
M, N = 50, 20
%timeit loops(B,C,M)
10 loops, best of 3: 50.3 ms per loop
%timeit k_vectorized(B,C,M)
100 loops, best of 3: 10.5 ms per loop
%timeit ik_vectorized(B,C,M)
1000 loops, best of 3: 963 us per loop
%timeit vectorized(B,C,M)
1000 loops, best of 3: 247 us per loop
%timeit askewchan(B,C,M)
1000 loops, best of 3: 493 us per loop
%timeit einsum(B,C,M)
10000 loops, best of 3: 134 us per loop
%timeit dsm(B,C,M)
10000 loops, best of 3: 80.2 us per loop
I am assuming #DSM found your typo, and you want:
A[i,j] = sum_{k in 0...N} C[j,k] where B[k] == i
Then you can loop over i in range(M) since M is relatively small.
A = np.array([C[:,B == i].sum(axis=1) for i in range(M)])