Suppose I have a NumPy array arr that I want to element-wise filter (reduce) depending on the truth value of a (broadcastable) function, e.g.
I want to get only values below a certain threshold value k:
def cond(x):
return x < k
There are a couple of methods, e.g.:
Using a generator: np.fromiter((x for x in arr if cond(x)), dtype=arr.dtype) (which is a memory efficient version of using a list comprehension: np.array([x for x in arr if cond(x)]) because np.fromiter() will produce a NumPy array directly, without the need to allocate an intermediate Python list)
Using boolean masking: arr[cond(arr)]
Using integer indexing: arr[np.nonzero(cond(arr))] (or equivalently using np.where() as it defaults to np.nonzero() with only one condition)
Using explicit looping with:
single pass and final copying/resizing
two passes: one to determine the size of the result and one to actually perform the computation
(The last two approaches to be accelerated with Cython or Numba)
Which is the fastest? What about memory efficiency?
(EDITED: To use directly np.nonzero() instead of np.where() as per #ShadowRanger comment)
Summary
Using a loop-based approach with single pass and copying, accelerated with Numba, offers the best overall trade-off in terms of speed, memory efficiency and flexibility.
If the execution of the condition function is sufficiently fast, two-passes (filter2_nb()) may be faster, while they are more memory efficient regardless.
Also, for sufficiently large inputs, resizing instead of copying (filter_resize_xnb()) leads to faster execution.
If the data type (and the condition function) is known ahead of time and can be compiled, the Cython acceleration seems to be faster.
It is likely that a similar hard-coding of the condition would lead to comparable speed-up with Numba accerelation as well.
When it comes to NumPy-only based approaches, boolean masking or integer indexing are of comparable speed, and which one comes out faster depends largely on the filtering factor, i.e. the portion of values that passes through the filtering condition.
The np.fromiter() approach is much slower (it would be off-charts in the plot), but does not produce large temporary objects.
Note that the following tests are meant to give some insights into the different approaches and should be taken with a grain of salt.
The most relevant assumptions are that the condition is broadcastable and that it would eventually compute very fast.
Definitions
Using a generator:
def filter_fromiter(arr, cond):
return np.fromiter((x for x in arr if cond(x)), dtype=arr.dtype)
Using boolean masking:
def filter_mask(arr, cond):
return arr[cond(arr)]
Using integer indexing:
def filter_idx(arr, cond):
return arr[np.nonzero(cond(arr))]
4a. Using explicit looping, with single pass and final copying/resizing
Cython-accelerated with copying (pre-compiled condition)
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import numpy as np
cdef long NUM = 1048576
cdef long MAX_VAL = 1048576
cdef long K = 1048576 // 2
cdef int cond_cy(long x, long k=K):
return x < k
cdef size_t _filter_cy(long[:] arr, long[:] result, size_t size):
cdef size_t j = 0
for i in range(size):
if cond_cy(arr[i]):
result[j] = arr[i]
j += 1
return j
def filter_cy(arr):
result = np.empty_like(arr)
new_size = _filter_cy(arr, result, arr.size)
return result[:new_size].copy()
Cython-accelerated with resizing (pre-compiled condition)
def filter_resize_cy(arr):
result = np.empty_like(arr)
new_size = _filter_cy(arr, result, arr.size)
result.resize(new_size)
return result
Numba-accelerated with copying
import numba as nb
#nb.njit
def cond_nb(x, k=K):
return x < k
#nb.njit
def filter_nb(arr, cond_nb):
result = np.empty_like(arr)
j = 0
for i in range(arr.size):
if cond_nb(arr[i]):
result[j] = arr[i]
j += 1
return result[:j].copy()
Numba-accelerated with resizing
#nb.njit
def _filter_out_nb(arr, out, cond_nb):
j = 0
for i in range(arr.size):
if cond_nb(arr[i]):
out[j] = arr[i]
j += 1
return j
def filter_resize_xnb(arr, cond_nb):
result = np.empty_like(arr)
j = _filter_out_nb(arr, result, cond_nb)
result.resize(j, refcheck=False) # unsupported in NoPython mode
return result
Numba-accelerated with a generator and np.fromiter()
#nb.njit
def filter_gen_nb(arr, cond_nb):
for i in range(arr.size):
if cond_nb(arr[i]):
yield arr[i]
def filter_gen_xnb(arr, cond_nb):
return np.fromiter(filter_gen_nb(arr, cond_nb), dtype=arr.dtype)
4b. Using explicit looping with two passes: one to determine the size of the result and one to actually perform the computation
Cython-accelerated (pre-compiled condition)
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
cdef size_t _filtered_size_cy(long[:] arr, size_t size):
cdef size_t j = 0
for i in range(size):
if cond_cy(arr[i]):
j += 1
return j
def filter2_cy(arr):
cdef size_t new_size = _filtered_size_cy(arr, arr.size)
result = np.empty(new_size, dtype=arr.dtype)
new_size = _filter_cy(arr, result, arr.size)
return result
Numba-accelerated
#nb.njit
def filter2_nb(arr, cond_nb):
j = 0
for i in range(arr.size):
if cond_nb(arr[i]):
j += 1
result = np.empty(j, dtype=arr.dtype)
j = 0
for i in range(arr.size):
if cond_nb(arr[i]):
result[j] = arr[i]
j += 1
return result
Timing Benchmarks
(The generator-based filter_fromiter() method is much slower than the others -- by approx. 2 orders of magnitude.
Similar (and perhaps slightly worse) performances can be expected from a list comprehension.
This would be true for any explicit looping with non-accelerated code.)
The timing would depend on both the input array size and the percent of filtered items.
As a function of input size
The first graph addresses the timings as a function of the input size (for ~50% filtering factor -- i.e. 50% of the elements appear in the result):
In general, explicit looping with one form of acceleration leads to the fastest execution, with slight variations depending on input size.
Within NumPy, the integer indexing approaches are basically on par with boolean masking.
The benefits of using np.fromiter() (no pre-allocation) can be reaped by writing a Numba-accelerated generator, which would come out slower than the other approaches (within an order of magnitude), but much faster than pure-Python looping.
As a function of filling
The second graph addresses the timings as a function of items passing through the filter (for a fixed input size of ~1 million elements):
The first observation is that all methods are slowest when approaching a ~50% filling and with less, or more filling they are faster, and fastest towards no filling (highest percent of filtered-out values, lowest percent of passing through values as indicated in the x-axis of the graph).
Again, explicit looping with some mean of acceleration leads to the fastest execution.
Within NumPy, the integer indexing and boolean masking approaches are again basically the same.
(Full code available here)
Memory Considerations
The generator-based filter_fromiter() method requires only minimal temporary storage, independently of the size of the input.
Memory-wise this is the most efficient method.
This approach can be effectively speed up with a Numba-accelerated generator.
Of similar memory efficiency are the Cython / Numba two-passes methods, because the size of the output is determined during the first pass.
The caveat here is that computing the condition has to be fast for these methods to be fast.
On the memory side, the single-pass solutions for both Cython and Numba require a temporary array of the size of the input.
Hence, these are not very memory-efficient compared to two-passes or the generator-based one.
Yet they are of similar asymptotic temporary memory footprint compared to masking, but the constant term is typically larger than masking.
The boolean masking solution requires a temporary array of the size of the input but of type bool, which in NumPy is 1 byte, so this is ~8 times smaller than the default size of a NumPy array on a typical 64-bit system.
The integer indexing solution has the same requirement as the boolean mask slicing in the first step (inside np.nonzero() call), which gets converted to a series of ints (typically int64 on a 64-bit system) in the second step (the output of np.nonzero()).
This second step, therefore, has variable memory requirements, depending on the number of filtered elements.
Remarks
both boolean masking and integer indexing require some form of conditioning that is capable of producing a boolean mask (or, alternatively, a list of indices); in the above implementation, the condition is broadcastable
the generator and the Numba-accelerated methods are also the most flexible when it comes to specifying a different filtering condition
the Numba-accelerated methods require the condition to be Numba-compatible to access the Numba acceleration in NoPython mode
the Cython solution requires specifying the data types for it to be fast, or extra effort for multiple types dispatching, and extra effort (not explored here) to get the same level of flexibility as the other methods
for both Numba and Cython, the filtering condition can be hard-coded, leading to marginal but appreaciable speed differences
the single-pass solutions require additional code to handle the unused (but otherwise initially allotted) memory.
the NumPy methods do NOT return a view of the input, but a copy, as a result of advanced indexing:
arr = np.arange(100)
k = 50
print('`arr[arr > k]` is a copy: ', arr[arr > k].base is None)
# `arr[arr > k]` is a copy: True
print('`arr[np.where(arr > k)]` is a copy: ', arr[np.where(arr > k)].base is None)
# `arr[np.where(arr > k)]` is a copy: True
print('`arr[:k]` is a copy: ', arr[:k].base is None)
# `arr[:k]` is a copy: False
(EDITED: various improvements based on #ShadowRanger, #PaulPanzer, #max9111 and #DavidW comments.)
Related
I have been testing the following block for numba speed up:
import numpy as np
import timeit
from numba import njit
import numba
#numba.guvectorize(["void(float64[:],float64[:],float64[:],float64, float64, float64[:])"],
"(m),(m),(m),(),()->(m)",nopython=True,target="parallel")
def func_diff_calc_numba_v2(X,refY,Y,lower,upper,arr):
fac=1000
for i in range(len(X)):
if X[i] >=lower and X[i] <upper:
diff=Y[i]-refY[i]
arr[i] = diff**2*fac
else:
arr[i] = 0
#numba.vectorize('(float64, float64, float64, float64, float64)',nopython=True,target="parallel")
def func_diff_calc_numba_v3(X,refY,Y,lower,upper):
fac=1000
if X >= lower and X < upper:
return (Y-refY)**2*fac
else:
return 0.0
#njit
def func_diff_calc_numba(X,refY,Y,lower,upper):
fac=1000
arr=np.zeros(len(X))
for i in range(len(X)):
if X[i] >=lower and X[i] <upper:
arr[i]=(Y[i]-refY[i])**2*fac
else:
arr[i] = 0
return arr
np.random.seed(69)
X=np.arange(10000)
refY = np.random.rand(10000)
Y = np.random.rand(10000)
lower=1
upper=10000
print("func_diff_calc_numba: {:.5f}".format(timeit.timeit(stmt="func_diff_calc_numba(X,refY,Y,lower,upper)", number=10000, globals=globals())))
print("func_diff_calc_numba_v2: {:.5f}".format(timeit.timeit(stmt="func_diff_calc_numba_v2(X,refY,Y,lower,upper)", number=10000, globals=globals())))
print("func_diff_calc_numba_v3: {:.5f}".format(timeit.timeit(stmt="func_diff_calc_numba_v3(X,refY,Y,lower,upper)", number=10000, globals=globals())))
The speedups for the v2 and v3 are significantly different:
func_diff_calc_numba: 0.58257
func_diff_calc_numba_v2: 0.49573
func_diff_calc_numba_v3: 1.07519
and if I change the number of iterations from 10,000 to 100,000 then:
func_diff_calc_numba: 1.67251
func_diff_calc_numba_v2: 4.85828
func_diff_calc_numba_v3: 11.63361
I was expecting vectorize and guvectorize to be almost similar in speedup but while njit and guvectorize are almost equal to each other in time, vectorize is ~2 and ~10 times slower than guvectorize and njit respectively. Is there is something wrong in my implementation or something else?
The task (function + inputs) is probably too small/simple to be effectively parallelized, causing the overhead of doing so to increase total runtime. If you compile both to the default cpu target the difference disappears I assume?
Because your input is 1D, with the given ufunc signature, the guvectorize doesn't parallelize anything, because there's only one task.
A like-for-like parallel comparison can be done by setting the signature to "(),(),(),(),()->()" basically telling it to also (like vectorize) apply the function element-wise. And those results should be very close again. But then you'll see that the overhead of parallelization makes it worse for both in this case.
For me timings are:
Using target="parallel" for both, and "(m),(m),(m),(),()->(m)":
numba_guvec : 0.26364
numba_vec : 3.26960
Using target="cpu" for both, and "(m),(m),(m),(),()->(m)":
numba_guvec : 0.21886
numba_vec : 0.26198
Using target="parallel" for both, and "(),(),(),(),()->()":
numba_guvec : 3.05748
numba_vec : 3.15587
You'll probably find similar behavior if you would also compare #njit(parallel=True) with the numba.prange.
At the end, there's just some extra work involved for parallelizing something, and that's only worth it for a sufficiently large (slow) task.
Suppose I have N items and a multi-hot vector of values {0, 1} that represents inclusion of these items in a result:
N = 4
# items 1 and 3 will be included in the result
vector = [0, 1, 0, 1]
# item 2 will be included in the result
vector = [0, 0, 1, 0]
I'm also provided a matrix of conflicts which indicates which items cannot be included in the result at the same time:
conflicts = [
[0, 1, 1, 0], # any result that contains items 1 AND 2 is invalid
[0, 1, 1, 1], # any result that contains AT LEAST 2 items from {1, 2, 3} is invalid
]
Given this matrix of conflicts, we can determine the validity of the earlier vectors:
# invalid as it triggers conflict 1: [0, 1, 1, 1]
vector = [0, 1, 0, 1]
# valid as it triggers no conflicts
vector = [0, 0, 1, 0]
A naive solution to detect whether a given vector is "valid" (i.e. does not trigger any conflicts) may be done via a dot product and summation operation in numpy:
violation = np.dot(conflicts, vector)
is_valid = np.max(violation) <= 1
Are there are more efficient ways to perform this operation, perhaps either via np.einsum or by bypassing numpy arrays entirely in favour of bit manipulation?
We assume that the number of vectors being checked can be very large (e.g. up to 2^N if we evaluate all possibilities) but that only one vector is likely being checked at a time (to avoid generating a matrix of shape up to (2^N, N) as input).
TL;DR: you can use Numba to optimize np.dot to only operate only on binary values. More specifically, you can perform SIMD-like operations on 8 bytes at once using 64-bit views.
Converting lists to arrays
First of all, the lists can be efficiently converted to relatively-compact arrays using this approach:
vector = np.fromiter(vector, np.uint8)
conflicts = np.array([np.fromiter(conflicts[i], np.uint8) for i in range(len(conflicts))])
This is faster than using the automatic Numpy conversion or np.array (there is less check to perform in the Numpy code internally and Numpy, Numpy know what type of array to build and the resulting one is smaller in memory and thus faster to fill). This step can be used to speed up your np.dot-based solution.
If the input are already a Numpy array, then check they are of type np.uint8 or np.int8. Otherwise, please cast them to such type using conflits = conflits.astype(np.uint8) for example.
First try
Then, one solution could be to use np.packbits to pack the input binary values much as possible in an array of bits in memory, and then perform logical ANDs. But it turns out that np.packbits is pretty slow. Thus, this solution is not a good idea in the end. In fact, any solution creating temporary arrays with a shape similar to conflicts will be slow since writing such an array in memory is generally slower than np.dot (which read conflicts from memory once).
Using Numba
Since np.dot is pretty well optimized, the only solution to defeat it is to use an optimized native code. Numba can be used to generate a native executable code at runtime from a Numpy-based Python code thanks to a just-in-time compiler. The idea is to perform a logical ANDs between vector and rows of conflicts per block. Conflict are check for each block so to stop the computation as early as possible. Blocks can be efficiently compared by groups of 8 octets by comparing the uint64 views of the two arrays (in a SIMD-friendly way).
import numba as nb
#nb.njit('bool_(uint8[::1], uint8[:,::1])')
def check_valid(vector, conflicts):
n, m = conflicts.shape
assert vector.size == m
for i in range(n):
block_size = 128 # In the range: 8,16,...,248
conflicts_row = conflicts[i,:]
gsum = 0 # Global sum of conflicts
m_limit = m // block_size * block_size
for j in range(0, m_limit, block_size):
vector_block = vector[j:j+block_size].view(np.uint64)
conflicts_block = conflicts_row[j:j+block_size].view(np.uint64)
# Matching
lsum = np.uint64(0) # 8 local sums of conflicts
for k in range(block_size//8):
lsum += vector_block[k] & conflicts_block[k]
# Trick to perform the reduction of all the bytes in lsum
lsum += lsum >> 32
lsum += lsum >> 16
lsum += lsum >> 8
gsum += lsum & 0xFF
# Check if there is a conflict
if gsum >= 2:
return False
# Remaining part
for j in range(m_limit, m):
gsum += vector[j] & conflicts_row[j]
if gsum >= 2:
return False
return True
Results
This is about 9 times faster than np.dot on my machine for a large conflicts array of shape (16, 65536) (without conflicts). The time to convert lists is not included in both cases. When there are conflicts, the provided solution is much faster since it can early stop the computation.
Theoretically, the computation should be even faster, but the Numba JIT do not succeed to vectorize the loop using SIMD instructions. That being said, it seems the same issue appears for np.dot. If the arrays are even bigger, you can parallelize the computation of the blocks (at the expense of a slower computation if the function return False).
I wanted to code a prime number generator in python - I've only done this in C and Java. I did the following. I used an integer bitmap as an array. Performance of the algorithm should increase nlog(log(n)) but I am seeing exponential increase in cost/time as the problem size n increases. Is this something obvious I am not seeing or don't know about python as integers grow larger than practical? I am using python-3.8.3.
def countPrimes(n):
if n < 3:
return []
arr = (1 << (n-1)) - 2
for i in range(2, n):
selector = 1 << (i - 1)
if (selector & arr) == 0:
continue
# We have a prime
composite = selector
while (composite := composite << i) < arr:
arr = arr & (~composite)
primes = []
for i in range(n):
if (arr >> i) & 1 == 1:
primes.append(i+1)
return primes
Some analysis of my runtime:
A plot of y = nlog(log(n)) (red line which is steeper) and y = x (blue line which is less steep):
I'd normally not use integers with sizes exceeding uint64, because python allows unlimited size integers and I'm just testing, I used the above approach. As I said, I am trying to understand why the algorithm time increases exponentially with problem size n.
I used an integer bitmap as an array
That's extremely expensive. Python ints are immutable. Every time you want to toggle a bit, you're building a whole new gigantic int.
You also need to build other giant ints just to access single bits you're interested in - for example, composite and ~composite are huge in arr = arr & (~composite), even though you're only interested in 1 bit.
Use an actual mutable sequence type. Maybe a list, maybe a NumPy array, maybe some bitvector type off of PyPI, but don't use an int.
I am trying to expand the numpy "tensordot" such that things like:
K_ijklm = A_ki * B_jml can be written in a clear way like this: K = mytensordot(A,B,[2,0],[1,4,3])
To my understanding, numpy's tensordot (with optional argument 0) would be able to do something like this: K_kijml = A_ki * B_jml, i.e. keeping the order of the indexes. Therefore I would then have to do a number of np.swapaxes() to obtain the matrix `K_ijklm', which in a complicated case can be an easy source of errors (potentially very hard to debug).
The problem is that my implementation is slow (10x slower than tensordot [EDIT: It is actually MUCH slower than that]), even when using numba. I was wondering if anyone would have some insight on what could be done to improve the performance of my algorithm.
MWE
import numpy as np
import numba as nb
import itertools
import timeit
#nb.jit()
def myproduct(dimN):
N=np.prod(dimN)
L=len(dimN)
Product=np.zeros((N,L),dtype=np.int32)
rn=0
for n in range(1,N):
for l in range(L):
if l==0:
rn=1
v=Product[n-1,L-1-l]+rn
rn = 0
if v == dimN[L-1-l]:
v = 0
rn = 1
Product[n,L-1-l]=v
return Product
#nb.jit()
def mytensordot(A,B,iA,iB):
iA,iB = np.array(iA,dtype=np.int32),np.array(iB,dtype=np.int32)
dimA,dimB = A.shape,B.shape
NdimA,NdimB=len(dimA),len(dimB)
if len(iA) != NdimA: raise ValueError("iA must be same size as dim A")
if len(iB) != NdimB: raise ValueError("iB must be same size as dim B")
NdimN = NdimA + NdimB
dimN=np.zeros(NdimN,dtype=np.int32)
dimN[iA]=dimA
dimN[iB]=dimB
Out=np.zeros(dimN)
indexes = myproduct(dimN)
for nidxs in indexes:
idxA = tuple(nidxs[iA])
idxB = tuple(nidxs[iB])
v=A[(idxA)]*B[(idxB)]
Out[tuple(nidxs)]=v
return Out
A=np.random.random((4,5,3))
B=np.random.random((6,4))
def runmytdot():
return mytensordot(A,B,[0,2,3],[1,4])
def runtensdot():
return np.tensordot(A,B,0).swapaxes(1,3).swapaxes(2,3)
print(np.all(runmytdot()==runtensdot()))
print(timeit.timeit(runmytdot,number=100))
print(timeit.timeit(runtensdot,number=100))
Result:
True
1.4962144780438393
0.003484356915578246
You have run into a known issue. numpy.zeros requires a tuple when creating a multidimensional array. If you pass something other than a tuple, it sometimes works, but that's only because numpy is smart about converting the object into a tuple first.
The trouble is that numba does not currently support conversion of arbitrary iterables into tuples. So this line fails when you try to compile it in nopython=True mode. (A couple of others fail too, but this is the first.)
Out=np.zeros(dimN)
In theory you could call np.prod(dimN), create a flat array of zeros, and reshape it, but then you run into the very same problem: the reshape method of numpy arrays requires a tuple!
This is quite a vexing problem with numba -- I had not encountered it before. I really doubt the solution I have found is the correct one, but it is a working solution that allows us to compile a version in nopython=True mode.
The core idea is to avoid using tuples for indexing by directly implementing an indexer that follows the strides of the array:
#nb.jit(nopython=True)
def index_arr(a, ix_arr):
strides = np.array(a.strides) / a.itemsize
ix = int((ix_arr * strides).sum())
return a.ravel()[ix]
#nb.jit(nopython=True)
def index_set_arr(a, ix_arr, val):
strides = np.array(a.strides) / a.itemsize
ix = int((ix_arr * strides).sum())
a.ravel()[ix] = val
This allows us to get and set values without needing a tuple.
We can also avoid using reshape by passing the output buffer into the jitted function, and wrapping that function in a helper:
#nb.jit() # We can't use nopython mode here...
def mytensordot(A, B, iA, iB):
iA, iB = np.array(iA, dtype=np.int32), np.array(iB, dtype=np.int32)
dimA, dimB = A.shape, B.shape
NdimA, NdimB = len(dimA), len(dimB)
if len(iA) != NdimA:
raise ValueError("iA must be same size as dim A")
if len(iB) != NdimB:
raise ValueError("iB must be same size as dim B")
NdimN = NdimA + NdimB
dimN = np.zeros(NdimN, dtype=np.int32)
dimN[iA] = dimA
dimN[iB] = dimB
Out = np.zeros(dimN)
return mytensordot_jit(A, B, iA, iB, dimN, Out)
Since the helper contains no loops, it adds some overhead, but the overhead is pretty trivial. Here's the final jitted function:
#nb.jit(nopython=True)
def mytensordot_jit(A, B, iA, iB, dimN, Out):
for i in range(np.prod(dimN)):
nidxs = int_to_idx(i, dimN)
a = index_arr(A, nidxs[iA])
b = index_arr(B, nidxs[iB])
index_set_arr(Out, nidxs, a * b)
return Out
Unfortunately, this does not wind up generating as much of a speedup as we might like. On smaller arrays it's about 5x slower than tensordot; on larger arrays it's still 50x slower. (But at least it's not 1000x slower!) This is not too surprising in retrospect, since dot and tensordot are both using BLAS under the hood, as #hpaulj reminds us.
After finishing this code, I saw that einsum has solved your real problem -- nice!
But the underlying issue that your original question points to -- that indexing with arbitrary-length tuples is not possible in jitted code -- is still a frustration. So hopefully this will be useful to someone else!
tensordot with scalar axes values can be obscure. I explored it in
How does numpy.tensordot function works step-by-step?
There I deduced that np.tensordot(A, B, axes=0) is equivalent using axes=[[], []].
In [757]: A=np.random.random((4,5,3))
...: B=np.random.random((6,4))
In [758]: np.tensordot(A,B,0).shape
Out[758]: (4, 5, 3, 6, 4)
In [759]: np.tensordot(A,B,[[],[]]).shape
Out[759]: (4, 5, 3, 6, 4)
That in turn is equivalent to calling dot with a new size 1 sum-of-products dimenson:
In [762]: np.dot(A[...,None],B[...,None,:]).shape
Out[762]: (4, 5, 3, 6, 4)
(4,5,3,1) * (6,1,4) # the 1 is the last of A and 2nd to the last of B
dot is fast, using BLAS (or equivalent) code. Swapping axes and reshaping is also relatively fast.
einsum gives us a lot of control over axes
replicating the above products:
In [768]: np.einsum('jml,ki->jmlki',A,B).shape
Out[768]: (4, 5, 3, 6, 4)
and with swapping:
In [769]: np.einsum('jml,ki->ijklm',A,B).shape
Out[769]: (4, 4, 6, 3, 5)
A minor point - the double swap can be written as one transpose:
.swapaxes(1,3).swapaxes(2,3)
.transpose(0,3,1,2,4)
I have two numpy boolean arrays (a and b). I need to find how many of their elements are equal. Currently, I do len(a) - (a ^ b).sum(), but the xor operation creates an entirely new numpy array, as I understand. How do I efficiently implement this desired behavior without creating the unnecessary temporary array?
I've tried using numexpr, but I can't quite get it to work right. It doesn't support the notion that True is 1 and False is 0, so I have to use ne.evaluate("sum(where(a==b, 1, 0))"), which takes about twice as long.
Edit: I forgot to mention that one of these arrays is actually a view into another array of a different size, and both arrays should be considered immutable. Both arrays are 2-dimensional and tend to be somewhere around 25x40 in size.
Yes, this is the bottleneck of my program and is worth optimizing.
On my machine this is faster:
(a == b).sum()
If you don't want to use any extra storage, than I would suggest using numba.
I'm not too familiar with it, but this seems to work well.
I ran into some trouble getting Cython to take a boolean NumPy array.
from numba import autojit
def pysumeq(a, b):
tot = 0
for i in xrange(a.shape[0]):
for j in xrange(a.shape[1]):
if a[i,j] == b[i,j]:
tot += 1
return tot
# make numba version
nbsumeq = autojit(pysumeq)
A = (rand(10,10)<.5)
B = (rand(10,10)<.5)
# do a simple dry run to get it to compile
# for this specific use case
nbsumeq(A, B)
If you don't have numba, I would suggest using the answer by #user2357112
Edit: Just got a Cython version working, here's the .pyx file. I'd go with this.
from numpy cimport ndarray as ar
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def cysumeq(ar[np.uint8_t,ndim=2,cast=True] a, ar[np.uint8_t,ndim=2,cast=True] b):
cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
for i in xrange(h):
for j in xrange(w):
if a[i,j] == b[i,j]:
tot += 1
return tot
To start with you can skip then A*B step:
>>> a
array([ True, False, True, False, True], dtype=bool)
>>> b
array([False, True, True, False, True], dtype=bool)
>>> np.sum(~(a^b))
3
If you do not mind destroying array a or b, I am not sure you will get faster then this:
>>> a^=b #In place xor operator
>>> np.sum(~a)
3
If the problem is allocation and deallocation, maintain a single output array and tell numpy to put the results there every time:
out = np.empty_like(a) # Allocate this outside a loop and use it every iteration
num_eq = np.equal(a, b, out).sum()
This'll only work if the inputs are always the same dimensions, though. You may be able to make one big array and slice out a part that's the size you need for each call if the inputs have varying sizes, but I'm not sure how much that slows you down.
Improving upon IanH's answer, it's also possible to get access to the underlying C array in a numpy array from within Cython, by supplying mode="c" to ndarray.
from numpy cimport ndarray as ar
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
cdef int cy_sum_eq(ar[np.uint8_t,ndim=2,cast=True,mode="c"] a, ar[np.uint8_t,ndim=2,cast=True,mode="c"] b):
cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
cdef np.uint8_t* adata = &a[0, 0]
cdef np.uint8_t* bdata = &b[0, 0]
for i in xrange(h):
for j in xrange(w):
if adata[j] == bdata[j]:
tot += 1
adata += w
bdata += w
return tot
This is about 40% faster on my machine than IanH's Cython version, and I've found that rearranging the loop contents doesn't seem to make much of a difference at this point probably due to compiler optimizations. At this point, one could potentially link to a C function optimized with SSE and such to perform this operation and pass adata and bdata as uint8_t*s