Numpy sum of operator results without allocating an unnecessary array - python

I have two numpy boolean arrays (a and b). I need to find how many of their elements are equal. Currently, I do len(a) - (a ^ b).sum(), but the xor operation creates an entirely new numpy array, as I understand. How do I efficiently implement this desired behavior without creating the unnecessary temporary array?
I've tried using numexpr, but I can't quite get it to work right. It doesn't support the notion that True is 1 and False is 0, so I have to use ne.evaluate("sum(where(a==b, 1, 0))"), which takes about twice as long.
Edit: I forgot to mention that one of these arrays is actually a view into another array of a different size, and both arrays should be considered immutable. Both arrays are 2-dimensional and tend to be somewhere around 25x40 in size.
Yes, this is the bottleneck of my program and is worth optimizing.

On my machine this is faster:
(a == b).sum()
If you don't want to use any extra storage, than I would suggest using numba.
I'm not too familiar with it, but this seems to work well.
I ran into some trouble getting Cython to take a boolean NumPy array.
from numba import autojit
def pysumeq(a, b):
tot = 0
for i in xrange(a.shape[0]):
for j in xrange(a.shape[1]):
if a[i,j] == b[i,j]:
tot += 1
return tot
# make numba version
nbsumeq = autojit(pysumeq)
A = (rand(10,10)<.5)
B = (rand(10,10)<.5)
# do a simple dry run to get it to compile
# for this specific use case
nbsumeq(A, B)
If you don't have numba, I would suggest using the answer by #user2357112
Edit: Just got a Cython version working, here's the .pyx file. I'd go with this.
from numpy cimport ndarray as ar
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def cysumeq(ar[np.uint8_t,ndim=2,cast=True] a, ar[np.uint8_t,ndim=2,cast=True] b):
cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
for i in xrange(h):
for j in xrange(w):
if a[i,j] == b[i,j]:
tot += 1
return tot

To start with you can skip then A*B step:
>>> a
array([ True, False, True, False, True], dtype=bool)
>>> b
array([False, True, True, False, True], dtype=bool)
>>> np.sum(~(a^b))
3
If you do not mind destroying array a or b, I am not sure you will get faster then this:
>>> a^=b #In place xor operator
>>> np.sum(~a)
3

If the problem is allocation and deallocation, maintain a single output array and tell numpy to put the results there every time:
out = np.empty_like(a) # Allocate this outside a loop and use it every iteration
num_eq = np.equal(a, b, out).sum()
This'll only work if the inputs are always the same dimensions, though. You may be able to make one big array and slice out a part that's the size you need for each call if the inputs have varying sizes, but I'm not sure how much that slows you down.

Improving upon IanH's answer, it's also possible to get access to the underlying C array in a numpy array from within Cython, by supplying mode="c" to ndarray.
from numpy cimport ndarray as ar
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
cdef int cy_sum_eq(ar[np.uint8_t,ndim=2,cast=True,mode="c"] a, ar[np.uint8_t,ndim=2,cast=True,mode="c"] b):
cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
cdef np.uint8_t* adata = &a[0, 0]
cdef np.uint8_t* bdata = &b[0, 0]
for i in xrange(h):
for j in xrange(w):
if adata[j] == bdata[j]:
tot += 1
adata += w
bdata += w
return tot
This is about 40% faster on my machine than IanH's Cython version, and I've found that rearranging the loop contents doesn't seem to make much of a difference at this point probably due to compiler optimizations. At this point, one could potentially link to a C function optimized with SSE and such to perform this operation and pass adata and bdata as uint8_t*s

Related

Filtering (reducing) a NumPy Array

Suppose I have a NumPy array arr that I want to element-wise filter (reduce) depending on the truth value of a (broadcastable) function, e.g.
I want to get only values below a certain threshold value k:
def cond(x):
return x < k
There are a couple of methods, e.g.:
Using a generator: np.fromiter((x for x in arr if cond(x)), dtype=arr.dtype) (which is a memory efficient version of using a list comprehension: np.array([x for x in arr if cond(x)]) because np.fromiter() will produce a NumPy array directly, without the need to allocate an intermediate Python list)
Using boolean masking: arr[cond(arr)]
Using integer indexing: arr[np.nonzero(cond(arr))] (or equivalently using np.where() as it defaults to np.nonzero() with only one condition)
Using explicit looping with:
single pass and final copying/resizing
two passes: one to determine the size of the result and one to actually perform the computation
(The last two approaches to be accelerated with Cython or Numba)
Which is the fastest? What about memory efficiency?
(EDITED: To use directly np.nonzero() instead of np.where() as per #ShadowRanger comment)
Summary
Using a loop-based approach with single pass and copying, accelerated with Numba, offers the best overall trade-off in terms of speed, memory efficiency and flexibility.
If the execution of the condition function is sufficiently fast, two-passes (filter2_nb()) may be faster, while they are more memory efficient regardless.
Also, for sufficiently large inputs, resizing instead of copying (filter_resize_xnb()) leads to faster execution.
If the data type (and the condition function) is known ahead of time and can be compiled, the Cython acceleration seems to be faster.
It is likely that a similar hard-coding of the condition would lead to comparable speed-up with Numba accerelation as well.
When it comes to NumPy-only based approaches, boolean masking or integer indexing are of comparable speed, and which one comes out faster depends largely on the filtering factor, i.e. the portion of values that passes through the filtering condition.
The np.fromiter() approach is much slower (it would be off-charts in the plot), but does not produce large temporary objects.
Note that the following tests are meant to give some insights into the different approaches and should be taken with a grain of salt.
The most relevant assumptions are that the condition is broadcastable and that it would eventually compute very fast.
Definitions
Using a generator:
def filter_fromiter(arr, cond):
return np.fromiter((x for x in arr if cond(x)), dtype=arr.dtype)
Using boolean masking:
def filter_mask(arr, cond):
return arr[cond(arr)]
Using integer indexing:
def filter_idx(arr, cond):
return arr[np.nonzero(cond(arr))]
4a. Using explicit looping, with single pass and final copying/resizing
Cython-accelerated with copying (pre-compiled condition)
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import numpy as np
cdef long NUM = 1048576
cdef long MAX_VAL = 1048576
cdef long K = 1048576 // 2
cdef int cond_cy(long x, long k=K):
return x < k
cdef size_t _filter_cy(long[:] arr, long[:] result, size_t size):
cdef size_t j = 0
for i in range(size):
if cond_cy(arr[i]):
result[j] = arr[i]
j += 1
return j
def filter_cy(arr):
result = np.empty_like(arr)
new_size = _filter_cy(arr, result, arr.size)
return result[:new_size].copy()
Cython-accelerated with resizing (pre-compiled condition)
def filter_resize_cy(arr):
result = np.empty_like(arr)
new_size = _filter_cy(arr, result, arr.size)
result.resize(new_size)
return result
Numba-accelerated with copying
import numba as nb
#nb.njit
def cond_nb(x, k=K):
return x < k
#nb.njit
def filter_nb(arr, cond_nb):
result = np.empty_like(arr)
j = 0
for i in range(arr.size):
if cond_nb(arr[i]):
result[j] = arr[i]
j += 1
return result[:j].copy()
Numba-accelerated with resizing
#nb.njit
def _filter_out_nb(arr, out, cond_nb):
j = 0
for i in range(arr.size):
if cond_nb(arr[i]):
out[j] = arr[i]
j += 1
return j
def filter_resize_xnb(arr, cond_nb):
result = np.empty_like(arr)
j = _filter_out_nb(arr, result, cond_nb)
result.resize(j, refcheck=False) # unsupported in NoPython mode
return result
Numba-accelerated with a generator and np.fromiter()
#nb.njit
def filter_gen_nb(arr, cond_nb):
for i in range(arr.size):
if cond_nb(arr[i]):
yield arr[i]
def filter_gen_xnb(arr, cond_nb):
return np.fromiter(filter_gen_nb(arr, cond_nb), dtype=arr.dtype)
4b. Using explicit looping with two passes: one to determine the size of the result and one to actually perform the computation
Cython-accelerated (pre-compiled condition)
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
cdef size_t _filtered_size_cy(long[:] arr, size_t size):
cdef size_t j = 0
for i in range(size):
if cond_cy(arr[i]):
j += 1
return j
def filter2_cy(arr):
cdef size_t new_size = _filtered_size_cy(arr, arr.size)
result = np.empty(new_size, dtype=arr.dtype)
new_size = _filter_cy(arr, result, arr.size)
return result
Numba-accelerated
#nb.njit
def filter2_nb(arr, cond_nb):
j = 0
for i in range(arr.size):
if cond_nb(arr[i]):
j += 1
result = np.empty(j, dtype=arr.dtype)
j = 0
for i in range(arr.size):
if cond_nb(arr[i]):
result[j] = arr[i]
j += 1
return result
Timing Benchmarks
(The generator-based filter_fromiter() method is much slower than the others -- by approx. 2 orders of magnitude.
Similar (and perhaps slightly worse) performances can be expected from a list comprehension.
This would be true for any explicit looping with non-accelerated code.)
The timing would depend on both the input array size and the percent of filtered items.
As a function of input size
The first graph addresses the timings as a function of the input size (for ~50% filtering factor -- i.e. 50% of the elements appear in the result):
In general, explicit looping with one form of acceleration leads to the fastest execution, with slight variations depending on input size.
Within NumPy, the integer indexing approaches are basically on par with boolean masking.
The benefits of using np.fromiter() (no pre-allocation) can be reaped by writing a Numba-accelerated generator, which would come out slower than the other approaches (within an order of magnitude), but much faster than pure-Python looping.
As a function of filling
The second graph addresses the timings as a function of items passing through the filter (for a fixed input size of ~1 million elements):
The first observation is that all methods are slowest when approaching a ~50% filling and with less, or more filling they are faster, and fastest towards no filling (highest percent of filtered-out values, lowest percent of passing through values as indicated in the x-axis of the graph).
Again, explicit looping with some mean of acceleration leads to the fastest execution.
Within NumPy, the integer indexing and boolean masking approaches are again basically the same.
(Full code available here)
Memory Considerations
The generator-based filter_fromiter() method requires only minimal temporary storage, independently of the size of the input.
Memory-wise this is the most efficient method.
This approach can be effectively speed up with a Numba-accelerated generator.
Of similar memory efficiency are the Cython / Numba two-passes methods, because the size of the output is determined during the first pass.
The caveat here is that computing the condition has to be fast for these methods to be fast.
On the memory side, the single-pass solutions for both Cython and Numba require a temporary array of the size of the input.
Hence, these are not very memory-efficient compared to two-passes or the generator-based one.
Yet they are of similar asymptotic temporary memory footprint compared to masking, but the constant term is typically larger than masking.
The boolean masking solution requires a temporary array of the size of the input but of type bool, which in NumPy is 1 byte, so this is ~8 times smaller than the default size of a NumPy array on a typical 64-bit system.
The integer indexing solution has the same requirement as the boolean mask slicing in the first step (inside np.nonzero() call), which gets converted to a series of ints (typically int64 on a 64-bit system) in the second step (the output of np.nonzero()).
This second step, therefore, has variable memory requirements, depending on the number of filtered elements.
Remarks
both boolean masking and integer indexing require some form of conditioning that is capable of producing a boolean mask (or, alternatively, a list of indices); in the above implementation, the condition is broadcastable
the generator and the Numba-accelerated methods are also the most flexible when it comes to specifying a different filtering condition
the Numba-accelerated methods require the condition to be Numba-compatible to access the Numba acceleration in NoPython mode
the Cython solution requires specifying the data types for it to be fast, or extra effort for multiple types dispatching, and extra effort (not explored here) to get the same level of flexibility as the other methods
for both Numba and Cython, the filtering condition can be hard-coded, leading to marginal but appreaciable speed differences
the single-pass solutions require additional code to handle the unused (but otherwise initially allotted) memory.
the NumPy methods do NOT return a view of the input, but a copy, as a result of advanced indexing:
arr = np.arange(100)
k = 50
print('`arr[arr > k]` is a copy: ', arr[arr > k].base is None)
# `arr[arr > k]` is a copy: True
print('`arr[np.where(arr > k)]` is a copy: ', arr[np.where(arr > k)].base is None)
# `arr[np.where(arr > k)]` is a copy: True
print('`arr[:k]` is a copy: ', arr[:k].base is None)
# `arr[:k]` is a copy: False
(EDITED: various improvements based on #ShadowRanger, #PaulPanzer, #max9111 and #DavidW comments.)

Algorithm for tensordot implemented in numba is much slower than numpy's

I am trying to expand the numpy "tensordot" such that things like:
K_ijklm = A_ki * B_jml can be written in a clear way like this: K = mytensordot(A,B,[2,0],[1,4,3])
To my understanding, numpy's tensordot (with optional argument 0) would be able to do something like this: K_kijml = A_ki * B_jml, i.e. keeping the order of the indexes. Therefore I would then have to do a number of np.swapaxes() to obtain the matrix `K_ijklm', which in a complicated case can be an easy source of errors (potentially very hard to debug).
The problem is that my implementation is slow (10x slower than tensordot [EDIT: It is actually MUCH slower than that]), even when using numba. I was wondering if anyone would have some insight on what could be done to improve the performance of my algorithm.
MWE
import numpy as np
import numba as nb
import itertools
import timeit
#nb.jit()
def myproduct(dimN):
N=np.prod(dimN)
L=len(dimN)
Product=np.zeros((N,L),dtype=np.int32)
rn=0
for n in range(1,N):
for l in range(L):
if l==0:
rn=1
v=Product[n-1,L-1-l]+rn
rn = 0
if v == dimN[L-1-l]:
v = 0
rn = 1
Product[n,L-1-l]=v
return Product
#nb.jit()
def mytensordot(A,B,iA,iB):
iA,iB = np.array(iA,dtype=np.int32),np.array(iB,dtype=np.int32)
dimA,dimB = A.shape,B.shape
NdimA,NdimB=len(dimA),len(dimB)
if len(iA) != NdimA: raise ValueError("iA must be same size as dim A")
if len(iB) != NdimB: raise ValueError("iB must be same size as dim B")
NdimN = NdimA + NdimB
dimN=np.zeros(NdimN,dtype=np.int32)
dimN[iA]=dimA
dimN[iB]=dimB
Out=np.zeros(dimN)
indexes = myproduct(dimN)
for nidxs in indexes:
idxA = tuple(nidxs[iA])
idxB = tuple(nidxs[iB])
v=A[(idxA)]*B[(idxB)]
Out[tuple(nidxs)]=v
return Out
A=np.random.random((4,5,3))
B=np.random.random((6,4))
def runmytdot():
return mytensordot(A,B,[0,2,3],[1,4])
def runtensdot():
return np.tensordot(A,B,0).swapaxes(1,3).swapaxes(2,3)
print(np.all(runmytdot()==runtensdot()))
print(timeit.timeit(runmytdot,number=100))
print(timeit.timeit(runtensdot,number=100))
Result:
True
1.4962144780438393
0.003484356915578246
You have run into a known issue. numpy.zeros requires a tuple when creating a multidimensional array. If you pass something other than a tuple, it sometimes works, but that's only because numpy is smart about converting the object into a tuple first.
The trouble is that numba does not currently support conversion of arbitrary iterables into tuples. So this line fails when you try to compile it in nopython=True mode. (A couple of others fail too, but this is the first.)
Out=np.zeros(dimN)
In theory you could call np.prod(dimN), create a flat array of zeros, and reshape it, but then you run into the very same problem: the reshape method of numpy arrays requires a tuple!
This is quite a vexing problem with numba -- I had not encountered it before. I really doubt the solution I have found is the correct one, but it is a working solution that allows us to compile a version in nopython=True mode.
The core idea is to avoid using tuples for indexing by directly implementing an indexer that follows the strides of the array:
#nb.jit(nopython=True)
def index_arr(a, ix_arr):
strides = np.array(a.strides) / a.itemsize
ix = int((ix_arr * strides).sum())
return a.ravel()[ix]
#nb.jit(nopython=True)
def index_set_arr(a, ix_arr, val):
strides = np.array(a.strides) / a.itemsize
ix = int((ix_arr * strides).sum())
a.ravel()[ix] = val
This allows us to get and set values without needing a tuple.
We can also avoid using reshape by passing the output buffer into the jitted function, and wrapping that function in a helper:
#nb.jit() # We can't use nopython mode here...
def mytensordot(A, B, iA, iB):
iA, iB = np.array(iA, dtype=np.int32), np.array(iB, dtype=np.int32)
dimA, dimB = A.shape, B.shape
NdimA, NdimB = len(dimA), len(dimB)
if len(iA) != NdimA:
raise ValueError("iA must be same size as dim A")
if len(iB) != NdimB:
raise ValueError("iB must be same size as dim B")
NdimN = NdimA + NdimB
dimN = np.zeros(NdimN, dtype=np.int32)
dimN[iA] = dimA
dimN[iB] = dimB
Out = np.zeros(dimN)
return mytensordot_jit(A, B, iA, iB, dimN, Out)
Since the helper contains no loops, it adds some overhead, but the overhead is pretty trivial. Here's the final jitted function:
#nb.jit(nopython=True)
def mytensordot_jit(A, B, iA, iB, dimN, Out):
for i in range(np.prod(dimN)):
nidxs = int_to_idx(i, dimN)
a = index_arr(A, nidxs[iA])
b = index_arr(B, nidxs[iB])
index_set_arr(Out, nidxs, a * b)
return Out
Unfortunately, this does not wind up generating as much of a speedup as we might like. On smaller arrays it's about 5x slower than tensordot; on larger arrays it's still 50x slower. (But at least it's not 1000x slower!) This is not too surprising in retrospect, since dot and tensordot are both using BLAS under the hood, as #hpaulj reminds us.
After finishing this code, I saw that einsum has solved your real problem -- nice!
But the underlying issue that your original question points to -- that indexing with arbitrary-length tuples is not possible in jitted code -- is still a frustration. So hopefully this will be useful to someone else!
tensordot with scalar axes values can be obscure. I explored it in
How does numpy.tensordot function works step-by-step?
There I deduced that np.tensordot(A, B, axes=0) is equivalent using axes=[[], []].
In [757]: A=np.random.random((4,5,3))
...: B=np.random.random((6,4))
In [758]: np.tensordot(A,B,0).shape
Out[758]: (4, 5, 3, 6, 4)
In [759]: np.tensordot(A,B,[[],[]]).shape
Out[759]: (4, 5, 3, 6, 4)
That in turn is equivalent to calling dot with a new size 1 sum-of-products dimenson:
In [762]: np.dot(A[...,None],B[...,None,:]).shape
Out[762]: (4, 5, 3, 6, 4)
(4,5,3,1) * (6,1,4) # the 1 is the last of A and 2nd to the last of B
dot is fast, using BLAS (or equivalent) code. Swapping axes and reshaping is also relatively fast.
einsum gives us a lot of control over axes
replicating the above products:
In [768]: np.einsum('jml,ki->jmlki',A,B).shape
Out[768]: (4, 5, 3, 6, 4)
and with swapping:
In [769]: np.einsum('jml,ki->ijklm',A,B).shape
Out[769]: (4, 4, 6, 3, 5)
A minor point - the double swap can be written as one transpose:
.swapaxes(1,3).swapaxes(2,3)
.transpose(0,3,1,2,4)

Most efficient way to do (lo <= k && k <= hi) ? 1 : 0 for k a numpy array, lo, hi constants

I have a large numpy array k, of unspecified shape, and I want to construct an identically shaped array d which is 1.0 when the corresponding entry in k is between two constants lo and hi, 0.0 otherwise. (Because of what the larger code is doing, I do not want a Boolean-valued array.)
The obvious way to do this is
d = np.ones_like(k)
d[np.less(k, lo)] = 0
d[np.greater(k, hi)] = 0
However, the np.less and np.greater calls involve the creation of large scratch Boolean arrays, and I have measured this to be a significant overhead. Is there a way to perform this operation that does not involve creating any large scratch objects, while remaining fully vectorized?
As others said, numpy is heavy on temporary buffers and it does not offer much control over it. If memory footprint is really a blocker, you can drop in your own little routine. For instance,
def process(x, lo, hi):
""" lo <= x < hi ? 1.0 : 0.0."""
x_shape = x.shape
xx = np.ascontiguousarray(x).ravel()
out = np.empty_like(xx)
_process(xx, lo, hi, out)
return out.reshape(x_shape)
where _process is in cython:
%%cython --annotate
import cython
#cython.boundscheck(False)
#cython.wraparound(False)
def _process(double[::1] x, double lo, double hi, double[::1] out):
""" lo <= x < hi ? 1.0 : 0.0."""
cdef:
Py_ssize_t j
double xj
for j in range(x.shape[0]):
xj = x[j]
if lo <= xj < hi:
out[j] = 1.0
else:
out[j] = 0.0
Here I used jupyter notebook (hence the funny %%cython syntax). In real project you need to throw in a setup.py to compile the extension etc. Whether the benefit from doing this is worth the hassle is up to you.
You can create boolean array based upon the comparison and then convert to float type, all in one go, like so -
d = ((k >=lo) & (k <= hi)).astype(float)
less and greater take an out parameter:
out=np.ones_like(k)
np.less(k,80,out=out);
out &= np.greater(k,20);
# np.logical_and(np.greater(k,20),out,out=out);
That might end up saving one intermediate array. Although my impression with the ufunc out is that it still creates a temporary array, but then just copies it to the out.
On a small (10x10) array, this is faster than #zwol's method, but slower than #Divakar's. But the differences are not major.

Poor(er) performance of Cython with NumPy array memoryview compared to C arrays

I encountered a pretty weird result from a benchmark
Those are all different flavors of a bubblesort implementation, and the fastest approach at n=10^4 is converting a Python lists to C arrays internally. In contrast, the yellow line corresponds to code where I am using NumPy arrays with memoryview. I am expected the results to be vice versa. I (and colleagues) repeated the benchmark a couple of times and always got the same results. Maybe someone has an idea of what is going on here ...
The black line in the plot would correspond to the code:
%%cython
cimport cython
from libc.stdlib cimport malloc, free
def cython_bubblesort_clist(a_list):
"""
The Cython implementation of bubble sort with internal
conversion between Python list objects and C arrays.
"""
cdef int *c_list
c_list = <int *>malloc(len(a_list)*cython.sizeof(int))
cdef int count, i, j # static type declarations
count = len(a_list)
# convert Python list to C array
for i in range(count):
c_list[i] = a_list[i]
for i in range(count):
for j in range(1, count):
if c_list[j] < c_list[j-1]:
c_list[j-1], c_list[j] = c_list[j], c_list[j-1]
# convert C array back to Python list
for i in range(count):
a_list[i] = c_list[i]
free(c_list)
return a_list
and the pink line to this code:
%%cython
import numpy as np
cimport numpy as np
cimport cython
def cython_bubblesort_numpy(long[:] np_ary):
"""
The Cython implementation of bubble sort with NumPy memoryview.
"""
cdef int count, i, j # static type declarations
count = np_ary.shape[0]
for i in range(count):
for j in range(1, count):
if np_ary[j] < np_ary[j-1]:
np_ary[j-1], np_ary[j] = np_ary[j], np_ary[j-1]
return np.asarray(np_ary)
As suggested in the comments above, I added the decorators
%%cython
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef cython_bubblesort_numpy(long[:] np_ary):
"""
The Cython implementation of bubble sort with NumPy memoryview.
"""
cdef int count, i, j # static type declarations
count = np_ary.shape[0]
for i in range(count):
for j in range(1, count):
if np_ary[j] < np_ary[j-1]:
np_ary[j-1], np_ary[j] = np_ary[j], np_ary[j-1]
return np.asarray(np_ary)
and the results are more what I expected now :)
It is worth making one trivial change to your code to see if it improves things further:
cpdef cython_bubblesort_numpy(long[::1] np_ary):
# ...
This tells cython that np_ary is a C contiguous array, and the generated code in the nested for loops can be further optimized with this information.
This code won't accept non-contiguous arrays as arguments, but that is fairly trivial to handle by using numpy.ascontiguousarray().

Speeding up python code with cython

I have a function which just basically makes lots of calls to a simple defined hash function and tests to see when it finds a duplicate. I need to do lots of simulations with it so would like it to be as fast as possible. I am attempting to use cython to do this. The cython code is currently called with a normal python list of integers with values in the range 0 to m^2.
import math, random
cdef int a,b,c,d,m,pos,value, cyclelimit, nohashcalls
def h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) %m
def floyd(inputx):
dupefound, nohashcalls = (0,0)
m = len(inputx)
loops = int(m*math.log(m))
for loopno in xrange(loops):
if (dupefound == 1):
break
a = random.randrange(m)
b = random.randrange(m)
c = random.randrange(m)
d = random.randrange(m)
pos = random.randrange(m)
value = inputx[pos]
listofpos = [0] * m
listofpos[pos] = 1
setofvalues = set([value])
cyclelimit = int(math.sqrt(m))
for j in xrange(cyclelimit):
pos = h3(a,b, c,d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
How can I convert inputx and listofpos to use C type arrays and to access the arrays at C speed? Are there any other speed ups I can use? Can setofvalues be sped up?
So that there is something to compare against, 50 calls to floyd() with m = 5000 currently takes around 30 seconds on my computer.
Update: Example code snippet to show how floyd is called.
m = 5000
inputx = random.sample(xrange(m**2), m)
(dupefound, nohashcalls) = edcython.floyd(inputx)
First of all, it seems that you must type the variables inside the function. A good example of it is here.
Second, cython -a, for "annotate", gives you a really excellent break down of the code generated by the cython compiler and a color-coded indication of how dirty (read: python api heavy) it is. This output is really essential when trying to optimize anything.
Third, the now famous page on working with Numpy explains how to get fast, C-style access to the Numpy array data. Unforunately it's verbose and annoying. We're in luck however, because more recent Cython provides Typed Memory Views, which are both easy to use and awesome. Read that entire page before you try to do anything else.
After ten minutes or so I came up with this:
# cython: infer_types=True
# Use the C math library to avoid Python overhead.
from libc cimport math
# For boundscheck below.
import cython
# We're lazy so we'll let Numpy handle our array memory management.
import numpy as np
# You would normally also import the Numpy pxd to get faster access to the Numpy
# API, but it requires some fancier compilation options so I'll leave it out for
# this demo.
# cimport numpy as np
import random
# This is a small function that doesn't need to be exposed to Python at all. Use
# `cdef` instead of `def` and inline it.
cdef inline int h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) % m
# If we want to live fast and dangerously, we tell cython not to check our array
# indices for IndexErrors. This means we CAN overrun our array and crash the
# program or screw up our stack. Use with caution. Profiling suggests that we
# aren't gaining anything in this case so I leave it on for safety.
# #cython.boundscheck(False)
# `cpdef` so that calling this function from another Cython (or C) function can
# skip the Python function call overhead, while still allowing us to use it from
# Python.
cpdef floyd(int[:] inputx):
# Type the variables in the scope of the function.
cdef int a,b,c,d, value, cyclelimit
cdef unsigned int dupefound = 0
cdef unsigned int nohashcalls = 0
cdef unsigned int loopno, pos, j
# `m` has type int because inputx is already a Cython memory view and
# `infer-types` is on.
m = inputx.shape[0]
cdef unsigned int loops = int(m*math.log(m))
# Again using the memory view, but letting Numpy allocate an array of zeros.
cdef int[:] listofpos = np.zeros(m, dtype=np.int32)
# Keep this random sampling out of the loop
cdef int[:, :] randoms = np.random.randint(0, m, (loops, 5)).astype(np.int32)
for loopno in range(loops):
if (dupefound == 1):
break
# From our precomputed array
a = randoms[loopno, 0]
b = randoms[loopno, 1]
c = randoms[loopno, 2]
d = randoms[loopno, 3]
pos = randoms[loopno, 4]
value = inputx[pos]
# Unforunately, Memory View does not support "vectorized" operations
# like standard Numpy arrays. Otherwise we'd use listofpos *= 0 here.
for j in range(m):
listofpos[j] = 0
listofpos[pos] = 1
setofvalues = set((value,))
cyclelimit = int(math.sqrt(m))
for j in range(cyclelimit):
pos = h3(a, b, c, d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
There are no tricks here that aren't explained on docs.cython.org, which is where I learned them myself, but helps to see it all come together.
The most important changes to your original code are in the comments, but they all amount to giving Cython hints about how to generate code that doesn't use the Python API.
As an aside: I really don't know why infer_types is not on by default. It lets the compiler
implicitly use C types instead of Python types where possible, meaning less work for you.
If you run cython -a on this, you'll see that the only lines that call into Python are your calls to random.sample, and building or adding to a Python set().
On my machine, your original code runs in 2.1 seconds. My version runs in 0.6 seconds.
The next step is to get random.sample out of that loop, but I'll leave that to you.
I have edited my answer to demonstrate how to precompute the rand samples. This brings the time down to 0.4 seconds.
Do you need to use this particular hashing algorithm? Why not use the built-in hashing algorithm for dicts? For example:
from collections import Counter
cnt = Counter(inputx)
dupes = [k for k, v in cnt.iteritems() if v > 1]

Categories

Resources