Speed up for-loop in Cython

Speed up for-loop in Cython - python

I am still at the beginning to understand how Cython works.
This snippet shows one of the slow parts of my program and I am wondering whether this for loop can be improved.
It still pretty much looks like the original Numpy version, but I added the cdef's and int-conversion.
cdef Py_ssize_t i, j
cdef double ii, jj
for ii in np.arange(startx, endx+1, 0.1):
for jj in np.arange(starty, endy+1, 0.1):
if my_condition(ii, jj):
i = <int>ii
j = <int>jj
data[i, j] += 1
Do you have any recommendations?

Study the cython example in
https://docs.scipy.org/doc/numpy/reference/arrays.nditer.html
That uses nditer to hand out the array elements.
And
https://cython.readthedocs.io/en/stable/src/userguide/memoryviews.html
which demonstrates the use of memoryviews (and C arrays) to rapidly iterate over the values.
Either way your goal is to let cython access the databuffer directly rather than by way of the numpy functions.

I know this is a old question, but i had the same problem and found this simple solution.
https://github.com/cython/cython/issues/3310#issuecomment-707252866
for i in range(start, stop, step):
...
to this:
i = start - step
for _ in range(((stop - start) + (step - 1))//step):
i += step
...

Related

How to speed up "for loop" in Python3

Below code is running so slow. I tried using numpy.argwhere instead of "if statement" to speed up the code and I got a pretty efficient result but it's still very slow. I also tried numpy.frompyfunc and numpy.vectorize but I failed. What would you suggest to speed up the code below?
import numpy as np
import time
time1 = time.time()
n = 1000000
k = 10000
velos = np.linspace(-1000, 1000, n)
line_centers = np.linspace(-1000, 1000, k)
weights = np.random.random_sample(k)
rvs = np.arange(-60, 60, 2)
m = len(rvs)
w = np.arange(10)
M = np.zeros((n, m))
for l, lc in enumerate(line_centers):
vi = velos - lc
for j in range(m - 1):
w = np.argwhere((vi < rvs[j + 1]) & (vi > rvs[j])).T[0]
M[w, j] = weights[l] * (rvs[j + 1] - vi[w]) / (rvs[j + 1] - rvs[j])
M[w, j + 1] = weights[l] * (vi[w] - rvs[j]) / (rvs[j + 1] - rvs[j])
time2 = time.time()
print(time2 - time1)
EDIT:
The size of the array M was incorrect. I fixed it.

This seems like a situation where a c++ interface could come in handy. With Pybind11 you can create c++ functions which take numpy arrays as argument, manipulate them and return them back to python. That would speed up you loops. Take a look at it!

Of course it is slow, you have two nested loops! You need to rethink your algorithm using vector operations, as in, no iteration over indices, but implement in terms of index or boolean arrays, and index shifts.
You have not given any background information, so it is incredibly hard for anyone to suggest something meaningful (given the soup of indices in the example). A few quick suggestions based on quickly gleaning over your example.
An expression like this (rvs[j + 1] - rvs[j]) is easily replaced with numpy.ediff1d.
You seem to be iterating through n in blocks of m, maybe numpy.nditer will be of use.
I have a hunch that your inner loop has an error, are you sure you really mean to iterate over range(m - 1)? That would mean you are iterating from 0 to m-2 (inclusive), I doubt you meant that.
We can help with more concrete answers if you provide more background information.

Most efficient way to do (lo <= k && k <= hi) ? 1 : 0 for k a numpy array, lo, hi constants

I have a large numpy array k, of unspecified shape, and I want to construct an identically shaped array d which is 1.0 when the corresponding entry in k is between two constants lo and hi, 0.0 otherwise. (Because of what the larger code is doing, I do not want a Boolean-valued array.)
The obvious way to do this is
d = np.ones_like(k)
d[np.less(k, lo)] = 0
d[np.greater(k, hi)] = 0
However, the np.less and np.greater calls involve the creation of large scratch Boolean arrays, and I have measured this to be a significant overhead. Is there a way to perform this operation that does not involve creating any large scratch objects, while remaining fully vectorized?

As others said, numpy is heavy on temporary buffers and it does not offer much control over it. If memory footprint is really a blocker, you can drop in your own little routine. For instance,
def process(x, lo, hi):
""" lo <= x < hi ? 1.0 : 0.0."""
x_shape = x.shape
xx = np.ascontiguousarray(x).ravel()
out = np.empty_like(xx)
_process(xx, lo, hi, out)
return out.reshape(x_shape)
where _process is in cython:
%%cython --annotate
import cython
#cython.boundscheck(False)
#cython.wraparound(False)
def _process(double[::1] x, double lo, double hi, double[::1] out):
""" lo <= x < hi ? 1.0 : 0.0."""
cdef:
Py_ssize_t j
double xj
for j in range(x.shape[0]):
xj = x[j]
if lo <= xj < hi:
out[j] = 1.0
else:
out[j] = 0.0
Here I used jupyter notebook (hence the funny %%cython syntax). In real project you need to throw in a setup.py to compile the extension etc. Whether the benefit from doing this is worth the hassle is up to you.

You can create boolean array based upon the comparison and then convert to float type, all in one go, like so -
d = ((k >=lo) & (k <= hi)).astype(float)

less and greater take an out parameter:
out=np.ones_like(k)
np.less(k,80,out=out);
out &= np.greater(k,20);
# np.logical_and(np.greater(k,20),out,out=out);
That might end up saving one intermediate array. Although my impression with the ufunc out is that it still creates a temporary array, but then just copies it to the out.
On a small (10x10) array, this is faster than #zwol's method, but slower than #Divakar's. But the differences are not major.

Numpy sum of operator results without allocating an unnecessary array

I have two numpy boolean arrays (a and b). I need to find how many of their elements are equal. Currently, I do len(a) - (a ^ b).sum(), but the xor operation creates an entirely new numpy array, as I understand. How do I efficiently implement this desired behavior without creating the unnecessary temporary array?
I've tried using numexpr, but I can't quite get it to work right. It doesn't support the notion that True is 1 and False is 0, so I have to use ne.evaluate("sum(where(a==b, 1, 0))"), which takes about twice as long.
Edit: I forgot to mention that one of these arrays is actually a view into another array of a different size, and both arrays should be considered immutable. Both arrays are 2-dimensional and tend to be somewhere around 25x40 in size.
Yes, this is the bottleneck of my program and is worth optimizing.

On my machine this is faster:
(a == b).sum()
If you don't want to use any extra storage, than I would suggest using numba.
I'm not too familiar with it, but this seems to work well.
I ran into some trouble getting Cython to take a boolean NumPy array.
from numba import autojit
def pysumeq(a, b):
tot = 0
for i in xrange(a.shape[0]):
for j in xrange(a.shape[1]):
if a[i,j] == b[i,j]:
tot += 1
return tot
# make numba version
nbsumeq = autojit(pysumeq)
A = (rand(10,10)<.5)
B = (rand(10,10)<.5)
# do a simple dry run to get it to compile
# for this specific use case
nbsumeq(A, B)
If you don't have numba, I would suggest using the answer by #user2357112
Edit: Just got a Cython version working, here's the .pyx file. I'd go with this.
from numpy cimport ndarray as ar
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def cysumeq(ar[np.uint8_t,ndim=2,cast=True] a, ar[np.uint8_t,ndim=2,cast=True] b):
cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
for i in xrange(h):
for j in xrange(w):
if a[i,j] == b[i,j]:
tot += 1
return tot

To start with you can skip then A*B step:
>>> a
array([ True, False, True, False, True], dtype=bool)
>>> b
array([False, True, True, False, True], dtype=bool)
>>> np.sum(~(a^b))
3
If you do not mind destroying array a or b, I am not sure you will get faster then this:
>>> a^=b #In place xor operator
>>> np.sum(~a)
3

If the problem is allocation and deallocation, maintain a single output array and tell numpy to put the results there every time:
out = np.empty_like(a) # Allocate this outside a loop and use it every iteration
num_eq = np.equal(a, b, out).sum()
This'll only work if the inputs are always the same dimensions, though. You may be able to make one big array and slice out a part that's the size you need for each call if the inputs have varying sizes, but I'm not sure how much that slows you down.

Improving upon IanH's answer, it's also possible to get access to the underlying C array in a numpy array from within Cython, by supplying mode="c" to ndarray.
from numpy cimport ndarray as ar
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
cdef int cy_sum_eq(ar[np.uint8_t,ndim=2,cast=True,mode="c"] a, ar[np.uint8_t,ndim=2,cast=True,mode="c"] b):
cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
cdef np.uint8_t* adata = &a[0, 0]
cdef np.uint8_t* bdata = &b[0, 0]
for i in xrange(h):
for j in xrange(w):
if adata[j] == bdata[j]:
tot += 1
adata += w
bdata += w
return tot
This is about 40% faster on my machine than IanH's Cython version, and I've found that rearranging the loop contents doesn't seem to make much of a difference at this point probably due to compiler optimizations. At this point, one could potentially link to a C function optimized with SSE and such to perform this operation and pass adata and bdata as uint8_t*s

Speeding up python code with cython

I have a function which just basically makes lots of calls to a simple defined hash function and tests to see when it finds a duplicate. I need to do lots of simulations with it so would like it to be as fast as possible. I am attempting to use cython to do this. The cython code is currently called with a normal python list of integers with values in the range 0 to m^2.
import math, random
cdef int a,b,c,d,m,pos,value, cyclelimit, nohashcalls
def h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) %m
def floyd(inputx):
dupefound, nohashcalls = (0,0)
m = len(inputx)
loops = int(m*math.log(m))
for loopno in xrange(loops):
if (dupefound == 1):
break
a = random.randrange(m)
b = random.randrange(m)
c = random.randrange(m)
d = random.randrange(m)
pos = random.randrange(m)
value = inputx[pos]
listofpos = [0] * m
listofpos[pos] = 1
setofvalues = set([value])
cyclelimit = int(math.sqrt(m))
for j in xrange(cyclelimit):
pos = h3(a,b, c,d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
How can I convert inputx and listofpos to use C type arrays and to access the arrays at C speed? Are there any other speed ups I can use? Can setofvalues be sped up?
So that there is something to compare against, 50 calls to floyd() with m = 5000 currently takes around 30 seconds on my computer.
Update: Example code snippet to show how floyd is called.
m = 5000
inputx = random.sample(xrange(m**2), m)
(dupefound, nohashcalls) = edcython.floyd(inputx)

First of all, it seems that you must type the variables inside the function. A good example of it is here.
Second, cython -a, for "annotate", gives you a really excellent break down of the code generated by the cython compiler and a color-coded indication of how dirty (read: python api heavy) it is. This output is really essential when trying to optimize anything.
Third, the now famous page on working with Numpy explains how to get fast, C-style access to the Numpy array data. Unforunately it's verbose and annoying. We're in luck however, because more recent Cython provides Typed Memory Views, which are both easy to use and awesome. Read that entire page before you try to do anything else.
After ten minutes or so I came up with this:
# cython: infer_types=True
# Use the C math library to avoid Python overhead.
from libc cimport math
# For boundscheck below.
import cython
# We're lazy so we'll let Numpy handle our array memory management.
import numpy as np
# You would normally also import the Numpy pxd to get faster access to the Numpy
# API, but it requires some fancier compilation options so I'll leave it out for
# this demo.
# cimport numpy as np
import random
# This is a small function that doesn't need to be exposed to Python at all. Use
# `cdef` instead of `def` and inline it.
cdef inline int h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) % m
# If we want to live fast and dangerously, we tell cython not to check our array
# indices for IndexErrors. This means we CAN overrun our array and crash the
# program or screw up our stack. Use with caution. Profiling suggests that we
# aren't gaining anything in this case so I leave it on for safety.
# #cython.boundscheck(False)
# `cpdef` so that calling this function from another Cython (or C) function can
# skip the Python function call overhead, while still allowing us to use it from
# Python.
cpdef floyd(int[:] inputx):
# Type the variables in the scope of the function.
cdef int a,b,c,d, value, cyclelimit
cdef unsigned int dupefound = 0
cdef unsigned int nohashcalls = 0
cdef unsigned int loopno, pos, j
# `m` has type int because inputx is already a Cython memory view and
# `infer-types` is on.
m = inputx.shape[0]
cdef unsigned int loops = int(m*math.log(m))
# Again using the memory view, but letting Numpy allocate an array of zeros.
cdef int[:] listofpos = np.zeros(m, dtype=np.int32)
# Keep this random sampling out of the loop
cdef int[:, :] randoms = np.random.randint(0, m, (loops, 5)).astype(np.int32)
for loopno in range(loops):
if (dupefound == 1):
break
# From our precomputed array
a = randoms[loopno, 0]
b = randoms[loopno, 1]
c = randoms[loopno, 2]
d = randoms[loopno, 3]
pos = randoms[loopno, 4]
value = inputx[pos]
# Unforunately, Memory View does not support "vectorized" operations
# like standard Numpy arrays. Otherwise we'd use listofpos *= 0 here.
for j in range(m):
listofpos[j] = 0
listofpos[pos] = 1
setofvalues = set((value,))
cyclelimit = int(math.sqrt(m))
for j in range(cyclelimit):
pos = h3(a, b, c, d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
There are no tricks here that aren't explained on docs.cython.org, which is where I learned them myself, but helps to see it all come together.
The most important changes to your original code are in the comments, but they all amount to giving Cython hints about how to generate code that doesn't use the Python API.
As an aside: I really don't know why infer_types is not on by default. It lets the compiler
implicitly use C types instead of Python types where possible, meaning less work for you.
If you run cython -a on this, you'll see that the only lines that call into Python are your calls to random.sample, and building or adding to a Python set().
On my machine, your original code runs in 2.1 seconds. My version runs in 0.6 seconds.
The next step is to get random.sample out of that loop, but I'll leave that to you.
I have edited my answer to demonstrate how to precompute the rand samples. This brings the time down to 0.4 seconds.

Do you need to use this particular hashing algorithm? Why not use the built-in hashing algorithm for dicts? For example:
from collections import Counter
cnt = Counter(inputx)
dupes = [k for k, v in cnt.iteritems() if v > 1]

How to rewrite this code from python loops to numpy vectors (for perfomance)?

I have this code:
for j in xrange (j_start, self.max_j):
for i in xrange (0, self.max_i):
new_i = round (i + ((j - j_start) * discriminant))
if new_i >= self.max_i:
continue
self.grid[new_i, j] = standard[i]
and I want to speed it up by throwing away slow native python loops. There is possibility to use numpy vector operations instead, they are really fast. How to do that?
j_start, self.max_j, self.max_i, discriminant
int, int, int, float (constants).
self.grid
two-dimensional numpy array (self.max_i x self.max_j).
standard
one-dimensional numpy array (self.max_i).

Here is a complete solution, perhaps that will help.
jrange = np.arange(self.max_j - j_start)
joffset = np.round(jrange * discriminant).astype(int)
i = np.arange(self.max_i)
for j in jrange:
new_i = i + joffset[j]
in_range = new_i < self.max_i
self.grid[new_i[in_range], j+j_start] = standard[i[in_range]]
It may be possible to vectorize both loops but that will, I think, be tricky.
I haven't tested this but I believe it computes the same result as your code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speed up for-loop in Cython - python

I know this is a old question, but i had the same problem and found this simple solution. https://github.com/cython/cython/issues/3310#issuecomment-707252866 for i in range(start, stop, step): ... to this: i = start - step for _ in range(((stop - start) + (step - 1))//step): i += step ...

Related

How to speed up "for loop" in Python3

Most efficient way to do (lo <= k && k <= hi) ? 1 : 0 for k a numpy array, lo, hi constants

Numpy sum of operator results without allocating an unnecessary array

Speeding up python code with cython

How to rewrite this code from python loops to numpy vectors (for perfomance)?

Categories

Resources