Speeding up python code with cython

Speeding up python code with cython - python

I have a function which just basically makes lots of calls to a simple defined hash function and tests to see when it finds a duplicate. I need to do lots of simulations with it so would like it to be as fast as possible. I am attempting to use cython to do this. The cython code is currently called with a normal python list of integers with values in the range 0 to m^2.
import math, random
cdef int a,b,c,d,m,pos,value, cyclelimit, nohashcalls
def h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) %m
def floyd(inputx):
dupefound, nohashcalls = (0,0)
m = len(inputx)
loops = int(m*math.log(m))
for loopno in xrange(loops):
if (dupefound == 1):
break
a = random.randrange(m)
b = random.randrange(m)
c = random.randrange(m)
d = random.randrange(m)
pos = random.randrange(m)
value = inputx[pos]
listofpos = [0] * m
listofpos[pos] = 1
setofvalues = set([value])
cyclelimit = int(math.sqrt(m))
for j in xrange(cyclelimit):
pos = h3(a,b, c,d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
How can I convert inputx and listofpos to use C type arrays and to access the arrays at C speed? Are there any other speed ups I can use? Can setofvalues be sped up?
So that there is something to compare against, 50 calls to floyd() with m = 5000 currently takes around 30 seconds on my computer.
Update: Example code snippet to show how floyd is called.
m = 5000
inputx = random.sample(xrange(m**2), m)
(dupefound, nohashcalls) = edcython.floyd(inputx)

First of all, it seems that you must type the variables inside the function. A good example of it is here.
Second, cython -a, for "annotate", gives you a really excellent break down of the code generated by the cython compiler and a color-coded indication of how dirty (read: python api heavy) it is. This output is really essential when trying to optimize anything.
Third, the now famous page on working with Numpy explains how to get fast, C-style access to the Numpy array data. Unforunately it's verbose and annoying. We're in luck however, because more recent Cython provides Typed Memory Views, which are both easy to use and awesome. Read that entire page before you try to do anything else.
After ten minutes or so I came up with this:
# cython: infer_types=True
# Use the C math library to avoid Python overhead.
from libc cimport math
# For boundscheck below.
import cython
# We're lazy so we'll let Numpy handle our array memory management.
import numpy as np
# You would normally also import the Numpy pxd to get faster access to the Numpy
# API, but it requires some fancier compilation options so I'll leave it out for
# this demo.
# cimport numpy as np
import random
# This is a small function that doesn't need to be exposed to Python at all. Use
# `cdef` instead of `def` and inline it.
cdef inline int h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) % m
# If we want to live fast and dangerously, we tell cython not to check our array
# indices for IndexErrors. This means we CAN overrun our array and crash the
# program or screw up our stack. Use with caution. Profiling suggests that we
# aren't gaining anything in this case so I leave it on for safety.
# #cython.boundscheck(False)
# `cpdef` so that calling this function from another Cython (or C) function can
# skip the Python function call overhead, while still allowing us to use it from
# Python.
cpdef floyd(int[:] inputx):
# Type the variables in the scope of the function.
cdef int a,b,c,d, value, cyclelimit
cdef unsigned int dupefound = 0
cdef unsigned int nohashcalls = 0
cdef unsigned int loopno, pos, j
# `m` has type int because inputx is already a Cython memory view and
# `infer-types` is on.
m = inputx.shape[0]
cdef unsigned int loops = int(m*math.log(m))
# Again using the memory view, but letting Numpy allocate an array of zeros.
cdef int[:] listofpos = np.zeros(m, dtype=np.int32)
# Keep this random sampling out of the loop
cdef int[:, :] randoms = np.random.randint(0, m, (loops, 5)).astype(np.int32)
for loopno in range(loops):
if (dupefound == 1):
break
# From our precomputed array
a = randoms[loopno, 0]
b = randoms[loopno, 1]
c = randoms[loopno, 2]
d = randoms[loopno, 3]
pos = randoms[loopno, 4]
value = inputx[pos]
# Unforunately, Memory View does not support "vectorized" operations
# like standard Numpy arrays. Otherwise we'd use listofpos *= 0 here.
for j in range(m):
listofpos[j] = 0
listofpos[pos] = 1
setofvalues = set((value,))
cyclelimit = int(math.sqrt(m))
for j in range(cyclelimit):
pos = h3(a, b, c, d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
There are no tricks here that aren't explained on docs.cython.org, which is where I learned them myself, but helps to see it all come together.
The most important changes to your original code are in the comments, but they all amount to giving Cython hints about how to generate code that doesn't use the Python API.
As an aside: I really don't know why infer_types is not on by default. It lets the compiler
implicitly use C types instead of Python types where possible, meaning less work for you.
If you run cython -a on this, you'll see that the only lines that call into Python are your calls to random.sample, and building or adding to a Python set().
On my machine, your original code runs in 2.1 seconds. My version runs in 0.6 seconds.
The next step is to get random.sample out of that loop, but I'll leave that to you.
I have edited my answer to demonstrate how to precompute the rand samples. This brings the time down to 0.4 seconds.

Do you need to use this particular hashing algorithm? Why not use the built-in hashing algorithm for dicts? For example:
from collections import Counter
cnt = Counter(inputx)
dupes = [k for k, v in cnt.iteritems() if v > 1]

Related

c array with python class inside python function

I am working in Cython. How can i declare a C array of a python class instances and then pass the array to a python function and work on it?
cdef int n=100
class particle:
def __init__(self):
self.x=uniform(1,99)
self.y=uniform(1,99)
self.pot=0
cdef particle parlist[n]
def CalPot(parlist[]):
for i in range(N-1):
pot=0
for j in range(i,N):
dx=parlist[j].x-parlist[i].x
dy=parlist[j].y-parlist[j].y
r2=dx**2+dy**2
pot=pot+4*ep*r2*((sig/r2)**12 - (sig/r2)**6)
parlist[i].pot=pot

As Ioannis and DavidW told you, you should not create a c-array of python objects and should use a python list instead.
Cytonizing the resulting pure Python would bring a speed up of about factor 2, because cython would cut out the interpreter part. However, there is much more potential if you would also get rid of reference counting and dynamic dispatch - a speed-up up to factor 100 is pretty common. Some time ago I answered a question illustrating this.
What should you do to get this speed up? You need to replace python-multiplication with "bare metal" multiplications.
First step: Don't use a (python)-class for particle, use a simple c-struct - it is just a collection of data - nothing more, nothing less:
cdef struct particle:
double x
double y
double pot
First benefit: It is possible to define a global c-array of these structs (another question is, whether it is a very smart thing to do in a bigger project):
DEF n=2000 # known at compile time
cdef particle parlist[n]
After the initialization of the array (for more details see attached listings), we can use it in our calcpot-function (I slightly changed your definition):
def calcpot():
cdef double pot,dX,dY
cdef int i,j
for i in range(n):
pot=0.0
for j in range(i+1, n):
dX=parlist[i].x-parlist[j].x
dY=parlist[i].y-parlist[j].y
pot=pot+1.0/(dX*dX+dY*dY)
parlist[i].pot=pot
The main difference to the original code: parlist[i].xand Co. are no longer slow python objects but simple and fast doubles. There are a lot of subtle things to be considered in order to be able to get the maximal speed-up - one really should read/reread the cython documentation.
Was the trouble worth it? There are the timings(via %timeit calcpot()) on my machine:
Time Speed-up
pure python + interpreter: 924 ms ± 14.1 ms x1.0
pure python + cython: 609 ms ± 6.83 ms x1.5
cython version: 4.1 ms ± 55.3 µs x231.0
A speed up of 231 through using the lowly stucts!
Listing python code:
import random
class particle:
def __init__(self):
self.x=random.uniform(1,99)
self.y=random.uniform(1,99)
self.pot=0
n=2000
parlist = [particle() for _ in range(n)]
def calcpot():
for i in range(n):
pot=0.0
for j in range(i+1, n):
dX=parlist[i].x-parlist[j].x
dY=parlist[i].y-parlist[j].y
pot=pot+1.0/(dX*dX+dY*dY)
parlist[i].pot=pot
Listing cython code:
#call init_parlist prior to calcpot!
cdef struct particle:
double x
double y
double pot
DEF n=2000 # known at compile time
cdef particle parlist[n]
import random
def init_parlist():
for i in range(n):
parlist[i].x=random.uniform(1,99)
parlist[i].y=random.uniform(1,99)
parlist[i].pot=0.0
def calcpot():
cdef double pot,dX,dY
cdef int i,j
for i in range(n):
pot=0.0
for j in range(i+1, n):
dX=parlist[i].x-parlist[j].x
dY=parlist[i].y-parlist[j].y
pot=pot+1.0/(dX*dX+dY*dY)
parlist[i].pot=pot

BUG: use spaces, not tabs, REF: use fenced code block in Markdown
Instances of Python classes are Python objects, and better handled in Python (they are not C types, and I don't see any reason for creating some form of C representation for them within the Cython source). Also, global variables like n and parlist are better avoided (in this example they aren't necessary).
class particle:
def __init__(self):
self.x = uniform(1, 99)
self.y = uniform(1, 99)
self.pot = 0
def CalPot(parlist):
N = len(parlist)
for i in range(N):
pot = 0
for j in range(i, N):
dx = parlist[j].x - parlist[i].x
dy = parlist[j].y - parlist[j].y
r2 = dx**2 + dy**2
pot = pot + 4 * ep * r2 * ((sig / r2)**12 - (sig / r2)**6)
parlist[i].pot = pot
So this Cython code happens to be pure Python.

Numba #jit fails to optimise simple function

I have a pretty simple function which uses Numpy arrays and for loops, but adding the Numba #jit decorator gives absolutely no speed up:
# #jit(float64[:](int32,float64,float64,float64,int32))
#jit
def Ising_model_1D(N=200,J=1,T=1e-2,H=0,n_iter=1e6):
beta = 1/T
s = randn(N,1) > 10
s[N-1] = s[0]
mag = zeros((n_iter,1))
aux_idx = randint(low=0,high=N,size=(n_iter,1))
for i1 in arange(n_iter):
rnd_idx = aux_idx[i1]
s_1 = s[rnd_idx]*2 - 1
s_2 = s[(rnd_idx+1)%(N)]*2 - 1
s_3 = s[(rnd_idx-1)%(N)]*2 - 1
delta_E = 2.0*J*(s_2+s_3)*s_1 + 2.0*H*s_1
if(delta_E < 0):
s[rnd_idx] = np.logical_not(s[rnd_idx])
elif(np.exp(-1*beta*delta_E) >= rand()):
s[rnd_idx] = np.logical_not(s[rnd_idx])
s[N-1] = s[0]
mag[i1] = (s*2-1).sum()*1.0/N
return mag
MATLAB on the other hand takes less than 0.5 seconds to run this!
Why is Numba missing something so basic?

Here is a reworking of your code that runs in about 0.4 seconds on my machine:
def ising_model_1d(N=200,J=1,T=1e-2,H=0,n_iter=1e6):
n_iter = int(n_iter)
beta = 1/T
s = randn(N) > 10
s[N-1] = s[0]
mag = zeros(n_iter)
aux_idx = randint(low=0,high=N,size=n_iter)
pre_rand = rand(n_iter)
_ising_jitted(n_iter, aux_idx, s, J, N, H, beta, pre_rand, mag)
return mag
#jit(nopython=True)
def _ising_jitted(n_iter, aux_idx, s, J, N, H, beta, pre_rand, mag):
for i1 in range(n_iter):
rnd_idx = aux_idx[i1]
s_1 = s[rnd_idx*2] - 1
s_2 = s[(rnd_idx+1)%(N)]*2 - 1
s_3 = s[(rnd_idx-1)%(N)]*2 - 1
delta_E = 2.0*J*(s_2+s_3)*s_1 + 2.0*H*s_1
t = rand()
if delta_E < 0:
s[rnd_idx] = not s[rnd_idx]
elif np.exp(-1*beta*delta_E) >= pre_rand[i1]:
s[rnd_idx] = not s[rnd_idx]
s[N-1] = s[0]
mag[i1] = (s*2-1).sum()*1.0/N
Please make sure the results are as expected! I changed much of what you had, and can't guarantee that the calculations are correct!
Working with numba requires a little care. Python functions, as well as most numpy functions, cannot be optimized by the compiler. One thing I find helpful is to use the nopython option to #jit. This means that the compiler will complain whenever you give it some code that it can't really optimize. You can then look at the error message and find the line that will likely slow down your code.
The trick, I find, is to write a "gateway" function in Python that does as much of the work as possible using numpy and its vectorized functions. It should create the empty arrays that you'll need to store the results in. It should package all of the data you'll need during the computation. Then it should pass all of these into your jitted function in one big, long argument list.
Case in point: notice how I handle random number generation in the jitted code. In your original code, you called rand():
elif(np.exp(-1*beta*delta_E) >= rand()):
But rand() can't be optimized by numba (in older versions of numba, at least. In newer versions it can, provided that rand is called without arguments). The observation is that you need a single random number for every one of the n_iter iterations. So we simply create a random array using numpy in our wrapper function, then feed this random array to the jitted function. Getting a random number is then as simple as indexing into this array.
Lastly, for a list of the numpy functions that can be optimized by the latest version of the compiler, see here. In my reworking of your code I was aggressive in removing calls to numpy functions so that the code would work over more versions of numba.

multiprocessing in ipython notebook [duplicate]

I often need to apply a function to the groups of a very large DataFrame (of mixed data types) and would like to take advantage of multiple cores.
I can create an iterator from the groups and use the multiprocessing module, but it is not efficient because every group and the results of the function must be pickled for messaging between processes.
Is there any way to avoid the pickling or even avoid the copying of the DataFrame completely? It looks like the shared memory functions of the multiprocessing modules are limited to numpy arrays. Are there any other options?

From the comments above, it seems that this is planned for pandas some time (there's also an interesting-looking rosetta project which I just noticed).
However, until every parallel functionality is incorporated into pandas, I noticed that it's very easy to write efficient & non-memory-copying parallel augmentations to pandas directly using cython + OpenMP and C++.
Here's a short example of writing a parallel groupby-sum, whose use is something like this:
import pandas as pd
import para_group_demo
df = pd.DataFrame({'a': [1, 2, 1, 2, 1, 1, 0], 'b': range(7)})
print para_group_demo.sum(df.a, df.b)
and output is:
sum
key
0 6
1 11
2 4
Note Doubtlessly, this simple example's functionality will eventually be part of pandas. Some things, however, will be more natural to parallelize in C++ for some time, and it's important to be aware of how easy it is to combine this into pandas.
To do this, I wrote a simple single-source-file extension whose code follows.
It starts with some imports and type definitions
from libc.stdint cimport int64_t, uint64_t
from libcpp.vector cimport vector
from libcpp.unordered_map cimport unordered_map
cimport cython
from cython.operator cimport dereference as deref, preincrement as inc
from cython.parallel import prange
import pandas as pd
ctypedef unordered_map[int64_t, uint64_t] counts_t
ctypedef unordered_map[int64_t, uint64_t].iterator counts_it_t
ctypedef vector[counts_t] counts_vec_t
The C++ unordered_map type is for summing by a single thread, and the vector is for summing by all threads.
Now to the function sum. It starts off with typed memory views for fast access:
def sum(crit, vals):
cdef int64_t[:] crit_view = crit.values
cdef int64_t[:] vals_view = vals.values
The function continues by dividing the semi-equally to the threads (here hardcoded to 4), and having each thread sum the entries in its range:
cdef uint64_t num_threads = 4
cdef uint64_t l = len(crit)
cdef uint64_t s = l / num_threads + 1
cdef uint64_t i, j, e
cdef counts_vec_t counts
counts = counts_vec_t(num_threads)
counts.resize(num_threads)
with cython.boundscheck(False):
for i in prange(num_threads, nogil=True):
j = i * s
e = j + s
if e > l:
e = l
while j < e:
counts[i][crit_view[j]] += vals_view[j]
inc(j)
When the threads have completed, the function merges all the results (from the different ranges) into a single unordered_map:
cdef counts_t total
cdef counts_it_t it, e_it
for i in range(num_threads):
it = counts[i].begin()
e_it = counts[i].end()
while it != e_it:
total[deref(it).first] += deref(it).second
inc(it)
All that's left is to create a DataFrame and return the results:
key, sum_ = [], []
it = total.begin()
e_it = total.end()
while it != e_it:
key.append(deref(it).first)
sum_.append(deref(it).second)
inc(it)
df = pd.DataFrame({'key': key, 'sum': sum_})
df.set_index('key', inplace=True)
return df

Numpy sum of operator results without allocating an unnecessary array

I have two numpy boolean arrays (a and b). I need to find how many of their elements are equal. Currently, I do len(a) - (a ^ b).sum(), but the xor operation creates an entirely new numpy array, as I understand. How do I efficiently implement this desired behavior without creating the unnecessary temporary array?
I've tried using numexpr, but I can't quite get it to work right. It doesn't support the notion that True is 1 and False is 0, so I have to use ne.evaluate("sum(where(a==b, 1, 0))"), which takes about twice as long.
Edit: I forgot to mention that one of these arrays is actually a view into another array of a different size, and both arrays should be considered immutable. Both arrays are 2-dimensional and tend to be somewhere around 25x40 in size.
Yes, this is the bottleneck of my program and is worth optimizing.

On my machine this is faster:
(a == b).sum()
If you don't want to use any extra storage, than I would suggest using numba.
I'm not too familiar with it, but this seems to work well.
I ran into some trouble getting Cython to take a boolean NumPy array.
from numba import autojit
def pysumeq(a, b):
tot = 0
for i in xrange(a.shape[0]):
for j in xrange(a.shape[1]):
if a[i,j] == b[i,j]:
tot += 1
return tot
# make numba version
nbsumeq = autojit(pysumeq)
A = (rand(10,10)<.5)
B = (rand(10,10)<.5)
# do a simple dry run to get it to compile
# for this specific use case
nbsumeq(A, B)
If you don't have numba, I would suggest using the answer by #user2357112
Edit: Just got a Cython version working, here's the .pyx file. I'd go with this.
from numpy cimport ndarray as ar
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def cysumeq(ar[np.uint8_t,ndim=2,cast=True] a, ar[np.uint8_t,ndim=2,cast=True] b):
cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
for i in xrange(h):
for j in xrange(w):
if a[i,j] == b[i,j]:
tot += 1
return tot

To start with you can skip then A*B step:
>>> a
array([ True, False, True, False, True], dtype=bool)
>>> b
array([False, True, True, False, True], dtype=bool)
>>> np.sum(~(a^b))
3
If you do not mind destroying array a or b, I am not sure you will get faster then this:
>>> a^=b #In place xor operator
>>> np.sum(~a)
3

If the problem is allocation and deallocation, maintain a single output array and tell numpy to put the results there every time:
out = np.empty_like(a) # Allocate this outside a loop and use it every iteration
num_eq = np.equal(a, b, out).sum()
This'll only work if the inputs are always the same dimensions, though. You may be able to make one big array and slice out a part that's the size you need for each call if the inputs have varying sizes, but I'm not sure how much that slows you down.

Improving upon IanH's answer, it's also possible to get access to the underlying C array in a numpy array from within Cython, by supplying mode="c" to ndarray.
from numpy cimport ndarray as ar
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
cdef int cy_sum_eq(ar[np.uint8_t,ndim=2,cast=True,mode="c"] a, ar[np.uint8_t,ndim=2,cast=True,mode="c"] b):
cdef int i, j, h=a.shape[0], w=a.shape[1], tot=0
cdef np.uint8_t* adata = &a[0, 0]
cdef np.uint8_t* bdata = &b[0, 0]
for i in xrange(h):
for j in xrange(w):
if adata[j] == bdata[j]:
tot += 1
adata += w
bdata += w
return tot
This is about 40% faster on my machine than IanH's Cython version, and I've found that rearranging the loop contents doesn't seem to make much of a difference at this point probably due to compiler optimizations. At this point, one could potentially link to a C function optimized with SSE and such to perform this operation and pass adata and bdata as uint8_t*s

Efficiently applying a function to a grouped pandas DataFrame in parallel

I often need to apply a function to the groups of a very large DataFrame (of mixed data types) and would like to take advantage of multiple cores.
I can create an iterator from the groups and use the multiprocessing module, but it is not efficient because every group and the results of the function must be pickled for messaging between processes.
Is there any way to avoid the pickling or even avoid the copying of the DataFrame completely? It looks like the shared memory functions of the multiprocessing modules are limited to numpy arrays. Are there any other options?

From the comments above, it seems that this is planned for pandas some time (there's also an interesting-looking rosetta project which I just noticed).
However, until every parallel functionality is incorporated into pandas, I noticed that it's very easy to write efficient & non-memory-copying parallel augmentations to pandas directly using cython + OpenMP and C++.
Here's a short example of writing a parallel groupby-sum, whose use is something like this:
import pandas as pd
import para_group_demo
df = pd.DataFrame({'a': [1, 2, 1, 2, 1, 1, 0], 'b': range(7)})
print para_group_demo.sum(df.a, df.b)
and output is:
sum
key
0 6
1 11
2 4
Note Doubtlessly, this simple example's functionality will eventually be part of pandas. Some things, however, will be more natural to parallelize in C++ for some time, and it's important to be aware of how easy it is to combine this into pandas.
To do this, I wrote a simple single-source-file extension whose code follows.
It starts with some imports and type definitions
from libc.stdint cimport int64_t, uint64_t
from libcpp.vector cimport vector
from libcpp.unordered_map cimport unordered_map
cimport cython
from cython.operator cimport dereference as deref, preincrement as inc
from cython.parallel import prange
import pandas as pd
ctypedef unordered_map[int64_t, uint64_t] counts_t
ctypedef unordered_map[int64_t, uint64_t].iterator counts_it_t
ctypedef vector[counts_t] counts_vec_t
The C++ unordered_map type is for summing by a single thread, and the vector is for summing by all threads.
Now to the function sum. It starts off with typed memory views for fast access:
def sum(crit, vals):
cdef int64_t[:] crit_view = crit.values
cdef int64_t[:] vals_view = vals.values
The function continues by dividing the semi-equally to the threads (here hardcoded to 4), and having each thread sum the entries in its range:
cdef uint64_t num_threads = 4
cdef uint64_t l = len(crit)
cdef uint64_t s = l / num_threads + 1
cdef uint64_t i, j, e
cdef counts_vec_t counts
counts = counts_vec_t(num_threads)
counts.resize(num_threads)
with cython.boundscheck(False):
for i in prange(num_threads, nogil=True):
j = i * s
e = j + s
if e > l:
e = l
while j < e:
counts[i][crit_view[j]] += vals_view[j]
inc(j)
When the threads have completed, the function merges all the results (from the different ranges) into a single unordered_map:
cdef counts_t total
cdef counts_it_t it, e_it
for i in range(num_threads):
it = counts[i].begin()
e_it = counts[i].end()
while it != e_it:
total[deref(it).first] += deref(it).second
inc(it)
All that's left is to create a DataFrame and return the results:
key, sum_ = [], []
it = total.begin()
e_it = total.end()
while it != e_it:
key.append(deref(it).first)
sum_.append(deref(it).second)
inc(it)
df = pd.DataFrame({'key': key, 'sum': sum_})
df.set_index('key', inplace=True)
return df

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speeding up python code with cython - python

Do you need to use this particular hashing algorithm? Why not use the built-in hashing algorithm for dicts? For example: from collections import Counter cnt = Counter(inputx) dupes = [k for k, v in cnt.iteritems() if v > 1]

Related

c array with python class inside python function

Numba #jit fails to optimise simple function

multiprocessing in ipython notebook [duplicate]

Numpy sum of operator results without allocating an unnecessary array

Efficiently applying a function to a grouped pandas DataFrame in parallel

Categories

Resources