Reduction of array in cython parallel - python

I have an array that needs to contain sum of different things and therefore I want to perform reduction on each of its elements.
Here's the code:
cdef int *a=<int *>malloc(sizeof(int) * 3)
for i in range(3):
a[i]=1*i
cdef int *b
for i in prange(1000,nogil=True,num_threads=10):
b=res() #res returns an array initialized to 1s
with gil: #if commented this line gives erroneous results
for k in range(3):
a[k]+=b[k]
for i in range(3):
print a[i]
Till there is with gil the code runs fine else gives wrong results.
How to deal with reductions on each element of array without using gil cause gil i think will block other threads

The way reductions usually work in practice is to do the sum individually for each thread, and then add them together at the end. You could do this manually with something like
cdef int *b
cdef int *a_local # version of a that is duplicated by each thread
cdef int i,j,k
# set up as before
cdef int *a=<int *>malloc(sizeof(int) * 3)
for i in range(3):
a[i]=1*i
# multithreaded from here
with nogil, parallel(num_threads=10):
# setup and initialise a_local on each thread
a_local = <int*>malloc(sizeof(int)*3)
for k in range(3):
a_local[k] = 0
for i in prange(1000):
b=res() # Note - you never free b
# this is likely a memory leak....
for j in range(3):
a_local[j]+=b[j]
# finally at the end add them all together.
# this needs to be done `with gil:` to avoid race conditions
# but it isn't a problem
# because it's only a small amount of work being done
with gil:
for k in range(3):
a[k] += a_local[k]
free(a_local)

Related

multiprocessing in ipython notebook [duplicate]

I often need to apply a function to the groups of a very large DataFrame (of mixed data types) and would like to take advantage of multiple cores.
I can create an iterator from the groups and use the multiprocessing module, but it is not efficient because every group and the results of the function must be pickled for messaging between processes.
Is there any way to avoid the pickling or even avoid the copying of the DataFrame completely? It looks like the shared memory functions of the multiprocessing modules are limited to numpy arrays. Are there any other options?
From the comments above, it seems that this is planned for pandas some time (there's also an interesting-looking rosetta project which I just noticed).
However, until every parallel functionality is incorporated into pandas, I noticed that it's very easy to write efficient & non-memory-copying parallel augmentations to pandas directly using cython + OpenMP and C++.
Here's a short example of writing a parallel groupby-sum, whose use is something like this:
import pandas as pd
import para_group_demo
df = pd.DataFrame({'a': [1, 2, 1, 2, 1, 1, 0], 'b': range(7)})
print para_group_demo.sum(df.a, df.b)
and output is:
sum
key
0 6
1 11
2 4
Note Doubtlessly, this simple example's functionality will eventually be part of pandas. Some things, however, will be more natural to parallelize in C++ for some time, and it's important to be aware of how easy it is to combine this into pandas.
To do this, I wrote a simple single-source-file extension whose code follows.
It starts with some imports and type definitions
from libc.stdint cimport int64_t, uint64_t
from libcpp.vector cimport vector
from libcpp.unordered_map cimport unordered_map
cimport cython
from cython.operator cimport dereference as deref, preincrement as inc
from cython.parallel import prange
import pandas as pd
ctypedef unordered_map[int64_t, uint64_t] counts_t
ctypedef unordered_map[int64_t, uint64_t].iterator counts_it_t
ctypedef vector[counts_t] counts_vec_t
The C++ unordered_map type is for summing by a single thread, and the vector is for summing by all threads.
Now to the function sum. It starts off with typed memory views for fast access:
def sum(crit, vals):
cdef int64_t[:] crit_view = crit.values
cdef int64_t[:] vals_view = vals.values
The function continues by dividing the semi-equally to the threads (here hardcoded to 4), and having each thread sum the entries in its range:
cdef uint64_t num_threads = 4
cdef uint64_t l = len(crit)
cdef uint64_t s = l / num_threads + 1
cdef uint64_t i, j, e
cdef counts_vec_t counts
counts = counts_vec_t(num_threads)
counts.resize(num_threads)
with cython.boundscheck(False):
for i in prange(num_threads, nogil=True):
j = i * s
e = j + s
if e > l:
e = l
while j < e:
counts[i][crit_view[j]] += vals_view[j]
inc(j)
When the threads have completed, the function merges all the results (from the different ranges) into a single unordered_map:
cdef counts_t total
cdef counts_it_t it, e_it
for i in range(num_threads):
it = counts[i].begin()
e_it = counts[i].end()
while it != e_it:
total[deref(it).first] += deref(it).second
inc(it)
All that's left is to create a DataFrame and return the results:
key, sum_ = [], []
it = total.begin()
e_it = total.end()
while it != e_it:
key.append(deref(it).first)
sum_.append(deref(it).second)
inc(it)
df = pd.DataFrame({'key': key, 'sum': sum_})
df.set_index('key', inplace=True)
return df

Cython: Create memoryview without NumPy array?

Since I found memory-views handy and fast, I try to avoid creating NumPy arrays in cython and work with the views of the given arrays. However, sometimes it cannot be avoided, not to alter an existing array but create a new one. In upper functions this is not noticeable, but in often called subroutines it is. Consider the following function
##cython.profile(False)
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
cdef double [:] vec_eq(double [:] v1, int [:] v2, int cond):
''' Function output corresponds to v1[v2 == cond]'''
cdef unsigned int n = v1.shape[0]
cdef unsigned int n_ = 0
# Size of array to create
cdef size_t i
for i in range(n):
if v2[i] == cond:
n_ += 1
# Create array for selection
cdef double [:] s = np.empty(n_, dtype=np_float) # Slow line
# Copy selection to new array
n_ = 0
for i in range(n):
if v2[i] == cond:
s[n_] = v1[i]
n_ += 1
return s
Profiling tells me, there is some speed to gain here
What I could do is adapting the function, cause sometimes, for instance the mean of this vector is calculated, sometimes the sum. So I could rewrite it, for summing or taking average. But isn't there a way to create memory-view with very little overhead directly, defining size dynamically. Something like first creating a c buffer using malloc etc and at the end of the function convert the buffer to a view, passing the pointer and strides or so..
Edit 1:
Maybe for simple cases, adapting the function e. g. like this is an acceptable approach. I only added an argument and summing/taking average. This way I dont have to create an array and can take an easy to handle inside function malloc. This won't get any faster, will it?
# ...
cdef double vec_eq(double [:] v1, int [:] v2, int cond, opt=0):
# additional option argument
''' Function output corresponds to v1[v2 == cond].sum() / .mean()'''
cdef unsigned int n = v1.shape[0]
cdef int n_ = 0
# Size of array to create
cdef Py_ssize_t i
for i in prange(n, nogil=True):
if v2[i] == cond:
n_ += 1
# Create array for selection
cdef double s = 0
cdef double * v3 = <double *> malloc(sizeof(double) * n_)
if v3 == NULL:
abort()
# Copy selection to new array
n_ = 0
for i in range(n):
if v2[i] == cond:
v3[n_] = v1[i]
n_ += 1
# Do further computation here, according to option
# Option 0 for the sum
if opt == 0:
for i in prange(n_, nogil=True):
s += v3[i]
free(v3)
return s
# Option 1 for the mean
else:
for i in prange(n_, nogil=True):
s += v3[i]
free(v3)
return s / n_
# Since in the end there is always only a single double value,
# the memory can be freed right here
Didn't know, how to deal with cpython arrays, so I solved this finally by a self made 'memory view', as proposed by fabrizioM. Wouldn't have thought that this would work. Creating a new np.array in a tight loop is pretty expensive, so this gave me a significant speed up. Since I only need a 1 dimensional array, I didn't even had to bother with strides. But even for a higher dimensional arrays, I think this could go well.
cdef class Vector:
cdef double *data
cdef public int n_ax0
def __init__(Vector self, int n_ax0):
self.data = <double*> malloc (sizeof(double) * n_ax0)
self.n_ax0 = n_ax0
def __dealloc__(Vector self):
free(self.data)
...
##cython.profile(False)
#cython.boundscheck(False)
cdef Vector my_vec_func(double [:, ::1] a, int [:] v, int cond, int opt):
# function returning a Vector, which can be hopefully freed by del Vector
cdef int vecsize
cdef size_t i
# defs..
# more stuff...
vecsize = n
cdef Vector v = Vector(vecsize)
for i in range(vecsize):
# computation
v[i] = ...
return v
...
vec = my_vec_func(...
ptr_to_data = vec.data
length_of_vec = vec.n_ax0
The following thread on the Cython mailing list would probably be of interest to you:
https://groups.google.com/forum/#!topic/cython-users/CwtU_jYADgM
It looks like there are some decent options presented if you are fine with returning a memoryview from your function that gets coerced at some different level where perfomance isn't as much of an issue.
From http://docs.cython.org/src/userguide/memoryviews.html it follows that memory for cython memory views can be allocated via:
cimport cython
cdef type [:] cview = cython.view.array(size = size,
itemsize = sizeof(type), format = "type", allocate_buffer = True)
or by
from libc.stdlib import malloc, free
cdef type [:] cview = <type[:size]> malloc(sizeof(type)*size)
Both case works, but in first i have an issues if introduce own type (ctypedef some mytype) because there is no suitable format for it.
In second case there is problem with deallocation of memory.
From manual it should work as follows:
cview.callback_memory_free = free
which bind function which free memory to the memoryview, however this code does not compile.

Bakeoff Part 2: Math with Cython Typed Memoryviews

I can't seem to do simple things like add a value to an array of values stored in a memory view. I understand that is not what a typed memory view is supposed to do. But converting the memory view back to an np.array is slower than tortoises herding cats.
When I try to write a cdef function like:
cdef double[::1] _add(self,double[::1] arr,double val):
cdef double[::1] newarr
cdef int i, n
#n = sizeof(arr)/sizeof(arr[0])
newarr = np.empty(5)
for i in xrange(n):
newarr[i] = arr[i] + val
return newarr
I get errors that say that the memoryviews are not contiguous.
"ValueError: Buffer and memoryview are not contiguous in the same dimension."
This actually does work if the memory view that is passed is not one that has been sliced. But it adds 10 seconds to the process!

Speeding up python code with cython

I have a function which just basically makes lots of calls to a simple defined hash function and tests to see when it finds a duplicate. I need to do lots of simulations with it so would like it to be as fast as possible. I am attempting to use cython to do this. The cython code is currently called with a normal python list of integers with values in the range 0 to m^2.
import math, random
cdef int a,b,c,d,m,pos,value, cyclelimit, nohashcalls
def h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) %m
def floyd(inputx):
dupefound, nohashcalls = (0,0)
m = len(inputx)
loops = int(m*math.log(m))
for loopno in xrange(loops):
if (dupefound == 1):
break
a = random.randrange(m)
b = random.randrange(m)
c = random.randrange(m)
d = random.randrange(m)
pos = random.randrange(m)
value = inputx[pos]
listofpos = [0] * m
listofpos[pos] = 1
setofvalues = set([value])
cyclelimit = int(math.sqrt(m))
for j in xrange(cyclelimit):
pos = h3(a,b, c,d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
How can I convert inputx and listofpos to use C type arrays and to access the arrays at C speed? Are there any other speed ups I can use? Can setofvalues be sped up?
So that there is something to compare against, 50 calls to floyd() with m = 5000 currently takes around 30 seconds on my computer.
Update: Example code snippet to show how floyd is called.
m = 5000
inputx = random.sample(xrange(m**2), m)
(dupefound, nohashcalls) = edcython.floyd(inputx)
First of all, it seems that you must type the variables inside the function. A good example of it is here.
Second, cython -a, for "annotate", gives you a really excellent break down of the code generated by the cython compiler and a color-coded indication of how dirty (read: python api heavy) it is. This output is really essential when trying to optimize anything.
Third, the now famous page on working with Numpy explains how to get fast, C-style access to the Numpy array data. Unforunately it's verbose and annoying. We're in luck however, because more recent Cython provides Typed Memory Views, which are both easy to use and awesome. Read that entire page before you try to do anything else.
After ten minutes or so I came up with this:
# cython: infer_types=True
# Use the C math library to avoid Python overhead.
from libc cimport math
# For boundscheck below.
import cython
# We're lazy so we'll let Numpy handle our array memory management.
import numpy as np
# You would normally also import the Numpy pxd to get faster access to the Numpy
# API, but it requires some fancier compilation options so I'll leave it out for
# this demo.
# cimport numpy as np
import random
# This is a small function that doesn't need to be exposed to Python at all. Use
# `cdef` instead of `def` and inline it.
cdef inline int h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) % m
# If we want to live fast and dangerously, we tell cython not to check our array
# indices for IndexErrors. This means we CAN overrun our array and crash the
# program or screw up our stack. Use with caution. Profiling suggests that we
# aren't gaining anything in this case so I leave it on for safety.
# #cython.boundscheck(False)
# `cpdef` so that calling this function from another Cython (or C) function can
# skip the Python function call overhead, while still allowing us to use it from
# Python.
cpdef floyd(int[:] inputx):
# Type the variables in the scope of the function.
cdef int a,b,c,d, value, cyclelimit
cdef unsigned int dupefound = 0
cdef unsigned int nohashcalls = 0
cdef unsigned int loopno, pos, j
# `m` has type int because inputx is already a Cython memory view and
# `infer-types` is on.
m = inputx.shape[0]
cdef unsigned int loops = int(m*math.log(m))
# Again using the memory view, but letting Numpy allocate an array of zeros.
cdef int[:] listofpos = np.zeros(m, dtype=np.int32)
# Keep this random sampling out of the loop
cdef int[:, :] randoms = np.random.randint(0, m, (loops, 5)).astype(np.int32)
for loopno in range(loops):
if (dupefound == 1):
break
# From our precomputed array
a = randoms[loopno, 0]
b = randoms[loopno, 1]
c = randoms[loopno, 2]
d = randoms[loopno, 3]
pos = randoms[loopno, 4]
value = inputx[pos]
# Unforunately, Memory View does not support "vectorized" operations
# like standard Numpy arrays. Otherwise we'd use listofpos *= 0 here.
for j in range(m):
listofpos[j] = 0
listofpos[pos] = 1
setofvalues = set((value,))
cyclelimit = int(math.sqrt(m))
for j in range(cyclelimit):
pos = h3(a, b, c, d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
There are no tricks here that aren't explained on docs.cython.org, which is where I learned them myself, but helps to see it all come together.
The most important changes to your original code are in the comments, but they all amount to giving Cython hints about how to generate code that doesn't use the Python API.
As an aside: I really don't know why infer_types is not on by default. It lets the compiler
implicitly use C types instead of Python types where possible, meaning less work for you.
If you run cython -a on this, you'll see that the only lines that call into Python are your calls to random.sample, and building or adding to a Python set().
On my machine, your original code runs in 2.1 seconds. My version runs in 0.6 seconds.
The next step is to get random.sample out of that loop, but I'll leave that to you.
I have edited my answer to demonstrate how to precompute the rand samples. This brings the time down to 0.4 seconds.
Do you need to use this particular hashing algorithm? Why not use the built-in hashing algorithm for dicts? For example:
from collections import Counter
cnt = Counter(inputx)
dupes = [k for k, v in cnt.iteritems() if v > 1]

Efficiently applying a function to a grouped pandas DataFrame in parallel

I often need to apply a function to the groups of a very large DataFrame (of mixed data types) and would like to take advantage of multiple cores.
I can create an iterator from the groups and use the multiprocessing module, but it is not efficient because every group and the results of the function must be pickled for messaging between processes.
Is there any way to avoid the pickling or even avoid the copying of the DataFrame completely? It looks like the shared memory functions of the multiprocessing modules are limited to numpy arrays. Are there any other options?
From the comments above, it seems that this is planned for pandas some time (there's also an interesting-looking rosetta project which I just noticed).
However, until every parallel functionality is incorporated into pandas, I noticed that it's very easy to write efficient & non-memory-copying parallel augmentations to pandas directly using cython + OpenMP and C++.
Here's a short example of writing a parallel groupby-sum, whose use is something like this:
import pandas as pd
import para_group_demo
df = pd.DataFrame({'a': [1, 2, 1, 2, 1, 1, 0], 'b': range(7)})
print para_group_demo.sum(df.a, df.b)
and output is:
sum
key
0 6
1 11
2 4
Note Doubtlessly, this simple example's functionality will eventually be part of pandas. Some things, however, will be more natural to parallelize in C++ for some time, and it's important to be aware of how easy it is to combine this into pandas.
To do this, I wrote a simple single-source-file extension whose code follows.
It starts with some imports and type definitions
from libc.stdint cimport int64_t, uint64_t
from libcpp.vector cimport vector
from libcpp.unordered_map cimport unordered_map
cimport cython
from cython.operator cimport dereference as deref, preincrement as inc
from cython.parallel import prange
import pandas as pd
ctypedef unordered_map[int64_t, uint64_t] counts_t
ctypedef unordered_map[int64_t, uint64_t].iterator counts_it_t
ctypedef vector[counts_t] counts_vec_t
The C++ unordered_map type is for summing by a single thread, and the vector is for summing by all threads.
Now to the function sum. It starts off with typed memory views for fast access:
def sum(crit, vals):
cdef int64_t[:] crit_view = crit.values
cdef int64_t[:] vals_view = vals.values
The function continues by dividing the semi-equally to the threads (here hardcoded to 4), and having each thread sum the entries in its range:
cdef uint64_t num_threads = 4
cdef uint64_t l = len(crit)
cdef uint64_t s = l / num_threads + 1
cdef uint64_t i, j, e
cdef counts_vec_t counts
counts = counts_vec_t(num_threads)
counts.resize(num_threads)
with cython.boundscheck(False):
for i in prange(num_threads, nogil=True):
j = i * s
e = j + s
if e > l:
e = l
while j < e:
counts[i][crit_view[j]] += vals_view[j]
inc(j)
When the threads have completed, the function merges all the results (from the different ranges) into a single unordered_map:
cdef counts_t total
cdef counts_it_t it, e_it
for i in range(num_threads):
it = counts[i].begin()
e_it = counts[i].end()
while it != e_it:
total[deref(it).first] += deref(it).second
inc(it)
All that's left is to create a DataFrame and return the results:
key, sum_ = [], []
it = total.begin()
e_it = total.end()
while it != e_it:
key.append(deref(it).first)
sum_.append(deref(it).second)
inc(it)
df = pd.DataFrame({'key': key, 'sum': sum_})
df.set_index('key', inplace=True)
return df

Categories

Resources