Since I found memory-views handy and fast, I try to avoid creating NumPy arrays in cython and work with the views of the given arrays. However, sometimes it cannot be avoided, not to alter an existing array but create a new one. In upper functions this is not noticeable, but in often called subroutines it is. Consider the following function
##cython.profile(False)
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
cdef double [:] vec_eq(double [:] v1, int [:] v2, int cond):
''' Function output corresponds to v1[v2 == cond]'''
cdef unsigned int n = v1.shape[0]
cdef unsigned int n_ = 0
# Size of array to create
cdef size_t i
for i in range(n):
if v2[i] == cond:
n_ += 1
# Create array for selection
cdef double [:] s = np.empty(n_, dtype=np_float) # Slow line
# Copy selection to new array
n_ = 0
for i in range(n):
if v2[i] == cond:
s[n_] = v1[i]
n_ += 1
return s
Profiling tells me, there is some speed to gain here
What I could do is adapting the function, cause sometimes, for instance the mean of this vector is calculated, sometimes the sum. So I could rewrite it, for summing or taking average. But isn't there a way to create memory-view with very little overhead directly, defining size dynamically. Something like first creating a c buffer using malloc etc and at the end of the function convert the buffer to a view, passing the pointer and strides or so..
Edit 1:
Maybe for simple cases, adapting the function e. g. like this is an acceptable approach. I only added an argument and summing/taking average. This way I dont have to create an array and can take an easy to handle inside function malloc. This won't get any faster, will it?
# ...
cdef double vec_eq(double [:] v1, int [:] v2, int cond, opt=0):
# additional option argument
''' Function output corresponds to v1[v2 == cond].sum() / .mean()'''
cdef unsigned int n = v1.shape[0]
cdef int n_ = 0
# Size of array to create
cdef Py_ssize_t i
for i in prange(n, nogil=True):
if v2[i] == cond:
n_ += 1
# Create array for selection
cdef double s = 0
cdef double * v3 = <double *> malloc(sizeof(double) * n_)
if v3 == NULL:
abort()
# Copy selection to new array
n_ = 0
for i in range(n):
if v2[i] == cond:
v3[n_] = v1[i]
n_ += 1
# Do further computation here, according to option
# Option 0 for the sum
if opt == 0:
for i in prange(n_, nogil=True):
s += v3[i]
free(v3)
return s
# Option 1 for the mean
else:
for i in prange(n_, nogil=True):
s += v3[i]
free(v3)
return s / n_
# Since in the end there is always only a single double value,
# the memory can be freed right here
Didn't know, how to deal with cpython arrays, so I solved this finally by a self made 'memory view', as proposed by fabrizioM. Wouldn't have thought that this would work. Creating a new np.array in a tight loop is pretty expensive, so this gave me a significant speed up. Since I only need a 1 dimensional array, I didn't even had to bother with strides. But even for a higher dimensional arrays, I think this could go well.
cdef class Vector:
cdef double *data
cdef public int n_ax0
def __init__(Vector self, int n_ax0):
self.data = <double*> malloc (sizeof(double) * n_ax0)
self.n_ax0 = n_ax0
def __dealloc__(Vector self):
free(self.data)
...
##cython.profile(False)
#cython.boundscheck(False)
cdef Vector my_vec_func(double [:, ::1] a, int [:] v, int cond, int opt):
# function returning a Vector, which can be hopefully freed by del Vector
cdef int vecsize
cdef size_t i
# defs..
# more stuff...
vecsize = n
cdef Vector v = Vector(vecsize)
for i in range(vecsize):
# computation
v[i] = ...
return v
...
vec = my_vec_func(...
ptr_to_data = vec.data
length_of_vec = vec.n_ax0
The following thread on the Cython mailing list would probably be of interest to you:
https://groups.google.com/forum/#!topic/cython-users/CwtU_jYADgM
It looks like there are some decent options presented if you are fine with returning a memoryview from your function that gets coerced at some different level where perfomance isn't as much of an issue.
From http://docs.cython.org/src/userguide/memoryviews.html it follows that memory for cython memory views can be allocated via:
cimport cython
cdef type [:] cview = cython.view.array(size = size,
itemsize = sizeof(type), format = "type", allocate_buffer = True)
or by
from libc.stdlib import malloc, free
cdef type [:] cview = <type[:size]> malloc(sizeof(type)*size)
Both case works, but in first i have an issues if introduce own type (ctypedef some mytype) because there is no suitable format for it.
In second case there is problem with deallocation of memory.
From manual it should work as follows:
cview.callback_memory_free = free
which bind function which free memory to the memoryview, however this code does not compile.
Related
I'm implementing a simple Eratosthenes sieve. It's fairly straightforward when I make a constant-sized stack array for the results, but this approach will restrict the numbers of primes that I can calculate.
So I tried using some dynamic heap array; the issue here is that, in the "cpdef" function, I want to return a python list and I must cast the array to a list, but that's not possible.
If I do it without dereferencing, Cython will complain about "Python objects cannot be cast from pointers of primitive types" and, if I dereference the array, it will return the first element, which is no surprise.
I'm aware of NumPy or using some list comprehension, but I don't want to lower the performance or use third-party packages.
What is the fastest way for doing so, considering limitations? CPython arrays are slow: Maybe typed memoryview? And why is it that Cython cannot cast dynamic arrays?
# cython: boundscheck=False
# cython: wraparound=False
from libc.stdlib cimport malloc, free
from libc.math cimport sqrt
from cython.operator cimport dereference
cdef inline int is_prime(unsigned int num, unsigned int *primes, unsigned int counter):
cdef unsigned int i
for i in range(counter):
if num % primes[i] == 0:
return 0
if primes[i] > <unsigned int> sqrt(num) + 1:
break
return 1
cpdef list primes_below(unsigned int x):
# cdef:
# unsigned int primes[1_000_000]
# unsigned int counter = 0
# unsigned int i
#
# for i in range(2, x):
# if is_prime(i, primes, counter):
# primes[counter] = i
# counter += 1
# return (<list> primes)[:counter]
# Alternative approach: (It doesn't work!)
cdef:
unsigned int *primes = <unsigned int *> malloc(sizeof(int) * x)
unsigned int counter = 0
unsigned int i
for i in range(2, x):
if is_prime(i, primes, counter):
primes[counter] = i
counter += 1
return <list> primes
# return <list> dereference(primes)
P.S.: I'm a C newbie, so I may be missing some subtle details.
EDIT:
First, I should thank #DavidW for his endless comprehensive answers to Cython questions. Second, he was right!
List comprehensions are really fast
In contradiction to what I was thinking.
This is the new implementation:
# cython: boundscheck=False
# cython: wraparound=False
# cython: cdivision=True
from cython.view cimport array as cy_array
from libc.math cimport sqrt
cdef inline int is_prime(unsigned int num, unsigned int[::1] primes, unsigned int counter):
cdef unsigned int i
for i in range(counter):
if num % primes[i] == 0:
return 0
if primes[i] > <unsigned int> sqrt(num) + 1:
break
return 1
cpdef list primes_below(unsigned int x):
cdef:
unsigned int[::1] primes = cy_array(shape=(x,), itemsize=sizeof(int), format="I")
unsigned int counter = 0
unsigned int i
for i in range(2, x):
if is_prime(i, primes, counter):
primes[counter] = i
counter += 1
return [primes[i] for i in range(counter)]
I find Cython arrays to be fast and straightforward, so allocating memory with malloc is overkill.
And why Cython cannot cast dynamic arrays?
How could it? It doesn't know how long your array is. All it knows is that that it's a pointer to some bit of memory that should be interpreted as ints. Even if it were clever enough to look at malloc that wouldn't help - you actually fill less memory up.
I'm aware of NumPy or using some list comprehension, but I don't want to lower the performance or use third-party packages. What is the fastest way for doing so, considering limitations? CPython arrays are slow;
Have you actually measured any of these performance claims you're making? Allocating an array.array is slightly slower than using malloc but access into it is very quick (using a memoryview).
Since you have to copy into a list at the end anyway, why not just append to a list as you go and skip the "fast" C array? It might well be quicker than copying at the end.
unsigned int *primes = <unsigned int *> malloc(sizeof(int) * x)
[...]
P.S: I'm a C newbie; so take it easy on me. Thanks! :)
Memory leak here - the memory you malloced is never freed. The big advantage of using Python containers is that you don't have to consider C memory allocation.
To actually answer your question: The easiest thing is probably to create a temporary Cython memoryview of the array and then pass that to list:
return list(<unsigned int[:counter:1]>primes)
I have an array that needs to contain sum of different things and therefore I want to perform reduction on each of its elements.
Here's the code:
cdef int *a=<int *>malloc(sizeof(int) * 3)
for i in range(3):
a[i]=1*i
cdef int *b
for i in prange(1000,nogil=True,num_threads=10):
b=res() #res returns an array initialized to 1s
with gil: #if commented this line gives erroneous results
for k in range(3):
a[k]+=b[k]
for i in range(3):
print a[i]
Till there is with gil the code runs fine else gives wrong results.
How to deal with reductions on each element of array without using gil cause gil i think will block other threads
The way reductions usually work in practice is to do the sum individually for each thread, and then add them together at the end. You could do this manually with something like
cdef int *b
cdef int *a_local # version of a that is duplicated by each thread
cdef int i,j,k
# set up as before
cdef int *a=<int *>malloc(sizeof(int) * 3)
for i in range(3):
a[i]=1*i
# multithreaded from here
with nogil, parallel(num_threads=10):
# setup and initialise a_local on each thread
a_local = <int*>malloc(sizeof(int)*3)
for k in range(3):
a_local[k] = 0
for i in prange(1000):
b=res() # Note - you never free b
# this is likely a memory leak....
for j in range(3):
a_local[j]+=b[j]
# finally at the end add them all together.
# this needs to be done `with gil:` to avoid race conditions
# but it isn't a problem
# because it's only a small amount of work being done
with gil:
for k in range(3):
a[k] += a_local[k]
free(a_local)
The Cython documentation on typed memory views list three ways of assigning to a typed memory view:
from a raw C pointer,
from a np.ndarray and
from a cython.view.array.
Assume that I don't have data passed in to my cython function from outside but instead want to allocate memory and return it as a np.ndarray, which of those options do I chose? Also assume that the size of that buffer is not a compile-time constant i.e. I can't allocate on the stack, but would need to malloc for option 1.
The 3 options would therefore looke something like this:
from libc.stdlib cimport malloc, free
cimport numpy as np
from cython cimport view
np.import_array()
def memview_malloc(int N):
cdef int * m = <int *>malloc(N * sizeof(int))
cdef int[::1] b = <int[:N]>m
free(<void *>m)
def memview_ndarray(int N):
cdef int[::1] b = np.empty(N, dtype=np.int32)
def memview_cyarray(int N):
cdef int[::1] b = view.array(shape=(N,), itemsize=sizeof(int), format="i")
What is surprising to me is that in all three cases, Cython generates quite a lot of code for the memory allocation, in particular a call to __Pyx_PyObject_to_MemoryviewSlice_dc_int. This suggests (and I might be wrong here, my insight into the inner workings of Cython are very limited) that it first creates a Python object and then "casts" it into a memory view, which seems unnecessary overhead.
A simple benchmark doesn't reveal much difference between the three methods, with 2. being the fastest by a thin margin.
Which of the three methods is recommended? Or is there a different, better option?
Follow-up question: I want to finally return the result as a np.ndarray, after having worked with that memory view in the function. Is a typed memory view the best choice or would I rather just use the old buffer interface as below to create an ndarray in the first place?
cdef np.ndarray[DTYPE_t, ndim=1] b = np.empty(N, dtype=np.int32)
Look here for an answer.
The basic idea is that you want cpython.array.array and cpython.array.clone (not cython.array.*):
from cpython.array cimport array, clone
# This type is what you want and can be cast to things of
# the "double[:]" syntax, so no problems there
cdef array[double] armv, templatemv
templatemv = array('d')
# This is fast
armv = clone(templatemv, L, False)
EDIT
It turns out that the benchmarks in that thread were rubbish. Here's my set, with my timings:
# cython: language_level=3
# cython: boundscheck=False
# cython: wraparound=False
import time
import sys
from cpython.array cimport array, clone
from cython.view cimport array as cvarray
from libc.stdlib cimport malloc, free
import numpy as numpy
cimport numpy as numpy
cdef int loops
def timefunc(name):
def timedecorator(f):
cdef int L, i
print("Running", name)
for L in [1, 10, 100, 1000, 10000, 100000, 1000000]:
start = time.clock()
f(L)
end = time.clock()
print(format((end-start) / loops * 1e6, "2f"), end=" ")
sys.stdout.flush()
print("μs")
return timedecorator
print()
print("INITIALISATIONS")
loops = 100000
#timefunc("cpython.array buffer")
def _(int L):
cdef int i
cdef array[double] arr, template = array('d')
for i in range(loops):
arr = clone(template, L, False)
# Prevents dead code elimination
str(arr[0])
#timefunc("cpython.array memoryview")
def _(int L):
cdef int i
cdef double[::1] arr
cdef array template = array('d')
for i in range(loops):
arr = clone(template, L, False)
# Prevents dead code elimination
str(arr[0])
#timefunc("cpython.array raw C type")
def _(int L):
cdef int i
cdef array arr, template = array('d')
for i in range(loops):
arr = clone(template, L, False)
# Prevents dead code elimination
str(arr[0])
#timefunc("numpy.empty_like memoryview")
def _(int L):
cdef int i
cdef double[::1] arr
template = numpy.empty((L,), dtype='double')
for i in range(loops):
arr = numpy.empty_like(template)
# Prevents dead code elimination
str(arr[0])
#timefunc("malloc")
def _(int L):
cdef int i
cdef double* arrptr
for i in range(loops):
arrptr = <double*> malloc(sizeof(double) * L)
free(arrptr)
# Prevents dead code elimination
str(arrptr[0])
#timefunc("malloc memoryview")
def _(int L):
cdef int i
cdef double* arrptr
cdef double[::1] arr
for i in range(loops):
arrptr = <double*> malloc(sizeof(double) * L)
arr = <double[:L]>arrptr
free(arrptr)
# Prevents dead code elimination
str(arr[0])
#timefunc("cvarray memoryview")
def _(int L):
cdef int i
cdef double[::1] arr
for i in range(loops):
arr = cvarray((L,),sizeof(double),'d')
# Prevents dead code elimination
str(arr[0])
print()
print("ITERATING")
loops = 1000
#timefunc("cpython.array buffer")
def _(int L):
cdef int i
cdef array[double] arr = clone(array('d'), L, False)
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
# Prevents dead-code elimination
str(d)
#timefunc("cpython.array memoryview")
def _(int L):
cdef int i
cdef double[::1] arr = clone(array('d'), L, False)
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
# Prevents dead-code elimination
str(d)
#timefunc("cpython.array raw C type")
def _(int L):
cdef int i
cdef array arr = clone(array('d'), L, False)
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
# Prevents dead-code elimination
str(d)
#timefunc("numpy.empty_like memoryview")
def _(int L):
cdef int i
cdef double[::1] arr = numpy.empty((L,), dtype='double')
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
# Prevents dead-code elimination
str(d)
#timefunc("malloc")
def _(int L):
cdef int i
cdef double* arrptr = <double*> malloc(sizeof(double) * L)
cdef double d
for i in range(loops):
for i in range(L):
d = arrptr[i]
free(arrptr)
# Prevents dead-code elimination
str(d)
#timefunc("malloc memoryview")
def _(int L):
cdef int i
cdef double* arrptr = <double*> malloc(sizeof(double) * L)
cdef double[::1] arr = <double[:L]>arrptr
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
free(arrptr)
# Prevents dead-code elimination
str(d)
#timefunc("cvarray memoryview")
def _(int L):
cdef int i
cdef double[::1] arr = cvarray((L,),sizeof(double),'d')
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
# Prevents dead-code elimination
str(d)
Output:
INITIALISATIONS
Running cpython.array buffer
0.100040 0.097140 0.133110 0.121820 0.131630 0.108420 0.112160 μs
Running cpython.array memoryview
0.339480 0.333240 0.378790 0.445720 0.449800 0.414280 0.414060 μs
Running cpython.array raw C type
0.048270 0.049250 0.069770 0.074140 0.076300 0.060980 0.060270 μs
Running numpy.empty_like memoryview
1.006200 1.012160 1.128540 1.212350 1.250270 1.235710 1.241050 μs
Running malloc
0.021850 0.022430 0.037240 0.046260 0.039570 0.043690 0.030720 μs
Running malloc memoryview
1.640200 1.648000 1.681310 1.769610 1.755540 1.804950 1.758150 μs
Running cvarray memoryview
1.332330 1.353910 1.358160 1.481150 1.517690 1.485600 1.490790 μs
ITERATING
Running cpython.array buffer
0.010000 0.027000 0.091000 0.669000 6.314000 64.389000 635.171000 μs
Running cpython.array memoryview
0.013000 0.015000 0.058000 0.354000 3.186000 33.062000 338.300000 μs
Running cpython.array raw C type
0.014000 0.146000 0.979000 9.501000 94.160000 916.073000 9287.079000 μs
Running numpy.empty_like memoryview
0.042000 0.020000 0.057000 0.352000 3.193000 34.474000 333.089000 μs
Running malloc
0.002000 0.004000 0.064000 0.367000 3.599000 32.712000 323.858000 μs
Running malloc memoryview
0.019000 0.032000 0.070000 0.356000 3.194000 32.100000 327.929000 μs
Running cvarray memoryview
0.014000 0.026000 0.063000 0.351000 3.209000 32.013000 327.890000 μs
(The reason for the "iterations" benchmark is that some methods have surprisingly different characteristics in this respect.)
In order of initialisation speed:
malloc: This is a harsh world, but it's fast. If you need to to allocate a lot of things and have unhindered iteration and indexing performance, this has to be it. But normally you're a good bet for...
cpython.array raw C type: Well damn, it's fast. And it's safe. Unfortunately it goes through Python to access its data fields. You can avoid that by using a wonderful trick:
arr.data.as_doubles[i]
which brings it up to the standard speed while removing safety! This makes this a wonderful replacement for malloc, being basically a pretty reference-counted version!
cpython.array buffer: Coming in at only three to four times the setup time of malloc, this is looks a wonderful bet. Unfortunately it has significant overhead (albeit small compared to the boundscheck and wraparound directives). That means it only really competes against full-safety variants, but it is the fastest of those to initialise. Your choice.
cpython.array memoryview: This is now an order of magnitude slower than malloc to initialise. That's a shame, but it iterates just as fast. This is the standard solution that I would suggest unless boundscheck or wraparound are on (in which case cpython.array buffer might be a more compelling tradeoff).
The rest. The only one worth anything is numpy's, due to the many fun methods attached to the objects. That's it, though.
As a follow up to Veedrac's answer: be aware using the memoryview support of cpython.array with python 2.7 appears to lead to memory leaks currently. This seems to be a long-standing issue as it is mentioned on the cython-users mailing list here in a post from November 2012. Running Veedrac's benchmark scrip with Cython version 0.22 with both Python 2.7.6 and Python 2.7.9 leads to a large memory leak on when initialising a cpython.array using either a buffer or memoryview interface. No memory leaks occur when running the script with Python 3.4. I've filed a bug report on this to the Cython developers mailing list.
I can't seem to do simple things like add a value to an array of values stored in a memory view. I understand that is not what a typed memory view is supposed to do. But converting the memory view back to an np.array is slower than tortoises herding cats.
When I try to write a cdef function like:
cdef double[::1] _add(self,double[::1] arr,double val):
cdef double[::1] newarr
cdef int i, n
#n = sizeof(arr)/sizeof(arr[0])
newarr = np.empty(5)
for i in xrange(n):
newarr[i] = arr[i] + val
return newarr
I get errors that say that the memoryviews are not contiguous.
"ValueError: Buffer and memoryview are not contiguous in the same dimension."
This actually does work if the memory view that is passed is not one that has been sliced. But it adds 10 seconds to the process!
I use numpexpr for fast math on large arrays but if the size of the array is less than the CPU cache, writing my code in Cython using simple array math is way faster, especially, if the function is called multiple times.
The issue is, how do you work with arrays in Cython, or more explicitly: is there a direct interface to Python's array.array type in Cython? What I would like to do is something like this (simple example)
cpdef array[double] running_sum(array[double] arr):
cdef int i
cdef int n = len(arr)
cdef array[double] out = new_array_zeros(1.0, n)
... # some error checks
out[0] = arr[0]
for i in xrange(1,n-1):
out[i] = out[i-1] + arr[i]
return(out)
I first tried using Cython numpy wrapper and worked with the ndarrays but it seems that creating them is very costly for small 1D arrays, compared with creating a C array with malloc (but memory handling becomes a pain).
Thanks!
You can roll your simple own with basic functions and checks here is a mockup to start:
from libc.stdlib cimport malloc,free
cpdef class SimpleArray:
cdef double * handle
cdef public int length
def __init__(SimpleArray self, int n):
self.handle = <double*>malloc(n * sizeof(double))
self.length = n
def __getitem__(self, int idx):
if idx < self.length:
return self.handle[idx]
raise ValueError("Invalid Idx")
def __dealloc__(SimpleArray self):
free(self.handle)
cpdef SimpleArray running_sum(SimpleArray arr):
cdef int i
cdef SimpleArray out = SimpleArray(arr.length)
out.handle[0] = arr.handle[0]
for i from 1 < i < arr.length-1:
out.handle[i] = out.handle[i-1] + arr.handle[i]
return out
can be used as
>>> import test
>>> simple = test.SimpleArray(100)
>>> del simple
>>> test.running_sum(test.SimpleArray(100))
<test.SimpleArray object at 0x1002a90b0>