Generating SIMD instructions from Cython code

Generating SIMD instructions from Cython code - python

I need to get an overview of the performance one can get from using Cython in high performance numerical code. One of the thing I am interested in is to find out if an optimizing C compiler can vectorize code generated by Cython. So I decided to write the following small example:
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef int f(np.ndarray[int, ndim = 1] f):
cdef int array_length = f.shape[0]
cdef int sum = 0
cdef int k
for k in range(array_length):
sum += f[k]
return sum
I know that there are Numpy functions that does the job, but I would like to have an easy code in order to understand what is possible with Cython. It turns out that the code generated with:
from distutils.core import setup
from Cython.Build import cythonize
setup(ext_modules = cythonize("sum.pyx"))
and called with:
python setup.py build_ext --inplace
generates a C code which look likes this for the loop:
for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_1; __pyx_t_2 += 1) {
__pyx_v_sum = __pyx_v_sum + (*(int *)((char *)
__pyx_pybuffernd_f.rcbuffer->pybuffer.buf +
__pyx_t_2 * __pyx_pybuffernd_f.diminfo[0].strides)));
}
The main problem with this code is that the compiler does not know at compile time that __pyx_pybuffernd_f.diminfo[0].strides is such that the elements of the array are close together in memory. Without that information, the compiler cannot vectorize efficiently.
Is there a way to do such a thing from Cython?

You have two problems in your code (use option -a to make it visible):
The indexing of numpy array isn't efficient
You have forgotten int in cdef sum=0
Taking this into account we get:
cpdef int f(np.ndarray[np.int_t] f): ##HERE
assert f.dtype == np.int
cdef int array_length = f.shape[0]
cdef int sum = 0 ##HERE
cdef int k
for k in range(array_length):
sum += f[k]
return sum
For the loop the following code:
int __pyx_t_5;
int __pyx_t_6;
Py_ssize_t __pyx_t_7;
....
__pyx_t_5 = __pyx_v_array_length;
for (__pyx_t_6 = 0; __pyx_t_6 < __pyx_t_5; __pyx_t_6+=1) {
__pyx_v_k = __pyx_t_6;
__pyx_t_7 = __pyx_v_k;
__pyx_v_sum = (__pyx_v_sum + (*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_int_t *, __pyx_pybuffernd_f.rcbuffer->pybuffer.buf, __pyx_t_7, __pyx_pybuffernd_f.diminfo[0].strides)));
}
Which is not that bad, but not as easy for the optimizer as the normal code written by human. As you have already pointed out, __pyx_pybuffernd_f.diminfo[0].strides isn't known at compile time and this prevents vectorization.
However, you would get better results, when using typed memory views, i.e:
cpdef int mf(int[::1] f):
cdef int array_length = len(f)
...
which leads to a less opaque C-code - the one, at least my compiler, can better optimize:
__pyx_t_2 = __pyx_v_array_length;
for (__pyx_t_3 = 0; __pyx_t_3 < __pyx_t_2; __pyx_t_3+=1) {
__pyx_v_k = __pyx_t_3;
__pyx_t_4 = __pyx_v_k;
__pyx_v_sum = (__pyx_v_sum + (*((int *) ( /* dim=0 */ ((char *) (((int *) __pyx_v_f.data) + __pyx_t_4)) ))));
}
The most crucial thing here, is that we make it clear to the cython, that the memory is continuous, i.e. int[::1] compared to int[:] as it is seen for numpy-arrays, for which a possible stride!=1 must be taken into account.
In this case, the cython-generated C-code results in the same assembler as the code I would have written. As crisb has pointed out, adding -march=native would lead to vectorization, but in this case the assembler of both functions would be slightly different again.
However, in my experience, compilers have quite often some problems to optimize loops created by cython and/or it is easier to miss a detail which prevents the generation of really good C-code. So my strategy for working-horse-loops is to write them in plain C and use cython for wrapping/accessing them - often it is somewhat faster, because one can also use dedicated compiler flags for this code-snipped without affecting the whole Cython-module.

Related

Comparing performance of python and ctypes equivalent code

I tried to compare the performance of both Python and Ctypes version of the sum function. I found that Python is faster than ctypes version.
Sum.c file:
int our_function(int num_numbers, int *numbers) {
int i;
int sum = 0;
for (i = 0; i < num_numbers; i++) {
sum += numbers[i];
}
return sum;
}
int our_function2(int num1, int num2) {
return num1 + num2;
}
I compiled it to a shared library:
gcc -shared -o Sum.so sum.c
Then I imported both the shared library and ctypes to use the C function. Below Sum.py:
import ctypes
_sum = ctypes.CDLL('.\junk.so')
_sum.our_function.argtypes = (ctypes.c_int, ctypes.POINTER(ctypes.c_int))
def our_function_c(numbers):
global _sum
num_numbers = len(numbers)
array_type = ctypes.c_int * num_numbers
result = _sum.our_function(ctypes.c_int(num_numbers), array_type(*numbers))
return int(result)
def our_function_py(numbers):
sum = 0
for i in numbers:
sum += i
return sum
import time
start = time.time()
print(our_function_c([1, 2, 3]))
end = time.time()
print("time taken C", end-start)
start1 = time.time()
print(our_function_py([1, 2, 3]))
end1 = time.time()
print("time taken py", end1-start1)
Output:
6
time taken C 0.0010006427764892578
6
time taken py 0.0
For larger list like list(range(int(1e5))):
start = time.time()
print(our_function_c(list(range(int(1e5)))))
end = time.time()
print("time taken C", end-start)
start1 = time.time()
print(our_function_py(list(range(int(1e5)))))
end1 = time.time()
print("time taken py", end1-start1)
Output:
704982704
time taken C 0.011005163192749023
4999950000
time taken py 0.00500178337097168
Question: I tried to use more numbers, but Python still beats ctypes in terms of performance. So my question is, is there a rule of thumb when I should move to ctypes over Python (in terms of the order of magnitude of code)? Also, what is the cost to convert Python to Ctypes, please?

Why
Well, yes, in such a case, it is not really worth it.
Because before calling the c function, you spend lot of time converting numbers in to c_int.
Which is not less expansive as an addition.
Usually we use ctypes when, either the data are generated on the C-side. Or when we generate them from python, but then use them for more than 1 simple operation.
Same with pandas
This is for example what happens with numpy or pandas. Two well known example of libraries in C (or compiled anyway) that allow huge time gain (in the order of 1000×), as long as data don't go back and forth between C space and python space.
Numpy is faster than list for many operations, for example. As long as you don't count data conversion for each atomic operation.
Pandas often works with data read from CSV, by pandas. Data stays in pandas space.
import time
import pandas as pd
lst=list(range(1000000))
start1=time.time()
s1=0
for x in lst:
s1+=x
end1=time.time()
start2=time.time()
df=pd.DataFrame({'x':lst})
middle2=time.time()
s2=df.x.sum()
end2=time.time()
print("python", s1, "t=", end1-start1)
print("pandas", s2, "t=", end2-start2, end2-middle2)
python 499999500000 t= 0.13175106048583984
pandas 499999500000 t= 0.35060644149780273 0.0020313262939453125
As you see pandas also is way slower than python by this standard.
But way faster if you don't count data creation.
Faster without data conversion
Try to run your code this way
import time
lst=list(range(1000))*1000
c_lst = (ctypes.c_int * len(lst))(*lst)
c_num = ctypes.c_int(len(lst))
start = time.time()
print(int(_sum.our_function(c_num, c_lst)))
end = time.time()
print("time taken C", end-start)
start1 = time.time()
print(our_function_py(lst))
end1 = time.time()
print("time taken py", end1-start1)
And c code is way faster.
So, like with panda, it doesn't worth it, if really, all you need from it is to do one summation, and then forget the data.
No such problem with c-extension
Note that with python c-extension, that allows c functions to handle python types, you don't have this problem (yet, it is often less efficient, because, well, python arrays are not just int * like C loves. But at least, you don't need a conversion to C made from python)
That is why you may sometimes see some libraries for which, even counting conversion, using external libraries is faster.
import numpy as np
np.array(lst).sum()
for example is slightly faster. But almost not so, when we are used to have numpy 1000× faster. Because numpy.array helps itself from python list data.
But that is not just ctypes, (by ctypes, I mean "using c-functions from the c-world handling c-data, not caring about python at all."). Plus, I am not even sure that this is the only reason. Numpy might be cheating, using several threads, and vectorization, which, neither python nor your c code does.
Example that needs no big data conversion
So, let's add another example, and add this to your code
int sumFirst(int n){
int s=0;
for(int i=0; i<n; i++){
s+=i;
}
return s;
}
And try it with
import ctypes
_sum = ctypes.CDLL('./ctypeBench.so')
_sum.sumFirst.argtypes = (ctypes.c_int,)
def c_sumFirst(n):
return _sum.sumFirst(ctypes.c_int(n))
import time
lst=list(range(10000))
start1=time.time()
s1=0
for x in lst:
s1+=x
end1=time.time()
start2=time.time()
s2=c_sumFirst(10000)
end2=time.time()
print(f"python {s1=}, Δt={end1-start1}")
print(f"c {s2=}, Δt={end2-start2}")
Result is
python s1=49995000, Δt=0.0012884140014648438
c s2=49995000, Δt=4.267692565917969e-05
And note that I was fair to python: I did not count data generation in its time (I explicitly listed the range. Which doesn't change much).
So, conclusion is, you can't expect ctypes function to gain time for a single operation per data such as +, when you need 1 conversion per data to use them.
Either you need to use c-extension and write ad-hoc code than handle a python list (and even there, you won't gain much if you have just one addition to do per value).
Or you need to keep the data on the c-side, creating them from c, and letting them there (like you do with pandas or numpy: you use dataframe or ndarrays, as much as possible with pandas and numpy functions or operator, not getting all of them in python with full indexation or .iloc).
Or you need to have really more than one addition per data to do.
Addendum: c-extension
Just to add another argument in favor of "problem is conversion", but also to explicit what to do if you really need to do one simple operation on a list, and don't want to convert every elements before, you can try this
modmy.c
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#define PY3K
static PyObject *mysum(PyObject *self, PyObject *args){
PyObject *list;
PyArg_ParseTuple(args, "O", &list);
PyObject *it = PyObject_GetIter(list);
long int s=0;
for(;;){
PyObject *v = PyIter_Next(it);
if(!v) break;
long int iv=PyLong_AsLong(v);
s+=iv;
}
return PyLong_FromLong(s);
}
static PyMethodDef MyMethods[] = {
{"mysum", mysum, METH_VARARGS, "Sum list"},
{NULL, NULL, 0, NULL} /* Sentinel */
};
static struct PyModuleDef modmy = {
PyModuleDef_HEAD_INIT,
"modmy",
NULL,
-1,
MyMethods
};
PyMODINIT_FUNC
PyInit_modmy()
{
return PyModule_Create(&modmy);
}
Compile with
gcc -fPIC `python3-config --cflags` -c modmy.c
gcc -shared -fPIC `python3-config --ldflags` -o modmy.so modmy.o
Then
import time
import modmy
lst=list(range(10000000))
start1=time.time()
s1=0
for x in lst:
s1+=x
end1=time.time()
start2=time.time()
s2=modmy.mysum(lst)
end2=time.time()
print("python res=%d t=%5.2f"%(s1, end1-start1))
print("c res=%d t=%5.2f"%(s2, end2-start2))
This time no need for conversion (or, to be more accurate, yes, there is still a need for conversion. But it is done by C code, since it is not any C code, but code made ad-hoc to extend python.
(And after all, python interpreter, under the hood, also need to unpack the elements)
Note that my code checks nothing. It assumes that you are really calling mysum with a single argument being a list of integers. God knows what happens if you don't. Well, not just God. just try:
>>> import modmy
>>> modmy.mysum(12)
Segmentation fault (core dumped)
$
Python crashes (not just python's code. It is not a python error. The python process dies)
But result worth it
python res=49999995000000 t= 1.22
c res=49999995000000 t= 0.11
So, you see, this times C wins. Because it is really the same rules (they are doing the same. Just C does it faster)
So, you need to know what you are doing. But well, this does what you expected: a very simple operation on a list of integers, that runs faster in C than in python.

Cython defining type for functions

I'm trying to make a cython-built slice-sampling library. A generic slice sampling library, where you supply a log-density, a starter value, and get a result. Working on the univariate model now. Based on the response here, I've come up with the following.
So i have a function defined in cSlice.pyx:
cdef double univariate_slice_sample(f_type_1 logd, double starter,
double increment_size = 0.5):
some stuff
return value
I have defined in cSlice.pxd:
cdef ctypedef double (*f_type_1)(double)
cdef double univariate_slice_sample(f_type_1 logd, double starter,
double increment_size = *)
where logd is a generic univariate log-density.
In my distribution file, let's say cDistribution.pyx, I have the following:
from cSlice cimport univariate_slice_sample, f_type_1
cdef double log_distribution(alpha_k, y_k, prior):
some stuff
return value
cdef double _sample_alpha_k_slice(
double starter,
double[:] y_k,
Prior prior,
double increment_size
):
cdef f_type_1 f = lambda alpha_k: log_distribution(alpha_k), y_k, prior)
return univariate_slice_sample(f, starter, increment_size)
cpdef double sample_alpha_k_slice(
double starter,
double[:] y_1,
Prior prior,
double increment_size = 0.5
):
return _sample_alpha_1_slice(starter, y_1, prior, increment_size)
the wrapper because apparently lambda's aren't allowed in cpdef's.
When I try compiling the distribution file, I get the following:
cDistribution.pyx:289:22: Cannot convert Python object to 'f_type_1'
pointing at the cdef f_type_1 f = ... line.
I'm unsure of what else to do. I want this code to maintain C speed, and importantly not hit the GIL. Any ideas?

You can jit a C-callback/wrapper for any Python function (cast to a pointer from a Python-object cannot done implicitly), how for example explained in this SO-post.
However, at its core the function will stay slow pure Python function. Numba gives you possibility to create real C-callbacks via a #cfunc. Here is a simplified example:
from numba import cfunc
#cfunc("float64(float64)")
def id_(x):
return x
and this is how it could be used:
%%cython
ctypedef double(*f_type)(double)
cdef void c_print_double(double x, f_type f):
print(2.0*f(x))
import numba
expected_signature = numba.float64(numba.float64)
def print_double(double x,f):
# check the signature of f:
if not f._sig == expected_signature:
raise TypeError("cfunc has not the right type")
# it is not possible to cast a Python object to a pointer directly,
# so we cast the address first to unsigned long long
c_print_double(x, <f_type><unsigned long long int>(f.address))
And now:
print_double(1.0, id_)
# 2.0
We need to check the signature of the cfunc-object during the run time, otherwise the casting <f_type><unsigned long long int>(f.address) would "work" also for the functions with wrong signature - only to (possible) crash during the call or giving funny hard to debug errors. I'm just not sure that my method is the best though - even if it works:
...
#cfunc("float32(float32)")
def id3_(x):
return x
print_double(1.0, id3_)
# TypeError: cfunc has not the right type

Cython return malloced pointer from function

I am fairly new to Cython, so this is probably fairly trivial, but I haven't been able to find the answer anywhere.
I've defined a struct type and I want to write a function that will initialize all the fields properly and return a pointer to the new struct.
from cpython.mem import PyMem_Malloc
ctypedef struct cell_t:
DTYPE_t[2] min_bounds
DTYPE_t[2] max_bounds
DTYPE_t size
bint is_leaf
cell_t * children[4]
DTYPE_t[2] center_of_mass
UINT32_t count
cdef cell_t * make_cell(DTYPE_t[2] min_bounds, DTYPE_t[2] max_bounds):
cdef cell_t * cell = <cell_t *>PyMem_Malloc(sizeof(cell_t)) # <- Fails here
if not cell:
MemoryError()
cell.min_bounds[:] = min_bounds
cell.max_bounds[:] = max_bounds
cell.size = min_bounds[0] - max_bounds[0]
cell.is_leaf = True
cell.center_of_mass[:] = [0, 0]
cell.count = 0
return cell
However, when I try to compile this, I get the following two errors during compilation:
cdef cell_t * make_cell(DTYPE_t[2] min_bounds, DTYPE_t[2] max_bounds):
cdef cell_t * cell = <cell_t *>PyMem_Malloc(sizeof(cell_t))
^
Casting temporary Python object to non-numeric non-Python type
------------------------------------------------------------
cdef cell_t * make_cell(DTYPE_t[2] min_bounds, DTYPE_t[2] max_bounds):
cdef cell_t * cell = <cell_t *>PyMem_Malloc(sizeof(cell_t))
^
Storing unsafe C derivative of temporary Python reference
------------------------------------------------------------
Now, I've looked all over, and from what I can gather, cell is actually stored in a temporary variable that gets deallocated at the end of the function.
Any help would be greatly appreciated.

cell.min_bounds = min_bounds
This doesn't do what you think it does (although I'm not 100% sure what it does do). You need to copy arrays element by element:
cell.min_bounds[0] = min_bounds[0]
cell.min_bounds[1] = min_bounds[1]
Same for max_bounds.
The line that I suspect is giving you that error message is:
cell.center_of_mass = [0, 0]
This is trying to assign a Python list to a C array (remembering that arrays and pointers are somewhat interchangeable in C), which doesn't make much sense. Again, you'd do
cell.center_of_mass[0] = 0
cell.center_of_mass[1] = 0
All this is largely consistent with the C behaviour that there aren't operators to copy whole arrays into each other, you need to copy element by element.
Edit:
However that's not your immediate problem. You haven't declared PyMem_Malloc so it's assumed to be a Python function. You should do
from cpython.mem cimport PyMem_Malloc
Make sure it's cimported, not imported
Edit2:
The following compiles fine for me:
from cpython.mem cimport PyMem_Malloc
ctypedef double DTYPE_t
ctypedef struct cell_t:
DTYPE_t[2] min_bounds
DTYPE_t[2] max_bounds
cdef cell_t * make_cell(DTYPE_t[2] min_bounds, DTYPE_t[2] max_bounds) except NULL:
cdef cell_t * cell = <cell_t *>PyMem_Malloc(sizeof(cell_t))
if not cell:
raise MemoryError()
return cell
I've cut down cell_t a bit (just to avoid having to make declarations of UINT32_t). I've also given the cdef function an except NULL to allow it to signal an error if needed and added a raise before MemoryError(). I don't think either of these changes are directly related to your error.

Cython: Pass a 2D array from Python to a C and retrieve it

I am trying to build a wrapper to a camera driver in written C using Cython. I am new to Cython (started 2 weeks ago). After a lot of struggle, I could successfully develop wrappers for structures, 1D arrays but now I am stuck with 2D arrays.
One of the camera's C APIs takes a 2D array pointer as input and assigns the captured image to it. This function needs to be called from Python and the output image needs to be processed/displayed in Python. After going through the Cython docs and various posts on stack-overflow, I ended up with more confusion. I could not figure out how to pass 2D arrays between Python and the C. The driver api looks (somewhat) like this:
driver.h
void assign_values2D(double **matrix, unsigned int row_size, unsigned int column_size);
c_driver.pyd
cdef extern from "driver.h":
void assign_values2D(double **matrix, unsigned int row_size, unsigned int column_size)
test.pyx
from c_driver import assign_values2D
import numpy as np
cimport numpy as np
cimport cython
from libc.stdlib cimport malloc, free
import ctypes
#cython.boundscheck(False)
#cython.wraparound(False)
def assignValues2D(self, np.ndarray[np.double_t,ndim=2,mode='c']mat):
row_size,column_size = np.shape(mat)
cdef np.ndarray[double, ndim=2, mode="c"] temp_mat = np.ascontiguousarray(mat, dtype = ctypes.c_double)
cdef double ** mat_pointer = <double **>malloc(column_size * sizeof(double*))
if not mat_pointer:
raise MemoryError
try:
for i in range(row_size):
mat_pointer[i] = &temp_mat[i, 0]
assign_values2D(<double **> &mat_pointer[0], row_size, column_size)
return np.array(mat)
finally:
free(mat_pointer)
test_camera.py
b = np.zeros((5,5), dtype=np.float) # sample code
print "B Before = "
print b
assignValues2D(b)
print "B After = "
print b
When compiled, it gives the error:
Error compiling Cython file:
------------------------------------------------------------
...
if not mat_pointer:
raise MemoryError
try:
for i in range(row_size):
mat_pointer[i] = &temp_mat[i, 0]
^
------------------------------------------------------------
test.pyx:120:21: Cannot take address of Python variable
In fact, the above code was taken from a stack-overflow post. I have tried several other ways but none of them are working. Please let me know how I can get the 2D image into Python. Thanks in advance.

You need to type i:
cdef int i
(Alternatively you can type row_size and it also works)
Once it knows that i is an int then it can work out the type that indexing the tmp_map gives and so the & operator works.
Normally it's pretty good about figuring out the type of loop variables like i, but I think the issue is that it can't deduce the type of row_size so it decided it can't deduce the type of i since it is deduced from range(row_size). Because of that it can't deduce the type of temp_mat[i,0].
I suspect you also you also want to change the return statement to return np.array(temp_mat) - the code you have will likely work most of the time but occasionally np.ascontinuousarray will have to make a copy and mat won't be changed.

How to sort an array of integers faster than quicksort?

Sorting an array of integers with numpy's quicksort has become the
bottleneck of my algorithm. Unfortunately, numpy does not have
radix sort yet.
Although counting sort would be a one-liner in numpy:
np.repeat(np.arange(1+x.max()), np.bincount(x))
see the accepted answer to the How can I vectorize this python count sort so it is absolutely as fast as it can be? question, the integers
in my application can run from 0 to 2**32.
Am I stuck with quicksort?
This post was primarily motivated by the
[Numpy grouping using itertools.groupby performance](https://stackoverflow.com/q/4651683/341970)
question.
Also note that
[it is not merely OK to ask and answer your own question, it is explicitly encouraged.](https://blog.stackoverflow.com/2011/07/its-ok-to-ask-and-answer-your-own-questions/)

No, you are not stuck with quicksort. You could use, for example,
integer_sort from
Boost.Sort
or u4_sort from usort. When sorting this array:
array(randint(0, high=1<<32, size=10**8), uint32)
I get the following results:
NumPy quicksort: 8.636 s 1.0 (baseline)
Boost.Sort integer_sort: 4.327 s 2.0x speedup
usort u4_sort: 2.065 s 4.2x speedup
I would not jump to conclusions based on this single experiment and use
usort blindly. I would test with my actual data and measure what happens.
Your mileage will vary depending on your data and on your machine. The
integer_sort in Boost.Sort has a rich set of options for tuning, see the
documentation.
Below I describe two ways to call a native C or C++ function from Python. Despite the long description, it's fairly easy to do it.
Boost.Sort
Put these lines into the spreadsort.cpp file:
#include <cinttypes>
#include "boost/sort/spreadsort/spreadsort.hpp"
using namespace boost::sort::spreadsort;
extern "C" {
void spreadsort(std::uint32_t* begin, std::size_t len) {
integer_sort(begin, begin + len);
}
}
It basically instantiates the templated integer_sort for 32 bit
unsigned integers; the extern "C" part ensures C linkage by disabling
name mangling.
Assuming you are using gcc and that the necessary include files of boost
are under the /tmp/boost_1_60_0 directory, you can compile it:
g++ -O3 -std=c++11 -march=native -DNDEBUG -shared -fPIC -I/tmp/boost_1_60_0 spreadsort.cpp -o spreadsort.so
The key flags are -fPIC to generate
position-independet code
and -shared to generate a
shared object
.so file. (Read the docs of gcc for further details.)
Then, you wrap the spreadsort() C++ function
in Python using ctypes:
from ctypes import cdll, c_size_t, c_uint32
from numpy import uint32
from numpy.ctypeslib import ndpointer
__all__ = ['integer_sort']
# In spreadsort.cpp: void spreadsort(std::uint32_t* begin, std::size_t len)
lib = cdll.LoadLibrary('./spreadsort.so')
sort = lib.spreadsort
sort.restype = None
sort.argtypes = [ndpointer(c_uint32, flags='C_CONTIGUOUS'), c_size_t]
def integer_sort(arr):
assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
sort(arr, arr.size)
Alternatively, you can use cffi:
from cffi import FFI
from numpy import uint32
__all__ = ['integer_sort']
ffi = FFI()
ffi.cdef('void spreadsort(uint32_t* begin, size_t len);')
C = ffi.dlopen('./spreadsort.so')
def integer_sort(arr):
assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
begin = ffi.cast('uint32_t*', arr.ctypes.data)
C.spreadsort(begin, arr.size)
At the cdll.LoadLibrary() and ffi.dlopen() calls I assumed that the
path to the spreadsort.so file is ./spreadsort.so. Alternatively,
you can write
lib = cdll.LoadLibrary('spreadsort.so')
or
C = ffi.dlopen('spreadsort.so')
if you append the path to spreadsort.so to the LD_LIBRARY_PATH environment
variable. See also Shared Libraries.
Usage. In both cases you simply call the above Python wrapper function integer_sort()
with your numpy array of 32 bit unsigned integers.
usort
As for u4_sort, you can compile it as follows:
cc -DBUILDING_u4_sort -I/usr/include -I./ -I../ -I../../ -I../../../ -I../../../../ -std=c99 -fgnu89-inline -O3 -g -fPIC -shared -march=native u4_sort.c -o u4_sort.so
Issue this command in the directory where the u4_sort.c file is located.
(Probably there is a less hackish way but I failed to figure that out. I
just looked into the deps.mk file in the usort directory to find out
the necessary compiler flags and include paths.)
Then, you can wrap the C function as follows:
from cffi import FFI
from numpy import uint32
__all__ = ['integer_sort']
ffi = FFI()
ffi.cdef('void u4_sort(unsigned* a, const long sz);')
C = ffi.dlopen('u4_sort.so')
def integer_sort(arr):
assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
begin = ffi.cast('unsigned*', arr.ctypes.data)
C.u4_sort(begin, arr.size)
In the above code, I assumed that the path to u4_sort.so has been
appended to the LD_LIBRARY_PATH environment variable.
Usage. As before with Boost.Sort, you simply call the above Python wrapper function integer_sort() with your numpy array of 32 bit unsigned integers.

A radix sort base 256 (1 byte) can generate a matrix of counts used to determine bucket size in 1 pass of the data, then takes 4 passes to do the sort. On my system, Intel 2600K 3.4ghz, using Visual Studio release build for a C++ program, it takes about 1.15 seconds to sort 10^8 psuedo random unsigned 32 bit integers.
Looking at the u4_sort code mentioned in Ali's answer, the programs are similar, but u4_sort uses field sizes of {10,11,11}, taking 3 passes to sort data and 1 pass to copy back, while this example uses field sizes of {8,8,8,8}, taking 4 passes to sort data. The process is probably memory bandwidth limited due to the random access writes, so the optimizations in u4_sort, like macros for shift, one loop with fixed shift per field, aren't helping much. My results are better probably due to system and/or compiler differences. (Note u8_sort is for 64 bit integers).
Example code:
// a is input array, b is working array
void RadixSort(uint32_t * a, uint32_t *b, size_t count)
{
size_t mIndex[4][256] = {0}; // count / index matrix
size_t i,j,m,n;
uint32_t u;
for(i = 0; i < count; i++){ // generate histograms
u = a[i];
for(j = 0; j < 4; j++){
mIndex[j][(size_t)(u & 0xff)]++;
u >>= 8;
}
}
for(j = 0; j < 4; j++){ // convert to indices
m = 0;
for(i = 0; i < 256; i++){
n = mIndex[j][i];
mIndex[j][i] = m;
m += n;
}
}
for(j = 0; j < 4; j++){ // radix sort
for(i = 0; i < count; i++){ // sort by current lsb
u = a[i];
m = (size_t)(u>>(j<<3))&0xff;
b[mIndex[j][m]++] = u;
}
std::swap(a, b); // swap ptrs
}
}

A radix-sort with python/numba(0.23), according to #rcgldr C program, with multithread on a 2 cores processor.
first the radix sort on numba, with two global arrays for efficiency.
from threading import Thread
from pylab import *
from numba import jit
n=uint32(10**8)
m=n//2
if 'x1' not in locals() : x1=array(randint(0,1<<16,2*n),uint16); #to avoid regeneration
x2=x1.copy()
x=frombuffer(x2,uint32) # randint doesn't work with 32 bits here :(
y=empty_like(x)
nbits=8
buffsize=1<<nbits
mask=buffsize-1
#jit(nopython=True,nogil=True)
def radix(x,y):
xs=x.size
dec=0
while dec < 32 :
u=np.zeros(buffsize,uint32)
k=0
while k<xs:
u[(x[k]>>dec)& mask]+=1
k+=1
j=t=0
for j in range(buffsize):
b=u[j]
u[j]=t
t+=b
v=u.copy()
k=0
while k<xs:
j=(x[k]>>dec)&mask
y[u[j]]=x[k]
u[j]+=1
k+=1
x,y=y,x
dec+=nbits
then the parallélisation, possible with nogil option.
def para(nthreads=2):
threads=[Thread(target=radix,
args=(x[i*n//nthreads(i+1)*n//nthreads],
y[i*n//nthreads:(i+1)*n//nthreads]))
for i in range(nthreads)]
for t in threads: t.start()
for t in threads: t.join()
#jit
def fuse(x,y):
kl=0
kr=n//2
k=0
while k<n:
if y[kl]<x[kr] :
x[k]=y[kl]
kl+=1
if kl==m : break
else :
x[k]=x[kr]
kr+=1
k+=1
def sort():
para(2)
y[:m]=x[:m]
fuse(x,y)
benchmarks :
In [24]: %timeit x2=x1.copy();x=frombuffer(x2,uint32) # time offset
1 loops, best of 3: 431 ms per loop
In [25]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);x.sort()
1 loops, best of 3: 37.8 s per loop
In [26]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);para(1)
1 loops, best of 3: 5.7 s per loop
In [27]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);sort()
1 loops, best of 3: 4.02 s per loop
So a pure python solution with a 10x (37s->3.5s) gain on my poor 1GHz machine. Can be enhanced with more cores and multifusion.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generating SIMD instructions from Cython code - python

Related

Comparing performance of python and ctypes equivalent code

Cython defining type for functions

Cython return malloced pointer from function

Cython: Pass a 2D array from Python to a C and retrieve it

How to sort an array of integers faster than quicksort?

Categories

Resources