Simple Python Challenge: Fastest Bitwise XOR on Data Buffers - python

Challenge:
Perform a bitwise XOR on two equal sized buffers. The buffers will be required to be the python str type since this is traditionally the type for data buffers in python. Return the resultant value as a str. Do this as fast as possible.
The inputs are two 1 megabyte (2**20 byte) strings.
The challenge is to substantially beat my inefficient algorithm using python or existing third party python modules (relaxed rules: or create your own module.) Marginal increases are useless.
from os import urandom
from numpy import frombuffer,bitwise_xor,byte
def slow_xor(aa,bb):
a=frombuffer(aa,dtype=byte)
b=frombuffer(bb,dtype=byte)
c=bitwise_xor(a,b)
r=c.tostring()
return r
aa=urandom(2**20)
bb=urandom(2**20)
def test_it():
for x in xrange(1000):
slow_xor(aa,bb)

First Try
Using scipy.weave and SSE2 intrinsics gives a marginal improvement. The first invocation is a bit slower since the code needs to be loaded from the disk and cached, subsequent invocations are faster:
import numpy
import time
from os import urandom
from scipy import weave
SIZE = 2**20
def faster_slow_xor(aa,bb):
b = numpy.fromstring(bb, dtype=numpy.uint64)
numpy.bitwise_xor(numpy.frombuffer(aa,dtype=numpy.uint64), b, b)
return b.tostring()
code = """
const __m128i* pa = (__m128i*)a;
const __m128i* pend = (__m128i*)(a + arr_size);
__m128i* pb = (__m128i*)b;
__m128i xmm1, xmm2;
while (pa < pend) {
xmm1 = _mm_loadu_si128(pa); // must use unaligned access
xmm2 = _mm_load_si128(pb); // numpy will align at 16 byte boundaries
_mm_store_si128(pb, _mm_xor_si128(xmm1, xmm2));
++pa;
++pb;
}
"""
def inline_xor(aa, bb):
a = numpy.frombuffer(aa, dtype=numpy.uint64)
b = numpy.fromstring(bb, dtype=numpy.uint64)
arr_size = a.shape[0]
weave.inline(code, ["a", "b", "arr_size"], headers = ['"emmintrin.h"'])
return b.tostring()
Second Try
Taking into account the comments, I revisited the code to find out if the copying could be avoided. Turns out I read the documentation of the string object wrong, so here goes my second try:
support = """
#define ALIGNMENT 16
static void memxor(const char* in1, const char* in2, char* out, ssize_t n) {
const char* end = in1 + n;
while (in1 < end) {
*out = *in1 ^ *in2;
++in1;
++in2;
++out;
}
}
"""
code2 = """
PyObject* res = PyString_FromStringAndSize(NULL, real_size);
const ssize_t tail = (ssize_t)PyString_AS_STRING(res) % ALIGNMENT;
const ssize_t head = (ALIGNMENT - tail) % ALIGNMENT;
memxor((const char*)a, (const char*)b, PyString_AS_STRING(res), head);
const __m128i* pa = (__m128i*)((char*)a + head);
const __m128i* pend = (__m128i*)((char*)a + real_size - tail);
const __m128i* pb = (__m128i*)((char*)b + head);
__m128i xmm1, xmm2;
__m128i* pc = (__m128i*)(PyString_AS_STRING(res) + head);
while (pa < pend) {
xmm1 = _mm_loadu_si128(pa);
xmm2 = _mm_loadu_si128(pb);
_mm_stream_si128(pc, _mm_xor_si128(xmm1, xmm2));
++pa;
++pb;
++pc;
}
memxor((const char*)pa, (const char*)pb, (char*)pc, tail);
return_val = res;
Py_DECREF(res);
"""
def inline_xor_nocopy(aa, bb):
real_size = len(aa)
a = numpy.frombuffer(aa, dtype=numpy.uint64)
b = numpy.frombuffer(bb, dtype=numpy.uint64)
return weave.inline(code2, ["a", "b", "real_size"],
headers = ['"emmintrin.h"'],
support_code = support)
The difference is that the string is allocated inside the C code. It's impossible to have it aligned at a 16-byte-boundary as required by the SSE2 instructions, therefore the unaligned memory regions at the beginning and the end are copied using byte-wise access.
The input data is handed in using numpy arrays anyway, because weave insists on copying Python str objects to std::strings. frombuffer doesn't copy, so this is fine, but the memory is not aligned at 16 byte, so we need to use _mm_loadu_si128 instead of the faster _mm_load_si128.
Instead of using _mm_store_si128, we use _mm_stream_si128, which will make sure that any writes are streamed to main memory as soon as possible---this way, the output array does not use up valuable cache lines.
Timings
As for the timings, the slow_xor entry in the first edit referred to my improved version (inline bitwise xor, uint64), I removed that confusion. slow_xor refers to the code from the original questions. All timings are done for 1000 runs.
slow_xor: 1.85s (1x)
faster_slow_xor: 1.25s (1.48x)
inline_xor: 0.95s (1.95x)
inline_xor_nocopy: 0.32s (5.78x)
The code was compiled using gcc 4.4.3 and I've verified that the compiler actually uses the SSE instructions.

Performance comparison: numpy vs. Cython vs. C vs. Fortran vs. Boost.Python (pyublas)
| function | time, usec | ratio | type |
|------------------------+------------+-------+--------------|
| slow_xor | 2020 | 1.0 | numpy |
| xorf_int16 | 1570 | 1.3 | fortran |
| xorf_int32 | 1530 | 1.3 | fortran |
| xorf_int64 | 1420 | 1.4 | fortran |
| faster_slow_xor | 1360 | 1.5 | numpy |
| inline_xor | 1280 | 1.6 | C |
| cython_xor | 1290 | 1.6 | cython |
| xorcpp_inplace (int32) | 440 | 4.6 | pyublas |
| cython_xor_vectorised | 325 | 6.2 | cython |
| inline_xor_nocopy | 172 | 11.7 | C |
| xorcpp | 144 | 14.0 | boost.python |
| xorcpp_inplace | 122 | 16.6 | boost.python |
#+TBLFM: $3=#2$2/$2;%.1f
To reproduce results, download http://gist.github.com/353005 and type make (to install dependencies, type: sudo apt-get install build-essential python-numpy python-scipy cython gfortran, dependencies for Boost.Python, pyublas are not included due to they require manual intervention to work)
Where:
slow_xor() is from the OP's question
faster_slow_xor(), inline_xor(), inline_xor_nocopy() are from #Torsten Marek's answer
cython_xor() and cython_vectorised() are from #gnibbler's answer
And xor_$type_sig() are:
! xorf.f90.template
subroutine xor_$type_sig(a, b, n, out)
implicit none
integer, intent(in) :: n
$type, intent(in), dimension(n) :: a
$type, intent(in), dimension(n) :: b
$type, intent(out), dimension(n) :: out
integer i
forall(i=1:n) out(i) = ieor(a(i), b(i))
end subroutine xor_$type_sig
It is used from Python as follows:
import xorf # extension module generated from xorf.f90.template
import numpy as np
def xor_strings(a, b, type_sig='int64'):
assert len(a) == len(b)
a = np.frombuffer(a, dtype=np.dtype(type_sig))
b = np.frombuffer(b, dtype=np.dtype(type_sig))
return getattr(xorf, 'xor_'+type_sig)(a, b).tostring()
xorcpp_inplace() (Boost.Python, pyublas):
xor.cpp:
#include <inttypes.h>
#include <algorithm>
#include <boost/lambda/lambda.hpp>
#include <boost/python.hpp>
#include <pyublas/numpy.hpp>
namespace {
namespace py = boost::python;
template<class InputIterator, class InputIterator2, class OutputIterator>
void
xor_(InputIterator first, InputIterator last,
InputIterator2 first2, OutputIterator result) {
// `result` migth `first` but not any of the input iterators
namespace ll = boost::lambda;
(void)std::transform(first, last, first2, result, ll::_1 ^ ll::_2);
}
template<class T>
py::str
xorcpp_str_inplace(const py::str& a, py::str& b) {
const size_t alignment = std::max(sizeof(T), 16ul);
const size_t n = py::len(b);
const char* ai = py::extract<const char*>(a);
char* bi = py::extract<char*>(b);
char* end = bi + n;
if (n < 2*alignment)
xor_(bi, end, ai, bi);
else {
assert(n >= 2*alignment);
// applying Marek's algorithm to align
const ptrdiff_t head = (alignment - ((size_t)bi % alignment))% alignment;
const ptrdiff_t tail = (size_t) end % alignment;
xor_(bi, bi + head, ai, bi);
xor_((const T*)(bi + head), (const T*)(end - tail),
(const T*)(ai + head),
(T*)(bi + head));
if (tail > 0) xor_(end - tail, end, ai + (n - tail), end - tail);
}
return b;
}
template<class Int>
pyublas::numpy_vector<Int>
xorcpp_pyublas_inplace(pyublas::numpy_vector<Int> a,
pyublas::numpy_vector<Int> b) {
xor_(b.begin(), b.end(), a.begin(), b.begin());
return b;
}
}
BOOST_PYTHON_MODULE(xorcpp)
{
py::def("xorcpp_inplace", xorcpp_str_inplace<int64_t>); // for strings
py::def("xorcpp_inplace", xorcpp_pyublas_inplace<int32_t>); // for numpy
}
It is used from Python as follows:
import os
import xorcpp
a = os.urandom(2**20)
b = os.urandom(2**20)
c = xorcpp.xorcpp_inplace(a, b) # it calls xorcpp_str_inplace()

Here are my results for cython
slow_xor 0.456888198853
faster_xor 0.400228977203
cython_xor 0.232881069183
cython_xor_vectorised 0.171468019485
Vectorising in cython shaves about 25% off the for loop on my computer, However more than half the time is spent building the python string (the return statement) - I don't think the extra copy can be avoided (legally) as the array may contain null bytes.
The illegal way would be to pass in a Python string and mutate it in place and would double the speed of the function.
xor.py
from time import time
from os import urandom
from numpy import frombuffer,bitwise_xor,byte,uint64
import pyximport; pyximport.install()
import xor_
def slow_xor(aa,bb):
a=frombuffer(aa,dtype=byte)
b=frombuffer(bb,dtype=byte)
c=bitwise_xor(a,b)
r=c.tostring()
return r
def faster_xor(aa,bb):
a=frombuffer(aa,dtype=uint64)
b=frombuffer(bb,dtype=uint64)
c=bitwise_xor(a,b)
r=c.tostring()
return r
aa=urandom(2**20)
bb=urandom(2**20)
def test_it():
t=time()
for x in xrange(100):
slow_xor(aa,bb)
print "slow_xor ",time()-t
t=time()
for x in xrange(100):
faster_xor(aa,bb)
print "faster_xor",time()-t
t=time()
for x in xrange(100):
xor_.cython_xor(aa,bb)
print "cython_xor",time()-t
t=time()
for x in xrange(100):
xor_.cython_xor_vectorised(aa,bb)
print "cython_xor_vectorised",time()-t
if __name__=="__main__":
test_it()
xor_.pyx
cdef char c[1048576]
def cython_xor(char *a,char *b):
cdef int i
for i in range(1048576):
c[i]=a[i]^b[i]
return c[:1048576]
def cython_xor_vectorised(char *a,char *b):
cdef int i
for i in range(131094):
(<unsigned long long *>c)[i]=(<unsigned long long *>a)[i]^(<unsigned long long *>b)[i]
return c[:1048576]

An easy speedup is to use a larger 'chunk':
def faster_xor(aa,bb):
a=frombuffer(aa,dtype=uint64)
b=frombuffer(bb,dtype=uint64)
c=bitwise_xor(a,b)
r=c.tostring()
return r
with uint64 also imported from numpy of course. I timeit this at 4 milliseconds, vs 6 milliseconds for the byte version.

Your problem isn't the speed of NumPy's xOr method, but rather with all of the buffering/data type conversions. Personally I suspect that the point of this post may have really been to brag about Python, because what you are doing here is processing THREE GIGABYTES of data in timeframes on par with non-interpreted languages, which are inherently faster.
The below code shows that even on my humble computer Python can xOr "aa" (1MB) and "bb" (1MB) into "c" (1MB) one thousand times (total 3GB) in under two seconds. Seriously, how much more improvement do you want? Especially from an interpreted language! 80% of the time was spent calling "frombuffer" and "tostring". The actual xOr-ing is completed in the other 20% of the time. At 3GB in 2 seconds, you would be hard-pressed to improve upon that substantially even just using memcpy in c.
In case this was a real question, and not just covert bragging about Python, the answer is to code so as to minimize the number, amount and frequency of your type conversions such as "frombuffer" and "tostring". The actual xOr'ing is lightning fast already.
from os import urandom
from numpy import frombuffer,bitwise_xor,byte,uint64
def slow_xor(aa,bb):
a=frombuffer(aa,dtype=byte)
b=frombuffer(bb,dtype=byte)
c=bitwise_xor(a,b)
r=c.tostring()
return r
bb=urandom(2**20)
aa=urandom(2**20)
def test_it():
for x in xrange(1000):
slow_xor(aa,bb)
def test_it2():
a=frombuffer(aa,dtype=uint64)
b=frombuffer(bb,dtype=uint64)
for x in xrange(1000):
c=bitwise_xor(a,b);
r=c.tostring()
test_it()
print 'Slow Complete.'
#6 seconds
test_it2()
print 'Fast Complete.'
#under 2 seconds
Anyway, the "test_it2" above accomplishes exactly the same amount of xOr-ing as "test_it" does, but in 1/5 the time. 5x speed improvement should qualify as "substantial", no?

The fastest bitwise XOR is "^". I can type that much quicker than "bitwise_xor" ;-)

Python3 has int.from_bytes and int.to_bytes, thus:
x = int.from_bytes(b"a" * (1024*1024), "big")
y = int.from_bytes(b"b" * (1024*1024), "big")
(x ^ y).to_bytes(1024*1024, "big")
It's faster than IO, kinda hard to test just how fast it is, looks like 0.018 .. 0.020s on my machine. Strangely "little"-endian conversion is a little faster.
CPython 2.x has the underlying function _PyLong_FromByteArray, it's not exported but accessible via ctypes:
In [1]: import ctypes
In [2]: p = ctypes.CDLL(None)
In [3]: p["_PyLong_FromByteArray"]
Out[3]: <_FuncPtr object at 0x2cc6e20>
Python 2 details are left as exercise to the reader.

If you want to do fast operations on array data types, then you should try Cython (cython.org). If you give it the right declarations it should be able to compile down to pure c code.

How badly do you need the answer as a string? Note that the c.tostring() method has to copy the data in c to a new string, as Python strings are immutable (and c is mutable). Python 2.6 and 3.1 have a bytearray type, which acts like str (bytes in Python 3.x) except for being mutable.
Another optimization is using the out parameter to bitwise_xor to specify where to store the result.
On my machine I get
slow_xor (int8): 5.293521 (100.0%)
outparam_xor (int8): 4.378633 (82.7%)
slow_xor (uint64): 2.192234 (41.4%)
outparam_xor (uint64): 1.087392 (20.5%)
with the code at the end of this post. Notice in particular that the method using a preallocated buffer is twice as fast as creating a new object (when operating on 4-byte (uint64) chunks). This is consistent with the slower method doing two operations per chunk (xor + copy) to the faster's 1 (just xor).
Also, FWIW, a ^ b is equivalent to bitwise_xor(a,b), and a ^= b is equivalent to bitwise_xor(a, b, a).
So, 5x speedup without writing any external modules :)
from time import time
from os import urandom
from numpy import frombuffer,bitwise_xor,byte,uint64
def slow_xor(aa, bb, ignore, dtype=byte):
a=frombuffer(aa, dtype=dtype)
b=frombuffer(bb, dtype=dtype)
c=bitwise_xor(a, b)
r=c.tostring()
return r
def outparam_xor(aa, bb, out, dtype=byte):
a=frombuffer(aa, dtype=dtype)
b=frombuffer(bb, dtype=dtype)
c=frombuffer(out, dtype=dtype)
assert c.flags.writeable
return bitwise_xor(a, b, c)
aa=urandom(2**20)
bb=urandom(2**20)
cc=bytearray(2**20)
def time_routine(routine, dtype, base=None, ntimes = 1000):
t = time()
for x in xrange(ntimes):
routine(aa, bb, cc, dtype=dtype)
et = time() - t
if base is None:
base = et
print "%s (%s): %f (%.1f%%)" % (routine.__name__, dtype.__name__, et,
(et/base)*100)
return et
def test_it(ntimes = 1000):
base = time_routine(slow_xor, byte, ntimes=ntimes)
time_routine(outparam_xor, byte, base, ntimes=ntimes)
time_routine(slow_xor, uint64, base, ntimes=ntimes)
time_routine(outparam_xor, uint64, base, ntimes=ntimes)

You could try the symmetric difference of the bitsets of sage.
http://www.sagemath.org/doc/reference/sage/misc/bitset.html

The fastest way (speedwise) will be doing what Max. S recommended. Implement it in C.
The supporting code for this task should be rather simple to write. It is just one function in a module creating a new string and doing the xor. That's all. When you have implemented one module like that, it is simple to take the code as template. Or you even take a module implemented from somebody else that implements a simple enhancement module for Python and just throw out everything not needed for your task.
The real complicated part is just, doing the RefCounter-Stuff right. But once realized how it works, it is manageable -- also since the task at hand is really simple (allocate some memory, and return it -- params are not to be touched (Ref-wise)).

Related

Comparing performance of python and ctypes equivalent code

I tried to compare the performance of both Python and Ctypes version of the sum function. I found that Python is faster than ctypes version.
Sum.c file:
int our_function(int num_numbers, int *numbers) {
int i;
int sum = 0;
for (i = 0; i < num_numbers; i++) {
sum += numbers[i];
}
return sum;
}
int our_function2(int num1, int num2) {
return num1 + num2;
}
I compiled it to a shared library:
gcc -shared -o Sum.so sum.c
Then I imported both the shared library and ctypes to use the C function. Below Sum.py:
import ctypes
_sum = ctypes.CDLL('.\junk.so')
_sum.our_function.argtypes = (ctypes.c_int, ctypes.POINTER(ctypes.c_int))
def our_function_c(numbers):
global _sum
num_numbers = len(numbers)
array_type = ctypes.c_int * num_numbers
result = _sum.our_function(ctypes.c_int(num_numbers), array_type(*numbers))
return int(result)
def our_function_py(numbers):
sum = 0
for i in numbers:
sum += i
return sum
import time
start = time.time()
print(our_function_c([1, 2, 3]))
end = time.time()
print("time taken C", end-start)
start1 = time.time()
print(our_function_py([1, 2, 3]))
end1 = time.time()
print("time taken py", end1-start1)
Output:
6
time taken C 0.0010006427764892578
6
time taken py 0.0
For larger list like list(range(int(1e5))):
start = time.time()
print(our_function_c(list(range(int(1e5)))))
end = time.time()
print("time taken C", end-start)
start1 = time.time()
print(our_function_py(list(range(int(1e5)))))
end1 = time.time()
print("time taken py", end1-start1)
Output:
704982704
time taken C 0.011005163192749023
4999950000
time taken py 0.00500178337097168
Question: I tried to use more numbers, but Python still beats ctypes in terms of performance. So my question is, is there a rule of thumb when I should move to ctypes over Python (in terms of the order of magnitude of code)? Also, what is the cost to convert Python to Ctypes, please?
Why
Well, yes, in such a case, it is not really worth it.
Because before calling the c function, you spend lot of time converting numbers in to c_int.
Which is not less expansive as an addition.
Usually we use ctypes when, either the data are generated on the C-side. Or when we generate them from python, but then use them for more than 1 simple operation.
Same with pandas
This is for example what happens with numpy or pandas. Two well known example of libraries in C (or compiled anyway) that allow huge time gain (in the order of 1000×), as long as data don't go back and forth between C space and python space.
Numpy is faster than list for many operations, for example. As long as you don't count data conversion for each atomic operation.
Pandas often works with data read from CSV, by pandas. Data stays in pandas space.
import time
import pandas as pd
lst=list(range(1000000))
start1=time.time()
s1=0
for x in lst:
s1+=x
end1=time.time()
start2=time.time()
df=pd.DataFrame({'x':lst})
middle2=time.time()
s2=df.x.sum()
end2=time.time()
print("python", s1, "t=", end1-start1)
print("pandas", s2, "t=", end2-start2, end2-middle2)
python 499999500000 t= 0.13175106048583984
pandas 499999500000 t= 0.35060644149780273 0.0020313262939453125
As you see pandas also is way slower than python by this standard.
But way faster if you don't count data creation.
Faster without data conversion
Try to run your code this way
import time
lst=list(range(1000))*1000
c_lst = (ctypes.c_int * len(lst))(*lst)
c_num = ctypes.c_int(len(lst))
start = time.time()
print(int(_sum.our_function(c_num, c_lst)))
end = time.time()
print("time taken C", end-start)
start1 = time.time()
print(our_function_py(lst))
end1 = time.time()
print("time taken py", end1-start1)
And c code is way faster.
So, like with panda, it doesn't worth it, if really, all you need from it is to do one summation, and then forget the data.
No such problem with c-extension
Note that with python c-extension, that allows c functions to handle python types, you don't have this problem (yet, it is often less efficient, because, well, python arrays are not just int * like C loves. But at least, you don't need a conversion to C made from python)
That is why you may sometimes see some libraries for which, even counting conversion, using external libraries is faster.
import numpy as np
np.array(lst).sum()
for example is slightly faster. But almost not so, when we are used to have numpy 1000× faster. Because numpy.array helps itself from python list data.
But that is not just ctypes, (by ctypes, I mean "using c-functions from the c-world handling c-data, not caring about python at all."). Plus, I am not even sure that this is the only reason. Numpy might be cheating, using several threads, and vectorization, which, neither python nor your c code does.
Example that needs no big data conversion
So, let's add another example, and add this to your code
int sumFirst(int n){
int s=0;
for(int i=0; i<n; i++){
s+=i;
}
return s;
}
And try it with
import ctypes
_sum = ctypes.CDLL('./ctypeBench.so')
_sum.sumFirst.argtypes = (ctypes.c_int,)
def c_sumFirst(n):
return _sum.sumFirst(ctypes.c_int(n))
import time
lst=list(range(10000))
start1=time.time()
s1=0
for x in lst:
s1+=x
end1=time.time()
start2=time.time()
s2=c_sumFirst(10000)
end2=time.time()
print(f"python {s1=}, Δt={end1-start1}")
print(f"c {s2=}, Δt={end2-start2}")
Result is
python s1=49995000, Δt=0.0012884140014648438
c s2=49995000, Δt=4.267692565917969e-05
And note that I was fair to python: I did not count data generation in its time (I explicitly listed the range. Which doesn't change much).
So, conclusion is, you can't expect ctypes function to gain time for a single operation per data such as +, when you need 1 conversion per data to use them.
Either you need to use c-extension and write ad-hoc code than handle a python list (and even there, you won't gain much if you have just one addition to do per value).
Or you need to keep the data on the c-side, creating them from c, and letting them there (like you do with pandas or numpy: you use dataframe or ndarrays, as much as possible with pandas and numpy functions or operator, not getting all of them in python with full indexation or .iloc).
Or you need to have really more than one addition per data to do.
Addendum: c-extension
Just to add another argument in favor of "problem is conversion", but also to explicit what to do if you really need to do one simple operation on a list, and don't want to convert every elements before, you can try this
modmy.c
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#define PY3K
static PyObject *mysum(PyObject *self, PyObject *args){
PyObject *list;
PyArg_ParseTuple(args, "O", &list);
PyObject *it = PyObject_GetIter(list);
long int s=0;
for(;;){
PyObject *v = PyIter_Next(it);
if(!v) break;
long int iv=PyLong_AsLong(v);
s+=iv;
}
return PyLong_FromLong(s);
}
static PyMethodDef MyMethods[] = {
{"mysum", mysum, METH_VARARGS, "Sum list"},
{NULL, NULL, 0, NULL} /* Sentinel */
};
static struct PyModuleDef modmy = {
PyModuleDef_HEAD_INIT,
"modmy",
NULL,
-1,
MyMethods
};
PyMODINIT_FUNC
PyInit_modmy()
{
return PyModule_Create(&modmy);
}
Compile with
gcc -fPIC `python3-config --cflags` -c modmy.c
gcc -shared -fPIC `python3-config --ldflags` -o modmy.so modmy.o
Then
import time
import modmy
lst=list(range(10000000))
start1=time.time()
s1=0
for x in lst:
s1+=x
end1=time.time()
start2=time.time()
s2=modmy.mysum(lst)
end2=time.time()
print("python res=%d t=%5.2f"%(s1, end1-start1))
print("c res=%d t=%5.2f"%(s2, end2-start2))
This time no need for conversion (or, to be more accurate, yes, there is still a need for conversion. But it is done by C code, since it is not any C code, but code made ad-hoc to extend python.
(And after all, python interpreter, under the hood, also need to unpack the elements)
Note that my code checks nothing. It assumes that you are really calling mysum with a single argument being a list of integers. God knows what happens if you don't. Well, not just God. just try:
>>> import modmy
>>> modmy.mysum(12)
Segmentation fault (core dumped)
$
Python crashes (not just python's code. It is not a python error. The python process dies)
But result worth it
python res=49999995000000 t= 1.22
c res=49999995000000 t= 0.11
So, you see, this times C wins. Because it is really the same rules (they are doing the same. Just C does it faster)
So, you need to know what you are doing. But well, this does what you expected: a very simple operation on a list of integers, that runs faster in C than in python.

assignment into arbitrary array locations using cython. assignment speed depends on value?

I'm seeing some bizarre behavior in cython code. I'm writing code to compute a forward Kalman filter, but I have a state transition model which has many 0s in it, so it would be nice to be able to calculate only certain elements of the covariance matrices.
So to test this out, I wanted to fill individual array elements using cython. To my surprise, I found
that writing output to specific array locations is very slow (function fill(...)), compared to just assigning it to a scalar variable every time (function nofill(...)) (essentially forgetting the results), and
setting C=0.1 or 31, while not affecting how long nofill(...) takes to run, the latter choice for C makes fill(...) run 2x as slowly. This is baffling to me. Can anyone explain why I'm seeing this?
Code:-
################# file way_too_slow.pyx
from libc.math cimport sin
# Setting C=0.1 or 31 doesn't change affect performance of calling nofill(...), but it makes the fill(...) slower. I have no clue why.
cdef double C = 0.1
# This function just throws away its output.
def nofill(double[::1] x, double[::1] y, long N):
cdef int i
cdef double *p_x = &x[0]
cdef double *p_y = &y[0]
cdef double d
with nogil:
for 0 <= i < N:
d = ((p_x[i] + p_y[i])*3 + p_x[i] - p_y[i]) + sin(p_x[i]*C) # C appears here
# Same function keeps its output.
# However: #1 - MUCH slower than
def fill(double[::1] x, double[::1] y, double[::1] out, long N):
cdef int i
cdef double *p_x = &x[0]
cdef double *p_y = &y[0]
cdef double *p_o = &out[0]
cdef double d
with nogil:
for 0 <= i < N:
p_o[i] = ((p_x[i] + p_y[i])*3 + p_x[i] - p_y[i]) + sin(p_x[i]*C) # C appears here
The above code is called by the python program
#################### run_way_too_slow.py
import way_too_slow as _wts
import time as _tm
N = 80000
x = _N.random.randn(N)
y = _N.random.randn(N)
out = _N.empty(N)
t1 = _tm.time()
_wts.nofill(x, y, N)
t2 = _tm.time()
_wts.fill(x, y, out, N)
t3 = _tm.time()
print "nofill() ET: %.3e" % (t2-t1)
print "fill() ET: %.3e" % (t3-t2)
print "fill() is slower by factor %.3f" % ((t3-t2)/(t2-t1))
The cython was compiled using the setup.py file
################# setup.py
from distutils.core import setup, Extension
from distutils.sysconfig import get_python_inc
from distutils.extension import Extension
from Cython.Distutils import build_ext
incdir=[get_python_inc(plat_specific=1)]
libdir = ['/usr/local/lib']
cmdclass = {'build_ext' : build_ext}
ext_modules = Extension("way_too_slow",
["way_too_slow.pyx"],
include_dirs=incdir, # include_dirs for Mac
library_dirs=libdir)
setup(
name="way_too_slow",
cmdclass = cmdclass,
ext_modules = [ext_modules]
)
Here is a typical output of running "run_way_too_slow.py" using C=0.1
>>> exf("run_way_too_slow.py")
nofill() ET: 6.700e-05
fill() ET: 6.409e-04
fill() is slower by factor 9.566
A typical run with C=31.
>>> exf("run_way_too_slow.py")
nofill() ET: 6.795e-05
fill() ET: 1.566e-03
fill() is slower by factor 23.046
As we can see
Assigning into specified array location is quite slow compared to assigning to a double.
For some reason, assigning speed seems to depend on what operation was done in the calculation - which makes no sense to me.
Any insight would be greatly appreciated.
Two things explain your observation:
A: in the first version nothing happens. The c compiler is clever enough to see, that the whole loop has no effect at all outside of the function and optimizes it out.
To enforce the execution you must make the result d visible outside, for example via:
cdef double d=0
....
d+=....
return d
It might be still slower than writing-to-an-array-version, because of the fewer costly memory accesses - but you will see a slowdown when changing value C.
B: sin is a complicated function and how long it takes to calculate depends on its argument. For example, for very small arguments - the argument itself can be returned, but for bigger argument much longer Taylor series must be evaluated. Here is one example for cost of tanh, depending on the value of the argument, which like sin is calculated via different approximation/ Taylor series - the most important part the time needed depends on argument.

Difference between complex.real and creal(complex) in Cython

To separate real part from complex parts in cython i have been generally using complex.real and complex.imag for the job. This does however generate code that is colored slight "python red" in the html-output, and i guess that I should be using creal(complex) and cimag(complex) instead.
Consider the example bellow:
cdef double complex myfun():
cdef double complex c1,c2,c3
c1=1.0 + 1.2j
c2=2.2 + 13.4j
c3=c2.real + c1*c2.imag
c3=creal(c2) + c1*c2.imag
c3=creal(c2) + c1*cimag(c2)
return c2
The the assigments to c3 give:
__pyx_v_c3 = __Pyx_c_sum(__pyx_t_double_complex_from_parts(__Pyx_CREAL(__pyx_v_c2), 0), __Pyx_c_prod(__pyx_v_c1, __pyx_t_double_complex_from_parts(__Pyx_CIMAG(__pyx_v_c2), 0)));
__pyx_v_c3 = __Pyx_c_sum(__pyx_t_double_complex_from_parts(creal(__pyx_v_c2), 0), __Pyx_c_prod(__pyx_v_c1, __pyx_t_double_complex_from_parts(__Pyx_CIMAG(__pyx_v_c2), 0)));
__pyx_v_c3 = __Pyx_c_sum(__pyx_t_double_complex_from_parts(creal(__pyx_v_c2), 0), __Pyx_c_prod(__pyx_v_c1, __pyx_t_double_complex_from_parts(cimag(__pyx_v_c2), 0)));
where the first line to use the (python colored) construction __Pyx_CREAL and __Pyx_CIMAG.
Whys is this, and does it effect performance "significantly"?
Surely the default C library
(complex.h) would work
for you.
However, it seems this library wouldn't give you any significant improvement
when compared to the c.real c.imag approach. By putting the code within a
with nogil: block, you can check that your code already makes no call to
Python APIs:
cdef double complex c1, c2, c3
with nogil:
c1 = 1.0 + 1.2j
c2 = 2.2 + 13.4j
c3 = c2.real + c1*c2.imag
I use Windows 7 and Python 2.7, which doesn't have complex.h available in the
builtin C library for Visual Studio Compilers 9.0 (compatible with Python 2.7).
Because of that I created an equivalet pure C function to check any possible
gains compared to c.real and c.imag:
cdef double mycreal(double complex dc):
cdef double complex* dcptr = &dc
return (<double *>dcptr)[0]
cdef double mycimag(double complex dc):
cdef double complex* dcptr = &dc
return (<double *>dcptr)[1]
After running the following two testing functions:
def myfun1(double complex c1, double complex c2):
return c2.real + c1*c2.imag
def myfun2(double complex c1, double complex c2):
return mycreal(c2) + c1*mycimag(c2)
Got the timings:
In [3]: timeit myfun1(c1, c2)
The slowest run took 17.50 times longer than the fastest. This could mean that a
n intermediate result is being cached.
10000000 loops, best of 3: 86.3 ns per loop
In [4]: timeit myfun2(c1, c2)
The slowest run took 17.24 times longer than the fastest. This could mean that a
n intermediate result is being cached.
10000000 loops, best of 3: 87.6 ns per loop
Confirming that c.real and c.imag is already fast enough.
Actually you should not expect to see any difference: __Pyx_CREAL(c) and __Pyx_CIMAG(c) can be looked up here, and there is no rocket science/black magic involved:
#if CYTHON_CCOMPLEX
#ifdef __cplusplus
#define __Pyx_CREAL(z) ((z).real())
#define __Pyx_CIMAG(z) ((z).imag())
#else
#define __Pyx_CREAL(z) (__real__(z))
#define __Pyx_CIMAG(z) (__imag__(z))
#endif
#else
#define __Pyx_CREAL(z) ((z).real)
#define __Pyx_CIMAG(z) ((z).imag)
#endif
It basically says, these are defines and lead to a call without overhead of:
std::complex<>.real() and std::complex<>.imag()if it is c++.
GNU extensions __real__ and __imag__, if it is gcc or intel compiler (intel added support some time ago).
You cannot beat it, e.g. in glibc, creal uses __real__. What is left, is Microsoft compiler (CYTHON_CCOMPLEX is not defined), for that an own/special cython-class is used:
#if CYTHON_CCOMPLEX
....
#else
static CYTHON_INLINE {{type}} {{type_name}}_from_parts({{real_type}} x, {{real_type}} y) {
{{type}} z;
z.real = x;
z.imag = y;
return z;
}
#endif
Normally, one should not write his/her own complex number implementation, but there is not so much you can do wrong, if only access to the real and imaginary parts is considered. I would not vouch for other functionality, but would not expect it to be very much slower than the usual windows implementation (but would like to see the results if you have some).
As conclusion: There is no difference for gcc/intel compiler and I would not fret too long about differences for others.

How to sort an array of integers faster than quicksort?

Sorting an array of integers with numpy's quicksort has become the
bottleneck of my algorithm. Unfortunately, numpy does not have
radix sort yet.
Although counting sort would be a one-liner in numpy:
np.repeat(np.arange(1+x.max()), np.bincount(x))
see the accepted answer to the How can I vectorize this python count sort so it is absolutely as fast as it can be? question, the integers
in my application can run from 0 to 2**32.
Am I stuck with quicksort?
This post was primarily motivated by the
[Numpy grouping using itertools.groupby performance](https://stackoverflow.com/q/4651683/341970)
question.
Also note that
[it is not merely OK to ask and answer your own question, it is explicitly encouraged.](https://blog.stackoverflow.com/2011/07/its-ok-to-ask-and-answer-your-own-questions/)
No, you are not stuck with quicksort. You could use, for example,
integer_sort from
Boost.Sort
or u4_sort from usort. When sorting this array:
array(randint(0, high=1<<32, size=10**8), uint32)
I get the following results:
NumPy quicksort: 8.636 s 1.0 (baseline)
Boost.Sort integer_sort: 4.327 s 2.0x speedup
usort u4_sort: 2.065 s 4.2x speedup
I would not jump to conclusions based on this single experiment and use
usort blindly. I would test with my actual data and measure what happens.
Your mileage will vary depending on your data and on your machine. The
integer_sort in Boost.Sort has a rich set of options for tuning, see the
documentation.
Below I describe two ways to call a native C or C++ function from Python. Despite the long description, it's fairly easy to do it.
Boost.Sort
Put these lines into the spreadsort.cpp file:
#include <cinttypes>
#include "boost/sort/spreadsort/spreadsort.hpp"
using namespace boost::sort::spreadsort;
extern "C" {
void spreadsort(std::uint32_t* begin, std::size_t len) {
integer_sort(begin, begin + len);
}
}
It basically instantiates the templated integer_sort for 32 bit
unsigned integers; the extern "C" part ensures C linkage by disabling
name mangling.
Assuming you are using gcc and that the necessary include files of boost
are under the /tmp/boost_1_60_0 directory, you can compile it:
g++ -O3 -std=c++11 -march=native -DNDEBUG -shared -fPIC -I/tmp/boost_1_60_0 spreadsort.cpp -o spreadsort.so
The key flags are -fPIC to generate
position-independet code
and -shared to generate a
shared object
.so file. (Read the docs of gcc for further details.)
Then, you wrap the spreadsort() C++ function
in Python using ctypes:
from ctypes import cdll, c_size_t, c_uint32
from numpy import uint32
from numpy.ctypeslib import ndpointer
__all__ = ['integer_sort']
# In spreadsort.cpp: void spreadsort(std::uint32_t* begin, std::size_t len)
lib = cdll.LoadLibrary('./spreadsort.so')
sort = lib.spreadsort
sort.restype = None
sort.argtypes = [ndpointer(c_uint32, flags='C_CONTIGUOUS'), c_size_t]
def integer_sort(arr):
assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
sort(arr, arr.size)
Alternatively, you can use cffi:
from cffi import FFI
from numpy import uint32
__all__ = ['integer_sort']
ffi = FFI()
ffi.cdef('void spreadsort(uint32_t* begin, size_t len);')
C = ffi.dlopen('./spreadsort.so')
def integer_sort(arr):
assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
begin = ffi.cast('uint32_t*', arr.ctypes.data)
C.spreadsort(begin, arr.size)
At the cdll.LoadLibrary() and ffi.dlopen() calls I assumed that the
path to the spreadsort.so file is ./spreadsort.so. Alternatively,
you can write
lib = cdll.LoadLibrary('spreadsort.so')
or
C = ffi.dlopen('spreadsort.so')
if you append the path to spreadsort.so to the LD_LIBRARY_PATH environment
variable. See also Shared Libraries.
Usage. In both cases you simply call the above Python wrapper function integer_sort()
with your numpy array of 32 bit unsigned integers.
usort
As for u4_sort, you can compile it as follows:
cc -DBUILDING_u4_sort -I/usr/include -I./ -I../ -I../../ -I../../../ -I../../../../ -std=c99 -fgnu89-inline -O3 -g -fPIC -shared -march=native u4_sort.c -o u4_sort.so
Issue this command in the directory where the u4_sort.c file is located.
(Probably there is a less hackish way but I failed to figure that out. I
just looked into the deps.mk file in the usort directory to find out
the necessary compiler flags and include paths.)
Then, you can wrap the C function as follows:
from cffi import FFI
from numpy import uint32
__all__ = ['integer_sort']
ffi = FFI()
ffi.cdef('void u4_sort(unsigned* a, const long sz);')
C = ffi.dlopen('u4_sort.so')
def integer_sort(arr):
assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
begin = ffi.cast('unsigned*', arr.ctypes.data)
C.u4_sort(begin, arr.size)
In the above code, I assumed that the path to u4_sort.so has been
appended to the LD_LIBRARY_PATH environment variable.
Usage. As before with Boost.Sort, you simply call the above Python wrapper function integer_sort() with your numpy array of 32 bit unsigned integers.
A radix sort base 256 (1 byte) can generate a matrix of counts used to determine bucket size in 1 pass of the data, then takes 4 passes to do the sort. On my system, Intel 2600K 3.4ghz, using Visual Studio release build for a C++ program, it takes about 1.15 seconds to sort 10^8 psuedo random unsigned 32 bit integers.
Looking at the u4_sort code mentioned in Ali's answer, the programs are similar, but u4_sort uses field sizes of {10,11,11}, taking 3 passes to sort data and 1 pass to copy back, while this example uses field sizes of {8,8,8,8}, taking 4 passes to sort data. The process is probably memory bandwidth limited due to the random access writes, so the optimizations in u4_sort, like macros for shift, one loop with fixed shift per field, aren't helping much. My results are better probably due to system and/or compiler differences. (Note u8_sort is for 64 bit integers).
Example code:
// a is input array, b is working array
void RadixSort(uint32_t * a, uint32_t *b, size_t count)
{
size_t mIndex[4][256] = {0}; // count / index matrix
size_t i,j,m,n;
uint32_t u;
for(i = 0; i < count; i++){ // generate histograms
u = a[i];
for(j = 0; j < 4; j++){
mIndex[j][(size_t)(u & 0xff)]++;
u >>= 8;
}
}
for(j = 0; j < 4; j++){ // convert to indices
m = 0;
for(i = 0; i < 256; i++){
n = mIndex[j][i];
mIndex[j][i] = m;
m += n;
}
}
for(j = 0; j < 4; j++){ // radix sort
for(i = 0; i < count; i++){ // sort by current lsb
u = a[i];
m = (size_t)(u>>(j<<3))&0xff;
b[mIndex[j][m]++] = u;
}
std::swap(a, b); // swap ptrs
}
}
A radix-sort with python/numba(0.23), according to #rcgldr C program, with multithread on a 2 cores processor.
first the radix sort on numba, with two global arrays for efficiency.
from threading import Thread
from pylab import *
from numba import jit
n=uint32(10**8)
m=n//2
if 'x1' not in locals() : x1=array(randint(0,1<<16,2*n),uint16); #to avoid regeneration
x2=x1.copy()
x=frombuffer(x2,uint32) # randint doesn't work with 32 bits here :(
y=empty_like(x)
nbits=8
buffsize=1<<nbits
mask=buffsize-1
#jit(nopython=True,nogil=True)
def radix(x,y):
xs=x.size
dec=0
while dec < 32 :
u=np.zeros(buffsize,uint32)
k=0
while k<xs:
u[(x[k]>>dec)& mask]+=1
k+=1
j=t=0
for j in range(buffsize):
b=u[j]
u[j]=t
t+=b
v=u.copy()
k=0
while k<xs:
j=(x[k]>>dec)&mask
y[u[j]]=x[k]
u[j]+=1
k+=1
x,y=y,x
dec+=nbits
then the parallélisation, possible with nogil option.
def para(nthreads=2):
threads=[Thread(target=radix,
args=(x[i*n//nthreads(i+1)*n//nthreads],
y[i*n//nthreads:(i+1)*n//nthreads]))
for i in range(nthreads)]
for t in threads: t.start()
for t in threads: t.join()
#jit
def fuse(x,y):
kl=0
kr=n//2
k=0
while k<n:
if y[kl]<x[kr] :
x[k]=y[kl]
kl+=1
if kl==m : break
else :
x[k]=x[kr]
kr+=1
k+=1
def sort():
para(2)
y[:m]=x[:m]
fuse(x,y)
benchmarks :
In [24]: %timeit x2=x1.copy();x=frombuffer(x2,uint32) # time offset
1 loops, best of 3: 431 ms per loop
In [25]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);x.sort()
1 loops, best of 3: 37.8 s per loop
In [26]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);para(1)
1 loops, best of 3: 5.7 s per loop
In [27]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);sort()
1 loops, best of 3: 4.02 s per loop
So a pure python solution with a 10x (37s->3.5s) gain on my poor 1GHz machine. Can be enhanced with more cores and multifusion.

sparse_hash_map is very slow for specific data

tl;dr: why does key lookup in sparse_hash_map become about 50x slower for specific data?
I am testing the speed of key lookups for sparse_hash_map from Google's sparsehash library using a very simple Cython wrapper I've written. The hashtable contains uint32_t keys and uint16_t values. For random keys, values and queries I am getting more than 1M lookups/sec. However, for the specific data I need the performance barely exceeds 20k lookups/sec.
The wrapper is here. The table which runs slowly is here. The benchmarking code is:
benchmark.pyx:
from sparsehash cimport SparseHashMap
from libc.stdint cimport uint32_t
from libcpp.vector cimport vector
import time
import numpy as np
def fill_randomly(m, size):
keys = np.random.random_integers(0, 0xFFFFFFFF, size)
# 0 is a special domain-specific value
values = np.random.random_integers(1, 0xFFFF, size)
for j in range(size):
m[keys[j]] = values[j]
def benchmark_get():
cdef int dummy
cdef uint32_t i, j, table_key
cdef SparseHashMap m
cdef vector[uint32_t] q_keys
cdef int NUM_QUERIES = 1000000
cdef uint32_t MAX_REQUEST = 7448 * 2**19 - 1 # this is domain-specific
time_start = time.time()
### OPTION 1 ###
m = SparseHashMap('17.shash')
### OPTION 2 ###
# m = SparseHashMap(16130443)
# fill_randomly(m, 16130443)
q_keys = np.random.random_integers(0, MAX_REQUEST, NUM_QUERIES)
print("Initialization: %.3f" % (time.time() - time_start))
dummy = 0
time_start = time.time()
for i in range(NUM_QUERIES):
table_key = q_keys[i]
dummy += m.get(table_key)
dummy %= 0xFFFFFF # to prevent overflow error
time_elapsed = time.time() - time_start
if dummy == 42:
# So that the unused variable is not optimized away
print("Wow, lucky!")
print("Table size: %d" % len(m))
print("Total time: %.3f" % time_elapsed)
print("Seconds per query: %.8f" % (time_elapsed / NUM_QUERIES))
print("Queries per second: %.1f" % (NUM_QUERIES / time_elapsed))
def main():
benchmark_get()
benchmark.pyxbld (because pyximport should compile in C++ mode):
def make_ext(modname, pyxfilename):
from distutils.extension import Extension
return Extension(
name=modname,
sources=[pyxfilename],
language='c++'
)
run.py:
import pyximport
pyximport.install()
import benchmark
benchmark.main()
The results for 17.shash are:
Initialization: 2.612
Table size: 16130443
Total time: 48.568
Seconds per query: 0.00004857
Queries per second: 20589.8
and for random data:
Initialization: 25.853
Table size: 16100260
Total time: 0.891
Seconds per query: 0.00000089
Queries per second: 1122356.3
The key distribution in 17.shash is this (plt.hist(np.fromiter(m.keys(), dtype=np.uint32, count=len(m)), bins=50)):
From the documentation on sparsehash and gcc it seems that trivial hashing is used here (that is, x hashes to x).
Is there anything obvious that could be causing this behavior besides hash collisions? From what I have found, it is non-trivial to integrate a custom hash function (i.e. overload std::hash<uint32_t>) in a Cython wrapper.
I have found a solution that works, but it ain't pretty.
sparsehash_wrapper.cpp:
#include "sparsehash/sparse_hash_map"
#include "stdint.h"
// syntax borrowed from
// https://stackoverflow.com/questions/14094104/google-sparse-hash-with-murmur-hash-function
struct UInt32Hasher {
size_t operator()(const uint32_t& x) const {
return (x ^ (x << 17) ^ (x >> 13) + 3238229671);
}
};
template<class Key, class T>
class sparse_hash_map : public google::sparse_hash_map<Key, T, UInt32Hasher> {};
This is a custom hash function which I could integrate in the existing wrapper with minimal code changes: I only had to replace sparsehash/sparse_hash_map with the path to sparsehash_wrapper.cpp in the Cython .pxd file. So far, the only problem with it is that pyximport cannot find sparsehash_wrapper.cpp unless I specify the full absolute path in .pxd.
The problem was indeed with collisions: after recreating a hash map with the same contents as 17.shash from scratch (that it, creating an empty map and inserting every (key, value) pair from 17.shash into it), the performance went up to 1M+ req/sec.

Categories

Resources