Filling 64 bit numpy array in Cython slower than 32 bit array - python

I am trying to understand why filling a 64 bit array is slower than filling a 32 bit array.
Here are the examples:
#cython.boundscheck(False)
#cython.wraparound(False)
def test32(long long int size):
cdef np.ndarray[np.int32_t,ndim=1] index = np.zeros(size, np.int32)
cdef Py_ssize_t i
for i in range(size):
index[i] = i
return indx
#cython.boundscheck(False)
#cython.wraparound(False)
def test64(long long int size):
cdef np.ndarray[np.int64_t,ndim=1] index = np.zeros(size, np.int64)
cdef Py_ssize_t i
for i in range(size):
index[i] = i
return indx
The timings:
In [4]: %timeit test32(1000000)
1000 loops, best of 3: 1.13 ms over loop
In [5]: %timeit test64(1000000)
100 loops, best of 3: 2.04 ms per loop
I am using a 64 bit computer and the Visual C++ for Python (9.0) compiler.
Edit:
Initializing a 64bit array and a 32bit array seems to be taking the same amount of time, which means the time difference is due to the filling process.
In [8]: %timeit np.zeros(1000000,'int32')
100000 loops, best of 3: 2.49 μs per loop
In [9]: %timeit np.zeros(1000000,'int64')
100000 loops, best of 3: 2.49 μs per loop
Edit2:
As DavidW points out, this behavior can be replicated with np.arange, which means that this is expected:
In [7]: %timeit np.arange(1000000,dtype='int32')
10 loops, best of 3: 1.22 ms per loop
In [8]: %timeit np.arange(1000000,dtype='int64')
10 loops, best of 3: 2.03 ms per loop

A quick set of measurements (which I initially posted in the comments) suggests that you see very similar behaviour (64 bits taking twice as long as 32 bits) for np.full, np.ones and np.arange, with all three showing similar times to each other (arange being about 10% slower than the other 2 for me).
I believe this suggests that this is expected behaviour and just the time it takes to fill the memory. (64 bits obviously has twice as much memory as 32 bits).
The interesting question is then why np.zeros is so uniform (and fast) - a complete answer is probably given in this C-based question but the basic summary is that allocating zeros (as done by the C function calloc) can be done lazily - i.e. it doesn't actually allocate or fill much until you actually try to write to the memory.

Related

Slicing a Python list with a NumPy array of indices -- any fast way?

I have a regular list called a, and a NumPy array of indices b.
(No, it is not possible for me to convert a to a NumPy array.)
Is there any way for me to the same effect as "a[b]" efficiently? To be clear, this implies that I don't want to extract every individual int in b due to its performance implications.
(Yes, this is a bottleneck in my code. That's why I'm using NumPy arrays to begin with.)
a = list(range(1000000))
b = np.random.randint(0, len(a), 10000)
%timeit np.array(a)[b]
10 loops, best of 3: 84.8 ms per loop
%timeit [a[x] for x in b]
100 loops, best of 3: 2.93 ms per loop
%timeit operator.itemgetter(*b)(a)
1000 loops, best of 3: 1.86 ms per loop
%timeit np.take(a, b)
10 loops, best of 3: 91.3 ms per loop
I had high hopes for numpy.take() but it is far from optimal. I tried some Numba solutions as well, and they yielded similar times--around 92 ms.
So a simple list comprehension is not far from the best here, but operator.itemgetter() wins, at least for input sizes at these orders of magnitude.
Write a cython function:
import cython
from cpython cimport PyList_New, PyList_SET_ITEM, Py_INCREF
#cython.wraparound(False)
#cython.boundscheck(False)
def take(list alist, Py_ssize_t[:] arr):
cdef:
Py_ssize_t i, idx, n = arr.shape[0]
list res = PyList_New(n)
object obj
for i in range(n):
idx = arr[i]
obj = alist[idx]
PyList_SET_ITEM(res, i, alist[idx])
Py_INCREF(obj)
return res
The result of %timeit:
import numpy as np
al= list(range(10000))
aa = np.array(al)
ba = np.random.randint(0, len(a), 10000)
bl = ba.tolist()
%timeit [al[i] for i in bl]
%timeit np.take(aa, ba)
%timeit take(al, ba)
1000 loops, best of 3: 1.68 ms per loop
10000 loops, best of 3: 51.4 µs per loop
1000 loops, best of 3: 254 µs per loop
numpy.take() is the fastest if both of the arguments are ndarray object. The cython version is 5x faster than list comprehension.

numba guvectorize target='parallel' slower than target='cpu'

I've been attempting to optimize a piece of python code that involves large multi-dimensional array calculations. I am getting counterintuitive results with numba. I am running on an MBP, mid 2015, 2.5 GHz i7 quadcore, OS 10.10.5, python 2.7.11. Consider the following:
import numpy as np
from numba import jit, vectorize, guvectorize
import numexpr as ne
import timeit
def add_two_2ds_naive(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
#jit
def add_two_2ds_jit(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
#guvectorize(['float64[:,:],float64[:,:],float64[:,:]'],
'(n,m),(n,m)->(n,m)',target='cpu')
def add_two_2ds_cpu(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
#guvectorize(['(float64[:,:],float64[:,:],float64[:,:])'],
'(n,m),(n,m)->(n,m)',target='parallel')
def add_two_2ds_parallel(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
def add_two_2ds_numexpr(A,B,res):
res = ne.evaluate('A+B')
if __name__=="__main__":
np.random.seed(69)
A = np.random.rand(10000,100)
B = np.random.rand(10000,100)
res = np.zeros((10000,100))
I can now run timeit on the various functions:
%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.16 ms per loop
%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.19 ms per loop
%timeit add_two_2ds_parallel(A,B,res)
100 loops, best of 3: 6.9 ms per loop
%timeit add_two_2ds_numexpr(A,B,res)
1000 loops, best of 3: 1.62 ms per loop
It seems that 'parallel' is not taking even using the majority of a single core, as it's usage in top shows that python is hitting ~40% cpu for 'parallel', ~100% for 'cpu', and numexpr hits ~300%.
There are two issues with your #guvectorize implementations. The first is that you are are doing all the looping inside your #guvectorize kernel, so there is actually nothing for the Numba parallel target to parallelize. Both #vectorize and #guvectorize parallelize on the broadcast dimensions in a ufunc/gufunc. Since the signature of your gufunc is 2D, and your inputs are 2D, there is only a single call to the inner function, which explains the only 100% CPU usage you saw.
The best way to write the function you have above is to use a regular ufunc:
#vectorize('(float64, float64)', target='parallel')
def add_ufunc(a, b):
return a + b
Then on my system, I see these speeds:
%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.87 ms per loop
%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.81 ms per loop
%timeit add_two_2ds_parallel(A,B,res)
The slowest run took 11.82 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 2.43 ms per loop
%timeit add_two_2ds_numexpr(A,B,res)
100 loops, best of 3: 2.79 ms per loop
%timeit add_ufunc(A, B, res)
The slowest run took 9.24 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 2.03 ms per loop
(This is a very similar OS X system to yours, but with OS X 10.11.)
Although Numba's parallel ufunc now beats numexpr (and I see add_ufunc using about 280% CPU), it doesn't beat the simple single-threaded CPU case. I suspect that the bottleneck is due to memory (or cache) bandwidth, but I haven't done the measurements to check that.
Generally speaking, you will see much more benefit from the parallel ufunc target if you are doing more math operations per memory element (like, say, a cosine).

Cython prange slower for 4 threads then with range

I am currently trying to follow a simple example for parallelizing a loop with cython's prange.
I have installed OpenBlas 0.2.14 with openmp allowed and compiled numpy 1.10.1 and scipy 0.16 from source against openblas. To test the performance of the libraries I am following this example: http://nealhughes.net/parallelcomp2/.
The functions to be timed are copied form the site:
import numpy as np
from math import exp
from libc.math cimport exp as c_exp
from cython.parallel import prange,parallel
def array_f(X):
Y = np.zeros(X.shape)
index = X > 0.5
Y[index] = np.exp(X[index])
return Y
def c_array_f(double[:] X):
cdef int N = X.shape[0]
cdef double[:] Y = np.zeros(N)
cdef int i
for i in range(N):
if X[i] > 0.5:
Y[i] = c_exp(X[i])
else:
Y[i] = 0
return Y
def c_array_f_multi(double[:] X):
cdef int N = X.shape[0]
cdef double[:] Y = np.zeros(N)
cdef int i
with nogil, parallel():
for i in prange(N):
if X[i] > 0.5:
Y[i] = c_exp(X[i])
else:
Y[i] = 0
return Y
The author of the code reports following speed ups for 4 cores:
from thread_demo import *
import numpy as np
X = -1 + 2*np.random.rand(10000000)
%timeit array_f(X)
1 loops, best of 3: 222 ms per loop
%timeit c_array_f(X)
10 loops, best of 3: 87.5 ms per loop
%timeit c_array_f_multi(X)
10 loops, best of 3: 22.4 ms per loop
When I run these example on my machines ( macbook pro with osx 10.10 ), I get the following timings for export OMP_NUM_THREADS=1
In [1]: from bla import *
In [2]: import numpy as np
In [3]: X = -1 + 2*np.random.rand(10000000)
In [4]: %timeit c_array_f(X)
10 loops, best of 3: 89.7 ms per loop
In [5]: %timeit c_array_f_multi(X)
1 loops, best of 3: 343 ms per loop
and for OMP_NUM_THREADS=4
In [1]: from bla import *
In [2]: import numpy as np
In [3]: X = -1 + 2*np.random.rand(10000000)
In [4]: %timeit c_array_f(X)
10 loops, best of 3: 89.5 ms per loop
In [5]: %timeit c_array_f_multi(X)
10 loops, best of 3: 119 ms per loop
I see this same behavior on an openSuse machine, hence my question. How can the author get a 4x speed up while the same code runs slower for 4 threads on 2 of my systems.
The setup script for generating the *.c & .so is also identical to the one used in the blog.
from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension
from Cython.Distutils import build_ext
import numpy as np
ext_modules=[
Extension("bla",
["bla.pyx"],
libraries=["m"],
extra_compile_args = ["-O3", "-ffast-math","-march=native", "-fopenmp" ],
extra_link_args=['-fopenmp'],
include_dirs = [np.get_include()]
)
]
setup(
name = "bla",
cmdclass = {"build_ext": build_ext},
ext_modules = ext_modules
)
Would be great if someone could explain to me why this happens.
1) An important feature of prange (like any other parallel for loop) is that it activates out-of-order execution, which means that the loop can execute in any arbitrary order. Out-of-order execution really pays off when you have no data dependency between iterations.
I do not know the internals of Cython but I reckon that if boundschecking is not turned off, the loop cannot be executed arbitrarily, since the next iteration will depend on whether or not the array is going out of bounds in the current iteration, hence the problem becomes almost serial as threads will have to wait for the result. This is one of the issues with your code. In fact Cython does give me the following warning:
warning: bla.pyx:42:16: Use boundscheck(False) for faster access
So add the following
from cython import boundscheck, wraparound
#boundscheck(False)
#wraparound(False)
def c_array_f(double[:] X):
# Rest of your code
#boundscheck(False)
#wraparound(False)
def c_array_f_multi(double[:] X):
# Rest of your code
Let's now time them with your data X = -1 + 2*np.random.rand(10000000).
With Bounds Checking:
In [2]:%timeit array_f(X)
10 loops, best of 3: 189 ms per loop
In [4]:%timeit c_array_f(X)
10 loops, best of 3: 93.6 ms per loop
In [5]:%timeit c_array_f_multi(X)
10 loops, best of 3: 103 ms per loop
Without Bounds Checking:
In [9]:%timeit c_array_f(X)
10 loops, best of 3: 84.2 ms per loop
In [10]:%timeit c_array_f_multi(X)
10 loops, best of 3: 42.3 ms per loop
These results are with num_threads=4 (I have 4 logical cores) and the speed-up is around 2x. Before getting further we can still shave off a few more ms by declaring our arrays to be contiguous i.e. declaring X and Y with double[::1].
Contiguous Arrays:
In [14]:%timeit c_array_f(X)
10 loops, best of 3: 81.8 ms per loop
In [15]:%timeit c_array_f_multi(X)
10 loops, best of 3: 39.3 ms per loop
2) Even more important is job scheduling and this is what your benchmark suffers from. By default chunk sizes are determined at compile time i.e. schedule=static however it is very likely that the environment variables (for instance OMP_SCHEDULE) and work-load of the two machines (yours and the one from the blog post) are different, and they schedule the jobs at runtime, dynmically, guidedly and so on. Let's experiment it with replacing your prange to
for i in prange(N, schedule='static'):
# static scheduling...
for i in prange(N, schedule='dynamic'):
# dynamic scheduling...
Let's time them now (only the multi-threaded code):
Scheduling Effect:
In [23]:%timeit c_array_f_multi(X) # static
10 loops, best of 3: 39.5 ms per loop
In [28]:%timeit c_array_f_multi(X) # dynamic
1 loops, best of 3: 319 ms per loop
You might be able to replicate this depending on the work-load on your own machine. As a side note, since you are just trying to measure the performance of a parallel vs serial code in a micro-benchmark test and not an actual code, I suggest you get rid of the if-else condition i.e. only keep Y[i] = c_exp(X[i]) within the for loop. This is because if-else statements also adversely affect branch-prediction and out-of-order execution in parallel code. On my machine I get almost 2.7x speed-up over serial code with this change.

How are are NumPy's in-place operators implemented to explain the significant performance gain

I know that in Python, the in-place operators use the __iadd__ method for in-place operators. For immutable types, the __iadd__ is a workaround using the __add__, e.g., like tmp = a + b; a = tmp, but mutable types (like lists) are modified in-place, which causes a slight speed boost.
However, if I have a NumPy array where I modify its contained immutable types, e.g., integers or floats, there is also an even more significant speed boost. How does this work? I did some example benchmarks below:
import numpy as np
def inplace(a, b):
a += b
return a
def assignment(a, b):
a = a + b
return a
int1 = 1
int2 = 1
list1 = [1]
list2 = [1]
npary1 = np.ones((1000,1000))
npary2 = np.ones((1000,1000))
print('Python integers')
%timeit inplace(int1, 1)
%timeit assignment(int2, 1)
print('\nPython lists')
%timeit inplace(list1, [1])
%timeit assignment(list2, [1])
print('\nNumPy Arrays')
%timeit inplace(npary1, 1)
%timeit assignment(npary2, 1)
What I would expect is a similar difference as for the Python integers when I used the in-place operators on NumPy arrays, however the results are completely different:
Python integers
1000000 loops, best of 3: 265 ns per loop
1000000 loops, best of 3: 249 ns per loop
Python lists
1000000 loops, best of 3: 449 ns per loop
1000000 loops, best of 3: 638 ns per loop
NumPy Arrays
100 loops, best of 3: 3.76 ms per loop
100 loops, best of 3: 6.6 ms per loop
Each call to assignment(npary2, 1) requires creating a new one million element array. Consider how much time it takes just to allocate a (1000, 1000)-shaped array of ones:
In [21]: %timeit np.ones((1000, 1000))
100 loops, best of 3: 3.84 ms per loop
This allocation of a new temporary array requires on my machine about 3.84 ms, and is on the right order of magnitude to explain the entire difference between inplace(npary1, 1) and assignment(nparay2, 1):
In [12]: %timeit inplace(npary1, 1)
1000 loops, best of 3: 1.8 ms per loop
In [13]: %timeit assignment(npary2, 1)
100 loops, best of 3: 4.04 ms per loop
So, given that allocation is a relatively slow process, it makes sense that in-place addition is significantly faster than assignment to a new array.
NumPy operations on NumPy arrays may be fast, but creation of NumPy arrays is relatively slow. Consider, for example, how much more time it takes to create a NumPy array than a Python list:
In [14]: %timeit list()
10000000 loops, best of 3: 106 ns per loop
In [15]: %timeit np.array([])
1000000 loops, best of 3: 563 ns per loop
This is one reason why it is generally better to use one large NumPy array (allocated once) rather than thousands of small NumPy arrays.

Iterating over arrays in cython, is list faster than np.array?

TLDR: in cython, why (or when?) is iterating over a numpy array faster than iterating over a python list?
Generally:
I've used Cython before and was able to get tremendous speed ups over naive python impl',
However, figuring out what exactly needs to be done seems non-trivial.
Consider the following 3 implementations of a sum() function.
They reside in a cython file called 'cy' (obviously, there's np.sum(), but that's besides my point..)
Naive python:
def sum_naive(A):
s = 0
for a in A:
s += a
return s
Cython with a function that expects a python list:
def sum_list(A):
cdef unsigned long s = 0
for a in A:
s += a
return s
Cython with a function that expects a numpy array.
def sum_np(np.ndarray[np.int64_t, ndim=1] A):
cdef unsigned long s = 0
for a in A:
s += a
return s
I would expect that in terms of running time, sum_np < sum_list < sum_naive, however, the following script demonstrates to the contrary (for completeness, I added np.sum() )
N = 1000000
v_np = np.array(range(N))
v_list = range(N)
%timeit cy.sum_naive(v_list)
%timeit cy.sum_naive(v_np)
%timeit cy.sum_list(v_list)
%timeit cy.sum_np(v_np)
%timeit v_np.sum()
with results:
In [18]: %timeit cyMatching.sum_naive(v_list)
100 loops, best of 3: 18.7 ms per loop
In [19]: %timeit cyMatching.sum_naive(v_np)
1 loops, best of 3: 389 ms per loop
In [20]: %timeit cyMatching.sum_list(v_list)
10 loops, best of 3: 82.9 ms per loop
In [21]: %timeit cyMatching.sum_np(v_np)
1 loops, best of 3: 1.14 s per loop
In [22]: %timeit v_np.sum()
1000 loops, best of 3: 659 us per loop
What's going on?
Why is cython+numpy slow?
P.S.
I do use
#cython: boundscheck=False
#cython: wraparound=False
There is a better way to implement this in cython that at least on my machine beats np.sum because it avoids type checking and other things that numpy normally has to do when dealing with an arbitrary array:
#cython.wraparound=False
#cython.boundscheck=False
cimport numpy as np
def sum_np(np.ndarray[np.int64_t, ndim=1] A):
cdef unsigned long s = 0
for a in A:
s += a
return s
def sum_np2(np.int64_t[::1] A):
cdef:
unsigned long s = 0
size_t k
for k in range(A.shape[0]):
s += A[k]
return s
And then the timings:
N = 1000000
v_np = np.array(range(N))
v_list = range(N)
%timeit sum(v_list)
%timeit sum_naive(v_list)
%timeit np.sum(v_np)
%timeit sum_np(v_np)
%timeit sum_np2(v_np)
10 loops, best of 3: 19.5 ms per loop
10 loops, best of 3: 64.9 ms per loop
1000 loops, best of 3: 1.62 ms per loop
1 loops, best of 3: 1.7 s per loop
1000 loops, best of 3: 1.42 ms per loop
You don't want to iterate over the numpy array via the Python style, but rather access elements using indexing as that it can be translated into pure C, rather than relying on the Python API.
a is untyped and thus there will be lots of conversions from Python to C types and back. These can be slow.
JoshAdel correctly pointed out that instead of iterating though, you should iterate over a range. Cython will convert the indexing to C, which is fast.
Using cython -a myfile.pyx will highlight these sorts of things for you; you want all of your loop logic to be white for maximum speed.
PS: Note that np.ndarray[np.int64_t, ndim=1] is outdated and has been deprecated in favour of the faster and more general long[:].

Categories

Resources