Achieving Numba's performance with Cython - python

Usually I'm able to match Numba's performance when using Cython. However, in this example I have failed to do so - Numba is about 4 times faster than my Cython's version.
Here the Cython-version:
%%cython -c=-march=native -c=-O3
cimport numpy as np
import numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def cy_where(double[::1] df):
cdef int i
cdef int n = len(df)
cdef np.ndarray[dtype=double] output = np.empty(n, dtype=np.float64)
for i in range(n):
if df[i]>0.5:
output[i] = 2.0*df[i]
else:
output[i] = df[i]
return output
And here is the Numba-version:
import numba as nb
#nb.njit
def nb_where(df):
n = len(df)
output = np.empty(n, dtype=np.float64)
for i in range(n):
if df[i]>0.5:
output[i] = 2.0*df[i]
else:
output[i] = df[i]
return output
When tested, the Cython version is on par with numpy's where, but is clearly inferior to Numba:
#Python3.6 + Cython 0.28.3 + gcc-7.2
import numpy
np.random.seed(0)
n = 10000000
data = np.random.random(n)
assert (cy_where(data)==nb_where(data)).all()
assert (np.where(data>0.5,2*data, data)==nb_where(data)).all()
%timeit cy_where(data) # 179ms
%timeit nb_where(data) # 49ms (!!)
%timeit np.where(data>0.5,2*data, data) # 278 ms
What is the reason for Numba's performance and how can it be matched when using Cython?
As suggested by #max9111, eliminating stride by using continuous memory-view, which doesn't improve the performance much:
#cython.boundscheck(False)
#cython.wraparound(False)
def cy_where_cont(double[::1] df):
cdef int i
cdef int n = len(df)
cdef np.ndarray[dtype=double] output = np.empty(n, dtype=np.float64)
cdef double[::1] view = output # view as continuous!
for i in range(n):
if df[i]>0.5:
view[i] = 2.0*df[i]
else:
view[i] = df[i]
return output
%timeit cy_where_cont(data) # 165 ms

This seems to be completely driven by optimizations that LLVM is able to make. If I compile the cython example with clang, performance between the two examples is identical. For what it's worth, MSVC on windows shows a similar performance discrepancy to numba.
$ CC=clang ipython
<... setup code>
In [7]: %timeit cy_where(data) # 179ms
...: %timeit nb_where(data) # 49ms (!!)
30.8 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
30.2 ms ± 498 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Interestingly, compiling the original Numpy code with pythran, using clang as a backend, yields the same performance as the Numba version.
import numpy as np
#pythran export work(float64[])
def work(df):
return np.where(data>0.5,2*data, data)
Compiled with
CXX=clang++ CC=clang pythran pythran_work.py -O3 -march=native
and the benchmark session:
import numpy as np
np.random.seed(0)
n = 10000000
data = np.random.random(n)
import numba_work, pythran_work
%timeit numba_work.work(data)
12.7 ms ± 20 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pythran_work.work(data)
12.7 ms ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

How can I speed up iterating large list and summing values

# Generate test data
test = list(range(150))
groups = []
for _ in range(75_000):
groups.append(random.sample(test, 6))
Setup variables as numpy arrays:
# Best version
import numpy as np
import random
from numba import jit # Kind of optional see below
# Generate test data
test = list(range(150))
groups = np.array([random.sample(test, 6) for _ in range(75_000)])
# This will change every time but just leaving the same for example
scores_dict = {i: random.uniform(0, 120) for i in range(150)}
scores = np.array(list(scores_dict.items()))
Here's the vectorized version using numpy's sum and take:
def fun1(scores, groups):
for _ in range(6250):
c = np.sum(np.take(scores[:, 1], groups), axis=1)
return c
%timeit fun1(scores, groups) # Takes ~2.5 mins to run
18.6 s ± 625 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you really want to go all out you can try using numba on top of numpy:
#jit(nopython=True)
def fun2(scores, groups):
for _ in range(6250):
c = np.sum(np.take(scores[:, 1], groups), axis=1)
return c
%timeit fun2(scores, groups) # Takes ~1.2 mins to run
10.1 s ± 1.32 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Can I do better on filtering numpy array

I have a somewhat contrived example to cytonize, where I want a function to:
accept a 1D numpy array of arbitrary length (~100'000 ÷ 1'000'000 np.float64's)
do some filtering on it
return results as a new [numpy?] array of the same length
The code and profiling is as follows:
%%cython -a
from libc.stdlib cimport malloc, free
from cython cimport boundscheck, wraparound
import numpy as np
#boundscheck(False)
#wraparound(False)
def func_memview(double[:] arr):
cdef:
int N = arr.shape[0], i
double *out_ptr = <double *> malloc(N * sizeof(double))
double[:] out = <double[:N]>out_ptr
for i in range(1, N):
if arr[i] > arr[i-1]:
out[i] = arr[i]
else:
out[i] = 0.
free(out_ptr)
return np.asarray(out)
My question is can I do any better with this?
As DavidW has pointed out, your code has some issues with memory management and it would be better to use a numpy-array directly:
%%cython
from cython cimport boundscheck, wraparound
import numpy as np
#boundscheck(False)
#wraparound(False)
def func_memview_correct(double[:] arr):
cdef:
int N = arr.shape[0], i
double[:] out = np.empty(N)
for i in range(1, N):
if arr[i] > arr[i-1]:
out[i] = arr[i]
else:
out[i] = 0.0
return np.asarray(out)
It is about as fast as the faulty original version:
import numpy as np
np.random.seed(0)
k= np.random.rand(5*10**7)
%timeit func_memview(k) # 413 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_memview_correct(k) # 412 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The question is how this code could be made faster? Most obvious options are
Parallelization.
Using vectorization/SIMD instructions.
It is notoriously hard to ensure that the C-code generated by Cython gets vectorized, see for example this SO-post. For many compilers it is necessary to use contiguous memory view to improve the situation, i.e.:
%%cython -c=/O3
from cython cimport boundscheck, wraparound
import numpy as np
#boundscheck(False)
#wraparound(False)
def func_memview_correct_cont(double[::1] arr): // <---- HERE
cdef:
int N = arr.shape[0], i
double[::1] out = np.empty(N) // <--- HERE
for i in range(1, N):
if arr[i] > arr[i-1]:
out[i] = arr[i]
else:
out[i] = 0.0
return np.asarray(out)
On my machine it is not really much faster
%timeit func_memview_correct_cont(k) # 402 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Other compilers might do better. However, I've often seen gcc and msvc struggling with producing optimal assembler for code typical for filtering (see for example this SO-question). Clang is much better at this, so the easiest solution would be probably to use numba:
import numba as nb
#nb.njit
def nb_func(arr):
N = arr.shape[0]
out = np.empty(N)
for i in range(1, N):
if arr[i] > arr[i-1]:
out[i] = arr[i]
else:
out[i] = 0.0
return out
which outperforms the cython code by almost factor of 3:
%timeit nb_func(k) # 151 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It is easy to parallelize the numba version using prange, but the win is not that much: parallelized version runs in 116ms on my machine.
To summarize: For such type of tasks my advice is to use numba. Using cython is trickier and the final performance will be down to the compiler used in the background.

Why numpy.var is O(N) space?

I have an array of ~13GB. I call numpy.var on it to compute the variance. However, it allocates another ~13GB to do this. Why does it need O(N) space? Or am I calling numpy.var in a wrong way?
import numpy as np
# data = ...
print('Variance: ', np.var(data))
NumPy will create an intermediate array to compute abs(data - data.mean()) ** 2 in order to compute the variance. You can write your own variance function with a loop and make it fast with Numba:
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def var_nb(a, ddof=0):
n = len(a)
s = a.sum()
m = s / (n - ddof)
v = 0
for i in nb.prange(n):
v += abs(a[i] - m) ** 2
return v / (n - ddof)
np.random.seed(100)
a = np.random.rand(100_000)
print(np.var(a))
# 0.08349747560941487
print(var_nb(a))
# 0.08349747560941487
%timeit np.var(a)
# 143 µs ± 414 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit var_nb(a)
# 40.2 µs ± 530 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This is faster whitout parallelization:
import numpy as np
def var(a: np.ndarray, axis: int = 0):
return np.sum(abs(a - (a.sum(axis=axis) / len(a))) ** 2, axis=axis) / len(a)

Cython Numpy Array Manipulation Slower than Python

I would like to optimize this Python code with Cython:
def updated_centers(point, start, center):
return np.array([__cluster_mean(point[start[c]:start[c + 1]], center[c]) for c in range(center.shape[0])])
def __cluster_mean(point, center):
return (np.sum(point, axis=0) + center) / (point.shape[0] + 1)
My Cython code:
cimport cython
cimport numpy as np
import numpy as np
# C-compatible Numpy integer type.
DTYPE = np.intc
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing.
#cython.cdivision(True) # Deactivate division by 0 checking.
def updated_centers(double [:,:] point, int [:] label, double [:,:] center):
if (point.shape[0] != label.size) or (point.shape[1] != center.shape[1]) or (center.shape[0] > point.shape[0]):
raise ValueError("Incompatible dimensions")
cdef Py_ssize_t i, c, j
cdef Py_ssize_t n = point.shape[0]
cdef Py_ssize_t m = point.shape[1]
cdef Py_ssize_t nc = center.shape[0]
# Updated centers. We accumulate point and center contributions into this array.
# Start by adding the (unscaled) center contributions.
new_center = np.zeros([nc, m])
new_center[:] = center
# Counter array. Will contain cluster sizes (including center, whose contribution
# is again added here) at the end of the point loop.
cluster_size = np.ones([nc], dtype=DTYPE)
# Add point contributions.
for i in range(n):
c = label[i]
cluster_size[c] += 1
for j in range(m):
new_center[c, j] += point[i, j]
# Scale center+point summation to be a mean.
for c in range(nc):
for j in range(m):
new_center[c, j] /= cluster_size[c]
return new_center
However, Cython is slower than python:
Python: %timeit f.updated_centers(point, start, center)
331 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Cython: %timeit fx.updated_centers(point, label, center)
433 ms ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The HTML reveals that almost all lines are yellow: allocating the array, +=, /=. I expected Cython to be an order of magnitude faster. What am I doing wrong?
You need to tell Cython that new_center and cluster_size are arrays:
cdef double[:, :] new_center = np.zeros((nc, m))
...
cdef int[:] cluster_size = np.ones((nc,), dtype=DTYPE)
...
Without these type annotations Cython cannot generate efficient C code, and has to call into the Python interpreter when you access those arrays.This is why the lines in the HTML output of cython -a where you access these arrays were yellow.
With just these two small modifications we immediately see the speedup we want:
%timeit python_updated_centers(point, start, center)
392 ms ± 41.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit cython_updated_centers(point, start, center)
1.18 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For such simple kernels, you can also use pythran to get nice speedups:
#pythran export updated_centers(float64 [:, :], int32 [:] , float64 [:, :] )
import numpy as np
def updated_centers(point, start, center):
return np.array([__cluster_mean(point[start[c]:start[c + 1]], center[c]) for c in range(center.shape[0])])
def __cluster_mean(point, center):
return (np.sum(point, axis=0) + center) / (point.shape[0] + 1)
Compiled with pythran updated_centers.py and one get the following timings:
Numpy code (same code, not compiled):
$ python -m perf timeit -s 'import numpy as np; n, m = 100000, 5; k = n//2; point = np.random.rand(n, m); start = 2*np.arange(k+1, dtype=np.int32); center=np.random.rand(k, m); from updated_centers import updated_centers' 'updated_centers(point, start, center)'
.....................
Mean +- std dev: 271 ms +- 12 ms
Pythran (after compilation):
$ python -m perf timeit -s 'import numpy as np; n, m = 100000, 5; k = n//2; point = np.random.rand(n, m); start = 2*np.arange(k+1, dtype=np.int32); center=np.random.rand(k, m); from updated_centers import updated_centers' 'updated_centers(point, start, center)'
.....................
Mean +- std dev: 12.8 ms +- 0.3 ms
The key is to write the Cython code like the Python code, to access arrays only when necessary.
cimport cython
cimport numpy as np
import numpy as np
# C-compatible Numpy integer type.
DTYPE = np.intc
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing.
#cython.cdivision(True) # Deactivate division by 0 checking.
def updated_centers(double [:, :] point, int [:] start, double [:, :] center):
"""Returns the updated list of cluster centers (damped center of mass Pahkira scheme). Cluster c
(and center[c]) corresponds to the point range point[start[c]:start[c+1]]."""
if (point.shape[1] != center.shape[1]) or (center.shape[0] > point.shape[0]) or (start.size != center.shape[0] + 1):
raise ValueError("Incompatible dimensions")
# Py_ssize_t is the proper C type for Python array indices.
cdef Py_ssize_t i, c, j, cluster_start, cluster_stop, cluster_size
cdef Py_ssize_t n = point.shape[0]
cdef Py_ssize_t m = point.shape[1]
cdef Py_ssize_t nc = center.shape[0]
cdef double center_of_mass
# Updated centers. We accumulate point and center contributions into this array.
# Start by adding the (unscaled) center contributions.
new_center = np.zeros([nc, m])
cluster_start = start[0]
for c in range(nc):
cluster_stop = start[c + 1]
cluster_size = cluster_stop - cluster_start + 1
for j in range(m):
center_of_mass = center[c, j]
for i in range(cluster_start, cluster_stop):
center_of_mass += point[i, j]
new_center[c, j] = center_of_mass / cluster_size
cluster_start = cluster_stop
return np.asarray(new_center)
With the same API we get
n, m = 100000, 5; k = n//2; point = np.random.rand(n, m); start = 2*np.arange(k+1, dtype=np.intc); center=np.random.rand(k, m);
%timeit fx.updated_centers(point, start, center)
31 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit f.updated_centers(point, start, center)
734 ms ± 17.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Numpy efficient matrix self-multiplication (gram matrix)

I want to multiply B = A # A.T in numpy. Obviously, the answer would be a symmetric matrix (i.e. B[i, j] == B[j, i]).
However, it is not clear to me how to leverage this easily to cut the computation time down in half (by only computing the lower triangle of B and then using that to get the upper triangle for free).
Is there a way to perform this optimally?
As noted in #PaulPanzer's link, dot can detect this case. Here's the timing proof:
In [355]: A = np.random.rand(1000,1000)
In [356]: timeit A.dot(A.T)
57.4 ms ± 960 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [357]: B = A.T.copy()
In [358]: timeit A.dot(B)
98.6 ms ± 805 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy dot too clever about symmetric multiplications
You can always use sklearns's pairwise_distances
Usage:
from sklearn.metrics.pairwise import pairwise_distances
gram = pairwise_distance(x, metric=metric)
Where metric is a callable or a string defining one of their implemented metrics (full list in the link above)
But, I wrote this for myself a while back so I can share what I did:
import numpy as np
def computeGram(elements, dist):
n = len(elements)
gram = np.zeros([n, n])
for i in range(n):
for j in range(i + 1):
gram[i, j] = dist(elements[i], elements[j])
upTriIdxs = np.triu_indices(n)
gram[upTriIdxs] = gram.T[upTriIdxs]
return gram
Where dist is a callable, in your case np.inner

Categories

Resources