Speed-up cython code - python

I have code that is working in python and want to use cython to speed up the calculation. The function that I've copied is in a .pyx file and gets called from my python code. V, C, train, I_k are 2-d numpy arrays and lambda_u, user, hidden are ints.
I don't have any experience in using C or cython. What is an efficient
way to make this code faster.
Using cython -a for compiling shows me that the code is flawed but how can I improve it. Using for i in prange (user_size, nogil=True):
results in Constructing Python slice object not allowed without gil.
How has the code to be modified to harvest the power of cython?
#cython.boundscheck(False)
#cython.wraparound(False)
def u_update(V, C, train, I_k, lambda_u, user, hidden):
cdef int user_size = user
cdef int hidden_dim = hidden
cdef np.ndarray U = np.empty((hidden_dim,user_size), float)
cdef int m = C.shape[1]
for i in range(user_size):
C_i = np.zeros((m, m), dtype=float)
for j in range(m):
C_i[j,j]=C[i,j]
U[:,i] = np.dot(np.linalg.inv(np.dot(V, np.dot(C_i,V.T)) + lambda_u*I_k), np.dot(V, np.dot(C_i,train[i,:].T)))
return U

You are trying to use cython by diving into the deep end of pool. You should start with something small, such as some of the numpy examples. Or even try to improve on np.diag.
i = 0
C_i = np.zeros((m, m), dtype=float)
for j in range(m):
C_i[j,j]=C[i,j]
v.
C_i = diag(C[i,:])
Can you improve the speed of this simple expression? diag is not compiled, but it does perform an efficient indexed assignment.
res[:n-k].flat[i::n+1] = v
But the real problem for cython is this expression:
U[:,i] = np.dot(np.linalg.inv(np.dot(V, np.dot(C_i,V.T)) + lambda_u*I_k), np.dot(V, np.dot(C_i,train[i,:].T)))
np.dot is compiled. cython won't turn that in to c code, nor will it consolidate all 5 dots into one expression. It also won't touch the inv. So at best cython will speed up the iteration wrapper, but it will still call this Python expression m times.
My guess is that this expression can be cleaned up. Replacing the inner dots with einsum can probably eliminate the need for C_i. The inv might make 'vectorizing' the whole thing difficult. But I'd have to study it more.
But if you want to stick with the cython route, you need to transform that U expression into simple iterative code, without calls to numpy functions like dot and inv.
===================
I believe the following are equivalent:
np.dot(C_i,V.T)
C[i,:,None]*V.T
In:
np.dot(C_i,train[i,:].T)
if train is 2d, then train[i,:] is 1d, and the .T does nothing.
In [289]: np.dot(np.diag([1,2,3]),np.arange(3))
Out[289]: array([0, 2, 6])
In [290]: np.array([1,2,3])*np.arange(3)
Out[290]: array([0, 2, 6])
If I got that right, you don't need C_i.
======================
Furthermore, these calculations can be moved outside the loop, with expressions like (not tested)
CV1 = C[:,:,None]*V.T # a 3d array
CV2 = C * train.T
for i in range(user_size):
U[:,i] = np.dot(np.linalg.inv(np.dot(V, CV1[i,...]) + lambda_u*I_k), np.dot(V, CV2[i,...]))
A further step is to move both np.dot(V,CV...) out of the loop. That may require np.matmul (#) or np.einsum. Then we will have
for i...
I = np.linalg.inv(VCV1[i,...])
U[:,i] = np.dot(I+ lambda_u), VCV2[i,])
or even
for i...
I[...i] = np.linalg.inv(...) # if inv can't be vectorized
U = np.einsum(..., I+lambda_u, VCV2)
This is a rough sketch, and details will need to be worked out.

The first thing that comes to mind is you haven't typed the function arguments and specified the data type and number of dimensions like so :
def u_update(np.ndarray[np.float64, ndim=2]V, np.ndarray[np.float64, ndim=2]\
C, np.ndarray[np.float64, ndim=2] train, np.ndarray[np.float64, ndim=2] \
I_k, int lambda_u, int user, int hidden) :
This will greatly speed up indexing with 2 indices like you do in the inner loop.
It's best to do this to the array U as well, although you are using slicing:
cdef np.ndarray[np.float64, ndim=2] U = np.empty((hidden_dim,user_size), np.float64)
Next, you are redefining C_i, a large 2-D array every iteration of the outer loop. Also, you have not supplied any type information for it, which is a must if Cython is to offer any speedup. To fix this :
cdef np.ndarray[np.float64, ndim=2] C_i = np.zeros((m, m), dtype=np.float64)
for i in range(user_size):
C_i.fill(0)
Here, we have defined it once (with type information), and reused the memory by filling with zeros instead of calling np.zeros() to make a new array every time.
Also, you might want to turn off bounds checking only after you have finished debugging.
If you need speedups in the U[:,i]=... step, you could consider writing another function with Cython to perform those operations using loops.
Do read this tutorial which should give you an idea of what to do when working with Numpy arrays in Cython and what not to do as well, and also to appreciate how much of a speedup you can get with these simple changes.

Related

How is numpy.einsum implemented?

I want to understand how is einsum function in python implemented. I found the source code in numpy/core/src/multiarray/einsum.c.src file but couldn't completely understand it. In particular I want to understand how does it creates the required loops automatically?
For example:
import numpy as np
a = np.random.rand(2,3,4,5)
b = np.random.rand(5,3,2,4)
ll = np.einsum('ijkl, ljik ->', a,b) # This should loop over all the
# four indicies i,j,k,l. How does it create loops for these indices automatically ?
# The assume that under the hood it does the following
sum1 = 0
for i in range(2):
for j in range(3):
for k in range(4):
for l in range(5):
sum1 = sum1 + a[i,j,k,l]*b[l,j,i,k]
Thank you in advance
ps: This question is not about how to use numpy.einsum
I want to understand how does it creates the required loops automatically?
Well, it does not create the loops the way you think it does. In this case, it creates an iterator operating over multiple arrays and then use it in a generic main loop. In the more general case, there are two main loops: one to iterate over the output array items and one to perform a reduction.
The main function is PyArray_EinsteinSum. In your case, it takes an unoptimized path and end up creating a basic iteration function based on the iterator created previously (ie. iter). This function is get_sum_of_products_function. It basically analyze the einsum operation so to find the best (sum of product) function to call based on a lookup table (like _outstride0_specialized_table). In your specific case, double_sum_of_products_outstride0_two is called. Numpy use a template system so to generate this function automatically at build time (*.c.src files are template files converted to *.c files based on predefined basic comments). In this case, the function is generated from #name#_sum_of_products_outstride0_#noplabel# and once computed by the C preprocessor it gives something like the following function:
static void double_sum_of_products_outstride0_two(int nop,
char **dataptr,
npy_intp const *strides,
npy_intp count)
{
npy_double accum = 0;
char *data0 = dataptr[0];
npy_intp stride0 = strides[0];
char *data1 = dataptr[1];
npy_intp stride1 = strides[1];
while (count--)
{
accum += (*(npy_double *)data0) * (*(npy_double *)data1);
data0 += stride0;
data1 += stride1;
}
*((npy_double *)dataptr[2]) = (accum + (*((npy_double *)dataptr[2])));
}
As you can see, there is only one main loop iterating over the previously generated iterator. In your case, stride0 and stride1 are both equal to 8, data0 and data1 are the raw input arrays, dataptr is the raw output array and count is set to 120 initially. Note that the fact both strides are equal to 8 is surprising at first glance since the einsum does not iterate on the two array contiguously. This is because the second array is copied and reorder because Numpy cannot create a uniform view based on the einsum parameters.
Note that the fallback case use for the example code is not particularly optimized and it only produce one value. For example, the much more optimized double_sum_of_products_contig_contig_outstride0_two function can be called from unbuffered_loop_nop2_ndim2 for the following code:
import numpy as np
a = np.random.rand(3, 10)
b = np.random.rand(3, 10)
for i in range(1):
ll = np.einsum('ij, ij -> i', a, b)
In this case, the double_sum_of_products_contig_contig_outstride0_two performs the reductions for a given output item and unbuffered_loop_nop2_ndim2 iterate over the output array.
If the expression ij, ij -> j is instead used in the above code, then the function double_sum_of_products_contig_two is called which operates the same way than double_sum_of_products_contig_contig_outstride0_two except it reads/writes on the whole output line during the reduction.

What is the Python 3 way to ensure the correct dimension of array arguments?

In my newbie Python 3.7 project, the arguments in many functions are numpy.ndarray's. These must be two-dimensional r x n matrices. The row dimension r is essential: some functions require 1 x n vectors, others 2 x n matrices, with r up to three and possibly more. There're also functions defined for any r x n array. (The column dimension n is not essential for design purposes.)
From my Matlab experience, this requirement can get confusing and error-prone. So I've considered the following approaches:
Document the method arguments (of course!)
Unit tests (of course!)
Do validation and throw exceptions inside some functions. (However, this is not very functional, nor performant.)
Define data classes: OneRow, TwoRows, ThreeRows and FourPlusRows. Each has an ndarray field, validated in the constructor. The upside includes type hints and a better domain modelling, a la DDD. A downside is extra complexity.
Question: Given the type hints introduced in Python 3 and the trend towards functional programming, what's the current pythonic approach to this problem?
One of the best things about Python is duck typing, and Numpy is in general very compatible with that design approach. Say you have a vector-only function vecfunc. You can add some boilerplate to the beginning of the function that will inflate any 1D arrays into 1 x n vectors:
def vecfunc(arr):
if arr.ndim==1:
arr = arr[None, :]
...function body goes here...
This will avoid any problems due to arr having too few dimensions, and will likely still give correct behavior in most cases. However, it doesn't do anything to prevent a user from passing in, say, a r x n x m array, or a 15 x n array. Ultimately, you're going to have to go with approach 3. for a bunch of this stuff and just throw some exceptions where it seems appropriate. For example:
def vecfunc(arr):
if not 0 < arr.ndim < 3:
raise ValueError("arr must have ndim of 1 or 2. arr.ndim: %d" % arr.ndim)
elif arr.ndim==1:
arr = arr[None, :]
If it makes you feel any better, the code bases of both numpy and scipy have those kinds of shape-based exception checks in a number of functions, when and where they're needed.
Of course, you could always leave off adding those kinds of exception checks until the very end of developing any given function. You may be surprised at the range of input that produces reasonable behavior.
If you're dead set on type annotations, you can get something similar by writing your code using Cython. For example, if you wanted an add function that only took 2D integer arrays, you could write the following function in a .pyx file:
import numpy as np
def add(long[:, :] arr1, long[:, :] arr2):
assert tuple(arr1.shape) == tuple(arr2.shape)
result = np.zeros((arr1.shape[0], arr1.shape[1]), dtype=np.long)
cdef long[:, :] result_view = result
for x in range(arr1.shape[0]):
for y in range(arr1.shape[1]):
result_view[x, y] = arr1[x, y] + arr2[x, y]
return result
For more details on writing and compiling Cython, see the docs linked above.
This isn't so much "type annotations" as it is actual strong typing, but it may do what you want. Sadly, I wasn't able to find a way to fix the size of a single dimension, just the total number of dimensions.

Defining more complicated static arrays

Quite often in numerical methods one has a lot coefficients which are static as they are fixed for the specific method. I was wondering what's the best way in Cython / C to set such arrays or variables.
In my case Runge-Kutta Integration methods are mostly the same, except for coefficients and the number of stages. Right now I'm doing something like (simplified)
# Define some struct such that it can be used for all different Runge-Kutta methods
ctypedef struct RKHelper:
int numStages
double* coeffs
cdef:
RKHelper firstRKMethod
# Later secondRKMethod, thirdRKMethod, etc.
firstRKMethod.numStages = 3
firstRKMethod.coeffs = <double*> malloc(firstRKMethod.numStages*sizeof(double))
# Arrays can be large and most entries are zero
for ii in range(firstRKMethod.numStages):
firstRKMethod.coeffs[ii] = 0.
# Set non-zero elements
firstRKMethod.coeffs[2] = 1.3
Some points:
I know that malloc isn't for static arrays, but I don't know how to declare "numStages" or "RKHelper" as static in Cython, so I can't use a static array... Or I do something like "double[4]" in RKHelper, which doesn't allow to use the same struct definition for all RK methods.
I'm wondering if there is a better way than to do a loop. I don't wanna set the whole array manually (e.g. array = [0., 0., 0., 0., 1.3, ... lots of numbers mostly zero]).
As far as I can see there are no "real" static variables in Cython, are there?
Is there a nicer way of doing what I want to do?
Cheers
One way to achieve what you want to achieve is to set the coefficients of Runge Kutta scheme as global variables, that way you can use static arrays. This would be fast but definitely ugly
The ugly solution:
cdef int numStages = 3
# Using the pointer notation you can set a static array
# as well as its elements in one go
cdef double* coeffs = [0.,0.,1.3]
# You can always change the coefficients further as you wish
def RungeKutta_StaticArrayGlobal():
# Do stuff
# Just to check
return numStages
A better solution would be to define a cython class with Runge Kutta coefficients as its members
The elegant solution:
cdef class RungeKutta_StaticArrayClass:
cdef double* coeffs
cdef int numStages
def __cinit__(self):
# Note that due to the static nature of self.coeffs, its elements
# expire beyond the scope of this function
self.coeffs = [0.,0.,1.3]
self.numStages = 3
def GetnumStages(self):
return self.numStages
def Integrate(self):
# Reset self.coeffs
self.coeffs = [0.,0.,0.,0.,0.8,2.1]
# Perform integration
Regarding your question of setting the elements, let's modify your own code with dynamically allocated arrays using calloc instead of malloc
The dynamically allocated version:
from libc.stdlib cimport calloc, free
ctypedef struct RKHelper:
int numStages
double* coeffs
def RungeKutta_DynamicArray():
cdef:
RKHelper firstRKMethod
firstRKMethod.numStages = 3
# Use calloc instead, it zero initialises the buffer, so you don't
# need to set the elements to zero within a loop
firstRKMethod.coeffs = <double*> calloc(firstRKMethod.numStages,sizeof(double))
# Set non-zero elements
firstRKMethod.coeffs[2] = 1.3
free(firstRKMethod.coeffs)
# Just to check
return firstRKMethod.numStages
Let's do a somewhat nonsensical benchmark, to verify that the arrays were truly static (i.e. had no runtime cost) in the first two examples
In[1]: print(RungeKutta_DynamicArray())
3
In[2]: print(RungeKutta_StaticArray())
3
In[3]: obj = RungeKutta_StaticArrayClass()
In[4]: print(obj.GetnumStages())
3
In[5]: %timeit RungeKutta_DynamicArray()
10000000 loops, best of 3: 65.2 ns per loop
In[6]: %timeit RungeKutta_StaticArray()
10000000 loops, best of 3: 25.2 ns per loop
In[6]: %timeit RungeKutta_StaticArrayClass()
10000000 loops, best of 3: 49.6 ns per loop
The RungeKutta_StaticArray essentially has close to a no-op cost, implying no runtime penalty for array allocation. You can choose to declare the coeffs within this function and the timing will still be the same. The RungeKutta_StaticArrayClass despite the overhead of setting a class with its members and constructors is still faster than the dynamically allocated version.
Why not just use a numpy array? Practically it isn't actually static (see note at end), but you can allocate it at the global scope so it's created on module start-up. You can also access the raw C array underneath so there's no real efficiency cost.
import numpy as np
# at module global scope
cdef double[::1] rk_coeffs = np.zeros((50,)) # avoid having to manually fill with 0s
# illustratively fill the non-zero elements
rk_coeffs[1] = 2.0
rk_coeffs[3] = 5.0
# if you need to convert to a C array
cdef double* rk_coeffs_ptr = &rk_coeffs[0]
Note My reading of the question is that you're using "static" to mean "compiled into the module" rather than any of the numerous C-related definitions or anything to do with Python static methods/class variables.

Cython function with variable sized matrix input

I am trying to convert part of a native python function to cython to improve the compute time. I would like to write a cython function just for the loop component that is taking up the time (as ipython lprun kindly told me). However this function takes in variably sized matrices .. and I can't see how to bring that across easily to statically typed cython.
for index1 in range(0,num_products):
for index2 in range(0,num_products):
cond_prob = (data[index1] * data[index2]).sum() / max(col_sums[index1], col_sums[index2])
prox[index1][index2] = cond_prob
This issue is that num_products changes year to year, so the matrix (data) size is variable.
What is the best strategy here?
Should I write two C functions. One to create a matrix of a certain dimension using memalloc, and then One to do the loops over the created matrix?
Is there some fancy cython/numpy wizardry to help in this scenario? Can I write a C function that takes in a variably sized Numpy Array in memory and pass the size?
Cython code is (strategically) statically typed, but that doesn't mean that arrays must have a fixed size. In straight C passing a multidimensional array to a function can be a little awkward maybe, but in Cython you should be able to do something like the following:
Note I took the function and variable names from your follow-up question.
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.cdivision(True)
def cooccurance_probability_cy(double[:,:] X):
cdef int P, i, j, k
P = X.shape[0]
cdef double item
cdef double [:] CS = np.sum(X, axis=1)
cdef double [:,:] D = np.empty((P, P), dtype=np.float)
for i in range(P):
for j in range(P):
item = 0
for k in range(P):
item += X[i,k] * X[j,k]
D[i,j] = item / max(CS[i], CS[j])
return D
On the other hand, using just Numpy should also be quite fast for this problem, if you use the right functions and some broadcasting. In fact, as the calculation complexity is dominated by the matrix multiplication, I found the following is much faster than the Cython code above (np.inner uses a highly optimized BLAS routine):
def new(X):
CS = np.sum(X, axis=1, keepdims=True)
D = np.inner(X,X) / np.maximum(CS, CS.T)
return D
Have you tried getting rid of the for loops in numpy?
for the first part of your equation you could for example try:
(data[ np.newaxis,:] * data[:,np.newaxis]).sum(2)
if memory is an issue you can also use the np.einsum() function.
For the second part one could probably also cook up a numpy expression (bit more difficult) if you've not already tried that.

Solve large number of small equation systems in numpy

I have a large number of small linear equation systems that I'd like to solve efficiently using numpy. Basically, given A[:,:,:] and b[:,:], I wish to find x[:,:] given by A[i,:,:].dot(x[i,:]) = b[i,:]. So if I didn't care about speed, I could solve this as
for i in range(n):
x[i,:] = np.linalg.solve(A[i,:,:],b[i,:])
But since this involved explicit looping in python, and since A typically has a shape like (1000000,3,3), such a solution would be quite slow. If numpy isn't up to this, I could do this loop in fortran (i.e. using f2py), but I'd prefer to stay in python if possible.
For those coming back to read this question now, I thought I'd save others time and mention that numpy handles this using broadcasting now.
So, in numpy 1.8.0 and higher, the following can be used to solve N linear equations.
x = np.linalg.solve(A,b)
I guess answering yourself is a bit of a faux pas, but this is the fortran solution I have a the moment, i.e. what the other solutions are effectively competing against, both in speed and brevity.
function pixsolve(A, b) result(x)
implicit none
real*8 :: A(:,:,:), b(:,:), x(size(b,1),size(b,2))
integer*4 :: i, n, m, piv(size(b,1)), err
n = size(A,3); m = size(A,1)
x = b
do i = 1, n
call dgesv(m, 1, A(:,:,i), m, piv, x(:,i), m, err)
end do
end function
This would be compiled as:
f2py -c -m foo{,.f90} -llapack -lblas
And called from python as
x = foo.pixsolve(A.T, b.T).T
(The .Ts are needed due to a poor design choice in f2py, which both causes unnecessary copying, inefficient memory access patterns and unnatural looking fortran indexing if the .Ts are left out.)
This also avoids a setup.py etc. I have no bone to pick with fortran (as long as strings aren't involved), but I was hoping that numpy might have something short and elegant which could do the same thing.
I think you're wrong about the explicit looping being a problem. Usually it's only the innermost loop it's worth optimizing, and I think that holds true here. For example, we can measure the code of the overhead vs the cost of the actual computation:
import numpy as np
n = 10**6
A = np.random.random(size=(n, 3, 3))
b = np.random.random(size=(n, 3))
x = b*0
def f():
for i in xrange(n):
x[i,:] = np.linalg.solve(A[i,:,:],b[i,:])
np.linalg.pseudosolve = lambda a,b: b
def g():
for i in xrange(n):
x[i,:] = np.linalg.pseudosolve(A[i,:,:],b[i,:])
which gives me
In [66]: time f()
CPU times: user 54.83 s, sys: 0.12 s, total: 54.94 s
Wall time: 55.62 s
In [67]: time g()
CPU times: user 5.37 s, sys: 0.01 s, total: 5.38 s
Wall time: 5.40 s
IOW, it's only spending 10% of its time doing anything other than actually solving your problem. Now, I could totally believe that np.linalg.solve itself is too slow for you relative to what you could get out of Fortran, and so you want to do something else. That's probably especially true on small problems, come to think of it: IIRC I once found it faster to unroll certain small solutions by hand, although that was a while back.
But by itself, it's not true that using an explicit loop on the first index will make the overall solution quite slow. If np.linalg.solve is fast enough, the loop won't change it much here.
I think you can do it in one go, with a (3x100000,3x100000) matrix composed of 3x3 blocks around the diagonal.
Not tested :
b_new = np.vstack([ b[i,:] for i in range(len(i)) ])
x_new = np.zeros(shape=(3x10000,3) )
A_new = np.zeros(shape=(3x10000,3x10000) )
n,m = A.shape
for i in range(n):
A_new[3*i:3*(i+1),3*i:3*(i+1)] = A[i,:,:]
x = np.linalg.solve(A_new,b_new)

Categories

Resources