Quite often in numerical methods one has a lot coefficients which are static as they are fixed for the specific method. I was wondering what's the best way in Cython / C to set such arrays or variables.
In my case Runge-Kutta Integration methods are mostly the same, except for coefficients and the number of stages. Right now I'm doing something like (simplified)
# Define some struct such that it can be used for all different Runge-Kutta methods
ctypedef struct RKHelper:
int numStages
double* coeffs
cdef:
RKHelper firstRKMethod
# Later secondRKMethod, thirdRKMethod, etc.
firstRKMethod.numStages = 3
firstRKMethod.coeffs = <double*> malloc(firstRKMethod.numStages*sizeof(double))
# Arrays can be large and most entries are zero
for ii in range(firstRKMethod.numStages):
firstRKMethod.coeffs[ii] = 0.
# Set non-zero elements
firstRKMethod.coeffs[2] = 1.3
Some points:
I know that malloc isn't for static arrays, but I don't know how to declare "numStages" or "RKHelper" as static in Cython, so I can't use a static array... Or I do something like "double[4]" in RKHelper, which doesn't allow to use the same struct definition for all RK methods.
I'm wondering if there is a better way than to do a loop. I don't wanna set the whole array manually (e.g. array = [0., 0., 0., 0., 1.3, ... lots of numbers mostly zero]).
As far as I can see there are no "real" static variables in Cython, are there?
Is there a nicer way of doing what I want to do?
Cheers
One way to achieve what you want to achieve is to set the coefficients of Runge Kutta scheme as global variables, that way you can use static arrays. This would be fast but definitely ugly
The ugly solution:
cdef int numStages = 3
# Using the pointer notation you can set a static array
# as well as its elements in one go
cdef double* coeffs = [0.,0.,1.3]
# You can always change the coefficients further as you wish
def RungeKutta_StaticArrayGlobal():
# Do stuff
# Just to check
return numStages
A better solution would be to define a cython class with Runge Kutta coefficients as its members
The elegant solution:
cdef class RungeKutta_StaticArrayClass:
cdef double* coeffs
cdef int numStages
def __cinit__(self):
# Note that due to the static nature of self.coeffs, its elements
# expire beyond the scope of this function
self.coeffs = [0.,0.,1.3]
self.numStages = 3
def GetnumStages(self):
return self.numStages
def Integrate(self):
# Reset self.coeffs
self.coeffs = [0.,0.,0.,0.,0.8,2.1]
# Perform integration
Regarding your question of setting the elements, let's modify your own code with dynamically allocated arrays using calloc instead of malloc
The dynamically allocated version:
from libc.stdlib cimport calloc, free
ctypedef struct RKHelper:
int numStages
double* coeffs
def RungeKutta_DynamicArray():
cdef:
RKHelper firstRKMethod
firstRKMethod.numStages = 3
# Use calloc instead, it zero initialises the buffer, so you don't
# need to set the elements to zero within a loop
firstRKMethod.coeffs = <double*> calloc(firstRKMethod.numStages,sizeof(double))
# Set non-zero elements
firstRKMethod.coeffs[2] = 1.3
free(firstRKMethod.coeffs)
# Just to check
return firstRKMethod.numStages
Let's do a somewhat nonsensical benchmark, to verify that the arrays were truly static (i.e. had no runtime cost) in the first two examples
In[1]: print(RungeKutta_DynamicArray())
3
In[2]: print(RungeKutta_StaticArray())
3
In[3]: obj = RungeKutta_StaticArrayClass()
In[4]: print(obj.GetnumStages())
3
In[5]: %timeit RungeKutta_DynamicArray()
10000000 loops, best of 3: 65.2 ns per loop
In[6]: %timeit RungeKutta_StaticArray()
10000000 loops, best of 3: 25.2 ns per loop
In[6]: %timeit RungeKutta_StaticArrayClass()
10000000 loops, best of 3: 49.6 ns per loop
The RungeKutta_StaticArray essentially has close to a no-op cost, implying no runtime penalty for array allocation. You can choose to declare the coeffs within this function and the timing will still be the same. The RungeKutta_StaticArrayClass despite the overhead of setting a class with its members and constructors is still faster than the dynamically allocated version.
Why not just use a numpy array? Practically it isn't actually static (see note at end), but you can allocate it at the global scope so it's created on module start-up. You can also access the raw C array underneath so there's no real efficiency cost.
import numpy as np
# at module global scope
cdef double[::1] rk_coeffs = np.zeros((50,)) # avoid having to manually fill with 0s
# illustratively fill the non-zero elements
rk_coeffs[1] = 2.0
rk_coeffs[3] = 5.0
# if you need to convert to a C array
cdef double* rk_coeffs_ptr = &rk_coeffs[0]
Note My reading of the question is that you're using "static" to mean "compiled into the module" rather than any of the numerous C-related definitions or anything to do with Python static methods/class variables.
Related
I want to understand how is einsum function in python implemented. I found the source code in numpy/core/src/multiarray/einsum.c.src file but couldn't completely understand it. In particular I want to understand how does it creates the required loops automatically?
For example:
import numpy as np
a = np.random.rand(2,3,4,5)
b = np.random.rand(5,3,2,4)
ll = np.einsum('ijkl, ljik ->', a,b) # This should loop over all the
# four indicies i,j,k,l. How does it create loops for these indices automatically ?
# The assume that under the hood it does the following
sum1 = 0
for i in range(2):
for j in range(3):
for k in range(4):
for l in range(5):
sum1 = sum1 + a[i,j,k,l]*b[l,j,i,k]
Thank you in advance
ps: This question is not about how to use numpy.einsum
I want to understand how does it creates the required loops automatically?
Well, it does not create the loops the way you think it does. In this case, it creates an iterator operating over multiple arrays and then use it in a generic main loop. In the more general case, there are two main loops: one to iterate over the output array items and one to perform a reduction.
The main function is PyArray_EinsteinSum. In your case, it takes an unoptimized path and end up creating a basic iteration function based on the iterator created previously (ie. iter). This function is get_sum_of_products_function. It basically analyze the einsum operation so to find the best (sum of product) function to call based on a lookup table (like _outstride0_specialized_table). In your specific case, double_sum_of_products_outstride0_two is called. Numpy use a template system so to generate this function automatically at build time (*.c.src files are template files converted to *.c files based on predefined basic comments). In this case, the function is generated from #name#_sum_of_products_outstride0_#noplabel# and once computed by the C preprocessor it gives something like the following function:
static void double_sum_of_products_outstride0_two(int nop,
char **dataptr,
npy_intp const *strides,
npy_intp count)
{
npy_double accum = 0;
char *data0 = dataptr[0];
npy_intp stride0 = strides[0];
char *data1 = dataptr[1];
npy_intp stride1 = strides[1];
while (count--)
{
accum += (*(npy_double *)data0) * (*(npy_double *)data1);
data0 += stride0;
data1 += stride1;
}
*((npy_double *)dataptr[2]) = (accum + (*((npy_double *)dataptr[2])));
}
As you can see, there is only one main loop iterating over the previously generated iterator. In your case, stride0 and stride1 are both equal to 8, data0 and data1 are the raw input arrays, dataptr is the raw output array and count is set to 120 initially. Note that the fact both strides are equal to 8 is surprising at first glance since the einsum does not iterate on the two array contiguously. This is because the second array is copied and reorder because Numpy cannot create a uniform view based on the einsum parameters.
Note that the fallback case use for the example code is not particularly optimized and it only produce one value. For example, the much more optimized double_sum_of_products_contig_contig_outstride0_two function can be called from unbuffered_loop_nop2_ndim2 for the following code:
import numpy as np
a = np.random.rand(3, 10)
b = np.random.rand(3, 10)
for i in range(1):
ll = np.einsum('ij, ij -> i', a, b)
In this case, the double_sum_of_products_contig_contig_outstride0_two performs the reductions for a given output item and unbuffered_loop_nop2_ndim2 iterate over the output array.
If the expression ij, ij -> j is instead used in the above code, then the function double_sum_of_products_contig_two is called which operates the same way than double_sum_of_products_contig_contig_outstride0_two except it reads/writes on the whole output line during the reduction.
I want to create a large number of simulated samples from a model using Cython that I need to analyze later using Python. The result of one run of my simulation script should be a 10000 x 10000 array.
I have defined a function using def and tried to declare my arrays as cdef int my_array[10000][10000]. The my_script.pyx file compile correctly but when I run the script I got a "segmentation fault" error (I am on Linux).
Looking for a solution, I have learned that this issue is caused by allocating memory on the stack instead of the heap so I decided to use PyMem_Malloc to allocate the memory. Here's kind of a minimum version of what I'm trying to do:
import cython
from cpython.mem cimport PyMem_Malloc
from libc.stdlib cimport rand, srand, RAND_MAX
srand(time(NULL))
def my_array_func(int a_param)
cdef int i
cdef int **my_array = <int **>PyMem_Malloc(sizeof(int *) * 10000)
for i in range(10000):
my_array[i] = <int *>PyMem_Malloc(sizeof(int) * 10000)
cdef int j
cdef int k
for j in range(10000):
for k in range(10000):
my_array[j][k] = <float>rand()/RAND_MAX * a_param
return my_array
When I try to compile this file, I got an error Cannot convert 'int **' to Python object which makes sense because my_array is not properly an array so I guess it cannot be returned as a Python object (sorry, my knowledge of C is really really rusty).
Is there a way to let the function return my 2D array such that it can be used as input to other Python functions? Another more than welcome solution might be to directly save the array in a file that can be imported later by a Python script.
Thanks.
In line with #DavidW 's comment, when matrix computations are involved in Cython it is advisable to use numpy arrays to own the memory and to live in pythonland.
In your case, it would look like this:
import cython
cimport numpy as np
import numpy as np
from libc.stdlib cimport rand, srand, RAND_MAX
from libc.time cimport time
srand(time(NULL))
def my_array_func(int a_param):
cdef int n_rows=10000, ncols=10000
# Mem alloc + Python object owning memory
cdef np.ndarray[dtype=int, ndim=2] my_array = np.empty((n_rows,ncols), dtype=int)
# Memoryview: iterate over my_array at C speed
cdef int[:,::1] my_array_view = my_array
# Fill array
cdef int i, j
for i in range(n_rows):
for j in range(ncols):
my_array_view[i,j] = <int> (rand()/RAND_MAX * a_param)
return my_array
Allocating an empty chunk of memory with defined size, making sure it is owned by a Python object and has all the nice array properties (like .shape) is what you get in a single line with the cdef np.ndarray[.... Looping over this array can be done with no Python interaction by using a memoryview.
I have code that is working in python and want to use cython to speed up the calculation. The function that I've copied is in a .pyx file and gets called from my python code. V, C, train, I_k are 2-d numpy arrays and lambda_u, user, hidden are ints.
I don't have any experience in using C or cython. What is an efficient
way to make this code faster.
Using cython -a for compiling shows me that the code is flawed but how can I improve it. Using for i in prange (user_size, nogil=True):
results in Constructing Python slice object not allowed without gil.
How has the code to be modified to harvest the power of cython?
#cython.boundscheck(False)
#cython.wraparound(False)
def u_update(V, C, train, I_k, lambda_u, user, hidden):
cdef int user_size = user
cdef int hidden_dim = hidden
cdef np.ndarray U = np.empty((hidden_dim,user_size), float)
cdef int m = C.shape[1]
for i in range(user_size):
C_i = np.zeros((m, m), dtype=float)
for j in range(m):
C_i[j,j]=C[i,j]
U[:,i] = np.dot(np.linalg.inv(np.dot(V, np.dot(C_i,V.T)) + lambda_u*I_k), np.dot(V, np.dot(C_i,train[i,:].T)))
return U
You are trying to use cython by diving into the deep end of pool. You should start with something small, such as some of the numpy examples. Or even try to improve on np.diag.
i = 0
C_i = np.zeros((m, m), dtype=float)
for j in range(m):
C_i[j,j]=C[i,j]
v.
C_i = diag(C[i,:])
Can you improve the speed of this simple expression? diag is not compiled, but it does perform an efficient indexed assignment.
res[:n-k].flat[i::n+1] = v
But the real problem for cython is this expression:
U[:,i] = np.dot(np.linalg.inv(np.dot(V, np.dot(C_i,V.T)) + lambda_u*I_k), np.dot(V, np.dot(C_i,train[i,:].T)))
np.dot is compiled. cython won't turn that in to c code, nor will it consolidate all 5 dots into one expression. It also won't touch the inv. So at best cython will speed up the iteration wrapper, but it will still call this Python expression m times.
My guess is that this expression can be cleaned up. Replacing the inner dots with einsum can probably eliminate the need for C_i. The inv might make 'vectorizing' the whole thing difficult. But I'd have to study it more.
But if you want to stick with the cython route, you need to transform that U expression into simple iterative code, without calls to numpy functions like dot and inv.
===================
I believe the following are equivalent:
np.dot(C_i,V.T)
C[i,:,None]*V.T
In:
np.dot(C_i,train[i,:].T)
if train is 2d, then train[i,:] is 1d, and the .T does nothing.
In [289]: np.dot(np.diag([1,2,3]),np.arange(3))
Out[289]: array([0, 2, 6])
In [290]: np.array([1,2,3])*np.arange(3)
Out[290]: array([0, 2, 6])
If I got that right, you don't need C_i.
======================
Furthermore, these calculations can be moved outside the loop, with expressions like (not tested)
CV1 = C[:,:,None]*V.T # a 3d array
CV2 = C * train.T
for i in range(user_size):
U[:,i] = np.dot(np.linalg.inv(np.dot(V, CV1[i,...]) + lambda_u*I_k), np.dot(V, CV2[i,...]))
A further step is to move both np.dot(V,CV...) out of the loop. That may require np.matmul (#) or np.einsum. Then we will have
for i...
I = np.linalg.inv(VCV1[i,...])
U[:,i] = np.dot(I+ lambda_u), VCV2[i,])
or even
for i...
I[...i] = np.linalg.inv(...) # if inv can't be vectorized
U = np.einsum(..., I+lambda_u, VCV2)
This is a rough sketch, and details will need to be worked out.
The first thing that comes to mind is you haven't typed the function arguments and specified the data type and number of dimensions like so :
def u_update(np.ndarray[np.float64, ndim=2]V, np.ndarray[np.float64, ndim=2]\
C, np.ndarray[np.float64, ndim=2] train, np.ndarray[np.float64, ndim=2] \
I_k, int lambda_u, int user, int hidden) :
This will greatly speed up indexing with 2 indices like you do in the inner loop.
It's best to do this to the array U as well, although you are using slicing:
cdef np.ndarray[np.float64, ndim=2] U = np.empty((hidden_dim,user_size), np.float64)
Next, you are redefining C_i, a large 2-D array every iteration of the outer loop. Also, you have not supplied any type information for it, which is a must if Cython is to offer any speedup. To fix this :
cdef np.ndarray[np.float64, ndim=2] C_i = np.zeros((m, m), dtype=np.float64)
for i in range(user_size):
C_i.fill(0)
Here, we have defined it once (with type information), and reused the memory by filling with zeros instead of calling np.zeros() to make a new array every time.
Also, you might want to turn off bounds checking only after you have finished debugging.
If you need speedups in the U[:,i]=... step, you could consider writing another function with Cython to perform those operations using loops.
Do read this tutorial which should give you an idea of what to do when working with Numpy arrays in Cython and what not to do as well, and also to appreciate how much of a speedup you can get with these simple changes.
I'm trying to get the pointer to a Numpy array so that I can manipulate it quickly in my Cython code. I found two ways of getting the buffer's pointer, one using array.__array_interface__['data'][0] and the other with array.ctypes.data. They are both painfully slow.
I have created a small Cython class that simply creates a numpy array and stores the pointer to its buffer:
cdef class ArrayHolder:
cdef array
cdef long *ptr
def __init__(ArrayHolder self, allocate=True):
self.array = np.zeros((4, 12,), dtype=np.int)
cdef long ptr = self.array.__array_interface__['data'][0]
self.ptr = <long *>ptr
Then, back in Python, I create multiple instances of this class, like so:
for i in range(1000000):
holder = ArrayHolder()
This takes around 3.6 seconds. Using array.ctypes.data is half a second slower .
When I comment out the call to __array_instance__['data'] and run the code again, it completes in around 1 second.
Why is obtaining the address of the Numpy array buffer so slow?
This can be helped a lot by using Cython's static typing mechanisms. That way Cython is aware that what you're dealing with is an appropriate type of array array, and can generate optimised C code.
cimport numpy as np # just so it knows np.int_t
cdef class ArrayHolder:
cdef np.int_t[:,:] array # now specified as a specific array type
cdef np.int_t *ptr # note I've changed this to match the array type
def __init__(ArrayHolder self, allocate=True):
self.array = np.zeros((4, 12,), dtype=np.int)
self.ptr = &self.array[0,0] # location of the first element
In this version there's a small cost at the assignment of self.array to check that the object is in fact an array. However, element lookup and taking the address are now around as fast as using a C pointer.
In your old version it was an arbitrary python object, so there was a dictionary lookup for __array_instance__, a dictionary lookup for __getitem__ to allow a dictionary lookup for data. A further dictionary lookup for __getitem__ to allow to you find index 0.
One thing to note: if you've used cdef to tell Cython the array type, you can do all your indexing directly on the array and it'll be pretty much type same speed as using the pointer, so you can probably skip creating the pointer entirely (unless you need it to pass to external C code). Turn off boundscheck and wraparound for the last little bit of speed.
I'm guessing, it's some sort of lazy-loading. Numpy only does the memset() on the table when you first access it. I would try to create this array without filling it with zeros to gain time.
Here's my test:
import numpy as np
cdef class ArrayHolder:
cdef array
cdef long *ptr
def __init__(ArrayHolder self, allocate=True):
self.array = np.zeros((4, 12,), dtype=np.int)
def ptr(ArrayHolder self):
cdef long ptr = self.array.__array_interface__['data'][0]
from timeit import timeit
from cyth import ArrayHolder
print(timeit("ArrayHolder()", number=1000000, setup="from cyth import ArrayHolder"))
print(timeit("ArrayHolder().ptr()", number=1000000, setup="from cyth import ArrayHolder"))
$ python test.py
1.0442328620702028
3.4246508290525526
I am trying to convert part of a native python function to cython to improve the compute time. I would like to write a cython function just for the loop component that is taking up the time (as ipython lprun kindly told me). However this function takes in variably sized matrices .. and I can't see how to bring that across easily to statically typed cython.
for index1 in range(0,num_products):
for index2 in range(0,num_products):
cond_prob = (data[index1] * data[index2]).sum() / max(col_sums[index1], col_sums[index2])
prox[index1][index2] = cond_prob
This issue is that num_products changes year to year, so the matrix (data) size is variable.
What is the best strategy here?
Should I write two C functions. One to create a matrix of a certain dimension using memalloc, and then One to do the loops over the created matrix?
Is there some fancy cython/numpy wizardry to help in this scenario? Can I write a C function that takes in a variably sized Numpy Array in memory and pass the size?
Cython code is (strategically) statically typed, but that doesn't mean that arrays must have a fixed size. In straight C passing a multidimensional array to a function can be a little awkward maybe, but in Cython you should be able to do something like the following:
Note I took the function and variable names from your follow-up question.
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.cdivision(True)
def cooccurance_probability_cy(double[:,:] X):
cdef int P, i, j, k
P = X.shape[0]
cdef double item
cdef double [:] CS = np.sum(X, axis=1)
cdef double [:,:] D = np.empty((P, P), dtype=np.float)
for i in range(P):
for j in range(P):
item = 0
for k in range(P):
item += X[i,k] * X[j,k]
D[i,j] = item / max(CS[i], CS[j])
return D
On the other hand, using just Numpy should also be quite fast for this problem, if you use the right functions and some broadcasting. In fact, as the calculation complexity is dominated by the matrix multiplication, I found the following is much faster than the Cython code above (np.inner uses a highly optimized BLAS routine):
def new(X):
CS = np.sum(X, axis=1, keepdims=True)
D = np.inner(X,X) / np.maximum(CS, CS.T)
return D
Have you tried getting rid of the for loops in numpy?
for the first part of your equation you could for example try:
(data[ np.newaxis,:] * data[:,np.newaxis]).sum(2)
if memory is an issue you can also use the np.einsum() function.
For the second part one could probably also cook up a numpy expression (bit more difficult) if you've not already tried that.