Issue converting a python program to cython for speedup

Issue converting a python program to cython for speedup - python

I am trying to write a fast algorithm in Cython. The algorithm is fairly straightforward and is described here (https://arxiv.org/pdf/1411.4357.pdf - the paragraph before Theorem 8 on page 17). The idea is relatively straightforward: assuming that the data is sparse it can be given as input in the form (i,j,A_ij) for a matrix A which has n rows and d columns. Now to take advantage of the sparsity we need two functions h which maps the n rows into s buckets uniformly at random (which is a parameter to the algorithm) and a sign function s which is just ±1 with equal probability. Then for every triple (i,j,A_ij) the algorithm must output (h(i),j, s(h(i))*A_ij) and return these as arrays to be given as input to another sparse.
The problem is that I can't get a speedup. This should run extremely fast and outperform matrix multiplication of realising S and multiplying by A as illustrated in the above reference.
Python approach (roughly 11ms):
import numpy as np
import numpy.random as npr
from scipy.sparse import coo_matrix
def countSketch_faster(input_matrix, sketch_size, seed=None):
'''
input_matrix: sparse - coo_matrix type
sketch_size: int
seed=None : random seed
'''
observed_rows = set([])
sign_hashes = []
hash_table = {}
sketched_data = []
hashed_rows = []
for row_id, col_id, Aij in zip(input_matrix.row, input_matrix.col, input_matrix.data):
if row_id in observed_rows:
sketched_data.append(hash_table[row_id][1]*Aij)
hashed_rows.append(hash_row_val)
else:
hash_row_val, sign_val = np.random.choice(sketch_size), np.random.choice([-1.0,1.0])
hash_table[row_id] = (hash_row_val, sign_val) #hash-sign pair
sketched_data.append(sign_val*Aij)
hashed_rows.append(hash_row_val)
observed_rows.add(row_id)
hashed_rows = np.asarray(hashed_rows)
sketched_data = np.asarray(sketched_data)
row_hashes = np.asarray(row_hashes)
S = coo_matrix((sketched_data, (hashed_rows, input_matrix.col))).tocsr()
return S
Converting to Cython:
Analysing the annotated output highlights that the most python-heavy lines are those which call numpy. I have tried to cdef all variables and specify dtype for any call to numpy. Also, I have removed the zip command and tried to keep the loop simpler and more C-like. All of this has actually had an adverse effect though and it runs very slowly, and I'm not really sure why. It seems like quite a simple algorithm to implement so if someone could help me get the runtime down to something very small I would be extremely grateful.
%%cython --annotate
cimport cython
import numpy.random as npr
import numpy as np
from scipy.sparse import coo_matrix
def countSketch_cython(input_matrix, sketch_size, seed=None):
'''
input_matrix: sparse - coo_matrix type
sketch_size: int
seed=None : random seed
'''
cdef Py_ssize_t idx
cdef int row_id
cdef float data_id
cdef float sketch_data_val
cdef int hash_row_value
cdef float sign_value
hash_table = {}
sketched_data = np.zeros_like(input_matrix.data,dtype=np.float64)
hashed_rows = np.zeros_like(sketched_data,dtype=np.float64)
observed_rows = set([])
for idx in range(input_matrix.row.shape[0]):
row_id = input_matrix.row[idx]
data_id = input_matrix.data[idx]
if row_id in observed_rows:
hash_row_value = hash_table[row_id][0]
sign_value = hash_table[row_id][1]
sketched_data[row_id] = sign_value*data_id
hashed_rows[idx] = hash_row_value
else:
hash_row_val = np.random.randint(low=0,high=sketch_size+1)
sign_val = np.random.choice([-1.0,1.0])
hash_table[row_id] = (hash_row_val, sign_val) #hash-sign pair
sketched_data[idx] = sign_val*data_id
hashed_rows[idx] = hash_row_value
observed_rows.add(row_id)
S = coo_matrix((sketched_data, (hashed_rows, input_matrix.col)))
return S`
UPDATE: I have managed to speed the code up by removing some of the slower lines. These were calls to np.random for which the C random number generator was faster, giving the length in the input so this did not need to be calculated, and not converting to a sparse matrix before returning (for the purpose of this experiment I am interested in how quickly the transform can be done as opposed to details surrounding the conversion for downstream use).
%%cython --annotate
cimport cython
import numpy.random as npr
import numpy as np
from libc.stdlib cimport rand
##cython.boundscheck(False) # these don't contribute much in this
example
##cython.wraparound(False)
def countSketch(input_matrix, input_matrix_length, sketch_size, seed=None):
'''
input_matrix: sparse - coo_matrix type
input_matrix_length - number of rows in input matrix
sketch_size: int
seed=None : random seed
'''
cdef Py_ssize_t idx
cdef int row_id
cdef float data_id
cdef float sketch_data_val
cdef int hash_row_value
cdef float sign_value
cdef int arr_lengths = input_matrix_length
# These two lines are still annotated most boldly
cdef double[:,:] sketched_data =
np.zeros((arr_lengths,1),dtype=np.float64)
cdef double[:,:] hashed_rows = np.zeros((arr_lengths,1))
hash_table = {}
observed_rows = set([])
for idx in range(arr_lengths):
row_id = input_matrix.row[idx]
data_id = input_matrix.data[idx]
if row_id in observed_rows:
hash_row_value = hash_table[row_id][0]
sign_value = hash_table[row_id][1]
sketched_data[row_id] = sign_value*data_id
hashed_rows[idx] = hash_row_value
else:
hash_row_val = rand()%(sketch_size)
#np.random.randint(low=0,high=sketch_size+1)
sign_val = 2*rand()%(2) - 1
#np.random.choice([-1.0,1.0])
hash_table[row_id] = (hash_row_val, sign_val) #hash-sign pair
sketched_data[idx] = sign_val*data_id
hashed_rows[idx] = hash_row_value
observed_rows.add(row_id)
#S = coo_matrix((sketched_data, (hashed_rows, input_matrix.col)), dtype=np.float64)
return hashed_rows, sketched_data
On a random sparse matrix A = scipy.sparse.random(1000, 50, density=0.1) this now achieves 508 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) using %timeit countSketch(A, A.shape[0],100). I should imagine there are still gains to be made as I don't know if I have set the arrays in the best way.

Related

Struct pointers in Cython with GSL Monte-Carlo minimization

I'm stuck on this exercise and am not good enough to resolve it. Basically I am writing a Monte-Carlo Maximum Likelihood algorithm for the Bernoulli distribution. The problem is that I have to pass the data as the parameter to the GSL minimization (one-dim) algorithm, and need to also pass the size of the data (since the outer loop are the different sample sizes of the "observed" data). So I'm attempting to pass these parameters as a struct. However, I'm running into seg faults and I'm SURE it is coming from the portion of the code that concerns the struct and treating it as a pointer.
[EDIT: I have corrected for allocation of the struct and its components]
%%cython
#!python
#cython: boundscheck=False, wraparound=False, nonecheck=False, cdivision=True
from libc.stdlib cimport rand, RAND_MAX, calloc, malloc, realloc, free, abort
from libc.math cimport log
#Use the CythonGSL package to get the low-level routines
from cython_gsl cimport *
######################### Define the Data Structure ############################
cdef struct Parameters:
#Pointer for Y data array
double* Y
#size of the array
int* Size
################ Support Functions for Monte-Carlo Function ##################
#Create a function that allocates the memory and verifies integrity
cdef void alloc_struct(Parameters* data, int N, unsigned int flag) nogil:
#allocate the data array initially
if flag==1:
data.Y = <double*> malloc(N * sizeof(double))
#reallocate the data array
else:
data.Y = <double*> realloc(data.Y, N * sizeof(double))
#If the elements of the struct are not properly allocated, destory it and return null
if N!=0 and data.Y==NULL:
destroy_struct(data)
data = NULL
#Create the destructor of the struct to return memory to system
cdef void destroy_struct(Parameters* data) nogil:
free(data.Y)
free(data)
#This function fills in the Y observed variable with discreet 0/1
cdef void Y_fill(Parameters* data, double p_true, int* N) nogil:
cdef:
Py_ssize_t i
double y
for i in range(N[0]):
y = rand()/<double>RAND_MAX
if y <= p_true:
data.Y[i] = 1
else:
data.Y[i] = 0
#Definition of the function to be maximized: LLF of Bernoulli
cdef double LLF(double p, void* data) nogil:
cdef:
#the sample structure (considered the parameter here)
Parameters* sample
#the total of the LLF
double Sum = 0
#the loop iterator
Py_ssize_t i, n
sample = <Parameters*> data
n = sample.Size[0]
for i in range(n):
Sum += sample.Y[i]*log(p) + (1-sample.Y[i])*log(1-p)
return (-(Sum/n))
########################## Monte-Carlo Function ##############################
def Monte_Carlo(int[::1] Samples, double[:,::1] p_hat,
Py_ssize_t Sims, double p_true):
#Define variables and pointers
cdef:
#Data Structure
Parameters* Data
#iterators
Py_ssize_t i, j
int status, GSL_CONTINUE, Iter = 0, max_Iter = 100
#Variables
int N = Samples.shape[0]
double start_val, a, b, tol = 1e-6
#GSL objects and pointer
const gsl_min_fminimizer_type* T
gsl_min_fminimizer* s
gsl_function F
#Set the GSL function
F.function = &LLF
#Allocate the minimization routine
T = gsl_min_fminimizer_brent
s = gsl_min_fminimizer_alloc(T)
#allocate the struct
Data = <Parameters*> malloc(sizeof(Parameters))
#verify memory integrity
if Data==NULL: abort()
#set the starting value
start_val = rand()/<double>RAND_MAX
try:
for i in range(N):
if i==0:
#allocate memory to the data array
alloc_struct(Data, Samples[i], 1)
else:
#reallocate the data array in the struct if
#we are past the first run of outer loop
alloc_struct(Data, Samples[i], 2)
#verify memory integrity
if Data==NULL: abort()
#pass the data size into the struct
Data.Size = &Samples[i]
for j in range(Sims):
#fill in the struct
Y_fill(Data, p_true, Data.Size)
#set the parameters for the GSL function (the samples)
F.params = <void*> Data
a = tol
b = 1
#set the minimizer
gsl_min_fminimizer_set(s, &F, start_val, a, b)
#initialize conditions
GSL_CONTINUE = -2
status = -2
while (status == GSL_CONTINUE and Iter < max_Iter):
Iter += 1
status = gsl_min_fminimizer_iterate(s)
start_val = gsl_min_fminimizer_x_minimum(s)
a = gsl_min_fminimizer_x_lower(s)
b = gsl_min_fminimizer_x_upper(s)
status = gsl_min_test_interval(a, b, tol, 0.0)
if (status == GSL_SUCCESS):
print ("Converged:\n")
p_hat[i,j] = start_val
finally:
destroy_struct(Data)
gsl_min_fminimizer_free(s)
with the following python code to run the above function:
import numpy as np
#Sample Sizes
N = np.array([5,50,500,5000], dtype='i')
#Parameters for MC
T = 1000
p_true = 0.2
#Array of the outputs from the MC
p_hat = np.empty((N.size,T), dtype='d')
p_hat.fill(np.nan)
Monte_Carlo(N, p_hat, T, p_true)
I have separately tested the struct allocation and it works, doing what it should do. However, while funning the Monte Carlo the kernel is killed with an abort call (per the output on my Mac) and the Jupyter output on my console is the following:
gsl: fsolver.c:39: ERROR: computed function value is infinite or NaN
Default GSL error handler invoked.
It seems now that the solver is not working. I'm not familiar with the GSL package, having used it only once to generate random numbers from the gumbel distribution (bypassing the scipy commands).
I would appreciate any help on this! Thanks
[EDIT: Change lower bound of a]
Redoing the exercise with the exponential distribution, whose log likelihood function contains just one log I've honed down the problem having been with gsl_min_fminimizer_set initially evaluating at the lower bound of a at 0 yielding the -INF result (since it evaluates the problem prior to solving to generate f(lower), f(upper) where f is my function to optimise). When I set the lower bound to something other than 0 but really small (say the tol variable of my defined tolerance) the solution algorithm works and yields the correct results.
Many thanks #DavidW for the hints to get me to where I needed to go.

This is a somewhat speculative answer since I don't have GSL installed so struggle to test it (so apologies if it's wrong!)
I think the issue is the line
Sum += sample.Y[i]*log(p) + (1-sample.Y[i])*log(1-p)
It looks like Y[i] can be either 0 or 1. When p is at either end of the range 0-1 it gives 0*-inf = nan. In the case where only all the 'Y's are the same this point is the minimum (so the solver will reliably end up at the invalid point). Fortunately you should be able to rewrite the line to avoid getting a nan:
if sample.Y[i]:
Sum += log(p)
else:
Sum += log(1-p)
(the case which will generate the nan is the one not executed).
There's a second minor issue I've spotted: in alloc_struct you do data = NULL in case of an error. This only affects the local pointer, so your test for NULL in Monte_Carlo is meaningless. You'd be better returning a true or false flag from alloc_struct and checking that. I doubt if you're hitting this error though.
Edit: Another better option would be to find the minimum analytically: the derivative of A log(p) + (1-A) log (1-p) is A/p - (1-A)/(1-p). Average all the sample.Ys to find A. Finding the place where the derivative is 0 gives p=A. (You'll want to double-check my working!). With this you can avoid having to use the GSL minimization routines.

Cython actually slowing me down

I am trying to 'cythonize' my code however while the following code does work, it is not adding any speed to my code (in fact its a shade slower). I am wondering if anyone knows what I am doing wrong if anything at all. Note I am passing a numpy array and its data type should be float16. Never used cython before and I am using Jupyter notebook right now. In the cell above I have Cython loaded.
%%cython
import numpy as np
cimport numpy as np
DTYPE = np.float16
ctypedef np.int_t DTYPE_t
def arrange_waveforms(np.ndarray arr,dim,mintomaxEventslistrange):
import timeit
start_time = timeit.default_timer()
cdef int key
#mintomaxEventslistrange is basically range(2000000)
dictlist = dict((key, [[] for _ in xrange(1536)]) for key in mintomaxEventslistrange)
# A seperate one to hold the timing info (did this to minimize memory of dictlist)
window = dict((key, [[] for _ in xrange(1536)]) for key in mintomaxEventslistrange)
#arrange waveforms
cdef np.ndarray pixel=arr[:,0].astype(int)
cdef int i
cdef int lines = dim[0]
for i in range(lines):
dictlist[arr[i,1]][pixel[i]].extend(arr[i,9:])
window[arr[i,1]][pixel[i]].append(arr[i,6])
elapsed = timeit.default_timer() - start_time
print elapsed
return dictlist,window

Reading hdf5 file quickly with cython and h5py

I'm trying to speed up a python3 function that takes some data, which is an array of indexes and saves them if they meet a certain criterion. I have tried to speed it up by using "cython -a script.py", but the bottle neck seems to be the h5py I/O slicing datasets.
I'm relatively new to cython, so I was wondering whether there is anyway to speed this up or am I just limited by the h5py I/O here?
Here is the function I'm trying to improve:
import numpy as np
import h5py
cimport numpy as np
cimport cython
from libc.math cimport sqrt
DTYPE64 = np.int64
ctypedef np.int64_t DTYPE64_t
DTYPE32 = np.int32
ctypedef np.int32_t DTYPE32_t
#cython.boundscheck(False)
#cython.wraparound(False)
def tag_subhalo_branch(np.ndarray[DTYPE64_t] halos_z0_treeindxs,
np.ndarray[DTYPE64_t] tree_pindx,
np.ndarray[DTYPE32_t] tree_psnapnum,
np.ndarray[DTYPE64_t] tree_psnapid,
np.ndarray[DTYPE64_t] tree_hsnapid, hf,
int size):
cdef int i
cdef double radial, progen_x, progen_y, progen_z
cdef double host_x, host_y, host_z, host_rvir
cdef DTYPE64_t progen_indx, progen_haloid, host_id
cdef DTYPE32_t progen_snap
cdef int j = 0
cdef int size_array = size
cdef np.ndarray[DTYPE64_t] backsplash_ids = np.zeros(size_array,
dtype=DTYPE64)
for i in range(0, size_array):
progen_indx = tree_pindx[halos_z0_treeindxs[i]]
if progen_indx != -1:
progen_snap = tree_psnapnum[progen_indx]
progen_haloid = tree_psnapid[progen_indx]
while progen_indx != -1 and progen_snap != -1:
# ** This is slow **
grp = hf['Snapshots/snap_' + str('%03d' % progen_snap) + '/']
host_id = grp['HaloCatalog'][(progen_haloid - 1), 2]
# **
if host_id != -1:
# ** This is slow **
progen_x = grp['HaloCatalog'][(progen_haloid - 1), 6]
host_x = grp['HaloCatalog'][(host_id - 1), 6]
progen_y = grp['HaloCatalog'][(progen_haloid - 1), 7]
host_y = grp['HaloCatalog'][(host_id - 1), 7]
progen_z = grp['HaloCatalog'][(progen_haloid - 1), 8]
host_z = grp['HaloCatalog'][(host_id - 1), 8]
# **
radial = 0
radial += (progen_x - host_x)**2
radial += (progen_y - host_y)**2
radial += (progen_z - host_z)**2
radial = sqrt(radial)
host_rvir = grp['HaloCatalog'][(host_id - 1), 24]
if radial <= host_rvir:
backsplash_ids[j] = tree_hsnapid[
halos_z0_treeindxs[i]]
j += 1
break
# Find next progenitor information
progen_indx = tree_pindx[progen_indx]
progen_snap = tree_psnapnum[progen_indx]
progen_haloid = tree_psnapid[progen_indx]
return backsplash_ids

As described here: http://api.h5py.org/, h5py uses cython code to interface with the HDF5 c code. So your own cython code might be able to access that directly. But I suspect that will require a lot more study.
Your code is using the Python interface to h5py, and cythonizing isn't going to touch that.
cython code is best used for low level actions, especially iterative things that can't be expressed as array operations. Study and experiment with the numpy examples first. You are diving into cython at the deep end of the pool.
Have you tried to improve that code just with Python and numpy? Just at glance I'm seeing a lot of redundant h5py calls.
====================
Your radial calculation accesses the h5py indexing 6 times when it could get by with 2. Maybe you wrote it that way in hopes that cython would preform the following calculation faster than numpy?
data = grp['HaloCatalog']
progen = data[progen_haloid-1, 6:9]
host = data[host_id-1, 6:9]
radial = np.sqrt((progren-host)**2).sum(axis=1))
Why not load all data[progen_haloid-1,:] and data[host_id-1,:]? Even all of data? I'd have to review when h5py switches from working directly with the arrays on the file and when they become numpy arrays. In any case, math on arrays in memory will be a lot faster than file reads.

Cython/numpy vs pure numpy for least square fitting

A T.A at school showed me this code as an example of a least square fitting algorithm.
import numpy as np
#return the coefficients (a0,..aN) of the fit y=a0+a1*x+..an*x^n
#with associated sigma dy
#x,y,dy are all np.arrays with dtype= np.float64
def fit_poly(x,y,dy,n):
V = np.asmatrix(np.diag(dy**2))
M = []
for k in range(n+1):
M.append(x**k)
M = np.asmatrix(M).T
theta = (M.T*V.I*M).I*M.T*V.I*np.asmatrix(y).T
cov_t = (M.T*V.I*M).I
return np.asarray(theta.T)[0], np.asarray(cov_t)
Im trying to optimize his codes using cython. i got this code
cimport numpy as np
import numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef poly_c(np.ndarray[np.float64_t, ndim=1] x ,
np.ndarray[np.float64_t, ndim=1] y np.ndarray[np.float64_t,ndim=1]dy , np.int n):
cdef np.ndarray[np.float64_t, ndim=2] V, M
V=np.asmatrix(np.diag(dy**2),dtype=np.float64)
M=np.asmatrix([x**k for k in range(n+1)],dtype=np.float64).T
return ((M.T*V.I*M).I*M.T*V.I*(np.asmatrix(y).T))[0],(M.T*V.I*M).I
But the runtime seems to be the same for both programs,i did used an 'assert' to make sure the outputs where the same. What am i missing/doing wrong?
Thank you for your time and hopefully you can help me.
ps: this is the code im profiling with(not sure if i can call this profiling but w/e)
import numpy as np
from polyC import poly_c
from time import time
from pancho_fit import fit_poly
#pancho's the T.A,sup pancho
x=np.arange(1,1000)
x=np.asarray(x,dtype=np.float64)
y=3*x+np.random.random(999)
y=np.asarray(y,dtype=np.float64)
dy=np.array([y.std() for i in range(1,1000)],dtype=np.float64)
t0=time()
a,b=poly_c(x,y,dy,4)
#a,b=fit_poly(x,y,dy,4)
print("time={}s".format(time()-t0))

Except for [x**k for k in range(n+1)] I don't see any iterations for cython to improve. Most of the action is in matrix products. Those are already done with compiled code (with np.dot for ndarrays).
And n is only 4, not many iterations.
But why iterate this?
In [24]: x=np.arange(1,1000.)
In [25]: M1=x[:,None]**np.arange(5)
# np.matrix(M1)
does the same thing.
So no, this does not look like a good cython candidate - not unless you are prepared to write out all those matrix products in compilable detail.
I'd skip also the asmatrix stuff and use regular dot, # and einsum, but that's more a matter of style than speed.

Why does numba have worse optimization than Cython in this code?

I am trying to optimize some code with numba. The problem is that a simple Cython optimization (just specifying data types) is six times faster than using autojit, so I don't know if I'm doing something wrong.
The function to optimize is:
from numba import autojit
#autojit(nopython=True)
def get_energy(system, i,j,m):
#system is an array, (i,j) some indices and m the size of the array
up=i-1; down=i+1; left=j-1; right=j+1
if up<0: total=system[m,j]
else: total=system[up,j]
if down>m: total+=system[0,j]
else: total+=system[down,j]
if left<0: total+=system[i,m]
else: total+=system[i,left]
if right>m: total+=system[i,0]
else: total+=system[i,right]
return 2*system[i,j]*total
A simple run would be something like this:
import numpy as np
x=np.random.rand(50,50)
get_energy(x, 3, 5, 50)
I've understood that numba is good at loops but may not optimize other things very well. Anyhow, I would expect a similar performance to Cython, is numba slower accessing arrays or at conditional statements?
The .pyx file in Cython is:
import numpy as np
cimport cython
cimport numpy as np
def get_energy(np.ndarray[np.float64_t, ndim=2] system, int i,int j,unsigned int m):
cdef int up
cdef int down
cdef int left
cdef int right
cdef np.float64_t total
up=i-1; down=i+1; left=j-1; right=j+1
if up<0: total=system[m,j]
else: total=system[up,j]
if down>m: total+=system[0,j]
else: total+=system[down,j]
if left<0: total+=system[i,m]
else: total+=system[i,left]
if right>m: total+=system[i,0]
else: total+=system[i,right]
return 2*system[i,j]*total
Please comment if I need to give further information.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issue converting a python program to cython for speedup - python

Related

Struct pointers in Cython with GSL Monte-Carlo minimization

Cython actually slowing me down

Reading hdf5 file quickly with cython and h5py

Cython/numpy vs pure numpy for least square fitting

Why does numba have worse optimization than Cython in this code?

Categories

Resources