Parallelizing generation of random vectors with arbitrary length using numba

Parallelizing generation of random vectors with arbitrary length using numba - python

I would like to generate random vectors of the form
[ i if random.uniform(0,1) <= probs[i] for i prange(K) ]
for an K length array of probabilities probs. Each resulting vector has somewhere between 0 and K elements. Conceptually, this is like flipping K specific coins (with particular probabilities of being heads) and recording which of the coins displayed heads.
The arbitrary return length makes it difficult to use any of the automatic parallelization options in numba. E.g.,
from numba import prange, njit, int64, float64
import numpy as np
#njit([int64[:](float64[:], int64)])
def rand_coin(freqs,r):
return np.arange(r)[np.random.uniform(0,1,size=r)<=freqs]
#njit(parallel=True)
def rand_coins(freqs,n):
r = freqs.shape[0]
return [rand_coin(freqs,r) for i in range(n)] # **
r = 10; n =100
freqs = np.random.uniform(0,1,r)
rand_coins(freqs, n)
works great serially but produces a double free or corruption error if the range in the starred line is replaced with prange.
Is is possible parallelize functions returning arrays with random lengths in numba?

prange is a numba function. The typing error is just a generic error from numba, saying that it ran into an issue while compiling the function. The real issue is that you are trying the make a call to a function that isn't declared. You need to utilize the prange function like so:
from numba import njit, int64, float64
import numba
import numpy as np
#njit([int64[:](float64[:], int64)])
def rand_coin(freqs,r):
return np.arange(r)[np.random.uniform(0,1,size=r)<=freqs]
#njit(parallel=True)
def rand_coins(freqs,n):
r = freqs.shape[0]
return [rand_coin(freqs,r) for i in numba.prange(n)] # **

Related

Numba is not enhancing the performance

I am testing numba performance on some function that takes a numpy array, and compare:
import numpy as np
from numba import jit, vectorize, float64
import time
from numba.core.errors import NumbaWarning
import warnings
warnings.simplefilter('ignore', category=NumbaWarning)
#jit(nopython=True, boundscheck=False) # Set "nopython" mode for best performance, equivalent to #njit
def go_fast(a): # Function is compiled to machine code when called the first time
trace = 0.0
for i in range(a.shape[0]): # Numba likes loops
trace += np.tanh(a[i, i]) # Numba likes NumPy functions
return a + trace # Numba likes NumPy broadcasting
class Main(object):
def __init__(self) -> None:
super().__init__()
self.mat = np.arange(100000000, dtype=np.float64).reshape(10000, 10000)
def my_run(self):
st = time.time()
trace = 0.0
for i in range(self.mat.shape[0]):
trace += np.tanh(self.mat[i, i])
res = self.mat + trace
print('Python Diration: ', time.time() - st)
return res
def jit_run(self):
st = time.time()
res = go_fast(self.mat)
print('Jit Diration: ', time.time() - st)
return res
obj = Main()
x1 = obj.my_run()
x2 = obj.jit_run()
The output is:
Python Diration: 0.2164750099182129
Jit Diration: 0.5367801189422607
How can I obtain an enhance version of this example ?

The slower execution time of the Numba implementation is due to the compilation time since Numba compile the function at the time it is used (only the first time unless the type of the argument change). It does that because it cannot know the type of the arguments before the function is called. Hopefully, you can specify the argument type to Numba so it can compile the function directly (when the decorator function is executed). Here is the resulting code:
#njit('float64[:,:](float64[:,:])')
def go_fast(a):
trace = 0.0
for i in range(a.shape[0]):
trace += np.tanh(a[i, i])
return a + trace
Note that njit is a shortcut for jit+nopython=True and that boundscheck is already set to False by default (see the doc).
On my machine this result in the same execution time for both Numpy and Numba. Indeed, the execution time is not bounded by the computation of the tanh function. It is bounded by the expression a + trace (for both Numba and Numpy). The same execution time is expected since both implement this the same way: they create a temporary new array to perform the addition. Creating a new temporary array is expensive because of page faults and the use of the RAM (a is fully read from the RAM and the temporary array is fully stored in RAM). If you want a faster computation, then you need to perform the operation in-place (this prevent page faults and expensive cache-line write allocations on x86 platforms).

Finding the Hessian matrix of this function

Hi I have the following function:
sum from 1 to 5000 -log(1−(xi)^2) -log(1-(a_i)^t*x), where a_i is a random vector and we are trying to minimize this function's value via Netwon's method.
I need a way to calculate the Hessian matrix with respect to (x1, x2, x3, ...). I tried auto-gradient but it took too much time. Here is my current time.
from autograd import elementwise_grad as egrad
from autograd import jacobian
import autograd.numpy as np
x=np.zeros(5000);
a = np.random.rand(5000,5000)
def f (x):
sum = 0;
for i in range(5000):
sum += -np.log(1 - x[i]*x[i]) - np.log(1-np.dot(x,a[i]))
return sum;
df = egrad(f)
d2f = jacobian(egrad(df));
print(d2f(x));
I have tried looking into sympy but I am confused on how to proceed.

PyTorch has a GPU optimised hessian operation:
import torch
torch.autograd.functional.hessian(func, inputs)

You can use the regular NumPy vectorization array operations which will speed up significantly the execution of the program:
from autograd import elementwise_grad as egrad
from autograd import jacobian
import autograd.numpy as np
from time import time
import datetime
n = 5000
x = np.zeros(n)
a = np.random.rand(n, n)
f = lambda x: -1 * np.sum(np.log(1-x**2) + np.log(1-np.dot(a, x)))
t_start = time()
df = egrad(f)
d2f = jacobian(egrad(df))
t_end = time() - t_start
print('Execution time: ', datetime.datetime.fromtimestamp(t_end).strftime('%H:%M:%S'))
Output
Execution time: 02:02:27
In general, every time you deal with numeric data, you should avoid by all means using loops for calculations, as they usually become the bottleneck of the program due to the their header and the maintenance of the counter variable.
NumPy on the other hand, uses a very short header for each array, and is highly optimized, as you'd expect, for numeric calculations.
Note the x**2 which squares every item of x instead of x[i]*x[i], and the np.dot(a, x) which performs the np.dot(x, a[i]) in just one command (where x and a switch places to fit the required dimensions).
You can refer to this great e-book which will explain this point in a greater detail.
Cheers

How can I generate an array of samples from a poisson distribution in python using jax (jit)?

I am using jax (https://github.com/google/jax) to code up a neural network, and to simulate my inputs I would like to generate an array of samples from a poisson distribution. How can I do this given jax's restrictions?
I have already tried using np.random.poisson(mu,N) and scipy.stats.poisson.rvs(mu, size=N). Neither of these work because they are not supported in jax.numpy and jax.scipy.stats. So basically, I need an alternative solution, either with another package that jax does support or by hard-coding the poisson function.
from jax import jit, vmap
import jax.numpy as np
import numpy as onp
from scipy.stats import poisson
def build_input_and_targets_simulated(ntime, key):
"""
Function: Simulate inputs and targets.
Args:
ntime: number of time steps in input
key: key for random number generator
Returns:
inputs: txu matrix of inputs
targets: txu matrix of target classifications
"""
mu = 0.03 # average number of events per interval
# scipy method
inputs = np.array([poisson.rvs(mu, size=ntime), poisson.rvs(mu,
size=ntime)]).T
# numpy method
inputs = np.array([onp.random.poisson(mu, ntime),
onp.random.poisson(mu, ntime)]).T
targets = onp.zeros((ntime,1))
# determine target based on difference in inputs
diffT = np.cumsum(inputs[:,0]) - np.cumsum(inputs[:,1]) # calculate
cumulative difference in inputs for each time point
targets[diffT > 0] = 1 # if diffT > 0, set binary choice
targets[diffT ==0] = int(random.randint(key, (1,1), 0, 2)) # if
inputs are equal, select target class randomly
return inputs, targets
# Now batch it and jit.
build_input_and_target = build_input_and_targets_simulated
build_inputs_and_targets = vmap(build_input_and_target, in_axes=(None,
0))
build_inputs_and_targets_jit = jit(build_inputs_and_targets,
static_argnums=(0,))
seed = onp.random.randint(0, 1000000)
key = random.PRNGKey(seed)
ntimesteps = 25
inputs, targets = build_inputs_and_targets_jit(ntimesteps, key)
If I use the scipy method, I get an error that looks like this:
Exception: Tracer can't be used with raw numpy functions. You might have
import numpy as np
instead of
import jax.numpy as np
If I use the numpy method, I get an error that looks like this:
TypeError: 'BatchTracer' object cannot be interpreted as an integer
Both of these errors appear to be related to the specialized data types that jax uses (and which are necessary when using jit).
How can I get around this??

Does this work?
key = jax.random.PRNGKey(0)
jax.random.poisson(key, 5.1, (10,5))

How to use Numba to perform multiple integration in SciPy with an arbitrary number of variables and parameters?

I'd like to use Numba to decorate the integrand of a multiple integral so that it can be called by SciPy's Nquad function as a LowLevelCallable. Ideally, the decorator should allow for an arbitrary number of variables, and an arbitrary number of additional parameters from the Nquad's args argument. This is built off an excellent Q&A from earlier this year, but extended to the case of multiple variables and parameters.
As an example, suppose the following multiple integral with N variables and K parameters:
The following code works, but only for two variables and two parameters (N=2,K=2). It does not work for the more general case. This is because some of the arguments in the decorator are manually enumerated (xx[0],xx[1],xx[2],xx[3] inside the wrapped function). The decorator would have to be edited for every different number of variables or parameters. I'd like to avoid that, if possible. Note that the integrand function itself takes advantage of Numpy objects and methods and so does not have this problem.
import numpy as np
import scipy.integrate as si
import numba
from numba import cfunc,carray
from numba.types import intc, CPointer, float64
from scipy import LowLevelCallable
def jit_integrand_function(integrand_function):
jitted_function = numba.jit(integrand_function, nopython=True)
#cfunc(float64(intc, CPointer(float64)))
def wrapped(n, xx):
return jitted_function(xx[0], xx[1], xx[2], xx[3])
#xx = carray(xx,len(xx))
#return jitted_function(xx)
return LowLevelCallable(wrapped.ctypes)
#jit_integrand_function
def integrand(*args):
d = np.array([args])
return -np.exp(d.prod())
#Two variable, two parameter example
parms = np.array([2,3])
print si.nquad(integrand,[[0,1],[0,1]],parms)
The ideal code would use just one decorator on the integrand function to also run:
#Three variable, three parameter example
parms2 = np.array([1,2,3])
print si.nquad(integrand,[[0,1],[0,1],[0,1]],parms2)
The Numba documents refer to a carray function that ought to return a Numpy array when given the low-level pointer and size of an array in the callback. Possibly, this could be used to generalize the code beyond the two-variable-two-parameter case. My (unsuccessful) attempt to implement this is in the two commented-out lines of code.
Help would be appreciated. Indeed, one of the Numba developers pointed out that SciPy integration was one of the reasons Numba was written, but that documentation and examples in this area are lacking.

The following code works:
import numpy as np
import scipy.integrate as si
import numba
from numba import cfunc,carray
from numba.types import intc, CPointer, float64
from scipy import LowLevelCallable
def jit_integrand_function(integrand_function):
jitted_function = numba.jit(integrand_function, nopython=True)
#cfunc(float64(intc, CPointer(float64)))
def wrapped(n, xx):
values = carray(xx,n)
return jitted_function(values)
return LowLevelCallable(wrapped.ctypes)
#jit_integrand_function
def integrand(args):
return -np.exp(args.prod())
#Two variable, two parameter example
parms = np.array([2,3])
print si.nquad(integrand,[[0,1],[0,1]],parms)
#Three variable, three parameter example
parms2 = np.array([1,2,3])
print si.nquad(integrand,[[0,1],[0,1],[0,1]],parms2)

Cython/numpy vs pure numpy for least square fitting

A T.A at school showed me this code as an example of a least square fitting algorithm.
import numpy as np
#return the coefficients (a0,..aN) of the fit y=a0+a1*x+..an*x^n
#with associated sigma dy
#x,y,dy are all np.arrays with dtype= np.float64
def fit_poly(x,y,dy,n):
V = np.asmatrix(np.diag(dy**2))
M = []
for k in range(n+1):
M.append(x**k)
M = np.asmatrix(M).T
theta = (M.T*V.I*M).I*M.T*V.I*np.asmatrix(y).T
cov_t = (M.T*V.I*M).I
return np.asarray(theta.T)[0], np.asarray(cov_t)
Im trying to optimize his codes using cython. i got this code
cimport numpy as np
import numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef poly_c(np.ndarray[np.float64_t, ndim=1] x ,
np.ndarray[np.float64_t, ndim=1] y np.ndarray[np.float64_t,ndim=1]dy , np.int n):
cdef np.ndarray[np.float64_t, ndim=2] V, M
V=np.asmatrix(np.diag(dy**2),dtype=np.float64)
M=np.asmatrix([x**k for k in range(n+1)],dtype=np.float64).T
return ((M.T*V.I*M).I*M.T*V.I*(np.asmatrix(y).T))[0],(M.T*V.I*M).I
But the runtime seems to be the same for both programs,i did used an 'assert' to make sure the outputs where the same. What am i missing/doing wrong?
Thank you for your time and hopefully you can help me.
ps: this is the code im profiling with(not sure if i can call this profiling but w/e)
import numpy as np
from polyC import poly_c
from time import time
from pancho_fit import fit_poly
#pancho's the T.A,sup pancho
x=np.arange(1,1000)
x=np.asarray(x,dtype=np.float64)
y=3*x+np.random.random(999)
y=np.asarray(y,dtype=np.float64)
dy=np.array([y.std() for i in range(1,1000)],dtype=np.float64)
t0=time()
a,b=poly_c(x,y,dy,4)
#a,b=fit_poly(x,y,dy,4)
print("time={}s".format(time()-t0))

Except for [x**k for k in range(n+1)] I don't see any iterations for cython to improve. Most of the action is in matrix products. Those are already done with compiled code (with np.dot for ndarrays).
And n is only 4, not many iterations.
But why iterate this?
In [24]: x=np.arange(1,1000.)
In [25]: M1=x[:,None]**np.arange(5)
# np.matrix(M1)
does the same thing.
So no, this does not look like a good cython candidate - not unless you are prepared to write out all those matrix products in compilable detail.
I'd skip also the asmatrix stuff and use regular dot, # and einsum, but that's more a matter of style than speed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallelizing generation of random vectors with arbitrary length using numba - python

Related

Numba is not enhancing the performance

Finding the Hessian matrix of this function

How can I generate an array of samples from a poisson distribution in python using jax (jit)?

How to use Numba to perform multiple integration in SciPy with an arbitrary number of variables and parameters?

Cython/numpy vs pure numpy for least square fitting

Categories

Resources