This is my first attempt to use JIT for python and this is the use case I want to speed up. I read a bit about numba and it seemed simple enough but the following code didn't provide any speedup. Please excuse any obvious mistakes I may be making.
I also tried to do what the basic tutorial of cython suggests but again no difference in time.
http://docs.cython.org/src/tutorial/cython_tutorial.html
I'm guessing I have to do something like declare variables? Use other libraries? Use for loops exclusively for everything? I'd appreciate any guidance or examples I can refer to.
For example I know from a previous question Elementwise operations in mpmath slow compared to numpy and its solution that using gmpy instead of mpmath was significantly faster.
import numpy as np
from scipy.special import eval_genlaguerre
from sympy import mpmath as mp
from sympy.mpmath import laguerre as genlag2
import collections
from numba import jit
import time
def len2(x):
return len(x) if isinstance(x, collections.Sized) else 1
#jit # <-- removing this doesn't change the output time if anything it's slower with this
def laguerre(a, b, x):
fun = np.vectorize(genlag2)
return fun(a, b, x)
def f1( a, b, c ):
t = time.time()
M = np.ones( [ len2(a), len2(b), len2(c) ] )
A, B, C = np.meshgrid( a, b, c, indexing = 'ij' )
temp = laguerre(A, B, C)
M *= temp
print 'part1: ', time.time() - t
t = time.time()
A, B = np.meshgrid( a, b, indexing= 'ij' )
temp = np.array( [[ mp.fac(x1)/mp.fac(y1) for x1,y1 in zip(x2,y2)] for x2,y2 in zip(A, B)] )
temp = np.reshape( temp, [ len(a), len(b), 1 ] )
temp = np.repeat( temp, len(c), axis = 2 )
print 'part2 so far:', time.time() - t
M *= temp
print 'part2 finally', time.time() - t
t = time.time()
a = mp.arange( 30 )
b = mp.arange( 10 )
c = mp.linspace( 0, 100, 100 )
M = f1( a, b, c)
Better to use numba with vectorize with self-defined decorators, if not defined lazy action will execute, which may lead to slow down the process.
Jit is slow compared to vectorize in my opinion.
Related
Is it possible to accelerate the following nested loop in Python (possibly with CUDA or parallel process)? The order of the elements that are appended to outputList[i] doesn't matter.
Currently, the code takes a long time to complete. Which part is slowing down the code? Is it the append() or the calculation f = c*sin(i*pi/180) + (1/c)*cos(i*pi/180)?
N = 640*480
outputList = [[] for i in range(N)]
def foo(a,b): # a, b are always integers
c = sqrt(a**2 + b**2)
for deg in range(360):
f = c*sin(deg*pi/180) + (1/c)*cos(deg*pi/180)
if (f<1):
outputList[a].append(f)
if __name__ == '__main__':
for x in range(N):
for y in range(60):
foo(x,y)
As talonmies mentioned, your first step should be using numpy.
import math
import numpy as np
def foo(a, b):
c = math.hypot(a, b)
degs = np.arange(360, dtype='f8')
rads = np.deg2rad(degs)
fs = c * np.sin(rads) + (1/c)*np.cos(rads)
below1 = fs[fs < 1.]
outputList[a].extend(below1)
There are fancier methods to do the sin + cos computation but I've kept it simple for now. Also, depending on your use case, change outputList to a 2D numpy array or a list of 1D numpy arrays
I have an issue: I'm trying to find the minimum of a function which depends on several parameters that I'd like to change as well. let's take as a simplified example:
import numpy as np
import scipy.optimize as opt
def f(x, a, b, c):
f = a * x**2 + b * x + c
return f
I'd like to find the x which minimizes the function for different set of values of a, b, c, let's say for
a = [-1, 0, 1]
b = [0, 1, 2]
c = [0, 1]
ATM I have three nested loops and a minimization:
for p1 in a:
for p2 in b:
for p3 in c:
y = opt.minimize(f, x0=[0, ], args=(p1, p2, p3, ))
print(y)
which is really slow for the calculation I'm doing, but I haven't found any better so far. So, does anyone know a way or a package that would allow me to improve the efficiency?
You could use a combination of different techniques to improve the efficiency of your script:
Use itertools.product to generate every possible combination in the list a, b, c
Use multiprocessingto execute the minimizations in parallel.
Other than this, i can't think of a way to optimize the efficiency of the code. As was pointed out in the comment, the constant value c has no influence on the minimization. But i'm sure the quadratic function is just an example.
I took the code of the multiprocessing part from here.
Here's the working code.
import numpy as np
import scipy.optimize as opt
import itertools
from multiprocessing import Pool
def f(x, a, b, c):
f = a * x**2 + b * x + c
return f
def mini(args):
res = opt.minimize(f, x0=np.array([0]), args=args)
return res.x
if __name__=="__main__":
a = np.linspace(-1,2,100)
b = np.linspace(0,2,100)
c = [0, 1]
args = list(itertools.product(a,b,c))
print("Number of combos:" + str(len(args)))
p = Pool(4)
import time
t0 = time.time()
res = p.map(mini, args)
print(time.time()-t0)
Even these 20000 combinations only need 5,28 seconds on my average laptop.
scipy.optimize.newton can do this.
I need to minimize a cost function for a large number (1000s) of different inputs. Obviously, this can be implemented by looping over scipy.optimize.minimize or any other minimization routine. Here is an example:
import numpy as np
import scipy as sp
def cost(x, a, b):
return np.sum((np.sum(a * x.reshape(a.shape), axis=1) - b)**2)
a = np.random.randn(500, 40)
b = np.array(np.arange(500))
x = []
for i in range(a.shape[0]):
res = sp.optimize.minimize(cost, np.zeros(40), args=(a[None, i], b[None, i]))
x.append(res.x)
It finds x[i, :] that minimize cost for each a[i, :] and b[i], but this is very slow. I guess looping over minimize causes considerable overhead.
A partial solution is to solve for all x simultaneously:
res = sp.optimize.minimize(cost, np.zeros_like(a), args=(a, b))
This is even slower than the loop. minimize does not know that elements in x are group-wise independent. So it computes the full hessian although a block-diagonal matrix would be sufficient, considering the problem structure. This is slow and overflows my computer's memory.
Is there any way to inform minimize or another optimization function about the problem structure so that it can solve multiple indepentent optimizations in a single function call? (Similar to certain options supported by Matlab's fsolve.)
First, a solution:
Turns out scipy.optimize.least_squares supports exploiting the structure of the jacobian by setting the jac_sparsity argument.
The least_squares function works slightly different than minimize so the cost function needs to be rewritten to return residuals instead:
def residuals(x, a, b):
return np.sum(a * x.reshape(a.shape), axis=1) - b
The jacobian has block-diagonal sparsity structure, given by
jacs = sp.sparse.block_diag([np.ones((1, 40), dtype=bool)]*500)
And calling the optimization routine:
res = sp.optimize.least_squares(residuals, np.zeros(500*40),
jac_sparsity=jacs, args=(a, b))
x = res.x.reshape(500, 40)
But is it really faster?
%timeit opt1_loopy_min(a, b) # 1 loop, best of 3: 2.43 s per loop
%timeit opt2_loopy_min_start(a, b) # 1 loop, best of 3: 2.55 s per loop
%timeit opt3_loopy_lsq(a, b) # 1 loop, best of 3: 13.7 s per loop
%timeit opt4_dense_lsq(a, b) # ValueError: array is too big; ...
%timeit opt5_jacs_lsq(a, b) # 1 loop, best of 3: 1.04 s per loop
Conclusions:
There is no obvious difference between the original solution (opt1) and re-use of the starting point (opt2) without sorting.
looping over least_squares (opt3) is considerable slower than looping over minimize (opt1, opt2).
The problem is too big to naiively run with least_squares because the jacobian matrix does not fit in memory.
Exploiting the sparsity structure of the jacobian in least_squares (opt5) seems to be the fastest approach.
This is the timing test environment:
import numpy as np
import scipy as sp
def cost(x, a, b):
return np.sum((np.sum(a * x.reshape(a.shape), axis=1) - b)**2)
def residuals(x, a, b):
return np.sum(a * x.reshape(a.shape), axis=1) - b
a = np.random.randn(500, 40)
b = np.arange(500)
def opt1_loopy_min(a, b):
x = []
x0 = np.zeros(a.shape[1])
for i in range(a.shape[0]):
res = sp.optimize.minimize(cost, x0, args=(a[None, i], b[None, i]))
x.append(res.x)
return np.stack(x)
def opt2_loopy_min_start(a, b):
x = []
x0 = np.zeros(a.shape[1])
for i in range(a.shape[0]):
res = sp.optimize.minimize(cost, x0, args=(a[None, i], b[None, i]))
x.append(res.x)
x0 = res.x
return np.stack(x)
def opt3_loopy_lsq(a, b):
x = []
x0 = np.zeros(a.shape[1])
for i in range(a.shape[0]):
res = sp.optimize.least_squares(residuals, x0, args=(a[None, i], b[None, i]))
x.append(res.x)
return x
def opt4_dense_lsq(a, b):
res = sp.optimize.least_squares(residuals, np.zeros(a.size), args=(a, b))
return res.x.reshape(a.shape)
def opt5_jacs_lsq(a, b):
jacs = sp.sparse.block_diag([np.ones((1, a.shape[1]), dtype=bool)]*a.shape[0])
res = sp.optimize.least_squares(residuals, np.zeros(a.size), jac_sparsity=jacs, args=(a, b))
return res.x.reshape(a.shape)
I guess looping over minimize causes considerable overhead.
Wrong guess. The time required for minimizing a function dwarfs any loop overhead. There is no vectorization magic for this problem.
Some time can be saved by using a better starting point of minimization. First, sort the parameters so that consecutive loops have similar parameters. Then use the end point of previous minimization as a starting point of the next one:
a = np.sort(np.random.randn(500, 40), axis=0) # sorted parameters
b = np.arange(500) # no need for np.array here, np.arange is already an ndarray
x0 = np.zeros(40)
for i in range(a.shape[0]):
res = minimize(cost, x0, args=(a[None, i], b[None, i]))
x.append(res.x)
x0 = res.x
This saves 30-40 percent of execution time in my test.
Another, minor, optimization to do is to preallocate an ndarray of appropriate size for resulting x-values, instead of using a list and append method. Before the loop: x = np.zeros((500, 40)); within the loop, x[i, :] = res.x.
I'm trying to use numbapro to write a simple matrix vector multiplication below:
from numbapro import cuda
from numba import *
import numpy as np
import math
from timeit import default_timer as time
n = 100
#cuda.jit('void(float32[:,:], float32[:], float32[:])')
def cu_matrix_vector(A, b, c):
y, x = cuda.grid(2)
if y < n:
c[y] = 0.0
if x < n and y < n:
for i in range(n):
c[y] += A[y, i] * b[i]
A = np.array(np.random.random((n, n)), dtype=np.float32)
B = np.array(np.random.random((n, 1)), dtype=np.float32)
C = np.empty_like(B)
s = time()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.to_device(C)
cu_matrix_vector(dA, dB, dC)
dC.to_host()
e = time()
tcuda = e - s
but I'm getting following error:
numbapro.cudadrv.error.CudaDriverError: CUDA_ERROR_LAUNCH_FAILED Failed to copy memory D->H
I don't understand why the device to host copy is failing. Please help
Your code has multiple problems.
The B and C vectors are Nx1 2D matrices, not 1D vectors, but the type signature of your kernel lists them as "float32[:]" -- 1D vectors. It also indexes them with a single index, which results in runtime errors on the GPU due to misaligned access (cuda-memcheck is your friend here!)
Your kernel assumes a 2D grid, but only uses 1 column of it -- meaning many threads doing the same computation and overwriting each other.
There is no execution configuration given, so NumbaPro is launching a kernel with 1 block of 1 thread. (nvprof is your friend here!)
Here is a code that works. Note that this uses a 1D grid of 1D blocks, and loops over the columns of the matrix. Therefore it is optimized for the case where the number of rows in the vector/matrix is large. A kernel that is optimized for a short and wide matrix would need to use another approach (parallel reductions). But I would use CUBLAS sgemv (which is exposed in NumbaPro also) instead.
from numbapro import cuda
from numba import *
import numpy as np
import math
from timeit import default_timer as time
m = 100000
n = 100
#cuda.jit('void(f4[:,:], f4[:], f4[:])')
def cu_matrix_vector(A, b, c):
row = cuda.grid(1)
if (row < m):
sum = 0
for i in range(n):
sum += A[row, i] * b[i]
c[row] = sum
A = np.array(np.random.random((m, n)), dtype=np.float32)
B = np.array(np.random.random(m), dtype=np.float32)
C = np.empty_like(B)
s = time()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.to_device(C)
cu_matrix_vector[(m+511)/512, 512](dA, dB, dC)
dC.to_host()
print C
e = time()
tcuda = e - s
Is there a way to speed up this code:
import mpmath as mp
import numpy as np
from time import time as epochTime
def func(E):
f = lambda theta: mp.sin(theta) * mp.exp(E * (mp.cos(theta**2) + \
mp.cos(theta)**2))
return f
start = epochTime()
mp.mp.dps = 15
mp.mp.pretty = True
E = np.linspace(0, 10, 200)
ints = [mp.quadgl(func(e), [0, mp.pi]) for e in E] # Main Job
print ('Took:{:.3}s'.format(epochTime() - start))
Running your code, I timed it to 5.84s
using Memoize and simplifying expressions:
cos = Memoize(mp.cos)
sin = Memoize(mp.sin)
def func(E):
def f(t):
cost = cos(t)
return sin(t) * mp.exp(E * (cos(t*t) + cost*cost))
return f
I got it down to 3.25s first time, and ~2.8s in the next iterations.
(An even better approach might be using lru_cache from the standard library, but I did not try to time it).
If you are running similar code many times, it may be sensible to Memoize() both func and f, so the computations become trivial ( ~0.364s ).
Replacing mp with math for cos/sin/exp, I got down to ~1.3s, and now memoizing make the performance worse, for some reason (~1.5s, I guess the lookup time became dominant).
In general, you want to avoid calls to transcendent functions like sin, cos, exp, ln as much as possible, especially in a "hot" function like an integrand.
Replace x**2 by x*x (often x**2 calls a generic=slow exponentiation function)
use variables for "expensive" intermediate terms which are used more than once
transform your equation to reduce or eliminate transcendent functions
special-case for typical parameter values. Integer exponents are a frequent candidate.
precompute everything that is constant, espc. in parameterized functions
For the particular example you can substitute z=cos(theta). It is dz = -sin(theta)dtheta. Your integrand becomes
-exp(E*(z^2 + cos(arccos(z)^2))
saving you some of the transcendent function calls. The boundaries [0, pi] become [1, -1]. Also avoid x**2, better use x*x.
Complete code:
import mpmath as mp
import numpy as np
from time import time as epochTime
def func(E):
def f(z):
acz = mp.acos(z)
return -mp.exp(E * (mp.cos(acz*acz) + z*z))
return f
start = epochTime()
mp.mp.dps = 15
mp.mp.pretty = True
E = np.linspace(0, 10, 200)
ints = [mp.quadgl(func(e), [1.0, -1.0]) for e in E] # Main Job
print ('Took:{:.3}s'.format(epochTime() - start))