I am playing around with Numba to see how much faster I can make a Python+NumPy code. My test function computes the pairwise Euclidean distances of n points in a three-dimensional space. I am getting 2 orders of magnitude speedup with Numba. If I comment out the lines where I store the distances in an array (i.e. distance[i, j] = d and distance[j, i] = d), I get 6 orders of magnitude speedup with Numba. So basically, the computations are lightning fast but accessing the array which holds the results is slow. Is there a way to speedup array access?
NumPy and Numba functions
import numpy as np
from numba import jit, float64, void
def pairwise_distance_numpy(distance, point):
numPoints = point.shape[0]
for i in range(numPoints):
for j in range(0, i):
d = 0.0
for k in range(3):
tmp = point[i, k] - point[j, k]
d += tmp*tmp
d = d**0.5
distance[i, j] = d
distance[j, i] = d
pairwise_distance_numba = jit(void(float64[:,:], float64[:,:]), nopython=True)(pairwise_distance_numpy)
Benchmark script
import numpy as np
from time import time
from pairwise_distance import pairwise_distance_numpy as pd_numpy
from pairwise_distance import pairwise_distance_numba as pd_numba
n = 1000
point = np.random.rand(n, 3)
distance = np.empty([n, n], dtype=np.float64)
pd_numpy(distance, point)
t = time()
pd_numpy(distance, point)
dt_numpy = time() - t
print('Numpy elapsed time: ', dt_numpy)
pd_numba(distance, point)
t = time()
pd_numba(distance, point)
dt_numba = time() - t
print('Numba Elapsed time: ', dt_numba)
print('Numba speedup: ', dt_numpy/dt_numba)
It seems Numba just optimized the calculations away since you're not storing the result in a variable. (from your code + your comment confirming this)
Array access in numpy should be pretty pretty fast in most cases!
Related
I am trying to do a quite memory intensive multiplication and it seems that I am always filling up my RAM. The idea is that I have a 2D gaussian centered in (0,0) and then I have another 2D gaussian that changes its distance with respect to the (0,0) point in time. For each time I need to compute the product of the two gaussian on a specific grid and sum all over the indices. Basically I should end up at each timestep I should have *SUM{g1ij g2ij} and end up with a 1D array of the same length of time
The code here is just a pseudo-code. The problem is the creation of a 1001x1001x25000 array here xx[:,:,np.newaxis] which gives a huge array
import numpy as np
import numexpr as ne
def gaussian2d(x,y,x0,y0,x_std,y_std):
return np.exp(-(x-x0)**2/x_std**2-(y-y0)**2/y_std**2)
x = np.linspace(-5,5,1001)
y = np.linspace(-5,5,1001)
xx, yy = np.meshgrid(x,y)
g1 = gaussian2d(xx, yy, 0, 0, 0.25, 0.25)
x0 = np.random.rand(25000)
y0 = np.random.rand(25000)
X = np.subtract(xx[:,:,np.newaxis], x0)
Y = np.subtract(yy[:,:,np.newaxis], y0)
X_std = 0.75
Y_std = 0.75
temp = ne.evaluate('exp(-(X)**2/(2*X_std**2)-(Y)**2/(2*Y_std**2))')
final = np.sum(np.multiply(temp.T, g1), axis=(1,2))
A very slow alternative would be to just loop along the x0 length, but in future x0 may be as long as 100000 points. The other solution would be to reduce the grid. But in that case I would lose resolution and if the fixed function is not a gaussian but something different may affect calulations.
Any suggestion?
There is no need for all such HUGE arrays. xx and yy as well as X and Y contains 1001 times the same repeated line/columns which is a huge waste of memory! The RAM is a very scarce resource (both throughput and space) so you should avoid operating on very large array (so to use the CPU cache which are far much faster or even CPU registers). You can rewrite the code using loops and use a JIT compiler like Numba (or a static compiler like Cython) so to run this efficiently by removing all the big arrays. In fact, thinking about loops can help to optimize the operation further even in pure Numpy (see later). So Numba/Cython is not even needed. Here is a naive implementation:
import numpy as np
import numba as nb
#nb.njit('(f8[:], f8[:], i8, i8, f8[:,::1])', parallel=True)
def compute(x0, y0, N, T, g1):
X_std = 0.75
Y_std = 0.75
final = np.empty(T)
for i in nb.prange(T):
s = 0.0
for k in range(N):
for j in range(N):
X = x[k] - x0[i]
Y = y[j] - y0[i]
temp = np.exp(-(X)**2/(2*X_std**2)-(Y)**2/(2*Y_std**2))
s += g1[k, j] * temp
final[i] = s
return final
N = 1001
T = 25000
# [...] (same code as in the question without the big temporary arrays)
final = compute_clever(x0, y0, N, T, g1)
This code is much faster and use only a tiny amount of RAM compared to the initial code that could not even run on a regular PC. The same strategy can be used to compute g1 so not to create xx and yy.
Actually, the above code is not even optimized. On can split the exponential expression in two parts so to pre-compute partial results using only basic math. The computation can then be factorized to reduce the number of mathematical operations even more. Here is a better implementation:
#nb.njit('(f8[:], f8[:], i8, i8, f8[:,::1])', parallel=True)
def compute_clever(x0, y0, N, T, g1):
X_std = 0.75
Y_std = 0.75
final = np.empty(T)
exp1 = np.empty((T, N))
exp2 = np.empty((T, N))
for i in nb.prange(T):
for k in range(N):
X = x[k] - x0[i]
exp1[i, k] = np.exp(-(X)**2/(2*X_std**2))
for i in nb.prange(T):
for j in range(N):
Y = y[j] - y0[i]
exp2[i, j] = np.exp(-(Y)**2/(2*Y_std**2))
for i in nb.prange(T):
s = 0.0
for k in range(N):
s2 = 0.0
for j in range(N):
s2 += g1[k, j] * exp2[i, j]
s += s2 * exp1[i, k]
final[i] = s
return final
Here are results with N=1001 and T=250 on my 6-core machine:
Naive Numpy: 2380 ms (use about 4 GiB of RAM)
compute: 374 ms (use few MiB of RAM)
compute_clever: 55 ms (use few MiB of RAM)
Note that the code can be further optimized using register blocking though it will make the code more complex. Also note that the last kernel can certainly be computed efficiently using np.einsum. exp1 and exp2 can also be computed using basic Numpy operation (though it will be a bit less efficient). Thus, you could even solve this using a pure Numpy code.
So, I need help minimizing the time it takes to run the code with large numbers of data only by using NumPy. I think the for loops made my code inefficient.. But I do not know how to make the for loop into a list comprehension, which might help it run faster..
def lagrange(p,node,n,x):
m=[]
#base lagrange polynomial
for i in range(n):
for j in range(p+1):
L=1
for k in range(p+1):
if k!=j:
L= L*(x[i] - node[k])/(node[j] - node[k])
m.append(L)
lagrange= np.array(m).reshape(n,p+1)
return lagrange
def interpolant(a,b,p,n,x,f):
m=[]
node=np.linspace(a,b,p+1)
for j in range(n):
polynomial=0
for i in range(p+1):
polynomial += f(node[i]) * lagrange(p,node,n,x)
m.append(polynomial)
interpolant = np.array(inter)
return interpolant
It appears the value of lagrange_poly(...) is recomputed n*(p+1) times for no reason which is very very expensive! You can compute it once before the loop, store it in a variable and reuse the variable later.
Here is the fixed code:
def uniform_poly_interpolation(a,b,p,n,x,f,produce_fig):
inter=[]
xhat=np.linspace(a,b,p+1)
#use for loop to iterate interpolant.
mat = lagrange_poly(p,xhat,n,x,1e-10)[0]
for j in range(n):
po=0
for i in range(p+1):
po += f(xhat[i]) * mat[i,j]
inter.append(po)
interpolant = np.array(inter)
return interpolant
This should be much much faster.
Moreover, the execution is slow because accessing scalar values of Numpy arrays from CPython is very slow. Numpy is designed to work with array and not to extract scalar values in loops. Additionally, the loop CPython interpreter are relatively slow. You can solve this problem efficiently with Numba that compile your code to a very fast native code using a JIT-compiler.
Here is the Numba code:
import numba as nb
#nb.njit
def lagrange_poly(p, xhat, n, x, tol):
error_flag = 0
er = 1
lagrange_matrix = np.empty((n, p+1), dtype=np.float64)
for l in range(p):
if abs(xhat[l] - xhat[l+1]) < tol:
error_flag = er
# Base lagrange polynomial
for i in range(n):
for j in range(p+1):
L = 1.0
for k in range(p+1):
if k!=j:
L = L * (x[i] - xhat[k]) / (xhat[j] - xhat[k])
lagrange_matrix[i, j] = L
return lagrange_matrix, error_flag
Overall, this should be several order of magnitude faster.
Is it possible to accelerate the following nested loop in Python (possibly with CUDA or parallel process)? The order of the elements that are appended to outputList[i] doesn't matter.
Currently, the code takes a long time to complete. Which part is slowing down the code? Is it the append() or the calculation f = c*sin(i*pi/180) + (1/c)*cos(i*pi/180)?
N = 640*480
outputList = [[] for i in range(N)]
def foo(a,b): # a, b are always integers
c = sqrt(a**2 + b**2)
for deg in range(360):
f = c*sin(deg*pi/180) + (1/c)*cos(deg*pi/180)
if (f<1):
outputList[a].append(f)
if __name__ == '__main__':
for x in range(N):
for y in range(60):
foo(x,y)
As talonmies mentioned, your first step should be using numpy.
import math
import numpy as np
def foo(a, b):
c = math.hypot(a, b)
degs = np.arange(360, dtype='f8')
rads = np.deg2rad(degs)
fs = c * np.sin(rads) + (1/c)*np.cos(rads)
below1 = fs[fs < 1.]
outputList[a].extend(below1)
There are fancier methods to do the sin + cos computation but I've kept it simple for now. Also, depending on your use case, change outputList to a 2D numpy array or a list of 1D numpy arrays
I have the following code:
# positions: np.ndarray of shape(N,d)
# fitness: np.ndarray of shape(N,)
# mass: np.ndarray of shape(N,)
iteration = 1
while iteration <= maxiter:
K = round((iteration-maxiter)*(N-1)/(1-maxiter) + 1)
for i in range(N):
displacement = positions[:K]-positions[i]
dist = np.linalg.norm(displacement, axis=-1)
if i<K:
dist[i] = 1.0 # prevent 1/0
force_i = (mass[:K]/dist)[:,np.newaxis]*displacement
rand = np.random.rand(K,1)
force[i] = np.sum(np.multiply(rand,force_i), axis=0)
So I have an array that stores the coordinates of N particles in d dimensions. I need to first calculate the euclidean distance between particle i and the first K particles, and then calculate the 'force' due to each of the K particles. Then, I need to sum over K particles to find the total force acting on particle i, and repeat for all N particles. It is only parts of the code, but after some profiling this part is the most time-critical step.
So my question is how I can optimize the above code. I have tried to vectorize it as much as possible, and I'm not sure if there is still room for improvement. The profiling results say that {method 'reduce' of 'numpy.ufunc' objects}, fromnumeric.py:1778(sum) and linalg.py:2103(norm) take the longest time to run. Is the first one die to array broadcasting? How can I optimize these three function calls?
We would keep the loops, but try to optimize by pre-computing certain things -
from scipy.spatial.distance import cdist
iteration = 1
while iteration <= maxiter:
K = round((iteration-maxiter)*(N-1)/(1-maxiter) + 1)
posd = cdist(positions,positions)
np.fill_diagonal(posd,1)
rands = np.random.rand(N,K)
s = rands*(mass[:K]/posd[:,:K])
for i in range(N):
displacement = positions[:K]-positions[i]
force[i] = s[i].dot(displacement)
I had to make some adjustments since your code was missing a few parts. But the first optimization would be to get rid of the for i in range(N) loop:
import numpy as np
np.random.seed(42)
N = 10
d = 3
maxiter = 50
positions = np.random.random((N, d))
force = np.random.random((N, d))
fitness = np.random.random(N)
mass = np.random.random(N)
iteration = 1
while iteration <= maxiter:
K = round((iteration-maxiter)*(N-1)/(1-maxiter) + 1)
displacement = positions[:K, None]-positions[None, :]
dist = np.linalg.norm(displacement, axis=-1)
dist[dist == 0] = 1
force = np.sum((mass[:K, None, None]/dist[:,:,None])*displacement * np.random.rand(K,N,1), axis=0)
iteration += 1
Other improvements would be to try faster implementations of the norm, such as scipy.cdist or numpy.einsum
I'm trying to use numbapro to write a simple matrix vector multiplication below:
from numbapro import cuda
from numba import *
import numpy as np
import math
from timeit import default_timer as time
n = 100
#cuda.jit('void(float32[:,:], float32[:], float32[:])')
def cu_matrix_vector(A, b, c):
y, x = cuda.grid(2)
if y < n:
c[y] = 0.0
if x < n and y < n:
for i in range(n):
c[y] += A[y, i] * b[i]
A = np.array(np.random.random((n, n)), dtype=np.float32)
B = np.array(np.random.random((n, 1)), dtype=np.float32)
C = np.empty_like(B)
s = time()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.to_device(C)
cu_matrix_vector(dA, dB, dC)
dC.to_host()
e = time()
tcuda = e - s
but I'm getting following error:
numbapro.cudadrv.error.CudaDriverError: CUDA_ERROR_LAUNCH_FAILED Failed to copy memory D->H
I don't understand why the device to host copy is failing. Please help
Your code has multiple problems.
The B and C vectors are Nx1 2D matrices, not 1D vectors, but the type signature of your kernel lists them as "float32[:]" -- 1D vectors. It also indexes them with a single index, which results in runtime errors on the GPU due to misaligned access (cuda-memcheck is your friend here!)
Your kernel assumes a 2D grid, but only uses 1 column of it -- meaning many threads doing the same computation and overwriting each other.
There is no execution configuration given, so NumbaPro is launching a kernel with 1 block of 1 thread. (nvprof is your friend here!)
Here is a code that works. Note that this uses a 1D grid of 1D blocks, and loops over the columns of the matrix. Therefore it is optimized for the case where the number of rows in the vector/matrix is large. A kernel that is optimized for a short and wide matrix would need to use another approach (parallel reductions). But I would use CUBLAS sgemv (which is exposed in NumbaPro also) instead.
from numbapro import cuda
from numba import *
import numpy as np
import math
from timeit import default_timer as time
m = 100000
n = 100
#cuda.jit('void(f4[:,:], f4[:], f4[:])')
def cu_matrix_vector(A, b, c):
row = cuda.grid(1)
if (row < m):
sum = 0
for i in range(n):
sum += A[row, i] * b[i]
c[row] = sum
A = np.array(np.random.random((m, n)), dtype=np.float32)
B = np.array(np.random.random(m), dtype=np.float32)
C = np.empty_like(B)
s = time()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.to_device(C)
cu_matrix_vector[(m+511)/512, 512](dA, dB, dC)
dC.to_host()
print C
e = time()
tcuda = e - s