GPU and `jax` performance mysteries

GPU and `jax` performance mysteries - python

I have been playing with jax lately, and it is very impressive, but then the following set of experiments confused me greatly:
First, we set up the timer utility:
import time
def timefunc(foo, *args):
tic = time.perf_counter()
tmp = foo(*args)
toc = time.perf_counter()
print(toc - tic)
return tmp
Now, let’s see what happens when we compute the eigenvalues of a random symmetric matrix matrix, thus (jnp is jax.numpy, so the eigh is done on the GPU)
def jfunc(n):
tmp = np.random.randn(n, n)
return jnp.linalg.eigh(tmp + tmp.T)
def nfunc(n):
tmp = np.random.randn(n, n)
return np.linalg.eigh(tmp + tmp.T)
Now for the timings (the machine is an nVidia DGX box, so the GPU is an A100, while the CPUs are some AMD EPYC2 parts.
>>> e1 = timefunc(nfunc, 10)
0.0002442029945086688
>>> e2 = timefunc(jfunc, 10)
0.013523647998226807
>>> e1 = timefunc(nfunc, 100)
0.11742364699603058
>>> e2 = timefunc(jfunc, 100)
0.11005625998950563
>>> e1 = timefunc(nfunc, 1000)
0.6572738009999739
>>> e2 = timefunc(jfunc, 1000)
0.5530761769914534
>>> e1 = timefunc(nfunc, 10000)
36.22587636699609
>>> e2 = timefunc(jfunc, 10000)
8.867857075005304
You will notice that the crossover is somewhere around 1000. Initially, I thought this was because of the overhead of moving stuff to/from the GPU, but if you define yet another function:
def jjfunc(n):
key=jax.random.PRNGKey(0)
tmp = jax.random.normal(key, [n, n])
return jnp.linalg.eigh(tmp + tmp.T)
>>> e1=timefunc(jjfunc, 10)
0.01886096798989456
>>> e1=timefunc(jjfunc, 100)
0.2756766739912564
>>> e1=timefunc(jjfunc, 1000)
0.7205733209993923
>>> e1=timefunc(jjfunc, 10000)
6.8624101399909705
Note that the small examples are actually (much) slower than moving the numpy array to the GPU and back.
So, my question is: what is going on, and is there a silver bullet? Is this a jax implementation bug?

I don't think your timings are reflective of actual JAX vs. numpy performance, for a few reasons:
JAX's computation model uses Asynchronous Dispatch, which means that JAX operations return before the computation is finished. As mentioned at that link, you can use the block_until_ready() method to ensure you are timing the computation rather than the dispatch.
Because operations like eigh are JIT-compiled by default, the first time you run them for a given size will incur the one-time compilation cost. Subsequent runs will be faster as JAX caches previous compilations.
Your computations are indeed being foiled by device transfer costs. It's easiest to see if you measure it directly:
def transfer(n):
tmp = np.random.randn(n, n)
return jnp.array(tmp).block_until_ready()
timefunc(transfer, 10000);
# 4.600406924000026
Your jjfunc combines the eigh call with the jax.random.normal call. The latter is slower than numpy's random number generation, and I believe is dominating the difference for small n.
Unrelated to JAX, but in general using time.time for profiling Python code can give you misleading results. Modules like timeit are much better for this kind of thing, particularly when you're dealing with microbenchmarks that complete in fractions of a second.
If you're interested in accurate benchmarks of JAX vs. Numpy versions of algorithms, I'd suggest isolating exactly the operations you're interested in benchmarking (i.e. generate the data & do any device transfer outside the benchmarks). Read up on the advice in Asynchronous Dispatch in JAX as it relates to benchmarking, and check out Python's timeit Docs for tips on getting accurate timings of small code snippets (though I find the %timeit magic more convenient if working IPython or Jupyter notebook).

Related

Can I use PyOpenCL in integration with Scipy to perform Differential Evolution in parallel with GPU?

I got my code for simulating a multivariate regression model to work using the Differential Evolution, and even got the multiprocessing option to help out in reducing runtime. However, with 7 independent variables with 10 values each and matrix operations on 21 100+ element matrices takes a bit of time to work on 24 cores.
I don't have much experience with multiprocessing with PyOpenCL, so I wanted to ask if it's worth entering into and trying to integrate the two to work on the GPU. I've attached the code snippet of 3 variables and 3 values for reference:
import scipy.optimize as op
import numpy as np
def func(vars, *args):
res = []
x = []
for i in args[1:]:
if len(res) + 1 > len(args)//2:
x.append(i)
continue
res.append(np.array(i).T)
f1 = 0
for i in range(len(x[0])):
for j in range(len(x[1])):
diff = (vars[0]*x[0][i] + vars[1])*(vars[2]*x[1][j]*x[1][j] + vars[3]*x[1][j] + vars[4])*(vars[5]*50*50 + vars[6]*50 + vars[7])
f1 = f1 + abs(res[0][i][j] - diff) # ID-Pitch
f2 = 0
for i in range(len(x[0])):
for j in range(len(x[2])):
diff = (vars[0]*x[0][i] + vars[1])*(vars[5]*x[2][j]*x[2][j] + vars[6]*x[2][j] + vars[7])*(vars[2]*10*10 + vars[3]*10 + vars[4])
f2 = f2 + abs(res[1][i][j] - diff) # ID-Depth
f3 = 0
for i in range(len(x[1])):
for j in range(len(x[2])):
diff = (vars[2]*x[1][i]*x[1][i] + vars[3]*x[1][i] + vars[4])*(vars[5]*x[2][j]*x[2][j] + vars[6]*x[2][j] + vars[7])*(vars[0]*3.860424005 + vars[1])
f3 = f3 + abs(res[2][i][j] - diff) # Pitch-Depth
return f1 + f2 + f3
def main():
res1 = [[134.3213274,104.8030828,75.28483813],[151.3351445,118.07797,84.82079556],[135.8343927,105.9836392,76.1328857]]
res2 = [[131.0645086,109.1574174,91.1952225],[54.74920444,30.31300092,17.36537062],[51.8931954,26.45139822,17.28693162]]
res3 = [[131.0645086,141.2210331,133.3192429],[54.74920444,61.75898314,56.52756593],[51.8931954,52.8191817,52.66531712]]
x1 = np.array([3.860424005,7.72084801,11.58127201])
x2 = np.array([10,20,30])
x3 = np.array([50,300,500])
interval = (-20,20)
bds = [interval,interval,interval,interval,interval,interval,interval,interval]
res = op.differential_evolution(func, bounds=bds, workers=-1, maxiter=100000, tol=0.01, popsize=15, args=([1,2,2], res1, res2, res3, x1, x2, x3))
print(res)
if __name__ == '__main__':
main()

firstly, yes it's possible, and func can be a function that will send the data to the GPU then wait for the computationts to finish then transfer the data back to the RAM and return it to scipy.
changing computations from CPU to GPU side is not always beneficial, because of the time required to transfer data back and forth from the GPU, so with a moderate laptop GPU for example, you won't get any speedup at all, and your code might be even slower. reducing data transfer between the GPU and RAM can make GPU 2-4 times faster than an average CPU, but your code requires data transfer so that won't be possible.
for powerful GPUs with high bandwidth (things like RTX2070 or RTX3070 or APUs) you can expect faster computations, so computations on GPU will be a few times faster than CPU, even with the data transfer, but it depends on the code implementation of both the CPU and GPU code.
lastly, your code can be sped up without the use of GPU, which is likely the first thing you should do before going for GPU computations, mainly by using code compilers like cython and numba, that can speed up your code by almost 100 times with little effort without major modifications, but you should convert your code to use only fixed size preallocated numpy arrays and not lists, as the code will be much faster and you can even disable the GIL and have your code multithreaded, and there are good multithreaded looping implementations in them.

Efficient computation of a loop of integrals in Python

I was wondering how to speed up the following code in where I compute a probability function which involves numerical integrals and then I compute some confidence margins.
Some possibilities that I have thought about are Numba or vectorization of the code
EDIT:
I have made minor modifications because there was a mistake. I am looking for some modifications that provide major time improvements (I know that there are some minor changes that would provide some minor time improvements, such as repeated functions, but I am not concerned about them)
The code is:
# -*- coding: utf-8 -*-
"""
Created on Tue Jan 26 17:05:46 2021
#author: Ignacio
"""
import numpy as np
from scipy.integrate import simps
def pdf(V,alfa_points):
alfa=np.linspace(0,2*np.pi,alfa_points)
return simps(1/np.sqrt(2*np.pi)/np.sqrt(sigma_R2)*np.exp(-(V*np.cos(alfa)-eR)**2/2/sigma_R2)*1/np.sqrt(2*np.pi)/np.sqrt(sigma_I2)*np.exp(-(V*np.sin(alfa)-eI)**2/2/sigma_I2),alfa)
def find_nearest(array,value):
array=np.asarray(array)
idx = (np.abs(array-value)).argmin()
return array[idx]
N = 20
n=np.linspace(0,N-1,N)
d=1
sigma_An=0.1
sigma_Pn=0.2
An=np.ones(N)
Pn=np.zeros(N)
Vs=np.linspace(0,30,1000)
inc=np.max(Vs)/len(Vs)
th=np.linspace(0,np.pi/2,250)
R=np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
I=np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
fmin=np.zeros(len(th))
fmax=np.zeros(len(th))
for tt in range(len(th)):
eR=np.exp(-sigma_Pn**2/2)*np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[tt])*n*d))
eI=np.exp(-sigma_Pn**2/2)*np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[tt])*n*d))
sigma_R2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)+1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
sigma_I2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)-1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
PDF=np.zeros(len(Vs))
for vv in range(len(Vs)):
PDF[vv]=pdf(Vs[vv],100)
total=simps(PDF,Vs)
values=np.cumsum(PDF)*inc/total
xval_05=find_nearest(values,0.05)
fmin[tt]=Vs[values==xval_05]
xval_95=find_nearest(values,0.95)
fmax[tt]=Vs[values==xval_95]

This version's speedup: 31x
A simple profiling (%%prun) reveals that most of the time is spent in simps.
You are in control of the integration done in pdf(): for example, you can use the trapeze method instead of Simpson with negligible numerical difference if you increase a bit the resolution of alpha. In fact, the higher resolution obtained by a higher sampling of alpha more than makes up for the difference between simps and trapeze (see picture at the bottom as for why). This is by far the highest speedup. We go one bit further by implementing the trapeze method ourselves instead of using scipy, since it is so simple. This alone yields marginal gain, but opens the door for a more drastic optimization (below, about pdf2D.
Also, the remaining simps(PDF, ...) goes faster when it knows that the dx step is constant, so we can just say so instead of passing the whole alpha array.
You can avoid doing the loop to compute PDF and use np.vectorize(pdf) directly on Vs, or better (as in the code below), do a 2-D version of that calculation.
There are some other minor things (such as using an index directly fmin[tt] = Vs[closest(values, 0.05)] instead of finding the index, returning the value, and then using a boolean mask for where values == xval_05), or taking all the constants (including alpha) outside functions and avoid recalculating every time.
This above gives us a 5.2x improvement. There is a number of things I don't understand in your code, e.g. why having An (ones) and Pn (zeros)?
But, importantly, another ~6x speedup comes from the observation that, since we are implementing our own trapeze method by using numpy primitives, we can actually do it in 2D in one go for the whole PDF.
The final speed up of the code below is 31x. I believe that a better understanding of "the big picture" of what you want to do would yield additional, perhaps substantial, speed gains.
Modified code:
import numpy as np
from scipy.integrate import simps
alpha_points = 200 # more points as we'll use trapeze not simps
alpha = np.linspace(0, 2*np.pi, alpha_points)
cosalpha = np.cos(alpha)
sinalpha = np.sin(alpha)
d_alpha = np.mean(np.diff(alpha)) # constant dx
coeff = 1 / np.sqrt(2*np.pi)
Vs=np.linspace(0,30,1000)
d_Vs = np.mean(np.diff(Vs)) # constant dx
inc=np.max(Vs)/len(Vs)
def f2D(Vs, eR, sigma_R2, eI, sigma_I2):
a = coeff / np.sqrt(sigma_R2)
b = coeff / np.sqrt(sigma_I2)
y = a * np.exp(-(np.outer(cosalpha, Vs) - eR)**2 / 2 / sigma_R2) * b * np.exp(-(np.outer(sinalpha, Vs) - eI)**2 / 2 / sigma_I2)
return y
def pdf2D(Vs, eR, sigma_R2, eI, sigma_I2):
y = f2D(Vs, eR, sigma_R2, eI, sigma_I2)
s = y.sum(axis=0) - (y[0] + y[-1]) / 2 # our own impl of trapeze, on 2D y
return s * d_alpha
def closest(a, val):
return np.abs(a - val).argmin()
N = 20
n = np.linspace(0,N-1,N)
d = 1
sigma_An = 0.1
sigma_Pn = 0.2
An=np.ones(N)
Pn=np.zeros(N)
th = np.linspace(0,np.pi/2,250)
R = np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
I = np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
fmin=np.zeros(len(th))
fmax=np.zeros(len(th))
for tt in range(len(th)):
eR=np.exp(-sigma_Pn**2/2)*np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[tt])*n*d))
eI=np.exp(-sigma_Pn**2/2)*np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[tt])*n*d))
sigma_R2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)+1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
sigma_I2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)-1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
PDF=pdf2D(Vs, eR, sigma_R2, eI, sigma_I2)
total = simps(PDF, dx=d_Vs)
values = np.cumsum(PDF) * inc / total
fmin[tt] = Vs[closest(values, 0.05)]
fmax[tt] = Vs[closest(values, 0.95)]
Note: most of the fmin and fmax are np.allclose() compared with the original function, but some of them have a small error: after some digging, it turns out that the implementation here is more precise as that function f() can be pretty abrupt, and more alpha points actually help (and more than compensate the minuscule lack of precision due to using trapeze instead of Simpson).
For example, at index tt=244, vv=400:

Considering several methods, the one that provides the largest time improvement is the Numba method. The method proposed by Pierre is very interesting and it does not require to install other packages, which is an asset.
However, in the examples that I have computed, the time improvement is not as large as with the numba example, specially when the points in th grows to a few tenths of thousands (which is my actual case). I post here the Numba code just in case someone is interested:
import numpy as np
from numba import njit
#njit
def margins(val_min,val_max):
fmin=np.zeros(len(th))
fmax=np.zeros(len(th))
for tt in range(len(th)):
eR=np.exp(-sigma_Pn**2/2)*np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[tt])*n*d))
eI=np.exp(-sigma_Pn**2/2)*np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[tt])*n*d))
sigma_R2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)+1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
sigma_I2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)-1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
Vs=np.linspace(0,30,1000)
inc=np.max(Vs)/len(Vs)
integration_points=200
PDF=np.zeros(len(Vs))
for vv in range(len(Vs)):
PDF[vv]=np.trapz(1/np.sqrt(2*np.pi)/np.sqrt(sigma_R2)*np.exp(-(Vs[vv]*np.cos(np.linspace(0,2*np.pi,integration_points))-eR)**2/2/sigma_R2)*1/np.sqrt(2*np.pi)/np.sqrt(sigma_I2)*np.exp(-(Vs[vv]*np.sin(np.linspace(0,2*np.pi,integration_points))-eI)**2/2/sigma_I2),np.linspace(0,2*np.pi,integration_points))
total=np.trapz(PDF,Vs)
values=np.cumsum(PDF)*inc/total
idx = (np.abs(values-val_min)).argmin()
xval_05=values[idx]
fmin[tt]=Vs[np.where(values==xval_05)[0][0]]
idx = (np.abs(values-val_max)).argmin()
xval_95=values[idx]
fmax[tt]=Vs[np.where(values==xval_95)[0][0]]
return fmin,fmax
N = 20
n=np.linspace(0,N-1,N)
d=1
sigma_An=1/2**6
sigma_Pn=2*np.pi/2**6
An=np.ones(N)
Pn=np.zeros(N)
th=np.linspace(0,np.pi/2,250)
R=np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
I=np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
F=R+1j*I
Fa=np.abs(F)/np.max(np.abs(F))
fmin, fmax = margins(0.05,0.95)

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

I'm looking to parallelize multiple 1d FFTs using CUDA. I'm working on a GTX 1050Ti with CUDA 6.1.
For instance in the code I attached, I have a 3d input array 'data', and I want to do 1d FFTs over the second dimension of this array. The purpose is, of course, to speed up the execution time by an order of magnitude.
I'm able to use Python's scikit-cuda's cufft package to run a batch of 1 1d FFT and the results match with NumPy's FFT. The problem comes when I go to a real batch size. There, I'm not able to match the NumPy's FFT output (which is the correct one) with cufft's output (which I believe isn't correct). In the code attached, parameter 'singleFFT' controls whether we schedule a batch of 1 or many. Help in correcting the output FFT and also speeding up execution further (if possible) will be greatly appreciated.
import numpy as np
from time import process_time
from skcuda import cufft as cf
import pycuda.autoinit
from pycuda import gpuarray
# params
nSamp = 512
nTx = 16
nRx = 16
nChirp = 256
NX = nChirp
# Uncomment the following line to generate same data always
# np.random.seed(seed=1)
data = (np.random.randn(nSamp,nChirp,nTx,nRx) + 1j*np.random.randn(nSamp,nChirp,nTx,nRx)).astype(np.complex64)
data = data.reshape(nSamp,-1,nTx*nRx)
dataShp0 = np.int32(data.shape[0])
dataShp2 = np.int32(data.shape[2])
idx1 = 0
idx2 = 0
idx3 = 0
singleFFT = 0
if (1 == singleFFT):
data_t = data[0,:,0]
fftAxis = 0
BATCH = np.int32(1)
else:
data_t = data
fftAxis = 1
BATCH = np.int32(nSamp*nTx*nRx)
# calculate and time NumPy FFT
t1 = process_time()
dataFft = np.fft.fft(data_t, axis=fftAxis)
t2 = process_time()
print('\nCPU NumPy time is: ',t2-t1)
data_o_gpu = gpuarray.empty((BATCH*NX),dtype=np.complex64)
# calculate and time GPU FFT
data_t = data_t.reshape((BATCH*NX))
t1 = process_time()
# transfer input data to Device
data_t_gpu = gpuarray.to_gpu(data_t)
# Make FFT plan
plan = cf.cufftPlan1d(NX, cf.CUFFT_C2C, BATCH)
# Execute FFT plan
res = cf.cufftExecC2C(plan, int(data_t_gpu.gpudata), int(data_o_gpu.gpudata), cf.CUFFT_FORWARD)
dataFft_gpu = data_o_gpu.get()
t2 = process_time()
if (0 == singleFFT):
dataFft_gpu = dataFft_gpu.reshape((nSamp,-1,nTx*nRx))
print('\nGPU time is: ',t2-t1)
print(np.allclose(dataFft,dataFft_gpu,atol=1e-6))
The last line in the code matches the result of NumPy's FFT with cuFFT. It could be seen with singleFFT=1, the result is True, while for singleFFT=0 (i.e. batch of many 1d FFTs), the result is False.

Post my attempts, I would want to conclude that:
Using cufft library from skcuda is a bit tricky and to get to the correct FFT output might take a long time, in development. I also noticed that there wasn't an order of magnitude difference in execution time between NumPy's FFT and cufft's FFT (from skcuda)
Using CuPy and arranging your data in a format so that the FFT dimension is laid out in contiguous memory gives an order of magnitude improvement in the FFT compute time. For my case, the order was a little better than 10!
Using CuPy for FFTs is a great option if one wants to stick to Py-based development only. Also the to and fro from C to Python when writing C GPU kernels is an added overhead which is very conveniently resolved with CuPy. Though CuPy itself calls laying out the plan and calling the FFT exec engine internally.

A long term puzzle, how to optimize multi-level loops in python?

I have written a function in python to calculate Delta function in Gauss broadening, which involves 4-level loops. However, the efficiency is very low, about 10 times slower than using Fortran in a similar way.
def Delta_Gaussf(Nw, N_bd, N_kp, hw, eigv):
Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=float)
for w1 in range(Nw):
for k1 in range(N_kp):
for i1 in range(N_bd):
for j1 in range(N_bd):
if ( j1 >= i1 ):
Delta_Gauss[w1][k1][i1][j1] = np.exp(pow((eigv[k1][j1]-eigv[k1][i1]-hw[w1])/width,2))
return Delta_Gauss
I have removed some constants to make it looks simpler.
Could any one help me to optimize this script to increase efficiency?

Simply compile it
To get the best performance I recommend Numba (easy usage, good performance). Alternatively Cython may be a good idea, but with a bit more changes to your code.
You actually got everything right and implemented a easy to understand (for a human and most important for a compiler) solution.
There are basically two ways to gain performance
Vectorize the code as #scnerd showed. This is usually a bit slower and more complex than simply compile a quite simple code, that only uses some for loops. Don't vectorize your code and than use a compiler. From a simple looping aproach this is usually some work to do and leads to a slower and more complex result. The advantage of this process is that you only need numpy, which is a standard dependency in nearly every Python project that deals with some numerical calculations.
Compile the code. If you have already a solution with a few loops and no other, or only a few non numpy functions involved this is often the simplest and fastest solution.
A solution using Numba
You do not have to change much, I changed the pow function to np.power and some slight changes to the way arrays accessed in numpy (this isn't really necessary).
import numba as nb
import numpy as np
#performance-debug info
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')
#nb.njit(fastmath=True)
def Delta_Gaussf_nb(Nw, N_bd, N_kp, hw, width,eigv):
Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=float)
for w1 in range(Nw):
for k1 in range(N_kp):
for i1 in range(N_bd):
for j1 in range(N_bd):
if ( j1 >= i1 ):
Delta_Gauss[w1,k1,i1,j1] = np.exp(np.power((eigv[k1,j1]-eigv[k1,i1]-hw[w1])/width,2))
return Delta_Gauss
Due to the 'if' the SIMD-vectorization fails. In the next step we can remove it (maybe a call outside the njited function to np.triu(Delta_Gauss) will be necessary). I also parallelized the function.
#nb.njit(fastmath=True,parallel=True)
def Delta_Gaussf_1(Nw, N_bd, N_kp, hw, width,eigv):
Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=np.float64)
for w1 in nb.prange(Nw):
for k1 in range(N_kp):
for i1 in range(N_bd):
for j1 in range(N_bd):
Delta_Gauss[w1,k1,i1,j1] = np.exp(np.power((eigv[k1,j1]-eigv[k1,i1]-hw[w1])/width,2))
return Delta_Gauss
Performance
Nw = 20
N_bd = 20
N_kp = 20
width=20
hw = np.linspace(0., 1.0, Nw)
eigv = np.zeros((N_kp, N_bd),dtype=np.float)
Your version: 0.5s
first_compiled version: 1.37ms
parallel version: 0.55ms
These easy optimizations lead to about 1000x speedup.

BLUF: Using Numpy's full functionality, plus another neat module, you can get the Python code down over 100x faster than this raw for-loop code. Using #max9111's answer, however, you can get even faster with much cleaner code and less work.
The resulting code looks nothing like the original, so I'll do the optimization one step at a time so that the process and final code make sense. Essentially, we're going to use a lot of broadcasting in order to get Numpy to perform the looping under the hood (which is always faster than looping in Python). The result computes the full square of results, which means we're necessarily duplicating some work since the result is symmetrical, but it's easier, and honestly probably faster, to do this work in high-performance ways than to have an if at the deepest level of looping in order to avoid the computation. This might be avoidable in Fortran, but probably not in Python. If you want the result to be identical to your provided source, we'll need to take the upper triangle of the result of my code below (which I do in the sample code below... feel free to remove the triu call in actual production, it's not necessary).
First, we'll notice a few things. The main equation has a denominator that performs np.sqrt, but the content of that computation doesn't change at any iteration of the loop, so we'll compute it once and re-use the result. This turns out to be minor, but we'll do it anyway. Next, the main function of the inner two loops is to perform eigv[k1][j1] - eigv[k1][i1], which is quite easy to vectorize. If eigv is a matrix, then eigv[k1] - eigv[k1].T produces a matrix where result[i1, j1] = eigv[k1][j1] - eigv[k1][i1]. This allows us to entirely remove the innermost two loops:
def mine_Delta_Gaussf(Nw, N_bd, N_kp, hw, width, eigv):
Delta_Gauss = np.zeros((Nw, N_kp, N_bd, N_bd), dtype=float)
denom = np.sqrt(2.0 * np.pi) * width
eigv = np.matrix(eigv)
for w1 in range(Nw):
for k1 in range(N_kp):
this_eigv = (eigv[k1] - eigv[k1].T - hw[w1])
v = np.power(this_eigv / width, 2)
Delta_Gauss[w1, k1, :, :] = np.exp(-0.5 * v) / denom
# Take the upper triangle to make the result exactly equal to the original code
return np.triu(Delta_Gauss)
Well, now that we're on the broadcasting bandwagon, it really seems like the remaining two loops should be possible to remove in the same way. As it happens, it is easy! The only thing we need k1 for is to get the row out of eigv that we're trying to pairwise-subtract... so why not do this to all rows at the same time? We're currently basically subtracting matrices of shapes (1, B) - (B, 1) for each of N rows in eigv (where B is is N_bd). We can abuse broadcasting to do this for all rows of eigv simultaneously by subtracting matrices of shapes (N, 1, B) - (N, B, 1) (where N is N_kp):
def mine_Delta_Gaussf(Nw, N_bd, N_kp, hw, width, eigv):
Delta_Gauss = np.zeros((Nw, N_kp, N_bd, N_bd), dtype=float)
denom = np.sqrt(2.0 * np.pi) * width
for w1 in range(Nw):
this_eigv = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2) - hw[w1]
v = np.power(this_eigv / width, 2)
Delta_Gauss[w1, :, :, :] = np.exp(-0.5 * v) / denom
return np.triu(Delta_Gauss)
The next step should be clear now. We're only using w1 to index hw, so let's do some more broadcasting to make numpy do the looping instead. We're currently subtracting a scalar value from a matrix of shape (N, B, B), so to get the resulting matrix for each of the W values in hw, we need to perform subtraction on matrices of the shapes (1, N, B, B) - (W, 1, 1, 1) and numpy will broadcast everything to produce a matrix of the shape (W, N, B, B):
def Delta_Gaussf(hw, width, eigv):
eigv_sub = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2)
w_sub = np.expand_dims(eigv_sub, 0) - np.reshape(hw, (0, 1, 1, 1))
v = np.power(w_sub / width, 2)
denom = np.sqrt(2.0 * np.pi) * width
Delta_Gauss = np.exp(-0.5 * v) / denom
return np.triu(Delta_Gauss)
On my example data, this code is ~100x faster (~900ms to ~10ms). Your mileage might vary.
But wait! There's more! Since our code is all numeric/numpy/python, we can use another handy module called numba to compile this function into an equivalent one with higher performance. Under the hood, it's basically reading what functions we're calling and converting the function into C-types and C-calls to remove the Python function call overhead. It's doing more than that, but that gives the jist of where we're going to gain benefit. To gain this benefit is trivial in this case:
import numba
#numba.jit
def Delta_Gaussf(hw, width, eigv):
eigv_sub = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2)
w_sub = np.expand_dims(eigv_sub, 0) - np.reshape(hw, (0, 1, 1, 1))
v = np.power(w_sub / width, 2)
denom = np.sqrt(2.0 * np.pi) * width
Delta_Gauss = np.exp(-0.5 * v) / denom
return np.triu(Delta_Gauss)
The resulting function is down to about ~7ms on my sample data, down from ~10ms, just by adding that decorator. Pretty nice for no effort.
EDIT: #max9111 gave a better answer that points out that numba works much better with the loop syntax than with numpy broadcasting code. With almost no work besides the removal of the inner if statement, he shows that numba.jit can be made to get the almost original code even faster. The result is much cleaner, in that you still have just the single innermost equation that shows what each value is, and you don't have to follow the magical broadcasting used above. I highly recommend using his answer.
Conclusion
For my given sample data (Nw = 20, N_bd = 20, N_kp = 20), my final runtimes are the following (I've included timings on the same computer for #max9111's solution, first without using parallel execution and then with it on my 2-core VM):
Original code: ~900 ms
Fortran estimate: ~90 ms (based on OP saying it was ~10x faster)
Final numpy code: ~10 ms
Final code with numba.jit: ~7 ms
max9111's solution (serial): ~4ms
max9111 (parallel 2-core): ~3ms
Overall vectorized speedup: ~130x
max9111's numba speedup: ~300x (potentially more with more cores)
I don't know how fast exactly your Fortran code is, but it looks like proper usage of numpy allows you to easily beat it by an order of magnitude, and #max9111's numba solution gives you potentially another order of magnitude.

Rewriting a for loop in pure NumPy to decrease execution time

I recently asked about trying to optimise a Python loop for a scientific application, and received an excellent, smart way of recoding it within NumPy which reduced execution time by a factor of around 100 for me!
However, calculation of the B value is actually nested within a few other loops, because it is evaluated at a regular grid of positions. Is there a similarly smart NumPy rewrite to shave time off this procedure?
I suspect the performance gain for this part would be less marked, and the disadvantages would presumably be that it would not be possible to report back to the user on the progress of the calculation, that the results could not be written to the output file until the end of the calculation, and possibly that doing this in one enormous step would have memory implications? Is it possible to circumvent any of these?
import numpy as np
import time
def reshape_vector(v):
b = np.empty((3,1))
for i in range(3):
b[i][0] = v[i]
return b
def unit_vectors(r):
return r / np.sqrt((r*r).sum(0))
def calculate_dipole(mu, r_i, mom_i):
relative = mu - r_i
r_unit = unit_vectors(relative)
A = 1e-7
num = A*(3*np.sum(mom_i*r_unit, 0)*r_unit - mom_i)
den = np.sqrt(np.sum(relative*relative, 0))**3
B = np.sum(num/den, 1)
return B
N = 20000 # number of dipoles
r_i = np.random.random((3,N)) # positions of dipoles
mom_i = np.random.random((3,N)) # moments of dipoles
a = np.random.random((3,3)) # three basis vectors for this crystal
n = [10,10,10] # points at which to evaluate sum
gamma_mu = 135.5 # a constant
t_start = time.clock()
for i in range(n[0]):
r_frac_x = np.float(i)/np.float(n[0])
r_test_x = r_frac_x * a[0]
for j in range(n[1]):
r_frac_y = np.float(j)/np.float(n[1])
r_test_y = r_frac_y * a[1]
for k in range(n[2]):
r_frac_z = np.float(k)/np.float(n[2])
r_test = r_test_x +r_test_y + r_frac_z * a[2]
r_test_fast = reshape_vector(r_test)
B = calculate_dipole(r_test_fast, r_i, mom_i)
omega = gamma_mu*np.sqrt(np.dot(B,B))
# write r_test, B and omega to a file
frac_done = np.float(i+1)/(n[0]+1)
t_elapsed = (time.clock()-t_start)
t_remain = (1-frac_done)*t_elapsed/frac_done
print frac_done*100,'% done in',t_elapsed/60.,'minutes...approximately',t_remain/60.,'minutes remaining'

One obvious thing you can do is replace the line
r_test_fast = reshape_vector(r_test)
with
r_test_fast = r_test.reshape((3,1))
Probably won't make any big difference in performance, but in any case it makes sense to use the numpy builtins instead of reinventing the wheel.
Generally speaking, as you probably have noticed by now, the trick with optimizing numpy is to express the algorithm with the help of numpy whole-array operations or at least with slices instead of iterating over each element in python code. What tends to prevent this kind of "vectorization" is so-called loop-carried dependencies, i.e. loops where each iteration is dependent on the result of a previous iteration. Looking briefly at your code, you have no such thing, and it should be possible to vectorize your code just fine.
EDIT: One solution
I haven't verified this is correct, but should give you an idea of how to approach it.
First, take the cartesian() function, which we'll use. Then
def calculate_dipole_vect(mus, r_i, mom_i):
# Treat each mu sequentially
Bs = []
omega = []
for mu in mus:
rel = mu - r_i
r_norm = np.sqrt((rel * rel).sum(1))
r_unit = rel / r_norm[:, np.newaxis]
A = 1e-7
num = A*(3*np.sum(mom_i * r_unit, 0)*r_unit - mom_i)
den = r_norm ** 3
B = np.sum(num / den[:, np.newaxis], 0)
Bs.append(B)
omega.append(gamma_mu * np.sqrt(np.dot(B, B)))
return Bs, omega
# Transpose to get more "natural" ordering with row-major numpy
r_i = r_i.T
mom_i = mom_i.T
t_start = time.clock()
r_frac = cartesian((np.arange(n[0]) / float(n[0]),
np.arange(n[1]) / float(n[1]),
np.arange(n[2]) / float(n[2])))
r_test = np.dot(r_frac, a)
B, omega = calculate_dipole_vect(r_test, r_i, mom_i)
print 'Total time for vectorized: %f s' % (time.clock() - t_start)
Well, in my testing, this is in fact slightly slower than the loop-based approach I started from. The thing is, in the original version in the question, it was already vectorized with whole-array operations over arrays of shape (20000, 3), so any further vectorization doesn't really bring much further benefit. In fact, it may worsen the performance, as above, maybe due to big temporary arrays.

If you profile your code, you'll see that 99% of the running time is in calculate_dipole so reducing the time for this looping really won't give a noticeable reduction in execution time. You still need to focus on calculate_dipole if you want to make this faster. I tried my Cython code for calculate_dipole on this and got a reduction by about a factor of 2 in the overall time. There might be other ways to improve the Cython code too.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

GPU and `jax` performance mysteries - python

Related

Can I use PyOpenCL in integration with Scipy to perform Differential Evolution in parallel with GPU?

Efficient computation of a loop of integrals in Python

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

A long term puzzle, how to optimize multi-level loops in python?

Rewriting a for loop in pure NumPy to decrease execution time

Categories

Resources