Speed up the differental evolution algorithm with thousands of parameters - python

I am trying to make a lumped rainfall-runoff balance model with a lot parameters (from 37 to 1099) in python. As input it will receive daily rainfall and temperature data and then provides output as a daily flows.
I am stuck on the optimisation method for the model's calibration. I choosed differential evolution algorithm because it is easy to use and can be applied to this kind of problem. The algorithm I wrote works well and it seems to minimise the objective function (which is Nash-Sutcliff model Eficiency - NSE). The problem starts with higher number of parameters which significantly slows the whole algorithm.
The DE algorithm I wrote:
import numpy as np
import flow # a python file from where I get observed daily flows as a np.array
def differential_evolution(func, bounds, popsize=10, mutate=0.8, CR=0.85, maxiter=50):
#--- INITIALIZE THE FIRST POPULATION WITHIN THE BOUNDS-------------------+
bounds = [(0, 250)] * 1 + [(0, 5)] * 366 + [(0, 2)] * 366 + [(0, 100)] * 366
dim = len(bounds)
pop_norm = np.random.rand(popsize, dim)
min_bound, max_bound = np.asarray(bounds).T
difference = np.fabs(min_bound - max_bound)
population = min_bound + pop_norm * difference
# Computed value of objective function for intial population
fitness = np.asarray([func(x, flow.l_flow) for x in population])
best_idx = np.argmin(fitness)
best = population[best_idx]
#--- MUTATION -----------------------------------------------------------+
# This is the part which take to much time to complete
for i in range(maxiter):
print('Generation: ', i)
for j in range(popsize):
# Random selection of three individuals to make a noice vector
idxs = list(range(0, popsize))
idxs.remove(j)
x_1, x_2, x_3 = pop_norm[np.random.choice(idxs, 3, replace=True)]
noice_vector = np.clip(x_1 + mutate * (x_2 - x_3), 0, 1)
#--- RECOMBINATION ------------------------------------------------------+
cross_points = np.random.rand(dim) < CR
if not np.any(cross_points):
cross_points[np.random.randint(0, dim)] = True
trial_vector_norm = np.where(cross_points, noice_vector, pop_norm[j])
trial_vector = min_bound + trial_vector_norm * difference
crit = func(trial_vector, flow.l_flow)
# Check for better fitness of objective function
if crit < fitness[j]:
fitness[j] = crit
pop_norm[j] = trial_vector_norm
if crit < fitness[best_idx]:
best_idx = j
best = trial_vector
return best, fitness[best_idx]
The rainfall-runoff model itself is a function which works basically on lists and via for loop it iteraters over each row to compute daily flows by simple equation.
The objective function NSE is vectorised by numpy arrays:
import model # a python file where rainfall-runoff model function is defined
def nse_min(parameters, observations):
# Modeled flows from model function
Q_modeled = np.array(model.model(parameters))
# Computation of the NSE fraction
numerator = np.subtract(observations, Q_modeled) ** 2
denominator = np.subtract(observations, np.sum(observations)/len(observations)) ** 2
return np.sum(numerator) / np.sum(denominator)
Is there any chance to speed it up? I found out about the numba library which "compiles python code to a machine code" and then let you compute on CPU more efficiently or on GPU using CUDA cores. But I do not study anything related to IT and have no idea how CPU/GPU works, therefore I do not know how to use numba properly. Can anyone help me with it? Or can anyone suggest different optimisation method?
What I use:
Python 3.7.0 64-bit,
Windows 10 Home x64,
Intel Core(TM) i7-7700HQ CPU # 2.80 Ghz,
NVIDIA GeForce GTX 1050 Ti 4GB GDDR5,
16 GB RAM DDR4.
I am a python beginner who study a water management and use python sometimes just for some sripts which make my life easier in data processing. Thank you for your help in advance.

You can use the python library multiprocessing. It just makes more processes to run your function.
you can use it like this.
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
p = Process(target=f, args=('bob',))
p.start()
p.join()

Related

Can I use PyOpenCL in integration with Scipy to perform Differential Evolution in parallel with GPU?

I got my code for simulating a multivariate regression model to work using the Differential Evolution, and even got the multiprocessing option to help out in reducing runtime. However, with 7 independent variables with 10 values each and matrix operations on 21 100+ element matrices takes a bit of time to work on 24 cores.
I don't have much experience with multiprocessing with PyOpenCL, so I wanted to ask if it's worth entering into and trying to integrate the two to work on the GPU. I've attached the code snippet of 3 variables and 3 values for reference:
import scipy.optimize as op
import numpy as np
def func(vars, *args):
res = []
x = []
for i in args[1:]:
if len(res) + 1 > len(args)//2:
x.append(i)
continue
res.append(np.array(i).T)
f1 = 0
for i in range(len(x[0])):
for j in range(len(x[1])):
diff = (vars[0]*x[0][i] + vars[1])*(vars[2]*x[1][j]*x[1][j] + vars[3]*x[1][j] + vars[4])*(vars[5]*50*50 + vars[6]*50 + vars[7])
f1 = f1 + abs(res[0][i][j] - diff) # ID-Pitch
f2 = 0
for i in range(len(x[0])):
for j in range(len(x[2])):
diff = (vars[0]*x[0][i] + vars[1])*(vars[5]*x[2][j]*x[2][j] + vars[6]*x[2][j] + vars[7])*(vars[2]*10*10 + vars[3]*10 + vars[4])
f2 = f2 + abs(res[1][i][j] - diff) # ID-Depth
f3 = 0
for i in range(len(x[1])):
for j in range(len(x[2])):
diff = (vars[2]*x[1][i]*x[1][i] + vars[3]*x[1][i] + vars[4])*(vars[5]*x[2][j]*x[2][j] + vars[6]*x[2][j] + vars[7])*(vars[0]*3.860424005 + vars[1])
f3 = f3 + abs(res[2][i][j] - diff) # Pitch-Depth
return f1 + f2 + f3
def main():
res1 = [[134.3213274,104.8030828,75.28483813],[151.3351445,118.07797,84.82079556],[135.8343927,105.9836392,76.1328857]]
res2 = [[131.0645086,109.1574174,91.1952225],[54.74920444,30.31300092,17.36537062],[51.8931954,26.45139822,17.28693162]]
res3 = [[131.0645086,141.2210331,133.3192429],[54.74920444,61.75898314,56.52756593],[51.8931954,52.8191817,52.66531712]]
x1 = np.array([3.860424005,7.72084801,11.58127201])
x2 = np.array([10,20,30])
x3 = np.array([50,300,500])
interval = (-20,20)
bds = [interval,interval,interval,interval,interval,interval,interval,interval]
res = op.differential_evolution(func, bounds=bds, workers=-1, maxiter=100000, tol=0.01, popsize=15, args=([1,2,2], res1, res2, res3, x1, x2, x3))
print(res)
if __name__ == '__main__':
main()
firstly, yes it's possible, and func can be a function that will send the data to the GPU then wait for the computationts to finish then transfer the data back to the RAM and return it to scipy.
changing computations from CPU to GPU side is not always beneficial, because of the time required to transfer data back and forth from the GPU, so with a moderate laptop GPU for example, you won't get any speedup at all, and your code might be even slower. reducing data transfer between the GPU and RAM can make GPU 2-4 times faster than an average CPU, but your code requires data transfer so that won't be possible.
for powerful GPUs with high bandwidth (things like RTX2070 or RTX3070 or APUs) you can expect faster computations, so computations on GPU will be a few times faster than CPU, even with the data transfer, but it depends on the code implementation of both the CPU and GPU code.
lastly, your code can be sped up without the use of GPU, which is likely the first thing you should do before going for GPU computations, mainly by using code compilers like cython and numba, that can speed up your code by almost 100 times with little effort without major modifications, but you should convert your code to use only fixed size preallocated numpy arrays and not lists, as the code will be much faster and you can even disable the GIL and have your code multithreaded, and there are good multithreaded looping implementations in them.

Optimization of wind farm using Penalty function in Scipy

In the following code I want to optimize a wind farm using a penalty function.
Using the first function(newsite), I have defined the wind turbines numbers and layout. Then in the next function, after importing x0(c=x0=initial guess), for each range of 10 wind directions (wd) I took the c values for the mean wd of each range. For instance, for wd:[0,10] mean value is 5 and I took c values of wd=5 and put it for all wd in the range[0,10] and for each wind speed(ws). I have to mention that c is the value that shows that wind turbines are off or on(c=0 means wt is off). then I have defined operating according to the c, which means that if operating is 0,c=0 and that wt is off.
Then I defined the penalty function to optimize power output. indeed wherever TI_eff>0.14, I need to implement a penalty function so this function must be subtracted from the original power output. For instance, if sim_res.TI_eff[1][2][3] > 0.14, so I need to apply penalty function so curr_func[1][2][3]=sim_res.Power[1][2][3]-10000*(sim_res.TI_eff[1][2][3]-0.14)**2.
The problem is that I run this code but it did not give me any results and I waited for long hours, I think it was stuck in a loop that could not reach converge. so I want to know what is the problem?
import time
from py_wake.examples.data.hornsrev1 import V80
from py_wake.examples.data.hornsrev1 import Hornsrev1Site # We work with the Horns Rev 1 site, which comes already set up with PyWake.
from py_wake import BastankhahGaussian
from py_wake.turbulence_models import GCLTurbulence
from py_wake.deflection_models.jimenez import JimenezWakeDeflection
from scipy.optimize import minimize
from py_wake.wind_turbines.power_ct_functions import PowerCtFunctionList, PowerCtTabular
import numpy as np
def newSite(x,y):
xNew=np.array([x[0]+560*i for i in range(4)])
yNew=np.array([y[0]+560*i for i in range(4)])
x_newsite=np.array([xNew[0],xNew[0],xNew[0],xNew[1]])
y_newsite=np.array([yNew[0],yNew[1],yNew[2],yNew[0]])
return (x_newsite,y_newsite)
def wt_simulation(c):
c = c.reshape(4,360,23)
site = Hornsrev1Site()
x, y = site.initial_position.T
x_newsite,y_newsite=newSite(x,y)
windTurbines = V80()
for item in range(4):
for j in range(10,370,10):
for i in range(j-10,j):
c[item][i]=c[item][j-5]
windTurbines.powerCtFunction = PowerCtFunctionList(
key='operating',
powerCtFunction_lst=[PowerCtTabular(ws=[0, 100], power=[0, 0], power_unit='w', ct=[0, 0]), # 0=No power and ct
windTurbines.powerCtFunction], # 1=Normal operation
default_value=1)
operating = np.ones((4,360,23)) # shape=(#wt,wd,ws)
operating[c <= 0.5]=0
wf_model = BastankhahGaussian(site, windTurbines,deflectionModel=JimenezWakeDeflection(),turbulenceModel=GCLTurbulence())
# run wind farm simulation
sim_res = wf_model(
x_newsite, y_newsite, # wind turbine positions
h=None, # wind turbine heights (defaults to the heights defined in windTurbines)
wd=None, # Wind direction (defaults to site.default_wd (0,1,...,360 if not overriden))
ws=None, # Wind speed (defaults to site.default_ws (3,4,...,25m/s if not overriden))
operating=operating
)
curr_func=np.ones((4,360,23))
for i in range(4):
for l in range(360):
for k in range(23):
if sim_res.TI_eff[i][l][k]-0.14 > 0 :
curr_func[i][l][k]=sim_res.Power[i][l][k]-10000*(sim_res.TI_eff[i][l][k]-0.14)**2
else:
curr_func[i][l][k]=sim_res.Power[i][l][k]
return -float(np.sum(curr_func)) # negative because of scipy minimize
t0 = time.perf_counter()
def solve():
wt =4 # for V80
wd=360
ws=23
x0 = np.ones((wt,wd,ws)).reshape(-1) # initial value for c
b=(0,1)
bounds=np.full((wt,wd,ws,2),b).reshape(-1, 2)
res = minimize(wt_simulation, x0=x0, bounds=bounds)
return res
res=solve()
print(f'success status: {res.success}')
print(f'aep: {-res.fun}') # negative to get the true maximum aep
print(f'c values: {res.x}\n')
print(f'elapse: {round(time.perf_counter() - t0)}s')
sim_res=wt_simulation(res.x)
There are a number of things in your approach that are either wrong or incomprehensible to me. Just for fun I have tried your code. A few observations:
Your set of parameters (optimization variables) has a shape of (4, 360, 23), i.e. you are looking at 33,120 parameters. There is no nonlinear optimization algorithm that is going to give you any meaningful answer to a problem that big. Ever. But then again, you shouldn't be looking at SciPy optimize if your optimization variables should only assume 0/1 values.
Calling SciPy minimize like this:
res = minimize(wt_simulation, x0=x0, bounds=bounds)
Is going to select a nonlinear optimizer between BFGS, L-BFGS-B or SLSQP (according to the documentation at https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html)
Those algorithms are gradient-based, and since you're not providing a gradient of your objective function SciPy is going to calculate them numerically. Good luck with that when you have 33,000 parameters. Never going to finish.
At the beginning of your objective function you are doing this:
for item in range(4):
for j in range(10,370,10):
for i in range(j-10,j):
c[item][i]=c[item][j-5]
I don't understand why you're doing it but you are overriding the input values of c coming from the optimizer.
Your objective function takes 20-25 seconds to evaluate on my powerful workstation. Even if you had only 10-15 optimization parameters, it would take you several days to get any answer out of an optimizer. You have 33,000+ variables. No way.
I don't know why you are doing this and why you're doing it the way you're doing it. You should rethink your approach.

Efficient computation of a loop of integrals in Python

I was wondering how to speed up the following code in where I compute a probability function which involves numerical integrals and then I compute some confidence margins.
Some possibilities that I have thought about are Numba or vectorization of the code
EDIT:
I have made minor modifications because there was a mistake. I am looking for some modifications that provide major time improvements (I know that there are some minor changes that would provide some minor time improvements, such as repeated functions, but I am not concerned about them)
The code is:
# -*- coding: utf-8 -*-
"""
Created on Tue Jan 26 17:05:46 2021
#author: Ignacio
"""
import numpy as np
from scipy.integrate import simps
def pdf(V,alfa_points):
alfa=np.linspace(0,2*np.pi,alfa_points)
return simps(1/np.sqrt(2*np.pi)/np.sqrt(sigma_R2)*np.exp(-(V*np.cos(alfa)-eR)**2/2/sigma_R2)*1/np.sqrt(2*np.pi)/np.sqrt(sigma_I2)*np.exp(-(V*np.sin(alfa)-eI)**2/2/sigma_I2),alfa)
def find_nearest(array,value):
array=np.asarray(array)
idx = (np.abs(array-value)).argmin()
return array[idx]
N = 20
n=np.linspace(0,N-1,N)
d=1
sigma_An=0.1
sigma_Pn=0.2
An=np.ones(N)
Pn=np.zeros(N)
Vs=np.linspace(0,30,1000)
inc=np.max(Vs)/len(Vs)
th=np.linspace(0,np.pi/2,250)
R=np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
I=np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
fmin=np.zeros(len(th))
fmax=np.zeros(len(th))
for tt in range(len(th)):
eR=np.exp(-sigma_Pn**2/2)*np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[tt])*n*d))
eI=np.exp(-sigma_Pn**2/2)*np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[tt])*n*d))
sigma_R2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)+1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
sigma_I2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)-1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
PDF=np.zeros(len(Vs))
for vv in range(len(Vs)):
PDF[vv]=pdf(Vs[vv],100)
total=simps(PDF,Vs)
values=np.cumsum(PDF)*inc/total
xval_05=find_nearest(values,0.05)
fmin[tt]=Vs[values==xval_05]
xval_95=find_nearest(values,0.95)
fmax[tt]=Vs[values==xval_95]
This version's speedup: 31x
A simple profiling (%%prun) reveals that most of the time is spent in simps.
You are in control of the integration done in pdf(): for example, you can use the trapeze method instead of Simpson with negligible numerical difference if you increase a bit the resolution of alpha. In fact, the higher resolution obtained by a higher sampling of alpha more than makes up for the difference between simps and trapeze (see picture at the bottom as for why). This is by far the highest speedup. We go one bit further by implementing the trapeze method ourselves instead of using scipy, since it is so simple. This alone yields marginal gain, but opens the door for a more drastic optimization (below, about pdf2D.
Also, the remaining simps(PDF, ...) goes faster when it knows that the dx step is constant, so we can just say so instead of passing the whole alpha array.
You can avoid doing the loop to compute PDF and use np.vectorize(pdf) directly on Vs, or better (as in the code below), do a 2-D version of that calculation.
There are some other minor things (such as using an index directly fmin[tt] = Vs[closest(values, 0.05)] instead of finding the index, returning the value, and then using a boolean mask for where values == xval_05), or taking all the constants (including alpha) outside functions and avoid recalculating every time.
This above gives us a 5.2x improvement. There is a number of things I don't understand in your code, e.g. why having An (ones) and Pn (zeros)?
But, importantly, another ~6x speedup comes from the observation that, since we are implementing our own trapeze method by using numpy primitives, we can actually do it in 2D in one go for the whole PDF.
The final speed up of the code below is 31x. I believe that a better understanding of "the big picture" of what you want to do would yield additional, perhaps substantial, speed gains.
Modified code:
import numpy as np
from scipy.integrate import simps
alpha_points = 200 # more points as we'll use trapeze not simps
alpha = np.linspace(0, 2*np.pi, alpha_points)
cosalpha = np.cos(alpha)
sinalpha = np.sin(alpha)
d_alpha = np.mean(np.diff(alpha)) # constant dx
coeff = 1 / np.sqrt(2*np.pi)
Vs=np.linspace(0,30,1000)
d_Vs = np.mean(np.diff(Vs)) # constant dx
inc=np.max(Vs)/len(Vs)
def f2D(Vs, eR, sigma_R2, eI, sigma_I2):
a = coeff / np.sqrt(sigma_R2)
b = coeff / np.sqrt(sigma_I2)
y = a * np.exp(-(np.outer(cosalpha, Vs) - eR)**2 / 2 / sigma_R2) * b * np.exp(-(np.outer(sinalpha, Vs) - eI)**2 / 2 / sigma_I2)
return y
def pdf2D(Vs, eR, sigma_R2, eI, sigma_I2):
y = f2D(Vs, eR, sigma_R2, eI, sigma_I2)
s = y.sum(axis=0) - (y[0] + y[-1]) / 2 # our own impl of trapeze, on 2D y
return s * d_alpha
def closest(a, val):
return np.abs(a - val).argmin()
N = 20
n = np.linspace(0,N-1,N)
d = 1
sigma_An = 0.1
sigma_Pn = 0.2
An=np.ones(N)
Pn=np.zeros(N)
th = np.linspace(0,np.pi/2,250)
R = np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
I = np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
fmin=np.zeros(len(th))
fmax=np.zeros(len(th))
for tt in range(len(th)):
eR=np.exp(-sigma_Pn**2/2)*np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[tt])*n*d))
eI=np.exp(-sigma_Pn**2/2)*np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[tt])*n*d))
sigma_R2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)+1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
sigma_I2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)-1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
PDF=pdf2D(Vs, eR, sigma_R2, eI, sigma_I2)
total = simps(PDF, dx=d_Vs)
values = np.cumsum(PDF) * inc / total
fmin[tt] = Vs[closest(values, 0.05)]
fmax[tt] = Vs[closest(values, 0.95)]
Note: most of the fmin and fmax are np.allclose() compared with the original function, but some of them have a small error: after some digging, it turns out that the implementation here is more precise as that function f() can be pretty abrupt, and more alpha points actually help (and more than compensate the minuscule lack of precision due to using trapeze instead of Simpson).
For example, at index tt=244, vv=400:
Considering several methods, the one that provides the largest time improvement is the Numba method. The method proposed by Pierre is very interesting and it does not require to install other packages, which is an asset.
However, in the examples that I have computed, the time improvement is not as large as with the numba example, specially when the points in th grows to a few tenths of thousands (which is my actual case). I post here the Numba code just in case someone is interested:
import numpy as np
from numba import njit
#njit
def margins(val_min,val_max):
fmin=np.zeros(len(th))
fmax=np.zeros(len(th))
for tt in range(len(th)):
eR=np.exp(-sigma_Pn**2/2)*np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[tt])*n*d))
eI=np.exp(-sigma_Pn**2/2)*np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[tt])*n*d))
sigma_R2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)+1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
sigma_I2=1/2*np.sum(An*sigma_An**2)+1/2*(1-np.exp(-sigma_Pn**2))*np.sum(An**2)-1/2*np.sum(np.cos(2*(Pn+2*np.pi*np.sin(th[tt])*n*d))*((An**2+sigma_An**2)*np.exp(-2*sigma_Pn**2)-An**2*np.exp(-sigma_Pn**2)))
Vs=np.linspace(0,30,1000)
inc=np.max(Vs)/len(Vs)
integration_points=200
PDF=np.zeros(len(Vs))
for vv in range(len(Vs)):
PDF[vv]=np.trapz(1/np.sqrt(2*np.pi)/np.sqrt(sigma_R2)*np.exp(-(Vs[vv]*np.cos(np.linspace(0,2*np.pi,integration_points))-eR)**2/2/sigma_R2)*1/np.sqrt(2*np.pi)/np.sqrt(sigma_I2)*np.exp(-(Vs[vv]*np.sin(np.linspace(0,2*np.pi,integration_points))-eI)**2/2/sigma_I2),np.linspace(0,2*np.pi,integration_points))
total=np.trapz(PDF,Vs)
values=np.cumsum(PDF)*inc/total
idx = (np.abs(values-val_min)).argmin()
xval_05=values[idx]
fmin[tt]=Vs[np.where(values==xval_05)[0][0]]
idx = (np.abs(values-val_max)).argmin()
xval_95=values[idx]
fmax[tt]=Vs[np.where(values==xval_95)[0][0]]
return fmin,fmax
N = 20
n=np.linspace(0,N-1,N)
d=1
sigma_An=1/2**6
sigma_Pn=2*np.pi/2**6
An=np.ones(N)
Pn=np.zeros(N)
th=np.linspace(0,np.pi/2,250)
R=np.sum(An*np.cos(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
I=np.sum(An*np.sin(Pn+2*np.pi*np.sin(th[:,np.newaxis])*n*d),axis=1)
F=R+1j*I
Fa=np.abs(F)/np.max(np.abs(F))
fmin, fmax = margins(0.05,0.95)

Numba CUDA speedup seems to low

Newbie starting with Numba/cuda here.
I wrote this little test script to compare between #jit and #cuda.jit. speeds, just to get a feel for it. It calculates 10M steps of a logistic equation for 256 separate instances.
The cuda part takes approximately 1.2s to finish.
The cpu 'jitted' part finishes in close to 5s (just one thread used on the cpu).
So there is a speedup of about x4, from going to the GPU (a dedicated GTX1080TI not doing anything else). I expected the cuda part, doing all 256 instances in parallel, to be much faster. What am I doing wrong?
Here is the working example:
#!/usr/bin/python3
#logistic equation on gpu/cpu comparison
import os,sys
#Set environment variables (needed for numba 0.42 to find lvvm)
os.environ['NUMBAPRO_NVVM'] = '/usr/lib/x86_64-linux-gnu/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/lib/nvidia-cuda-toolkit/libdevice/'
from time import time
from scipy import *
from numba import cuda, jit
from numba import int64,int32, float64
#cuda.jit
def logistic_cuda(array_in,array_out):
pos = cuda.grid(1)
x = array_in[pos]
for k in range(10*1000*1000):
x = 3.9 * x * (1 - x)
array_out[pos] = x
#jit
def logistic_cpu(array_in,array_out):
for pos,x in enumerate(array_in):
for k in range(10*1000*1000):
x = 3.9 * x * (1 - x)
array_out[pos] = x
if __name__ == '__main__':
N=256
in_ary = random.uniform(low=0.2,high=0.9,size=N).astype('float32')
out_ary = zeros(N,dtype='float32')
t0 = time()
#explicit copying. not really needed
d_in_ary = cuda.to_device(in_ary)
d_out_ary = cuda.to_device(out_ary)
t1 = time()
logistic_cuda[1,N](d_in_ary,d_out_ary)
cuda.synchronize()
t2 = time()
out_ary = d_out_ary.copy_to_host()
t3 = time()
print(out_ary)
print('Total time cuda: %g seconds.'%(t3-t0))
out_ary2 = zeros(N)
t4 = time()
logistic_cpu(in_ary,out_ary2)
t5 = time()
print('Total time cpu: %g seconds.'%(t5-t4))
print('\nDifference:')
print(out_ary2-out_ary)
#Total time cuda: 1.19364 seconds.
#Total time cpu: 5.01788 seconds.
Thanks!
The problem likely comes from the very small amount of data and the loop dependency. Indeed, modern Nvidia GPUs can execute thousands of CUDA threads simultaneously (packed in warps of 32 threads) thanks to the large amount of CUDA cores. In your case, each thread performs a computation on one cell of array_out using a sequential loop. However, there are only 256 cells. Thus, at most 256 threads (8 warps) can run simultaneously - only a tiny faction of the number of simultaneous threads that your GPU should be able to manage. As a result, if you want a better speed-up, you need to provide more parallelism to the GPU (for example by increasing the data size or by computing multiple regression at the same time).

Rewriting a for loop in pure NumPy to decrease execution time

I recently asked about trying to optimise a Python loop for a scientific application, and received an excellent, smart way of recoding it within NumPy which reduced execution time by a factor of around 100 for me!
However, calculation of the B value is actually nested within a few other loops, because it is evaluated at a regular grid of positions. Is there a similarly smart NumPy rewrite to shave time off this procedure?
I suspect the performance gain for this part would be less marked, and the disadvantages would presumably be that it would not be possible to report back to the user on the progress of the calculation, that the results could not be written to the output file until the end of the calculation, and possibly that doing this in one enormous step would have memory implications? Is it possible to circumvent any of these?
import numpy as np
import time
def reshape_vector(v):
b = np.empty((3,1))
for i in range(3):
b[i][0] = v[i]
return b
def unit_vectors(r):
return r / np.sqrt((r*r).sum(0))
def calculate_dipole(mu, r_i, mom_i):
relative = mu - r_i
r_unit = unit_vectors(relative)
A = 1e-7
num = A*(3*np.sum(mom_i*r_unit, 0)*r_unit - mom_i)
den = np.sqrt(np.sum(relative*relative, 0))**3
B = np.sum(num/den, 1)
return B
N = 20000 # number of dipoles
r_i = np.random.random((3,N)) # positions of dipoles
mom_i = np.random.random((3,N)) # moments of dipoles
a = np.random.random((3,3)) # three basis vectors for this crystal
n = [10,10,10] # points at which to evaluate sum
gamma_mu = 135.5 # a constant
t_start = time.clock()
for i in range(n[0]):
r_frac_x = np.float(i)/np.float(n[0])
r_test_x = r_frac_x * a[0]
for j in range(n[1]):
r_frac_y = np.float(j)/np.float(n[1])
r_test_y = r_frac_y * a[1]
for k in range(n[2]):
r_frac_z = np.float(k)/np.float(n[2])
r_test = r_test_x +r_test_y + r_frac_z * a[2]
r_test_fast = reshape_vector(r_test)
B = calculate_dipole(r_test_fast, r_i, mom_i)
omega = gamma_mu*np.sqrt(np.dot(B,B))
# write r_test, B and omega to a file
frac_done = np.float(i+1)/(n[0]+1)
t_elapsed = (time.clock()-t_start)
t_remain = (1-frac_done)*t_elapsed/frac_done
print frac_done*100,'% done in',t_elapsed/60.,'minutes...approximately',t_remain/60.,'minutes remaining'
One obvious thing you can do is replace the line
r_test_fast = reshape_vector(r_test)
with
r_test_fast = r_test.reshape((3,1))
Probably won't make any big difference in performance, but in any case it makes sense to use the numpy builtins instead of reinventing the wheel.
Generally speaking, as you probably have noticed by now, the trick with optimizing numpy is to express the algorithm with the help of numpy whole-array operations or at least with slices instead of iterating over each element in python code. What tends to prevent this kind of "vectorization" is so-called loop-carried dependencies, i.e. loops where each iteration is dependent on the result of a previous iteration. Looking briefly at your code, you have no such thing, and it should be possible to vectorize your code just fine.
EDIT: One solution
I haven't verified this is correct, but should give you an idea of how to approach it.
First, take the cartesian() function, which we'll use. Then
def calculate_dipole_vect(mus, r_i, mom_i):
# Treat each mu sequentially
Bs = []
omega = []
for mu in mus:
rel = mu - r_i
r_norm = np.sqrt((rel * rel).sum(1))
r_unit = rel / r_norm[:, np.newaxis]
A = 1e-7
num = A*(3*np.sum(mom_i * r_unit, 0)*r_unit - mom_i)
den = r_norm ** 3
B = np.sum(num / den[:, np.newaxis], 0)
Bs.append(B)
omega.append(gamma_mu * np.sqrt(np.dot(B, B)))
return Bs, omega
# Transpose to get more "natural" ordering with row-major numpy
r_i = r_i.T
mom_i = mom_i.T
t_start = time.clock()
r_frac = cartesian((np.arange(n[0]) / float(n[0]),
np.arange(n[1]) / float(n[1]),
np.arange(n[2]) / float(n[2])))
r_test = np.dot(r_frac, a)
B, omega = calculate_dipole_vect(r_test, r_i, mom_i)
print 'Total time for vectorized: %f s' % (time.clock() - t_start)
Well, in my testing, this is in fact slightly slower than the loop-based approach I started from. The thing is, in the original version in the question, it was already vectorized with whole-array operations over arrays of shape (20000, 3), so any further vectorization doesn't really bring much further benefit. In fact, it may worsen the performance, as above, maybe due to big temporary arrays.
If you profile your code, you'll see that 99% of the running time is in calculate_dipole so reducing the time for this looping really won't give a noticeable reduction in execution time. You still need to focus on calculate_dipole if you want to make this faster. I tried my Cython code for calculate_dipole on this and got a reduction by about a factor of 2 in the overall time. There might be other ways to improve the Cython code too.

Categories

Resources