Numba CUDA speedup seems to low

Numba CUDA speedup seems to low - python

Newbie starting with Numba/cuda here.
I wrote this little test script to compare between #jit and #cuda.jit. speeds, just to get a feel for it. It calculates 10M steps of a logistic equation for 256 separate instances.
The cuda part takes approximately 1.2s to finish.
The cpu 'jitted' part finishes in close to 5s (just one thread used on the cpu).
So there is a speedup of about x4, from going to the GPU (a dedicated GTX1080TI not doing anything else). I expected the cuda part, doing all 256 instances in parallel, to be much faster. What am I doing wrong?
Here is the working example:
#!/usr/bin/python3
#logistic equation on gpu/cpu comparison
import os,sys
#Set environment variables (needed for numba 0.42 to find lvvm)
os.environ['NUMBAPRO_NVVM'] = '/usr/lib/x86_64-linux-gnu/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/lib/nvidia-cuda-toolkit/libdevice/'
from time import time
from scipy import *
from numba import cuda, jit
from numba import int64,int32, float64
#cuda.jit
def logistic_cuda(array_in,array_out):
pos = cuda.grid(1)
x = array_in[pos]
for k in range(10*1000*1000):
x = 3.9 * x * (1 - x)
array_out[pos] = x
#jit
def logistic_cpu(array_in,array_out):
for pos,x in enumerate(array_in):
for k in range(10*1000*1000):
x = 3.9 * x * (1 - x)
array_out[pos] = x
if __name__ == '__main__':
N=256
in_ary = random.uniform(low=0.2,high=0.9,size=N).astype('float32')
out_ary = zeros(N,dtype='float32')
t0 = time()
#explicit copying. not really needed
d_in_ary = cuda.to_device(in_ary)
d_out_ary = cuda.to_device(out_ary)
t1 = time()
logistic_cuda[1,N](d_in_ary,d_out_ary)
cuda.synchronize()
t2 = time()
out_ary = d_out_ary.copy_to_host()
t3 = time()
print(out_ary)
print('Total time cuda: %g seconds.'%(t3-t0))
out_ary2 = zeros(N)
t4 = time()
logistic_cpu(in_ary,out_ary2)
t5 = time()
print('Total time cpu: %g seconds.'%(t5-t4))
print('\nDifference:')
print(out_ary2-out_ary)
#Total time cuda: 1.19364 seconds.
#Total time cpu: 5.01788 seconds.
Thanks!

The problem likely comes from the very small amount of data and the loop dependency. Indeed, modern Nvidia GPUs can execute thousands of CUDA threads simultaneously (packed in warps of 32 threads) thanks to the large amount of CUDA cores. In your case, each thread performs a computation on one cell of array_out using a sequential loop. However, there are only 256 cells. Thus, at most 256 threads (8 warps) can run simultaneously - only a tiny faction of the number of simultaneous threads that your GPU should be able to manage. As a result, if you want a better speed-up, you need to provide more parallelism to the GPU (for example by increasing the data size or by computing multiple regression at the same time).

Related

Can I use PyOpenCL in integration with Scipy to perform Differential Evolution in parallel with GPU?

I got my code for simulating a multivariate regression model to work using the Differential Evolution, and even got the multiprocessing option to help out in reducing runtime. However, with 7 independent variables with 10 values each and matrix operations on 21 100+ element matrices takes a bit of time to work on 24 cores.
I don't have much experience with multiprocessing with PyOpenCL, so I wanted to ask if it's worth entering into and trying to integrate the two to work on the GPU. I've attached the code snippet of 3 variables and 3 values for reference:
import scipy.optimize as op
import numpy as np
def func(vars, *args):
res = []
x = []
for i in args[1:]:
if len(res) + 1 > len(args)//2:
x.append(i)
continue
res.append(np.array(i).T)
f1 = 0
for i in range(len(x[0])):
for j in range(len(x[1])):
diff = (vars[0]*x[0][i] + vars[1])*(vars[2]*x[1][j]*x[1][j] + vars[3]*x[1][j] + vars[4])*(vars[5]*50*50 + vars[6]*50 + vars[7])
f1 = f1 + abs(res[0][i][j] - diff) # ID-Pitch
f2 = 0
for i in range(len(x[0])):
for j in range(len(x[2])):
diff = (vars[0]*x[0][i] + vars[1])*(vars[5]*x[2][j]*x[2][j] + vars[6]*x[2][j] + vars[7])*(vars[2]*10*10 + vars[3]*10 + vars[4])
f2 = f2 + abs(res[1][i][j] - diff) # ID-Depth
f3 = 0
for i in range(len(x[1])):
for j in range(len(x[2])):
diff = (vars[2]*x[1][i]*x[1][i] + vars[3]*x[1][i] + vars[4])*(vars[5]*x[2][j]*x[2][j] + vars[6]*x[2][j] + vars[7])*(vars[0]*3.860424005 + vars[1])
f3 = f3 + abs(res[2][i][j] - diff) # Pitch-Depth
return f1 + f2 + f3
def main():
res1 = [[134.3213274,104.8030828,75.28483813],[151.3351445,118.07797,84.82079556],[135.8343927,105.9836392,76.1328857]]
res2 = [[131.0645086,109.1574174,91.1952225],[54.74920444,30.31300092,17.36537062],[51.8931954,26.45139822,17.28693162]]
res3 = [[131.0645086,141.2210331,133.3192429],[54.74920444,61.75898314,56.52756593],[51.8931954,52.8191817,52.66531712]]
x1 = np.array([3.860424005,7.72084801,11.58127201])
x2 = np.array([10,20,30])
x3 = np.array([50,300,500])
interval = (-20,20)
bds = [interval,interval,interval,interval,interval,interval,interval,interval]
res = op.differential_evolution(func, bounds=bds, workers=-1, maxiter=100000, tol=0.01, popsize=15, args=([1,2,2], res1, res2, res3, x1, x2, x3))
print(res)
if __name__ == '__main__':
main()

firstly, yes it's possible, and func can be a function that will send the data to the GPU then wait for the computationts to finish then transfer the data back to the RAM and return it to scipy.
changing computations from CPU to GPU side is not always beneficial, because of the time required to transfer data back and forth from the GPU, so with a moderate laptop GPU for example, you won't get any speedup at all, and your code might be even slower. reducing data transfer between the GPU and RAM can make GPU 2-4 times faster than an average CPU, but your code requires data transfer so that won't be possible.
for powerful GPUs with high bandwidth (things like RTX2070 or RTX3070 or APUs) you can expect faster computations, so computations on GPU will be a few times faster than CPU, even with the data transfer, but it depends on the code implementation of both the CPU and GPU code.
lastly, your code can be sped up without the use of GPU, which is likely the first thing you should do before going for GPU computations, mainly by using code compilers like cython and numba, that can speed up your code by almost 100 times with little effort without major modifications, but you should convert your code to use only fixed size preallocated numpy arrays and not lists, as the code will be much faster and you can even disable the GIL and have your code multithreaded, and there are good multithreaded looping implementations in them.

GPU and `jax` performance mysteries

I have been playing with jax lately, and it is very impressive, but then the following set of experiments confused me greatly:
First, we set up the timer utility:
import time
def timefunc(foo, *args):
tic = time.perf_counter()
tmp = foo(*args)
toc = time.perf_counter()
print(toc - tic)
return tmp
Now, let’s see what happens when we compute the eigenvalues of a random symmetric matrix matrix, thus (jnp is jax.numpy, so the eigh is done on the GPU)
def jfunc(n):
tmp = np.random.randn(n, n)
return jnp.linalg.eigh(tmp + tmp.T)
def nfunc(n):
tmp = np.random.randn(n, n)
return np.linalg.eigh(tmp + tmp.T)
Now for the timings (the machine is an nVidia DGX box, so the GPU is an A100, while the CPUs are some AMD EPYC2 parts.
>>> e1 = timefunc(nfunc, 10)
0.0002442029945086688
>>> e2 = timefunc(jfunc, 10)
0.013523647998226807
>>> e1 = timefunc(nfunc, 100)
0.11742364699603058
>>> e2 = timefunc(jfunc, 100)
0.11005625998950563
>>> e1 = timefunc(nfunc, 1000)
0.6572738009999739
>>> e2 = timefunc(jfunc, 1000)
0.5530761769914534
>>> e1 = timefunc(nfunc, 10000)
36.22587636699609
>>> e2 = timefunc(jfunc, 10000)
8.867857075005304
You will notice that the crossover is somewhere around 1000. Initially, I thought this was because of the overhead of moving stuff to/from the GPU, but if you define yet another function:
def jjfunc(n):
key=jax.random.PRNGKey(0)
tmp = jax.random.normal(key, [n, n])
return jnp.linalg.eigh(tmp + tmp.T)
>>> e1=timefunc(jjfunc, 10)
0.01886096798989456
>>> e1=timefunc(jjfunc, 100)
0.2756766739912564
>>> e1=timefunc(jjfunc, 1000)
0.7205733209993923
>>> e1=timefunc(jjfunc, 10000)
6.8624101399909705
Note that the small examples are actually (much) slower than moving the numpy array to the GPU and back.
So, my question is: what is going on, and is there a silver bullet? Is this a jax implementation bug?

I don't think your timings are reflective of actual JAX vs. numpy performance, for a few reasons:
JAX's computation model uses Asynchronous Dispatch, which means that JAX operations return before the computation is finished. As mentioned at that link, you can use the block_until_ready() method to ensure you are timing the computation rather than the dispatch.
Because operations like eigh are JIT-compiled by default, the first time you run them for a given size will incur the one-time compilation cost. Subsequent runs will be faster as JAX caches previous compilations.
Your computations are indeed being foiled by device transfer costs. It's easiest to see if you measure it directly:
def transfer(n):
tmp = np.random.randn(n, n)
return jnp.array(tmp).block_until_ready()
timefunc(transfer, 10000);
# 4.600406924000026
Your jjfunc combines the eigh call with the jax.random.normal call. The latter is slower than numpy's random number generation, and I believe is dominating the difference for small n.
Unrelated to JAX, but in general using time.time for profiling Python code can give you misleading results. Modules like timeit are much better for this kind of thing, particularly when you're dealing with microbenchmarks that complete in fractions of a second.
If you're interested in accurate benchmarks of JAX vs. Numpy versions of algorithms, I'd suggest isolating exactly the operations you're interested in benchmarking (i.e. generate the data & do any device transfer outside the benchmarks). Read up on the advice in Asynchronous Dispatch in JAX as it relates to benchmarking, and check out Python's timeit Docs for tips on getting accurate timings of small code snippets (though I find the %timeit magic more convenient if working IPython or Jupyter notebook).

Speed up the differental evolution algorithm with thousands of parameters

I am trying to make a lumped rainfall-runoff balance model with a lot parameters (from 37 to 1099) in python. As input it will receive daily rainfall and temperature data and then provides output as a daily flows.
I am stuck on the optimisation method for the model's calibration. I choosed differential evolution algorithm because it is easy to use and can be applied to this kind of problem. The algorithm I wrote works well and it seems to minimise the objective function (which is Nash-Sutcliff model Eficiency - NSE). The problem starts with higher number of parameters which significantly slows the whole algorithm.
The DE algorithm I wrote:
import numpy as np
import flow # a python file from where I get observed daily flows as a np.array
def differential_evolution(func, bounds, popsize=10, mutate=0.8, CR=0.85, maxiter=50):
#--- INITIALIZE THE FIRST POPULATION WITHIN THE BOUNDS-------------------+
bounds = [(0, 250)] * 1 + [(0, 5)] * 366 + [(0, 2)] * 366 + [(0, 100)] * 366
dim = len(bounds)
pop_norm = np.random.rand(popsize, dim)
min_bound, max_bound = np.asarray(bounds).T
difference = np.fabs(min_bound - max_bound)
population = min_bound + pop_norm * difference
# Computed value of objective function for intial population
fitness = np.asarray([func(x, flow.l_flow) for x in population])
best_idx = np.argmin(fitness)
best = population[best_idx]
#--- MUTATION -----------------------------------------------------------+
# This is the part which take to much time to complete
for i in range(maxiter):
print('Generation: ', i)
for j in range(popsize):
# Random selection of three individuals to make a noice vector
idxs = list(range(0, popsize))
idxs.remove(j)
x_1, x_2, x_3 = pop_norm[np.random.choice(idxs, 3, replace=True)]
noice_vector = np.clip(x_1 + mutate * (x_2 - x_3), 0, 1)
#--- RECOMBINATION ------------------------------------------------------+
cross_points = np.random.rand(dim) < CR
if not np.any(cross_points):
cross_points[np.random.randint(0, dim)] = True
trial_vector_norm = np.where(cross_points, noice_vector, pop_norm[j])
trial_vector = min_bound + trial_vector_norm * difference
crit = func(trial_vector, flow.l_flow)
# Check for better fitness of objective function
if crit < fitness[j]:
fitness[j] = crit
pop_norm[j] = trial_vector_norm
if crit < fitness[best_idx]:
best_idx = j
best = trial_vector
return best, fitness[best_idx]
The rainfall-runoff model itself is a function which works basically on lists and via for loop it iteraters over each row to compute daily flows by simple equation.
The objective function NSE is vectorised by numpy arrays:
import model # a python file where rainfall-runoff model function is defined
def nse_min(parameters, observations):
# Modeled flows from model function
Q_modeled = np.array(model.model(parameters))
# Computation of the NSE fraction
numerator = np.subtract(observations, Q_modeled) ** 2
denominator = np.subtract(observations, np.sum(observations)/len(observations)) ** 2
return np.sum(numerator) / np.sum(denominator)
Is there any chance to speed it up? I found out about the numba library which "compiles python code to a machine code" and then let you compute on CPU more efficiently or on GPU using CUDA cores. But I do not study anything related to IT and have no idea how CPU/GPU works, therefore I do not know how to use numba properly. Can anyone help me with it? Or can anyone suggest different optimisation method?
What I use:
Python 3.7.0 64-bit,
Windows 10 Home x64,
Intel Core(TM) i7-7700HQ CPU # 2.80 Ghz,
NVIDIA GeForce GTX 1050 Ti 4GB GDDR5,
16 GB RAM DDR4.
I am a python beginner who study a water management and use python sometimes just for some sripts which make my life easier in data processing. Thank you for your help in advance.

You can use the python library multiprocessing. It just makes more processes to run your function.
you can use it like this.
from multiprocessing import Process
def f(name):
print('hello', name)
if __name__ == '__main__':
p = Process(target=f, args=('bob',))
p.start()
p.join()

How to achieve a faster convolve2d using GPU

I was recently learning PyCuda and planning to replace some code of a camera system to speed up image processing. That part was originally using cv2.filter2D. My intention is to accelerate the processing with GPU.
Time for signal.convolve2d: 1.6639747619628906
Time for cusignal.convolve2d: 0.6955723762512207
Time for cv2.filter2D: 0.18787837028503418
However, it seems that cv2.filter2D is still the fastest among three. If the input is a long list of images, could a custom PyCuda kernel outweigh the cv2.filter2D?
import time
import cv2
from cusignal.test.utils import array_equal
import cusignal
import cupy as cp
import numpy as np
from scipy import signal
from scipy import misc
ascent = misc.ascent()
ascent = np.array(ascent, dtype='int16')
ascentList = [ascent]*100
filterSize = 3
scharr = np.ones((filterSize, filterSize), dtype="float") * (1.0 / (filterSize*filterSize))
startTime = time.time()
for asc in ascentList:
grad = signal.convolve2d(asc, scharr, boundary='symm', mode='same')
endTime = time.time()
print("Time for signal.convolve2d: "+str(endTime - startTime))
startTime = time.time()
for asc in ascentList:
gpu_convolve2d = cp.asnumpy(cusignal.convolve2d(cp.asarray(asc), scharr, boundary='symm', mode='same'))
endTime = time.time()
print("Time for cusignal.convolve2d: "+str(endTime - startTime))
print("If signal equal to cusignal: "+ str(array_equal(grad, gpu_convolve2d)))
startTime = time.time()
for asc in ascentList:
opencvOutput = cv2.filter2D(asc, -1, scharr)
endTime = time.time()
print("Time for cv2.filter2D: "+str(endTime - startTime))
print("If cv2 equal to cusignal: "+ str(array_equal(opencvOutput, gpu_convolve2d)))

In your timing analysis of the GPU, you are timing the time to copy asc to the GPU, execute convolve2d, and transfer the answer back. Transfers to and from the GPU are very slow in the scheme of things. If you want a true comparison of the compute just profile convolve2d.
Currently the cuSignal.convolve2d is written in Numba. We are in the process of porting this to use CuPy Raw Kernels, and there will be an improvement. I don't have an ETA on convolve2d.
It looks there might be a OpenCV CUDA version https://github.com/opencv/opencv_contrib/blob/master/modules/cudafilters/src/cuda/filter2d.cu
Have you tried scipy.ndimage.filters.convolve - http://blog.rtwilson.com/convolution-in-python-which-function-to-use/
Also, checkout CuPy's convolve - https://github.com/cupy/cupy/blob/master/cupyx/scipy/ndimage/filters.py
Now to your original question. When trying to determine if the GPU will be faster than the CPU, you need to ensure there is enough work to keep the GPU busy. It is known that in some cases, where the data size is small, the CPU will perform faster.

Memory consumption when using multiprocessing in python

I am using python's multiprocessing module to launch some Monte-Carlo simulation in order to speed up the computation. The code I have looks like this:
def main():
(various parameters are being set up...)
start = 0
end = 10
count = int(1e4)
time = np.linspace(start, end, num=count)
num_procs = 12
mc_iterations_per_proc = int(1e5)
mc_iterations = num_procs * mc_iterations_per_proc
mean_estimate, mean_estimate_variance = np.zeros(count), np.zeros(count)
pool = multiprocessing.Pool(num_procs)
for index, (estimate, estimate_variance) in enumerate(pool.imap_unordered(mc_linear_estimate,
((disorder_mean, intensity, wiener_std, time) for index in xrange(mc_iterations)), chunksize=mc_iterations_per_proc)):
delta = estimate - mean_estimate
mean_estimate = mean_estimate + delta / float(index + 1)
mean_estimate_variance = mean_estimate_variance + delta * (estimate - mean_estimate)
mean_estimate_variance = mean_estimate_variance / float(index)
Ok, now mc_linear_estimate is a function taking *args and creating additional variables inside it. It looks like this:
def mc_linear_estimate(*args):
disorder_mean, intensity, wiener_std, time = args[0]
theta_process = source_process(time, intensity, disorder_mean)
xi_process = observed_process(time, theta_process, wiener_std)
gamma = error_estimate(time, intensity, wiener_std, disorder_mean)
estimate = signal_estimate(time, intensity, wiener_std, disorder_mean, gamma, xi_process)
estimate_variance = (estimate - theta_process) ** 2
return estimate, estimate_variance
As you could see, the number of iterations is pretty large (1.2M), and the size of all the arrays is 10K doubles, therefore I use Welford's algorithm to compute mean and variance, due to the fact that it does not require the store every element of the considered sequences in memory. However, this does not help.
The problem: I run out of memory. When I launch the application, 12 processes emerge (as seen using top program on my Linux machine). They instantly start consuming a lot of memory, but as the Linux machine I'm using has 49G of RAM, things are OK for some time. Then, as each of the processes takes up to around 4G of RAM, one of them fails and shows as <defunct> in top. Then another process falls off, and this happens until only one process is left, which ultimately fails with "Out of memory" exception.
The questions:
What am I possibly doing wrong?
How could I improve the code so that it wouldn't consume all the memory?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numba CUDA speedup seems to low - python

Related

Can I use PyOpenCL in integration with Scipy to perform Differential Evolution in parallel with GPU?

GPU and `jax` performance mysteries

Speed up the differental evolution algorithm with thousands of parameters

How to achieve a faster convolve2d using GPU

Memory consumption when using multiprocessing in python

Categories

Resources