I tried this code with numba as well as normal mode but both were completed in 13 seconds and numba did not add speed
How can I set numba for this situation?
import numpy as np
from numba import jit, cuda
a=[]
#jit(target_backend="cuda")
def func():
for i in range(100000):
a.append(i)
return a
print(func())
CUDA cannot be used here because this code cannot run on the GPU (and even if would be modified, it would be inefficient). GPUs are very different from CPUs and they are also programmed differently. To understand why, please read the CUDA programming guide
If you just run the code and read the warnings from Numba, then you can see that the code fallback to a basic Python implementation:
Compilation is falling back to object mode WITH looplifting enabled because Function "func" failed type inference due to: Untyped global name 'a': Cannot type empty list
The reason is that the type of a is not provided and Numba fail to find it. Numba is fast because of its typing system which enable it to compile the code in a native binary.
Additionally, you should not modify global variables. This is not a good idea in term of software engineering and Numba does not support that anyway.
Thus, you need to use a typed lists returned from the function. Not that typed lists are not much faster than Python list when read/written from/to CPython because Numba has to make the conversion from/to CPython lists which is an expensive operation. Still, on my machine this is about 3 times faster.
Corrected code:
import numpy as np
from numba import jit, cuda
#jit
def func():
a=[]
for i in range(100000):
a.append(i)
return a
func() # Compile the function during the first run
a = func() # Execute quickly the code
print(a) # Printing is slow
For more information about the lazy compilation please read the Numba documentation.
I enjoy using a lot of functional programming features when playing with Python lists. When I switch to Numpy for big dataset, I would expect that it is significantly more efficient than native Python list operations over ndarray.tolist() since it is stored differently.
So when I try to apply map, reduce, filter such FP things on Numpy array, I first search over the Numpy's doc for some "optimized things". And what I get is numpy.ufunc.reduce it seems to be the right thing. However, for curiosity, I did a simple test on both approaches:
Use Numpy reduce
import numpy as np
a = np.array(range(100000000))
adf = lambda res, a: res + a
u_adf = np.frompyfunc(adf, 2, 1)
print(u_adf.reduce(a, initial=0))
Use ndarray.tolist() and then use Python native reduce
import numpy as np
from functools import reduce
a = np.array(range(100000000))
adf = lambda res, a: res + a
print(reduce(adf, a.tolist(), 0))
Here comes the most unexpected thing:
> python 1.py
4999999950000000
python 1.py 28.00s user 5.71s system 102% cpu 32.925 total
> python 2.py
4999999950000000
python 2.py 26.38s user 6.38s system 103% cpu 31.792 total
The so-called "stupid" approach is actually the more efficient way?
How can that be? Can anyone please explain this for me? And hopefully gives some advice on using functional programming features on Numpy arrays.
Appreciate ^_^
I am kinda new to multiprocessing library in python. I have watched couple of lectures on youtube and have implemented basic instances of process from multiprocessing. My problem is, I want to speed up my code by utilizing all the 8 available cores. I asked a similar question in this regard a couple of days back but couldnt elicit a reasonable answer. I went back and did thorough homework and now after having understood the basics of pool(among other things in multiprocessing), I am kinda stuck in a peculier situation. I went through similar problems that were posed here at SO for instance this one-(Python multiprocessing pool map with multiple arguments) and (Python multiprocessing pool.map for multiple arguments). I even implemented the solution mentioned in former. However my code is still throwing this error-
File "minimal_example_so.py", line 84, in <module>
X,u,t=pool.starmap(solver,[args])
ValueError: not enough values to unpack (expected 3, got 1)
My code looks something like this-
import time
ti=time.time()
import matplotlib.pyplot as plt
from scipy.integrate import ode
import numpy as np
from numpy import sin,cos,tan,zeros,exp,tanh,dot,array
from matplotlib import rc
import itertools
from const_for_drdo_mod4 import *
plt.style.use('bmh')
import mpltex
from functools import reduce
from multiprocessing import Pool
import multiprocessing
linestyles=mpltex.linestyle_generator()
def zetta(x,spr,c):
num=len(x)*len(c)
Mu=[[] for i in range(len(x))]
for i in range(len(x)):
Mu[i]=np.zeros(len(c))
m=[]
for i in range(len(x)):
for j in range(len(c)):
Mu[i][j]=exp(-.5*((x[i]-c[j])/spr)**2)
b=list(itertools.product(*Mu))
for i in range(len(b)):
m.append(reduce(lambda x,y:x*y,b[i]))
m=np.array(m)
S=np.sum(m)
return m/S
def f(t,Y,a,b,spr,tim,so,k,K,C):
x1,x2=Y[0],Y[1]
e=x1-2
de=-2*x1+a*x2+b*sin(x1)
s=de+2*e
xx=[e,de]
sold,ti=so,tim
#import pdb;pdb.set_trace()
theta=Y[2:2+len(C)**len(xx)]
Z=zetta(xx,spr,C)
u=dot(Z,theta)
Z1=list(Z)
dt=time.time()-ti
ti=time.time()
sodt=(s-sold)/dt
x1dot=de
x2dot=-x2*cos(x1)+cos(2*x1)*u
xdot=[x1dot,x2dot]
thetadot=[-20*number*(sodt+k*s+K*tanh(s))-20*.1*number2 for number,number2 in zip(Z1,theta)]
sold=s
ydot=xdot+thetadot
return [ydot,u]
def solver(t0,y0,t1,dt,a,b,spr,tim,so,k,K,C):
num=2
x,t=[[] for i in range(2+len(C)**num)],[]
u=[]
r=ode(lambda t,y,a,b,spr,tim,so,k,K,C: f(t,y,a,b,spr,tim,so,k,K,C)[0]).set_integrator('dopri5',method='bdf')
r.set_initial_value(y0,t0).set_f_params(a,b,spr,tim,so,k,K,C)
while r.successful() and r.t<t1:
r.integrate(r.t+dt)
for i in range(2+len(C)**num):
x[i].append(r.y[i])
u.append(f(r.t,r.y,a,b,spr,tim,so,k,K,C)[1])
t.append(r.t)
return x,u,t
if __name__=='__main__':
spr,C=1.5,[-3,-1.5,0,1.5,3]
num=2
k,K=2,5
tim,so=0,0
a,b=1,2
y0,T=[0.1,0],100
x1=[0 for i in range(len(C)**num)]
x0=y0+x1
args=(0,x0,T,1e-2,a,b,spr,tim,so,k,K,C)
pool=multiprocessing.Pool(3)
X,u,t=pool.starmap(solver,[args])
#X,u,t=solver(0,x0,T,1e-2,a,b,spr,tim,so,k,K,C)
nam=["x1","x2"]
pool.close()
pool.join()
plt.figure(1)
for i in range(len(X[0:2])):
plt.plot(t,X[i],label=nam[i])
plt.legend(loc='upper right')
plt.figure(2)
for i in range(len(X[2:])):
plt.plot(t,X[i])
plt.figure(3)
plt.plot(t,u)
plt.show()
Here I wish to tell that all my arguments are either plain numbers or lists/array of numbers which needs to be passed to the solver method. I tried couple of ways to pass my arguments to the pool using map and even starmap, but all of them were in vain. Kindly help, thanks in advance.
PS- my code is working absolutely fine without pool.
You're calling starmap with a list of just one tuple of arguments. The return value will therefore also be a list containing one element - the tuple returned by one call to solver. So you're effectively saying
X, u, t = [(x1, u1, t1)]
which is why you get the exception you're getting: you can't unpack one value (the returned tuple) into three variables. If you want to use starmap here you'll need to do something like:
[(X,u,t)] = pool.starmap(solver,[args])
instead. But for only one set of arguments it makes more sense to use apply, since it's designed for a single invocation:
X,u,t = pool.apply(solver, args)
I'm trying to implement an transform which at some stage in it has a lookup table < 1K in size. This seems to me like it shouldn't pose a problem to a modern graphics card.
But the code below is failing with an unknown error:
from numba import cuda, vectorize
import numpy as np
tmp = np.random.uniform( 0, 100, 1000000 ).astype(np.int16)
tmp_device = cuda.to_device( tmp )
lut = np.arange(100).astype(np.float32) * 2.5
lut_device = cuda.to_device(lut)
#cuda.jit(device=True)
def lookup(x):
return lut[x]
#vectorize("float32(int16)", target="cuda")
def test_lookup(x):
return lookup(x)
test_lookup(tmp_device).copy_to_host() # <-- fails with cuMemAlloc returning UNKNOWN_CUDA_ERROR
What am I doing against the spirit of numba.cuda?
Even replacing lookup with the following simplified code results in the same error:
#cuda.jit(device=True)
def lookup(x):
return x + lut[1]
Once this error occurs, I am essentially no longer able to utilize the cuda context at all. For instance, allocating a new array via cuda.to_device results in a:
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemAlloc results in UNKNOWN_CUDA_ERROR
Running on: 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04)
Driver Version: 390.25
numba: 0.33.0
The above code is fixed by modifying the part in bold:
#cuda.jit(device=True)
def lookup(x):
lut_device = cuda.const.array_like(lut)
return lut_device[x]
I ran multiple variations of the code including simply touching the lookup table from within this kernel, but not using its output. This combined with #talonmies' assertion that UNKNOWN_CUDA_ERROR usually occurs with invalid instructions, I thought that perhaps there was a shared memory constraint that was causing the issue.
The above code makes the whole thing work. However, I still don't understand why in a profound way.
If anyone knows and understands why, please feel free to contribute to this answer.
I am trying to solve a linear system using numba with GPU processing using CUDA.
I have installed all the relevant packages and tested it so it seems that my GPU and CUDA etc is set up properly.
My code is:
import numpy as np
import time
from numba import vectorize, cuda
#vectorize(['float64(float64, float64)'], target='cuda')
def solver(A, b):
return np.linalg.solve(A, b)
def main():
A = np.random.rand(100, 100).astype(np.float64)
b = np.random.rand(100, 1).astype(np.float64)
start = time.time()
C = solver(A, b)
vector_add_time = time.time() - start
print("Took " + str(vector_add_time) + " seconds to solve")
if __name__ == '__main__':
main()
Commenting the #vectorize... line, the code runs fine. However, when I try to do it with numba and cuda, I get a long list of errors, where I think he most relevant one is:
raise TypingError(msg)
numba.errors.TypingError: Failed at nopython (nopython frontend)
np.linalg.solve() only supported for array types
I assume the problem is that numpy.linalg.solve does not accept the data types required by cuda.
Am I correct in assuming this? Are there other data types that will work?
In this example problem, the same data type is passed to the function, so I think the problem lies with numpy.linalg.
Am I correct in assuming this?
No
Are there other data types that will work?
No
The problem here is that you cannot use numpy.linalg in code which is targeted to run on the numba GPU backend.