Solving linear system using Python with numba and CUDA - python

I am trying to solve a linear system using numba with GPU processing using CUDA.
I have installed all the relevant packages and tested it so it seems that my GPU and CUDA etc is set up properly.
My code is:
import numpy as np
import time
from numba import vectorize, cuda
#vectorize(['float64(float64, float64)'], target='cuda')
def solver(A, b):
return np.linalg.solve(A, b)
def main():
A = np.random.rand(100, 100).astype(np.float64)
b = np.random.rand(100, 1).astype(np.float64)
start = time.time()
C = solver(A, b)
vector_add_time = time.time() - start
print("Took " + str(vector_add_time) + " seconds to solve")
if __name__ == '__main__':
main()
Commenting the #vectorize... line, the code runs fine. However, when I try to do it with numba and cuda, I get a long list of errors, where I think he most relevant one is:
raise TypingError(msg)
numba.errors.TypingError: Failed at nopython (nopython frontend)
np.linalg.solve() only supported for array types
I assume the problem is that numpy.linalg.solve does not accept the data types required by cuda.
Am I correct in assuming this? Are there other data types that will work?
In this example problem, the same data type is passed to the function, so I think the problem lies with numpy.linalg.

Am I correct in assuming this?
No
Are there other data types that will work?
No
The problem here is that you cannot use numpy.linalg in code which is targeted to run on the numba GPU backend.

Related

numba did not speed up the compilation of code

I tried this code with numba as well as normal mode but both were completed in 13 seconds and numba did not add speed
How can I set numba for this situation?
import numpy as np
from numba import jit, cuda
a=[]
#jit(target_backend="cuda")
def func():
for i in range(100000):
a.append(i)
return a
print(func())
CUDA cannot be used here because this code cannot run on the GPU (and even if would be modified, it would be inefficient). GPUs are very different from CPUs and they are also programmed differently. To understand why, please read the CUDA programming guide
If you just run the code and read the warnings from Numba, then you can see that the code fallback to a basic Python implementation:
Compilation is falling back to object mode WITH looplifting enabled because Function "func" failed type inference due to: Untyped global name 'a': Cannot type empty list
The reason is that the type of a is not provided and Numba fail to find it. Numba is fast because of its typing system which enable it to compile the code in a native binary.
Additionally, you should not modify global variables. This is not a good idea in term of software engineering and Numba does not support that anyway.
Thus, you need to use a typed lists returned from the function. Not that typed lists are not much faster than Python list when read/written from/to CPython because Numba has to make the conversion from/to CPython lists which is an expensive operation. Still, on my machine this is about 3 times faster.
Corrected code:
import numpy as np
from numba import jit, cuda
#jit
def func():
a=[]
for i in range(100000):
a.append(i)
return a
func() # Compile the function during the first run
a = func() # Execute quickly the code
print(a) # Printing is slow
For more information about the lazy compilation please read the Numba documentation.

Can numba, multiprocessor and random number generators work together?

I'm trying to get numba, multiprocessor and random number generators work together. I have downsized my real problem to the following piece of code containing the important elements. The following works for me.
import numpy as np
from numba import jit
import multiprocessing as mp
##jit(nopython=True)
def compute_with_random(j,rng):
x=rng.normal(0,0.3,j)
y=np.sum(x)/j
return y
##jit(nopython=True)
def single(args):
(n,se)=args
rng = np.random.default_rng(se)
s=0
for i in range(1,n):
s+=compute_with_random(i,rng)
return s
def Call_Multi():
seed_sequence = np.random.SeedSequence(12345)
seeds = seed_sequence.spawn(4)
all_ins=[ (500,seeds[0]), (700,seeds[1]), (400,seeds[2]), (200,seeds[3]) ]
pool = mp.Pool(4)
result = pool.map( single, all_ins )
return result
if __name__=='__main__':
print( Call_Multi() )
As for my real problem the two functions compute_with_random() and single() take quite long and numba can accelerate them I want to use the numba decorator, so using the #jit decorators above result in the following error.
numba.core.errors.TypingError: [1mFailed in nopython mode pipeline (step: nopython frontend)
[1m[1mnon-precise type pyobject[0m
[0m[1mDuring: typing of argument at C:\test\test_rng_numba.py (16) [0m [1m
File "test_rng_numba.py", line 16:[0m
[1mdef single(args):
[1m (n,se)=args
[0m [1m^[0m[0m
This error may have been caused by the following argument(s):
... (truncated)
If I replace x=rng.normal(0,0.3,j) by x=np.random.normal(0,0.3,j) and remove the arguments rng in compute_with_random() and se in single() the example works also fine with numba. So probably there is a problem with numba and the parallel random number generator, or the seeds/rng. The random number generator with rng.normal() is suppossed to make independent chains of random numbers for each process. Any ideas how to solve that issue? thanks

How to use Dask to run python code on the GPU?

I have some code that uses Numba cuda.jit in order for me to run on the gpu, and I would like to layer dask on top of it if possible.
Example Code
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from numba import cuda, njit
import numpy as np
from dask.distributed import Client, LocalCluster
#cuda.jit()
def addingNumbersCUDA (big_array, big_array2, save_array):
i = cuda.grid(1)
if i < big_array.shape[0]:
for j in range (big_array.shape[1]):
save_array[i][j] = big_array[i][j] * big_array2[i][j]
if __name__ == "__main__":
cluster = LocalCluster()
client = Client(cluster)
big_array = np.random.random_sample((100, 3000))
big_array2 = np.random.random_sample((100, 3000))
save_array = np.zeros(shape=(100, 3000))
arraysize = 100
threadsperblock = 64
blockspergrid = (arraysize + (threadsperblock - 1))
d_big_array = cuda.to_device(big_array)
d_big_array2 = cuda.to_device(big_array2)
d_save_array = cuda.to_device(save_array)
addingNumbersCUDA[blockspergrid, threadsperblock](d_big_array, d_big_array2, d_save_array)
save_array = d_save_array.copy_to_host()
If my function addingNumbersCUDA didn't use any CUDA I would just put client.submit in front of my function (along with gather after) and it would work. But, since I'm using CUDA putting submit in front of the function doesn't work. The dask documentation says that you can target the gpu, but it's unclear as to how to actually set it up in practice. How would I set up my function to use dask with the gpu targeted and with cuda.jit if possible?
You may want to look through Dask's documentation on GPUs
But, since I'm using CUDA putting submit in front of the function doesn't work.
There is no particular reason why this should be the case. All Dask does it run your function on a different computer. It doesn't change or modify your function in any way.

Why is this numba.cuda lookup table implementation failing?

I'm trying to implement an transform which at some stage in it has a lookup table < 1K in size. This seems to me like it shouldn't pose a problem to a modern graphics card.
But the code below is failing with an unknown error:
from numba import cuda, vectorize
import numpy as np
tmp = np.random.uniform( 0, 100, 1000000 ).astype(np.int16)
tmp_device = cuda.to_device( tmp )
lut = np.arange(100).astype(np.float32) * 2.5
lut_device = cuda.to_device(lut)
#cuda.jit(device=True)
def lookup(x):
return lut[x]
#vectorize("float32(int16)", target="cuda")
def test_lookup(x):
return lookup(x)
test_lookup(tmp_device).copy_to_host() # <-- fails with cuMemAlloc returning UNKNOWN_CUDA_ERROR
What am I doing against the spirit of numba.cuda?
Even replacing lookup with the following simplified code results in the same error:
#cuda.jit(device=True)
def lookup(x):
return x + lut[1]
Once this error occurs, I am essentially no longer able to utilize the cuda context at all. For instance, allocating a new array via cuda.to_device results in a:
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemAlloc results in UNKNOWN_CUDA_ERROR
Running on: 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04)
Driver Version: 390.25
numba: 0.33.0
The above code is fixed by modifying the part in bold:
#cuda.jit(device=True)
def lookup(x):
lut_device = cuda.const.array_like(lut)
return lut_device[x]
I ran multiple variations of the code including simply touching the lookup table from within this kernel, but not using its output. This combined with #talonmies' assertion that UNKNOWN_CUDA_ERROR usually occurs with invalid instructions, I thought that perhaps there was a shared memory constraint that was causing the issue.
The above code makes the whole thing work. However, I still don't understand why in a profound way.
If anyone knows and understands why, please feel free to contribute to this answer.

CUDA-Python: How to launch CUDA kernel in Python (Numba 0.25)?

could you please help me understand how to write CUDA kernels in Python? AFAIK, numba.vectorize can be performed on cuda, cpu, parallel(multi-cpus), based on target. But target='cuda' requires to set up CUDA kernels.
The main issue is that many examples, answers in Internet are related to deprecated NumbaPro library, so it's hard to follow to such as not-updated WIKIs, especially if you're newbie.
I have:
latest Anaconda (v2)
latest Numba (v0.25)
latest CUDA toolkit (v7)
Here is the error I'm getting:
numba.cuda.cudadrv.driver.CudaAPIError: 1 Call to cuLaunchKernel
results in CU DA_ERROR_INVALID_VALUE
import numpy as np
import time
from numba import vectorize, cuda
#vectorize(['float32(float32, float32)'], target='cuda')
def VectorAdd(a, b):
return a + b
def main():
N = 32000000
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
start = time.time()
C = VectorAdd(A, B)
vector_add_time = time.time() - start
print "C[:5] = " + str(C[:5])
print "C[-5:] = " + str(C[-5:])
print "VectorAdd took for % seconds" % vector_add_time
if __name__ == '__main__':
main()
The code, as posted, is correct and will run on a Python 2 Numbapro/Accelerate system without error.
It was likely that the particular system being used to run the code wasn't very large in capacity and was hitting a display driver watchdog or free memory error with 32 million element vectors. Reducing the size of the input data allowed the code to run correctly.
[This answer assembled from comments and added as a community wiki entry to get this question off the unanswered list]

Categories

Resources