numba did not speed up the compilation of code - python

I tried this code with numba as well as normal mode but both were completed in 13 seconds and numba did not add speed
How can I set numba for this situation?
import numpy as np
from numba import jit, cuda
a=[]
#jit(target_backend="cuda")
def func():
for i in range(100000):
a.append(i)
return a
print(func())

CUDA cannot be used here because this code cannot run on the GPU (and even if would be modified, it would be inefficient). GPUs are very different from CPUs and they are also programmed differently. To understand why, please read the CUDA programming guide
If you just run the code and read the warnings from Numba, then you can see that the code fallback to a basic Python implementation:
Compilation is falling back to object mode WITH looplifting enabled because Function "func" failed type inference due to: Untyped global name 'a': Cannot type empty list
The reason is that the type of a is not provided and Numba fail to find it. Numba is fast because of its typing system which enable it to compile the code in a native binary.
Additionally, you should not modify global variables. This is not a good idea in term of software engineering and Numba does not support that anyway.
Thus, you need to use a typed lists returned from the function. Not that typed lists are not much faster than Python list when read/written from/to CPython because Numba has to make the conversion from/to CPython lists which is an expensive operation. Still, on my machine this is about 3 times faster.
Corrected code:
import numpy as np
from numba import jit, cuda
#jit
def func():
a=[]
for i in range(100000):
a.append(i)
return a
func() # Compile the function during the first run
a = func() # Execute quickly the code
print(a) # Printing is slow
For more information about the lazy compilation please read the Numba documentation.

Related

Can numba, multiprocessor and random number generators work together?

I'm trying to get numba, multiprocessor and random number generators work together. I have downsized my real problem to the following piece of code containing the important elements. The following works for me.
import numpy as np
from numba import jit
import multiprocessing as mp
##jit(nopython=True)
def compute_with_random(j,rng):
x=rng.normal(0,0.3,j)
y=np.sum(x)/j
return y
##jit(nopython=True)
def single(args):
(n,se)=args
rng = np.random.default_rng(se)
s=0
for i in range(1,n):
s+=compute_with_random(i,rng)
return s
def Call_Multi():
seed_sequence = np.random.SeedSequence(12345)
seeds = seed_sequence.spawn(4)
all_ins=[ (500,seeds[0]), (700,seeds[1]), (400,seeds[2]), (200,seeds[3]) ]
pool = mp.Pool(4)
result = pool.map( single, all_ins )
return result
if __name__=='__main__':
print( Call_Multi() )
As for my real problem the two functions compute_with_random() and single() take quite long and numba can accelerate them I want to use the numba decorator, so using the #jit decorators above result in the following error.
numba.core.errors.TypingError: [1mFailed in nopython mode pipeline (step: nopython frontend)
[1m[1mnon-precise type pyobject[0m
[0m[1mDuring: typing of argument at C:\test\test_rng_numba.py (16) [0m [1m
File "test_rng_numba.py", line 16:[0m
[1mdef single(args):
[1m (n,se)=args
[0m [1m^[0m[0m
This error may have been caused by the following argument(s):
... (truncated)
If I replace x=rng.normal(0,0.3,j) by x=np.random.normal(0,0.3,j) and remove the arguments rng in compute_with_random() and se in single() the example works also fine with numba. So probably there is a problem with numba and the parallel random number generator, or the seeds/rng. The random number generator with rng.normal() is suppossed to make independent chains of random numbers for each process. Any ideas how to solve that issue? thanks

Weird bug in Pandas and Numpy regarding multithreading

Most of the Numpy's function will enable multithreading by default.
for example, I work on a 8-cores intel cpu workstation, if I run a script
import numpy as np
x=np.random.random(1000000)
for i in range(100000):
np.sqrt(x)
the linux top will show 800% cpu usage during running like
Which means numpy automatically detects that my workstation has 8 cores, and np.sqrt automatically use all 8 cores to accelerate computation.
However, I found a weird bug. If I run a script
import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.random((10,10)))
df+df
x=np.random.random(1000000)
for i in range(100000):
np.sqrt(x)
the cpu usage is 100%!!.
It means that if you plus two pandas DataFrame before running any numpy function, the auto multithreading feature of numpy is gone without any warning! This is absolutely not reasonable, why would Pandas dataFrame calculation affect Numpy threading setting? Is it a bug? How to work around this?
PS:
I dig further using Linux perf tool.
running first script shows
While running second script shows
So both script involves libmkl_vml_avx2.so, while the first script involves additional libiomp5.so which seems to be related to openMP.
And since vml means intel vector math library, so according to vml doc I guess at least below functions are all automatically multithreaded
Pandas uses numexpr under the hood to calculate some operations, and numexpr sets the maximal number of threads for vml to 1, when it is imported:
# The default for VML is 1 thread (see #39)
set_vml_num_threads(1)
and it gets imported by pandas when df+df is evaluated in expressions.py:
from pandas.core.computation.check import _NUMEXPR_INSTALLED
if _NUMEXPR_INSTALLED:
import numexpr as ne
However, Anaconda distribution also uses vml-functionality for such functions as sqrt, sin, cos and so on - and once numexpr set the maximal number of vml-threads to 1, the numpy-functions no longer use parallelization.
The problem can be easily seen in gdb (using your slow script):
>>> gdb --args python slow.py
(gdb) b mkl_serv_domain_set_num_threads
function "mkl_serv_domain_set_num_threads" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (mkl_serv_domain_set_num_threads) pending.
(gbd) run
Thread 1 "python" hit Breakpoint 1, 0x00007fffee65cd70 in mkl_serv_domain_set_num_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
(gdb) bt
#0 0x00007fffee65cd70 in mkl_serv_domain_set_num_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
#1 0x00007fffe978026c in _set_vml_num_threads(_object*, _object*) () from /home/ed/anaconda37/lib/python3.7/site-packages/numexpr/interpreter.cpython-37m-x86_64-linux-gnu.so
#2 0x00005555556cd660 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:694
...
(gdb) print $rdi
$1 = 1
i.e. we can see, numexpr sets number of threads to 1. Which is later used when vml-sqrt function is called:
(gbd) b mkl_serv_domain_get_max_threads
Breakpoint 2 at 0x7fffee65a900
(gdb) (gdb) c
Continuing.
Thread 1 "python" hit Breakpoint 2, 0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
(gdb) bt
#0 0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
#1 0x00007ffff01fcea9 in mkl_vml_serv_threader_d_1i_1o () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
#2 0x00007fffedf78563 in vdSqrt () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_lp64.so
#3 0x00007ffff5ac04ac in trivial_two_operand_loop () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-x86_64-linux-gnu.so
So we can see numpy uses vml's implementation of vdSqrt which utilizes mkl_vml_serv_threader_d_1i_1o to decide whether calculation should be done in parallel and it looks the number of threads:
(gdb) fin
Run till exit from #0 0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
0x00007ffff01fcea9 in mkl_vml_serv_threader_d_1i_1o () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
(gdb) print $rax
$2 = 1
the register %rax has the maximal number of threads and it is 1.
Now we can use numexpr to increase the number of vml-threads, i.e.:
import numpy as np
import numexpr as ne
import pandas as pd
df=pd.DataFrame(np.random.random((10,10)))
df+df
#HERE: reset number of vml-threads
ne.set_vml_num_threads(8)
x=np.random.random(1000000)
for i in range(10000):
np.sqrt(x) # now in parallel
Now multiple cores are utilized!
Looking at numpy, it looks like, under the hood it has had on/off issues with multithreading, and depending on what version you are using you may expect to may start to see crashes when you bump up ne.set_vml_num_threads() ..
http://numpy-discussion.10968.n7.nabble.com/ANN-NumExpr-2-7-0-Release-td47414.html
I need to get my head around how this is glued in to the python interpreter, given your code example where it seems to be somehow allowing multiple apparently synchronous/ordered calls to np.sqrt() to proceed in parallel. I guess if python interpreter is always just returning a reference to an object when it pops the stack, and in your example is just pitching those references and not assigning or manipulating them in any way it would be fine. But if subsequent loop iterations depend on previous ones then it seems less clear how these could be safely parallelized. Arguably silent failure / wrong results is an outcome worse than crashes.
I think that your initial premise may be incorrect -
You stated: Which means numpy automatically detects that my workstation has 8 cores, and np.sqrt automatically use all 8 cores to accelerate computation.
A single function np.sqrt() cannot guess how it will next be invoked or return before it has partially completed. There are parallelism mechanisms in python, but none are automatic.
Now, having said that, the python interpreter may be able to optimize the for loop for parallelism, which may be what you are seeing, but I strongly suspect if you look at the wall-clock time for this loop to execute it will be no different regardless if you are (apparently) using 8 cores or 1 core.
UPDATE: Having read a bit more of the comments it seems as though the multi-core behavior you are seeing is related to the anaconda distribution of the python interpreter. I took a look but was unable to find any source code for it, but it seems that the python license permits entities (like anaconda.com) to compile and distribute derivatives of the interpreter without requiring their changes to be published.
I guess that you can reach out to the anaconda folks - the behaviour you are seeing will be difficult to figure out without knowing what/if anything they've changed in the interpreter ..
Also do a quick check of the wall clock time with/without the optimization to see if it is indeed 8x faster - even if you've really got all 8 cores working instead of 1 it would be good to know if the results are actually 8x faster or if there are spinlocks in use which are still serializing on a single mutex.

Caching jit-compiled functions in numba

I want to compile a range oft functions using numba and as I only need to run them on my machine with the same signatures, I want to cache them.
But when attempting to do so, numba tells me that the function cannot be cached because it uses large global arrays. This is the specific warning it displayed.
NumbaWarning: Cannot cache compiled function "sigmoid" as it uses dynamic globals (such as ctypes pointers and large global arrays)
I am aware that global arrays are usually frozen but large ones aren't, but as my function looks like this:
#njit(parallel=True, cache=True)
def sigmoid(x):
return 1./(1. + np.exp(-x))
I cannot see any global arrays, especially large ones.
Where is the problem?
I observed this behavior too (running on: Windows 10, Dell Latitude 7480, Git for Windows), even for very simple tests. It seems parallel=True doesn't allow caching. This is independent from the actual presence of of prange calls. Below a simple example.
def where_numba(arr: np.ndarray) -> np.ndarray:
l0, l1 = np.shape(arr)[0], np.shape(arr)[1]
for i0 in prange(l0):
for i1 in prange(l1):
if arr[i0, i1] > 0.5:
arr[i0, i1] = arr[i0, i1] * 10
return(arr)
where_numba_jit = jit(signature_or_function='float64[:,:](float64[:,:])',
nopython=True, parallel=True, cache=True, fastmath=True, nogil=True)(where_numba)
arr = np.random.random((10000, 10000))
seln = where_numba_jit(arr)
I get the same warning.
I think you may consider your specific codes and see which option (cache or parallel) is better to keep, clearly cache for relative short calculation times and parallel when the compilation time may be negligible compared to the actual calculation time. Please, comment if you have updates.
There is also an open Numba issue on this:
https://github.com/numba/numba/issues/2439

Solving linear system using Python with numba and CUDA

I am trying to solve a linear system using numba with GPU processing using CUDA.
I have installed all the relevant packages and tested it so it seems that my GPU and CUDA etc is set up properly.
My code is:
import numpy as np
import time
from numba import vectorize, cuda
#vectorize(['float64(float64, float64)'], target='cuda')
def solver(A, b):
return np.linalg.solve(A, b)
def main():
A = np.random.rand(100, 100).astype(np.float64)
b = np.random.rand(100, 1).astype(np.float64)
start = time.time()
C = solver(A, b)
vector_add_time = time.time() - start
print("Took " + str(vector_add_time) + " seconds to solve")
if __name__ == '__main__':
main()
Commenting the #vectorize... line, the code runs fine. However, when I try to do it with numba and cuda, I get a long list of errors, where I think he most relevant one is:
raise TypingError(msg)
numba.errors.TypingError: Failed at nopython (nopython frontend)
np.linalg.solve() only supported for array types
I assume the problem is that numpy.linalg.solve does not accept the data types required by cuda.
Am I correct in assuming this? Are there other data types that will work?
In this example problem, the same data type is passed to the function, so I think the problem lies with numpy.linalg.
Am I correct in assuming this?
No
Are there other data types that will work?
No
The problem here is that you cannot use numpy.linalg in code which is targeted to run on the numba GPU backend.

Cython reading in files in parallel and bypassing GIL

Trying to figure out how to use Cython to bypass the GIL and load files parallelly for IO bound tasks. For now I have the following Cython code trying to load files n0.npy, n1.py ... n100.npy
def foo_parallel():
cdef int i
for i in prange(100, nogil=True, num_threads=8):
with gil:
np.load('n'+str(i)+'.npy')
return []
def foo_serial():
cdef int i
for i in range(100):
np.load('n'+str(i)+'.npy')
return []
I'm not noticing a significant speedup - does anyone have any experience with this?
Edit: I'm getting around 900ms parallely vs 1.3 seconds serially. Would expect more speedup given 8 threads
As the comment states you can't use NumPy with gil and expect it to become parallel. You need C or C++ level file operations to do this. See this post here for a potential solution http://www.code-corner.de/?p=183
I.e. adapt this to your problem: file_io.pyx I'd post it here but can't figure out how on my cell. Add nogil to the end of the cdef statement there and call the function from a cpdef foo_parallel defined function within your prange loop. Use the read_file not the slow one and change it to cdef. Please post benchmarks after doing so as I'm curious and have no computer on vacation.

Categories

Resources