This question following this one [1]. I have a big 3D array and i have to do some heavy calculations on it.
I would like to split a slice of my array in 4 parts and do calculations for each part with each 4 cores of my computer...
And do that for each slices of my 3D array...what is the best way to do that?
import numpy
size = 8.
array[:,:,0] = 0
array[:,:,1] = X+Y
array[:,:,2] = X*cos(X)+Y*sin(Y)
array[:,:,3] = X**3+sin(X)+X**2+Y**2+sin(Y)
You can use Pool from the multiprocessing module:
from multiprocessing import Pool
def f(num):
return num * 2 # replace with heavy computation
lst = [1,2,3,4,5,6,7,8,9,10,11]
p = Pool(4)
print, lst)
It will work equally well with a 3-dimensional numpy array:
from multiprocessing import Pool
import numpy
def f(num):
return num * 2 # replace with heavy computation
arr = numpy.array(
p = Pool(4)
print, arr)
As an alternative to multiprocessing, you can use the concurrent.futures module:
import concurrent.futures
def f(num):
return num * 2
arr = […]
with concurrent.futures.ProcessPoolExecutor() as exc:
print(list(, arr)))
I'm trying to solve an ODE system with solve_ivp and i want to change the local variables of the function every time it's been called by the solver.
In particular I wand to update the lagrange multipliers (lambdas_i) so that the next time step of solve_ivp, uses the values of the previous one.
(The ''reconstruct'' function is from a python module that uses a method to reconstruct a size distribution from given moments)
Is there a way to do this? I'll post the code below:
import time
import numpy as np
import scipy.integrate as integrate
from numpy import sqrt, sin, cos, pi
import math
import pylab as pp
from pymaxent import reconstruct
start = time.time()
'''Initialize variables'''
t=np.linspace(0, 60,31)
for i in range(4):
def distr(L,i=i):
return (L**i)*0.0399*np.exp(-((L-50)**2)/200)
m, err=integrate.quad(distr, 0, np.inf)
''' Solving ode system using Maximum Entropy, G(L)=1+0.002*L'''
def moments(t,y):
m0 = y[0]
m1 = y[1]
m2 = y[2]
m3 = y[3]
sol, lambdas_i= reconstruct(mu=y ,bnds=bnds)
def moment1(L):
dm1dt, err1=integrate.quad(moment1,Lmin,Lmax)
def moment2(L):
dm2dt, err2=integrate.quad(moment2,Lmin,Lmax)
def moment3(L):
dm3dt, err3=integrate.quad(moment3,Lmin,Lmax)
'''Χρήση της BDF, step by step'''
end = time.time()
print('Total time =',{end-start})
Here is one way to achieve what you want. I won't be using your actual code, but using a simpler example problem, and you can use the same strategy to solve yours.
def seq():
l = [1, 0]
while True:
p = l[1]
n = l[0] + p
l = [p, n]
Above is an example function that would print the next (or first) term of the fibonacci sequence. (In which each term is the sum of the two previous terms.) In this case, the input is solely used to pause between each iteration (as it is infinite). Now to transform this into a generator, to allow more flexibility, you could rewrite the function as:
def seq():
l = [1, 0]
while True:
p = l[1]
n = l[0] + p
l = [p, n]
yield n
Now, if you wanted to get the same result, you could:
for item in seq():
However, this is mostly useless. The point of the generator comes in when you want to gather the next number of the sequence, but at any point of the code. Not necessarily inside the loop that repeats itself until you're done. You could achieve this:
gen = seq()
next(gen) # returns 1
next(gen) # returns 1
next(gen) # returns 2
next(gen) # returns 3
next(gen) # returns 5
And so on...
Another way to solve this problem would be to use a global variable instead of a local variable.
l = [1, 0]
def seq():
p = l[1]
n = l[0] + p
l[0] = p
l[1] = n
return n
The variable l is defined outside the function, not inside, so it will not be discarded when the function exits. (To run the code the same way as before:)
while True:
Both of these should be implementable in your code.
I'm trying to do a multiprocessed version of Monte Carlo Pi calculation.
However, I always get an error that says
TypeError: map() missing 1 required positional argument: 'iterable'
I read that the error can be fixed by using starmap but it doesn't work either for some reason.
import random
import math
import itertools
import multiprocessing
from multiprocessing import Pool, current_process
from timeit import default_timer as timer
import functools
def monteCarlo(total):
inside = 0
for i in range(0, total):
x2 = random.random()**2
y2 = random.random()**2
if math.sqrt(x2 + y2) < 1.0:
inside += 1
pi = (float(inside) / total) * 4
return pi
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
values = [100]
data = sum(, values)))
Don't need to call partial to get map value partial rather use just The partial() is used for partial function application which “freezes” some portion of a function’s arguments and/or keywords resulting in a new object with a simplified signature. But you don't have need something freeze, if further need to use partial then you need to pass two argument to get the callable beahave. Get more about functools
import random
import math
import itertools
import multiprocessing
from multiprocessing import Pool, current_process
from timeit import default_timer as timer
import functools
def monteCarlo(total):
inside = 0
for i in range(0, total):
x2 = random.random()**2
y2 = random.random()**2
if math.sqrt(x2 + y2) < 1.0:
inside += 1
pi = (float(inside) / total) * 4
return pi
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
values = [100]
data = sum(, values))
I was convinced to save computation time in using lambda function, but it's not that clear. look at this example:
import numpy as np
import timeit
def f_with_lambda():
a = np.array(range(5))
b = np.array(range(5))
A,B = np.meshgrid(a,b)
rst = list(map(lambda x,y : x+y , A, B))
return np.array(rst)
def f_with_for():
a = range(5)
b = np.array(range(5))
rst = [b+x for x in a]
return np.array(rst)
lambda_rst = f_with_lambda()
for_rst = f_with_for()
if __name__ == '__main__':
print(timeit.timeit("f_with_lambda()",setup = "from __main__ import f_with_lambda",number = 10000))
print(timeit.timeit("f_with_for()",setup = "from __main__ import f_with_for",number = 10000))
result is simple:
-lambda function result with time it is 0.3514268280014221 s
- with for loop : 0.10633227700236603 s
How do I write my lambda function to be competitive ? I noticed the list function to get results from de map object is not good in time. Any other way to proceed ? the mesgrid function is certainly not the best as well...
every tip is welcome!
Considering the remark about the list:
import numpy as np
import timeit
def f_with_lambda():
A,B = np.meshgrid(range(150),range(150))
return np.array(map(lambda x,y : x+y , A, B))
def f_with_for():
return np.array([np.array(range(150))+x for x in range(150)])
if __name__ == '__main__':
print(timeit.timeit("f_with_lambda()",setup = "from __main__ import f_with_lambda",number = 10000))
print(timeit.timeit("f_with_for()",setup = "from __main__ import f_with_for",number = 10000))
it is changing a lot of things. This time (lambda vs for)
for 5:
0.30227499100146815 vs 0.2510572589999356 (quite similar)
for 150:
0.6687559890015109 vs 20.31807473200024 ( :) :) :) ) !! great job! thank you!
Memory allocation is taking time (it should call an OS procedure, it might be delayed).
In the lambda version, you allocated a, b, meshgrid, rst (list and array versions) + the return array.
In the for version, you allocated b and rst + the return array. a is a generator so it takes no time to create and load it in memory.
This is why your function using lambda is slower.
Plus, don't use list to handle result of np-array operations to cast it back to np-array.
Just by removing the list() it become faster (from 0.9 to 0.4).
def f_with_lambda():
a = np.array(range(SIZE))
b = np.array(range(SIZE))
A,B = np.meshgrid(a,b)
rst = map(lambda x,y : x+y , A, B)
return np.array(rst)
See for speed comparison.
I compacted the code:
import numpy as np
import timeit
def f_with_lambda():
A,B = np.meshgrid(range(150),range(150))
return np.array(list(map(lambda x,y : x+y , A, B)))
def f_with_for():
return np.array([np.array(range(150))+x for x in range(150)])
if __name__ == '__main__':
print(timeit.timeit("f_with_lambda()",setup = "from __main__ import f_with_lambda",number = 10000))
print(timeit.timeit("f_with_for()",setup = "from __main__ import f_with_for",number = 10000))
This time, for a 5x5, the result is
Lambda vs for
0.38113487999726203 vs 0.24913009200099623
and with 150 it's better:
2.680842614001449 vs 20.176408246999927
But I found no way to integrate the mesgrid inside the lambda function. and the list conversion before the array is sad as well.
I took time to integrate the last remark from politinsa:
import numpy as np
import timeit
def f_with_lambda():
A,B = np.meshgrid(range(150),range(150))
return np.array(list(map(lambda x,y : x+y , A, B)))
def f_with_for():
return np.array([np.array(range(150))+x for x in range(150)])
def f_with_lambda_nolist():
A,B = np.meshgrid(range(150),range(150))
return np.array(map(lambda x,y : x+y , A, B))
if __name__ == '__main__':
print(timeit.timeit("f_with_lambda()",setup = "from __main__ import f_with_lambda",number = 10000))
print(timeit.timeit("f_with_for()",setup = "from __main__ import f_with_for",number = 10000))
print(timeit.timeit("f_with_lambda_nolist()",setup = "from __main__ import f_with_lambda_nolist",number = 10000))
results are:
2.4421722999977646 s
18.75847979998798 s
0.6800016999914078 s -> list conversion has (as explained) a real impact on memory allocation
I am using numbas #jit decorator for adding two numpy arrays in python. The performance is so high if I use #jit compared with python.
However it is not utilizing all CPU cores even if I pass in #numba.jit(nopython = True, parallel = True, nogil = True).
Is there any way to to make use of all CPU cores with numba #jit.
Here is my code:
import time
import numpy as np
import numba
SIZE = 2147483648 * 6
a = np.full(SIZE, 1, dtype = np.int32)
b = np.full(SIZE, 1, dtype = np.int32)
c = np.ndarray(SIZE, dtype = np.int32)
#numba.jit(nopython = True, parallel = True, nogil = True)
def add(a, b, c):
for i in range(SIZE):
c[i] = a[i] + b[i]
start = time.time()
add(a, b, c)
end = time.time()
print(end - start)
You can pass parallel=True to any numba jitted function but that doesn't mean it's always utilizing all cores. You have to understand that numba uses some heuristics to make the code execute in parallel, sometimes these heuristics simply don't find anything to parallelize in the code. There's currently a pull request so that it issues a Warning if it wasn't possible to make it "parallel". So it's more like an "please make it execute in parallel if possible" parameter not "enforce parallel execution".
However you can always use threads or processes manually if you really know you can parallelize your code. Just adapting the example of using multi-threading from the numba docs:
#!/usr/bin/env python
from __future__ import print_function, division, absolute_import
import math
import threading
from timeit import repeat
import numpy as np
from numba import jit
nthreads = 4
size = 10**7 # CHANGED
def func_np(a, b):
Control function using Numpy.
return a + b
#jit('void(double[:], double[:], double[:])', nopython=True, nogil=True)
def inner_func_nb(result, a, b):
Function under test.
for i in range(len(result)):
result[i] = a[i] + b[i]
def timefunc(correct, s, func, *args, **kwargs):
Benchmark *func* and print out its runtime.
print(s.ljust(20), end=" ")
# Make sure the function is compiled before we start the benchmark
res = func(*args, **kwargs)
if correct is not None:
assert np.allclose(res, correct), (res, correct)
# time it
print('{:>5.0f} ms'.format(min(repeat(lambda: func(*args, **kwargs),
number=5, repeat=2)) * 1000))
return res
def make_singlethread(inner_func):
Run the given function inside a single thread.
def func(*args):
length = len(args[0])
result = np.empty(length, dtype=np.float64)
inner_func(result, *args)
return result
return func
def make_multithread(inner_func, numthreads):
Run the given function inside *numthreads* threads, splitting its
arguments into equal-sized chunks.
def func_mt(*args):
length = len(args[0])
result = np.empty(length, dtype=np.float64)
args = (result,) + args
chunklen = (length + numthreads - 1) // numthreads
# Create argument tuples for each input chunk
chunks = [[arg[i * chunklen:(i + 1) * chunklen] for arg in args]
for i in range(numthreads)]
# Spawn one thread per chunk
threads = [threading.Thread(target=inner_func, args=chunk)
for chunk in chunks]
for thread in threads:
for thread in threads:
return result
return func_mt
func_nb = make_singlethread(inner_func_nb)
func_nb_mt = make_multithread(inner_func_nb, nthreads)
a = np.random.rand(size)
b = np.random.rand(size)
correct = timefunc(None, "numpy (1 thread)", func_np, a, b)
timefunc(correct, "numba (1 thread)", func_nb, a, b)
timefunc(correct, "numba (%d threads)" % nthreads, func_nb_mt, a, b)
I highlighted the parts which I changed, everything else was copied verbatim from the example. This utilizes all cores on my machine (4 core machine therefore 4 threads) but doesn't show a significant speedup:
numpy (1 thread) 539 ms
numba (1 thread) 536 ms
numba (4 threads) 442 ms
The lack of (much) speedup with multithreading in this case is that addition is a bandwidth-limited operation. That means it takes much more time to load the elements from the array and place the result in the result array than to do the actual addition.
In these cases you could even see slowdowns because of parallel execution!
Only if the functions are more complex and the actual operation takes significant time compared to loading and storing of array elements you'll see a big improvement with parallel execution. The example in the numba documentation is one like that:
def func_np(a, b):
Control function using Numpy.
return np.exp(2.1 * a + 3.2 * b)
#jit('void(double[:], double[:], double[:])', nopython=True, nogil=True)
def inner_func_nb(result, a, b):
Function under test.
for i in range(len(result)):
result[i] = math.exp(2.1 * a[i] + 3.2 * b[i])
This actually scales (almost) with the number of threads because two multiplications, one addition and one call to math.exp is much slower than loading and storing results:
func_nb = make_singlethread(inner_func_nb)
func_nb_mt2 = make_multithread(inner_func_nb, 2)
func_nb_mt3 = make_multithread(inner_func_nb, 3)
func_nb_mt4 = make_multithread(inner_func_nb, 4)
a = np.random.rand(size)
b = np.random.rand(size)
correct = timefunc(None, "numpy (1 thread)", func_np, a, b)
timefunc(correct, "numba (1 thread)", func_nb, a, b)
timefunc(correct, "numba (2 threads)", func_nb_mt2, a, b)
timefunc(correct, "numba (3 threads)", func_nb_mt3, a, b)
timefunc(correct, "numba (4 threads)", func_nb_mt4, a, b)
numpy (1 thread) 3422 ms
numba (1 thread) 2959 ms
numba (2 threads) 1555 ms
numba (3 threads) 1080 ms
numba (4 threads) 797 ms
For the sake of completeness, in year 2018 (numba v 0.39) you can just do
from numba import prange
and replace range with prange in your original function definition, that's it.
That immediately makes CPU utilization 100% and in my case speeds things up from 2.9 to 1.7 seconds of runtime (for SIZE = 2147483648 * 1, on machine with 16 cores 32 threads).
More complex kernels one often can speed up even more by passing in fastmath=True.
I have a multidimensional array (result) that should be filled by some nested loops. Function fun() is a complex and time-consuming function. I want to fill my array elements in a parallel manner, so I can use all my system's processing power.
Here's the code:
import numpy as np
def fun(x, y, z):
# time-consuming computation...
# ...
return output
dim1 = 10
dim2 = 20
dim3 = 30
result = np.zeros([dim1, dim2, dim3])
for i in xrange(dim1):
for j in xrange(dim2):
for k in xrange(dim3):
result[i, j, k] = fun(i, j, k)
My question is that "Can I parallelize this code or not? if yes, How?"
I'm using Windows 10 64-bit and python 2.7.
Please provide your solution by changing my code if you can.
If you want a more general solution, taking advantage of fully parallel execution, then why not use something like this:
>>> import multiprocess as mp
>>> p = mp.Pool()
>>> # a time consuming function taking x,y,z,...
>>> def fun(*args):
... import time
... time.sleep(.1)
... return sum(*args)
>>> dim1, dim2, dim3 = 10, 20, 30
>>> import itertools
>>> input = ((i,j,k) for i,j,k in itertools.combinations_with_replacement(xrange(dim3), 3) if i < dim1 and j < dim2)
>>> results =, input)
>>> p.close()
>>> p.join()
>>> results[:2]
[0, 1]
>>> results[-2:]
[56, 57]
Note I'm using multiprocess instead of multiprocessing, but that's only to get the ability to work in the interpreter.
I didn't use a numpy.array, but if you had to... you could just dump the output from directly into a numpy.array and then modify the shape attribute to be shape = (dim1, dim2, dim3), or you could do something like this:
>>> input = ((i,j,k) for i,j,k in itertools.combinations_with_replacement(xrange(dim3), 3) if i < dim1 and j < dim2)
>>> import numpy as np
>>> results = np.empty(dim1*dim2*dim3)
>>> res = p.imap(fun, input)
>>> for i,r in enumerate(res):
... results[i] = r
>>> results.shape = (dim1,dim2,dim3)
Here is a version of code that runs fun(i, j, k) in parallel for differend k indices. This is done by running fun in different processes by using
import numpy as np
from multiprocessing import Pool
def fun(x, y, z):
# time-consuming computation...
# ...
return output
def fun_wrapper(indices):
if __name__ == '__main__':
dim1 = 10
dim2 = 20
dim3 = 30
result = np.zeros([dim1, dim2, dim3])
pool = Pool(processes=8)
for i in xrange(dim1):
for j in xrange(dim2):
result[i, j] =, [(i, j, k) for k in xrange(dim3)])
This is not the most elegant solution but you may start with it. And you will get a speed up only if fun contains time-consuming computation
A simple approach could be to divide the array in sections and create some threads to operate throught these sections. For example one section from (0,0,0) to (5,10,15) and other one from (5,10,16) to (10,20,30).
You can use threading module and do something like this
import numpy as np
import threading as t
def fun(x, y, z):
# time-consuming computation...
# ...
return output
dim1 = 10
dim2 = 20
dim3 = 30
result = np.zeros([dim1, dim2, dim3])
#b - beginning index, e - end index
def work(ib,jb,kb,ie,je,ke):
for i in xrange(ib,ie):
for j in xrange(jb,je):
for k in xrange(kb,ke):
result[i, j, k] = fun(i, j, k)
threads = list()
threads.append(t.Thread(target=work, args(0,0,0,dim1/2,dim2/2,dim3/2))
threads.append(t.Thread(target=work, args(dim1/2,dim2/2,dim3/2 +1,dim1, dim2, dim3))
for thread in threads:
You can define these sections through some algorithm and determine the number of threads dynamically. Hope it helps you or at least give you some ideas.