Since I moved from python3.5 to 3.6 the Parallel computation using joblib is not reducing the computation time.
Here are the librairies installed versions:
- python: 3.6.3
- joblib: 0.11
- numpy: 1.14.0
Based on a very well known example, I give below a sample code to reproduce the problem:
import time
import numpy as np
from joblib import Parallel, delayed
def square_int(i):
return i * i
ndata = 1000000
ti = time.time()
results = []
for i in range(ndata):
results.append(square_int(i))
duration = np.round(time.time() - ti,4)
print(f"standard computation: {duration} s" )
for njobs in [1,2,3,4] :
ti = time.time()
results = []
results = Parallel(n_jobs=njobs, backend="multiprocessing")\
(delayed(square_int)(i) for i in range(ndata))
duration = np.round(time.time() - ti,4)
print(f"{njobs} jobs computation: {duration} s" )
I got the following ouput:
standard computation: 0.2672 s
1 jobs computation: 352.3113 s
2 jobs computation: 6.9662 s
3 jobs computation: 7.2556 s
4 jobs computation: 7.097 s
While I am increasing by a factor of 10 the number of ndata and removing the 1 core computation, I get those results:
standard computation: 2.4739 s
2 jobs computation: 77.8861 s
3 jobs computation: 79.9909 s
4 jobs computation: 83.1523 s
Does anyone have an idea in which direction I should investigate ?
I think the primary reason is that your overhead from parallel beats the benefits. In another word, your square_int is too simple to earn any performance improvement via parallel. The square_int is so simple that passing input and output between processes may take more time than executing the function square_int.
I modified your code by creating a square_int_batch function. It reduced the computation time a lot, though it is still more than the serial implementation.
import time
import numpy as np
from joblib import Parallel, delayed
def square_int(i):
return i * i
def square_int_batch(a,b):
results=[]
for i in range(a,b):
results.append(square_int(i))
return results
ndata = 1000000
ti = time.time()
results = []
for i in range(ndata):
results.append(square_int(i))
# results = [square_int(i) for i in range(ndata)]
duration = np.round(time.time() - ti,4)
print(f"standard computation: {duration} s" )
batch_num = 3
batch_size=int(ndata/batch_num)
for njobs in [2,3,4] :
ti = time.time()
results = []
a = list(range(ndata))
# results = Parallel(n_jobs=njobs, )(delayed(square_int)(i) for i in range(ndata))
# results = Parallel(n_jobs=njobs, backend="multiprocessing")(delayed(
results = Parallel(n_jobs=njobs)(delayed(
square_int_batch)(i*batch_size,(i+1)*batch_size) for i in range(batch_num))
duration = np.round(time.time() - ti,4)
print(f"{njobs} jobs computation: {duration} s" )
And the computation timings are
standard computation: 0.3184 s
2 jobs computation: 0.5079 s
3 jobs computation: 0.6466 s
4 jobs computation: 0.4836 s
A few other suggestions that will help reduce the time.
Use list comprehension results = [square_int(i) for i in range(ndata)] to replace for loop in your specific case, it is faster. I tested.
Set batch_num to a reasonable size. The larger this value is, the more overhead. It started to get significantly slower when batch_num exceed 1000 in my case.
I used the default backend loky instead of multiprocessing. It is slightly faster, at least in my case.
From a few other SO questions, I read that the multiprocessing is good for cpu-heavy tasks, for which I don't have an official definition. You can explore that yourself.
Related
The issue
I am trying to optimise some calculations which lend themselves to so-called embarrassingly parallel calculations, but I am finding that using python's multiprocessing package actually slows things down.
My question is: am I doing something wrong, or is there an intrinsic reason why parallelisation actually slows things down? Is it because I am using numba? Would other packages like joblib or dak make much of a difference?
There are loads of similar questions, in which the answer is always that the overhead costs more than the time savings, but all those questions tend to revolve around very simple functions, whereas I would have expected something with nested loops to lend itself better to parallelisation. I have also not found comparisons among joblib, multiprocessing and dask.
My function
I have a function which takes a one-dimensional numpy array as argument of shape n, and outputs a numpy array of shape (n x t), where each row is independent, i.e. row 0 of the output depends only on item 0 of the input, row 1 on item 1, etc. Something like this:
The underlying calculation is optimised with numba , which speeds things up by various orders of magnitude.
Toy example - results
I cannot share the exact code, so I have come up with a toy example. The calculation defined in my_fun_numba is actually irrelevant, it's just some very banal number crunching to keep the CPU busy.
With the toy example, the results on my PC are these, and they are very similar to what I get with my actual code.
As you can see, splitting the input array into different chunks and sending each of them to multiprocessing.pool actually slows things down vs just using numba on a single core.
What I have tried
I have tried various combinations of the cache and nogil options in the numba.jit decorator, but the difference is minimal.
I have profiled the code (not the timeit.Timer part, just a single run) with PyCharm and, if I understand the output correctly, it seems most of the time is spent waiting for the pool.
Sorted by time:
Sorted by own time:
Toy example - the code
import numpy as np
import pandas as pd
import multiprocessing
from multiprocessing import Pool
import numba
import timeit
#numba.jit(nopython = True, nogil = True, cache = True)
def my_fun_numba(x):
dim2 = 10
out = np.empty((len(x), dim2))
n = len(x)
for r in range(n):
for c in range(dim2):
out[r,c] = np.cos(x[r]) ** 2 + np.sin(x[r]) ** 2
return out
def my_fun_non_numba(x):
dim2 = 10
out = np.empty((len(x), dim2))
n = len(x)
for r in range(n):
for c in range(dim2):
out[r,c] = np.cos(x[r]) ** 2 + np.sin(x[r]) ** 2
return out
def my_func_parallel(inp, func, cpus = None):
if cpus == None:
cpus = max(1, multiprocessing.cpu_count() - 1)
else:
cpus = cpus
inp_split = np.array_split(inp,cpus)
pool = Pool(cpus)
out = np.vstack(pool.map(func, inp_split) )
pool.close()
pool.join()
return out
if __name__ == "__main__":
inputs = np.array([100,10e3,1e6] ).astype(int)
res = pd.DataFrame(index = inputs, columns =['no paral, no numba','no paral, numba','numba 6 cores','numba 12 cores'])
r = 3
n = 1
for i in inputs:
my_arg = np.arange(0,i)
res.loc[i, 'no paral, no numba'] = min(
timeit.Timer("my_fun_non_numba(my_arg)", globals=globals()).repeat(repeat=r, number=n)
)
res.loc[i, 'no paral, numba'] = min(
timeit.Timer("my_fun_numba(my_arg)", globals=globals()).repeat(repeat=r, number=n)
)
res.loc[i, 'numba 6 cores'] = min(
timeit.Timer("my_func_parallel(my_arg, my_fun_numba, cpus = 6)", globals=globals()).repeat(repeat=r, number=n)
)
res.loc[i, 'numba 12 cores'] = min(
timeit.Timer("my_func_parallel(my_arg, my_fun_numba, cpus = 12)", globals=globals()).repeat(repeat=r, number=n)
)
I have this code which I would like to use multi-processing to speed up:
matrix=[]
for i in range(len(datasplit)):
matrix.append(np.array(np.asarray(datasplit[i].split()),dtype=float))
The variable "datasplit" is a comma-separated list of strings. Each string has around 50 numbers which are separated by a space. For each string, this code adds commas between these numbers instead of spaces, turns the entire string into an array, and turns each individual number into a string. This would now look like a an array of comma-separated strings where each string is 1 of the 50 numbers. The code then turns these strings into floats, so now we have an array of 50 comma separated numbers. After the code has run, printing, "matrix" would give a list of arrays, where each array has 50 comma separated numbers.
Now my problem is that the length of datasplit is huge. It has a length of ~ 10^7. This code takes around 15 minutes to run. I need to run this for 124 other samples of similar size, so I would like to use multiprocessing to speed up the run time.
How exactly would I re-write my code using multiprocessing to get it to run faster?
I appreciate any help.
The Python standard library provides two options for multiprocessing: The modules multiprocessing and concurrent.futures. The second adds a layer of abstraction onto the first. For simple map-scenarios like yours the usage is pretty simple.
Here's something to experiment with:
import numpy as np
from time import time
from os import cpu_count
from multiprocessing import Pool
from concurrent.futures import ProcessPoolExecutor
def string_to_float(string):
return np.array(np.asarray(string.split()), dtype=float)
if __name__ == '__main__':
# Example datasplit
rng = np.random.default_rng()
num_strings = 100000
datasplit = [' '.join(str(n) for n in rng.random(50))
for _ in range(num_strings)]
# Looping (sequential processing)
start = time()
matrix = []
for i in range(len(datasplit)):
matrix.append(np.array(np.asarray(datasplit[i].split()), dtype=float))
print(f'Duration of sequential processing: {time() - start:.2f} secs')
# Setting up multiprocessing
num_workers = int(0.8 * cpu_count())
chunksize = max(1, int(len(datasplit) / num_workers))
# Multiprocessing with Pool
start = time()
with Pool(num_workers) as p:
matrix = p.map(string_to_float, datasplit, chunksize)
print(f'Duration of parallel processing (Pool): {time() - start:.2f} secs')
# Multiprocessing with ProcessPoolExecutor
start = time()
with ProcessPoolExecutor(num_workers) as ppe:
matrix = list(ppe.map(string_to_float, datasplit, chunksize=chunksize))
print(f'Duration of parallel processing (PPE): {time() - start:.2f} secs')
You should play around with the num_workers and more importantly the chunksize variable. The ones I've used here worked well for me in quite a few situations. You can also let the system decide what to choose, those arguments are optional, but the results can be suboptimal, especially when the amount of data to be processed is large.
For 10 million strings (your range) and chunksize=10000 my machine produced the following results:
Duration of sequential processing: 393.78 secs
Duration of parallel processing (Pool): 73.76 secs
Duration of parallel processing (PPE): 85.82 secs
PS: Why do you use np.array(np.asarray(string.split()), dtype=float) instead of np.asarray(string.split(), dtype=float)?
This will split your tasks to multiple cores and speed up your performance by atleast 4-8x:
from multiprocessing import Pool
import os
import numpy as np
pool = Pool(os.cpu_count())
# Add your data to the datasplit variable below:
datasplit = []
results = pool.map(lambda x: np.array(np.asarray(x.split()),dtype=float), datasplit)
pool.close()
pool.join()
I'm trying to understand why the parallel processing using joblib is slower than single cpu operation?
Below is my code.
from joblib import Parallel, delayed
import multiprocessing
import time
inputs = range(10000)
def processInput(i):
return i * i
if __name__ == '__main__':
num_cores = multiprocessing.cpu_count()
start_time = time.process_time()
results = Parallel(n_jobs=num_cores)(delayed(processInput)(i) for i in inputs)
#print(results)
print(str(time.process_time() - start_time))
results=[]
start_time = time.process_time()
for i in inputs:
results.append(processInput(i))
#print(results)
print(str(time.process_time() - start_time))
Output:
Time taken parallel: 2.4427331139999997
Time taken single cpu: 0.00196953699999991
The overhead introduced to spawn the processes is much higher than the computation time. In practice there is no gain to use multiprocessing in this context.
If you change your function you will start observing improvements.
For instance, let's just change the current function with a naive recursive Fibonacci function.
inputs = range(25, 35)
def processInput(n):
if n < 2:
return n
return processInput(n-2) + processInput(n-1)
Time taken parallel: 0.06956500000000002
Time taken single cpu: 8.295273
So the case is the following:
I wanted to compare the runtime for a matrix multiplication
with ipython parallel and just running on a single core.
Code for normal execution:
import numpy as np
n = 13
dim_1, dim_2, dim_3, dim_4 = 2**n, 2**n, 2**n, 2**n
A = np.random.random((dim_1, dim_2))
B = np.random.random((dim_3, dim_4))
start = timeit.time.time()
C = np.matmul(A,B)
dur = timeit.time.time() - start
well this amounts to about 24 seconds on my notebook
If I do the same thing trying to parallize it.
I start four engines using: ipcluster start -n 4 (I have 4 cores).
Then I run in my notebook:
from ipyparallel import Client
c = Client()
dview = c.load_balanced_view()
%px import numpy
def pdot(view_obj, A_mat, B_mat):
view_obj['B'] = B
view_obj.scatter('A', A)
view_obj.execute('C=A.dot(B)')
return view_obj.gather('C', block=True)
start = timeit.time.time()
pdot(dview, A, B)
dur1 = timeit.time.time() - start
dur1
which takes approximately 34 seconds.
When I view in the system monitor I can see, that in both
cases all cores are used. In the parallel case there seems to
be an overhead where they aren't on 100 % usage (I suppose that's
the part where they get scattered across the engines).
In the non parallel part immediately all cores are on 100 % usage.
This surprises me as I always thought python was intrinsically
run on a single core.
Would be happy if somebody has more insight into this.
I have to solve a complex network optimization problem using a heuristic algorithm (e.g. ant algorithm). This algorithm is decomposed in two steps: 1.) Calculate new solutions using an random component, 2.) Evaluate new Solutions. These two parts are very highly time-consuming and solved the problem parallel using multiprocessing in subprograms. With increasing number of iteration the time duration of the steps increases very fast. I localized the time delay between the initialization of the parallel processes (labels timeMainCalculate and timeMainEvaluate) and the start of the first subprogram (labels timeSubCalculate and timeSubEvaluate). In the first iteration the time difference timeMainCalculate-timeSubCalculate respectively timeMainEvaluate-timeSubEvaluate is nearly 0, after 100 iterations it is about 10 seconds for both steps and after 200 steps the time difference is about 20. This time difference is linear increasing. The time duration for calculation and evaluation in the subprograms is constant. So it might be a problem in the communication between the main program and the subprograms using multiprocessing. Pool.
For your Information: I’m using Python 3.3 on a eight core computer.
opt_heuristic.py:
import multiprocessing.Pool
import datetime
import calculate, evaluate
epsilon = 1e-10
nbOfCpusForParallel = 6
nbCalculation = 6
def calculate_bound_update_information(result):
Do_some_calculation using result
return [bound,x,y,z]
if __name__ == '__main__':
Inititalize x,y,z
while bound > epsilon:
# Calculate new Solution
pool = multiprocessing.Pool(processes=nbOfCpusForParallel)
result_parallel = list()
for i in range(nbCalculation):
result_parallel.append(pool.apply_async(calculate.main,[x,y,z]))
timeMainCalculate = datetime.datetime.now()
pool.close()
pool.join()
resultCalculation = [result_parallel[i].get() for i in range(nbCalculation)]
# Evaluate Solutions
pool = multiprocessing.Pool(processes=nbOfCpusForParallel)
argsEvalute = [[resultCalculation[i][0],resultCalculation[i][1]] for i in range(len(resultCalculation))]
result_evaluate = list()
for i in range(len(resultCalculation)):
result_evaluate.append(pool.apply_async(evaluate.main,argsEvalute[i]))
timeMainEvaluate = datetime.datetime.now()
pool.close()
pool.join()
resultEvaluation = [result_evaluate[i].get() for i in range(len(resultCalculation))]
[bound,x,y,z] = calculate_bound_update_information(resultEvaluation)
calculate.py:
import datetime
def main(x,y,z):
timeSubCalculate = datetime.datetime.now()
Do some random calculation using x,y,z
return result
evaluate.py
import datetime
def main(x,y):
timeSubEvaluate = datetime.datetime.now()
Do some evaluation using x,y
return result
I seems to me that the main program store some information of the parallel process. I tried some things like delete the pool variable but it was not successful.
Has somebody an idea what's the technical problem and how it could be solved? Thanks.