Why is python joblib parallel processing slower than single cpu? - python

I'm trying to understand why the parallel processing using joblib is slower than single cpu operation?
Below is my code.
from joblib import Parallel, delayed
import multiprocessing
import time
inputs = range(10000)
def processInput(i):
return i * i
if __name__ == '__main__':
num_cores = multiprocessing.cpu_count()
start_time = time.process_time()
results = Parallel(n_jobs=num_cores)(delayed(processInput)(i) for i in inputs)
#print(results)
print(str(time.process_time() - start_time))
results=[]
start_time = time.process_time()
for i in inputs:
results.append(processInput(i))
#print(results)
print(str(time.process_time() - start_time))
Output:
Time taken parallel: 2.4427331139999997
Time taken single cpu: 0.00196953699999991

The overhead introduced to spawn the processes is much higher than the computation time. In practice there is no gain to use multiprocessing in this context.
If you change your function you will start observing improvements.
For instance, let's just change the current function with a naive recursive Fibonacci function.
inputs = range(25, 35)
def processInput(n):
if n < 2:
return n
return processInput(n-2) + processInput(n-1)
Time taken parallel: 0.06956500000000002
Time taken single cpu: 8.295273

Related

How do I use multiprocessing on Python to speed up a for loop?

I have this code which I would like to use multi-processing to speed up:
matrix=[]
for i in range(len(datasplit)):
matrix.append(np.array(np.asarray(datasplit[i].split()),dtype=float))
The variable "datasplit" is a comma-separated list of strings. Each string has around 50 numbers which are separated by a space. For each string, this code adds commas between these numbers instead of spaces, turns the entire string into an array, and turns each individual number into a string. This would now look like a an array of comma-separated strings where each string is 1 of the 50 numbers. The code then turns these strings into floats, so now we have an array of 50 comma separated numbers. After the code has run, printing, "matrix" would give a list of arrays, where each array has 50 comma separated numbers.
Now my problem is that the length of datasplit is huge. It has a length of ~ 10^7. This code takes around 15 minutes to run. I need to run this for 124 other samples of similar size, so I would like to use multiprocessing to speed up the run time.
How exactly would I re-write my code using multiprocessing to get it to run faster?
I appreciate any help.
The Python standard library provides two options for multiprocessing: The modules multiprocessing and concurrent.futures. The second adds a layer of abstraction onto the first. For simple map-scenarios like yours the usage is pretty simple.
Here's something to experiment with:
import numpy as np
from time import time
from os import cpu_count
from multiprocessing import Pool
from concurrent.futures import ProcessPoolExecutor
def string_to_float(string):
return np.array(np.asarray(string.split()), dtype=float)
if __name__ == '__main__':
# Example datasplit
rng = np.random.default_rng()
num_strings = 100000
datasplit = [' '.join(str(n) for n in rng.random(50))
for _ in range(num_strings)]
# Looping (sequential processing)
start = time()
matrix = []
for i in range(len(datasplit)):
matrix.append(np.array(np.asarray(datasplit[i].split()), dtype=float))
print(f'Duration of sequential processing: {time() - start:.2f} secs')
# Setting up multiprocessing
num_workers = int(0.8 * cpu_count())
chunksize = max(1, int(len(datasplit) / num_workers))
# Multiprocessing with Pool
start = time()
with Pool(num_workers) as p:
matrix = p.map(string_to_float, datasplit, chunksize)
print(f'Duration of parallel processing (Pool): {time() - start:.2f} secs')
# Multiprocessing with ProcessPoolExecutor
start = time()
with ProcessPoolExecutor(num_workers) as ppe:
matrix = list(ppe.map(string_to_float, datasplit, chunksize=chunksize))
print(f'Duration of parallel processing (PPE): {time() - start:.2f} secs')
You should play around with the num_workers and more importantly the chunksize variable. The ones I've used here worked well for me in quite a few situations. You can also let the system decide what to choose, those arguments are optional, but the results can be suboptimal, especially when the amount of data to be processed is large.
For 10 million strings (your range) and chunksize=10000 my machine produced the following results:
Duration of sequential processing: 393.78 secs
Duration of parallel processing (Pool): 73.76 secs
Duration of parallel processing (PPE): 85.82 secs
PS: Why do you use np.array(np.asarray(string.split()), dtype=float) instead of np.asarray(string.split(), dtype=float)?
This will split your tasks to multiple cores and speed up your performance by atleast 4-8x:
from multiprocessing import Pool
import os
import numpy as np
pool = Pool(os.cpu_count())
# Add your data to the datasplit variable below:
datasplit = []
results = pool.map(lambda x: np.array(np.asarray(x.split()),dtype=float), datasplit)
pool.close()
pool.join()

Multiprocessing from joblib doesn't parallelize?

Since I moved from python3.5 to 3.6 the Parallel computation using joblib is not reducing the computation time.
Here are the librairies installed versions:
- python: 3.6.3
- joblib: 0.11
- numpy: 1.14.0
Based on a very well known example, I give below a sample code to reproduce the problem:
import time
import numpy as np
from joblib import Parallel, delayed
def square_int(i):
return i * i
ndata = 1000000
ti = time.time()
results = []
for i in range(ndata):
results.append(square_int(i))
duration = np.round(time.time() - ti,4)
print(f"standard computation: {duration} s" )
for njobs in [1,2,3,4] :
ti = time.time()
results = []
results = Parallel(n_jobs=njobs, backend="multiprocessing")\
(delayed(square_int)(i) for i in range(ndata))
duration = np.round(time.time() - ti,4)
print(f"{njobs} jobs computation: {duration} s" )
I got the following ouput:
standard computation: 0.2672 s
1 jobs computation: 352.3113 s
2 jobs computation: 6.9662 s
3 jobs computation: 7.2556 s
4 jobs computation: 7.097 s
While I am increasing by a factor of 10 the number of ndata and removing the 1 core computation, I get those results:
standard computation: 2.4739 s
2 jobs computation: 77.8861 s
3 jobs computation: 79.9909 s
4 jobs computation: 83.1523 s
Does anyone have an idea in which direction I should investigate ?
I think the primary reason is that your overhead from parallel beats the benefits. In another word, your square_int is too simple to earn any performance improvement via parallel. The square_int is so simple that passing input and output between processes may take more time than executing the function square_int.
I modified your code by creating a square_int_batch function. It reduced the computation time a lot, though it is still more than the serial implementation.
import time
import numpy as np
from joblib import Parallel, delayed
def square_int(i):
return i * i
def square_int_batch(a,b):
results=[]
for i in range(a,b):
results.append(square_int(i))
return results
ndata = 1000000
ti = time.time()
results = []
for i in range(ndata):
results.append(square_int(i))
# results = [square_int(i) for i in range(ndata)]
duration = np.round(time.time() - ti,4)
print(f"standard computation: {duration} s" )
batch_num = 3
batch_size=int(ndata/batch_num)
for njobs in [2,3,4] :
ti = time.time()
results = []
a = list(range(ndata))
# results = Parallel(n_jobs=njobs, )(delayed(square_int)(i) for i in range(ndata))
# results = Parallel(n_jobs=njobs, backend="multiprocessing")(delayed(
results = Parallel(n_jobs=njobs)(delayed(
square_int_batch)(i*batch_size,(i+1)*batch_size) for i in range(batch_num))
duration = np.round(time.time() - ti,4)
print(f"{njobs} jobs computation: {duration} s" )
And the computation timings are
standard computation: 0.3184 s
2 jobs computation: 0.5079 s
3 jobs computation: 0.6466 s
4 jobs computation: 0.4836 s
A few other suggestions that will help reduce the time.
Use list comprehension results = [square_int(i) for i in range(ndata)] to replace for loop in your specific case, it is faster. I tested.
Set batch_num to a reasonable size. The larger this value is, the more overhead. It started to get significantly slower when batch_num exceed 1000 in my case.
I used the default backend loky instead of multiprocessing. It is slightly faster, at least in my case.
From a few other SO questions, I read that the multiprocessing is good for cpu-heavy tasks, for which I don't have an official definition. You can explore that yourself.

Python 3 multiprocessing on 1 core gives overhead that grows with workload

I am testing the parallel capabilities of Python3, which I intend to use in my code. I observe unexpectedly slow behaviour, and so I boil down my code to the following proof of principle. Let's calculate a simple logarithmic series. Let's do it serial, and in parallel using 1 core. One would imagine that the timing for these two examples would be the same, except for a small overhead associated with initializing and closing the multiprocessing.Pool class. However, what I observe is that the overhead grows linearly with problem size, and thus the parallel solution on 1 core is significantly worse relative to the serial solution even for large inputs. Please tell me if I am doing something wrong
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(x) for x in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(foo, tuple(range(rangeMax)))
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 2)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.legend(['serial', 'parallel 1 core'])
plt.show()
Edit:
It was commented that the overhead my be due to creating multiple jobs. Here is a modification of the parallel function that should explicitly only make 1 job. I still observe linear growth of the overhead
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
Edit 2: Once more, the exact code that produces linear growth. It can be tested with a print statement inside the serial_series function that it is only called once for each call of parallel_series_1core.
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(i) for i in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 20)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez1 = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez2 = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.plot(nTask, [i / j for i,j in zip(nTimeParallel, nTimeSerial)])
plt.legend(['serial', 'parallel 1 core', 'ratio'])
plt.show()
When you use Pool.map() you're essentially telling it to split the passed iterable into jobs over all available sub-processes (which is one in your case) - the larger the iterable the more 'jobs' are created on the first call. That's what initially adds a huge (trumped only by the process creation itself), albeit linear overhead.
Since sub-processes do not share memory, for all changing data on POSIX systems (due to forking) and all data (even static) on Windows it needs to pickle it on one end and unpickle it on the other. Plus it needs time to clear out the process stack for the next job, plus there is an overhead in system thread switching (that's out of your control, you'd have to mess with the system's scheduler to reduce that one).
For simple/quick tasks a single process will always trump multiprocessing.
UPDATE - As I was saying above, the additional overhead comes from the fact that for any data exchange between processes Python transparently does pickling/unpickling routine. Since the list you return from the serial_series() function grows in size over time, so does the performance penalty for pickling/unpickling. Here's a simple demonstration of it based on your code:
import math
import pickle
import sys
import time
# multi-platform precision timer
get_timer = time.clock if sys.platform == "win32" else time.time
def foo(x): # logic/computation function
return sum([math.log(1 + i*x) for i in range(10)])
def serial_series(max_range): # main sub-process function
return [foo(i) for i in range(max_range)]
def serial_series_slave(max_range): # subprocess interface
return pickle.dumps(serial_series(pickle.loads(max_range)))
def serial_series_master(max_range): # main process interface
return pickle.loads(serial_series_slave(pickle.dumps(max_range)))
tasks = [1 + i ** 2 * 1000 for i in range(1, 20)]
simulated_times = []
for task in tasks:
print("Simulated task size: {}".format(task))
start = get_timer()
res = serial_series_master(task)
simulated_times.append((task, get_timer() - start))
At the end, simulated_times will contain something like:
[(1001, 0.010015994115533963), (4001, 0.03402641167313844), (9001, 0.06755546622419131),
(16001, 0.1252664260421834), (25001, 0.18815836740279515), (36001, 0.28339434475444325),
(49001, 0.3757235840503601), (64001, 0.4813749807557435), (81001, 0.6115452710446636),
(100001, 0.7573718332506543), (121001, 0.9228750064147522), (144001, 1.0909038813527427),
(169001, 1.3017281342479343), (196001, 1.4830192955746764), (225001, 1.7117389965616931),
(256001, 1.9392146632682739), (289001, 2.19192682050668), (324001, 2.4497541011649187),
(361001, 2.7481495578097466)]
showing clear greater-than-linear processing time increase as the list grows bigger. This is what essentially happens with multiprocessing - if your sub-process function didn't return anything it would end up considerably faster.
If you have a large amount of data you need to share among processes, I'd suggest you to use some in-memory database (like Redis) and have your sub-processes connect to it to store/retrieve data.

Multiprocessing for calculating eigen value

I'm generating 100 random int matrices of size 1000x1000. I'm using the multiprocessing module to calculate the eigen values of the 100 matrices.
The code is given below:
import timeit
import numpy as np
import multiprocessing as mp
def calEigen():
S, U = np.linalg.eigh(a)
def multiprocess(processes):
pool = mp.Pool(processes=processes)
#Start timing here as I don't want to include time taken to initialize the processes
start = timeit.default_timer()
results = [pool.apply_async(calEigen, args=())]
stop = timeit.default_timer()
print (processes":", stop - start)
results = [p.get() for p in results]
results.sort() # to sort the results
if __name__ == "__main__":
global a
a=[]
for i in range(0,100):
a.append(np.random.randint(1,100,size=(1000,1000)))
#Print execution time without multiprocessing
start = timeit.default_timer()
calEigen()
stop = timeit.default_timer()
print stop - start
#With 1 process
multiprocess(1)
#With 2 processes
multiprocess(2)
#With 3 processes
multiprocess(3)
#With 4 processes
multiprocess(4)
The output is
0.510247945786
('Process:', 1, 5.1021575927734375e-05)
('Process:', 2, 5.698204040527344e-05)
('Process:', 3, 8.320808410644531e-05)
('Process:', 4, 7.200241088867188e-05)
Another iteration showed this output:
69.7296020985
('Process:', 1, 0.0009050369262695312)
('Process:', 2, 0.023727893829345703)
('Process:', 3, 0.0003509521484375)
('Process:', 4, 0.057518959045410156)
My questions are these:
Why doesn't the time execution time reduce as the number of
processes increase? Am I using the multiprocessing module correctly?
Am I calculating the execution time correctly?
I have edited the code given in the comments below. I want the serial and multiprocessing functions to find the eigen values for the same list of 100 matrices. The edited code is-
import numpy as np
import time
from multiprocessing import Pool
a=[]
for i in range(0,100):
a.append(np.random.randint(1,100,size=(1000,1000)))
def serial(z):
result = []
start_time = time.time()
for i in range(0,100):
result.append(np.linalg.eigh(z[i])) #calculate eigen values and append to result list
end_time = time.time()
print("Single process took :", end_time - start_time, "seconds")
def caleigen(c):
result = []
result.append(np.linalg.eigh(c)) #calculate eigenvalues and append to result list
return result
def mp(x):
start_time = time.time()
with Pool(processes=x) as pool: # start a pool of 4 workers
result = pool.map_async(caleigen,a) # distribute work to workers
result = result.get() # collect result from MapResult object
end_time = time.time()
print("Mutltiprocessing took:", end_time - start_time, "seconds" )
if __name__ == "__main__":
serial(a)
mp(1,a)
mp(2,a)
mp(3,a)
mp(4,a)
There is no reduction in the time as the number of processes increases. Where am I going wrong? Does multiprocessing divide the list into chunks for the processes or do I have to do the division?
You're not using the multiprocessing module correctly. As #dopstar pointed out, you're not dividing your task. There is only one task for the process pool, so not matter how many workers you assigned, only one will get the job. As for your second question, I didn't use timeit to measure process time precisely. I just use time module to get a crude sense of how fast things are. It serves the purpose most of the time, though. If I understand what you're trying to do correctly, this should be the single process version of your code
import numpy as np
import time
result = []
start_time = time.time()
for i in range(100):
a = np.random.randint(1, 100, size=(1000,1000)) #generate random matrix
result.append(np.linalg.eigh(a)) #calculate eigen values and append to result list
end_time = time.time()
print("Single process took :", end_time - start_time, "seconds")
The single process version took 15.27 seconds on my computer. Below is the multiprocess version, which took only 0.46 seconds on my computer. I also included the single process version for comparison. (The single process version has to be enclosed in the if block as well and placed after the multiprocess version.) Because you would like to repeat your calculation for 100 times, it'd be a lot easier to create a pool of workers and let them take on unfinished task automatically than to manually start each process and specify what each process should do. Here in my codes, the argument for the caleigen call is merely to keep track of how many times the task has been executed. Finally, map_async is generally faster than apply_async, with its downside being consuming slightly more memory and taking only one argument for function call. The reason for using map_async but not map is that in this case, the order in which result is returned does not matter and map_async is much faster than map.
from multiprocessing import Pool
import numpy as np
import time
def caleigen(x): # define work for each worker
a = np.random.randint(1,100,size=(1000,1000))
S, U = np.linalg.eigh(a)
return S, U
if __name__ == "main":
start_time = time.time()
with Pool(processes=4) as pool: # start a pool of 4 workers
result = pool.map_async(caleigen, range(100)) # distribute work to workers
result = result.get() # collect result from MapResult object
end_time = time.time()
print("Mutltiprocessing took:", end_time - start_time, "seconds" )
# Run the single process version for comparison. This has to be within the if block as well.
result = []
start_time = time.time()
for i in range(100):
a = np.random.randint(1, 100, size=(1000,1000)) #generate random matrix
result.append(np.linalg.eigh(a)) #calculate eigen values and append to result list
end_time = time.time()
print("Single process took :", end_time - start_time, "seconds")

python pool.map vs map

I need to apply a complex function to a long series. Done sequentially this seems to be inefficient as I have access to a 12 core machine.
Before making significant investment in this, i wrote up this simple version which compares pool.map and map.
1) Surprisingly map is much much faster (see results below).
2) And there is an overflow error in the pool function which does not come up on the map version.
3) with a smaller array, the run time warning does not appear, but map is still faster.
I am not a computer science guy (just a functional user) - any thoughts and suggestions? I went with pool.map because the async version may mess up the order of the series (which is a pain for me to sort out).
SEE UPDATE BELOW : Based on Jdog's suggestion.
Config: python v 2.7, 64 bit version, windows 7
#-------------------------------------------------------------------------------
# Name: poolMap
#-------------------------------------------------------------------------------
import multiprocessing as mp
import numpy as np
import time
def func(x):
y=x*x
return y
def worker(inputs):
num=mp.cpu_count()
print 'num of cpus', num
pool = mp.Pool(num)
#inputs = list(inputs)
#print "inputs type",type(inputs)
results = pool.map(func, inputs)
pool.close()
pool.join()
return results
if __name__ == '__main__':
series = np.arange(500000)
start = time.clock()
poolAnswer = worker(series)
end = time.clock()
print 'pool time' ,(end - start)
start = time.clock()
answer = map(func,series)
end = time.clock()
print 'map time', (end - start)
results:
num of cpus 12
pool time 2.40276007188
D:\poolmap.py:19: RuntimeWarning: overflow encountered in long_scalars
y=x*x
map time 0.904187849745
##############UPDATE
Using this func gave me the results I was looking for
def func(x):
x=float(x)
y=(((x*x)**0.35))*x+np.ma.sqrt((((x*x)**0.35)))
return y
results:
num of cpus 12
pool time 12.7410957475
map time 45.4550067581

Categories

Resources