python pool.map vs map - python

I need to apply a complex function to a long series. Done sequentially this seems to be inefficient as I have access to a 12 core machine.
Before making significant investment in this, i wrote up this simple version which compares pool.map and map.
1) Surprisingly map is much much faster (see results below).
2) And there is an overflow error in the pool function which does not come up on the map version.
3) with a smaller array, the run time warning does not appear, but map is still faster.
I am not a computer science guy (just a functional user) - any thoughts and suggestions? I went with pool.map because the async version may mess up the order of the series (which is a pain for me to sort out).
SEE UPDATE BELOW : Based on Jdog's suggestion.
Config: python v 2.7, 64 bit version, windows 7
#-------------------------------------------------------------------------------
# Name: poolMap
#-------------------------------------------------------------------------------
import multiprocessing as mp
import numpy as np
import time
def func(x):
y=x*x
return y
def worker(inputs):
num=mp.cpu_count()
print 'num of cpus', num
pool = mp.Pool(num)
#inputs = list(inputs)
#print "inputs type",type(inputs)
results = pool.map(func, inputs)
pool.close()
pool.join()
return results
if __name__ == '__main__':
series = np.arange(500000)
start = time.clock()
poolAnswer = worker(series)
end = time.clock()
print 'pool time' ,(end - start)
start = time.clock()
answer = map(func,series)
end = time.clock()
print 'map time', (end - start)
results:
num of cpus 12
pool time 2.40276007188
D:\poolmap.py:19: RuntimeWarning: overflow encountered in long_scalars
y=x*x
map time 0.904187849745
##############UPDATE
Using this func gave me the results I was looking for
def func(x):
x=float(x)
y=(((x*x)**0.35))*x+np.ma.sqrt((((x*x)**0.35)))
return y
results:
num of cpus 12
pool time 12.7410957475
map time 45.4550067581

Related

Why is multiprocessing slower than single-core? Would using joblib or dask make a difference?

The issue
I am trying to optimise some calculations which lend themselves to so-called embarrassingly parallel calculations, but I am finding that using python's multiprocessing package actually slows things down.
My question is: am I doing something wrong, or is there an intrinsic reason why parallelisation actually slows things down? Is it because I am using numba? Would other packages like joblib or dak make much of a difference?
There are loads of similar questions, in which the answer is always that the overhead costs more than the time savings, but all those questions tend to revolve around very simple functions, whereas I would have expected something with nested loops to lend itself better to parallelisation. I have also not found comparisons among joblib, multiprocessing and dask.
My function
I have a function which takes a one-dimensional numpy array as argument of shape n, and outputs a numpy array of shape (n x t), where each row is independent, i.e. row 0 of the output depends only on item 0 of the input, row 1 on item 1, etc. Something like this:
The underlying calculation is optimised with numba , which speeds things up by various orders of magnitude.
Toy example - results
I cannot share the exact code, so I have come up with a toy example. The calculation defined in my_fun_numba is actually irrelevant, it's just some very banal number crunching to keep the CPU busy.
With the toy example, the results on my PC are these, and they are very similar to what I get with my actual code.
As you can see, splitting the input array into different chunks and sending each of them to multiprocessing.pool actually slows things down vs just using numba on a single core.
What I have tried
I have tried various combinations of the cache and nogil options in the numba.jit decorator, but the difference is minimal.
I have profiled the code (not the timeit.Timer part, just a single run) with PyCharm and, if I understand the output correctly, it seems most of the time is spent waiting for the pool.
Sorted by time:
Sorted by own time:
Toy example - the code
import numpy as np
import pandas as pd
import multiprocessing
from multiprocessing import Pool
import numba
import timeit
#numba.jit(nopython = True, nogil = True, cache = True)
def my_fun_numba(x):
dim2 = 10
out = np.empty((len(x), dim2))
n = len(x)
for r in range(n):
for c in range(dim2):
out[r,c] = np.cos(x[r]) ** 2 + np.sin(x[r]) ** 2
return out
def my_fun_non_numba(x):
dim2 = 10
out = np.empty((len(x), dim2))
n = len(x)
for r in range(n):
for c in range(dim2):
out[r,c] = np.cos(x[r]) ** 2 + np.sin(x[r]) ** 2
return out
def my_func_parallel(inp, func, cpus = None):
if cpus == None:
cpus = max(1, multiprocessing.cpu_count() - 1)
else:
cpus = cpus
inp_split = np.array_split(inp,cpus)
pool = Pool(cpus)
out = np.vstack(pool.map(func, inp_split) )
pool.close()
pool.join()
return out
if __name__ == "__main__":
inputs = np.array([100,10e3,1e6] ).astype(int)
res = pd.DataFrame(index = inputs, columns =['no paral, no numba','no paral, numba','numba 6 cores','numba 12 cores'])
r = 3
n = 1
for i in inputs:
my_arg = np.arange(0,i)
res.loc[i, 'no paral, no numba'] = min(
timeit.Timer("my_fun_non_numba(my_arg)", globals=globals()).repeat(repeat=r, number=n)
)
res.loc[i, 'no paral, numba'] = min(
timeit.Timer("my_fun_numba(my_arg)", globals=globals()).repeat(repeat=r, number=n)
)
res.loc[i, 'numba 6 cores'] = min(
timeit.Timer("my_func_parallel(my_arg, my_fun_numba, cpus = 6)", globals=globals()).repeat(repeat=r, number=n)
)
res.loc[i, 'numba 12 cores'] = min(
timeit.Timer("my_func_parallel(my_arg, my_fun_numba, cpus = 12)", globals=globals()).repeat(repeat=r, number=n)
)

Why is python joblib parallel processing slower than single cpu?

I'm trying to understand why the parallel processing using joblib is slower than single cpu operation?
Below is my code.
from joblib import Parallel, delayed
import multiprocessing
import time
inputs = range(10000)
def processInput(i):
return i * i
if __name__ == '__main__':
num_cores = multiprocessing.cpu_count()
start_time = time.process_time()
results = Parallel(n_jobs=num_cores)(delayed(processInput)(i) for i in inputs)
#print(results)
print(str(time.process_time() - start_time))
results=[]
start_time = time.process_time()
for i in inputs:
results.append(processInput(i))
#print(results)
print(str(time.process_time() - start_time))
Output:
Time taken parallel: 2.4427331139999997
Time taken single cpu: 0.00196953699999991
The overhead introduced to spawn the processes is much higher than the computation time. In practice there is no gain to use multiprocessing in this context.
If you change your function you will start observing improvements.
For instance, let's just change the current function with a naive recursive Fibonacci function.
inputs = range(25, 35)
def processInput(n):
if n < 2:
return n
return processInput(n-2) + processInput(n-1)
Time taken parallel: 0.06956500000000002
Time taken single cpu: 8.295273

Python 3 multiprocessing on 1 core gives overhead that grows with workload

I am testing the parallel capabilities of Python3, which I intend to use in my code. I observe unexpectedly slow behaviour, and so I boil down my code to the following proof of principle. Let's calculate a simple logarithmic series. Let's do it serial, and in parallel using 1 core. One would imagine that the timing for these two examples would be the same, except for a small overhead associated with initializing and closing the multiprocessing.Pool class. However, what I observe is that the overhead grows linearly with problem size, and thus the parallel solution on 1 core is significantly worse relative to the serial solution even for large inputs. Please tell me if I am doing something wrong
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(x) for x in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(foo, tuple(range(rangeMax)))
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 2)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.legend(['serial', 'parallel 1 core'])
plt.show()
Edit:
It was commented that the overhead my be due to creating multiple jobs. Here is a modification of the parallel function that should explicitly only make 1 job. I still observe linear growth of the overhead
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
Edit 2: Once more, the exact code that produces linear growth. It can be tested with a print statement inside the serial_series function that it is only called once for each call of parallel_series_1core.
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(i) for i in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 20)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez1 = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez2 = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.plot(nTask, [i / j for i,j in zip(nTimeParallel, nTimeSerial)])
plt.legend(['serial', 'parallel 1 core', 'ratio'])
plt.show()
When you use Pool.map() you're essentially telling it to split the passed iterable into jobs over all available sub-processes (which is one in your case) - the larger the iterable the more 'jobs' are created on the first call. That's what initially adds a huge (trumped only by the process creation itself), albeit linear overhead.
Since sub-processes do not share memory, for all changing data on POSIX systems (due to forking) and all data (even static) on Windows it needs to pickle it on one end and unpickle it on the other. Plus it needs time to clear out the process stack for the next job, plus there is an overhead in system thread switching (that's out of your control, you'd have to mess with the system's scheduler to reduce that one).
For simple/quick tasks a single process will always trump multiprocessing.
UPDATE - As I was saying above, the additional overhead comes from the fact that for any data exchange between processes Python transparently does pickling/unpickling routine. Since the list you return from the serial_series() function grows in size over time, so does the performance penalty for pickling/unpickling. Here's a simple demonstration of it based on your code:
import math
import pickle
import sys
import time
# multi-platform precision timer
get_timer = time.clock if sys.platform == "win32" else time.time
def foo(x): # logic/computation function
return sum([math.log(1 + i*x) for i in range(10)])
def serial_series(max_range): # main sub-process function
return [foo(i) for i in range(max_range)]
def serial_series_slave(max_range): # subprocess interface
return pickle.dumps(serial_series(pickle.loads(max_range)))
def serial_series_master(max_range): # main process interface
return pickle.loads(serial_series_slave(pickle.dumps(max_range)))
tasks = [1 + i ** 2 * 1000 for i in range(1, 20)]
simulated_times = []
for task in tasks:
print("Simulated task size: {}".format(task))
start = get_timer()
res = serial_series_master(task)
simulated_times.append((task, get_timer() - start))
At the end, simulated_times will contain something like:
[(1001, 0.010015994115533963), (4001, 0.03402641167313844), (9001, 0.06755546622419131),
(16001, 0.1252664260421834), (25001, 0.18815836740279515), (36001, 0.28339434475444325),
(49001, 0.3757235840503601), (64001, 0.4813749807557435), (81001, 0.6115452710446636),
(100001, 0.7573718332506543), (121001, 0.9228750064147522), (144001, 1.0909038813527427),
(169001, 1.3017281342479343), (196001, 1.4830192955746764), (225001, 1.7117389965616931),
(256001, 1.9392146632682739), (289001, 2.19192682050668), (324001, 2.4497541011649187),
(361001, 2.7481495578097466)]
showing clear greater-than-linear processing time increase as the list grows bigger. This is what essentially happens with multiprocessing - if your sub-process function didn't return anything it would end up considerably faster.
If you have a large amount of data you need to share among processes, I'd suggest you to use some in-memory database (like Redis) and have your sub-processes connect to it to store/retrieve data.

Multiprocessing for calculating eigen value

I'm generating 100 random int matrices of size 1000x1000. I'm using the multiprocessing module to calculate the eigen values of the 100 matrices.
The code is given below:
import timeit
import numpy as np
import multiprocessing as mp
def calEigen():
S, U = np.linalg.eigh(a)
def multiprocess(processes):
pool = mp.Pool(processes=processes)
#Start timing here as I don't want to include time taken to initialize the processes
start = timeit.default_timer()
results = [pool.apply_async(calEigen, args=())]
stop = timeit.default_timer()
print (processes":", stop - start)
results = [p.get() for p in results]
results.sort() # to sort the results
if __name__ == "__main__":
global a
a=[]
for i in range(0,100):
a.append(np.random.randint(1,100,size=(1000,1000)))
#Print execution time without multiprocessing
start = timeit.default_timer()
calEigen()
stop = timeit.default_timer()
print stop - start
#With 1 process
multiprocess(1)
#With 2 processes
multiprocess(2)
#With 3 processes
multiprocess(3)
#With 4 processes
multiprocess(4)
The output is
0.510247945786
('Process:', 1, 5.1021575927734375e-05)
('Process:', 2, 5.698204040527344e-05)
('Process:', 3, 8.320808410644531e-05)
('Process:', 4, 7.200241088867188e-05)
Another iteration showed this output:
69.7296020985
('Process:', 1, 0.0009050369262695312)
('Process:', 2, 0.023727893829345703)
('Process:', 3, 0.0003509521484375)
('Process:', 4, 0.057518959045410156)
My questions are these:
Why doesn't the time execution time reduce as the number of
processes increase? Am I using the multiprocessing module correctly?
Am I calculating the execution time correctly?
I have edited the code given in the comments below. I want the serial and multiprocessing functions to find the eigen values for the same list of 100 matrices. The edited code is-
import numpy as np
import time
from multiprocessing import Pool
a=[]
for i in range(0,100):
a.append(np.random.randint(1,100,size=(1000,1000)))
def serial(z):
result = []
start_time = time.time()
for i in range(0,100):
result.append(np.linalg.eigh(z[i])) #calculate eigen values and append to result list
end_time = time.time()
print("Single process took :", end_time - start_time, "seconds")
def caleigen(c):
result = []
result.append(np.linalg.eigh(c)) #calculate eigenvalues and append to result list
return result
def mp(x):
start_time = time.time()
with Pool(processes=x) as pool: # start a pool of 4 workers
result = pool.map_async(caleigen,a) # distribute work to workers
result = result.get() # collect result from MapResult object
end_time = time.time()
print("Mutltiprocessing took:", end_time - start_time, "seconds" )
if __name__ == "__main__":
serial(a)
mp(1,a)
mp(2,a)
mp(3,a)
mp(4,a)
There is no reduction in the time as the number of processes increases. Where am I going wrong? Does multiprocessing divide the list into chunks for the processes or do I have to do the division?
You're not using the multiprocessing module correctly. As #dopstar pointed out, you're not dividing your task. There is only one task for the process pool, so not matter how many workers you assigned, only one will get the job. As for your second question, I didn't use timeit to measure process time precisely. I just use time module to get a crude sense of how fast things are. It serves the purpose most of the time, though. If I understand what you're trying to do correctly, this should be the single process version of your code
import numpy as np
import time
result = []
start_time = time.time()
for i in range(100):
a = np.random.randint(1, 100, size=(1000,1000)) #generate random matrix
result.append(np.linalg.eigh(a)) #calculate eigen values and append to result list
end_time = time.time()
print("Single process took :", end_time - start_time, "seconds")
The single process version took 15.27 seconds on my computer. Below is the multiprocess version, which took only 0.46 seconds on my computer. I also included the single process version for comparison. (The single process version has to be enclosed in the if block as well and placed after the multiprocess version.) Because you would like to repeat your calculation for 100 times, it'd be a lot easier to create a pool of workers and let them take on unfinished task automatically than to manually start each process and specify what each process should do. Here in my codes, the argument for the caleigen call is merely to keep track of how many times the task has been executed. Finally, map_async is generally faster than apply_async, with its downside being consuming slightly more memory and taking only one argument for function call. The reason for using map_async but not map is that in this case, the order in which result is returned does not matter and map_async is much faster than map.
from multiprocessing import Pool
import numpy as np
import time
def caleigen(x): # define work for each worker
a = np.random.randint(1,100,size=(1000,1000))
S, U = np.linalg.eigh(a)
return S, U
if __name__ == "main":
start_time = time.time()
with Pool(processes=4) as pool: # start a pool of 4 workers
result = pool.map_async(caleigen, range(100)) # distribute work to workers
result = result.get() # collect result from MapResult object
end_time = time.time()
print("Mutltiprocessing took:", end_time - start_time, "seconds" )
# Run the single process version for comparison. This has to be within the if block as well.
result = []
start_time = time.time()
for i in range(100):
a = np.random.randint(1, 100, size=(1000,1000)) #generate random matrix
result.append(np.linalg.eigh(a)) #calculate eigen values and append to result list
end_time = time.time()
print("Single process took :", end_time - start_time, "seconds")

Python Multiprocessing Increasing Time delay using pool

I have to solve a complex network optimization problem using a heuristic algorithm (e.g. ant algorithm). This algorithm is decomposed in two steps: 1.) Calculate new solutions using an random component, 2.) Evaluate new Solutions. These two parts are very highly time-consuming and solved the problem parallel using multiprocessing in subprograms. With increasing number of iteration the time duration of the steps increases very fast. I localized the time delay between the initialization of the parallel processes (labels timeMainCalculate and timeMainEvaluate) and the start of the first subprogram (labels timeSubCalculate and timeSubEvaluate). In the first iteration the time difference timeMainCalculate-timeSubCalculate respectively timeMainEvaluate-timeSubEvaluate is nearly 0, after 100 iterations it is about 10 seconds for both steps and after 200 steps the time difference is about 20. This time difference is linear increasing. The time duration for calculation and evaluation in the subprograms is constant. So it might be a problem in the communication between the main program and the subprograms using multiprocessing. Pool.
For your Information: I’m using Python 3.3 on a eight core computer.
opt_heuristic.py:
import multiprocessing.Pool
import datetime
import calculate, evaluate
epsilon = 1e-10
nbOfCpusForParallel = 6
nbCalculation = 6
def calculate_bound_update_information(result):
Do_some_calculation using result
return [bound,x,y,z]
if __name__ == '__main__':
Inititalize x,y,z
while bound > epsilon:
# Calculate new Solution
pool = multiprocessing.Pool(processes=nbOfCpusForParallel)
result_parallel = list()
for i in range(nbCalculation):
result_parallel.append(pool.apply_async(calculate.main,[x,y,z]))
timeMainCalculate = datetime.datetime.now()
pool.close()
pool.join()
resultCalculation = [result_parallel[i].get() for i in range(nbCalculation)]
# Evaluate Solutions
pool = multiprocessing.Pool(processes=nbOfCpusForParallel)
argsEvalute = [[resultCalculation[i][0],resultCalculation[i][1]] for i in range(len(resultCalculation))]
result_evaluate = list()
for i in range(len(resultCalculation)):
result_evaluate.append(pool.apply_async(evaluate.main,argsEvalute[i]))
timeMainEvaluate = datetime.datetime.now()
pool.close()
pool.join()
resultEvaluation = [result_evaluate[i].get() for i in range(len(resultCalculation))]
[bound,x,y,z] = calculate_bound_update_information(resultEvaluation)
calculate.py:
import datetime
def main(x,y,z):
timeSubCalculate = datetime.datetime.now()
Do some random calculation using x,y,z
return result
evaluate.py
import datetime
def main(x,y):
timeSubEvaluate = datetime.datetime.now()
Do some evaluation using x,y
return result
I seems to me that the main program store some information of the parallel process. I tried some things like delete the pool variable but it was not successful.
Has somebody an idea what's the technical problem and how it could be solved? Thanks.

Categories

Resources