Fastest way to perform Multiprocessing of a loop in a function?

Fastest way to perform Multiprocessing of a loop in a function? - python

1. I have a function var. I want to know the best possible way to run the loop within this function quickly by multiprocessing/parallel processing by utilizing all the processors, cores, threads, and RAM memory the system has.
import numpy
from pysheds.grid import Grid
xs = 82.1206, 72.4542, 65.0431, 83.8056, 35.6744
ys = 25.2111, 17.9458, 13.8844, 10.0833, 24.8306
a = r'/home/test/image1.tif'
b = r'/home/test/image2.tif'
def var(interest):
variable_avg = []
for (x,y) in zip(xs,ys):
grid = Grid.from_raster(interest, data_name='map')
grid.catchment(data='map', x=x, y=y, out_name='catch')
variable = grid.view('catch', nodata=np.nan)
variable = numpy.array(variable)
variablemean = (variable).mean()
variable_avg.append(variablemean)
return(variable_avg)
2. It would be great if I can run both function var and loop in it parallelly for the given multiple parameters of the function.
ex:var(a)and var(b) at the same time. Since it will consume much less time then just parallelizing the loop alone.
Ignore 2, if it does not make sense.

TLDR:
You can use the multiprocessing library to run your var function in parallel. However, as written you likely don't make enough calls to var for multiprocessing to have a performance benefit because of its overhead. If all you need to do is run those two calls, running in serial is likely the fastest you'll get. However, if you need to make a lot of calls, multiprocessing can help you out.
We'll need to use a process pool to run this in parallel, threads won't work here because Python's global interpreter lock will prevent us from true parallelism. The drawback of process pools is that processes are heavyweight to spin up. In the example of just running two calls to var the time to create the pool overwhelms the time spent running var itself.
To illiustrate this, let's use a process pool and use asyncio to run calls to var in parallel and compare it to just running things sequentially. Note to run this example I used an image from the Pysheds library https://github.com/mdbartos/pysheds/tree/master/data - if your image is much larger the below may not hold true.
import functools
import time
from concurrent.futures.process import ProcessPoolExecutor
import asyncio
a = 'diem.tif'
xs = 10, 20, 30, 40, 50
ys = 10, 20, 30, 40, 50
async def main():
loop = asyncio.get_event_loop()
pool_start = time.time()
with ProcessPoolExecutor() as pool:
task_one = loop.run_in_executor(pool, functools.partial(var, a))
task_two = loop.run_in_executor(pool, functools.partial(var, a))
results = await asyncio.gather(task_one, task_two)
pool_end = time.time()
print(f'Process pool took {pool_end-pool_start}')
serial_start = time.time()
result_one = var(a)
result_two = var(a)
serial_end = time.time()
print(f'Running in serial took {serial_end - serial_start}')
if __name__ == "__main__":
asyncio.run(main())
Running the above on my machine (a 2.4 GHz 8-Core Intel Core i9) I get the following output:
Process pool took 1.7581260204315186
Running in serial took 0.32335805892944336
In this example, a process pool is over five times slower! This is due to the overhead of creating and managing multiple processes. That said, if you need to call var more than just a few times, a process pool may make more sense. Let's adapt this to run var 100 times and compare the results:
async def main():
loop = asyncio.get_event_loop()
pool_start = time.time()
tasks = []
with ProcessPoolExecutor() as pool:
for _ in range(100):
tasks.append(loop.run_in_executor(pool, functools.partial(var, a)))
results = await asyncio.gather(*tasks)
pool_end = time.time()
print(f'Process pool took {pool_end-pool_start}')
serial_start = time.time()
for _ in range(100):
result = var(a)
serial_end = time.time()
print(f'Running in serial took {serial_end - serial_start}')
Running 100 times, I get the following output:
Process pool took 3.442288875579834
Running in serial took 13.769982099533081
In this case, running in a process pool is about 4x faster. You may also wish to try running each iteration of your loop concurrently. You can do this by creating a function that processes one x,y coordinate at a time and then run each point you want to examine in a process pool:
def process_poi(interest, x, y):
grid = Grid.from_raster(interest, data_name='map')
grid.catchment(data='map', x=x, y=y, out_name='catch')
variable = grid.view('catch', nodata=np.nan)
variable = np.array(variable)
return variable.mean()
async def var_loop_async(interest, pool, loop):
tasks = []
for (x,y) in zip(xs,ys):
function_call = functools.partial(process_poi, interest, x, y)
tasks.append(loop.run_in_executor(pool, function_call))
return await asyncio.gather(*tasks)
async def main():
loop = asyncio.get_event_loop()
pool_start = time.time()
tasks = []
with ProcessPoolExecutor() as pool:
for _ in range(100):
tasks.append(var_loop_async(a, pool, loop))
results = await asyncio.gather(*tasks)
pool_end = time.time()
print(f'Process pool took {pool_end-pool_start}')
serial_start = time.time()
In this case I get Process pool took 3.2950568199157715 - so not really any faster than our first version with one process per each call of var. This is likely because the limiting factor at this point is how many cores we have available on our CPU, splitting our work into smaller increments does not add much value.
That said, if you have 1000 x and y coordinates you wish to examine across two images, this last approach may yield a performance gain.

I think this is a reasonable and straightforward way of speeding up your code by merely parallelizing only the main loop. You can saturate your cores with this, so there is no need to parallelize also for the interest variable. I can't test the code, so I assume that your function is correct, I have just encoded the loop in a new function and parallelized it in var().
from multiprocessing import Pool
def var(interest,xs,ys):
grid = Grid.from_raster(interest, data_name='map')
with Pool(4) as p: #uses 4 cores, adjust this as you need
variable_avg = p.starmap(loop, [(x,y,grid) for x,y in zip(xs,ys)])
return variable_avg
def loop(x, y, grid):
grid.catchment(data='map', x=x, y=y, out_name='catch')
variable = grid.view('catch', nodata=np.nan)
variable = numpy.array(variable)
return variable.mean()

Related

python - Difference in CPU cores used when using Pool map and Pool starmap

I want to use Pool to split a task among n workers. What happens is that when I'm using map with one argument in the task function, I observe that all the cores are used, all tasks are launched simultaneously.
On the other hand, when I'm using starmap, task launch is one by one and I never reach 100% CPU load.
I want to use starmap for my case because I want to pass a second argument, but there's no use if it doesn't take advantage of multiprocessing.
This is the code that works
import numpy as np
from multiprocessing import Pool
# df_a = just a pandas dataframe which I split in n parts and I
# feed each part to a task. Each one may have a few
# thousand rows
n_jobs = 16
def run_parallel(df_a):
dfs_a = np.array_split(df_a, n_jobs)
print("done split")
pool = Pool(n_jobs)
result = pool.map(task_function, dfs_a)
return result
def task_function(left_df):
print("in task function")
# execute task...
return result
result = run_parallel(df_a)
in this case, "in task function" is printed at the same time, 16 times.
This is the code that doesn't work
n_jobs = 16
# df_b: a big pandas dataframe (~1.7M rows, ~20 columns) which I
# want to send to each task as is
def run_parallel(df_a, df_b):
dfs_a = np.array_split(df_a, n_jobs)
print("done split")
pool = Pool(n_jobs)
result = pool.starmap(task_function, zip(dfs_a, repeat(df_b)))
return result
def task_function(left_df, right_df):
print("in task function")
# execute task
return result
result = run_parallel(df_a, df_b)
Here, "in task function" is printed sequentially and the processors never reach 100% capacity. I also tried workarounds based on this answer:
https://stackoverflow.com/a/5443941/6941970
but no luck. Even when I used map in this way:
from functools import partial
pool.map(partial(task_function, b=df_b), dfs_a)
considering that maybe repeat(*very big df*) would introduce memory issues, still there wasn't any real parallelization

Running different Python functions in separate CPUs

Using multiprocessing.pool I can split an input list for a single function to be processed in parallel along multiple CPUs. Like this:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4)
results = pool.map(f, range(100))
pool.close()
pool.join()
However, this does not allow to run different functions on different processors. If I want to do something like this, in parallel / simultaneously:
foo1(args1) --> Processor1
foo2(args2) --> Processor2
How can this be done?
Edit: After Darkonaut remarks, I do not care about specifically assigning foo1 to Processor number 1. It can be any processor as chosen by the OS. I am just interested in running independent functions in different/ parallel Processes. So rather:
foo1(args1) --> process1
foo2(args2) --> process2

I usually find it easiest to use the concurrent.futures module for concurrency. You can achieve the same with multiprocessing, but concurrent.futures has (IMO) a much nicer interface.
Your example would then be:
from concurrent.futures import ProcessPoolExecutor
def foo1(x):
return x * x
def foo2(x):
return x * x * x
if __name__ == '__main__':
with ProcessPoolExecutor(2) as executor:
# these return immediately and are executed in parallel, on separate processes
future_1 = executor.submit(foo1, 1)
future_2 = executor.submit(foo2, 2)
# get results / re-raise exceptions that were thrown in workers
result_1 = future_1.result() # contains foo1(1)
result_2 = future_2.result() # contains foo2(2)
If you have many inputs, it is better to use executor.map with the chunksize argument instead:
from concurrent.futures import ProcessPoolExecutor
def foo1(x):
return x * x
def foo2(x):
return x * x * x
if __name__ == '__main__':
with ProcessPoolExecutor(4) as executor:
# these return immediately and are executed in parallel, on separate processes
future_1 = executor.map(foo1, range(10000), chunksize=100)
future_2 = executor.map(foo2, range(10000), chunksize=100)
# executor.map returns an iterator which we have to consume to get the results
result_1 = list(future_1) # contains [foo1(x) for x in range(10000)]
result_2 = list(future_2) # contains [foo2(x) for x in range(10000)]
Note that the optimal values for chunksize, the number of processes, and whether process-based concurrency actually leads to increased performance depends on many factors:
The runtime of foo1 / foo2. If they are extremely cheap (as in this example), the communication overhead between processes might dominate the total runtime.
Spawning a process takes time, so the code inside with ProcessPoolExecutor needs to run long enough for this to amortize.
The actual number of physical processors in the machine you are running on.
Whether your application is IO bound or compute bound.
Whether the functions you use in foo are already parallelized (such as some np.linalg solvers, or scikit-learn estimators).

Multiprocesses will not run in Parallel on Windows on Jupyter Notebook

I'm currently working on Windows on jupyter notebook and have been struggling to get multiprocessing to work. It does not run all my async's in parallel it runs them singularly one at a time please provide some guidance where am I going wrong. I need to put the results into a variable for future use. What am I not understanding?
import multiprocessing as mp
import cylib
Pool = mp.Pool(processes=4)
result1 = Pool.apply_async(cylib.f, [v]) # evaluate asynchronously
result2 = Pool.apply_async(cylib.f, [x]) # evaluate asynchronously
result3 = Pool.apply_async(cylib.f, [y]) # evaluate asynchronously
result4 = Pool.apply_async(cylib.f, [z]) # evaluate asynchronously
vr = result1.get(timeout=420)
xr = result2.get(timeout=420)
yr = result3.get(timeout=420)
zr = result4.get(timeout=420)

The tasks are executing in parallel.
However, this is fetching the results synchronously i.e. "wait until result1 is ready, then wait until result2 is ready, .." and so on.
vr = result1.get(timeout=420)
xr = result2.get(timeout=420)
yr = result3.get(timeout=420)
zr = result4.get(timeout=420)
Consider the following example code, where each task is polled asynchronously
from time import sleep
import multiprocessing as mp
pool = mp.Pool(processes=4)
# Create tasks with longer wait first
tasks = {i: pool.apply_async(sleep, [t]) for i, t in enumerate(reversed(range(3)))}
done = set()
# Keep polling until all tasks complete
while len(done) < len(tasks):
for i, t in tasks.items():
# Skip completed tasks
if i in done:
continue
result = None
try:
result = t.get(timeout=0)
except mp.TimeoutError:
pass
else:
print("Task #:{} complete".format(i))
done.add(i)
You can replicate something like the above or use the callback argument on apply_async to perform some handling automatically as tasks complete.

Python 3 multiprocessing on 1 core gives overhead that grows with workload

I am testing the parallel capabilities of Python3, which I intend to use in my code. I observe unexpectedly slow behaviour, and so I boil down my code to the following proof of principle. Let's calculate a simple logarithmic series. Let's do it serial, and in parallel using 1 core. One would imagine that the timing for these two examples would be the same, except for a small overhead associated with initializing and closing the multiprocessing.Pool class. However, what I observe is that the overhead grows linearly with problem size, and thus the parallel solution on 1 core is significantly worse relative to the serial solution even for large inputs. Please tell me if I am doing something wrong
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(x) for x in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(foo, tuple(range(rangeMax)))
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 2)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.legend(['serial', 'parallel 1 core'])
plt.show()
Edit:
It was commented that the overhead my be due to creating multiple jobs. Here is a modification of the parallel function that should explicitly only make 1 job. I still observe linear growth of the overhead
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
Edit 2: Once more, the exact code that produces linear growth. It can be tested with a print statement inside the serial_series function that it is only called once for each call of parallel_series_1core.
import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
def foo(x):
return sum([np.log(1 + i*x) for i in range(10)])
def serial_series(rangeMax):
return [foo(i) for i in range(rangeMax)]
def parallel_series_1core(rangeMax):
pool = multiprocessing.Pool(processes=1)
rez = pool.map(serial_series, [rangeMax])
pool.terminate()
pool.join()
return rez
nTask = [1 + i ** 2 * 1000 for i in range(1, 20)]
nTimeSerial = []
nTimeParallel = []
for taskSize in nTask:
print('TaskSize', taskSize)
start = time.time()
rez1 = serial_series(taskSize)
end = time.time()
nTimeSerial.append(end - start)
start = time.time()
rez2 = parallel_series_1core(taskSize)
end = time.time()
nTimeParallel.append(end - start)
plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.plot(nTask, [i / j for i,j in zip(nTimeParallel, nTimeSerial)])
plt.legend(['serial', 'parallel 1 core', 'ratio'])
plt.show()

When you use Pool.map() you're essentially telling it to split the passed iterable into jobs over all available sub-processes (which is one in your case) - the larger the iterable the more 'jobs' are created on the first call. That's what initially adds a huge (trumped only by the process creation itself), albeit linear overhead.
Since sub-processes do not share memory, for all changing data on POSIX systems (due to forking) and all data (even static) on Windows it needs to pickle it on one end and unpickle it on the other. Plus it needs time to clear out the process stack for the next job, plus there is an overhead in system thread switching (that's out of your control, you'd have to mess with the system's scheduler to reduce that one).
For simple/quick tasks a single process will always trump multiprocessing.
UPDATE - As I was saying above, the additional overhead comes from the fact that for any data exchange between processes Python transparently does pickling/unpickling routine. Since the list you return from the serial_series() function grows in size over time, so does the performance penalty for pickling/unpickling. Here's a simple demonstration of it based on your code:
import math
import pickle
import sys
import time
# multi-platform precision timer
get_timer = time.clock if sys.platform == "win32" else time.time
def foo(x): # logic/computation function
return sum([math.log(1 + i*x) for i in range(10)])
def serial_series(max_range): # main sub-process function
return [foo(i) for i in range(max_range)]
def serial_series_slave(max_range): # subprocess interface
return pickle.dumps(serial_series(pickle.loads(max_range)))
def serial_series_master(max_range): # main process interface
return pickle.loads(serial_series_slave(pickle.dumps(max_range)))
tasks = [1 + i ** 2 * 1000 for i in range(1, 20)]
simulated_times = []
for task in tasks:
print("Simulated task size: {}".format(task))
start = get_timer()
res = serial_series_master(task)
simulated_times.append((task, get_timer() - start))
At the end, simulated_times will contain something like:
[(1001, 0.010015994115533963), (4001, 0.03402641167313844), (9001, 0.06755546622419131),
(16001, 0.1252664260421834), (25001, 0.18815836740279515), (36001, 0.28339434475444325),
(49001, 0.3757235840503601), (64001, 0.4813749807557435), (81001, 0.6115452710446636),
(100001, 0.7573718332506543), (121001, 0.9228750064147522), (144001, 1.0909038813527427),
(169001, 1.3017281342479343), (196001, 1.4830192955746764), (225001, 1.7117389965616931),
(256001, 1.9392146632682739), (289001, 2.19192682050668), (324001, 2.4497541011649187),
(361001, 2.7481495578097466)]
showing clear greater-than-linear processing time increase as the list grows bigger. This is what essentially happens with multiprocessing - if your sub-process function didn't return anything it would end up considerably faster.
If you have a large amount of data you need to share among processes, I'd suggest you to use some in-memory database (like Redis) and have your sub-processes connect to it to store/retrieve data.

Multiprocessing for calculating eigen value

I'm generating 100 random int matrices of size 1000x1000. I'm using the multiprocessing module to calculate the eigen values of the 100 matrices.
The code is given below:
import timeit
import numpy as np
import multiprocessing as mp
def calEigen():
S, U = np.linalg.eigh(a)
def multiprocess(processes):
pool = mp.Pool(processes=processes)
#Start timing here as I don't want to include time taken to initialize the processes
start = timeit.default_timer()
results = [pool.apply_async(calEigen, args=())]
stop = timeit.default_timer()
print (processes":", stop - start)
results = [p.get() for p in results]
results.sort() # to sort the results
if __name__ == "__main__":
global a
a=[]
for i in range(0,100):
a.append(np.random.randint(1,100,size=(1000,1000)))
#Print execution time without multiprocessing
start = timeit.default_timer()
calEigen()
stop = timeit.default_timer()
print stop - start
#With 1 process
multiprocess(1)
#With 2 processes
multiprocess(2)
#With 3 processes
multiprocess(3)
#With 4 processes
multiprocess(4)
The output is
0.510247945786
('Process:', 1, 5.1021575927734375e-05)
('Process:', 2, 5.698204040527344e-05)
('Process:', 3, 8.320808410644531e-05)
('Process:', 4, 7.200241088867188e-05)
Another iteration showed this output:
69.7296020985
('Process:', 1, 0.0009050369262695312)
('Process:', 2, 0.023727893829345703)
('Process:', 3, 0.0003509521484375)
('Process:', 4, 0.057518959045410156)
My questions are these:
Why doesn't the time execution time reduce as the number of
processes increase? Am I using the multiprocessing module correctly?
Am I calculating the execution time correctly?
I have edited the code given in the comments below. I want the serial and multiprocessing functions to find the eigen values for the same list of 100 matrices. The edited code is-
import numpy as np
import time
from multiprocessing import Pool
a=[]
for i in range(0,100):
a.append(np.random.randint(1,100,size=(1000,1000)))
def serial(z):
result = []
start_time = time.time()
for i in range(0,100):
result.append(np.linalg.eigh(z[i])) #calculate eigen values and append to result list
end_time = time.time()
print("Single process took :", end_time - start_time, "seconds")
def caleigen(c):
result = []
result.append(np.linalg.eigh(c)) #calculate eigenvalues and append to result list
return result
def mp(x):
start_time = time.time()
with Pool(processes=x) as pool: # start a pool of 4 workers
result = pool.map_async(caleigen,a) # distribute work to workers
result = result.get() # collect result from MapResult object
end_time = time.time()
print("Mutltiprocessing took:", end_time - start_time, "seconds" )
if __name__ == "__main__":
serial(a)
mp(1,a)
mp(2,a)
mp(3,a)
mp(4,a)
There is no reduction in the time as the number of processes increases. Where am I going wrong? Does multiprocessing divide the list into chunks for the processes or do I have to do the division?

You're not using the multiprocessing module correctly. As #dopstar pointed out, you're not dividing your task. There is only one task for the process pool, so not matter how many workers you assigned, only one will get the job. As for your second question, I didn't use timeit to measure process time precisely. I just use time module to get a crude sense of how fast things are. It serves the purpose most of the time, though. If I understand what you're trying to do correctly, this should be the single process version of your code
import numpy as np
import time
result = []
start_time = time.time()
for i in range(100):
a = np.random.randint(1, 100, size=(1000,1000)) #generate random matrix
result.append(np.linalg.eigh(a)) #calculate eigen values and append to result list
end_time = time.time()
print("Single process took :", end_time - start_time, "seconds")
The single process version took 15.27 seconds on my computer. Below is the multiprocess version, which took only 0.46 seconds on my computer. I also included the single process version for comparison. (The single process version has to be enclosed in the if block as well and placed after the multiprocess version.) Because you would like to repeat your calculation for 100 times, it'd be a lot easier to create a pool of workers and let them take on unfinished task automatically than to manually start each process and specify what each process should do. Here in my codes, the argument for the caleigen call is merely to keep track of how many times the task has been executed. Finally, map_async is generally faster than apply_async, with its downside being consuming slightly more memory and taking only one argument for function call. The reason for using map_async but not map is that in this case, the order in which result is returned does not matter and map_async is much faster than map.
from multiprocessing import Pool
import numpy as np
import time
def caleigen(x): # define work for each worker
a = np.random.randint(1,100,size=(1000,1000))
S, U = np.linalg.eigh(a)
return S, U
if __name__ == "main":
start_time = time.time()
with Pool(processes=4) as pool: # start a pool of 4 workers
result = pool.map_async(caleigen, range(100)) # distribute work to workers
result = result.get() # collect result from MapResult object
end_time = time.time()
print("Mutltiprocessing took:", end_time - start_time, "seconds" )
# Run the single process version for comparison. This has to be within the if block as well.
result = []
start_time = time.time()
for i in range(100):
a = np.random.randint(1, 100, size=(1000,1000)) #generate random matrix
result.append(np.linalg.eigh(a)) #calculate eigen values and append to result list
end_time = time.time()
print("Single process took :", end_time - start_time, "seconds")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fastest way to perform Multiprocessing of a loop in a function? - python

Related

python - Difference in CPU cores used when using Pool map and Pool starmap

Running different Python functions in separate CPUs

Multiprocesses will not run in Parallel on Windows on Jupyter Notebook

Python 3 multiprocessing on 1 core gives overhead that grows with workload

Multiprocessing for calculating eigen value

Categories

Resources