Python sequence multi-threading for network tasks - python

I am trying to run a script with multiple threads to decrease the time taken by the script to complete.
I need to know how to implement multithreading in a program like this.
Script example :
def getnetworkdata():
data = ["somesite.com/1", "somesite.com/2", "somesite.com/3", "somesite.com/4"]
for url in data:
r = requests.get(url)
someOtherArray.append(r.text)
Threads should be running in sequence for the required task.
The output I am expecting :
someOtherArray = [1, 2, 3, 4]
I am using Python 2.x

from multiprocessing.dummy import Pool as ThreadPool
data = ["somesite.com/1", "somesite.com/2", "somesite.com/3", "somesite.com/4"]
# Make the Pool of workers
pool = ThreadPool(4)
# Open the urls in their own threads
# and return the results
results = pool.map(requests.get, data)
someOtherArray = map( lambda x: x.text, results )
#close the pool and wait for the work to finish
pool.close()
pool.join()

Related

Running multiple variants of a function asynchronously in Python

How can I run multiple processes pool where I process run1-3 asynchronously, with a multi processing tool in python.
def Numbers(number):
value = number * 10 /33
return value
run1 = Numbers(10)
run2 = Numbers(2)
run3 = Numbers(55)
simple usage of multiprocessing.Pool()
import multiprocessing # import package
with multiprocessing.Pool(3) as pool: # 3 processes
run1, run2, run3 = pool.map(Numbers, [10,2,55]) # map input & output

Fastest way to perform Multiprocessing of a loop in a function?

1. I have a function var. I want to know the best possible way to run the loop within this function quickly by multiprocessing/parallel processing by utilizing all the processors, cores, threads, and RAM memory the system has.
import numpy
from pysheds.grid import Grid
xs = 82.1206, 72.4542, 65.0431, 83.8056, 35.6744
ys = 25.2111, 17.9458, 13.8844, 10.0833, 24.8306
a = r'/home/test/image1.tif'
b = r'/home/test/image2.tif'
def var(interest):
variable_avg = []
for (x,y) in zip(xs,ys):
grid = Grid.from_raster(interest, data_name='map')
grid.catchment(data='map', x=x, y=y, out_name='catch')
variable = grid.view('catch', nodata=np.nan)
variable = numpy.array(variable)
variablemean = (variable).mean()
variable_avg.append(variablemean)
return(variable_avg)
2. It would be great if I can run both function var and loop in it parallelly for the given multiple parameters of the function.
ex:var(a)and var(b) at the same time. Since it will consume much less time then just parallelizing the loop alone.
Ignore 2, if it does not make sense.
TLDR:
You can use the multiprocessing library to run your var function in parallel. However, as written you likely don't make enough calls to var for multiprocessing to have a performance benefit because of its overhead. If all you need to do is run those two calls, running in serial is likely the fastest you'll get. However, if you need to make a lot of calls, multiprocessing can help you out.
We'll need to use a process pool to run this in parallel, threads won't work here because Python's global interpreter lock will prevent us from true parallelism. The drawback of process pools is that processes are heavyweight to spin up. In the example of just running two calls to var the time to create the pool overwhelms the time spent running var itself.
To illiustrate this, let's use a process pool and use asyncio to run calls to var in parallel and compare it to just running things sequentially. Note to run this example I used an image from the Pysheds library https://github.com/mdbartos/pysheds/tree/master/data - if your image is much larger the below may not hold true.
import functools
import time
from concurrent.futures.process import ProcessPoolExecutor
import asyncio
a = 'diem.tif'
xs = 10, 20, 30, 40, 50
ys = 10, 20, 30, 40, 50
async def main():
loop = asyncio.get_event_loop()
pool_start = time.time()
with ProcessPoolExecutor() as pool:
task_one = loop.run_in_executor(pool, functools.partial(var, a))
task_two = loop.run_in_executor(pool, functools.partial(var, a))
results = await asyncio.gather(task_one, task_two)
pool_end = time.time()
print(f'Process pool took {pool_end-pool_start}')
serial_start = time.time()
result_one = var(a)
result_two = var(a)
serial_end = time.time()
print(f'Running in serial took {serial_end - serial_start}')
if __name__ == "__main__":
asyncio.run(main())
Running the above on my machine (a 2.4 GHz 8-Core Intel Core i9) I get the following output:
Process pool took 1.7581260204315186
Running in serial took 0.32335805892944336
In this example, a process pool is over five times slower! This is due to the overhead of creating and managing multiple processes. That said, if you need to call var more than just a few times, a process pool may make more sense. Let's adapt this to run var 100 times and compare the results:
async def main():
loop = asyncio.get_event_loop()
pool_start = time.time()
tasks = []
with ProcessPoolExecutor() as pool:
for _ in range(100):
tasks.append(loop.run_in_executor(pool, functools.partial(var, a)))
results = await asyncio.gather(*tasks)
pool_end = time.time()
print(f'Process pool took {pool_end-pool_start}')
serial_start = time.time()
for _ in range(100):
result = var(a)
serial_end = time.time()
print(f'Running in serial took {serial_end - serial_start}')
Running 100 times, I get the following output:
Process pool took 3.442288875579834
Running in serial took 13.769982099533081
In this case, running in a process pool is about 4x faster. You may also wish to try running each iteration of your loop concurrently. You can do this by creating a function that processes one x,y coordinate at a time and then run each point you want to examine in a process pool:
def process_poi(interest, x, y):
grid = Grid.from_raster(interest, data_name='map')
grid.catchment(data='map', x=x, y=y, out_name='catch')
variable = grid.view('catch', nodata=np.nan)
variable = np.array(variable)
return variable.mean()
async def var_loop_async(interest, pool, loop):
tasks = []
for (x,y) in zip(xs,ys):
function_call = functools.partial(process_poi, interest, x, y)
tasks.append(loop.run_in_executor(pool, function_call))
return await asyncio.gather(*tasks)
async def main():
loop = asyncio.get_event_loop()
pool_start = time.time()
tasks = []
with ProcessPoolExecutor() as pool:
for _ in range(100):
tasks.append(var_loop_async(a, pool, loop))
results = await asyncio.gather(*tasks)
pool_end = time.time()
print(f'Process pool took {pool_end-pool_start}')
serial_start = time.time()
In this case I get Process pool took 3.2950568199157715 - so not really any faster than our first version with one process per each call of var. This is likely because the limiting factor at this point is how many cores we have available on our CPU, splitting our work into smaller increments does not add much value.
That said, if you have 1000 x and y coordinates you wish to examine across two images, this last approach may yield a performance gain.
I think this is a reasonable and straightforward way of speeding up your code by merely parallelizing only the main loop. You can saturate your cores with this, so there is no need to parallelize also for the interest variable. I can't test the code, so I assume that your function is correct, I have just encoded the loop in a new function and parallelized it in var().
from multiprocessing import Pool
def var(interest,xs,ys):
grid = Grid.from_raster(interest, data_name='map')
with Pool(4) as p: #uses 4 cores, adjust this as you need
variable_avg = p.starmap(loop, [(x,y,grid) for x,y in zip(xs,ys)])
return variable_avg
def loop(x, y, grid):
grid.catchment(data='map', x=x, y=y, out_name='catch')
variable = grid.view('catch', nodata=np.nan)
variable = numpy.array(variable)
return variable.mean()

Multiprocesses will not run in Parallel on Windows on Jupyter Notebook

I'm currently working on Windows on jupyter notebook and have been struggling to get multiprocessing to work. It does not run all my async's in parallel it runs them singularly one at a time please provide some guidance where am I going wrong. I need to put the results into a variable for future use. What am I not understanding?
import multiprocessing as mp
import cylib
Pool = mp.Pool(processes=4)
result1 = Pool.apply_async(cylib.f, [v]) # evaluate asynchronously
result2 = Pool.apply_async(cylib.f, [x]) # evaluate asynchronously
result3 = Pool.apply_async(cylib.f, [y]) # evaluate asynchronously
result4 = Pool.apply_async(cylib.f, [z]) # evaluate asynchronously
vr = result1.get(timeout=420)
xr = result2.get(timeout=420)
yr = result3.get(timeout=420)
zr = result4.get(timeout=420)
The tasks are executing in parallel.
However, this is fetching the results synchronously i.e. "wait until result1 is ready, then wait until result2 is ready, .." and so on.
vr = result1.get(timeout=420)
xr = result2.get(timeout=420)
yr = result3.get(timeout=420)
zr = result4.get(timeout=420)
Consider the following example code, where each task is polled asynchronously
from time import sleep
import multiprocessing as mp
pool = mp.Pool(processes=4)
# Create tasks with longer wait first
tasks = {i: pool.apply_async(sleep, [t]) for i, t in enumerate(reversed(range(3)))}
done = set()
# Keep polling until all tasks complete
while len(done) < len(tasks):
for i, t in tasks.items():
# Skip completed tasks
if i in done:
continue
result = None
try:
result = t.get(timeout=0)
except mp.TimeoutError:
pass
else:
print("Task #:{} complete".format(i))
done.add(i)
You can replicate something like the above or use the callback argument on apply_async to perform some handling automatically as tasks complete.

python multiprocessing hangs after a few iterations

I am running a multiprocessing pool in a for loop over a chuck of data. It runs fine for two iterations and hangs on the third. If I reduce the size of each chuck it hangs later on perhaps the forth or fifth iteration. In the program where I discovered the problem I am running a more extensive function but this works to reproduce the error.
Is there a proper way to terminate a pool after it is finished? So that I can start it again.
import pandas as pd
import numpy as np
from multiprocess import Pool
df = pd.read_csv('paths.csv')
def do_something(user):
v = df[df['userId'] == user]
return v
if __name__ == '__main__':
users = df['userId'].unique()
n_chunks = round(len(users)/40)
subsets = [users[i:i+n_chunks] for i in range(0, len(users), n_chunks)]
chunk_counter = 0
for user_subset in subsets:
chunk_counter += 1
print(f'Beginning to process chunk {chunk_counter}...')
with Pool() as pool:
frames = pool.map(do_something, user_subset)
pool.close()
pool.terminate()
print(f'Completed processing chunk {chunk_counter}.')
I was able to prevent the hanging with the code below:
with Pool(maxtasksperchild=1) as pool:
frames = pool.map_async(do_something, user_subset).get()
pool.terminate()
pool.join()
I don't understand why using map_async would prevent the hanging. I will dive deeper if I have a chance and update if I understand the reason.

How to efficiently chain ipyparallel tasks and pass intermediate results to engines?

I am trying to chain several tasks together in iPyParallel, like
import ipyparallel
client = ipyparallel.Client()
view = client.load_balanced_view()
def task1(x):
## Do some work.
return x * 2
def task2(x):
## Do some work.
return x * 3
def task3(x):
## Do some work.
return x * 4
results1 = view.map_async(task1, [1, 2, 3])
results2 = view.map_async(task2, results1.get())
results3 = view.map_async(task3, results2.get())
However, this code won't submit any task2 unless task1 is done and is essentially blocking. My tasks can take different time and it is very inefficient. Is there an easy way that I can chain these steps efficiently and engines can get the results from previous steps? Something like:
def task2(x):
## Do some work.
return x.get() * 3 ## Get AsyncResult out.
def task3(x):
## Do some work.
return x.get() * 4 ## Get AsyncResult out.
results1 = [view.apply_async(task1, x) for x in [1, 2, 3]]
results2 = []
for x in result1:
view.set_flags(after=x.msg_ids)
results2.append(view.apply_async(task2, x))
results3 = []
for x in result2:
view.set_flags(after=x.msg_ids)
results3.append(view.apply_async(task3, x))
Apparently, this will fail as AsyncResult is not pickable.
I was considering a few solutions:
Use view.map_async(ordered=False).
results1 = view.map_async(task1, [1, 2, 3], ordered=False)
for x in results1:
results2.append(view.apply_async(task2, x.get()))
But this has to wait for all task1 to finish before any task3 can be submitted. It is still blocking.
Use asyncio.
#asyncio.coroutine
def submitter(x):
result1 = yield from asyncio.wrap_future(view.apply_async(task1, x))
result2 = yield from asyncio.wrap_future(view.apply_async(task2, result1)
result3 = yield from asyncio.wrap_future(view.apply_async(task3, result2)
yield result3
#asyncio.coroutine
def submit_all(ls):
jobs = [submitter(x) for x in ls]
results = []
for async_r in asyncio.as_completed(jobs):
r = yield from async_r
results.append(r)
## Do some work, like analysing results.
It is working, but the code soon become messy and unintuitive when more complicated tasks are introduced.
Thank you for your help.
Option 1: chain futures
IPython parallel isn't the best at doing this because the connection has to be done at the client level. You do have to wait for the results to complete and return to the client before submitting the results. Essentially, your asyncio submit_all is the right way to do it for IPython parallel. You can get something a little more generic by writing a chain function that uses add_done_callback to submit the new task when the previous one completes:
from concurrent.futures import Future
from functools import partial
def chain_apply(view, func, future):
"""Chain a call to view.apply(func, future.result()) when future is ready.
Returns a Future for the subsequent result.
"""
f2 = Future()
# when f1 is ready, submit a new task for func on its result
def apply_func(f):
if f.exception():
f2.set_exception(f.exception())
return
print('submitting %s(%s)' % (func.__name__, f.result()))
ar = view.apply_async(func, f.result())
# when ar is done, pass through the result to f2
ar.add_done_callback(lambda ar: f2.set_result(ar.get()))
future.add_done_callback(apply_func)
return f2
def chain_map(view, func, list_of_futures):
"""Chain a new callback on a list of futures."""
return [ chain_apply(view, func, f) for f in list_of_futures ]
# use builtin map with apply, since we want one Future per item
results1 = map(partial(view.apply, task1), [1, 2, 3])
results2 = chain_map(view, task2, results1)
results3 = chain_map(view, task3, results2)
print("Waiting for results")
[ r.result() for r in results3 ]
As with any example of add_done_callback, it can be written with coroutines, but I find the callbacks in this case to be fine. This should at least be a fairly generic utility that you can use to compose your pipeline.
Option 2: dask.distributed
Full disclosure: I'm the primary author of IPython Parallel, about to suggest that you use a different tool.
It is possible to pass results from one task to another via engine namespaces and DAG dependencies in IPython parallel, but honestly, if your workflow looks like this, you should consider using dask distributed, which is designed specifically for this kind of computation graph. If you are already comfortable and familiar with IPython parallel, getting started with dask should not be too much of a burden.
IPython 5.1 provides a handy command for turning your IPython parallel cluster into a dask distributed cluster:
import ipyparallel as ipp
client = ipp.Client()
executor = client.become_distributed(ncores=1)
And then the key relevant feature of dask is that you can submit futures as arguments to subsequent map calls, and the scheduler takes care of it when the results are ready, rather than having to do it explicitly in the client:
results1 = executor.map(task1, [1, 2, 3])
results2 = executor.map(task2, results1)
results3 = executor.map(task3, results2)
executor.gather(results3)
So basically, dask distributed works how you wish IPython parallel's load-balancing would work when you need to chain things like this.
This notebook illustrates both examples.

Categories

Resources