How to use asyncio with ProcessPoolExecutor - python

I am searching for huge number of addresses on web, I want to use both asyncio and ProcessPoolExecutor in my task to quickly search the addresses.
async def main():
n_jobs = 3
addresses = [list of addresses]
_addresses = list_splitter(data=addresses, n=n_jobs)
with ProcessPoolExecutor(max_workers=n_jobs) as executor:
futures_list = []
for _address in _addresses:
futures_list +=[asyncio.get_event_loop().run_in_executor(executor, execute_parallel, _address)]
for f in tqdm(as_completed(futures_list, loop=asyncio.get_event_loop()), total=len(_addresses)):
results = await f
asyncio.get_event_loop().run_until_complete(main())
expected:
I want to execute_parallel function should run in parallel.
error:
Traceback (most recent call last):
File "/home/awaish/danamica/scraping/skraafoto/aerial_photos_scraper.py", line 228, in <module>
asyncio.run(main())
File "/usr/local/lib/python3.7/asyncio/runners.py", line 43, in run
return loop.run_until_complete(main)
File "/usr/local/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
return future.result()
File "/home/awaish/danamica/scraping/skraafoto/aerial_photos_scraper.py", line 224, in main
results = await f
File "/usr/local/lib/python3.7/asyncio/tasks.py", line 533, in _wait_for_one
return f.result() # May raise f.exception().
TypeError: can't pickle coroutine objects

I'm not sure I'm answering the correct question, but it appears the intent of your code is to run your execute_parallel function across several processes using Asyncio. As opposed to using ProcessPoolExecutor, why not try something like using a normal multiprocessing Pool and setting up separate Asyncio loops to run in each. You might set up one process per core and let Asyncio work its magic within each process.
async def run_loop(addresses):
loop = asyncio.get_event_loop()
loops = [loop.create_task(execute_parallel, address) for address in addresses]
loop.run_until_complete(asyncio.wait(loops))
def main():
n_jobs = 3
addresses = [list of addresses]
_addresses = list_splitter(data=addresses, n=n_jobs)
with multiprocessing.Pool(processes=n_jobs) as pool:
pool.imap_unordered(run_loop, _addresses)
I've used Pool.imap_unordered with great success, but depending on your needs you may prefer Pool.map or some other functionality. You can play around with chunksize or with the number of addresses in each list to achieve optimal results (ie, if you're getting a lot of timeouts you may want to reduce the number of addresses being processed concurrently)

Related

How to start another thread without waiting for function to finish?

Hey I am making a telegram bot and I need it to be able to run the same command multiple times at once.
dispatcher.add_handler(CommandHandler("send", send))
This is the command ^
And inside the command it starts a function:
sendmail(email, amount, update, context)
This function takes around 5seconds to finish. I want it so I can run it multiple times at once without needing to wait for it to finish. I tried the following:
Thread(target=sendmail(email, amount, update, context)).start()
This would give me no errors but It waits for function to finish then proceeds. I also tried this
with ThreadPoolExecutor(max_workers=100) as executor:
executor.submit(sendmail, email, amount, update, context).result()
but it gave me the following error:
No error handlers are registered, logging exception.
Traceback (most recent call last):
File "C:\Users\seal\AppData\Local\Programs\Python\Python310\lib\site-packages\telegram\ext\dispatcher.py", line 557, in process_update
handler.handle_update(update, self, check, context)
File "C:\Users\seal\AppData\Local\Programs\Python\Python310\lib\site-packages\telegram\ext\handler.py", line 199, in handle_update
return self.callback(update, context)
File "c:\Users\seal\Downloads\telegrambot\main.py", line 382, in sendmailcmd
executor.submit(sendmail, email, amount, update, context).result()
File "C:\Users\main\AppData\Local\Programs\Python\Python310\lib\concurrent\futures\thread.py", line 169, in submit
raise RuntimeError('cannot schedule new futures after '
RuntimeError: cannot schedule new futures after interpreter shutdown
This is my first attempt at threading, but maybe try this:
import threading
x1 = threading.Thread(target=sendmail, args=(email, amount, update, context))
x1.start()
You can just put the x1 = threading... and x1.start() in a loop to have it run multiple times
Hope this helps
It's not waiting for one function to finish, to start another, but in python GIL (Global Interpreter Lock) executes only one thread at a given time. Since thread use multiple cores, time between two functions are negligible in most cases.
Following is the way to start threads with the ThreadPoolExecutor, please adjust it to your usecase.
def async_send_email(emails_to_send):
with ThreadPoolExecutor(max_workers=32) as executor:
futures = [
executor.submit(
send_email,
email=email_to_send.email,
amount=email_to_send.amount,
update=email_to_send.update,
context=email_to_send.context
)
for email_to_send in emails_to_send
]
for future, email_to_send in zip(futures, emails_to_send):
try:
future.result()
except Exception as e:
# Handle the exceptions.
continue
def send_email(email, amount, update, context):
# do what you want here.

How to properly transform a sync function to an async one?

I'm writing a telegram bot and I need the bot to be available to users even when it is processing some previous request. My bot downloads some videos and compresses them if it exceeds the size limit, so it takes some time to process the request. I want to turn my sync functions to async ones and handle them within another process to make this happen.
I found a way to do this, using this article but it doesn't work for me. That's my code to test the solution:
import asyncio
from concurrent.futures import ProcessPoolExecutor
from functools import wraps, partial
executor = ProcessPoolExecutor()
def async_wrap(func):
#wraps(func)
async def run(*args, **kwargs):
loop = asyncio.get_running_loop()
pfunc = partial(func, *args, **kwargs)
return await loop.run_in_executor(executor, pfunc)
return run
#async_wrap
def sync_func(a):
import time
time.sleep(10)
if __name__ == "__main__":
asyncio.run(sync_func(4))
As a result, I've got the following error message:
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/queues.py", line 245, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function sync_func at 0x7f2e333625f0>: it's not the same object as __main__.sync_func
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/mikhail/Projects/social_network_parsing_bot/processes.py", line 34, in <module>
asyncio.run(sync_func(4))
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/home/mikhail/Projects/social_network_parsing_bot/processes.py", line 18, in run
return await loop.run_in_executor(executor, pfunc)
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/queues.py", line 245, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function sync_func at 0x7f2e333625f0>: it's not the same object as __main__.sync_func
As I understand, the error arises because decorator changes the function and as a result returns a new object. What I need to change in my code to make it work. Maybe I don't understand some crucial concepts and there is some simple method to achieve the desired. Thanks for help
The article runs a nice experiment, but it really is just meant to work with a threaded-pool exercutor - not a multi-processing one.
If you see its code, at some point it passes executor=None to the .run_in_executor call, and asyncio creates a default executor which is a ThreadPoolExecutor.
The main difference to a ProcessPoolExecutor is that all data moved cross-process (and therefore, all data sent to the workers, including the target functions) have to be serialized - and it is done via Python's pickle.
Now, Pickle serialization of functions do not really send the function objects, along with its bytecode, down the wire: rather, it just sends the function qualname, and it is expected that the function with the same qualname on the other end is the same as the original function.
In the case of your code, the func which is the target for the executor-pool is the declared function, prior to it being wrapped in the decorator ( __main__.sync_func) . But what exists with this name in the target process is the post-decorated function. So, if Python would not block it due to the functions not being the same, you'd get into an infinite-loop creating hundreds of nested subprocess and never actually calling your function - as the entry-point in the target would be the wrapped function. That is just an error in the article you viewed.
All this said, the simpler way to make all this work, is instead of using this decorator in the usual fashion, just keep the original, undecorated function, in the module namespace, and create a new name for the wrapped function - this way, the "raw" code can be the target for the executor:
(...)
def sync_func(a):
import time
time.sleep(2)
print(f"finished {a}")
# this creates the decorated function with a new name,
# instead of replacing the original:
wrapped_sync = async_wrap(sync_func)
if __name__ == "__main__":
asyncio.run(wrapped_sync("go go go"))

Getting erratic runtime exceptions trying to access persistant data in multiprocessing.Pool worker processes

Inspired by this solution I am trying to set up a multiprocessing pool of worker processes in Python. The idea is to pass some data to the worker processes before they actually start their work and reuse it eventually. It's intended to minimize the amount of data which needs to be packed/unpacked for every call into a worker process (i.e. reducing inter-process communication overhead). My MCVE looks like this:
import multiprocessing as mp
import numpy as np
def create_worker_context():
global context # create "global" context in worker process
context = {}
def init_worker_context(worker_id, some_const_array, DIMS, DTYPE):
context.update({
'worker_id': worker_id,
'some_const_array': some_const_array,
'tmp': np.zeros((DIMS, DIMS), dtype = DTYPE),
}) # store context information in global namespace of worker
return True # return True, verifying that the worker process received its data
class data_analysis:
def __init__(self):
self.DTYPE = 'float32'
self.CPU_LEN = mp.cpu_count()
self.DIMS = 100
self.some_const_array = np.zeros((self.DIMS, self.DIMS), dtype = self.DTYPE)
# Init multiprocessing pool
self.cpu_pool = mp.Pool(processes = self.CPU_LEN, initializer = create_worker_context) # create pool and context in workers
pool_results = [
self.cpu_pool.apply_async(
init_worker_context,
args = (core_id, self.some_const_array, self.DIMS, self.DTYPE)
) for core_id in range(self.CPU_LEN)
] # pass information to workers' context
result_batches = [result.get() for result in pool_results] # check if they got the information
if not all(result_batches): # raise an error if things did not work
raise SyntaxError('Workers could not be initialized ...')
#staticmethod
def process_batch(batch_data):
context['tmp'][:,:] = context['some_const_array'] + batch_data # some fancy computation in worker
return context['tmp'] # return result
def process_all(self):
input_data = np.arange(0, self.DIMS ** 2, dtype = self.DTYPE).reshape(self.DIMS, self.DIMS)
pool_results = [
self.cpu_pool.apply_async(
data_analysis.process_batch,
args = (input_data,)
) for _ in range(self.CPU_LEN)
] # let workers actually work
result_batches = [result.get() for result in pool_results]
for batch in result_batches[1:]:
np.add(result_batches[0], batch, out = result_batches[0]) # reduce batches
print(result_batches[0]) # show result
if __name__ == '__main__':
data_analysis().process_all()
I am running the above with CPython 3.6.6.
The strange thing is ... sometimes it works, sometimes it does not. If it does not work, the process_batch method throws an exception, because it can not find some_const_array as a key in context. The full traceback looks like this:
(env) me#box:/path> python so.py
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/path/so.py", line 37, in process_batch
context['tmp'][:,:] = context['some_const_array'] + batch_data # some fancy computation in worker
KeyError: 'some_const_array'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/path/so.py", line 54, in <module>
data_analysis().process_all()
File "/path/so.py", line 48, in process_all
result_batches = [result.get() for result in pool_results]
File "/path/so.py", line 48, in <listcomp>
result_batches = [result.get() for result in pool_results]
File "/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
KeyError: 'some_const_array'
I am puzzled. What is going on here?
If my context dictionaries contain an object of "higher type", e.g. a database driver or similar, I am not getting this kind of problem. I can only reproduce this if my context dictionaries contain basic Python data types, collections or numpy arrays.
(Is there a potentially better approach for achieving the same thing in a more reliable manner? I know my approach is considered a hack ...)
You need to relocate the content of init_worker_context into your initializer function create_worker_context.
Your assumption that every single worker process will run init_worker_context is responsible for your confusion.
The tasks you submit to a pool get fed into one internal taskqueue all worker processes read from. What happens in your case is, that some worker processes complete their task and compete again for getting new tasks. So it can happen that one worker processes will execute multiple tasks while another one will not get a single one. Keep in mind the OS schedules runtime for threads (of the worker processes).

How to pass classes into Pool.map as arguments - Pickling Error

I am trying to process a file by cutting it up into chunks and running them through a function which processes the chunks and returns a numpy array. After looking around it seems the best method would be to use the Pool.map method by passing through classes as the arguments. These classes are initiated with the chunk sections as a variable, and another variable to store the numpy array. The output list of classes can then be parsed to get out the information I need to continue with the problem. Here is a simplified version of the script I am trying to write:
from multiprocessing import Pool
class container():
def __init__(self, k):
self.input_section = k
self.ouput_answer = 0
def compute(object_class):
# Main operation would go on in here....
object_class.output_answer = object_class.input_section
return object_class
def Main():
# Create list of classes to path as arguments
sections = [container(k) for k in range(10)]
# Create pool and compute modified classes
with Pool(4) as p:
results = p.map(compute, sections)
# Decode here to get answers
sections = [k.output_answer for k in results]
# Print answers
print(sections)
if __name__ == '__main__':
Main()
This is the error that I get when I run the script:
Exception in thread Thread-9: Traceback (most recent call last):
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\threading.py", line 916, in _bootstrap_inner
self.run()
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 463, in _handle_results
task = get()
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
AttributeError: Can't get attribute 'container' on module '__main__' from
'C:\\Users\\rbernon\\AppData\\Local\\Continuum\\Anaconda3\\lib\\site-packages\\spyder\\utils\\ipython\\start_kernel.py'>
Any help would be greatly apprectiated!
Keep in mind that every piece of data you want to have processed needs to be pickled and sent to the worker processes.
The overhead of this will reduce (and might even eliminate) the advantages of using multiple processes.
If the data file is large, it is probably better to send each worker a start and end offset as a 2-tuple of numbers, so each worker can read part of the file and process it.

Python3 filling a dictionary concurrently

I want to fill a dictionary in a loop. Iterations in the loop are independent from each other. I want to perform this on a cluster with thousands of processors. Here is a simplified version of what I tried and need to do.
import multiprocessing
class Worker(multiprocessing.Process):
def setName(self,name):
self.name=name
def run(self):
print ('In %s' % self.name)
return
if __name__ == '__main__':
jobs = []
names=dict()
for i in range(10000):
p = Worker()
p.setName(str(i))
names[str(i)]=i
jobs.append(p)
p.start()
for j in jobs:
j.join()
I tried this one in python3 on my own computer and received the following error:
..
In 249
Traceback (most recent call last):
File "test.py", line 16, in <module>
p.start()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/process.py", line 105, in start
In 250
self._popen = self._Popen(self)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/context.py", line 212, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
return Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/popen_fork.py", line 66, in _launch
parent_r, child_w = os.pipe()
OSError: [Errno 24] Too many open files
Is there any better way to do this?
multiprocessing talks to its subprocesses via pipes. Each subprocesses requires two open file descriptors, one for read and one for write. If you launch 10000 workers, you'll end opening 20000 file descriptors which exceeds the default limit on OS X (which your paths indicate you're using).
You can fix the issue by raising the limit. See https://superuser.com/questions/433746/is-there-a-fix-for-the-too-many-open-files-in-system-error-on-os-x-10-7-1 for details - basically, it amounts to setting two sysctl knobs and upping your shell's ulimit setting.
You are spawning 10000 processes at once at the moment. That really isn't a good idea.
The error you see is most definitely raised because the multiprocessing module (seem to) use pipes for the Inter Proccess Communication and there is a limit of open pipes/FDs.
I suggest using an python interpreter without a Global interpreter lock like Jython or IronPython and just replace the multiprocessing module with the threading one.
If you still want to use the multiprocessing module, you could use an Proccess Pool like this to collect the return values:
from multiprocessing import Pool
def worker(params):
name, someArg = params
print ('In %s' % name)
# do something with someArg here
return (name, someArg)
if __name__ == '__main__':
jobs = []
names=dict()
# Spawn 100 worker processes
pool = Pool(processes=100)
# Fill with real data
task_dict = dict(('name_{}'.format(i), i) for i in range(1000))
# Process every task via our pool
results = pool.map(worker, task_dict.items())
# And convert the rsult to a dict
results = dict(results)
print (results)
This should work with minimal changes for the threading module, too.

Categories

Resources