Run multiple async loops in separate processes within a main async app - python

Ok so this is a bit convoluted but I have a async class with a lot of async code.
I wish to parallelize a task inside that class and I want to spawn multiple processes to run a blocking task and also within each of this processes I want to create an asyncio loop to handle various subtasks.
SO I short of managed to do this with a ThreadPollExecutor but when I try to use a ProcessPoolExecutor I get a Can't pickle local object error.
This is a simplified version of my code that runs with ThreadPoolExecutor. How can this be parallelized with ProcessPoolExecutor?
import asyncio
import time
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
class MyClass:
def __init__(self) -> None:
self.event_loop = None
# self.pool_executor = ProcessPoolExecutor(max_workers=8)
self.pool_executor = ThreadPoolExecutor(max_workers=8)
self.words = ["one", "two", "three", "four", "five"]
self.multiplier = int(2)
async def subtask(self, letter: str):
await asyncio.sleep(1)
return letter * self.multiplier
async def task_gatherer(self, subtasks: list):
return await asyncio.gather(*subtasks)
def blocking_task(self, word: str):
time.sleep(1)
subtasks = [self.subtask(letter) for letter in word]
result = asyncio.run(self.task_gatherer(subtasks))
return result
async def master_method(self):
self.event_loop = asyncio.get_running_loop()
master_tasks = [
self.event_loop.run_in_executor(
self.pool_executor,
self.blocking_task,
word,
)
for word in self.words
]
results = await asyncio.gather(*master_tasks)
print(results)
if __name__ == "__main__":
my_class = MyClass()
asyncio.run(my_class.master_method())

This is a very good question. Both the problem and the solution are quite interesting.
The Problem
One difference between multithreading and multiprocessing is how memory is handled. Threads share a memory space. Processes do not (in general, see below).
Objects are passed to a ThreadPoolExecutor simply by reference. There is no need to create new objects.
But a ProcessPoolExecutor lives in a separate memory space. To pass objects to it, the implementation pickles the objects and unpickles them again on the other side. This detail is often important.
Look carefully at the arguments to blocking_task in the original question. I don't mean word - I mean the first argument: self. The one that's always there. We've seen it a million times and hardly even think about it. To execute the function blocking_task, a value is required for the argument named "self." To run this function in a ProcessPoolExecutor, "self" must get pickled and unpickled. Now look at some of the member objects of "self": there's an event loop and also the executor itself. Neither of which is pickleable. That's the problem.
There is no way we can run that function, as is, in another Process.
Admittedly, the traceback message "Cannot pickle local object" leaves a lot to be desired. So does the documentation. But it actually makes total sense that the program works with a ThreadPool but not with a ProcessPool.
Note: There are mechanisms for sharing ctypes objects between Processes. However, as far as I'm aware, there is no way to share Python objects directly. That's why the pickle/unpickle mechanism is used.
The Solution
Refactor MyClass to separate the data from the multiprocessing framework. I created a second class, MyTask, which can be pickled and unpickled. I moved a few of the functions from MyClass into it. Nothing of importance has been modified from the original listing - just rearranged.
The script runs successfully with both ProcessPoolExecutor and ThreadPoolExecutor.
import asyncio
import time
# from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ProcessPoolExecutor
# Refactored MyClass to break out MyTask
class MyTask:
def __init__(self):
self.multiplier = 2
async def subtask(self, letter: str):
await asyncio.sleep(1)
return letter * self.multiplier
async def task_gatherer(self, subtasks: list):
return await asyncio.gather(*subtasks)
def blocking_task(self, word: str):
time.sleep(1)
subtasks = [self.subtask(letter) for letter in word]
result = asyncio.run(self.task_gatherer(subtasks))
return result
class MyClass:
def __init__(self):
self.task = MyTask()
self.event_loop: asyncio.AbstractEventLoop = None
self.pool_executor = ProcessPoolExecutor(max_workers=8)
# self.pool_executor = ThreadPoolExecutor(max_workers=8)
self.words = ["one", "two", "three", "four", "five"]
async def master_method(self):
self.event_loop = asyncio.get_running_loop()
master_tasks = [
self.event_loop.run_in_executor(
self.pool_executor,
self.task.blocking_task,
word,
)
for word in self.words
]
results = await asyncio.gather(*master_tasks)
print(results)
if __name__ == "__main__":
my_class = MyClass()
asyncio.run(my_class.master_method())

Related

Call 4 methods at once in Python 3

I want to call 4 methods at once so they run parallel-ly in Python. These methods make HTTP calls and do some basic operation like verify response. I want to call them at once so the time taken will be less. Say each method takes ~20min to run, I want all 4methods to return response in 20min and not 20*4 80min
It is important to note that the 4methods I'm trying to run in parallel are async functions. When I tried using ThreadPoolExecutor to run the 4methods in parallel I didn't see much difference in time taken.
Example code - edited from #tomerar comment below
from concurrent.futures import ThreadPoolExecutor
async def foo_1():
print("foo_1")
async def foo_2():
print("foo_2")
async def foo_3():
print("foo_3")
async def foo_4():
print("foo_4")
with ThreadPoolExecutor() as executor:
for foo in [await foo_1,await foo_2,await foo_3,await foo_4]:
executor.submit(foo)
Looking for suggestions
You can use from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ThreadPoolExecutor
def foo_1():
print("foo_1")
def foo_2():
print("foo_2")
def foo_3():
print("foo_3")
def foo_4():
print("foo_4")
with ThreadPoolExecutor() as executor:
for foo in [foo_1,foo_2,foo_3,foo_4]:
executor.submit(foo)
You can use "multiprocessing" in python.
it's so simple
from multiprocessing import Pool
pool = Pool()
result1 = pool.apply_async(solve1, [A]) # evaluate "solve1(A)"
result2 = pool.apply_async(solve2, [B]) # evaluate "solve2(B)"
answer1 = result1.get(timeout=10)
answer2 = result2.get(timeout=10)
you can see full details

Python: How do I refactor and structure code in this scenario?

I'm quite stuck on structuring the code in this scenario. Can anyone help me with this?
| module.py
import asyncio
class Server:
def __init__(self):
self.d = {}
#classmethod
async def create(cls):
self = cls()
await self.func()
return self
async def func(self):
await asyncio.sleep(5) # Some other async code here
self.a = 12
def reg(self, ev):
def decorator(func):
self.d[ev] = func()
retun func
return decorator
def reg2(self, ev, func):
self.d[ev] = func
| main.py
import asyncio
from module import Server
async def main():
ser = await Server.create()
# This would be another way... but i find the other way one neater
serv.reg2('msg', some_handler)
# I want to decorate and register this using
# reg func; but since object is not created yet
# how do i acomplish this?
# #ser.reg('msg')
async def some_handler():
...
if __name__ == "__main__":
asyncio.run(main())
Some key points of my aim:
The function 'some_handler' is never used other than the time for register. That is, the function soley exists to be registered and is not used anywhere else.
Since Server class needs an asynchronous initialisation, it cannot be done globally.
(I dont know whether this point is helpful) Generally only one Server instance is created for a single program. There wont be any other instance even in other modules.
How do I model my code to satisfy this senario? I have mentioned an alternate way to register the function, but I feel I am missing something, as some_handler isn't used anywhere else. I have thought about making Server class into a metaclass to do registering and converting the main() and some_handler() as parts of the metclass's class but I'm seeking for different views and opinions.

How to add a pool of processes available for a multiprocessing queue

I am following a preceding question here: how to add more items to a multiprocessing queue while script in motion
the code I am working with now:
import multiprocessing
class MyFancyClass:
def __init__(self, name):
self.name = name
def do_something(self):
proc_name = multiprocessing.current_process().name
print('Doing something fancy in {} for {}!'.format(proc_name, self.name))
def worker(q):
while True:
obj = q.get()
if obj is None:
break
obj.do_something()
if __name__ == '__main__':
queue = multiprocessing.Queue()
p = multiprocessing.Process(target=worker, args=(queue,))
p.start()
queue.put(MyFancyClass('Fancy Dan'))
queue.put(MyFancyClass('Frankie'))
# print(queue.qsize())
queue.put(None)
# Wait for the worker to finish
queue.close()
queue.join_thread()
p.join()
Right now, there's two items in the queue. if I replace the two lines with a list of, say 50 items....How do I initiate a POOL to allow a number of processes available. for example:
p = multiprocessing.Pool(processes=4)
where does that go? I'd like to be able run multiple items at once, especially if the items run for a bit.
Thanks!
As a rule, you either use Pool or Process(es) plus Queues. Mixing both is a misuse; the Pool already uses Queues (or a similar mechanism) behind the scenes.
If you want to do this with a Pool, change your code to (moving code to main function for performance and better resource cleanup than running in global scope):
def main():
myfancyclasses = [MyFancyClass('Fancy Dan'), ...] # define your MyFancyClass instances here
with multiprocessing.Pool(processes=4) as p:
# Submit all the work
futures = [p.apply_async(fancy.do_something) for fancy in myfancyclasses]
# Done submitting, let workers exit as they run out of work
p.close()
# Wait until all the work is finished
for f in futures:
f.wait()
if __name__ == '__main__':
main()
This could be simplified further at the expense of purity, with the .*map* methods of Pool, e.g. to minimize memory usage redefine main as:
def main():
myfancyclasses = [MyFancyClass('Fancy Dan'), ...] # define your MyFancyClass instances here
with multiprocessing.Pool(processes=4) as p:
# No return value, so we ignore it, but we need to run out the result
# or the work won't be done
for _ in p.imap_unordered(MyFancyClass.do_something, myfancyclasses):
pass
Yes, technically either approach has a slightly higher overhead in terms of needing to serialize the return value you're not using so give it back to the parent process. But in practice, this cost is pretty low (since your function has no return, it's returning None, which serializes to almost nothing). An advantage to this approach is that for printing to the screen, you generally don't want to do it from the child processes (since they'll end up interleaving output), and you can replace the printing with returns to let the parent do the work, e.g.:
import multiprocessing
class MyFancyClass:
def __init__(self, name):
self.name = name
def do_something(self):
proc_name = multiprocessing.current_process().name
# Changed from print to return
return 'Doing something fancy in {} for {}!'.format(proc_name, self.name)
def main():
myfancyclasses = [MyFancyClass('Fancy Dan'), ...] # define your MyFancyClass instances here
with multiprocessing.Pool(processes=4) as p:
# Using the return value now to avoid interleaved output
for res in p.imap_unordered(MyFancyClass.do_something, myfancyclasses):
print(res)
if __name__ == '__main__':
main()
Note how all of these solutions remove the need to write your own worker function, or manually manage Queues, because Pools do that grunt work for you.
Alternate approach using concurrent.futures to efficiently process results as they become available, while allowing you to choose to submit new work (either based on the results, or based on external information) as you go:
import concurrent.futures
from concurrent.futures import FIRST_COMPLETED
def main():
allow_new_work = True # Set to False to indicate we'll no longer allow new work
myfancyclasses = [MyFancyClass('Fancy Dan'), ...] # define your initial MyFancyClass instances here
with concurrent.futures.ProcessPoolExecutor() as executor:
remaining_futures = {executor.submit(fancy.do_something)
for fancy in myfancyclasses}
while remaining_futures:
done, remaining_futures = concurrent.futures.wait(remaining_futures,
return_when=FIRST_COMPLETED)
for fut in done:
result = fut.result()
# Do stuff with result, maybe submit new work in response
if allow_new_work:
if should_stop_checking_for_new_work():
allow_new_work = False
# Let the workers exit when all remaining tasks done,
# and reject submitting more work from now on
executor.shutdown(wait=False)
elif has_more_work():
# Assumed to return collection of new MyFancyClass instances
new_fanciness = get_more_fanciness()
remaining_futures |= {executor.submit(fancy.do_something)
for fancy in new_fanciness}
myfancyclasses.extend(new_fanciness)

Python Multiprocessing - TypeError: Pickling an AuthenticationString object is disallowed for security reasons

I'm having the following Problem. I want to implement a web crawler, so far this worked but it was so slow, that I tried to use multiprocessing for fetching the URLs.
Unfortunately I'm not very experienced at this field.
After some reading the easiest way seemed to me to use the map method from multiprocessing.pool for this.
But I constantly get the following error:
TypeError: Pickling an AuthenticationString object is disallowed for security reasons
I found very few cases with the same error and they unfortunately did not help me.
I created a stripped version of my code which can reproduce the error:
import multiprocessing
class TestCrawler:
def __init__(self):
self.m = multiprocessing.Manager()
self.queue = self.m.Queue()
for i in range(50):
self.queue.put(str(i))
self.pool = multiprocessing.Pool(6)
def mainloop(self):
self.process_next_url(self.queue)
while True:
self.pool.map(self.process_next_url, (self.queue,))
def process_next_url(self, queue):
url = queue.get()
print(url)
c = TestCrawler()
c.mainloop()
I would be very thankful about any help or suggestion!
Question: But I constantly get the following error:
The Error you'r getting is missleading, the reason are
self.queue = self.m.Queue()
Move the Queue instantiation Outside the class TestCrawler.
This leads to another Error:
NotImplementedError: pool objects cannot be passed between processes or pickled
The reason are:
self.pool = multiprocessing.Pool(6)
Both Errors are indicating that pickle can't find the class Members.
Note: Endless Loop!
Your following while Loop leads to a Endless Loop!
This will overload your System!
Furthermore, your pool.map(... starts only one Process with one Task!
while True:
self.pool.map(self.process_next_url, (self.queue,))
I suggest reading The Examples that demonstrates the use of a pool
Change to the following:
class TestCrawler:
def __init__(self, tasks):
# Assign the Global task to class member
self.queue = tasks
for i in range(50):
self.queue.put(str(i))
def mainloop(self):
# Instantiate the pool local
pool = mp.Pool(6)
for n in range(50):
# .map requires a Parameter pass None
pool.map(self.process_next_url, (None,))
# None is passed
def process_next_url(self, dummy):
url = self.queue.get()
print(url)
if __name__ == "__main__":
# Create the Queue as Global
tasks = mp.Manager().Queue()
# Pass the Queue to your class TestCrawler
c = TestCrawler(tasks)
c.mainloop()
This Example starts 5 Processes each processing 10 Tasks(urls):
class TestCrawler2:
def __init__(self, tasks):
self.tasks = tasks
def start(self):
pool = mp.Pool(5)
pool.map(self.process_url, self.tasks)
def process_url(self, url):
print('self.process_url({})'.format(url))
if __name__ == "__main__":
tasks = ['url{}'.format(n) for n in range(50)]
TestCrawler2(tasks).start()
Tested with Python: 3.4.2

Asynchronous method call in Python?

I was wondering if there's any library for asynchronous method calls in Python. It would be great if you could do something like
#async
def longComputation():
<code>
token = longComputation()
token.registerCallback(callback_function)
# alternative, polling
while not token.finished():
doSomethingElse()
if token.finished():
result = token.result()
Or to call a non-async routine asynchronously
def longComputation()
<code>
token = asynccall(longComputation())
It would be great to have a more refined strategy as native in the language core. Was this considered?
Something like:
import threading
thr = threading.Thread(target=foo, args=(), kwargs={})
thr.start() # Will run "foo"
....
thr.is_alive() # Will return whether foo is running currently
....
thr.join() # Will wait till "foo" is done
See the documentation at https://docs.python.org/library/threading.html for more details.
You can use the multiprocessing module added in Python 2.6. You can use pools of processes and then get results asynchronously with:
apply_async(func[, args[, kwds[, callback]]])
E.g.:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=1) # Start a worker processes.
result = pool.apply_async(f, [10], callback) # Evaluate "f(10)" asynchronously calling callback when finished.
This is only one alternative. This module provides lots of facilities to achieve what you want. Also it will be really easy to make a decorator from this.
As of Python 3.5, you can use enhanced generators for async functions.
import asyncio
import datetime
Enhanced generator syntax:
#asyncio.coroutine
def display_date(loop):
end_time = loop.time() + 5.0
while True:
print(datetime.datetime.now())
if (loop.time() + 1.0) >= end_time:
break
yield from asyncio.sleep(1)
loop = asyncio.get_event_loop()
# Blocking call which returns when the display_date() coroutine is done
loop.run_until_complete(display_date(loop))
loop.close()
New async/await syntax:
async def display_date(loop):
end_time = loop.time() + 5.0
while True:
print(datetime.datetime.now())
if (loop.time() + 1.0) >= end_time:
break
await asyncio.sleep(1)
loop = asyncio.get_event_loop()
# Blocking call which returns when the display_date() coroutine is done
loop.run_until_complete(display_date(loop))
loop.close()
It's not in the language core, but a very mature library that does what you want is Twisted. It introduces the Deferred object, which you can attach callbacks or error handlers ("errbacks") to. A Deferred is basically a "promise" that a function will have a result eventually.
You can implement a decorator to make your functions asynchronous, though that's a bit tricky. The multiprocessing module is full of little quirks and seemingly arbitrary restrictions – all the more reason to encapsulate it behind a friendly interface, though.
from inspect import getmodule
from multiprocessing import Pool
def async(decorated):
r'''Wraps a top-level function around an asynchronous dispatcher.
when the decorated function is called, a task is submitted to a
process pool, and a future object is returned, providing access to an
eventual return value.
The future object has a blocking get() method to access the task
result: it will return immediately if the job is already done, or block
until it completes.
This decorator won't work on methods, due to limitations in Python's
pickling machinery (in principle methods could be made pickleable, but
good luck on that).
'''
# Keeps the original function visible from the module global namespace,
# under a name consistent to its __name__ attribute. This is necessary for
# the multiprocessing pickling machinery to work properly.
module = getmodule(decorated)
decorated.__name__ += '_original'
setattr(module, decorated.__name__, decorated)
def send(*args, **opts):
return async.pool.apply_async(decorated, args, opts)
return send
The code below illustrates usage of the decorator:
#async
def printsum(uid, values):
summed = 0
for value in values:
summed += value
print("Worker %i: sum value is %i" % (uid, summed))
return (uid, summed)
if __name__ == '__main__':
from random import sample
# The process pool must be created inside __main__.
async.pool = Pool(4)
p = range(0, 1000)
results = []
for i in range(4):
result = printsum(i, sample(p, 100))
results.append(result)
for result in results:
print("Worker %i: sum value is %i" % result.get())
In a real-world case I would ellaborate a bit more on the decorator, providing some way to turn it off for debugging (while keeping the future interface in place), or maybe a facility for dealing with exceptions; but I think this demonstrates the principle well enough.
Just
import threading, time
def f():
print "f started"
time.sleep(3)
print "f finished"
threading.Thread(target=f).start()
My solution is:
import threading
class TimeoutError(RuntimeError):
pass
class AsyncCall(object):
def __init__(self, fnc, callback = None):
self.Callable = fnc
self.Callback = callback
def __call__(self, *args, **kwargs):
self.Thread = threading.Thread(target = self.run, name = self.Callable.__name__, args = args, kwargs = kwargs)
self.Thread.start()
return self
def wait(self, timeout = None):
self.Thread.join(timeout)
if self.Thread.isAlive():
raise TimeoutError()
else:
return self.Result
def run(self, *args, **kwargs):
self.Result = self.Callable(*args, **kwargs)
if self.Callback:
self.Callback(self.Result)
class AsyncMethod(object):
def __init__(self, fnc, callback=None):
self.Callable = fnc
self.Callback = callback
def __call__(self, *args, **kwargs):
return AsyncCall(self.Callable, self.Callback)(*args, **kwargs)
def Async(fnc = None, callback = None):
if fnc == None:
def AddAsyncCallback(fnc):
return AsyncMethod(fnc, callback)
return AddAsyncCallback
else:
return AsyncMethod(fnc, callback)
And works exactly as requested:
#Async
def fnc():
pass
You could use eventlet. It lets you write what appears to be synchronous code, but have it operate asynchronously over the network.
Here's an example of a super minimal crawler:
urls = ["http://www.google.com/intl/en_ALL/images/logo.gif",
"https://wiki.secondlife.com/w/images/secondlife.jpg",
"http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif"]
import eventlet
from eventlet.green import urllib2
def fetch(url):
return urllib2.urlopen(url).read()
pool = eventlet.GreenPool()
for body in pool.imap(fetch, urls):
print "got body", len(body)
Something like this works for me, you can then call the function, and it will dispatch itself onto a new thread.
from thread import start_new_thread
def dowork(asynchronous=True):
if asynchronous:
args = (False)
start_new_thread(dowork,args) #Call itself on a new thread.
else:
while True:
#do something...
time.sleep(60) #sleep for a minute
return
You can use concurrent.futures (added in Python 3.2).
import time
from concurrent.futures import ThreadPoolExecutor
def long_computation(duration):
for x in range(0, duration):
print(x)
time.sleep(1)
return duration * 2
print('Use polling')
with ThreadPoolExecutor(max_workers=1) as executor:
future = executor.submit(long_computation, 5)
while not future.done():
print('waiting...')
time.sleep(0.5)
print(future.result())
print('Use callback')
executor = ThreadPoolExecutor(max_workers=1)
future = executor.submit(long_computation, 5)
future.add_done_callback(lambda f: print(f.result()))
print('waiting for callback')
executor.shutdown(False) # non-blocking
print('shutdown invoked')
The newer asyncio running method in Python 3.7 and later is using asyncio.run() instead of creating loop and calling loop.run_until_complete() as well as closing it:
import asyncio
import datetime
async def display_date(delay):
loop = asyncio.get_running_loop()
end_time = loop.time() + delay
while True:
print("Blocking...", datetime.datetime.now())
await asyncio.sleep(1)
if loop.time() > end_time:
print("Done.")
break
asyncio.run(display_date(5))
Is there any reason not to use threads? You can use the threading class.
Instead of finished() function use the isAlive(). The result() function could join() the thread and retrieve the result. And, if you can, override the run() and __init__ functions to call the function specified in the constructor and save the value somewhere to the instance of the class.
The native Python way for asynchronous calls in 2021 with Python 3.9 suitable also for Jupyter / Ipython Kernel
Camabeh's answer is the way to go since Python 3.3.
async def display_date(loop):
end_time = loop.time() + 5.0
while True:
print(datetime.datetime.now())
if (loop.time() + 1.0) >= end_time:
break
await asyncio.sleep(1)
loop = asyncio.get_event_loop()
# Blocking call which returns when the display_date() coroutine is done
loop.run_until_complete(display_date(loop))
loop.close()
This will work in Jupyter Notebook / Jupyter Lab but throw an error:
RuntimeError: This event loop is already running
Due to Ipython's usage of event loops we need something called nested asynchronous loops which is not yet implemented in Python. Luckily there is nest_asyncio to deal with the issue. All you need to do is:
!pip install nest_asyncio # use ! within Jupyter Notebook, else pip install in shell
import nest_asyncio
nest_asyncio.apply()
(Based on this thread)
Only when you call loop.close() it throws another error as it probably refers to Ipython's main loop.
RuntimeError: Cannot close a running event loop
I'll update this answer as soon as someone answered to this github issue.
You can use process. If you want to run it forever use while (like networking) in you function:
from multiprocessing import Process
def foo():
while 1:
# Do something
p = Process(target = foo)
p.start()
if you just want to run it one time, do like that:
from multiprocessing import Process
def foo():
# Do something
p = Process(target = foo)
p.start()
p.join()

Categories

Resources