Python Asyncio/Trio for Asynchronous Computing/Fetching

Python Asyncio/Trio for Asynchronous Computing/Fetching - python

I am looking for a way to efficiently fetch a chunk of values from disk, and then perform computation/calculations on the chunk. My thought was a for loop that would run the disk fetching task first, then run the computation on the fetched data. I want to have my program fetch the next batch as it is running the computation so I don't have to wait for another data fetch every time a computation completes. I expect the computation will take longer than the fetching of the data from disk, and likely cannot be done truly in parallel due to a single computation task already pinning the cpu usage at near 100%.
I have provided some code below in python using trio (but could alternatively be used with asyncio to the same effect) to illustrate my best attempt at performing this operation with async programming:
import trio
import numpy as np
from datetime import datetime as dt
import time
testiters=10
dim = 6000
def generateMat(arrlen):
for _ in range(30):
retval= np.random.rand(arrlen, arrlen)
# print("matrix generated")
return retval
def computeOpertion(matrix):
return np.linalg.inv(matrix)
def runSync():
for _ in range(testiters):
mat=generateMat(dim)
result=computeOpertion(mat)
return result
async def matGenerator_Async(count):
for _ in range(count):
yield generateMat(dim)
async def computeOpertion_Async(matrix):
return computeOpertion(matrix)
async def runAsync():
async with trio.open_nursery() as nursery:
async for value in matGenerator_Async(testiters):
nursery.start_soon(computeOpertion_Async,value)
#await computeOpertion_Async(value)
print("Sync:")
start=dt.now()
runSync()
print(dt.now()-start)
print("Async:")
start=dt.now()
trio.run(runAsync)
print(dt.now()-start)
This code will simulate getting data from disk by generating 30 random matrices, which uses a small amount of cpu. It will then perform matrix inversion on the generated matrix, which uses 100% cpu (with openblas/mkl configuration in numpy). I compare the time taken to run the tasks by timing the synchronous and asynchronous operations.
From what I can tell, both jobs take exactly the same amount of time to finish, meaning the async operation did not speed up the execution. Observing the behavior of each computation, the sequential operation runs the fetch and computation in order and the async operation runs all the fetches first, then all the computations afterwards.
Is there a way to use asynchronously fetch and compute? Perhaps with futures or something like gather()? Asyncio has these functions, and trio has them in a seperate package trio_future. I am also open to solutions via other methods (threads and multiprocessing).
I believe that there likely exists a solution with multiprocessing that can make the disk reading operation run in a separate process. However, inter-process communication and blocking then becomes a hassle, as I would need some sort of semaphore to control how many blocks could be generated at a time due to memory constraints, and multiprocessing tends to be quite heavy and slow.
EDIT
Thank you VPfB for your answer. I am not able to sleep(0) in the operation, but I think even if I did, it would necessarily block the computation in favor of performing disk operations. I think this may be a hard limitation of python threading and asyncio, that it can only execute 1 thread at a time. Running two different processes simultaneously is impossible if both require anything but waiting for some external resource to respond from your CPU.
Perhaps there is a way with an executor for a multiprocessing pool. I have added the following code below:
import asyncio
import concurrent.futures
async def asynciorunAsync():
loop = asyncio.get_running_loop()
with concurrent.futures.ProcessPoolExecutor() as pool:
async for value in matGenerator_Async(testiters):
result = await loop.run_in_executor(pool, computeOpertion,value)
print("Async with PoolExecutor:")
start=dt.now()
asyncio.run(asynciorunAsync())
print(dt.now()-start)
Although timing this, it still takes the same amount of time as the synchronous example. I think I will have to go with a more involved solution as it seems that async and await are too crude of a tool to properly do this type of task switching.

I don't work with trio, my answer it asyncio based.
Under these circumstances the only way to improve the asyncio performance I see is to break the computation into smaller pieces and insert await sleep(0) between them. This would allow the data fetching task to run.
Asyncio uses cooperative scheduling. A synchronous CPU bound routine does not cooperate, it blocks everything else while it is running.
sleep() always suspends the current task, allowing other tasks to run.
Setting the delay to 0 provides an optimized path to allow other tasks
to run. This can be used by long-running functions to avoid blocking
the event loop for the full duration of the function call.
(quoted from: asyncio.sleep)
If that is not possible, try to run the computation in an executor. This adds some multi-threading capabilities to otherwise pure asyncio code.

The point of async I/O is to make it easy to write programs where there is lots of network I/O but very little actual computation (or disk I/O). That applies to any async library (Trio or asyncio) or even different languages (e.g. ASIO in C++). So your program is ideally unsuited to async I/O! You will need to use multiple threads (or processes). Although, in fairness, async I/O including Trio can be useful for coordinating work on threads, and that might work well in your case.
As VPfB's answer says, if you're using asyncio then you can use executors, specifically a ThreadPoolExecutor passed to loop.run_in_executor(). For Trio, the equivalent would be trio.to_thread.run_sync() (see also Threads (if you must) in the Trio docs), which is even easier to use. In both cases, you can await the result, so the function is running in a separate thread while the main Trio thread can continue running your async code. Your code would end up looking something like this:
async def matGenerator_Async(count):
for _ in range(count):
yield await trio.to_thread.run_sync(generateMat, dim)
async def my_trio_main()
async with trio.open_nursery() as nursery:
async for matrix in matGenerator_Async(testiters):
nursery.start_soon(trio.to_thread.run_sync, computeOperation, matrix)
trio.run(my_trio_main)
There's no need for the computation functions (generateMat and computeOperation) to be async. In fact, it's problematic if they are because you could no longer run them in a separate thread. In general, only make a function async if it needs to await something or use async with or async for.
You can see from the above example how to pass data to the functions running in the other thread: just pass them as parameters to trio.to_thread.run_sync(), and they will be passed along as parameters to the function. Getting the result back from generateMat() is also straightforward - the return value of the function called in the other thread is returned from await trio.to_thread.run_sync(). Getting the result of computeOperation() is trickier, because it's called in the nursery, so its return value is thrown away. You'll need to pass a mutable parameter to it (like a dict) and stash the result in there. But be careful about thread safety; the easiest way to do that is to pass a new object to each coroutine, and only inspect them all after the nursery has finished.
A few final footnotes that you can probably ignore:
Just to be clear, yield await in the code above isn't some sort of special syntax. It's just await foo(), which returns a value once foo() has finished, followed by yield of that value.
You can change the number of threads Trio uses for calls to to_thread.run_sync() by passing a CapacityLimiter object, or by finding the default one and setting the count on that. It looks like the default is currently 40, so you might want to turn that down a bit, but it's probably not too important.
There is a common myth that Python doesn't support threads, or at least can't do computation in multiple threads simultaneously, because it has a single global lock (the global interpreter lock, or GIL). That would mean that you need to use multiple processes, rather than threads, for your program to really compute thing in parallel. It's true there is a GIL in Python, but so long as you're doing your computation using something like numpy, which you are, then it doesn't stop multithreading from working effectively.
Trio actually has great support for async file I/O. But I don't think it would be helpful in your case.

To supplement my other answer (which uses Trio like you asked), here's how to do it use it just using threads without any async library. The easiest way to do this with Future objects and a ThreadPoolExecutor.
futures = []
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
for matrix in matGenerator(testiters):
futures.append(executor.submit(computeOperation, matrix))
results = [f.result() for f in futures]
The code is actually pretty similar to the async code, but if anything it's simpler. If you don't need to do network I/O, you're better off with this method.

I think the main issue with using multiprocessing and not seeing any improvement is the 100% utilization of the CPU. It essentially leaves you with an async-like behavior where resources are occasionally being freed up and used for the I/O process. You could set a limit to the number of workers for your ProcessPoolExecutor and that might allow the I/O the room it needs to be more at the ready.
Disclaimer: I'm still new to multiprocessing and threading.

Related

Why use multithreading when multiprocessing is available in python? [duplicate]

I found that in Python 3.4, there are few different libraries for multiprocessing/threading: multiprocessing vs threading vs asyncio.
But I don't know which one to use or is the "recommended one". Do they do the same thing, or are different? If so, which one is used for what? I want to write a program that uses multicores in my computer. But I don't know which library I should learn.

TL;DR
Making the Right Choice:
We have walked through the most popular forms of concurrency. But the question remains - when should choose which one? It really depends on the use cases. From my experience (and reading), I tend to follow this pseudo code:
if io_bound:
if io_very_slow:
print("Use Asyncio")
else:
print("Use Threads")
else:
print("Multi Processing")
CPU Bound => Multi Processing
I/O Bound, Fast I/O, Limited Number of Connections => Multi Threading
I/O Bound, Slow I/O, Many connections => Asyncio
Reference
[NOTE]:
If you have a long call method (e.g. a method containing a sleep time or lazy I/O), the best choice is asyncio, Twisted or Tornado approach (coroutine methods), that works with a single thread as concurrency.
asyncio works on Python3.4 and later.
Tornado and Twisted are ready since Python2.7
uvloop is ultra fast asyncio event loop (uvloop makes asyncio 2-4x faster).
[UPDATE (2019)]:
Japranto (GitHub) is a very fast pipelining HTTP server based on uvloop.

They are intended for (slightly) different purposes and/or requirements. CPython (a typical, mainline Python implementation) still has the global interpreter lock so a multi-threaded application (a standard way to implement parallel processing nowadays) is suboptimal. That's why multiprocessing may be preferred over threading. But not every problem may be effectively split into [almost independent] pieces, so there may be a need in heavy interprocess communications. That's why multiprocessing may not be preferred over threading in general.
asyncio (this technique is available not only in Python, other languages and/or frameworks also have it, e.g. Boost.ASIO) is a method to effectively handle a lot of I/O operations from many simultaneous sources w/o need of parallel code execution. So it's just a solution (a good one indeed!) for a particular task, not for parallel processing in general.

In multiprocessing you leverage multiple CPUs to distribute your calculations. Since each of the CPUs runs in parallel, you're effectively able to run multiple tasks simultaneously. You would want to use multiprocessing for CPU-bound tasks. An example would be trying to calculate a sum of all elements of a huge list. If your machine has 8 cores, you can "cut" the list into 8 smaller lists and calculate the sum of each of those lists separately on separate core and then just add up those numbers. You'll get a ~8x speedup by doing that.
In (multi)threading you don't need multiple CPUs. Imagine a program that sends lots of HTTP requests to the web. If you used a single-threaded program, it would stop the execution (block) at each request, wait for a response, and then continue once received a response. The problem here is that your CPU isn't really doing work while waiting for some external server to do the job; it could have actually done some useful work in the meantime! The fix is to use threads - you can create many of them, each responsible for requesting some content from the web. The nice thing about threads is that, even if they run on one CPU, the CPU from time to time "freezes" the execution of one thread and jumps to executing the other one (it's called context switching and it happens constantly at non-deterministic intervals). So if your task is I/O bound - use threading.
asyncio is essentially threading where not the CPU but you, as a programmer (or actually your application), decide where and when does the context switch happen. In Python you use an await keyword to suspend the execution of your coroutine (defined using async keyword).

This is the basic idea:
Is it IO-BOUND ? -----------> USE asyncio
IS IT CPU-HEAVY ? ---------> USE multiprocessing
ELSE ? ----------------------> USE threading
So basically stick to threading unless you have IO/CPU problems.

Many of the answers suggest how to choose only 1 option, but why not be able to use all 3? In this answer I explain how you can use asyncio to manage combining all 3 forms of concurrency instead as well as easily swap between them later if need be.
The short answer
Many developers that are first-timers to concurrency in Python will end up using processing.Process and threading.Thread. However, these are the low-level APIs which have been merged together by the high-level API provided by the concurrent.futures module. Furthermore, spawning processes and threads has overhead, such as requiring more memory, a problem which plagued one of the examples I showed below. To an extent, concurrent.futures manages this for you so that you cannot as easily do something like spawn a thousand processes and crash your computer by only spawning a few processes and then just re-using those processes each time one finishes.
These high-level APIs are provided through concurrent.futures.Executor, which are then implemented by concurrent.futures.ProcessPoolExecutor and concurrent.futures.ThreadPoolExecutor. In most cases, you should use these over the multiprocessing.Process and threading.Thread, because it's easier to change from one to the other in the future when you use concurrent.futures and you don't have to learn the detailed differences of each.
Since these share a unified interfaces, you'll also find that code using multiprocessing or threading will often use concurrent.futures. asyncio is no exception to this, and provides a way to use it via the following code:
import asyncio
from concurrent.futures import Executor
from functools import partial
from typing import Any, Callable, Optional, TypeVar
T = TypeVar("T")
async def run_in_executor(
executor: Optional[Executor],
func: Callable[..., T],
/,
*args: Any,
**kwargs: Any,
) -> T:
"""
Run `func(*args, **kwargs)` asynchronously, using an executor.
If the executor is None, use the default ThreadPoolExecutor.
"""
return await asyncio.get_running_loop().run_in_executor(
executor,
partial(func, *args, **kwargs),
)
# Example usage for running `print` in a thread.
async def main():
await run_in_executor(None, print, "O" * 100_000)
asyncio.run(main())
In fact it turns out that using threading with asyncio was so common that in Python 3.9 they added asyncio.to_thread(func, *args, **kwargs) to shorten it for the default ThreadPoolExecutor.
The long answer
Are there any disadvantages to this approach?
Yes. With asyncio, the biggest disadvantage is that asynchronous functions aren't the same as synchronous functions. This can trip up new users of asyncio a lot and cause a lot of rework to be done if you didn't start programming with asyncio in mind from the beginning.
Another disadvantage is that users of your code will also become forced to use asyncio. All of this necessary rework will often leave first-time asyncio users with a really sour taste in their mouth.
Are there any non-performance advantages to this?
Yes. Similar to how using concurrent.futures is advantageous over threading.Thread and multiprocessing.Process for its unified interface, this approach can be considered a further abstraction from an Executor to an asynchronous function. You can start off using asyncio, and if later you find a part of it you need threading or multiprocessing, you can use asyncio.to_thread or run_in_executor. Likewise, you may later discover that an asynchronous version of what you're trying to run with threading already exists, so you can easily step back from using threading and switch to asyncio instead.
Are there any performance advantages to this?
Yes... and no. Ultimately it depends on the task. In some cases, it may not help (though it likely does not hurt), while in other cases it may help a lot. The rest of this answer provides some explanations as to why using asyncio to run an Executor may be advantageous.
- Combining multiple executors and other asynchronous code
asyncio essentially provides significantly more control over concurrency at the cost of you need to take control of the concurrency more. If you want to simultaneously run some code using a ThreadPoolExecutor along side some other code using a ProcessPoolExecutor, it is not so easy managing this using synchronous code, but it is very easy with asyncio.
import asyncio
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
async def with_processing():
with ProcessPoolExecutor() as executor:
tasks = [...]
for task in asyncio.as_completed(tasks):
result = await task
...
async def with_threading():
with ThreadPoolExecutor() as executor:
tasks = [...]
for task in asyncio.as_completed(tasks):
result = await task
...
async def main():
await asyncio.gather(with_processing(), with_threading())
asyncio.run(main())
How does this work? Essentially asyncio asks the executors to run their functions. Then, while an executor is running, asyncio will go run other code. For example, the ProcessPoolExecutor starts a bunch of processes, and then while waiting for those processes to finish, the ThreadPoolExecutor starts a bunch of threads. asyncio will then check in on these executors and collect their results when they are done. Furthermore, if you have other code using asyncio, you can run them while waiting for the processes and threads to finish.
- Narrowing in on what sections of code needs executors
It is not common that you will have many executors in your code, but what is a common problem that I have seen when people use threads/processes is that they will shove the entirety of their code into a thread/process, expecting it to work. For example, I once saw the following code (approximately):
from concurrent.futures import ThreadPoolExecutor
import requests
def get_data(url):
return requests.get(url).json()["data"]
urls = [...]
with ThreadPoolExecutor() as executor:
for data in executor.map(get_data, urls):
print(data)
The funny thing about this piece of code is that it was slower with concurrency than without. Why? Because the resulting json was large, and having many threads consume a huge amount of memory was disastrous. Luckily the solution was simple:
from concurrent.futures import ThreadPoolExecutor
import requests
urls = [...]
with ThreadPoolExecutor() as executor:
for response in executor.map(requests.get, urls):
print(response.json()["data"])
Now only one json is unloaded into memory at a time, and everything is fine.
The lesson here?
You shouldn't try to just slap all of your code into threads/processes, you should instead focus in on what part of the code actually needs concurrency.
But what if get_data was not a function as simple as this case? What if we had to apply the executor somewhere deep in the middle of the function? This is where asyncio comes in:
import asyncio
import requests
async def get_data(url):
# A lot of code.
...
# The specific part that needs threading.
response = await asyncio.to_thread(requests.get, url, some_other_params)
# A lot of code.
...
return data
urls = [...]
async def main():
tasks = [get_data(url) for url in urls]
for task in asyncio.as_completed(tasks):
data = await task
print(data)
asyncio.run(main())
Attempting the same with concurrent.futures is by no means pretty. You could use things such as callbacks, queues, etc., but it would be significantly harder to manage than basic asyncio code.

Already a lot of good answers. Can't elaborate more on the when to use each one. This is more an interesting combination of two. Multiprocessing + asyncio: https://pypi.org/project/aiomultiprocess/.
The use case for which it was designed was highio, but still utilizing as many of the cores available. Facebook used this library to write some kind of python based File server. Asyncio allowing for IO bound traffic, but multiprocessing allowing multiple event loops and threads on multiple cores.
Ex code from the repo:
import asyncio
from aiohttp import request
from aiomultiprocess import Pool
async def get(url):
async with request("GET", url) as response:
return await response.text("utf-8")
async def main():
urls = ["https://jreese.sh", ...]
async with Pool() as pool:
async for result in pool.map(get, urls):
... # process result
if __name__ == '__main__':
# Python 3.7
asyncio.run(main())
# Python 3.6
# loop = asyncio.get_event_loop()
# loop.run_until_complete(main())
Just and addition here, would not working in say jupyter notebook very well, as the notebook already has a asyncio loop running. Just a little note for you to not pull your hair out.

I’m not a professional Python user, but as a student in computer architecture I think I can share some of my considerations when choosing between multi processing and multi threading. Besides, some of the other answers (even among those with higher votes) are misusing technical terminology, so I thinks it’s also necessary to make some clarification on those as well, and I’ll do it first.
The fundamental difference between multiprocessing and multithreading is whether they share the same memory space. Threads share access to the same virtual memory space, so it is efficient and easy for threads to exchange their computation results (zero copy, and totally user-space execution).
Processes on the other hand have separate virtual memory spaces. They cannot directly read or write the other process’ memory space, just like a person cannot read or alter the mind of another person without talking to him. (Allowing so would be a violation of memory protection and defeat the purpose of using virtual memory. ) To exchange data between processes, they have to rely on the operating system’s facility (e.g. message passing), and for more than one reasons this is more costly to do than the “shared memory” scheme used by threads. One reason is that invoking the OS’ message passing mechanism requires making a system call which will switch the code execution from user mode to kernel mode, which is time consuming; another reason is likely that OS message passing scheme will have to copy the data bytes from the senders’ memory space to the receivers’ memory space, so non-zero copy cost.
It is incorrect to say a multithread program can only use one CPU. The reason why many people say so is due to an artifact of the CPython implementation: global interpreter lock (GIL). Because of the GIL, threads in a CPython process are serialized. As a result, it appears that the multithreaded python program only uses one CPU.
But multi thread computer programs in general are not restricted to one core, and for Python, implementations that do not use the GIL can indeed run many threads in parallel, that is, run on more than one CPU at the same time. (See https://wiki.python.org/moin/GlobalInterpreterLock).
Given that CPython is the predominant implementation of Python, it’s understandable why multithreaded python programs are commonly equated to being bound to a single core.
With Python with GIL, the only way to unleash the power of multicores is to use multiprocessing (there are exceptions to this as mentioned below). But your problem better be easily partition-able into parallel sub-problems that have minimal intercommunication, otherwise a lot of inter-process communication will have to take place and as explained above, the overhead of using the OS’ message passing mechanism will be costly, sometimes so costly the benefits of parallel processing are totally offset. If the nature of your problem requires intense communication between concurrent routines, multithreading is the natural way to go. Unfortunately with CPython, true, effectively parallel multithreading is not possible due to the GIL. In this case you should realize Python is not the optimal tool for your project and consider using another language.
There’s one alternative solution, that is to implement the concurrent processing routines in an external library written in C (or other languages), and import that module to Python. The CPython GIL will not bother to block the threads spawned by that external library.
So, with the burdens of GIL, is multithreading in CPython any good? It still offers benefits though, as other answers have mentioned, if you’re doing IO or network communication. In these cases the relevant computation is not done by your CPU but done by other devices (in the case of IO, the disk controller and DMA (direct memory access) controller will transfer the data with minimal CPU participation; in the case of networking, the NIC (network interface card) and DMA will take care of much of the task without CPU’s participation), so once a thread delegates such task to the NIC or disk controller, the OS can put that thread to a sleeping state and switch to other threads of the same program to do useful work.
In my understanding, the asyncio module is essentially a specific case of multithreading for IO operations.
So:
CPU-intensive programs, that can easily be partitioned to run on multiple processes with limited communication: Use multithreading if GIL does not exist (eg Jython), or use multiprocess if GIL is present (eg CPython).
CPU-intensive programs, that requires intensive communication between concurrent routines: Use multithreading if GIL does not exist, or use another programming language.
Lot’s of IO: asyncio

Multiprocessing can be run parallelly.
Multithreading and asyncio cannot be run parallelly.
With Intel(R) Core(TM) i7-8700K CPU # 3.70GHz and 32.0 GB RAM, I timed how many prime numbers are between 2 and 100000 with 2 processes, 2 threads and 2 asyncio tasks as shown below. *This is CPU bound calculation:
Multiprocessing
Multithreading
asyncio
23.87 seconds
45.24 seconds
44.77 seconds
Because multiprocessing can be run parallelly so multiprocessing is double more faster than multithreading and asyncio as shown above.
I used 3 sets of code below:
Multiprocessing:
# "process_test.py"
from multiprocessing import Process
import time
start_time = time.time()
def test():
num = 100000
primes = 0
for i in range(2, num + 1):
for j in range(2, i):
if i % j == 0:
break
else:
primes += 1
print(primes)
if __name__ == "__main__": # This is needed to run processes on Windows
process_list = []
for _ in range(0, 2): # 2 processes
process = Process(target=test)
process_list.append(process)
for process in process_list:
process.start()
for process in process_list:
process.join()
print(round((time.time() - start_time), 2), "seconds") # 23.87 seconds
Result:
...
9592
9592
23.87 seconds
Multithreading:
# "thread_test.py"
from threading import Thread
import time
start_time = time.time()
def test():
num = 100000
primes = 0
for i in range(2, num + 1):
for j in range(2, i):
if i % j == 0:
break
else:
primes += 1
print(primes)
thread_list = []
for _ in range(0, 2): # 2 threads
thread = Thread(target=test)
thread_list.append(thread)
for thread in thread_list:
thread.start()
for thread in thread_list:
thread.join()
print(round((time.time() - start_time), 2), "seconds") # 45.24 seconds
Result:
...
9592
9592
45.24 seconds
Asyncio:
# "asyncio_test.py"
import asyncio
import time
start_time = time.time()
async def test():
num = 100000
primes = 0
for i in range(2, num + 1):
for j in range(2, i):
if i % j == 0:
break
else:
primes += 1
print(primes)
async def call_tests():
tasks = []
for _ in range(0, 2): # 2 asyncio tasks
tasks.append(test())
await asyncio.gather(*tasks)
asyncio.run(call_tests())
print(round((time.time() - start_time), 2), "seconds") # 44.77 seconds
Result:
...
9592
9592
44.77 seconds

Multiprocessing
Each process has its own Python interpreter and can run on a separate core of a processor. Python multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers true parallelism, effectively side-stepping the Global Interpreter Lock by using sub processes instead of threads.
Use multiprocessing when you have CPU intensive tasks.
Multithreading
Python multithreading allows you to spawn multiple threads within the process. These threads can share the same memory and resources of the process. In CPython due to Global interpreter lock at any given time only a single thread can run, hence you cannot utilize multiple cores. Multithreading in Python does not offer true parallelism due to GIL limitation.
Asyncio
Asyncio works on co-operative multitasking concepts. Asyncio tasks run on the same thread so there is no parallelism, but it provides better control to the developer instead of the OS which is the case in multithreading.
There is a nice discussion on this link regarding the advantages of asyncio over threads.
There is a nice blog by Lei Mao on Python concurrency here
Multiprocessing VS Threading VS AsyncIO in Python Summary

AsyncIO and concurrent.futures.ThreadPoolExecutor

I'm building a web scraping API, and most of my scraping is done with AsyncIO coroutines, like this:
async def parse_event(self):
do scraping
# call the func
asyncio.run(b.parse_event())
This works perfectly fine, but as I'm scraping multiple websites at the same time, I was using concurrent.futures.ThreadPoolExecutor at first to scrape with multiple threads.
But since I've implemented the coroutine logic, I cannot now use the asyncio.run method in my thread directly.
Before (without coroutine):
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
w1_future = executor.submit(self.w1.parse_event)
w2_future = executor.submit(self.w2.parse_event)
w3_future = executor.submit(self.w3.parse_event)
After, I would have expected something like below
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
w1_future = executor.submit(asyncio.run(self.w1.parse_event))
w2_future = executor.submit(asyncio.run(self.w2.parse_event))
w3_future = executor.submit(asyncio.run(self.w3.parse_event))
Unfortunately it is not working.

Both asyncio and threading are a means to use a single core for concurrent operations. However, this works via different mechanisms: asyncio uses the cooperative concurrency of async/await whereas threading uses the preemptive concurrency of the GIL.
Mixing the two does not speed up execution since both still use the same single core; instead, overhead of both mechanisms will slow down the program and the interaction of the mechanisms complicates writing correct code.
To achieve concurrency between multiple tasks, submit them all to a single asyncio event loop. The equivalent to executor.submit is asyncio.create_task; multiple tasks can be submitted at once using asyncio.gather. Note that both are called inside the loop as opposed to outside the executor.
async def parse_all():
return await asyncio.gather(
# all the parsing tasks that should run concurrently
self.w1.parse_event,
self.w2.parse_event,
self.w3.parse_event,
)
asyncio.run(parse_all())
If you absolutely do want to use separate, threaded event loops for each parse, you must use executor.submit(func, *args) instead of executor.submit(func(args)).
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
w1_future = executor.submit(asyncio.run, self.w1.parse_event())
w2_future = executor.submit(asyncio.run, self.w2.parse_event())
w3_future = executor.submit(asyncio.run, self.w3.parse_event())
Note that mixing asyncio and threading adds complexity and constraints. You might want to use debug mode to detect some thread and context safety issues. However, thread and context safety constraints/guarantees are often not documented; manually test or inspect the operations for safety if needed.

Why is it that only asynchronous functions can yield in asynchronous code?

In the article "I'm not feeling the async pressure" Armin Ronacher makes the following observation:
In threaded code any function can yield. In async code only async functions can. This means for instance that the writer.write method cannot block.
This observation is made with reference to the following code sample:
from asyncio import start_server, run
async def on_client_connected(reader, writer):
while True:
data = await reader.readline()
if not data:
break
writer.write(data)
async def server():
srv = await start_server(on_client_connected, '127.0.0.1', 8888)
async with srv:
await srv.serve_forever()
run(server())
I do not understand this comment. Specifically:
How come synchronous functions cannot yield when inside of asynchronous functions?
What does yield have to do with blocking execution? Why is it that a function that cannot yield, cannot block?

Going line-by-line:
In threaded code any function can yield.
Programs running on a machine are organized in terms of processes. Each process may have one or more threads. Threads, like processes, are scheduled by (and interruptible by) the operating system. The word "yield" in this context means "letting other code run". When work is split between multiple threads, functions "yield" easily: the operating system suspends the code running in one thread, runs some code in a different thread, suspends that, comes back, and works some more on the first thread, and so on. By switching between threads in this way, concurrency is achieved.
In this execution model, whether the code being suspended is synchronous or asynchronous does not matter. The code within the thread is being run line-by-line, so the fundamental assumption of a synchronous function---that no changes occurred in between running one line of code and the next---is not violated.
In async code only async functions can.
"Async code" in this context means a single-threaded application that does the same work as the multi-threaded application, except that it achieves concurrency by using asynchronous functions within a thread, instead of splitting the work between different threads. In this execution model, your interpreter, not the operating system, is responsible for switching between functions as needed to achieve concurrency.
In this execution model, it is unsafe for work to be suspended in the middle of a synchronous function that's located inside of an asynchronous function. Doing so would mean running some other code in the middle of running your synchronous function, breaking the "line-by-line" assumption made by the synchronous function.
As a result, the interpreter will wait only suspend the execution of an asynchronous function in between synchronous sub-functions, never within one. This is what is meant by the statement that synchronous functions in async code cannot yield: once a synchronous function starts running, it must complete.
This means for instance that the writer.write method cannot block.
The writer.write method is synchronous, and hence, when run in an async program, uninterruptible. If this method were to block, it would block not just the asynchronous function it is running inside of, but the entire program. That would be bad. writer.write avoids blocking the program by writing to a write buffer instead and returning immediately.
Strictly speaking, writer.write can block, it's just inadvisable to do so.
If you need to block inside of an async function, the proper way to do so is to await another async function. This is what e.g. await writer.drain() does. This will block asynchronously: while this specific function remains blocked, it will correctly yield to other functions that can run.

“Yield” here refers to cooperative multitasking (albeit within a process rather than among them). In the context of the async/await style of Python programming, asynchronous functions are defined in terms of Python’s pre-existing generator support: if a function blocks (typically for I/O), all its callers that are performing awaits suspend (with an invisible yield/yield from that is indeed of the generator variety). The actual call for any generator is to its next method; that function actually returns.
Every caller, up to some sort of driver that most programmers never write, must participate for this approach to work: any function that did not suspend would suddenly have the responsibility of the driver of deciding what to do next while waiting on the function it called to complete. This “infectious” aspect of asynchronicity has been called a “color”; it can be problematic, as for example when people forget to await a coroutine call that looks correct because it looks like any other call. (The async/await syntax exists to minimize the disruption of the program’s structure from the concurrency by implicitly converting functions into state machines, but this ambiguity remains.) It can also be a good thing: an asynchronous function can be interrupted exactly when it awaits, so it’s straightforward to reason about the consistency of data structures.
A synchronous function therefore cannot yield simply as a matter of definition. The import of the restriction is rather that a function called with a normal (synchronous) call cannot yield: its caller is not prepared to handle such an interaction. (What will happen if it does anyway is of course the same “forgotten await”.) This also affects refactoring: a function cannot be changed to be asynchronous without changing all its clients (and making them asynchronous as well if they are not already). (This is similar to how all I/O works in Haskell, since it affects the type of any function that performs any.)
Note that yield is allowed in its role as a normal generator used with an ordinary for even in an asynchronous function, but that’s just the general fact that the caller must expect the same protocol as the callee: if an enhanced generator (an “old-style” coroutine) is used with for, it just gets None from every (yield), and if an async function is used with for, it produces awaitables that probably break when they are sent None.
The distinction with threading, or with so-called stackful coroutines or fibers, is that no special resumption support is needed from the caller because the actual function call simply doesn’t return until the thread/fiber is resumed. (In the thread case, the kernel also chooses when to resume it.) In that sense, these approaches are easier to use, but with fibers the ability to “sneak” a pause into any function is partially compromised by the need to specify arguments to that function to tell it about the userspace scheduler with which to register itself (unless you’re willing to use global variables for that…). Threads, on the other hand, have even higher overhead than fibers, which matters when great numbers of them are running.

Python synchronous code example faster than async

I was migrating a production system to async when I realized the synchronous version is 20x faster than the async version. I was able to create a very simple example to demonstrate this in a repeatable way;
Asynchronous Version
import asyncio, time
data = {}
async def process_usage(key):
data[key] = key
async def main():
await asyncio.gather(*(process_usage(key) for key in range(0,1000000)))
s = time.perf_counter()
results = asyncio.run(main())
elapsed = time.perf_counter() - s
print(f"Took {elapsed:0.2f} seconds.")
This takes 19 seconds. The code loops through 1M keys and builds a dictionary, data with the same key and value.
$ python3.7 async_test.py
Took 19.08 seconds.
Synchronous Version
import time
data = {}
def process_usage(key):
data[key] = key
def main():
for key in range(0,1000000):
process_usage(key)
s = time.perf_counter()
results = main()
elapsed = time.perf_counter() - s
print(f"Took {elapsed:0.2f} seconds.")
This takes 0.17 seconds! And does exactly the same thing as above.
$ python3.7 test.py
Took 0.17 seconds.
Asynchronous Version with create_task
import asyncio, time
data = {}
async def process_usage(key):
data[key] = key
async def main():
for key in range(0,1000000):
asyncio.create_task(process_usage(key))
s = time.perf_counter()
results = asyncio.run(main())
elapsed = time.perf_counter() - s
print(f"Took {elapsed:0.2f} seconds.")
This version brings it down to 11 seconds.
$ python3.7 async_test2.py
Took 11.91 seconds.
Why does this happen?
In my production code I will have a blocking call in process_usage where I save the value of key to a redis database.

When comparing those benchmarks one should note that the asynchronous version is, well, asynchronous: asyncio spends a considerable effort to ensure that the coroutines you submit can run concurrently. In your particular case they don't actually run concurrently because process_usage doesn't await anything, but the system doesn't actually that. The synchronous version on the other hand makes no such provisions: it just runs everything sequentially, hitting the happy path of the interpreter.
A more reasonable comparison would be for the synchronous version to try to parallelize things in the way idiomatic for synchronous code: by using threads. Of course, you won't be able to create a separate thread for each process_usage because, unlike asyncio with its tasks, the OS won't allow you to create a million threads. But you can create a thread pool and feed it tasks:
def main():
with concurrent.futures.ThreadPoolExecutor() as executor:
for key in range(0,1000000):
executor.submit(process_usage, key)
# at the end of "with" the executor automatically
# waits for all futures to finish
On my system this takes ~17s, whereas the asyncio version takes ~18s. (The faster asyncio version takes ~13s.)
If the speed gain of asyncio is so small, one could ask why bother with asyncio? The difference is that with asyncio, assuming idiomatic code and IO-bound coroutines, you have at your disposal a virtually unlimited number of tasks that in a very real sense execute concurrently. You can create tens of thousands of asynchronous connections at the same time, and asyncio will happily juggle them all at once, using a high-quality poller and a scalable coroutine scheduler. With a thread pool the number of tasks executed in parallel is always limited by the number of threads in the pool, typically in the hundreds at most.
Even toy examples have value, for learning if nothing else. If you are using such microbenchmarks to make decisions, I suggest investing some more effort to give the examples more realism. The coroutine in the asyncio example should contain at least one await, and the sync example should use threads to emulate the same amount of parallelism you obtain with async. If you adjust both to match your actual use case, then the benchmark actually puts you in a position to make a (more) informed decision.

Why does this happen?
TL;DR
Because using asyncio itself doesn't speedup code. You need multiple gathered network I/O related operations to see the difference toward synchronous version.
Detailed
asyncio is not a magic that allows you to speedup arbitrary code. With or without asyncio your code is still being run by CPU with limit performance.
asyncio is a way to manage multiple execution flows (coroutines) in a nice, clear way. Multiple execution flows allow you to start next I/O-related operation (such as request to database) before waiting for other one to be completed. Please read this answer for more detailed explanation.
Please also read this answer for explanation when it makes sense to use asyncio.
Once you start to use asyncio right way overhead for using it should be much lower than benefits you get for parallelizing I/O operations.

asyncio with multiple processors [duplicate]

As almost everyone is aware when they first look at threading in Python, there is the GIL that makes life miserable for people who actually want to do processing in parallel - or at least give it a chance.
I am currently looking at implementing something like the Reactor pattern. Effectively I want to listen for incoming socket connections on one thread-like, and when someone tries to connect, accept that connection and pass it along to another thread-like for processing.
I'm not (yet) sure what kind of load I might be facing. I know there is currently setup a 2MB cap on incoming messages. Theoretically we could get thousands per second (though I don't know if practically we've seen anything like that). The amount of time spent processing a message isn't terribly important, though obviously quicker would be better.
I was looking into the Reactor pattern, and developed a small example using the multiprocessing library that (at least in testing) seems to work just fine. However, now/soon we'll have the asyncio library available, which would handle the event loop for me.
Is there anything that could bite me by combining asyncio and multiprocessing?

You should be able to safely combine asyncio and multiprocessing without too much trouble, though you shouldn't be using multiprocessing directly. The cardinal sin of asyncio (and any other event-loop based asynchronous framework) is blocking the event loop. If you try to use multiprocessing directly, any time you block to wait for a child process, you're going to block the event loop. Obviously, this is bad.
The simplest way to avoid this is to use BaseEventLoop.run_in_executor to execute a function in a concurrent.futures.ProcessPoolExecutor. ProcessPoolExecutor is a process pool implemented using multiprocessing.Process, but asyncio has built-in support for executing a function in it without blocking the event loop. Here's a simple example:
import time
import asyncio
from concurrent.futures import ProcessPoolExecutor
def blocking_func(x):
time.sleep(x) # Pretend this is expensive calculations
return x * 5
#asyncio.coroutine
def main():
#pool = multiprocessing.Pool()
#out = pool.apply(blocking_func, args=(10,)) # This blocks the event loop.
executor = ProcessPoolExecutor()
out = yield from loop.run_in_executor(executor, blocking_func, 10) # This does not
print(out)
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
For the majority of cases, this is function alone is good enough. If you find yourself needing other constructs from multiprocessing, like Queue, Event, Manager, etc., there is a third-party library called aioprocessing (full disclosure: I wrote it), that provides asyncio-compatible versions of all the multiprocessing data structures. Here's an example demoing that:
import time
import asyncio
import aioprocessing
import multiprocessing
def func(queue, event, lock, items):
with lock:
event.set()
for item in items:
time.sleep(3)
queue.put(item+5)
queue.close()
#asyncio.coroutine
def example(queue, event, lock):
l = [1,2,3,4,5]
p = aioprocessing.AioProcess(target=func, args=(queue, event, lock, l))
p.start()
while True:
result = yield from queue.coro_get()
if result is None:
break
print("Got result {}".format(result))
yield from p.coro_join()
#asyncio.coroutine
def example2(queue, event, lock):
yield from event.coro_wait()
with (yield from lock):
yield from queue.coro_put(78)
yield from queue.coro_put(None) # Shut down the worker
if __name__ == "__main__":
loop = asyncio.get_event_loop()
queue = aioprocessing.AioQueue()
lock = aioprocessing.AioLock()
event = aioprocessing.AioEvent()
tasks = [
asyncio.async(example(queue, event, lock)),
asyncio.async(example2(queue, event, lock)),
]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

Yes, there are quite a few bits that may (or may not) bite you.
When you run something like asyncio it expects to run on one thread or process. This does not (by itself) work with parallel processing. You somehow have to distribute the work while leaving the IO operations (specifically those on sockets) in a single thread/process.
While your idea to hand off individual connections to a different handler process is nice, it is hard to implement. The first obstacle is that you need a way to pull the connection out of asyncio without closing it. The next obstacle is that you cannot simply send a file descriptor to a different process unless you use platform-specific (probably Linux) code from a C-extension.
Note that the multiprocessing module is known to create a number of threads for communication. Most of the time when you use communication structures (such as Queues), a thread is spawned. Unfortunately those threads are not completely invisible. For instance they can fail to tear down cleanly (when you intend to terminate your program), but depending on their number the resource usage may be noticeable on its own.
If you really intend to handle individual connections in individual processes, I suggest to examine different approaches. For instance you can put a socket into listen mode and then simultaneously accept connections from multiple worker processes in parallel. Once a worker is finished processing a request, it can go accept the next connection, so you still use less resources than forking a process for each connection. Spamassassin and Apache (mpm prefork) can use this worker model for instance. It might end up easier and more robust depending on your use case. Specifically you can make your workers die after serving a configured number of requests and be respawned by a master process thereby eliminating much of the negative effects of memory leaks.

Based on #dano's answer above I wrote this function to replace places where I used to use multiprocess pool + map.
def asyncio_friendly_multiproc_map(fn: Callable, l: list):
"""
This is designed to replace the use of this pattern:
with multiprocessing.Pool(5) as p:
results = p.map(analyze_day, list_of_days)
By letting caller drop in replace:
asyncio_friendly_multiproc_map(analyze_day, list_of_days)
"""
tasks = []
with ProcessPoolExecutor(5) as executor:
for e in l:
tasks.append(asyncio.get_event_loop().run_in_executor(executor, fn, e))
res = asyncio.get_event_loop().run_until_complete(asyncio.gather(*tasks))
return res

See PEP 3156, in particular the section on Thread interaction:
http://www.python.org/dev/peps/pep-3156/#thread-interaction
This documents clearly the new asyncio methods you might use, including run_in_executor(). Note that the Executor is defined in concurrent.futures, I suggest you also have a look there.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.