Python synchronous code example faster than async

Python synchronous code example faster than async - python

I was migrating a production system to async when I realized the synchronous version is 20x faster than the async version. I was able to create a very simple example to demonstrate this in a repeatable way;
Asynchronous Version
import asyncio, time
data = {}
async def process_usage(key):
data[key] = key
async def main():
await asyncio.gather(*(process_usage(key) for key in range(0,1000000)))
s = time.perf_counter()
results = asyncio.run(main())
elapsed = time.perf_counter() - s
print(f"Took {elapsed:0.2f} seconds.")
This takes 19 seconds. The code loops through 1M keys and builds a dictionary, data with the same key and value.
$ python3.7 async_test.py
Took 19.08 seconds.
Synchronous Version
import time
data = {}
def process_usage(key):
data[key] = key
def main():
for key in range(0,1000000):
process_usage(key)
s = time.perf_counter()
results = main()
elapsed = time.perf_counter() - s
print(f"Took {elapsed:0.2f} seconds.")
This takes 0.17 seconds! And does exactly the same thing as above.
$ python3.7 test.py
Took 0.17 seconds.
Asynchronous Version with create_task
import asyncio, time
data = {}
async def process_usage(key):
data[key] = key
async def main():
for key in range(0,1000000):
asyncio.create_task(process_usage(key))
s = time.perf_counter()
results = asyncio.run(main())
elapsed = time.perf_counter() - s
print(f"Took {elapsed:0.2f} seconds.")
This version brings it down to 11 seconds.
$ python3.7 async_test2.py
Took 11.91 seconds.
Why does this happen?
In my production code I will have a blocking call in process_usage where I save the value of key to a redis database.

When comparing those benchmarks one should note that the asynchronous version is, well, asynchronous: asyncio spends a considerable effort to ensure that the coroutines you submit can run concurrently. In your particular case they don't actually run concurrently because process_usage doesn't await anything, but the system doesn't actually that. The synchronous version on the other hand makes no such provisions: it just runs everything sequentially, hitting the happy path of the interpreter.
A more reasonable comparison would be for the synchronous version to try to parallelize things in the way idiomatic for synchronous code: by using threads. Of course, you won't be able to create a separate thread for each process_usage because, unlike asyncio with its tasks, the OS won't allow you to create a million threads. But you can create a thread pool and feed it tasks:
def main():
with concurrent.futures.ThreadPoolExecutor() as executor:
for key in range(0,1000000):
executor.submit(process_usage, key)
# at the end of "with" the executor automatically
# waits for all futures to finish
On my system this takes ~17s, whereas the asyncio version takes ~18s. (The faster asyncio version takes ~13s.)
If the speed gain of asyncio is so small, one could ask why bother with asyncio? The difference is that with asyncio, assuming idiomatic code and IO-bound coroutines, you have at your disposal a virtually unlimited number of tasks that in a very real sense execute concurrently. You can create tens of thousands of asynchronous connections at the same time, and asyncio will happily juggle them all at once, using a high-quality poller and a scalable coroutine scheduler. With a thread pool the number of tasks executed in parallel is always limited by the number of threads in the pool, typically in the hundreds at most.
Even toy examples have value, for learning if nothing else. If you are using such microbenchmarks to make decisions, I suggest investing some more effort to give the examples more realism. The coroutine in the asyncio example should contain at least one await, and the sync example should use threads to emulate the same amount of parallelism you obtain with async. If you adjust both to match your actual use case, then the benchmark actually puts you in a position to make a (more) informed decision.

Why does this happen?
TL;DR
Because using asyncio itself doesn't speedup code. You need multiple gathered network I/O related operations to see the difference toward synchronous version.
Detailed
asyncio is not a magic that allows you to speedup arbitrary code. With or without asyncio your code is still being run by CPU with limit performance.
asyncio is a way to manage multiple execution flows (coroutines) in a nice, clear way. Multiple execution flows allow you to start next I/O-related operation (such as request to database) before waiting for other one to be completed. Please read this answer for more detailed explanation.
Please also read this answer for explanation when it makes sense to use asyncio.
Once you start to use asyncio right way overhead for using it should be much lower than benefits you get for parallelizing I/O operations.

Related

Why use multithreading when multiprocessing is available in python? [duplicate]

I found that in Python 3.4, there are few different libraries for multiprocessing/threading: multiprocessing vs threading vs asyncio.
But I don't know which one to use or is the "recommended one". Do they do the same thing, or are different? If so, which one is used for what? I want to write a program that uses multicores in my computer. But I don't know which library I should learn.

TL;DR
Making the Right Choice:
We have walked through the most popular forms of concurrency. But the question remains - when should choose which one? It really depends on the use cases. From my experience (and reading), I tend to follow this pseudo code:
if io_bound:
if io_very_slow:
print("Use Asyncio")
else:
print("Use Threads")
else:
print("Multi Processing")
CPU Bound => Multi Processing
I/O Bound, Fast I/O, Limited Number of Connections => Multi Threading
I/O Bound, Slow I/O, Many connections => Asyncio
Reference
[NOTE]:
If you have a long call method (e.g. a method containing a sleep time or lazy I/O), the best choice is asyncio, Twisted or Tornado approach (coroutine methods), that works with a single thread as concurrency.
asyncio works on Python3.4 and later.
Tornado and Twisted are ready since Python2.7
uvloop is ultra fast asyncio event loop (uvloop makes asyncio 2-4x faster).
[UPDATE (2019)]:
Japranto (GitHub) is a very fast pipelining HTTP server based on uvloop.

They are intended for (slightly) different purposes and/or requirements. CPython (a typical, mainline Python implementation) still has the global interpreter lock so a multi-threaded application (a standard way to implement parallel processing nowadays) is suboptimal. That's why multiprocessing may be preferred over threading. But not every problem may be effectively split into [almost independent] pieces, so there may be a need in heavy interprocess communications. That's why multiprocessing may not be preferred over threading in general.
asyncio (this technique is available not only in Python, other languages and/or frameworks also have it, e.g. Boost.ASIO) is a method to effectively handle a lot of I/O operations from many simultaneous sources w/o need of parallel code execution. So it's just a solution (a good one indeed!) for a particular task, not for parallel processing in general.

In multiprocessing you leverage multiple CPUs to distribute your calculations. Since each of the CPUs runs in parallel, you're effectively able to run multiple tasks simultaneously. You would want to use multiprocessing for CPU-bound tasks. An example would be trying to calculate a sum of all elements of a huge list. If your machine has 8 cores, you can "cut" the list into 8 smaller lists and calculate the sum of each of those lists separately on separate core and then just add up those numbers. You'll get a ~8x speedup by doing that.
In (multi)threading you don't need multiple CPUs. Imagine a program that sends lots of HTTP requests to the web. If you used a single-threaded program, it would stop the execution (block) at each request, wait for a response, and then continue once received a response. The problem here is that your CPU isn't really doing work while waiting for some external server to do the job; it could have actually done some useful work in the meantime! The fix is to use threads - you can create many of them, each responsible for requesting some content from the web. The nice thing about threads is that, even if they run on one CPU, the CPU from time to time "freezes" the execution of one thread and jumps to executing the other one (it's called context switching and it happens constantly at non-deterministic intervals). So if your task is I/O bound - use threading.
asyncio is essentially threading where not the CPU but you, as a programmer (or actually your application), decide where and when does the context switch happen. In Python you use an await keyword to suspend the execution of your coroutine (defined using async keyword).

This is the basic idea:
Is it IO-BOUND ? -----------> USE asyncio
IS IT CPU-HEAVY ? ---------> USE multiprocessing
ELSE ? ----------------------> USE threading
So basically stick to threading unless you have IO/CPU problems.

Many of the answers suggest how to choose only 1 option, but why not be able to use all 3? In this answer I explain how you can use asyncio to manage combining all 3 forms of concurrency instead as well as easily swap between them later if need be.
The short answer
Many developers that are first-timers to concurrency in Python will end up using processing.Process and threading.Thread. However, these are the low-level APIs which have been merged together by the high-level API provided by the concurrent.futures module. Furthermore, spawning processes and threads has overhead, such as requiring more memory, a problem which plagued one of the examples I showed below. To an extent, concurrent.futures manages this for you so that you cannot as easily do something like spawn a thousand processes and crash your computer by only spawning a few processes and then just re-using those processes each time one finishes.
These high-level APIs are provided through concurrent.futures.Executor, which are then implemented by concurrent.futures.ProcessPoolExecutor and concurrent.futures.ThreadPoolExecutor. In most cases, you should use these over the multiprocessing.Process and threading.Thread, because it's easier to change from one to the other in the future when you use concurrent.futures and you don't have to learn the detailed differences of each.
Since these share a unified interfaces, you'll also find that code using multiprocessing or threading will often use concurrent.futures. asyncio is no exception to this, and provides a way to use it via the following code:
import asyncio
from concurrent.futures import Executor
from functools import partial
from typing import Any, Callable, Optional, TypeVar
T = TypeVar("T")
async def run_in_executor(
executor: Optional[Executor],
func: Callable[..., T],
/,
*args: Any,
**kwargs: Any,
) -> T:
"""
Run `func(*args, **kwargs)` asynchronously, using an executor.
If the executor is None, use the default ThreadPoolExecutor.
"""
return await asyncio.get_running_loop().run_in_executor(
executor,
partial(func, *args, **kwargs),
)
# Example usage for running `print` in a thread.
async def main():
await run_in_executor(None, print, "O" * 100_000)
asyncio.run(main())
In fact it turns out that using threading with asyncio was so common that in Python 3.9 they added asyncio.to_thread(func, *args, **kwargs) to shorten it for the default ThreadPoolExecutor.
The long answer
Are there any disadvantages to this approach?
Yes. With asyncio, the biggest disadvantage is that asynchronous functions aren't the same as synchronous functions. This can trip up new users of asyncio a lot and cause a lot of rework to be done if you didn't start programming with asyncio in mind from the beginning.
Another disadvantage is that users of your code will also become forced to use asyncio. All of this necessary rework will often leave first-time asyncio users with a really sour taste in their mouth.
Are there any non-performance advantages to this?
Yes. Similar to how using concurrent.futures is advantageous over threading.Thread and multiprocessing.Process for its unified interface, this approach can be considered a further abstraction from an Executor to an asynchronous function. You can start off using asyncio, and if later you find a part of it you need threading or multiprocessing, you can use asyncio.to_thread or run_in_executor. Likewise, you may later discover that an asynchronous version of what you're trying to run with threading already exists, so you can easily step back from using threading and switch to asyncio instead.
Are there any performance advantages to this?
Yes... and no. Ultimately it depends on the task. In some cases, it may not help (though it likely does not hurt), while in other cases it may help a lot. The rest of this answer provides some explanations as to why using asyncio to run an Executor may be advantageous.
- Combining multiple executors and other asynchronous code
asyncio essentially provides significantly more control over concurrency at the cost of you need to take control of the concurrency more. If you want to simultaneously run some code using a ThreadPoolExecutor along side some other code using a ProcessPoolExecutor, it is not so easy managing this using synchronous code, but it is very easy with asyncio.
import asyncio
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
async def with_processing():
with ProcessPoolExecutor() as executor:
tasks = [...]
for task in asyncio.as_completed(tasks):
result = await task
...
async def with_threading():
with ThreadPoolExecutor() as executor:
tasks = [...]
for task in asyncio.as_completed(tasks):
result = await task
...
async def main():
await asyncio.gather(with_processing(), with_threading())
asyncio.run(main())
How does this work? Essentially asyncio asks the executors to run their functions. Then, while an executor is running, asyncio will go run other code. For example, the ProcessPoolExecutor starts a bunch of processes, and then while waiting for those processes to finish, the ThreadPoolExecutor starts a bunch of threads. asyncio will then check in on these executors and collect their results when they are done. Furthermore, if you have other code using asyncio, you can run them while waiting for the processes and threads to finish.
- Narrowing in on what sections of code needs executors
It is not common that you will have many executors in your code, but what is a common problem that I have seen when people use threads/processes is that they will shove the entirety of their code into a thread/process, expecting it to work. For example, I once saw the following code (approximately):
from concurrent.futures import ThreadPoolExecutor
import requests
def get_data(url):
return requests.get(url).json()["data"]
urls = [...]
with ThreadPoolExecutor() as executor:
for data in executor.map(get_data, urls):
print(data)
The funny thing about this piece of code is that it was slower with concurrency than without. Why? Because the resulting json was large, and having many threads consume a huge amount of memory was disastrous. Luckily the solution was simple:
from concurrent.futures import ThreadPoolExecutor
import requests
urls = [...]
with ThreadPoolExecutor() as executor:
for response in executor.map(requests.get, urls):
print(response.json()["data"])
Now only one json is unloaded into memory at a time, and everything is fine.
The lesson here?
You shouldn't try to just slap all of your code into threads/processes, you should instead focus in on what part of the code actually needs concurrency.
But what if get_data was not a function as simple as this case? What if we had to apply the executor somewhere deep in the middle of the function? This is where asyncio comes in:
import asyncio
import requests
async def get_data(url):
# A lot of code.
...
# The specific part that needs threading.
response = await asyncio.to_thread(requests.get, url, some_other_params)
# A lot of code.
...
return data
urls = [...]
async def main():
tasks = [get_data(url) for url in urls]
for task in asyncio.as_completed(tasks):
data = await task
print(data)
asyncio.run(main())
Attempting the same with concurrent.futures is by no means pretty. You could use things such as callbacks, queues, etc., but it would be significantly harder to manage than basic asyncio code.

Already a lot of good answers. Can't elaborate more on the when to use each one. This is more an interesting combination of two. Multiprocessing + asyncio: https://pypi.org/project/aiomultiprocess/.
The use case for which it was designed was highio, but still utilizing as many of the cores available. Facebook used this library to write some kind of python based File server. Asyncio allowing for IO bound traffic, but multiprocessing allowing multiple event loops and threads on multiple cores.
Ex code from the repo:
import asyncio
from aiohttp import request
from aiomultiprocess import Pool
async def get(url):
async with request("GET", url) as response:
return await response.text("utf-8")
async def main():
urls = ["https://jreese.sh", ...]
async with Pool() as pool:
async for result in pool.map(get, urls):
... # process result
if __name__ == '__main__':
# Python 3.7
asyncio.run(main())
# Python 3.6
# loop = asyncio.get_event_loop()
# loop.run_until_complete(main())
Just and addition here, would not working in say jupyter notebook very well, as the notebook already has a asyncio loop running. Just a little note for you to not pull your hair out.

I’m not a professional Python user, but as a student in computer architecture I think I can share some of my considerations when choosing between multi processing and multi threading. Besides, some of the other answers (even among those with higher votes) are misusing technical terminology, so I thinks it’s also necessary to make some clarification on those as well, and I’ll do it first.
The fundamental difference between multiprocessing and multithreading is whether they share the same memory space. Threads share access to the same virtual memory space, so it is efficient and easy for threads to exchange their computation results (zero copy, and totally user-space execution).
Processes on the other hand have separate virtual memory spaces. They cannot directly read or write the other process’ memory space, just like a person cannot read or alter the mind of another person without talking to him. (Allowing so would be a violation of memory protection and defeat the purpose of using virtual memory. ) To exchange data between processes, they have to rely on the operating system’s facility (e.g. message passing), and for more than one reasons this is more costly to do than the “shared memory” scheme used by threads. One reason is that invoking the OS’ message passing mechanism requires making a system call which will switch the code execution from user mode to kernel mode, which is time consuming; another reason is likely that OS message passing scheme will have to copy the data bytes from the senders’ memory space to the receivers’ memory space, so non-zero copy cost.
It is incorrect to say a multithread program can only use one CPU. The reason why many people say so is due to an artifact of the CPython implementation: global interpreter lock (GIL). Because of the GIL, threads in a CPython process are serialized. As a result, it appears that the multithreaded python program only uses one CPU.
But multi thread computer programs in general are not restricted to one core, and for Python, implementations that do not use the GIL can indeed run many threads in parallel, that is, run on more than one CPU at the same time. (See https://wiki.python.org/moin/GlobalInterpreterLock).
Given that CPython is the predominant implementation of Python, it’s understandable why multithreaded python programs are commonly equated to being bound to a single core.
With Python with GIL, the only way to unleash the power of multicores is to use multiprocessing (there are exceptions to this as mentioned below). But your problem better be easily partition-able into parallel sub-problems that have minimal intercommunication, otherwise a lot of inter-process communication will have to take place and as explained above, the overhead of using the OS’ message passing mechanism will be costly, sometimes so costly the benefits of parallel processing are totally offset. If the nature of your problem requires intense communication between concurrent routines, multithreading is the natural way to go. Unfortunately with CPython, true, effectively parallel multithreading is not possible due to the GIL. In this case you should realize Python is not the optimal tool for your project and consider using another language.
There’s one alternative solution, that is to implement the concurrent processing routines in an external library written in C (or other languages), and import that module to Python. The CPython GIL will not bother to block the threads spawned by that external library.
So, with the burdens of GIL, is multithreading in CPython any good? It still offers benefits though, as other answers have mentioned, if you’re doing IO or network communication. In these cases the relevant computation is not done by your CPU but done by other devices (in the case of IO, the disk controller and DMA (direct memory access) controller will transfer the data with minimal CPU participation; in the case of networking, the NIC (network interface card) and DMA will take care of much of the task without CPU’s participation), so once a thread delegates such task to the NIC or disk controller, the OS can put that thread to a sleeping state and switch to other threads of the same program to do useful work.
In my understanding, the asyncio module is essentially a specific case of multithreading for IO operations.
So:
CPU-intensive programs, that can easily be partitioned to run on multiple processes with limited communication: Use multithreading if GIL does not exist (eg Jython), or use multiprocess if GIL is present (eg CPython).
CPU-intensive programs, that requires intensive communication between concurrent routines: Use multithreading if GIL does not exist, or use another programming language.
Lot’s of IO: asyncio

Multiprocessing can be run parallelly.
Multithreading and asyncio cannot be run parallelly.
With Intel(R) Core(TM) i7-8700K CPU # 3.70GHz and 32.0 GB RAM, I timed how many prime numbers are between 2 and 100000 with 2 processes, 2 threads and 2 asyncio tasks as shown below. *This is CPU bound calculation:
Multiprocessing
Multithreading
asyncio
23.87 seconds
45.24 seconds
44.77 seconds
Because multiprocessing can be run parallelly so multiprocessing is double more faster than multithreading and asyncio as shown above.
I used 3 sets of code below:
Multiprocessing:
# "process_test.py"
from multiprocessing import Process
import time
start_time = time.time()
def test():
num = 100000
primes = 0
for i in range(2, num + 1):
for j in range(2, i):
if i % j == 0:
break
else:
primes += 1
print(primes)
if __name__ == "__main__": # This is needed to run processes on Windows
process_list = []
for _ in range(0, 2): # 2 processes
process = Process(target=test)
process_list.append(process)
for process in process_list:
process.start()
for process in process_list:
process.join()
print(round((time.time() - start_time), 2), "seconds") # 23.87 seconds
Result:
...
9592
9592
23.87 seconds
Multithreading:
# "thread_test.py"
from threading import Thread
import time
start_time = time.time()
def test():
num = 100000
primes = 0
for i in range(2, num + 1):
for j in range(2, i):
if i % j == 0:
break
else:
primes += 1
print(primes)
thread_list = []
for _ in range(0, 2): # 2 threads
thread = Thread(target=test)
thread_list.append(thread)
for thread in thread_list:
thread.start()
for thread in thread_list:
thread.join()
print(round((time.time() - start_time), 2), "seconds") # 45.24 seconds
Result:
...
9592
9592
45.24 seconds
Asyncio:
# "asyncio_test.py"
import asyncio
import time
start_time = time.time()
async def test():
num = 100000
primes = 0
for i in range(2, num + 1):
for j in range(2, i):
if i % j == 0:
break
else:
primes += 1
print(primes)
async def call_tests():
tasks = []
for _ in range(0, 2): # 2 asyncio tasks
tasks.append(test())
await asyncio.gather(*tasks)
asyncio.run(call_tests())
print(round((time.time() - start_time), 2), "seconds") # 44.77 seconds
Result:
...
9592
9592
44.77 seconds

Multiprocessing
Each process has its own Python interpreter and can run on a separate core of a processor. Python multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers true parallelism, effectively side-stepping the Global Interpreter Lock by using sub processes instead of threads.
Use multiprocessing when you have CPU intensive tasks.
Multithreading
Python multithreading allows you to spawn multiple threads within the process. These threads can share the same memory and resources of the process. In CPython due to Global interpreter lock at any given time only a single thread can run, hence you cannot utilize multiple cores. Multithreading in Python does not offer true parallelism due to GIL limitation.
Asyncio
Asyncio works on co-operative multitasking concepts. Asyncio tasks run on the same thread so there is no parallelism, but it provides better control to the developer instead of the OS which is the case in multithreading.
There is a nice discussion on this link regarding the advantages of asyncio over threads.
There is a nice blog by Lei Mao on Python concurrency here
Multiprocessing VS Threading VS AsyncIO in Python Summary

Python Asyncio/Trio for Asynchronous Computing/Fetching

I am looking for a way to efficiently fetch a chunk of values from disk, and then perform computation/calculations on the chunk. My thought was a for loop that would run the disk fetching task first, then run the computation on the fetched data. I want to have my program fetch the next batch as it is running the computation so I don't have to wait for another data fetch every time a computation completes. I expect the computation will take longer than the fetching of the data from disk, and likely cannot be done truly in parallel due to a single computation task already pinning the cpu usage at near 100%.
I have provided some code below in python using trio (but could alternatively be used with asyncio to the same effect) to illustrate my best attempt at performing this operation with async programming:
import trio
import numpy as np
from datetime import datetime as dt
import time
testiters=10
dim = 6000
def generateMat(arrlen):
for _ in range(30):
retval= np.random.rand(arrlen, arrlen)
# print("matrix generated")
return retval
def computeOpertion(matrix):
return np.linalg.inv(matrix)
def runSync():
for _ in range(testiters):
mat=generateMat(dim)
result=computeOpertion(mat)
return result
async def matGenerator_Async(count):
for _ in range(count):
yield generateMat(dim)
async def computeOpertion_Async(matrix):
return computeOpertion(matrix)
async def runAsync():
async with trio.open_nursery() as nursery:
async for value in matGenerator_Async(testiters):
nursery.start_soon(computeOpertion_Async,value)
#await computeOpertion_Async(value)
print("Sync:")
start=dt.now()
runSync()
print(dt.now()-start)
print("Async:")
start=dt.now()
trio.run(runAsync)
print(dt.now()-start)
This code will simulate getting data from disk by generating 30 random matrices, which uses a small amount of cpu. It will then perform matrix inversion on the generated matrix, which uses 100% cpu (with openblas/mkl configuration in numpy). I compare the time taken to run the tasks by timing the synchronous and asynchronous operations.
From what I can tell, both jobs take exactly the same amount of time to finish, meaning the async operation did not speed up the execution. Observing the behavior of each computation, the sequential operation runs the fetch and computation in order and the async operation runs all the fetches first, then all the computations afterwards.
Is there a way to use asynchronously fetch and compute? Perhaps with futures or something like gather()? Asyncio has these functions, and trio has them in a seperate package trio_future. I am also open to solutions via other methods (threads and multiprocessing).
I believe that there likely exists a solution with multiprocessing that can make the disk reading operation run in a separate process. However, inter-process communication and blocking then becomes a hassle, as I would need some sort of semaphore to control how many blocks could be generated at a time due to memory constraints, and multiprocessing tends to be quite heavy and slow.
EDIT
Thank you VPfB for your answer. I am not able to sleep(0) in the operation, but I think even if I did, it would necessarily block the computation in favor of performing disk operations. I think this may be a hard limitation of python threading and asyncio, that it can only execute 1 thread at a time. Running two different processes simultaneously is impossible if both require anything but waiting for some external resource to respond from your CPU.
Perhaps there is a way with an executor for a multiprocessing pool. I have added the following code below:
import asyncio
import concurrent.futures
async def asynciorunAsync():
loop = asyncio.get_running_loop()
with concurrent.futures.ProcessPoolExecutor() as pool:
async for value in matGenerator_Async(testiters):
result = await loop.run_in_executor(pool, computeOpertion,value)
print("Async with PoolExecutor:")
start=dt.now()
asyncio.run(asynciorunAsync())
print(dt.now()-start)
Although timing this, it still takes the same amount of time as the synchronous example. I think I will have to go with a more involved solution as it seems that async and await are too crude of a tool to properly do this type of task switching.

I don't work with trio, my answer it asyncio based.
Under these circumstances the only way to improve the asyncio performance I see is to break the computation into smaller pieces and insert await sleep(0) between them. This would allow the data fetching task to run.
Asyncio uses cooperative scheduling. A synchronous CPU bound routine does not cooperate, it blocks everything else while it is running.
sleep() always suspends the current task, allowing other tasks to run.
Setting the delay to 0 provides an optimized path to allow other tasks
to run. This can be used by long-running functions to avoid blocking
the event loop for the full duration of the function call.
(quoted from: asyncio.sleep)
If that is not possible, try to run the computation in an executor. This adds some multi-threading capabilities to otherwise pure asyncio code.

The point of async I/O is to make it easy to write programs where there is lots of network I/O but very little actual computation (or disk I/O). That applies to any async library (Trio or asyncio) or even different languages (e.g. ASIO in C++). So your program is ideally unsuited to async I/O! You will need to use multiple threads (or processes). Although, in fairness, async I/O including Trio can be useful for coordinating work on threads, and that might work well in your case.
As VPfB's answer says, if you're using asyncio then you can use executors, specifically a ThreadPoolExecutor passed to loop.run_in_executor(). For Trio, the equivalent would be trio.to_thread.run_sync() (see also Threads (if you must) in the Trio docs), which is even easier to use. In both cases, you can await the result, so the function is running in a separate thread while the main Trio thread can continue running your async code. Your code would end up looking something like this:
async def matGenerator_Async(count):
for _ in range(count):
yield await trio.to_thread.run_sync(generateMat, dim)
async def my_trio_main()
async with trio.open_nursery() as nursery:
async for matrix in matGenerator_Async(testiters):
nursery.start_soon(trio.to_thread.run_sync, computeOperation, matrix)
trio.run(my_trio_main)
There's no need for the computation functions (generateMat and computeOperation) to be async. In fact, it's problematic if they are because you could no longer run them in a separate thread. In general, only make a function async if it needs to await something or use async with or async for.
You can see from the above example how to pass data to the functions running in the other thread: just pass them as parameters to trio.to_thread.run_sync(), and they will be passed along as parameters to the function. Getting the result back from generateMat() is also straightforward - the return value of the function called in the other thread is returned from await trio.to_thread.run_sync(). Getting the result of computeOperation() is trickier, because it's called in the nursery, so its return value is thrown away. You'll need to pass a mutable parameter to it (like a dict) and stash the result in there. But be careful about thread safety; the easiest way to do that is to pass a new object to each coroutine, and only inspect them all after the nursery has finished.
A few final footnotes that you can probably ignore:
Just to be clear, yield await in the code above isn't some sort of special syntax. It's just await foo(), which returns a value once foo() has finished, followed by yield of that value.
You can change the number of threads Trio uses for calls to to_thread.run_sync() by passing a CapacityLimiter object, or by finding the default one and setting the count on that. It looks like the default is currently 40, so you might want to turn that down a bit, but it's probably not too important.
There is a common myth that Python doesn't support threads, or at least can't do computation in multiple threads simultaneously, because it has a single global lock (the global interpreter lock, or GIL). That would mean that you need to use multiple processes, rather than threads, for your program to really compute thing in parallel. It's true there is a GIL in Python, but so long as you're doing your computation using something like numpy, which you are, then it doesn't stop multithreading from working effectively.
Trio actually has great support for async file I/O. But I don't think it would be helpful in your case.

To supplement my other answer (which uses Trio like you asked), here's how to do it use it just using threads without any async library. The easiest way to do this with Future objects and a ThreadPoolExecutor.
futures = []
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
for matrix in matGenerator(testiters):
futures.append(executor.submit(computeOperation, matrix))
results = [f.result() for f in futures]
The code is actually pretty similar to the async code, but if anything it's simpler. If you don't need to do network I/O, you're better off with this method.

I think the main issue with using multiprocessing and not seeing any improvement is the 100% utilization of the CPU. It essentially leaves you with an async-like behavior where resources are occasionally being freed up and used for the I/O process. You could set a limit to the number of workers for your ProcessPoolExecutor and that might allow the I/O the room it needs to be more at the ready.
Disclaimer: I'm still new to multiprocessing and threading.

python asyncio.gather vs asyncio.as_completed when IO task followed by CPU-bound task

I have a program workflow as follows: 1. IO-bound (web page fetch) -> 2. cpu-bound (processing information) -> 3. IO-bound (writing results to database).
I'm presently using aiohttp to fetch the web pages. I am currently using asyncio.as_completed to gather Step 1 tasks and pass them to Step 2 as completed. My concern is that this may be interfering with the completion of Step 1 tasks by consuming cpu resources and blocking program flow in Step 2.
I've tried to use ProcessPoolExecutor to farm out the Step 2 tasks to other processes, but Step 2 tasks uses non-pickleable data structures and functions. I've tried ThreadPoolExecutor, and while that worked (e.g. it didn't crash), it is my understanding that doing so for CPU-bound tasks is counter-productive.
Because the workflow has an intermediate cpu-bound task, would it be more efficient to use asyncio.gather (instead of asyncio.as_completed) to complete all of the Step 1 processes before moving on to Step 2?
Sample asyncio.as_completed code:
async with ClientSession() as session:
tasks = {self.fetch(session, url) for url in self.urls}
for task in asyncio.as_completed(tasks):
raw_data = await asyncio.shield(task)
data = self.extract_data(*raw_data)
await self.store_data(data)
Sample asyncio.gather code:
async with ClientSession() as session:
tasks = {self.fetch(session, url) for url in self.urls}
results = await asyncio.gather(*tasks)
for result in results:
data = self.extract_data(*result)
await self.store_data(data)
Preliminary tests with limited samples show as_completed to be slightly more efficient than gather: ~2.98s (as_completed) vs ~3.15s (gather). But is there an asyncio conceptual issue that would favor one solution over another?

"I've tried ThreadPoolExecutor, [...] it is my understanding that doing so for CPU-bound tasks is counter-productive." - it is countrproductiv in a sense you won't have two such asks running Python code in parallel, using multiple CPU cores - but otherwise, it will work to free up your asyncio Loop to continue working, if only munching code for one task at a time.
If you can't pickle things to a subprocess, running the CPU bound tasks in a ThreadPoolExecutor is good enough.
Otherwise, just sprinkle you cpu code with some await asyncio.sleep(0) (inside the loops) and run them normally as coroutines: that is enough for a cpu bound task not to lock the asyncio loop.

Reserve cpu-time in multiprocessing or multithreading application

I'm working on a project with a Raspberry Pi 3 for some environmental control with a number of simple recurring events in a continuous loop. The RP3 is way overqualified for this job, but it alows me focus on other stuff.
Characteristics of the application:
The application should collect sensordata (with variable interval n-seconds) from a dozen sensors (temperature, humidity, pH, ORP, etc).
Based on time and these sensordata, the controller calculates output (switches, valves and PWM-drivers).
Almost none events needs to run sequential.
Some events are in the category "safety" and should run instantly when triggered (fail-safe sensors, emergency button).
Most events run repetitive in a seconds interval (every second, till every 30 seconds).
Some events triggers an action, activating a relay during 1 to 120 seconds.
Some events use a time-bassed value. This value needs to be calculated several times a day and is fairly CPU intensive (uses a few itterative interpolating formula, and therefore has a variable runtime).
Display with environment status (in a continuous loop)
I'm familiar (not by profession) with VB.NET, but decided to do this project in Python 3.6.
The last few months I read a lot about subjecs like design patterns, threads, processes, events, paralel prcessing, etc.
Based on my reading I think Asyncio combined with some tasks in an Executor will do the job.
Most tasks/events are not time-critical. Controller output can use 'most recent' sensordata.
Some tasks, on the other hand, activating a relay for a certain period of time. I would like to know how to programm these tasks without the chance another 'time consuming' task is blocking the processor during the period of time (for example) a CO2 valve is open. This could be disastrous for my environment.
Herefore I need some advice.
See below for my code so far. I'm not sure I make correct use of the Asyncio functions in Python.
For the sake of readability, I will store the contents of the various tasks in separate modules.
import asyncio
import concurrent.futures
import datetime
import time
import random
import math
# define a task...
async def firstTask():
while True:
await asyncio.sleep(1)
print("First task executed")
# define another task...
async def secondTask():
while True:
await asyncio.sleep(5)
print("Second Worker Executed")
# define/simulate heavy CPU-bound task
def heavy_load():
while True:
print('Heavy_load started')
i = 0
for i in range(50000000):
f = math.sqrt(i)*math.sqrt(i)
print('Heavy_load finished')
time.sleep(4)
def main():
# Create a process pool (for CPU bound tasks).
processpool = concurrent.futures.ProcessPoolExecutor()
# Create a thread pool (for I/O bound tasks).
threadpool = concurrent.futures.ThreadPoolExecutor()
loop = asyncio.get_event_loop()
try:
# Add all tasks. (Correct use?)
asyncio.ensure_future(firstTask())
asyncio.ensure_future(secondTask())
loop.run_in_executor(processpool, heavy_load)
loop.run_forever()
except KeyboardInterrupt:
pass
finally:
print("Loop will be ended")
loop.close()
if __name__ == '__main__':
main()

Most tasks/events are not time-critical. Controller output can use 'most recent' sensordata. Some tasks, on the other hand, activating a relay for a certain period of time. I would like to know how to programm these tasks without the chance another 'time consuming' task is blocking the processor during the period of time (for example) a CO2 valve is open. This could be disastrous for my environment.
Allow me to stress that Python is not a real-time language, and asyncio is not a real-time component. They sport neither the infrastructure for real-time execution (Python is garbage collected and typically runs on time-shared systems), nor have they been tested in such environments in practice. Consequently I would strongly advise against using them in any scenario where a misstep could be disastrous for your environment.
With that out of the way, your code has a problem: while the heavy_load calculation will not block the event loop, it will never complete either, nor will it provide information on its progress. The idea behind run_in_executor is that the calculation you are running will eventually halt, and that the event loop will want to be notified about it. Idiomatic usage of run_in_executor could look like this:
def do_heavy_calc(param):
print('Heavy_load started')
f = 0
for i in range(50000000):
f += math.sqrt(i)*math.sqrt(i)
return f
def heavy_calc(param):
loop = asyncio.get_event_loop()
return loop.run_in_executor(processpool, do_heavy_calc)
The expression heavy_calc(...) not only runs without blocking the event loop, but it is also awaitable. That means that asynchronous code can await its result, also without blocking other coroutines:
async def sum_params(p1, p2):
s1 = await heavy_calc(p1)
s2 = await heavy_calc(p2)
return s1 + s2
The above runs the two calculations one after the other. It can also be done in parallel:
async def sum_params_parallel(p1, p2):
s1, s2 = await asyncio.gather(heavy_calc(p1), heavy_calc(p2))
return s1 + s2
Another thing that could improve is the setup code:
asyncio.ensure_future(firstTask())
asyncio.ensure_future(secondTask())
loop.run_in_executor(processpool, heavy_load)
loop.run_forever()
Calling asyncio.ensure_future and then never awaiting the result is somewhat of an asyncio anti-pattern. Exceptions raised by unawaited tasks are silently swallowed, which is almost certainly not something you'd want. Sometimes people simply forget to write await, which is why asyncio complains about unawaited pending tasks when the loop is destroyed.
It is good coding practice to arrange for every task to be awaited by someone, either immediately with await or gather to combine it with other task, or at a later point. For instance, if the task needs to run in the background, you can store it somewhere and await or cancel it at the end of the application lifecycle. In your case, I would combine gather with loop.run_until_complete:
everything = asyncio.gather(firstTask(), secondTask(),
loop.run_in_executor(processpool, heavy_load))
loop.run_until_complete(everything)

Some events are in the category "safety" and should run instantly when triggered (fail-safe sensors, emergency button).
Then I strongly advise you to not rely on software to fulfil this function. Emergency stop buttons that cut off the power is the way that kind of thing is normally done. If you have software doing that, and you genuinely have a threat-to-life situation to handle, you're in for a whole pile of woe - there's almost certainly a ton of regulations with which you have to comply.

asyncio with multiple processors [duplicate]

As almost everyone is aware when they first look at threading in Python, there is the GIL that makes life miserable for people who actually want to do processing in parallel - or at least give it a chance.
I am currently looking at implementing something like the Reactor pattern. Effectively I want to listen for incoming socket connections on one thread-like, and when someone tries to connect, accept that connection and pass it along to another thread-like for processing.
I'm not (yet) sure what kind of load I might be facing. I know there is currently setup a 2MB cap on incoming messages. Theoretically we could get thousands per second (though I don't know if practically we've seen anything like that). The amount of time spent processing a message isn't terribly important, though obviously quicker would be better.
I was looking into the Reactor pattern, and developed a small example using the multiprocessing library that (at least in testing) seems to work just fine. However, now/soon we'll have the asyncio library available, which would handle the event loop for me.
Is there anything that could bite me by combining asyncio and multiprocessing?

You should be able to safely combine asyncio and multiprocessing without too much trouble, though you shouldn't be using multiprocessing directly. The cardinal sin of asyncio (and any other event-loop based asynchronous framework) is blocking the event loop. If you try to use multiprocessing directly, any time you block to wait for a child process, you're going to block the event loop. Obviously, this is bad.
The simplest way to avoid this is to use BaseEventLoop.run_in_executor to execute a function in a concurrent.futures.ProcessPoolExecutor. ProcessPoolExecutor is a process pool implemented using multiprocessing.Process, but asyncio has built-in support for executing a function in it without blocking the event loop. Here's a simple example:
import time
import asyncio
from concurrent.futures import ProcessPoolExecutor
def blocking_func(x):
time.sleep(x) # Pretend this is expensive calculations
return x * 5
#asyncio.coroutine
def main():
#pool = multiprocessing.Pool()
#out = pool.apply(blocking_func, args=(10,)) # This blocks the event loop.
executor = ProcessPoolExecutor()
out = yield from loop.run_in_executor(executor, blocking_func, 10) # This does not
print(out)
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
For the majority of cases, this is function alone is good enough. If you find yourself needing other constructs from multiprocessing, like Queue, Event, Manager, etc., there is a third-party library called aioprocessing (full disclosure: I wrote it), that provides asyncio-compatible versions of all the multiprocessing data structures. Here's an example demoing that:
import time
import asyncio
import aioprocessing
import multiprocessing
def func(queue, event, lock, items):
with lock:
event.set()
for item in items:
time.sleep(3)
queue.put(item+5)
queue.close()
#asyncio.coroutine
def example(queue, event, lock):
l = [1,2,3,4,5]
p = aioprocessing.AioProcess(target=func, args=(queue, event, lock, l))
p.start()
while True:
result = yield from queue.coro_get()
if result is None:
break
print("Got result {}".format(result))
yield from p.coro_join()
#asyncio.coroutine
def example2(queue, event, lock):
yield from event.coro_wait()
with (yield from lock):
yield from queue.coro_put(78)
yield from queue.coro_put(None) # Shut down the worker
if __name__ == "__main__":
loop = asyncio.get_event_loop()
queue = aioprocessing.AioQueue()
lock = aioprocessing.AioLock()
event = aioprocessing.AioEvent()
tasks = [
asyncio.async(example(queue, event, lock)),
asyncio.async(example2(queue, event, lock)),
]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

Yes, there are quite a few bits that may (or may not) bite you.
When you run something like asyncio it expects to run on one thread or process. This does not (by itself) work with parallel processing. You somehow have to distribute the work while leaving the IO operations (specifically those on sockets) in a single thread/process.
While your idea to hand off individual connections to a different handler process is nice, it is hard to implement. The first obstacle is that you need a way to pull the connection out of asyncio without closing it. The next obstacle is that you cannot simply send a file descriptor to a different process unless you use platform-specific (probably Linux) code from a C-extension.
Note that the multiprocessing module is known to create a number of threads for communication. Most of the time when you use communication structures (such as Queues), a thread is spawned. Unfortunately those threads are not completely invisible. For instance they can fail to tear down cleanly (when you intend to terminate your program), but depending on their number the resource usage may be noticeable on its own.
If you really intend to handle individual connections in individual processes, I suggest to examine different approaches. For instance you can put a socket into listen mode and then simultaneously accept connections from multiple worker processes in parallel. Once a worker is finished processing a request, it can go accept the next connection, so you still use less resources than forking a process for each connection. Spamassassin and Apache (mpm prefork) can use this worker model for instance. It might end up easier and more robust depending on your use case. Specifically you can make your workers die after serving a configured number of requests and be respawned by a master process thereby eliminating much of the negative effects of memory leaks.

Based on #dano's answer above I wrote this function to replace places where I used to use multiprocess pool + map.
def asyncio_friendly_multiproc_map(fn: Callable, l: list):
"""
This is designed to replace the use of this pattern:
with multiprocessing.Pool(5) as p:
results = p.map(analyze_day, list_of_days)
By letting caller drop in replace:
asyncio_friendly_multiproc_map(analyze_day, list_of_days)
"""
tasks = []
with ProcessPoolExecutor(5) as executor:
for e in l:
tasks.append(asyncio.get_event_loop().run_in_executor(executor, fn, e))
res = asyncio.get_event_loop().run_until_complete(asyncio.gather(*tasks))
return res

See PEP 3156, in particular the section on Thread interaction:
http://www.python.org/dev/peps/pep-3156/#thread-interaction
This documents clearly the new asyncio methods you might use, including run_in_executor(). Note that the Executor is defined in concurrent.futures, I suggest you also have a look there.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.