For an internship on the Python library fluidimage, we are investigating if it could be a good idea to write a HPC parallel application with a client/servers model using the library trio.
For asynchronous programming and i/o, trio is indeed great!
Then, I'm wondering how to
spawn processes (the servers doing the CPU-GPU bounded work)
communicating complex Python objects (potentially containing large numpy arrays) between the processes.
I didn't find what was the recommended way to do this with trio in its documentation (even if the echo client/server tutorial is a good start).
One obvious way for spawning processes in Python and communicate is using multiprocessing.
In the HPC context, I think one good solution would be to use MPI (http://mpi4py.readthedocs.io/en/stable/overview.html#dynamic-process-management). For reference, I also have to mention rpyc (https://rpyc.readthedocs.io/en/latest/docs/zerodeploy.html#zerodeploy).
I don't know if one can use such tools together with trio and what would be the right way to do this.
An interesting related question
Share python object between multiprocess in python3
Remark PEP 574
It seems to me that the PEP 574 (see https://pypi.org/project/pickle5/) could also be part of a good solution to this problem.
Unfortunately, as of today (July 2018), Trio doesn't yet have support for spawning and communicating with subprocesses, or any kind of high-wrappers for MPI or other high-level inter-process coordination protocols.
This is definitely something we want to get to eventually, and if you want to talk in more detail about what would need to be implemented, then you can hop in our chat, or this issue has an overview of what's needed for core subprocess support. But if your goal is to have something working within a few months for your internship, honestly you might want to consider more mature HPC tools like dask.
As of mid-2018, Trio doesn't do that yet. Your best option to date is to use trio_asyncio to leverage asyncio's support for the features which Trio still needs to learn.
I post a very naive example of a code using multiprocessing and trio (in the main program and in the server). It seems to work.
from multiprocessing import Process, Queue
import trio
import numpy as np
async def sleep():
print("enter sleep")
await trio.sleep(0.2)
print("end sleep")
def cpu_bounded_task(input_data):
result = input_data.copy()
for i in range(1000000-1):
result += input_data
return result
def server(q_c2s, q_s2c):
async def main_server():
# get the data to be processed
input_data = await trio.run_sync_in_worker_thread(q_c2s.get)
print("in server: input_data received", input_data)
# a CPU-bounded task
result = cpu_bounded_task(input_data)
print("in server: sending back the answer", result)
await trio.run_sync_in_worker_thread(q_s2c.put, result)
trio.run(main_server)
async def client(q_c2s, q_s2c):
input_data = np.arange(10)
print("in client: sending the input_data", input_data)
await trio.run_sync_in_worker_thread(q_c2s.put, input_data)
result = await trio.run_sync_in_worker_thread(q_s2c.get)
print("in client: result received", result)
async def parent(q_c2s, q_s2c):
async with trio.open_nursery() as nursery:
nursery.start_soon(sleep)
nursery.start_soon(client, q_c2s, q_s2c)
nursery.start_soon(sleep)
def main():
q_c2s = Queue()
q_s2c = Queue()
p = Process(target=server, args=(q_c2s, q_s2c))
p.start()
trio.run(parent, q_c2s, q_s2c)
p.join()
if __name__ == '__main__':
main()
A simple example with mpi4py... It may be a bad work around from the trio point of view, but it seems to work.
Communications are done with trio.run_sync_in_worker_thread so (as written by Nathaniel J. Smith) (1) no cancellation (and no control-C support) and (2) use more memory than trio tasks (but one Python thread does not use so much memory).
But for communications involving large numpy arrays, I would go like this since communication of buffer-like objects is going to be very efficient with mpi4py.
import sys
from functools import partial
import trio
import numpy as np
from mpi4py import MPI
async def sleep():
print("enter sleep")
await trio.sleep(0.2)
print("end sleep")
def cpu_bounded_task(input_data):
print("cpu_bounded_task starting")
result = input_data.copy()
for i in range(1000000-1):
result += input_data
print("cpu_bounded_task finished ")
return result
if "server" not in sys.argv:
comm = MPI.COMM_WORLD.Spawn(sys.executable,
args=['trio_spawn_comm_mpi.py', 'server'])
async def client():
input_data = np.arange(4)
print("in client: sending the input_data", input_data)
send = partial(comm.send, dest=0, tag=0)
await trio.run_sync_in_worker_thread(send, input_data)
print("in client: recv")
recv = partial(comm.recv, tag=1)
result = await trio.run_sync_in_worker_thread(recv)
print("in client: result received", result)
async def parent():
async with trio.open_nursery() as nursery:
nursery.start_soon(sleep)
nursery.start_soon(client)
nursery.start_soon(sleep)
trio.run(parent)
print("in client, end")
comm.barrier()
else:
comm = MPI.Comm.Get_parent()
async def main_server():
# get the data to be processed
recv = partial(comm.recv, tag=0)
input_data = await trio.run_sync_in_worker_thread(recv)
print("in server: input_data received", input_data)
# a CPU-bounded task
result = cpu_bounded_task(input_data)
print("in server: sending back the answer", result)
send = partial(comm.send, dest=0, tag=1)
await trio.run_sync_in_worker_thread(send, result)
trio.run(main_server)
comm.barrier()
You could also check out tractor which finally seems to have a first alpha release out.
it has built-in function-focussed-style RPC system (much like trio) using TCP and msgpack (but i think they have more transports planned). You just call functions in other processes directly and stream/get results back a variety of different ways.
Here's their first example:
"""
Run with a process monitor from a terminal using::
$TERM -e watch -n 0.1 "pstree -a $$" \
& python examples/parallelism/single_func.py \
&& kill $!
"""
import os
import tractor
import trio
async def burn_cpu():
pid = os.getpid()
# burn a core # ~ 50kHz
for _ in range(50000):
await trio.sleep(1/50000/50)
return os.getpid()
async def main():
async with tractor.open_nursery() as n:
portal = await n.run_in_actor(burn_cpu)
# burn rubber in the parent too
await burn_cpu()
# wait on result from target function
pid = await portal.result()
# end of nursery block
print(f"Collected subproc {pid}")
if __name__ == '__main__':
trio.run(main)
Related
I want to implement a class with the possibility to start various websockets in different threads to retrieve market data and update the class attributes. I am using the kucoin-python-sdk library to that purpose.
The below works fine in spyder, however when I set my script to run via a conda batch it fails with the following errors over and over.
Thank you.
<Task finished name='Task-4' coro=<ConnectWebsocket._run() done,>
defined at > path\lib\site-packages\kucoin\websocket\websocket.py:33>>
exception=RuntimeError("can't register atexit after shutdown")> got
an> exception can't register atexit after shutdown pending name='Task-3' coro=<ConnectWebsocket._recover_topic_req_msg()
running> at>
path\lib\site-packages\kucoin\websocket\websocket.py:127>>
wait_for=> cancel ok.> _reconnect
over.
<Task finished name='Task-7' coro=<ConnectWebsocket._run() done,
defined at>> path\lib\site-packages\kucoin\websocket\websocket.
py:33>> exception=RuntimeError("can't register atexit after shutdown")> got an> exception can't register atexit after shutdown pending name='Task-6' coro=<ConnectWebsocket._recover_topic_req_msg() running> at
path\lib\site-packages\kucoin\websocket\websocket.py:127>> wait_for=> cancel ok.> _reconnect over.
Hence wondering:
Does the issue come from the Kucoin package or is my implementation of threads/asyncio incorrect ?
How to explain the different behavior between Spyder execution and conda on the same environment ?
Python 3.9.13 | Spyder 5.3.3 | Spyder kernel 2.3.3 | websocket 0.2.1 | nest-asyncio 1.5.6 | kucoin-python 1.0.11
Class_X.py
import asyncio
import nest_asyncio
nest_asyncio.apply()
from kucoin.client import WsToken
from kucoin.ws_client import KucoinWsClient
from threading import Thread
class class_X():
def __init__(self):
self.msg= ""
async def main(self):
async def book_msg(msg):
self.msg = msg
client = WsToken()
ws_client = await KucoinWsClient.create(None, client, book_msg, private=False)
await ws_client.subscribe(f'/market/level2:BTC-USDT')
while True:
await asyncio.sleep(20)
def launch(self):
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(self.main())
instance = class_X()
t = Thread(target=instance.launch)
t.start()
Batch
call path\anaconda3\Scripts\activate myENV
python "path1\class_X.py"
conda deactivate
I want to say it's your implementation but I haven't tried using that client the way you're doing it. Here's a pared down skeleton of what I'm doing to implement that kucoin-python in async.
import asyncio
from kucoin.client import WsToken
from kucoin.ws_client import KucoinWsClient
from kucoin.client import Market
from kucoin.client import User
from kucoin.client import Trade
async def main():
async def handle_event(msg):
if '/market/snapshot:' in msg['topic']:
snapshot = msg['data']['data']
## trade logic here using snapshot data
elif msg['topic'] == '/spotMarket/tradeOrders':
print(msg['data'])
else:
print("Unhandled message type")
print(msg)
async def unsubscribeFromPublicSnapsot(symbol):
ksm.unsubscribe('/market/snapshot:' + symbol)
async def subscribeToPublicSnapshot(symbol):
try:
print("subscribing to " + symbol)
await ksm.subscribe('/market/snapshot:' + symbol)
except Exception as e:
print("Error subscribing to snapshot for " + doc['currency'])
print(e)
pubClient = WsToken()
print("creating websocket client")
ksm = await KucoinWsClient.create(None, pubClient, handle_event, private=False)
# for private topics pass private=True
privateClient = WsToken(config["tradeKey"], config["tradeSecret"], config["tradePass"])
ksm_private = await KucoinWsClient.create(None, privateClient, handle_event, private=True)
# Always subscribe to BTC-USDT
await subscribeToPublicSnapshot('BTC-USDT')
# Subscribe to the currency-BTC spot market for each available currency
for doc in tradeable_holdings:
if doc['currency'] != 'BTC': # Don't need to resubscribe :D
await subscribeToPublicSnapshot(doc['currency'] + "-BTC")
# Subscribe to spot market trade orders
await ksm_private.subscribe('/spotMarket/tradeOrders')
if __name__ == "__main__":
print("Step 1: Kubot initialzied")
print("Step 2: ???")
print("Step 2: Profit")
loopMain = asyncio.get_event_loop()
loopMain.create_task(main())
loopMain.run_forever()
loopMain.close()
As you can probably guess, "tradeable_holdings" is a list of symbols I'm interested in that I already own. You'll also notice I'm using the snapshot instead of the market/ticker subscription. I think at 100ms updates on the ticker, it could quickly run into latency and race conditions - at least until I figure out how to deal with those. So I opted for the snapshot which only updates every 2 seconds and for the less active coins, not even that often.
Anyway, I'm not to where it's looking to trade but I'm quickly getting to that logic.
Hope this helps you figure your implementation out even though it's different.
I would like to combine asyncio and multiprocessing as I have a task where a part is io-bound and another is cpu-bound. I first tried to use loop.run_in_executor(), but I couldn't get it to work probably. Instead I went with creating two processes where one uses asyncio and the other doesn't.
The code is such that I have a class with some non-blocking functions and one blocking. I have an asyncio.Queue to pass information between the non-blocking parts and a multiprocessing.Queue to pass information between the non-blocking and the blocking functions.
import multiprocessing as mp
from concurrent.futures import ProcessPoolExecutor
import asyncio
import time
class TestClass:
def __init__(self):
m = mp.Manager()
self.blocking_queue = m.Queue()
async def run(self):
loop = asyncio.get_event_loop()
self.non_blocking_queue = asyncio.Queue() # asyncio Queue must be declared within event loop
task1 = loop.create_task(self.non_blocking1())
task2 = loop.create_task(self.non_blocking2())
task3 = loop.create_task(self.print_msgs())
await asyncio.gather(task1, task2)
task3.cancel()
def blocking(self):
i = 0
while i < 5:
time.sleep(0.6)
i += 1
print("Blocking ", i)
line = self.blocking_queue.get()
print("Blocking: ", line)
print("blocking done")
async def non_blocking1(self):
for i in range(5):
await self.non_blocking_queue.put("Hello")
await asyncio.sleep(0.4)
async def non_blocking2(self):
for i in range(5):
await self.non_blocking_queue.put("World")
await asyncio.sleep(0.5)
async def print_msgs(self):
while True:
line = await self.non_blocking_queue.get()
self.blocking_queue.put(line)
print(line)
test_class = TestClass()
with ProcessPoolExecutor() as pool:
pool.submit(test_class.blocking)
pool.submit(asyncio.run(test_class.run()))
print("done")
About half the times I run this, it works fine and prints out the text in the blocking and the non-blocking queues. The other half it only prints out the results of the non-blocking queue. It looks like the blocking process isn't started at all. It is not consequent every other time. It might work five times in a row and then not work five times in row.
What might cause such a problem? Which better way can I do this, using both multiprocessing and asyncio?
running the async task "inside" the other process works for me, e.g.:
def runfn(fn):
return asyncio.run(fn())
with ProcessPoolExecutor() as pool:
pool.submit(test_class.blocking)
pool.submit(runfn, test_class.run)
presumably there's some state inside asyncio/the task that needs to be consistent or gets broken when running in another process
I try to create a client which uses a asyncio.Queue to feed the messages I want to send to the server. Receiving data from websocket server works great. Sending data which is just generated by the producer works, too. For explaning what works and what fails, first here's my code:
import sys
import asyncio
import websockets
class WebSocketClient:
def __init__(self):
self.send_queue = asyncio.Queue()
#self.send_queue.put_nowait('test-message-1')
async def startup(self):
await self.connect_websocket()
consumer_task = asyncio.create_task(
self.consumer_handler()
)
producer_task = asyncio.create_task(
self.producer_handler()
)
done, pending = await asyncio.wait(
[consumer_task, producer_task],
return_when=asyncio.ALL_COMPLETED
)
for task in pending:
task.cancel()
async def connect_websocket(self):
try:
self.connection = await websockets.client.connect('ws://my-server')
except ConnectionRefusedError:
sys.exit('error: cannot connect to backend')
async def consumer_handler(self):
async for message in self.connection:
await self.consumer(message)
async def consumer(self, message):
self.send_queue.put_nowait(message)
# await self.send_queue.put(message)
print('mirrored message %s now in queue, queue size is %s' % (message, self.send_queue.qsize()))
async def producer_handler(self):
while True:
message = await self.producer()
await self.connection.send(message)
async def producer(self):
result = await self.send_queue.get()
self.send_queue.task_done()
#await asyncio.sleep(10)
#result = 'test-message-2'
return result
if __name__ == '__main__':
wsc = WebSocketClient()
asyncio.run(wsc.startup())
Connecting works great. If I send something from my server to the client, this works great too and prints the message in consumer(). But producer never gets any message I put in send_queue inside consumer().
The reason why I chose send_queue.put_nowait in consumer() was that I wanted to prevent deadlocks. If I use the line await self.send_queue.put(message) line instead of self.send_queue.put_nowait(message) it makes no difference.
I thought, maybe the queue dos not work at all, so I filled something to the queue just at creation in __init__(): self.send_queue.put_nowait("test-message-1"). This works and is sent to my server. So the basic concept of the queue and await queue.get() works.
I als thought, maybe there is some issue with the producer, so let's just randomly generate messages during runtime: result = "test-message-2" instead of result = await self.send_queue.get(). This works too: every 10 seconds 'test-message-2' is sent to my server.
EDIT: This also happens if I try to add stuff from another source to the queue on the fly. I build a small asyncio socket server which pushes any message to the queue, which works great, and you can see the messages I added from the other source with qsize() in consumer(), but still no successfull queue.get(). So the queue itself seems to work, just not get(). This is btw the reason for the queue, too: I would like to send data from quite different sources.
So, this is the point where I'm stuck. My wild guess is that the queue I use in producer() is not the same as in consumer(), something which happens at threading quite easily if you use non-thread-safe queues like asyncio.Queue, but as I understood it I don't use threading at all, just coroutines. So, what else went wrong here?
Just for the context: it's a Ubuntu 20.04 python 3.8.2 inside a docker container.
Thanks,
Ernesto
Just for the records - the solution for my problem was quite simple: I defined send_queue outside the event loop created by my websocket client. So it called events.get_event_loop() and got its own loop - which was not part of the main loop and therefore never called, therefore await queue.get() really never got anything back.
In normal mode, you don't see any message which is a hint to this issue. But, python documentation to the rescue: for course they mentioned it at https://docs.python.org/3/library/asyncio-dev.html : logging.DEBUG gave the hints I needed to find the problem.
It should look like this:
class WebSocketClient:
async def startup(self):
self.send_queue = asyncio.Queue()
await self.connect_websocket()
Then the queue is defined inside the main loop.
I have a function that continuously monitors an API. Basically, the function gets the data, parses it then appends it to a file. then it waits for 15 minutes and does the same over and over.
what I want is to run this loop in the background so I don't block the rest of my code from executing.
If you are using asyncio (I assume you are due to the asyncio tag) a scheduled operation can be performed using a task.
import asyncio
loop = asyncio.get_event_loop()
async def check_api():
while True:
# Do API check, helps if this is using async methods
await asyncio.sleep(15 * 60) # 15 minutes (in seconds)
loop.create_task(check_api())
... # Rest of your application
loop.run_forever()
If your API check is not async (or the library you are using to interact with it does is not async) you can use an Executor to run the operation in a separate thread or process while still maintaining the asyncio API.
For example:
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor()
def call_api():
...
async def check_api():
while True:
await loop.run_in_executor(executor, call_api)
await asyncio.sleep(15 * 60) # 15 minutes (in seconds)
Note that asyncio does not automatically make your code parallel, it is co-operative multitasking, all of your methods need to cooperate by using await, a long-running operation will still block other threads and in that case, an Executor will help.
This is very broad, but you could take a look at the multiprocessing or threading python modules.
For running a thread in the background it would look something like this:
from threading import Thread
def background_task():
# your code here
t = Thread(target=background_task)
t.start()
Try multithreading :
import threading
def background():
while True:
number = int(len(oilrigs)) * 49
number += money
time.sleep(1)
def foreground():
// What you want to run in the foreground
b = threading.Thread(name='background', target=background)
f = threading.Thread(name='foreground', target=foreground)
b.start()
f.start()
Try Multi Threading
import threading
def background():
#The loop you want to run in back ground
b = threading.Thread(target=background)
b.start()
Similar Question (but answer does not work for me): How to cancel long-running subprocesses running using concurrent.futures.ProcessPoolExecutor?
Unlike the question linked above and the solution provided, in my case the computation itself is rather long (CPU bound) and cannot be run in a loop to check if some event has happened.
Reduced version of the code below:
import asyncio
import concurrent.futures as futures
import time
class Simulator:
def __init__(self):
self._loop = None
self._lmz_executor = None
self._tasks = []
self._max_execution_time = time.monotonic() + 60
self._long_running_tasks = []
def initialise(self):
# Initialise the main asyncio loop
self._loop = asyncio.get_event_loop()
self._loop.set_default_executor(
futures.ThreadPoolExecutor(max_workers=3))
# Run separate processes of long computation task
self._lmz_executor = futures.ProcessPoolExecutor(max_workers=3)
def run(self):
self._tasks.extend(
[self.bot_reasoning_loop(bot_id) for bot_id in [1, 2, 3]]
)
try:
# Gather bot reasoner tasks
_reasoner_tasks = asyncio.gather(*self._tasks)
# Send the reasoner tasks to main monitor task
asyncio.gather(self.sample_main_loop(_reasoner_tasks))
self._loop.run_forever()
except KeyboardInterrupt:
pass
finally:
self._loop.close()
async def sample_main_loop(self, reasoner_tasks):
"""This is the main monitor task"""
await asyncio.wait_for(reasoner_tasks, None)
for task in self._long_running_tasks:
try:
await asyncio.wait_for(task, 10)
except asyncio.TimeoutError:
print("Oops. Some long operation timed out.")
task.cancel() # Doesn't cancel and has no effect
task.set_result(None) # Doesn't seem to have an effect
self._lmz_executor.shutdown()
self._loop.stop()
print('And now I am done. Yay!')
async def bot_reasoning_loop(self, bot):
import math
_exec_count = 0
_sleepy_time = 15
_max_runs = math.floor(self._max_execution_time / _sleepy_time)
self._long_running_tasks.append(
self._loop.run_in_executor(
self._lmz_executor, really_long_process, _sleepy_time))
while time.monotonic() < self._max_execution_time:
print("Bot#{}: thinking for {}s. Run {}/{}".format(
bot, _sleepy_time, _exec_count, _max_runs))
await asyncio.sleep(_sleepy_time)
_exec_count += 1
print("Bot#{} Finished Thinking".format(bot))
def really_long_process(sleepy_time):
print("I am a really long computation.....")
_large_val = 9729379273492397293479237492734 ** 344323
print("I finally computed this large value: {}".format(_large_val))
if __name__ == "__main__":
sim = Simulator()
sim.initialise()
sim.run()
The idea is that there is a main simulation loop that runs and monitors three bot threads. Each of these bot threads then perform some reasoning but also start a really long background process using ProcessPoolExecutor, which may end up running longer their own threshold/max execution time for reasoning on things.
As you can see in the code above, I attempted to .cancel() these tasks when a timeout occurs. Though this is not really cancelling the actual computation, which keeps happening in the background and the asyncio loop doesn't terminate until after all the long running computation have finished.
How do I terminate such long running CPU-bound computations within a method?
Other similar SO questions, but not necessarily related or helpful:
asyncio: Is it possible to cancel a future been run by an Executor?
How to terminate a single async task in multiprocessing if that single async task exceeds a threshold time in Python
Asynchronous multiprocessing with a worker pool in Python: how to keep going after timeout?
How do I terminate such long running CPU-bound computations within a method?
The approach you tried doesn't work because the futures returned by ProcessPoolExecutor are not cancellable. Although asyncio's run_in_executor tries to propagate the cancellation, it is simply ignored by Future.cancel once the task starts executing.
There is no fundamental reason for that. Unlike threads, processes can be safely terminated, so it would be perfectly possible for ProcessPoolExecutor.submit to return a future whose cancel terminated the corresponding process. Asyncio coroutines have well-defined cancellation semantics and could automatically make use of it. Unfortunately, ProcessPoolExecutor.submit returns a regular concurrent.futures.Future, which assumes the lowest common denominator of the underlying executors, and treats a running future as untouchable.
As a result, to cancel tasks executed in subprocesses, one must circumvent the ProcessPoolExecutor altogether and manage one's own processes. The challenge is how to do this without reimplementing half of multiprocessing. One option offered by the standard library is to (ab)use multiprocessing.Pool for this purpose, because it supports reliable shutdown of worker processes. A CancellablePool could work as follows:
Instead of spawning a fixed number of processes, spawn a fixed number of 1-worker pools.
Assign tasks to pools from an asyncio coroutine. If the coroutine is canceled while waiting for the task to finish in the other process, terminate the single-process pool and create a new one.
Since everything is coordinated from the single asyncio thread, don't worry about race conditions such as accidentally killing a process which has already started executing another task. (This would need to be prevented if one were to support cancellation in ProcessPoolExecutor.)
Here is a sample implementation of that idea:
import asyncio
import multiprocessing
class CancellablePool:
def __init__(self, max_workers=3):
self._free = {self._new_pool() for _ in range(max_workers)}
self._working = set()
self._change = asyncio.Event()
def _new_pool(self):
return multiprocessing.Pool(1)
async def apply(self, fn, *args):
"""
Like multiprocessing.Pool.apply_async, but:
* is an asyncio coroutine
* terminates the process if cancelled
"""
while not self._free:
await self._change.wait()
self._change.clear()
pool = usable_pool = self._free.pop()
self._working.add(pool)
loop = asyncio.get_event_loop()
fut = loop.create_future()
def _on_done(obj):
loop.call_soon_threadsafe(fut.set_result, obj)
def _on_err(err):
loop.call_soon_threadsafe(fut.set_exception, err)
pool.apply_async(fn, args, callback=_on_done, error_callback=_on_err)
try:
return await fut
except asyncio.CancelledError:
pool.terminate()
usable_pool = self._new_pool()
finally:
self._working.remove(pool)
self._free.add(usable_pool)
self._change.set()
def shutdown(self):
for p in self._working | self._free:
p.terminate()
self._free.clear()
A minimalistic test case showing cancellation:
def really_long_process():
print("I am a really long computation.....")
large_val = 9729379273492397293479237492734 ** 344323
print("I finally computed this large value: {}".format(large_val))
async def main():
loop = asyncio.get_event_loop()
pool = CancellablePool()
tasks = [loop.create_task(pool.apply(really_long_process))
for _ in range(5)]
for t in tasks:
try:
await asyncio.wait_for(t, 1)
except asyncio.TimeoutError:
print('task timed out and cancelled')
pool.shutdown()
asyncio.get_event_loop().run_until_complete(main())
Note how the CPU usage never exceeds 3 cores, and how it starts dropping near the end of the test, indicating that the processes are being terminated as expected.
To apply it to the code from the question, make self._lmz_executor an instance of CancellablePool and change self._loop.run_in_executor(...) to self._loop.create_task(self._lmz_executor.apply(...)).