How do I implement async generators? - python

I have subscribed to a MQ queue. Every time I get a message, I pass it a function that then performs a number of time-consuming I/O actions on it.
The issue is that everything happens serially.
A request comes in, it picks up the request, performs the action by calling the function, and then picks up the next request.
I want to do this asynchronously so that multiple requests can be dealt with in an async manner.
results = []
queue = queue.subscribe(name)
async for message in queue:
yield my_funcion(message)
The biggest issue is that my_function is slow because it calls external web services and I want my code to process other messages in the meantime.
I tried to implement it above but it doesn't work! I am not sure how to implement async here.
I can't create a task because I don't know how many requests will be received. It's a MQ which I have subscribed to. I loop over each message and perform an action. I don't want for the function to complete before I perform the action on the next message. I want it to happen asynchronously.

If I understand your request, what you need is a queue that your request handlers fill, and that you read from from the code that needs to do something with the results.
If you insist on an async iterator, it is straightforward to use a generator to expose the contents of a queue. For example:
def make_asyncgen():
queue = asyncio.Queue(1)
async def feed(item):
await queue.put(item)
async def exhaust():
while True:
item = await queue.get()
yield item
return feed, exhaust()
make_asyncgen returns two objects: an async function and an async generator. The two are connected in such a way that, when you call the function with an item, the item gets emitted by the generator. For example:
import random, asyncio
# Emulate a server that takes some time to process each message,
# and then provides a result. Here it takes an async function
# that it will call with the result.
async def serve(server_ident, on_message):
while True:
await asyncio.sleep(random.uniform(1, 5))
await on_message('%s %s' % (server_ident, random.random()))
async def main():
# create the feed function, and the generator
feed, get = make_asyncgen()
# subscribe to serve several requests in parallel
asyncio.create_task(serve('foo', feed))
asyncio.create_task(serve('bar', feed))
asyncio.create_task(serve('baz', feed))
# process results from all three servers as they arrive
async for msg in get:
print('received', msg)
asyncio.run(main())

Related

Python HTTPX | RuntimeError: The connection pool was closed while 6 HTTP requests/responses were still in-flight

I've come across this error multiple times while using the HTTPX module. I believe I know what it means but I don't know how to solve it.
In the following example, I have an asynchronous function gather_players() that sends get requests to an API I'm using and then returns a list of all the players from a specified NBA team. Inside of teamRoster() I'm using asyncio.run() to initiate gather_players() and that's the line that produces this error: RuntimeError: The connection pool was closed while 6 HTTP requests/responses were still in-flight
async def gather_players(list_of_urlCodes):
async def get_json(client, link):
response = await client.get(BASE_URL + link)
return response.json()['league']['standard']['players']
async with httpx.AsyncClient() as client:
tasks = []
for code in list_of_urlCodes:
link = f'/prod/v1/2022/teams/{code}/roster.json'
tasks.append(asyncio.create_task(get_json(client, link)))
list_of_people = await asyncio.gather(*tasks)
return list_of_people
def teamRoster(list_of_urlCodes: list) -> list:
list_of_personIds = asyncio.run(gather_players(list_of_urlCodes))
finalResult = []
for person in list_of_personIds:
personId = person['personId']
#listOfPLayers is a list of every NBA player that I got
#from a previous get request
for player in listOfPlayers:
if personId == player['personId']:
finalResult.append({
"playerName": f"{player['firstName']} {player['lastName']}",
"personId": player['personId'],
"jersey": player['jersey'],
"pos": player['pos'],
"heightMeters": player['heightMeters'],
"weightKilograms": player['weightKilograms'],
"dateOfBirthUTC": player['dateOfBirthUTC'],
"nbaDebutYear": player['nbaDebutYear'],
"country": player['country']
})
return finalResult
*Note: The teamRoster() function in my original script is actually a class method and I've also used the same technique with the asynchronous function to send multiple get request in an earlier part of my script.
I was able to finally find a solution to this problem. For some reason the context manager: async with httpx.AsyncClient() as client fails to properly close the AsyncClient. A quick fix to this problem is closing it manually using: client.aclose()
Before:
async with httpx.AsyncClient() as client:
tasks = []
for code in list_of_urlCodes:
link = f'/prod/v1/2022/teams/{code}/roster.json'
tasks.append(asyncio.create_task(get_json(client, link)))
list_of_people = await asyncio.gather(*tasks)
return list_of_people
After:
client = httpx.AsyncClient()
tasks = []
for code in list_of_urlCodes:
link = f'/prod/v1/2022/teams/{code}/roster.json'
tasks.append(asyncio.create_task(get_json(client, link)))
list_of_people = await asyncio.gather(*tasks)
client.aclose()
return list_of_people
The accepted answer claims that the original code failed to properly close the client because it didn't call aclose(), and while that's technically true the implementation of the async context manager exit method (__aexit__) essentially duplicates the aclose() implementation.
In fact, you can tell that the connection is closed because the error message complains about 6 HTTP requests remaining in-flight after the connection is closed.
By contrast, the accepted answer "fixes" the error by explicitly not closing the connection. Because httpx.AsyncClient.aclose is an async function, calling it without awaiting creates a coroutine that is not actually scheduled for execution on the event loop. That coroutine is then destroyed when the function returns immediately after without having ever actually executed, meaning the connection is never closed. Python should print a RuntimeWarning that client.aclose() was never awaited. As a result, each request has plenty of time to complete before the process terminates and force-closes each connection so the RuntimeError is never raised.
While I don't know the full reason that some requests were still in-flight, I suspect it was some cleanup at the end that didn't finish before the function returned and the connections were closed. For instance, if you put await asyncio.sleep(1) right before the return, then the error would likely go away as the client would have time to finish and clean up after each of its requests. (Note I'm not saying this is a good fix, but rather would help provide evidence to back up my explanation.)
Instead of using asyncio.gather, try using TaskGroups as recommended by the Python docs for asyncio.gather. So your new code could look something like this:
async def gather_players(list_of_urlCodes):
async def get_json(client, link):
response = await client.get(BASE_URL + link)
return response.json()['league']['standard']['players']
async with httpx.AsyncClient() as client:
async with asyncio.TaskGroup() as tg:
tasks = [tg.create_task(get_json(client, f'/prod/v1/2022/teams/{code}/roster.json')) for code in list_of_urlCodes]
list_of_people = [task.result for task in tasks]
return list_of_people
This is obviously not production-grade code, as it is missing error-handling, but demonstrates the suggestion clearly enough.

Python Process blocking the rest of application

i have a program that basically does 2 things:
opens a websocket and remains on listening for messages and starting a video streaming in a forever loop.
I was trying to use multiprocess to manage both things but one piece stops the other from running.
The app is
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(start_client())
async def start_client():
async with WSClient() as client:
pass
class WSClient:
async def __aenter__(self):
async with websockets.connect(url,max_size= None) as websocket:
self.websocket = websocket
await self.on_open() ## it goes
p = Process(target = init(self)) # This is the streaming method
p.start()
async for message in websocket:
await on_message(message, websocket) ## listen for websocket messages
return self
the init method is
def init(ws):
logging.info('Firmware Version: ' + getVersion())
startStreaming(ws)
return
basically startStreaming has an infinite loop in it.
In this configuration, the stream starts but the on_message of the websocket it's not called because the Process function freezes the rest of the application.
How can I run both methods?
Thanks
In your code, you're telling multiprocessing.Process to take the function returned by init and call it in a new process. What you want is for the process to call init itself (with an argument). Here's how you can do that:
p = Process(target=init, args=(self,))
I have to note that you're passing an asynchronous websocket object to your init function. This will likely break as asyncio stuff aren't usually meant to be used in two threads, let alone two processes. Unless you're somehow recreating the websocket object in the new process and making a new loop there too, what you're actually looking for is how to create an asyncio task.
Assuming startStreaming is already an async function, you should change the init function to this:
async def init(ws): # note the async
logging.info('Firmware Version: ' + getVersion())
await startStreaming(ws) # note the await
return
and change the line creating and starting the process to this:
asyncio.create_task(init(self))
This will run your startStreaming function in a new task while you also read incoming messages at (basically) the same time.
Also, I'm not sure what you're trying to do with the async context manager as everything could be just in a normal async function. If you're interested in using one for learning purposes, I'd suggest you to check out contextlib.asynccontextmanager and have your message reading code inside the async with statement in start_client rather than inside __aenter__.

How to chain coroutines in asyncio when fetch nested urls

I'm currently designing a spider to crawl a specific website. I can do it synchronous but I'm trying to get my head around asyncio to make it as efficient as possible. I've tried a lot of different approaches, with yield, chained functions and queues but I can't make it work.
I'm most interested in the design part and logic to solve the problem. Not necessary runnable code, rather highlight the most important aspects of assyncio. I can't post any code, because my attempts are not worth sharing.
The mission:
The exemple.com (I know, it should be example.com) got the following design:
In synchronous manner the logic would be like this:
for table in my_url_list:
# Get HTML
# Extract urls from HTML to user_list
for user in user_list:
# Get HTML
# Extract urls from HTML to user_subcat_list
for subcat in user_subcat_list:
# extract content
But now I would like to scrape the site asynchronous. Lets say we using 5 instances (tabs in pyppeteer or requests in aiohttp) to parse the content. How should we design it to make it most efficient and what asyncio syntax should we use?
Update
Thanks to #user4815162342 who solved my problem. I've been playing around with his solution and I post runnable code below if someone else want to play around with asyncio.
import asyncio
import random
my_url_list = ['exemple.com/table1', 'exemple.com/table2', 'exemple.com/table3']
# Random sleeps to simulate requests to the server
async def randsleep(caller=None):
i = random.randint(1, 6)
if caller:
print(f"Request HTML for {caller} sleeping for {i} seconds.")
await asyncio.sleep(i)
async def process_urls(url_list):
print(f'async def process_urls: added {url_list}')
limit = asyncio.Semaphore(5)
coros = [process_user_list(table, limit) for table in url_list]
await asyncio.gather(*coros)
async def process_user_list(table, limit):
async with limit:
# Simulate HTML request and extracting urls to populate user_list
await randsleep(table)
if table[-1] == '1':
user_list = ['exemple.com/user1', 'exemple.com/user2', 'exemple.com/user3']
elif table[-1] == '2':
user_list = ['exemple.com/user4', 'exemple.com/user5', 'exemple.com/user6']
else:
user_list = ['exemple.com/user7', 'exemple.com/user8', 'exemple.com/user9']
print(f'async def process_user_list: Extracted {user_list} from {table}')
# Execute process_user in parallel, but do so outside the `async with`
# because process_user will also need the semaphore, and we don't need
# it any more since we're done with fetching HTML.
coros = [process_user(user, limit) for user in user_list]
await asyncio.gather(*coros)
async def process_user(user, limit):
async with limit:
# Simulate HTML request and extracting urls to populate user_subcat_list
await randsleep(user)
user_subcat_list = [user + '/profile', user + '/info', user + '/followers']
print(f'async def process_user: Extracted {user_subcat_list} from {user}')
coros = [process_subcat(subcat, limit) for subcat in user_subcat_list]
await asyncio.gather(*coros)
async def process_subcat(subcat, limit):
async with limit:
# Simulate HTML request and extracting content
await randsleep(subcat)
print(f'async def process_subcat: Extracted content from {subcat}')
if __name__ == '__main__':
asyncio.run(process_urls(my_url_list))
Let's restructure the sync code so that each piece that can access the network is in a separate function. The functionality is unchanged, but it will make things easier later:
def process_urls(url_list):
for table in url_list:
process_user_list(table)
def process_user_list(table):
# Get HTML, extract user_list
for user in user_list:
process_user(user)
def process_user(user):
# Get HTML, extract user_subcat_list
for subcat in user_subcat_list:
process_subcat(subcat)
def process_subcat(subcat):
# get HTML, extract content
if __name__ == '__main__':
process_urls(my_url_list)
Assuming that the order of processing doesn't matter, we'd like the async version to run all the functions that are now called in for loops in parallel. They'll still run on a single thread, but they will await anything that might block, allowing the event loop to parallelize the waiting and drive them to completion by resuming each coroutine whenever it is ready to proceed. This is achieved by spawning each coroutine as a separate task that runs independent of other tasks and therefore in parallel. For example, a sequential (but still async) version of process_urls would look like this:
async def process_urls(url_list):
for table in url_list:
await process_user_list(table)
This is async because it is running inside an event loop, and you could run several such functions in parallel (which we'll show how to do shortly), but it's also sequential because it chooses to await each invocation of process_user_list. At each loop iteration the await explicitly instructs asyncio to suspend execution of process_urls until the result of process_user_list is available.
What we want instead is to tell asyncio to run all invocations of process_user_list in parallel, and to suspend execution of process_urls until they're all done. The basic primitive to spawn a coroutine in the "background" is to schedule it as a task using asyncio.create_task, which is the closest async equivalent of a light-weight thread. Using create_task the parallel version of process_urls would look like this:
async def process_urls(url_list):
# spawn a task for each table
tasks = []
for table in url_list:
asyncio.create_task(process_user_list(table))
tasks.append(task)
# The tasks are now all spawned, so awaiting one task lets
# them all run.
for task in tasks:
await task
At first glance the second loop looks like it awaits tasks in sequence like the previous version, but this is not the case. Since each await suspends to the event loop, awaiting any task allows all tasks to progress, as long as they were scheduled beforehand using create_task(). The total waiting time will be no longer than the time of the longest task, regardless of the order in which they finish.
This pattern is used so often that asyncio has a dedicated utility function for it, asyncio.gather. Using this function the same code can be expressed in a much shorter version:
async def process_urls(url_list):
coros = [process_user_list(table) for table in url_list]
await asyncio.gather(*coros)
But there is another thing to take care of: since process_user_list will get HTML from the server and there will be many instances of it running in parallel, and we cannot allow it to hammer the server with hundreds of simultaneous connections. We could create a pool of worker tasks and some sort of queue, but asyncio offers a more elegant solution: the semaphore. Semaphore is a synchronization device that doesn't allow more than a pre-determined number of activations in parallel, making the rest wait in line.
The final version of process_urls creates a semaphore and just passes it down. It doesn't activate the semaphore because process_urls doesn't actually fetch any HTML itself, so there is no reason for it to hold a semaphore slot while process_user_lists are running.
async def process_urls(url_list):
limit = asyncio.Semaphore(5)
coros = [process_user_list(table, limit) for table in url_list]
await asyncio.gather(*coros)
process_user_list looks similar, but it does need to activate the semaphore using async with:
async def process_user_list(table, limit):
async with limit:
# Get HTML using aiohttp, extract user_list
# Execute process_user in parallel, but do so outside the `async with`
# because process_user will also need the semaphore, and we don't need
# it any more since we're done with fetching HTML.
coros = [process_user(user, limit) for user in user_list]
await asyncio.gather(*coros)
process_user and process_subcat are more of the same:
async def process_user(user, limit):
async with limit:
# Get HTML, extract user_subcat_list
coros = [process_subcat(subcat, limit) for subcat in user_subcat_list]
await asyncio.gather(*coros)
def process_subcat(subcat, limit):
async with limit:
# get HTML, extract content
# do something with content
if __name__ == '__main__':
asyncio.run(process_urls(my_url_list))
In practice you will probably want the async functions to share the same aiohttp session, so you'd probably create it in the top-level function (process_urls in your case) and pass it down along with the semaphore. Each function that fetches HTML would have another async with for the aiohttp request/response, such as:
async with limit:
async with session.get(url, params...) as resp:
# get HTML data here
resp.raise_for_status()
resp = await resp.read()
# extract content from HTML data here
The two async withs can be collapsed into one, reducing the indentation but keeping the same meaning:
async with limit, session.get(url, params...) as resp:
# get HTML data here
resp.raise_for_status()
resp = await resp.read()
# extract content from HTML data here

Python asyncio (aiohttp, aiofiles)

I seem to be having a difficult time understanding pythons asyncio. I have not written any code, as all the examples I see are for one-off runs. Create a few coroutine's, add them to an event loop, then run the loop, they run the tasks switching between them, done. Which does not seem all that helpful for me.
I want to use asyncio to not interrupt the operation in my application (using pyqt5). I want to create some functions that when called run in the asyncio event loop, then when they are done they do a callback.
What I imagine is. Create a separate thread for asyncio, create the loop and run it forever. Create some functions getFile(url, fp), get(url), readFile(file), etc. Then in the UI, I have a text box with a submit button, user enters url, clicks submit, it downloads the file.
But, every example I see, I cannot see how to add a coroutine to a running loop. And I do not see how I could do what I want without adding to a running loop.
#!/bin/python3
import asyncio
import aiohttp
import threading
loop = asyncio.get_event_loop()
def async_in_thread(loop):
asyncio.set_event_loop(loop)
loop.run_forever()
async def _get(url, callback):
print("get: " + url)
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
result = await response.text()
callback(result)
return
def get(url, callback):
asyncio.ensure_future(_get(url, callback))
thread = threading.Thread(target=async_in_thread, args=(loop, ))
thread.start()
def stop():
loop.close()
def callme(data):
print(data)
stop()
get("http://google.com", callme)
thread.join()
This is what I imagine, but it does not work.
To add a coroutine to a loop running in a different thread, use asyncio.run_coroutine_threadsafe:
def get(url, callback):
asyncio.run_coroutine_threadsafe(_get(url, callback))
In general, when you are interacting with the event loop from outside the thread that runs it, you must run everything through either run_coroutine_threadsafe (for coroutines) or loop.call_soon_threadsafe (for functions). For example, to stop the loop, use loop.call_soon_threadsafe(loop.stop). Also note that loop.close() must not be invoked inside a loop callback, so you should place that call in async_in_thread, right after the call to run_forever(), at which point the loop has definitely stopped running.
Another thing with asyncio is that passing explicit when_done callbacks isn't idiomatic because asyncio exposes the concept of futures (akin to JavaScript promises), which allow attaching callbacks to a not-yet-available result. For example, one could write _get like this:
async def _get(url):
print("get: " + url)
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.text()
It doesn't need a callback argument because any interested party can convert it to a task using loop.create_task and use add_done_callback to be notified when the task is complete. For example:
def _get_with_callback(url, callback):
loop = asyncio.get_event_loop()
task = loop.create_task(_get(url))
task.add_done_callback(lambda _fut: callback(task.result()))
In your case you're not dealing with the task directly because your code aims to communicate with the event loop from another thread. However, run_coroutine_threadsafe returns a very useful value - a full-fledged concurrent.futures.Future which you can use to register done callbacks. Instead of accepting a callback argument, you can expose the future object to the caller:
def get(url):
return asyncio.run_coroutine_threadsafe(_get(url), loop)
Now the caller can choose a callback-based approach:
future = get(url)
# call me when done
future.add_done_callback(some_callback)
# ... proceed with other work ...
or, when appropriate, they can even wait for the result:
# give me the response, I'll wait for it
result = get(url).result()
The latter is by definition blocking, but since the event loop is safely running in a different thread, it is not affected by the blocking call.
Install QualMash to smooth integration between Qt and asyncio.
Example from the project's README gives an inspiration for how it looks like:
import sys
import asyncio
import time
from PyQt5.QtWidgets import QApplication, QProgressBar
from quamash import QEventLoop, QThreadExecutor
app = QApplication(sys.argv)
loop = QEventLoop(app)
asyncio.set_event_loop(loop) # NEW must set the event loop
progress = QProgressBar()
progress.setRange(0, 99)
progress.show()
async def master():
await first_50()
with QThreadExecutor(1) as exec:
await loop.run_in_executor(exec, last_50)
async def first_50():
for i in range(50):
progress.setValue(i)
await asyncio.sleep(.1)
def last_50():
for i in range(50,100):
loop.call_soon_threadsafe(progress.setValue, i)
time.sleep(.1)
with loop: ## context manager calls .close() when loop completes, and releases all resources
loop.run_until_complete(master())

Enforce serial requests in asyncio

I've been using asyncio and the http requests package aiohttp recently and I've run into a problem.
My application talks to a REST API.
for some API endpoints it makes sense to be able to dispatch multiple requests in parallel. Eg. sending different queries in the request to the same endpoint to get different data.
Though for some endpoints, this doesn't make sense. As in the endpoint always takes the same arguments (authentication) and returns requested information. (No point asking for the same data multiple times before the server has responded once) For these endpoints I need to enforce a 'serial' flow of requests. In that my program should only be able to send a request when it's not waiting for a response. (the typical behavior of blocking requests).
Of course I don't want to block.
This is an abstraction of what I intend to do. Essentially wrap the endpoint in an async generator that enforces this serial behavior.
I feel like I'm reinventing the wheel, Is there a common solution to this issue?
import asyncio
from time import sleep
# Encapsulate the idea of an endpoint that can't handle multiple requests
async def serialendpoint():
count = 0
while True:
count += 1
await asyncio.sleep(2)
yield str(count)
# Pretend client object
class ExampleClient(object):
gen = serialendpoint()
# Simulate a http request that sends multiple requests
async def simulate_multiple_http_requests(self):
print(await self.gen.asend(None))
print(await self.gen.asend(None))
print(await self.gen.asend(None))
print(await self.gen.asend(None))
async def other_stuff():
for _ in range(6):
await asyncio.sleep(1)
print('doing async stuff')
client = ExampleClient()
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(client.simulate_multiple_http_requests(),
client.simulate_multiple_http_requests(),
other_stuff()))
outputs
doing async stuff
1
doing async stuff
doing async stuff
2
doing async stuff
doing async stuff
3
doing async stuff
4
5
6
7
8
update
This is the actual async generator I implemented:
All the endpoints that require serial behavior get assigned a serial_request_async_generator during the import phase. Which meant I couldn't initialize them with an await 'async_gen'.asend(None) as the await is only allowed in an async coroutine. The compromise is that every serial request at runtime must .asend(None) before asending the actual arguments. There must be a better way!
async def serial_request_async_generator():
args, kwargs = yield
while True:
yield await request(*args, **kwargs) # request is an aiohttp request
args, kwargs = yield

Categories

Resources