I would like to read from multiple simultanous HTTP streaming requests inside coroutines using httpx, and yield the data back to my non-async function running the event loop, rather than just returning the final data.
But if I make my async functions yield instead of return, I get complaints that asyncio.as_completed() and loop.run_until_complete() expects a coroutine or a Future, not an async generator.
So the only way I can get this to work at all is by collecting all the streamed data inside each coroutine, returning all data once the request finishes. Then collect all the coroutine results and finally returning that to the non-async calling function.
Which means I have to keep everything in memory, and wait until the slowest request has completed before I get all my data, which defeats the whole point of streaming http requests.
Is there any way I can accomplish something like this? My current silly implementation looks like this:
def collect_data(urls):
"""Non-async function wishing it was a non-async generator"""
async def stream(async_client, url, payload):
data = []
async with async_client.stream("GET", url=url) as ar:
ar.raise_for_status()
async for line in ar.aiter_lines():
data.append(line)
# would like to yield each line here
return data
async def execute_tasks(urls):
all_data = []
async with httpx.AsyncClient() as async_client:
tasks = [stream(async_client, url) for url in urls]
for coroutine in asyncio.as_completed(tasks):
all_data += await coroutine
# would like to iterate and yield each line here
return all_events
try:
loop = asyncio.get_event_loop()
data = loop.run_until_complete(execute_tasks(urls=urls))
return data
# would like to iterate and yield the data here as it becomes available
finally:
loop.close()
EDIT: I've tried some solutions using asyncio.Queue and trio memory channels as well, but since I can only read from those in an async scope it doesn't get me any closer to a solution
EDIT 2: The reason I want to use this from a non-asyncronous generator is that I want to use it from a Django app using a Django Rest Framework streaming API.
Normally you should just make collect_data async, and use async code throughout - that's how asyncio was designed to be used. But if that's for some reason not feasible, you can iterate an async iterator manually by applying some glue code:
def iter_over_async(ait, loop):
ait = ait.__aiter__()
async def get_next():
try:
obj = await ait.__anext__()
return False, obj
except StopAsyncIteration:
return True, None
while True:
done, obj = loop.run_until_complete(get_next())
if done:
break
yield obj
The way the above works is by providing an async closure that keeps retrieving the values from the async iterator using the __anext__ magic method and returning the objects as they arrive. This async closure is invoked with run_until_complete() in a loop inside an ordinary sync generator. (The closure actually returns a pair of done indicator and actual object in order to avoid propagating StopAsyncIteration through run_until_complete, which might be unsupported.)
With this in place, you can make your execute_tasks an async generator (async def with yield) and iterate over it using:
for chunk in iter_over_async(execute_tasks(urls), loop):
...
Just note that this approach is incompatible with asyncio.run, and might cause problems later down the line.
Just wanting to update #user4815162342's solution to use asyncio.run_coroutine_threadsafe instead of loop.run_until_complete.
import asyncio
from typing import Any, AsyncGenerator
def _iter_over_async(loop: asyncio.AbstractEventLoop, async_generator: AsyncGenerator):
ait = async_generator.__aiter__()
async def get_next() -> tuple[bool, Any]:
try:
obj = await ait.__anext__()
done = False
except StopAsyncIteration:
obj = None
done = True
return done, obj
while True:
done, obj = asyncio.run_coroutine_threadsafe(get_next(), loop).result()
if done:
break
yield obj
I'd also like to add, that I have found tools like this quite helpful in the process of piecewise convert synchronous code to asyncio code.
There is a nice library that does this (and more!) called pypeln:
import pypeln as pl
import asyncio
from random import random
async def slow_add1(x):
await asyncio.sleep(random()) # <= some slow computation
return x + 1
async def slow_gt3(x):
await asyncio.sleep(random()) # <= some slow computation
return x > 3
data = range(10) # [0, 1, 2, ..., 9]
stage = pl.task.map(slow_add1, data, workers=3, maxsize=4)
stage = pl.task.filter(slow_gt3, stage, workers=2)
data = list(stage) # e.g. [5, 6, 9, 4, 8, 10, 7]
Related
I have a scraper that requires the use of a websocket server (can't go into too much detail on why because of company policy) that I'm trying to turn into a template/module for easier use on other websites.
I have one main function that runs the loop of the server (e.g. ping-pongs to keep the connection alive and send work and stop commands when necessary) that I'm trying to turn into a generator that yields the HTML of scraped pages (asynchronously, of course). However, I can't figure out a way to turn the server into a generator.
This is essentially the code I would want (simplified to just show the main idea, of course):
import asyncio, websockets
needsToStart = False # Setting this to true gets handled somewhere else in the script
async def run(ws):
global needsToStart
while True:
data = await ws.recv()
if data == "ping":
await ws.send("pong")
elif "<html" in data:
yield data # Yielding the page data
if needsToStart:
await ws.send("work") # Starts the next scraping session
needsToStart = False
generator = websockets.serve(run, 'localhost', 9999)
while True:
html = await anext(generator)
# Do whatever with html
This, of course, doesn't work, giving the error "TypeError: 'Serve' object is not callable". But is there any way to set up something along these lines? An alternative I could try is creating an 'intermittent' object that holds the data which the end loop awaits, but that seems messier to me than figuring out a way to get this idea to work.
Thanks in advance.
I found a solution that essentially works backwards, for those in need of the same functionality: instead of yielding the data, I pass along the function that processes said data. Here's the updated example case:
import asyncio, websockets
from functools import partial
needsToStart = False # Setting this to true gets handled somewhere else in the script
def process(html):
pass
async def run(ws, htmlFunc):
global needsToStart
while True:
data = await ws.recv()
if data == "ping":
await ws.send("pong")
elif "<html" in data:
htmlFunc(data) # Processing the page data
if needsToStart:
await ws.send("work") # Starts the next scraping session
needsToStart = False
func = partial(run, htmlFunc=process)
websockets.serve(func, 'localhost', 9999)
I have a task that is IO bound running in a loop. This task does a lot of work and is often times hogging the loop (Is that the right word for it?). My plan is to run it in a separate process or thread using run_in_executor with ProcessPoolExecutor or ThreadPoolExecutor to run it separately and allow the main loop to do its work. Currently for communication between tasks I use asyncio.PriorityQueue() and asyncio.Event() for communication and would like to reuse these, or something with the same interface, if possible.
Current code:
# Getter for events and queues so communication can happen
send, receive, send_event, receive_event = await process_obj.get_queues()
# Creates task based off the process object
future = asyncio.create_task(process_obj.main())
Current process code:
async def main():
while True:
#does things that hogs loop
What I want to do:
# Getter for events and queues so communication can happen
send, receive, send_event, receive_event = await process_obj.get_queues()
# I assume I could use Thread or Process executors
pool = concurrent.futures.ThreadPoolExecutor()
result = await loop.run_in_executor(pool, process_obj.run())
New process code:
def run():
asyncio.create_task(main())
async def main():
while True:
#does things that hogs loop
How do I communicate between this new thread and the original loop like I could originally?
There is not much I could reproduce your code. So please consider this code from YouTube Downloader as example and I hope that will help you to understand how to get result from thread function:
example code:
def on_download(self, is_mp3: bool, is_mp4: bool, url: str) -> None:
if is_mp3 == False and is_mp4 == False:
self.ids.info_lbl.text = 'Please select a type of file to download.'
else:
self.ids.info_lbl.text = 'Downloading...'
self.is_mp3 = is_mp3
self.is_mp4 = is_mp4
self.url = url
Clock.schedule_once(self.schedule_download, 2)
Clock.schedule_interval(self.start_progress_bar, 0.1)
def schedule_download(self, dt: float) -> None:
'''
Callback method for the download.
'''
pool = ThreadPool(processes=1)
_downloader = Downloader(self.d_path)
self.async_result = pool.apply_async(_downloader.download,
(self.is_mp3, self.is_mp4, self.url))
Clock.schedule_interval(self.check_process, 0.1)
def check_process(self, dt: float) -> None:
'''
Check if download is complete.
'''
if self.async_result.ready():
resp = self.async_result.get()
if resp[0] == 'Error. Download failed.':
self.ids.info_lbl.text = resp[0]
# progress bar gray if error
self.stop_progress_bar(value=0)
else:
# progress bar blue if success
self.stop_progress_bar(value=100)
self.ids.file_name.text = resp[0]
self.ids.info_lbl.text = 'Finished downloading.'
self.ids.url_input.text = ''
Clock.unschedule(self.check_process)
Personally I prefer from multiprocessing.pool import ThreadPool and now it looks like your code 'hogs up' because you are awaiting for result. So obviously until there is result program will wait (and that may be long). If you look in my example code:
on_download will schedule and event schedule download and this one will schedule another event check process. I can't tell if you app is GUI app or terminal as there is pretty much no code in your question but what you have to do, in your loop you have to schedule an event of check process.
If you look on my check process: if self.async_result.ready(): that will only return when my result is ready.
Now you are waiting for the result, here everything is happening in the background and every now and then the main loop will check for the result (it won't hog up as if there is no result the main loop will carry on doing what it have to rather than wait for it).
So basically you have to schedule some events (especially the one for the result) in your loop rather than going line by line and waiting for one. Does that make sense and does my example code is helpful? Sorry I am really bad at explaining what is in my head ;)
-> mainloop
-> new Thread if there is any
-> check for result if there is any Threads
-> if there is a result
-> do something
-> mainloop keeps running
-> back to top
When you execute the while True in your main coroutine, it doesn't hog the loop but blocks the loop not accepting the rest task to do their jobs. Running a process in your event-based application is not the best solution as the processes are not much friendly in data sharing.
It is possible to do all concurrently without using parallelism. All you need is to execute a await asyncio.sleep(0) at the end of while True. It yields back to the loop and allows the rest tasks to be executed. So we do not exit from the coroutine.
In the following example, I have a listener that uses while True and handles the data added by emitter to the queue.
import asyncio
from queue import Empty
from queue import Queue
from random import choice
queue = Queue()
async def listener():
while True:
try:
# data polling from the queue
data = queue.get_nowait()
print(data) # {"type": "event", "data": {...}}
except (Empty, Exception):
pass
finally:
# the magic action
await asyncio.sleep(0)
async def emitter():
# add a data to the queue
queue.put({"type": "event", "data": {...}})
async def main():
# first create a task for listener
running_loop = asyncio.get_running_loop()
running_loop.create_task(listener())
for _ in range(5):
# create tasks for emitter with random intervals to
# demonstrate that the listener is still running in
# the loop and handling the data put into the queue
running_loop.create_task(emitter())
await asyncio.sleep(choice(range(2)))
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
I am downloading some information from webpages in the form
http://example.com?p=10
http://example.com?p=20
...
The point is that I don't know how many they are. At some point I will receive an error from the server, or maybe at some point I want to stop the processing since I have enough. I want to run them in parallel.
def generator_query(step=10):
i = 0
yield "http://example.com?p=%d" % i
i += step
def task(url):
t = request.get(url).text
if not t: # after the last one
return None
return t
I can implement it with consumer/producer pattern with queues, but I am wondering it is possible to have an higher level implementation, for example with the concurrent module.
Non-concurrent example:
results = []
for url in generator_query():
results.append(task(url))
You could use concurrent's ThreadPoolExecutor. An example of how to use it is provided here.
You'll need to break out of the example's for-loop, when you're getting invalid answers from the server (the except section) or whenever you feel like you got enough data (you could count valid responses in the else section for example).
You could use aiohttp for this purpose:
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def coro(step):
url = 'https://example.com?p={}'.format(step)
async with aiohttp.ClientSession() as session:
html = await fetch(session, url)
print(html)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
tasks = [coro(i*10) for i in range(10)]
loop.run_until_complete(asyncio.wait(tasks))
as for the page error, you might have to figure it yourself since I don't know what website you're dealing with. Maybe try...except?
Notice: if your python version is higher than 3.5, it might cause an ssl certificate verification error.
I'm very new to using asyncio/aiohttp, but I have a Python script that read a batch of URL:s from a Postgres table, downloads the URL:s, runs a processing function on each download (not relevant for the question), and saves back the result of the processing to the table.
In simplified form it looks like this:
import asyncio
import psycopg2
from aiohttp import ClientSession, TCPConnector
BATCH_SIZE = 100
def _get_pgconn():
return psycopg2.connect()
def db_conn(func):
def _db_conn(*args, **kwargs):
with _get_pgconn() as conn:
with conn.cursor() as cur:
return func(cur, *args, **kwargs)
conn.commit()
return _db_conn
async def run():
async with ClientSession(connector=TCPConnector(ssl=False, limit=100)) as session:
while True:
count = await run_batch(session)
if count == 0:
break
async def run_batch(session):
tasks = []
for url in get_batch():
task = asyncio.ensure_future(process_url(url, session))
tasks.append(task)
await asyncio.gather(*tasks)
results = [task.result() for task in tasks]
save_batch_result(results)
return len(results)
async def process_url(url, session):
try:
async with session.get(url, timeout=15) as response:
body = await response.read()
return process_body(body)
except:
return {...}
#db_conn
def get_batch(cur):
sql = "SELECT id, url FROM db.urls WHERE processed IS NULL LIMIT %s"
cur.execute(sql, (BATCH_SIZE,))
return cur.fetchall()
#db_conn
def save_batch_result(cur, results):
sql = "UPDATE db.urls SET a = %(a)s, processed = true WHERE id = %(id)s"
cur.executemany(sql, tuple(results))
loop = asyncio.get_event_loop()
loop.run_until_complete(run())
But I have the feeling that I must be missing something here. The script runs but it seems to become slower and slower with each batch. Specially it seems like the call to the process_url function becomes slower over time. Also the used memory keeps growing so I'm guessing there might be something that I fail to clean up properly between runs?
I also have problems increasing the batch size much, if I go much over 200 I seem to get a much higher proportion of exceptions from the call to session.get. I have tried playing with the limit argument to the TCPConnector, setting it both higher and lower but I can't see that it helps much. Have also tried running it on a few different server but it seems to be the same. Is there some way to think about how to set these values more effectively?
Would be grateful for some pointers to what I might do wrong here!
The problem of your code is mixing asynchronous aiohttp library with synchronous psycopg2 client.
As a consequence calls to DB blocks the event loop entirely affecting all other parallel tasks.
To solve it you need to use asynchronous DB client: aiopg (a wrapper around psycopg2 async mode) or asyncpg (it has a different API but works faster).
In Tornado we can use the coroutine decorator to write an asynchronous function neatly as a Python generator, where each yield statement returns to the scheduler and the final raise/return returns a single value to the caller. But is there any way to return a sequence of values to the caller, interspersed with asynchronous calls?
E.g. how could I turn this synchronous function:
def crawl_site_sync(rooturi):
rootpage = fetch_page_sync(rooturi)
links = extract_links(rootpage)
for link in links:
yield fetch_page_sync(link.uri)
...which I can call like this:
for page in crawl_site_sync("http://example.com/page.html"):
show_summary(page)
...into a similar-looking asynchronous function in Tornado? E.g.:
#tornado.gen.coroutine
def crawl_site_async(rooturi):
# Yield a future to the scheduler:
rootpage = yield fetch_page_async(rooturi)
links = extract_links(rootpage)
for link in links:
# Yield a future to the scheduler:
sub_page = yield fetch_page_async(link.uri)
# Yield a value to the caller:
really_really_yield sub_page # ???
And how would I call it?
for page in yield crawl_site_sync("http://example.com/page.html"):
# This won't work, the yield won't return until the entire
# coroutine has finished, and it won't give us an iterable.
show_summary(page)
I can think of ways to get it done, but all of them involve changing the call-site and the function to such a degree that it completely loses the benefit of the asynchronous version looking very similar to the synchronous version, and it no longer composes cleanly. I feel like I must be missing a trick here. Is there some way to simultaneously use a Python generator as a sequence of lazily computed values and as a Tornado coroutine?
I'd use a Queue from Toro, which is designed for coroutines to cooperate like this. Here's a simple example:
from tornado.ioloop import IOLoop
from tornado import gen
from tornado.httpclient import AsyncHTTPClient
from toro import Queue
q = Queue(maxsize=1)
#gen.coroutine
def consumer():
item = yield q.get()
while item:
print item
item = yield q.get()
#gen.coroutine
def producer():
try:
client = AsyncHTTPClient()
for url in [
'http://tornadoweb.org',
'http://python.org',
'http://readthedocs.org']:
response = yield client.fetch(url)
item = (url, len(response.body))
yield q.put(item)
# Done.
q.put(None)
except Exception:
IOLoop.current().stop()
raise
future = producer()
IOLoop.current().run_sync(consumer, timeout=20)
A more detailed web crawler example is in Toro's docs, here:
https://toro.readthedocs.org/en/stable/examples/web_spider_example.html