I'm trying to write some asynchronous code. I started with a public code like the following:
import asyncio
import aiohttp
urls = ['www.example.com/1', 'www.example.com/2', ...]
tasks = []
async def fetch(url, session) -> str:
async with session.get(url) as resp:
return await resp.text()
async def main():
async with aiohttp.ClientSession() as session:
for url in urls:
tasks.append(asyncio.create_task(fetch(url, session)))
response = await asyncio.gather(*tasks, return_exceptions=True)
asyncio.run(main())
I realized that there is another way to get the same result by writing main() as below:
async def main_2():
async with aiohttp.ClientSession() as session:
for url in urls:
tasks.append(asyncio.create_task(fetch(url, session)))
response = []
for t in tasks:
response.append(await t)
Both methods take same time to finish. So, while processing responses inside main_2() is so easy, what are the benefits of using asyncio.gather?
Advantages:
It automatically schedules any coroutines as tasks for you. If you hadn't been creating the tasks manually, the non-gather approach wouldn't even start running them until you tried to await them (losing all the benefits of async processing), where gather would create tasks for all of them up-front then await them in bulk.
When using return_exceptions=False (the default), you'll know when something has gone wrong immediately; with the loop, you might process dozens of results before one turns out to have failed. This may or may not be advantageous, depending on your needs. asyncio.as_completed may serve better in certain cases (getting results in completion order, as soon as they come in, rather than waiting for everything to finish), it depends on needs.
If you save off the gather to a name before awaiting it, you can bulk cancel any outstanding tasks when an exception occurs and return_exceptions=False (just try:/except Exception: gathername.cancel(), without needing to know which tasks need canceling).
Personally, I usually find asyncio.as_completed more useful, in the same way multiprocessing.Pool.imap_unordered is nicer than multiprocessing.Pool.map (because result ordering rarely matters, and it's nice to process results immediately as they become available), but asyncio.gather is the simpler "all-in-one, wait for everything before continuing" interface.
Related
I'm trying to make a simple web crawler using trio an asks. I use nursery to start a couple of crawlers at once, and memory channel to maintain a list of urls to visit.
Each crawler receives clones of both ends of that channel, so they can grab a url (via receive_channel), read it, find and add new urls to be visited (via send_channel).
async def main():
send_channel, receive_channel = trio.open_memory_channel(math.inf)
async with trio.open_nursery() as nursery:
async with send_channel, receive_channel:
nursery.start_soon(crawler, send_channel.clone(), receive_channel.clone())
nursery.start_soon(crawler, send_channel.clone(), receive_channel.clone())
nursery.start_soon(crawler, send_channel.clone(), receive_channel.clone())
async def crawler(send_channel, receive_channel):
async for url in receive_channel: # I'm a consumer!
content = await ...
urls_found = ...
for u in urls_found:
await send_channel.send(u) # I'm a producer too!
In this scenario the consumers are the producers. How to stop everything gracefully?
The conditions for shutting everything down are:
channel is empty
AND
all crawlers are stuck at the first for loop, waiting for the url to appear in receive_channel (which... won't happen anymore)
I tried with async with send_channel inside crawler() but could not find a good way to do it. I also tried to find some different approach (some memory-channel-bound worker pool, etc), no luck here as well.
There are at least two problem here.
Firstly is your assumption about stopping when the channel is empty. Since you allocate the memory channel with a size of 0, it will always be empty. You are only able to hand off a url, if a crawler is ready to receive it.
This creates problem number two. If you ever find more urls than you have allocated crawlers, your application will deadlock.
The reason is, that since you wont be able to hand off all your found urls to a crawler, the crawler will never be ready to receive a new url to crawl, because it is stuck waiting for another crawler to take one of its urls.
This gets even worse, because assuming one of the other crawlers find new urls, they too will get stuck behind the crawler that is already waiting to hand off its urls and they will never be able to take one of the urls that are waiting to be processed.
Relevant portion of the documentation:
https://trio.readthedocs.io/en/stable/reference-core.html#buffering-in-channels
Assuming we fix that, where to go next?
You probably need to keep a list (set?) of all visited urls, to make sure you dont visit them again.
To actually figure out when to stop, instead of closing the channels, it is probably a lot easier to simply cancel the nursery.
Lets say we modify the main loop like this:
async def main():
send_channel, receive_channel = trio.open_memory_channel(math.inf)
active_workers = trio.CapacityLimiter(3) # Number of workers
async with trio.open_nursery() as nursery:
async with send_channel, receive_channel:
nursery.start_soon(crawler, active_workers, send_channel, receive_channel)
nursery.start_soon(crawler, active_workers, send_channel, receive_channel)
nursery.start_soon(crawler, active_workers, send_channel, receive_channel)
while True:
await trio.sleep(1) # Give the workers a chance to start up.
if active_workers.borrowed_tokens == 0 and send_channel.statistics().current_buffer_used == 0:
nursery.cancel_scope.cancel() # All done!
Now we need to modify the crawlers slightly, to pick up a token when active.
async def crawler(active_workers, send_channel, receive_channel):
async for url in receive_channel: # I'm a consumer!
with active_workers:
content = await ...
urls_found = ...
for u in urls_found:
await send_channel.send(u) # I'm a producer too!
Other things to consider -
You may want to use send_channel.send_noblock(u) in the crawler. Since you have an unbounded buffer, there is no chance of a WouldBlock exception, and the behaviour of not having a checkpoint trigger on every send might be desireable. That way you know for sure, that a particular url is fully processed and all new urls have been added, before other tasks get a chance to grab a new url, or the parent task get a chance to check if work is done.
This is a solution I came up with when I tried to reorganize the problem:
async def main():
send_channel, receive_channel = trio.open_memory_channel(math.inf)
limit = trio.CapacityLimiter(3)
async with send_channel:
await send_channel.send(('https://start-url', send_channel.clone()))
#HERE1
async with trio.open_nursery() as nursery:
async for url, send_channel in receive_channel: #HERE3
nursery.start(consumer, url, send_channel, limit)
async def crawler(url, send_channel, limit, task_status):
async with limit, send_channel:
content = await ...
links = ...
for link in links:
await send_channel.send((link, send_channel.clone()))
#HERE2
(I skipped skipping visited urls)
Here, there is no 3 long lived consumers, but there is at most 3 consumers whenever there is enough work for them.
At #HERE1 the send_channel is closed (because it was used as context manager), the only thing that's keeping the channel alive is a clone of it, inside that channel.
At #HERE2 the clone is also closed (because context manager). If the channel is empty, then that clone was the last thing keeping the channel alive. Channel dies, for loop ends (#HERE3).
UNLESS there were urls found, in which case they were added to the channel, together with more clones of send_channel that will keep the channel alive long enough to process those urls.
Both this and Anders E. Andersen's solutions feel hacky to me: one is using sleep and statistics(), the other creates clones of send_channel and puts them in the channel... feels like a software implementation of klein bottle to me. I will probably look for some other approaches.
I'm currently designing a spider to crawl a specific website. I can do it synchronous but I'm trying to get my head around asyncio to make it as efficient as possible. I've tried a lot of different approaches, with yield, chained functions and queues but I can't make it work.
I'm most interested in the design part and logic to solve the problem. Not necessary runnable code, rather highlight the most important aspects of assyncio. I can't post any code, because my attempts are not worth sharing.
The mission:
The exemple.com (I know, it should be example.com) got the following design:
In synchronous manner the logic would be like this:
for table in my_url_list:
# Get HTML
# Extract urls from HTML to user_list
for user in user_list:
# Get HTML
# Extract urls from HTML to user_subcat_list
for subcat in user_subcat_list:
# extract content
But now I would like to scrape the site asynchronous. Lets say we using 5 instances (tabs in pyppeteer or requests in aiohttp) to parse the content. How should we design it to make it most efficient and what asyncio syntax should we use?
Update
Thanks to #user4815162342 who solved my problem. I've been playing around with his solution and I post runnable code below if someone else want to play around with asyncio.
import asyncio
import random
my_url_list = ['exemple.com/table1', 'exemple.com/table2', 'exemple.com/table3']
# Random sleeps to simulate requests to the server
async def randsleep(caller=None):
i = random.randint(1, 6)
if caller:
print(f"Request HTML for {caller} sleeping for {i} seconds.")
await asyncio.sleep(i)
async def process_urls(url_list):
print(f'async def process_urls: added {url_list}')
limit = asyncio.Semaphore(5)
coros = [process_user_list(table, limit) for table in url_list]
await asyncio.gather(*coros)
async def process_user_list(table, limit):
async with limit:
# Simulate HTML request and extracting urls to populate user_list
await randsleep(table)
if table[-1] == '1':
user_list = ['exemple.com/user1', 'exemple.com/user2', 'exemple.com/user3']
elif table[-1] == '2':
user_list = ['exemple.com/user4', 'exemple.com/user5', 'exemple.com/user6']
else:
user_list = ['exemple.com/user7', 'exemple.com/user8', 'exemple.com/user9']
print(f'async def process_user_list: Extracted {user_list} from {table}')
# Execute process_user in parallel, but do so outside the `async with`
# because process_user will also need the semaphore, and we don't need
# it any more since we're done with fetching HTML.
coros = [process_user(user, limit) for user in user_list]
await asyncio.gather(*coros)
async def process_user(user, limit):
async with limit:
# Simulate HTML request and extracting urls to populate user_subcat_list
await randsleep(user)
user_subcat_list = [user + '/profile', user + '/info', user + '/followers']
print(f'async def process_user: Extracted {user_subcat_list} from {user}')
coros = [process_subcat(subcat, limit) for subcat in user_subcat_list]
await asyncio.gather(*coros)
async def process_subcat(subcat, limit):
async with limit:
# Simulate HTML request and extracting content
await randsleep(subcat)
print(f'async def process_subcat: Extracted content from {subcat}')
if __name__ == '__main__':
asyncio.run(process_urls(my_url_list))
Let's restructure the sync code so that each piece that can access the network is in a separate function. The functionality is unchanged, but it will make things easier later:
def process_urls(url_list):
for table in url_list:
process_user_list(table)
def process_user_list(table):
# Get HTML, extract user_list
for user in user_list:
process_user(user)
def process_user(user):
# Get HTML, extract user_subcat_list
for subcat in user_subcat_list:
process_subcat(subcat)
def process_subcat(subcat):
# get HTML, extract content
if __name__ == '__main__':
process_urls(my_url_list)
Assuming that the order of processing doesn't matter, we'd like the async version to run all the functions that are now called in for loops in parallel. They'll still run on a single thread, but they will await anything that might block, allowing the event loop to parallelize the waiting and drive them to completion by resuming each coroutine whenever it is ready to proceed. This is achieved by spawning each coroutine as a separate task that runs independent of other tasks and therefore in parallel. For example, a sequential (but still async) version of process_urls would look like this:
async def process_urls(url_list):
for table in url_list:
await process_user_list(table)
This is async because it is running inside an event loop, and you could run several such functions in parallel (which we'll show how to do shortly), but it's also sequential because it chooses to await each invocation of process_user_list. At each loop iteration the await explicitly instructs asyncio to suspend execution of process_urls until the result of process_user_list is available.
What we want instead is to tell asyncio to run all invocations of process_user_list in parallel, and to suspend execution of process_urls until they're all done. The basic primitive to spawn a coroutine in the "background" is to schedule it as a task using asyncio.create_task, which is the closest async equivalent of a light-weight thread. Using create_task the parallel version of process_urls would look like this:
async def process_urls(url_list):
# spawn a task for each table
tasks = []
for table in url_list:
asyncio.create_task(process_user_list(table))
tasks.append(task)
# The tasks are now all spawned, so awaiting one task lets
# them all run.
for task in tasks:
await task
At first glance the second loop looks like it awaits tasks in sequence like the previous version, but this is not the case. Since each await suspends to the event loop, awaiting any task allows all tasks to progress, as long as they were scheduled beforehand using create_task(). The total waiting time will be no longer than the time of the longest task, regardless of the order in which they finish.
This pattern is used so often that asyncio has a dedicated utility function for it, asyncio.gather. Using this function the same code can be expressed in a much shorter version:
async def process_urls(url_list):
coros = [process_user_list(table) for table in url_list]
await asyncio.gather(*coros)
But there is another thing to take care of: since process_user_list will get HTML from the server and there will be many instances of it running in parallel, and we cannot allow it to hammer the server with hundreds of simultaneous connections. We could create a pool of worker tasks and some sort of queue, but asyncio offers a more elegant solution: the semaphore. Semaphore is a synchronization device that doesn't allow more than a pre-determined number of activations in parallel, making the rest wait in line.
The final version of process_urls creates a semaphore and just passes it down. It doesn't activate the semaphore because process_urls doesn't actually fetch any HTML itself, so there is no reason for it to hold a semaphore slot while process_user_lists are running.
async def process_urls(url_list):
limit = asyncio.Semaphore(5)
coros = [process_user_list(table, limit) for table in url_list]
await asyncio.gather(*coros)
process_user_list looks similar, but it does need to activate the semaphore using async with:
async def process_user_list(table, limit):
async with limit:
# Get HTML using aiohttp, extract user_list
# Execute process_user in parallel, but do so outside the `async with`
# because process_user will also need the semaphore, and we don't need
# it any more since we're done with fetching HTML.
coros = [process_user(user, limit) for user in user_list]
await asyncio.gather(*coros)
process_user and process_subcat are more of the same:
async def process_user(user, limit):
async with limit:
# Get HTML, extract user_subcat_list
coros = [process_subcat(subcat, limit) for subcat in user_subcat_list]
await asyncio.gather(*coros)
def process_subcat(subcat, limit):
async with limit:
# get HTML, extract content
# do something with content
if __name__ == '__main__':
asyncio.run(process_urls(my_url_list))
In practice you will probably want the async functions to share the same aiohttp session, so you'd probably create it in the top-level function (process_urls in your case) and pass it down along with the semaphore. Each function that fetches HTML would have another async with for the aiohttp request/response, such as:
async with limit:
async with session.get(url, params...) as resp:
# get HTML data here
resp.raise_for_status()
resp = await resp.read()
# extract content from HTML data here
The two async withs can be collapsed into one, reducing the indentation but keeping the same meaning:
async with limit, session.get(url, params...) as resp:
# get HTML data here
resp.raise_for_status()
resp = await resp.read()
# extract content from HTML data here
I am running a program that makes three different requests from a rest api. data, indicator, request functions all fetch data from BitMEX's api using a wrapper i've made.
I have used asyncio to try to speed up the process so that while i am waiting on a response from previous request, it can begin to make another one.
However, my asynchronous version is not running any quicker for some reason. The code works and as far as I know, I have set everything up correctly. But there could be something wrong with how I am setting up the coroutines?
Here is the asynchronous version:
import time
import asyncio
from bordemwrapper import BitMEXData, BitMEXFunctions
'''
asynchronous I/O
'''
async def data():
data = BitMEXData().get_ohlcv(symbol='XBTUSD', timeframe='1h',
instances=25)
await asyncio.sleep(0)
return data
async def indicator():
indicator = BitMEXData().get_indicator(symbol='XBTUSD',
timeframe='1h', indicator='RSI', period=20, source='close',
instances=25)
await asyncio.sleep(0)
return indicator
async def request():
request = BitMEXFunctions().get_price()
await asyncio.sleep(0)
return request
async def chain():
data_ = await data()
indicator_ = await indicator()
request_ = await request()
return data_, indicator_, request_
async def main():
await asyncio.gather(chain())
if __name__ == '__main__':
start = time.perf_counter()
asyncio.run(main())
end = time.perf_counter()
print('process finished in {} seconds'.format(end - start))
Unfortunately, asyncio isn't magic. Although you've put them in async functions, the BitMEXData().get_<foo> functions are not themselves async (i.e. you can't await them), and therefore block while they run. The concurrency in asyncio can only occur while awaiting something.
You'll need a library which makes the actual HTTP requests asynchronously, like aiohttp. It sounds like you wrote bordemwrapper yourself - you should rewrite the get_<foo> functions to use asynchronous HTTP requests. Feel free to submit a separate question if you need help with that.
I have a list of URL's of websites that I want to download repeatedly (in variable time intervals) using Python. It is necessary to do that asynchronously to cope with a large number of websites and/or long response times.
I've tried many things with event loops, queues, async functions, asyncio, etc., but I do not get it working. The following very simple version downloads the websites repeatedly, but it does not download the websites concurrently - instead the next download only starts after the previous one is finished.
import asyncio
import datetime
import aiohttp
def produce_helper(url: str):
# helper, because I cannot call an async function with loop.call_later
loop.create_task(produce(url))
async def produce(url: str):
await q.put(url)
print(f'{datetime.datetime.now().strftime("%H:%M:%S.%f")} - Produced {url}')
async def consume():
async with aiohttp.ClientSession() as session:
while True:
url = await q.get()
print(f'{datetime.datetime.now().strftime("%H:%M:%S.%f")} - Start: {url}')
async with session.get(url, timeout=10) as response:
print(f'{datetime.datetime.now().strftime("%H:%M:%S.%f")} - Finished: {url}')
q.task_done()
loop.call_later(10, produce_helper, url)
q = asyncio.Queue()
url_list = ["https://www.google.com/", "https://www.bing.com/", "https://www.yelp.com/"]
loop = asyncio.get_event_loop()
for url in url_list:
loop.create_task(produce(url))
loop.create_task(consume())
loop.run_forever()
Is this a suitable approach for my problem? Is there anything better conceptually?
And how do I accomplish concurrent downloads?
Any help is appreciated.
EDIT:
The challenge (as described in the comment below) is the following: After each successful download, I want to add the respective URL back to the queue - to be due after a specified waiting time (10 s in the example in my question). As soon, as it is due, I want to download the website again, add the URL back to the queue etc.
I have a python discord bot built with discord.py, meaning the entire program runs inside an event loop.
The function I'm working on involves making several hundred HTTP requests and add the results to a final list. It takes about two minutes to do these in order, so I'm using aiohttp to make them async. The related parts of my code are identical to the quickstart example in the aiohttp docs, but it's throwing a RuntimeError: Session is closed. The methodology was taken from an example at https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html under 'Fetch multiple URLs'.
async def searchPostList(postUrls, searchString)
futures = []
async with aiohttp.ClientSession() as session:
for url in postUrls:
task = asyncio.ensure_future(searchPost(url,searchString,session))
futures.append(task)
return await asyncio.gather(*futures)
async def searchPost(url,searchString,session)):
async with session.get(url) as response:
page = await response.text()
#Assorted verification and parsing
Return data
I don't know why this error turns up since my code is so similar to two presumably functional examples. The event loop itself is working fine. It runs forever, since this is a bot application.
In the example you linked, the gathering of results was within the async with block. If you do it outside, there's no guarantee that the session won't close before the requests are even made!
Moving your return statement inside the block should work:
async with aiohttp.ClientSession() as session:
for url in postUrls:
task = asyncio.ensure_future(searchPost(url,searchString,session))
futures.append(task)
return await asyncio.gather(*futures)