Repeatedly download websites with Python async

Repeatedly download websites with Python async - python

I have a list of URL's of websites that I want to download repeatedly (in variable time intervals) using Python. It is necessary to do that asynchronously to cope with a large number of websites and/or long response times.
I've tried many things with event loops, queues, async functions, asyncio, etc., but I do not get it working. The following very simple version downloads the websites repeatedly, but it does not download the websites concurrently - instead the next download only starts after the previous one is finished.
import asyncio
import datetime
import aiohttp
def produce_helper(url: str):
# helper, because I cannot call an async function with loop.call_later
loop.create_task(produce(url))
async def produce(url: str):
await q.put(url)
print(f'{datetime.datetime.now().strftime("%H:%M:%S.%f")} - Produced {url}')
async def consume():
async with aiohttp.ClientSession() as session:
while True:
url = await q.get()
print(f'{datetime.datetime.now().strftime("%H:%M:%S.%f")} - Start: {url}')
async with session.get(url, timeout=10) as response:
print(f'{datetime.datetime.now().strftime("%H:%M:%S.%f")} - Finished: {url}')
q.task_done()
loop.call_later(10, produce_helper, url)
q = asyncio.Queue()
url_list = ["https://www.google.com/", "https://www.bing.com/", "https://www.yelp.com/"]
loop = asyncio.get_event_loop()
for url in url_list:
loop.create_task(produce(url))
loop.create_task(consume())
loop.run_forever()
Is this a suitable approach for my problem? Is there anything better conceptually?
And how do I accomplish concurrent downloads?
Any help is appreciated.
EDIT:
The challenge (as described in the comment below) is the following: After each successful download, I want to add the respective URL back to the queue - to be due after a specified waiting time (10 s in the example in my question). As soon, as it is due, I want to download the website again, add the URL back to the queue etc.

Related

python for each run async function without await and parallel

I have 10 links in my CSV which I'm trying to run all at the same time in a loop from getTasks function. However, the way it's working now, it send a request to link 1, waits for it to complete, then link 2, etc, etc. I want the 10 links that I have to run all whenever startTask is called, leading to 10 requests a second.
Anyone know how to code that using the code below? Thanks in advance.
import requests
from bs4 import BeautifulSoup
import asyncio
def getTasks(tasks):
for task in tasks:
asyncio.run(startTask(task))
async def startTask(task):
success = await getProduct(task)
if success is None:
return startTask(task)
success = await addToCart(task)
if success is None:
return startTask(task)
...
...
...
getTasks(tasks)

First of all, to make your requests sent concurrently, you should use the aiohttp instead of the requests package that blocks I/O. And use the asyncio's semaphore to limit the count of concurrent processes at the same time.
import asyncio
import aiohttp
# read links from CSV
links = [
...
]
semaphore = asyncio.BoundedSemaphore(10)
# 10 is the max count of concurrent tasks
# that can be processed at the same time.
# In this case, tasks are requests.
async def async_request(url):
async with aiohttp.ClientSession() as session:
async with semaphore, session.get(url) as response:
return await response.text()
async def main():
result = await asyncio.gather(*[
async_request(link) for link in links
])
print(result) # [response1, response2, ...]
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()

What are the benefits of using asyncio.gather?

I'm trying to write some asynchronous code. I started with a public code like the following:
import asyncio
import aiohttp
urls = ['www.example.com/1', 'www.example.com/2', ...]
tasks = []
async def fetch(url, session) -> str:
async with session.get(url) as resp:
return await resp.text()
async def main():
async with aiohttp.ClientSession() as session:
for url in urls:
tasks.append(asyncio.create_task(fetch(url, session)))
response = await asyncio.gather(*tasks, return_exceptions=True)
asyncio.run(main())
I realized that there is another way to get the same result by writing main() as below:
async def main_2():
async with aiohttp.ClientSession() as session:
for url in urls:
tasks.append(asyncio.create_task(fetch(url, session)))
response = []
for t in tasks:
response.append(await t)
Both methods take same time to finish. So, while processing responses inside main_2() is so easy, what are the benefits of using asyncio.gather?

Advantages:
It automatically schedules any coroutines as tasks for you. If you hadn't been creating the tasks manually, the non-gather approach wouldn't even start running them until you tried to await them (losing all the benefits of async processing), where gather would create tasks for all of them up-front then await them in bulk.
When using return_exceptions=False (the default), you'll know when something has gone wrong immediately; with the loop, you might process dozens of results before one turns out to have failed. This may or may not be advantageous, depending on your needs. asyncio.as_completed may serve better in certain cases (getting results in completion order, as soon as they come in, rather than waiting for everything to finish), it depends on needs.
If you save off the gather to a name before awaiting it, you can bulk cancel any outstanding tasks when an exception occurs and return_exceptions=False (just try:/except Exception: gathername.cancel(), without needing to know which tasks need canceling).
Personally, I usually find asyncio.as_completed more useful, in the same way multiprocessing.Pool.imap_unordered is nicer than multiprocessing.Pool.map (because result ordering rarely matters, and it's nice to process results immediately as they become available), but asyncio.gather is the simpler "all-in-one, wait for everything before continuing" interface.

How to chain coroutines in asyncio when fetch nested urls

I'm currently designing a spider to crawl a specific website. I can do it synchronous but I'm trying to get my head around asyncio to make it as efficient as possible. I've tried a lot of different approaches, with yield, chained functions and queues but I can't make it work.
I'm most interested in the design part and logic to solve the problem. Not necessary runnable code, rather highlight the most important aspects of assyncio. I can't post any code, because my attempts are not worth sharing.
The mission:
The exemple.com (I know, it should be example.com) got the following design:
In synchronous manner the logic would be like this:
for table in my_url_list:
# Get HTML
# Extract urls from HTML to user_list
for user in user_list:
# Get HTML
# Extract urls from HTML to user_subcat_list
for subcat in user_subcat_list:
# extract content
But now I would like to scrape the site asynchronous. Lets say we using 5 instances (tabs in pyppeteer or requests in aiohttp) to parse the content. How should we design it to make it most efficient and what asyncio syntax should we use?
Update
Thanks to #user4815162342 who solved my problem. I've been playing around with his solution and I post runnable code below if someone else want to play around with asyncio.
import asyncio
import random
my_url_list = ['exemple.com/table1', 'exemple.com/table2', 'exemple.com/table3']
# Random sleeps to simulate requests to the server
async def randsleep(caller=None):
i = random.randint(1, 6)
if caller:
print(f"Request HTML for {caller} sleeping for {i} seconds.")
await asyncio.sleep(i)
async def process_urls(url_list):
print(f'async def process_urls: added {url_list}')
limit = asyncio.Semaphore(5)
coros = [process_user_list(table, limit) for table in url_list]
await asyncio.gather(*coros)
async def process_user_list(table, limit):
async with limit:
# Simulate HTML request and extracting urls to populate user_list
await randsleep(table)
if table[-1] == '1':
user_list = ['exemple.com/user1', 'exemple.com/user2', 'exemple.com/user3']
elif table[-1] == '2':
user_list = ['exemple.com/user4', 'exemple.com/user5', 'exemple.com/user6']
else:
user_list = ['exemple.com/user7', 'exemple.com/user8', 'exemple.com/user9']
print(f'async def process_user_list: Extracted {user_list} from {table}')
# Execute process_user in parallel, but do so outside the `async with`
# because process_user will also need the semaphore, and we don't need
# it any more since we're done with fetching HTML.
coros = [process_user(user, limit) for user in user_list]
await asyncio.gather(*coros)
async def process_user(user, limit):
async with limit:
# Simulate HTML request and extracting urls to populate user_subcat_list
await randsleep(user)
user_subcat_list = [user + '/profile', user + '/info', user + '/followers']
print(f'async def process_user: Extracted {user_subcat_list} from {user}')
coros = [process_subcat(subcat, limit) for subcat in user_subcat_list]
await asyncio.gather(*coros)
async def process_subcat(subcat, limit):
async with limit:
# Simulate HTML request and extracting content
await randsleep(subcat)
print(f'async def process_subcat: Extracted content from {subcat}')
if __name__ == '__main__':
asyncio.run(process_urls(my_url_list))

Let's restructure the sync code so that each piece that can access the network is in a separate function. The functionality is unchanged, but it will make things easier later:
def process_urls(url_list):
for table in url_list:
process_user_list(table)
def process_user_list(table):
# Get HTML, extract user_list
for user in user_list:
process_user(user)
def process_user(user):
# Get HTML, extract user_subcat_list
for subcat in user_subcat_list:
process_subcat(subcat)
def process_subcat(subcat):
# get HTML, extract content
if __name__ == '__main__':
process_urls(my_url_list)
Assuming that the order of processing doesn't matter, we'd like the async version to run all the functions that are now called in for loops in parallel. They'll still run on a single thread, but they will await anything that might block, allowing the event loop to parallelize the waiting and drive them to completion by resuming each coroutine whenever it is ready to proceed. This is achieved by spawning each coroutine as a separate task that runs independent of other tasks and therefore in parallel. For example, a sequential (but still async) version of process_urls would look like this:
async def process_urls(url_list):
for table in url_list:
await process_user_list(table)
This is async because it is running inside an event loop, and you could run several such functions in parallel (which we'll show how to do shortly), but it's also sequential because it chooses to await each invocation of process_user_list. At each loop iteration the await explicitly instructs asyncio to suspend execution of process_urls until the result of process_user_list is available.
What we want instead is to tell asyncio to run all invocations of process_user_list in parallel, and to suspend execution of process_urls until they're all done. The basic primitive to spawn a coroutine in the "background" is to schedule it as a task using asyncio.create_task, which is the closest async equivalent of a light-weight thread. Using create_task the parallel version of process_urls would look like this:
async def process_urls(url_list):
# spawn a task for each table
tasks = []
for table in url_list:
asyncio.create_task(process_user_list(table))
tasks.append(task)
# The tasks are now all spawned, so awaiting one task lets
# them all run.
for task in tasks:
await task
At first glance the second loop looks like it awaits tasks in sequence like the previous version, but this is not the case. Since each await suspends to the event loop, awaiting any task allows all tasks to progress, as long as they were scheduled beforehand using create_task(). The total waiting time will be no longer than the time of the longest task, regardless of the order in which they finish.
This pattern is used so often that asyncio has a dedicated utility function for it, asyncio.gather. Using this function the same code can be expressed in a much shorter version:
async def process_urls(url_list):
coros = [process_user_list(table) for table in url_list]
await asyncio.gather(*coros)
But there is another thing to take care of: since process_user_list will get HTML from the server and there will be many instances of it running in parallel, and we cannot allow it to hammer the server with hundreds of simultaneous connections. We could create a pool of worker tasks and some sort of queue, but asyncio offers a more elegant solution: the semaphore. Semaphore is a synchronization device that doesn't allow more than a pre-determined number of activations in parallel, making the rest wait in line.
The final version of process_urls creates a semaphore and just passes it down. It doesn't activate the semaphore because process_urls doesn't actually fetch any HTML itself, so there is no reason for it to hold a semaphore slot while process_user_lists are running.
async def process_urls(url_list):
limit = asyncio.Semaphore(5)
coros = [process_user_list(table, limit) for table in url_list]
await asyncio.gather(*coros)
process_user_list looks similar, but it does need to activate the semaphore using async with:
async def process_user_list(table, limit):
async with limit:
# Get HTML using aiohttp, extract user_list
# Execute process_user in parallel, but do so outside the `async with`
# because process_user will also need the semaphore, and we don't need
# it any more since we're done with fetching HTML.
coros = [process_user(user, limit) for user in user_list]
await asyncio.gather(*coros)
process_user and process_subcat are more of the same:
async def process_user(user, limit):
async with limit:
# Get HTML, extract user_subcat_list
coros = [process_subcat(subcat, limit) for subcat in user_subcat_list]
await asyncio.gather(*coros)
def process_subcat(subcat, limit):
async with limit:
# get HTML, extract content
# do something with content
if __name__ == '__main__':
asyncio.run(process_urls(my_url_list))
In practice you will probably want the async functions to share the same aiohttp session, so you'd probably create it in the top-level function (process_urls in your case) and pass it down along with the semaphore. Each function that fetches HTML would have another async with for the aiohttp request/response, such as:
async with limit:
async with session.get(url, params...) as resp:
# get HTML data here
resp.raise_for_status()
resp = await resp.read()
# extract content from HTML data here
The two async withs can be collapsed into one, reducing the indentation but keeping the same meaning:
async with limit, session.get(url, params...) as resp:
# get HTML data here
resp.raise_for_status()
resp = await resp.read()
# extract content from HTML data here

Python loop through list to get api call in asyncio and save results

I don't fully understand how asyncio and aiohttp work yet.
I am trying to make a bunch of asynchronous api requests from a list of urls and save them as a variable so I can processes them later.
so far I am generating the list which is no problem and setting up the request framework.
urls = []
for i in range(0,20):
urls.append('https://api.binance.com/api/v1/klines?symbol={}&interval=
{}&limit={}'.format(pairs_list_pairs[i],time_period,
pull_limit))
import asyncio
import aiohttp
async def request(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
async def main():
results = await asyncio.gather(
request(urls[0]),
request(urls[1]),
)
print(len(results))
print(results)
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main())
loop.run_until_complete(loop.shutdown_asyncgens())
finally:
loop.close()
If I manually type out my requests one by one using indexing (like below), I can make the request. But the problem is that my list has upwards of 100 apis requests that I don't want to type by hand. How can I iterate through my list? Also how can I save my results into a variable? When the script ends it does not save "results" anywhere.
async def main():
results = await asyncio.gather(
request(urls[0]),
request(urls[1]),
)
print(len(results))
print(results)
Below are some sample urls to replicate the code:
[
'https://api.binance.com/api/v1/klines?symbol=ETHBTC&interval=15m&limit=1',
'https://api.binance.com/api/v1/klines?symbol=LTCBTC&interval=15m&limit=1',
'https://api.binance.com/api/v1/klines?symbol=BNBBTC&interval=15m&limit=1',
'https://api.binance.com/api/v1/klines?symbol=NEOBTC&interval=15m&limit=1',
]

To pass a variable number of arguments to gather, use the * function argument syntax:
results = await asyncio.gather(*[request(u) for u in urls])
Note that f(*args) is a standard Python feature to invoke f with positional arguments calculated at run-time.
results will be available once all requests are done, and they will be in a list in the same order as the URLs. Then you can return them from main, which will cause them to be returned by run_until_complete.
Also, you will have much better performance if you create the session only once, and reuse it for all requests, e.g. by passing it as a second argument to the request function.

Using gather and a helper function (request) are only making a quite simple task more complicated and difficult to work with. You can simply use the same ClientSession throughout all your individual requests with a loop whilst saving each response into a resultant list.
async def main():
results = []
async with aiohttp.ClientSession() as session:
for url in urls:
async with session.get(url) as resp:
results.append(await resp.text())
print(len(results))
print(results)
For the other part of your question, when you said:
When the script ends it does not save "results" anywhere.
if you meant that you want to access results outside of the main coroutine, you simply can add a return statement.
At the end of main, add:
return results
and change
loop.run_until_complete(main())
# into:
results = loop.run_until_complete(main())

Collecting results from python coroutines before loop finishes

I have a python discord bot built with discord.py, meaning the entire program runs inside an event loop.
The function I'm working on involves making several hundred HTTP requests and add the results to a final list. It takes about two minutes to do these in order, so I'm using aiohttp to make them async. The related parts of my code are identical to the quickstart example in the aiohttp docs, but it's throwing a RuntimeError: Session is closed. The methodology was taken from an example at https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html under 'Fetch multiple URLs'.
async def searchPostList(postUrls, searchString)
futures = []
async with aiohttp.ClientSession() as session:
for url in postUrls:
task = asyncio.ensure_future(searchPost(url,searchString,session))
futures.append(task)
return await asyncio.gather(*futures)
async def searchPost(url,searchString,session)):
async with session.get(url) as response:
page = await response.text()
#Assorted verification and parsing
Return data
I don't know why this error turns up since my code is so similar to two presumably functional examples. The event loop itself is working fine. It runs forever, since this is a bot application.

In the example you linked, the gathering of results was within the async with block. If you do it outside, there's no guarantee that the session won't close before the requests are even made!
Moving your return statement inside the block should work:
async with aiohttp.ClientSession() as session:
for url in postUrls:
task = asyncio.ensure_future(searchPost(url,searchString,session))
futures.append(task)
return await asyncio.gather(*futures)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Repeatedly download websites with Python async - python

Related

python for each run async function without await and parallel

What are the benefits of using asyncio.gather?

How to chain coroutines in asyncio when fetch nested urls

Python loop through list to get api call in asyncio and save results

Collecting results from python coroutines before loop finishes

Categories

Resources