Parallel async tasks emitted by endless generator - python

I have a code that very close to this:
class Parser:
async def fetch(self, url):
html = await get(url)
return html
#property
def endless_generator(self):
while True:
yield url
async def loop():
htmls = []
for url in self.endless_generator:
htmls.append(await self.fetch(url))
return htmls
def run(self):
loop = asyncio.get_event_loop()
try:
htmls = loop.run_until_complete(self.loop())
finally:
loop.close()
parser = Parser()
parser.run()
Now Parser.loop run synchronously.
I've tried asyncio.wait and asyncio.gather to achieve async invocations of Parser.fetch, but I don't know the number of URLs in advance (because URLs yielding by endless generator).
So, how do I get asynchronous calls if the number of tasks is not known in advance?

I've tried asyncio.wait and asyncio.gather to achieve async invocations of Parser.fetch, but I don't know the number of URLs in advance (because URLs yielding by endless generator).
I assume that by endless generator you mean a generator whose number of URLs is not known in advance, rather than a truly endless generator (generating an infinite list). Here is a version that creates a task as soon as a URL is available, and gathers the results as they arrive:
async def loop():
lp = asyncio.get_event_loop()
tasks = set()
result = {}
any_done = asyncio.Event()
def _task_done(t):
tasks.remove(t)
any_done.set()
result[t.fetch_url] = t.result()
for url in self.endless_generator:
new_task = lp.create_task(self.fetch(url))
new_task.fetch_url = url
tasks.add(new_task)
new_task.add_done_callback(_task_done)
await any_done.wait()
any_done.clear()
while tasks:
await any_done.wait()
any_done.clear()
return result # mapping url -> html
One cannot simply call gather or wait in each iteration because that would wait for all the existing tasks to finish before queuing a new one. wait(return_when=FIRST_COMPLETED) could work, but it would be O(n**2) in the number of tasks because it would set up its own callback each time anew.

Related

Running blocking function (f.e. requests) concurrently but asynchronous with Python

There is a function that blocks event loop (f.e. that function makes an API request). I need to make continuous stream of requests which will run in parallel but not synchronous. So every next request will be started before the previous request will be finished.
So I found this solved question with the loop.run_in_executer() solution and use it in the beginning:
import asyncio
import requests
#blocking_request_func() defined somewhere
async def main():
loop = asyncio.get_event_loop()
future1 = loop.run_in_executor(None, blocking_request_func, 'param')
future2 = loop.run_in_executor(None, blocking_request_func, 'param')
response1 = await future1
response2 = await future2
print(response1)
print(response2)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
this works well, requests run in parallel but there is a problem for my task - in this example we make group of tasks/futures in the beginning and then run this group synchronous. But I need something like this:
1. Sending request_1 and not awaiting when it's done.
(AFTER step 1 but NOT in the same time when step 1 starts):
2. Sending request_2 and not awaiting when it's done.
(AFTER step 2 but NOT in the same time when step 2 starts):
3. Sending request_3 and not awaiting when it's done.
(Request 1(or any other) gives the response)
(AFTER step 3 but NOT in the same time when step 3 starts):
4. Sending request_4 and not awaiting when it's done.
(Request 2(or any other) gives the response)
and so on...
I tried using asyncio.TaskGroup():
async def request_func():
global result #the list of results of requests defined somewhere in global area
loop = asyncio.get_event_loop()
result.append(await loop.run_in_executor(None, blocking_request_func, 'param')
await asyncio.sleep(0) #adding or removing this line gives the same result
async def main():
async with asyncio.TaskGroup() as tg:
for i in range(0, 10):
tg.create_task(request_func())
all these things gave the same result: first of all we defined group of tasks/futures and only then run this group synchronous and concurrently. But is there a way to run all these requests concurrently but "in the stream"?
I tried to make visualization if my explanation is not clear enough.
What I have for now
What I need
================ Update with the answer ===================
The most close answer however with some limitations:
import asyncio
import random
import time
def blockme(n):
x = random.random() * 2.0
time.sleep(x)
return n, x
def cb(fut):
print("Result", fut.result())
async def main():
#You need to control threads quantity
pool = concurrent.futures.ThreadPoolExecutor(max_workers=4)
loop = asyncio.get_event_loop()
futs = []
#You need to control requests per second
delay = 0.5
while await asyncio.sleep(delay, result=True):
fut = loop.run_in_executor(pool, blockme, n)
fut.add_done_callback(cb)
futs.append(fut)
#You need to control futures quantity, f.e. like this:
if len(futs)>40:
completed, futs = await asyncio.wait(futs,
timeout=5,
return_when=FIRST_COMPLETED)
asyncio.run(main())
I think this might be what you want. You don't have to await each request - the run_in_executor function returns a Future. Instead of awaiting that, you can attach a callback function:
import asyncio
import random
import time
def blockme(n):
x = random.random() * 2.0
time.sleep(x)
return n, x
def cb(fut):
print("Result", fut.result())
async def main():
loop = asyncio.get_event_loop()
futs = []
for n in range(20):
fut = loop.run_in_executor(None, blockme, n)
fut.add_done_callback(cb)
futs.append(fut)
await asyncio.gather(*futs)
# await asyncio.sleep(10)
asyncio.run(main())
All the requests are started at the beginning, but they don't all execute in parallel because the number of threads is limited by the ThreadPool. You can adjust the number of threads if you want.
Here I simulated a blocking call with time.sleep. I needed a way to prevent main() from ending before all the callbacks occurred, so I used gather for that purpose. You can also wait for some length of time, but gather is cleaner.
Apologies if I don't understand what you want. But I think you want to avoid using await for each call, and I tried to show one way you can do that.
This is directly referenced from Python documentation. The code snippet from documentation of asyncio library explains how you can run a blocking code concurrently using asyncio. It uses to_thread method to create task
you can find more here - https://docs.python.org/3/library/asyncio-task.html#running-in-threads
def blocking_io():
print(f"start blocking_io at {time.strftime('%X')}")
# Note that time.sleep() can be replaced with any blocking
# IO-bound operation, such as file operations.
time.sleep(1)
print(f"blocking_io complete at {time.strftime('%X')}")
async def main():
print(f"started main at {time.strftime('%X')}")
await asyncio.gather(
asyncio.to_thread(blocking_io),
asyncio.sleep(1))
print(f"finished main at {time.strftime('%X')}")
asyncio.run(main())

How to chain coroutines in asyncio when fetch nested urls

I'm currently designing a spider to crawl a specific website. I can do it synchronous but I'm trying to get my head around asyncio to make it as efficient as possible. I've tried a lot of different approaches, with yield, chained functions and queues but I can't make it work.
I'm most interested in the design part and logic to solve the problem. Not necessary runnable code, rather highlight the most important aspects of assyncio. I can't post any code, because my attempts are not worth sharing.
The mission:
The exemple.com (I know, it should be example.com) got the following design:
In synchronous manner the logic would be like this:
for table in my_url_list:
# Get HTML
# Extract urls from HTML to user_list
for user in user_list:
# Get HTML
# Extract urls from HTML to user_subcat_list
for subcat in user_subcat_list:
# extract content
But now I would like to scrape the site asynchronous. Lets say we using 5 instances (tabs in pyppeteer or requests in aiohttp) to parse the content. How should we design it to make it most efficient and what asyncio syntax should we use?
Update
Thanks to #user4815162342 who solved my problem. I've been playing around with his solution and I post runnable code below if someone else want to play around with asyncio.
import asyncio
import random
my_url_list = ['exemple.com/table1', 'exemple.com/table2', 'exemple.com/table3']
# Random sleeps to simulate requests to the server
async def randsleep(caller=None):
i = random.randint(1, 6)
if caller:
print(f"Request HTML for {caller} sleeping for {i} seconds.")
await asyncio.sleep(i)
async def process_urls(url_list):
print(f'async def process_urls: added {url_list}')
limit = asyncio.Semaphore(5)
coros = [process_user_list(table, limit) for table in url_list]
await asyncio.gather(*coros)
async def process_user_list(table, limit):
async with limit:
# Simulate HTML request and extracting urls to populate user_list
await randsleep(table)
if table[-1] == '1':
user_list = ['exemple.com/user1', 'exemple.com/user2', 'exemple.com/user3']
elif table[-1] == '2':
user_list = ['exemple.com/user4', 'exemple.com/user5', 'exemple.com/user6']
else:
user_list = ['exemple.com/user7', 'exemple.com/user8', 'exemple.com/user9']
print(f'async def process_user_list: Extracted {user_list} from {table}')
# Execute process_user in parallel, but do so outside the `async with`
# because process_user will also need the semaphore, and we don't need
# it any more since we're done with fetching HTML.
coros = [process_user(user, limit) for user in user_list]
await asyncio.gather(*coros)
async def process_user(user, limit):
async with limit:
# Simulate HTML request and extracting urls to populate user_subcat_list
await randsleep(user)
user_subcat_list = [user + '/profile', user + '/info', user + '/followers']
print(f'async def process_user: Extracted {user_subcat_list} from {user}')
coros = [process_subcat(subcat, limit) for subcat in user_subcat_list]
await asyncio.gather(*coros)
async def process_subcat(subcat, limit):
async with limit:
# Simulate HTML request and extracting content
await randsleep(subcat)
print(f'async def process_subcat: Extracted content from {subcat}')
if __name__ == '__main__':
asyncio.run(process_urls(my_url_list))
Let's restructure the sync code so that each piece that can access the network is in a separate function. The functionality is unchanged, but it will make things easier later:
def process_urls(url_list):
for table in url_list:
process_user_list(table)
def process_user_list(table):
# Get HTML, extract user_list
for user in user_list:
process_user(user)
def process_user(user):
# Get HTML, extract user_subcat_list
for subcat in user_subcat_list:
process_subcat(subcat)
def process_subcat(subcat):
# get HTML, extract content
if __name__ == '__main__':
process_urls(my_url_list)
Assuming that the order of processing doesn't matter, we'd like the async version to run all the functions that are now called in for loops in parallel. They'll still run on a single thread, but they will await anything that might block, allowing the event loop to parallelize the waiting and drive them to completion by resuming each coroutine whenever it is ready to proceed. This is achieved by spawning each coroutine as a separate task that runs independent of other tasks and therefore in parallel. For example, a sequential (but still async) version of process_urls would look like this:
async def process_urls(url_list):
for table in url_list:
await process_user_list(table)
This is async because it is running inside an event loop, and you could run several such functions in parallel (which we'll show how to do shortly), but it's also sequential because it chooses to await each invocation of process_user_list. At each loop iteration the await explicitly instructs asyncio to suspend execution of process_urls until the result of process_user_list is available.
What we want instead is to tell asyncio to run all invocations of process_user_list in parallel, and to suspend execution of process_urls until they're all done. The basic primitive to spawn a coroutine in the "background" is to schedule it as a task using asyncio.create_task, which is the closest async equivalent of a light-weight thread. Using create_task the parallel version of process_urls would look like this:
async def process_urls(url_list):
# spawn a task for each table
tasks = []
for table in url_list:
asyncio.create_task(process_user_list(table))
tasks.append(task)
# The tasks are now all spawned, so awaiting one task lets
# them all run.
for task in tasks:
await task
At first glance the second loop looks like it awaits tasks in sequence like the previous version, but this is not the case. Since each await suspends to the event loop, awaiting any task allows all tasks to progress, as long as they were scheduled beforehand using create_task(). The total waiting time will be no longer than the time of the longest task, regardless of the order in which they finish.
This pattern is used so often that asyncio has a dedicated utility function for it, asyncio.gather. Using this function the same code can be expressed in a much shorter version:
async def process_urls(url_list):
coros = [process_user_list(table) for table in url_list]
await asyncio.gather(*coros)
But there is another thing to take care of: since process_user_list will get HTML from the server and there will be many instances of it running in parallel, and we cannot allow it to hammer the server with hundreds of simultaneous connections. We could create a pool of worker tasks and some sort of queue, but asyncio offers a more elegant solution: the semaphore. Semaphore is a synchronization device that doesn't allow more than a pre-determined number of activations in parallel, making the rest wait in line.
The final version of process_urls creates a semaphore and just passes it down. It doesn't activate the semaphore because process_urls doesn't actually fetch any HTML itself, so there is no reason for it to hold a semaphore slot while process_user_lists are running.
async def process_urls(url_list):
limit = asyncio.Semaphore(5)
coros = [process_user_list(table, limit) for table in url_list]
await asyncio.gather(*coros)
process_user_list looks similar, but it does need to activate the semaphore using async with:
async def process_user_list(table, limit):
async with limit:
# Get HTML using aiohttp, extract user_list
# Execute process_user in parallel, but do so outside the `async with`
# because process_user will also need the semaphore, and we don't need
# it any more since we're done with fetching HTML.
coros = [process_user(user, limit) for user in user_list]
await asyncio.gather(*coros)
process_user and process_subcat are more of the same:
async def process_user(user, limit):
async with limit:
# Get HTML, extract user_subcat_list
coros = [process_subcat(subcat, limit) for subcat in user_subcat_list]
await asyncio.gather(*coros)
def process_subcat(subcat, limit):
async with limit:
# get HTML, extract content
# do something with content
if __name__ == '__main__':
asyncio.run(process_urls(my_url_list))
In practice you will probably want the async functions to share the same aiohttp session, so you'd probably create it in the top-level function (process_urls in your case) and pass it down along with the semaphore. Each function that fetches HTML would have another async with for the aiohttp request/response, such as:
async with limit:
async with session.get(url, params...) as resp:
# get HTML data here
resp.raise_for_status()
resp = await resp.read()
# extract content from HTML data here
The two async withs can be collapsed into one, reducing the indentation but keeping the same meaning:
async with limit, session.get(url, params...) as resp:
# get HTML data here
resp.raise_for_status()
resp = await resp.read()
# extract content from HTML data here

Where to put BeautifulSoup code in Asyncio Web Scraping Application

I need to scrape and get the raw text of the body paragraphs for many (5-10k per day) news articles. I've written some threading code, but given the highly I/O bound nature of this project I am dabbling in asyncio. The code snippet below is not any faster than a 1-threaded version, and far worse than my threaded version. Could anyone tell me what I am doing wrong? Thank you!
async def fetch(session,url):
async with session.get(url) as response:
return await response.text()
async def scrape_urls(urls):
results = []
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
html = await fetch(session,url)
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
paras = [normalize('NFKD',para.get_text()) for para in body.find_all('p')]
results.append(paras)
return results
await means "wait until the result is ready", so when you await the fetching in each loop iteration, you request (and get) sequential execution. To parallelize fetching, you need to spawn each fetch into a background tasks using something like asyncio.create_task(fetch(...)), and then await them, similar to how you'd do it with threads. Or even more simply, you can let the asyncio.gather convenience function do it for you. For example (untested):
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
def parse(html):
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
return [normalize('NFKD',para.get_text())
for para in body.find_all('p')]
async def fetch_and_parse(session, url):
html = await fetch(session, url)
paras = parse(html)
return paras
async def scrape_urls(urls):
async with aiohttp.ClientSession() as session:
return await asyncio.gather(
*(fetch_and_parse(session, url) for url in urls)
)
If you find that this still runs slower than the multi-threaded version, it is possible that the parsing of HTML is slowing down the IO-related work. (Asyncio runs everything in a single thread by default.) To prevent CPU-bound code from interfering with asyncio, you can move the parsing to a separate thread using run_in_executor:
async def fetch_and_parse(session, url):
html = await fetch(session, url)
loop = asyncio.get_event_loop()
# run parse(html) in a separate thread, and
# resume this coroutine when it completes
paras = await loop.run_in_executor(None, parse, html)
return paras
Note that run_in_executor must be awaited because it returns an awaitable that is "woken up" when the background thread completes the given assignment. As this version uses asyncio for IO and threads for parsing, it should run about as fast as your threaded version, but scale to a much larger number of parallel downloads.
Finally, if you want the parsing to run actually in parallel, using multiple cores, you can use multi-processing instead:
_pool = concurrent.futures.ProcessPoolExecutor()
async def fetch_and_parse(session, url):
html = await fetch(session, url)
loop = asyncio.get_event_loop()
# run parse(html) in a separate process, and
# resume this coroutine when it completes
paras = await loop.run_in_executor(pool, parse, html)
return paras

Process list of async tasks immediately upon each task's completion

How can I process the results of a list of async tasks immediately as a task is completed?
For instance, the following should display whichever page loads first:
urls = ['stackoverflow.com', 'google.com']
tasks = [asyncio.create_task(fetch_page(x)) for x in urls]
for page in asyncio.give_me_results_ASAP(tasks):
print(page.url)
Since google loads faster, I'd like it to print:
google.com
stackoverflow.com
asyncio.as_completed is designed for exactly this, and returns an iterator of coroutines in the order that the tasks are completed. The first coroutine returned from the iterator will correspond to the first task that completes, and you can await on each coroutine to get the result of the task.
# With Python 3.8+
import asyncio
import time
async def fetch_page(url):
reponse_time = 0.1 if url == 'google.com' else 0.8
await asyncio.sleep(reponse_time)
return url
async def main():
urls = ['stackoverflow.com', 'google.com']
tasks = [asyncio.create_task(fetch_page(x)) for x in urls]
for coro in asyncio.as_completed(tasks):
print(f"{time.time():.3f}", await coro)
if __name__ == '__main__':
asyncio.run(main())
produces:
1600821288.178 google.com
1600821288.280 stackoverflow.com

Python loop through list to get api call in asyncio and save results

I don't fully understand how asyncio and aiohttp work yet.
I am trying to make a bunch of asynchronous api requests from a list of urls and save them as a variable so I can processes them later.
so far I am generating the list which is no problem and setting up the request framework.
urls = []
for i in range(0,20):
urls.append('https://api.binance.com/api/v1/klines?symbol={}&interval=
{}&limit={}'.format(pairs_list_pairs[i],time_period,
pull_limit))
import asyncio
import aiohttp
async def request(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
async def main():
results = await asyncio.gather(
request(urls[0]),
request(urls[1]),
)
print(len(results))
print(results)
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main())
loop.run_until_complete(loop.shutdown_asyncgens())
finally:
loop.close()
If I manually type out my requests one by one using indexing (like below), I can make the request. But the problem is that my list has upwards of 100 apis requests that I don't want to type by hand. How can I iterate through my list? Also how can I save my results into a variable? When the script ends it does not save "results" anywhere.
async def main():
results = await asyncio.gather(
request(urls[0]),
request(urls[1]),
)
print(len(results))
print(results)
Below are some sample urls to replicate the code:
[
'https://api.binance.com/api/v1/klines?symbol=ETHBTC&interval=15m&limit=1',
'https://api.binance.com/api/v1/klines?symbol=LTCBTC&interval=15m&limit=1',
'https://api.binance.com/api/v1/klines?symbol=BNBBTC&interval=15m&limit=1',
'https://api.binance.com/api/v1/klines?symbol=NEOBTC&interval=15m&limit=1',
]
To pass a variable number of arguments to gather, use the * function argument syntax:
results = await asyncio.gather(*[request(u) for u in urls])
Note that f(*args) is a standard Python feature to invoke f with positional arguments calculated at run-time.
results will be available once all requests are done, and they will be in a list in the same order as the URLs. Then you can return them from main, which will cause them to be returned by run_until_complete.
Also, you will have much better performance if you create the session only once, and reuse it for all requests, e.g. by passing it as a second argument to the request function.
Using gather and a helper function (request) are only making a quite simple task more complicated and difficult to work with. You can simply use the same ClientSession throughout all your individual requests with a loop whilst saving each response into a resultant list.
async def main():
results = []
async with aiohttp.ClientSession() as session:
for url in urls:
async with session.get(url) as resp:
results.append(await resp.text())
print(len(results))
print(results)
For the other part of your question, when you said:
When the script ends it does not save "results" anywhere.
if you meant that you want to access results outside of the main coroutine, you simply can add a return statement.
At the end of main, add:
return results
and change
loop.run_until_complete(main())
# into:
results = loop.run_until_complete(main())

Categories

Resources