Asyncio aiohttp populate list of HTML from list of URL

Asyncio aiohttp populate list of HTML from list of URL - python

I'm trying to get a list of HTML source code from a list of URLs asynchronously with asyncio and aiohttp but I get 2 Exceptions:
TypeError('An asyncio.Future, a coroutine or an awaitable is ' TypeError: An asyncio.Future, a coroutine or an awaitable is required Exception ignored in: <function _ProactorBasePipeTransport.__del__ at 0x000001C92141F310>
raise RuntimeError('Event loop is closed') RuntimeError: Event loop is closed
This is my code:
import asyncio
import aiohttpasync
def main():
tasks = []
html_list = []
async with aiohttp.ClientSession() as session:
for url in ['http://www.apple.com', 'http://www.google.cl']:
async with session.get(url) as resp:
tasks.append(html_list.append(await resp.read()))
print(html_list[0])
print(html_list[1])
await(asyncio.wait(tasks))
if __name__ == '__main__':
asyncio.run(main())
This code gets to print the html codes of ['http://www.apple.com', 'http://www.google.cl'] but immediatly after I get the aforementioned Exceptions.

There are a few things that aren't quite right with your example:
You're not running anything concurrently. When you await resp.read() in your for loop, this blocks main until the result comes back. Perhaps this is what you want, but you need to use asyncio.create_task to run your requests concurrently.
As pointed out, you don't need the task array at all because of point number one. You can just append to html_list.
You don't need to call asynio.wait because you're not awaiting any tasks or coroutines at this point.
You can resolve your direct issues by what you have done in the comments, but a version that runs concurrently looks like this:
import asyncio
import aiohttp
async def read_url(session, url):
async with session.get(url) as resp:
return await resp.read()
async def main():
tasks = []
async with aiohttp.ClientSession() as session:
for url in ['http://www.apple.com', 'http://www.google.cl']:
tasks.append(asyncio.create_task(read_url(session, url)))
html_list = await asyncio.gather(*tasks)
print(html_list[0])
print(html_list[1])
if __name__ == '__main__':
asyncio.run(main())
Here we define a coroutine read_url which gets the contents of a single url. Then in the loop, you create a task for reading the url and append it to the tasks list. Then you use asyncio.gather - this will wait for all tasks to finish concurrently.
As written, I'm unable to reproduce your RuntimeError: Event loop is closed error.

Related

python for each run async function without await and parallel

I have 10 links in my CSV which I'm trying to run all at the same time in a loop from getTasks function. However, the way it's working now, it send a request to link 1, waits for it to complete, then link 2, etc, etc. I want the 10 links that I have to run all whenever startTask is called, leading to 10 requests a second.
Anyone know how to code that using the code below? Thanks in advance.
import requests
from bs4 import BeautifulSoup
import asyncio
def getTasks(tasks):
for task in tasks:
asyncio.run(startTask(task))
async def startTask(task):
success = await getProduct(task)
if success is None:
return startTask(task)
success = await addToCart(task)
if success is None:
return startTask(task)
...
...
...
getTasks(tasks)

First of all, to make your requests sent concurrently, you should use the aiohttp instead of the requests package that blocks I/O. And use the asyncio's semaphore to limit the count of concurrent processes at the same time.
import asyncio
import aiohttp
# read links from CSV
links = [
...
]
semaphore = asyncio.BoundedSemaphore(10)
# 10 is the max count of concurrent tasks
# that can be processed at the same time.
# In this case, tasks are requests.
async def async_request(url):
async with aiohttp.ClientSession() as session:
async with semaphore, session.get(url) as response:
return await response.text()
async def main():
result = await asyncio.gather(*[
async_request(link) for link in links
])
print(result) # [response1, response2, ...]
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()

Where to put BeautifulSoup code in Asyncio Web Scraping Application

I need to scrape and get the raw text of the body paragraphs for many (5-10k per day) news articles. I've written some threading code, but given the highly I/O bound nature of this project I am dabbling in asyncio. The code snippet below is not any faster than a 1-threaded version, and far worse than my threaded version. Could anyone tell me what I am doing wrong? Thank you!
async def fetch(session,url):
async with session.get(url) as response:
return await response.text()
async def scrape_urls(urls):
results = []
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
html = await fetch(session,url)
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
paras = [normalize('NFKD',para.get_text()) for para in body.find_all('p')]
results.append(paras)
return results

await means "wait until the result is ready", so when you await the fetching in each loop iteration, you request (and get) sequential execution. To parallelize fetching, you need to spawn each fetch into a background tasks using something like asyncio.create_task(fetch(...)), and then await them, similar to how you'd do it with threads. Or even more simply, you can let the asyncio.gather convenience function do it for you. For example (untested):
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
def parse(html):
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
return [normalize('NFKD',para.get_text())
for para in body.find_all('p')]
async def fetch_and_parse(session, url):
html = await fetch(session, url)
paras = parse(html)
return paras
async def scrape_urls(urls):
async with aiohttp.ClientSession() as session:
return await asyncio.gather(
*(fetch_and_parse(session, url) for url in urls)
)
If you find that this still runs slower than the multi-threaded version, it is possible that the parsing of HTML is slowing down the IO-related work. (Asyncio runs everything in a single thread by default.) To prevent CPU-bound code from interfering with asyncio, you can move the parsing to a separate thread using run_in_executor:
async def fetch_and_parse(session, url):
html = await fetch(session, url)
loop = asyncio.get_event_loop()
# run parse(html) in a separate thread, and
# resume this coroutine when it completes
paras = await loop.run_in_executor(None, parse, html)
return paras
Note that run_in_executor must be awaited because it returns an awaitable that is "woken up" when the background thread completes the given assignment. As this version uses asyncio for IO and threads for parsing, it should run about as fast as your threaded version, but scale to a much larger number of parallel downloads.
Finally, if you want the parsing to run actually in parallel, using multiple cores, you can use multi-processing instead:
_pool = concurrent.futures.ProcessPoolExecutor()
async def fetch_and_parse(session, url):
html = await fetch(session, url)
loop = asyncio.get_event_loop()
# run parse(html) in a separate process, and
# resume this coroutine when it completes
paras = await loop.run_in_executor(pool, parse, html)
return paras

How to run DB requests asynchronously?

When I run this it lists off the websites in the database one by one with the response code and it takes about 10 seconds to run through a very small list. It should be way faster and isn't running asynchronously but I'm not sure why.
import dblogin
import aiohttp
import asyncio
import async_timeout
dbconn = dblogin.connect()
dbcursor = dbconn.cursor(buffered=True)
dbcursor.execute("SELECT thistable FROM adatabase")
website_list = dbcursor.fetchall()
async def fetch(session, url):
with async_timeout.timeout(30):
async with session.get(url, ssl=False) as response:
await response.read()
return response.status, url
async def main():
async with aiohttp.ClientSession() as session:
for all_urls in website_list:
url = all_urls[0]
resp = await fetch(session, url)
print(resp, url)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
dbcursor.close()
dbconn.close()

This article explains the details. What you need to do is pass each fetch call in a Future object, and then pass a list of those to either asyncio.wait or asyncio.gather depending on your needs.
Your code would look something like this:
async def fetch(session, url):
with async_timeout.timeout(30):
async with session.get(url, ssl=False) as response:
await response.read()
return response.status, url
async def main():
tasks = []
async with aiohttp.ClientSession() as session:
for all_urls in website_list:
url = all_urls[0]
task = asyncio.create_task(fetch(session, url))
tasks.append(task)
responses = await asyncio.gather(*tasks)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
future = asyncio.create_task(main())
loop.run_until_complete(future)
Also, are you sure that loop.close() call is needed? The docs mention that
The loop must not be running when this function is called. Any pending callbacks will be discarded.
This method clears all queues and shuts down the executor, but does not wait for the executor to finish.
As mentioned in the docs and in the link that #user4815162342 posted, it is better to use the create_task method instead of the ensure_future method when we know that the argument is a coroutine. Note that this was added in Python 3.7, so previous versions should continue using ensure_future instead.

Scraping content using pyppeteer in association with asyncio

I've written a script in python in combination with pyppeteer along with asyncio to scrape the links of different posts from its landing page and eventually get the title of each post by tracking the url leading to its inner page. The content I parsed here are not dynamic ones. However, I made use of pyppeteer and asyncio to see how efficiently it performs asynchronously.
The following script goes well for some moments but then enounters an error:
File "C:\Users\asyncio\tasks.py", line 526, in ensure_future
raise TypeError('An asyncio.Future, a coroutine or an awaitable is '
TypeError: An asyncio.Future, a coroutine or an awaitable is required
This is what I've wriiten so far:
import asyncio
from pyppeteer import launch
link = "https://stackoverflow.com/questions/tagged/web-scraping"
async def fetch(page,url):
await page.goto(url)
linkstorage = []
elements = await page.querySelectorAll('.summary .question-hyperlink')
for element in elements:
linkstorage.append(await page.evaluate('(element) => element.href', element))
tasks = [await browse_all_links(link, page) for link in linkstorage]
results = await asyncio.gather(*tasks)
return results
async def browse_all_links(link, page):
await page.goto(link)
title = await page.querySelectorEval('.question-hyperlink','(e => e.innerText)')
print(title)
async def main(url):
browser = await launch(headless=True,autoClose=False)
page = await browser.newPage()
await fetch(page,url)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(main(link))
loop.run_until_complete(future)
loop.close()
My question: how can I get rid of that error and do the doing asynchronously?

The problem is in the following lines:
tasks = [await browse_all_links(link, page) for link in linkstorage]
results = await asyncio.gather(*tasks)
The intention is for tasks to be a list of awaitable objects, such as coroutine objects or futures. The list is to be passed to gather, so that the awaitables can run in parallel until they all complete. However, the list comprehension contains an await, which means that it:
executes each browser_all_links to completion in series rather than in parallel;
places the return values of browse_all_links invocations into the list.
Since browse_all_links doesn't return a value, you are passing a list of None objects to asyncio.gather, which complains that it didn't get an awaitable object.
To resolve the issue, just drop the await from the list comprehension.

Python loop through list to get api call in asyncio and save results

I don't fully understand how asyncio and aiohttp work yet.
I am trying to make a bunch of asynchronous api requests from a list of urls and save them as a variable so I can processes them later.
so far I am generating the list which is no problem and setting up the request framework.
urls = []
for i in range(0,20):
urls.append('https://api.binance.com/api/v1/klines?symbol={}&interval=
{}&limit={}'.format(pairs_list_pairs[i],time_period,
pull_limit))
import asyncio
import aiohttp
async def request(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
async def main():
results = await asyncio.gather(
request(urls[0]),
request(urls[1]),
)
print(len(results))
print(results)
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main())
loop.run_until_complete(loop.shutdown_asyncgens())
finally:
loop.close()
If I manually type out my requests one by one using indexing (like below), I can make the request. But the problem is that my list has upwards of 100 apis requests that I don't want to type by hand. How can I iterate through my list? Also how can I save my results into a variable? When the script ends it does not save "results" anywhere.
async def main():
results = await asyncio.gather(
request(urls[0]),
request(urls[1]),
)
print(len(results))
print(results)
Below are some sample urls to replicate the code:
[
'https://api.binance.com/api/v1/klines?symbol=ETHBTC&interval=15m&limit=1',
'https://api.binance.com/api/v1/klines?symbol=LTCBTC&interval=15m&limit=1',
'https://api.binance.com/api/v1/klines?symbol=BNBBTC&interval=15m&limit=1',
'https://api.binance.com/api/v1/klines?symbol=NEOBTC&interval=15m&limit=1',
]

To pass a variable number of arguments to gather, use the * function argument syntax:
results = await asyncio.gather(*[request(u) for u in urls])
Note that f(*args) is a standard Python feature to invoke f with positional arguments calculated at run-time.
results will be available once all requests are done, and they will be in a list in the same order as the URLs. Then you can return them from main, which will cause them to be returned by run_until_complete.
Also, you will have much better performance if you create the session only once, and reuse it for all requests, e.g. by passing it as a second argument to the request function.

Using gather and a helper function (request) are only making a quite simple task more complicated and difficult to work with. You can simply use the same ClientSession throughout all your individual requests with a loop whilst saving each response into a resultant list.
async def main():
results = []
async with aiohttp.ClientSession() as session:
for url in urls:
async with session.get(url) as resp:
results.append(await resp.text())
print(len(results))
print(results)
For the other part of your question, when you said:
When the script ends it does not save "results" anywhere.
if you meant that you want to access results outside of the main coroutine, you simply can add a return statement.
At the end of main, add:
return results
and change
loop.run_until_complete(main())
# into:
results = loop.run_until_complete(main())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Asyncio aiohttp populate list of HTML from list of URL - python

Related

python for each run async function without await and parallel

Where to put BeautifulSoup code in Asyncio Web Scraping Application

How to run DB requests asynchronously?

Scraping content using pyppeteer in association with asyncio

Python loop through list to get api call in asyncio and save results

Categories

Resources