Scraping content using pyppeteer in association with asyncio - python

I've written a script in python in combination with pyppeteer along with asyncio to scrape the links of different posts from its landing page and eventually get the title of each post by tracking the url leading to its inner page. The content I parsed here are not dynamic ones. However, I made use of pyppeteer and asyncio to see how efficiently it performs asynchronously.
The following script goes well for some moments but then enounters an error:
File "C:\Users\asyncio\tasks.py", line 526, in ensure_future
raise TypeError('An asyncio.Future, a coroutine or an awaitable is '
TypeError: An asyncio.Future, a coroutine or an awaitable is required
This is what I've wriiten so far:
import asyncio
from pyppeteer import launch
link = "https://stackoverflow.com/questions/tagged/web-scraping"
async def fetch(page,url):
await page.goto(url)
linkstorage = []
elements = await page.querySelectorAll('.summary .question-hyperlink')
for element in elements:
linkstorage.append(await page.evaluate('(element) => element.href', element))
tasks = [await browse_all_links(link, page) for link in linkstorage]
results = await asyncio.gather(*tasks)
return results
async def browse_all_links(link, page):
await page.goto(link)
title = await page.querySelectorEval('.question-hyperlink','(e => e.innerText)')
print(title)
async def main(url):
browser = await launch(headless=True,autoClose=False)
page = await browser.newPage()
await fetch(page,url)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(main(link))
loop.run_until_complete(future)
loop.close()
My question: how can I get rid of that error and do the doing asynchronously?

The problem is in the following lines:
tasks = [await browse_all_links(link, page) for link in linkstorage]
results = await asyncio.gather(*tasks)
The intention is for tasks to be a list of awaitable objects, such as coroutine objects or futures. The list is to be passed to gather, so that the awaitables can run in parallel until they all complete. However, the list comprehension contains an await, which means that it:
executes each browser_all_links to completion in series rather than in parallel;
places the return values of browse_all_links invocations into the list.
Since browse_all_links doesn't return a value, you are passing a list of None objects to asyncio.gather, which complains that it didn't get an awaitable object.
To resolve the issue, just drop the await from the list comprehension.

Related

Asyncio aiohttp populate list of HTML from list of URL

I'm trying to get a list of HTML source code from a list of URLs asynchronously with asyncio and aiohttp but I get 2 Exceptions:
TypeError('An asyncio.Future, a coroutine or an awaitable is ' TypeError: An asyncio.Future, a coroutine or an awaitable is required Exception ignored in: <function _ProactorBasePipeTransport.__del__ at 0x000001C92141F310>
raise RuntimeError('Event loop is closed') RuntimeError: Event loop is closed
This is my code:
import asyncio
import aiohttpasync
def main():
tasks = []
html_list = []
async with aiohttp.ClientSession() as session:
for url in ['http://www.apple.com', 'http://www.google.cl']:
async with session.get(url) as resp:
tasks.append(html_list.append(await resp.read()))
print(html_list[0])
print(html_list[1])
await(asyncio.wait(tasks))
if __name__ == '__main__':
asyncio.run(main())
This code gets to print the html codes of ['http://www.apple.com', 'http://www.google.cl'] but immediatly after I get the aforementioned Exceptions.
There are a few things that aren't quite right with your example:
You're not running anything concurrently. When you await resp.read() in your for loop, this blocks main until the result comes back. Perhaps this is what you want, but you need to use asyncio.create_task to run your requests concurrently.
As pointed out, you don't need the task array at all because of point number one. You can just append to html_list.
You don't need to call asynio.wait because you're not awaiting any tasks or coroutines at this point.
You can resolve your direct issues by what you have done in the comments, but a version that runs concurrently looks like this:
import asyncio
import aiohttp
async def read_url(session, url):
async with session.get(url) as resp:
return await resp.read()
async def main():
tasks = []
async with aiohttp.ClientSession() as session:
for url in ['http://www.apple.com', 'http://www.google.cl']:
tasks.append(asyncio.create_task(read_url(session, url)))
html_list = await asyncio.gather(*tasks)
print(html_list[0])
print(html_list[1])
if __name__ == '__main__':
asyncio.run(main())
Here we define a coroutine read_url which gets the contents of a single url. Then in the loop, you create a task for reading the url and append it to the tasks list. Then you use asyncio.gather - this will wait for all tasks to finish concurrently.
As written, I'm unable to reproduce your RuntimeError: Event loop is closed error.

Where to put BeautifulSoup code in Asyncio Web Scraping Application

I need to scrape and get the raw text of the body paragraphs for many (5-10k per day) news articles. I've written some threading code, but given the highly I/O bound nature of this project I am dabbling in asyncio. The code snippet below is not any faster than a 1-threaded version, and far worse than my threaded version. Could anyone tell me what I am doing wrong? Thank you!
async def fetch(session,url):
async with session.get(url) as response:
return await response.text()
async def scrape_urls(urls):
results = []
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
html = await fetch(session,url)
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
paras = [normalize('NFKD',para.get_text()) for para in body.find_all('p')]
results.append(paras)
return results
await means "wait until the result is ready", so when you await the fetching in each loop iteration, you request (and get) sequential execution. To parallelize fetching, you need to spawn each fetch into a background tasks using something like asyncio.create_task(fetch(...)), and then await them, similar to how you'd do it with threads. Or even more simply, you can let the asyncio.gather convenience function do it for you. For example (untested):
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
def parse(html):
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
return [normalize('NFKD',para.get_text())
for para in body.find_all('p')]
async def fetch_and_parse(session, url):
html = await fetch(session, url)
paras = parse(html)
return paras
async def scrape_urls(urls):
async with aiohttp.ClientSession() as session:
return await asyncio.gather(
*(fetch_and_parse(session, url) for url in urls)
)
If you find that this still runs slower than the multi-threaded version, it is possible that the parsing of HTML is slowing down the IO-related work. (Asyncio runs everything in a single thread by default.) To prevent CPU-bound code from interfering with asyncio, you can move the parsing to a separate thread using run_in_executor:
async def fetch_and_parse(session, url):
html = await fetch(session, url)
loop = asyncio.get_event_loop()
# run parse(html) in a separate thread, and
# resume this coroutine when it completes
paras = await loop.run_in_executor(None, parse, html)
return paras
Note that run_in_executor must be awaited because it returns an awaitable that is "woken up" when the background thread completes the given assignment. As this version uses asyncio for IO and threads for parsing, it should run about as fast as your threaded version, but scale to a much larger number of parallel downloads.
Finally, if you want the parsing to run actually in parallel, using multiple cores, you can use multi-processing instead:
_pool = concurrent.futures.ProcessPoolExecutor()
async def fetch_and_parse(session, url):
html = await fetch(session, url)
loop = asyncio.get_event_loop()
# run parse(html) in a separate process, and
# resume this coroutine when it completes
paras = await loop.run_in_executor(pool, parse, html)
return paras

Get aiohttp results as string

I'm trying to get data from a website using async in python. As an example I used this code (under A Better Coroutine Example): https://www.blog.pythonlibrary.org/2016/07/26/python-3-an-intro-to-asyncio/
Now this works fine, but it writes the binary chunks to a file and I don't want it in a file. I want the resulting data directly. But I currently have a list of coroutine objects which I can not get the data out of.
The code:
# -*- coding: utf-8 -*-
import aiohttp
import asyncio
import async_timeout
async def fetch(session, url):
with async_timeout.timeout(10):
async with session.get(url) as response:
return await response.text()
async def main(loop, urls):
async with aiohttp.ClientSession(loop=loop) as session:
tasks = [fetch(session, url) for url in urls]
await asyncio.gather(*tasks)
return tasks
# time normal way of retrieval
if __name__ == '__main__':
urls = [a list of urls..]
loop = asyncio.get_event_loop()
details_async = loop.run_until_complete(main(loop, urls))
Thanks
The problem is in return tasks at the end of main(), which is not present in the original article. Instead of returning the coroutine objects (which are not useful once passed to asyncio.gather), you should be returning the tuple returned by asyncio.gather, which contains the results of running the coroutines in correct order. For example:
async def main(loop, urls):
async with aiohttp.ClientSession(loop=loop) as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks)
return results
Now loop.run_until_complete(main(loop, urls)) will return a tuple of texts in the same order as the URLs.

Python loop through list to get api call in asyncio and save results

I don't fully understand how asyncio and aiohttp work yet.
I am trying to make a bunch of asynchronous api requests from a list of urls and save them as a variable so I can processes them later.
so far I am generating the list which is no problem and setting up the request framework.
urls = []
for i in range(0,20):
urls.append('https://api.binance.com/api/v1/klines?symbol={}&interval=
{}&limit={}'.format(pairs_list_pairs[i],time_period,
pull_limit))
import asyncio
import aiohttp
async def request(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
async def main():
results = await asyncio.gather(
request(urls[0]),
request(urls[1]),
)
print(len(results))
print(results)
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main())
loop.run_until_complete(loop.shutdown_asyncgens())
finally:
loop.close()
If I manually type out my requests one by one using indexing (like below), I can make the request. But the problem is that my list has upwards of 100 apis requests that I don't want to type by hand. How can I iterate through my list? Also how can I save my results into a variable? When the script ends it does not save "results" anywhere.
async def main():
results = await asyncio.gather(
request(urls[0]),
request(urls[1]),
)
print(len(results))
print(results)
Below are some sample urls to replicate the code:
[
'https://api.binance.com/api/v1/klines?symbol=ETHBTC&interval=15m&limit=1',
'https://api.binance.com/api/v1/klines?symbol=LTCBTC&interval=15m&limit=1',
'https://api.binance.com/api/v1/klines?symbol=BNBBTC&interval=15m&limit=1',
'https://api.binance.com/api/v1/klines?symbol=NEOBTC&interval=15m&limit=1',
]
To pass a variable number of arguments to gather, use the * function argument syntax:
results = await asyncio.gather(*[request(u) for u in urls])
Note that f(*args) is a standard Python feature to invoke f with positional arguments calculated at run-time.
results will be available once all requests are done, and they will be in a list in the same order as the URLs. Then you can return them from main, which will cause them to be returned by run_until_complete.
Also, you will have much better performance if you create the session only once, and reuse it for all requests, e.g. by passing it as a second argument to the request function.
Using gather and a helper function (request) are only making a quite simple task more complicated and difficult to work with. You can simply use the same ClientSession throughout all your individual requests with a loop whilst saving each response into a resultant list.
async def main():
results = []
async with aiohttp.ClientSession() as session:
for url in urls:
async with session.get(url) as resp:
results.append(await resp.text())
print(len(results))
print(results)
For the other part of your question, when you said:
When the script ends it does not save "results" anywhere.
if you meant that you want to access results outside of the main coroutine, you simply can add a return statement.
At the end of main, add:
return results
and change
loop.run_until_complete(main())
# into:
results = loop.run_until_complete(main())

Parallel async tasks emitted by endless generator

I have a code that very close to this:
class Parser:
async def fetch(self, url):
html = await get(url)
return html
#property
def endless_generator(self):
while True:
yield url
async def loop():
htmls = []
for url in self.endless_generator:
htmls.append(await self.fetch(url))
return htmls
def run(self):
loop = asyncio.get_event_loop()
try:
htmls = loop.run_until_complete(self.loop())
finally:
loop.close()
parser = Parser()
parser.run()
Now Parser.loop run synchronously.
I've tried asyncio.wait and asyncio.gather to achieve async invocations of Parser.fetch, but I don't know the number of URLs in advance (because URLs yielding by endless generator).
So, how do I get asynchronous calls if the number of tasks is not known in advance?
I've tried asyncio.wait and asyncio.gather to achieve async invocations of Parser.fetch, but I don't know the number of URLs in advance (because URLs yielding by endless generator).
I assume that by endless generator you mean a generator whose number of URLs is not known in advance, rather than a truly endless generator (generating an infinite list). Here is a version that creates a task as soon as a URL is available, and gathers the results as they arrive:
async def loop():
lp = asyncio.get_event_loop()
tasks = set()
result = {}
any_done = asyncio.Event()
def _task_done(t):
tasks.remove(t)
any_done.set()
result[t.fetch_url] = t.result()
for url in self.endless_generator:
new_task = lp.create_task(self.fetch(url))
new_task.fetch_url = url
tasks.add(new_task)
new_task.add_done_callback(_task_done)
await any_done.wait()
any_done.clear()
while tasks:
await any_done.wait()
any_done.clear()
return result # mapping url -> html
One cannot simply call gather or wait in each iteration because that would wait for all the existing tasks to finish before queuing a new one. wait(return_when=FIRST_COMPLETED) could work, but it would be O(n**2) in the number of tasks because it would set up its own callback each time anew.

Categories

Resources