How to run DB requests asynchronously? - python

When I run this it lists off the websites in the database one by one with the response code and it takes about 10 seconds to run through a very small list. It should be way faster and isn't running asynchronously but I'm not sure why.
import dblogin
import aiohttp
import asyncio
import async_timeout
dbconn = dblogin.connect()
dbcursor = dbconn.cursor(buffered=True)
dbcursor.execute("SELECT thistable FROM adatabase")
website_list = dbcursor.fetchall()
async def fetch(session, url):
with async_timeout.timeout(30):
async with session.get(url, ssl=False) as response:
await response.read()
return response.status, url
async def main():
async with aiohttp.ClientSession() as session:
for all_urls in website_list:
url = all_urls[0]
resp = await fetch(session, url)
print(resp, url)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
dbcursor.close()
dbconn.close()

This article explains the details. What you need to do is pass each fetch call in a Future object, and then pass a list of those to either asyncio.wait or asyncio.gather depending on your needs.
Your code would look something like this:
async def fetch(session, url):
with async_timeout.timeout(30):
async with session.get(url, ssl=False) as response:
await response.read()
return response.status, url
async def main():
tasks = []
async with aiohttp.ClientSession() as session:
for all_urls in website_list:
url = all_urls[0]
task = asyncio.create_task(fetch(session, url))
tasks.append(task)
responses = await asyncio.gather(*tasks)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
future = asyncio.create_task(main())
loop.run_until_complete(future)
Also, are you sure that loop.close() call is needed? The docs mention that
The loop must not be running when this function is called. Any pending callbacks will be discarded.
This method clears all queues and shuts down the executor, but does not wait for the executor to finish.
As mentioned in the docs and in the link that #user4815162342 posted, it is better to use the create_task method instead of the ensure_future method when we know that the argument is a coroutine. Note that this was added in Python 3.7, so previous versions should continue using ensure_future instead.

Related

Where to put BeautifulSoup code in Asyncio Web Scraping Application

I need to scrape and get the raw text of the body paragraphs for many (5-10k per day) news articles. I've written some threading code, but given the highly I/O bound nature of this project I am dabbling in asyncio. The code snippet below is not any faster than a 1-threaded version, and far worse than my threaded version. Could anyone tell me what I am doing wrong? Thank you!
async def fetch(session,url):
async with session.get(url) as response:
return await response.text()
async def scrape_urls(urls):
results = []
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
html = await fetch(session,url)
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
paras = [normalize('NFKD',para.get_text()) for para in body.find_all('p')]
results.append(paras)
return results
await means "wait until the result is ready", so when you await the fetching in each loop iteration, you request (and get) sequential execution. To parallelize fetching, you need to spawn each fetch into a background tasks using something like asyncio.create_task(fetch(...)), and then await them, similar to how you'd do it with threads. Or even more simply, you can let the asyncio.gather convenience function do it for you. For example (untested):
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
def parse(html):
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
return [normalize('NFKD',para.get_text())
for para in body.find_all('p')]
async def fetch_and_parse(session, url):
html = await fetch(session, url)
paras = parse(html)
return paras
async def scrape_urls(urls):
async with aiohttp.ClientSession() as session:
return await asyncio.gather(
*(fetch_and_parse(session, url) for url in urls)
)
If you find that this still runs slower than the multi-threaded version, it is possible that the parsing of HTML is slowing down the IO-related work. (Asyncio runs everything in a single thread by default.) To prevent CPU-bound code from interfering with asyncio, you can move the parsing to a separate thread using run_in_executor:
async def fetch_and_parse(session, url):
html = await fetch(session, url)
loop = asyncio.get_event_loop()
# run parse(html) in a separate thread, and
# resume this coroutine when it completes
paras = await loop.run_in_executor(None, parse, html)
return paras
Note that run_in_executor must be awaited because it returns an awaitable that is "woken up" when the background thread completes the given assignment. As this version uses asyncio for IO and threads for parsing, it should run about as fast as your threaded version, but scale to a much larger number of parallel downloads.
Finally, if you want the parsing to run actually in parallel, using multiple cores, you can use multi-processing instead:
_pool = concurrent.futures.ProcessPoolExecutor()
async def fetch_and_parse(session, url):
html = await fetch(session, url)
loop = asyncio.get_event_loop()
# run parse(html) in a separate process, and
# resume this coroutine when it completes
paras = await loop.run_in_executor(pool, parse, html)
return paras

Aiohttp try to get page response with page numbers until hit empty response

I'm trying to have a small function scraping data from a JSON end point,
the url is like https://xxxxxxxx.com/products.json?&page=" which I can insert a page number,
While I was using requests module I just had a while loop and incrementing the page number and break until I get a empty response (which page is empty)
Is there a possible way to do the same thing with aiohttp?
What I only achieved so far is just pre-genenrate certain number of urls and pass it into tasks
Wondering if I can use a loop as well and stop when see empty response
Thank you very much
'''
import asyncio
import aiohttp
async def download_one(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
pprint.pprint(await resp.json(content_type=None))
async def download_all(sites):
tasks = [asyncio.create_task(download_one(site)) for site in sites]
await asyncio.gather(*tasks)
def main():
sites = list(map(lambda x: request_url + str(x), range(1, 50)))
asyncio.run(download_all(sites))
'''
Here is a piece of untested code. Even if it won't work, it will give you an idea how to do the job
import asyncio
import aiohttp
async def download_one(session, url):
async with session.get(url) as resp:
resp = await resp.json()
if not resp:
raise Exception("No data found") # needs to be there for breaking the loop
async def download_all(sites):
async with aiohttp.ClientSession() as session:
futures = [download_one(session, site) for site in sites]
done, pending = await asyncio.wait(
futures, return_when=FIRST_EXCEPTION # will return the result when exception is raised by any future
)
for future in pending:
future.cancel() # it will shut down all redundant jobs
def main():
sites = list(map(lambda x: request_url + str(x), range(1, 50)))
asyncio.run_until_complete(download_all(sites))

asyncio tasks using aiohttp.ClientSession

I'm using python 3.7 and trying to make a crawler that can go multiple domains asynchronously. I'm using for this asyncio and aiohttp but i'm experiencing problems with the aiohttp.ClientSession. This is my reduced code:
import aiohttp
import asyncio
async def fetch(session, url):
async with session.get(url) as response:
print(await response.text())
async def main():
loop = asyncio.get_event_loop()
async with aiohttp.ClientSession(loop=loop) as session:
cwlist = [loop.create_task(fetch(session, url)) for url in ['http://python.org', 'http://google.com']]
asyncio.gather(*cwlist)
if __name__ == "__main__":
asyncio.run(main())
The thrown exception is this:
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=RuntimeError('Session is closed')>
What am i doing wrong here?
You forgot to await the asyncio.gather result:
async with aiohttp.ClientSession(loop=loop) as session:
cwlist = [loop.create_task(fetch(session, url)) for url in ['http://python.org', 'http://google.com']]
await asyncio.gather(*cwlist)
If you ever have an async with containing no await expressions you should be fairly suspicious.

My script encounters an error when it is supposed to run asynchronously

I've written a script in python using asyncio association with aiohttp library to parse the names out of pop up boxes initiated upon clicking on contact info buttons out of diffetent agency information located within a table from this website asynchronously. The webpage displayes the tabular contents across 513 pages.
I encountered this error too many file descriptors in select() when I tried with asyncio.get_event_loop() but when I came across this thread I could see that there is a suggestion to use asyncio.ProactorEventLoop() to avoid such error so I used the latter but noticed that, even when I complied with the suggestion, the script collects the names only from few pages until it throws the following error. How can i fix this?
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host www.tursab.org.tr:443 ssl:None [The semaphore timeout period has expired]
This is my try so far with:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"
async def get_links(url):
async with asyncio.Semaphore(10):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
result = await process_docs(text)
return result
async def process_docs(html):
coros = []
soup = BeautifulSoup(html,"lxml")
items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
for item in items:
coros.append(fetch_again(lead_link.format(item)))
await asyncio.gather(*coros)
async def fetch_again(link):
async with asyncio.Semaphore(10):
async with aiohttp.ClientSession() as session:
async with session.get(link) as response:
text = await response.text()
sauce = BeautifulSoup(text,"lxml")
try:
name = sauce.select_one("p > b").text
except Exception: name = ""
print(name)
if __name__ == '__main__':
loop = asyncio.ProactorEventLoop()
asyncio.set_event_loop(loop)
loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))
In short, What the process_docs() function does is collect data-id numbers from each pages to reuse them as the prefix of this https://www.tursab.org.tr/en/displayAcenta?AID={} link to collect the names from pop up boxes. One such id is 8757 and one such qualified links therefore https://www.tursab.org.tr/en/displayAcenta?AID=8757.
Btw, If I change the highest number used in the links variable to 20 or 30 or so, It goes smoothly.
async def get_links(url):
async with asyncio.Semaphore(10):
You can't do such a thing: it means that on each function call new semaphore instance will be created, while you need to single semaphore instance for all requests. Change your code this way:
sem = asyncio.Semaphore(10) # module level
async def get_links(url):
async with sem:
# ...
async def fetch_again(link):
async with sem:
# ...
You can also return default loop once you're using semaphore correctly:
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(...)
Finally, you should alter both get_links(url) and fetch_again(link) to do parsing outside of semaphore to release it as soon as possible, before semaphore is needed inside process_docs(text).
Final code:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"
sem = asyncio.Semaphore(10)
async def get_links(url):
async with sem:
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
result = await process_docs(text)
return result
async def process_docs(html):
coros = []
soup = BeautifulSoup(html,"lxml")
items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
for item in items:
coros.append(fetch_again(lead_link.format(item)))
await asyncio.gather(*coros)
async def fetch_again(link):
async with sem:
async with aiohttp.ClientSession() as session:
async with session.get(link) as response:
text = await response.text()
sauce = BeautifulSoup(text,"lxml")
try:
name = sauce.select_one("p > b").text
except Exception:
name = "o"
print(name)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))

RuntimeError Session is closed when trying to make async requests

First of all heres the code:
import random
import asyncio
from aiohttp import ClientSession
import csv
headers =[]
def extractsites(file):
sites = []
readfile = open(file, "r")
reader = csv.reader(readfile, delimiter=",")
raw = list(reader)
for a in raw:
sites.append((a[1]))
return sites
async def fetchheaders(url, session):
async with session.get(url) as response:
responseheader = await response.headers
print(responseheader)
return responseheader
async def bound_fetch(sem, url, session):
async with sem:
print("doing request for "+ url)
await fetchheaders(url, session)
async def run():
urls = extractsites("cisco-umbrella.csv")
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(100)
async with ClientSession() as session:
for i in urls:
task = asyncio.ensure_future(bound_fetch(sem, "http://"+i, session))
tasks.append(task)
return tasks
def main():
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run())
loop.run_until_complete(future)
if __name__ == '__main__':
main()
Most of this code was taken from this blog post:
https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html
Here is my problem that I'm facing: I am trying to read a million urls from a file and then make async request for each of them.
But when I try to execute the code above I get the Session expired error.
This is my line of thought:
I am relatively new to async programming so bear with me.
My though process was to create a long task list (that only allows 100 parallel requests), that I build in the run function, and then pass as a future to the event loop to execute.
I have included a print debug in the bound_fetch (which I copied from the blog post) and it looks like it loops over all urls that I have and as soon as it should start making requests in the fetchheaders function I get the runtime errors.
How do I fix my code ?
A couple things here.
First, in your run function you actually want to gather the tasks there and await them to fix your session issue, like so:
async def run():
urls = ['google.com','amazon.com']
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(100)
async with ClientSession() as session:
for i in urls:
task = asyncio.ensure_future(bound_fetch(sem, "http://"+i, session))
tasks.append(task)
await asyncio.gather(*tasks)
Second, the aiohttp API is a little odd in dealing with headers in that you can't await them. I worked around this by awaiting body so that headers are populated and then returning the headers:
async def fetchheaders(url, session):
async with session.get(url) as response:
data = await response.read()
responseheader = response.headers
print(responseheader)
return responseheader
There is some additional overhead here in pulling the body however. I couldn't find another way to load headers though without doing a body read.

Categories

Resources