Async HTML Parse with Beautifulsoup4 in Python

Async HTML Parse with Beautifulsoup4 in Python - python

I'm making a python web scraper script. I should do this using asyncio. So for Async HTTP request I use AioHTTP.
It's ok but when i'm trying to make a non-blocking app (await), the beautifulsoup4 will block application (because beautifulsoup4 dose't support async)
This is what i'm tried.
import asyncio, aiohttp
from bs4 import BeautifulSoup
async def extractLinks(html):
soup = BeautifulSoup(html, 'html.parser')
return soup.select(".c-pro-box__title a")
async def getHtml(session, url):
async with session.get(url) as response:
return await response.text()
async def loadPage(url):
async with aiohttp.ClientSession() as session:
html = await getHtml(session, url)
links = await extractLinks(html)
return links
loop = asyncio.get_event_loop()
loop.run_until_complete(loadPage())
The extractLinks() will block program flow.
So is this possible to make it non-blocking? Or is there any library except beautifulsoup4 that support async as well as possible?

Related

pd.read_html - No Tables Found Error - aiohttp refactor from request

Okay so I am using yahoo_fin in python however the requests were too slow for my short term use case (pre-redis/cron) and I need an implementation that can make many requests to yahoo finance.
I had converted the code from request to aiohttp/asyncio and despite being awaiting I now get the error No Tables Found on some of the requests -- it varies? Did I not implement the async correctly or am I forced to use selenium? (It's ~10x Faster Now fwiw)
BEFORE
tables = pd.read_html(requests.get(site, headers=headers).text)
AFTER
async with aiohttp.ClientSession() as session:
async with session.get(site, headers=headers) as resp:
text = await resp.text()
tables = pd.read_html(text)
I call the AFTER code like so... (small excerpts to illustrate)
async def get_ranking(arr, expiration, optionType, symbol):
try:
print(symbol, flush=True)
price = await async_get_live_price(symbol)
chain = await async_get_options_chain(symbol, expiration)
async with ClientSession() as session:
await asyncio.gather(*[get_ranking(arr, expiration, optionType, symbol) for symbol in symbols])

How to use aiter in aiomultiprocessing

How can I use the aiter-method in the aiomultiprocessing?
I have a small code to download urls. They are directly available in the "get(url)"-function, but I need/want them in the "main()"-function.
Here a small code example:
Import aiomultiprocessing
Import requests_html
async def get(url):
asession = AsyncHTMLSession()
response = await asession.get(url)
soup = BeautifulSoup(response.text, "lxml")
return soup
async def main():
async with aiomultiprocess.Pool(processes=4, childconcurrency=8) as pool:
await pool.map(get,urls) #--> What would be here the code to use them as they are ready?
I would be thankful for an example to use the aiter-method on the main()-function.
Thank you.

Where to put BeautifulSoup code in Asyncio Web Scraping Application

I need to scrape and get the raw text of the body paragraphs for many (5-10k per day) news articles. I've written some threading code, but given the highly I/O bound nature of this project I am dabbling in asyncio. The code snippet below is not any faster than a 1-threaded version, and far worse than my threaded version. Could anyone tell me what I am doing wrong? Thank you!
async def fetch(session,url):
async with session.get(url) as response:
return await response.text()
async def scrape_urls(urls):
results = []
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
html = await fetch(session,url)
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
paras = [normalize('NFKD',para.get_text()) for para in body.find_all('p')]
results.append(paras)
return results

await means "wait until the result is ready", so when you await the fetching in each loop iteration, you request (and get) sequential execution. To parallelize fetching, you need to spawn each fetch into a background tasks using something like asyncio.create_task(fetch(...)), and then await them, similar to how you'd do it with threads. Or even more simply, you can let the asyncio.gather convenience function do it for you. For example (untested):
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
def parse(html):
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div', attrs={'class':'entry-content'})
return [normalize('NFKD',para.get_text())
for para in body.find_all('p')]
async def fetch_and_parse(session, url):
html = await fetch(session, url)
paras = parse(html)
return paras
async def scrape_urls(urls):
async with aiohttp.ClientSession() as session:
return await asyncio.gather(
*(fetch_and_parse(session, url) for url in urls)
)
If you find that this still runs slower than the multi-threaded version, it is possible that the parsing of HTML is slowing down the IO-related work. (Asyncio runs everything in a single thread by default.) To prevent CPU-bound code from interfering with asyncio, you can move the parsing to a separate thread using run_in_executor:
async def fetch_and_parse(session, url):
html = await fetch(session, url)
loop = asyncio.get_event_loop()
# run parse(html) in a separate thread, and
# resume this coroutine when it completes
paras = await loop.run_in_executor(None, parse, html)
return paras
Note that run_in_executor must be awaited because it returns an awaitable that is "woken up" when the background thread completes the given assignment. As this version uses asyncio for IO and threads for parsing, it should run about as fast as your threaded version, but scale to a much larger number of parallel downloads.
Finally, if you want the parsing to run actually in parallel, using multiple cores, you can use multi-processing instead:
_pool = concurrent.futures.ProcessPoolExecutor()
async def fetch_and_parse(session, url):
html = await fetch(session, url)
loop = asyncio.get_event_loop()
# run parse(html) in a separate process, and
# resume this coroutine when it completes
paras = await loop.run_in_executor(pool, parse, html)
return paras

My script encounters an error when it is supposed to run asynchronously

I've written a script in python using asyncio association with aiohttp library to parse the names out of pop up boxes initiated upon clicking on contact info buttons out of diffetent agency information located within a table from this website asynchronously. The webpage displayes the tabular contents across 513 pages.
I encountered this error too many file descriptors in select() when I tried with asyncio.get_event_loop() but when I came across this thread I could see that there is a suggestion to use asyncio.ProactorEventLoop() to avoid such error so I used the latter but noticed that, even when I complied with the suggestion, the script collects the names only from few pages until it throws the following error. How can i fix this?
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host www.tursab.org.tr:443 ssl:None [The semaphore timeout period has expired]
This is my try so far with:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"
async def get_links(url):
async with asyncio.Semaphore(10):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
result = await process_docs(text)
return result
async def process_docs(html):
coros = []
soup = BeautifulSoup(html,"lxml")
items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
for item in items:
coros.append(fetch_again(lead_link.format(item)))
await asyncio.gather(*coros)
async def fetch_again(link):
async with asyncio.Semaphore(10):
async with aiohttp.ClientSession() as session:
async with session.get(link) as response:
text = await response.text()
sauce = BeautifulSoup(text,"lxml")
try:
name = sauce.select_one("p > b").text
except Exception: name = ""
print(name)
if __name__ == '__main__':
loop = asyncio.ProactorEventLoop()
asyncio.set_event_loop(loop)
loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))
In short, What the process_docs() function does is collect data-id numbers from each pages to reuse them as the prefix of this https://www.tursab.org.tr/en/displayAcenta?AID={} link to collect the names from pop up boxes. One such id is 8757 and one such qualified links therefore https://www.tursab.org.tr/en/displayAcenta?AID=8757.
Btw, If I change the highest number used in the links variable to 20 or 30 or so, It goes smoothly.

async def get_links(url):
async with asyncio.Semaphore(10):
You can't do such a thing: it means that on each function call new semaphore instance will be created, while you need to single semaphore instance for all requests. Change your code this way:
sem = asyncio.Semaphore(10) # module level
async def get_links(url):
async with sem:
# ...
async def fetch_again(link):
async with sem:
# ...
You can also return default loop once you're using semaphore correctly:
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(...)
Finally, you should alter both get_links(url) and fetch_again(link) to do parsing outside of semaphore to release it as soon as possible, before semaphore is needed inside process_docs(text).
Final code:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"
sem = asyncio.Semaphore(10)
async def get_links(url):
async with sem:
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
result = await process_docs(text)
return result
async def process_docs(html):
coros = []
soup = BeautifulSoup(html,"lxml")
items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
for item in items:
coros.append(fetch_again(lead_link.format(item)))
await asyncio.gather(*coros)
async def fetch_again(link):
async with sem:
async with aiohttp.ClientSession() as session:
async with session.get(link) as response:
text = await response.text()
sauce = BeautifulSoup(text,"lxml")
try:
name = sauce.select_one("p > b").text
except Exception:
name = "o"
print(name)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))

async post call in python 2.7?

i have a project on python 2.7. i need to make async post call to connect aws. i have an code for async in python 3.5 by using asyncio.but my code needs to work on python2.7 2.7.please guide me how to resolve this issue.
import asyncio
import json
from aiohttp import ClientSession
HEADERS = {'Content-type':'application/json'}
async def hello(url):
data = {"mac": 'mm','minor':3,'distance':1,'timestamp':4444,'uuid':'aa','rssi':1,'tx':34}
async with ClientSession() as session:
async with session.post(url,json=data) as response:
response = await response.read()
print(response)
loop = asyncio.get_event_loop()
url = "http://192.168.101.74:9090/api/postreader"
while True:
loop.run_until_complete(hello(url))

Try using gevent instead of asyncio?
http://www.gevent.org/
https://pypi.org/project/gevent/#downloads

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Async HTML Parse with Beautifulsoup4 in Python - python

Related

pd.read_html - No Tables Found Error - aiohttp refactor from request

How to use aiter in aiomultiprocessing

Where to put BeautifulSoup code in Asyncio Web Scraping Application

My script encounters an error when it is supposed to run asynchronously

async post call in python 2.7?

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Async HTML Parse with Beautifulsoup4 in Python - python

Related

pd.read_html - No Tables Found Error - aiohttp refactor from request

How to use __aiter__ in aiomultiprocessing

Where to put BeautifulSoup code in Asyncio Web Scraping Application

My script encounters an error when it is supposed to run asynchronously

async post call in python 2.7?

Categories

Resources

How to use aiter in aiomultiprocessing