Fetch multiple URLs with asyncio/aiohttp and retry for failures - python

I'm attempting to write some asynchronous GET requests with the aiohttp package, and have most of the pieces figured out, but am wondering what the standard approach is when handling the failures (returned as exceptions).
A general idea of my code so far (after some trial and error, I am following the approach here):
import asyncio
import aiofiles
import aiohttp
from pathlib import Path
with open('urls.txt', 'r') as f:
urls = [s.rstrip() for s in f.readlines()]
async def fetch(session, url):
async with session.get(url) as response:
if response.status != 200:
response.raise_for_status()
data = await response.text()
# (Omitted: some more URL processing goes on here)
out_path = Path(f'out/')
if not out_path.is_dir():
out_path.mkdir()
fname = url.split("/")[-1]
async with aiofiles.open(out_path / f'{fname}.html', 'w+') as f:
await f.write(data)
async def fetch_all(urls, loop):
async with aiohttp.ClientSession(loop=loop) as session:
results = await asyncio.gather(*[fetch(session, url) for url in urls],
return_exceptions=True)
return results
if __name__ == '__main__':
loop = asyncio.get_event_loop()
results = loop.run_until_complete(fetch_all(urls, loop))
Now this runs fine:
As expected the results variable is populated with None entries where the corresponding URL [i.e. at the same index in the urls array variable, i.e. at the same line number in the input file urls.txt] was successfully requested, and the corresponding file is written to disk.
This means I can use the results variable to determine which URLs were not successful (those entries in results not equal to None)
I have looked at a few different guides to using the various asynchronous Python packages (aiohttp, aiofiles, and asyncio) but I haven't seen the standard way to handle this final step.
Should the retrying to send a GET request be done after the await statement has 'finished'/'completed'?
...or should the retrying to send a GET request be initiated by some sort of callback upon failure
The errors look like this: (ClientConnectorError(111, "Connect call failed ('000.XXX.XXX.XXX', 443)") i.e. the request to IP address 000.XXX.XXX.XXX at port 443 failed, probably because there's some limit from the server which I should respect by waiting with a time out before retrying.
Is there some sort of limit I might consider putting on, to batch the number of requests rather than trying them all?
I am getting about 40-60 successful requests when attempting a few hundred (over 500) URLs in my list.
Naively, I was expecting run_until_complete to handle this in such a way that it would finish upon succeeding at requesting all URLs, but this isn't the case.
I haven't worked with asynchronous Python and sessions/loops before, so would appreciate any help to find how to get the results. Please let me know if I can give any more information to improve this question, thank you!

Should the retrying to send a GET request be done after the await statement has 'finished'/'completed'? ...or should the retrying to send a GET request be initiated by some sort of callback upon failure
You can do the former. You don't need any special callback, since you are executing inside a coroutine, so a simple while loop will suffice, and won't interfere with execution of other coroutines. For example:
async def fetch(session, url):
data = None
while data is None:
try:
async with session.get(url) as response:
response.raise_for_status()
data = await response.text()
except aiohttp.ClientError:
# sleep a little and try again
await asyncio.sleep(1)
# (Omitted: some more URL processing goes on here)
out_path = Path(f'out/')
if not out_path.is_dir():
out_path.mkdir()
fname = url.split("/")[-1]
async with aiofiles.open(out_path / f'{fname}.html', 'w+') as f:
await f.write(data)
Naively, I was expecting run_until_complete to handle this in such a way that it would finish upon succeeding at requesting all URLs
The term "complete" is meant in the technical sense of a coroutine completing (running its course), which is achieved either by the coroutine returning or raising an exception.

Related

How to fire and forgot a HTTP request?

Is it possible to fire a request and not wait for response at all?
For python, most internet search results in
asynchronous-requests-with-python-requests
grequests
requests-futures
However, all the above solutions spawns a new thread and wait for response on each of the respective threads. Is it possible to not wait for any response at all, anywhere?
You can run your thread as a daemon, see the code below; If I comment out the line (t.daemon = True), the code will wait on the threads to finish before exiting. With daemon set to true, it will simply exit. You can try it with the example below.
import requests
import threading
import time
def get_thread():
g = requests.get("http://www.google.com")
time.sleep(2)
print(g.text[0:100])
if __name__ == '__main__':
t = threading.Thread(target=get_thread)
t.daemon = True # Try commenting this out, running it, and see the difference
t.start()
print("Done")
I don't really know what you are trying to achieve by just firing an http request. So I will list some use cases I can think of.
Ignoring the result
If the only thing you want is that your program feels like it never stops for making a request. You can use a library like aiohttp to make concurrent request without actually calling await for the responses.
import aiohttp
import asyncio
async def main():
async with aiohttp.ClientSession() as session:
session.get('http://python.org')
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
but how can you know that the request was made successfully if you don't check anything?
Ignoring the body
Maybe you want to be very performant, and you are worried about loosing time reading the body. In that case yo can just fire the request, check the status code and then close the connection.
def make_request(url = "yahoo.com", timeout= 50):
conn = http.client.HTTPConnection(url, timeout=timeout)
conn.request("GET", "/")
res = conn.getresponse()
print(res.status)
conn.close()
If you close the connection as I did previously you won't be able to reuse the connections.
The right way
I would recommend to await on asynchronous calls using aiohttp so you can add the necessary logic without having to block.
But if you are looking for performance, a custom solution with the http library is necessary. Maybe you could also consider very small request/responses, small timeouts and compression in your client and server.

Python: async programming and pool_maxsize with HTTPAdapter

What's the correct way to use HTTPAdapter with Async programming and calling out to a method? All of these requests are being made to the same domain.
I'm doing some async programming in Celery using eventlet and testing the load on one of my sites. I have a method that I call out to which makes the request to the url.
def get_session(url):
# gets session returns source
headers, proxies = header_proxy()
# set all of our necessary variables to None so that in the event of an error
# we can make sure we dont break
response = None
status_code = None
out_data = None
content = None
try:
# we are going to use request-html to be able to parse the
# data upon the initial request
with HTMLSession() as session:
# you can swap out the original request session here
# session = requests.session()
# passing the parameters to the session
session.mount('https://', HTTPAdapter(max_retries=0, pool_connections=250, pool_maxsize=500))
response = session.get(url, headers=headers, proxies=proxies)
status_code = response.status_code
try:
# we are checking to see if we are getting a 403 error on all requests. If so,
# we update the status code
code = response.html.xpath('''//*[#id="accessDenied"]/p[1]/b/text()''')
if code:
status_code = str(code[0][:-1])
else:
pass
except Exception as error:
pass
# print(error)
# assign the content to content
content = response.content
except Exception as error:
print(error)
pass
If I leave out the pool_connections and pool_maxsize parameters, and run the code, I get an error indicating that I do not have enough open connections. However, I don't want to unnecessarily open up a large number of connections if I dont need to.
based on this... https://laike9m.com/blog/requests-secret-pool_connections-and-pool_maxsize,89/ Im going to guess that this applies to the host and not so much the async task. Therefore, I set the max number to the max number of connections that can be reused per host. If I hit a domain several times, the connection is reused.

asyncio and aiohttp "Cannot connect to host"

I have a piece of code which checks whether domains from a list host a website or not.
I'm running 100 parallel tasks which consume the domains from a queue.
The issue I'm facing is that I get false negative errors Cannot connect to host on some domains, while the same domains may actually produce valid 200 HTTP response when processed individually using the exact same code.
Here's a cleaned-up version of the code I use to do the actual call:
def get_session():
connector = aiohttp.TCPConnector(ssl=False, family=socket.AF_INET, resolver=aiohttp.AsyncResolver(timeout=5))
return aiohttp.ClientSession(connector=connector)
async def ping(url, session):
result = PingResult()
try:
async with session.get(url, timeout=timeout, headers=headers) as r:
result.status_code = r.status
result.redirect = r.headers['location'] if 'location' in r.headers else None
except BaseException as e:
result.exception = classify_exception(e)
return result
When it's called, it gets the session returned by get_session() as a parameter (all tasks share the same session, I tried it with one session / url, didn't work):
async with get_session() as session:
await ping(url, session)
(PingResult and classify_exception, headers, timeout are defined outside).
I'm using uvloop and aiodns, and running it on Ubuntu 18.04.
Is there a reason why this code should run fine when executed alone, but sometimes fail with Cannot connect to host when ran in multiple tasks?

how to open different urls at the same time by using python selenium?

is this possible to open 50 different urls at the same time using selenium with python?
is this possible using threading?
If so, how would I go about doing this?
If not, then what would be a good method to do so?
You can try below to open 50 URLs one by one in new tab:
urls = ["http://first.com", "http://second.com", ...]
for url in urls:
driver.execute_script('window.open("%s")' % url)
you can use celery (Distributed Task Queue) to open all these urls.
or you can use async and await with aiohttp on python >= 3.5 , which runs a single thread on a single process but concurrently(utilises the wait time on urls for fetching other urls)
here is the code sample for the same. Loop takes care of scheduling these concurrent tasks.
#!/usr/local/bin/python3.5
import asyncio
from aiohttp import ClientSession
async def hello(url):
async with ClientSession() as session:
async with session.get(url) as response:
response = await response.read()
print(response)
loop = asyncio.get_event_loop()
loop.run_until_complete(hello("http://httpbin.org/headers"))
well, open 50 urls at the same time seems unreasonable and will require a LOT of processing, but, is possible. However i would recommend you to use a form of iteration opening one url at a time. 50 times.
list = ['list of urls here','2nd url'...]
driver = webdriver.Firefox()
for i in list:
moving = driver.get(i)
...#rest of your code
driver.quit()
but... you can make one driver.get('url') for each url you want... using different drivers. Or tabs. But it will require a lot of processing.

The tasks from asyncio.gather does not work concurrently

I want to scrape data from a website concurrently, but I found that the following program is NOT executed concurrently.
async def return_soup(url):
r = requests.get(url)
r.encoding = "utf-8"
soup = BeautifulSoup(r.text, "html.parser")
future = asyncio.Future()
future.set_result(soup)
return future
async def parseURL_async(url):
print("Started to download {0}".format(url))
soup = await return_soup(url)
print("Finished downloading {0}".format(url))
return soup
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
t = [parseURL_async(url_1), parseURL_async(url_2)]
loop.run_until_complete(asyncio.gather(*t))
However, this program starts to download the second content only after the first one finishes. If my understanding is correct, the await keyword on the await return_soup(url) awaits for the function to be complete, and while waiting for the completion, it returns back the control to the event loop, which enables the loop to start the second download.
And once the function finally finishes the execution, the future instance within it gets the result value.
But why does this not work concurrently? What am I missing here?
Using asyncio is different from using threads in that you cannot add it to an existing code base to make it concurrent. Specifically, code that runs in the asyncio event loop must not block - all blocking calls must be replaced with non-blocking versions that yield control to the event loop. In your case, requests.get blocks and defeats the parallelism implemented by asyncio.
To avoid this problem, you need to use an http library that is written with asyncio in mind, such as aiohttp.
I'll add a little more to user4815162342's response. The asyncio framework uses coroutines that must cede control of the thread while they do the long operation. See the diagram at the end of this section for a nice graphical representation. As user4815162342 mentioned, the requests library doesn't support asyncio. I know of two ways to make this work concurrently. First, is to do what user4815162342 suggested and switch to a library with native support for asynchronous requests. The second is to run this synchronous code in separate threads or processes. The latter is easy because of the run_in_executor function.
loop = asyncio.get_event_loop()
async def return_soup(url):
r = await loop.run_in_executor(None, requests.get, url)
r.encoding = "utf-8"
return BeautifulSoup(r.text, "html.parser")
async def parseURL_async(url):
print("Started to download {0}".format(url))
soup = await return_soup(url)
print("Finished downloading {0}".format(url))
return soup
t = [parseURL_async(url_1), parseURL_async(url_2)]
loop.run_until_complete(asyncio.gather(*t))
This solution removes some of the benefit of using asyncio, as the long operation will still probably be executed from a fixed size thread pool, but it's also much easier to start with.
The reason as mentioned in other answers is the lack of library support for coroutines.
As of Python 3.9 though, you can use the function to_thread as an alternative for I/O concurrency.
Obviously this is not exactly equivalent because as the name suggests it runs your functions in separate threads as opposed of a single thread in the event loop, but it can be a way to achieve I/O concurrency without relying on proper async support from the library.
In your example the code would be:
def return_soup(url):
r = requests.get(url)
r.encoding = "utf-8"
return BeautifulSoup(r.text, "html.parser")
def parseURL_async(url):
print("Started to download {0}".format(url))
soup = return_soup(url)
print("Finished downloading {0}".format(url))
return soup
async def main():
result_url_1, result_url_2 = await asyncio.gather(
asyncio.to_thread(parseURL_async, url_1),
asyncio.to_thread(parseURL_async, url_2),
)
asyncio.run(main())

Categories

Resources