asyncio and aiohttp "Cannot connect to host" - python

I have a piece of code which checks whether domains from a list host a website or not.
I'm running 100 parallel tasks which consume the domains from a queue.
The issue I'm facing is that I get false negative errors Cannot connect to host on some domains, while the same domains may actually produce valid 200 HTTP response when processed individually using the exact same code.
Here's a cleaned-up version of the code I use to do the actual call:
def get_session():
connector = aiohttp.TCPConnector(ssl=False, family=socket.AF_INET, resolver=aiohttp.AsyncResolver(timeout=5))
return aiohttp.ClientSession(connector=connector)
async def ping(url, session):
result = PingResult()
try:
async with session.get(url, timeout=timeout, headers=headers) as r:
result.status_code = r.status
result.redirect = r.headers['location'] if 'location' in r.headers else None
except BaseException as e:
result.exception = classify_exception(e)
return result
When it's called, it gets the session returned by get_session() as a parameter (all tasks share the same session, I tried it with one session / url, didn't work):
async with get_session() as session:
await ping(url, session)
(PingResult and classify_exception, headers, timeout are defined outside).
I'm using uvloop and aiodns, and running it on Ubuntu 18.04.
Is there a reason why this code should run fine when executed alone, but sometimes fail with Cannot connect to host when ran in multiple tasks?

Related

How to tell if requests.Session() is good?

I have the following session-dependent code which must be run continuously.
Code
import requests
http = requests.Session()
while True:
# if http is not good, then run http = requests.Session() again
response = http.get(....)
# process respons
# wait for 5 seconds
Note: I moved the line http = requests.Session() out of the loop.
Issue
How to check if the session is working?
An example for a not working session may be after the web server is restarted. Or loadbalancer redirects to a different web server.
The requests.Session object is just a persistence and connection-pooling object to allow shared state between different HTTP request on the client-side.
If the server unexpectedly closes a session, so that it becomes invalid, the server probably would respond with some error-indicating HTTP status code.
Thus requests would raise an error. See Errors and Exceptions:
All exceptions that Requests explicitly raises inherit from requests.exceptions.RequestException.
See the extended classes of RequestException.
Approach 1: implement open/close using try/except
Your code can catch such exceptions within a try/except-block.
It depends on the server's API interface specification how it will signal a invalidated/closed session. This signal response should be evaluated in the except block.
Here we use session_was_closed(exception) function to evaluate the exception/response and Session.close() to close the session correctly before opening a new one.
import requests
# initially open a session object
s = requests.Session()
# execute requests continuously
while True:
try:
response = s.get(....)
# process response
except requests.exceptions.RequestException as e:
if session_was_closed(e):
s.close() # close the session
s = requests.Session() # opens a new session
else:
# process non-session-related errors
# wait for 5 seconds
Depending on the server response of your case, implement the method session_was_closed(exception).
Approach 2: automatically open/close using with
From Advanced Usage, Session Objects:
Sessions can also be used as context managers:
with requests.Session() as s:
s.get('https://httpbin.org/cookies/set/sessioncookie/123456789')
This will make sure the session is closed as soon as the with block is exited, even if unhandled exceptions occurred.
I would flip the logic and add a try-except.
import requests
http = requests.Session()
while True:
try:
response = http.get(....)
except requests.ConnectionException:
http = requests.Session()
continue
# process respons
# wait for 5 seconds
See this answer for more info. I didn't test if the raised exception is that one, so please test it.

Error while multi threading API queries in Python

I'm performing queries in a server with Open Source Routing Machine (OSRM) deployed. I send a set of coordinates and obtain a n x n matrix of network distances over a streets network.
In order to improve the speed of the computations, I want to use "ThreadPoolExecutor" to parallelize queries.
So far, I'm setting the connection in two ways, both giving me the same error:
def osrm_query(url_input):
'Send request'
response = requests.get(url_input)
r = response.json()
return r
def osrm_query_2(url_input):
'Send request'
s = requests.Session()
retries = Retry(total=3,
backoff_factor=0.1,
status_forcelist=[ 500, 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
response = s.get(url_input)
r = response.json()
return r
I generate the set of URLs (in the _urls list) that I want to send as requests and parallelize this way:
with ThreadPoolExecutor(max_workers=5) as executor:
for each in executor.map(osrm_query_2, _urls):
r.append(each)
So far, all works ok, but then, when parsing more than 40,000 URLs, I get this error back:
OSError: [WinError 10048] Only one usage of each socket address (protocol/networ
k address/port) is normally permitted
As far as I understand, the problem is that I send too many request from my machine, exhausting the number of ports available to send the request (it looks that this has nothing to do with the machine I'm sending the request to).
How can I fix this?
There is a way to tell the treadPoolExecutor to re-use the connections?
I was guided in the right direction by someone outside Stack Overflow.
The trick was to point the workers of the pool to a request Session. The function to send the queries was re-worked as follows:
def osrm_query(url_input, session):
'Send request'
response = session.get(url_input)
r = response.json()
return r
And the parallelization would be:
with ThreadPoolExecutor(max_workers=50) as executor:
with requests.Session() as s:
for each in executor.map(osrm_query, _urls, repeat(s)):
r.append(each)
This way I reduced the time of execution from 100 minutes (not parallelized) to 7 minutes with 50 workers as the max_workers argument, for 200,000 urls.

Fetch multiple URLs with asyncio/aiohttp and retry for failures

I'm attempting to write some asynchronous GET requests with the aiohttp package, and have most of the pieces figured out, but am wondering what the standard approach is when handling the failures (returned as exceptions).
A general idea of my code so far (after some trial and error, I am following the approach here):
import asyncio
import aiofiles
import aiohttp
from pathlib import Path
with open('urls.txt', 'r') as f:
urls = [s.rstrip() for s in f.readlines()]
async def fetch(session, url):
async with session.get(url) as response:
if response.status != 200:
response.raise_for_status()
data = await response.text()
# (Omitted: some more URL processing goes on here)
out_path = Path(f'out/')
if not out_path.is_dir():
out_path.mkdir()
fname = url.split("/")[-1]
async with aiofiles.open(out_path / f'{fname}.html', 'w+') as f:
await f.write(data)
async def fetch_all(urls, loop):
async with aiohttp.ClientSession(loop=loop) as session:
results = await asyncio.gather(*[fetch(session, url) for url in urls],
return_exceptions=True)
return results
if __name__ == '__main__':
loop = asyncio.get_event_loop()
results = loop.run_until_complete(fetch_all(urls, loop))
Now this runs fine:
As expected the results variable is populated with None entries where the corresponding URL [i.e. at the same index in the urls array variable, i.e. at the same line number in the input file urls.txt] was successfully requested, and the corresponding file is written to disk.
This means I can use the results variable to determine which URLs were not successful (those entries in results not equal to None)
I have looked at a few different guides to using the various asynchronous Python packages (aiohttp, aiofiles, and asyncio) but I haven't seen the standard way to handle this final step.
Should the retrying to send a GET request be done after the await statement has 'finished'/'completed'?
...or should the retrying to send a GET request be initiated by some sort of callback upon failure
The errors look like this: (ClientConnectorError(111, "Connect call failed ('000.XXX.XXX.XXX', 443)") i.e. the request to IP address 000.XXX.XXX.XXX at port 443 failed, probably because there's some limit from the server which I should respect by waiting with a time out before retrying.
Is there some sort of limit I might consider putting on, to batch the number of requests rather than trying them all?
I am getting about 40-60 successful requests when attempting a few hundred (over 500) URLs in my list.
Naively, I was expecting run_until_complete to handle this in such a way that it would finish upon succeeding at requesting all URLs, but this isn't the case.
I haven't worked with asynchronous Python and sessions/loops before, so would appreciate any help to find how to get the results. Please let me know if I can give any more information to improve this question, thank you!
Should the retrying to send a GET request be done after the await statement has 'finished'/'completed'? ...or should the retrying to send a GET request be initiated by some sort of callback upon failure
You can do the former. You don't need any special callback, since you are executing inside a coroutine, so a simple while loop will suffice, and won't interfere with execution of other coroutines. For example:
async def fetch(session, url):
data = None
while data is None:
try:
async with session.get(url) as response:
response.raise_for_status()
data = await response.text()
except aiohttp.ClientError:
# sleep a little and try again
await asyncio.sleep(1)
# (Omitted: some more URL processing goes on here)
out_path = Path(f'out/')
if not out_path.is_dir():
out_path.mkdir()
fname = url.split("/")[-1]
async with aiofiles.open(out_path / f'{fname}.html', 'w+') as f:
await f.write(data)
Naively, I was expecting run_until_complete to handle this in such a way that it would finish upon succeeding at requesting all URLs
The term "complete" is meant in the technical sense of a coroutine completing (running its course), which is achieved either by the coroutine returning or raising an exception.

Get url without blocking the function

I'm trying to get a resource from a url inside a route on a web server without blocking it since getting it sometimes takes 11 seconds+.
I switched from flask to aiohttp for this.
async def process(request):
data = await request.json()
req = urllib.request.Request(
request["resource_url"],
data=None,
headers=hdrs
)
# Do processing on the resource
But I'm not sure how to make the call and would it allow other calls to be made to this route while the resource is getting fetched?

Python: async programming and pool_maxsize with HTTPAdapter

What's the correct way to use HTTPAdapter with Async programming and calling out to a method? All of these requests are being made to the same domain.
I'm doing some async programming in Celery using eventlet and testing the load on one of my sites. I have a method that I call out to which makes the request to the url.
def get_session(url):
# gets session returns source
headers, proxies = header_proxy()
# set all of our necessary variables to None so that in the event of an error
# we can make sure we dont break
response = None
status_code = None
out_data = None
content = None
try:
# we are going to use request-html to be able to parse the
# data upon the initial request
with HTMLSession() as session:
# you can swap out the original request session here
# session = requests.session()
# passing the parameters to the session
session.mount('https://', HTTPAdapter(max_retries=0, pool_connections=250, pool_maxsize=500))
response = session.get(url, headers=headers, proxies=proxies)
status_code = response.status_code
try:
# we are checking to see if we are getting a 403 error on all requests. If so,
# we update the status code
code = response.html.xpath('''//*[#id="accessDenied"]/p[1]/b/text()''')
if code:
status_code = str(code[0][:-1])
else:
pass
except Exception as error:
pass
# print(error)
# assign the content to content
content = response.content
except Exception as error:
print(error)
pass
If I leave out the pool_connections and pool_maxsize parameters, and run the code, I get an error indicating that I do not have enough open connections. However, I don't want to unnecessarily open up a large number of connections if I dont need to.
based on this... https://laike9m.com/blog/requests-secret-pool_connections-and-pool_maxsize,89/ Im going to guess that this applies to the host and not so much the async task. Therefore, I set the max number to the max number of connections that can be reused per host. If I hit a domain several times, the connection is reused.

Categories

Resources