What's the correct way to use HTTPAdapter with Async programming and calling out to a method? All of these requests are being made to the same domain.
I'm doing some async programming in Celery using eventlet and testing the load on one of my sites. I have a method that I call out to which makes the request to the url.
def get_session(url):
# gets session returns source
headers, proxies = header_proxy()
# set all of our necessary variables to None so that in the event of an error
# we can make sure we dont break
response = None
status_code = None
out_data = None
content = None
try:
# we are going to use request-html to be able to parse the
# data upon the initial request
with HTMLSession() as session:
# you can swap out the original request session here
# session = requests.session()
# passing the parameters to the session
session.mount('https://', HTTPAdapter(max_retries=0, pool_connections=250, pool_maxsize=500))
response = session.get(url, headers=headers, proxies=proxies)
status_code = response.status_code
try:
# we are checking to see if we are getting a 403 error on all requests. If so,
# we update the status code
code = response.html.xpath('''//*[#id="accessDenied"]/p[1]/b/text()''')
if code:
status_code = str(code[0][:-1])
else:
pass
except Exception as error:
pass
# print(error)
# assign the content to content
content = response.content
except Exception as error:
print(error)
pass
If I leave out the pool_connections and pool_maxsize parameters, and run the code, I get an error indicating that I do not have enough open connections. However, I don't want to unnecessarily open up a large number of connections if I dont need to.
based on this... https://laike9m.com/blog/requests-secret-pool_connections-and-pool_maxsize,89/ Im going to guess that this applies to the host and not so much the async task. Therefore, I set the max number to the max number of connections that can be reused per host. If I hit a domain several times, the connection is reused.
Related
I have the following session-dependent code which must be run continuously.
Code
import requests
http = requests.Session()
while True:
# if http is not good, then run http = requests.Session() again
response = http.get(....)
# process respons
# wait for 5 seconds
Note: I moved the line http = requests.Session() out of the loop.
Issue
How to check if the session is working?
An example for a not working session may be after the web server is restarted. Or loadbalancer redirects to a different web server.
The requests.Session object is just a persistence and connection-pooling object to allow shared state between different HTTP request on the client-side.
If the server unexpectedly closes a session, so that it becomes invalid, the server probably would respond with some error-indicating HTTP status code.
Thus requests would raise an error. See Errors and Exceptions:
All exceptions that Requests explicitly raises inherit from requests.exceptions.RequestException.
See the extended classes of RequestException.
Approach 1: implement open/close using try/except
Your code can catch such exceptions within a try/except-block.
It depends on the server's API interface specification how it will signal a invalidated/closed session. This signal response should be evaluated in the except block.
Here we use session_was_closed(exception) function to evaluate the exception/response and Session.close() to close the session correctly before opening a new one.
import requests
# initially open a session object
s = requests.Session()
# execute requests continuously
while True:
try:
response = s.get(....)
# process response
except requests.exceptions.RequestException as e:
if session_was_closed(e):
s.close() # close the session
s = requests.Session() # opens a new session
else:
# process non-session-related errors
# wait for 5 seconds
Depending on the server response of your case, implement the method session_was_closed(exception).
Approach 2: automatically open/close using with
From Advanced Usage, Session Objects:
Sessions can also be used as context managers:
with requests.Session() as s:
s.get('https://httpbin.org/cookies/set/sessioncookie/123456789')
This will make sure the session is closed as soon as the with block is exited, even if unhandled exceptions occurred.
I would flip the logic and add a try-except.
import requests
http = requests.Session()
while True:
try:
response = http.get(....)
except requests.ConnectionException:
http = requests.Session()
continue
# process respons
# wait for 5 seconds
See this answer for more info. I didn't test if the raised exception is that one, so please test it.
The API I'm sending requests to has a bit of an unusual format for its responses
It always returns status_code = 200
There's an additional error key inside the returned json that details the actual status of the response:
2.1. error = 0 means it successfully completes
2.2. error != 0 means something went wrong
I'm trying use the Retry class in urlib3, but so far I understand it only uses the status_code from the response, not its actual content.
Are there any other options?
If I'm hearing you right, then there are two cases in which you have 'errors' to handle:
Any non-200 response from the web server (i.e. 500, 403, etc.)
whenever the API returns a non-zero value for 'error' in the JSON response as the server always responds with an HTTP 200 even if your request is bad.
Given that we need to handle two completely different cases which trigger a retry, it'd be easier to write your own retry handler rather than trying to hack our way into this with the urllib3 library or similar, as we can specifically specify the cases where we need to do a retry.
You might try something like this approach, which also takes into account the number of requests you're making to determine if there's a repeated error case, and in cases of API response errors or HTTP errors, we use an (suggested via comments on my initial answer) 'exponential backoff' approach to retries so you don't constantly tax a server - this means that each successive retry has a different 'sleep' period before retrying, until we reach a MAX_RETRY count, as written it's a base increment of 1 second for first retry attempt, 2 seconds for second retry, 4 seconds for third retry, etc. which will permit the server to catch up if it has to rather than just constantly over-tax the server.
import requests
import time
MAX_RETRY = 5
def make_request():
'''This makes a single request to the server to get data from it.'''
# Replace 'get' with whichever method you're using, and the URL with the actual API URL
r = requests.get('http://api.example.com')
# If r.status_code is not 200, treat it as an error.
if r.status_code != 200:
raise RuntimeError(f"HTTP Response Code {r.status_code} received from server."
else:
j = r.json()
if j['error'] != 0:
raise RuntimeError(f"API Error Code {j['error']} received from server."
else:
return j
def request_with_retry(backoff_in_seconds=1):
'''This makes a request retry up to MAX_RETRY set above with exponential backoff.'''
attempts = 1
while True:
try:
data = make_request()
return data
except RuntimeError as err:
print(err)
if attempts > MAX_RETRY:
raise RuntimeError("Maximum number of attempts exceeded, aborting.")
sleep = backoff_in_seconds * 2 ** (attempts - 1)
print(f"Retrying request (attempt #{attempts}) in {sleep} seconds...")
time.sleep(sleep)
attempts += 1
Then, you couple these two functions together with the following to actually attempt to get data from the API server and then either error hard or do something with it if there's no errors encountered:
# This code actually *calls* these functions which contain the request with retry and
# exponential backoff *and* the individual request process for a single request.
try:
data = request_with_retry()
except RuntimeError as err:
print(err)
exit(1)
After that code, you can just 'do something' with data which is the JSON(?) output of your API, even if this part is included in another function. You just need the two dependent functions (done this way to reduce code duplication).
I have a piece of code which checks whether domains from a list host a website or not.
I'm running 100 parallel tasks which consume the domains from a queue.
The issue I'm facing is that I get false negative errors Cannot connect to host on some domains, while the same domains may actually produce valid 200 HTTP response when processed individually using the exact same code.
Here's a cleaned-up version of the code I use to do the actual call:
def get_session():
connector = aiohttp.TCPConnector(ssl=False, family=socket.AF_INET, resolver=aiohttp.AsyncResolver(timeout=5))
return aiohttp.ClientSession(connector=connector)
async def ping(url, session):
result = PingResult()
try:
async with session.get(url, timeout=timeout, headers=headers) as r:
result.status_code = r.status
result.redirect = r.headers['location'] if 'location' in r.headers else None
except BaseException as e:
result.exception = classify_exception(e)
return result
When it's called, it gets the session returned by get_session() as a parameter (all tasks share the same session, I tried it with one session / url, didn't work):
async with get_session() as session:
await ping(url, session)
(PingResult and classify_exception, headers, timeout are defined outside).
I'm using uvloop and aiodns, and running it on Ubuntu 18.04.
Is there a reason why this code should run fine when executed alone, but sometimes fail with Cannot connect to host when ran in multiple tasks?
Currently taking a web scraping class with other students, and we are supposed to make ‘get’ requests to a dummy site, parse it, and visit another site.
The problem is, the content of the dummy site is only up for several minutes and disappears, and the content comes back up at a certain interval. During the time the content is available, everyone tries to make the ‘get’ requests, so mine just hangs until everyone clears up, and the content eventually disappears. So I end up not being able to successfully make the ‘get’ request:
import requests
from splinter import Browser
browser = Browser('chrome')
# Hangs here
requests.get('http://dummysite.ca').text
# Even if get is successful hangs here as well
browser.visit(parsed_url)
So my question is, what's the fastest/best way to make endless concurrent 'get' requests until I get a response?
Decide to use either requests or splinter
Read about Requests: HTTP for Humans
Read about Splinter
Related
Read about keep-alive
Read about blocking-or-non-blocking
Read about timeouts
Read about errors-and-exceptions
If you are able to get not hanging requests, you can think of repeated requests, for instance:
while True:
requests.get(...
if request is succesfull:
break
time.sleep(1)
Gevent provides a framework for running asynchronous network requests.
It can patch Python's standard library so that existing libraries like requests and splinter work out of the box.
Here is a short example of how to make 10 concurrent requests, based on the above code, and get their response.
from gevent import monkey
monkey.patch_all()
import gevent.pool
import requests
pool = gevent.pool.Pool(size=10)
greenlets = [pool.spawn(requests.get, 'http://dummysite.ca')
for _ in range(10)]
# Wait for all requests to complete
pool.join()
for greenlet in greenlets:
# This will raise any exceptions raised by the request
# Need to catch errors, or check if an exception was
# thrown by checking `greenlet.exception`
response = greenlet.get()
text_response = response.text
Could also use map and a response handling function instead of get.
See gevent documentation for more information.
In this situation, concurrency will not help much since the server seems to be the limiting factor. One solution is to send a request with a timeout interval, if the interval has exceeded, then try the request again after a few seconds. Then gradually increase the time between retries until you get the data that you want. For instance, your code might look like this:
import time
import requests
def get_content(url, timeout):
# raise Timeout exception if more than x sends have passed
resp = requests.get(url, timeout=timeout)
# raise generic exception if request is unsuccessful
if resp.status_code != 200:
raise LookupError('status is not 200')
return resp.content
timeout = 5 # seconds
retry_interval = 0
max_retry_interval = 120
while True:
try:
response = get_content('https://example.com', timeout=timeout)
retry_interval = 0 # reset retry interval after success
break
except (LookupError, requests.exceptions.Timeout):
retry_interval += 10
if retry_interval > max_retry_interval:
retry_interval = max_retry_interval
time.sleep(retry_interval)
# process response
If concurrency is required, consider the Scrapy project. It uses the Twisted framework. In Scrapy you can replace time.sleep with reactor.callLater(fn, *args, **kw) or use one of hundreds of middleware plugins.
From the documentation for requests:
If the remote server is very slow, you can tell Requests to wait
forever for a response, by passing None as a timeout value and then
retrieving a cup of coffee.
import requests
#Wait potentially forever
r = requests.get('http://dummysite.ca', timeout=None)
#Check the status code to see how the server is handling the request
print r.status_code
Status codes beginning with 2 mean the request was received, understood, and accepted. 200 means the request was a success and the information returned. But 503 means the server is overloaded or undergoing maintenance.
Requests used to include a module called async which could send concurrent requests. It is now an independent module named grequests
which you can use to make concurrent requests endlessly until a 200 response:
import grequests
urls = [
'http://python-requests.org', #Just include one url if you want
'http://httpbin.org',
'http://python-guide.org',
'http://kennethreitz.com'
]
def keep_going():
rs = (grequests.get(u) for u in urls) #Make a set of unsent Requests
out = grequests.map(rs) #Send them all at the same time
for i in out:
if i.status_code == 200:
print i.text
del urls[out.index(i)] #If we have the content, delete the URL
return
while urls:
keep_going()
We have some custom module where we have redefined open, seek, read, tell functions to read only a part of file according to the arguments.
But, this logic overrides the default tell and python requests is trying to calculate the content-length which involves using tell(), which then redirects to our custom tell function and the logic is somewhere buggy and returns a wrong value. And I tried some changes, it throws error.
Found the following from models.py of requests:
def prepare_content_length(self, body):
if hasattr(body, 'seek') and hasattr(body, 'tell'):
body.seek(0, 2)
self.headers['Content-Length'] = builtin_str(body.tell())
body.seek(0, 0)
elif body is not None:
l = super_len(body)
if l:
self.headers['Content-Length'] = builtin_str(l)
elif (self.method not in ('GET', 'HEAD')) and (self.headers.get('Content-Length') is None):
self.headers['Content-Length'] = '0'
For now, I am not able to figure out where's the bug and stressed out to investigate more and fix it. And everything else work except content-length calculation by python requests.
So, I have created my own definition for finding content-length. And I have included the value in requests header. But, the request is still preparing the content-length and throwing error.
How can I restrict not preparing content-length and use the specified content-length?
Requests lets you modify a request before sending. See Prepared Requests.
For example:
from requests import Request, Session
s = Session()
req = Request('POST', url, data=data, headers=headers)
prepped = req.prepare()
# do something with prepped.headers
prepped.headers['Content-Length'] = your_custom_content_length_calculation()
resp = s.send(prepped, ...)
If your session has its own configuration (like cookie persistence or connection-pooling), then you should use s.prepare_request(req) instead of req.prepare().