retry with python requests when status_code = 200 - python

The API I'm sending requests to has a bit of an unusual format for its responses
It always returns status_code = 200
There's an additional error key inside the returned json that details the actual status of the response:
2.1. error = 0 means it successfully completes
2.2. error != 0 means something went wrong
I'm trying use the Retry class in urlib3, but so far I understand it only uses the status_code from the response, not its actual content.
Are there any other options?

If I'm hearing you right, then there are two cases in which you have 'errors' to handle:
Any non-200 response from the web server (i.e. 500, 403, etc.)
whenever the API returns a non-zero value for 'error' in the JSON response as the server always responds with an HTTP 200 even if your request is bad.
Given that we need to handle two completely different cases which trigger a retry, it'd be easier to write your own retry handler rather than trying to hack our way into this with the urllib3 library or similar, as we can specifically specify the cases where we need to do a retry.
You might try something like this approach, which also takes into account the number of requests you're making to determine if there's a repeated error case, and in cases of API response errors or HTTP errors, we use an (suggested via comments on my initial answer) 'exponential backoff' approach to retries so you don't constantly tax a server - this means that each successive retry has a different 'sleep' period before retrying, until we reach a MAX_RETRY count, as written it's a base increment of 1 second for first retry attempt, 2 seconds for second retry, 4 seconds for third retry, etc. which will permit the server to catch up if it has to rather than just constantly over-tax the server.
import requests
import time
MAX_RETRY = 5
def make_request():
'''This makes a single request to the server to get data from it.'''
# Replace 'get' with whichever method you're using, and the URL with the actual API URL
r = requests.get('http://api.example.com')
# If r.status_code is not 200, treat it as an error.
if r.status_code != 200:
raise RuntimeError(f"HTTP Response Code {r.status_code} received from server."
else:
j = r.json()
if j['error'] != 0:
raise RuntimeError(f"API Error Code {j['error']} received from server."
else:
return j
def request_with_retry(backoff_in_seconds=1):
'''This makes a request retry up to MAX_RETRY set above with exponential backoff.'''
attempts = 1
while True:
try:
data = make_request()
return data
except RuntimeError as err:
print(err)
if attempts > MAX_RETRY:
raise RuntimeError("Maximum number of attempts exceeded, aborting.")
sleep = backoff_in_seconds * 2 ** (attempts - 1)
print(f"Retrying request (attempt #{attempts}) in {sleep} seconds...")
time.sleep(sleep)
attempts += 1
Then, you couple these two functions together with the following to actually attempt to get data from the API server and then either error hard or do something with it if there's no errors encountered:
# This code actually *calls* these functions which contain the request with retry and
# exponential backoff *and* the individual request process for a single request.
try:
data = request_with_retry()
except RuntimeError as err:
print(err)
exit(1)
After that code, you can just 'do something' with data which is the JSON(?) output of your API, even if this part is included in another function. You just need the two dependent functions (done this way to reduce code duplication).

Related

Python: async programming and pool_maxsize with HTTPAdapter

What's the correct way to use HTTPAdapter with Async programming and calling out to a method? All of these requests are being made to the same domain.
I'm doing some async programming in Celery using eventlet and testing the load on one of my sites. I have a method that I call out to which makes the request to the url.
def get_session(url):
# gets session returns source
headers, proxies = header_proxy()
# set all of our necessary variables to None so that in the event of an error
# we can make sure we dont break
response = None
status_code = None
out_data = None
content = None
try:
# we are going to use request-html to be able to parse the
# data upon the initial request
with HTMLSession() as session:
# you can swap out the original request session here
# session = requests.session()
# passing the parameters to the session
session.mount('https://', HTTPAdapter(max_retries=0, pool_connections=250, pool_maxsize=500))
response = session.get(url, headers=headers, proxies=proxies)
status_code = response.status_code
try:
# we are checking to see if we are getting a 403 error on all requests. If so,
# we update the status code
code = response.html.xpath('''//*[#id="accessDenied"]/p[1]/b/text()''')
if code:
status_code = str(code[0][:-1])
else:
pass
except Exception as error:
pass
# print(error)
# assign the content to content
content = response.content
except Exception as error:
print(error)
pass
If I leave out the pool_connections and pool_maxsize parameters, and run the code, I get an error indicating that I do not have enough open connections. However, I don't want to unnecessarily open up a large number of connections if I dont need to.
based on this... https://laike9m.com/blog/requests-secret-pool_connections-and-pool_maxsize,89/ Im going to guess that this applies to the host and not so much the async task. Therefore, I set the max number to the max number of connections that can be reused per host. If I hit a domain several times, the connection is reused.

Python + requests + splinter: What's the fastest/best way to make multiple concurrent 'get' requests?

Currently taking a web scraping class with other students, and we are supposed to make ‘get’ requests to a dummy site, parse it, and visit another site.
The problem is, the content of the dummy site is only up for several minutes and disappears, and the content comes back up at a certain interval. During the time the content is available, everyone tries to make the ‘get’ requests, so mine just hangs until everyone clears up, and the content eventually disappears. So I end up not being able to successfully make the ‘get’ request:
import requests
from splinter import Browser
browser = Browser('chrome')
# Hangs here
requests.get('http://dummysite.ca').text
# Even if get is successful hangs here as well
browser.visit(parsed_url)
So my question is, what's the fastest/best way to make endless concurrent 'get' requests until I get a response?
Decide to use either requests or splinter
Read about Requests: HTTP for Humans
Read about Splinter
Related
Read about keep-alive
Read about blocking-or-non-blocking
Read about timeouts
Read about errors-and-exceptions
If you are able to get not hanging requests, you can think of repeated requests, for instance:
while True:
requests.get(...
if request is succesfull:
break
time.sleep(1)
Gevent provides a framework for running asynchronous network requests.
It can patch Python's standard library so that existing libraries like requests and splinter work out of the box.
Here is a short example of how to make 10 concurrent requests, based on the above code, and get their response.
from gevent import monkey
monkey.patch_all()
import gevent.pool
import requests
pool = gevent.pool.Pool(size=10)
greenlets = [pool.spawn(requests.get, 'http://dummysite.ca')
for _ in range(10)]
# Wait for all requests to complete
pool.join()
for greenlet in greenlets:
# This will raise any exceptions raised by the request
# Need to catch errors, or check if an exception was
# thrown by checking `greenlet.exception`
response = greenlet.get()
text_response = response.text
Could also use map and a response handling function instead of get.
See gevent documentation for more information.
In this situation, concurrency will not help much since the server seems to be the limiting factor. One solution is to send a request with a timeout interval, if the interval has exceeded, then try the request again after a few seconds. Then gradually increase the time between retries until you get the data that you want. For instance, your code might look like this:
import time
import requests
def get_content(url, timeout):
# raise Timeout exception if more than x sends have passed
resp = requests.get(url, timeout=timeout)
# raise generic exception if request is unsuccessful
if resp.status_code != 200:
raise LookupError('status is not 200')
return resp.content
timeout = 5 # seconds
retry_interval = 0
max_retry_interval = 120
while True:
try:
response = get_content('https://example.com', timeout=timeout)
retry_interval = 0 # reset retry interval after success
break
except (LookupError, requests.exceptions.Timeout):
retry_interval += 10
if retry_interval > max_retry_interval:
retry_interval = max_retry_interval
time.sleep(retry_interval)
# process response
If concurrency is required, consider the Scrapy project. It uses the Twisted framework. In Scrapy you can replace time.sleep with reactor.callLater(fn, *args, **kw) or use one of hundreds of middleware plugins.
From the documentation for requests:
If the remote server is very slow, you can tell Requests to wait
forever for a response, by passing None as a timeout value and then
retrieving a cup of coffee.
import requests
#Wait potentially forever
r = requests.get('http://dummysite.ca', timeout=None)
#Check the status code to see how the server is handling the request
print r.status_code
Status codes beginning with 2 mean the request was received, understood, and accepted. 200 means the request was a success and the information returned. But 503 means the server is overloaded or undergoing maintenance.
Requests used to include a module called async which could send concurrent requests. It is now an independent module named grequests
which you can use to make concurrent requests endlessly until a 200 response:
import grequests
urls = [
'http://python-requests.org', #Just include one url if you want
'http://httpbin.org',
'http://python-guide.org',
'http://kennethreitz.com'
]
def keep_going():
rs = (grequests.get(u) for u in urls) #Make a set of unsent Requests
out = grequests.map(rs) #Send them all at the same time
for i in out:
if i.status_code == 200:
print i.text
del urls[out.index(i)] #If we have the content, delete the URL
return
while urls:
keep_going()

Mock a HTTP request that times out with HTTPretty

Using the HTTPretty library for Python, I can create mock HTTP responses of choice and then pick them up i.e. with the requests library like so:
import httpretty
import requests
# set up a mock
httpretty.enable()
httpretty.register_uri(
method=httpretty.GET,
uri='http://www.fakeurl.com',
status=200,
body='My Response Body'
)
response = requests.get('http://www.fakeurl.com')
# clean up
httpretty.disable()
httpretty.reset()
print(response)
Out: <Response [200]>
Is there also the possibility to register an uri which cannot be reached (e.g. connection timed out, connection refused, ...) such that no response is received at all (which is not the same as an established connection which gives an HTTP error code like 404)?
I want to use this behaviour in unit testing to ensure that my error handling works as expected (which does different things in case of 'no connection established' and 'connection established, bad bad HTTP status code'). As a workaround, I could try to connect to an invalid server like http://192.0.2.0 which would time out in any case. However, I would prefer to do all my unit testing without using any real network connections.
Meanwhile I got it, using a HTTPretty callback body seems to produce the desired behaviour. See inline comments below.
This is actually not exactly the same as I was looking for (it is not a server that cannot be reached and hence the request times out but a server that throws a timeout exception once it is reached, however, the effect is the same for my usecase.
Still, if anybody knows a different solution, I'm looking forward to it.
import httpretty
import requests
# enable HTTPretty
httpretty.enable()
# create a callback body that raises an exception when opened
def exceptionCallback(request, uri, headers):
# raise your favourite exception here, e.g. requests.ConnectionError or requests.Timeout
raise requests.Timeout('Connection timed out.')
# set up a mock and use the callback function as response's body
httpretty.register_uri(
method=httpretty.GET,
uri='http://www.fakeurl.com',
status=200,
body=exceptionCallback
)
# try to get a response from the mock server and catch the exception
try:
response = requests.get('http://www.fakeurl.com')
except requests.Timeout as e:
print('requests.Timeout exception got caught...')
print(e)
# do whatever...
# clean up
httpretty.disable()
httpretty.reset()

Avoiding ChunkedEncodingError for an empty chunk with Requests 2.3.0

I'm using Requests to download a file (several gigabytes) from a server. To provide progress updates (and to prevent the entire file from having to be stored in memory) I've set stream=True and wrote the download to a file:
with open('output', 'w') as f:
response = requests.get(url, stream=True)
if not response.ok:
print 'There was an error'
exit()
for block in response.iter_content(1024 * 100):
f.write(block)
completed_bytes += len(block)
write_progress(completed_bytes, total_bytes)
However, at some random point in the download, Requests throws a ChunkedEncodingError. I've gone into the source and found that this corresponds to an IncompleteRead exception. I inserted a log statement around those lines and found that e.partial = "\r". I know that the server gives the downloads low priority and I suspect that this exception occurs when the server waits too long to send the next chunk.
As is expected, the exception stops the download. Unfortunately, the server does not implement HTTP/1.1's content ranges, so I cannot simply resume it. I've played around with increasing urllib3's internal timeout, but the exception still persists.
Is there anyway to make the underlying urllib3 (or Requests) more tolerant of these empty (or late) chunks so that the file can completely download?
import httplib
def patch_http_response_read(func):
def inner(*args):
try:
return func(*args)
except httplib.IncompleteRead, e:
return e.partial
return inner
httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)
I can not reproduce your problem right now, but I think this could be a patch. It allows you to deal with defective http servers.
Most bad servers transmit all data, but due implementation errors they wrongly close session and httplib raise error and bury your precious bytes.

Can I set max_retries for requests.request?

The Python requests module is simple and elegant but one thing bugs me.
It is possible to get a requests.exception.ConnectionError with a message like:
Max retries exceeded with url: ...
This implies that requests can attempt to access the data several times. But there is not a single mention of this possibility anywhere in the docs. Looking at the source code I didn't find any place where I could alter the default (presumably 0) value.
So is it possible to somehow set the maximum number of retries for requests?
This will not only change the max_retries but also enable a backoff strategy which makes requests to all http:// addresses sleep for a period of time before retrying (to a total of 5 times):
import requests
from requests.adapters import HTTPAdapter, Retry
s = requests.Session()
retries = Retry(total=5,
backoff_factor=0.1,
status_forcelist=[ 500, 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))
s.get('http://httpstat.us/500')
As per documentation for Retry: if the backoff_factor is 0.1, then sleep() will sleep for [0.05s, 0.1s, 0.2s, 0.4s, ...] between retries. It will also force a retry if the status code returned is 500, 502, 503 or 504.
Various other options to Retry allow for more granular control:
total – Total number of retries to allow.
connect – How many connection-related errors to retry on.
read – How many times to retry on read errors.
redirect – How many redirects to perform.
method_whitelist – Set of uppercased HTTP method verbs that we should retry on.
status_forcelist – A set of HTTP status codes that we should force a retry on.
backoff_factor – A backoff factor to apply between attempts.
raise_on_redirect – Whether, if the number of redirects is exhausted, to raise a MaxRetryError, or to return a response with a response code in the 3xx range.
raise_on_status – Similar meaning to raise_on_redirect: whether we should raise an exception, or return a response, if status falls in status_forcelist range and retries have been exhausted.
NB: raise_on_status is relatively new, and has not made it into a release of urllib3 or requests yet. The raise_on_status keyword argument appears to have made it into the standard library at most in python version 3.6.
To make requests retry on specific HTTP status codes, use status_forcelist. For example, status_forcelist=[503] will retry on status code 503 (service unavailable).
By default, the retry only fires for these conditions:
Could not get a connection from the pool.
TimeoutError
HTTPException raised (from http.client in Python 3 else httplib).
This seems to be low-level HTTP exceptions, like URL or protocol not
formed correctly.
SocketError
ProtocolError
Notice that these are all exceptions that prevent a regular HTTP response from being received. If any regular response is generated, no retry is done. Without using the status_forcelist, even a response with status 500 will not be retried.
To make it behave in a manner which is more intuitive for working with a remote API or web server, I would use the above code snippet, which forces retries on statuses 500, 502, 503 and 504, all of which are not uncommon on the web and (possibly) recoverable given a big enough backoff period.
It is the underlying urllib3 library that does the retrying. To set a different maximum retry count, use alternative transport adapters:
from requests.adapters import HTTPAdapter
s = requests.Session()
s.mount('http://stackoverflow.com', HTTPAdapter(max_retries=5))
The max_retries argument takes an integer or a Retry() object; the latter gives you fine-grained control over what kinds of failures are retried (an integer value is turned into a Retry() instance which only handles connection failures; errors after a connection is made are by default not handled as these could lead to side-effects).
Old answer, predating the release of requests 1.2.1:
The requests library doesn't really make this configurable, nor does it intend to (see this pull request). Currently (requests 1.1), the retries count is set to 0. If you really want to set it to a higher value, you'll have to set this globally:
import requests
requests.adapters.DEFAULT_RETRIES = 5
This constant is not documented; use it at your own peril as future releases could change how this is handled.
Update: and this did change; in version 1.2.1 the option to set the max_retries parameter on the HTTPAdapter() class was added, so that now you have to use alternative transport adapters, see above. The monkey-patch approach no longer works, unless you also patch the HTTPAdapter.__init__() defaults (very much not recommended).
Be careful, Martijn Pieters's answer isn't suitable for version 1.2.1+. You can't set it globally without patching the library.
You can do this instead:
import requests
from requests.adapters import HTTPAdapter
s = requests.Session()
s.mount('http://www.github.com', HTTPAdapter(max_retries=5))
s.mount('https://www.github.com', HTTPAdapter(max_retries=5))
After struggling a bit with some of the answers here, I found a library called backoff that worked better for my situation. A basic example:
import backoff
#backoff.on_exception(
backoff.expo,
requests.exceptions.RequestException,
max_tries=5,
giveup=lambda e: e.response is not None and e.response.status_code < 500
)
def publish(self, data):
r = requests.post(url, timeout=10, json=data)
r.raise_for_status()
I'd still recommend giving the library's native functionality a shot, but if you run into any problems or need broader control, backoff is an option.
A cleaner way to gain higher control might be to package the retry stuff into a function and make that function retriable using a decorator and whitelist the exceptions.
I have created the same here:
http://www.praddy.in/retry-decorator-whitelisted-exceptions/
Reproducing the code in that link :
def retry(exceptions, delay=0, times=2):
"""
A decorator for retrying a function call with a specified delay in case of a set of exceptions
Parameter List
-------------
:param exceptions: A tuple of all exceptions that need to be caught for retry
e.g. retry(exception_list = (Timeout, Readtimeout))
:param delay: Amount of delay (seconds) needed between successive retries.
:param times: no of times the function should be retried
"""
def outer_wrapper(function):
#functools.wraps(function)
def inner_wrapper(*args, **kwargs):
final_excep = None
for counter in xrange(times):
if counter > 0:
time.sleep(delay)
final_excep = None
try:
value = function(*args, **kwargs)
return value
except (exceptions) as e:
final_excep = e
pass #or log it
if final_excep is not None:
raise final_excep
return inner_wrapper
return outer_wrapper
#retry(exceptions=(TimeoutError, ConnectTimeoutError), delay=0, times=3)
def call_api():
You can use the requests library to accomplish all in one go.
The following code will retry 3 times if you receive 429,500,502,503 or 504 status code, each time with a longer delay set through "backoff_factor". See https://findwork.dev/blog/advanced-usage-python-requests-timeouts-retries-hooks/ for a nice tutorial.
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
method_whitelist=["HEAD", "GET", "OPTIONS"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
http = requests.Session()
http.mount("https://", adapter)
http.mount("http://", adapter)
response = http.get("https://en.wikipedia.org/w/api.php")

Categories

Resources