How to avoid many requests refused in using Python requests package? - python

I use requests package to fetch some earthquake catalog from website: ISC earthquake bulletin
When the content table is small, all good. But when it comes to massive search, or a loop search, I.e., I set different parameters to run the requests in a loop. It will return no available data:
Sorry, but your request cannot be processed at the present time. Please try again in a few minutes.
Can anyone told me how can I avoid too many refused requests?
Here’s my scripts:
# import package
import requests
Url = ‘http://www.isc.ac.uk/cgi-bin/web-db-v4?iscreview=on&out_format=CSV&ttime=on&ttres=on&tdef=on&amps=on&phaselist=&stnsearch=STN&sta_list=CLC&stn_ctr_lat=&stn_ctr_lon=&stn_radius=&max_stn_dist_units=deg&stn_top_lat=&stn_bot_lat=&stn_left_lon=&stn_right_lon=&stn_srn=&stn_grn=&bot_lat=&top_lat=&left_lon=&right_lon=&ctr_lat=&ctr_lon=&radius=&max_dist_units=deg&searchshape=GLOBAL&srn=&grn=&start_year=2009&start_month=7&start_day=01&start_time=00%3A00%3A00&end_year=2019&end_month=8&end_day=01&end_time=00%3A00%3A00&min_dep=&max_dep=&min_mag=6.0&max_mag=6.9&req_mag_type=Any&req_mag_agcy=Any&include_links=on&request=STNARRIVALS’
R.requests(URL)
print(R.text)

Use the Retry mechanism of HTTPAdapter to automatically re-send the request when a temporary failure happened. Some settings you maybe interested at:
total - Total number of retries to allow. If the limit is reached without a successful response, then the request is considered a failure.
backoff_factor - Since failed requests usually happens when the server is loaded, it would be beneficial to set a delay in between retries so that the server can breathe. Think of it as the 1st request would happen at the 1st second, the 2nd request at the 2nd second, the 3rd request at the 4th second, the 4th request at the 8th second, the 5th request at the 16th second, and so on up until the configured BACKOFF_MAX.
allowed_methods - The HTTP methods that you want to retry.
status_forcelist - The HTTP status codes that should be retried. Commonly, this is the 5xx series since those are the errors that originated from the server and might be successful if retried.
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
retry_strategy = Retry(
total=5,
backoff_factor=0.1,
allowed_methods=["GET"],
status_forcelist=[500, 502, 504],
)
adapter = HTTPAdapter(max_retries=retry_strategy)
http = requests.Session()
http.mount("https://", adapter)
http.mount("http://", adapter)
response = http.get("https://google.com")
print(response.status_code)
In this example, if ever the response fails, we are sure that a series of 5 retries was made separated through time by a factor of 0.1 but all still failed. But if the server failure isn't persistent, it is highly likely that this would be successful due to the number of retries made separated through time.

Related

Unshorten URL for Invalid / expired Hostnames?

I used the following code snippet to unshorten URLs using the requests library. The snippet runs correctly for URL redirects of hostnames that are valid , and running webpages. But , this code and every other variants of the snippets of unshortening urls seem to fail when the final URL is invalid website. I would still like to get what the final web page url is , regardless of being it an invalid one.
The snippet is :
def unshorten_url(url):
return requests.head(url, allow_redirects=True).url
print unshorten_url(<shortened URL>)
The shortened URL should redirect to this webpage, which has invalid host .
http://trekingear.com/product/4-get-a-real-rocky-mountain-high/?utm_source=Content&utm_medium=Postings&utm_campaign=Guffey%20X%20Mass
But it returns me this error :
requests.exceptions.ConnectionError: HTTPConnectionPool(host='trekingear.com', port=80): Max retries exceeded with url: /product/4-get-a-real-rocky-mountain-high/?utm_source=Content&utm_medium=Postings&utm_campaign=Guffey%20X%20Mass (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10556dc50>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))
Here is the URL I am trying to unshorten :
How can I extract the final URL, of this invalid host from this redirection chain?
You should not use requests.head like that, since by default it follows a 302 Redirect up to three times.
You could disable redirection (with retries=False) and use urlopen. Then the returned response would always hold the 302 contents as its url:
urlopen(method, url, body=None, headers=None, retries=None,
redirect=True, assert_same_host=True, timeout=<object object>,
pool_timeout=None, release_conn=None, chunked=False, body_pos=None,
**response_kw)
Get a connection from the pool and perform an HTTP request. This is the lowest level call for making a request, so you’ll need to specify all the raw details.
Parameters:
method – HTTP request method (such as GET, POST, PUT, etc.)
body – Data to send in the request body (useful for creating POST requests, see HTTPConnectionPool.post_url for more convenience).
headers – Dictionary of custom headers to send, such as User-Agent, If-None-Match, etc. If None, pool headers are used. If provided, these headers completely replace any pool-specific headers.
retries (Retry, False, or an int.) –
Configure the number of retries to allow before raising a MaxRetryError exception.
Pass None to retry until you receive a response. Pass a Retry object for fine-grained control over different types of retries. Pass an integer number to retry connection errors that many times, but no other types of errors. Pass zero to never retry.
And this is the relevant note:
If False, then retries are disabled and any exception is raised immediately. Also, instead of raising a MaxRetryError on redirects, the redirect response will be returned.
Example
(I have actually ran a different test on my local web server, but can't find a public one supplying wrong 302 requests).
from urllib3 import PoolManager
manager = PoolManager(10)
req = manager.urlopen("GET", "http://en.wikipedia.org/wiki/Claude_E._Shannon", retries=False)
print req.get_redirect_location()
The above will request a HTTP page from Wikipedia, thus generating the redirect to HTTPS:
https://en.wikipedia.org/wiki/Claude_E._Shannon
Redirects plus no retries
Your case is a bit different. You want to do redirects since the original URL will not yield the real redirection on the first try, but you want to get the failed redirect.
The problem here is that redirects are handled by the same code as error retries, so you can't disable only the latter. It's neither or both.
You then have to enable both, and do it the long way (intercepting the error). You might need to increase retries, which will slow down things when errors occur.
try:
// Did not know you can't post a URL shortener in a SO answer. Live and learn.
req = manager.urlopen("GET", "http(COLON)(SLASH)(SLASH)t(DOT)co(SLASH)eWWk8s8Hzj")
loc = req.get_redirect_location()
except MaxRetryError as fail:
// build "loc" from scheme, host and url
loc = "%s://%s%s" % (fail.pool.scheme, fail.pool.host, fail.url)
print loc
Your specific case
Since you're using a urllib3 wrapper, you can just unwrap the exception:
try:
# This is your existing code
return requests.head(url, allow_redirects = True).url
except requests.ConnectionError as fail:
return "%s://%s%s" % (fail.args[0].pool.scheme, fail.args[0].pool.host, fail.args[0].url)
You ought to provide for other possible errors, though.

Python + requests + splinter: What's the fastest/best way to make multiple concurrent 'get' requests?

Currently taking a web scraping class with other students, and we are supposed to make ‘get’ requests to a dummy site, parse it, and visit another site.
The problem is, the content of the dummy site is only up for several minutes and disappears, and the content comes back up at a certain interval. During the time the content is available, everyone tries to make the ‘get’ requests, so mine just hangs until everyone clears up, and the content eventually disappears. So I end up not being able to successfully make the ‘get’ request:
import requests
from splinter import Browser
browser = Browser('chrome')
# Hangs here
requests.get('http://dummysite.ca').text
# Even if get is successful hangs here as well
browser.visit(parsed_url)
So my question is, what's the fastest/best way to make endless concurrent 'get' requests until I get a response?
Decide to use either requests or splinter
Read about Requests: HTTP for Humans
Read about Splinter
Related
Read about keep-alive
Read about blocking-or-non-blocking
Read about timeouts
Read about errors-and-exceptions
If you are able to get not hanging requests, you can think of repeated requests, for instance:
while True:
requests.get(...
if request is succesfull:
break
time.sleep(1)
Gevent provides a framework for running asynchronous network requests.
It can patch Python's standard library so that existing libraries like requests and splinter work out of the box.
Here is a short example of how to make 10 concurrent requests, based on the above code, and get their response.
from gevent import monkey
monkey.patch_all()
import gevent.pool
import requests
pool = gevent.pool.Pool(size=10)
greenlets = [pool.spawn(requests.get, 'http://dummysite.ca')
for _ in range(10)]
# Wait for all requests to complete
pool.join()
for greenlet in greenlets:
# This will raise any exceptions raised by the request
# Need to catch errors, or check if an exception was
# thrown by checking `greenlet.exception`
response = greenlet.get()
text_response = response.text
Could also use map and a response handling function instead of get.
See gevent documentation for more information.
In this situation, concurrency will not help much since the server seems to be the limiting factor. One solution is to send a request with a timeout interval, if the interval has exceeded, then try the request again after a few seconds. Then gradually increase the time between retries until you get the data that you want. For instance, your code might look like this:
import time
import requests
def get_content(url, timeout):
# raise Timeout exception if more than x sends have passed
resp = requests.get(url, timeout=timeout)
# raise generic exception if request is unsuccessful
if resp.status_code != 200:
raise LookupError('status is not 200')
return resp.content
timeout = 5 # seconds
retry_interval = 0
max_retry_interval = 120
while True:
try:
response = get_content('https://example.com', timeout=timeout)
retry_interval = 0 # reset retry interval after success
break
except (LookupError, requests.exceptions.Timeout):
retry_interval += 10
if retry_interval > max_retry_interval:
retry_interval = max_retry_interval
time.sleep(retry_interval)
# process response
If concurrency is required, consider the Scrapy project. It uses the Twisted framework. In Scrapy you can replace time.sleep with reactor.callLater(fn, *args, **kw) or use one of hundreds of middleware plugins.
From the documentation for requests:
If the remote server is very slow, you can tell Requests to wait
forever for a response, by passing None as a timeout value and then
retrieving a cup of coffee.
import requests
#Wait potentially forever
r = requests.get('http://dummysite.ca', timeout=None)
#Check the status code to see how the server is handling the request
print r.status_code
Status codes beginning with 2 mean the request was received, understood, and accepted. 200 means the request was a success and the information returned. But 503 means the server is overloaded or undergoing maintenance.
Requests used to include a module called async which could send concurrent requests. It is now an independent module named grequests
which you can use to make concurrent requests endlessly until a 200 response:
import grequests
urls = [
'http://python-requests.org', #Just include one url if you want
'http://httpbin.org',
'http://python-guide.org',
'http://kennethreitz.com'
]
def keep_going():
rs = (grequests.get(u) for u in urls) #Make a set of unsent Requests
out = grequests.map(rs) #Send them all at the same time
for i in out:
if i.status_code == 200:
print i.text
del urls[out.index(i)] #If we have the content, delete the URL
return
while urls:
keep_going()

Is google.appengine.api.urlfetch deadline limited to 60s?

I'm using python on google app engine, and keep getting google.appengine.api.urlfetch_errors.DeadlineExceededError on requests made from a machine which does some backend processing. The requests take approximately 60s, sometimes a little longer, so I've attempted to increase the deadline.
The requests are wrapped in a retry, and from the logs I can see that the time between retries is always ~60s. I assume this is either because I've configured things incorrectly, or misunderstand the limitations of the deadline.
The machine config is:
instance_class: B8
basic_scaling:
max_instances: 1
idle_timeout: 10m
The code I'm using is (redacted for simplicity):
from google.appengine.api import urlfetch
from retrying import retry
timeout = 600
retries = 10
#retry(
stop_max_attempt_number=retries,
wait_exponential_multiplier=1000,
wait_exponential_max=1000*60*5
)
def fetch(url):
"""Fetch remote data, retrying as necessary"""
urlfetch.set_default_fetch_deadline(timeout)
result = urlfetch.fetch(url)
if result.status_code != 200:
raise IOError("Did not receive OK response from server")
return result.content
data = fetch(config['url'])
I've tried setting the deadline explicitly as urlfetch.fetch(url, deadline=timeout) but setting the default seems to be the approach most people suggest.
Can anyone clarify whether there is a maximum value which can be set for deadline?
The Request Timer
The Google App Engine request timer (Java/Python/Go) ensures that requests have a finite lifespan and do not get caught in an infinite loop. Currently, the deadline for requests to frontend instances is 60 seconds. (Backend instances have no corresponding limit.) Every request, including warmup (request to /_ah/warmup) and loading requests ("loading_request=1" log header), is subject to this restriction.
If a request fails to return within 60 seconds and a DeadlineExceededError is thrown and not caught, the request is aborted and a 500 internal server error is returned. If the DeadlineExceededError is caught but a response is not produced quickly enough (you have less than a second), the request is aborted and a 500 internal server error is returned.
As far as i read the document i think the maximum request timeout is 60 seconds in the app engine.
Here is the link to the documentation

Use of urllib2 in Google App Engine throws "Deadline exceeded while waiting for HTTP response from URL:..."

I am using urllib2 for Google App Engine (GAE) in python.
Very often the app crashes because of the following error:
Deadline exceeded while waiting for HTTP response from URL: ....
The Source looks like this:
import webapp2
import urllib2
from bs4 import BeautifulSoup
def functionRunning2To5Seconds_1()
#Check if the Url could be parsed
try:
url ="http://...someUrl..."
req = urllib2.Request(url,headers={'User-Agent': 'Mozilla/5.0'})
page = urllib2.urlopen(req)
htmlSource = BeautifulSoup(page)
except Exception e:
logging.info("Error : {er}".format(er=str(e)))
#do some calculation with the data of htmlSource, which takes 2 To 5 Seconds
#and the handler looks like:
class xyHandler(webapp2.RequestHandler):
def post(self, uurl=None):
r_data1 = functionRunning2To5Seconds_1()
r_data2 = functionRunning2To5Seconds_2()
r_data3 = functionRunning2To5Seconds_3()
...
#show the results in a web page
I found this doc which states :
You can use the Python standard libraries urllib, urllib2 or httplib
to make HTTP requests. When running in App Engine, these libraries
perform HTTP requests using App Engine's URL fetch service
and this:
You can set a deadline for a request, the most amount of time the
service will wait for a response. By default, the deadline for a fetch
is 5 seconds. The maximum deadline is 60 seconds for HTTP requests and
60 seconds for task queue and cron job requests.
So HOW do I do this? How to set a timeout on urllib2?
Or, do i have to rewrite the whole application to use the App Engine's URL fetch service?
(PS: Does anybody know a secure way to run the "r_data1 = functionRunning2To5Seconds_...()" calls in parallel?)
https://docs.python.org/2/library/urllib2.html
urllib2.urlopen(url[, data][, timeout])
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used).
As suggested by Paul, you can pass the timeout parameter. On App Engine it is tied to the URL fetch and will adjust its deadline to a maximum of 60 seconds. Keep in mind that if the urlopen takes more than the time specified in the timeout parameter, you'll get DeadlineExceededError coming from google.appengine.api.urlfetch_errors.DeadlineExceededError instead of the usual socket.timeout. It's a good practice to catch this error and retry / log if necessary. See [1] for more information on dealing with DeadlineExceededError.
[1] - https://developers.google.com/appengine/articles/deadlineexceedederrors

Can I set max_retries for requests.request?

The Python requests module is simple and elegant but one thing bugs me.
It is possible to get a requests.exception.ConnectionError with a message like:
Max retries exceeded with url: ...
This implies that requests can attempt to access the data several times. But there is not a single mention of this possibility anywhere in the docs. Looking at the source code I didn't find any place where I could alter the default (presumably 0) value.
So is it possible to somehow set the maximum number of retries for requests?
This will not only change the max_retries but also enable a backoff strategy which makes requests to all http:// addresses sleep for a period of time before retrying (to a total of 5 times):
import requests
from requests.adapters import HTTPAdapter, Retry
s = requests.Session()
retries = Retry(total=5,
backoff_factor=0.1,
status_forcelist=[ 500, 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))
s.get('http://httpstat.us/500')
As per documentation for Retry: if the backoff_factor is 0.1, then sleep() will sleep for [0.05s, 0.1s, 0.2s, 0.4s, ...] between retries. It will also force a retry if the status code returned is 500, 502, 503 or 504.
Various other options to Retry allow for more granular control:
total – Total number of retries to allow.
connect – How many connection-related errors to retry on.
read – How many times to retry on read errors.
redirect – How many redirects to perform.
method_whitelist – Set of uppercased HTTP method verbs that we should retry on.
status_forcelist – A set of HTTP status codes that we should force a retry on.
backoff_factor – A backoff factor to apply between attempts.
raise_on_redirect – Whether, if the number of redirects is exhausted, to raise a MaxRetryError, or to return a response with a response code in the 3xx range.
raise_on_status – Similar meaning to raise_on_redirect: whether we should raise an exception, or return a response, if status falls in status_forcelist range and retries have been exhausted.
NB: raise_on_status is relatively new, and has not made it into a release of urllib3 or requests yet. The raise_on_status keyword argument appears to have made it into the standard library at most in python version 3.6.
To make requests retry on specific HTTP status codes, use status_forcelist. For example, status_forcelist=[503] will retry on status code 503 (service unavailable).
By default, the retry only fires for these conditions:
Could not get a connection from the pool.
TimeoutError
HTTPException raised (from http.client in Python 3 else httplib).
This seems to be low-level HTTP exceptions, like URL or protocol not
formed correctly.
SocketError
ProtocolError
Notice that these are all exceptions that prevent a regular HTTP response from being received. If any regular response is generated, no retry is done. Without using the status_forcelist, even a response with status 500 will not be retried.
To make it behave in a manner which is more intuitive for working with a remote API or web server, I would use the above code snippet, which forces retries on statuses 500, 502, 503 and 504, all of which are not uncommon on the web and (possibly) recoverable given a big enough backoff period.
It is the underlying urllib3 library that does the retrying. To set a different maximum retry count, use alternative transport adapters:
from requests.adapters import HTTPAdapter
s = requests.Session()
s.mount('http://stackoverflow.com', HTTPAdapter(max_retries=5))
The max_retries argument takes an integer or a Retry() object; the latter gives you fine-grained control over what kinds of failures are retried (an integer value is turned into a Retry() instance which only handles connection failures; errors after a connection is made are by default not handled as these could lead to side-effects).
Old answer, predating the release of requests 1.2.1:
The requests library doesn't really make this configurable, nor does it intend to (see this pull request). Currently (requests 1.1), the retries count is set to 0. If you really want to set it to a higher value, you'll have to set this globally:
import requests
requests.adapters.DEFAULT_RETRIES = 5
This constant is not documented; use it at your own peril as future releases could change how this is handled.
Update: and this did change; in version 1.2.1 the option to set the max_retries parameter on the HTTPAdapter() class was added, so that now you have to use alternative transport adapters, see above. The monkey-patch approach no longer works, unless you also patch the HTTPAdapter.__init__() defaults (very much not recommended).
Be careful, Martijn Pieters's answer isn't suitable for version 1.2.1+. You can't set it globally without patching the library.
You can do this instead:
import requests
from requests.adapters import HTTPAdapter
s = requests.Session()
s.mount('http://www.github.com', HTTPAdapter(max_retries=5))
s.mount('https://www.github.com', HTTPAdapter(max_retries=5))
After struggling a bit with some of the answers here, I found a library called backoff that worked better for my situation. A basic example:
import backoff
#backoff.on_exception(
backoff.expo,
requests.exceptions.RequestException,
max_tries=5,
giveup=lambda e: e.response is not None and e.response.status_code < 500
)
def publish(self, data):
r = requests.post(url, timeout=10, json=data)
r.raise_for_status()
I'd still recommend giving the library's native functionality a shot, but if you run into any problems or need broader control, backoff is an option.
A cleaner way to gain higher control might be to package the retry stuff into a function and make that function retriable using a decorator and whitelist the exceptions.
I have created the same here:
http://www.praddy.in/retry-decorator-whitelisted-exceptions/
Reproducing the code in that link :
def retry(exceptions, delay=0, times=2):
"""
A decorator for retrying a function call with a specified delay in case of a set of exceptions
Parameter List
-------------
:param exceptions: A tuple of all exceptions that need to be caught for retry
e.g. retry(exception_list = (Timeout, Readtimeout))
:param delay: Amount of delay (seconds) needed between successive retries.
:param times: no of times the function should be retried
"""
def outer_wrapper(function):
#functools.wraps(function)
def inner_wrapper(*args, **kwargs):
final_excep = None
for counter in xrange(times):
if counter > 0:
time.sleep(delay)
final_excep = None
try:
value = function(*args, **kwargs)
return value
except (exceptions) as e:
final_excep = e
pass #or log it
if final_excep is not None:
raise final_excep
return inner_wrapper
return outer_wrapper
#retry(exceptions=(TimeoutError, ConnectTimeoutError), delay=0, times=3)
def call_api():
You can use the requests library to accomplish all in one go.
The following code will retry 3 times if you receive 429,500,502,503 or 504 status code, each time with a longer delay set through "backoff_factor". See https://findwork.dev/blog/advanced-usage-python-requests-timeouts-retries-hooks/ for a nice tutorial.
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
method_whitelist=["HEAD", "GET", "OPTIONS"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
http = requests.Session()
http.mount("https://", adapter)
http.mount("http://", adapter)
response = http.get("https://en.wikipedia.org/w/api.php")

Categories

Resources