I'm gathering statistics on a list of websites and I'm using requests for it for simplicity. Here is my code:
data=[]
websites=['http://google.com', 'http://bbc.co.uk']
for w in websites:
r= requests.get(w, verify=False)
data.append( (r.url, len(r.content), r.elapsed.total_seconds(), str([(l.status_code, l.url) for l in r.history]), str(r.headers.items()), str(r.cookies.items())) )
Now, I want requests.get to timeout after 10 seconds so the loop doesn't get stuck.
This question has been of interest before too but none of the answers are clean.
I hear that maybe not using requests is a good idea but then how should I get the nice things requests offer (the ones in the tuple).
Set the timeout parameter:
r = requests.get(w, verify=False, timeout=10) # 10 seconds
Changes in version 2.25.1
The code above will cause the call to requests.get() to timeout if the connection or delays between reads takes more than ten seconds. See: https://requests.readthedocs.io/en/stable/user/advanced/#timeouts
What about using eventlet? If you want to timeout the request after 10 seconds, even if data is being received, this snippet will work for you:
import requests
import eventlet
eventlet.monkey_patch()
with eventlet.Timeout(10):
requests.get("http://ipv4.download.thinkbroadband.com/1GB.zip", verify=False)
UPDATE: https://requests.readthedocs.io/en/master/user/advanced/#timeouts
In new version of requests:
If you specify a single value for the timeout, like this:
r = requests.get('https://github.com', timeout=5)
The timeout value will be applied to both the connect and the read timeouts. Specify a tuple if you would like to set the values separately:
r = requests.get('https://github.com', timeout=(3.05, 27))
If the remote server is very slow, you can tell Requests to wait forever for a response, by passing None as a timeout value and then retrieving a cup of coffee.
r = requests.get('https://github.com', timeout=None)
My old (probably outdated) answer (which was posted long time ago):
There are other ways to overcome this problem:
1. Use the TimeoutSauce internal class
From: https://github.com/kennethreitz/requests/issues/1928#issuecomment-35811896
import requests from requests.adapters import TimeoutSauce
class MyTimeout(TimeoutSauce):
def __init__(self, *args, **kwargs):
connect = kwargs.get('connect', 5)
read = kwargs.get('read', connect)
super(MyTimeout, self).__init__(connect=connect, read=read)
requests.adapters.TimeoutSauce = MyTimeout
This code should cause us to set the read timeout as equal to the
connect timeout, which is the timeout value you pass on your
Session.get() call. (Note that I haven't actually tested this code, so
it may need some quick debugging, I just wrote it straight into the
GitHub window.)
2. Use a fork of requests from kevinburke: https://github.com/kevinburke/requests/tree/connect-timeout
From its documentation: https://github.com/kevinburke/requests/blob/connect-timeout/docs/user/advanced.rst
If you specify a single value for the timeout, like this:
r = requests.get('https://github.com', timeout=5)
The timeout value will be applied to both the connect and the read
timeouts. Specify a tuple if you would like to set the values
separately:
r = requests.get('https://github.com', timeout=(3.05, 27))
kevinburke has requested it to be merged into the main requests project, but it hasn't been accepted yet.
timeout = int(seconds)
Since requests >= 2.4.0, you can use the timeout argument, i.e:
requests.get('https://duckduckgo.com/', timeout=10)
Note:
timeout is not a time limit on the entire response download; rather,
an exception is raised if the server has not issued a response for
timeout seconds ( more precisely, if no bytes have been received on the
underlying socket for timeout seconds). If no timeout is specified
explicitly, requests do not time out.
To create a timeout you can use signals.
The best way to solve this case is probably to
Set an exception as the handler for the alarm signal
Call the alarm signal with a ten second delay
Call the function inside a try-except-finally block.
The except block is reached if the function timed out.
In the finally block you abort the alarm, so it's not singnaled later.
Here is some example code:
import signal
from time import sleep
class TimeoutException(Exception):
""" Simple Exception to be called on timeouts. """
pass
def _timeout(signum, frame):
""" Raise an TimeoutException.
This is intended for use as a signal handler.
The signum and frame arguments passed to this are ignored.
"""
# Raise TimeoutException with system default timeout message
raise TimeoutException()
# Set the handler for the SIGALRM signal:
signal.signal(signal.SIGALRM, _timeout)
# Send the SIGALRM signal in 10 seconds:
signal.alarm(10)
try:
# Do our code:
print('This will take 11 seconds...')
sleep(11)
print('done!')
except TimeoutException:
print('It timed out!')
finally:
# Abort the sending of the SIGALRM signal:
signal.alarm(0)
There are some caveats to this:
It is not threadsafe, signals are always delivered to the main thread, so you can't put this in any other thread.
There is a slight delay after the scheduling of the signal and the execution of the actual code. This means that the example would time out even if it only slept for ten seconds.
But, it's all in the standard python library! Except for the sleep function import it's only one import. If you are going to use timeouts many places You can easily put the TimeoutException, _timeout and the singaling in a function and just call that. Or you can make a decorator and put it on functions, see the answer linked below.
You can also set this up as a "context manager" so you can use it with the with statement:
import signal
class Timeout():
""" Timeout for use with the `with` statement. """
class TimeoutException(Exception):
""" Simple Exception to be called on timeouts. """
pass
def _timeout(signum, frame):
""" Raise an TimeoutException.
This is intended for use as a signal handler.
The signum and frame arguments passed to this are ignored.
"""
raise Timeout.TimeoutException()
def __init__(self, timeout=10):
self.timeout = timeout
signal.signal(signal.SIGALRM, Timeout._timeout)
def __enter__(self):
signal.alarm(self.timeout)
def __exit__(self, exc_type, exc_value, traceback):
signal.alarm(0)
return exc_type is Timeout.TimeoutException
# Demonstration:
from time import sleep
print('This is going to take maximum 10 seconds...')
with Timeout(10):
sleep(15)
print('No timeout?')
print('Done')
One possible down side with this context manager approach is that you can't know if the code actually timed out or not.
Sources and recommended reading:
The documentation on signals
This answer on timeouts by #David Narayan. He has organized the above code as a decorator.
Try this request with timeout & error handling:
import requests
try:
url = "http://google.com"
r = requests.get(url, timeout=10)
except requests.exceptions.Timeout as e:
print e
The connect timeout is the number of seconds Requests will wait for your client to establish a connection to a remote machine (corresponding to the connect()) call on the socket. It’s a good practice to set connect timeouts to slightly larger than a multiple of 3, which is the default TCP packet retransmission window.
Once your client has connected to the server and sent the HTTP request, the read timeout started. It is the number of seconds the client will wait for the server to send a response. (Specifically, it’s the number of seconds that the client will wait between bytes sent from the server. In 99.9% of cases, this is the time before the server sends the first byte).
If you specify a single value for the timeout, The timeout value will be applied to both the connect and the read timeouts. like below:
r = requests.get('https://github.com', timeout=5)
Specify a tuple if you would like to set the values separately for connect and read:
r = requests.get('https://github.com', timeout=(3.05, 27))
If the remote server is very slow, you can tell Requests to wait forever for a response, by passing None as a timeout value and then retrieving a cup of coffee.
r = requests.get('https://github.com', timeout=None)
https://docs.python-requests.org/en/latest/user/advanced/#timeouts
Most other answers are incorrect
Despite all the answers, I believe that this thread still lacks a proper solution and no existing answer presents a reasonable way to do something which should be simple and obvious.
Let's start by saying that as of 2022, there is still absolutely no way to do it properly with requests alone. It is a concious design decision by the library's developers.
Solutions utilizing the timeout parameter simply do not accomplish what they intend to do. The fact that it "seems" to work at the first glance is purely incidental:
The timeout parameter has absolutely nothing to do with the total execution time of the request. It merely controls the maximum amount of time that can pass before underlying socket receives any data. With an example timeout of 5 seconds, server can just as well send 1 byte of data every 4 seconds and it will be perfectly okay, but won't help you very much.
Answers with stream and iter_content are somewhat better, but they still do not cover everything in a request. You do not actually receive anything from iter_content until after response headers are sent, which falls under the same issue - even if you use 1 byte as a chunk size for iter_content, reading full response headers could take a totally arbitrary amount of time and you can never actually get to the point in which you read any response body from iter_content.
Here are some examples that completely break both timeout and stream-based approach. Try them all. They all hang indefinitely, no matter which method you use.
server.py
import socket
import time
server = socket.socket()
server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, True)
server.bind(('127.0.0.1', 8080))
server.listen()
while True:
try:
sock, addr = server.accept()
print('Connection from', addr)
sock.send(b'HTTP/1.1 200 OK\r\n')
# Send some garbage headers very slowly but steadily.
# Never actually complete the response.
while True:
sock.send(b'a')
time.sleep(1)
except:
pass
demo1.py
import requests
requests.get('http://localhost:8080')
demo2.py
import requests
requests.get('http://localhost:8080', timeout=5)
demo3.py
import requests
requests.get('http://localhost:8080', timeout=(5, 5))
demo4.py
import requests
with requests.get('http://localhost:8080', timeout=(5, 5), stream=True) as res:
for chunk in res.iter_content(1):
break
The proper solution
My approach utilizes Python's sys.settrace function. It is dead simple. You do not need to use any external libraries or turn your code upside down. Unlike most other answers, this actually guarantees that the code executes in specified time. Be aware that you still need to specify the timeout parameter, as settrace only concerns Python code. Actual socket reads are external syscalls which are not covered by settrace, but are covered by the timeout parameter. Due to this fact, the exact time limit is not TOTAL_TIMEOUT, but a value which is explained in comments below.
import requests
import sys
import time
# This function serves as a "hook" that executes for each Python statement
# down the road. There may be some performance penalty, but as downloading
# a webpage is mostly I/O bound, it's not going to be significant.
def trace_function(frame, event, arg):
if time.time() - start > TOTAL_TIMEOUT:
raise Exception('Timed out!') # Use whatever exception you consider appropriate.
return trace_function
# The following code will terminate at most after TOTAL_TIMEOUT + the highest
# value specified in `timeout` parameter of `requests.get`.
# In this case 10 + 6 = 16 seconds.
# For most cases though, it's gonna terminate no later than TOTAL_TIMEOUT.
TOTAL_TIMEOUT = 10
start = time.time()
sys.settrace(trace_function)
try:
res = requests.get('http://localhost:8080', timeout=(3, 6)) # Use whatever timeout values you consider appropriate.
except:
raise
finally:
sys.settrace(None) # Remove the time constraint and continue normally.
# Do something with the response
Condensed
import requests, sys, time
TOTAL_TIMEOUT = 10
def trace_function(frame, event, arg):
if time.time() - start > TOTAL_TIMEOUT:
raise Exception('Timed out!')
return trace_function
start = time.time()
sys.settrace(trace_function)
try:
res = requests.get('http://localhost:8080', timeout=(3, 6))
except:
raise
finally:
sys.settrace(None)
That's it!
Set stream=True and use r.iter_content(1024). Yes, eventlet.Timeout just somehow doesn't work for me.
try:
start = time()
timeout = 5
with get(config['source']['online'], stream=True, timeout=timeout) as r:
r.raise_for_status()
content = bytes()
content_gen = r.iter_content(1024)
while True:
if time()-start > timeout:
raise TimeoutError('Time out! ({} seconds)'.format(timeout))
try:
content += next(content_gen)
except StopIteration:
break
data = content.decode().split('\n')
if len(data) in [0, 1]:
raise ValueError('Bad requests data')
except (exceptions.RequestException, ValueError, IndexError, KeyboardInterrupt,
TimeoutError) as e:
print(e)
with open(config['source']['local']) as f:
data = [line.strip() for line in f.readlines()]
The discussion is here https://redd.it/80kp1h
This may be overkill, but the Celery distributed task queue has good support for timeouts.
In particular, you can define a soft time limit that just raises an exception in your process (so you can clean up) and/or a hard time limit that terminates the task when the time limit has been exceeded.
Under the covers, this uses the same signals approach as referenced in your "before" post, but in a more usable and manageable way. And if the list of web sites you are monitoring is long, you might benefit from its primary feature -- all kinds of ways to manage the execution of a large number of tasks.
I believe you can use multiprocessing and not depend on a 3rd party package:
import multiprocessing
import requests
def call_with_timeout(func, args, kwargs, timeout):
manager = multiprocessing.Manager()
return_dict = manager.dict()
# define a wrapper of `return_dict` to store the result.
def function(return_dict):
return_dict['value'] = func(*args, **kwargs)
p = multiprocessing.Process(target=function, args=(return_dict,))
p.start()
# Force a max. `timeout` or wait for the process to finish
p.join(timeout)
# If thread is still active, it didn't finish: raise TimeoutError
if p.is_alive():
p.terminate()
p.join()
raise TimeoutError
else:
return return_dict['value']
call_with_timeout(requests.get, args=(url,), kwargs={'timeout': 10}, timeout=60)
The timeout passed to kwargs is the timeout to get any response from the server, the argument timeout is the timeout to get the complete response.
Despite the question being about requests, I find this very easy to do with pycurl CURLOPT_TIMEOUT or CURLOPT_TIMEOUT_MS.
No threading or signaling required:
import pycurl
import StringIO
url = 'http://www.example.com/example.zip'
timeout_ms = 1000
raw = StringIO.StringIO()
c = pycurl.Curl()
c.setopt(pycurl.TIMEOUT_MS, timeout_ms) # total timeout in milliseconds
c.setopt(pycurl.WRITEFUNCTION, raw.write)
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(pycurl.URL, url)
c.setopt(pycurl.HTTPGET, 1)
try:
c.perform()
except pycurl.error:
traceback.print_exc() # error generated on timeout
pass # or just pass if you don't want to print the error
In case you're using the option stream=True you can do this:
r = requests.get(
'http://url_to_large_file',
timeout=1, # relevant only for underlying socket
stream=True)
with open('/tmp/out_file.txt'), 'wb') as f:
start_time = time.time()
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
if time.time() - start_time > 8:
raise Exception('Request took longer than 8s')
The solution does not need signals or multiprocessing.
Just another one solution (got it from http://docs.python-requests.org/en/master/user/advanced/#streaming-uploads)
Before upload you can find out the content size:
TOO_LONG = 10*1024*1024 # 10 Mb
big_url = "http://ipv4.download.thinkbroadband.com/1GB.zip"
r = requests.get(big_url, stream=True)
print (r.headers['content-length'])
# 1073741824
if int(r.headers['content-length']) < TOO_LONG:
# upload content:
content = r.content
But be careful, a sender can set up incorrect value in the 'content-length' response field.
timeout = (connection timeout, data read timeout) or give a single argument(timeout=1)
import requests
try:
req = requests.request('GET', 'https://www.google.com',timeout=(1,1))
print(req)
except requests.ReadTimeout:
print("READ TIME OUT")
this code working for socketError 11004 and 10060......
# -*- encoding:UTF-8 -*-
__author__ = 'ACE'
import requests
from PyQt4.QtCore import *
from PyQt4.QtGui import *
class TimeOutModel(QThread):
Existed = pyqtSignal(bool)
TimeOut = pyqtSignal()
def __init__(self, fun, timeout=500, parent=None):
"""
#param fun: function or lambda
#param timeout: ms
"""
super(TimeOutModel, self).__init__(parent)
self.fun = fun
self.timeer = QTimer(self)
self.timeer.setInterval(timeout)
self.timeer.timeout.connect(self.time_timeout)
self.Existed.connect(self.timeer.stop)
self.timeer.start()
self.setTerminationEnabled(True)
def time_timeout(self):
self.timeer.stop()
self.TimeOut.emit()
self.quit()
self.terminate()
def run(self):
self.fun()
bb = lambda: requests.get("http://ipv4.download.thinkbroadband.com/1GB.zip")
a = QApplication([])
z = TimeOutModel(bb, 500)
print 'timeout'
a.exec_()
Well, I tried many solutions on this page and still faced instabilities, random hangs, poor connections performance.
I'm now using Curl and i'm really happy about it's "max time" functionnality and about the global performances, even with such a poor implementation :
content=commands.getoutput('curl -m6 -Ss "http://mywebsite.xyz"')
Here, I defined a 6 seconds max time parameter, englobing both connection and transfer time.
I'm sure Curl has a nice python binding, if you prefer to stick to the pythonic syntax :)
There is a package called timeout-decorator that you can use to time out any python function.
#timeout_decorator.timeout(5)
def mytest():
print("Start")
for i in range(1,10):
time.sleep(1)
print("{} seconds have passed".format(i))
It uses the signals approach that some answers here suggest. Alternatively, you can tell it to use multiprocessing instead of signals (e.g. if you are in a multi-thread environment).
If it comes to that, create a watchdog thread that messes up requests' internal state after 10 seconds, e.g.:
closes the underlying socket, and ideally
triggers an exception if requests retries the operation
Note that depending on the system libraries you may be unable to set deadline on DNS resolution.
I'm using requests 2.2.1 and eventlet didn't work for me. Instead I was able use gevent timeout instead since gevent is used in my service for gunicorn.
import gevent
import gevent.monkey
gevent.monkey.patch_all(subprocess=True)
try:
with gevent.Timeout(5):
ret = requests.get(url)
print ret.status_code, ret.content
except gevent.timeout.Timeout as e:
print "timeout: {}".format(e.message)
Please note that gevent.timeout.Timeout is not caught by general Exception handling.
So either explicitly catch gevent.timeout.Timeout
or pass in a different exception to be used like so: with gevent.Timeout(5, requests.exceptions.Timeout): although no message is passed when this exception is raised.
The biggest problem is that if the connection can't be established, the requests package waits too long and blocks the rest of the program.
There are several ways how to tackle the problem but when I looked for a oneliner similar to requests, I couldn't find anything. That's why I built a wrapper around requests called reqto ("requests timeout"), which supports proper timeout for all standard methods from requests.
pip install reqto
The syntax is identical to requests
import reqto
response = reqto.get(f'https://pypi.org/pypi/reqto/json',timeout=1)
# Will raise an exception on Timeout
print(response)
Moreover, you can set up a custom timeout function
def custom_function(parameter):
print(parameter)
response = reqto.get(f'https://pypi.org/pypi/reqto/json',timeout=5,timeout_function=custom_function,timeout_args="Timeout custom function called")
#Will call timeout_function instead of raising an exception on Timeout
print(response)
Important note is that the import line
import reqto
needs to be earlier import than all other imports working with requests, threading, etc. due to monkey_patch which runs in the background.
I came up with a more direct solution that is admittedly ugly but fixes the real problem. It goes a bit like this:
resp = requests.get(some_url, stream=True)
resp.raw._fp.fp._sock.settimeout(read_timeout)
# This will load the entire response even though stream is set
content = resp.content
You can read the full explanation here
Related
I'm working in an environment where web applications fork processes on demand and each process has its own thread pool to service web requests. The threads may need to issue HTTPS requests to outside services, and the requests library is currently used to do so. When requests usage was first added, it was used naively by creating a new requests.Session and requests.adapters.HTTPAdapter for each request, or even by simply calling requests.get or requests.post on demand. The problem that arises is that a new connection is established each time instead of potentially taking advantage of HTTP persistent connections. A potential fix would be to use a connection pool, but what is the recommended way of sharing a HTTP connection pool between threads when using the requests library? Is there one?
The first thought would be to share a single requests.Session, but that currently not safe, as described in "Is the Session object from Python's Requests library thread safe?" and "Document threading contract for Session class". Is it safe and sufficient to have a single global requests.adapters.HTTPAdapter that is shared between requests.Sessionss that are created on demand in each thread? According to "Our use of urllib3's ConnectionPools is not threadsafe.", even that may not be a valid use. Only needing to connect to a small number of distinct remote endpoints may allow it to be a viable approach regardless.
I doubt there is existing way to do this in requests. But you can modify my code to encapsulate requests session() instead of standard urllib2.
This is my code that I use when I want to get data from multiple sites at the same time:
# Following code I keep in file named do.py
# It can be use to perform any number of otherwise blocking IO operations simultaneously
# Results are available to you when all IO operations are completed.
# Completed means either IO finished successfully or an exception got raised.
# So when all tasks are completed, you pick up results.
# Usage example:
# >>> import do
# >>> results = do.simultaneously([
# ... (func1, func1_args, func1_kwargs),
# ... (func2, func2_args, func2_kwargs), ...])
# >>> for x in results:
# ... print x
# ...
from thread import start_new_thread as thread
from thread import allocate_lock
from collections import deque
from time import sleep
class Task:
"""A task's thread holder. Keeps results or exceptions raised.
This could be a bit more robustly implemented using
threading module.
"""
def __init__ (self, func, args, kwargs, pool):
self.func = func
self.args = args
self.kwargs = kwargs
self.result = None
self.done = 0
self.started = 0
self.xraised = 0
self.tasks = pool
pool.append(self)
self.allow = allocate_lock()
self.run()
def run (self):
thread(self._run,())
def _run (self):
self.allow.acquire() # Prevent same task from being started multiple times
self.started = 1
self.result = None
self.done = 0
self.xraised = 0
try:
self.result = self.func(*self.args, **self.kwargs)
except Exception, e:
e.task = self # Keep reference to the task in an exception
# This way we can access original task from caught exception
self.result = e
self.xraised = 1
self.done = 1
self.allow.release()
def wait (self):
while not self.done:
try: sleep(0.001)
except: break
def withdraw (self):
if not self.started: self.run()
if not self.done: self.wait()
self.tasks.remove(self)
return self.result
def remove (self):
self.tasks.remove(self)
def simultaneously (tasks, xraise=0):
"""Starts all functions within iterable <tasks>.
Then waits for all to be finished.
Iterable <tasks> may contain a subiterables with:
(function, [[args,] kwargs])
or just functions. These would be called without arguments.
Returns an iterator that yields result of each called function.
If an exception is raised within a task the Exception()'s instance will be returned unless
is 1 or True. Then first encountered exception within results will be raised.
Results will start to yield after all funcs() either return or raise an exception.
"""
pool = deque()
for x in tasks:
func = lambda: None
args = ()
kwargs = {}
if not isinstance(x, (tuple, list)):
Task(x, args, kwargs, pool)
continue
l = len(x)
if l: func = x[0]
if l>1:
args = x[1]
if not isinstance(args, (tuple, list)): args = (args,)
if l>2:
if isinstance(x[2], dict):
kwargs = x[2]
Task(func, args, kwargs, pool)
for t in pool: t.wait()
while pool:
t = pool.popleft()
if xraise and t.xraised:
raise t.result
yield t.result
# So, I do this using urllib2, you can do it using requests if you want.
from urllib2 import URLError, HTTPError, urlopen
import do
class AccessError(Exception):
"""Raised if server rejects us because we bombarded same server with multiple connections in too small time slots."""
pass
def retrieve (url):
try:
u = urlopen(url)
r = u.read()
u.close()
return r
except HTTPError, e:
msg = "HTTPError %i - %s" % (e.code, e.msg)
t = AccessError()
if e.code in (401, 403, 429):
msg += " (perhaps you're making too many calls)"
t.reason = "perhaps you are making too many calls"
elif e.code in (502, 504):
msg += " (service temporarily not available)"
t.reason = "service temporarily not available"
else: t.reason = e.msg
t.args = (msg,)
t.message = msg
t.msg = e.msg; t.code = e.code
t.orig = e
raise t
except URLError, e:
msg = "URLError %s - %s (%s)" % (str(e.errno), str(e.message), str(e.reason))
t = AccessError(msg)
t.reason = str(e.reason)
t.msg = str(t.message)
t.code = e.errno
t.orig = e
raise t
except: raise
urls = ["http://www.google.com", "http://www.amazon.com", "http://stackoverflow.com", "http://blah.blah.sniff-sniff"]
retrieval = []
for u in urls:
retrieval.append((retrieve, u))
x = 0
for data in do.simultaneously(retrieval):
url = urls[x]
if isinstance(data, Exception):
print url, "not retrieved successfully!\nThe error is:"
print data
else:
print url, "returned", len(data), "characters!!\nFirst 100:"
print data[:100]
x += 1
# If you need persistent HTTP, you tweak the retrieve() function to be able to hold the connection open.
# After retrieving currently requested data You save opened connections in a global dict() with domains as keys.
# When the next retrieve is called and the domain already has an opened connection, you remove the connection from dict (to prevent any other retrieve grabbing it in the middle of nowhere), then you use it
# to send a new request if possible. (If it isn't timed out or something), if connection broke, you just open a new one.
# You will probably have to introduce some limits if you will be using multiple connections to same server at once.
# Like no more than 4 connections to same server at once, some delays between requests and so on.
# No matter which approach to multithreading you will choose (something like I propose or some other mechanism) thread safety is in trouble because HTTP is serialized protocol.
# You send a request, you await the answer. After you receive whole answer, then you can make a new request if HTTP/1.1 is used and connection is being kept alive.
# If your thread tries to send a new request during the data download a general mess will occur.
# So you design your system to open as much connections as possible, but always wait for one to be free before reusing it. Thats a trick here.
# As for any other part of requests being thread unsafe for some reason, well, you should check the code to see which calls exactly should be kept atomic and then use a lock. But don't put it in a block where major IO is occurring or it will be as you aren't using threads at all.
I'm basing this off of the sample from https://docs.python.org/3/library/concurrent.futures.html#id1.
I've update the following:
data = future.result()
to this:
data = future.result(timeout=0.1)
The doc for concurrent.futures.Future.result states:
If the call hasn’t completed in timeout seconds, then a TimeoutError will be raised. timeout can be an int or float
(I know there is a timeout on the request, for 60, but in my real code I'm performing a different action that doesn't use a urllib request)
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
conn = urllib.request.urlopen(url, timeout=timeout)
return conn.readall()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
# The below timeout isn't raising the TimeoutError.
data = future.result(timeout=0.01)
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
TimeoutError is raised if I set it on the call to as_completed, but I need to set the timeout on a per Future basis, not all of them as a whole.
Update
Thanks #jme, that works with a single Future, but not with multiples using the below. Do I need to yield at the beginning of the functions to allow the build-up of the futures dict? From the docs it sounds like the calls to submit shouldn't block.
import concurrent.futures
import time
import sys
def wait():
time.sleep(5)
return 42
with concurrent.futures.ThreadPoolExecutor(4) as executor:
waits = [wait, wait]
futures = {executor.submit(w): w for w in waits}
for future in concurrent.futures.as_completed(futures):
try:
future.result(timeout=1)
except concurrent.futures.TimeoutError:
print("Too long!")
sys.stdout.flush()
print(future.result())
The issue seems to be with the call to concurrent.futures.as_completed().
If I replace that with just a for loop, everything seems to work:
for wait, future in [(w, executor.submit(w)) for w in waits]:
...
I misinterpreted the doc for as_completed which states:
...yields futures as they complete (finished or were cancelled)...
as_completed will handle timeouts but as a whole, not on a per future basis.
The exception is being raised in the main thread, you just aren't seeing it because stdout hasn't been flushed. Try for example:
import concurrent.futures
import time
import sys
def wait():
time.sleep(5)
return 42
with concurrent.futures.ThreadPoolExecutor(4) as executor:
future = executor.submit(wait)
try:
future.result(timeout=1)
except concurrent.futures.TimeoutError:
print("Too long!")
sys.stdout.flush()
print(future.result())
Run this and you'll see "Too long!" appear after one second, but the interpreter will wait an additional four seconds for the threads to finish executing. Then you'll see 42 -- the result of wait() -- appear.
What does this mean? Setting a timeout doesn't kill the thread, and that's actually a good thing. What if the thread is holding a lock? If we kill it abruptly, that lock is never freed. No, it's much better to let the thread handle its own demise. Likewise, the purpose of future.cancel is to prevent a thread from starting, not to kill it.
I managed to code a rather silly bug that would make one of my request handlers run a very slow DB query.
Interesting bit is that I noticed that even long-after siege completed Tornado was still churning through requests (sometimes 90s later). (Comment --> I'm not 100% sure of the workings of Siege, but I'm fairly sure it closed the connection..)
My question in two parts:
- Does Tornado cancel request handlers when client closes the connection?
- Is there a way to timeout request handlers in Tornado?
I read through the code and can't seem to find anything. Even though my request handlers are running asynchronously in the above bug the number of pending requests piled up to a level where it was slowing down the app and it would have been better to close out the connections.
Tornado does not automatically close the request handler when the client drops the connection. However, you can override on_connection_close to be alerted when the client drops, which would allow you to cancel the connection on your end. A context manager (or a decorator) could be used to handle setting a timeout for handling the request; use tornado.ioloop.IOLoop.add_timeout to schedule some method that times out the request to run after timeout as part of the __enter__ of the context manager, and then cancel that callback in the __exit__ block of the context manager. Here's an example demonstrating both of those ideas:
import time
import contextlib
from tornado.ioloop import IOLoop
import tornado.web
from tornado import gen
#gen.coroutine
def async_sleep(timeout):
yield gen.Task(IOLoop.instance().add_timeout, time.time() + timeout)
#contextlib.contextmanager
def auto_timeout(self, timeout=2): # Seconds
handle = IOLoop.instance().add_timeout(time.time() + timeout, self.timed_out)
try:
yield handle
except Exception as e:
print("Caught %s" % e)
finally:
IOLoop.instance().remove_timeout(handle)
if not self._timed_out:
self.finish()
else:
raise Exception("Request timed out") # Don't continue on passed this point
class TimeoutableHandler(tornado.web.RequestHandler):
def initialize(self):
self._timed_out = False
def timed_out(self):
self._timed_out = True
self.write("Request timed out!\n")
self.finish() # Connection to client closes here.
# You might want to do other clean up here.
class MainHandler(TimeoutableHandler):
#gen.coroutine
def get(self):
with auto_timeout(self): # We'll timeout after 2 seconds spent in this block.
self.sleeper = async_sleep(5)
yield self.sleeper
print("writing") # get will abort before we reach here if we timed out.
self.write("hey\n")
def on_connection_close(self):
# This isn't the greatest way to cancel a future, since it will not actually
# stop the work being done asynchronously. You'll need to cancel that some
# other way. Should be pretty straightforward with a DB connection (close
# the cursor/connection, maybe?)
self.sleeper.set_exception(Exception("cancelled"))
application = tornado.web.Application([
(r"/test", MainHandler),
])
application.listen(8888)
IOLoop.instance().start()
Another solution to this problem is to use gen.with_timeout:
import time
from tornado import gen
from tornado.util import TimeoutError
class MainHandler
#gen.coroutine
def get(self):
try:
# I'm using gen.sleep here but you can use any future in this place
yield gen.with_timeout(time.time() + 2, gen.sleep(5))
self.write("This will never be reached!!")
except TimeoutError as te:
logger.warning(te.__repr__())
self.timed_out()
def timed_out(self):
self.write("Request timed out!\n")
I liked the way handled by the contextlib solution but I'm was always getting logging leftovers.
The native coroutine solution would be:
async def get(self):
try:
await gen.with_timeout(time.time() + 2, gen.sleep(5))
self.write("This will never be reached!!")
except TimeoutError as te:
logger.warning(te.__repr__())
self.timed_out()
I have code for reading an url like this:
from urllib2 import Request, urlopen
req = Request(url)
for key, val in headers.items():
req.add_header(key, val)
res = urlopen(req, timeout = timeout)
# This line blocks
content = res.read()
The timeout works for the urlopen() call. But then the code gets to the res.read() call where I want to read the response data and the timeout isn't applied there. So the read call may hang almost forever waiting for data from the server. The only solution I've found is to use a signal to interrupt the read() which is not suitable for me since I'm using threads.
What other options are there? Is there a HTTP library for Python that handles read timeouts? I've looked at httplib2 and requests and they seem to suffer the same issue as above. I don't want to write my own nonblocking network code using the socket module because I think there should already be a library for this.
Update: None of the solutions below are doing it for me. You can see for yourself that setting the socket or urlopen timeout has no effect when downloading a large file:
from urllib2 import urlopen
url = 'http://iso.linuxquestions.org/download/388/7163/http/se.releases.ubuntu.com/ubuntu-12.04.3-desktop-i386.iso'
c = urlopen(url)
c.read()
At least on Windows with Python 2.7.3, the timeouts are being completely ignored.
It's not possible for any library to do this without using some kind of asynchronous timer through threads or otherwise. The reason is that the timeout parameter used in httplib, urllib2 and other libraries sets the timeout on the underlying socket. And what this actually does is explained in the documentation.
SO_RCVTIMEO
Sets the timeout value that specifies the maximum amount of time an input function waits until it completes. It accepts a timeval structure with the number of seconds and microseconds specifying the limit on how long to wait for an input operation to complete. If a receive operation has blocked for this much time without receiving additional data, it shall return with a partial count or errno set to [EAGAIN] or [EWOULDBLOCK] if no data is received.
The bolded part is key. A socket.timeout is only raised if not a single byte has been received for the duration of the timeout window. In other words, this is a timeout between received bytes.
A simple function using threading.Timer could be as follows.
import httplib
import socket
import threading
def download(host, path, timeout = 10):
content = None
http = httplib.HTTPConnection(host)
http.request('GET', path)
response = http.getresponse()
timer = threading.Timer(timeout, http.sock.shutdown, [socket.SHUT_RD])
timer.start()
try:
content = response.read()
except httplib.IncompleteRead:
pass
timer.cancel() # cancel on triggered Timer is safe
http.close()
return content
>>> host = 'releases.ubuntu.com'
>>> content = download(host, '/15.04/ubuntu-15.04-desktop-amd64.iso', 1)
>>> print content is None
True
>>> content = download(host, '/15.04/MD5SUMS', 1)
>>> print content is None
False
Other than checking for None, it's also possible to catch the httplib.IncompleteRead exception not inside the function, but outside of it. The latter case will not work though if the HTTP request doesn't have a Content-Length header.
I found in my tests (using the technique described here) that a timeout set in the urlopen() call also effects the read() call:
import urllib2 as u
c = u.urlopen('http://localhost/', timeout=5.0)
s = c.read(1<<20)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
File "/usr/lib/python2.7/httplib.py", line 561, in read
s = self.fp.read(amt)
File "/usr/lib/python2.7/httplib.py", line 1298, in read
return s + self._file.read(amt - len(s))
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
socket.timeout: timed out
Maybe it's a feature of newer versions? I'm using Python 2.7 on a 12.04 Ubuntu straight out of the box.
One possible (imperfect) solution is to set the global socket timeout, explained in more detail here:
import socket
import urllib2
# timeout in seconds
socket.setdefaulttimeout(10)
# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
However, this only works if you're willing to globally modify the timeout for all users of the socket module. I'm running the request from within a Celery task, so doing this would mess up timeouts for the Celery worker code itself.
I'd be happy to hear any other solutions...
I'd expect this to be a common problem, and yet - no answers to be found anywhere... Just built a solution for this using timeout signal:
import urllib2
import socket
timeout = 10
socket.setdefaulttimeout(timeout)
import time
import signal
def timeout_catcher(signum, _):
raise urllib2.URLError("Read timeout")
signal.signal(signal.SIGALRM, timeout_catcher)
def safe_read(url, timeout_time):
signal.setitimer(signal.ITIMER_REAL, timeout_time)
url = 'http://uberdns.eu'
content = urllib2.urlopen(url, timeout=timeout_time).read()
signal.setitimer(signal.ITIMER_REAL, 0)
# you should also catch any exceptions going out of urlopen here,
# set the timer to 0, and pass the exceptions on.
The credit for the signal part of the solution goes here btw: python timer mystery
Any asynchronous network library should allow to enforce the total timeout on any I/O operation e.g., here's gevent code example:
#!/usr/bin/env python2
import gevent
import gevent.monkey # $ pip install gevent
gevent.monkey.patch_all()
import urllib2
with gevent.Timeout(2): # enforce total timeout
response = urllib2.urlopen('http://localhost:8000')
encoding = response.headers.getparam('charset')
print response.read().decode(encoding)
And here's asyncio equivalent:
#!/usr/bin/env python3.5
import asyncio
import aiohttp # $ pip install aiohttp
async def fetch_text(url):
response = await aiohttp.get(url)
return await response.text()
text = asyncio.get_event_loop().run_until_complete(
asyncio.wait_for(fetch_text('http://localhost:8000'), timeout=2))
print(text)
The test http server is defined here.
pycurl.TIMEOUT option works for the whole request:
#!/usr/bin/env python3
"""Test that pycurl.TIMEOUT does limit the total request timeout."""
import sys
import pycurl
timeout = 2 #NOTE: it does limit both the total *connection* and *read* timeouts
c = pycurl.Curl()
c.setopt(pycurl.CONNECTTIMEOUT, timeout)
c.setopt(pycurl.TIMEOUT, timeout)
c.setopt(pycurl.WRITEFUNCTION, sys.stdout.buffer.write)
c.setopt(pycurl.HEADERFUNCTION, sys.stderr.buffer.write)
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(pycurl.URL, 'http://localhost:8000')
c.setopt(pycurl.HTTPGET, 1)
c.perform()
The code raises the timeout error in ~2 seconds. I've tested the total read timeout with the server that sends the response in multiple chunks with the time less than the timeout between chunks:
$ python -mslow_http_server 1
where slow_http_server.py:
#!/usr/bin/env python
"""Usage: python -mslow_http_server [<read_timeout>]
Return an http response with *read_timeout* seconds between parts.
"""
import time
try:
from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer, test
except ImportError: # Python 3
from http.server import BaseHTTPRequestHandler, HTTPServer, test
def SlowRequestHandlerFactory(read_timeout):
class HTTPRequestHandler(BaseHTTPRequestHandler):
def do_GET(self):
n = 5
data = b'1\n'
self.send_response(200)
self.send_header("Content-type", "text/plain; charset=utf-8")
self.send_header("Content-Length", n*len(data))
self.end_headers()
for i in range(n):
self.wfile.write(data)
self.wfile.flush()
time.sleep(read_timeout)
return HTTPRequestHandler
if __name__ == "__main__":
import sys
read_timeout = int(sys.argv[1]) if len(sys.argv) > 1 else 5
test(HandlerClass=SlowRequestHandlerFactory(read_timeout),
ServerClass=HTTPServer)
I've tested the total connection timeout with http://google.com:22222.
This isn't the behavior I see. I get a URLError when the call times out:
from urllib2 import Request, urlopen
req = Request('http://www.google.com')
res = urlopen(req,timeout=0.000001)
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# ...
# raise URLError(err)
# urllib2.URLError: <urlopen error timed out>
Can't you catch this error and then avoid trying to read res?
When I try to use res.read() after this I get NameError: name 'res' is not defined. Is something like this what you need:
try:
res = urlopen(req,timeout=3.0)
except:
print 'Doh!'
finally:
print 'yay!'
print res.read()
I suppose the way to implement a timeout manually is via multiprocessing, no? If the job hasn't finished you can terminate it.
Had the same issue with socket timeout on the read statement. What worked for me was putting both the urlopen and the read inside a try statement. Hope this helps!
I'd better use the following sample codes to explain my problem:
while True:
NewThread = threading.Thread(target = CheckSite, args = ("http://example.com", "http://demo.com"))
NewThread.start()
time.sleep(300)
def CheckSite(Url1, Url2):
try:
Response1 = urllib2.urlopen(Url1)
Response2 = urllib2.urlopen(Url2)
del Response1
del Response2
except Exception, reason:
print "How should I delete Response1 and Response2 when exception occurs?"
del Response1
del Response2 #### You can't simply write this as Reponse2 might not even exist if exception shows up running Response1
I've wrote a really looong script, and it's used to check different sites running status(response time or similar stuff), just like what I did in the previous codes, I use couple of threads to check different site separately. As you can see in each thread there would be several server requests and of course you will get 403 or similar every now and then. I always think those wasted connections(ones with exceptions) would be collected by some kind of garbage collector in python, so I just leave them alone.
But when I check my network monitor, I found those wasted connections still there wasting resources. The longer the script running, the more wasted connections appears. I really don't want to do try-except clause each time sending server request so that del responsecan be used in each except part to destroy the wasted connection. There gotta be a better way to do this, anybody can help me out?
What exactly do you expect "delete" to mean in this context, anyway, and what are you hoping to accomplish?
Python has automatic garbage collection. These objects are defined, further, in such a way that the connection will be closed whenever the garbage collector gets around to collecting the corresponding objects.
If you want to ensure that connections are closed as soon as you no longer need the object, you can use the with construct. For example:
def CheckSite(Url1, Url2):
with urllib2.urlopen(Url1) as Response1:
with urllib2.urlopen(Url2) as Response2:
# do stuff
I'd also suggest to use the with statement in conjunction with the contextlib.closing function.
It should close the connection when it finishes the job or when it gets an exception.
Something like:
with contextlib.closing(urllib2.open(url)) as reponse:
pass
#del response #to assure the connection does not have references...
You shoud use Response1.close(). with doesn't work with urllib2.urlopen directly, but see the contextlib.closing example in the Python documentation.
Connections can stay open for hours if not properly closed, even if the process creating them exits, due the reliable packet delivery features of TCP.
You should not check for Exception rather you should catch URLError as noted in the Documentation.
If an exception isn't thrown, does the connection persist? Maybe what you're looking for is
try:
Response1 = urllib2.urlopen(Url1)
Response2 = urllib2.urlopen(Url2)
Response1.close()
Response2.close()
except URLError, reason:
print "How should I delete Response1 and Response2 when exception occurs?"
if Response2 is not None:
Response2.close()
elif Response1 is not None:
Response1.close()
But I don't understand why you're encapsulating both in a single try. I would do the following personally.
def CheckSites(Url1, Url2):
try:
Response1 = urllib2.urlopen(Url1)
except URLError, reason:
print "Response 1 failed"
return
try:
Response2 = urllib2.urlopen(Url2)
except URLError, reason:
print "Response 2 failed"
## close Response1
Response1.close()
## do something or don't based on 1 passing and 2 failing
return
print "Both responded"
## party time. rm -rf /
Note that this accomplishes the same thing because in your code, if Url1 fails, you'll never even try to open the Url2 connection.
** Side Note **
Threading is really not helping you here at all. You might as well just try them sequentially because only one thread is going to be running at a time.
http://dabeaz.blogspot.com/2009/08/inside-inside-python-gil-presentation.html
http://wiki.python.org/moin/GlobalInterpreterLock