I'm implementing a little service that fetches web pages from various servers. I need to be able to configure different types of timeouts. I've tried mucking around with the settimeout method of sockets but it's not exactly as I'd like it. Here are the problems.
I need to specify a timeout for the initial DNS lookup. I understand this is done when I instantiate the HTTPConnection at the beginning.
My code is written in such a way that I first .read a chunk of data (around 10 MB) and if the entire payload fits in this, I move on to other parts of the code. If it doesn't fit in this, I directly stream the payload out to a file rather than into memory. When this happens, I do an unbounded .read() to get the data and if the remote side sends me, say, a byte of data every second, the connection just keeps waiting receiving one byte every second. I want to be able to disconnect with a "you're taking too long". A thread based solution would be the last resort.
httplib is to straight forward for what you are looking for.
I would recommend to take a look for http://pycurl.sourceforge.net/ and the http://curl.haxx.se/libcurl/c/curl_easy_setopt.html#CURLOPTTIMEOUT option.
The http://curl.haxx.se/libcurl/c/curl_easy_setopt.html#CURLOPT_NOSIGNAL option sounds also interesting:
Consider building libcurl with c-ares support to enable asynchronous DNS lookups, which enables nice timeouts for name resolves without signals.
Have you tried requests?
You can set timeouts conveniently http://docs.python-requests.org/en/latest/user/quickstart/#timeouts
>>> requests.get('http://github.com', timeout=0.001)
EDIT:
I missed the part 2 of the question. For that you could use this:
import sys
import signal
import requests
class TimeoutException(Exception):
pass
def get_timeout(url, dns_timeout=10, load_timeout=60):
def timeout_handler(signum, frame):
raise TimeoutException()
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(load_timeout) # triger alarm in seconds
try:
response = requests.get(url, timeout=dns_timeout)
except TimeoutException:
return "you're taking too long"
return response
and in your code use the get_timeout function.
If you need the timeout to be available for other functions you could create a decorator.
Above code from http://pguides.net/python-tutorial/python-timeout-a-function/.
Related
I'm using Tornado to send requests in rapid, periodic succession (every 0.1s or even 0.01s) to a server. For this, I'm using AsyncHttpClient.fetch with a callback to handle the response.
Here's a very simple code to show what I mean:
from functools import partial
from tornado import gen, locks, httpclient
from datetime import timedelta, datetime
# usually many of these running on the same thread, maybe requesting the same server
#gen.coroutine
def send_request(url, interval):
wakeup_condition = locks.Condition()
#using this to allow requests to send immediately
http_client = httpclient.AsyncHTTPClient(max_clients=1000)
for i in range(300):
req_time = datetime.now()
current_callback = partial(handle_response, req_time)
http_client.fetch(url, current_callback, method='GET')
yield wakeup_condition.wait(timeout=timedelta(seconds=interval))
def handle_response(req_time, response):
resp_time = datetime.now()
write_to_log(req_time, resp_time, resp_time - req_time) #opens the log and writes to it
When I was testing it against a local server, it was working fine, the requests were being sent on time, the round trip time was obviously minimal.
However, when I test it against a remote server, with larger round trip times (especially for higher request loads), the request timing gets messed up by multiple seconds: The period of wait between each request becomes much larger than the desired period.
How come? I thought the async code wouldn't be affected by the roundtrip time since it isn't blocking while waiting for the response. Is there any known solution to this?
After some tinkering and tcpdumping, I've concluded that two things were really slowing down my coroutine. With these two corrected stalling has gone down enormously drastically and the timeout in yield wakeup_condition.wait(timeout=timedelta(seconds=interval)) is much better respected:
The computer I'm running on doesn't seem to be caching DNS, which for AsyncHTTPClient seem to be a blocking network call. As such every coroutine sending requests has the added time to wait for the DNS to resolve. Tornado docs say:
tornado.httpclient in the default configuration blocks on DNS
resolution but not on other network access (to mitigate this use
ThreadedResolver or a tornado.curl_httpclient with a
properly-configured build of libcurl).
...and in the AsynHTTPClient docs
To select curl_httpclient, call AsyncHTTPClient.configure at startup:
AsyncHTTPClient.configure("tornado.curl_httpclient.CurlAsyncHTTPClient")
I ended up implementing my own thread which resolves and caches DNS, however, and that resolved the issue by issuing the request directly to the IP address.
The URL I was using was HTTPS, changing to a HTTP url improved performance. For my use case that's not always possible, but it's good to be able to localize part of the issue
TL;DR: I have a beautifully crafted, continuously running piece of Python code controlling and reading out a physics experiment. Now I want to add an HTTP API.
I have written a module which controls the hardware using USB. I can script several types of autonomously operating experiments, but I'd like to control my running experiment over the internet. I like the idea of an HTTP API, and have implemented a proof-of-concept using Flask's development server.
The experiment runs as a single process claiming the USB connection and periodically (every 16 ms) all data is read out. This process can write hardware settings and commands, and reads data and command responses.
I have a few problems choosing the 'correct' way to communicate with this process. It works if the HTTP server only has a single worker. Then, I can use python's multiprocessing.Pipe for communication. Using more-or-less low-level sockets (or things like zeromq) should work, even for request/response, but I have to implement some sort of protocol: send {'cmd': 'set_voltage', 'value': 900} instead of calling hardware.set_voltage(800) (which I can use in the stand-alone scripts). I can use some sort of RPC, but as far as I know they all (SimpleXMLRPCServer, Pyro) use some sort of event loop for the 'server', in this case the process running the experiment, to process requests. But I can't have an event loop waiting for incoming requests; it should be reading out my hardware! I googled around quite a bit, but however I try to rephrase my question, I end up with Celery as the answer, which mostly fires off one job after another, but isn't really about communicating with a long-running process.
I'm confused. I can get this to work, but I fear I'll be reinventing a few wheels. I just want to launch my app in the terminal, open a web browser from anywhere, and monitor and control my experiment.
Update: The following code is a basic example of using the module:
from pysparc.muonlab.muonlab_ii import MuonlabII
muonlab = MuonlabII()
muonlab.select_lifetime_measurement()
muonlab.set_pmt1_voltage(900)
muonlab.set_pmt1_threshold(500)
lifetimes = []
while True:
data = muonlab.read_lifetime_data()
if data:
print "Muon decays detected with lifetimes", data
lifetimes.extend(data)
The module lives at https://github.com/HiSPARC/pysparc/tree/master/pysparc/muonlab.
My current implementation of the HTTP API lives at https://github.com/HiSPARC/pysparc/blob/master/bin/muonlab_with_http_api.
I'm pretty happy with the module (with lots of tests) but the HTTP API runs using Flask's single-threaded development server (which the documentation and the internet tells me is a bad idea) and passes dictionaries through a Pipe as some sort of IPC. I'd love to be able to do something like this in the above script:
while True:
data = muonlab.read_lifetime_data()
if data:
print "Muon decays detected with lifetimes", data
lifetimes.extend(data)
process_remote_requests()
where process_remote_requests is a fairly short function to call the muonlab instance or return data. Then, in my Flask views, I'd have something like:
muonlab = RemoteMuonlab()
#app.route('/pmt1_voltage', methods=['GET', 'PUT'])
def get_data():
if request.method == 'PUT':
voltage = request.form['voltage']
muonlab.set_pmt1_voltage(voltage)
else:
voltage = muonlab.get_pmt1_voltage()
return jsonify(voltage=voltage)
Getting the measurement data from the app is perhaps less of a problem, since I could store that in SQLite or something else that handles concurrent access.
But... you do have an IO loop; it runs every 16ms.
You can use BaseHTTPServer.HTTPServer in such a case; just set the timeout attribute to something small. bascially...
class XmlRPCApi:
def do_something(self):
print "doing something"
server = SimpleXMLRPCServer(("localhost", 8000))
server.register_instance(XMLRpcAPI())
server.timeout = 0
while True:
sleep(0.016)
do_normal_thing()
x.handle_request()
Edit: python has a built in server, also built on BaseHTTPServer, capable of serving a flask app. since flask.Flask() happens to be a wsgi compliant application, your process_remote_requests() should look like this:
import wsgiref.simple_server
remote_server = wsgire.simple_server('localhost', 8000, app)
# app here is just your Flask() application!
# as before, set timeout to zero so that you can go right back
# to your event loop if there are no requests to handle
remote_server.timeout = 0
def process_remote_requests():
remote_server.handle_request()
This works well enough if you have only short running requests; but if you need to handle requests that may possibly take longer than your event loop's normal polling interval, or if you need to handle more requests than you have polls per unit of time, then you can't use this approach, exactly.
You don't necessarily need to fork off another process, though, You can potentially get by using a pool of workers in another thread. roughly:
import threading
import wsgiref.simple_server
remote_server = wsgire.simple_server('localhost', 8000, app)
POOL_SIZE = 10 # or some other value.
pool = [threading.Thread(target=remote_server.serve_forever) for dummy in xrange(POOL_SIZE)]
for thread in pool:
thread.daemon = True
thread.start()
while True:
pass # normal experiment processing here; don't handle requests in this thread.
However; this approach has one major shortcoming, you now have to deal with concurrency! It's not safe to manipulate your program state as freely as you could with the above loop, since you might be, concurrently manipulating that same state in the main thread (or another http server thread). It's up to you to know when this is valid, wrapping each resource with some sort of mutex lock or whatever is appropriate.
We have a network client based on asyncore with the user's network connection is embodied in a Dispatcher. The goal is for a user working from an interactive terminal to be able to enter network request commands which would go out to a server and eventually come back with an answer. The client is written to be asynchronous so that the user can start several requests on different servers all at once, collecting the results as they become available.
How can we allow the user to type in commands while we're going around a select loop? If we hit the select() call registered only as readable, then we'll sit there until we read data or timeout. During this (possibly very long) time user input will be ignored. If we instead always register as writable, we get a hot loop.
One bad solution is as follows. We run the select loop in its own thread and have the user inject input into a thread safe write Queue by invoking a method we define on our Dispatcher. Something like
def myConnection.enqueue(self, someData):
self.lock.acquire()
self.queue.put(someData)
self.lock.release()
We then only register as writable if the Queue is not emtpy
def writable(self):
return not self.queue.is_empty()
We would then specify a timeout for select() that's short on human scales but long for a computer. This way if we're in the select call registered only for reading when the user puts in new data, the loop will eventually run around again and find out that there's new data to write. This is a bad solution though because we might want to use this code for servers' client connections as well, in which case we don't want the dead time you get waiting for select() to time out. Again, I realize this is a bad solution.
It seems like the correct solution would be to bring the user input in through a file descriptor so that we can detect new input while sitting in the select call registered only as readable. Is there a way to do this?
NOTE: This is an attempt to simplify the question posted here
stdin is selectable. Put stdin into your dispatcher.
Also, I recommend Twisted for future development of any event-driven software. It is much more featureful than asyncore, has better documentation, a bigger community, performs better, etc.
I am trying to implement a basic lib to issue HTTP GET requests. My target is to receive data through socket connections - minimalistic design to improve performance - usage with threads, thread pool(s).
I have a bunch of links which I group by their hostnames, so here's a simple demonstration of input URLs:
hostname1.com - 500 links
hostname2.org - 350 links
hostname3.co.uk - 100 links
...
I intend to use sockets because of performance issues. I intend to use a number of sockets which keeps connected (if possible and it usually is) and issue HTTP GET requests. The idea came from urllib low performance on continuous requests, then I met urllib3, then I realized it uses httplib and then I decided to try sockets. So here's what I accomplished till now:
GETSocket class, SocketPool class, ThreadPool and Worker classes
GETSocket class is a minified, "HTTP GET only" version of Python's httplib.
So, I use these classes like that:
sp = Comm.SocketPool(host,size=self.poolsize, timeout=5)
for link in linklist:
pool.add_task(self.__get_url_by_sp, self.count, sp, link, results)
self.count += 1
pool.wait_completion()
pass
__get_url_by_sp function is a wrapper which calls sp.urlopen and saves the result to results list. I am using a pool of 5 threads which has a socket pool of 5 GETSocket classes.
What I wonder is, is there any other possible way that I can improve performance of this system?
I've read about asyncore here, but I couldn't figure out how to use same socket connection with class HTTPClient(asyncore.dispatcher) provided.
Another point, I don't know if I'm using a blocking or a non-blocking socket, which would be better for performance or how to implement which one.
Please be specific about your experiences, I don't intend to import another library to do just HTTP GET so I want to code my own tiny library.
Any help appreciated, thanks.
Do this.
Use multiprocessing. http://docs.python.org/library/multiprocessing.html.
Write a worker Process which puts all of the URL's into a Queue.
Write a worker Process which gets a URL from a Queue and does a GET, saving a file and putting the File information into another Queue. You'll probably want multiple copies of this Process. You'll have to experiment to find how many is the correct number.
Write a worker Process which reads file information from a Queue and does whatever it is that you're trying do.
I finally found a well chosen path to solve my problems. I was using Python 3 for my project and my only option was to use pycurl, so this made me have to port my project back to Python 2.7 series.
Using pycurl, I gained:
- Consistent responses to my requests (actually my script has to deal with minimum 10k URLs)
- With the usage of ThreadPool class I am receiving responses as fast as my system can (received data is processed later - so multiprocessing is not much of a possibility here)
I tried httplib2 first, I realized that it is not acting as solid as it acts on Python 2, by switching to pycurl I lost caching support.
Final conclusion: When it comes to HTTP communication, one could need a tool like (py)curl at his disposal. It is a lifesaver, especially when one is dealing with loads of URLs (try sometimes for fun: you will get lots of weird responses from them)
Thanks for the replies, folks.
I'm downloading a huge set of files with following code in a loop:
try:
urllib.urlretrieve(url2download, destination_on_local_filesystem)
except KeyboardInterrupt:
break
except:
print "Timed-out or got some other exception: "+url2download
If the server times-out on URL url2download when connection is just initiating, the last exception is handled properly. But sometimes server responded, and downloading is started, but the server is so slow, that it'll takes hours for even one file, and eventually it returns something like:
Enter username for Clients Only at albrightandomalley.com:
Enter password for in Clients Only at albrightandomalley.com:
and just hangs there (although no username/passworde is aksed if the same link is downloaded through the browser).
My intention in this situation would be -- skip this file and go to the next one. The question is -- how to do that? Is there a way in python to specify how long is OK to work on downloading one file, and if more time is already spent, interrupt, and go forward?
Try:
import socket
socket.setdefaulttimeout(30)
If you're not limited to what's shipped with python out of the box, then the urlgrabber module might come in handy:
import urlgrabber
urlgrabber.urlgrab(url2download, destination_on_local_filesystem,
timeout=30.0)
There's a discussion of this here. Caveats (in addition to the ones they mention): I haven't tried it, and they're using urllib2, not urllib (would that be a problem for you?) (Actually, now that I think about it, this technique would probably work for urllib, too).
This question is more general about timing out a function:
How to limit execution time of a function call in Python
I've used the method described in my answer there to write a wait for text function that times out to attempt an auto-login. If you'd like similar functionality you can reference the code here:
http://code.google.com/p/psftplib/source/browse/trunk/psftplib.py