Python HTTP client with request pipelining - python

The problem: I need to send many HTTP requests to a server. I can only use one connection (non-negotiable server limit). The server's response time plus the network latency is too high – I'm falling behind.
The requests typically don't change server state and don't depend on the previous request's response. So my idea is to simply send them on top of each other, enqueue the response objects, and depend on the Content-Length: of the incoming responses to feed incoming replies to the next-waiting response object. In other words: Pipeline the requests to the server.
This is of course not entirely safe (any reply without Content-Length: means trouble), but I don't care -- in that case I can always retry any queued requests. (The safe way would be to wait for the header before sending the next bit. That'd might help me enough. No way to test beforehand.)
So, ideally I want the following client code (which uses client delays to mimic network latency) to run in three seconds.
Now for the $64000 question: Is there a Python library which already does this, or do I need to roll my own? My code uses gevent; I could use Twisted if necessary, but Twisted's standard connection pool does not support pipelined requests. I also could write a wrapper for some C library if necessary, but I'd prefer native code.
#!/usr/bin/python
import gevent.pool
from gevent import sleep
from time import time
from geventhttpclient import HTTPClient
url = 'http://local_server/100k_of_lorem_ipsum.txt'
http = HTTPClient.from_url(url, concurrency=1)
def get_it(http):
print time(),"Queueing request"
response = http.get(url)
print time(),"Expect header data"
# Do something with the header, just to make sure that it has arrived
# (the greenlet should block until then)
assert response.status_code == 200
assert response["content-length"] > 0
for h in response.items():
pass
print time(),"Wait before reading body data"
# Now I can read the body. The library should send at
# least one new HTTP request during this time.
sleep(2)
print time(),"Reading body data"
while response.read(10000):
pass
print time(),"Processing my response"
# The next request should definitely be transmitted NOW.
sleep(1)
print time(),"Done"
# Run parallel requests
pool = gevent.pool.Pool(3)
for i in range(3):
pool.spawn(get_it, http)
pool.join()
http.close()

Dugong is an HTTP/1.1-only client which claims to support real HTTP/1.1 pipelining. The tutorial includes several examples on how to use it, including one using threads and another using asyncio.
Be sure to verify that the server you're communicating with actually supports HTTP/1.1 pipelining—some servers claim to support HTTP/1.1 but don't implement pipelining.

I think txrequests could get you most of what you are looking for, using the background_callback to en-queue processing of responses on a separate thread. Each request would still be it's own thread but using a session means by default it would reuse the same connection.
https://github.com/tardyp/txrequests#working-in-the-background

It seems you are running python2.
For python3 >= 3.5
you could use async/await loop
See asyncio
Also, there is a library built on top for better, easier use
called Trio, available on pip.
Another thing I can think of is multiple threads with locks.
I will think on how to better explain this or could it even work.

Related

Does setting socket timeout cancel the initial request

I have a request that can only run once. At times, the request takes much longer than it should.
If I were to set a default socket timeout value (using socket.setdefaulttimeout(5)), and it took longer than 5 seconds, will the original request be cancelled so it's safe to retry (see example code below)?
If not, what is the best way to cancel the original request and retry it again ensuring it never runs more than once.
import socket
from googleapiclient.discovery import build
from tenacity import retry, stop_after_attempt, wait_fixed, retry_if_exception_type
#retry(
retry=retry_if_exception_type(socket.timeout),
wait=wait_fixed(4),
stop=stop_after_attempt(3)
)
def create_file_once_only(creds, body):
service = build('drive', 'v3', credentials=creds)
file = service.files().create(body=body, fields='id').execute()
socket.setdefaulttimeout(5)
create_file_once_only(creds, body)
It's unlikely that this can be made to work as you hope. An HTTP POST (as with any other HTTP request) is implemented by sending a command to the web server, then receiving a response. The python requests library encapsulates a lot of tedious parts of that for you, but at the core, it's going to do a socket send followed by a socket recv (it may of course require more than one send or recv depending on the size of the data).
Now, if you were able to connect to the web server initially (again, this is taken care of for you by the requests library but typically only takes a few milliseconds), then it's highly likely that the data in your POST request has long since been sent. (If the data you are sending is megabytes long, it's possible that it's only been partially sent, but if it is reasonably short, it's almost certainly been sent in full.)
That in turn means that in all likelihood the server has received your entire request and is working on it or has enqueued your request to work on it eventually. In either case, even if you break the connection to the server by timing out on the recv, it's unlikely that the server will actually even notice that until it gets to the point in its execution where it would be sending its response to your request. By that point, it has probably finished doing whatever it was going to do.
In other words, your socket timeout is not going to apply to the "HTTP request" -- it applies to the underlying socket operations instead -- and almost certainly to the recv part on the tail end. And just breaking the socket connection doesn't cancel the HTTP request.
There is no reliable way to do what you want without designing a transactional protocol with the close cooperation of the HTTP server.
You could do something (with the cooperation of the HTTP server still) that could do something approximating it:
Create a unique ID (UUID or the like)
Send a request to the server that contains that UUID along with the other account info (name, password, whatever else)
The server then only creates the account if it hasn't already created an account with the same unique ID.
That way, you can request the operation multiple times, but know that it will only actually be implemented once. If asked to do the same operation a second time, the server would simply respond with "yep, already did that".

HTTP Callback URL vs. WebSocket for ansynchronous response?

I have two servers: Golang and Python (2.7). The Python (Bottle) server has a computation intensive task to perform and exposes a RESTful URI to start the execution of the process. That is, the Go server sends:
HTTP GET to myserver.com/data
The python server performs the computation and needs to inform the Go server of the completion of the processing. There are two ways in which I see this can be designed:
Go sends a callback URL/data to Python and python responds by hitting that URL. E.g:
HTTP GET | myserver.com/data | Data{callbackURI:goserver.com/process/results, Type: POST, response:"processComplete"}
Have a WebSocket based response be sent back from Python to Go.
What would be a more suitable design? Are there pros/cons of doing one over the other? Other than error conditions (server crashed etc.,) the only thing that the Python server needs to actually "inform" the client is about completing the computation. That's the only response.
The team working on the Go server is not very well versed with having a Go client based on websockets/ajax (nor do I. But I've never written a single line of Go :) #1 seems to be easier but am not aware of whether it is an accepted design approach or is it just a hack? What's the recommended way to proceed in this regard?
If you want to do it RESTful, then when the client requests HTTP GET myserver.com/data the server should return a 202 Accepted status code:
202 Accepted
The request has been accepted for processing, but the processing has not been completed. The request might or might not eventually be acted upon, as it might be disallowed when processing actually takes place. There is no facility for re-sending a status code from an asynchronous operation such as this.
The 202 response is intentionally non-committal. Its purpose is to allow a server to accept a request for some other process (perhaps a batch-oriented process that is only run once per day) without requiring that the user agent's connection to the server persist until the process is completed. The entity returned with this response SHOULD include an indication of the request's current status and either a pointer to a status monitor or some estimate of when the user can expect the request to be fulfilled.
The Python server could return an ETA and an URL to a temporary resource to request the current status of the operation (e.g.: myserver.com/temp_data?processing_status). Then it's up to the Go client to wait for the task to fulfill by requesting this resource and reading the ETA. Once the processing is done, the Python server could return a 410 Gone status with the definitive URL of the new resource.
It depends on how often these signal are being sent. If it's many times per second, keeping a websocket open might make more sense. Otherwise, use option #1 since it will have less overhead and be more loosely coupled.

How to speed-up a HTTP request

I need to get json data and I'm using urllib2:
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'gzip')
opener = urllib2.build_opener()
connection = opener.open(request)
data = connection.read()
but although the data aren't so big it is too slow.
Is there a way to speed it up? I can use 3rd party libraries too.
Accept-Encoding:gzip means that the client is ready to gzip Encoded content if the Server is ready to send it first. The rest of the request goes down the sockets and to over your Operating Systems TCP/IP stack and then to physical layer.
If the Server supports ETags, then you can send a If-None-Match header to ensure that content has not changed and rely on the cache. An example is given here.
You cannot do much with clients only to improve your HTTP request speed.
You're dependant on a number of different things here that may not be within your control:
Latency/Bandwidth of your connection
Latency/Bandwidth of server connection
Load of server application and its individual processes
Items 2 and 3 are probably where the problem lies and you won't be able to do much about it. Is the content cache-able? This will depend on your own application needs and HTTP headers (e.g. ETags, Cache-Control, Last-Modified) that are returned from the server. The server may only up date every day in which case you might be better off only requesting data every hour.
There is unlikely an issue with urllib. If you have network issues and performance problems: consider using tools like Wireshark to investigate on the network level. I have very strong doubts that this is related to Python in any way.
If you are making lots of requests, look into threading. Having about 10 workers making requests can speed things up - you don't grind to a halt if one of them takes too long getting a connection.

How can I make an http request without getting back an http response in Python?

I want to send it and forget it. The http rest service call I'm making takes a few seconds to respond. The goal is to avoid waiting those few seconds before more code can execute.
I'd rather not use python threads
I'll use twisted async calls if I must and ignore the response.
You are going to have to implement that asynchronously as HTTP protocol states you have a request and a reply.
Another option would be to work directly with the socket, bypassing any pre-built module. This would allow you to violate protocol and write your own bit that ignores any responses, in essence dropping the connection after it has made the request.
HTTP implies a request and a reply for that request. Go with an async approach.
You do not need twisted for this, just urllib will do. See http://pythonquirks.blogspot.com/2009/12/asynchronous-http-request.html
I am copying the relevant code here but the credit goes to that link:
import urllib2
class MyHandler(urllib2.HTTPHandler):
def http_response(self, req, response):
return response
o = urllib2.build_opener(MyHandler())
o.open('http://www.google.com/')

Python: Asynchronous http requests sent in order with automatic handling of cookies?

I am coding a python (2.6) interface to a web service. I need to communicate via http so that :
Cookies are handled automatically,
The requests are asynchronous,
The order in which the requests are sent is respected (the order in which the responses to these requests are received does not matter).
I have tried what could be easily derived from the build-in libraries, facing different problems :
Using httplib and urllib2, the requests are synchronous unless I use thread, in which case the order is not guaranteed to be respected,
Using asyncore, there was no library to automatically deal with cookies send by the web service.
After some googling, it seems that there are many examples of python scripts or libraries that match 2 out of the 3 criteria, but not the 3 of them. I am thinking of reading through the cookielib sources and adapting what I need of it to asyncore (or only to my application in a ad hoc manner), but it seems strange that nothing like this exists yet, as I guess I am not the only one interested. If anyone knows of pointers about this problem, it would be greatly appreciated.
Thank you.
Edit to clarify :
What I am doing is a local proxy that interfaces my IRC client with a webchat. It creates a socket that listens to IRC connections, then upon receiving one, it logs in the webchat via http. I don't have access to the behaviour of the webchat, and it uses cookies for session IDs. When client sends several IRC requests to my python proxy, I have to forward them to the webchat's server via http and with cookies. I also want to do this asynchronously (I don't want to wait for the http response before I send the next request), and currently what happens is that the order in which the http requests are sent is not the order in which the IRC commands were received.
I hope this clarifies the question, and I will of course detail more if it doesn't.
Using httplib and urllib2, the
requests are synchronous unless I use
thread, in which case the order is not
guaranteed to be respected
How would you know that the order has been respected unless you get your response back from the first connection before you send the response to the second connection? After all, you don't care what order the responses come in, so it's very possible that the responses come back in the order you expect but that your requests were processed in the wrong order!
The only way you can guarantee the ordering is by waiting for confirmation that the first request has successfully arrived (eg. you start receiving the response for it) before beginning the second request. You can do this by not launching the second thread until you reach the response handling part of the first thread.

Categories

Resources