multiple send on httplib.HTTPConnection, and multiple read on HTTPResponse? - python

Should it be possible to send a plain, single http POST request (not chunk-encoded), in more than one segment? I was thinking of using httplib.HTTPConnection and calling the send method more than once (and calling read on the response object after each send).
(Context: I'm collaborating to the design of a server that offers services analogous to transactions [series of interrelated requests-responses]. I'm looking for the simplest, most compatible HTTP representation.)

After being convinced by friends that this should be possible, I found a way to do it. I override httplib.HTTPResponse (n.b. httplib.HTTPConnection is nice enough to let you specify the response_class it will instantiate).
Looking at socket.py and httplib.py (especially _fileobject.read()), I had noticed that read() only allowed 2 things:
read an exact number of bytes (this returns immediately, even if the connection is not closed)
read all bytes until the connection is closed
I was able to extend this behavior and allow free streaming with just a few lines of code. I also had to set the will_close member of my HTTPResponse to 0.
I'd still be interested to hear if this is considered acceptable or abusive usage of HTTP.

Related

How to use dataReceived in Twisted?

I have implemented a server program using Twisted. I am using basic.lineReceiver with the method dataReceived to receive data from multiple clients. Also, I am using protocol.ServerFactory to keep track of connected clients. The server sends some commands to each connected client. Based on the response that the server gets from each client, it (the server) should perform some tasks. Thus, the best solution that came to my mind was to create a buffer for received messages as a python list, and each time that the functions at server side want to know the response from a client, they access the last element of the buffer list (of that client).
This approach has turned out to be unreliable. The first issue is that since TCP streaming is used, sometimes messages merge (I can use a delimiter for this). Second, the received messages are sometimes not in their appropriate sequence. Third, the networking communication seems to be too slow, as when the server initially tries to access the last element of the buffered list, the list is empty (this shows that the last messages on the buffer might not be the response to the last sent commands).
Could you tell me what is the best parctice for using dataReceived or its equivalents in the above problem? thank you in advance.
EDIT 1: Answer- While I accept #Jean-Paul Calderone's answer since I certainly learned from it, I would like to add that in my own research of Twisted's documentation, I learned that in order to avoid delays in communications of the server, one should use return at the end of dataReceived() or lineReceived() functions, and this solved part of my problem. The rest, were explained in the answer.
I have implemented a server program using Twisted. I am using basic.lineReceiver with the method dataReceived to receive data from multiple clients.
This is a mistake - an unfortunately common one brought on by the mistaken use of inheritance in many of Twisted's protocol implementations as the mechanism for building up more and more sophisticated behaviors. When you use twisted.protocols.basic.LineReceiver, the dataReceived callback is not for you. LineReceiver.dataReceived is an implementation detail of LineReceiver. The callback for you is LineReceiver.lineReceived. LineReceiver.dataReceived looks like it might be for you - it doesn't start with an underscore or anything - but it's not. dataReceived is how LineReceiver receives information from its transport. It is one of the public methods of IProtocol - the interface between a transport and the protocol interpreting the data received over that transport. Yes, I just said "public method" there. The trouble is it's public for the benefit of someone else. This is confusing and perhaps not communicated as well as it could be. No doubt this is why it is a Frequently Asked Question.
This approach has turned out to be unreliable. The first issue is that since TCP streaming is used, sometimes messages merge (I can use a delimiter for this).
Use of dataReceived is why this happens. LineReceiver already implements delimiter-based parsing for you. That's why it's called "line" receiver - it receives lines separated by a delimiter. If you override lineReceived instead of dataReceived then you'll be called which each line that is received, regardless of how TCP splits things up or smashes them together.
Second, the received messages are sometimes not in their appropriate sequence.
TCP is a reliable, ordered, stream-oriented transport. "Ordered" means that bytes arrive in the same order they are sent. Put another way, when you write("x"); write("y") it is guaranteed that the receiver will receive "x" before they receive "y" (they may receive "x" and "y" in the same call to recv() but if they do, the data will definitely be "xy" and not "yx"; or they may receive the two bytes in two calls to recv() and if they do, the first recv() will definitely by "x" and the second will definitely be "y", not the other way around).
If bytes appear to be arriving in a different order than you sent them, there's probably another bug somewhere that makes it look like this is happening - but it actually isn't. Your platform's TCP stack is very likely very close to bug free and in particular it probably doesn't have TCP data re-ordering bugs. Likewise, this area of Twisted is extremely well tested and probably works correctly. This leaves a bug in your application code or a misinterpretation of your observations. Perhaps your code doesn't always append data to a list or perhaps the data isn't being sent in the order you expected.
Another possibility is that you are talking about the order in which data arrives across multiple separate TCP connections. TCP is only ordered over a single connection. If you have two connections, there are very few (if any) guarantees about the order in which data will arrive over them.
Third, the networking communication seems to be too slow, as when the server initially tries to access the last element of the buffered list, the list is empty (this shows that the last messages on the buffer might not be the response to the last sent commands).
What defines "too slow"? The network is as fast as the network is. If that's not fast enough for you, find a fatter piece of copper. It sounds like what you really mean here is that your server sometimes expects data to have arrived before that data actually arrives. This doesn't mean the network is too slow, though, it means your server isn't properly event driven. If you're inspecting a buffer and not finding the information you expected, it's because you inspected it before the occurrence of the event which informs you of the arrival of that information. This is why Twisted has all these callback methods - dataReceived, lineReceived, connectionLost, etc. When lineReceived is called, this is an event notification telling you that right now something happened which resulted in a line being available (and, for convenience, lineReceived takes one argument - an object representing the line which is now available).
If you have some code that is meant to run when a line has arrived, consider putting that code inside an implementation of the lineReceived method. That way, when it runs (in response to a line being received), you can be 100% sure that you have a line to operate on. You can also be sure that it will run as soon as possible (as soon as the line arrives) but no sooner.

What could cause the response body to be cut off (on client-side)?

I am writing a python language plugin for an active code generator that makes calls to our Rest API. After making many attempts to use the requests library and failing, I opted to use the much lower level socket and ssl modules, which have been working fine so far. I am using a very crude method to parse the responses; for fairly short responses in the body, this works fine, but I am now trying to retrieve much larger json objects (lists of users). The response is being cut off as follows (note: I removed a couple user entries for the sake of brevity):
{"page-start":1,"total":5,"userlist":[{"userid":"jim.morrison","first-name":"Jim","last-name":"Morrison","language":"English","timezone":"(GMT+5:30)CHENNAI,KOLKATA,MUMBAI,NEW DELHI","currency":"US DOLLAR","roles":
There should be a few more users after this and the response body is on a single line in the console.
Here is the code I am using to request the user list from the Rest API server:
import socket, ssl, json
host = self.WrmlClientSession.api_host
port = 8443
pem_file = "<pem file>"
url = self.WrmlClientSession.buildURI(host, port, '<root path>')
#Create the header
http_header = 'GET {0} HTTP/1.1\n\n'
req = http_header.format(url)
#Socket configuration and connection execution
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
conn = ssl.wrap_socket(sock, ca_certs = pem_file)
conn.connect((host, port))
conn.send(req)
response = conn.recv()
(headers, body) = response.split("\r\n\r\n")
#Here I would convert the body into a json object, but because the response is
#cut off, it cannot be properly decoded.
print(response)
Any insight into this matter would be greatly appreciated!
Edit: I forgot to mention that I debugged the response on the server-side, and everything was perfectly normal.
You can't assume that you can just call recv() once and get all the data since the TCP connection will only buffer a limited amount. Also, you're not parsing any of the headers to determine the size of body that you're expecting. You could use a non-blocking socket and keep reading until it blocks, which will mostly work but is simply not reliable and quite poor practice so I'm not going to document it here.
HTTP has ways of indicating the size of the body for exactly this reason and the correct approach is to use them if you want your code to be reliable. There are two things to look for. Firstly, if the HTTP response has a Content-Length then that indicates how many bytes will occur in the response body - you need to keep reading until you've had that much. The second option is that the server may send you a response which uses chunked encoding - it indicates this by including a Transfer-Encoding header whose value will contain the text chunked. I won't go into chunked encoding here, read the wikipedia article for details. In essence the body contains small headers for each "chunk" of data which indicate the size of that chunk. In this case you have to keep reading chunks until you get an empty one, which indicates the end of the response. This approach is used instead of Content-Length when the size of the response body isn't known by the server when it starts to send it.
Typically a server won't use both Content-Length and chunked encoding, but there's nothing to actually stop it so that's also something to consider. If you only need to interoperate with a specific server then you can just tell what it does and work with that, but be aware you'll be making your code less portable and more fragile to future changes.
Note that when using these headers, you'll still need to read in a loop because any given read operation may return incomplete data - TCP is designed to stop sending data until the reading application has started to empty the buffer, so this isn't something you can work around. Also note that each read may not even contain a complete chunk, so you need to keep track of state about the size of the current chunk and the amount of it you've already seen. You only know to read the next chunk header when you've seen the number of bytes specified by the previous chunk header.
Of course, you don't have to worry about any of this if you use any of Python's myriad of HTTP libraries. Speaking as someone who's had to implement a fairly complete HTTP/1.1 client before, you really want to let someone else do it if you possibly can - there's quite a few tricky corner cases to consider and your simple code above is going to fail in a lot of cases. If requests doesn't work for you, have you tried any of the standard Python libraries? There's urllib and urllib2 for higher level interfaces and httplib provides a lower-level approach which you might find allows you to work around some of your problems.
Remember that you can always modify the code in these (after copying to your local repository of course) if you really have to fix issues, or possibly just import them and monkey-patch your changes in. You'd have to be quite clear it was an issue in the library and not just a mistaken use of it, though.
If you really want to implement a HTTP client that's fine, but just be aware that it's harder than it looks.
As a final aside, I've always used the read() method of SSL sockets instead of recv() - I'd hope they'd be equivalent, but you may wish to try that if you're still having issues.

Simple server/client string exchange protocol

i am looking for an abstract and clean way to exchange strings between two python programs. The protocol is really simple: client/server sends a string to the server/client and it takes the corresponding action - via a handler, i suppose - and replies OR NOT to the other side with another string. Strings can be three things: an acknowledgement, signalling one side that the other on is still alive; a pickled class containing a command, if going from the "client" to the "server", or a response, if going from the "server" to the "client"; and finally a "lock" command, that signals a side of the conversation that the other is working and no further questions should be asked until another lock packet is received.
I have been looking at the python's built in SocketServer.TCPServer, but it's way too low level, it does not easily support reconnection and the client has to use the socket interface, which i preferred to be encapsulated.
I then explored the twisted framework, particularly the LineOnlyReceiver protocol and server examples, but i found the initial learning curve to be too steep, the online documentation assuming a little too much knowledge and a general lack of examples and good documentation (except the 2005 O'reilly book, is this still valid?).
I then tryied the pyliblo library, which is perfect for the task, alas it is monodirectional, there is no way to "answer" a client, and i need the answer to be associated to the specific command.
So my question is: is there an existing framework/library/module that allows me to have a client object in the server, to read the commands from and send the replies to, and a server object in the client, to read the replies from and send the commands to, that i can use after a simple setup (client, the server address is host:port, server, you are listening on port X) having the underlying socket, reconnection engine and so on handled?
thanks in advance to any answer (pardon my english and inexperience, this is my first question)
Python also provides an asyncchat module that simplifies much of the server/client behavior common to chat-like communications.
What you want to do seems a lot like RPC, so the things that come to my mind are XMLRPC or JSON RPC, if you dont want to use XML .
Python has a XMLRPC library that you can use, it uses HTTP as the transport so it also solves your problem of not being too low level. However if you could provide more detail in terms of what you exactly want to do perhaps we can give a better solution.

Checking files retrieved by Twisted's FTPClient.retrieveFile method for completeness

I'm writing a custom ftp client to act as a gatekeeper for incoming multimedia content from subcontractors hired by one of our partners. I chose twisted because it allows me to parse the file contents before writing the files to disk locally, and I've been looking for occasion to explore twisted anyway. I'm using 'twisted.protocols.ftp.FTPClient.retrieveFile' to get the file, passing the escaped path to the file, and a protocol to the 'retrieveFile' method. I want to be absolutely sure that the entire file has been retrieved because the event handler in the call back is going to write the file to disk locally, then delete the remote file from the ftp server alla '-E' switch behavior in the lftp client. My question is, do I really need to worry about this, or can I assume that an err back will happen if the file is not fully retrieved?
There are a couple unit tests for behavior in this area.
twisted.test.test_ftp.FTPClientTestCase.test_failedRETR is the most directly relevant one. It covers the case where the control and data connections are lost while a file transfer is in progress.
It seems to me that test coverage in this area could be significantly improved. There are no tests covering the case where just the data connection is lost while a transfer is in progress, for example. One thing that makes this tricky, though, is that FTP is not a very robust protocol. The end of a file transfer is signaled by the data connection closing. To be safe, you have to check to see if you received as many bytes as you expected to receive. The only way to perform this check is to know the file size in advance or ask the server for it using LIST (FTPClient.list).
Given all this, I'd suggest that when a file transfer completes, you always ask the server how many bytes you should have gotten and make sure it agrees with the number of bytes delivered to your protocol. You may sometimes get an errback on the Deferred returned from retrieveFile, but this will keep you safe even in the cases where you don't.

Python Sockets - Creating a message format

I have built a Python server to which various clients can connect, and I need to set a predefined series of messages from clients to the server - For example the client passes in a name to the server when it first connects.
I was wondering what the best way to approach this is? How should I build a simple protocol for their communication?
Should the messages start with a specific set of bytes to mark them out as part of this protocol, then contain some sort of message id? Any suggestions or further reading appreciated.
First, you need to decide whether you want your protocol to be human readable (much more overhead) or binary. If the first, you probably want to use regular expressions to decode the messages. For this, use the python module re. If the latter, the module struct is your friend.
Second, if you are building a protocol that is somehow stateful (e.g. first we do a handshake, then we transfer data, then we check checksums and say goodbye) you probably want to create a some sort of FSM to track the state.
Third, if protocol design is not a familiar subject, read some simple protocol specifications, for example by IETF
If this is not a learning excercise, you might want to build up from something else, like Python Twisted
Depending on the requirements, you might want to consider using JSON: use "newline" terminated strings with JSON encoding. The transport protocol could be HTTP: with this, you could have access to all the "connection related" facilities (e.g. status codes) and have JSON encoded payload.
The advantages of using JSON over HTTP:
human readable (debugging etc.)
libraries available for all languages/platforms
cross-platform
browser debuggable (to some extent)
Of course, there are many other ways to skin this cat but the time to working prototype using this approach is very low. This is worth considering if your requirements (which aren't very detailed here) can be met.
Read some protocols, and try to find one that looks like what you need. Does it need to be message-oriented or stream-oriented? Does it need request order to be preserved, does it need requests to be paired with responses? Do you need message identifiers? Retries, back-off? Is it an RPC protocol, a message queue protocol?
See http://www.faqs.org/docs/artu/ch05s02.html and http://www.faqs.org/docs/artu/ch05s03.html for a good overview and discussion on data file formats and protocols.

Categories

Resources