Sending big files in Twisted

Sending big files in Twisted - python

I have a really simple code that allows me to send an image from client to server. And it works.
As simple as this:
On the client side...
def sendFile(self):
image = open(picname)
data = image.read()
self.transport.write(data)
On the server side...
def dataReceived(self, data):
print 'Received'
f = open("image.png",'wb')
f.write(data)
f.close()
Problem with this is that only works if the image is up to 4.somethingkB, as it stops working when the image is bigger (at least doesn't work when gets to 6kB). Then, is when I see that the "Received" is being printed more than one time. Which makes me think that data is being separated in smaller chunks. However, even if those chunks of data get to the server (as I'm seeing the repeated print called from the dataReceived) the image is corrupted and can't be opened.
I don't know that much about protocols, but I supposed that TCP should be reliable, so the fact that the packets got there in a different order or so, shouldn't...happen? So I was thinking that maybe Twisted is doing something there that I ignore and maybe I should use another Protocol.
So here is my question. Is there something that I could do now to make it work or I should definitely change to another Protocol? If so...any idea? My goal would be sending a bigger image, maybe the order of hundreds of kB.

This is a variant of an entry in the Twisted FAQ:
Why is protocol.dataReceived called with only part of the data I called transport.write with?
TCP is a stream-based protocol. It is delivering a stream of bytes, which may be broken up into an arbitrary number of fragments. If you write one big blob of bytes, it may be broken up into an arbitrary number of smaller chunks, depending on the characteristics of your physical network connection. When you say that TCP should be "reliable", and that the chunks should arrive in order, you are roughly correct: however, what arrives in order is the bytes, not the chunks.
What you are doing in your dataReceived method is, upon receiving each chunk, opening a file and writing the contents of just that chunk to "image.png", then closing it. If you change it to open the file in connectionMade and close the file in connectionLost you should see at least vaguely the right behavior, although this will still cause you to get corrupted / truncated images if the connection is lost unexpectedly, with no warning. You should really use a framing protocol like AMP; although if you're just sending big blobs of data around, HTTP is probably a better choice.

Related

Socket lose part of message in UDP

I am trying to send an image frame through a UDP socket with Python 2.7, the current frame I am trying to send is 921600 bytes (640 x 480). And the buffer limit for UDP messages are 65507 bytes, so I need to split the message, here is how I am doing it.
From client.py:
image_string = frame.tostring() # frame is an multi-d numpy array
message_size = len(image_string)
sock.sendto(str(message_size), (HOST, PORT)) # First send the size
for i in xrange(0, message_size, 65507): # Split it and send
sock.sendto(image_string[i:i + 65507], (HOST, PORT))
sock.sendto("\n", (HOST, PORT)) # Mark the end to avoid hanging.
Here is how I am receiving it in server.py, I inserted some prints for debugging.
image_string = ""
data, addr = sock.recvfrom(1024) # recieve image size
message_size = int(data)
print "Incoming image with size: " + data
for i in xrange(0, message_size, 65507):
data, addr = sock.recvfrom(65507)
image_string += data.strip()
print "received part, image is now:", len(image_string)
print "End of image"
So I am reading the message same way I send it, it checks out in theory however not in practice. Possibly because of some packet loss after the client is done sending - the server is still stuck trying to read (blocked).
I know that UDP is unreliable and hard to work with, however I read that UDP is used in many video streaming applications, so I believe there should exist a solution to this problem, but I can not find it.
All help is appreciated, thanks.
Edit1: The reason I suspect packet loss is the problem, is because every time I run the test, I end up with different size of image being already sent before the server hangs.
Edit2: I forgot to mention that I tried different size of chunks while partitioning, 1024 and 500 bytes revealed no difference (5-20 bytes lost in 921600). But I should mention that I am sending and recieving from localhost, which already provides minimum error.

I know that UDP is unreliable and hard to work with, however I read that UDP is used in many video streaming applications, so I believe there should exist a solution to this problem, but I can not find it.
Those guys can. They design their protocol knowing that data may be lost (or even the contrary, arrive multiple times), it may arrive out of order, and their protocols/applications expect that.
You can not simply cut your data into pieces and sent them with UDP. You have to form each individual message in a way that each of them has a meaning on its own. If it is a "stream" it has to contain where that particular piece of data is located in the stream, and when your application receives it, it will know if the given piece of data can be handled, it is obsolete (arrived too late, or arrived already), it should be put aside and hope that some preceding parts would arrive, perhaps so unusable in itself, that the application should send a direct request to the sender in order to get things synchonized again.
In case of transferring an image - or a series of images -, you could send the offset of the data, and simply overwrite the given offset a fixed size buffer (which can host one entire image) whenever receiving something, and render the result. Then the buffer would always contain some image, at least some mixture of several images - or in extremely lucky cases a single, "real" image.
EDIT: an example of 'evaluating' what to do with a package: besides the offset, the number ('timestamp') of the image could be there too, and then the application could avoid overwriting a newer part of the image with something old - should some packet from the past (re)appear for any reason.

Why do you use the maximum buffer limit? The largest payload you can reliably send through UDP is 534 bytes. Sending more than that may cause fragmentation. If you are concerned with the data loss, use TCP.

How to use dataReceived in Twisted?

I have implemented a server program using Twisted. I am using basic.lineReceiver with the method dataReceived to receive data from multiple clients. Also, I am using protocol.ServerFactory to keep track of connected clients. The server sends some commands to each connected client. Based on the response that the server gets from each client, it (the server) should perform some tasks. Thus, the best solution that came to my mind was to create a buffer for received messages as a python list, and each time that the functions at server side want to know the response from a client, they access the last element of the buffer list (of that client).
This approach has turned out to be unreliable. The first issue is that since TCP streaming is used, sometimes messages merge (I can use a delimiter for this). Second, the received messages are sometimes not in their appropriate sequence. Third, the networking communication seems to be too slow, as when the server initially tries to access the last element of the buffered list, the list is empty (this shows that the last messages on the buffer might not be the response to the last sent commands).
Could you tell me what is the best parctice for using dataReceived or its equivalents in the above problem? thank you in advance.
EDIT 1: Answer- While I accept #Jean-Paul Calderone's answer since I certainly learned from it, I would like to add that in my own research of Twisted's documentation, I learned that in order to avoid delays in communications of the server, one should use return at the end of dataReceived() or lineReceived() functions, and this solved part of my problem. The rest, were explained in the answer.

I have implemented a server program using Twisted. I am using basic.lineReceiver with the method dataReceived to receive data from multiple clients.
This is a mistake - an unfortunately common one brought on by the mistaken use of inheritance in many of Twisted's protocol implementations as the mechanism for building up more and more sophisticated behaviors. When you use twisted.protocols.basic.LineReceiver, the dataReceived callback is not for you. LineReceiver.dataReceived is an implementation detail of LineReceiver. The callback for you is LineReceiver.lineReceived. LineReceiver.dataReceived looks like it might be for you - it doesn't start with an underscore or anything - but it's not. dataReceived is how LineReceiver receives information from its transport. It is one of the public methods of IProtocol - the interface between a transport and the protocol interpreting the data received over that transport. Yes, I just said "public method" there. The trouble is it's public for the benefit of someone else. This is confusing and perhaps not communicated as well as it could be. No doubt this is why it is a Frequently Asked Question.
This approach has turned out to be unreliable. The first issue is that since TCP streaming is used, sometimes messages merge (I can use a delimiter for this).
Use of dataReceived is why this happens. LineReceiver already implements delimiter-based parsing for you. That's why it's called "line" receiver - it receives lines separated by a delimiter. If you override lineReceived instead of dataReceived then you'll be called which each line that is received, regardless of how TCP splits things up or smashes them together.
Second, the received messages are sometimes not in their appropriate sequence.
TCP is a reliable, ordered, stream-oriented transport. "Ordered" means that bytes arrive in the same order they are sent. Put another way, when you write("x"); write("y") it is guaranteed that the receiver will receive "x" before they receive "y" (they may receive "x" and "y" in the same call to recv() but if they do, the data will definitely be "xy" and not "yx"; or they may receive the two bytes in two calls to recv() and if they do, the first recv() will definitely by "x" and the second will definitely be "y", not the other way around).
If bytes appear to be arriving in a different order than you sent them, there's probably another bug somewhere that makes it look like this is happening - but it actually isn't. Your platform's TCP stack is very likely very close to bug free and in particular it probably doesn't have TCP data re-ordering bugs. Likewise, this area of Twisted is extremely well tested and probably works correctly. This leaves a bug in your application code or a misinterpretation of your observations. Perhaps your code doesn't always append data to a list or perhaps the data isn't being sent in the order you expected.
Another possibility is that you are talking about the order in which data arrives across multiple separate TCP connections. TCP is only ordered over a single connection. If you have two connections, there are very few (if any) guarantees about the order in which data will arrive over them.
Third, the networking communication seems to be too slow, as when the server initially tries to access the last element of the buffered list, the list is empty (this shows that the last messages on the buffer might not be the response to the last sent commands).
What defines "too slow"? The network is as fast as the network is. If that's not fast enough for you, find a fatter piece of copper. It sounds like what you really mean here is that your server sometimes expects data to have arrived before that data actually arrives. This doesn't mean the network is too slow, though, it means your server isn't properly event driven. If you're inspecting a buffer and not finding the information you expected, it's because you inspected it before the occurrence of the event which informs you of the arrival of that information. This is why Twisted has all these callback methods - dataReceived, lineReceived, connectionLost, etc. When lineReceived is called, this is an event notification telling you that right now something happened which resulted in a line being available (and, for convenience, lineReceived takes one argument - an object representing the line which is now available).
If you have some code that is meant to run when a line has arrived, consider putting that code inside an implementation of the lineReceived method. That way, when it runs (in response to a line being received), you can be 100% sure that you have a line to operate on. You can also be sure that it will run as soon as possible (as soon as the line arrives) but no sooner.

Streaming data of unknown size from client to server over HTTP in Python

As unfortunately my previous question got closed for being an "exact copy" of a question while it definitely IS NOT, hereby again.
It is NOT a duplicate of Python: HTTP Post a large file with streaming
That one deals with streaming a big file; I want to send arbitrary chunks of a file one by one to the same http connection. So I have a file of say 20 MB, and what I want to do is open an HTTP connection, then send 1 MB, send another 1 MB, etc, until it's complete. Using the same connection, so the server sees a 20 MB chunk appear over that connection.
Mmapping a file is what I ALSO intend to do, but that does not work when the data is read from stdin. And primarily for that second case I an looking for this part-by-part feeding of data.
Honestly I wonder whether it can be done at all - if not, I'd like to know, then can close the issue. But if it can be done, how could it be done?

From the client’s perspective, it’s easy. You can use httplib’s low-level interface—putrequest, putheader, endheaders, and send—to send whatever you want to the server in chunks of any size.
But you also need to indicate where your file ends.
If you know the total size of the file in advance, you can simply include the Content-Length header, and the server will stop reading your request body after that many bytes. The code may then look like this.
import httplib
import os.path
total_size = os.path.getsize('/path/to/file')
infile = open('/path/to/file')
conn = httplib.HTTPConnection('example.org')
conn.connect()
conn.putrequest('POST', '/upload/')
conn.putheader('Content-Type', 'application/octet-stream')
conn.putheader('Content-Length', str(total_size))
conn.endheaders()
while True:
chunk = infile.read(1024)
if not chunk:
break
conn.send(chunk)
resp = conn.getresponse()
If you don’t know the total size in advance, the theoretical answer is the chunked transfer encoding. Problem is, while it is widely used for responses, it seems less popular (although just as well defined) for requests. Stock HTTP servers may not be able to handle it out of the box. But if the server is under your control too, you could try manually parsing the chunks from the request body and reassembling them into the original file.
Another option is to send each chunk as a separate request (with Content-Length) over the same connection. But you still need to implement custom logic on the server. Moreover, you need to persist state between requests.
Added 2012-12-27. There’s an nginx module that converts chunked requests into regular ones. May be helpful so long as you don’t need true streaming (start handling the request before the client is done sending it).

How can I deserialize incoming data on the TCP server?

I've set up a server reusing the code found in the documentation where I have self.data = self.request.recv(1024).strip().
But how do I go from this, deserialize it to protobuf message (Message.proto/Message_pb2.py). Right now it seems that it's receiving chunks of 1024 bytes, and that more then one at the time... making it all rubbish :D

TCP is typically just a stream of data. Just because you sent each packet as a unit, doesn't mean the receiver gets that. Large messages may be split into multiple packets; small messages may be combined into a single packet.
The only way to interpret multiple messages over TCP is with some kind of "framing". With text-based protocols, a CR/LF/CRLF/zero-byte might signify the end of each frame, but that won't work with binary protocols like protobuf. In such cases, the most common approach is to simply prepend each message with the length, for example in a fixed-size (4 bytes?) network-byte-order chunk. Then the payload. In the case of protobuf, the API for your platform may also provide a mechanism to write the length as a "varint".
Then, reading is a matter of:
read an entire length-header
read (and buffer) that many bytes
process the buffered data
rinse and repeat
But keeping in mind that you might have (in a single packet) the end of one message, 2 complete messages, and the start of another message (maybe half of the length-header, just to make it interesting). So: keeping track of exactly what you are reading at any point becomes paramount.

Checking files retrieved by Twisted's FTPClient.retrieveFile method for completeness

I'm writing a custom ftp client to act as a gatekeeper for incoming multimedia content from subcontractors hired by one of our partners. I chose twisted because it allows me to parse the file contents before writing the files to disk locally, and I've been looking for occasion to explore twisted anyway. I'm using 'twisted.protocols.ftp.FTPClient.retrieveFile' to get the file, passing the escaped path to the file, and a protocol to the 'retrieveFile' method. I want to be absolutely sure that the entire file has been retrieved because the event handler in the call back is going to write the file to disk locally, then delete the remote file from the ftp server alla '-E' switch behavior in the lftp client. My question is, do I really need to worry about this, or can I assume that an err back will happen if the file is not fully retrieved?

There are a couple unit tests for behavior in this area.
twisted.test.test_ftp.FTPClientTestCase.test_failedRETR is the most directly relevant one. It covers the case where the control and data connections are lost while a file transfer is in progress.
It seems to me that test coverage in this area could be significantly improved. There are no tests covering the case where just the data connection is lost while a transfer is in progress, for example. One thing that makes this tricky, though, is that FTP is not a very robust protocol. The end of a file transfer is signaled by the data connection closing. To be safe, you have to check to see if you received as many bytes as you expected to receive. The only way to perform this check is to know the file size in advance or ask the server for it using LIST (FTPClient.list).
Given all this, I'd suggest that when a file transfer completes, you always ask the server how many bytes you should have gotten and make sure it agrees with the number of bytes delivered to your protocol. You may sometimes get an errback on the Deferred returned from retrieveFile, but this will keep you safe even in the cases where you don't.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.