Checking files retrieved by Twisted's FTPClient.retrieveFile method for completeness

Checking files retrieved by Twisted's FTPClient.retrieveFile method for completeness - python

I'm writing a custom ftp client to act as a gatekeeper for incoming multimedia content from subcontractors hired by one of our partners. I chose twisted because it allows me to parse the file contents before writing the files to disk locally, and I've been looking for occasion to explore twisted anyway. I'm using 'twisted.protocols.ftp.FTPClient.retrieveFile' to get the file, passing the escaped path to the file, and a protocol to the 'retrieveFile' method. I want to be absolutely sure that the entire file has been retrieved because the event handler in the call back is going to write the file to disk locally, then delete the remote file from the ftp server alla '-E' switch behavior in the lftp client. My question is, do I really need to worry about this, or can I assume that an err back will happen if the file is not fully retrieved?

There are a couple unit tests for behavior in this area.
twisted.test.test_ftp.FTPClientTestCase.test_failedRETR is the most directly relevant one. It covers the case where the control and data connections are lost while a file transfer is in progress.
It seems to me that test coverage in this area could be significantly improved. There are no tests covering the case where just the data connection is lost while a transfer is in progress, for example. One thing that makes this tricky, though, is that FTP is not a very robust protocol. The end of a file transfer is signaled by the data connection closing. To be safe, you have to check to see if you received as many bytes as you expected to receive. The only way to perform this check is to know the file size in advance or ask the server for it using LIST (FTPClient.list).
Given all this, I'd suggest that when a file transfer completes, you always ask the server how many bytes you should have gotten and make sure it agrees with the number of bytes delivered to your protocol. You may sometimes get an errback on the Deferred returned from retrieveFile, but this will keep you safe even in the cases where you don't.

Related

Sending big files in Twisted

I have a really simple code that allows me to send an image from client to server. And it works.
As simple as this:
On the client side...
def sendFile(self):
image = open(picname)
data = image.read()
self.transport.write(data)
On the server side...
def dataReceived(self, data):
print 'Received'
f = open("image.png",'wb')
f.write(data)
f.close()
Problem with this is that only works if the image is up to 4.somethingkB, as it stops working when the image is bigger (at least doesn't work when gets to 6kB). Then, is when I see that the "Received" is being printed more than one time. Which makes me think that data is being separated in smaller chunks. However, even if those chunks of data get to the server (as I'm seeing the repeated print called from the dataReceived) the image is corrupted and can't be opened.
I don't know that much about protocols, but I supposed that TCP should be reliable, so the fact that the packets got there in a different order or so, shouldn't...happen? So I was thinking that maybe Twisted is doing something there that I ignore and maybe I should use another Protocol.
So here is my question. Is there something that I could do now to make it work or I should definitely change to another Protocol? If so...any idea? My goal would be sending a bigger image, maybe the order of hundreds of kB.

This is a variant of an entry in the Twisted FAQ:
Why is protocol.dataReceived called with only part of the data I called transport.write with?
TCP is a stream-based protocol. It is delivering a stream of bytes, which may be broken up into an arbitrary number of fragments. If you write one big blob of bytes, it may be broken up into an arbitrary number of smaller chunks, depending on the characteristics of your physical network connection. When you say that TCP should be "reliable", and that the chunks should arrive in order, you are roughly correct: however, what arrives in order is the bytes, not the chunks.
What you are doing in your dataReceived method is, upon receiving each chunk, opening a file and writing the contents of just that chunk to "image.png", then closing it. If you change it to open the file in connectionMade and close the file in connectionLost you should see at least vaguely the right behavior, although this will still cause you to get corrupted / truncated images if the connection is lost unexpectedly, with no warning. You should really use a framing protocol like AMP; although if you're just sending big blobs of data around, HTTP is probably a better choice.

How to use dataReceived in Twisted?

I have implemented a server program using Twisted. I am using basic.lineReceiver with the method dataReceived to receive data from multiple clients. Also, I am using protocol.ServerFactory to keep track of connected clients. The server sends some commands to each connected client. Based on the response that the server gets from each client, it (the server) should perform some tasks. Thus, the best solution that came to my mind was to create a buffer for received messages as a python list, and each time that the functions at server side want to know the response from a client, they access the last element of the buffer list (of that client).
This approach has turned out to be unreliable. The first issue is that since TCP streaming is used, sometimes messages merge (I can use a delimiter for this). Second, the received messages are sometimes not in their appropriate sequence. Third, the networking communication seems to be too slow, as when the server initially tries to access the last element of the buffered list, the list is empty (this shows that the last messages on the buffer might not be the response to the last sent commands).
Could you tell me what is the best parctice for using dataReceived or its equivalents in the above problem? thank you in advance.
EDIT 1: Answer- While I accept #Jean-Paul Calderone's answer since I certainly learned from it, I would like to add that in my own research of Twisted's documentation, I learned that in order to avoid delays in communications of the server, one should use return at the end of dataReceived() or lineReceived() functions, and this solved part of my problem. The rest, were explained in the answer.

I have implemented a server program using Twisted. I am using basic.lineReceiver with the method dataReceived to receive data from multiple clients.
This is a mistake - an unfortunately common one brought on by the mistaken use of inheritance in many of Twisted's protocol implementations as the mechanism for building up more and more sophisticated behaviors. When you use twisted.protocols.basic.LineReceiver, the dataReceived callback is not for you. LineReceiver.dataReceived is an implementation detail of LineReceiver. The callback for you is LineReceiver.lineReceived. LineReceiver.dataReceived looks like it might be for you - it doesn't start with an underscore or anything - but it's not. dataReceived is how LineReceiver receives information from its transport. It is one of the public methods of IProtocol - the interface between a transport and the protocol interpreting the data received over that transport. Yes, I just said "public method" there. The trouble is it's public for the benefit of someone else. This is confusing and perhaps not communicated as well as it could be. No doubt this is why it is a Frequently Asked Question.
This approach has turned out to be unreliable. The first issue is that since TCP streaming is used, sometimes messages merge (I can use a delimiter for this).
Use of dataReceived is why this happens. LineReceiver already implements delimiter-based parsing for you. That's why it's called "line" receiver - it receives lines separated by a delimiter. If you override lineReceived instead of dataReceived then you'll be called which each line that is received, regardless of how TCP splits things up or smashes them together.
Second, the received messages are sometimes not in their appropriate sequence.
TCP is a reliable, ordered, stream-oriented transport. "Ordered" means that bytes arrive in the same order they are sent. Put another way, when you write("x"); write("y") it is guaranteed that the receiver will receive "x" before they receive "y" (they may receive "x" and "y" in the same call to recv() but if they do, the data will definitely be "xy" and not "yx"; or they may receive the two bytes in two calls to recv() and if they do, the first recv() will definitely by "x" and the second will definitely be "y", not the other way around).
If bytes appear to be arriving in a different order than you sent them, there's probably another bug somewhere that makes it look like this is happening - but it actually isn't. Your platform's TCP stack is very likely very close to bug free and in particular it probably doesn't have TCP data re-ordering bugs. Likewise, this area of Twisted is extremely well tested and probably works correctly. This leaves a bug in your application code or a misinterpretation of your observations. Perhaps your code doesn't always append data to a list or perhaps the data isn't being sent in the order you expected.
Another possibility is that you are talking about the order in which data arrives across multiple separate TCP connections. TCP is only ordered over a single connection. If you have two connections, there are very few (if any) guarantees about the order in which data will arrive over them.
Third, the networking communication seems to be too slow, as when the server initially tries to access the last element of the buffered list, the list is empty (this shows that the last messages on the buffer might not be the response to the last sent commands).
What defines "too slow"? The network is as fast as the network is. If that's not fast enough for you, find a fatter piece of copper. It sounds like what you really mean here is that your server sometimes expects data to have arrived before that data actually arrives. This doesn't mean the network is too slow, though, it means your server isn't properly event driven. If you're inspecting a buffer and not finding the information you expected, it's because you inspected it before the occurrence of the event which informs you of the arrival of that information. This is why Twisted has all these callback methods - dataReceived, lineReceived, connectionLost, etc. When lineReceived is called, this is an event notification telling you that right now something happened which resulted in a line being available (and, for convenience, lineReceived takes one argument - an object representing the line which is now available).
If you have some code that is meant to run when a line has arrived, consider putting that code inside an implementation of the lineReceived method. That way, when it runs (in response to a line being received), you can be 100% sure that you have a line to operate on. You can also be sure that it will run as soon as possible (as soon as the line arrives) but no sooner.

Streaming data of unknown size from client to server over HTTP in Python

As unfortunately my previous question got closed for being an "exact copy" of a question while it definitely IS NOT, hereby again.
It is NOT a duplicate of Python: HTTP Post a large file with streaming
That one deals with streaming a big file; I want to send arbitrary chunks of a file one by one to the same http connection. So I have a file of say 20 MB, and what I want to do is open an HTTP connection, then send 1 MB, send another 1 MB, etc, until it's complete. Using the same connection, so the server sees a 20 MB chunk appear over that connection.
Mmapping a file is what I ALSO intend to do, but that does not work when the data is read from stdin. And primarily for that second case I an looking for this part-by-part feeding of data.
Honestly I wonder whether it can be done at all - if not, I'd like to know, then can close the issue. But if it can be done, how could it be done?

From the client’s perspective, it’s easy. You can use httplib’s low-level interface—putrequest, putheader, endheaders, and send—to send whatever you want to the server in chunks of any size.
But you also need to indicate where your file ends.
If you know the total size of the file in advance, you can simply include the Content-Length header, and the server will stop reading your request body after that many bytes. The code may then look like this.
import httplib
import os.path
total_size = os.path.getsize('/path/to/file')
infile = open('/path/to/file')
conn = httplib.HTTPConnection('example.org')
conn.connect()
conn.putrequest('POST', '/upload/')
conn.putheader('Content-Type', 'application/octet-stream')
conn.putheader('Content-Length', str(total_size))
conn.endheaders()
while True:
chunk = infile.read(1024)
if not chunk:
break
conn.send(chunk)
resp = conn.getresponse()
If you don’t know the total size in advance, the theoretical answer is the chunked transfer encoding. Problem is, while it is widely used for responses, it seems less popular (although just as well defined) for requests. Stock HTTP servers may not be able to handle it out of the box. But if the server is under your control too, you could try manually parsing the chunks from the request body and reassembling them into the original file.
Another option is to send each chunk as a separate request (with Content-Length) over the same connection. But you still need to implement custom logic on the server. Moreover, you need to persist state between requests.
Added 2012-12-27. There’s an nginx module that converts chunked requests into regular ones. May be helpful so long as you don’t need true streaming (start handling the request before the client is done sending it).

I need to send multiple clients a python file and have them all execute it, then download the result

So here's the deal. I have a server that has a ton of clients. I need a way to send them all a python script, and once they receive it they must immediately execute the script.
Said script will create a file which I then need to download back to the server.
The only thing I have to start with is a file with a list of client IP addresses (though with not too much effort I can change that to be client "names" if that would make the code easier).
As of right now it does not matter whether this is accomplished through POST or FTP or any other file transfer service you can think of, the only goal is that it is fast.
The script that is being executed on all the clients is a simple key generator, which I can provide if need be.
As said before the main goal of this needs to be speed, any help that can be provided would be appreciated.
Im part of a project trying to "map" the internet. I was picked, not because of any sort of networking skills (of which i have none) but because i was the only candidate that knew any python. Right now im just trying to establish a connection with all the (1000+) clients we are using and get the ssh (rsa) keys from them. Later i will be telling the clients to send traceroutes.
EDIT ---- provided additional information to make the question clearer

Have a look at Fabric. It is a tool that exactly fits what you need, though I don't know how fast it is.

Constipated Python urllib2 sockets

I've been scouring the Internet looking for a solution to my problem with Python. I'm trying to use a urllib2 connection to read a potentially endless stream of data from an HTTP server. It's part of some interactive communication, so it's important that I can get the data that's available, even if it's not a whole buffer full. There seems to be no way to have read \ readline return the available data. It will block forever waiting for the entire (endless) stream before it returns.
Even if I set the underlying file descriptor to non-blocking using fnctl, the urllib2 file-object still blocks!! In general there seems to be no way to make python file-objects, upon read, return all available data if there is some and block otherwise.
I've seen a few posts about people seeking help with this, but I have seen no solutions. What gives? Am I missing something? This seems like such a normal use-case to completely ruin! I'm hoping to utilize urllib2's ability to detect configured proxies and use chunked encoding, but I can't if it won't cooperate.
Edit: Upon request, here is some example code
Client:
connection = urllib2.urlopen(commandpath)
id = connection.readline()
Now suppose that the server is using chunked transfer encoding, and writes one chunk down the stream and the chunk contains the line, and then waits. The connection is still open, but the client has data waiting in a buffer.
I cannot get read or readline to return the data I know it has waiting for it, because it tries to read until the end of the connection. In this case the connection may never close so it will wait either forever or until an inactivity timeout occurs, severing the connection. Once the connection is severed it will return, but that's obviously not the behavior I want.

urllib2 operates at the HTTP level, which works with complete documents. I don't think there's a way around that without hacking into the urllib2 source code.
What you can do is use plain sockets (you'll have to talk HTTP yourself in this case), and call sock.recv(maxbytes) which does read only available data.
Update: you may want to try to call conn.fp._sock.recv(maxbytes), instead of conn.read(bytes) on an urllib2 connection.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.