Twisted, FTP, and "streaming" large files - python

I'm attempting to implement what can best be described as "an FTP interface to an HTTP API". Essentially, there is an existing REST API that can be used to manage a user's files for a site, and I'm building a mediator server that re-exposes this API as an FTP server. So you can login with, say, Filezilla and list your files, upload new ones, delete old ones, etc.
I'm attempting this with twisted.protocols.ftp for the (FTP) server, and twisted.web.client for the (HTTP) client.
The thing I'm running up against is, when a user tries to download a file, "streaming" that file from an HTTP response to my FTP response. Similar for uploading.
The most straightforward approach would be to download the entire file from the HTTP server, then turn around and send the contents to the user. The problem with this is that any given file could be many gigabytes large (think drive images, ISO files, etc). With this approach, though, the contents of the file would be held in memory between the time I download it from the API and the time I send it to the user - not good.
So my solution is to try to "stream" it - as I get chunks of data from the API's HTTP response, I just want to turn around and send those chunks along to the FTP user. Seems straightforward.
For my "custom FTP functionality", I'm using a subclass of ftp.FTPShell. The reading method of this, openForReading, returns a Deferred that fires with an implementation of IReadFile.
Below is my (initial, simple) implementation for "streaming HTTP". I use the fetch function to setup an HTTP request, and the callback I pass in gets called with each chunk I get from the response.
I thought I could use some sort of two-ended buffer object to transport the chunks between the HTTP and FTP, by using the buffer object as the file-like object required by ftp._FileReader, but that's quickly proving not to work, as the consumer from the send call almost immediately closes the buffer (because it's returning an empty string, because there's no data to read yet, etc). Thus, I'm "sending" empty files before I even start receiving the HTTP response chunks.
Am I close, but missing something? Am I on the wrong path altogether? Is what I want to do really impossible (I highly doubt that)?
from twisted.web import client
import urlparse
class HTTPStreamer(client.HTTPPageGetter):
def __init__(self):
self.callbacks = []
def addHandleResponsePartCallback(self, callback):
self.callbacks.append(callback)
def handleResponsePart(self, data):
for cb in self.callbacks:
cb(data)
client.HTTPPageGetter.handleResponsePart(self, data)
class HTTPStreamerFactory(client.HTTPClientFactory):
protocol = HTTPStreamer
def __init__(self, *args, **kwargs):
client.HTTPClientFactory.__init__(self, *args, **kwargs)
self.callbacks = []
def addChunkCallback(self, callback):
self.callbacks.append(callback)
def buildProtocol(self, addr):
p = client.HTTPClientFactory.buildProtocol(self, addr)
for cb in self.callbacks:
p.addHandleResponsePartCallback(cb)
return p
def fetch(url, callback):
parsed = urlparse.urlsplit(url)
f = HTTPStreamerFactory(parsed.path)
f.addChunkCallback(callback)
from twisted.internet import reactor
reactor.connectTCP(parsed.hostname, parsed.port or 80, f)
As a side note, this is only my second day with Twisted - I spent most of yesterday reading through Dave Peticolas' Twisted Introduction, which has been a great starting point, even if based on an older version of twisted.
That said, I may be doing things wrong.

I thought I could use some sort of two-ended buffer object to transport the chunks between the HTTP and FTP, by using the buffer object as the file-like object required by ftp._FileReader, but that's quickly proving not to work, as the consumer from the send call almost immediately closes the buffer (because it's returning an empty string, because there's no data to read yet, etc). Thus, I'm "sending" empty files before I even start receiving the HTTP response chunks.
Instead of using ftp._FileReader, you want something that will do a write whenever a chunk arrives from your HTTPStreamer to a callback it supplies. You never need/want to do a read from a buffer on the HTTP, because there's no reason to even have such a buffer. As soon as HTTP bytes arrive, write them to the consumer. Something like...
class FTPStreamer(object):
implements(IReadFile)
def __init__(self, url):
self.url = url
def send(self, consumer):
fetch(url, consumer.write)
# You also need a Deferred to return here, so the
# FTP implementation knows when you're done.
return someDeferred
You may also want to use Twisted's producer/consumer interface to allow the transfer to be throttled, as may be necessary if your connection to the HTTP server is faster than your user's FTP connection to you.

Related

Streaming uploads using Python Requests

I want to stream an "infinite" (i.e. continuous) amount of data using HTTP Post. Basically, I want to send the POST request header and then stream the content (where the content length is unknown). I looked through http://docs.python-requests.org/en/latest/user/advanced/ and it seems to have the facility. The one question I have is it says in the document " To stream and upload, simply provide a file-like object for your body". What does "file-like" object mean? The data I wish to stream comes from a sensor. How do I implement a "file-like" object which will read data from the sensor and pass it to the caller?
Sorry about my ignorance here but I am feeling my way through python (i.e. learning as I go along. hmm.. looks like a snake. It feels slithery. Trying to avoid the business end of the critter... :-) ).
Thank you in advance for your help.
Ranga.
Just wanted to give you an answer so you could close your question:
It sounds like what you're really looking for is python websockets. Internally, you make a HTTP request to upgrade the connection to a websocket, and after the handshake you are free to stream data both ways. Python makes this easy, for example:
from flask import Flask
from flask_sockets import Sockets
app = Flask(__name__)
sockets = Sockets(app)
#sockets.route('/echo')
def echo_socket(ws):
while True:
message = ws.receive()
ws.send(message)
#app.route('/')
def hello():
return 'Hello World!'
Websockets do support full duplex communication, but you seem only interested in the server-to-client part. In that case you can just stream data using ws.send(). I'm not sure if this is what you're looking for, but it should provide a solution.
A file-like object is an object with a "read" method that accept a size and returns a binary data buffer for the next chunk of data.
One example that looks like that is indeed, the file object, if you want to read from the filesystem.
Another common case is the StringIO class, which reads and writes to a buffer.
In your case, you would need to implement a "file-like object" by yourself, which would simply read from the sensor.
class Sensor(object):
def __init__(self, sensor_thing)
self.sensor_thing = sensor_thing
def read(self, size):
return self.convert_to_binary(self.sensor_thing.read_from_sensor())
def convert_to_binary(self, sensor_data)
....

Twisted: callbacks design for a chatty protocol

I am using twisted to handle a text based protocol. Initially the client connects to server. After connecting, server will send commands to which the client should respond to. Each type of commands takes different amount of time to formulate a response. (eg: an return a value from local hash versus returning a value from a complex MySQL query on a big database). The response from client should go in the order the command was received. Server will not wait for a response from one command before sending another command, but expects response in the order command was sent. Client cannot expect any order for commands sent from server.
Following is a minimal code showing the outline of how my program works currently.
class ExternalListener(LineReceiver):
def connectionMade(self):
log.msg("Listener: New connection")
def lookupMethod(self, command):
return getattr(self, 'do_' + command.lower(), None)
def lineReceived(self, verb):
method = self.lookupMethod(verb)
method(verb)
def do_cmd1(self, verb):
d = self.getResult(verb)
d.addCallback(self._cbValidate1)
def _cbValidate1(self):
resp = "response"
self.transport.write(resp)
def do_cmd2(self, verb):
d = self.getResult(verb)
d.addCallback(self._cbValidate1)
def _cbValidate2(self):
resp = "response"
self.transport.write(resp)
As it can be seen, this will not take care of ordering of responses. I am not in a position to use DeferredList because deferreds are created as and when a command is received and there is no list of deferreds which I can put in a DeferredList.
What is the twisted way to handle this scenario?
Thanks and Regards,
One solution is to use a protocol with tags for requests and responses. This means you can generate responses in any order. See AMP (or even IMAP4) as an example of such a protocol.
However, if the protocol is out of your control and you cannot fix it, then a not-quite-as-nice solution is to buffer the responses to ensure proper ordering.
I don't think there is any particularly Twisted solution here, it's just a matter of holding responses to newer requests until all of the responses to older requests has been sent. That's probably a matter of using a counter internally to assign an ordering to responses and then implementing the logic to either buffer a response if it needs to wait or send it and all appropriate buffered responses if it doesn't.

Streaming data of unknown size from client to server over HTTP in Python

As unfortunately my previous question got closed for being an "exact copy" of a question while it definitely IS NOT, hereby again.
It is NOT a duplicate of Python: HTTP Post a large file with streaming
That one deals with streaming a big file; I want to send arbitrary chunks of a file one by one to the same http connection. So I have a file of say 20 MB, and what I want to do is open an HTTP connection, then send 1 MB, send another 1 MB, etc, until it's complete. Using the same connection, so the server sees a 20 MB chunk appear over that connection.
Mmapping a file is what I ALSO intend to do, but that does not work when the data is read from stdin. And primarily for that second case I an looking for this part-by-part feeding of data.
Honestly I wonder whether it can be done at all - if not, I'd like to know, then can close the issue. But if it can be done, how could it be done?
From the client’s perspective, it’s easy. You can use httplib’s low-level interface—putrequest, putheader, endheaders, and send—to send whatever you want to the server in chunks of any size.
But you also need to indicate where your file ends.
If you know the total size of the file in advance, you can simply include the Content-Length header, and the server will stop reading your request body after that many bytes. The code may then look like this.
import httplib
import os.path
total_size = os.path.getsize('/path/to/file')
infile = open('/path/to/file')
conn = httplib.HTTPConnection('example.org')
conn.connect()
conn.putrequest('POST', '/upload/')
conn.putheader('Content-Type', 'application/octet-stream')
conn.putheader('Content-Length', str(total_size))
conn.endheaders()
while True:
chunk = infile.read(1024)
if not chunk:
break
conn.send(chunk)
resp = conn.getresponse()
If you don’t know the total size in advance, the theoretical answer is the chunked transfer encoding. Problem is, while it is widely used for responses, it seems less popular (although just as well defined) for requests. Stock HTTP servers may not be able to handle it out of the box. But if the server is under your control too, you could try manually parsing the chunks from the request body and reassembling them into the original file.
Another option is to send each chunk as a separate request (with Content-Length) over the same connection. But you still need to implement custom logic on the server. Moreover, you need to persist state between requests.
Added 2012-12-27. There’s an nginx module that converts chunked requests into regular ones. May be helpful so long as you don’t need true streaming (start handling the request before the client is done sending it).

pyftpdlib slow .read on file blocks entire mainloop

Helllo,
I am using a custom AbstractFS on pyftpdlib that maps files on a HTTP server to FTP.
This files are returned by my implementation of open (of AbstractFS) which returns a httplib.HTTPResponse wrapped by the following class:
class HTTPConnWrapper:
def __init__(self, obj, filename):
# make it more file obj like
self.obj = obj
self.closed = True
self.name = filename.split(os.sep)[-1]
def seek(self, arg):
pass
def read(self, bytes):
#print 'read', bytes
read = self.obj.read(100) #we DONT read var byes, but 100 bytes
#print 'ok'
return read
The problem is that if a client is downloading files the entire server becommes sluggish.
What can I do?
Any ideas?
PS:
And why just monkey patching everything with evenetlet does'nt magically makes everything work?
pyftpdlib uses Python's asyncore module which polls and interacts with dispatchers. Each time you map an FTP request to a request to the HTTP server you're blocking the asyncore loop that pydftpdlib is using. You should implement your HTTP requests as dispatchers that fit the asyncore model, or generate threads to handle the request asynchronously and post the result back to the FTP request handler when the data has arrived. This is somewhat difficult as there's no provided mechanism to interrupt asyncore's polling loop from external threads.
As for eventlet, I don't know that it would play nicely with asyncore, which is already utilizing a nonblocking IO mechanism.
Ok, I posted a bug report on pyftpdlib:
I wouldn't even know what to recommend exactly as it's a problem which is hard to resolve and there's no easy or standard way to deal with it.
But I got a crazy solution to solve this problem without using pyftpdlib.
rewrite everything using wsgidav (which uses the cherrypy
wsgiserver, so its threaded)
mount that WebDAV filesystem as native filesystem (net use on
windows, mount.davfs on linux)
serve this mounted filesystem with any ftp server that can handle
blocking file systems

How to serve data from UDP stream over HTTP in Python?

I am currently working on exposing data from legacy system over the web. I have a (legacy) server application that sends and receives data over UDP. The software uses UDP to send sequential updates to a given set of variables in (near) real-time (updates every 5-10 ms). thus, I do not need to capture all UDP data -- it is sufficient that the latest update is retrieved.
In order to expose this data over the web, I am considering building a lightweight web server that reads/write UDP data and exposes this data over HTTP.
As I am experienced with Python, I am considering to use it.
The question is the following: how can I (continuously) read data from UDP and send snapshots of it over TCP/HTTP on-demand with Python? So basically, I am trying to build a kind of "UDP2HTTP" adapter to interface with the legacy app so that I wouldn't need to touch the legacy code.
A solution that is WSGI compliant would be much preferred. Of course any tips are very welcome and MUCH appreciated!
Twisted would be very suitable here. It supports many protocols (UDP, HTTP) and its asynchronous nature makes it possible to directly stream UDP data to HTTP without shooting yourself in the foot with (blocking) threading code. It also support wsgi.
Here's a quick "proof of concept" app using the twisted framework. This assumes that the legacy UDP service is listening on localhost:8000 and will start sending UDP data in response to a datagram containing "Send me data". And that the data is 3 32bit integers. Additionally it will respond to an "HTTP GET /" on port 2080.
You could start this with twistd -noy example.py:
example.py
from twisted.internet import protocol, defer
from twisted.application import service
from twisted.python import log
from twisted.web import resource, server as webserver
import struct
class legacyProtocol(protocol.DatagramProtocol):
def startProtocol(self):
self.transport.connect(self.service.legacyHost,self.service.legacyPort)
self.sendMessage("Send me data")
def stopProtocol(self):
# Assume the transport is closed, do any tidying that you need to.
return
def datagramReceived(self,datagram,addr):
# Inspect the datagram payload, do sanity checking.
try:
val1, val2, val3 = struct.unpack("!iii",datagram)
except struct.error, err:
# Problem unpacking data log and ignore
log.err()
return
self.service.update_data(val1,val2,val3)
def sendMessage(self,message):
self.transport.write(message)
class legacyValues(resource.Resource):
def __init__(self,service):
resource.Resource.__init__(self)
self.service=service
self.putChild("",self)
def render_GET(self,request):
data = "\n".join(["<li>%s</li>" % x for x in self.service.get_data()])
return """<html><head><title>Legacy Data</title>
<body><h1>Data</h1><ul>
%s
</ul></body></html>""" % (data,)
class protocolGatewayService(service.Service):
def __init__(self,legacyHost,legacyPort):
self.legacyHost = legacyHost #
self.legacyPort = legacyPort
self.udpListeningPort = None
self.httpListeningPort = None
self.lproto = None
self.reactor = None
self.data = [1,2,3]
def startService(self):
# called by application handling
if not self.reactor:
from twisted.internet import reactor
self.reactor = reactor
self.reactor.callWhenRunning(self.startStuff)
def stopService(self):
# called by application handling
defers = []
if self.udpListeningPort:
defers.append(defer.maybeDeferred(self.udpListeningPort.loseConnection))
if self.httpListeningPort:
defers.append(defer.maybeDeferred(self.httpListeningPort.stopListening))
return defer.DeferredList(defers)
def startStuff(self):
# UDP legacy stuff
proto = legacyProtocol()
proto.service = self
self.udpListeningPort = self.reactor.listenUDP(0,proto)
# Website
factory = webserver.Site(legacyValues(self))
self.httpListeningPort = self.reactor.listenTCP(2080,factory)
def update_data(self,*args):
self.data[:] = args
def get_data(self):
return self.data
application = service.Application('LegacyGateway')
services = service.IServiceCollection(application)
s = protocolGatewayService('127.0.0.1',8000)
s.setServiceParent(services)
Afterthought
This isn't a WSGI design. The idea for this would to use be to run this program daemonized and have it's http port on a local IP and apache or similar to proxy requests. It could be refactored for WSGI. It was quicker to knock up this way, easier to debug.
The software uses UDP to send sequential updates to a given set of variables in (near) real-time (updates every 5-10 ms). thus, I do not need to capture all UDP data -- it is sufficient that the latest update is retrieved
What you must do is this.
Step 1.
Build a Python app that collects the UDP data and caches it into a file. Create the file using XML, CSV or JSON notation.
This runs independently as some kind of daemon. This is your listener or collector.
Write the file to a directory from which it can be trivially downloaded by Apache or some other web server. Choose names and directory paths wisely and you're done.
Done.
If you want fancier results, you can do more. You don't need to, since you're already done.
Step 2.
Build a web application that allows someone to request this data being accumulated by the UDP listener or collector.
Use a web framework like Django for this. Write as little as possible. Django can serve flat files created by your listener.
You're done. Again.
Some folks think relational databases are important. If so, you can do this. Even though you're already done.
Step 3.
Modify your data collection to create a database that the Django ORM can query. This requires some learning and some adjusting to get a tidy, simple ORM model.
Then write your final Django application to serve the UDP data being collected by your listener and loaded into your Django database.

Categories

Resources