Synchronous reading of a multipart upload alongside Django

Synchronous reading of a multipart upload alongside Django - python

I have a Django application which needs to have access to reading multipart file uploads as file-like objects as they're uploaded, which means that I need more or less synchronous access to the request object and a way to unpack it in chunks to binary data. Django unfortunately handles uploads by moving them directly into memory or to temporary files, which won't work for my use case.
Some one recommended that I use gevent/greenlet to handle the upload, but I'm not sure how that plays into the equation and what setup is required alongside Django to make it work. Plus, running something outside of Django would mean that I would have to implement a database connection layer to validate that the upload is allowed (using a ticket id).
With this said, how can I set this up? Django should be running in a WSGI application, and someone had also recommended writing a second WSGI application to capture a single URL path for uploads. I'd like to essentially take as much advantage of the Django framework as I can, while being able to read uploads synchronously?
(I just became familiar with the requests Python library and have to say I'm a pretty big fan, though I wouldn't know the first thing about using it in a server context.)

I believe a lot of these suggestions are overcomplicating things.
You need to change the way Django handles uploaded files? Simply modify the upload handler.
The base class is relatively straightforward, and gives you lots of great hooks. You should be able to extend it to do what you want.

While I can't write all the code for you (it's complex), here is my recommended setup.
Use Tornado + Django: Tornado can embed WSGI processes, so this gives you the ability to have a single process host both Django, and this one off Tornado handler. Here's a quick sample from one of my active projects (though this using Tornado as a Socket.io handler, it should give you the gist of the solution (:
# Socket Server
db = momoko.Pool(DB_DSN, **DB_CONFIG)
router = tornadio2.TornadioRouter(QueryRouter, user_settings={'db':db})
sock_app = tornado.web.Application(router.urls, flash_policy_port = FL_PORT, flash_policy_file = path.join(PROJECT_ROOT, 'assets/xml/flashpolicy.xml'), socket_io_port = WS_PORT, debug=True)
# Django Server
os.environ['DJANGO_SETTINGS_MODULE'] = 'sever.settings'
application = django.core.handlers.wsgi.WSGIHandler()
container = tornado.wsgi.WSGIContainer(application)
# Start the web servers
if __name__ == "__main__":
try:
import logging
tornado.options.parse_command_line()
logging.getLogger().setLevel(logging.INFO)
logging.info('Server started')
tornado.locale.set_default_locale('us_US')
http_server = tornado.httpserver.HTTPServer(container)
http_server.listen(8000)
tornadio2.SocketServer(sock_app, auto_start=True)
tornado.ioloop.IOLoop.instance().start()
except KeyboardInterrupt:
tornado.ioloop.IOLoop.instance().stop()
logging.info("Stopping servers.")
This could easily be converted to two server instances running on two different ports, with 80 being reserved for Django and 8080 being used for your upload handler.
I'm recommending Tornado because it supports streaming request body, and is very well suited to this type of use. Here's a gist that might help you.
Your proxying setup will matter. If you're using NGINX, make sure to turn proxy_buffering off.
I wouldn't use a database for the ticket/upload check. Redis or memcache would probably be a much faster way to handle this. A cache would also be great way to bass upload progress back and forth between Django and Tornado, since the overhead for setting/getting a new value would be so small.
This is a big hairy problem that will serious engineering to come up with something elegant, but it's more than doable.

Related

Testing a Flask application with requests

Ever since I read
A untested application is broken
in the flask documentation about testing here
I have been working down my list of things to make for some of my applications.
I currently have a flask web app when I write a new route I just write a requests.get('https://api.github.com/user', auth=('user', 'pass')), post, put, etc to test the route.
Is this a decent alternative? Or should I try and do tests via what flask's documentation says, and if so why?

Fundamentally it's the same concept, you are running functionality tests as they do. However, you have a prerequisite, a live application running somewhere (if I got it right). They create a fake application (aka mock) so you can test it without being live, e.g. you want to run tests in a CI environment.
In my opinion it's a better alternative than a live system. Your current approach consumes more resources on your local machine, since you are required to run the whole system to test something (i.e. at least a DB and the application itself). In their approach they don't, the fake instance does not need to have real data, thus no connection to a DB or any other external dependency.
I suggest you to switch to their testing, in the end you will like it.

Controlling a Twisted Server from Django

I'm trying to build a Twisted/Django mashup that will let me control various client connections managed by a Twisted server via Django's admin interface. Meaning, I want to be able to login to Django's admin and see what protocols are currently in use, any details specific to each connection (e.g. if the server is connected to freenode via IRC, it should list all the channels currently connected to), and allow me to disconnect or connect new clients by modifying or creating database records.
What would be the best way to do this? There are lots of posts out there about combining Django with Twisted, but I haven't found any prior art for doing quite what I've outlined. All the Twisted examples I've seen use hardcoded connection parameters, which makes it difficult for me to imagine how I would dynamically running reactor.connectTCP(...) or loseConnection(...) when signalled by a record in the database.
My strategy is to create a custom ClientFactory that solely polls the Django/managed database every N seconds for any commands, and to modify/create/delete connections as appropriate, reflecting the new status in the database when complete.
Does this seem feasible? Is there a better approach? Does anyone know of any existing projects that implement similar functionality?

Polling the database is lame, but unfortunately, databases rarely have good tools (and certainly there are no database-portable tools) for monitoring changes. So your approach might be okay.
However, if your app is in Django and you're not supporting random changes to the database from other (non-Django) clients, and your WSGI container is Twisted, then you can do this very simply by doing callFromThread(connectTCP, ...).

I've been working on yet another way of combing django and twisted. Fell free to give it a try: https://github.com/kowalski/featdjango.
The way it works, is slightly different that the others. It starts a twisted application and http site. The requests done to django are processed inside a special thread pool. What makes it special, is that that these threads can wait on Deferred, which makes it easy to combine synchronous django application code with asynchronous twisted code.
The reason I came up with structure like this, is that my application needs to perform a lot of http requests from inside the django views. Instead of performing them one by one I can delegate all of them at once to "the main application thread" which runs twisted and wait for them. The similarity to your problem is, that I also have an asynchronous component, which is a singleton and I access it from django views.
So this is, for example, this is how you would initiate the twisted component and later to get the reference from the view.
import threading
from django.conf import settings
_initiate_lock = threading.Lock()
def get_component():
global _initiate_lock
if not hasattr(settings, 'YOUR_CLIENT')
_initiate_lock.acquire()
try:
# other thread might have did our job while we
# were waiting for the lock
if not hasattr(settings, 'YOUR_CLIENT'):
client = YourComponent(**whatever)
threading.current_thread().wait_for_deferred(
client.initiate)
settings.YOUR_CLIENT = client
finally:
_initiate_lock.release()
return settings.YOUR_CLIENT
The code above, initiates my client and calls the initiate method on it. This method is asynchronous and returns a Deferred. I do all the necessary setup in there. The django thread will wait for it to finish before processing to next line.
This is how I do it, because I only access it from the request handler. You probably would want to initiate your component at startup, to call ListenTCP|SSL. Than your django request handlers could get the data about the connections just accessing some public methods on the your client. These methods could even return Deferred, in which case you should use .wait_for_defer() to call them.

Non-Message Queue / Simple Long-Polling in Python (and Flask)

I am looking for a simple (i.e., not one that requires me to setup a separate server to handle a messaging queue) way to do long-polling for a small web-interface that runs calculations and produces a graph. This is what my web-interface needs to do:
User requests a graph/data in a web-interface
Server runs some calculations.
While the server is running calculations, a small container is updated (likely via AJAX/jQuery) with the calculation progress (similar to what you'd do in a consol with print (i.e. print 'calculating density function...'))
Calculation finishes and graph is shown to user.
As the calculation is all done server-side, I'm not really sure how to easily set this up. Obviously I'll want to setup a REST API to handle the polling, which would be easy in Flask. However, I'm not sure how to retrieve the actual updates. An obvious, albeit complicated for this purpose, solution would be to setup a messaging queue and do some long polling. However, I'm not sure sure this is the right approach for something this simple.
Here are my questions:
Is there a way to do this using the file system? Performance isn't a huge issue. Can AJAX/jQuery find messages from a file? Save the progress to some .json file?
What about pickling? (I don't really know much about pickling, but maybe I could pickle a message dict and it could be read by an API that is handling the polling).
Is polling even the right approach? Is there a better or more common pattern to handle this?
I have a feeling I'm overcomplicating things as I know this kind of thing is common on the web. Quite often I see stuff happening and a little "loading.gif" image is running while some calculation is going on (for example, in Google Analytics).
Thanks for your help!

I've built several apps like this using just Flask and jQuery. Based on that experience, I'd say your plan is good.
Do not use the filesystem. You will run into JavaScript security issues/protections. In the unlikely event you find reasonable workarounds, you still wouldn't have anything portable or scalable. Instead, use a small local web serving framework, like Flask.
Do not pickle. Use JSON. It's the language of web apps and REST interfaces. jQuery and those nice jQuery-based plugins for drawing charts, graphs and such will expect JSON. It's easy to use, human-readable, and for small-scale apps, there's no reason to go any place else.
Long-polling is fine for what you want to accomplish. Pure HTTP-based apps have some limitations. And WebSockets and similar socket-ish layers like Socket.IO "are the future." But finding good, simple examples of the server-side implementation has, in my experience, been difficult. I've looked hard. There are plenty of examples that want you to set up Node.js, REDIS, and other pieces of middleware. But why should we have to set up two or three separate middleware servers? It's ludicrous. So long-polling on a simple, pure-Python web framework like Flask is the way to go IMO.
The code is a little more than a snippet, so rather than including it here, I've put a simplified example into a Mercurial repository on bitbucket that you can freely review, copy, or clone. There are three parts:
serve.py a Python/Flask based server
templates/index.html 98% HTML, 2% template file the Flask-based server will render as HTML
static/lpoll.js a jQuery-based client

Long-polling was a reasonable work-around before simple, natural support for Web Sockets came to most browsers, and before it was easily integrated alongside Flask apps. But here in mid-2013, Web Socket support has come a long way.
Here is an example, similar to the one above, but integrating Flask and Web Sockets. It runs atop server components from gevent and gevent-websocket.
Note this example is not intended to be a Web Socket masterpiece. It retains a lot of the lpoll structure, to make them more easily comparable. But it immediately improves responsiveness, server overhead, and interactivity of the Web app.
Update for Python 3.7+
5 years since the original answer, WebSocket has become easier to implement. As of Python 3.7, asynchronous operations have matured into mainstream usefulness. Python web apps are the perfect use case. They can now use async just as JavaScript and Node.js have, leaving behind some of the quirks and complexities of "concurrency on the side." In particular, check out Quart. It retains Flask's API and compatibility with a number of Flask extensions, but is async-enabled. A key side-effect is that WebSocket connections can be gracefully handled side-by-side with HTTP connections. E.g.:
from quart import Quart, websocket
app = Quart(__name__)
#app.route('/')
async def hello():
return 'hello'
#app.websocket('/ws')
async def ws():
while True:
await websocket.send('hello')
app.run()
Quart is just one of the many great reasons to upgrade to Python 3.7.

Replace AppEngine Devserver With Spawning (BaseHTTPRequestHandler as WSGI)

I'm looking to replace AppEngine's devserver with spawning. Spawning handles standard wsgi handlers, just like appengine, so running your app on it is easy.
But the devserver takes into account your app.yaml file that has url redirects etc. I've been going through the devserver code and it is pretty easy to get the BaseHTTPRequestHandler like this:
from google.appengine.tools.dev_appserver import CreateRequestHandler
dev = CreateRequestHandler(os.path.dirname(__file__), '', require_indexes=False, static_caching=True)
But the BaseHTTPRequestHandler is not a WSGI app, so my guess is I need to put something around it to make it work. Any hints?

I don't think you're going to be able to pull out a part of the dev_appserver and use it in a custom WSGI server quite so easily. The dev_appserver does a lot of 'magic', and it isn't really structured to be pulled out and used as a WSGI wrapper in another server (more's the pity).
You may want to check out TwistedAE, which is working on creating an alternate serving environment; if you really want to use spawning, you can probably use TwistedAE's work as a basis.
That said, if you do want to do it yourself, there's a couple of options:
You can write your own shim to interface WSGI with the class returned by CreateRequestHandler. In that case, you need to replicate the interface in BaseHTTPServer.BaseHTTPRequestHandler from the Python SDK. Converting WSGI to that, just so the dev_appserver code can convert it back seems a bit perverse, though.
You can rip out the code from the _HandleRequest method of DevAppServerRequestHandler, modify it to work with WSGI, and create a WSGI app from that (probably your best bet if you want to DIY).
You can start from scratch, which I believe is the approach taken by TwistedAE.
One thing to bear in mind whatever you do: App Engine explicitly expects a single-threaded environment for its apps. Don't use a multithreaded approach if you want apps to work the same locally as they do in production or on the dev_appserver!

A good multithreaded python webserver?

I am looking for a python webserver which is multithreaded instead of being multi-process (as in case of mod_python for apache). I want it to be multithreaded because I want to have an in memory object cache that will be used by various http threads. My webserver does a lot of expensive stuff and computes some large arrays which needs to be cached in memory for future use to avoid recomputing. This is not possible in a multi-process web server environment. Storing this information in memcache is also not a good idea as the arrays are large and storing them in memcache would lead to deserialization of data coming from memcache apart from the additional overhead of IPC.
I implemented a simple webserver using BaseHttpServer, it gives good performance but it gets stuck after a few hours time. I need some more matured webserver. Is it possible to configure apache to use mod_python under a thread model so that I can do some object caching?

CherryPy. Features, as listed from the website:
A fast, HTTP/1.1-compliant, WSGI thread-pooled webserver. Typically, CherryPy itself takes only 1-2ms per page!
Support for any other WSGI-enabled webserver or adapter, including Apache, IIS, lighttpd, mod_python, FastCGI, SCGI, and mod_wsgi
Easy to run multiple HTTP servers (e.g. on multiple ports) at once
A powerful configuration system for developers and deployers alike
A flexible plugin system
Built-in tools for caching, encoding, sessions, authorization, static content, and many more
A native mod_python adapter
A complete test suite
Swappable and customizable...everything.
Built-in profiling, coverage, and testing support.

Consider reconsidering your design. Maintaining that much state in your webserver is probably a bad idea. Multi-process is a much better way to go for stability.
Is there another way to share state between separate processes? What about a service? Database? Index?
It seems unlikely that maintaining a huge array of data in memory and relying on a single multi-threaded process to serve all your requests is the best design or architecture for your app.

Twisted can serve as such a web server. While not multithreaded itself, there is a (not yet released) multithreaded WSGI container present in the current trunk. You can check out the SVN repository and then run:
twistd web --wsgi=your.wsgi.application

Its hard to give a definitive answer without knowing what kind of site you are working on and what kind of load you are expecting. Sub second performance may be a serious requirement or it may not. If you really need to save that last millisecond then you absolutely need to keep your arrays in memory. However as others have suggested it is more than likely that you don't and could get by with something else. Your usage pattern of the data in the array may affect what kinds of choices you make. You probably don't need access to the entire set of data from the array all at once so you could break your data up into smaller chunks and put those chunks in the cache instead of the one big lump. Depending on how often your array data needs to get updated you might make a choice between memcached, local db (berkley, sqlite, small mysql installation, etc) or a remote db. I'd say memcached for fairly frequent updates. A local db for something in the frequency of hourly and remote for the frequency of daily. One thing to consider also is what happens after a cache miss. If 50 clients all of a sudden get a cache miss and all of them at the same time decide to start regenerating those expensive arrays your box(es) will quickly be reduced to 8086's. So you have to take in to consideration how you will handle that. Many articles out there cover how to recover from cache misses. Hope this is helpful.

Not multithreaded, but twisted might serve your needs.

You could instead use a distributed cache that is accessible from each process, memcached being the example that springs to mind.

web.py has made me happy in the past. Consider checking it out.
But it does sound like an architectural redesign might be the proper, though more expensive, solution.

Perhaps you have a problem with your implementation in Python using BaseHttpServer. There's no reason for it to "get stuck", and implementing a simple threaded server using BaseHttpServer and threading shouldn't be difficult.
Also, see http://pymotw.com/2/BaseHTTPServer/index.html#module-BaseHTTPServer about implementing a simple multi-threaded server with HTTPServer and ThreadingMixIn

I use CherryPy both personally and professionally, and I'm extremely happy with it. I even do the kinds of thing you're describing, such as having global object caches, running other threads in the background, etc. And it integrates well with Apache; simply run CherryPy as a standalone server bound to localhost, then use Apache's mod_proxy and mod_rewrite to have Apache transparently forward your requests to CherryPy.
The CherryPy website is http://cherrypy.org/

I actually had the same issue recently. Namely: we wrote a simple server using BaseHTTPServer and found that the fact that it's not multi-threaded was a big drawback.
My solution was to port the server to Pylons (http://pylonshq.com/). The port was fairly easy and one benefit was it's very easy to create a GUI using Pylons so I was able to throw a status page on top of what's basically a daemon process.
I would summarize Pylons this way:
it's similar to Ruby on Rails in that it aims to be very easy to deploy web apps
it's default templating language, Mako, is very nice to work with
it uses a system of routing urls that's very convenient
for us performance is not an issue, so I can't guarantee that Pylons would perform adequately for your needs
you can use it with Apache & Lighthttpd, though I've not tried this
We also run an app with Twisted and are happy with it. Twisted has good performance, but I find Twisted's single-threaded/defer-to-thread programming model fairly complicated. It has lots of advantages, but would not be my choice for a simple app.
Good luck.

Just to point out something different from the usual suspects...
Some years ago while I was using Zope 2.x I read about Medusa as it was the web server used for the platform. They advertised it to work well under heavy load and it can provide you with the functionality you asked.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.