nginx complex procedural rewrite rule -- plugin? separate server?

nginx complex procedural rewrite rule -- plugin? separate server? - python

I need to use nginx to serve static files (many, many files, most small, but also a mix of big video files). Configuring nginx as a static file server is, of course, pretty easy except...
... the processing we need to take the "input" path (where the user thinks the file is) and convert it to the real location is too complex to be expressed using nginx's rewrite engines. In general, the logic that does the computation is currently expressed in a Python library. (Just accept it's too complex to be an nginx rewrite rule.)
What are my options for making nginx be able to apply my custom rewrite rules to arrive at the proper path for serving the files? I've thought of these 2 options:
Write some C code that I add to nginx that calls out to Python and does my computation. (For this answer, it doesn't even matter that it's Python.
I'm just injecting C code into nginx that will do an arbitrary computation.) I have, though, no clue as to where to even start looking for where to make this modification to allow nginx to do this.
Is it possible that nginx can query a separate backend process (basically a very simple tornado server) and say "hey, given path X what path Y should I serve)?
While not ideal to go cross process, I figure if the two processes are on the same machine, the latency should be low.
Just write a tornado server which does the computation and then does a redirect to nginx.
This is do-able, but then requires our client code to handle redirects, and (a) this feels slower and (b) I don't feel like messing with the clients to make them handle redirects, since now they have to go roundtrip twice to the server.
??

The Lua nginx module is probably the answer you are looking for. It can be configured to call out to an external process to decide how to handle a request, similar to what you're describing in your question, or even to figure out how to handle a request entirely on its own (possibly by querying a database).

Related

Patterns for waiting for a server side script to finish in flask. How to handle errors and premature termination

I'm writing a web app in flask that will be used to manipulate some data server side. Basically it accepts a path to a zip, decompresses it and places it in a temporary folder to be cleaned up later, asks you which folders in the zip to use and then runs a script over the files in that folder. This could take a long time for the script to run but I'm not sure how to communicate this to the user. I don't really want to use javascript, as I don't know any.
Also this process may fail. How do I communicate the failure back to the flask app and how can I make sure that if the python script does fail it cleans up after itself by deleting the uncompressed temporary files.
Are there any good patterns, examples or packages that handle such tasks in Flask?

Celery is the canonical answer to all "how do I manage long-running jobs in Python web apps" questions.
But it's going to be difficult to avoid Javascript. The basic idea is to offload the job to a separate process - ie Celery - and set a flag somewhere (probably in the database) when it's finished. But if you want your front end to know when that flag is set, it's probably going to have to do a repeated call to the back end to check the status: and that does mean Javascript.
If you use a library like jQuery though, it's only going to be a few lines. It's worth learning in any case.

Non-Message Queue / Simple Long-Polling in Python (and Flask)

I am looking for a simple (i.e., not one that requires me to setup a separate server to handle a messaging queue) way to do long-polling for a small web-interface that runs calculations and produces a graph. This is what my web-interface needs to do:
User requests a graph/data in a web-interface
Server runs some calculations.
While the server is running calculations, a small container is updated (likely via AJAX/jQuery) with the calculation progress (similar to what you'd do in a consol with print (i.e. print 'calculating density function...'))
Calculation finishes and graph is shown to user.
As the calculation is all done server-side, I'm not really sure how to easily set this up. Obviously I'll want to setup a REST API to handle the polling, which would be easy in Flask. However, I'm not sure how to retrieve the actual updates. An obvious, albeit complicated for this purpose, solution would be to setup a messaging queue and do some long polling. However, I'm not sure sure this is the right approach for something this simple.
Here are my questions:
Is there a way to do this using the file system? Performance isn't a huge issue. Can AJAX/jQuery find messages from a file? Save the progress to some .json file?
What about pickling? (I don't really know much about pickling, but maybe I could pickle a message dict and it could be read by an API that is handling the polling).
Is polling even the right approach? Is there a better or more common pattern to handle this?
I have a feeling I'm overcomplicating things as I know this kind of thing is common on the web. Quite often I see stuff happening and a little "loading.gif" image is running while some calculation is going on (for example, in Google Analytics).
Thanks for your help!

I've built several apps like this using just Flask and jQuery. Based on that experience, I'd say your plan is good.
Do not use the filesystem. You will run into JavaScript security issues/protections. In the unlikely event you find reasonable workarounds, you still wouldn't have anything portable or scalable. Instead, use a small local web serving framework, like Flask.
Do not pickle. Use JSON. It's the language of web apps and REST interfaces. jQuery and those nice jQuery-based plugins for drawing charts, graphs and such will expect JSON. It's easy to use, human-readable, and for small-scale apps, there's no reason to go any place else.
Long-polling is fine for what you want to accomplish. Pure HTTP-based apps have some limitations. And WebSockets and similar socket-ish layers like Socket.IO "are the future." But finding good, simple examples of the server-side implementation has, in my experience, been difficult. I've looked hard. There are plenty of examples that want you to set up Node.js, REDIS, and other pieces of middleware. But why should we have to set up two or three separate middleware servers? It's ludicrous. So long-polling on a simple, pure-Python web framework like Flask is the way to go IMO.
The code is a little more than a snippet, so rather than including it here, I've put a simplified example into a Mercurial repository on bitbucket that you can freely review, copy, or clone. There are three parts:
serve.py a Python/Flask based server
templates/index.html 98% HTML, 2% template file the Flask-based server will render as HTML
static/lpoll.js a jQuery-based client

Long-polling was a reasonable work-around before simple, natural support for Web Sockets came to most browsers, and before it was easily integrated alongside Flask apps. But here in mid-2013, Web Socket support has come a long way.
Here is an example, similar to the one above, but integrating Flask and Web Sockets. It runs atop server components from gevent and gevent-websocket.
Note this example is not intended to be a Web Socket masterpiece. It retains a lot of the lpoll structure, to make them more easily comparable. But it immediately improves responsiveness, server overhead, and interactivity of the Web app.
Update for Python 3.7+
5 years since the original answer, WebSocket has become easier to implement. As of Python 3.7, asynchronous operations have matured into mainstream usefulness. Python web apps are the perfect use case. They can now use async just as JavaScript and Node.js have, leaving behind some of the quirks and complexities of "concurrency on the side." In particular, check out Quart. It retains Flask's API and compatibility with a number of Flask extensions, but is async-enabled. A key side-effect is that WebSocket connections can be gracefully handled side-by-side with HTTP connections. E.g.:
from quart import Quart, websocket
app = Quart(__name__)
#app.route('/')
async def hello():
return 'hello'
#app.websocket('/ws')
async def ws():
while True:
await websocket.send('hello')
app.run()
Quart is just one of the many great reasons to upgrade to Python 3.7.

Apache + mod_wsgi / Lighttpd + wsgi - am I going to see differences in performance?

I'm a newbie to developing with Python and I'm piecing together the information I need to make intelligent choices in two other open questions. (This isn't a duplicate.)
I'm not developing using a framework but building a web app from scratch using the gevent library. As far as front-end web servers go, it seems I have three choices: nginx, apache, and lighttpd.
From all accounts that I've read, nginx's mod_wsgi isn't suitable.
That leaves two choices - lighttpd and Apache. Under heavy load, am I going to see major differences in performance and memory consumption characteristics? I'm under the impression Apache tends to be memory hungry even when not using prefork, but I don't know how suitable lighttp is for Python apps.
Are there any caveats or benefits to using lighttpd over apache? I really want to hear all the information you can possibly bore me with!

Apache...
Apache is by far the most widely used web server out there. Which is a good thing. There is so much more information on how to do stuff with it, and when something goes wrong there are a lot of people who know how to fix it. But, it is also the slowest out of the box; requring a lot of tweaking and a beefier server than Lighttpd. In your case, it will be a lot easier to get off the ground using Apache and Python. There are countless AMP packages out there, and many guides on how to setup python and make your application work. Just a quick google search will get you on your way. Under heavy load, Lighttpd will outshine Apache, but Apache is like a train. It just keeps chugging along.
Pros
Wide User Base
Universal support
A lot of plugins
Cons
Slow out of the box
Requires performance tweaking
Memory whore (No way you could get it working on a 64MB VPS)
Lighttpd...
Lighttpd is the new kid on the block. It is fast, powerful, and kicks ass performance wise (not to mention use like no memory). Out of the box, Lighttpd wipes the floor with Apache. But, not as many people know Lighttpd, so getting it to work is harder. Yes, it is the second most used webserver, but it does not have as much community support behind it. If you look here, on stackoverflow, there is this dude who keeps asking about how to get his Python app working but nobody has helped him. Under heavy load, if configured correctly, Lighttpd will out preform Apache (I did some tests a while back, and you might see a 200-300% performance increase in requests per second).
Pros
Fast out of the box
Uses very little memory
Cons
Not as much support as Apache
Sometimes just does not work
Nginx
If you were running a static website, then you would use nginx. you are correct in saying nginx's mod_wsgi isn't suitable.
Conclusion
Benefits? There are both web servers; designed to be able to replace one another. If both web servers are tuned correctly and you have ample hardware, then there is no real benefit of using one over another. You should try and see which web server meets your need, but asking me; I would say go with Lighttpd. It is, in my opinion, easier to configure and just works.
Also, You should look at Cherokee Web Server. Mad easy to set up and, the performance aint half bad. And you should ask this on Server Fault as well.

That you have mentioned gevent is important. Does that mean you are specifically trying to implement a long polling application? If you are and that functionality is the bulk of the application, then you will need to put your gevent server behind a front end web server that is implemented using async techniques rather that processes/threading model. Lighttd is an async server and fits that bill whereas Apache isn't. So use of Apache isn't good as front end proxy for long polling application. If that is the criteria though, would actually suggest you use nginx rather than Lighttpd.
Now if you are not doing long polling or anything else that needs high concurrency for long running requests, then you aren't necessarily going to gain too much by using gevent, especially if intention is to use a WSGI layer on top. For WSGI applications, ultimately the performance difference between different servers is minimal because your application is unlikely to be a hello world program that the benchmarks all use. The real bottlenecks are not the server but your application code, database, external callouts, lack of caching etc etc. In light of that, you should just use whatever WSGI hosting mechanism you find easier to use initially and when you properly work out what the hosting requirements are for your application, based on having an actual real application to test, then you can switch to something more appropriate if necessary.
In summary, you are just wasting your time trying to prematurely optimize by trying to find what may be the theoretically best server when in practice your application is what you should be concentrating on initially. After that, you also should be looking at application monitoring tools, because without monitoring tools how are you even going to determine if one hosting solution is better than another.

Need CGI (or another solution compatible with IIS 7) to handle massive uploads

We need to handle massive file uploads without spending resources on an IIS 7 server. To emphasize how light-weight this needs to be, let's say that we need to handle file uploads of sizes that are completely insane, like 100GB uploads, or something that can continue running for an extremely long time without consuming additional resources. Basically we need something that gives us control over the reception of the file from the moment it starts to the moment it ends.
A bit of background:
We're using ColdFusion as the server-side processor, but it has failed us when handling uploads beyond about 1GB and we've exhausted our configuration options. There's a long story behind that, but essentially, if a .cfm page (ColdFusion) is the destination of the file upload and it goes over about 1GB, it gives a 503 error... even if the target file doesn't exist. So clearly too much is going on merely by telling the server that we intend to process the file with a .cfm page.
We suspect that this is due to Java limitations because the server (or really, the workstation in this case) does not show any signs of load on CPU or memory. Since we have limited memory and this website is intended for a lot of concurrent uploads, we can't trust simply raising the virtual machine memory usage, especially because that simply doesn't work currently, even for a single connection... let alone the hundreds of concurrent connections we expect when we go live.
So we're down to writing a specialized solution using CGI that will handle file uploads only. Basically, we need control on the server-side that we don't get with ColdFusion or ASP.NET because those technologies do so many things on their own, behind the scenes, without giving us the control we need. They always end up spending up too many resources one way or the other for an arguably obvious reason; what we're trying to do is completely insane and not the intended function of those technologies. That's why we want a specialized uploader through CGI that bypasses all that ColdFusion/ASP.NET magic that keeps getting in the way, hoping it gives us the control we need.
But before we spent countless hours on this, I figured I'd ask around and see if anyone knows of a proper solution to this problem that might be viable in our case.
The only real restriction here is that it has to be CGI, and it has to run on IIS 7, therefore a Windows "Server" environment. We're fine with it being written in Python, Perl, name it... provided it can run as a CGI, but it has to run as a CGI... unless of course someone has better ideas on how to do this.
So the magic question is; are there CGI solutions out there that already do this or are we stuck with writing it on our own, hoping that the reason no one else has done it already is some other than it being impossible?
Thanks in advance.

You're not going to get reliable multi-GB uploads from a dumb client (eg a browser and standard upload behaviour). Been there, done that, written commercial digital asset management solutions handling huge files.
The key to any degree of reliability in this scenario is chunking - you need to be able to chunk the upload, send each chunk as a discrete file, and re-assemble it server side.
What are your client restrictions though (if any)? Can you use a java applet? Could you even have a client side app ?
One possible starting point for a browser based solution would be the jupload opensource project but there are plenty of others.

You want WebDAV, not CGI. It provides all the nice bits that make file transfers not suck, like resuming and pausing.

the windows TCP stack is limited to 4GB file uploads. Anymore than that is not possible.

A good multithreaded python webserver?

I am looking for a python webserver which is multithreaded instead of being multi-process (as in case of mod_python for apache). I want it to be multithreaded because I want to have an in memory object cache that will be used by various http threads. My webserver does a lot of expensive stuff and computes some large arrays which needs to be cached in memory for future use to avoid recomputing. This is not possible in a multi-process web server environment. Storing this information in memcache is also not a good idea as the arrays are large and storing them in memcache would lead to deserialization of data coming from memcache apart from the additional overhead of IPC.
I implemented a simple webserver using BaseHttpServer, it gives good performance but it gets stuck after a few hours time. I need some more matured webserver. Is it possible to configure apache to use mod_python under a thread model so that I can do some object caching?

CherryPy. Features, as listed from the website:
A fast, HTTP/1.1-compliant, WSGI thread-pooled webserver. Typically, CherryPy itself takes only 1-2ms per page!
Support for any other WSGI-enabled webserver or adapter, including Apache, IIS, lighttpd, mod_python, FastCGI, SCGI, and mod_wsgi
Easy to run multiple HTTP servers (e.g. on multiple ports) at once
A powerful configuration system for developers and deployers alike
A flexible plugin system
Built-in tools for caching, encoding, sessions, authorization, static content, and many more
A native mod_python adapter
A complete test suite
Swappable and customizable...everything.
Built-in profiling, coverage, and testing support.

Consider reconsidering your design. Maintaining that much state in your webserver is probably a bad idea. Multi-process is a much better way to go for stability.
Is there another way to share state between separate processes? What about a service? Database? Index?
It seems unlikely that maintaining a huge array of data in memory and relying on a single multi-threaded process to serve all your requests is the best design or architecture for your app.

Twisted can serve as such a web server. While not multithreaded itself, there is a (not yet released) multithreaded WSGI container present in the current trunk. You can check out the SVN repository and then run:
twistd web --wsgi=your.wsgi.application

Its hard to give a definitive answer without knowing what kind of site you are working on and what kind of load you are expecting. Sub second performance may be a serious requirement or it may not. If you really need to save that last millisecond then you absolutely need to keep your arrays in memory. However as others have suggested it is more than likely that you don't and could get by with something else. Your usage pattern of the data in the array may affect what kinds of choices you make. You probably don't need access to the entire set of data from the array all at once so you could break your data up into smaller chunks and put those chunks in the cache instead of the one big lump. Depending on how often your array data needs to get updated you might make a choice between memcached, local db (berkley, sqlite, small mysql installation, etc) or a remote db. I'd say memcached for fairly frequent updates. A local db for something in the frequency of hourly and remote for the frequency of daily. One thing to consider also is what happens after a cache miss. If 50 clients all of a sudden get a cache miss and all of them at the same time decide to start regenerating those expensive arrays your box(es) will quickly be reduced to 8086's. So you have to take in to consideration how you will handle that. Many articles out there cover how to recover from cache misses. Hope this is helpful.

Not multithreaded, but twisted might serve your needs.

You could instead use a distributed cache that is accessible from each process, memcached being the example that springs to mind.

web.py has made me happy in the past. Consider checking it out.
But it does sound like an architectural redesign might be the proper, though more expensive, solution.

Perhaps you have a problem with your implementation in Python using BaseHttpServer. There's no reason for it to "get stuck", and implementing a simple threaded server using BaseHttpServer and threading shouldn't be difficult.
Also, see http://pymotw.com/2/BaseHTTPServer/index.html#module-BaseHTTPServer about implementing a simple multi-threaded server with HTTPServer and ThreadingMixIn

I use CherryPy both personally and professionally, and I'm extremely happy with it. I even do the kinds of thing you're describing, such as having global object caches, running other threads in the background, etc. And it integrates well with Apache; simply run CherryPy as a standalone server bound to localhost, then use Apache's mod_proxy and mod_rewrite to have Apache transparently forward your requests to CherryPy.
The CherryPy website is http://cherrypy.org/

I actually had the same issue recently. Namely: we wrote a simple server using BaseHTTPServer and found that the fact that it's not multi-threaded was a big drawback.
My solution was to port the server to Pylons (http://pylonshq.com/). The port was fairly easy and one benefit was it's very easy to create a GUI using Pylons so I was able to throw a status page on top of what's basically a daemon process.
I would summarize Pylons this way:
it's similar to Ruby on Rails in that it aims to be very easy to deploy web apps
it's default templating language, Mako, is very nice to work with
it uses a system of routing urls that's very convenient
for us performance is not an issue, so I can't guarantee that Pylons would perform adequately for your needs
you can use it with Apache & Lighthttpd, though I've not tried this
We also run an app with Twisted and are happy with it. Twisted has good performance, but I find Twisted's single-threaded/defer-to-thread programming model fairly complicated. It has lots of advantages, but would not be my choice for a simple app.
Good luck.

Just to point out something different from the usual suspects...
Some years ago while I was using Zope 2.x I read about Medusa as it was the web server used for the platform. They advertised it to work well under heavy load and it can provide you with the functionality you asked.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.