Python web application: How to keep state

Python web application: How to keep state - python

I wrote a WSGI compatible web application using web.py that loads a few dozen MB data into memory during startup.
It works quite well with the web.py integrated server.
However, using Apache 2 + mod_wsgi, every single request reloads the data, essentially starting the program again. Due to the loading time of several seconds, this is unbearable.
Is it inherent to mod_wsgi or can it be configured? What are my alternatives?

"Is it inherent to mod_wsgi?" No. It's inherent in HTTP
Since you didn't post your mod_wsgi configuration, it's impossible to say what you did wrong.
I can only guess that you didn't use daemon mode.
See http://code.google.com/p/modwsgi/wiki/ConfigurationGuidelines#Defining_Process_Groups for more information on daemon mode.
This may not be the best solution. It may be better (far, far better) to use a proper database. Without actual code examples, and more details, this is all just random guessing.

Related

Reload single module in cherrypy?

Is it possible to use the python reload command (or similar) on a single module in a standalone cherrypy web app? I have a CherryPy based web application that often is under continual usage. From time to time I'll make an "important" change that only affects one module. I would like to be able to reload just that module immediately, without affecting the rest of the web application. A full restart is, admittedly, fast, however there are still several seconds of downtime that I would prefer to avoid if possible.

Reloading modules is very, very hard to do in a sane way. It leads to the potential of stale objects in your code with impossible-to-interrogate state and subtle bugs. It's not something you want to do.
What real web applications tend to do is to have a server that stays alive in front of their application, such as Apache with mod_proxy, to serve as a reverse proxy. You start your new app server, change your reverse proxy's routing, and only then kill the old app server.
No downtime. No insane, undebuggable code.

Fastest, simplest way to handle long-running upstream requests for Django

I'm using Django with Uwsgi. We have 8 processes running, and I have no real indication that our code is particularly thread safe, as it was never designed with threads in mind.
Recently, we added the ability to get live rates from vendors of a service through their various APIs and display them at once for the user. The problem is these requests are old web services technologies, and due to their response times, the time needed before the all rates from vendors are acquired (or it gives up), can be up to 10 seconds.
This presents a problem. We have a pretty decent amount of traffic on our site, and the customers need to look at these rates pretty often. With only 8 processes, it's quite easy to see how the server can get tied up waiting on these upstream requests. Especially when other optimizations need to be made to make the site baseline faster anyway (we're working on that).
We made a separate library (which should be mostly threadsafe, and if not, should be converted to it easily enough) for the rates requesting, and we can separate out its configuration. So I was thinking of making a separate service with its own threads, perhaps in Twisted, and having the browser contact that service for JSON instead of having it run in the main Django server.
Is this solution a good one? Can you think of a better or simpler way to do it? Should I use something other than Twisted, and if so, why?

If you want to use your code in-process with Django, you can simply call out to your Twisted by using Crochet, which can automatically manage the creation, running, and shutdown of the reactor within whatever WSGI implementation you choose (presuming that it behaves like a regular Python process, at least).
Obviously it might be less complex to just run within the Twisted WSGI container :-).
It might also be worth looking at TReq to issue your service client requests; your new "thread safe" library will still have the disadvantage of tying up an entire thread for each blocking client, which is a non-trivial amount of memory and additional concurrency overhead, whereas with Twisted you will only need to worry about a couple of objects.

Apache + mod_wsgi / Lighttpd + wsgi - am I going to see differences in performance?

I'm a newbie to developing with Python and I'm piecing together the information I need to make intelligent choices in two other open questions. (This isn't a duplicate.)
I'm not developing using a framework but building a web app from scratch using the gevent library. As far as front-end web servers go, it seems I have three choices: nginx, apache, and lighttpd.
From all accounts that I've read, nginx's mod_wsgi isn't suitable.
That leaves two choices - lighttpd and Apache. Under heavy load, am I going to see major differences in performance and memory consumption characteristics? I'm under the impression Apache tends to be memory hungry even when not using prefork, but I don't know how suitable lighttp is for Python apps.
Are there any caveats or benefits to using lighttpd over apache? I really want to hear all the information you can possibly bore me with!

Apache...
Apache is by far the most widely used web server out there. Which is a good thing. There is so much more information on how to do stuff with it, and when something goes wrong there are a lot of people who know how to fix it. But, it is also the slowest out of the box; requring a lot of tweaking and a beefier server than Lighttpd. In your case, it will be a lot easier to get off the ground using Apache and Python. There are countless AMP packages out there, and many guides on how to setup python and make your application work. Just a quick google search will get you on your way. Under heavy load, Lighttpd will outshine Apache, but Apache is like a train. It just keeps chugging along.
Pros
Wide User Base
Universal support
A lot of plugins
Cons
Slow out of the box
Requires performance tweaking
Memory whore (No way you could get it working on a 64MB VPS)
Lighttpd...
Lighttpd is the new kid on the block. It is fast, powerful, and kicks ass performance wise (not to mention use like no memory). Out of the box, Lighttpd wipes the floor with Apache. But, not as many people know Lighttpd, so getting it to work is harder. Yes, it is the second most used webserver, but it does not have as much community support behind it. If you look here, on stackoverflow, there is this dude who keeps asking about how to get his Python app working but nobody has helped him. Under heavy load, if configured correctly, Lighttpd will out preform Apache (I did some tests a while back, and you might see a 200-300% performance increase in requests per second).
Pros
Fast out of the box
Uses very little memory
Cons
Not as much support as Apache
Sometimes just does not work
Nginx
If you were running a static website, then you would use nginx. you are correct in saying nginx's mod_wsgi isn't suitable.
Conclusion
Benefits? There are both web servers; designed to be able to replace one another. If both web servers are tuned correctly and you have ample hardware, then there is no real benefit of using one over another. You should try and see which web server meets your need, but asking me; I would say go with Lighttpd. It is, in my opinion, easier to configure and just works.
Also, You should look at Cherokee Web Server. Mad easy to set up and, the performance aint half bad. And you should ask this on Server Fault as well.

That you have mentioned gevent is important. Does that mean you are specifically trying to implement a long polling application? If you are and that functionality is the bulk of the application, then you will need to put your gevent server behind a front end web server that is implemented using async techniques rather that processes/threading model. Lighttd is an async server and fits that bill whereas Apache isn't. So use of Apache isn't good as front end proxy for long polling application. If that is the criteria though, would actually suggest you use nginx rather than Lighttpd.
Now if you are not doing long polling or anything else that needs high concurrency for long running requests, then you aren't necessarily going to gain too much by using gevent, especially if intention is to use a WSGI layer on top. For WSGI applications, ultimately the performance difference between different servers is minimal because your application is unlikely to be a hello world program that the benchmarks all use. The real bottlenecks are not the server but your application code, database, external callouts, lack of caching etc etc. In light of that, you should just use whatever WSGI hosting mechanism you find easier to use initially and when you properly work out what the hosting requirements are for your application, based on having an actual real application to test, then you can switch to something more appropriate if necessary.
In summary, you are just wasting your time trying to prematurely optimize by trying to find what may be the theoretically best server when in practice your application is what you should be concentrating on initially. After that, you also should be looking at application monitoring tools, because without monitoring tools how are you even going to determine if one hosting solution is better than another.

How can I keep on-the-fly application-level statistics in an application running under Apache?

I have an application running under apache that I want to keep "in the moment" statistics on. I want to have the application tell me things like:
requests per second, broken down by types of request
latency to make requests to various backend services via thrift (broken down by service and server)
number of errors being served per second
etc.
I want to do this without any external dependencies. However, I'm running into issues sharing statistics between apache processes. Obviously, I can't just use global memory. What is a good pattern for this sort of issue?
The application is written in python using pylons, though I suspect this is more of a "communication across processes" design question than something that's python specific.

Perhaps you could keep the relevant counters and other statistics in a memcached, that is accessed by all apache processes?

I want to do this without any external
dependencies.
What if your apache dies somehow? (Separation of concerns?)
Personally I am using (redundant) Nagios to monitor the hardware itself, services, and application metrics. This way i can easily/automatically plot "requests per second/users online", "cpu load/user activy X per second" etc. graphs which help with lots of things.
Writing plugins for nagios is really easy, also there are thousands of premade scripts in any language.
Apache monitoring
I am monitoring apache by extracting the information I need from the apache mod_status page via nagios plugin.
Example plugin response:
APACHE OK - 0.080 sec. response time, Busy/Idle 18/16, open 766/800, ReqPerSec 12.4, BytesPerReq 3074, BytesPerSec 38034
Application Monitoring
I used mod_status just as an example for your list of things you'd like to monitor.
For our application we have a very small framework for Nagios plugins, so basically every nagios check is a small class which runs its check against a cache or database and returns its value to nagios (small and simple commandline-script).
more examples:
Memcache:
OK - consumption: 82.88% (106.1 MBytes/128.0 MBytes), connections: 2, requests/s: 10.99, hitrate: 95.2% (34601210/36346999), getrate: 50.1% (36346999/72542987)
Application feature #1 usage:
OK - last 5m: 22 last 24h: 655 ever: 26121
Application feature #2 usage:
OK - last 5m: 39 last 24h: 11011
Other applications metrics:
OK - users online: 556
What I want to say: Extending Nagios for application monitoring is very easy.
With my little framework which took me 3-4 hours to write, any check I am adding takes me just some minutes now.
Nagios plug-in development guidelines

Use pylons.g object. It is an instance of Globals class in your Pylons application's lib/app_globals.py file. Its state changes will be visible to all threads, so stuff in it needs to be threadsafe.
lib/app_globals.py:
class Globals(object):
def __init__(self):
self.requests_served = 0
controllers/status.py:
from pylons import g
class StatusController(BaseController):
def status(self):
g.requests_served += 1
return "Served %d requests." % g.requests_served

A good multithreaded python webserver?

I am looking for a python webserver which is multithreaded instead of being multi-process (as in case of mod_python for apache). I want it to be multithreaded because I want to have an in memory object cache that will be used by various http threads. My webserver does a lot of expensive stuff and computes some large arrays which needs to be cached in memory for future use to avoid recomputing. This is not possible in a multi-process web server environment. Storing this information in memcache is also not a good idea as the arrays are large and storing them in memcache would lead to deserialization of data coming from memcache apart from the additional overhead of IPC.
I implemented a simple webserver using BaseHttpServer, it gives good performance but it gets stuck after a few hours time. I need some more matured webserver. Is it possible to configure apache to use mod_python under a thread model so that I can do some object caching?

CherryPy. Features, as listed from the website:
A fast, HTTP/1.1-compliant, WSGI thread-pooled webserver. Typically, CherryPy itself takes only 1-2ms per page!
Support for any other WSGI-enabled webserver or adapter, including Apache, IIS, lighttpd, mod_python, FastCGI, SCGI, and mod_wsgi
Easy to run multiple HTTP servers (e.g. on multiple ports) at once
A powerful configuration system for developers and deployers alike
A flexible plugin system
Built-in tools for caching, encoding, sessions, authorization, static content, and many more
A native mod_python adapter
A complete test suite
Swappable and customizable...everything.
Built-in profiling, coverage, and testing support.

Consider reconsidering your design. Maintaining that much state in your webserver is probably a bad idea. Multi-process is a much better way to go for stability.
Is there another way to share state between separate processes? What about a service? Database? Index?
It seems unlikely that maintaining a huge array of data in memory and relying on a single multi-threaded process to serve all your requests is the best design or architecture for your app.

Twisted can serve as such a web server. While not multithreaded itself, there is a (not yet released) multithreaded WSGI container present in the current trunk. You can check out the SVN repository and then run:
twistd web --wsgi=your.wsgi.application

Its hard to give a definitive answer without knowing what kind of site you are working on and what kind of load you are expecting. Sub second performance may be a serious requirement or it may not. If you really need to save that last millisecond then you absolutely need to keep your arrays in memory. However as others have suggested it is more than likely that you don't and could get by with something else. Your usage pattern of the data in the array may affect what kinds of choices you make. You probably don't need access to the entire set of data from the array all at once so you could break your data up into smaller chunks and put those chunks in the cache instead of the one big lump. Depending on how often your array data needs to get updated you might make a choice between memcached, local db (berkley, sqlite, small mysql installation, etc) or a remote db. I'd say memcached for fairly frequent updates. A local db for something in the frequency of hourly and remote for the frequency of daily. One thing to consider also is what happens after a cache miss. If 50 clients all of a sudden get a cache miss and all of them at the same time decide to start regenerating those expensive arrays your box(es) will quickly be reduced to 8086's. So you have to take in to consideration how you will handle that. Many articles out there cover how to recover from cache misses. Hope this is helpful.

Not multithreaded, but twisted might serve your needs.

You could instead use a distributed cache that is accessible from each process, memcached being the example that springs to mind.

web.py has made me happy in the past. Consider checking it out.
But it does sound like an architectural redesign might be the proper, though more expensive, solution.

Perhaps you have a problem with your implementation in Python using BaseHttpServer. There's no reason for it to "get stuck", and implementing a simple threaded server using BaseHttpServer and threading shouldn't be difficult.
Also, see http://pymotw.com/2/BaseHTTPServer/index.html#module-BaseHTTPServer about implementing a simple multi-threaded server with HTTPServer and ThreadingMixIn

I use CherryPy both personally and professionally, and I'm extremely happy with it. I even do the kinds of thing you're describing, such as having global object caches, running other threads in the background, etc. And it integrates well with Apache; simply run CherryPy as a standalone server bound to localhost, then use Apache's mod_proxy and mod_rewrite to have Apache transparently forward your requests to CherryPy.
The CherryPy website is http://cherrypy.org/

I actually had the same issue recently. Namely: we wrote a simple server using BaseHTTPServer and found that the fact that it's not multi-threaded was a big drawback.
My solution was to port the server to Pylons (http://pylonshq.com/). The port was fairly easy and one benefit was it's very easy to create a GUI using Pylons so I was able to throw a status page on top of what's basically a daemon process.
I would summarize Pylons this way:
it's similar to Ruby on Rails in that it aims to be very easy to deploy web apps
it's default templating language, Mako, is very nice to work with
it uses a system of routing urls that's very convenient
for us performance is not an issue, so I can't guarantee that Pylons would perform adequately for your needs
you can use it with Apache & Lighthttpd, though I've not tried this
We also run an app with Twisted and are happy with it. Twisted has good performance, but I find Twisted's single-threaded/defer-to-thread programming model fairly complicated. It has lots of advantages, but would not be my choice for a simple app.
Good luck.

Just to point out something different from the usual suspects...
Some years ago while I was using Zope 2.x I read about Medusa as it was the web server used for the platform. They advertised it to work well under heavy load and it can provide you with the functionality you asked.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.