Python/mod_wsgi server global data

Python/mod_wsgi server global data - python

I have been looking into different systems for creating a fast cache in a web-farm running Python/mod_wsgi. Memcache and others are options ... But I was wondering:
Because I don't need to share data across machines, wanting each machine to maintain a local cache ...
Does Python or WSGI provide a mechanism for Python native shared data in Apache such that the data persists and is available to all threads/processes until the server is restarted? This way I could just keep a cache of objects with concurrency control in the memory space of all running application instances?
If not, it sure would be useful
Thanks!

This is thoroughly covered by the Sharing and Global Data section of the mod_wsgi documentation. The short answer is: No, not unless you run everything in one process, but that's not an ideal solution.
It should be noted that caching is ridiculously easy to do with Beaker middleware, which supports multiple backends including memcache.

There's Django's thread-safe in-memory cache back-end, see here. It's cPickle-based, and although it's designed for use with Django, it has minimal dependencies on the rest of Django and you could easily refactor it to remove these. Obviously each process would get its own cache, shared between its threads; If you want a cache shared by all processes on the same machine, you could just use this cache in its own process with an IPC interface of your choice (domain sockets, say) or use memcached locally, or, if you might ever want persistence across restarts, something like Tokyo Cabinet with a Python interface like this.

I realize this is an old thread, but here's another option for a "server-wide dict": http://poshmodule.sourceforge.net/posh/html/posh.html (POSH, Python Shared Objects). Disclaimer: haven't used it myself yet.

Related

Multiple flask processes with a shared resource

Multiple flask processes (managed by gunicorn) serve the frontend and have to use a shared resource: A data structure that allows reads and updates and therefore needs to be protected by a simple (or RW) lock.
What options do I have regarding the communication between web frontend and data structure? I already had a look at the following libraries:
pyZMQ. I'm held back by the problem that arises when the service is restarting while the client is expecting data. Also I would need to implement method calling, de-/serialization and the like.
https://github.com/0rpc/zerorpc-python This is an additional layer around pyZMQ and works around this issue but seems not very actively developed and I don't want to be forced to use gevent.
Pyro. Seems to provide the functionality I need (using a single instance or python threads for the service). Might be a bit heavyweight for my needs.
socketserver. Pretty lowlevel but might also do what I want as long as I implement method calling, de-/serialization, ...
Are there better options?

Minimal example of uWSGI's shared memory?

Does anyone have a minimal working example of how to use uWSGI to share memory across requests in say Django?
I have a large file in proprietary format (not database-compatible) that I need to load for each request.
An instagram post got me thinking which states:
For the application server, we use uWSGI with pre-fork mode to leverage memory sharing between master and worker processes.
How would you set something like that up?

There are multiple ways to handle this:
Share by "abusing" copy-on-write for read-only data
If your data is read-only, you can leverage the fact that uWSGI is executing your python code to get the application before forking into multiple processes. This means all the data that is already loaded before the fork happens will be shared with all your processes.
This can be a great tool because you don't have to do anything dealing with multi-processing to enjoy this mechanism. But be careful, as soon as any process writes to this data, it will copy it first to get its own local version.
Django doesn't make it easy because all the views are lazy. This means django won't try to run code related to your view when application is created. Therefore to enjoy the pre-fork sharing you need to load the data in a code outside your views. For instance you can consider loading the data right before or after you built your application object (like in the gist linked by #john-strood).
Use uWSGI cache framework
If you need to write to this data, a first solution is to use uWSGI cache framework. It's fairly easy to use. You need to configure in advance how much memory you need, and then all your processes can read and write to it. You don't have to deal with locking or other multi-processing related issues.
The drawback is that you still incur IO latency between your processes and uWSGI cache's process. This is insignificant for tiny chunks of data, but would be prohibitive for gigabytes.
Use shared memory manually
As a last resort, if your data is not read-only and you need to load large chunks at all request, so large that even sending through a unix socket would take too long, then you need to load your data directly in the shared memory space. Here uWSGI won't help, and you will have to deal with locking and multi-processing issues yourself.
You can refer to multiprocessing's shared memory documentation.

Questions about django thread safety

I have a django app which is used for managing registrations to a survey.
There are fixed number of slots and I want to "reserve" slots for users when they sign up.
In one of my views, I get the next available slot and reserve it (or redirect the user if there are no slots available.)
I want to protect against the case where two user's signing up at the same time get registered for the same slot because the the method "get_next_available_slot" returned the same slot for both users.
For this I am trying to understand the use of processes and threads with Django's views.
1) Is each request handled in a separate thread and will using python threading module's LOCK() take care of exclusive access?
2) I am running apache on RHEL with modwsgi. How do I configure Apache/modwsgi to ensure a more easy and simple solution to handle the above situation?
I am not expecting a huge load on the web application at all. So I would like a simpler solution instead of a high performance one.

You should not make assumptions about thread/process setup of django application, because it depends on web server you're using and how django is integrated to it. Therefore, interprocess communication methods should not rely on these details to be reliable. One good solution is using built-in cache library for locks and shared data.
Here's a good example of cache lock ensuring only once instance of celery task is run at a time. You can apply it to regular requests as well.

You shouldn't be worrying about this kind of stuff.
These slots are stored in a database right? The database should handle all the locking mechanisms for you, just make sure you run everything under a transaction and you will be fine.

A good persistent synchronous queue in python

I don't immediately care about fifo or filo options, but it might be nice in the future..
What I'm looking for a is a nice fast simple way to store (at most a gig of data or tens of millions of entries) on disk that can be get and put by multiple processes. The entries are just simple 40 byte strings, not python objects. Don't really need all the functionality of shelve.
I've seen this http://code.activestate.com/lists/python-list/310105/
It looks simple. It needs to be upgraded to the new Queue version.
Wondering if there's something better? I'm concerned that in the event of a power interruption, the entire pickled file becomes corrupt instead of just one record.

Try using Celery. It's not pure python, as it uses RabbitMQ as a backend, but it's reliable, persistent and distributed, and, all in all, far better then using files or database in the long run.

I think that PyBSDDB is what you want. You can choose a queue as the access type. PyBSDDB is a Python module based on Oracle Berkeley DB.
It has synchronous access and can be accessed from different processes although I don't know if that is possible from the Python bindings. About multiple processes writing to the db I found this thread.

This is a very old question, but persist-queue seems to be a nice tool for this kind of task.
persist-queue implements a file-based queue and a serial of
sqlite3-based queues. The goals is to achieve following requirements:
Disk-based: each queued item should be stored in disk in case of any crash.
Thread-safe: can be used by multi-threaded producers and multi-threaded consumers.
Recoverable: Items can be read after process restart.
Green-compatible: can be used in greenlet or eventlet environment.
By default, persist-queue use pickle object serialization module to
support object instances. Most built-in type, like int, dict, list are
able to be persisted by persist-queue directly, to support customized
objects, please refer to Pickling and unpickling extension
types(Python2) and Pickling Class Instances(Python3)

Using files is not working?...
Use a journaling file system to recover from power interruptions. That's their purpose.

A good multithreaded python webserver?

I am looking for a python webserver which is multithreaded instead of being multi-process (as in case of mod_python for apache). I want it to be multithreaded because I want to have an in memory object cache that will be used by various http threads. My webserver does a lot of expensive stuff and computes some large arrays which needs to be cached in memory for future use to avoid recomputing. This is not possible in a multi-process web server environment. Storing this information in memcache is also not a good idea as the arrays are large and storing them in memcache would lead to deserialization of data coming from memcache apart from the additional overhead of IPC.
I implemented a simple webserver using BaseHttpServer, it gives good performance but it gets stuck after a few hours time. I need some more matured webserver. Is it possible to configure apache to use mod_python under a thread model so that I can do some object caching?

CherryPy. Features, as listed from the website:
A fast, HTTP/1.1-compliant, WSGI thread-pooled webserver. Typically, CherryPy itself takes only 1-2ms per page!
Support for any other WSGI-enabled webserver or adapter, including Apache, IIS, lighttpd, mod_python, FastCGI, SCGI, and mod_wsgi
Easy to run multiple HTTP servers (e.g. on multiple ports) at once
A powerful configuration system for developers and deployers alike
A flexible plugin system
Built-in tools for caching, encoding, sessions, authorization, static content, and many more
A native mod_python adapter
A complete test suite
Swappable and customizable...everything.
Built-in profiling, coverage, and testing support.

Consider reconsidering your design. Maintaining that much state in your webserver is probably a bad idea. Multi-process is a much better way to go for stability.
Is there another way to share state between separate processes? What about a service? Database? Index?
It seems unlikely that maintaining a huge array of data in memory and relying on a single multi-threaded process to serve all your requests is the best design or architecture for your app.

Twisted can serve as such a web server. While not multithreaded itself, there is a (not yet released) multithreaded WSGI container present in the current trunk. You can check out the SVN repository and then run:
twistd web --wsgi=your.wsgi.application

Its hard to give a definitive answer without knowing what kind of site you are working on and what kind of load you are expecting. Sub second performance may be a serious requirement or it may not. If you really need to save that last millisecond then you absolutely need to keep your arrays in memory. However as others have suggested it is more than likely that you don't and could get by with something else. Your usage pattern of the data in the array may affect what kinds of choices you make. You probably don't need access to the entire set of data from the array all at once so you could break your data up into smaller chunks and put those chunks in the cache instead of the one big lump. Depending on how often your array data needs to get updated you might make a choice between memcached, local db (berkley, sqlite, small mysql installation, etc) or a remote db. I'd say memcached for fairly frequent updates. A local db for something in the frequency of hourly and remote for the frequency of daily. One thing to consider also is what happens after a cache miss. If 50 clients all of a sudden get a cache miss and all of them at the same time decide to start regenerating those expensive arrays your box(es) will quickly be reduced to 8086's. So you have to take in to consideration how you will handle that. Many articles out there cover how to recover from cache misses. Hope this is helpful.

Not multithreaded, but twisted might serve your needs.

You could instead use a distributed cache that is accessible from each process, memcached being the example that springs to mind.

web.py has made me happy in the past. Consider checking it out.
But it does sound like an architectural redesign might be the proper, though more expensive, solution.

Perhaps you have a problem with your implementation in Python using BaseHttpServer. There's no reason for it to "get stuck", and implementing a simple threaded server using BaseHttpServer and threading shouldn't be difficult.
Also, see http://pymotw.com/2/BaseHTTPServer/index.html#module-BaseHTTPServer about implementing a simple multi-threaded server with HTTPServer and ThreadingMixIn

I use CherryPy both personally and professionally, and I'm extremely happy with it. I even do the kinds of thing you're describing, such as having global object caches, running other threads in the background, etc. And it integrates well with Apache; simply run CherryPy as a standalone server bound to localhost, then use Apache's mod_proxy and mod_rewrite to have Apache transparently forward your requests to CherryPy.
The CherryPy website is http://cherrypy.org/

I actually had the same issue recently. Namely: we wrote a simple server using BaseHTTPServer and found that the fact that it's not multi-threaded was a big drawback.
My solution was to port the server to Pylons (http://pylonshq.com/). The port was fairly easy and one benefit was it's very easy to create a GUI using Pylons so I was able to throw a status page on top of what's basically a daemon process.
I would summarize Pylons this way:
it's similar to Ruby on Rails in that it aims to be very easy to deploy web apps
it's default templating language, Mako, is very nice to work with
it uses a system of routing urls that's very convenient
for us performance is not an issue, so I can't guarantee that Pylons would perform adequately for your needs
you can use it with Apache & Lighthttpd, though I've not tried this
We also run an app with Twisted and are happy with it. Twisted has good performance, but I find Twisted's single-threaded/defer-to-thread programming model fairly complicated. It has lots of advantages, but would not be my choice for a simple app.
Good luck.

Just to point out something different from the usual suspects...
Some years ago while I was using Zope 2.x I read about Medusa as it was the web server used for the platform. They advertised it to work well under heavy load and it can provide you with the functionality you asked.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.