I'd like people's views on current design I'm considering for a tornado app. Although I'm using mongoDB to store permanent information I currently have the session information as a python data structure that I've simply added within the Application object at initialisation.
I will need to perform some iteration and manipulation of the sessions while the server is running. I keep debating whether to move these to another mongoDB or just keep it as a python structure.
Is there anything wrong with keeping session information this way?
If you store session data in Python your apllication will:
loose it if you stop the Python process;
likely consume more memory as Python isn't very efficient in memory management (and you will have to store all the sessions in memory, not the ones you need right now).
If these are not problems for you you can go with Python structures. But usually these are serious concerns and most of the projects use some external storage for sessions.
Related
We have an existing python application (let's call it control app) that does operation data logging as well as smaller controlling tasks on a machine. We want to extend this application with a web interface, which is based on flask (let's call it web app). Both parts, the control app as well as the web app, are already present, however, the setup feels somehow fishy. In the process of rethinking the setup, I'm somehow undecided on how to structure those two parts.
At the moment, the control app gathers machine data and stores it in a postgres database. Based on several machine states, additional operations are performed that provide new input for the PLCs that control the machine.
The web app currently polls the database to react to machine states to e.g. update visualisation data, change some (state representing) images and such things.
The web app polling the database is the part that somehow smells. So my idea was to unify both apps into one to have the web app tightly coupled to the control app to be able to react on machine state changes instead of polling the database for those state changes.
Based on that idea, I'm wondering how to add a flask app to an existing python app. When I'm not mistaken, the flask app consumes the application's main thread, which would break to already existing logic. Thus I would need to have one of the two parts running on another thread. Thinking about this problem, I'm further wondering whether this merging is a good idea at all.
So, the questions are: Is it a good idea to merge both applications? If yes, how to merge them without breaking one of them? If not, how else should I try to get rid of the database polling (how to synchronize and also move some data from the web app to the control app)?
It's not a good idea to merge them per se -- problems in one part will affect the other, and this sort of tight coupling is a bad idea both because you can't run the two parts of the program on separate machines and because if one crashes, so does the other one. It's better to have them communicating over some sort of protocol.
If I were designing this, I would probably do the same thing as you did, except that instead of using an SQL database for this, I would use something like Redis which stores its data in memory. Redis allows you to subscribe to events rather than poll for updates, and polling for updates is cheaper because it's in memory.
Is there any way to maintain a variable that is accessible and mutable across processes?
Example
User A made a request to a view called make_foo and the operation within that view takes time. We want to have a flag variable that says making_foo = True that is viewable by User B that will make a request and by any other user or service within that django app and be able to set it to False when done
Don't take the example too seriously, I know about task queues but what I am trying to understand is the idea of having a shared mutable variable across processes without the need to use a database.
Is there any best practice to achieve that?
One thing you need to be aware of is that when your django server is running in production, there is not just one django process, there will be several worker threads running at the same time.
If you want to share data between processes, even internally, you will need some kind of database to do so, whether that's with SQLite3 or Redis (which I recommend for stuff like this).
I won't go into the details because it's already been said before by other people, but Redis is an in-memory database that uses key-value storing (unlike how Django uses a model, Redis is essentially a giant dictionary). Redis is fast and most operations are atomic which means you are unlikely to encounter race conditions.
I am having a hard time trying to figure out the big picture of the handling of multiple requests by the uwsgi server with django or pyramid application.
My understanding at the moment is this:
When multiple http requests are sent to uwsgi server concurrently, the server creates a separate processes or threads (copies of itself) for every request (or assigns to them the request) and every process/thread loads the webapplication's code (say django or pyramid) into computers memory and executes it and returns the response. In between every copy of the code can access the session, cache or database. There is a separate database server usually and it can also handle concurrent requests to the database.
So here some questions I am fighting with.
Is my above understanding correct or not?
Are the copies of code interact with each other somehow or are they wholly separated from each other?
What about the session or cache? Are they shared between them or are they local to each copy?
How are they created: by the webserver or by copies of python code?
How are responses returned to the requesters: by each process concurrently or are they put to some kind of queue and sent synchroniously?
I have googled these questions and have found very interesting answers on StackOverflow but anyway can't get the whole picture and the whole process remains a mystery for me. It would be fantastic if someone can explain the whole picture in terms of django or pyramid with uwsgi or whatever webserver.
Sorry for asking kind of dumb questions, but they really torment me every night and I am looking forward to your help:)
There's no magic in pyramid or django that gets you past process boundaries. The answers depend entirely on the particular server you've selected and the settings you've selected. For example, uwsgi has the ability to run multiple threads and multiple processes. If uwsig spins up multiple processes then they will each have their own copies of data which are not shared unless you took the time to create some IPC (this is why you should keep state in a third party like a database instead of in-memory objects which are not shared across processes). Each process initializes a WSGI object (let's call it app) which the server calls via body_iter = app(environ, start_response). This app object is shared across all of the threads in the process and is invoked concurrently, thus it needs to be threadsafe (usually the structures the app uses are either threadlocal or readonly to deal with this, for example a connection pool to the database).
In general the answers to your questions are that things happen concurrently, and objects may or may not be shared based on your server model but in general you should take anything that you want to be shared and store it somewhere that can handle concurrency properly (a database).
The power and weakness of webservers is that they are in principle stateless. This enables them to be massively parallel. So indeed for each page request a different thread may be spawned. Wether or not this indeed happens depends on the configuration settings of you webserver. There's also a cost to spawning many threads, so if possible threads are reused from a thread pool.
Almost all serious webservers have page cache. So if the same page is requested multiple times, it can be retrieved from cache. In addition, browsers do their own caching. A webserver has to be clever about what to cache and what not. Static pages aren't a big problem, although they may be replaced, in which case it is quite confusing to still get the old page served because of the cache.
One way to defeat the cache is by adding (dummy) parameters to the page request.
The statelessness of the web was initialy welcomed as a necessity to achieve scalability, where webpages of busy sites even could be served concurrently from different servers at nearby or remote locations.
However the trend is to have stateful apps. State can be maintained on the browser, easing the burden on the server. If it's maintained on the server it requires the server to know 'who's talking'. One way to do this is saving and recognizing cookies (small identifiable bits of data) on the client.
For databases the story is a bit different. As soon as anything gets stored that relates to a particular user, the application is in principle stateful. While there's no conceptual difference between retaining state on disk and in RAM memory, traditionally statefulness was left to the database, which in turned used thread pools and load balancing to do its job efficiently.
With the advent of very large internet shops like amazon and google, mandatory disk access to achieve statefulness created a performance problem. The answer were in-memory databases. While they may be accessed traditionally using e.g. SQL, they offer much more flexibility in the way data is stored conceptually.
A type of database that enjoys growing popularity is persistent object store. With this database, while the distinction still can be made formally, the boundary between webserver and database is blurred. Both have their data in RAM (but can swap to disk if needed), both work with objects rather than flat records as in SQL tables. These objects can be interconnected in complex ways.
In short there's an explosion of innovative storage / thread pooling / caching/ persistence / redundance / synchronisation technology, driving what has become popularly know as 'the cloud'.
I've done a couple of years of large-scale game server development in PHP. A load balancer delegates incoming requests to one server in a cluster. In the name of better performance, we began caching all static data (essentially the game world's model objects) on each of the instances in that cluster, directly in Apache shared memory, using apc_store and apc_fetch.
For a number of reasons, we're now beginning to develop a similar game framework in Python, using the Flask microframework. At first glance, this instance's memory store is the one piece that doesn't appear to translate directly to Python/Flask. We're presently considering running Memcached locally on each instance (to avoid streaming fairly large model objects over-the-wire from our main Memcached cluster.)
What can we use instead?
I would think that even in this case you might want to consider having a centralized key/value store system rather than a series of independent ones on each server. Unless your load balancer always redirects the same users to the same servers you could run into a case where a user's requests are routed to different servers each time so each node would have to retrieve the game state instead of accessing it from a shared cache.
Also the memory strain that a local key/value store on each system might incur could slow down your game server's other functions. Though that largely depends on the amount of data being cached.
In general the best approach would be to run some benchmarks to see what kind of performance you'd get with a memcached cluster and the types of objects you're storing vs local storage.
Depending on what other features you want from you key/value store you might also want to look into some alternatives like mongodb (http://www.mongodb.org/).
[Five-months later]
Our game framework is done.
In the end, we decided to store the static data in fully initialized sqlalchemy model instances in each web server. When a newly-booted game server is warming up, these instances are first constructed by hitting a shared MySQL db.
Since our Model factories defer to an instance pool, the Model instances need only be constructed once per deployment per server – this is important, because at our scale, MySQL would weep under any sort of ongoing load. We accomplished our goal of not streaming this data over the wire by keeping the item definitions as close to our app code as possible: in the app code itself.
I now realize that my original question was naive, because unlike in the LAMP stack, the Flask server keeps running between requests, the server's memory itself is "shared memory" – there's no need for something like APC to make it so. In fact, anything outside of the request processing scope it self and Flask's threadsafe local store, can be considered "shared memory".
I have been looking into different systems for creating a fast cache in a web-farm running Python/mod_wsgi. Memcache and others are options ... But I was wondering:
Because I don't need to share data across machines, wanting each machine to maintain a local cache ...
Does Python or WSGI provide a mechanism for Python native shared data in Apache such that the data persists and is available to all threads/processes until the server is restarted? This way I could just keep a cache of objects with concurrency control in the memory space of all running application instances?
If not, it sure would be useful
Thanks!
This is thoroughly covered by the Sharing and Global Data section of the mod_wsgi documentation. The short answer is: No, not unless you run everything in one process, but that's not an ideal solution.
It should be noted that caching is ridiculously easy to do with Beaker middleware, which supports multiple backends including memcache.
There's Django's thread-safe in-memory cache back-end, see here. It's cPickle-based, and although it's designed for use with Django, it has minimal dependencies on the rest of Django and you could easily refactor it to remove these. Obviously each process would get its own cache, shared between its threads; If you want a cache shared by all processes on the same machine, you could just use this cache in its own process with an IPC interface of your choice (domain sockets, say) or use memcached locally, or, if you might ever want persistence across restarts, something like Tokyo Cabinet with a Python interface like this.
I realize this is an old thread, but here's another option for a "server-wide dict": http://poshmodule.sourceforge.net/posh/html/posh.html (POSH, Python Shared Objects). Disclaimer: haven't used it myself yet.