I have a very long list of objects that I would like to only load from the db once to memory (Meaning not for each session) this list WILL change it's values and grow over time by user inputs, The reason I need it in memory is because I am doing some complex searches on it and want to give a quick answer back.
My question is how do I load a list on the start of the server and keep it alive through sessions letting them all READ/WRITE to it.
Will it be better to do a heavy SQL search instead of keeping the list alive through my server?
The answer is that this is bad idea, you are opening a pandora's box specially since you need write access as well. However all is not lost. You can quite easily use redis for this task.
Redis is a peristent data store but at the same time everything is held in memory. If the redis server runs on the same device as the web server access is almost instantaneous
Related
I am having a hard time trying to figure out the big picture of the handling of multiple requests by the uwsgi server with django or pyramid application.
My understanding at the moment is this:
When multiple http requests are sent to uwsgi server concurrently, the server creates a separate processes or threads (copies of itself) for every request (or assigns to them the request) and every process/thread loads the webapplication's code (say django or pyramid) into computers memory and executes it and returns the response. In between every copy of the code can access the session, cache or database. There is a separate database server usually and it can also handle concurrent requests to the database.
So here some questions I am fighting with.
Is my above understanding correct or not?
Are the copies of code interact with each other somehow or are they wholly separated from each other?
What about the session or cache? Are they shared between them or are they local to each copy?
How are they created: by the webserver or by copies of python code?
How are responses returned to the requesters: by each process concurrently or are they put to some kind of queue and sent synchroniously?
I have googled these questions and have found very interesting answers on StackOverflow but anyway can't get the whole picture and the whole process remains a mystery for me. It would be fantastic if someone can explain the whole picture in terms of django or pyramid with uwsgi or whatever webserver.
Sorry for asking kind of dumb questions, but they really torment me every night and I am looking forward to your help:)
There's no magic in pyramid or django that gets you past process boundaries. The answers depend entirely on the particular server you've selected and the settings you've selected. For example, uwsgi has the ability to run multiple threads and multiple processes. If uwsig spins up multiple processes then they will each have their own copies of data which are not shared unless you took the time to create some IPC (this is why you should keep state in a third party like a database instead of in-memory objects which are not shared across processes). Each process initializes a WSGI object (let's call it app) which the server calls via body_iter = app(environ, start_response). This app object is shared across all of the threads in the process and is invoked concurrently, thus it needs to be threadsafe (usually the structures the app uses are either threadlocal or readonly to deal with this, for example a connection pool to the database).
In general the answers to your questions are that things happen concurrently, and objects may or may not be shared based on your server model but in general you should take anything that you want to be shared and store it somewhere that can handle concurrency properly (a database).
The power and weakness of webservers is that they are in principle stateless. This enables them to be massively parallel. So indeed for each page request a different thread may be spawned. Wether or not this indeed happens depends on the configuration settings of you webserver. There's also a cost to spawning many threads, so if possible threads are reused from a thread pool.
Almost all serious webservers have page cache. So if the same page is requested multiple times, it can be retrieved from cache. In addition, browsers do their own caching. A webserver has to be clever about what to cache and what not. Static pages aren't a big problem, although they may be replaced, in which case it is quite confusing to still get the old page served because of the cache.
One way to defeat the cache is by adding (dummy) parameters to the page request.
The statelessness of the web was initialy welcomed as a necessity to achieve scalability, where webpages of busy sites even could be served concurrently from different servers at nearby or remote locations.
However the trend is to have stateful apps. State can be maintained on the browser, easing the burden on the server. If it's maintained on the server it requires the server to know 'who's talking'. One way to do this is saving and recognizing cookies (small identifiable bits of data) on the client.
For databases the story is a bit different. As soon as anything gets stored that relates to a particular user, the application is in principle stateful. While there's no conceptual difference between retaining state on disk and in RAM memory, traditionally statefulness was left to the database, which in turned used thread pools and load balancing to do its job efficiently.
With the advent of very large internet shops like amazon and google, mandatory disk access to achieve statefulness created a performance problem. The answer were in-memory databases. While they may be accessed traditionally using e.g. SQL, they offer much more flexibility in the way data is stored conceptually.
A type of database that enjoys growing popularity is persistent object store. With this database, while the distinction still can be made formally, the boundary between webserver and database is blurred. Both have their data in RAM (but can swap to disk if needed), both work with objects rather than flat records as in SQL tables. These objects can be interconnected in complex ways.
In short there's an explosion of innovative storage / thread pooling / caching/ persistence / redundance / synchronisation technology, driving what has become popularly know as 'the cloud'.
I'd like people's views on current design I'm considering for a tornado app. Although I'm using mongoDB to store permanent information I currently have the session information as a python data structure that I've simply added within the Application object at initialisation.
I will need to perform some iteration and manipulation of the sessions while the server is running. I keep debating whether to move these to another mongoDB or just keep it as a python structure.
Is there anything wrong with keeping session information this way?
If you store session data in Python your apllication will:
loose it if you stop the Python process;
likely consume more memory as Python isn't very efficient in memory management (and you will have to store all the sessions in memory, not the ones you need right now).
If these are not problems for you you can go with Python structures. But usually these are serious concerns and most of the projects use some external storage for sessions.
I am creating an application (app A) in Python that listens on a port, receives NetFlow records, encapsulates them and securely sends them to another application (app B). App A also checks if the record was successfully sent. If not, it has to be saved. App A waits few seconds and then tries to send it again etc. This is the important part. If the sending was unsuccessful, records must be stored, but meanwhile many more records can arrive and they need to be stored too. The ideal way to do that is a queue. However I need this queue to be in file (on the disk). I found for example this code http://code.activestate.com/recipes/576642/ but it "On open, loads full file into memory" and that's exactly what I want to avoid. I must assume that this file with records will have up to couple of GBs.
So my question is, what would you recommend to store these records in? It needs to handle a lot of data, on the other hand it would be great if it wasn't too slow because during normal activity only one record is saved at a time and it's read and removed immediately. So the basic state is an empty queue. And it should be thread safe.
Should I use a database (dbm, sqlite3..) or something like pickle, shelve or something else?
I am a little consfused in this... thank you.
You can use Redis as a database for this. It is very very fast, does queuing amazingly well, and it can save its state to disk in a few manners, depending on the fault tolerance level you want. being an external process, you might not need to have it use a very strict saving policy, since if your program crashes, everything is saved externally.
see here http://redis.io/documentation , and if you want more detailed info on how to do this in redis, I'd be glad to elaborate.
I've done a couple of years of large-scale game server development in PHP. A load balancer delegates incoming requests to one server in a cluster. In the name of better performance, we began caching all static data (essentially the game world's model objects) on each of the instances in that cluster, directly in Apache shared memory, using apc_store and apc_fetch.
For a number of reasons, we're now beginning to develop a similar game framework in Python, using the Flask microframework. At first glance, this instance's memory store is the one piece that doesn't appear to translate directly to Python/Flask. We're presently considering running Memcached locally on each instance (to avoid streaming fairly large model objects over-the-wire from our main Memcached cluster.)
What can we use instead?
I would think that even in this case you might want to consider having a centralized key/value store system rather than a series of independent ones on each server. Unless your load balancer always redirects the same users to the same servers you could run into a case where a user's requests are routed to different servers each time so each node would have to retrieve the game state instead of accessing it from a shared cache.
Also the memory strain that a local key/value store on each system might incur could slow down your game server's other functions. Though that largely depends on the amount of data being cached.
In general the best approach would be to run some benchmarks to see what kind of performance you'd get with a memcached cluster and the types of objects you're storing vs local storage.
Depending on what other features you want from you key/value store you might also want to look into some alternatives like mongodb (http://www.mongodb.org/).
[Five-months later]
Our game framework is done.
In the end, we decided to store the static data in fully initialized sqlalchemy model instances in each web server. When a newly-booted game server is warming up, these instances are first constructed by hitting a shared MySQL db.
Since our Model factories defer to an instance pool, the Model instances need only be constructed once per deployment per server – this is important, because at our scale, MySQL would weep under any sort of ongoing load. We accomplished our goal of not streaming this data over the wire by keeping the item definitions as close to our app code as possible: in the app code itself.
I now realize that my original question was naive, because unlike in the LAMP stack, the Flask server keeps running between requests, the server's memory itself is "shared memory" – there's no need for something like APC to make it so. In fact, anything outside of the request processing scope it self and Flask's threadsafe local store, can be considered "shared memory".
I am using PostgreSQL 8.4. I really like the new unnest() and array_agg() features; it is about time they realize the dynamic processing potential of their Arrays!
Anyway, I am working on web server back ends that uses long Arrays a lot. Their will be two successive processes which will each occur on a different physical machine. Each such process is a light python application which ''manage'' SQL queries to the database on each of their machines as well as requests from the front ends.
The first process will generate an Array which will be buffered into an SQL Table. Each such generated Array is accessible via a Primary Key. When its done the first python app sends the key to the second python app. Then the second python app, which is running on a different machine, uses it to go get the referenced Array found in the first machine. It then sends it to it's own db for generating a final result.
The reason why I send a key is because I am hopping that this will make the two processes go faster. But really what I would like is for a way to have the second database send a query to the first database in the hope of minimizing serialization delay and such.
Any help/advice would be appreciated.
Thanks
Sounds like you want dblink from contrib. This allows some inter-db postgres communication. The pg docs are great and should provide the needed examples.
not sure I totally understand, but you've looked at notify/listen? http://www.postgresql.org/docs/8.1/static/sql-listen.html
I am thinking either listen/notify or something with a cache such as memcache. You would send the key to memcache and have the second python app retrieve it from there. You could even do it with listen/notify... e.g; send the key and notify your second app that the key is in memcache waiting to be retrieved.