I'm writing a reasonably complex web application. The Python backend runs an algorithm whose state depends on data stored in several interrelated database tables which does not change often, plus user specific data which does change often. The algorithm's per-user state undergoes many small changes as a user works with the application. This algorithm is used often during each user's work to make certain important decisions.
For performance reasons, re-initializing the state on every request from the (semi-normalized) database data quickly becomes non-feasible. It would be highly preferable, for example, to cache the state's Python object in some way so that it can simply be used and/or updated whenever necessary. However, since this is a web application, there several processes serving requests, so using a global variable is out of the question.
I've tried serializing the relevant object (via pickle) and saving the serialized data to the DB, and am now experimenting with caching the serialized data via memcached. However, this still has the significant overhead of serializing and deserializing the object often.
I've looked at shared memory solutions but the only relevant thing I've found is POSH. However POSH doesn't seem to be widely used and I don't feel easy integrating such an experimental component into my application.
I need some advice! This is my first shot at developing a web application, so I'm hoping this is a common enough issue that there are well-known solutions to such problems. At this point solutions which assume the Python back-end is running on a single server would be sufficient, but extra points for solutions which scale to multiple servers as well :)
Notes:
I have this application working, currently live and with active users. I started out without doing any premature optimization, and then optimized as needed. I've done the measuring and testing to make sure the above mentioned issue is the actual bottleneck. I'm sure pretty sure I could squeeze more performance out of the current setup, but I wanted to ask if there's a better way.
The setup itself is still a work in progress; assume that the system's architecture can be whatever suites your solution.
Be cautious of premature optimization.
Addition: The "Python backend runs an algorithm whose state..." is the session in the web framework. That's it. Let the Django framework maintain session state in cache. Period.
"The algorithm's per-user state undergoes many small changes as a user works with the application." Most web frameworks offer a cached session object. Often it is very high performance. See Django's session documentation for this.
Advice. [Revised]
It appears you have something that works. Leverage to learn your framework, learn the tools, and learn what knobs you can turn without breaking a sweat. Specifically, using session state.
Second, fiddle with caching, session management, and things that are easy to adjust, and see if you have enough speed. Find out whether MySQL socket or named pipe is faster by trying them out. These are the no-programming optimizations.
Third, measure performance to find your actual bottleneck. Be prepared to provide (and defend) the measurements as fine-grained enough to be useful and stable enough to providing meaningful comparison of alternatives.
For example, show the performance difference between persistent sessions and cached sessions.
I think that the multiprocessing framework has what might be applicable here - namely the shared ctypes module.
Multiprocessing is fairly new to Python, so it might have some oddities. I am not quite sure whether the solution works with processes not spawned via multiprocessing.
I think you can give ZODB a shot.
"A major feature of ZODB is transparency. You do not need to write any code to explicitly read or write your objects to or from a database. You just put your persistent objects into a container that works just like a Python dictionary. Everything inside this dictionary is saved in the database. This dictionary is said to be the "root" of the database. It's like a magic bag; any Python object that you put inside it becomes persistent."
Initailly it was a integral part of Zope, but lately a standalone package is also available.
It has the following limitation:
"Actually there are a few restrictions on what you can store in the ZODB. You can store any objects that can be "pickled" into a standard, cross-platform serial format. Objects like lists, dictionaries, and numbers can be pickled. Objects like files, sockets, and Python code objects, cannot be stored in the database because they cannot be pickled."
I have read it but haven't given it a shot myself though.
Other possible thing could be a in-memory sqlite db, that may speed up the process a bit - being an in-memory db, but still you would have to do the serialization stuff and all.
Note: In memory db is expensive on resources.
Here is a link: http://www.zope.org/Documentation/Articles/ZODB1
First of all your approach is not a common web development practice. Even multi threading is being used, web applications are designed to be able to run multi-processing environments, for both scalability and easier deployment .
If you need to just initialize a large object, and do not need to change later, you can do it easily by using a global variable that is initialized while your WSGI application is being created, or the module contains the object is being loaded etc, multi processing will do fine for you.
If you need to change the object and access it from every thread, you need to be sure your object is thread safe, use locks to ensure that. And use a single server context, a process. Any multi threading python server will serve you well, also FCGI is a good choice for this kind of design.
But, if multiple threads are accessing and changing your object the locks may have a really bad effect on your performance gain, which is likely to make all the benefits go away.
This is Durus, a persistent object system for applications written in the Python
programming language. Durus offers an easy way to use and maintain a consistent
collection of object instances used by one or more processes. Access and change of a
persistent instances is managed through a cached Connection instance which includes
commit() and abort() methods so that changes are transactional.
http://www.mems-exchange.org/software/durus/
I've used it before in some research code, where I wanted to persist the results of certain computations. I eventually switched to pytables as it met my needs better.
Another option is to review the requirement for state, it sounds like if the serialisation is the bottle neck then the object is very large. Do you really need an object that large?
I know in the Stackoverflow podcast 27 the reddit guys discuss what they use for state, so that maybe useful to listen to.
Related
# Context -- skip if you want to get right to the point
I've been building a rather complex web application in Python (Bottle/gevent/MongoDB). It is a RSVP system which allows several independent front-end instances with registration forms als well as back-end access with granular user permissions (those users are our clients). I now need to implement a flexible map-reduce engine to collect statistics on the registration data. A one-size-fits-all solution is impossible since the data gathered varies from instance to instance. I also want to keep this open for our more technically inclined clients.
# End of context
So I need to execute arbitrary strings of code (some kind of ad-hoc plugin - language doesn't matter) entered through a web interface. I've already learned that it's virtually impossible to properly sandbox Python, so that's no option.
As of now I've looked into Lua and found Lupa, Lunatic Python and Lupy, but all three of them allow access to parts of the Python runtime.
There's also PyExecJS and its various runtimes (V8, Node, SpiderMonkey), but I have no idea whether it poses any security risks.
Questions:
1. Does anyone know of another (more fitting) option?
2. To those familiar with any of the Lua bindings: Is it possible to make them completely safe without too much hassle?
3. To those familiar with PyExecJS: How secure is it? Also, what kind of performance should I expect for, say, calling a short mapping function 1000 times and then iterating over a 1000-item list?
Here are a few ways you can run untrusted code:
a docker container that runs the code, I would suggest checking codecube.io out, it does exactly what you want and you can learn more about the process here
using the libsandbox libraries but at the present time the documentation is pretty bad
PyPy’s sandboxing
Sneklang is strict subset of Python, that is safely evaluated in your provided scope.
It is limited by scope size, and by number of node evaluation steps and protects from infinite loops, stack overflows, and excessive memory usage.
There is an online sandbox as well: https://sneklang.functup.com
I've made this project specifically because I had the same requirements.
My Django application is extremely performance sensitive and all requests require access to the same data structure. How do I store the data structure in such a way that it is accessible to all the requests?
Background:
I'm currently using the cache backend. This is a bit slow because the DS is large and it has to be retrieved and unpickled each time.
I understand that HTTP interactions should be stateless and knowingly need to break this constraint. Nothing bad should happen because it's read-only right?
There are a several ways to deal with this issue:
Move the data structure out of Python completely (rather than loading it from a storage medium every time). For example, if your structure is conducive to it you could store it in Redis, MongoDB, Riak, or Neo4j. (As a bonus, you get the ability to query the data, if you need that ability).
Move the structure to a separate process and communicate with it using a pipe or queue.
Use a memory mapped file to share the data.
HTTP is stateless, but that doesn't mean that you can't preserve state between requests. You just have to do the work yourself (at the application level). The protocol doesn't do it for you. Ideally you avoid the state, as it makes it easier to scale horizontally, but not every application is easy to scale
Django, and probably the majority of web applications, use caching. Of course the efficacy of caching depends on how you use it, i.e. by storing the data retrieved most frequently.
Really useful and informative article on caching in Django here. Gives quantified speed improvements. Amazing how much faster you can get with a little tweaking.
I doubt this is even possible, but here is the problem and proposed solution (the feasibility of the proposed solution is the object of this question):
I have some "global data" that needs to be available for all requests. I'm persisting this data to Riak and using Redis as a caching layer for access speed (for now...). The data is split into about 30 logical chunks, each about 8 KB.
Each request is required to read 4 of these 8KB chunks, resulting in 32KB of data read in from Redis or Riak. This is in ADDITION to any request-specific data which would also need to be read (which is quite a bit).
Assuming even 3000 requests per second (this isn't a live server so I don't have real numbers, but 3000ps is a reasonable assumption, could be more), this means 96KBps of transfer from Redis or Riak in ADDITION to the already not-insignificant other calls being made from the application logic. Also, Python is parsing the JSON of these 8KB objects 3000 times every second.
All of this - especially Python having to repeatedly deserialize the data - seems like an utter waste, and a perfectly elegant solution would be to just have the deserialized data cached in an in-memory native object in Python, which I can refresh periodically as and when all this "static" data becomes stale. Once in a few minutes (or hours), instead of 3000 times per second.
But I don't know if this is even possible. You'd realistically need an "always running" application for it to cache any data in its memory. And I know this is not the case in the nginx+uwsgi+python combination (versus something like node) - python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken.
Unfortunately this is a system I have "inherited" and therefore can't make too many changes in terms of the base technology, nor am I knowledgeable enough of how the nginx+uwsgi+python combination works in terms of starting up Python processes and persisting Python in-memory data - which means I COULD be terribly mistaken with my assumption above!
So, direct advice on whether this solution would work + references to material that could help me understand how the nginx+uwsgi+python would work in terms of starting new processes and memory allocation, would help greatly.
P.S:
Have gone through some of the documentation for nginx, uwsgi etc but haven't fully understood the ramifications per my use-case yet. Hope to make some progress on that going forward now
If the in-memory thing COULD work out, I would chuck Redis, since I'm caching ONLY the static data I mentioned above, in it. This makes an in-process persistent in-memory Python cache even more attractive for me, reducing one moving part in the system and at least FOUR network round-trips per request.
What you're suggesting isn't directly feasible. Since new processes can be spun up and down outside of your control, there's no way to keep native Python data in memory.
However, there are a few ways around this.
Often, one level of key-value storage is all you need. And sometimes, having fixed-size buffers for values (which you can use directly as str/bytes/bytearray objects; anything else you need to struct in there or otherwise serialize) is all you need. In that case, uWSGI's built-in caching framework will take care of everything you need.
If you need more precise control, you can look at how the cache is implemented on top of SharedArea and do something customize. However, I wouldn't recommend that. It basically gives you the same kind of API you get with a file, and the only real advantages over just using a file are that the server will manage the file's lifetime; it works in all uWSGI-supported languages, even those that don't allow files; and it makes it easier to migrate your custom cache to a distributed (multi-computer) cache if you later need to. I don't think any of those are relevant to you.
Another way to get flat key-value storage, but without the fixed-size buffers, is with Python's stdlib anydbm. The key-value lookup is as pythonic as it gets: it looks just like a dict, except that it's backed up to an on-disk BDB (or similar) database, cached as appropriate in memory, instead of being stored in an in-memory hash table.
If you need to handle a few other simple types—anything that's blazingly fast to un/pickle, like ints—you may want to consider shelve.
If your structure is rigid enough, you can use key-value database for the top level, but access the values through a ctypes.Structure, or de/serialize with struct. But usually, if you can do that, you can also eliminate the top level, at which point your whole thing is just one big Structure or Array.
At that point, you can just use a plain file for storage—either mmap it (for ctypes), or just open and read it (for struct).
Or use multiprocessing's Shared ctypes Objects to access your Structure directly out of a shared memory area.
Meanwhile, if you don't actually need all of the cache data all the time, just bits and pieces every once in a while, that's exactly what databases are for. Again, anydbm, etc. may be all you need, but if you've got complex structure, draw up an ER diagram, turn it into a set of tables, and use something like MySQL.
"python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken."
you are mistaken.
the whole point of using uwsgi over, say, the CGI mechanism is to persist data across threads and save the overhead of initialization for each call. you must set processes = 1 in your .ini file, or, depending on how uwsgi is configured, it might launch more than 1 worker process on your behalf. log the env and look for 'wsgi.multiprocess': False and 'wsgi.multithread': True, and all uwsgi.core threads for the single worker should show the same data.
you can also see how many worker processes, and "core" threads under each, you have by using the built-in stats-server.
that's why uwsgi provides lock and unlock functions for manipulating data stores by multiple threads.
you can easily test this by adding a /status route in your app that just dumps a json representation of your global data object, and view it every so often after actions that update the store.
You said nothing about writing this data back, is it static? In this case, the solution is every simple, and I have no clue what is up with all the "it's not feasible" responses.
Uwsgi workers are always-running applications. So data absolutely gets persisted between requests. All you need to do is store stuff in a global variable, that is it. And remember it's per-worker, and workers do restart from time to time, so you need proper loading/invalidation strategies.
If the data is updated very rarely (rarely enough to restart the server when it does), you can save even more. Just create the objects during app construction. This way, they will be created exactly once, and then all the workers will fork off the master, and reuse the same data. Of course, it's copy-on-write, so if you update it, you will lose the memory benefits (same thing will happen if python decides to compact its memory during a gc run, so it's not super predictable).
I have never actually tried it myself, but could you possibly use uWSGI's SharedArea to accomplish what you're after?
Everybody in Django world seems to hate threadlocals(http://code.djangoproject.com/ticket/4280, http://code.djangoproject.com/wiki/CookBookThreadlocalsAndUser). I read Armin's essay on this(http://lucumr.pocoo.org/2006/7/10/why-i-cant-stand-threadlocal-and-others), but most of it hinges on threadlocals is bad because it is inelegant.
I have a scenario where theadlocals will make things significantly easier. (I have a app where people will have subdomains, so all the models need to have access to the current subdomain, and passing them from requests is not worth it, if the only problem with threadlocals is that they are inelegant, or make for brittle code.)
Also a lot of Java frameworks seem to be using threadlocals a lot, so how is their case different from Python/Django 's?
I avoid this sort of usage of threadlocals, because it introduces an implicit non-local coupling. I frequently use models in all kinds of non-HTTP-oriented ways (local management commands, data import/export, etc). If I access some threadlocals data in models.py, now I have to find some way to ensure that it is always populated whenever I use my models, and this could get quite ugly.
In my opinion, more explicit code is cleaner and more maintainable. If a model method requires a subdomain in order to operate, that fact should be made obvious by having the method accept that subdomain as a parameter.
If I absolutely could find no way around storing request data in threadlocals, I would at least implement wrapper methods in a separate module that access threadlocals and call the model methods with the needed data. This way the models.py remains self-contained and models can be used without the threadlocals coupling.
I don't think there is anything wrong with threadlocals - yes, it is a global variable, but besides that it's a normal tool. We use it just for this purpose (storing subdomain model in the context global to the current request from middleware) and it works perfectly.
So I say, use the right tool for the job, in this case threadlocals make your app much more elegant than passing subdomain model around in all the model methods (not mentioning the fact that it is even not always possible - when you are overriding django manager methods to limit queries by subdomain, you have no way to pass anything extra to get_query_set, for example - so threadlocals is the natural and only answer).
Also a lot of Java frameworks seem to be using threadlocals a lot, so how is their case different from Python/Django 's?
CPython's interpreter has a Global Interpreter Lock (GIL) which means that only one Python thread can be executed by the interpreter at any given time. It isn't clear to me that a Python interpreter implementation would necessarily need to use more than one operating system thread to achieve this, although in practice CPython does.
Java's main locking mechanism is via objects' monitor locks. This is a decentralized approach that allows the use of multiple concurrent threads on multi-core and or multi-processor CPUs, but also produces much more complicated synchronization issues for the programmer to deal with.
These synchronization issues only arise with "shared-mutable state". If the state isn't mutable, or as in the case of a ThreadLocal it isn't shared, then that is one less complicated problem for the Java programmer to solve.
A CPython programmer still has to deal with the possibility of race conditions, but some of the more esoteric Java problems (such as publication) are presumably solved by the interpreter.
A CPython programmer also has the option to code performance critical code in Python-callable C or C++ code where the GIL restriction does not apply. Technically a Java programmer has a similar option via JNI, but this is rightly or wrongly considered less acceptable in Java than in Python.
You want to use threadlocals when you're working with multiple threads and want to localize some objects to a specific thread, eg. having one database connection for each thread.
In your case, you want to use it more as a global context (if I understand you correctly), which is probably a bad idea. It will make your app a bit slower, more coupled and harder to test.
Why is passing it from request not worth it? Why don't you store it in session or user profile?
There difference with Java is that web development there is much more stateful than in Python/PERL/PHP/Ruby world so people are used to all kind of contexts and stuff like that. I don't think that is an advantage, but it does seem like it at the beginning.
I have found using ThreadLocal is an excellent way to implement Dependency Injection in a HTTP request/response environment (i.e. any webapp). You just set up a servlet filter to 'inject' the object you need into the thread on receiving the request and 'uninject' it on returning the response.
It's a smart man's DI without all the XML ugliness, without the MB of Spring Jars (not to mention its learning curve) and without all the cryptic repetitive #annotation nonsense and because it doesn't individually inject many object instances with the dependencies it's probably a heck of a lot faster and uses less memory.
It worked so well we opened sourced our exPOJO Filter that can inject a Hibernate session or a JDO PersistenceManager using ThreadLocal:
http://www.expojo.com
I'm working on a client class which needs to load data from a networked database. It's been suggested that adding a standard caching service to the client could improve it's performance.
I'd dearly like not to have to build my own caching class - it's well known that these provide common points of failure. It would be far better to use a class that somebody else has developed rather than spend a huge amount of my own time debugging a home-made caching system.
Java developers have this:
http://ehcache.sourceforge.net/
It's a general purpose high-performance caching class which can support all kinds of storage. It's got options for time-based expiry and other methods for garbage-collecting. It looks really good. Unfortunately I cannot find anything this good for Python.
So, can somebody suggest a cache-class that's ready for me to use. My wish-list is:
Ability to limit the number of objects in the cache.
Ability to limit the maximum age of objects in the cache.
LRU object expirey
Ability to select multiple forms of storage ( e.g. memory, disk )
Well debugged, well maintained, in use by at least one well-known application.
Good performance.
So, any suggestions?
UPDATE: I'm looking for LOCAL caching of objects. The server which I connect to is already heavily cached. Memcached is not appropriate because it requires an additional network traffic between the Windows client and the server.
I'd recommend using memcached and using cmemcache to access it. You can't necessarily limit the number of objects in the cache, but you can set an expiration time and limit the amount of memory it uses. And memcached is used by a lot of big names. In fact, I'd call it kind of the industry standard.
UPDATE:
I'm looking for LOCAL caching of objects.
You can run memcached locally and access it via localhost. I've done this a few times.
Other than that, the only solution that I can think of is django's caching system. It offers several backends and some other configuration options. But that may be a little bit heavyweight if you're not using django.
UPDATE 2: I suppose as a last resort, you can also use jython and access the java caching system. This may be a little difficult to do if you've already got clients using CPython though.
UPDATE 3: It's probably a bit late to be of use to you, but a previous employer of mine used ZODB for this kind of thing. It's an actual database, but its read performance is fast enough to make it useful for caching.