Is there a standard 3rd party Python caching class? - python

I'm working on a client class which needs to load data from a networked database. It's been suggested that adding a standard caching service to the client could improve it's performance.
I'd dearly like not to have to build my own caching class - it's well known that these provide common points of failure. It would be far better to use a class that somebody else has developed rather than spend a huge amount of my own time debugging a home-made caching system.
Java developers have this:
http://ehcache.sourceforge.net/
It's a general purpose high-performance caching class which can support all kinds of storage. It's got options for time-based expiry and other methods for garbage-collecting. It looks really good. Unfortunately I cannot find anything this good for Python.
So, can somebody suggest a cache-class that's ready for me to use. My wish-list is:
Ability to limit the number of objects in the cache.
Ability to limit the maximum age of objects in the cache.
LRU object expirey
Ability to select multiple forms of storage ( e.g. memory, disk )
Well debugged, well maintained, in use by at least one well-known application.
Good performance.
So, any suggestions?
UPDATE: I'm looking for LOCAL caching of objects. The server which I connect to is already heavily cached. Memcached is not appropriate because it requires an additional network traffic between the Windows client and the server.

I'd recommend using memcached and using cmemcache to access it. You can't necessarily limit the number of objects in the cache, but you can set an expiration time and limit the amount of memory it uses. And memcached is used by a lot of big names. In fact, I'd call it kind of the industry standard.
UPDATE:
I'm looking for LOCAL caching of objects.
You can run memcached locally and access it via localhost. I've done this a few times.
Other than that, the only solution that I can think of is django's caching system. It offers several backends and some other configuration options. But that may be a little bit heavyweight if you're not using django.
UPDATE 2: I suppose as a last resort, you can also use jython and access the java caching system. This may be a little difficult to do if you've already got clients using CPython though.
UPDATE 3: It's probably a bit late to be of use to you, but a previous employer of mine used ZODB for this kind of thing. It's an actual database, but its read performance is fast enough to make it useful for caching.

Related

Python - Multiprocessing and database entries

I'm working on a framework for Digital Forensic Investigators to use to compare files with each other for my Master's capstone project. However, I ran into a bit of a snag...
I'm trying to implement multiprocessing on the comparisons since using a single core seems to be really slow. The trouble I'm having, however, is when the code goes to enter information into an SQLite database. It will occasionally get a "Database is locked" error when two cores complete at nearly the same time.
So, simple side of my question, is it unsafe to operate database functions within a multiprocessing environment due to the errors I'm encountering? If not, is there a method of going about this that is safe and won't result in random errors?
Thanks!
Your problem is that you are trying to have multiple writers access a toy database -- i.e. sqlite -- which is stored in a single file. Using Lock might help, but it's going to kill your multiprocess throughput because of all the waiting-for-the-lock time. In essence, the lock choke point will serialize your program.
Setting up either MySQL or Postgres on almost any platform is straightforward, and there are several excellent Python modules for accessing them. Using one of those will completely eliminate this problem.
Update for an extended response to comment:
I always ask clients / students, "What problem are you trying to solve?" I'm assuming that you are not trying to create a database system, simply to use one. SQLite3 is fine for a well-defined set of problems, but multiprocess access is not one of them. I could veer off into asking what aspect of your project requires multiprocess access, but I'll assume that you have already determined that this is needed. I don't know either your programming skills or your understanding of how a database works, so forgive me if the following is a bit basic.
Normally you need a database (my preference is Postgres), and a Python module that understands all of the fiddly details of how to talk to that database. Then you need to know what it is you want the DBMS to do for you. The Good News is that you are hardly the first to go down this path.
The Postgres Wiki is full of good stuff. See their page on Python Drivers. Psycopg2 is the category leader and runs on Win/Linux/Mac. Also check out PyPi, the Python Package Index, for many well-written extensions.
If you want to stay more object-oriented, as opposed to writing straight SQL, you might want to look at an ORM like SQLAlchemy. This is another category leader that is well-maintained and widely deployed.
The value of using an ORM is that you can (mostly) keep your head in ObjectLand, where most of your problem lives, and not get tangled up in the cognitive dissonance created by object-oriented programming vs. relational database management, which are two very different views of the world of data.
If you need more help, email me. My address is in my profile.
You can make use of Lock. Take a look at https://docs.python.org/2/library/multiprocessing.html#synchronization-between-processes

I need to run untrusted server-side code in a web app - what are my options?

# Context -- skip if you want to get right to the point
I've been building a rather complex web application in Python (Bottle/gevent/MongoDB). It is a RSVP system which allows several independent front-end instances with registration forms als well as back-end access with granular user permissions (those users are our clients). I now need to implement a flexible map-reduce engine to collect statistics on the registration data. A one-size-fits-all solution is impossible since the data gathered varies from instance to instance. I also want to keep this open for our more technically inclined clients.
# End of context
So I need to execute arbitrary strings of code (some kind of ad-hoc plugin - language doesn't matter) entered through a web interface. I've already learned that it's virtually impossible to properly sandbox Python, so that's no option.
As of now I've looked into Lua and found Lupa, Lunatic Python and Lupy, but all three of them allow access to parts of the Python runtime.
There's also PyExecJS and its various runtimes (V8, Node, SpiderMonkey), but I have no idea whether it poses any security risks.
Questions:
1. Does anyone know of another (more fitting) option?
2. To those familiar with any of the Lua bindings: Is it possible to make them completely safe without too much hassle?
3. To those familiar with PyExecJS: How secure is it? Also, what kind of performance should I expect for, say, calling a short mapping function 1000 times and then iterating over a 1000-item list?
Here are a few ways you can run untrusted code:
a docker container that runs the code, I would suggest checking codecube.io out, it does exactly what you want and you can learn more about the process here
using the libsandbox libraries but at the present time the documentation is pretty bad
PyPy’s sandboxing
Sneklang is strict subset of Python, that is safely evaluated in your provided scope.
It is limited by scope size, and by number of node evaluation steps and protects from infinite loops, stack overflows, and excessive memory usage.
There is an online sandbox as well: https://sneklang.functup.com
I've made this project specifically because I had the same requirements.

How do I keep a shared data structure in-memory between Django Requests

My Django application is extremely performance sensitive and all requests require access to the same data structure. How do I store the data structure in such a way that it is accessible to all the requests?
Background:
I'm currently using the cache backend. This is a bit slow because the DS is large and it has to be retrieved and unpickled each time.
I understand that HTTP interactions should be stateless and knowingly need to break this constraint. Nothing bad should happen because it's read-only right?
There are a several ways to deal with this issue:
Move the data structure out of Python completely (rather than loading it from a storage medium every time). For example, if your structure is conducive to it you could store it in Redis, MongoDB, Riak, or Neo4j. (As a bonus, you get the ability to query the data, if you need that ability).
Move the structure to a separate process and communicate with it using a pipe or queue.
Use a memory mapped file to share the data.
HTTP is stateless, but that doesn't mean that you can't preserve state between requests. You just have to do the work yourself (at the application level). The protocol doesn't do it for you. Ideally you avoid the state, as it makes it easier to scale horizontally, but not every application is easy to scale
Django, and probably the majority of web applications, use caching. Of course the efficacy of caching depends on how you use it, i.e. by storing the data retrieved most frequently.
Really useful and informative article on caching in Django here. Gives quantified speed improvements. Amazing how much faster you can get with a little tweaking.

Comparing persistent storage solutions in python

I'm starting on a new scientific project which has a lot of data (millions of entries) I'd like to store in an easily and quickly accessible format. I've come across a number of different potential options, but I'm not sure how to pick amongst them. My data can probably just be stored as a dictionary, or potentially a dictionary of dictionaries. Some potential considerations:
Speed. I can't load all the data off disk every time I start a new script, and I'd like as quick access to random entries as possible.
Ease-of-use. This is python. The storage should feel like python.
Stability/maturity. I'd like something that's currently supported, although something that works well but is still in development would be fine.
Ease of installation. My sysadmin should be able to get this running on our cluster.
I don't really care that much about the size of the storage, but it could be a consideration if an option is really terrible on this front. Also, if it matters, I'll most likely be creating the database once, and thereafter only reading from it.
Some potential options that I've started looking at (see this post):
pyTables
ZopeDB
shove
shelve
redis
durus
Any suggestions on which of these might be better for my purposes? Any better ideas? Some of these have a back-end; any suggestions on which file-system back-end would be best?
Might want to give mongodb a shot - the PyMongo library works with dictionaries and supports most Python types. Easy to install, very performant + scalable. MongoDB (and PyMongo) is also used in production at some big names.
A RDBMS.
Nothing is more realiable than using tables on a well known RDBMS. Postgresql comes to mind.
That automatically gives you some choices for the future like clustering. Also you automatically have a lot of tools to administer your database, and you can use it from other software written in virtually any language.
It is really fast.
In the "feel like python" point, I might add that you can use an ORM. A strong name is sqlalchemy. Maybe with the elixir "extension".
Using sqlalchemy you can leave your user/sysadmin choose which database backend he wants to use. Maybe they already have MySql installed - no problem.
RDBMSs are still the best choice for data storage.
I'm working on such a project and I'm using SQLite.
SQLite stores everything in one file and is part of Python's standard library. Hence, installation and configuration is virtually for free (ease of installation).
You can easily manage the database file with small Python scripts or via various tools. There is also a Firefox plugin (ease of installation / ease-of-use).
I find it very convenient to use SQL to filter/sort/manipulate/... the data. Although, I'm not an SQL expert. (ease-of-use)
I'm not sure if SQLite is the fastes DB system for this work and it lacks some features you might need e.g. stored procedures.
Anyway, SQLite works for me.
if you really just need dictionary-like storage, some of the new key/value or column stores like Cassandra or MongoDB might provide a lot more speed than you'd get with a relational database. Of course if you decide to go with RDBMS, SQLAlchemy is the way to go (disclaimer: I am its creator), but your desired featurelist seems to lean in the direction of "I just want a dictionary that feels like Python" - if you aren't interested in relational queries or strong ACIDity, those facets of RDBMS will probably feel cumbersome.
Sqlite -- it comes with python, fast, widely availible and easy to maintain
If you only need simple (dict like) access mechanisms and need efficiency for processing a lot of data, then HDF5 might be a good option. If you are going to be using numpy then it is really worth considering.
Go with a RDBMS is reliable scalable and fast.
If you need a more scalabre solution and don't need the features of RDBMS, you can go with a key-value store like couchdb that has a good python api.
The NEMO collaboration (building a cosmic neutrino detector underwater) had much of the same problems, and they used mysql and postgresql without major problems.
It really depends on what you're trying to do. An RDBMS is designed for relational data, so if your data is relational, then use one of the various SQL options. But it sounds like your data is more oriented towards a key-value store with very fast random GET operations. If that's the case, compare the benchmarks of the various key-stores, focusing on the GET speed. The ideal key-value store will keep or cache requests in memory, and be able to handle many GET requests concurrently. You may actually want to create your own benchmark suite so you can effectively compare random concurrent GET operations.
Why do you need a cluster? Is the size of each value very large? If not, you shouldn't need a cluster to handle storage of a million entries. But if you're storing large blobs of data, that matters, and you may need something easily supports read slaves and/or transparent partitioning. Some of the key-value stores are document oriented and/or optimized for storing larger values. Redis is technically more storage efficient for larger values due to the indexing overhead required for fast GETs, but that doesn't necessarily mean it's slower. In fact, the extra indexing makes lookups faster.
You're the only one that can truly answer this question, and I strongly recommend putting together a custom benchmark suite to test available options with actual usage scenarios. The data you get from that will give you more insight than anything else.

How would one make Python objects persistent in a web-app?

I'm writing a reasonably complex web application. The Python backend runs an algorithm whose state depends on data stored in several interrelated database tables which does not change often, plus user specific data which does change often. The algorithm's per-user state undergoes many small changes as a user works with the application. This algorithm is used often during each user's work to make certain important decisions.
For performance reasons, re-initializing the state on every request from the (semi-normalized) database data quickly becomes non-feasible. It would be highly preferable, for example, to cache the state's Python object in some way so that it can simply be used and/or updated whenever necessary. However, since this is a web application, there several processes serving requests, so using a global variable is out of the question.
I've tried serializing the relevant object (via pickle) and saving the serialized data to the DB, and am now experimenting with caching the serialized data via memcached. However, this still has the significant overhead of serializing and deserializing the object often.
I've looked at shared memory solutions but the only relevant thing I've found is POSH. However POSH doesn't seem to be widely used and I don't feel easy integrating such an experimental component into my application.
I need some advice! This is my first shot at developing a web application, so I'm hoping this is a common enough issue that there are well-known solutions to such problems. At this point solutions which assume the Python back-end is running on a single server would be sufficient, but extra points for solutions which scale to multiple servers as well :)
Notes:
I have this application working, currently live and with active users. I started out without doing any premature optimization, and then optimized as needed. I've done the measuring and testing to make sure the above mentioned issue is the actual bottleneck. I'm sure pretty sure I could squeeze more performance out of the current setup, but I wanted to ask if there's a better way.
The setup itself is still a work in progress; assume that the system's architecture can be whatever suites your solution.
Be cautious of premature optimization.
Addition: The "Python backend runs an algorithm whose state..." is the session in the web framework. That's it. Let the Django framework maintain session state in cache. Period.
"The algorithm's per-user state undergoes many small changes as a user works with the application." Most web frameworks offer a cached session object. Often it is very high performance. See Django's session documentation for this.
Advice. [Revised]
It appears you have something that works. Leverage to learn your framework, learn the tools, and learn what knobs you can turn without breaking a sweat. Specifically, using session state.
Second, fiddle with caching, session management, and things that are easy to adjust, and see if you have enough speed. Find out whether MySQL socket or named pipe is faster by trying them out. These are the no-programming optimizations.
Third, measure performance to find your actual bottleneck. Be prepared to provide (and defend) the measurements as fine-grained enough to be useful and stable enough to providing meaningful comparison of alternatives.
For example, show the performance difference between persistent sessions and cached sessions.
I think that the multiprocessing framework has what might be applicable here - namely the shared ctypes module.
Multiprocessing is fairly new to Python, so it might have some oddities. I am not quite sure whether the solution works with processes not spawned via multiprocessing.
I think you can give ZODB a shot.
"A major feature of ZODB is transparency. You do not need to write any code to explicitly read or write your objects to or from a database. You just put your persistent objects into a container that works just like a Python dictionary. Everything inside this dictionary is saved in the database. This dictionary is said to be the "root" of the database. It's like a magic bag; any Python object that you put inside it becomes persistent."
Initailly it was a integral part of Zope, but lately a standalone package is also available.
It has the following limitation:
"Actually there are a few restrictions on what you can store in the ZODB. You can store any objects that can be "pickled" into a standard, cross-platform serial format. Objects like lists, dictionaries, and numbers can be pickled. Objects like files, sockets, and Python code objects, cannot be stored in the database because they cannot be pickled."
I have read it but haven't given it a shot myself though.
Other possible thing could be a in-memory sqlite db, that may speed up the process a bit - being an in-memory db, but still you would have to do the serialization stuff and all.
Note: In memory db is expensive on resources.
Here is a link: http://www.zope.org/Documentation/Articles/ZODB1
First of all your approach is not a common web development practice. Even multi threading is being used, web applications are designed to be able to run multi-processing environments, for both scalability and easier deployment .
If you need to just initialize a large object, and do not need to change later, you can do it easily by using a global variable that is initialized while your WSGI application is being created, or the module contains the object is being loaded etc, multi processing will do fine for you.
If you need to change the object and access it from every thread, you need to be sure your object is thread safe, use locks to ensure that. And use a single server context, a process. Any multi threading python server will serve you well, also FCGI is a good choice for this kind of design.
But, if multiple threads are accessing and changing your object the locks may have a really bad effect on your performance gain, which is likely to make all the benefits go away.
This is Durus, a persistent object system for applications written in the Python
programming language. Durus offers an easy way to use and maintain a consistent
collection of object instances used by one or more processes. Access and change of a
persistent instances is managed through a cached Connection instance which includes
commit() and abort() methods so that changes are transactional.
http://www.mems-exchange.org/software/durus/
I've used it before in some research code, where I wanted to persist the results of certain computations. I eventually switched to pytables as it met my needs better.
Another option is to review the requirement for state, it sounds like if the serialisation is the bottle neck then the object is very large. Do you really need an object that large?
I know in the Stackoverflow podcast 27 the reddit guys discuss what they use for state, so that maybe useful to listen to.

Categories

Resources