Django / general caching question - python

One common pattern when caching with Django is to use the current site's ID in each cache key, in order to, in essence, namespace your keys. The problem I have is that I'd love to be able to delete all values in cache under a namespace (e.g. Delete all cache values for site 45 because they've made some fundamental change). The current pattern to dealing with this send signals all over the place, etc. I've used the Site.id cache key example because that is a common pattern that others may recognize, but the way I'm using cache for a custom multi-tenant application makes this problem even more profound, so my question: Is there a cache back-end and/or pattern that works well for deleting objects in a namespaced way, or pseudo-namespaced way, that is not extraordinarily expensive (ie, not looping through all possible cache keys for a given namespace, deleting the cache for each)? I would prefer to use memecached, but am open to any solution that works well, plug-in or not.

It's generally difficult to delete large categories of keys. A better approach is for each site to have a generation number associated with it. Start the generation at 1. Use the generation number in the cache keys for that site. When you make a fundamental change, or any other time you want to invalidate the entire cache for the site, increment the site's generation number. Now all cache accesses will be misses, until everything is cached anew. All the old data will still be in the cache, but it will be discarded as it ages, and it isn't accessed any more.
This scheme is extremely efficient, since it doesn't require finding or touching all the old data at all. It can also be generalized to any class of cache contents, it doesn't have to be per-site.

I have a feeling that Django's Cache Versioning support, new in Django 1.3.
It allows you to set a system-wide variable VERSION on your cache, or you can explicitly set it when creating a record:
cache.set("mykey", True, version=2)
With this method, when you need to update your cache, simply up your VERSION and you're in the clear.

Related

Python Django cache vs store in model field? Which is more efficient?

My view displays a table of data (specific to a customer who may have many users) and this table takes up a lot of computational resource to populate. A customers data changes 4/5 times a week, usually on the same day.
Caching is an obvious solution to this but I was wondering if Django's cache framework is significantly more efficient than creating a Textfield at the customer level and storing the data there instead?
I feel it's easier to implement (and clear the Textfield when the data changes) but what are the drawbacks and is there anything else I need to look out for? (problems if the dataset gets too big? additional fields in the model etc etc???)
Any help would be much appreciated!
A cache is a cache is cache, however you implement it, and the main problem with caches is invalidation.
As Melvyn rightly answered, the case for the cache framework is that it's (well, can be, depending on which backend you choose) outside your database. Whether it's a pro or cons really depends on your database load, infrastructure and whatnots... if you already use the cache framework (for more than plain unconditional full-page caching I mean) and want to mimimize the load on your database then it's possibly worth the added complexity.
Else storing your computed result in the db is quite straightforward and doesn't require additional servers, install etc. I'd personnally go for a dedicated model - to avoid unnecessary overhead at the db level -, including both the cached result and a checksum of the params on which this result depends (canonical memoization pattern) so you can easily detect whether it needs to be recomputed. I found this solution to be easier to maintain than trying to detect changes to each and any of those params and invalid/recompute the cache "on the fly" (which is what can make proper cache invalidation difficult or at least complex to implement) but this again depends on what those params are and where they come from.
The upside to using the cache framework is that you don't have to use the database. You can scale your cache store independent of your database and run the cache on different physical (or virtual) machines.
In addition you don't have to implement the stale vs fresh logic, but that's a one-off.
4-5 times a week doesn't look like a big challenge, but nobody knows except you what kind of computation do you have, how many data you should store, how many users do you have and so on.
If you want to implement this with TextField, it still some kind of caching system, so I suggest to use django's caching system with database backend first https://docs.djangoproject.com/en/1.11/topics/cache/#database-caching You can't retrieve data with 1 query like in case of TextField, but later you can replace database with other layer if necessary.

update existing cache data with newer items in django

I want to use caching in Django and I am stuck up with how to go about it. I have data in some specific models which are write intensive. records will get added continuously to the model. Each user has some specific data in the model similar to orders table.
Since my model is write intensive I am not sure how effective caching frameworks in Django are going to be. I tried Django view specific caching and I am try to develop a view where first it will pick up data from the cache. Then I will have another call which will bring in data which was added to the model after the caching was done. What I want to do is add the updated data to the original cache data and store it again.
It is like I don't want to expire my cache, I just want to keep adding to my existing cache data. may be once in 3 hrs I can clear it.
Is what I am doing right. Are there better ways than this. Can I really add to items in existing cache.
I will be very glad for your help
You ask about "caching" which is a really broad topic, and the answer is always a mix of opinion, style and the specific app requirements. Here are a few points to consider.
If the data is per user, you can cache it per user:
from django.core.cache import cache
cache.set(request.user.id,"foo")
cache.get(request.user.id)
The common practice it to keep a database flag that tells you if the user's data changed since it was cached. So before you fetch the data from cache, check only this flag from the DB. If the flag says nothing changed, get the data from cache. If it did change, pull from DB, replace the cache, and set the flag again.
The flag check should be fast and simple: one table, indexed by user.id, and a boolean flag field. This will squeeze a lot of index rows into a single DB page, and enables a fast fetching of a single one field row. Yet you still get a persistent updated main storage, that prevents the use of not updated cache data. You can check this flag in a middleware.
You can run expiry in many ways: clear cache when user logs out, run a cron script that clears items, or let the cache backend expire items. If you use a flag check before you use the cache, there is no issue in keeping items in cache except space, and caching backends handle that. If you use the django simple file cache (which is easy, simple and zero config), you will have to clear the cache. A simple cron script will do.

Dynamically Created Top Articles List in Django?

I'm creating a Django-powered site for my newspaper-ish site. The least obvious and common-sense task that I have come across in getting the site together is how best to generate a "top articles" list for the sidebar of the page.
The first thing that came to mind was some sort of database column that is updated (based on what?) with every view. That seems (to my instincts) ridiculously database intensive and impractical and thus I think I'd like to find another solution.
Thanks all.
I would give celery a try (with django-celery). While it's not so easy to configure and use as cache, it enables you to queue tasks like incrementing counters and do them in background. It could be even combined with cache technique - in views increment counters in cache and define PeriodicTask that will run every now and then, resetting counters and writing them to the database.
I just remembered - I once found this blog entry which provides nice way of incrementing 'viewed_count' (or similar) column in database with AJAX JS call. If you don't have heavy traffic maybe it's good idea?
Also mentioned in this post is django-tracking, but I don't know much about it, I never used it myself (yet).
Premature optimization, first try the db way and then see if it really is too database sensitive. Any decent database has so good caches it probably won't matter very much. And even if it is a problem, take a look at the other db/cache suggestions here.
It is most likely by the way is that you will have many more intensive db queries with each view than a simple view update.
If you do something like sort by top views, it would be fast if you index the view column in the DB. Another option is to only collect the top x articles every hour or so, and toss that value into Django's cache framework.
The nice thing about caching the list is that the algorithm you use to determine top articles can be as complex as you like without hitting the DB hard with every page view. Django's cache framework can use memory, db, or file system. I prefer DB, but many others prefer memory. I believe it uses pickle, so you can also store Python objects directly. It's easy to use, recommended.
An index wouldn't help as them main problem I believe is not so much getting the sorted list as having a DB write with every page view of an article. Another index actually makes that problem worse, albeit only a little.
So I'd go with the cache. I think django's cache shim is a problem here because it requires timeouts on all keys. I'm not sure if that's imposed by memcached, if not then go with redis. Actually just go with redis anyway, the python library is great, I've used it from django projects before, and it has atomic increments and powerful sorting - everything you need.

Where is the best place to put cache-evicting logic in an AppEngine application?

I've written an application for Google AppEngine, and I'd like to make use of the memcache API to cut down on per-request CPU time. I've profiled the application and found that a large chunk of the CPU time is in template rendering and API calls to the datastore, and after chatting with a co-worker I jumped (perhaps a bit early?) to the conclusion that caching a chunk of a page's rendered HTML would cut down on the CPU time per request significantly. The caching pattern is pretty clean, but the question of where to put this logic of caching and evicting is a bit of a mystery to me.
For example, imagine an application's main page has an Announcements section. This section would need to be re-rendered after:
first read for anyone in the account,
a new announcement being added, and
an old announcement being deleted
Some options of where to put the evict_announcements_section_from_cache() method call:
in the Announcement Model's .delete(), and .put() methods
in the RequestHandler's .post() method
anywhere else?
Then in the RequestHandler's get page, I could potentially call get_announcements_section() which would follow the standard memcache pattern (check cache, add to cache on miss, return value) and pass that HTML down to the template for that chunk of the page.
Is it the typical design pattern to put the cache-evicting logic in the Model, or the Controller/RequestHandler, or somewhere else? Ideally I'd like to avoid having evicting logic with tentacles all over the code.
I've got just such a decorator up in an open source Github project:
http://github.com/jamslevy/gae_memoize/tree/master
It's a bit more in-depth, allowing for things like forcing execution of the function (when you want to refresh the cache) or forcing caching locally...these were just things that I needed in my app, so I baked them into my memoize decorator.
A couple of alternatives to regular eviction:
The obvious one: Don't evict, and set a timer instead. Even a really short one - a few seconds - can cut down on effort a huge amount for a popular app, without users even noticing data may be a few seconds stale.
Instead of evicting, generate the cache key based on criteria that change when the data does. For example, if retrieving the key of the most recent announcement is cheap, you could use that as part of the key of the cached data. When a new announcement is posted, you go looking for a key that doesn't exist, and create a new one as a result.

appengine: cached reference property?

How can I cache a Reference Property in Google App Engine?
For example, let's say I have the following models:
class Many(db.Model):
few = db.ReferenceProperty(Few)
class Few(db.Model):
year = db.IntegerProperty()
Then I create many Many's that point to only one Few:
one_few = Few.get_or_insert(year=2009)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Now, if I want to iterate over all the Many's, reading their few value, I would do this:
for many in Many.all().fetch(1000):
print "%s" % many.few.year
The question is:
Will each access to many.few trigger a database lookup?
If yes, is it possible to cache somewhere, as only one lookup should be enough to bring the same entity every time?
As noted in one comment: I know about memcache, but I'm not sure how I can "inject it" when I'm calling the other entity through a reference.
In any case memcache wouldn't be useful, as I need caching within an execution, not between them. Using memcache wouldn't help optimizing this call.
The first time you dereference any reference property, the entity is fetched - even if you'd previously fetched the same entity associated with a different reference property. This involves a datastore get operation, which isn't as expensive as a query, but is still worth avoiding if you can.
There's a good module that adds seamless caching of entities available here. It works at a lower level of the datastore, and will cache all datastore gets, not just dereferencing ReferenceProperties.
If you want to resolve a bunch of reference properties at once, there's another way: You can retrieve all the keys and fetch the entities in a single round trip, like so:
keys = [MyModel.ref.get_value_for_datastore(x) for x in referers]
referees = db.get(keys)
Finally, I've written a library that monkeypatches the db module to locally cache entities on a per-request basis (no memcache involved). It's available, here. One warning, though: It's got unit tests, but it's not widely used, so it could be broken.
The question is:
Will each access to many.few trigger a database lookup? Yes. Not sure if its 1 or 2 calls
If yes, is it possible to cache somewhere, as only one lookup should be enough to bring the same entity every time? You should be able to use the memcache repository to do this. This is in the google.appengine.api.memcache package.
Details for memcache are in http://code.google.com/appengine/docs/python/memcache/usingmemcache.html

Categories

Resources