How can I cache a Reference Property in Google App Engine?
For example, let's say I have the following models:
class Many(db.Model):
few = db.ReferenceProperty(Few)
class Few(db.Model):
year = db.IntegerProperty()
Then I create many Many's that point to only one Few:
one_few = Few.get_or_insert(year=2009)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Now, if I want to iterate over all the Many's, reading their few value, I would do this:
for many in Many.all().fetch(1000):
print "%s" % many.few.year
The question is:
Will each access to many.few trigger a database lookup?
If yes, is it possible to cache somewhere, as only one lookup should be enough to bring the same entity every time?
As noted in one comment: I know about memcache, but I'm not sure how I can "inject it" when I'm calling the other entity through a reference.
In any case memcache wouldn't be useful, as I need caching within an execution, not between them. Using memcache wouldn't help optimizing this call.
The first time you dereference any reference property, the entity is fetched - even if you'd previously fetched the same entity associated with a different reference property. This involves a datastore get operation, which isn't as expensive as a query, but is still worth avoiding if you can.
There's a good module that adds seamless caching of entities available here. It works at a lower level of the datastore, and will cache all datastore gets, not just dereferencing ReferenceProperties.
If you want to resolve a bunch of reference properties at once, there's another way: You can retrieve all the keys and fetch the entities in a single round trip, like so:
keys = [MyModel.ref.get_value_for_datastore(x) for x in referers]
referees = db.get(keys)
Finally, I've written a library that monkeypatches the db module to locally cache entities on a per-request basis (no memcache involved). It's available, here. One warning, though: It's got unit tests, but it's not widely used, so it could be broken.
The question is:
Will each access to many.few trigger a database lookup? Yes. Not sure if its 1 or 2 calls
If yes, is it possible to cache somewhere, as only one lookup should be enough to bring the same entity every time? You should be able to use the memcache repository to do this. This is in the google.appengine.api.memcache package.
Details for memcache are in http://code.google.com/appengine/docs/python/memcache/usingmemcache.html
Related
I'm finally upgrading from db to ndb (it is a much bigger headache than I anticipated...).
I used a lot of ReferenceProperty and I've converted these to KeyProperty. Now, every place where I used a ReferenceProperty I need to add an explicit get because it was previously done for me automatically.
My question relates to whether I should restructure my code to make it more efficient. I have many methods that use a KeyProperty and I need to do an explicit get(). I'm wondering if I should change these methods so that I am passing the entity to them instead of using the KeyProperty and a get().
Is the automatic memcaching for ndb good enough that I don't need to restructure? Or should I restructure my code to avoid repeated gets of the same entity?
We are not looking at huge inefficiencies here. But for a single HTTP GET/POST I might be getting the same entity 3-5 times.
In your case the In-Context Cache will take over and save you the db calls:
The in-context cache is fast; this cache lives in memory. When an NDB function writes to the Datastore, it also writes to the in-context cache. When an NDB function reads an entity, it checks the in-context cache first. If the entity is found there, no Datastore interaction takes place.
Each request will get a new context.
ReferenceProperty was very helpful in handling references between two modules. Fox example:
class UserProf(db.Model):
name = db.StringProperty(required=True)
class Team(db.Model):
manager_name = db.ReferenceProperty(UserProf, collection_name='teams')
name = db.StringProperty(required=True)
To get 'manager_name' with team instance, we use team_ins.manager_name.
To get 'teams' which are managed by particular user instance, we use user_instance.teams and iterate over.
Doesn't it look easy and understandable?
In doing same thing using NDB, we have to modify
db.ReferenceProperty(UserProf, collection_name='teams') --> ndb.KeyProperty(kind=UserProf)
team_ins.manager_name.get() would give you manager name
To get all team which are manger by particular user, we have to do
for team in Team.query(Team.manager_name == user_ins.key):
print "team name:", team.name
As you can see handling these kind of scenarios looks easier and readable in db than ndb.
What is the reason for removing ReferenceProperty in ndb?
Even db's query user_instance.teams would have doing the same thing as it is done in ndb's for loop. But in ndb, we are explicitly mentioning using for loop.
What is happening behind the scenes when we do user_instance.teams?
Thanks in advance..
Tim explained it well. We found that a common anti-pattern was using reference properties and loading them one at a time, because the notation "entity.property1.property2" doesn't make it clear that the first dot causes a database "get" operation. So we made it more obvious by forcing you to write "entity.property1.get().property2", and we made it easier to do batch prefetching (without the complex solution from Nick's blog) by simply saying "entity.property1.get_async()" for a bunch of entities -- this queues a single batch get operation without blocking for the result, and when you next reference any of these properties using "entity.property1.get().property2" this won't start another get operation but just waits for that batch get to complete (and the second time you do this, the batch get is already complete). Also this way in-process and memcache integration comes for free.
I don't know the answer as to why Guido didn't implement reference property.
However I found a spent a lot of time using pre_fetch_refprops http://blog.notdot.net/2010/01/ReferenceProperty-prefetching-in-App-Engine (pre fetches all of the reference properties by grabbing all the keys with get_value_for_datastore,) and then it does a get_multi on the keys.
This was vastly more efficient.
Also if the object referenced doesn't exist you would get an error when trying to fetch the object.
If you pickled an object which had references you ended up pickling a lot more than you probably planned too.
So I found except for the one case, where you have single entity and you wanted to grab the referenced object with .name type accessor you had to jump through all sorts of hoops to prevent the referenced entity from being fetched.
One common pattern when caching with Django is to use the current site's ID in each cache key, in order to, in essence, namespace your keys. The problem I have is that I'd love to be able to delete all values in cache under a namespace (e.g. Delete all cache values for site 45 because they've made some fundamental change). The current pattern to dealing with this send signals all over the place, etc. I've used the Site.id cache key example because that is a common pattern that others may recognize, but the way I'm using cache for a custom multi-tenant application makes this problem even more profound, so my question: Is there a cache back-end and/or pattern that works well for deleting objects in a namespaced way, or pseudo-namespaced way, that is not extraordinarily expensive (ie, not looping through all possible cache keys for a given namespace, deleting the cache for each)? I would prefer to use memecached, but am open to any solution that works well, plug-in or not.
It's generally difficult to delete large categories of keys. A better approach is for each site to have a generation number associated with it. Start the generation at 1. Use the generation number in the cache keys for that site. When you make a fundamental change, or any other time you want to invalidate the entire cache for the site, increment the site's generation number. Now all cache accesses will be misses, until everything is cached anew. All the old data will still be in the cache, but it will be discarded as it ages, and it isn't accessed any more.
This scheme is extremely efficient, since it doesn't require finding or touching all the old data at all. It can also be generalized to any class of cache contents, it doesn't have to be per-site.
I have a feeling that Django's Cache Versioning support, new in Django 1.3.
It allows you to set a system-wide variable VERSION on your cache, or you can explicitly set it when creating a record:
cache.set("mykey", True, version=2)
With this method, when you need to update your cache, simply up your VERSION and you're in the clear.
I am struggling to find a good tutorial or best practices document for the use of memcache in app engine.
I'm pretty happy with it on the level presented in the docs. Get an object by ID, checking memcache first, but I'm unclear on things like:
If you cache a query, is there an accepted method for ensuring that the cache is cleared/updated when an object stored in that query is updated.
What are the effects of using ReferenceProperties ? If a cache a Foo object with a Bar reference. Is my foo.bar in memcache too and in need of clearing down if it gets updated from some other part of my application.
I don't expect answers to this here (unless you are feeling particularly generous!), but pointers to things I could read would be very gratefully received.
If you cache a query, is there an
accepted method for ensuring that the
cache is cleared/updated when an
object stored in that query is
updated.
Typically you wrap your reads with a conditional to retrieve the value from the main DB if it's not in the cache. Just wrap your updates as well to fill the cache whenever you write the data. That's if you need the results to be as up to date as possible - if you're not so bothered about it being out of date just set an expiry time that is low enough for the application to have to re-request the data from the main DB often enough.
One way to easily do this is by searching for any datastore put() call and set memcache after it. Next make sure to get data from memcache before attempting a datastore query.
Set memcache after writing to datastore:
data.put()
memcache.set(user_id, data)
Try getting data from memcache before doing a datastore query:
data = memcache.get(user_id)
if data is None:
data = Data.get_by_key_name(user_id)
memcache.set(user_id, data)
Using Memcache reduces app engine costs significantly. More details on how I optimized on app engine.
About reference properties, let's say you have MainModel and RefModel with a reference property 'ref' that points to a MainModel instance. Whenever you cal ref_model.ref it does a datastore get operation and retrieves the object from datastore. It does not interact with memcache in any way.
I have a list of entities which I want to store in the memcache. The
problem is that I have large Models referenced by their
ReferenceProperty which are automatically also stored in the memcache.
As a result I'm exceeding the size limit for objects stored in
memcache.
Is there any possibility to prevent the ReferenceProperties from
loading the referenced Models while putting them in memcache?
I tried something like
def __getstate__(self):
odict = self.__dict__.copy()
odict['model'] = None
return odict
in the class I want to store in memcache, but that doesn't seem to do
the trick.
Any suggestions would be highly appreciated.
Edit: I verified by adding a logging-statement that the __getstate__-Method is executed.
For large entities, you might want to manually handle the loading of the related entities by storing the keys of the large entities as something other than a ReferenceProperty. That way you can choose when to load the large entity and when not to. Just use a long property store ids or a string property to store keynames.
odict = self.copy()
del odict.model
would probably be better than using dict (unless getstate needs to return dict - i'm not familiar with it). Not sure if this solves Your problem, though... You could implement del in Model to test if it's freed. For me it looks like You still hold a reference somewhere.
Also check out the pickle module - you would have to store everything under a single key, but it automaticly protects You from multiple references to the same object (stores it only once). Sorry no link, mobile client ;)
Good luck!