GAE/P: Migrating to NDB efficiently - python

I'm finally upgrading from db to ndb (it is a much bigger headache than I anticipated...).
I used a lot of ReferenceProperty and I've converted these to KeyProperty. Now, every place where I used a ReferenceProperty I need to add an explicit get because it was previously done for me automatically.
My question relates to whether I should restructure my code to make it more efficient. I have many methods that use a KeyProperty and I need to do an explicit get(). I'm wondering if I should change these methods so that I am passing the entity to them instead of using the KeyProperty and a get().
Is the automatic memcaching for ndb good enough that I don't need to restructure? Or should I restructure my code to avoid repeated gets of the same entity?
We are not looking at huge inefficiencies here. But for a single HTTP GET/POST I might be getting the same entity 3-5 times.

In your case the In-Context Cache will take over and save you the db calls:
The in-context cache is fast; this cache lives in memory. When an NDB function writes to the Datastore, it also writes to the in-context cache. When an NDB function reads an entity, it checks the in-context cache first. If the entity is found there, no Datastore interaction takes place.
Each request will get a new context.

Related

Python Django cache vs store in model field? Which is more efficient?

My view displays a table of data (specific to a customer who may have many users) and this table takes up a lot of computational resource to populate. A customers data changes 4/5 times a week, usually on the same day.
Caching is an obvious solution to this but I was wondering if Django's cache framework is significantly more efficient than creating a Textfield at the customer level and storing the data there instead?
I feel it's easier to implement (and clear the Textfield when the data changes) but what are the drawbacks and is there anything else I need to look out for? (problems if the dataset gets too big? additional fields in the model etc etc???)
Any help would be much appreciated!
A cache is a cache is cache, however you implement it, and the main problem with caches is invalidation.
As Melvyn rightly answered, the case for the cache framework is that it's (well, can be, depending on which backend you choose) outside your database. Whether it's a pro or cons really depends on your database load, infrastructure and whatnots... if you already use the cache framework (for more than plain unconditional full-page caching I mean) and want to mimimize the load on your database then it's possibly worth the added complexity.
Else storing your computed result in the db is quite straightforward and doesn't require additional servers, install etc. I'd personnally go for a dedicated model - to avoid unnecessary overhead at the db level -, including both the cached result and a checksum of the params on which this result depends (canonical memoization pattern) so you can easily detect whether it needs to be recomputed. I found this solution to be easier to maintain than trying to detect changes to each and any of those params and invalid/recompute the cache "on the fly" (which is what can make proper cache invalidation difficult or at least complex to implement) but this again depends on what those params are and where they come from.
The upside to using the cache framework is that you don't have to use the database. You can scale your cache store independent of your database and run the cache on different physical (or virtual) machines.
In addition you don't have to implement the stale vs fresh logic, but that's a one-off.
4-5 times a week doesn't look like a big challenge, but nobody knows except you what kind of computation do you have, how many data you should store, how many users do you have and so on.
If you want to implement this with TextField, it still some kind of caching system, so I suggest to use django's caching system with database backend first https://docs.djangoproject.com/en/1.11/topics/cache/#database-caching You can't retrieve data with 1 query like in case of TextField, but later you can replace database with other layer if necessary.

How to cache a row instead of an model instance when using SQLAlchemy?

I'm using SQLAlchemy with python-memcached.
Before actually using a real SQLAlchemy query, I first check if the corresponding key is cached, and in case it's not already cached I would set the found instance (which is a SQLAlchemy model object) in cache.
For most of my functions this is fast enough, but in functions where thousands of objects get queried, fair amount of time is spent in cPickle.loads for deserialising the objects.
Because serialisation and deserialisation of a tuple/dict row could be several times faster than that of an object, I'm wondering if I can somehow cache the row instead.
You could use dogpile cache package for this.
There's even an example of usage on the SQLAlchemy documentation website.

Dynamically Created Top Articles List in Django?

I'm creating a Django-powered site for my newspaper-ish site. The least obvious and common-sense task that I have come across in getting the site together is how best to generate a "top articles" list for the sidebar of the page.
The first thing that came to mind was some sort of database column that is updated (based on what?) with every view. That seems (to my instincts) ridiculously database intensive and impractical and thus I think I'd like to find another solution.
Thanks all.
I would give celery a try (with django-celery). While it's not so easy to configure and use as cache, it enables you to queue tasks like incrementing counters and do them in background. It could be even combined with cache technique - in views increment counters in cache and define PeriodicTask that will run every now and then, resetting counters and writing them to the database.
I just remembered - I once found this blog entry which provides nice way of incrementing 'viewed_count' (or similar) column in database with AJAX JS call. If you don't have heavy traffic maybe it's good idea?
Also mentioned in this post is django-tracking, but I don't know much about it, I never used it myself (yet).
Premature optimization, first try the db way and then see if it really is too database sensitive. Any decent database has so good caches it probably won't matter very much. And even if it is a problem, take a look at the other db/cache suggestions here.
It is most likely by the way is that you will have many more intensive db queries with each view than a simple view update.
If you do something like sort by top views, it would be fast if you index the view column in the DB. Another option is to only collect the top x articles every hour or so, and toss that value into Django's cache framework.
The nice thing about caching the list is that the algorithm you use to determine top articles can be as complex as you like without hitting the DB hard with every page view. Django's cache framework can use memory, db, or file system. I prefer DB, but many others prefer memory. I believe it uses pickle, so you can also store Python objects directly. It's easy to use, recommended.
An index wouldn't help as them main problem I believe is not so much getting the sorted list as having a DB write with every page view of an article. Another index actually makes that problem worse, albeit only a little.
So I'd go with the cache. I think django's cache shim is a problem here because it requires timeouts on all keys. I'm not sure if that's imposed by memcached, if not then go with redis. Actually just go with redis anyway, the python library is great, I've used it from django projects before, and it has atomic increments and powerful sorting - everything you need.

Idiots guide to app engine and memcache

I am struggling to find a good tutorial or best practices document for the use of memcache in app engine.
I'm pretty happy with it on the level presented in the docs. Get an object by ID, checking memcache first, but I'm unclear on things like:
If you cache a query, is there an accepted method for ensuring that the cache is cleared/updated when an object stored in that query is updated.
What are the effects of using ReferenceProperties ? If a cache a Foo object with a Bar reference. Is my foo.bar in memcache too and in need of clearing down if it gets updated from some other part of my application.
I don't expect answers to this here (unless you are feeling particularly generous!), but pointers to things I could read would be very gratefully received.
If you cache a query, is there an
accepted method for ensuring that the
cache is cleared/updated when an
object stored in that query is
updated.
Typically you wrap your reads with a conditional to retrieve the value from the main DB if it's not in the cache. Just wrap your updates as well to fill the cache whenever you write the data. That's if you need the results to be as up to date as possible - if you're not so bothered about it being out of date just set an expiry time that is low enough for the application to have to re-request the data from the main DB often enough.
One way to easily do this is by searching for any datastore put() call and set memcache after it. Next make sure to get data from memcache before attempting a datastore query.
Set memcache after writing to datastore:
data.put()
memcache.set(user_id, data)
Try getting data from memcache before doing a datastore query:
data = memcache.get(user_id)
if data is None:
data = Data.get_by_key_name(user_id)
memcache.set(user_id, data)
Using Memcache reduces app engine costs significantly. More details on how I optimized on app engine.
About reference properties, let's say you have MainModel and RefModel with a reference property 'ref' that points to a MainModel instance. Whenever you cal ref_model.ref it does a datastore get operation and retrieves the object from datastore. It does not interact with memcache in any way.

appengine: cached reference property?

How can I cache a Reference Property in Google App Engine?
For example, let's say I have the following models:
class Many(db.Model):
few = db.ReferenceProperty(Few)
class Few(db.Model):
year = db.IntegerProperty()
Then I create many Many's that point to only one Few:
one_few = Few.get_or_insert(year=2009)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Many.get_or_insert(few=one_few)
Now, if I want to iterate over all the Many's, reading their few value, I would do this:
for many in Many.all().fetch(1000):
print "%s" % many.few.year
The question is:
Will each access to many.few trigger a database lookup?
If yes, is it possible to cache somewhere, as only one lookup should be enough to bring the same entity every time?
As noted in one comment: I know about memcache, but I'm not sure how I can "inject it" when I'm calling the other entity through a reference.
In any case memcache wouldn't be useful, as I need caching within an execution, not between them. Using memcache wouldn't help optimizing this call.
The first time you dereference any reference property, the entity is fetched - even if you'd previously fetched the same entity associated with a different reference property. This involves a datastore get operation, which isn't as expensive as a query, but is still worth avoiding if you can.
There's a good module that adds seamless caching of entities available here. It works at a lower level of the datastore, and will cache all datastore gets, not just dereferencing ReferenceProperties.
If you want to resolve a bunch of reference properties at once, there's another way: You can retrieve all the keys and fetch the entities in a single round trip, like so:
keys = [MyModel.ref.get_value_for_datastore(x) for x in referers]
referees = db.get(keys)
Finally, I've written a library that monkeypatches the db module to locally cache entities on a per-request basis (no memcache involved). It's available, here. One warning, though: It's got unit tests, but it's not widely used, so it could be broken.
The question is:
Will each access to many.few trigger a database lookup? Yes. Not sure if its 1 or 2 calls
If yes, is it possible to cache somewhere, as only one lookup should be enough to bring the same entity every time? You should be able to use the memcache repository to do this. This is in the google.appengine.api.memcache package.
Details for memcache are in http://code.google.com/appengine/docs/python/memcache/usingmemcache.html

Categories

Resources