How do you effectively cache a large django queryset? - python

I am working on a Django project using a PostgreSQL database in which I need to run a sometimes complex query on a user profile table with ~1M records, and the dataset returned can be anywhere from 0 to all 1M records. My issue is that once I grab all of the records, I want to be able to filter them further to run analytics on these profiles. The analytics cannot be completed in the same request/response loop, as this will time out for all but the smallest querysets. So I am using async javascript to shoot off new requests for each type of analytics I want.
For example, I will grab all of the profiles in the initial request and then i will send subsequent requests to take those profiles and return to me the % of genders or % of users who have a certain job title, etc. The issue is that every subsequent request has to run the original query again. What I would love to do is somehow cache that query and then run my query on this subset of the profile table without having to execute the original, long-running query.
I have tried to use a cache function to cache the queryset itself, but this actually slows down performance a bit, I assume because the queryset has to be pickled or unpickled and then it still has to run? I also tried to cache a list of ids from the parent long query (this is potentially a VERY long list, up to 1M integers) and that grinds my system to a completely halt for anything more than like 44k records.
Has anyone dealt with this kind of issue before? I know that I could set up a worker/queue system, which is on my roadmap, but it would be lovely if there was a simple solution to this that utilizes the built-in capabilities of Django.
Some sample code:
def get_analytics(request):
data_type = request.POST.get('data_type')
query_params = request.POST.get('query_params') # a crazy filter with lots of Q objects
profile_ids = get_profiles(query_params) # I WANT TO CACHE THIS
profiles = Profile.objects.filter(id__in=profile_ids).distinct()
if data_type == 'overview':
return profiles.count()
else if data_type == 'gender':
gender_breakdown = profiles.filter(a_filter_for_gender).values('gender').annotate(Count('gender', distinct=True))
return gender_breakdown
def cache_function(length):
"""
A variant of the snippet posted by Jeff Wheeler at
http://www.djangosnippets.org/snippets/109/
Caches a function, using the function and its arguments as the key, and the return
value as the value saved. It passes all arguments on to the function, as
it should.
The decorator itself takes a length argument, which is the number of
seconds the cache will keep the result around.
"""
def decorator(func):
def inner_func(*args, **kwargs):
if hasattr(settings, 'IS_IN_UNITTEST'):
return func(*args, **kwargs)
key = get_cache_key(func.__name__, func.__module__, args, kwargs)
value = cache.get(key)
if key in cache:
return value
else:
result = func(*args, **kwargs)
cache.set(key, result, length)
return result
return inner_func
return decorator
#cache_function(60*2)
def get_profiles(query_params):
return Profile.objects.filter(query_params).values_list('id')
Why does caching the ids slow my system down? Is there a better way to accomplish this?

Related

Async behaviour within a with statement

I am building a wrapper for an API, in order to make it more accessible to our users. The user initialises the SomeAPI object and then has access to lots of class methods, as defined below.
One of the operations I wish to support is creating what we call a "instance".
Once the instance is no longer required, it should be deleted. Therefore I would use contextlib.contextmanager like so:
class SomeAPI:
# Lots of methods
...
...
def create_instance(self, some_id):
# Create an instance for some_id
payload = {"id": some_id}
resp_url = ".../instance"
# This specific line of code may take a long time
resp = self.requests.post(resp_url, json=payload)
return resp.json()["instance_id"]
def delete_instance(self, instance_id):
# Delete a specific instance
resp_url = f".../instance/{instance_id}"
resp = self.requests.delete(resp_url)
return
#contextlib.contextmanager
def instance(self, some_id):
instance_id = self.create_instance(some_id)
try:
yield instance_id
finally:
if instance_id:
self.delete_instance(instance_id)
So then our users can write code like this
some_api = SomeApi()
# Necessary preprocessing - anywhere between 0-10 minutes
x = preprocess_1()
y = preprocess_2()
my_id = 1234
with some_api.instance(my_id):
# Once the instance is created, do some stuff with it in here
# Uses the preprocesses above
some_api.do_other_class_method_1(x)
some_api.do_other_class_method_2(y)
# Exited the with block - instance has been deleted
Which works fine. The problem is that creation of this instance always takes 60-90 seconds (as commented within the create_instance method), therefore if possible I would like to make this whole code more efficient by:
Starting the process of creating the instance (using a with block)
Only then, start the preprocessing (as commented, may take anywhere between 0-10 mins)
Once the preprocessing has been completed, use that with the instance
This order of operations would save up to 60 seconds each time, if the preprocessing happens to take more than 60 seconds. Note that there is no guarantee that the preprocessing will be longer or shorter than the creation of the instance.
I am aware of the existence of context.asynccontextmanager, but the whole async side of things does tie a knot in my brain. I have no idea how to get the order of operations right, while also maintaining the ability for the user to create and destroy the instance easily using a with statement.
Can anyone help?

Delete Data store entries synchronously in Google App Engine

I use python in GAP and try to delete one entries in datastore by using db.delete(model_obj). I suppose this operation is undertaken synchronously, since the document tell the difference between delete() and delete_async(), but when I read the source code in the db, the delete method just simply call the delete_async, which is not match what the document says :(
So is there any one to do delete in synchronous flow?
Here is the source code in db:
def delete_async(models, **kwargs):
"""Asynchronous version of delete one or more Model instances.
Identical to db.delete() except returns an asynchronous object. Call
get_result() on the return value to block on the call.
"""
if isinstance(models, (basestring, Model, Key)):
models = [models]
else:
try:
models = iter(models)
except TypeError:
models = [models]
keys = [_coerce_to_key(v) for v in models]
return datastore.DeleteAsync(keys, **kwargs)
def delete(models, **kwargs):
"""Delete one or more Model instances.
"""
delete_async(models, **kwargs).get_result()
EDIT: From a comment, this is the original misbehaving code:
def tearDown(self):
print self.account
db.delete(self.device)
db.delete(self.account)
print Account.get_by_email(self.email, case_sensitive=False)
The result for two print statement is <Account object at 0x10d1827d0> <Account object at 0x10d1825d0>. Even two memory addresses are different but they point to the same object. If I put some latency after the delete like for loop, the object fetched is None.
The code you show for delete calls delete_async, yes, but then it calls get_result on the returned asynchronous handle, which will block until the delete actually occurs. So, delete is synchronous.
The reason the sample code you show is returning an object is that you're probably running a query to fetch the account; I presume the email is not the db.Key of the account? Normal queries are not guaranteed to return updated results immediately. To avoid seeing stale data, you either need to use an ancestor query or look up the entity by key, both of which are strongly consistent.

Efficient pagination and database querying in django

There were some code examples for django pagination which I used a while back. I may be wrong but when looking over the code it looks like it wastes tons of memory. I was looking for a better solution, here is the code:
# in views.py
from django.core.paginator import Paginator, EmptyPage, PageNotAnInteger
...
...
def someView():
models = Model.objects.order_by('-timestamp')
paginator = Paginator(models, 7)
pageNumber = request.GET.get('page')
try:
paginatedPage = paginator.page(pageNumber)
except PageNotAnInteger:
pageNumber = 1
except EmptyPage:
pageNumber = paginator.num_pages
models = paginator.page(pageNumber)
return render_to_resp ( ..... models ....)
I'm not sure of the subtlties of this code but from what it looks like, the first line of code retrieves every single model from the database and pushes it into. Then it is passed into Paginator which chunks it up based on which page the user is on from a html GET. Is paginator somehow making this acceptable, or is this totally memory inefficient? If it is inefficient, how can it be improved?
Also, a related topic. If someone does:
Model.objects.all()[:40]
Does this code mean that all models are pushed into memory, and we splice out 40 of them? Which is bad. Or does it mean that we query and push only 40 objects into memory period?
Thank you for your help!
mymodel.objects.all() yields a queryset, not a list. Querysets are lazy - no request is issued and nothing done until you actually try to use them. Also slicing a query set does not load the whole damn thing in memory only to get a subset but adds limit and offset to the SQL query before hitting the database.
There is nothing memory inefficient when using paginator. Querysets are evaluated lazily. In your call Paginator(models, 7), models is a queryset which has not been evaluated till this point. So, till now database hasn't been hit. Also no list containing all the instances of model is in the memory at this point.
When you want to get a page i.e at paginatedPage = paginator.page(pageNumber), slicing is done on this queryset, only at this point the database is hit and database returns you a queryset containing instances of model. And then slicing only returns the objects which should be there on the page. So, only the sliced objects will go in a list which will be there in the memory. Say on one page you want to show 10 objects, only these 10 objects will stay in the memory.
When someone does;
Model.objects.all()[:40]
When you slice a list, a new list is created. In your case a list will be created with only 40 elements and will be stored somewhere in memory. No other list will be there and so there won't be any list which contains all the instances of Model in memory.
Using the above information I came up with a view function decorator. The json_list_objects takes djanog objects to json-ready python dicts of the known relationship fields of the django objects and returns the jsonified list as {count: results: }.
Others may find it useful.
def with_paging(fn):
"""
Decorator providing paging behavior. It is for decorating a function that
takes a request and other arguments and returns the appropriate query
doing select and filter operations. The decorator adds paging by examining
the QueryParams of the request for page_size (default 2000) and
page_num (default 0). The query supplied is used to return the appropriate
slice.
"""
#wraps(fn)
def inner(request, *args, **kwargs):
page_size = int(request.GET.get('page_size', 2000))
page_num = int(request.GET.get('page_num', 0))
query = fn(request, *args, **kwargs)
start = page_num * page_size
end = start + page_size
data = query[start:end]
total_size = query.count()
return json_list_objects(data, overall_count=total_size)
return inner

How to retrieve properties only once from database in django

I have some relationships in my database that I describe like that:
#property
def translations(self):
"""
:return: QuerySet
"""
if not hasattr(self, '_translations'):
self._translations = ClientTranslation.objects.filter(base=self)
return self._translations
The idea behind the hasattr() and self._translation is to have the db only hit one time, while the second time the stored property is returned.
However, after reading, the docs, I'm not sure if the code is doing that - as queries are only hitting the db when the values are really needed - which comes after my code.
How would a correct approach look like?
Yes, DB is hit the first time someone needs the value. But as you pointed out, you save the query, not the results. Wrap the query with list(...) to save the results.
By the way, you can use the cached_property decorator to make it more elegant. It is not a built-in, though. It can be found here. You end up with:
#cached_property
def translations(self):
return list(ClientTranslation.objects.filter(base=self))

struggling with memcache on google app engine python

I've been struggling to get memcache working on my app for a bit now. I thought I had finally got it working where it never reads from the database (unless memcache data is lost of course), only to have my site shut down because of a over-qota number of datastore reads! I'm currently using a free appspot and would like to keep it that way for as long as possible. Anyway, here's my code, maybe somebody can help me find the hole in it.
I am currently trying to implement memcache by overriding the db.Model.all(), delete(), and put() methods to query memcache first. I have memcache set up where each object in the datastore has it's own memcache value with it's id as the key. Then for each Model class I have a list of the id's under a key it knows how to query. I hope I explained this clear enough.
""" models.py """
#classmethod
def all(cls, order="sent"):
result = get_all("messages", Message)
if not result or memcache.get("updatemessages"):
result = list(super(Message, cls).all())
set_all("messages", result)
memcache.set("updatemessages", False)
logging.info("DB Query for messages")
result.sort(key=lambda x: getattr(x, order), reverse=True)
return result
#classmethod
def delete(cls, message):
del_from("messages", message)
super(Message, cls).delete(message)
def put(self):
super(Message, self).put()
add_to_all("messages", self)
""" helpers.py """
def get_all(type, Class):
all = []
ids = memcache.get(type+"allid")
query_amount = 0
if ids:
for id in ids:
ob = memcache.get(str(id))
if ob is None:
ob = Class.get_by_id(int(id))
if ob is None:
continue
memcache.set(str(id), ob)
query_amount += 1
all.append(ob)
if query_amount: logging.info(str(query_amount) + " ob queries")
return all
return None
def add_to_all(type, object):
memcache.set(str(object.key().id()), object)
all = memcache.get(type+"allid")
if not all:
all = [str(ob.key().id()) for ob in object.__class__.all()]
logging.info("DB query for %s" % type)
assert all is not None, "query returned None. Send this error code to ____: 2 3-193A"
if not str(object.key().id()) in all:
all.append(str(object.key().id()))
memcache.set(type+"allid", all)
#log_on_fail
def set_all(type, objects):
assert type in ["users", "messages", "items"], "set_all was not passed a valid type. Send this error code to ____: 33-205"
assert not objects is None, "set_all was passed None as the list of objects. Send this error code to _________: 33-206"
all = []
for ob in objects:
error = not memcache.set(str(ob.key().id()), ob)
if error:
logging.warning("keys not setting properly. Object must not be pickleable")
all.append(str(ob.key().id()))
memcache.set(type+"allid", all)
#log_on_fail
def del_from(type, object):
all = memcache.get(type+"allid")
if not all:
all = object.__class__.all()
logging.info("DB query %s" % type)
assert all, "Could not find any objects. Send this error code to _____: 13- 219"
assert str(object.key().id()) in all, "item not found in cache. Send this error code to ________: 33-220"
del all[ all.index(str(object.key().id())) ]
memcache.set(type+"allid", all)
memcache.delete(str(object.key().id()))
I apologize for all of the clutter and lack of elegance. Hopefully somebody will be able to help. I've thought about switching to ndb but for now I rather stick to my custom cache. You'll notice the logging.info("some-number of ob queries"). I get this log quite often. Maybe once or twice every half hour. Does memcache really lose data that often or is something wrong with my code?
Simple solution: switch to NDB.
NDB models will store values in memcache and in an instance cache (which is 100% free) and these models will also invalidate the cache for you when you update/delete your objects. Retrieval will first try to get from the instance cache, then if that fails from memcache, and finally from the datastore, and it will set the value in any of the caches missed on the way up.
App Engine memcache removes objects by an optimized eviction algorithm, thus having this log message with the frequency you described results in two possible explanations.
Either these data are not accessed very often or the amount of data you have in your memcache is pretty large, and thus some of it is removed from time to time.
I would also propose to move to ndb which handles the use of memcache and instance cache quite efficiently.
Hope this helps!

Categories

Resources