I'm having a discussion with a coworker about iterating through large tables via the Django ORM. Up to now I've been using an implementation of a queryset_iterator as seen here:
def queryset_iterator(queryset, chunksize=1000):
'''''
Iterate over a Django Queryset ordered by the primary key
This method loads a maximum of chunksize (default: 1000) rows in it's
memory at the same time while django normally would load all rows in it's
memory. Using the iterator() method only causes it to not preload all the
classes.
Note that the implementation of the iterator does not support ordered query sets.
'''
pk = 0
last_pk = queryset.order_by('-pk')[0].pk
queryset = queryset.order_by('pk')
while pk < last_pk:
for row in queryset.filter(pk__gt=pk)[:chunksize]:
pk = row.pk
yield row
gc.collect()
My coworker suggested using Django's Paginator and passing the queryset into that. It seems like similar work would done, and the only difference I can find is that the Paginator doesn't make any garbage collecting calls.
Can anybody shed light on the difference between the two? Is there any?
The implementation here is totally different to what Paginator does; there are almost no similarities at all.
Your class iterates through an entire queryset, requesting chunksize items at a time, each chunk being a separate query. It can only be used on non-ordered queries because it does its own order_by call.
Paginator does nothing like this. It is not for iterating over an entire queryset, but for returning a single page from the full qs, which it does with a single query using the slice operators (which map to LIMIT/OFFSET).
Separately, I'm not sure what you think calling gc.collect would do here. The garbage collector is an add-on to the main memory management system, which is reference counting. It is only useful in cleaning up circular references, and there is no reason to believe any would be created here.
Related
I am loving Django, and liking its implemented pagination functionality. However, I encounter issues when attempting to split a randomly ordered queryset across multiple pages.
For example, I have 100 elements in a queryset, and wish to display them 25 at a time. Providing the context object as a queryset ordered randomly (with the .order_by('?') specification), a completely new queryset is loaded into the context each time a new page is requested (page 2, 3, 4).
Explicitly stated: how do I (or can I) request a single queryset, randomly ordered, and display it across digestible pages?
I ran into the same problem recently where I didn't want to have to cache all the results.
What I did to resolve this was a combination of .extra() and raw().
This is what it looks like:
raw_sql = str(queryset.extra(select={'sort_key': 'random()'})
.order_by('sort_key').query)
set_seed = "SELECT setseed(%s);" % float(random_seed)
queryset = self.model.objects.raw(set_seed + raw_sql)
I believe this will only work for postgres. Doing a similar thing in MySQL is probably simpler since you can pass the seed directly to RAND(123).
The seed can be stored in the session/a cookie/your frontend in the case of ajax calls.
Warning - There is a better way
This is actually a very slow operation. I found this blog post describes a very good method both for retrieving a single result as well as sets of results.
In this case the seed will be used in your local random number generator.
i think this really good answer will be useful to you: How to have a "random" order on a set of objects with paging in Django?
basically he suggests to cache the list of objects and refer to it with a session variable, so it can be maintained between the pages (using django pagination).
or you could manually randomize the list and pass a seed to maintain the randomification for the same user!
The best way to achive this is to use some pagination APP like:
pure-pagination
django-pagination
django-infinite-pagination
Personally i use the first one, it integrates pretty well with Haystack.
""" EXAMPLE: (django-pagination) """
#paginate 10 results.
{% autopaginate my_query 10 %}
from django.core.paginator import Paginator
from .model import blog
b = blog.objects.all()
p = Paginator(b, 2)
The above code is from the official Django Pagination Documentation.
Can somebody please explain me how this is not inefficient
If I have a table with many records? Doesn't the above code fetch everything from the db and then just chops it down? Destroying the purpose of Pagination...
Can somebody please explain me how this is not inefficient If I have a table with many records?
QuerySets are lazy. The blog.objects.all() will not retrieve the records from the database, unless you "consume" the QuerySet. But you did not do that, and so will the paginator not do that either. You "consume" a queryset by iterating over it, calling len(..) over it, converting it to a str(..)ing, and some other functions. But as long as you do not do that. The QuerySet is just an object that represents a "query" that Django will perform if necessary.
The Paginator will construct a QuerySet that is a sliced variant of the queryset you pass. It will first call queryset.count() to retrieve the number of objects, and thus make a SELECT COUNT(*) FROM … query (where we thus do not retrieve the elements itself), and then use a sliced variant that looks like queryset[:2] that will thus look like SELECT … FROM … LIMIT 2. The count is performed to determine the number of pages that exist.
In short, it will thus construct new queries, and perform these on the database. Normally these limit the amount of data returned.
I want to iterate over all the objects in a QuerySet. However, the QuerySet matches hundreds of thousands if not millions of objects. So when I try to start an iteration, my CPU usage goes to 100%, and all my memory fills up, and then things crash. This happens before the first item is returned:
bts = Backtrace.objects.all()
for bt in bts:
print bt
I can ask for an individual object and it returns immediately:
bts = Backtrace.objects.all()
print(bts[5])
But getting a count of all objects crashes just as above, so I can't iterate using this method since I don't know how many objects there will be.
What's a way to iterate without causing the whole result to get pre-cached?
First of all make sure that you understand when a queryset is evaluated (hits the database).
Of course, one approach would be to filter the queryset.
If this is out of the question there are some workarounds you can use, according to your needs:
Queryset Iterator
Pagination
A django snippet that claims to do the job using a generator
Raw SQL using cursors.
Fetch a list of necessary values or a dictionary of necessary values and work with them.
Here is a nice article that tries to tackle with this issue in a more theoretical level.
I have a memory issue with mongoengine (in python).
Let's say I have a very large amount of custom_documents (several thousands).
I want to process them all, like this:
for item in custom_documents.objects():
process(item)
The problem is custom_documents.objects() load every objects in memory and my app use several GB ...
How can I do to make it more memory wise?
Is there a way to make mongoengine to query the DB lazily (it request objects when we iterates on the queryset)?
According to the docs (and in my experience), collection.objects returns a lazy QuerySet. Your first problem might be that you're calling the objects attribute, rather than just using it as an iterable. I feel like there must be some other reason your app is using so much memory, perhaps process(object) stores a reference to it somehow? Try the following code and check your app's memory usage:
queryset = custom_documents.objects
print queryset.count()
Since QuerySets are lazy, you can do things like custom_documents.limit(100).skip(500) as well in order to return objects 500-600 only.
I think you want to look at querysets - these are the MongoEngine wrapper for cursors:
http://mongoengine.org/docs/v0.4/apireference.html#querying
They let you control the number of objects returned, essentially taking care of the batch size settings etc. that you can set directly in the pymongo driver:
http://api.mongodb.org/python/current/api/pymongo/cursor.html
Cursors are set up to generally behave this way by default, you have to try to get them to return everything in one shot, even in the native mongodb shell.
Trying to understand what happens during a django low-level cache.set()
Particularly, details about what part of the queryset gets stored in memcached.
First, am I interpreting the django docs correctly?
a queryset (python object) has/maintains its own cache
access to the database is lazy; even if the queryset.count is 1000,
if I do an object.get for 1 record, then the dbase will only be
accessed once, for that 1 record.
when accessing a django view via apache prefork MPM, everytime that
a particular daemon instance X ends up invoking a particular view that includes something
like "tournres_qset = TournamentResult.objects.all()",
this will then result, each time, in a new tournres_qset object
being created. That is, anything that may have been cached internally
by a tournres_qset python object from a previous (tcp/ip) visit,
is not used at all by a new request's tournres_qset.
Now the questions about saving things to memcached within the view.
Let's say I add something like this at the top of the view:
tournres_qset = cache.get('tournres', None)
if tournres_qset is None:
tournres_qset = TournamentResult.objects.all()
cache.set('tournres', tournres_qset, timeout)
# now start accessing tournres_qset
# ...
What gets stored during the cache.set()?
Does the whole queryset (python object) get serialized and saved?
Since the queryset hasn't been used yet to get any records, is this
just a waste of time, since no particular records' contents are actually
being saved in memcache? (Any future requests will get the queryset
object from memcache, which will always start fresh, with an empty local
queryset cache; access to the dbase will always occur.)
If the above is true, then should I just always re-save the queryset
at the end of the view, after it's been used throughout the vierw to access
some records, which will result in the queryset's local cache to get updated,
and which should always get re-saved to memcached? But then, this would always
result in once again serializing the queryset object.
So much for speeding things up.
Or, does the cache.set() force the queryset object to iterate and
access from the dbase all the records, which will also get saved in
memcache? Everything would get saved, even if the view only accesses
a subset of the query set?
I see pitfalls in all directions, which makes me think that I'm
misunderstanding a whole bunch of things.
Hope this makes sense and appreciate clarifications or pointers to some
"standard" guidelines. Thanks.
Querysets are lazy, which means they don't call the database until they're evaluated. One way they could get evaluated would be to serialize them, which is what cache.set does behind the scenes. So no, this isn't a waste of time: the entire contents of your Tournament model will be cached, if that's what you want. It probably isn't: and if you filter the queryset further, Django will just go back to the database, which would make the whole thing a bit pointless. You should just cache the model instances you actually need.
Note that the third point in your initial set isn't quite right, in that this has nothing to do with Apache or preforking. It's simply that a view is a function like any other, and anything defined in a local variable inside a function goes out of scope when that function returns. So a queryset defined and evaluated inside a view goes out of scope when the view returns the response, and a new one will be created the next time the view is called, ie on the next request. This is the case whichever way you are serving Django.
However, and this is important, if you do something like set your queryset to a global (module-level) variable, it will persist between requests. Most of the ways that Django is served, and this definitely includes mod_wsgi, keep a process alive for many requests before recycling it, so the value of the queryset will be the same for all of those requests. This can be useful as a sort of bargain-basement cache, but is difficult to get right because you have no idea how long the process will last, plus other processes are likely to be running in parallel which have their own versions of that global variable.
Updated to answer questions in the comment
Your questions show that you still haven't quite grokked how querysets work. It's all about when they are evaluated: if you list, or iterate, or slice a queryset, that evaluates it, and it's at that point the database call is made (I count serialization under iterating, here), and the results stored in the queryset's internal cache. So, if you've already done one of those things to your queryset, and then set it to the (external) cache, that won't cause another database hit.
But, every filter() operation on a queryset, even one that's already evaluated, is another database hit. That's because it's a modification of the underlying SQL query, so Django goes back to the database - and returns a new queryset, with its own internal cache.