I want to iterate over all the objects in a QuerySet. However, the QuerySet matches hundreds of thousands if not millions of objects. So when I try to start an iteration, my CPU usage goes to 100%, and all my memory fills up, and then things crash. This happens before the first item is returned:
bts = Backtrace.objects.all()
for bt in bts:
print bt
I can ask for an individual object and it returns immediately:
bts = Backtrace.objects.all()
print(bts[5])
But getting a count of all objects crashes just as above, so I can't iterate using this method since I don't know how many objects there will be.
What's a way to iterate without causing the whole result to get pre-cached?
First of all make sure that you understand when a queryset is evaluated (hits the database).
Of course, one approach would be to filter the queryset.
If this is out of the question there are some workarounds you can use, according to your needs:
Queryset Iterator
Pagination
A django snippet that claims to do the job using a generator
Raw SQL using cursors.
Fetch a list of necessary values or a dictionary of necessary values and work with them.
Here is a nice article that tries to tackle with this issue in a more theoretical level.
Related
I don't know where else I could have asked this question, thus asking it here. I want to know that if I impose multiple Django filters on a page which are using multiple db tables, will that effect ram consumption whenever a user visits this page because before the user only filtered data will get reflected. I'm using django with postgresql on a ubuntu based VM, also if there are any documentation which can be helpful in understanding ram utilization, please suggest.
Django filter and query sets are lazy. What it actually means is you are not actually hitting the database until you evaluate them. Quoting official documentation -
Internally, a QuerySet can be constructed, filtered, sliced, and generally passed around without actually hitting the database. No database activity actually occurs until you do something to evaluate the queryset.
So the only space that gets taken in your RAM is actually the list containing queryset and your program. It is when query is evaluated and data is extracted from the database, that is when(depending on how much data is extracted), memory is filled. Also, it'd be a good idea to look at iterators as well
I'm having a discussion with a coworker about iterating through large tables via the Django ORM. Up to now I've been using an implementation of a queryset_iterator as seen here:
def queryset_iterator(queryset, chunksize=1000):
'''''
Iterate over a Django Queryset ordered by the primary key
This method loads a maximum of chunksize (default: 1000) rows in it's
memory at the same time while django normally would load all rows in it's
memory. Using the iterator() method only causes it to not preload all the
classes.
Note that the implementation of the iterator does not support ordered query sets.
'''
pk = 0
last_pk = queryset.order_by('-pk')[0].pk
queryset = queryset.order_by('pk')
while pk < last_pk:
for row in queryset.filter(pk__gt=pk)[:chunksize]:
pk = row.pk
yield row
gc.collect()
My coworker suggested using Django's Paginator and passing the queryset into that. It seems like similar work would done, and the only difference I can find is that the Paginator doesn't make any garbage collecting calls.
Can anybody shed light on the difference between the two? Is there any?
The implementation here is totally different to what Paginator does; there are almost no similarities at all.
Your class iterates through an entire queryset, requesting chunksize items at a time, each chunk being a separate query. It can only be used on non-ordered queries because it does its own order_by call.
Paginator does nothing like this. It is not for iterating over an entire queryset, but for returning a single page from the full qs, which it does with a single query using the slice operators (which map to LIMIT/OFFSET).
Separately, I'm not sure what you think calling gc.collect would do here. The garbage collector is an add-on to the main memory management system, which is reference counting. It is only useful in cleaning up circular references, and there is no reason to believe any would be created here.
I have a model, Reading, which has a foreign key, Type. I'm trying to get a reading for each type that I have, using the following code:
for type in Type.objects.all():
readings = Reading.objects.filter(
type=type.pk)
if readings.exists():
reading_list.append(readings[0])
The problem with this, of course, is that it hits the database for each sensor reading. I've played around with some queries to try to optimize this to a single database call, but none of them seem efficient. .values for instance will provide me a list of readings grouped by type, but it will give me EVERY reading for each type, and I have to filter them with Python in memory. This is out of the question, as we're dealing with potentially millions of readings.
if you use PostgreSQL as your DB backend you can do this in one-line with something like:
Reading.objects.order_by('type__pk', 'any_other_order_field').distinct('type__pk')
Note that the field on which distinct happens must always be the first argument in the order_by method. Feel free to change type__pk with the actuall field you want to order types on (e.g. type__name if the Type model has a name property). You can read more about distinct here https://docs.djangoproject.com/en/dev/ref/models/querysets/#distinct.
If you do not use PostgreSQL you could use the prefetch_related method for this purpose:
#reading_set could be replaced with whatever your reverse relation name actually is
for type in Type.objects.prefetch_related('reading_set').all():
readings = type.reading_set.all()
if len(readings):
reading_list.append(readings[0])
The above will perform only 2 queries in total. Note I use len() so that no extra query is performed when counting the objects. You can read more about prefetch_related here https://docs.djangoproject.com/en/dev/ref/models/querysets/#prefetch-related.
On the downside of this approach is you first retrieve all related objects from the DB and then just get the first.
The above code is not tested, but I hope it will at least point you towards the right direction.
I have a memory issue with mongoengine (in python).
Let's say I have a very large amount of custom_documents (several thousands).
I want to process them all, like this:
for item in custom_documents.objects():
process(item)
The problem is custom_documents.objects() load every objects in memory and my app use several GB ...
How can I do to make it more memory wise?
Is there a way to make mongoengine to query the DB lazily (it request objects when we iterates on the queryset)?
According to the docs (and in my experience), collection.objects returns a lazy QuerySet. Your first problem might be that you're calling the objects attribute, rather than just using it as an iterable. I feel like there must be some other reason your app is using so much memory, perhaps process(object) stores a reference to it somehow? Try the following code and check your app's memory usage:
queryset = custom_documents.objects
print queryset.count()
Since QuerySets are lazy, you can do things like custom_documents.limit(100).skip(500) as well in order to return objects 500-600 only.
I think you want to look at querysets - these are the MongoEngine wrapper for cursors:
http://mongoengine.org/docs/v0.4/apireference.html#querying
They let you control the number of objects returned, essentially taking care of the batch size settings etc. that you can set directly in the pymongo driver:
http://api.mongodb.org/python/current/api/pymongo/cursor.html
Cursors are set up to generally behave this way by default, you have to try to get them to return everything in one shot, even in the native mongodb shell.
I'm generating a feed that merges the comments of many users, so your feed might be of comments by user1+user2+user1000 whereas mine might be user1+user2. So I have the line:
some_comments = Comment.gql("WHERE username IN :1",user_list)
I can't just memcache the whole thing since everyone will have different feeds, even if the feeds for user1 and user2 would be common to many viewers. According to the documentation:
...the IN operator executes a separate
underlying datastore query for every
item in the list. The entities
returned are a result of the
cross-product of all the underlying
datastore queries and are
de-duplicated. A maximum of 30 datastore queries are allowed for any
single GQL query.
Is there a library function to merge some sorted and cached queries, or am I going to have to:
for user in user_list
if memcached(user):
add it to the results
else:
add Comment.gql("WHERE username = :1",user) to the results
cache it too
sort the results
(In the worst case (nothing is cached) I expect sending 30 GQL queries off is slower than one giant IN query.)
There's nothing built-in to do this, but you can do it yourself, with one caveat: If you do an in query and return 30 results, these will be the 30 records that sort lowest according to your sort criteria across all the subqueries. If you want to assemble the resultset from cached individual queries, though, either you are going to have to cache as many results for each user as the total result set (eg, 30), and throw away most of those results, or you're going to have to store fewer results per user, and accept that sometimes you'll throw away newer results from one user in favor of older results from another.
That said, here's how you can do this:
Do a memcache.get_multi to retrieve cached result sets for all the users
For each user that doesn't have a result set cached, execute the individual query, fetching the most results you need. Use memcache.set_multi to cache the result sets.
Do a merge-join on all the result sets and take the top n results as your final result set. Because username is presumably not a list field (eg, every comment has a single author), you don't need to worry about duplicates.
Currently, in queries are executed serially, so this approach won't be any slower than executing an in query, even when none of the results are cached. This may change in future, though. If you want to improve performance now, you'll probably want to use Guido's NDB project, which will allow you to execute all the subqueries in parallel.
You can use memcache.get_multi() to see which of the user's feeds are already in memcache. Then use a set().difference() on the original user list vs. the user list found in memcache to find out which weren't retrieved. Then finally fetch the missing user feeds from the datastore in a batch get.
From there you can combine the two lists and, if it isn't too long, sort it in memory. If you're working on something Ajaxy, you could hand off sorting to the client.