caching queryset - python

I'm working on an app that requires me to filter through large amount of records. I've been reading about caching QuerySets and related stuff and found some good material.
qs = MyModel.objects.filter(Q(<initial_filter_to_narrow_down_size>))
After this, I wish to put this qs in cache for later use. I want to apply all the other filters without hitting the database. something like
cache.set('qs', qs)
but what happens when I will do qs = qs.filter(q_object) ? Cache will be modified ? I don't want that. I want qs to remain constant until I update it. What should I do in this case ?

.filter() clones the queryset before applying the filter. Cache will not be affected.
BTW, you might want to check JohnnyCache ... a great app about queryset caching.

What i have understood from your question is that you just need to get the query set from the cache as shown below.
your_cached_qs = cache.get('qs')
And then, apply what ever filter you want.
your_new_qs = your_cached_qs.filter(further_filter)
This will not affect the query set in the cache. Hence, your cache will remain un-changed until you again update it by yourself and your desired result will be achieved.

Related

Does it matter in which order you use prefetch_related and filter in Django?

The title says it all.
Let's take a look at this code for example:
objs = Model.objects.prefetch_related('model2').filter()
objs.first().model2_set.first().field
vs
objs = Model.objects.filter().prefetch_related('model2')
objs.first().model2_set.first().field
Question
When using prefetch_related() first, does Django fetch all the ManyToOne/ManyToMany relations without taking into consideration .filter() and after everything is fetched, the filter is applied?
IMO, that doesn't matter since there's still one query executed at the end.
Thanks in advance.
It doesn't matter where you specify prefetch_related as long as it's before any records are fetched. Personally I put things like prefetch_related, select_related. and only at the end of the chain but that just feels more expressive to me from a code readability perspective.
But that is not true of all manager methods. Some methods do have different effects depending on their position in the chain, for example order_by can have positional significance when used with distinct (group by).

Do Django fetches ALL database objects for the paginator function and make it inefficient?

from django.core.paginator import Paginator
from .model import blog
b = blog.objects.all()
p = Paginator(b, 2)
The above code is from the official Django Pagination Documentation.
Can somebody please explain me how this is not inefficient
If I have a table with many records? Doesn't the above code fetch everything from the db and then just chops it down? Destroying the purpose of Pagination...
Can somebody please explain me how this is not inefficient If I have a table with many records?
QuerySets are lazy. The blog.objects.all() will not retrieve the records from the database, unless you "consume" the QuerySet. But you did not do that, and so will the paginator not do that either. You "consume" a queryset by iterating over it, calling len(..) over it, converting it to a str(..)ing, and some other functions. But as long as you do not do that. The QuerySet is just an object that represents a "query" that Django will perform if necessary.
The Paginator will construct a QuerySet that is a sliced variant of the queryset you pass. It will first call queryset.count() to retrieve the number of objects, and thus make a SELECT COUNT(*) FROM … query (where we thus do not retrieve the elements itself), and then use a sliced variant that looks like queryset[:2] that will thus look like SELECT … FROM … LIMIT 2. The count is performed to determine the number of pages that exist.
In short, it will thus construct new queries, and perform these on the database. Normally these limit the amount of data returned.

Django prefetch related and exists

I am using prefetch_related when querying a model that have several m2m relationships:
qs = context.mymodel_set.prefetch_related('things1', 'things2', 'things3')
So that when I do this there is no need to perform an additional query to get things1, they should have been fetched already:
r = list(qs)
r[0].things1.all()
But what if I do r[0].things1.exists()? Will this generate a new query? Or will it use the prefetched information? If it generates a new query, does that mean that going for r[0].things1.all() for the purposes of existence checking is more efficient?
PS: cached information being in desync with the database does not worry me for this particular question.
It's easy to check the queries that Django is running for yourself.
When I tried it, it appeared that obj.things.exists() did not cause any additional queries when things was prefetched.
To capture only objects having relation with things1 it can go in the query like this:
context.mymodel_set.prefetch_related(
'things1', 'things2', 'things3'
).filter(
things1__isnull=False
)

Differences between queryset_iterator and Django Paginator

I'm having a discussion with a coworker about iterating through large tables via the Django ORM. Up to now I've been using an implementation of a queryset_iterator as seen here:
def queryset_iterator(queryset, chunksize=1000):
'''''
Iterate over a Django Queryset ordered by the primary key
This method loads a maximum of chunksize (default: 1000) rows in it's
memory at the same time while django normally would load all rows in it's
memory. Using the iterator() method only causes it to not preload all the
classes.
Note that the implementation of the iterator does not support ordered query sets.
'''
pk = 0
last_pk = queryset.order_by('-pk')[0].pk
queryset = queryset.order_by('pk')
while pk < last_pk:
for row in queryset.filter(pk__gt=pk)[:chunksize]:
pk = row.pk
yield row
gc.collect()
My coworker suggested using Django's Paginator and passing the queryset into that. It seems like similar work would done, and the only difference I can find is that the Paginator doesn't make any garbage collecting calls.
Can anybody shed light on the difference between the two? Is there any?
The implementation here is totally different to what Paginator does; there are almost no similarities at all.
Your class iterates through an entire queryset, requesting chunksize items at a time, each chunk being a separate query. It can only be used on non-ordered queries because it does its own order_by call.
Paginator does nothing like this. It is not for iterating over an entire queryset, but for returning a single page from the full qs, which it does with a single query using the slice operators (which map to LIMIT/OFFSET).
Separately, I'm not sure what you think calling gc.collect would do here. The garbage collector is an add-on to the main memory management system, which is reference counting. It is only useful in cleaning up circular references, and there is no reason to believe any would be created here.

Django. Remove select_related from queryset

Is there any way to remove select related from queryset?
I found, that django add JOIN on count() operation to sql query.
So, if we have code like this:
entities = Entities.objects.select_related('subentity').all()
#We will have INNER JOIN here..
entities.count()
I'm looking for a way to remove join.
One important detail - I got this queryset into django paginator, so I can't simply write
Entities.objects.all().count()
I believe this code comments provide a relatively good answer to the general question that is asked here:
If select_related(None) is called, the list is cleared.
https://github.com/django/django/blob/stable/1.8.x/django/db/models/query.py#L735
In the general sense, if you want to do something to the entities queryset, but first remove the select_related items from it, entities.select_related(None).
However, that probably doesn't solve your particular situation with the paginator. If you do entries.count(), then it already will remove the select_related items. If you find yourself with extra JOINs taking place, then it could be several non-ideal factors. It could be that the ORM fails to remove it because of other logic that may or may not affect the count when combined with the select_related.
As a simple example of one of these non-ideal cases, consider Foo.objects.select_related('bar').count() versus Foo.objects.select_related('bar').distinct().count(). It might be obvious to you that the original queryset does not contain multiple entries, but it is not obvious to the Django ORM. As a result, the SQL that executes contains a JOIN, and there is no universal prescription to work around that. Even applying .select_related(None) will not help you.
Can you show the code where you need this, I think refactoring is the best answer here.
If you want quick answer, entities.query.select_related = False, but it's rather hacky (and don't forget to restore the value if you will need select_related later).

Categories

Resources