How do you split a Django queryset without evaluating it? - python

I am dealing with a Queryset of over 5 million + items (For batch ML purposes) and I need to split the queryset (so I can perform multithreading operations) without evaluating the queryset as I only ever need to access each item in the queryset once and thus I don't want to cache the queryset items which evaluating causes.
Is it possible to select the items into one queryset and split this without evaluating? or am I going to have to approach it by querying for multiple querysets using Limits [:size] to achieve this behaviour?
N.B: I am aware that an Iterable can be used to cycle through a queryset without evaluating it but my question is related to how I can I split a queryset (if possible) to then run an iterable on each of the splitted querysets.

Django provides a few classes that help you manage paginated data – that is, data that’s split across several pages, with “Previous/Next” links:
from django.core.paginator import Paginator
object_list = MyModel.objects.all()
paginator = Paginator(object_list, 10) # Show 10 objects per page, you can choose any other value
for i in paginator.page_range(): # A 1-based range iterator of page numbers, e.g. yielding [1, 2, 3, 4].
data = iter(paginator.get_page(i))
# use data

If your django version is 1.11 or less than that like 1.10, 1.9 or so on, then use paginator.page(page_no) but be careful that this may raise an InvalidPage Exception when invalid/no page has been found.
For versions <= 1.11, use below code:
from django.core.paginator import Paginator
qs = MyModel.objects.all()
paginator = Paginator(qs, 20)
for page_no in paginator.page_range:
current_page = paginator.page(page_no)
current_qs = current_page.object_list
And if you're using django version >= 2.0, please use paginator.get_page(page_no) instead, but you can also use paginator.page(page_no).
For versions >= 2.0, use below code:
from django.core.paginator import Paginator
qs = MyModel.objects.all()
paginator = Paginator(qs, 20)
for page_no in paginator.page_range:
current_page = paginator.get_page(page_no)
current_qs = current_page.object_list
The advantage of using paginator.get_page(page_no) according to django documentations is as follows:
Return a valid page, even if the page argument isn't a number or isn't
in range.
While in the case of paginator.page(page_no), you have to handle the exception manually if page_no is not a number or is out of range.

Passing query sets to Threads is not something I would recommend. I know the sort of thing you are trying to do and why, but its best to just pass some sort of param set to each thread and then have the Thread perform the partial query.
Working this way, your threads are distinct from the calling code.
On a different note, if you are trying to use threads as a work around for the lags caused by high DB queries, you might find using transaction management a better route.
This link link has some useful tips. I use this instead of Threads

Yes you can, as from this gist
Per the updated answer:
def queryset_iterator(queryset, chunk_size=1000):
"""
Iterate over a Django Queryset ordered by the primary key
This method loads a maximum of chunk_size (default: 1000) rows in it's
memory at the same time while django normally would load all rows in it's
memory. Using the iterator() method only causes it to not preload all the
classes.
Note that the implementation of the iterator does not support ordered query sets.
"""
try:
last_pk = queryset.order_by('-pk')[:1].get().pk
except ObjectDoesNotExist:
return
pk = 0
queryset = queryset.order_by('pk')
while pk < last_pk:
for row in queryset.filter(pk__gt=pk)[:chunk_size]:
pk = row.pk
yield row
gc.collect()

Related

When are Django Querysets executed in the view?

I read the Querysets Django docs regarding querysets being lazy, but still am a bit confused here.
So in my view, I set a queryset to a variable like so
players = Players.objects.filter(team=team)
Later on, I have a sorting mechanism that I can apply
sort = '-player_last_name'
if pts > 20:
players = players.filter(pts__gte = pts).order_by(sort)
else:
players = players.filter(pts__lte = pts).order_by(sort)
if ast < 5:
players = players.filter(asts__lte = ast).order_by(sort)
else:
players = players.filter(asts__gte = ast).order_by(sort)
context = {players: players)
return render(request, '2021-2022/allstars.html', context)
What I want to know is, when is the players queryset evaluated? is it when each page is rendered, or everytime I assign the queryset to a variable? Because if it's the former, then I can just apply the .order_by(sort) chain and the previous applications are redundant.
QuerySets are evaluated if you "consume" the queryset. You consume a queryset by enumerating over it, call .get(…), .exists(…), .aggregate(…) or .count(…), check the truthiness (for example with if myqueryset: or bool(queryset), or call len(…) over it, etc. As a rule of thumb, it gets evaluated if you perform an action on it such that the result is no longer a QuerySet.
If you enumerate over it, or you call len(…) the result is cached in the QuerySet, so calling it a second time, or enumerating over it after you have called len(…) will not make another trip to the database.
In this specific case, none of the QuerySets are evaluated before you make the call to the render(…) function. If in the template you for example use {% if players %}, or {% for players %}, {{ players|length }}, or {{ players.exists }}, then it will evaluate the queryset.
Django queries are designed to be "lazy" - that is. they don't run the database query until something requests actual data. Queries can be modified by the addition of filtering and other similar functions.
For example, the following code requests all TeamMember objects when the search string is 'all', but otherwise adds a filter to restrict names to those matching the given search.
squad_list = TeamMember.objects(state__in={"Hired", "Joined", "Staff", "Recruiting"})
if squadname != 'all':
squad_list = squad_list(squad__icontains=squadname.lower())
When the squadlist query is finally executed it will retrieve the required record. Dopes this help?

Django prefetch_related GenericForeignKey with multiple content types

I'm using django-activity-stream to display a list of recent events. For the sake of example these could be someone commenting or someone editing an article. I.e. the GenericForeignKey action_object could reference a Comment or an Article. I'd like to display a link to whatever the action_object is too:
<a href="{{ action.action_object.get_absolute_url }}">
{{ action.action_object }}
</a>
The problem is this causes queries for every single item, particularly as Comment.get_absolute_url requires the comment's article, which has not been fetched yet, and Article.__unicode__ requires its revision.content, which also hasn't been fetched.
django-activity-stream already calls prefetch_related('action_object') automatically (related discussion).
This appears to be working as testing with {{ action.action_object.id }} results in a single query per action_object_content_type, despite the docs saying:
It also supports prefetching of GenericRelation and GenericForeignKey, however, it must be restricted to a homogeneous set of results. For example, prefetching objects referenced by a GenericForeignKey is only supported if the query is restricted to one ContentType.
And there is more than one content type. However in my use case above I need extra prefetch_related calls, for example:
query = query.prefetch_related('action_object__article`, `action_object__revision`)
But this complains because Articles don't have an __article (and would probably complain about Comments not having a __revision too if it got that far). I'm assuming this is what the docs are really referring to. So I thought I'd try this:
comments = query._clone().filter(action_object_content_type=comment_ctype).prefetch_related('action_object__article')
articles = query._clone().filter(action_object_content_type=article_ctype).prefetch_related('action_object__revision')
query = comments | articles
But the results are always empty. I guess querysets only support a single prefetch_related list and can't be joined like that.
I like a single queryset to return because further filtering is done later in the code which this part doesn't know about. Although once the queryset is finally evaluated I want to be able to have django fetch everything needed to render the events.
Is there another way?
I had a look at Prefetch objects but I don't think they offer any help in this situation.
A solution can be found in django-notify-x which is derived from django-notifications which, in turn, is derived from django-activity-stream. It makes use of a "django snippet" linked in the copied text below.
https://github.com/v1k45/django-notify-x/pull/19
Using a snippet from https://djangosnippets.org/snippets/2492/,
prefetch generic relations to reduce the number of queries.
Currently, we trigger one additional query for each generic relation
for each record, with this code, we reduce to one additional query for
each generic relation for each type of generic relation used.
If all your notifications are related to a Badges model, only one
aditional query will be triggered.
For Django 1.10 and 1.11, I am using the snippet above modified as below (just in case you are not using django-activity-stream):
from django.contrib.contenttypes.models import ContentType
from django.contrib.contenttypes import fields as generic
def get_field_by_name(meta, fname):
return [f for f in meta.get_fields() if f.name == fname]
def prefetch_relations(weak_queryset):
weak_queryset = weak_queryset.select_related()
# reverse model's generic foreign keys into a dict:
# { 'field_name': generic.GenericForeignKey instance, ... }
gfks = {}
for name, gfk in weak_queryset.model.__dict__.items():
if not isinstance(gfk, generic.GenericForeignKey):
continue
gfks[name] = gfk
data = {}
for weak_model in weak_queryset:
for gfk_name, gfk_field in gfks.items():
related_content_type_id = getattr(weak_model, get_field_by_name(gfk_field.model._meta, gfk_field.ct_field)[
0].get_attname())
if not related_content_type_id:
continue
related_content_type = ContentType.objects.get_for_id(related_content_type_id)
related_object_id = int(getattr(weak_model, gfk_field.fk_field))
if related_content_type not in data.keys():
data[related_content_type] = []
data[related_content_type].append(related_object_id)
for content_type, object_ids in data.items():
model_class = content_type.model_class()
models = prefetch_relations(model_class.objects.filter(pk__in=object_ids))
for model in models:
for weak_model in weak_queryset:
for gfk_name, gfk_field in gfks.items():
related_content_type_id = getattr(weak_model,
get_field_by_name(gfk_field.model._meta, gfk_field.ct_field)[
0].get_attname())
if not related_content_type_id:
continue
related_content_type = ContentType.objects.get_for_id(related_content_type_id)
related_object_id = int(getattr(weak_model, gfk_field.fk_field))
if related_object_id != model.pk:
continue
if related_content_type != content_type:
continue
setattr(weak_model, gfk_name, model)
return weak_queryset
This is giving me the intended results.
EDIT:
To use it, you simply call prefetch_relations, with your QuerySet as the argument.
For example, instead of:
my_objects = MyModel.objects.all()
you can do this:
my_objects = prefetch_relations(MyModel.objects.all())

Efficient pagination and database querying in django

There were some code examples for django pagination which I used a while back. I may be wrong but when looking over the code it looks like it wastes tons of memory. I was looking for a better solution, here is the code:
# in views.py
from django.core.paginator import Paginator, EmptyPage, PageNotAnInteger
...
...
def someView():
models = Model.objects.order_by('-timestamp')
paginator = Paginator(models, 7)
pageNumber = request.GET.get('page')
try:
paginatedPage = paginator.page(pageNumber)
except PageNotAnInteger:
pageNumber = 1
except EmptyPage:
pageNumber = paginator.num_pages
models = paginator.page(pageNumber)
return render_to_resp ( ..... models ....)
I'm not sure of the subtlties of this code but from what it looks like, the first line of code retrieves every single model from the database and pushes it into. Then it is passed into Paginator which chunks it up based on which page the user is on from a html GET. Is paginator somehow making this acceptable, or is this totally memory inefficient? If it is inefficient, how can it be improved?
Also, a related topic. If someone does:
Model.objects.all()[:40]
Does this code mean that all models are pushed into memory, and we splice out 40 of them? Which is bad. Or does it mean that we query and push only 40 objects into memory period?
Thank you for your help!
mymodel.objects.all() yields a queryset, not a list. Querysets are lazy - no request is issued and nothing done until you actually try to use them. Also slicing a query set does not load the whole damn thing in memory only to get a subset but adds limit and offset to the SQL query before hitting the database.
There is nothing memory inefficient when using paginator. Querysets are evaluated lazily. In your call Paginator(models, 7), models is a queryset which has not been evaluated till this point. So, till now database hasn't been hit. Also no list containing all the instances of model is in the memory at this point.
When you want to get a page i.e at paginatedPage = paginator.page(pageNumber), slicing is done on this queryset, only at this point the database is hit and database returns you a queryset containing instances of model. And then slicing only returns the objects which should be there on the page. So, only the sliced objects will go in a list which will be there in the memory. Say on one page you want to show 10 objects, only these 10 objects will stay in the memory.
When someone does;
Model.objects.all()[:40]
When you slice a list, a new list is created. In your case a list will be created with only 40 elements and will be stored somewhere in memory. No other list will be there and so there won't be any list which contains all the instances of Model in memory.
Using the above information I came up with a view function decorator. The json_list_objects takes djanog objects to json-ready python dicts of the known relationship fields of the django objects and returns the jsonified list as {count: results: }.
Others may find it useful.
def with_paging(fn):
"""
Decorator providing paging behavior. It is for decorating a function that
takes a request and other arguments and returns the appropriate query
doing select and filter operations. The decorator adds paging by examining
the QueryParams of the request for page_size (default 2000) and
page_num (default 0). The query supplied is used to return the appropriate
slice.
"""
#wraps(fn)
def inner(request, *args, **kwargs):
page_size = int(request.GET.get('page_size', 2000))
page_num = int(request.GET.get('page_num', 0))
query = fn(request, *args, **kwargs)
start = page_num * page_size
end = start + page_size
data = query[start:end]
total_size = query.count()
return json_list_objects(data, overall_count=total_size)
return inner

Only process subset of a queryset in Django (and return original queryset)

I have a generic list view in Django which returns around 300 objects (unless a search is performed).
I have pagination set to only display 10 of the objects.
So I query the database, then iterate through the objects, processing them and adding extra values before the display. I noticed that all 300 objects get the processing done, and the pagination is done after the processing. So I want to only do the processing on the objects that are going to displayed to increase performance.
I calculate the indexes of the objects in the queryset that should be processed 0-10 for page 1, 11-20 for page 2, 21-30 for page 3, etc.
Now I want to process only the objects in the display range but return the full queryset (so the genreic view works as expected).
Initially I tried:
for object in queryset[slice_start:slice_end] :
# process things here
return queryset
But the slicing seems to return a new queryset, and the original queryset objects do not have any of the calculated values.
Currently my solution is:
index = -1
for object in queryset:
index += 1
if index < slice_start or index > slice_end : continue
# process things here
return queryset
Now this works, but it seems rather hacky, and unelegant for Python.
Is there a more pythonic way to do this?
If you're using Django's Paginator class (docs), then you can request the current page and iterate over those objects in the view:
from django.core.paginator import Paginator
# in the view:
p = Paginator(objects, 2)
page = p.page(current_page)
for o in page.object_list:
# do processing
pass
You can obtain the value for current_page from the request's parameters (e.g, a page parameter in request.GET).
You should do the processing on the results of page.object_list, which will be guaranteed to only contain the objects for that page.

Django ORM: Selecting related set

Say I have 2 models:
class Poll(models.Model):
category = models.CharField(u"Category", max_length = 64)
[...]
class Choice(models.Model):
poll = models.ForeignKey(Poll)
[...]
Given a Poll object, I can query its choices with:
poll.choice_set.all()
But, is there a utility function to query all choices from a set of Poll?
Actually, I'm looking for something like the following (which is not supported, and I don't seek how it could be):
polls = Poll.objects.filter(category = 'foo').select_related('choice_set')
for poll in polls:
print poll.choice_set.all() # this shouldn't perform a SQL query at each iteration
I made an (ugly) function to help me achieve that:
def qbind(objects, target_name, model, field_name):
objects = list(objects)
objects_dict = dict([(object.id, object) for object in objects])
for foreign in model.objects.filter(**{field_name + '__in': objects_dict.keys()}):
id = getattr(foreign, field_name + '_id')
if id in objects_dict:
object = objects_dict[id]
if hasattr(object, target_name):
getattr(object, target_name).append(foreign)
else:
setattr(object, target_name, [foreign])
return objects
which is used as follow:
polls = Poll.objects.filter(category = 'foo')
polls = qbind(polls, 'choices', Choice, 'poll')
# Now, each object in polls have a 'choices' member with the list of choices.
# This was achieved with 2 SQL queries only.
Is there something easier already provided by Django? Or at least, a snippet doing the same thing in a better way.
How do you handle this problem usually?
Time has passed and this functionality is now available in Django 1.4 with the introduction of the prefetch_related() QuerySet function. This function effectively does what is performed by the suggested qbind function. ie. Two queries are performed and the join occurs in Python land, but now this is handled by the ORM.
The original query request would now become:
polls = Poll.objects.filter(category = 'foo').prefetch_related('choice_set')
As is shown in the following code sample, the polls QuerySet can be used to obtain all Choice objects per Poll without requiring any further database hits:
for poll in polls:
for choice in poll.choice_set:
print choice
Update: Since Django 1.4, this feature is built in: see prefetch_related.
First answer: don't waste time writing something like qbind until you've already written a working application, profiled it, and demonstrated that N queries is actually a performance problem for your database and load scenarios.
But maybe you've done that. So second answer: qbind() does what you'll need to do, but it would be more idiomatic if packaged in a custom QuerySet subclass, with an accompanying Manager subclass that returns instances of the custom QuerySet. Ideally you could even make them generic and reusable for any reverse relation. Then you could do something like:
Poll.objects.filter(category='foo').fetch_reverse_relations('choices_set')
For an example of the Manager/QuerySet technique, see this snippet, which solves a similar problem but for the case of Generic Foreign Keys, not reverse relations. It wouldn't be too hard to combine the guts of your qbind() function with the structure shown there to make a really nice solution to your problem.
I think what you're saying is, "I want all Choices for a set of Polls." If so, try this:
polls = Poll.objects.filter(category='foo')
choices = Choice.objects.filter(poll__in=polls)
I think what you are trying to do is the term "eager loading" of child data - meaning you are loading the child list (choice_set) for each Poll, but all in the first query to the DB, so that you don't have to make a bunch of queries later on.
If this is correct, then what you are looking for is 'select_related' - see https://docs.djangoproject.com/en/dev/ref/models/querysets/#select-related
I noticed you tried 'select_related' but it didn't work. Can you try doing the 'select_related' and then the filter. That might fix it.
UPDATE: This doesn't work, see comments below.

Categories

Resources