How to cache a paginated Django queryset

How to cache a paginated Django queryset - python

How do you cache a paginated Django queryset, specifically in a ListView?
I noticed one query was taking a long time to run, so I'm attempting to cache it. The queryset is huge (over 100k records), so I'm attempting to only cache paginated subsections of it. I can't cache the entire view or template because there are sections that are user/session specific and need to change constantly.
ListView has a couple standard methods for retrieving the queryset, get_queryset(), which returns the non-paginated data, and paginate_queryset(), which filters it by the current page.
I first tried caching the query in get_queryset(), but quickly realized calling cache.set(my_query_key, super(MyView, self).get_queryset()) was causing the entire query to be serialized.
So then I tried overriding paginate_queryset() like:
import time
from functools import partial
from django.core.cache import cache
from django.views.generic import ListView
class MyView(ListView):
...
def paginate_queryset(self, queryset, page_size):
cache_key = 'myview-queryset-%s-%s' % (self.page, page_size)
print 'paginate_queryset.cache_key:',cache_key
t0 = time.time()
ret = cache.get(cache_key)
if ret is None:
print 're-caching'
ret = super(MyView, self).paginate_queryset(queryset, page_size)
cache.set(cache_key, ret, 60*60)
td = time.time() - t0
print 'paginate_queryset.time.seconds:',td
(paginator, page, object_list, other_pages) = ret
print 'total objects:',len(object_list)
return ret
However, this takes almost a minute to run, even though only 10 objects are retrieved, and every requests shows "re-caching", implying nothing is being saved to cache.
My settings.CACHE looks like:
CACHES = {
'default': {
'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache',
'LOCATION': '127.0.0.1:11211',
}
}
and service memcached status shows memcached is running and tail -f /var/log/memcached.log shows absolutely nothing.
What am I doing wrong? What is the proper way to cache a paginated query so that the entire queryset isn't retrieved?
Edit: I think their may be a bug in either memcached or the Python wrapper. Django appears to support two different memcached backends, one using python-memcached and one using pylibmc. The python-memcached seems to silently hide the error caching the paginate_queryset() value. When I switched to the pylibmc backend, now I get an explicit error message "error 10 from memcached_set: SERVER ERROR" tracing back to django/core/cache/backends/memcached.py in set, line 78.

You can extend the Paginator to support caching by a provided cache_key.
A blog post about usage and implementation of a such CachedPaginator can be found here. The source code is posted at djangosnippets.org (here is a web-acrhive link because the original is not working).
However I will post a slightly modificated example from the original version, which can not only cache objects per page, but the total count too. (sometimes even the count can be an expensive operation).
from django.core.cache import cache
from django.utils.functional import cached_property
from django.core.paginator import Paginator, Page, PageNotAnInteger
class CachedPaginator(Paginator):
"""A paginator that caches the results on a page by page basis."""
def __init__(self, object_list, per_page, orphans=0, allow_empty_first_page=True, cache_key=None, cache_timeout=300):
super(CachedPaginator, self).__init__(object_list, per_page, orphans, allow_empty_first_page)
self.cache_key = cache_key
self.cache_timeout = cache_timeout
#cached_property
def count(self):
"""
The original django.core.paginator.count attribute in Django1.8
is not writable and cant be setted manually, but we would like
to override it when loading data from cache. (instead of recalculating it).
So we make it writable via #cached_property.
"""
return super(CachedPaginator, self).count
def set_count(self, count):
"""
Override the paginator.count value (to prevent recalculation)
and clear num_pages and page_range which values depend on it.
"""
self.count = count
# if somehow we have stored .num_pages or .page_range (which are cached properties)
# this can lead to wrong page calculations (because they depend on paginator.count value)
# so we clear their values to force recalculations on next calls
try:
del self.num_pages
except AttributeError:
pass
try:
del self.page_range
except AttributeError:
pass
#cached_property
def num_pages(self):
"""This is not writable in Django1.8. We want to make it writable"""
return super(CachedPaginator, self).num_pages
#cached_property
def page_range(self):
"""This is not writable in Django1.8. We want to make it writable"""
return super(CachedPaginator, self).page_range
def page(self, number):
"""
Returns a Page object for the given 1-based page number.
This will attempt to pull the results out of the cache first, based on
the requested page number. If not found in the cache,
it will pull a fresh list and then cache that result + the total result count.
"""
if self.cache_key is None:
return super(CachedPaginator, self).page(number)
# In order to prevent counting the queryset
# we only validate that the provided number is integer
# The rest of the validation will happen when we fetch fresh data.
# so if the number is invalid, no cache will be setted
# number = self.validate_number(number)
try:
number = int(number)
except (TypeError, ValueError):
raise PageNotAnInteger('That page number is not an integer')
page_cache_key = "%s:%s:%s" % (self.cache_key, self.per_page, number)
page_data = cache.get(page_cache_key)
if page_data is None:
page = super(CachedPaginator, self).page(number)
#cache not only the objects, but the total count too.
page_data = (page.object_list, self.count)
cache.set(page_cache_key, page_data, self.cache_timeout)
else:
cached_object_list, cached_total_count = page_data
self.set_count(cached_total_count)
page = Page(cached_object_list, number, self)
return page

The problem turned out to be a combination of factors. Mainly, the result returned by the paginate_queryset() contains a reference to the unlimited queryset, meaning it's essentially uncachable. When I called cache.set(mykey, (paginator, page, object_list, other_pages)), it was trying to serialize thousands of records instead of just the page_size number of records I was expecting, causing the cached item to exceed memcached's limits and fail.
The other factor was the horrible default error reporting in the memcached/python-memcached, which silently hides all errors and turns cache.set() into a nop if anything goes wrong, making it very time-consuming to track down the problem.
I fixed this by essentially rewriting paginate_queryset() to ditch Django's builtin paginator functionality altogether and calculate the queryset myself with:
object_list = queryset[page_size*(page-1):page_size*(page-1)+page_size]
and then caching that object_list.

I wanted to paginate my infinite scrolling view on my home page and this is the solution I came up with. It's a mix of Django CCBVs and the author's initial solution.
The response times, however, didn't improve as much as I would've hoped for but that's probably because I am testing it on my local with just 6 posts and 2 users haha.
# Import
from django.core.cache import cache
from django.core.paginator import InvalidPage
from django.views.generic.list import ListView
from django.http Http404
class MyListView(ListView):
template_name = 'MY TEMPLATE NAME'
model = MY POST MODEL
paginate_by = 10
def paginate_queryset(self, queryset, page_size):
"""Paginate the queryset"""
paginator = self.get_paginator(
queryset, page_size, orphans=self.get_paginate_orphans(),
allow_empty_first_page=self.get_allow_empty())
page_kwarg = self.page_kwarg
page = self.kwargs.get(page_kwarg) or self.request.GET.get(page_kwarg) or 1
try:
page_number = int(page)
except ValueError:
if page == 'last':
page_number = paginator.num_pages
else:
raise Http404(_("Page is not 'last', nor can it be converted to an int."))
try:
page = paginator.page(page_number)
cache_key = 'mylistview-%s-%s' % (page_number, page_size)
retreive_cache = cache.get(cache_key)
if retreive_cache is None:
print('re-caching')
retreive_cache = super(MyListView, self).paginate_queryset(queryset, page_size)
# Caching for 1 day
cache.set(cache_key, retreive_cache, 86400)
return retreive_cache
except InvalidPage as e:
raise Http404(_('Invalid page (%(page_number)s): %(message)s') % {
'page_number': page_number,
'message': str(e)
})

Related

Django pagination: EmptyPage: That page contains no results

When using Django CBV ListView with pagination:
class Proposals(ListView):
model = Proposal
ordering = "id"
paginate_by = 10
In the browser, if I provide a page that is out of range, I get an error:
I would like to have a different behaviour: to fallback to the last existing page if the provided page is out of range.
I dug into Django source code paginator.py file and was surprised to find some code that does exactly this:
So using paginator.get_page(page) (and not paginator.page(page)) would be the way to go. However, ListView does not use it as you can see here:
What is the best way to deal with this?
Thanks.

The only solution I found is by overriding the paginate_queryset method.
However I don't like it as I'm forced to rewrite the whole logic while I just want to change a single line.
Open to any better suggestion.
class PermissivePaginationListView(ListView):
def paginate_queryset(self, queryset, page_size):
"""
This is an exact copy of the original method, jut changing `page` to `get_page` method to prevent errors with out of range pages.
This is useful with HTMX, when the last row of the table is deleted, as the current page in URL is not valid anymore because there is no result in it.
"""
paginator = self.get_paginator(
queryset,
page_size,
orphans=self.get_paginate_orphans(),
allow_empty_first_page=self.get_allow_empty(),
)
page_kwarg = self.page_kwarg
page = self.kwargs.get(page_kwarg) or self.request.GET.get(page_kwarg) or 1
try:
page_number = int(page)
except ValueError:
if page == "last":
page_number = paginator.num_pages
else:
raise Http404(_("Page is not “last”, nor can it be converted to an int."))
try:
page = paginator.get_page(page_number)
return (paginator, page, page.object_list, page.has_other_pages())
except InvalidPage as e:
raise Http404(
_("Invalid page (%(page_number)s): %(message)s")
% {"page_number": page_number, "message": str(e)}
)

Django; Tips for making views thin

I'm a beginner and I have been working on project using Django.
I am wondering if there's a good way to avoid repeating the same code.
Also if there are similar logic in some functions, how can I decide if the logic is organized or not.
for example,
def entry_list(request):
entry_list = Entry.objects.all()
#this part is repeated
page = request.GET.get('page', 1)
paginator = Paginator(entry_list, 10)
try:
entries = paginator.page(page)
except PageNotAnInteger:
entries = paginator.page(1)
except EmptyPage:
entries = paginator.page(paginator.num_pages)
return render(request, 'blog/entry_list.html', {'entries': entries})
The logic to paginate is repeated in some other functions as well.
Where should I put the repeated code and how can I decide if I should organize code?

Using function-based views
You could encapsulate it in another function (for example construct such function in a file named utils.py):
# in app/utils.py
def get_page_entries(entry_list, page, per_page=10):
paginator = Paginator(entry_list, per_page)
try:
return paginator.page(page)
except PageNotAnInteger:
return paginator.page(1)
except EmptyPage:
return paginator.page(paginator.num_pages)
You can then use it like:
# app/views.py
from app.utils import get_page_entries
def entry_list(request):
entry_list = Entry.objects.all()
entries= get_page_entries(entry_list, request.GET.get('page', 1))
return render(request, 'blog/entry_list.html', {'entries': entries})
You can provide an optional third parameter with the number of elements per page. If not provided, it will by default be 10.
Or we can encapsulate the request.GET.get(..) logic as well, like:
# in app/utils.py
def get_page_entries(entry_list, querydict, per_page=10, key='page', default_page=1):
page = querydict.get(key, default_page)
paginator = Paginator(entry_list, per_page)
try:
return paginator.page(page)
except PageNotAnInteger:
return paginator.page(default_page)
except EmptyPage:
return paginator.page(paginator.num_pages)
and thus call it with:
# app/views.py
from app.utils import get_page_entries
def entry_list(request):
return render(request, 'blog/entry_list.html', {
'entries': get_page_entries(Entry.objects.all(), request.GET)
})
Using class-based views
You however do not need to use function-based views. This use case is covered by the ListView class [Django-doc]:
class EntryListView(ListView):
model = Entry
template_name = 'blog/entry_list.html'
context_object_name = 'entries'
paginate_by = 10
and then in the urls.py, use EntryListView.as_view() instead of the function, so:
# app/urls.py
from django.urls import path
from app.views import EntryListView
urlpatterns = [
path('entries', EntryListView.as_view(), name='entry-list'),
]
Note only did we reduce the number of lines of code, this is also a more declarative way of developing: instead of specifying how we want to do something, we specify what we want to do. How it is done is more up to Django's implementation of a ListView. Furthermore by providing the settings as attributes of the class, we can easily develop more tooling that take these parameters into account.

I asked myself the same questions for several things including:
functions I might use many times in my views,
messages I would use many times in my views but also tests (in order to test behavior of my apps),
exceptions that could be used in many models or views,
choices when defining Charfields which value can be only specific values,
context values that could be used in many templates,
etc.
Here how I work for instance:
functions: a utils.py file as Willem said does the trick,
messages: I've got a customMessages.py file used for that, where I define simple messages (e.g.:SUCCESS_M_CREATE_OBJECT=_("You created this object successfully."), then used by calling customMessages.SUCCESS_M_CREATE_OBJECT in my models, tests or views) or more complexe ones with variable fields for instance, defined by a function and called with lambda in my tests,
context: by defining dedicated functions in my context_processors.py returning dictionaries of context vars, registering them in my settings.py and then I'm fine if want to call any in my templates,
etc.
Well, there is always a way to not repeat yourself for anything you do in Python and Django, and you are just right to ask yourself those questions at anytime in your development, because your futur self will thank you for that!

Django save behaving randomly

I have a Story model with a M2M relationship to some Resource objects. Some of the Resource objects are missing a name so I want to copy the title of the Story to the assigned Resource objects.
Here is my code:
from collector import models
from django.core.paginator import Paginator
paginator = Paginator(models.Story.objects.all(), 1000)
def fix_issues():
for page in range(1, paginator.num_pages + 1):
for story in paginator.page(page).object_list:
name_story = story.title
for r in story.resources.select_subclasses():
if r.name != name_story:
r.name = name_story
r.save()
if len(r.name) == 0:
print("Something went wrong: " + name_story)
print("done processing page %s out of %s" % (page, paginator.num_pages))
fix_issues()
I need to use a paginator because I'm dealing with a million objects. The weird part is that after calling fix_issues() about half of my resources that had no name, now have the correct name, while the other half still has no name. I can call fix_issues() again and again and every time more objects receive a name. This seems really weird to me, why would an object not be updated the first time but only the second time?
Additional information:
The "Something went wrong: " message is never printed.
I'm using select_subclasses from django-model-utils to iterate over all resources (any type).
The story.title is never empty.
No error message is printed, when I run these commands.
I did not override the save method of the Resource model (only the save method of the Story model).
I tried to use #transaction.atomic but the result was the same.
My Model:
class Resource(models.Model):
name = models.CharField(max_length=200)
# Important for retrieving the correct subtype.
objects = InheritanceManager()
def __str__(self):
return str(self.name)
class CustomResource(Resource):
homepage = models.CharField(max_length=3000, default="", blank=True, null=True)
class Story(models.Model):
url = models.URLField(max_length=3000)
resources = models.ManyToManyField(Resource)
popularity = models.FloatField()
def _update_popularity(self):
self.popularity = 3
def save(self, *args, **kwargs):
super(Story, self).save(*args, **kwargs)
self._update_popularity()
super(Story, self).save(*args, **kwargs)
Documentation for the select_subclasses:
http://django-model-utils.readthedocs.io/en/latest/managers.html#inheritancemanager
Further investigation:
I thought that maybe select_subclasses did not return all the objects. Right now every story has exactly one resource. So it was easy enough to check that select_subclasses always returns one item. This is the function I used:
def find_issues():
for page in range(1, paginator.num_pages + 1):
for story in paginator.page(page).object_list:
assert(len(story.resources.select_subclasses()) == 1)
print("done processing page %s out of %s" % (page, paginator.num_pages))
But again, this executes without any problems. So I don't thing the select_subclasses is to blame. I also checked if paginator.num_pages is right and it is. If i divide by 1000 (items per page) I get exactly the number of stories I have in my database.

I think I know what is happening:
The Paginator loads a Queryset and gives me the first n items. I process these and update some of the values. But for the next iteration the order of the items in the queryset changes (because I updated some of them and did not define an order). So I'm skipping over items that are now on the first page. I can avoid it by specifying an order (pk for example).
If you think I'm wrong, please let me know. Otherwise I will accept this as the correct answer. Thank you.

Django Rest Framework 3.1 breaks pagination.PaginationSerializer

I just updated to Django Rest Framework 3.1 and it seems that all hell broke loose.
in my serializers.py I was having the following code:
class TaskSerializer(serializers.ModelSerializer):
class Meta:
model = task
exclude = ('key', ...)
class PaginatedTaskSerializer(pagination.PaginationSerializer):
class Meta:
object_serializer_class = TaskSerializer
which was working just fine. Now with the release of 3.1 I can't find examples on how to do the same thing since PaginationSerializer is no longer there.
I have tried to subclass PageNumberPagination and use its default paginate_queryset and get_paginated_response methods but I can no longer get their results serialized.
In other words my problem is that I can no longer do this:
class Meta:
object_serializer_class = TaskSerializer
Any ideas?
Thanks in advance

I think I figured it out (for the most part at least):
What we should have used from the very beginning is this:
Just use the built-in paginator and change your views.py to this:
from rest_framework.pagination import PageNumberPagination
class CourseListView(AuthView):
def get(self, request, format=None):
"""
Returns a JSON response with a listing of course objects
"""
courses = Course.objects.order_by('name').all()
paginator = PageNumberPagination()
# From the docs:
# The paginate_queryset method is passed the initial queryset
# and should return an iterable object that contains only the
# data in the requested page.
result_page = paginator.paginate_queryset(courses, request)
# Now we just have to serialize the data just like you suggested.
serializer = CourseSerializer(result_page, many=True)
# From the docs:
# The get_paginated_response method is passed the serialized page
# data and should return a Response instance.
return paginator.get_paginated_response(serializer.data)
For the desired page size just set the PAGE_SIZE in settings.py:
REST_FRAMEWORK = {
'DEFAULT_PAGINATION_CLASS': 'rest_framework.pagination.PageNumberPagination',
'PAGE_SIZE': 15
}
You should be all set now with all the options present in the body of the response (count, next and back links) ordered just like before the update.
However there is one more thing that still troubles me: We should also be able to get the new html pagination controls which for some reason are missing for now...
I could definitely use a couple more suggestions on this...

I am not sure if this is the completely correct way to do it, but it works for my needs. It uses the Django Paginator and a custom serializer.
Here is my View Class that retrieves the objects for serialization
class CourseListView(AuthView):
def get(self, request, format=None):
"""
Returns a JSON response with a listing of course objects
"""
courses = Course.objects.order_by('name').all()
serializer = PaginatedCourseSerializer(courses, request, 25)
return Response(serializer.data)
Here is the hacked together Serializer that uses my Course serializer.
from django.core.paginator import Paginator, PageNotAnInteger, EmptyPage
class PaginatedCourseSerializer():
def __init__(self, courses, request, num):
paginator = Paginator(courses, num)
page = request.QUERY_PARAMS.get('page')
try:
courses = paginator.page(page)
except PageNotAnInteger:
courses = paginator.page(1)
except EmptyPage:
courses = paginator.page(paginator.num_pages)
count = paginator.count
previous = None if not courses.has_previous() else courses.previous_page_number()
next = None if not courses.has_next() else courses.next_page_number()
serializer = CourseSerializer(courses, many=True)
self.data = {'count':count,'previous':previous,
'next':next,'courses':serializer.data}
This gives me a result that is similar to the behavior that the old paginator gave.
{
"previous": 1,
"next": 3,
"courses": [...],
"count": 384
}
I hope this helps. I still think there has got to be a beter way to do this wiht the new API, but it's just not documented well. If I figure anything more out, I'll edit my post.
EDIT
I think I have found a better, more elegant way to do it bey creating my own custom paginator to get behavior like I used to get with the old Paginated Serializer class.
This is a custom paginator class. I overloaded the response and next page methods to get the result I want (i.e. ?page=2 instead of the full url).
from rest_framework.response import Response
from rest_framework.utils.urls import replace_query_param
class CustomCoursePaginator(pagination.PageNumberPagination):
def get_paginated_response(self, data):
return Response({'count': self.page.paginator.count,
'next': self.get_next_link(),
'previous': self.get_previous_link(),
'courses': data})
def get_next_link(self):
if not self.page.has_next():
return None
page_number = self.page.next_page_number()
return replace_query_param('', self.page_query_param, page_number)
def get_previous_link(self):
if not self.page.has_previous():
return None
page_number = self.page.previous_page_number()
return replace_query_param('', self.page_query_param, page_number)
Then my course view is very similar to how you implemented it, only this time using the Custom paginator.
class CourseListView(AuthView):
def get(self, request, format=None):
"""
Returns a JSON response with a listing of course objects
"""
courses = Course.objects.order_by('name').all()
paginator = CustomCoursePaginator()
result_page = paginator.paginate_queryset(courses, request)
serializer = CourseSerializer(result_page, many=True)
return paginator.get_paginated_response(serializer.data)
Now I get the result that I'm looking for.
{
"count": 384,
"next": "?page=3",
"previous": "?page=1",
"courses": []
}
I am still not certain about how this works for the Browsable API (I don't user this feature of drf). I think you can also create your own custom class for this. I hope this helps!

I realize over a year has passed since this was posted but hoping this helps others. The response to my similar question was the solution for me. I am using DRF 3.2.3.
Django Rest Framework 3.2.3 pagination not working for generics.ListCreateAPIView
Seeing how it was implemented gave me the solution needed to get pagination + the controls in the visible API.
https://github.com/tomchristie/django-rest-framework/blob/master/rest_framework/mixins.py#L39

Will this cost me two queries for the same thing?

I have the normal pagination like this in my view:
paginator = Paginator(book_list, 100)
And then in my view I am passing the values to my template:
return render(request,
...
'paginator': paginator,
...
And I have a custom tag for my pagination, which I am loading like this:
{% if paginator.count > paginator.per_page %}
{% load paginator %}
{% paginator 3 %}
{% endif %}
In my custom template pagination tag, I have the following along the code:
def paginator(context, adjacent_pages=2):
page_obj = context['paginator'].page(context['object_list'].number)
...
'hits': context['paginator'].count,
...
Everything is working as expected but I am worried about context['paginator'].page(context['object_list'].number), is Django fetching the data from DB with this bit or it's using the same data that was fetched from my main view?
Please advise. Thanks.

The paginator keeps the query_set as object_list, in django 1.3.4, the page method, is
def page(self, number):
"Returns a Page object for the given 1-based page number."
number = self.validate_number(number)
bottom = (number - 1) * self.per_page
top = bottom + self.per_page
if top + self.orphans >= self.count:
top = self.count
return Page(self.object_list[bottom:top], number, self)
Only the last line related to db,
self.object_list[bottom:top]
The object_list is just a QuerySet, so the problems comes to if you invoke query_set[x:y] more times, whether there exists multiple queries.
Django's query_set is lazy, if you don't iterate through it, no sql will be triggered. Otherwise, there will be db queries.
You can use check queries in django.db.connection.queries for following code,
from django.db import connection
original = XXXX.objects.filter(...)
res1 = original[x:y]
for item in res1:
print item
print len(connection.queries), connection.queries[-1]
res2 = original[x:y]
for item in res2:
print item
print len(connection.queries), connection.queries[-1]
You'll find that the query length grows.

My understanding here is that it's simply using whatever object you passed it in your main view. context['paginator'] is going to return the object stored in the paginator variable that you passed to the context, an instance of the Paginator class.
The question of whether or not it's going back to the database is simply about the .page(...) method. If calling Paginator.page(...) issues a database query, then it will be going back to the database--it wouldn't cache that value. However, if that information is already available locally in the paginator variable and that is what is called up by the .page method, then you're not refetching the data from the database.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to cache a paginated Django queryset - python

Related

Django pagination: EmptyPage: That page contains no results

Django; Tips for making views thin

Django save behaving randomly

Django Rest Framework 3.1 breaks pagination.PaginationSerializer

Will this cost me two queries for the same thing?

Categories

Resources