Efficient pagination and database querying in django - python

There were some code examples for django pagination which I used a while back. I may be wrong but when looking over the code it looks like it wastes tons of memory. I was looking for a better solution, here is the code:
# in views.py
from django.core.paginator import Paginator, EmptyPage, PageNotAnInteger
...
...
def someView():
models = Model.objects.order_by('-timestamp')
paginator = Paginator(models, 7)
pageNumber = request.GET.get('page')
try:
paginatedPage = paginator.page(pageNumber)
except PageNotAnInteger:
pageNumber = 1
except EmptyPage:
pageNumber = paginator.num_pages
models = paginator.page(pageNumber)
return render_to_resp ( ..... models ....)
I'm not sure of the subtlties of this code but from what it looks like, the first line of code retrieves every single model from the database and pushes it into. Then it is passed into Paginator which chunks it up based on which page the user is on from a html GET. Is paginator somehow making this acceptable, or is this totally memory inefficient? If it is inefficient, how can it be improved?
Also, a related topic. If someone does:
Model.objects.all()[:40]
Does this code mean that all models are pushed into memory, and we splice out 40 of them? Which is bad. Or does it mean that we query and push only 40 objects into memory period?
Thank you for your help!

mymodel.objects.all() yields a queryset, not a list. Querysets are lazy - no request is issued and nothing done until you actually try to use them. Also slicing a query set does not load the whole damn thing in memory only to get a subset but adds limit and offset to the SQL query before hitting the database.

There is nothing memory inefficient when using paginator. Querysets are evaluated lazily. In your call Paginator(models, 7), models is a queryset which has not been evaluated till this point. So, till now database hasn't been hit. Also no list containing all the instances of model is in the memory at this point.
When you want to get a page i.e at paginatedPage = paginator.page(pageNumber), slicing is done on this queryset, only at this point the database is hit and database returns you a queryset containing instances of model. And then slicing only returns the objects which should be there on the page. So, only the sliced objects will go in a list which will be there in the memory. Say on one page you want to show 10 objects, only these 10 objects will stay in the memory.
When someone does;
Model.objects.all()[:40]
When you slice a list, a new list is created. In your case a list will be created with only 40 elements and will be stored somewhere in memory. No other list will be there and so there won't be any list which contains all the instances of Model in memory.

Using the above information I came up with a view function decorator. The json_list_objects takes djanog objects to json-ready python dicts of the known relationship fields of the django objects and returns the jsonified list as {count: results: }.
Others may find it useful.
def with_paging(fn):
"""
Decorator providing paging behavior. It is for decorating a function that
takes a request and other arguments and returns the appropriate query
doing select and filter operations. The decorator adds paging by examining
the QueryParams of the request for page_size (default 2000) and
page_num (default 0). The query supplied is used to return the appropriate
slice.
"""
#wraps(fn)
def inner(request, *args, **kwargs):
page_size = int(request.GET.get('page_size', 2000))
page_num = int(request.GET.get('page_num', 0))
query = fn(request, *args, **kwargs)
start = page_num * page_size
end = start + page_size
data = query[start:end]
total_size = query.count()
return json_list_objects(data, overall_count=total_size)
return inner

Related

How do you effectively cache a large django queryset?

I am working on a Django project using a PostgreSQL database in which I need to run a sometimes complex query on a user profile table with ~1M records, and the dataset returned can be anywhere from 0 to all 1M records. My issue is that once I grab all of the records, I want to be able to filter them further to run analytics on these profiles. The analytics cannot be completed in the same request/response loop, as this will time out for all but the smallest querysets. So I am using async javascript to shoot off new requests for each type of analytics I want.
For example, I will grab all of the profiles in the initial request and then i will send subsequent requests to take those profiles and return to me the % of genders or % of users who have a certain job title, etc. The issue is that every subsequent request has to run the original query again. What I would love to do is somehow cache that query and then run my query on this subset of the profile table without having to execute the original, long-running query.
I have tried to use a cache function to cache the queryset itself, but this actually slows down performance a bit, I assume because the queryset has to be pickled or unpickled and then it still has to run? I also tried to cache a list of ids from the parent long query (this is potentially a VERY long list, up to 1M integers) and that grinds my system to a completely halt for anything more than like 44k records.
Has anyone dealt with this kind of issue before? I know that I could set up a worker/queue system, which is on my roadmap, but it would be lovely if there was a simple solution to this that utilizes the built-in capabilities of Django.
Some sample code:
def get_analytics(request):
data_type = request.POST.get('data_type')
query_params = request.POST.get('query_params') # a crazy filter with lots of Q objects
profile_ids = get_profiles(query_params) # I WANT TO CACHE THIS
profiles = Profile.objects.filter(id__in=profile_ids).distinct()
if data_type == 'overview':
return profiles.count()
else if data_type == 'gender':
gender_breakdown = profiles.filter(a_filter_for_gender).values('gender').annotate(Count('gender', distinct=True))
return gender_breakdown
def cache_function(length):
"""
A variant of the snippet posted by Jeff Wheeler at
http://www.djangosnippets.org/snippets/109/
Caches a function, using the function and its arguments as the key, and the return
value as the value saved. It passes all arguments on to the function, as
it should.
The decorator itself takes a length argument, which is the number of
seconds the cache will keep the result around.
"""
def decorator(func):
def inner_func(*args, **kwargs):
if hasattr(settings, 'IS_IN_UNITTEST'):
return func(*args, **kwargs)
key = get_cache_key(func.__name__, func.__module__, args, kwargs)
value = cache.get(key)
if key in cache:
return value
else:
result = func(*args, **kwargs)
cache.set(key, result, length)
return result
return inner_func
return decorator
#cache_function(60*2)
def get_profiles(query_params):
return Profile.objects.filter(query_params).values_list('id')
Why does caching the ids slow my system down? Is there a better way to accomplish this?

How do you split a Django queryset without evaluating it?

I am dealing with a Queryset of over 5 million + items (For batch ML purposes) and I need to split the queryset (so I can perform multithreading operations) without evaluating the queryset as I only ever need to access each item in the queryset once and thus I don't want to cache the queryset items which evaluating causes.
Is it possible to select the items into one queryset and split this without evaluating? or am I going to have to approach it by querying for multiple querysets using Limits [:size] to achieve this behaviour?
N.B: I am aware that an Iterable can be used to cycle through a queryset without evaluating it but my question is related to how I can I split a queryset (if possible) to then run an iterable on each of the splitted querysets.
Django provides a few classes that help you manage paginated data – that is, data that’s split across several pages, with “Previous/Next” links:
from django.core.paginator import Paginator
object_list = MyModel.objects.all()
paginator = Paginator(object_list, 10) # Show 10 objects per page, you can choose any other value
for i in paginator.page_range(): # A 1-based range iterator of page numbers, e.g. yielding [1, 2, 3, 4].
data = iter(paginator.get_page(i))
# use data
If your django version is 1.11 or less than that like 1.10, 1.9 or so on, then use paginator.page(page_no) but be careful that this may raise an InvalidPage Exception when invalid/no page has been found.
For versions <= 1.11, use below code:
from django.core.paginator import Paginator
qs = MyModel.objects.all()
paginator = Paginator(qs, 20)
for page_no in paginator.page_range:
current_page = paginator.page(page_no)
current_qs = current_page.object_list
And if you're using django version >= 2.0, please use paginator.get_page(page_no) instead, but you can also use paginator.page(page_no).
For versions >= 2.0, use below code:
from django.core.paginator import Paginator
qs = MyModel.objects.all()
paginator = Paginator(qs, 20)
for page_no in paginator.page_range:
current_page = paginator.get_page(page_no)
current_qs = current_page.object_list
The advantage of using paginator.get_page(page_no) according to django documentations is as follows:
Return a valid page, even if the page argument isn't a number or isn't
in range.
While in the case of paginator.page(page_no), you have to handle the exception manually if page_no is not a number or is out of range.
Passing query sets to Threads is not something I would recommend. I know the sort of thing you are trying to do and why, but its best to just pass some sort of param set to each thread and then have the Thread perform the partial query.
Working this way, your threads are distinct from the calling code.
On a different note, if you are trying to use threads as a work around for the lags caused by high DB queries, you might find using transaction management a better route.
This link link has some useful tips. I use this instead of Threads
Yes you can, as from this gist
Per the updated answer:
def queryset_iterator(queryset, chunk_size=1000):
"""
Iterate over a Django Queryset ordered by the primary key
This method loads a maximum of chunk_size (default: 1000) rows in it's
memory at the same time while django normally would load all rows in it's
memory. Using the iterator() method only causes it to not preload all the
classes.
Note that the implementation of the iterator does not support ordered query sets.
"""
try:
last_pk = queryset.order_by('-pk')[:1].get().pk
except ObjectDoesNotExist:
return
pk = 0
queryset = queryset.order_by('pk')
while pk < last_pk:
for row in queryset.filter(pk__gt=pk)[:chunk_size]:
pk = row.pk
yield row
gc.collect()

How to paginate in Flask-SQLAlchemy for db.session joined queries?

Say, we have the following relationships:
a person can have many email addresses
a email service provider can (obviously) serve multiple email address
So, it's a many to many relationship. I have three tables: emails, providers, and users. Emails have two foreign ids for provider and user.
Now, given a specific person, I want to print all the email providers and the email address it hosts for this person, if it exists. (If the person do not have an email at Gmail, I still want Gmail be in the result. I believe otherwise I only need a left inner join to solve this.)
I figured out how to do this with the following subqueries (following the sqlalchemy tutorial):
email_subq = db.session.query(Emails).\
filter(Emails.user_id==current_user.id).\
subquery()
provider_and_email = db.session.query(Provider, email_subq).\
outerjoin(email_subq, Provider.emails).\
all()
This works okay (it returns a 4-tuple of (Provider, user_id, provider_id, email_address), all the information that I want), but I later found out this is not using the Flask BaseQuery class, so that pagination provided by Flask-SQLAlchemy does not work. Apparently db.session.query() is not the Flask-SQLAlchemy Query instance.
I tried to do Emails.query.outerjoin[...] but that returns only columns in the email table though I want both the provider info and the emails.
My question: how can I do the same thing with Flask-SQLAlchemy so that I do not have to re-implement pagination that is already there?
I guess the simplest option at this point is to implement my own paginate function, but I'd love to know if there is another proper way of doing this.
I'm not sure if this is going to end up being the long-term solution, and it does not directly address my concern about not using the Flask-SQLAlchemy's BaseQuery, but the most trivial way around to accomplish what I want is to reimplement the paginate function.
And, in fact, it is pretty easy to use the original Flask-SQLAlchemy routine to do this:
def paginate(query, page, per_page=20, error_out=True):
if error_out and page < 1:
abort(404)
items = query.limit(per_page).offset((page - 1) * per_page).all()
if not items and page != 1 and error_out:
abort(404)
# No need to count if we're on the first page and there are fewer
# items than we expected.
if page == 1 and len(items) < per_page:
total = len(items)
else:
total = query.order_by(None).count()
return Pagination(query, page, per_page, total, items)
Modified from the paginate function found around line 376: https://github.com/mitsuhiko/flask-sqlalchemy/blob/master/flask_sqlalchemy.py
Your question is how to use Flask-SQLAlchemy's Pagination with regular SQLAlchemy queries.
Since Flask-SQLAlchemy's BaseQuery object holds no state of its own, and is derived from SQLAlchemy's Query, and is really just a container for methods, you can use this hack:
from flask.ext.sqlalchemy import BaseQuery
def paginate(sa_query, page, per_page=20, error_out=True):
sa_query.__class__ = BaseQuery
# We can now use BaseQuery methods like .paginate on our SA query
return sa_query.paginate(page, per_page, error_out)
To use:
#route(...)
def provider_and_email_view(page):
provider_and_email = db.session.query(...) # any SQLAlchemy query
paginated_results = paginate(provider_and_email, page)
return render_template('...', paginated_results=paginated_results)
*Edit:
Please be careful doing this. It's really just a way to avoid copying/pasting the paginate function, as seen in the other answer. Note that BaseQuery has no __init__ method. See How dangerous is setting self.__class__ to something else?.
*Edit2:
If BaseQuery had an __init__, you could construct one using the SA query object, rather than hacking .__class__.
Hey I have found a quick fix for this here it is:
provider_and_email = Provider.query.with_entities(email_subq).\
outerjoin(email_subq, Provider.emails).paginate(page, POST_PER_PAGE_LONG, False)
I'm currently using this approach:
query = BaseQuery([Provider, email_subq], db.session())
to create my own BaseQuery. db is the SqlAlchemy instance.
Update: as #afilbert suggests you can also do this:
query = BaseQuery(provider_and_email.subquery(), db.session())
How do you init your application with SQLAlchemy?
Probably your current SQLAlchemy connection has nothing to do with flask.ext.sqalchemy and you use original sqlalchemy
Check this tutorial and check your imports, that they really come from flask.ext.sqlalchemy
http://pythonhosted.org/Flask-SQLAlchemy/quickstart.html#a-minimal-application
You can try to paginate the list with results.
my_list = [my_list[i:i + per_page] for i in range(0, len(my_list), per_page)][page]
I did this and it works:
query = db.session.query(Table1, Table2, ...).filter(...)
if page_size is not None:
query = query.limit(page_size)
if page is not None:
query = query.offset(page*page_size)
query = query.all()
I could be wrong, but I think your problem may be the .all(). By using that, you're getting a list, not a query object.
Try leaving it off, and pass your query to the pagination method like so (I left off all the subquery details for clarity's sake):
email_query = db.session.query(Emails).filter(**filters)
email_query.paginate(page, per_page)

Only process subset of a queryset in Django (and return original queryset)

I have a generic list view in Django which returns around 300 objects (unless a search is performed).
I have pagination set to only display 10 of the objects.
So I query the database, then iterate through the objects, processing them and adding extra values before the display. I noticed that all 300 objects get the processing done, and the pagination is done after the processing. So I want to only do the processing on the objects that are going to displayed to increase performance.
I calculate the indexes of the objects in the queryset that should be processed 0-10 for page 1, 11-20 for page 2, 21-30 for page 3, etc.
Now I want to process only the objects in the display range but return the full queryset (so the genreic view works as expected).
Initially I tried:
for object in queryset[slice_start:slice_end] :
# process things here
return queryset
But the slicing seems to return a new queryset, and the original queryset objects do not have any of the calculated values.
Currently my solution is:
index = -1
for object in queryset:
index += 1
if index < slice_start or index > slice_end : continue
# process things here
return queryset
Now this works, but it seems rather hacky, and unelegant for Python.
Is there a more pythonic way to do this?
If you're using Django's Paginator class (docs), then you can request the current page and iterate over those objects in the view:
from django.core.paginator import Paginator
# in the view:
p = Paginator(objects, 2)
page = p.page(current_page)
for o in page.object_list:
# do processing
pass
You can obtain the value for current_page from the request's parameters (e.g, a page parameter in request.GET).
You should do the processing on the results of page.object_list, which will be guaranteed to only contain the objects for that page.

I came across a tricky trouble about django Queryset

Tricky code:
user = User.objects.filter(id=123)
user[0].last_name = 'foo'
user[0].save() # Cannot be saved.
id(user[0]) # 32131
id(user[0]) # 44232 ( different )
user cannot be saved in this way.
Normal code:
user = User.objects.filter(id=123)
if user:
user[0].last_name = 'foo'
user[0].save() # Saved successfully.
id(user[0]) # 32131
id(user[0]) # 32131 ( same )
So, what is the problem?
In first variant your user queryset isn't evaluated yet. So every time you write user[0] ORM makes independent query to DB. In second variation queryset is evalutaed and acts like normal Python list.
And BTW if you want just one row, use get:
user = User.objects.get(id=123)
when you index into a queryset, django fetches the data (or looks in its cache) and creates a model instance for you. as you discovered with id(), each call creates a new instance. so while you can set the properties on these qs[0].last_name = 'foo', the subsequent call to qs[0].save() creates a new instance (with the original last_name) and saves that
i'm guessing your particular issue has to do with when django caches query results. when you are just indexing into the qs, nothing gets cached, but your call if users causes the entire (original) qs to be evaluated, and thus cached. so in that case each call to [0] retrieves the same model instance
Saving is possible, but everytime you access user[0], you actually get it from the database so it's unchanged.
Indeed, when you slice a Queryset, Django issues a SELECT ... FROM ... OFFSET ... LIMIT ... query to your database.
A Queryset is not a list, so if you want to it to behave like a list, you need to evaluate it, to do so, call list() on it.
user = list(User.objects.filter(id=123))
In your second example, calling if user will actually evaluate the queryset (get it from the database into your python program), so you then work with your Queryset's internal cache.
Alternatively, you can use u = user[0], edit that and then save, which will work.
Finally, you should actually be calling Queryset.get, not filter here, since you're using the unique key.

Categories

Resources