Django QuerySet chaining, subset and caching - python

I know that QuerySets are lazy and they are evaluated only on certain conditions to avoid hitting the databases all the times.
What I don't know is if given a generic query set (retrieving all the items) and then using it to construct a more refined queryset (adding a filter for example) would lead to multiple sql queries or not?
Example:
all_items = MyModel.objects.all()
subset1 = all_items.filter(**some_conditions)
subset2 = subset1.filter(**other_condition)
1) Would this create 3 different sql queries?
Or it all depends if the 3 variable are evaluated (for example iterating over them)?
2) Is this efficient or would it be better to fetch all the items, then convert them into a list and filter them in python?

1) If you enumerate only the final query set subset2 then only one database query request is executed, that is optimal.
2) Avoid premature optimization (before measurement on appropriate amount of data after most of application code is written.). You never know what will be finally the most important problem if the database gets bigger. E.g. if you ask for a subset then the query is usually faster thanks to caching in the database. The amount of memory is in opposition to other optimizations. Maybe you can't hold later all data in the memory and users will access them only by a page of data. A clean readable code is more important for a later possible optimization than an optimization by 20% that must be removed later to can continue.
Other important paragraphs about (lazy) evaluation of queries:
When QuerySets are evaluated
QuerySets are lazy
Laziness in Django

Related

Is there a way to error on related queries in Django ORM?

I have a Django model backed by a very large table (Log) containing millions of rows. This model has a foreign key reference to a much smaller table (Host). Example models:
class Host(Model):
name = CharField()
class Log(Model):
value = CharField()
host = ForeignKey(Host)
In reality there are many more fields and also more foreign keys similar to Log.host.
There is an iter_logs() function that efficiently iterates over Log records using a paginated query scheme. Other places in the program use iter_logs() to process large volumes of Log records, doing things like dumping to a file, etc.
For efficient operation, any code that uses iter_logs() should only access fields like value. But problems arise when someone innocently accesses log.host. In this case Django will issue a separate query each time a new Log record's host is accessed, killing the performance of the efficient paginated query in iter_logs().
I know I can use select_related to efficiently fetch the related host records, but all known uses of iter_logs() should not need this, so it would be wasteful. If a use case for accessing log.host did arise in this context I would want to add a parameter to iter_logs() to optionally use .select_related("host"), but that has not become necessary yet.
I am looking for a way to tell the underlying query logic in Django to never perform additional database queries except those explicitly allowed in iter_logs(). If such a query becomes necessary it should raise an error instead. Is there a way to do that with Django?
One solution I'd prefer to avoid: wrap or otherwise modify objects yielded by iter_logs() to prevent access to foreign keys.
More generally, Django's deferred query logic breaks encapsulation of code that constructs queries. Dependent code must know about the implementation or risk imposing major inefficiencies. This is usually fine at small scale where a little inefficiency does not matter, but becomes a real problem at larger scale. An early error would be much better because it would be easy to detect in small-scale tests rather than deferring the problem to production run time where it manifests as general slowness.

Which filter is better and faster

I am relatively new to django and python, so i would really like to know which of these two implementation is the better or faster approach. I am currently using the filter, but i thought of it, and cos i really like list comprehension, i wrote the same code using list comprehension. Both codes do exactly the same thing, but I just want to know from developers with more experience which is better and why. Below are both codes.
posts = Post.objects.filter(approved=True).order_by('-date_posted')
posts = [post for post in Post.objects.all().order_by('-date_posted') if post.approved]
A .filter(..) is not implemented to performed filtering on the Django/Python level: filtering is done on the database, with a WHERE (or HAVING) clause. Databases are systems that are designed to store, retrieve, and aggregate large amounts of data.
If you often sort on the approved value, you an add an index on the column:
class Post(models.Model):
approved = models.BooleanField(db_index=True)
In that case, the database will add an indexing structure that make filtering more efficient.
It is usually better to filter on the database, since that means the database needs to communicate less records to the Python/Django layer, and furthermore Django has to deserialize less objects into memory objects. So even if filtering with list comprehension was as fast as filtering through the database, it would still slow down efficiency, because the Python/Django first has to deserialize more elements. If the number of elements is growing, it will eventually result in memory problems, since you can not store all the records at the same time in memory.

Reasons why Django was slow

This is simply a question of curiosity. I have a script that loads a specific queryset without evaluating it, and then I print the count(). I understand that count has to go through so depending on the size it could potentially take some time, but it took over a minute to return 0 as the count of an empty queryset Why is that taking so long? is it Django or my server?
notes:
the queryset was all one type.
It all depends on the query that you're running. If you're running a SELECT COUNT(*) FROM foo on a table that has ten rows, it's going to be very fast; but if your query involves a dozen joins, sub-selects, filters on un-indexed rows--or if the target table simply has a lot of rows--the query can take an arbitrary amount of time. In all likelihood, the bottleneck is not Django (although its ORM has some quirks), but rather the database and your query. Just because no rows meet the criteria doesn't mean that the database didn't need to deal with the other rows in the table.

Optimizing Django: nested queries vs relation lookups

I have a legacy code which uses nested ORM query, which produces SQL SELECT query with JOIN, and conditions which also contains SELECT and JOIN. Execution of this query takes enormous time. By the way, when I execute this query in raw SQL, taken from Django_ORM_query.query, it performs with reasonable time.
What are best practices for optimization in such cases?
Would the query perform faster if I will use ManyToMany and ForeignKey relations?
Performance issue in Django is usually caused by following relations in a loop, which causes multiple database queries. If you have django-debug-toolbar installed, you can check for how many queries you're doing and figure out which query needs to be optimized. The debug toolbar also shows you the time of each queries, which is essential for optimizing django, you're missing out a lot if you didn't have it installed or didn't use it.
You'd generally solve the problem of following relations by using select_related() or prefetch_related().
A page generally should have at most 20-30 queries, any more and it's going to seriously affect performance. Most pages should only have 5-10 queries. You want to reduce the number of queries because round trip is the number one killer of database performance. In general one big query is faster than 100 small queries.
The number two killer of database performance is much rarer a problem, though it sometimes arises because of techniques that reduces the number of queries. Your query might simply be too big, if this is the case, you should use defer() or only() so you don't load large fields that you know you won't be using.
When in doubt, use raw SQL. That's a completely valid optimization in Django world.

Merging cached GQL queries instead of using IN

I'm generating a feed that merges the comments of many users, so your feed might be of comments by user1+user2+user1000 whereas mine might be user1+user2. So I have the line:
some_comments = Comment.gql("WHERE username IN :1",user_list)
I can't just memcache the whole thing since everyone will have different feeds, even if the feeds for user1 and user2 would be common to many viewers. According to the documentation:
...the IN operator executes a separate
underlying datastore query for every
item in the list. The entities
returned are a result of the
cross-product of all the underlying
datastore queries and are
de-duplicated. A maximum of 30 datastore queries are allowed for any
single GQL query.
Is there a library function to merge some sorted and cached queries, or am I going to have to:
for user in user_list
if memcached(user):
add it to the results
else:
add Comment.gql("WHERE username = :1",user) to the results
cache it too
sort the results
(In the worst case (nothing is cached) I expect sending 30 GQL queries off is slower than one giant IN query.)
There's nothing built-in to do this, but you can do it yourself, with one caveat: If you do an in query and return 30 results, these will be the 30 records that sort lowest according to your sort criteria across all the subqueries. If you want to assemble the resultset from cached individual queries, though, either you are going to have to cache as many results for each user as the total result set (eg, 30), and throw away most of those results, or you're going to have to store fewer results per user, and accept that sometimes you'll throw away newer results from one user in favor of older results from another.
That said, here's how you can do this:
Do a memcache.get_multi to retrieve cached result sets for all the users
For each user that doesn't have a result set cached, execute the individual query, fetching the most results you need. Use memcache.set_multi to cache the result sets.
Do a merge-join on all the result sets and take the top n results as your final result set. Because username is presumably not a list field (eg, every comment has a single author), you don't need to worry about duplicates.
Currently, in queries are executed serially, so this approach won't be any slower than executing an in query, even when none of the results are cached. This may change in future, though. If you want to improve performance now, you'll probably want to use Guido's NDB project, which will allow you to execute all the subqueries in parallel.
You can use memcache.get_multi() to see which of the user's feeds are already in memcache. Then use a set().difference() on the original user list vs. the user list found in memcache to find out which weren't retrieved. Then finally fetch the missing user feeds from the datastore in a batch get.
From there you can combine the two lists and, if it isn't too long, sort it in memory. If you're working on something Ajaxy, you could hand off sorting to the client.

Categories

Resources