Django prefetch related and exists

Django prefetch related and exists - python

I am using prefetch_related when querying a model that have several m2m relationships:
qs = context.mymodel_set.prefetch_related('things1', 'things2', 'things3')
So that when I do this there is no need to perform an additional query to get things1, they should have been fetched already:
r = list(qs)
r[0].things1.all()
But what if I do r[0].things1.exists()? Will this generate a new query? Or will it use the prefetched information? If it generates a new query, does that mean that going for r[0].things1.all() for the purposes of existence checking is more efficient?
PS: cached information being in desync with the database does not worry me for this particular question.

It's easy to check the queries that Django is running for yourself.
When I tried it, it appeared that obj.things.exists() did not cause any additional queries when things was prefetched.

To capture only objects having relation with things1 it can go in the query like this:
context.mymodel_set.prefetch_related(
'things1', 'things2', 'things3'
).filter(
things1__isnull=False
)

Related

How to optimize lazy loading of related object, if we already have its instance?

I like how Django ORM lazy loads related objects in the queryset, but I guess it's quite unpredictable as it is.
The queryset API doesn't keep the related objects when they are used to make a queryset, thereby fetching them again when accessed later.
Suppose I have a ModelA instance (say instance_a) which is a foreign key (say for_a) of some N instances of ModelB. Now I want to perform query on ModelB which has the given ModelA instance as the foreign key.
Django ORM provides two ways:
Using .filter() on ModelB:
b_qs = ModelB.objects.filter(for_a=instance_a)
for instance_b in b_qs:
instance_b.for_a # <-- fetches the same row for ModelA again
Results in 1 + N queries here.
Using reverse relations on ModelA instance:
b_qs = instance_a.for_a_set.all()
for instance_b in b_qs:
instance_b.for_a # <-- this uses the instance_a from memory
Results in 1 query only here.
While the second way can be used to achieve the result, it's not part of the standard API and not useable for every scenario. For example, if I have instances of 2 foreign keys of ModelB (say, ModelA and ModelC) and I want to get related objects to both of them.
Something like the following works:
ModelB.objects.filter(for_a=instance_a, for_c=instance_c)
I guess it's possible to use .intersection() for this scenario, but I would like a way to achieve this via the standard API. After all, covering such cases would require more code with non-standard queryset functions which may not make sense to the next developer.
So, the first question, is it possible to optimize such scenarios with the the standard API itself?
The second question, if it's not possible right now, can it be added with some tweaks with the QuerySet?
PS: It's my first time asking a question here, so forgive me if I made any mistake.

You could improve the query by using select_related():
b_qs = ModelB.objects.select_related('for_a').filter(for_a=instance_a)
or
b_qs = instance_a.for_a_set.select_related('for_a')
Does that help?

You use .select_related(..) [Django-doc] for ForeignKeys, or .prefetch_related(..) [Django-doc] for something-to-many relations.
With .select_related(..) you will make a LEFT OUTER JOIN at the database side, and fetch records for the two objects, and thus do the deserialization to the proper objects.
ModelB.objects.select_related('for_a').filter(for_a=instance_a)
For relations that are one-to-many (so a reversed ForeignKey), or ManyToManyFields, this is not a good idea, since it could result in a large amount of duplicate objects that are retrieved. This would result in a large answer from the database, and a lot of work at the Python end to deserialize these objects. .prefetch_related will make individual queries, and then do the linking itself.

Bulk insert with multiprocessing using peewee

I'm working on simple html scraper in Python 3.4, using peewee as ORM (great ORM btw!). My script takes a bunch of sites, extract necessary data and save them to the database, however every site is scraped in detached process, to improve performance and saved data should be unique. There can be duplicate data not only between sites, but also on particular site, so I want to store them only once.
Example:
Post and Category - many-to-many relation. During scraping, same category appears multiple times in different posts. For the first time I want to save that category to database (create new row). If the same category shows up in different post, I want to bind that post with already created row in db.
My question is - do I have to use atomic updates/inserts (insert one post, save, get_or_create categories, save, insert new rows to many-to-many table, save) or can I use bulk insert somehow? What is the fastest solution to that problem? Maybe some temporary tables shared between processes, which will be bulk insert at the end of work? Im using MySQL db.
Thx for answers and your time

You can rely on the database to enforce unique constraints by adding unique=True to fields or multi-column unique indexes. You can also check the docs on get/create and bulk inserts:
http://docs.peewee-orm.com/en/latest/peewee/models.html#indexes-and-unique-constraints
http://docs.peewee-orm.com/en/latest/peewee/querying.html#get-or-create
http://docs.peewee-orm.com/en/latest/peewee/querying.html#bulk-inserts
http://docs.peewee-orm.com/en/latest/peewee/querying.html#upsert - upsert with on conflict

Looked for this myself for a while, but found it!
you can use the on_conflict_replace() or on_conflict_ignore() functions to define behaviour for when a record exists in a table that has a uniqueness constraint.
PriceData.insert_many(values).on_conflict_replace().execute()
or
PriceData.insert_many(values).on_conflict_ignore().execute()
More info under "Upsert" here

caching queryset

I'm working on an app that requires me to filter through large amount of records. I've been reading about caching QuerySets and related stuff and found some good material.
qs = MyModel.objects.filter(Q(<initial_filter_to_narrow_down_size>))
After this, I wish to put this qs in cache for later use. I want to apply all the other filters without hitting the database. something like
cache.set('qs', qs)
but what happens when I will do qs = qs.filter(q_object) ? Cache will be modified ? I don't want that. I want qs to remain constant until I update it. What should I do in this case ?

.filter() clones the queryset before applying the filter. Cache will not be affected.
BTW, you might want to check JohnnyCache ... a great app about queryset caching.

What i have understood from your question is that you just need to get the query set from the cache as shown below.
your_cached_qs = cache.get('qs')
And then, apply what ever filter you want.
your_new_qs = your_cached_qs.filter(further_filter)
This will not affect the query set in the cache. Hence, your cache will remain un-changed until you again update it by yourself and your desired result will be achieved.

Django: What kind of querysets should i look for when deciding on model indexes?

In general, is there a type of Model query you look for to optimize by indexing a field (db_index=True)?
In case it's relevant: I'm using MySQL.
Elaboration:
Although I appreciate the responses already given. I was more looking for advice such as this one from a colleague:
You should definitely index the fields in your default ordering and any field you use for filtering.
Think that about covers it?

Install django-debug-toolbar
Look at the SQL panel, look for long-running queries
Index the columns selected in those queries
If you need help with the queries, try the "EXPLAIN" MySQL command on the query.

Basically you should index fields that are searched for often. For example if you have a user-table then username could be indexed as well if things are constantly queried based on username. There are trade offs of course.

How to make Django QuerySet bulk delete() more efficient

Setup:
Django 1.1.2, MySQL 5.1
Problem:
Blob.objects.filter(foo = foo) \
.filter(status = Blob.PLEASE_DELETE) \
.delete()
This snippet results in the ORM first generating a SELECT * from xxx_blob where ... query, then doing a DELETE from xxx_blob where id in (BLAH); where BLAH is a ridiculously long list of id's. Since I'm deleting a large amount of blobs, this makes both me and the DB very unhappy.
Is there a reason for this? I don't see why the ORM can't convert the above snippet into a single DELETE query. Is there a way to optimize this without resorting to raw SQL?

For those who are still looking for an efficient way to bulk delete in django, here's a possible solution:
The reason delete() may be so slow is twofold: 1) Django has to ensure cascade deleting functions properly, thus looking for foreign key references to your models; 2) Django has to handle pre and post-save signals for your models.
If you know your models don't have cascade deleting or signals to be handled, you can accelerate this process by resorting to the private API _raw_delete as follows:
queryset._raw_delete(queryset.db)
More details in here. Please note that Django already tries to make a good handling of these events, though using the raw delete is, in many situations, much more efficient.

Not without writing your own custom SQL or managers or something; they are apparently working on it though.
http://code.djangoproject.com/ticket/9519

Bulk delete is already part of django
Keep in mind that this will, whenever possible, be executed purely in SQL

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.