How to make Django QuerySet bulk delete() more efficient

How to make Django QuerySet bulk delete() more efficient - python

Setup:
Django 1.1.2, MySQL 5.1
Problem:
Blob.objects.filter(foo = foo) \
.filter(status = Blob.PLEASE_DELETE) \
.delete()
This snippet results in the ORM first generating a SELECT * from xxx_blob where ... query, then doing a DELETE from xxx_blob where id in (BLAH); where BLAH is a ridiculously long list of id's. Since I'm deleting a large amount of blobs, this makes both me and the DB very unhappy.
Is there a reason for this? I don't see why the ORM can't convert the above snippet into a single DELETE query. Is there a way to optimize this without resorting to raw SQL?

For those who are still looking for an efficient way to bulk delete in django, here's a possible solution:
The reason delete() may be so slow is twofold: 1) Django has to ensure cascade deleting functions properly, thus looking for foreign key references to your models; 2) Django has to handle pre and post-save signals for your models.
If you know your models don't have cascade deleting or signals to be handled, you can accelerate this process by resorting to the private API _raw_delete as follows:
queryset._raw_delete(queryset.db)
More details in here. Please note that Django already tries to make a good handling of these events, though using the raw delete is, in many situations, much more efficient.

Not without writing your own custom SQL or managers or something; they are apparently working on it though.
http://code.djangoproject.com/ticket/9519

Bulk delete is already part of django
Keep in mind that this will, whenever possible, be executed purely in SQL

Related

Should buisness logîc be enforced in django python or in sql?

I am writing a rest backend for a project. Heres a basic example of the kind of problems I'm trying to solve.
These students have tests with grades and each student has a column current_average_grade.
Everytime a test is stored, this average_grade_should be updated (using all existing tests).
So the question is, should this be calculated and stored with within the post view of django (get all grades from db and then do the calculation) or with an sql trigger and only use django to convert json to sql.
The advantage of using sql for this, is of course it should theoretically be much faster and you also get concurrency for free.
The disadvantage is that since I am now programming sql, I have yet another codebase to manage and it might even create problems with django.
So whats the ideal solution here? How do I enforce buisness logic in an elegant way?

I think dealing it in Django views will be a better idea. In this way you can control the business logic in a better way and also you don't need to manage the database extensively.
And for handling concurrency Django provides a beautiful way in the form of select_for_update().
Handling Concurrency in Django
To acquire a lock on a resource we use a database lock.
And in Django we use select_for_update() for database lock.
Example Code
from django.db import transaction
entries = Entry.objects.select_for_update().filter(author=request.user)
with transaction.atomic():
for entry in entries:
# action on each entry in thread-safe way

django - model - what are the benefits?

I have customized my Django project to work with mariaDB (mySQL).
Works fine, however I have issues (or concerns) with models.
First of all - I am not sure why I should need them if for me (personally) its much easier to use SQL statements to get the data.
Using API for DB queries might be useful for people who do not know SQL, but for me its less flexible.
Can anybody explain me main benefits of using models?
Here is one of the issues I have. See the code below.
class Quotes(models.Model):
updated = models.DateTimeField()
tdate = models.DateField(default='1900-01-01')
ticker = models.CharField(max_length=15)
open = models.FloatField(default=0)
vol = models.BigIntegerField(default=0)
why program does not consider 'default' when DB table and fields are created?
why - what I define as FloatField on DB is 'double' and not 'float' (I checked this using phpMYAdmin)
How can I properly set default value?
My table will have at least 1 million of entries.
Do I need to concern about performance using API instead of direct SQL queries? Usually one query will select 700-800 entries.
Is it good approach to use MySQLdb and direct SQL's instead of models?
sorry that some questions might sound too simple, but I just started with Django. Before this I worked with PHP. Main reason I want to use Python for web page development is library which I have developed.

Question zero, i.e. why models: Django's models are a nice abstraction on top of relational database tables – most (if not all) web apps end up having (or being) some sort of CRUD where you manipulate objects or graphs of objects saved in the database, so an object-oriented approach is nice to work with there.
In addition, many features in Django (and libraries that work with Django) are built around models (such as the admin, ModelForms, serialization, etc.).
Question 1: That date should preferably be datetime.date(1900, 1, 1), not a string, but that aside, Django deals with defaults on model instantiation, not necessarily in the database.
Question 2: Because that's how it's mapped, presumably to avoid programmers accidentally losing floating-point precision (since MySQL is rather notorious about doing precision-losing conversions "behind your back").
Question 3: Django's ORM is, to be absolutely honest, not the fastest when it generates queries and instantiates model instances. Most of the time, in regular operations, that's not a problem. Depending on what you're doing with those 700 to 800 instances, you may be able to work around that anyway; for instance, using .values() or .values_list() on a queryset if you don't need the actual instances, just the data.
Regarding direct SQL, please don't hard-code any MySQLdb calls in a Django app though; Django has very nice "escape hatches" for doing raw SQL:
You can perform .raw() SQL queries that map into models, or if that's not enough,
you can just execute SQL against the database connection like you would with MySQLdb.
Oh, and one more thing: your model name should be singular (Quote) :)

Aggregating save()s in Django?

I'm using Django with an sqlite backend, and write performance is a problem. I may graduate to a "proper" db at some stage, but for the moment I'm stuck with sqlite. I think that my write performance problems are probably related to the fact that I'm creating a large number of rows, and presumably each time I save() one it's locking, unlocking and syncing the DB on disk.
How can I aggregate a large number of save() calls into a single database operation?

EDITED: commit_on_success is deprecated and was removed in Django 1.8. Use transaction.atomic instead. See Fraser Harris's answer.
Actually this is easier to do than you think. You can use transactions in Django. These batch database operations (specifically save, insert and delete) into one operation. I've found the easiest one to use is commit_on_success. Essentially you wrap your database save operations into a function and then use the commit_on_success decorator.
from django.db.transaction import commit_on_success
#commit_on_success
def lot_of_saves(queryset):
for item in queryset:
modify_item(item)
item.save()
This will have a huge speed increase. You'll also get the benefit of having roll-backs if any of the items fail. If you have millions of save operations then you may have to commit them in blocks using the commit_manually and transaction.commit() but I've rarely needed that.

New as of Django 1.6 is atomic, a simple API to control DB transactions. Copied verbatim from the docs:
atomic is usable both as a decorator:
from django.db import transaction
#transaction.atomic
def viewfunc(request):
# This code executes inside a transaction.
do_stuff()
and as a context manager:
from django.db import transaction
def viewfunc(request):
# This code executes in autocommit mode (Django's default).
do_stuff()
with transaction.atomic():
# This code executes inside a transaction.
do_more_stuff()
Legacy django.db.transaction functions autocommit(), commit_on_success(), and commit_manually() have been deprecated and will be remove in Django 1.8.

I think this is the method you are looking for: https://docs.djangoproject.com/en/dev/ref/models/querysets/#bulk-create
Code copied from the docs:
Entry.objects.bulk_create([
Entry(headline='This is a test'),
Entry(headline='This is only a test'),
])
Which in practice, would look like:
my_entries = list()
for i in range(100):
my_entries.append(Entry(headline='Headline #'+str(i))
Entry.objects.bulk_create(my_entries)
According to the docs, this executes a single query, regardless of the size of the list (maximum 999 items on SQLite3), which can't be said for the atomic decorator.
There is an important distinction to make. It sounds like, from the OP's question, that he is attempted to bulk create rather than bulk save. The atomic decorator is the fastest solution for saving, but not for creating.

"How can I aggregate a large number of save() calls into a single database operation?"
You don't need to. Django already manages a cache for you. You can't improve it's DB caching by trying to fuss around with saves.
"write performance problems are probably related to the fact that I'm creating a large number of rows"
Correct.
SQLite is pretty slow. That's the way it is. Queries are faster than most other DB's. Writes are pretty slow.
Consider more serious architecture change. Are you loading rows during a web transaction (i.e., bulk uploading files and loading the DB from those files)?
If you're doing bulk loading inside a web transaction, stop. You need to do something smarter. Use celery or use some other "batch" facility to do your loads in the background.
We try to limit ourself to file validation in a web transaction and do the loads when the user's not waiting for their page of HTML.

Does Python Django support custom SQL and denormalized databases with no Foreign Key relationships?

I've just started learning Python Django and have a lot of experience building high traffic websites using PHP and MySQL. What worries me so far is Python's overly optimistic approach that you will never need to write custom SQL and that it automatically creates all these Foreign Key relationships in your database. The one thing I've learned in the last few years of building Chess.com is that its impossible to NOT write custom SQL when you're dealing with something like MySQL that frequently needs to be told what indexes it should use (or avoid), and that Foreign Keys are a death sentence. Percona's strongest recommendation was for us to remove all FKs for optimal performance.
Is there a way in Django to do this in the models file? create relationships without creating actual DB FKs? Or is there a way to start at the database level, design/create my database, and then have Django reverse engineer the models file?

If you don't want foreign keys, then avoid using
models.ForeignKey(),
models.ManyToManyField(), and
models.OneToOneField().
Django will automatically create an auto-increment int field named id that you can use to refer to individual records, or you can override that by marking a field as primary_key=True.
There is also documentation on running raw SQL queries on the database.

Raw SQL is as easy as this :
for obj in MyModel.objects.raw('SELECT * FROM myapp_mymodel'):
print obj
Denormalizing a database is up to you at model definition time.
You can use non-relational databases (MongoDB, ...) too with Django NonRel

django-admin inspectdb allows you to reverse engineer a models file from existing tables. That is only a very partial response to your question ;)

You can just create the model.py and avoid having SQL Alchemy automatically create the tables leaving it up to you to define the actual tables as you please. So although there are foreign key relationships in the model.py this does not mean that they must exist in the actual tables. This is a very good thing considering how ludicrously foreign key constraints are implemented in MySQL - MyISAM just ignores them and InnoDB creates a non-optional index on every single one regardless of whether it makes sense.

I concur with the 'no foreign keys' advice (with the disclaimer: I also work for Percona).
The reason why it is is recommended is for concurrency / reducing locking internally.
It can be a difficult "optimization" to sell, but if you consider that the database has transactions (and is more or less ACID compliant) then it should only be application-logic errors that cause foreign-key violations. Not to say they don't exist, but if you enable foreign keys in development hopefully you should find at least a few bugs.
In terms of whether or not you need to write custom SQL:
The explanation I usually give is that "optimization rarely decreases complexity". I think it is okay to stick with an ORM by default, but if in a profiler it looks like one particular piece of functionality is taking a lot more time than you suspect it would when written by hand, then you need to be prepared to fix it (assuming the code is called often enough).
The real secret here is that you need good instrumentation / profiling in order to be frugal with your complexity adding optimization(s).

Map raw SQL to multiple related Django models

Due to performance reasons I can't use the ORM query methods of Django and I have to use raw SQL for some complex questions. I want to find a way to map the results of a SQL query to several models.
I know I can use the following statement to map the query results to one model, but I can't figure how to use it to be able to map to related models (like I can do by using the select_related statement in Django).
model_instance = MyModel(**dict(zip(field_names, row_data)))
Is there a relatively easy way to be able to map fields of related tables that are also in the query result set?

First, can you prove the ORM is stopping your performance? Sometimes performance problems are simply poor database design, or improper indexes. Usually this comes from trying to force-fit Django's ORM onto a legacy database design. Stored procedures and triggers can have adverse impact on performance -- especially when working with Django where the trigger code is expected to be in the Python model code.
Sometimes poor performance is an application issue. This includes needless order-by operations being done in the database.
The most common performance problem is an application that "over-fetches" data. Casually using the .all() method and creating large in-memory collections. This will crush performance. The Django query sets have to be touched as little as possible so that the query set iterator is given to the template for display.
Once you choose to bypass the ORM, you have to fight out the Object-Relational Impedance Mismatch problem. Again. Specifically, relational "navigation" has no concept of "related": it has to be a first-class fetch of a relational set using foreign keys. To assemble a complex in-memory object model via SQL is simply hard. Circular references make this very hard; resolving FK's into collections is hard.
If you're going to use raw SQL, you have two choices.
Eschew "select related" -- it doesn't exist -- and it's painful to implement.
Invent your own ORM-like "select related" features. A common approach is to add stateful getters that (a) check a private cache to see if they've fetched the related object and if the object doesn't exist, (b) fetch the related object from the database and update the cache.
In the process of inventing your own stateful getters, you'll be reinventing Django's, and you'll probably discover that it isn't the ORM layer, but a database design or an application design issue.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.