Aggregating save()s in Django? - python

I'm using Django with an sqlite backend, and write performance is a problem. I may graduate to a "proper" db at some stage, but for the moment I'm stuck with sqlite. I think that my write performance problems are probably related to the fact that I'm creating a large number of rows, and presumably each time I save() one it's locking, unlocking and syncing the DB on disk.
How can I aggregate a large number of save() calls into a single database operation?

EDITED: commit_on_success is deprecated and was removed in Django 1.8. Use transaction.atomic instead. See Fraser Harris's answer.
Actually this is easier to do than you think. You can use transactions in Django. These batch database operations (specifically save, insert and delete) into one operation. I've found the easiest one to use is commit_on_success. Essentially you wrap your database save operations into a function and then use the commit_on_success decorator.
from django.db.transaction import commit_on_success
#commit_on_success
def lot_of_saves(queryset):
for item in queryset:
modify_item(item)
item.save()
This will have a huge speed increase. You'll also get the benefit of having roll-backs if any of the items fail. If you have millions of save operations then you may have to commit them in blocks using the commit_manually and transaction.commit() but I've rarely needed that.

New as of Django 1.6 is atomic, a simple API to control DB transactions. Copied verbatim from the docs:
atomic is usable both as a decorator:
from django.db import transaction
#transaction.atomic
def viewfunc(request):
# This code executes inside a transaction.
do_stuff()
and as a context manager:
from django.db import transaction
def viewfunc(request):
# This code executes in autocommit mode (Django's default).
do_stuff()
with transaction.atomic():
# This code executes inside a transaction.
do_more_stuff()
Legacy django.db.transaction functions autocommit(), commit_on_success(), and commit_manually() have been deprecated and will be remove in Django 1.8.

I think this is the method you are looking for: https://docs.djangoproject.com/en/dev/ref/models/querysets/#bulk-create
Code copied from the docs:
Entry.objects.bulk_create([
Entry(headline='This is a test'),
Entry(headline='This is only a test'),
])
Which in practice, would look like:
my_entries = list()
for i in range(100):
my_entries.append(Entry(headline='Headline #'+str(i))
Entry.objects.bulk_create(my_entries)
According to the docs, this executes a single query, regardless of the size of the list (maximum 999 items on SQLite3), which can't be said for the atomic decorator.
There is an important distinction to make. It sounds like, from the OP's question, that he is attempted to bulk create rather than bulk save. The atomic decorator is the fastest solution for saving, but not for creating.

"How can I aggregate a large number of save() calls into a single database operation?"
You don't need to. Django already manages a cache for you. You can't improve it's DB caching by trying to fuss around with saves.
"write performance problems are probably related to the fact that I'm creating a large number of rows"
Correct.
SQLite is pretty slow. That's the way it is. Queries are faster than most other DB's. Writes are pretty slow.
Consider more serious architecture change. Are you loading rows during a web transaction (i.e., bulk uploading files and loading the DB from those files)?
If you're doing bulk loading inside a web transaction, stop. You need to do something smarter. Use celery or use some other "batch" facility to do your loads in the background.
We try to limit ourself to file validation in a web transaction and do the loads when the user's not waiting for their page of HTML.

Related

Should buisness logîc be enforced in django python or in sql?

I am writing a rest backend for a project. Heres a basic example of the kind of problems I'm trying to solve.
These students have tests with grades and each student has a column current_average_grade.
Everytime a test is stored, this average_grade_should be updated (using all existing tests).
So the question is, should this be calculated and stored with within the post view of django (get all grades from db and then do the calculation) or with an sql trigger and only use django to convert json to sql.
The advantage of using sql for this, is of course it should theoretically be much faster and you also get concurrency for free.
The disadvantage is that since I am now programming sql, I have yet another codebase to manage and it might even create problems with django.
So whats the ideal solution here? How do I enforce buisness logic in an elegant way?
I think dealing it in Django views will be a better idea. In this way you can control the business logic in a better way and also you don't need to manage the database extensively.
And for handling concurrency Django provides a beautiful way in the form of select_for_update().
Handling Concurrency in Django
To acquire a lock on a resource we use a database lock.
And in Django we use select_for_update() for database lock.
Example Code
from django.db import transaction
entries = Entry.objects.select_for_update().filter(author=request.user)
with transaction.atomic():
for entry in entries:
# action on each entry in thread-safe way

Having issues doing fast enough db inserts inside a Flask endpoint

I have an HTTP POST endpoint in Flask which needs to insert whatever data comes in into a database. This endpoint can receive up to hundreds of requests per second. Doing an insert every time a new request comes takes too much time. I have thought that doing a bulk insert every 1000 request with all the previous 1000 request data should work like some sort of caching mechanism. I have tried saving 1000 incoming data objects into some collection and then doing a bulk insert once the array is 'full'.
Currently my code looks like this:
#app.route('/user', methods=['POST'])
def add_user():
firstname = request.json['firstname']
lastname = request.json['lastname']
email = request.json['email']
usr = User(firstname, lastname, email)
global bulk
bulk.append(usr)
if len(bulk) > 1000:
bulk = []
db.session.bulk_save_objects(bulk)
db.session.commit()
return user_schema.jsonify(usr)
The problem I'm having with this is that the database becomes 'locked', and I really don't know if this is a good solution but just poorly implemented, or a stupid idea.
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
Your error message indicates that you are using an sqlite DB with SQLAlchemy. You may want to try changing the setting of the sqlite "synchronous" flag to turn syncing OFF. This can speed INSERT queries up dramatically, but it does come with the increased risk of data loss. See https://sqlite.org/pragma.html#pragma_synchronous for more details.
With synchronous OFF (0), SQLite continues without syncing as soon as
it has handed data off to the operating system. If the application
running SQLite crashes, the data will be safe, but the database might
become corrupted if the operating system crashes or the computer loses
power before that data has been written to the disk surface. On the
other hand, commits can be orders of magnitude faster with synchronous
OFF
If your application and use case can tolerate the increased risks, then disabling syncing may negate the need for bulk inserts.
See "How to set SQLite PRAGMA statements with SQLAlchemy": How to set SQLite PRAGMA statements with SQLAlchemy
Once I moved the code on AWS and used the Aurora instance as the database, the problems went away, so I suppose it's safe to conclude that the issue were solely related to my sqlite3 instance.
The final solution gave me satisfactory results and I ended up changing only this line:
db.session.bulk_save_objects(bulk)
to this:
db.session.save_all(bulk)
I can now safely do up to 400 or more (haven't tested for more) calls on that specific endpoints, all ending with valid inserts, per second.
Not an expert on this, but seems like database has reached its concurrency limits. You can try using Pony for better concurrency and transaction management
https://docs.ponyorm.org/transactions.html
By default Pony uses the optimistic concurrency control concept for increasing performance. With this concept, Pony doesn’t acquire locks on database rows. Instead it verifies that no other transaction has modified the data it has read or is trying to modify.

Caching a static Database table in Django

I have a Django web application that is currently live and receiving a lot of queries. I am looking for ways to optimize its performance and one area that can be improved is how it interacts with its database.
In its current state, each request to a particular view loads an entire database table into a pandas dataframe, against which queries are done. This table consists of over 55,000 rows of text data (co-ordinates mostly).
To avoid needless queries, I have been advised to cache the database into memory and have it be cached upon the first time its loaded. This will remove some overhead on the DB side of things. I've never used this feature of Django before so I am a bit lost.
The Django manual does not seem to have a concrete implementation of what I want to do. Would it be a good idea to just store the entire table in memory or would storing it in a file be a better idea?
I had a similar problem and django-cache-machine worked like a charm. It uses the Django caching features to cache the results of your queries. It is very easy to set up (assuming you have already configured Django's cache backend):
pip install django-cache-machine
Then in the model you want to cache:
from caching.base import CachingManager, CachingMixin
class MyModel(CachingMixin, models.Model):
objects = CachingManager()
And that's it, your queries will be cached.

How to make Django QuerySet bulk delete() more efficient

Setup:
Django 1.1.2, MySQL 5.1
Problem:
Blob.objects.filter(foo = foo) \
.filter(status = Blob.PLEASE_DELETE) \
.delete()
This snippet results in the ORM first generating a SELECT * from xxx_blob where ... query, then doing a DELETE from xxx_blob where id in (BLAH); where BLAH is a ridiculously long list of id's. Since I'm deleting a large amount of blobs, this makes both me and the DB very unhappy.
Is there a reason for this? I don't see why the ORM can't convert the above snippet into a single DELETE query. Is there a way to optimize this without resorting to raw SQL?
For those who are still looking for an efficient way to bulk delete in django, here's a possible solution:
The reason delete() may be so slow is twofold: 1) Django has to ensure cascade deleting functions properly, thus looking for foreign key references to your models; 2) Django has to handle pre and post-save signals for your models.
If you know your models don't have cascade deleting or signals to be handled, you can accelerate this process by resorting to the private API _raw_delete as follows:
queryset._raw_delete(queryset.db)
More details in here. Please note that Django already tries to make a good handling of these events, though using the raw delete is, in many situations, much more efficient.
Not without writing your own custom SQL or managers or something; they are apparently working on it though.
http://code.djangoproject.com/ticket/9519
Bulk delete is already part of django
Keep in mind that this will, whenever possible, be executed purely in SQL

Django, how to make a view atomic?

I have a simple django app to simulate a stock market, users come in and buy/sell. When they choose to trade,
the market price is read, and
based on the buy/sell order the market price is increased/decreased.
I'm not sure how this works in django, but is there a way to make the view atomic? i.e. I'm concerned that user A's actions may read the price but before it's updated because of his order, user B's action reads the price.
Couldn't find a simple, clean solution for this online. Thanks.
This is database transactions, with some notes. All notes for Postgresql; all databases have locking mechanisms but the details are different.
Many databases don't do this level of locking by default, even if you're in a transaction. You need to get an explicit lock on the data.
In Postgresql, you probably want SELECT ... FOR UPDATE, which will lock the returned rows. You need to use FOR UPDATE on every SELECT that wants to block if another user is about to update them.
Unfortunately, there's no way to do a FOR UPDATE in Django's ORM. You'd eitiher need to hack the ORM a bit or use raw SQL, as far as I know. If this is low-performance code and you can afford to serialize all access to the table, you can use a table-level LOCK IN EXCLUSIVE MODE, which will serialize the whole table.
http://www.postgresql.org/docs/current/static/explicit-locking.html
http://www.postgresql.org/docs/current/static/sql-lock.html
http://www.postgresql.org/docs/current/static/sql-select.html
Pretty old question, but since 1.6 you can use transaction.atomic() as decorator.
views.py
#transaction.atomic()
def stock_action(request):
#trade here
Wrap the DB queries that read and the ones that update in a transaction. The syntax depends on what ORM you are using.

Categories

Resources