I have a set of celery tasks that I've written. Each of these tasks take a — just an example — author id as a parameter and for each of the books for the author, it fetches the latest price and stores it in the database.
I'd like to add transactions to my task by adding Django's
#transaction.commit_on_success decorator to my tasks. If any task crashes, I'd like the whole task to fail and nothing to be saved to the database.
I have a dozen or so celery workers that check the prices of books for a author and I'm wondering if this simple transactional logic would cause locking and race conditions in my Postgres database.
I've dug around and found this project called django-celery-transactions but I still haven't understood the real issue behind this and what this project tried to solve.
The reasoning is that in your Django view the DB transaction is not committed until the view has exited if you apply the decorator. Inside the view before it returns and triggers the commit you may invoke tasks that expect the DB transaction to already be committed i.e. for those entries to exist in the DB context.
In order to guard against this race condition (task starting before your view and consequently transaction finished) you can either manually manage it or use the module you mentioned which handles it automatically for you.
The example where it might fail for instance in your case is if you are adding a new author and you have a task that fetches prices for all/any of its books. Should the task execute before the commit for the new author transaction is done, your task will try to fetch Author with an id that does not yet exist.
It depends on several things including: the transaction isolation level of your database, how frequently you check for price updates, and how often you expect prices to change. If, for example, you were making a very large number of updates per second to stock standard PostgreSQL, you might get different results executing the same select statement multiple times in a transaction.
Databases are optimized to handle concurrency so I don't think this is going to be a problem for you; especially if you don't open the transaction until after fetching prices (i.e. use a context manager rather than decorating the task). If — for some reason — things get slow in the future, optimize then (fetch prices less frequently, tweak database configuration, etc.).
As for you other question: django-celery-transactions aims to prevent race conditions between Django and Celery. One example is if you were to pass the primary key of a newly created object to a task: the task may attempt to retrieve the object before the view's transaction has been committed. Boom!
Related
I am trying to update a postgresql table where the business logic for the updates is written in python and the database connection uses psycopg2. To ensure consistency, this will require locking the table. Loosely what I would like to do is:
Lock the table for updates (SHARE LOCK)
Query the table for some information
Given the state of that table and the user's input, determine how the table needs to change (python business logic)
Update the table
Unlock the table
The logic for determining the differences between the existing table's state and the user input will be done in python using pandas.
As I understand it, locks are per transaction, so in order to hold a lock on the table, all of this must happen as part of one SQL transaction. However, I need to lock for updates before querying the data and I have to run some business logic in python before actually updating the table. This sounds like two transactions to me, so I'm not sure how I can ensure that the data doesn't change during these steps. Any help is appreciated. Thanks!
Database transactions should never be that long – they should definitely never contain user interaction.
You have two options: add a table column locked_at or so and persist the "locks" that way (essentially, lock in the application) or use advisory locks that can span several transactions.
I am using SQLAlchemy's ORM. I have a model that has multiple many-to-many relationships:
User
User <--MxN--> Organization
User <--MxN--> School
User <--MxN--> Credentials
I am implementing these using association tables, so there are also User_to_Organization, User_to_School and User_to_Credentials tables that I don't directly use.
Now, when I attempt to load a single User (using its PK identifier) and its relationships (and related models) using joined eager loading, I get horrible performance (15+ seconds). I assume this is due to this issue:
When multiple levels of depth are used with joined or subquery loading, loading collections-within- collections will multiply the total number of rows fetched in a cartesian fashion. Both forms of eager loading always join from the original parent class.
If I introduce another level or two to the hierarchy:
Organization <--1xN--> Project
School <--1xN--> Course
Project <--MxN--> Credentials
Course <--MxN--> Credentials
The query takes 50+ seconds to complete, even though the total amount of records in each table is fairly small.
Using lazy loading, I am required to manually load each relationship, and there are multiple round trips to the server.
e.g.
Operations, executed serially as queries:
Get user
Get user's Organizations
Get user's Schools
Get user's credentials
For each Organization, get its Projects
For each School, get its Courses
For each Project, get its Credentials
For each Course, get its Credentials
Still, it all finishes in less than 200ms.
I was wondering if there is anyway to indeed use lazy loading, but perform the relationship loading queries in parallel. For example, using the concurrent module, asyncio or by using gevent.
e.g.
Step 1 (in parallel):
Get user
Get user's Organizations
Get user's Schools
Get user's credentials
Step 2 (in parallel):
For each Organization, get its Projects
For each School, get its Courses
Step 3 (in parallel):
For each Project, get its Credentials
For each Course, get its Credentials
Actually, at this point, making a subquery type load can also work, that is, return Organization and OrganizationID/Project/Credentials in two separate queries:
e.g.
Step 1 (in parallel):
Get user
Get user's Organizations
Get user's Schools
Get user's credentials
Step 2 (in parallel):
Get Organizations
Get Schools
Get the Organizations' Projects, join with Credentials
Get the Schools' Courses, join with Credentials
The first thing you're going to want to do is check to see what queries are actually being executed on the db. I wouldn't assume that SQLAlchemy is doing what you expect unless you're very familiar with it. You can use echo=True on your engine configuration or look at some db logs (not sure how to do that with mysql).
You've mentioned that you're using different loading strategies so I guess you've read through the docs on that (
http://docs.sqlalchemy.org/en/latest/orm/loading_relationships.html). For what you're doing, I'd probably recommend subquery load, but it totally depends on the number of rows / columns you're dealing with. In my experience it's a good general starting point though.
One thing to note, you might need to something like:
db.query(Thing).options(subqueryload('A').subqueryload('B')).filter(Thing.id==x).first()
With filter.first rather that get, as the latter case won't re-execute queries according to your loading strategy if the primary object is already in the identity map.
Finally, I don't know your data - but those numbers sound pretty abysmal for anything short of a huge data set. Check that you have the correct indexes specified on all your tables.
You may have already been through all of this, but based on the information you've provided, it sounds like you need to do more work to narrow down your issue. Is it the db schema, or is it the queries SQLA is executing?
Either way, I'd say, "no" to running multiple queries on different connections. Any attempt to do that could result in inconsistent data coming back to your app, and if you think you've got issues now..... :-)
MySQL has no parallelism in a single connection. For the ORM to do such would require multiple connections to MySQL. Generally, the overhead of trying to do such is "not worth it".
To get a user, his Organizations, Schools, etc, can all be done (in mysql) via a single query:
SELECT user, organization, ...
FROM Users
JOIN Organizations ON ...
etc.
This is significantly more efficient than
SELECT user FROM ...;
SELECT organization ... WHERE user = ...;
etc.
(This is not "parallelism".)
Or maybe your "steps" are not quite 'right'?...
SELECT user, organization, project
FROM Users
JOIN Organizations ...
JOIN Projects ...
That gets, in a single step, all users, together with all their organizations and projects.
But is a "user" associated with a "project"? If not, then this is the wrong approach.
If the ORM is not providing a mechanism to generate queries like those, than it is "getting in the way".
Hi there so i am using Django Rest Framework 3.1 and i was wondering if its possible to "protect" my viewsets / database against writes in a per user basis?
In other words if 1 user is saving something the other one cannot save and it either waits till the first user finishes or returns some kind of error.
I tried looking for this answer but couldn't find it.
Is this behavior already implemented? if not how can i achieve this in practice?
UPDATE after some more thinking:
This is just a theory still, it needs more thinking, but if we use a Queue (Redis or Rabbitmq) we can put all synchronization writes requests in the queue instead of processing them right away and in conjunction with some user specific lock variable (maybe in the user sessions db table) we can ask if there are any users in front of us belonging to the same proponent and if those users have finished writing their updates or not (using the lock)
cheers
Database transactions will provide some of the safety you're looking for, I think. If a number of database operations are wrapped in a transaction, they are applied to the database together, so a sequence of operations cannot fail mid-way through and leave the database in an invalid state.
Other users will see the results of the operations as if they were applied all at once, or not at all (in the case of an error).
Background
The get() method is special in SQLAlchemy's ORM because it tries to return objects from the identity map before issuing a SQL query to the database (see the documentation).
This is great for performance, but can cause problems for distributed applications because an object may have been modified by another process, so the local process has no ability to know that the object is dirty and will keep retrieving the stale object from the identity map when get() is called.
Question
How can I force get() to ignore the identity map and issue a call to the DB every time?
Example
I have a Company object defined in the ORM.
I have a price_updater() process which updates the stock_price attribute of all the Company objects every second.
I have a buy_and_sell_stock() process which buys and sells stocks occasionally.
Now, inside this process, I may have loaded a microsoft = Company.query.get(123) object.
A few minutes later, I may issue another call for Company.query.get(123). The stock price has changed since then, but my buy_and_sell_stock() process is unaware of the change because it happened in another process.
Thus, the get(123) call returns the stale version of the Company from the session's identity map, which is a problem.
I've done a search on SO(under the [sqlalchemy] tag) and read the SQLAlchemy docs to try to figure out how to do this, but haven't found a way.
Using session.expire(my_instance) will cause the data to be re-selected on access. However, even if you use expire (or expunge), the next data that is fetched will be based on the transaction isolation level. See the PostgreSQL docs on isolations levels (it applies to other databases as well) and the SQLAlchemy docs on setting isolation levels.
You can test if an instance is in the session with in: my_instance in session.
You can use filter instead of get to bypass the cache, but it still has the same isolation level restriction.
Company.query.filter_by(id=123).one()
I have a simple django app to simulate a stock market, users come in and buy/sell. When they choose to trade,
the market price is read, and
based on the buy/sell order the market price is increased/decreased.
I'm not sure how this works in django, but is there a way to make the view atomic? i.e. I'm concerned that user A's actions may read the price but before it's updated because of his order, user B's action reads the price.
Couldn't find a simple, clean solution for this online. Thanks.
This is database transactions, with some notes. All notes for Postgresql; all databases have locking mechanisms but the details are different.
Many databases don't do this level of locking by default, even if you're in a transaction. You need to get an explicit lock on the data.
In Postgresql, you probably want SELECT ... FOR UPDATE, which will lock the returned rows. You need to use FOR UPDATE on every SELECT that wants to block if another user is about to update them.
Unfortunately, there's no way to do a FOR UPDATE in Django's ORM. You'd eitiher need to hack the ORM a bit or use raw SQL, as far as I know. If this is low-performance code and you can afford to serialize all access to the table, you can use a table-level LOCK IN EXCLUSIVE MODE, which will serialize the whole table.
http://www.postgresql.org/docs/current/static/explicit-locking.html
http://www.postgresql.org/docs/current/static/sql-lock.html
http://www.postgresql.org/docs/current/static/sql-select.html
Pretty old question, but since 1.6 you can use transaction.atomic() as decorator.
views.py
#transaction.atomic()
def stock_action(request):
#trade here
Wrap the DB queries that read and the ones that update in a transaction. The syntax depends on what ORM you are using.