Not nesting version of #atomic() in Django? - python

From the docs of atomic()
atomic blocks can be nested
This sound like a great feature, but in my use case I want the opposite: I want the transaction to be durable as soon as the block decorated with #atomic() gets left successfully.
Is there a way to ensure durability in django's transaction handling?
Background
Transaction are ACID. The "D" stands for durability. That's why I think transactions can't be nested without loosing feature "D".
Example: If the inner transaction is successful, but the outer transaction is not, then the outer and the inner transaction get rolled back. The result: The inner transaction was not durable.
I use PostgreSQL, but AFAIK this should not matter much.

You can't do that through any API.
Transactions can't be nested while retaining all ACID properties, and not all databases support nested transactions.
Only the outermost atomic block creates a transaction. Inner atomic blocks create a savepoint inside the transaction, and release or roll back the savepoint when exiting the inner block. As such, inner atomic blocks provide atomicity, but as you noted, not e.g. durability.
Since the outermost atomic block creates a transaction, it must provide atomicity, and you can't commit a nested atomic block to the database if the containing transaction is not committed.
The only way to ensure that the inner block is committed, is to make sure that the code in the transaction finishes executing without any errors.

I agree with knbk's answer that it is not possible: durability is only present at the level of a transaction, and atomic provides that. It does not provide it at the level of save points. Depending on the use case, there may be workarounds.
I'm guessing your use case is something like:
#atomic # possibly implicit if ATOMIC_REQUESTS is enabled
def my_view():
run_some_code() # It's fine if this gets rolled back.
charge_a_credit_card() # It's not OK if this gets rolled back.
run_some_more_code() # This shouldn't roll back the credit card.
I think you'd want something like:
#transaction.non_atomic_requests
def my_view():
with atomic():
run_some_code()
with atomic():
charge_a_credit_card()
with atomic():
run_some_more_code()
If your use case is for credit cards specifically (as mine was when I had this issue a few years ago), my coworker discovered that credit card processors actually provide mechanisms for handling this. A similar mechanism might work for your use case, depending on the problem structure:
#atomic
def my_view():
run_some_code()
result = charge_a_credit_card(capture=False)
if result.successful:
transaction.on_commit(lambda: result.capture())
run_some_more_code()
Another option would be to use a non-transactional persistence mechanism for recording what you're interested in, like a log database, or a redis queue of things to record.

This type of durability is impossible due to ACID, with one connection. (i.e. that a nested block stays committed while the outer block get rolled back) It is a consequence of ACID, not a problem of Django. Imagine a super database and the case that table B has a foreign key to table A.
CREATE TABLE A (id serial primary key);
CREATE TABLE B (id serial primary key, b_id integer references A (id));
-- transaction
INSERT INTO A DEFAULT VALUES RETURNING id AS new_a_id
-- like it would be possible to create an inner transaction
INSERT INTO B (a_id) VALUES (new_a_id)
-- commit
-- rollback (= integrity problem)
If the inner "transaction" should be durable while the (outer) transaction get rolled back then the integrity would be broken. The rollback operation must be always implemented so that it can never fail, therefore no database would implement a nested independent transaction. It would be against the principle of causality and the integrity can not be guarantied after such selective rollback. It is also against atomicity.
The transaction is related to a database connection. If you create two connections then two independent transactions are created. One connection doesn't see uncommitted rows of other transactions (it is possible to set this isolation level, but it depends on the database backend) and no foreign keys to them can be created and the integrity is preserved after rollback by the database backend design.
Django supports multiple databases, therefore multiple connections.
# no ATOMIC_REQUESTS should be set for "other_db" in DATABASES
#transaction.atomic # atomic for the database "default"
def my_view():
with atomic(): # or set atomic() here, for the database "default"
some_code()
with atomic("other_db"):
row = OtherModel.objects.using("other_db").create(**kwargs)
raise DatabaseError
The data in "other_db" stays committed.
It is probably possible in Django to create a trick with two connections to the same database like it would be two databases, with some database backends, but I'm sure that it is untested, it would be prone to mistakes, with problems with migrations, bigger load by the database backend that must create real parallel transactions at every request and it can not be optimized. It is better to use two real databases or to reorganize the code.
The setting DATABASE_ROUTERS is very useful, but I'm not sure yet if you are interested in multiple connections.

Even though this exact behaviour is not possible, since django 3.2 there is a durable=True[#transaction.atomic(durable=True)] option to make sure that such a block of code isnt nested, so that by chance if such code is run as nested it results in a RuntimeError error.
https://docs.djangoproject.com/en/dev/topics/db/transactions/#django.db.transaction.atomic
An article on this issue https://seddonym.me/2020/11/19/trouble-atomic/

Related

What is the use case for Django's on_commit?

Reading this documentation https://docs.djangoproject.com/en/4.0/topics/db/transactions/#django.db.transaction.on_commit
This is the use case for on_commit
with transaction.atomic(): # Outer atomic, start a new transaction
transaction.on_commit(foo)
# Do things...
with transaction.atomic(): # Inner atomic block, create a savepoint
transaction.on_commit(bar)
# Do more things...
# foo() and then bar() will be called when leaving the outermost block
But why not just write the code like normal without on_commit hooks? Like this:
with transaction.atomic(): # Outer atomic, start a new transaction
# Do things...
with transaction.atomic(): # Inner atomic block, create a savepoint
# Do more things...
foo()
bar()
# foo() and then bar() will be called when leaving the outermost block
It's easier to read since it doesn't require more knowledge of the Django APIs and the statements are put in the order of when they are executed. It's easier to test since you don't have to use any special test classes for Django.
So what is the use-case for the on_commit hook?
The example code given in the Django docs is transaction.on_commit(lambda: some_celery_task.delay('arg1')) and it's probably specifically because this comes up a lot with celery tasks.
Imagine if you do the following within a transaction:
my_object = MyObject.objects.create()
some_celery_task.delay(my_object.pk)
Then in your celery task you try doing this:
#app.task
def some_celery_task(object_pk)
my_object = MyObject.objects.get(pk=object_pk)
This may work a lot of the time, but randomly you'll get errors where it's not able to find the object (depending on how fast the work task is run because it's a race condition). This is because you created a MyObject record within a transaction, but it isn't actually available in the database until a COMMIT is run. Celery has no access to that open transaction, so it needs to be run after the COMMIT. There's also the very real possibility that something later on causes a ROLLBACK and that celery task should never actually be called.
So... You need to do:
my_object = MyObject.objects.create()
transaction.on_commit(lambda: some_celery_task.delay(my_object.pk))
Now, the celery task won't be called until the MyObject has actually been saved to the database after the COMMIT was called.
I should note, though, this is primarily only a concern when you aren't using AUTOCOMMIT (which is actually the default). If you're in AUTOCOMMIT mode then you can be certain that a commit has been finished as part of a .create() or .save(). However, if you're code base has any possibility of being called within a #transaction.atomic() then it's no longer AUTOCOMMIT and you're back to needing .on_commit(), so it's best/safest to always use it.
Django documentation:
Django provides the on_commit() function to register callback functions that should be executed after a transaction is successfully committed
It is the main purpose. A transaction is a unit of work that you want to treat atomically. It either happens completely or not at all. The same applies to your code. If something went wrong during DB operations you might not need to do some things.
Let's consider some business logic flow:
User sends his registration data to our endpoint, we validate it, etc.
We save the new user to our DB.
We send him a "hello" letter to email with a link for confirming his account.
If something goes wrong during step 2, we shouldn't go to step 3.
We can think that, well, I'll get an exception and wouldn't execute that code as well. Why do we still need it?
Sometimes you take actions in your code based on an assumption of the transaction being successful before potentially dangerous DB operations. For example, you want firstly to check if can send an email to your user, because you know that your emailing 3rd-party often gives you 500. In that case, you want to raise a 500 for the user and ask him to register later (a very bad idea, btw, but it's just a synthetic example).
When your function (e.g. with #atomic decorator) contains a lot of DB operations you surely don't want to memorize all the variables states in order to use them after all DB-related code. Like this:
Validation of user's order.
Checking at DB if it could be completed.
If it could be done we need to send a request to 3rd-party CRM with the order's details.
If it couldn't, then we should create a support ticket in another 3rd-party.
Saving user's order to DB, updating user's model.
Sending a messenger notification to the employee who is responsible for the order.
Saving information, that notification for employee was sent successfully to the DB.
You can imagine what a mess would we have if we hadn't on_commit in this situation and we had a really big try-catch on this.

Is it thread-safe to use SQLAlchemy with engine/connections instead of sessions?

I've got a simple webservice which uses SQLAlchemy to connect to a database using the pattern
engine = create_engine(database_uri)
connection = engine.connect()
In each endpoint of the service, I then use the same connection, in the following fashion:
for result in connection.execute(query):
<do something fancy>
Since Sessions are not thread-safe, I'm afraid that connections aren't either.
Can I safely keep doing this? If not, what's the easiest way to fix it?
Minor note -- I don't know if the service will ever run multithreaded, but I'd rather be sure that I don't get into trouble when it does.
Short answer: you should be fine.
There is a difference between a connection and a Session. The short description is that connections represent just that… a connection to a database. Information you pass into it will come out pretty plain. It won't keep track of your transactions unless you tell it to, and it won't care about what order you send it data. So if it matters that you create your Widget object before you create your Sprocket object, then you better call that in a thread-safe context. Same generally goes for if you want to keep track of a database transaction.
Session, on the other hand, keeps track of data and transactions for you. If you check out the source code, you'll notice quite a bit of back and forth over database transactions and without a way to know that you have everything you want in a transaction, you could very well end up committing in one thread while you expect to be able to add another object (or several) in another.
In case you don't know what a transaction is this is Wikipedia, but the short version is that transactions help make sure your data stays stable. If you have 15 inserts and updates, and insert 15 fails, you might not want to make the other 14. A transaction would let you cancel the entire operation in bulk.

Execute multiple independent statements in SQLAlchemy Core?

I'm using SQLAlchemy Core to run a few independent statements. The statements are to separate tables and unrelated. Because of that I can't use the standard table.insert() with multiple dictionaries of params passed in. Right now, I'm doing this:
sql_conn.execute(query1)
sql_conn.execute(query2)
Is there any way I can run these in one shot instead of needing two back-and-forths to the db? I'm on MySQL 5.7 and Python 2.7.11.
Sounds like you want a Transaction:
with engine.connect() as sql_conn:
with sql_conn.begin():
sql_conn.execute(query1)
sql_conn.execute(query2)
There is an implicit sql_conn.commit() above (when using the context manager) which commits the changes to the database in one trip. If you want to do it explicitly, it's done like this:
from sqlalchemy import create_engine
engine = create_engine("postgresql://scott:tiger#localhost/test")
connection = engine.connect()
trans = connection.begin()
connection.execute(text("insert into x (a, b) values (1, 2)"))
trans.commit()
https://docs.sqlalchemy.org/en/14/core/connections.html#basic-usage
While this is mostly geared towards creating real database transactions, it has a useful side effect for your use case, where it will maintain a "virtual" transaction in SQLAlchemy, see this link for more info:
https://docs.sqlalchemy.org/en/14/orm/session_transaction.html#session-level-vs-engine-level-transaction-control
The Session tracks the state of a single “virtual” transaction at a time, using an object called SessionTransaction. This object then makes use of the underlying Engine or engines to which the Session object is bound in order to start real connection-level transactions using the Connection object as needed.
This “virtual” transaction is created automatically when needed, or
can alternatively be started using the Session.begin() method. To as
great a degree as possible, Python context manager use is supported
both at the level of creating Session objects as well as to maintain
the scope of the SessionTransaction.
The above describes the ORM functionality but the link shows that it has parity with the Core functionality.
It is neither wise, nor practical, to run two queries at once. I am referring to having a single call to the server with two SQL statements one after another: "SELECT ...; SELECT ...;"
It is not wise allowing such give hackers another way to do nasty things via "SQL Injection".
On the other hand, it is possible, but not necessarily practical. You would create a Stored Procedure that contains any number of related (or unrelated) queries in it. Then CALL that procedure. There some things that may make it impractical:
The only way to get data in is via a finite number of scalar arguments.
The output comes back as multiple resultsets; you need to code differently to see what happened.
Roundtrip latency is insignificant if you are on the same machine with the MySQL server. It can usually be ignored even if the two servers are in the same datacenter. Latency becomes important when the client and server are separated by a long distance. For cross-Atlantic latency, we are talking over 100ms. Brazil to China is about 250ms. (Be glad we are no living on Jupiter.)

Does setting autocommit to true take longer than batch committing?

I have run a few trials and there seems to be some improvement in speed if I set autocommit to False.
However, I am worried that doing one commit at the end of my code, the database rows will not be updated. So, for example, I do several updates to the database, none are committed, does querying the database then give me the old data? Or, does it know it should commit first?
Or, am I completely mistaken as to what commit actually does?
Note: I'm using pyodbc and MySQL. Also, the table I'm using are InnoDB, does that make a difference?
There are some situations which trigger an implicit commit. However under most situations not commiting means data will be unavailable to other connections.
It also means that if another connection tries to perform an action that conflicts with an ongoing transaction (another connection locked that resource) the last request will have to wait for the lock to be released.
As for performance concerns, autocommit causes every change to be immediate. The performance hit will be quite noticeable on big tables as on each commit indexes and constraints need to be updated/checked too. If you only commit after a series of queries, indexes/constraints will only be updated at that time.
On the other hand, not committing frequently enough might cause the server to have too much work trying to maintain consistency between the two sets of data. So there is a trade-off.
And yes, using InnoDB makes a difference. If you were using for instance MyISAM you wouldn't have transactions at all, so any change will be permanent (similarly to autocommit=True). On MyISAM you can play with the delay-key-write option.
For more information about transactions have a look at the official documentation. For more tips about optimization have a look at this article.
The default transaction mode for InnoDB is REPEATABLE READ, all the read will be consistent within a transaction. If you insert rows and query them in the same transaction, you will not see the newly inserted row, but they will be stored when you commit the transaction. If you want to see the newly inserted row before you commit the transaction, you can set the isolation level to READ COMMITTED.
As long as you use the same connection, the database should show you a consistent view on the data, e.g. with all changes made so far in this transaction.
Once you commit, the changes will be written to disk and be visible to other (new) transactions and connections.

SQLAlchemy: What's the difference between flush() and commit()?

What the difference is between flush() and commit() in SQLAlchemy?
I've read the docs, but am none the wiser - they seem to assume a pre-understanding that I don't have.
I'm particularly interested in their impact on memory usage. I'm loading some data into a database from a series of files (around 5 million rows in total) and my session is occasionally falling over - it's a large database and a machine with not much memory.
I'm wondering if I'm using too many commit() and not enough flush() calls - but without really understanding what the difference is, it's hard to tell!
A Session object is basically an ongoing transaction of changes to a database (update, insert, delete). These operations aren't persisted to the database until they are committed (if your program aborts for some reason in mid-session transaction, any uncommitted changes within are lost).
The session object registers transaction operations with session.add(), but doesn't yet communicate them to the database until session.flush() is called.
session.flush() communicates a series of operations to the database (insert, update, delete). The database maintains them as pending operations in a transaction. The changes aren't persisted permanently to disk, or visible to other transactions until the database receives a COMMIT for the current transaction (which is what session.commit() does).
session.commit() commits (persists) those changes to the database.
flush() is always called as part of a call to commit() (1).
When you use a Session object to query the database, the query will return results both from the database and from the flushed parts of the uncommitted transaction it holds. By default, Session objects autoflush their operations, but this can be disabled.
Hopefully this example will make this clearer:
#---
s = Session()
s.add(Foo('A')) # The Foo('A') object has been added to the session.
# It has not been committed to the database yet,
# but is returned as part of a query.
print 1, s.query(Foo).all()
s.commit()
#---
s2 = Session()
s2.autoflush = False
s2.add(Foo('B'))
print 2, s2.query(Foo).all() # The Foo('B') object is *not* returned
# as part of this query because it hasn't
# been flushed yet.
s2.flush() # Now, Foo('B') is in the same state as
# Foo('A') was above.
print 3, s2.query(Foo).all()
s2.rollback() # Foo('B') has not been committed, and rolling
# back the session's transaction removes it
# from the session.
print 4, s2.query(Foo).all()
#---
Output:
1 [<Foo('A')>]
2 [<Foo('A')>]
3 [<Foo('A')>, <Foo('B')>]
4 [<Foo('A')>]
This does not strictly answer the original question but some people have mentioned that with session.autoflush = True you don't have to use session.flush()... And this is not always true.
If you want to use the id of a newly created object in the middle of a transaction, you must call session.flush().
# Given a model with at least this id
class AModel(Base):
id = Column(Integer, primary_key=True) # autoincrement by default on integer primary key
session.autoflush = True
a = AModel()
session.add(a)
a.id # None
session.flush()
a.id # autoincremented integer
This is because autoflush does NOT auto fill the id (although a query of the object will, which sometimes can cause confusion as in "why this works here but not there?" But snapshoe already covered this part).
One related aspect that seems pretty important to me and wasn't really mentioned:
Why would you not commit all the time? - The answer is atomicity.
A fancy word to say: an ensemble of operations have to all be executed successfully OR none of them will take effect.
For example, if you want to create/update/delete some object (A) and then create/update/delete another (B), but if (B) fails you want to revert (A). This means those 2 operations are atomic.
Therefore, if (B) needs a result of (A), you want to call flush after (A) and commit after (B).
Also, if session.autoflush is True, except for the case that I mentioned above or others in Jimbo's answer, you will not need to call flush manually.
Use flush when you need to simulate a write, for example to get a primary key ID from an autoincrementing counter.
john=Person(name='John Smith', parent=None)
session.add(john)
session.flush()
son=Person(name='Bill Smith', parent=john.id)
Without flushing, john.id would be null.
Like others have said, without commit() none of this will be permanently persisted to DB.
Why flush if you can commit?
As someone new to working with databases and sqlalchemy, the previous answers - that flush() sends SQL statements to the DB and commit() persists them - were not clear to me. The definitions make sense but it isn't immediately clear from the definitions why you would use a flush instead of just committing.
Since a commit always flushes (https://docs.sqlalchemy.org/en/13/orm/session_basics.html#committing) these sound really similar. I think the big issue to highlight is that a flush is not permanent and can be undone, whereas a commit is permanent, in the sense that you can't ask the database to undo the last commit (I think)
#snapshoe highlights that if you want to query the database and get results that include newly added objects, you need to have flushed first (or committed, which will flush for you). Perhaps this is useful for some people although I'm not sure why you would want to flush rather than commit (other than the trivial answer that it can be undone).
In another example I was syncing documents between a local DB and a remote server, and if the user decided to cancel, all adds/updates/deletes should be undone (i.e. no partial sync, only a full sync). When updating a single document I've decided to simply delete the old row and add the updated version from the remote server. It turns out that due to the way sqlalchemy is written, order of operations when committing is not guaranteed. This resulted in adding a duplicate version (before attempting to delete the old one), which resulted in the DB failing a unique constraint. To get around this I used flush() so that order was maintained, but I could still undo if later the sync process failed.
See my post on this at: Is there any order for add versus delete when committing in sqlalchemy
Similarly, someone wanted to know whether add order is maintained when committing, i.e. if I add object1 then add object2, does object1 get added to the database before object2
Does SQLAlchemy save order when adding objects to session?
Again, here presumably the use of a flush() would ensure the desired behavior. So in summary, one use for flush is to provide order guarantees (I think), again while still allowing yourself an "undo" option that commit does not provide.
Autoflush and Autocommit
Note, autoflush can be used to ensure queries act on an updated database as sqlalchemy will flush before executing the query. https://docs.sqlalchemy.org/en/13/orm/session_api.html#sqlalchemy.orm.session.Session.params.autoflush
Autocommit is something else that I don't completely understand but it sounds like its use is discouraged:
https://docs.sqlalchemy.org/en/13/orm/session_api.html#sqlalchemy.orm.session.Session.params.autocommit
Memory Usage
Now the original question actually wanted to know about the impact of flush vs. commit for memory purposes. As the ability to persist or not is something the database offers (I think), simply flushing should be sufficient to offload to the database - although committing shouldn't hurt (actually probably helps - see below) if you don't care about undoing.
sqlalchemy uses weak referencing for objects that have been flushed: https://docs.sqlalchemy.org/en/13/orm/session_state_management.html#session-referencing-behavior
This means if you don't have an object explicitly held onto somewhere, like in a list or dict, sqlalchemy won't keep it in memory.
However, then you have the database side of things to worry about. Presumably flushing without committing comes with some memory penalty to maintain the transaction. Again, I'm new to this but here's a link that seems to suggest exactly this: https://stackoverflow.com/a/15305650/764365
In other words, commits should reduce memory usage, although presumably there is a trade-off between memory and performance here. In other words, you probably don't want to commit every single database change, one at a time (for performance reasons), but waiting too long will increase memory usage.
The existing answers don't make a lot of sense unless you understand what a database transaction is. (Twas the case for myself until recently.)
Sometimes you might want to run multiple SQL statements and have them succeed or fail as a whole. For example, if you want to execute a bank transfer from account A to account B, you'll need to do two queries like
UPDATE accounts SET value = value - 100 WHERE acct = 'A'
UPDATE accounts SET value = value + 100 WHERE acct = 'B'
If the first query succeeds but the second fails, this is bad (for obvious reasons). So, we need a way to treat these two queries "as a whole". The solution is to start with a BEGIN statement and end with either a COMMIT statement or a ROLLBACK statement, like
BEGIN
UPDATE accounts SET value = value - 100 WHERE acct = 'A'
UPDATE accounts SET value = value + 100 WHERE acct = 'B'
COMMIT
This is a single transaction.
In SQLAlchemy's ORM, this might look like
# BEGIN issued here
acctA = session.query(Account).get(1) # SELECT issued here
acctB = session.query(Account).get(2) # SELECT issued here
acctA.value -= 100
acctB.value += 100
session.commit() # UPDATEs and COMMIT issued here
If you monitor when the various queries get executed, you'll see the UPDATEs don't hit the database until you call session.commit().
In some situations you might want to execute the UPDATE statements before issuing a COMMIT. (Perhaps the database issues an auto-incrementing id to the object and you want to fetch it before COMMITing). In these cases, you can explicitly flush() the session.
# BEGIN issued here
acctA = session.query(Account).get(1) # SELECT issued here
acctB = session.query(Account).get(2) # SELECT issued here
acctA.value -= 100
acctB.value += 100
session.flush() # UPDATEs issued here
session.commit() # COMMIT issued here
commit () records these changes in the database. flush () is always called as part of the commit () (1) call. When you use a Session object to query a database, the query returns results from both the database and the reddened parts of the unrecorded transaction it is performing.
For simple orientation:
commit makes real changes (they become visible in the database)
flush makes fictive changes (they become visible just for you)
Imagine that databases work like git-branching.
First you have to understand that during a transaction you are not manipulating the real database data.
Instead, you get something like a new branch, and there you play around.
If at some point you write the command commit, that means: "merge my data-changes into main DB data".
But if you need some future data, that you can get only after commit (ex. insert into a table, and you need the inserted PKID), then you use the flush command, meaning: "calculate me the future PKID, and reserve it for me".
Then you can use that PKID value further in you code and be sure that the real data will be as expected.
Commit must always come at the end, to merge into main DB data.

Categories

Resources