django 1.5 select_for_update considered fragile design

django 1.5 select_for_update considered fragile design - python

The Django documentation states
If you were relying on “automatic transactions” to provide locking
between select_for_update() and a subsequent write operation — an
extremely fragile design, but nonetheless possible — you must wrap the
relevant code in atomic(). Since Django 1.6.3, executing a query with
select_for_update() in autocommit mode will raise a
TransactionManagementError.
Why is this considered fragile? I would have thought that this would result in proper transactionality.

select_for_update isn't fragile.
I wrote that "if you were relying on "automatic transactions"" then you need to review your code when you upgrade from 1.5 from 1.6.
If you weren't relying on "automatic transaction", and even more if the concept doesn't ring a bell, then you don't need to do anything.
As pointed out in yuvi's answer (which is very good, thank you!) Django will raise an exception when it encounters invalid code. There's no need to think about this until you see a TransactionManagementError raised by select_for_update.

The answer is just around the corner, in the docs for select_for_update (emphasis mine):
Evaluating a queryset with select_for_update in autocommit mode is an
error because the rows are then not locked. If allowed, this would
facilitate data corruption, and could easily be caused by calling,
outside of any transaction, code that expects to be run in one.
In other words, there's a contradicting behaviour between autocommit and select_for_update, which can cause data corruption. Here's the django developer's discussion where they first proposed solving this issue, to quote (again, emphasis mine):
[...] under Oracle, in autocommit mode, the automatic commit happens
immediately after the command is executed -- and so, trying to fetch
the results fails for being done in a separate transaction.
However, with any backend, select-for-update in autocommit mode
makes very little sense. Even if it doesn't break (as it does on
Oracle), it doesn't really lock anything. So, IMO, executing a
query that is a select-for-update in autocommit mode is probably en
error, and one that is likely to cause data- corruption bugs.
So I'm suggesting we change the behavior of select-for-update queries,
to error out [...] This is a backwards-incompatible change [...]
These projects should probably be thankful -- they were running with a
subtle bug that is now exposed -- but still.
So it was an Oracle-only bug, which shown light over a deeper problem that's relevant for all backends, and so they made the decision to make this an error in django.
Atomic, on the other hand, only commits things to the database after it has verifying that there are no errors, thus solving the issue.

Aymeric clarified over email that such a design is fragile because it relies on the implicit transaction boundaries formed by Django 1.5's implicit transactions.
select_for_update(...)
more_code()
save()
This code works in straightforward cases, but if more_code() results in a write operation to the database, then the transaction would close, producing unintended behavior.
Forcing the user to specify the transaction boundaries also leads to clearer code.

Related

Why does the SQLAlchemy documentation put the add operation inside try/except without a flush operation?

I'm trying to understand SQLAlchemy (1.4) and I can't quite put things together. I can find examples that put either session.commit() (SO, FAQ) or session.add() (SO, session documentation) inside try.
For me add() never raises an exception unless I manually call flush() as well.
My initial guess was that autoflush enabled means add() is automatically flushed and the following example from the flushing documentation seems to suggest as much (otherwise why use this as an example?)
with mysession.no_autoflush:
mysession.add(some_object)
mysession.flush()
but it makes no difference and reading on it seems that autoflush only affects query operations.
I can just put commit() inside try, but I'd appreciate if someone could help me understand why a decade old SO answer works for me while the official documentation doesn't. Am I missing some other configuration?

Is there any reason not to use autocommit?

I'm using pyscopg2 to manage some postgresql databases connections.
As I have found here and in the docs it seems psycopg2 simulates non-autocommit mode as default. Also postgresql treats every statement as a transaction, basically autocommit mode.
My doubt is, which one of this cases happen if both psycopg and postgresql stay in default mode? Or what exactly happens if it's neither one of these two. Any performance advise will be appreciated too.
Code Psycopg2 Postgresql
Some statements --> One big transaction --> Multiple simple transactions
or
Some statements --> One big transaction --> Big transaction

First, my interpretation of the two documents is that when running psycopg2 with postgresql you will be running by default in simulated non-autocommit mode by virtue of psycopg2 having started a transaction. You can, of course, override that default with autocommit=True. Now to answer your question:
By default you will not be using autocommit=True and this will require you to do a commit anytime you do an update to the database that you wish to be permanent. That may seem inconvenient. But there are many instances when you need to do multiple updates and either they must all succeed or none must succeed. If you specified autocommit=True, then you would have to explicitly start a transaction for these cases. With autocommit=False, you are saved the trouble of having to ever start a transaction at the price of always having to do a commit or rollback. It seems to be question of preference. I personally prefer autocommit=False.
As far as performance is concerned, specifying autocommit=True will save you the cost of starting a needless transaction in many instances. But I can't quantify how much of a performance savings that really is.

Python prepared statement security

I took a slight peek behind the curtain at the MySQLdb python driver, and to my horror I saw it was simply escaping the parameters and putting them directly into the query string. I realize that escaping inputs should be fine in most cases, but coming from PHP, I have seen bugs where, given certain database character sets and versions of the MySQL driver, SQL injection was still possible.
This question had some incredibly detailed responses regarding the edge cases of string escaping in PHP, and has led me to the belief that prepared statements should be used whenever possible.
So then my questions are: Are there any known cases where the MySQLdb driver has been successfully exploited due to this? When a query needs to be run in a loop, say in the case of an incremental DB migration script, will this degrade performance? Are my concerns regarding escaped input fundamentally flawed?

I can't point to any known exploit cases, but I can say that yes, this is terrible.
The Python project calling itself MySQLdb is no longer maintained. It's been piling up unresolved Github issues since 2014, and just quickly looking at the source code, I can find more bugs not yet reported - for example, it's using a regex to parse queries in execute_many, leading it to mishandle any query with the string " values()" in it where values isn't the keyword.
Instead of MySQLdb, you should be using MySQL's Connector/Python. That is still maintained, and it's hopefully less terrible than MySQLdb. (Hopefully. I didn't check that one.)

...prepared statements should be used whenever possible.
Yes. That's the best advice. If the prepared statement system is broken there will be klaxons blaring from the rooftops and everyone in the Python world will pounce on the problem to fix it. If there's a mistake in your own code that doesn't use prepared statements you're on your own.
What you're seeing in the driver is probably prepared statement emulation, that is the driver is responsible for inserting data into the placeholders and forwarding the final, composed statement to the server. This is done for various reasons, some historical, some to do with compatibility.
Drivers are generally given a lot of serious scrutiny as they're the foundation of most systems. If there is a security bug in there then there's a lot of people that are going to be impacted by it, the stakes are very high.
The difference between using prepared statements with placeholder values and your own interpolated code is massive even if behind the scenes the same thing happens. This is because the driver, by design, always escapes your data. Your code might not, you may omit the escaping on one value and then you have a catastrophic hole.
Use placeholder values like your life depends on it, because it very well might. You do not want to wake up to a phone call or email one day saying your site got hacked and now your database is floating around on the internet.

Using prepared statements is faster than concatenating a query. The database can precompile the statement, so only the parameters will be changed when iterating in a loop.

Rationale for DB API 2.0 auto-commit off by default?

PEP 249 -- Python Database API Specification v2.0 in the description of .commit() states:
Note that if the database supports an auto-commit feature, this must
be initially off. An interface method may be provided to turn it back
on.
What is the rationale behind that, given that most databases default to auto-commit on?

According to Discovering SQL:
The transaction model, as it is defined in the ANSI/ISO SQL Standard,
utilizes the implicit start of a transaction, with an explicit COMMIT, in
the case of the successful execution of all the logical units of the
transaction, or an explicit ROLLBACK, when the noncommitted changes need to
be rolled back (for example, when the program terminates abnormally); most
RDBMSs follow this model.
I.e., the SQL standard states transactions should be explicitly committed or
rolled-back.
The case for having explicit committing is best described by SQL-Transactions:
Some DBMS products, for example, SQL Server, MySQL/InnoDB, PostgreSQL and
Pyrrho operate by default in the AUTOCOMMIT mode. This means that the result
of every single SQL command will is automatically committed to the
database, thus the effects/changes made to the database by the statement in
question cannot be rolled back. So, in case of errors the application needs
do reverse-operations for the logical unit of work, which may be impossible
after operations of concurrent SQL-clients. Also in case of broken
connections the database might be left in inconsistent state.
I.e., error handling and reversal of operations can be vastly simpler when using
using explicit commits instead of auto-committing.
Also, from my observation of the users in the python mailing list, the
consensus was that it is bad for auto-commit to be on by default.
One post states:
Auto commit is a bad thing and a pretty evil invention of ODBC. While it
does make writing ODBC drivers simpler (ones which don't support
transactions that is), it is potentially dangerous at times, e.g. take a
crashing program: there is no way to recover from errors because the
database has no way of knowing which data is valid and which is not. No
commercial application handling "mission critical" (I love that term ;-)
data would ever want to run in auto-commit mode.
Another post says:
ANY serious application MUST manage its own transactions, as otherwise you
can't ever hope to control failure modes.
It is my impression that Python developers took this sort of information into consideration and decided the benefit of having auto-commit off by default (easier error handling and reversing) out weighed that of having auto-commit on (increased concurrency).

Why assert is not largely used?

I found that Python's assert statement is a good way to catch situations that should never happen. And it can be removed by Python optimization when the code is trusted to be correct.
It seems to be a perfect mechanism to run Python applications in debug mode. But looking at several Python projects like django, twisted and zope, the assert is almost never used. So, why does this happen?
Why are asserts statements not frequently used in the Python community?

I guess the main reason for assert not being used more often is that nobody uses Python's "optimized" mode.
Asserts are a great tool to detect programming mistakes, to guard yourself from unexpected situations, but all this error checking comes with a cost. In compiled languages such as C/C++, this does not really matter, since asserts are only enabled in debug builds, and completely removed from release builds.
In Python, on the other hand, there is no strict distinction between debug and release mode. The interpreter features an "optimization flag" (-O), but currently this does not actually optimize the byte code, but only removes asserts.
Therefore, most Python users just ignore the -O flag and run their scripts in "normal mode", which is kind of the debug mode since asserts are enabled and __debug__ is True, but is considered "production ready".
Maybe it would be wiser to switch the logic, i.e., "optimize" by default and only enable asserts in an explicit debug mode(*), but I guess this would confuse a lot of users and I doubt we will ever see such a change.
((*) This is for example how the Java VM does it, by featuring a -ea (enable assertions) switch.)

Several reasons come to mind...
It is not a primary function
Many programmers, lets not get bogged down by the rationale, disrespect anything which is not a direct participant in the program's penultimate functionality. The assert statement is intended for debugging and testing, and so, a luxury they can ill-afford.
Unit Testing
The assert statement predates the rise and rise of unit-testing. Whilst the assert statement still has its uses, unit-testing is now widely used for constructing a hostile environment with which to bash the crap out of a subroutine and its system. Under these conditions assert statements start to feel like knives in a gunfight.
Improved industry respect for testing
The assert statement serves best as the last line of defence. It rose to lofty and untouchable heights under the C language, when that language ruled the world, as a great way to implement the new-fangled "defensive programming"; it recognises and traps catastrophic disasters in the moment they teeter on the brink. This was before the value of Testing became widely recognised and respected and disasters were substantially more common.
Today, it is unheard of, for any serious commercial software to be released without some form of testing. Testing is taken seriously and has evolved into a massive field. There are Testing professionals and Quality Assurance departments with big checklists and formal sign-offs. Under these conditions programmers tend not to bother with asserts because they have confidence that their code will be subjected to so much tiresome testing that the odds of wacky brink-of-disaster conditions are so remote as to be negligible. That's not to say they're right, but if the blame for lazy programming can be shifted to the QA department, hell why not?

I'm not an author of any of those projects, so this is just a guess based on my own experiences. Without directly asking people in those projects you won't get a concrete answer.
Assert is great when you're trying to do debugging, etc in your own application. As stated in the link you provided, however, using a conditional is better when the application might be able to predict and recover from a state. I haven't used zope, but in both Twisted and Django, their applications are able to recover and continue from many errors in your code. In a sense, they have already 'compiled away' the assertions since they actually can handle them.
Another reason, related to that, is that often applications using external libraries such as those you listed might want to do error handling. If the library simply uses assertions, no matter what the error is it will raise an AssertionError. With a conditional, the libraries can actually throw useful errors that can be caught and handled by your application.

As per my experience, asserts are majorly used in development phase of a program-to check the user defined inputs. asserts are not really needed to catch programming errors. Python itself is very well capable of trapping genuine programming errors like ZeroDivisionError, TypeError or so.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.