PEP 249 -- Python Database API Specification v2.0 in the description of .commit() states:
Note that if the database supports an auto-commit feature, this must
be initially off. An interface method may be provided to turn it back
on.
What is the rationale behind that, given that most databases default to auto-commit on?
According to Discovering SQL:
The transaction model, as it is defined in the ANSI/ISO SQL Standard,
utilizes the implicit start of a transaction, with an explicit COMMIT, in
the case of the successful execution of all the logical units of the
transaction, or an explicit ROLLBACK, when the noncommitted changes need to
be rolled back (for example, when the program terminates abnormally); most
RDBMSs follow this model.
I.e., the SQL standard states transactions should be explicitly committed or
rolled-back.
The case for having explicit committing is best described by SQL-Transactions:
Some DBMS products, for example, SQL Server, MySQL/InnoDB, PostgreSQL and
Pyrrho operate by default in the AUTOCOMMIT mode. This means that the result
of every single SQL command will is automatically committed to the
database, thus the effects/changes made to the database by the statement in
question cannot be rolled back. So, in case of errors the application needs
do reverse-operations for the logical unit of work, which may be impossible
after operations of concurrent SQL-clients. Also in case of broken
connections the database might be left in inconsistent state.
I.e., error handling and reversal of operations can be vastly simpler when using
using explicit commits instead of auto-committing.
Also, from my observation of the users in the python mailing list, the
consensus was that it is bad for auto-commit to be on by default.
One post states:
Auto commit is a bad thing and a pretty evil invention of ODBC. While it
does make writing ODBC drivers simpler (ones which don't support
transactions that is), it is potentially dangerous at times, e.g. take a
crashing program: there is no way to recover from errors because the
database has no way of knowing which data is valid and which is not. No
commercial application handling "mission critical" (I love that term ;-)
data would ever want to run in auto-commit mode.
Another post says:
ANY serious application MUST manage its own transactions, as otherwise you
can't ever hope to control failure modes.
It is my impression that Python developers took this sort of information into consideration and decided the benefit of having auto-commit off by default (easier error handling and reversing) out weighed that of having auto-commit on (increased concurrency).
Related
I'm using pyscopg2 to manage some postgresql databases connections.
As I have found here and in the docs it seems psycopg2 simulates non-autocommit mode as default. Also postgresql treats every statement as a transaction, basically autocommit mode.
My doubt is, which one of this cases happen if both psycopg and postgresql stay in default mode? Or what exactly happens if it's neither one of these two. Any performance advise will be appreciated too.
Code Psycopg2 Postgresql
Some statements --> One big transaction --> Multiple simple transactions
or
Some statements --> One big transaction --> Big transaction
First, my interpretation of the two documents is that when running psycopg2 with postgresql you will be running by default in simulated non-autocommit mode by virtue of psycopg2 having started a transaction. You can, of course, override that default with autocommit=True. Now to answer your question:
By default you will not be using autocommit=True and this will require you to do a commit anytime you do an update to the database that you wish to be permanent. That may seem inconvenient. But there are many instances when you need to do multiple updates and either they must all succeed or none must succeed. If you specified autocommit=True, then you would have to explicitly start a transaction for these cases. With autocommit=False, you are saved the trouble of having to ever start a transaction at the price of always having to do a commit or rollback. It seems to be question of preference. I personally prefer autocommit=False.
As far as performance is concerned, specifying autocommit=True will save you the cost of starting a needless transaction in many instances. But I can't quantify how much of a performance savings that really is.
I'm quite new to Python and Flask, and while working through the examples, couldn't help noticing cursors. Before this I programmed in PHP, where I never needed cursors. So I got to wondering: What are cursors and why are they used so much in these code examples?
But no matter where I turned, I saw no clear verdict and lots of warnings:
Wikipedia: "Fetching a row from the cursor may result in a network round trip each time", and "Cursors allocate resources on the server, such as locks, packages, processes, and temporary storage."
StackOverflow: See the answer by AndreasT.
The Island of Misfit Cursors: "A good developer is never reluctant to use a tool only because it's often misused by others."
And to top it all, I learned that MySQL does NOT support cursors!
It looks like the only code that doesn't use cursors in the mysqlclient library is the _msql module, and the author repeatedly warns not to use it for compatibility reasons: "If you want to write applications which are portable across databases, use MySQLdb, and avoid using this module directly."
Well, I hope I have explained and supported my dilemma sufficiently well. Here are two big questions troubling me:
Since MySQL doesn't support cursors, what's the whole point of building the entire thing on a Cursor class hierarchy?
Why aren't cursors optional in mysqlclient?
Your are confusing database-engine level cursors and Python db-api cursors. The second ones only exists at the Python code level and are not necessarily tied to database-level ones.
At the Python level, cursors are a way to encapsulate a query and it's results. This abstraction level allow to have a simple, usable and common api for different vendors. Whether the actual implementation for a given vendor relies on database-level cursors or not is a totally different problem.
To make a long story short: there are two distinct concepts here:
database (server) cursors, a feature that exists in some but not all SQL engines
db api (client) cursors (as defined in pep 249), which are used to execute a query and eventually fetch the results.
db api cursors are named so because they conceptually have some similarity with database cursors but are technically totally unrelated.
As to why mysqlclient works this way, it's plain simple: it implements pep 249, which is the community-defined API for Python SQL databases clients.
Python application, standard web app.
If a particular request gets executed twice by error the second request will try to insert a row with an already existing primary key.
What is the most sensible way to deal with it.
a) Execute a query to check if the primary key already exists and do the checking and error handling in the python app
b) Let the SQL engine reject the insertion with a constraint failure and use exception handling to handle it back in the app
From a speed perspective it might seem that a failed request will take the same amount of time as a successful one, making b faster because its only one request and not two.
However, when you take things in account like read-only db slaves and table write-locks and stuff like that things get fuzzy for my experience on scaling standard SQL databases.
The best option is (b), from almost any perspective. As mentioned in a comment, there is a multi-threading issue. That means that option (a) doesn't even protect data integrity. And that is a primary reason why you want data integrity checks inside the database, not outside it.
There are other reasons. Consider performance. Passing data into and out of the database takes effort. There are multiple levels of protocol and data preparation, not to mention round trip, sequential communication from the database server. One call has one such unit of overhead. Two calls have two such units.
It is true that under some circumstances, a failed query can have a long clean-up period. However, constraint checking for unique values is a single lookup in an index, which is both fast and has minimal overhead for cleaning up. The extra overhead for handling the error should be tiny in comparison to the overhead for running the queries from the application -- both are small, one is much smaller.
If you had a query load where the inserts were really rare with respect to the comparison, then you might consider doing the check in the application. It is probably a tiny bit faster to check to see if something exists using a SELECT rather than using INSERT. However, unless your query load is many such checks for each actual insert, I would go with checking in the database and move on to other issues.
The latter one you need to do and handle in any case, thus I do not see there is much value in querying for duplicates, except to show the user information beforehand - e.g. report "This username has been taken already, please choose another" when the user is still filling in the form.
I'm trying to insert a row if the same primary key does not exist yet (ignore in that case). Doing this from Python, using psycopg2 and Postgres version 9.3.
There are several options how to do this: 1) use subselect, 2) use transaction, 3) let it fail.
It seems easiest to do something like this:
try:
cursor.execute('INSERT...')
except psycopg2.IntegrityError:
pass
Are there any drawbacks to this approach? Is there any performance penalty with the failure?
The foolproof way to do it at the moment is try the insert and let it fail. You can do that at the app level or at the Postgres level; assuming it's not part of a procedure being executed on the server, it doesn't materially matter if it's one or the other when it comes to performance, since either way you're sending a request to the server and retrieving the result. (Where it may matter is in your need to define a save point if you're trying it from within a transaction, for the same reason. Or, as highlighted in Craig's answer, if you've many failed statements.)
In future releases, a proper merge and upsert are on the radar, but as the near-decade long discussion will suggest implementing them properly is rather thorny:
https://wiki.postgresql.org/wiki/SQL_MERGE
https://wiki.postgresql.org/wiki/UPSERT
With respect to the other options you mentioned, the above wiki pages and the links within them should highlight the difficulties. Basically though, using a subselect is cheap, as noted by Erwin, but isn't concurrency-proof (unless you lock properly); using locks basically amounts to locking the entire table (trivial but not great) or reinventing the wheel that's being forged in core (trivial for existing rows, less so for potentially new ones which are inserted concurrently if seek to use predicates instead of a table-level lock); and using a transaction and catching the exception is what you'll end up doing anyway.
Work is ongoing to add a native upsert to PostgreSQL 9.5, which will probably take the form of an INSERT ... ON CONFLICT UPDATE ... statement.
In the mean time, you must attempt the update and if it fails, retry. There's no safe alternative, though you can loop within a PL/PgSQL function to hide this from the application.
Re trying and letting it fail:
Are there any drawbacks to this approach?
It creates a large volume of annoying noise in the log files. It also burns through transaction IDs very rapidly if the conflict rate is high, potentially requiring more frequent VACUUM FREEZE to be run by autovacuum, which can be an issue on large databases.
Is there any performance penalty with the failure?
If the conflict rate is high, you'll be doing a bunch of extra round trips to the database. Otherwise not much really.
The Django documentation states
If you were relying on “automatic transactions” to provide locking
between select_for_update() and a subsequent write operation — an
extremely fragile design, but nonetheless possible — you must wrap the
relevant code in atomic(). Since Django 1.6.3, executing a query with
select_for_update() in autocommit mode will raise a
TransactionManagementError.
Why is this considered fragile? I would have thought that this would result in proper transactionality.
select_for_update isn't fragile.
I wrote that "if you were relying on "automatic transactions"" then you need to review your code when you upgrade from 1.5 from 1.6.
If you weren't relying on "automatic transaction", and even more if the concept doesn't ring a bell, then you don't need to do anything.
As pointed out in yuvi's answer (which is very good, thank you!) Django will raise an exception when it encounters invalid code. There's no need to think about this until you see a TransactionManagementError raised by select_for_update.
The answer is just around the corner, in the docs for select_for_update (emphasis mine):
Evaluating a queryset with select_for_update in autocommit mode is an
error because the rows are then not locked. If allowed, this would
facilitate data corruption, and could easily be caused by calling,
outside of any transaction, code that expects to be run in one.
In other words, there's a contradicting behaviour between autocommit and select_for_update, which can cause data corruption. Here's the django developer's discussion where they first proposed solving this issue, to quote (again, emphasis mine):
[...] under Oracle, in autocommit mode, the automatic commit happens
immediately after the command is executed -- and so, trying to fetch
the results fails for being done in a separate transaction.
However, with any backend, select-for-update in autocommit mode
makes very little sense. Even if it doesn't break (as it does on
Oracle), it doesn't really lock anything. So, IMO, executing a
query that is a select-for-update in autocommit mode is probably en
error, and one that is likely to cause data- corruption bugs.
So I'm suggesting we change the behavior of select-for-update queries,
to error out [...] This is a backwards-incompatible change [...]
These projects should probably be thankful -- they were running with a
subtle bug that is now exposed -- but still.
So it was an Oracle-only bug, which shown light over a deeper problem that's relevant for all backends, and so they made the decision to make this an error in django.
Atomic, on the other hand, only commits things to the database after it has verifying that there are no errors, thus solving the issue.
Aymeric clarified over email that such a design is fragile because it relies on the implicit transaction boundaries formed by Django 1.5's implicit transactions.
select_for_update(...)
more_code()
save()
This code works in straightforward cases, but if more_code() results in a write operation to the database, then the transaction would close, producing unintended behavior.
Forcing the user to specify the transaction boundaries also leads to clearer code.