I'm using pyscopg2 to manage some postgresql databases connections.
As I have found here and in the docs it seems psycopg2 simulates non-autocommit mode as default. Also postgresql treats every statement as a transaction, basically autocommit mode.
My doubt is, which one of this cases happen if both psycopg and postgresql stay in default mode? Or what exactly happens if it's neither one of these two. Any performance advise will be appreciated too.
Code Psycopg2 Postgresql
Some statements --> One big transaction --> Multiple simple transactions
or
Some statements --> One big transaction --> Big transaction
First, my interpretation of the two documents is that when running psycopg2 with postgresql you will be running by default in simulated non-autocommit mode by virtue of psycopg2 having started a transaction. You can, of course, override that default with autocommit=True. Now to answer your question:
By default you will not be using autocommit=True and this will require you to do a commit anytime you do an update to the database that you wish to be permanent. That may seem inconvenient. But there are many instances when you need to do multiple updates and either they must all succeed or none must succeed. If you specified autocommit=True, then you would have to explicitly start a transaction for these cases. With autocommit=False, you are saved the trouble of having to ever start a transaction at the price of always having to do a commit or rollback. It seems to be question of preference. I personally prefer autocommit=False.
As far as performance is concerned, specifying autocommit=True will save you the cost of starting a needless transaction in many instances. But I can't quantify how much of a performance savings that really is.
Related
I am currently writing code to insert a bunch of object data into a mysql database through a plain python script. The amount of rows I need to insert is on the order of a few thousand. I want to be able to do this as fast as possible, and wanted to know if there is a performance difference between calling executeMany() on a bunch fo rows and then calling commit(), vs calling execute() many times and then calling commit()
It is always more efficient to perform all operations at once, and commit at the end of the process. commit incurs additional processing that you don't want to repeat for each and every row, if performance matters.
The more operations you perform, the greater the performance benefit. On the other hand, you need to consider the side effect of a long lasting operation. As an example, if you have several processes inserting concurrently, the risk of deadlock increases - especially if duplicate key errors arise. An intermediate approach is to insert in batches. You may want to have a look at the MYSQL documentation on locking mechanisms.
The MySQL documentation has an interesting section about how to optimize insert statements - here are a few picks:
the load data syntax is the fastest available option
using multiple values() lists is also quite faster than running multiple inserts
Here some tips: Tune the mysql settings in /etc/mysql/my.cnf (for Ubuntu) can increase the performance of Mysql a lot. More memory+cache is usually better for queries. Create a very long text with many insert queries and semicolons will improve your speed a lot. Keeping the entire database in memory gives maximum speed, but not suitable for most projects. Tips for mysql tuning are at: https://duckduckgo.com/?q=mysql+tune+for+speed&t=newext&atb=v275-1&ia=web.
In python should it be indifferent, because the data must be comited before they are inserted.
so there should be little difference between execute and executemanybut as stated here
the mysql homepage also states
With the executemany() method, it is not possible to specify multiple statements to execute in the operation argument. Doing so raises an InternalError exception. Consider using execute() with multi=True instead.
So if you have doubts about performance you can have a look at sqlalchemy is seems to be a bit faster, but takes time to get it to work
Flask example applications Flasky and Flaskr create, drop, and re-seed their entire database between each test. Even if this doesn't make the test suite run slowly, I wonder if there is a way to accomplish the same thing while not being so "destructive". I'm surprised there isn't a "softer" way to roll back any changes. I've tried a few things that haven't worked.
For context, my tests call endpoints through the Flask test_client using something like self.client.post('/things'), and within the endpoints session.commit() is called.
I've tried making my own "commit" function that actually only flushes during tests, but then if I make two sequential requests like self.client.post('/things') and self.client.get('/things'), the newly created item is not present in the result set because the new request has a new request context with a new DB session (and transaction) which is not aware of changes that are merely flushed, not committed. This seems like an unavoidable problem with this approach.
I've tried using subtransactions with db.session.begin(subtransactions=True), but then I run into an even worse problem. Because I have autoflush=False, nothing actually gets committed OR flushed until the outer transaction is committed. So again, any requests that rely on data modified by earlier requests in the same test will fail. Even with autoflush=True, the earlier problem would occur for sequential requests.
I've tried nested transactions with the same result as subtransactions, and apparently they don't do what I was hoping they would do. I saw that nested transactions issue a SAVEPOINT command to the DB. I hoped that would allow commits to happen, visible to other sessions, and then be able to rollback to that save point at an arbitrary time, but that's not what they do. They're used within transactions, and have the same issues as the previous approach.
Update: Apparently there is a way of using nested transactions on a Connection rather than a Session, which might work but requires some restructuring of an application to use a Connection created by the test code. I haven't tried this yet. I'll get around to it eventually, but meanwhile I hope there's another way. Some say this approach may not work with MySQL due to a distinction between "real nested transactions" and savepoints, but the Postgres documentation also says to use SAVEPOINT rather than attempting to nest transactions. I think we can disregard this warning. I don't see any difference between these two databases anymore and if it works on one it will probably work on the other.
Another option that avoids a DB drop_all, create_all, and re-seeding with data, is to manually un-do the changes that a test introduces. But when testing an endpoint, many rows could be inserted into many tables, and reliably undoing this manually would be both exhausting and bug prone.
After trying all those things, I start to see the wisdom in dropping and creating between tests. However, is there something I've tried above that SHOULD work, but I'm simply doing something incorrectly? Or is there yet another method that someone is aware of that I haven't tried yet?
Update: Another method I just found on StackOverflow is to truncate all the tables instead of dropping and creating them. This is apparently about twice as fast, but it still seems heavy-handed and isn't as convenient as a rollback (which would not delete any sample data placed in the DB prior to the test case).
For unit tests I think the standard approach of regenerating the entire database is what makes the most sense, as you've seen in my examples and many others. But I agree, for large applications this can take a lot of time during your test run.
Thanks to SQLAlchemy you can get away with writing a lot of generic database code that runs on your production database, which might be MySQL, Postgres, etc. and at the same time it runs on sqlite for tests. It is not possible for every application out there to use 100% generic SQLAlchemy, since sqlite has some important differences with the others, but in many cases this works well.
So whenever possible, I set up a sqlite database for my tests. Even for large databases, using an in-memory sqlite database should be pretty fast. Another very fast alternative is to generate your tables once, make a backup of your sqlite file with all the emtpy tables, then before each test restore the file instead of doing a create_all().
I have not explored the idea of doing an initial backup of the database with empty tables and then use file based restores between tests for MySQL or Postgres, but in theory that should work as well, so I guess that is one solution you haven't mentioned in your list. You will need to stop and restart the db service in between your tests, though.
I'm trying to insert a row if the same primary key does not exist yet (ignore in that case). Doing this from Python, using psycopg2 and Postgres version 9.3.
There are several options how to do this: 1) use subselect, 2) use transaction, 3) let it fail.
It seems easiest to do something like this:
try:
cursor.execute('INSERT...')
except psycopg2.IntegrityError:
pass
Are there any drawbacks to this approach? Is there any performance penalty with the failure?
The foolproof way to do it at the moment is try the insert and let it fail. You can do that at the app level or at the Postgres level; assuming it's not part of a procedure being executed on the server, it doesn't materially matter if it's one or the other when it comes to performance, since either way you're sending a request to the server and retrieving the result. (Where it may matter is in your need to define a save point if you're trying it from within a transaction, for the same reason. Or, as highlighted in Craig's answer, if you've many failed statements.)
In future releases, a proper merge and upsert are on the radar, but as the near-decade long discussion will suggest implementing them properly is rather thorny:
https://wiki.postgresql.org/wiki/SQL_MERGE
https://wiki.postgresql.org/wiki/UPSERT
With respect to the other options you mentioned, the above wiki pages and the links within them should highlight the difficulties. Basically though, using a subselect is cheap, as noted by Erwin, but isn't concurrency-proof (unless you lock properly); using locks basically amounts to locking the entire table (trivial but not great) or reinventing the wheel that's being forged in core (trivial for existing rows, less so for potentially new ones which are inserted concurrently if seek to use predicates instead of a table-level lock); and using a transaction and catching the exception is what you'll end up doing anyway.
Work is ongoing to add a native upsert to PostgreSQL 9.5, which will probably take the form of an INSERT ... ON CONFLICT UPDATE ... statement.
In the mean time, you must attempt the update and if it fails, retry. There's no safe alternative, though you can loop within a PL/PgSQL function to hide this from the application.
Re trying and letting it fail:
Are there any drawbacks to this approach?
It creates a large volume of annoying noise in the log files. It also burns through transaction IDs very rapidly if the conflict rate is high, potentially requiring more frequent VACUUM FREEZE to be run by autovacuum, which can be an issue on large databases.
Is there any performance penalty with the failure?
If the conflict rate is high, you'll be doing a bunch of extra round trips to the database. Otherwise not much really.
PEP 249 -- Python Database API Specification v2.0 in the description of .commit() states:
Note that if the database supports an auto-commit feature, this must
be initially off. An interface method may be provided to turn it back
on.
What is the rationale behind that, given that most databases default to auto-commit on?
According to Discovering SQL:
The transaction model, as it is defined in the ANSI/ISO SQL Standard,
utilizes the implicit start of a transaction, with an explicit COMMIT, in
the case of the successful execution of all the logical units of the
transaction, or an explicit ROLLBACK, when the noncommitted changes need to
be rolled back (for example, when the program terminates abnormally); most
RDBMSs follow this model.
I.e., the SQL standard states transactions should be explicitly committed or
rolled-back.
The case for having explicit committing is best described by SQL-Transactions:
Some DBMS products, for example, SQL Server, MySQL/InnoDB, PostgreSQL and
Pyrrho operate by default in the AUTOCOMMIT mode. This means that the result
of every single SQL command will is automatically committed to the
database, thus the effects/changes made to the database by the statement in
question cannot be rolled back. So, in case of errors the application needs
do reverse-operations for the logical unit of work, which may be impossible
after operations of concurrent SQL-clients. Also in case of broken
connections the database might be left in inconsistent state.
I.e., error handling and reversal of operations can be vastly simpler when using
using explicit commits instead of auto-committing.
Also, from my observation of the users in the python mailing list, the
consensus was that it is bad for auto-commit to be on by default.
One post states:
Auto commit is a bad thing and a pretty evil invention of ODBC. While it
does make writing ODBC drivers simpler (ones which don't support
transactions that is), it is potentially dangerous at times, e.g. take a
crashing program: there is no way to recover from errors because the
database has no way of knowing which data is valid and which is not. No
commercial application handling "mission critical" (I love that term ;-)
data would ever want to run in auto-commit mode.
Another post says:
ANY serious application MUST manage its own transactions, as otherwise you
can't ever hope to control failure modes.
It is my impression that Python developers took this sort of information into consideration and decided the benefit of having auto-commit off by default (easier error handling and reversing) out weighed that of having auto-commit on (increased concurrency).
What is the most reliable way to determine what statements are "querying" versus "modifying"? For example, SELECT versus UPDATE / INSERT / CREATE.
Parsing the statement myself seems the obvious first attempt, but I can't help but think that this would be a flaky solution. Just looking for SELECT at the beginning doesn't work, as PRAGMA can also return results, and I'm sure there are a multitude of ways that strategy could fail. Testing for zero rows returned from the cursor doesn't work either, as a SELECT can obviously return zero results.
I'm working with SQLite via the Python sqlite3 module.
Use the sqlite3_changes API call, which is also available from SQL using the changes function.
As TokenMacGuy mentioned, you can rollback the transaction containing the statement that caused the changes; the sqlite3_changes function will let you know if that is necessary.
There is also the update_hook callback if you need more fine grained information abouth the tables and rows affected.