I'm trying to insert a row if the same primary key does not exist yet (ignore in that case). Doing this from Python, using psycopg2 and Postgres version 9.3.
There are several options how to do this: 1) use subselect, 2) use transaction, 3) let it fail.
It seems easiest to do something like this:
try:
cursor.execute('INSERT...')
except psycopg2.IntegrityError:
pass
Are there any drawbacks to this approach? Is there any performance penalty with the failure?
The foolproof way to do it at the moment is try the insert and let it fail. You can do that at the app level or at the Postgres level; assuming it's not part of a procedure being executed on the server, it doesn't materially matter if it's one or the other when it comes to performance, since either way you're sending a request to the server and retrieving the result. (Where it may matter is in your need to define a save point if you're trying it from within a transaction, for the same reason. Or, as highlighted in Craig's answer, if you've many failed statements.)
In future releases, a proper merge and upsert are on the radar, but as the near-decade long discussion will suggest implementing them properly is rather thorny:
https://wiki.postgresql.org/wiki/SQL_MERGE
https://wiki.postgresql.org/wiki/UPSERT
With respect to the other options you mentioned, the above wiki pages and the links within them should highlight the difficulties. Basically though, using a subselect is cheap, as noted by Erwin, but isn't concurrency-proof (unless you lock properly); using locks basically amounts to locking the entire table (trivial but not great) or reinventing the wheel that's being forged in core (trivial for existing rows, less so for potentially new ones which are inserted concurrently if seek to use predicates instead of a table-level lock); and using a transaction and catching the exception is what you'll end up doing anyway.
Work is ongoing to add a native upsert to PostgreSQL 9.5, which will probably take the form of an INSERT ... ON CONFLICT UPDATE ... statement.
In the mean time, you must attempt the update and if it fails, retry. There's no safe alternative, though you can loop within a PL/PgSQL function to hide this from the application.
Re trying and letting it fail:
Are there any drawbacks to this approach?
It creates a large volume of annoying noise in the log files. It also burns through transaction IDs very rapidly if the conflict rate is high, potentially requiring more frequent VACUUM FREEZE to be run by autovacuum, which can be an issue on large databases.
Is there any performance penalty with the failure?
If the conflict rate is high, you'll be doing a bunch of extra round trips to the database. Otherwise not much really.
Related
I am currently writing code to insert a bunch of object data into a mysql database through a plain python script. The amount of rows I need to insert is on the order of a few thousand. I want to be able to do this as fast as possible, and wanted to know if there is a performance difference between calling executeMany() on a bunch fo rows and then calling commit(), vs calling execute() many times and then calling commit()
It is always more efficient to perform all operations at once, and commit at the end of the process. commit incurs additional processing that you don't want to repeat for each and every row, if performance matters.
The more operations you perform, the greater the performance benefit. On the other hand, you need to consider the side effect of a long lasting operation. As an example, if you have several processes inserting concurrently, the risk of deadlock increases - especially if duplicate key errors arise. An intermediate approach is to insert in batches. You may want to have a look at the MYSQL documentation on locking mechanisms.
The MySQL documentation has an interesting section about how to optimize insert statements - here are a few picks:
the load data syntax is the fastest available option
using multiple values() lists is also quite faster than running multiple inserts
Here some tips: Tune the mysql settings in /etc/mysql/my.cnf (for Ubuntu) can increase the performance of Mysql a lot. More memory+cache is usually better for queries. Create a very long text with many insert queries and semicolons will improve your speed a lot. Keeping the entire database in memory gives maximum speed, but not suitable for most projects. Tips for mysql tuning are at: https://duckduckgo.com/?q=mysql+tune+for+speed&t=newext&atb=v275-1&ia=web.
In python should it be indifferent, because the data must be comited before they are inserted.
so there should be little difference between execute and executemanybut as stated here
the mysql homepage also states
With the executemany() method, it is not possible to specify multiple statements to execute in the operation argument. Doing so raises an InternalError exception. Consider using execute() with multi=True instead.
So if you have doubts about performance you can have a look at sqlalchemy is seems to be a bit faster, but takes time to get it to work
Flask example applications Flasky and Flaskr create, drop, and re-seed their entire database between each test. Even if this doesn't make the test suite run slowly, I wonder if there is a way to accomplish the same thing while not being so "destructive". I'm surprised there isn't a "softer" way to roll back any changes. I've tried a few things that haven't worked.
For context, my tests call endpoints through the Flask test_client using something like self.client.post('/things'), and within the endpoints session.commit() is called.
I've tried making my own "commit" function that actually only flushes during tests, but then if I make two sequential requests like self.client.post('/things') and self.client.get('/things'), the newly created item is not present in the result set because the new request has a new request context with a new DB session (and transaction) which is not aware of changes that are merely flushed, not committed. This seems like an unavoidable problem with this approach.
I've tried using subtransactions with db.session.begin(subtransactions=True), but then I run into an even worse problem. Because I have autoflush=False, nothing actually gets committed OR flushed until the outer transaction is committed. So again, any requests that rely on data modified by earlier requests in the same test will fail. Even with autoflush=True, the earlier problem would occur for sequential requests.
I've tried nested transactions with the same result as subtransactions, and apparently they don't do what I was hoping they would do. I saw that nested transactions issue a SAVEPOINT command to the DB. I hoped that would allow commits to happen, visible to other sessions, and then be able to rollback to that save point at an arbitrary time, but that's not what they do. They're used within transactions, and have the same issues as the previous approach.
Update: Apparently there is a way of using nested transactions on a Connection rather than a Session, which might work but requires some restructuring of an application to use a Connection created by the test code. I haven't tried this yet. I'll get around to it eventually, but meanwhile I hope there's another way. Some say this approach may not work with MySQL due to a distinction between "real nested transactions" and savepoints, but the Postgres documentation also says to use SAVEPOINT rather than attempting to nest transactions. I think we can disregard this warning. I don't see any difference between these two databases anymore and if it works on one it will probably work on the other.
Another option that avoids a DB drop_all, create_all, and re-seeding with data, is to manually un-do the changes that a test introduces. But when testing an endpoint, many rows could be inserted into many tables, and reliably undoing this manually would be both exhausting and bug prone.
After trying all those things, I start to see the wisdom in dropping and creating between tests. However, is there something I've tried above that SHOULD work, but I'm simply doing something incorrectly? Or is there yet another method that someone is aware of that I haven't tried yet?
Update: Another method I just found on StackOverflow is to truncate all the tables instead of dropping and creating them. This is apparently about twice as fast, but it still seems heavy-handed and isn't as convenient as a rollback (which would not delete any sample data placed in the DB prior to the test case).
For unit tests I think the standard approach of regenerating the entire database is what makes the most sense, as you've seen in my examples and many others. But I agree, for large applications this can take a lot of time during your test run.
Thanks to SQLAlchemy you can get away with writing a lot of generic database code that runs on your production database, which might be MySQL, Postgres, etc. and at the same time it runs on sqlite for tests. It is not possible for every application out there to use 100% generic SQLAlchemy, since sqlite has some important differences with the others, but in many cases this works well.
So whenever possible, I set up a sqlite database for my tests. Even for large databases, using an in-memory sqlite database should be pretty fast. Another very fast alternative is to generate your tables once, make a backup of your sqlite file with all the emtpy tables, then before each test restore the file instead of doing a create_all().
I have not explored the idea of doing an initial backup of the database with empty tables and then use file based restores between tests for MySQL or Postgres, but in theory that should work as well, so I guess that is one solution you haven't mentioned in your list. You will need to stop and restart the db service in between your tests, though.
Which of the following would perform better?
(1) **INSERT IGNORE**
cursor.execute('INSERT IGNORE INTO table VALUES (%s,%s)')
(2) **SELECT or CREATE**
cursor.execute('SELECT 1 FROM table WHERE id=%s')
if not cursor.fetchone():
cursor.execute('INSERT INTO table VALUES (%s,%s)')
I have to do this patter millions of time so I'm looking to find the best performance for this pattern. Which one is preferably? Why?
The insert ignore is the better method, for several reasons.
In terms of performance, only one query is being compiled and executed, rather than two. This saves the overhead of moving stuff in and out of the database.
In terms of maintenance, only having one query is more maintainable, because the logic is all in one place. If you added a where clause, for instance, you would be more likely to miss adding it in two separate queries.
In terms of accuracy, only one query should have no (or at least many fewer) opportunities for race conditions. If a row is inserted between the select and insert, then you will still get an error.
However, better than insert ignore is insert . . . on duplicate key update. The latter only avoids the error for duplication problems. insert ignore might be ignoring errors that you actually care about.
By the way, you should be checking for errors from the statement anyway.
With most performance issues, the best approach is to try it both ways and measure them to see which is actually faster. Most of the time, there are many small things which affect performance that aren't obvious on the surface. Trying to predict the performance of something ahead of time often takes longer than conducting the test and may even be impossible to do with any accuracy.
It is important, though, to be as careful as possible to simulate your actual production conditions exactly. As I said before, small things can make a big difference in performance, and you'll want to avoid invalidating your test by changing one of them between your test environment and the production environment.
With SQL performance, one of the most relevant items is the content of the database during the test. Queries which perform well with a few rows become very slow with many rows. Or, queries which are fast when all the data is very similar become very slow when it is very diverse. The best approach (if possible) is to create a clone of your production database in which to run your tests. That way, you're sure about not fooling yourself with an inaccurate test environment.
Once you've got your tests running, you may want to run your database's explain plan equivalent to find out exactly what is going on with each approach. This will often allow you to start tuning both to remove obvious issues. Sometimes, this will make enough difference to change which is faster, or even suggest a third approach which beats both of them.
For a single or couple of entries, I would use the first approach "INSERT IGNORE" without any doubts.
We don't know much details about your case, but in case you have bulk inserts (since you mentioned you need to run this millions of time), then the key to boot up your insert performance is by using 1 insert statement for bulk of entries instead of an insert statement per entry.
This can be achieved either by:
Using INSERT IGNORE.
INSERT IGNORE INTO table VALUES (id1,'val1'), (id2,'val2')....
Or, what you can do is do a single select statement that, for a bulk of entries, gets the existing entries ie: SELECT id FROM table WHERE id in (id1, id2, id3....)
Then programmatically, in your code, exclude from the initial list
the ones retrieved from db.
Then run your INSERT statement:
INSERT INTO table VALUES (id1,'val1'), (id5,'val5')..
Normally, we would expect that INSERT IGNORE Bulk inserts would be optimal since handled by the db engine, but this cannot be guarantee. Therefore, for your solution better do a small validation for both cases using bulk of data.
If you don't want to run a small comparison test to validate, then you can use the INSERT IGNORE bulk inserts, (this is needed in both cases), during your test, in case you noticed slowness you can try the second approach.
Normally, the second approach would be fast since the first select is done on bulk of ids (pk) so the query is fast and much better than running a select per entry.Filtering ids programmatically is also fast.
Python application, standard web app.
If a particular request gets executed twice by error the second request will try to insert a row with an already existing primary key.
What is the most sensible way to deal with it.
a) Execute a query to check if the primary key already exists and do the checking and error handling in the python app
b) Let the SQL engine reject the insertion with a constraint failure and use exception handling to handle it back in the app
From a speed perspective it might seem that a failed request will take the same amount of time as a successful one, making b faster because its only one request and not two.
However, when you take things in account like read-only db slaves and table write-locks and stuff like that things get fuzzy for my experience on scaling standard SQL databases.
The best option is (b), from almost any perspective. As mentioned in a comment, there is a multi-threading issue. That means that option (a) doesn't even protect data integrity. And that is a primary reason why you want data integrity checks inside the database, not outside it.
There are other reasons. Consider performance. Passing data into and out of the database takes effort. There are multiple levels of protocol and data preparation, not to mention round trip, sequential communication from the database server. One call has one such unit of overhead. Two calls have two such units.
It is true that under some circumstances, a failed query can have a long clean-up period. However, constraint checking for unique values is a single lookup in an index, which is both fast and has minimal overhead for cleaning up. The extra overhead for handling the error should be tiny in comparison to the overhead for running the queries from the application -- both are small, one is much smaller.
If you had a query load where the inserts were really rare with respect to the comparison, then you might consider doing the check in the application. It is probably a tiny bit faster to check to see if something exists using a SELECT rather than using INSERT. However, unless your query load is many such checks for each actual insert, I would go with checking in the database and move on to other issues.
The latter one you need to do and handle in any case, thus I do not see there is much value in querying for duplicates, except to show the user information beforehand - e.g. report "This username has been taken already, please choose another" when the user is still filling in the form.
To set the background: I'm interested in:
Capturing implicit signals of interest in books as users browse around a site. The site is written in django (python) using mysql, memcached, ngnix, and apache
Let's say, for instance, my site sells books. As a user browses around my site I'd like to keep track of which books they've viewed, and how many times they've viewed them.
Not that I'd store the data this way, but ideally I could have on-the-fly access to a structure like:
{user_id : {book_id: number_of_views, book_id_2: number_of_views}}
I realize there are a few approaches here:
Some flat-file log
Writing an object to a database every time
Writing to an object in memcached
I don't really know the performance implications, but I'd rather not be writing to a database on every single page view, and the lag writing to a log and computing the structure later seems not quick enough to give good recommendations on-the-fly as you use the site, and the memcached appraoch seems fine, but there's a cost in keeping this obj in memory: you might lose it, and it never gets written somewhere 'permanent'.
What approach would you suggest? (doesn't have to be one of the above) Thanks!
If this data is not an unimportant statistic that might or might not be available I'd suggest taking the simple approach and using a model. It will surely hit the database everytime.
Unless you are absolutely positively sure these queries are actually degrading overall experience there is no need to worry about it. Even if you optimize this one, there's a good chance other unexpected queries are wasting more CPU time. I assume you wouldn't be asking this question if you were testing all other queries. So why risk premature optimization on this one?
An advantage of the model approach would be having an API in place. When you have tested and decided to optimize you can keep this API and change the underlying model with something else (which will most probably be more complex than a model).
I'd definitely go with a model first and see how it performs. (and also how other parts of the project perform)
What approach would you suggest? (doesn't have to be one of the above) Thanks!
hmmmm ...this like been in a four walled room with only one door and saying i want to get out of room but not through the only door...
There was an article i was reading sometime back (can't get the link now) that says memcache can handle huge (facebook uses it) sets of data in memory with very little degradation in performance...my advice is you will need to explore more on memcache, i think it will do the trick.
Either a document datastore (mongo/couchdb), or a persistent key value store (tokyodb, memcachedb etc) may be explored.
No definite recommendations from me as the final solution depends on multiple factors - load, your willingness to learn/deploy a new technology, size of the data...
Seems to me that one approach could be to use memcached to keep the counter, but have a cron running regularly to store the value from memcached to the db or disk. That way you'd get all the performance of memcached, but in the case of a crash you wouldn't lose more than a couple of minutes' data.