SQLAlchemy IntegrityError and bulk data imports

SQLAlchemy IntegrityError and bulk data imports - python

I am inserting several 10k records into a database with REF integrity rules. Some of the rows of data are unfortunately duplicates (in that they already exist in the database). It would be too expensive to check the existence of every row in the database before inserting it so I intend to proceed by handling IntegrityError exceptions thrown by SQLAlchemy, logging the error and then continuing.
My code will look something like this:
# establish connection to db etc.
tbl = obtain_binding_to_sqlalchemy_orm()
datarows = load_rows_to_import()
try:
conn.execute(tbl.insert(), datarows)
except IntegrityError as ie:
# eat error and keep going
except Exception as e:
# do something else
The (implicit) assumption I am making above is that SQLAlchemy is not rolling the multiple inserts into ONE transaction. If my assumption is wrong then it means that if an IntegrityError occurs, the rest of the insert is aborted. Can anyone confirm if the pseudocode "pattern" above will work as expected - or will I end up losing data as a result of thrown IntegrityError exceptions?
Also, if anyone has a better idea of doing this, I will be interested to hear it.

it may work like this, if you didn't start any transaction before, as in this case sqlalchemy's autocommit feature will kick in. but you should explicitly set as described in the link.

There is almost no way to tell the sql engine to do a bulk insert on duplicate ignore action. But, we can try to do a fallback solution on the python end. If your duplicates are not distributed in a very bad way*, this pretty much will get the benefits of both worlds.
try:
# by very bad, I mean what if each batch of the items contains one duplicate
session.bulk_insert_mappings(mapper, items)
session.commit()
except IntegrityError:
logger.info("bulk inserting rows failed, fallback to one by one")
for item in items:
try:
session.execute(insert(mapper).values(**item))
session.commit()
except SQLAlchemyError:
logger.exception("Error inserting item: %s", item)

I also encountered this problem when I was parsing ASCII data files to import the data into a table. The problem is that I instinctively and intuitionally wanted SQLAlchemy to skip the duplicate rows while allowing the unique data. Or it could be the case that a random error is thrown with a row, due to the current SQL engine, such as unicode strings not being allowed.
However, this behavior is out of the scope of the definition of the SQL interface. SQL APIs, and hence SQLAlchemy only understand transactions and commits, and do not account for this selective behavior. Moreover, it sounds dangerous to depend on the autocommit feature, since the insertion halts after the exception, leaving the rest of the data.
My solution (which I am not sure whether it is the most elegant one) is to process every line in a loop, catch and log exceptions, and commit the changes at the very end.
Assuming that you somehow acquired data in a list of lists, i.e. list of rows which are lists of column values. Then you read every row in a loop:
# Python 3.5
from sqlalchemy import Table, create_engine
import logging
# Create the engine
# Create the table
# Parse the data file and save data in `rows`
conn = engine.connect()
trans = conn.begin() # Disables autocommit
exceptions = {}
totalRows = 0
importedRows = 0
ins = table.insert()
for currentRowIdx, cols in enumerate(rows):
try:
conn.execute(ins.values(cols)) # try to insert the column values
importedRows += 1
except Exception as e:
exc_name = type(e).__name__ # save the exception name
if not exc_name in exceptions:
exceptions[exc_name] = []
exceptions[exc_name].append(currentRowIdx)
totalRows += 1
for key, val in exceptions.items():
logging.warning("%d out of %d lines were not imported due to %s."%(len(val), totalRows, key))
logging.info("%d rows were imported."%(importedRows))
trans.commit() # Commit at the very end
conn.close()
In order to maximize the speed in this operation, you should disable autocommit. I am using this code with SQLite and it is still 3-5 times slower than my older version using only sqlite3, even with autocommit disabled. (The reason that I ported to SQLAlchemy was to be able to use it with MySQL.)
It is not the most elegant solution in the sense that it is not as fast as a direct interface to SQLite. If I profile the code and find the bottleneck in near future, I will update this answer with the solution.

Related

Is it possible to NOT add data into a database when a certain condition is met? Sqlite3

I'm trying to NOT add data to the database if a certain condition is met.
Is it possible, and if so, what's the syntax?
import sqlite3
text = input('> ')
def DO_NOT_ADD_DATA():
con = sqlite3.connect('my db file.db')
cursor = con.cursor()
if "a" in text():
print("Uh Oh")
# I want to NOT add data in this part, it still gets added but it does print out Uh Oh
else:
cursor.execute("INSERT INTO table VALUES ('value1', 'value2')")
con.commit()
con.close()

Yes. It is quite possible. And you have multiple ways of doing it.
You could do it as in your example (if you fix the syntax errors), where the python script can perform some complex evaluation of whether to perform the operation.
If you instead want to avoid inserting duplicates, you would probably not check so in python, as you can run into race conditions (e.g. if you were to query the database first whether entry 'a' already exists, it doesn't, but another process sneaks in the entry in the time between you've checked and actually inserted it).
In these cases you can actually build your database to ensure it always upholds some constraints. In these cases, you could put a "UNIQUE" constaint on the column, and if you attempted to insert a duplicate, the database will throw you an error, so you can react accordingly.
See e.g. https://sqlite.org/lang_createtable.html, https://sqlite.org/syntax/column-constraint.html, https://sqlite.org/syntax/table-constraint.html.
Whether you want to do one or another really, really depends on what you actually want to acheive.
(Note: The race conditions could be prevented by using transactions, and sometimes transactions and locking rows/tables/databases is preferred over using constraints in the database schema. It all really depends.)

SQLAlchemy - bulk insert ignore Duplicate / Unique

Using Sqlalchemy With a large dataset, I would like to insert all rows using something efficient like session.add_all() followed by session.commit(). I am looking for a way to ignore inserting any rows which raise duplicate / unique key errors. The problem is that these errors only come up on the session.commit() call, so there is no way to fail that specific row and move onto the next.
The closest question I have seen is here: SQLAlchemy - bulk insert ignore: "Duplicate entry" ; however, the accepted answer proposes not using the bulk method and committing after every single row insert, which is extremely slow and causes huge amounts of I/O, so I am looking for a better solution.

Indeed.
Same issue here. They seem to have forgotten performance, and especially when you have a remote DB this is an issue.
What I then always do is code around it in Python using a Dictionary or List. The trick is for instance in a Dictionary to set key and value to the same key data.
i.e.
myEmailAddressesDict = {}
myEmailList = []
for emailAddress in allEmailAddresses:
if emailAddress not in myEmailAddressesDict:
#can add
myEmailList.append(emailAddress)
myEmailAddressesDict[emailAddress] = emailAddress
mySession = sessionmaker(bind=self.engine)
try:
mySession.add_all(myEmailList)
mySession.commit()
except Exception as e:
print("Add exception: ", str(e))
mySession.close()
It's not a fix to the actual problem but a sort of workaround for the moment. The key in this solution here is that you actually have cleared (delete_all) the DB or start with nothing. Otherwise, when you already have a DB then the code will fail nevertheless.
For this we need something like a parameter in SQLAlchemy to ignore dupes on the add_all or they should provide a merge_all.

Using the python MySQLDB SScursor with nested queries

The typical MySQLdb library query can use a lot of memory and perform poorly in Python, when a large result set is generated. For example:
cursor.execute("SELECT id, name FROM `table`")
for i in xrange(cursor.rowcount):
id, name = cursor.fetchone()
print id, name
There is an optional cursor that will fetch just one row at a time, really speeding up the script and cutting the memory footprint of the script a lot.
import MySQLdb
import MySQLdb.cursors
conn = MySQLdb.connect(user="user", passwd="password", db="dbname",
cursorclass = MySQLdb.cursors.SSCursor)
cur = conn.cursor()
cur.execute("SELECT id, name FROM users")
row = cur.fetchone()
while row is not None:
doSomething()
row = cur.fetchone()
cur.close()
conn.close()
But I can't find anything about using SSCursor with with nested queries. If this is the definition of doSomething():
def doSomething()
cur2 = conn.cursor()
cur2.execute('select id,x,y from table2')
rows = cur2.fetchall()
for row in rows:
doSomethingElse(row)
cur2.close()
then the script throws the following error:
_mysql_exceptions.ProgrammingError: (2014, "Commands out of sync; you can't run this command now")
It sounds as if SSCursor is not compatible with nested queries. Is that true? If so that's too bad because the main loop seems to run too slowly with the standard cursor.

This problem in discussed a bit in the MySQLdb User's Guide, under the heading of the threadsafety attribute (emphasis mine):
The MySQL protocol can not handle multiple threads using the same
connection at once. Some earlier versions of MySQLdb utilized locking
to achieve a threadsafety of 2. While this is not terribly hard to
accomplish using the standard Cursor class (which uses
mysql_store_result()), it is complicated by SSCursor (which uses
mysql_use_result(); with the latter you must ensure all the rows have
been read before another query can be executed.
The documentation for the MySQL C API function mysql_use_result() gives more information about your error message:
When using mysql_use_result(), you must execute mysql_fetch_row()
until a NULL value is returned, otherwise, the unfetched rows are
returned as part of the result set for your next query. The C API
gives the error Commands out of sync; you can't run this command now
if you forget to do this!
In other words, you must completely fetch the result set from any unbuffered cursor (i.e., one that uses mysql_use_result() instead of mysql_store_result() - with MySQLdb, that means SSCursor and SSDictCursor) before you can execute another statement over the same connection.
In your situation, the most direct solution would be to open a second connection to use while iterating over the result set of the unbuffered query. (It wouldn't work to simply get a buffered cursor from the same connection; you'd still have to advance past the unbuffered result set before using the buffered cursor.)
If your workflow is something like "loop through a big result set, executing N little queries for each row," consider looking into MySQL's stored procedures as an alternative to nesting cursors from different connections. You can still use MySQLdb to call the procedure and get the results, though you'll definitely want to read the documentation of MySQLdb's callproc() method since it doesn't conform to Python's database API specs when retrieving procedure outputs.
A second alternative is to stick to buffered cursors, but split up your query into batches. That's what I ended up doing for a project last year where I needed to loop through a set of millions of rows, parse some of the data with an in-house module, and perform some INSERT and UPDATE queries after processing each row. The general idea looks something like this:
QUERY = r"SELECT id, name FROM `table` WHERE id BETWEEN %s and %s;"
BATCH_SIZE = 5000
i = 0
while True:
cursor.execute(QUERY, (i + 1, i + BATCH_SIZE))
result = cursor.fetchall()
# If there's no possibility of a gap as large as BATCH_SIZE in your table ids,
# you can test to break out of the loop like this (otherwise, adjust accordingly):
if not result:
break
for row in result:
doSomething()
i += BATCH_SIZE
One other thing I would note about your example code is that you can iterate directly over a cursor in MySQLdb instead of calling fetchone() explicitly over xrange(cursor.rowcount). This is especially important when using an unbuffered cursor, because the rowcount attribute is undefined and will give a very unexpected result (see: Python MysqlDB using cursor.rowcount with SSDictCursor returning wrong count).

Skip a record after a duplicate error (except IntegrityError)

I have records that am processing and part of the process is getting rid of duplicates. With that, I've created UniqueContraints on my tables using SQLAlchemy and I have the following catch except in my save function which runs in a for loop:
code1
for record in millionsofrecords:
try:
#code to save
session.flush()
except IntegrityError:
logger("live_loader", LOGLEVEL.warning, "Duplicate Entry")
except:
logger("live_loader", LOGLEVEL.critical, "\n%s" %(sys.exc_info()[1]))
raise
With the above, I capture the error okay but then SQLAlchemy states in the next loop that: sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback().. So I change to the following:
code2
for record in millionsofrecords:
try:
#code to save
session.flush()
except IntegrityError:
logger("live_loader", LOGLEVEL.warning, "Duplicate Entry")
session.rollback()
except:
logger("live_loader", LOGLEVEL.critical, ":%s" %(sys.exc_info()[1]))
raise
I change the function to include the session.rollback but then any record inserted pior to the duplicate detection is discarded.
questions:
When a duplicate is detected, I want to skip that "1" record but insert the rest that are not duplicates. When I add the session.rollback as shown in code2 above, all the records that had even been flushed earlier are "discarded". I only want to discard the duplicate record but allow all the rest to be saved.
Which is a better design given that I'll be processing many records. Do a quick Select Statement on the DB to detect the duplication or do what am doing now and make the database Unique Keys work for me while I capture the duplication exception and move on?

What you are doing (especially if you have many records) is executing many queries and that is highly inefficient. Thus, it would be a better idea to build a query to detect duplicates. For the future you should still keep the unique constraint to avoid further conflicts, but for the transition (or whatever your are doing there) it is far more efficient to have such a query. You could even transform the results into a dictionary, because that is often more efficient if you need to look up stuff by a key.
Furthermore, concerning speed, use a set instead of a list if you need to do something like value in duplicates because it is more efficient.
Now if you'd like to keep your old way, use savepoints. The idea here is that you say "okay so far I am fine, let's keep this so when I roll back then only the last bit where I wasn't sure yet". Your example is a bit to short to give you an example, but usually, before adding stuff to the database you make a begin_nested and then either commit or roll back.

Continue loading after IntegrityError

In python, I am populating a SQLITE data base using the importmany, so I can import tens of thousands of rows of data at once. My data is contained as a list of tuples. I had my database set up with the primary keys where I wanted them.
Problem I ran into was primary key errors would throw up an IntegrityError. If I handle the exception my script stops importing at the primary key conflict.
try:
try:
self.curs.executemany("INSERT into towers values (NULL,?,?,?,?)",self.insertList)
except IntegrityError:
print "Primary key error"
conn.commit()
So my questions are, in python using importmany can I:
1. Capture the values that violate the primary key?
2. Continue loading data after I get my primary key errors.
I get why it doesnt continue to load, because after the exception I commit the data to the database. I dont know how to continue where I left off however.
Unforutnley I cannot copy and paste all the code on this network, any help would be greatly appreciated. Right now I have no PKs set as a work around...

To answer (2) first, if you want to continue loading after you get an error, it's a simple fix on the SQL side:
INSERT OR IGNORE INTO towers VALUES (NULL,?,?,?,?)
This will successfully insert any rows that don't have any violations, and gracefully ignore the conflicts. Please do note however that the IGNORE clause will still fail on Foreign Key violations.
Another option for a conflict resolution clause in your case is: INSERT OR REPLACE INTO .... I strongly recommend the SQLite docs for more information on conflicts and conflict resolution.
As far as I know you cannot do both (1) and (2) simultaneously in an efficient way. You could possibly create a trigger to fire before insertions that can capture conflicting rows but this will impose a lot of unnecessary overhead on all of your insertions. (Someone please let me know if you can do this in a smarter way.) Therefore I would recommend you consider whether you truly need to capture the values of the conflicting rows or whether a redesign of your schema is required, if possible/applicable.

You could use lastrowid to get the point where you stopped:
http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.lastrowid
If you use it, however, you can't use executemany.

Use a for loop to iterate through the list and use execute instead of executemany. Surround the for loop with your try and continue execution after an exception. Something like this:
for it in self.insertList:
try:
self.curs.execute("INSERT into towers values (NULL,?,?,?,?)",it)
except IntegrityError:
#here you could insert the itens that were rejected in a temporary table
#without constraints for later use (question 1)
pass
conn.commit()
You can even count how many items of the list were really inserted.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

SQLAlchemy IntegrityError and bulk data imports - python

it may work like this, if you didn't start any transaction before, as in this case sqlalchemy's autocommit feature will kick in. but you should explicitly set as described in the link.

Related

Is it possible to NOT add data into a database when a certain condition is met? Sqlite3

SQLAlchemy - bulk insert ignore Duplicate / Unique

Using the python MySQLDB SScursor with nested queries

Skip a record after a duplicate error (except IntegrityError)

Continue loading after IntegrityError

Categories

Resources