Continue loading after IntegrityError - python

In python, I am populating a SQLITE data base using the importmany, so I can import tens of thousands of rows of data at once. My data is contained as a list of tuples. I had my database set up with the primary keys where I wanted them.
Problem I ran into was primary key errors would throw up an IntegrityError. If I handle the exception my script stops importing at the primary key conflict.
try:
try:
self.curs.executemany("INSERT into towers values (NULL,?,?,?,?)",self.insertList)
except IntegrityError:
print "Primary key error"
conn.commit()
So my questions are, in python using importmany can I:
1. Capture the values that violate the primary key?
2. Continue loading data after I get my primary key errors.
I get why it doesnt continue to load, because after the exception I commit the data to the database. I dont know how to continue where I left off however.
Unforutnley I cannot copy and paste all the code on this network, any help would be greatly appreciated. Right now I have no PKs set as a work around...

To answer (2) first, if you want to continue loading after you get an error, it's a simple fix on the SQL side:
INSERT OR IGNORE INTO towers VALUES (NULL,?,?,?,?)
This will successfully insert any rows that don't have any violations, and gracefully ignore the conflicts. Please do note however that the IGNORE clause will still fail on Foreign Key violations.
Another option for a conflict resolution clause in your case is: INSERT OR REPLACE INTO .... I strongly recommend the SQLite docs for more information on conflicts and conflict resolution.
As far as I know you cannot do both (1) and (2) simultaneously in an efficient way. You could possibly create a trigger to fire before insertions that can capture conflicting rows but this will impose a lot of unnecessary overhead on all of your insertions. (Someone please let me know if you can do this in a smarter way.) Therefore I would recommend you consider whether you truly need to capture the values of the conflicting rows or whether a redesign of your schema is required, if possible/applicable.

You could use lastrowid to get the point where you stopped:
http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.lastrowid
If you use it, however, you can't use executemany.

Use a for loop to iterate through the list and use execute instead of executemany. Surround the for loop with your try and continue execution after an exception. Something like this:
for it in self.insertList:
try:
self.curs.execute("INSERT into towers values (NULL,?,?,?,?)",it)
except IntegrityError:
#here you could insert the itens that were rejected in a temporary table
#without constraints for later use (question 1)
pass
conn.commit()
You can even count how many items of the list were really inserted.

Related

Is it possible to NOT add data into a database when a certain condition is met? Sqlite3

I'm trying to NOT add data to the database if a certain condition is met.
Is it possible, and if so, what's the syntax?
import sqlite3
text = input('> ')
def DO_NOT_ADD_DATA():
con = sqlite3.connect('my db file.db')
cursor = con.cursor()
if "a" in text():
print("Uh Oh")
# I want to NOT add data in this part, it still gets added but it does print out Uh Oh
else:
cursor.execute("INSERT INTO table VALUES ('value1', 'value2')")
con.commit()
con.close()
Yes. It is quite possible. And you have multiple ways of doing it.
You could do it as in your example (if you fix the syntax errors), where the python script can perform some complex evaluation of whether to perform the operation.
If you instead want to avoid inserting duplicates, you would probably not check so in python, as you can run into race conditions (e.g. if you were to query the database first whether entry 'a' already exists, it doesn't, but another process sneaks in the entry in the time between you've checked and actually inserted it).
In these cases you can actually build your database to ensure it always upholds some constraints. In these cases, you could put a "UNIQUE" constaint on the column, and if you attempted to insert a duplicate, the database will throw you an error, so you can react accordingly.
See e.g. https://sqlite.org/lang_createtable.html, https://sqlite.org/syntax/column-constraint.html, https://sqlite.org/syntax/table-constraint.html.
Whether you want to do one or another really, really depends on what you actually want to acheive.
(Note: The race conditions could be prevented by using transactions, and sometimes transactions and locking rows/tables/databases is preferred over using constraints in the database schema. It all really depends.)

SQLAlchemy - bulk insert ignore Duplicate / Unique

Using Sqlalchemy With a large dataset, I would like to insert all rows using something efficient like session.add_all() followed by session.commit(). I am looking for a way to ignore inserting any rows which raise duplicate / unique key errors. The problem is that these errors only come up on the session.commit() call, so there is no way to fail that specific row and move onto the next.
The closest question I have seen is here: SQLAlchemy - bulk insert ignore: "Duplicate entry" ; however, the accepted answer proposes not using the bulk method and committing after every single row insert, which is extremely slow and causes huge amounts of I/O, so I am looking for a better solution.
Indeed.
Same issue here. They seem to have forgotten performance, and especially when you have a remote DB this is an issue.
What I then always do is code around it in Python using a Dictionary or List. The trick is for instance in a Dictionary to set key and value to the same key data.
i.e.
myEmailAddressesDict = {}
myEmailList = []
for emailAddress in allEmailAddresses:
if emailAddress not in myEmailAddressesDict:
#can add
myEmailList.append(emailAddress)
myEmailAddressesDict[emailAddress] = emailAddress
mySession = sessionmaker(bind=self.engine)
try:
mySession.add_all(myEmailList)
mySession.commit()
except Exception as e:
print("Add exception: ", str(e))
mySession.close()
It's not a fix to the actual problem but a sort of workaround for the moment. The key in this solution here is that you actually have cleared (delete_all) the DB or start with nothing. Otherwise, when you already have a DB then the code will fail nevertheless.
For this we need something like a parameter in SQLAlchemy to ignore dupes on the add_all or they should provide a merge_all.

Skip a record after a duplicate error (except IntegrityError)

I have records that am processing and part of the process is getting rid of duplicates. With that, I've created UniqueContraints on my tables using SQLAlchemy and I have the following catch except in my save function which runs in a for loop:
code1
for record in millionsofrecords:
try:
#code to save
session.flush()
except IntegrityError:
logger("live_loader", LOGLEVEL.warning, "Duplicate Entry")
except:
logger("live_loader", LOGLEVEL.critical, "\n%s" %(sys.exc_info()[1]))
raise
With the above, I capture the error okay but then SQLAlchemy states in the next loop that: sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback().. So I change to the following:
code2
for record in millionsofrecords:
try:
#code to save
session.flush()
except IntegrityError:
logger("live_loader", LOGLEVEL.warning, "Duplicate Entry")
session.rollback()
except:
logger("live_loader", LOGLEVEL.critical, ":%s" %(sys.exc_info()[1]))
raise
I change the function to include the session.rollback but then any record inserted pior to the duplicate detection is discarded.
questions:
When a duplicate is detected, I want to skip that "1" record but insert the rest that are not duplicates. When I add the session.rollback as shown in code2 above, all the records that had even been flushed earlier are "discarded". I only want to discard the duplicate record but allow all the rest to be saved.
Which is a better design given that I'll be processing many records. Do a quick Select Statement on the DB to detect the duplication or do what am doing now and make the database Unique Keys work for me while I capture the duplication exception and move on?
What you are doing (especially if you have many records) is executing many queries and that is highly inefficient. Thus, it would be a better idea to build a query to detect duplicates. For the future you should still keep the unique constraint to avoid further conflicts, but for the transition (or whatever your are doing there) it is far more efficient to have such a query. You could even transform the results into a dictionary, because that is often more efficient if you need to look up stuff by a key.
Furthermore, concerning speed, use a set instead of a list if you need to do something like value in duplicates because it is more efficient.
Now if you'd like to keep your old way, use savepoints. The idea here is that you say "okay so far I am fine, let's keep this so when I roll back then only the last bit where I wasn't sure yet". Your example is a bit to short to give you an example, but usually, before adding stuff to the database you make a begin_nested and then either commit or roll back.

MySQLdb Python prevent duplicate and optimize muliple inserts

I wrote this python script to import a specific xls file into mysql. It works fine but if it's run twice on the same data it will create duplicate entries. I'm pretty sure I need to use MySQL JOIN but I'm not clear on how to do that. Also is executemany() going to have the same overhead as doing inserts in a loop? I'm obviously trying to avoid that.
Here's the code in question...
for row in range(sheet.nrows):
"""name is in the 0th col. email is the 4th col."""
name = sheet.cell(row, 0).value
email = sheet.cell(row, 4).value
if name and email:
mailing_list[name.lstrip()] = email.strip()
for n, e in sorted(mailing_list.iteritems()):
rows.append((n, e))
db = MySQLdb.connect(host=host, user=user, db=dbname, passwd=pwd)
cursor = db.cursor()
cursor.executemany("""
INSERT IGNORE INTO mailing_list (name, email) VALUES (%s,%s)""",(rows))
CLARIFICATION...
I read here that...
To be sure, executemany() is effectively the same as simple iteration.
However, it is typically faster. It provides an optimized means of
affecting INSERT and REPLACE across multiple rows.
Also I took Unodes suggestion and used the UNIQUE constraint. But the IGNORE keyword is better than ON DUPLICATE KEY UPDATE because I want it to fail silently.
TL;DR
1. What's the best way prevent duplicate inserts?
ANSWER 1: UNIQUE contraint on column with SELECT IGNORE to fail silently or ON DUPLICATE KEY UPDATE to increment the duplicate value and insert it.
Is executemany() as expensive as INSERT in a loop?
#Unode says it's not but my research tells me otherwise. I would like a definitive answer.
Is this the best way or is it going to be really slow with bigger
tables and how would I test to be sure?
1 - What's the best way prevent duplicate inserts?
Depending on what "preventing" means in your case, you have two strategies and one requirement.
The requirement is that you add a UNIQUE constraint on the column/columns that you want to be unique. This alone will cause an error if insertion of a duplicate entry is attempted. However given you are using executemany the outcome may not be what you would expect.
Then as strategies you can do:
An initial filter step by running a SELECT statement before. This means running one SELECT statement per item in your rows to check if it exists already. This strategy works but is inefficient.
Using ON DUPLICATE KEY UPDATE. This automatically triggers an update if the data already exists. For more information refer to the official documentation.
2 - Is executemany() as expensive as INSERT in a loop?
No, executemany creates one query which inserts in bulk while doing a for loop will create as many queries as the number of elements in your rows.

SQLAlchemy IntegrityError and bulk data imports

I am inserting several 10k records into a database with REF integrity rules. Some of the rows of data are unfortunately duplicates (in that they already exist in the database). It would be too expensive to check the existence of every row in the database before inserting it so I intend to proceed by handling IntegrityError exceptions thrown by SQLAlchemy, logging the error and then continuing.
My code will look something like this:
# establish connection to db etc.
tbl = obtain_binding_to_sqlalchemy_orm()
datarows = load_rows_to_import()
try:
conn.execute(tbl.insert(), datarows)
except IntegrityError as ie:
# eat error and keep going
except Exception as e:
# do something else
The (implicit) assumption I am making above is that SQLAlchemy is not rolling the multiple inserts into ONE transaction. If my assumption is wrong then it means that if an IntegrityError occurs, the rest of the insert is aborted. Can anyone confirm if the pseudocode "pattern" above will work as expected - or will I end up losing data as a result of thrown IntegrityError exceptions?
Also, if anyone has a better idea of doing this, I will be interested to hear it.
it may work like this, if you didn't start any transaction before, as in this case sqlalchemy's autocommit feature will kick in. but you should explicitly set as described in the link.
There is almost no way to tell the sql engine to do a bulk insert on duplicate ignore action. But, we can try to do a fallback solution on the python end. If your duplicates are not distributed in a very bad way*, this pretty much will get the benefits of both worlds.
try:
# by very bad, I mean what if each batch of the items contains one duplicate
session.bulk_insert_mappings(mapper, items)
session.commit()
except IntegrityError:
logger.info("bulk inserting rows failed, fallback to one by one")
for item in items:
try:
session.execute(insert(mapper).values(**item))
session.commit()
except SQLAlchemyError:
logger.exception("Error inserting item: %s", item)
I also encountered this problem when I was parsing ASCII data files to import the data into a table. The problem is that I instinctively and intuitionally wanted SQLAlchemy to skip the duplicate rows while allowing the unique data. Or it could be the case that a random error is thrown with a row, due to the current SQL engine, such as unicode strings not being allowed.
However, this behavior is out of the scope of the definition of the SQL interface. SQL APIs, and hence SQLAlchemy only understand transactions and commits, and do not account for this selective behavior. Moreover, it sounds dangerous to depend on the autocommit feature, since the insertion halts after the exception, leaving the rest of the data.
My solution (which I am not sure whether it is the most elegant one) is to process every line in a loop, catch and log exceptions, and commit the changes at the very end.
Assuming that you somehow acquired data in a list of lists, i.e. list of rows which are lists of column values. Then you read every row in a loop:
# Python 3.5
from sqlalchemy import Table, create_engine
import logging
# Create the engine
# Create the table
# Parse the data file and save data in `rows`
conn = engine.connect()
trans = conn.begin() # Disables autocommit
exceptions = {}
totalRows = 0
importedRows = 0
ins = table.insert()
for currentRowIdx, cols in enumerate(rows):
try:
conn.execute(ins.values(cols)) # try to insert the column values
importedRows += 1
except Exception as e:
exc_name = type(e).__name__ # save the exception name
if not exc_name in exceptions:
exceptions[exc_name] = []
exceptions[exc_name].append(currentRowIdx)
totalRows += 1
for key, val in exceptions.items():
logging.warning("%d out of %d lines were not imported due to %s."%(len(val), totalRows, key))
logging.info("%d rows were imported."%(importedRows))
trans.commit() # Commit at the very end
conn.close()
In order to maximize the speed in this operation, you should disable autocommit. I am using this code with SQLite and it is still 3-5 times slower than my older version using only sqlite3, even with autocommit disabled. (The reason that I ported to SQLAlchemy was to be able to use it with MySQL.)
It is not the most elegant solution in the sense that it is not as fast as a direct interface to SQLite. If I profile the code and find the bottleneck in near future, I will update this answer with the solution.

Categories

Resources