I wrote this python script to import a specific xls file into mysql. It works fine but if it's run twice on the same data it will create duplicate entries. I'm pretty sure I need to use MySQL JOIN but I'm not clear on how to do that. Also is executemany() going to have the same overhead as doing inserts in a loop? I'm obviously trying to avoid that.
Here's the code in question...
for row in range(sheet.nrows):
"""name is in the 0th col. email is the 4th col."""
name = sheet.cell(row, 0).value
email = sheet.cell(row, 4).value
if name and email:
mailing_list[name.lstrip()] = email.strip()
for n, e in sorted(mailing_list.iteritems()):
rows.append((n, e))
db = MySQLdb.connect(host=host, user=user, db=dbname, passwd=pwd)
cursor = db.cursor()
cursor.executemany("""
INSERT IGNORE INTO mailing_list (name, email) VALUES (%s,%s)""",(rows))
CLARIFICATION...
I read here that...
To be sure, executemany() is effectively the same as simple iteration.
However, it is typically faster. It provides an optimized means of
affecting INSERT and REPLACE across multiple rows.
Also I took Unodes suggestion and used the UNIQUE constraint. But the IGNORE keyword is better than ON DUPLICATE KEY UPDATE because I want it to fail silently.
TL;DR
1. What's the best way prevent duplicate inserts?
ANSWER 1: UNIQUE contraint on column with SELECT IGNORE to fail silently or ON DUPLICATE KEY UPDATE to increment the duplicate value and insert it.
Is executemany() as expensive as INSERT in a loop?
#Unode says it's not but my research tells me otherwise. I would like a definitive answer.
Is this the best way or is it going to be really slow with bigger
tables and how would I test to be sure?
1 - What's the best way prevent duplicate inserts?
Depending on what "preventing" means in your case, you have two strategies and one requirement.
The requirement is that you add a UNIQUE constraint on the column/columns that you want to be unique. This alone will cause an error if insertion of a duplicate entry is attempted. However given you are using executemany the outcome may not be what you would expect.
Then as strategies you can do:
An initial filter step by running a SELECT statement before. This means running one SELECT statement per item in your rows to check if it exists already. This strategy works but is inefficient.
Using ON DUPLICATE KEY UPDATE. This automatically triggers an update if the data already exists. For more information refer to the official documentation.
2 - Is executemany() as expensive as INSERT in a loop?
No, executemany creates one query which inserts in bulk while doing a for loop will create as many queries as the number of elements in your rows.
Related
I'm trying to NOT add data to the database if a certain condition is met.
Is it possible, and if so, what's the syntax?
import sqlite3
text = input('> ')
def DO_NOT_ADD_DATA():
con = sqlite3.connect('my db file.db')
cursor = con.cursor()
if "a" in text():
print("Uh Oh")
# I want to NOT add data in this part, it still gets added but it does print out Uh Oh
else:
cursor.execute("INSERT INTO table VALUES ('value1', 'value2')")
con.commit()
con.close()
Yes. It is quite possible. And you have multiple ways of doing it.
You could do it as in your example (if you fix the syntax errors), where the python script can perform some complex evaluation of whether to perform the operation.
If you instead want to avoid inserting duplicates, you would probably not check so in python, as you can run into race conditions (e.g. if you were to query the database first whether entry 'a' already exists, it doesn't, but another process sneaks in the entry in the time between you've checked and actually inserted it).
In these cases you can actually build your database to ensure it always upholds some constraints. In these cases, you could put a "UNIQUE" constaint on the column, and if you attempted to insert a duplicate, the database will throw you an error, so you can react accordingly.
See e.g. https://sqlite.org/lang_createtable.html, https://sqlite.org/syntax/column-constraint.html, https://sqlite.org/syntax/table-constraint.html.
Whether you want to do one or another really, really depends on what you actually want to acheive.
(Note: The race conditions could be prevented by using transactions, and sometimes transactions and locking rows/tables/databases is preferred over using constraints in the database schema. It all really depends.)
Hello StackEx community.
I am implementing a relational database using SQLite interfaced with Python. My table consists of 5 attributes with around a million tuples.
To avoid large number of database queries, I wish to execute a single query that updates 2 attributes of multiple tuples. These updated values depend on the tuples' Primary Key value and so, are different for each tuple.
I am trying something like the following in Python 2.7:
stmt= 'UPDATE Users SET Userid (?,?), Neighbours (?,?) WHERE Username IN (?,?)'
cursor.execute(stmt, [(_id1, _Ngbr1, _name1), (_id2, _Ngbr2, _name2)])
In other words, I am trying to update the rows that have Primary Keys _name1 and _name2 by substituting the Neighbours and Userid columns with corresponding values. The execution of the two statements returns the following error:
OperationalError: near "(": syntax error
I am reluctant to use executemany() because I want to reduce the number of trips across the database.
I am struggling with this issue for a couple of hours now but couldn't figure out either the error or an alternate on the web. Please help.
Thanks in advance.
If the column that is used to look up the row to update is properly indexed, then executing multiple UPDATE statements would be likely to be more efficient than a single statement, because in the latter case the database would probably need to scan all rows.
Anyway, if you really want to do this, you can use CASE expressions (and explicitly numbered parameters, to avoid duplicates):
UPDATE Users
SET Userid = CASE Username
WHEN ?5 THEN ?1
WHEN ?6 THEN ?2
END,
Neighbours = CASE Username
WHEN ?5 THEN ?3
WHEN ?6 THEN ?4
END,
WHERE Username IN (?5, ?6);
There exists a table Users and in my code I have a big list of User objects. To insert them I can use :
session.add_all(user_list)
session.commit()
The problem is that there can be several duplicates which I want to update but the database wont allow to insert duplicate entries. For sure, I can iterate over user_list and try to insert user in the database and if it fails - update it :
for u in users:
q = session.query(T).filter(T.fullname==u.fullname).first()
if q:
session.query(T).filter_by(index=q.index).update({column: getattr(u,column) for column in Users.__table__.columns.keys() if column!='id'})
session.commit()
else:
session.add(u)
session.commit()
but I find this solution quiet ineffective : first, I am making several requests to retrieve object q, and instead of batch inserting of new items I insert them one per one. I wonder if there exists a better solution for this task.
UPD better version:
for u in users:
q = session.query(T).filter(Users.fullname==u.fullname).first()
if q:
for column in Users.__table__.columns.keys():
if not column=='index':
setattr(q,column,getattr(u,column))
session.add(q)
else:
session.add(u)
session.commit()
a better solution would be to use
INSERT ... ON DUPLICATE KEY UPDATE ...
bulk MySQL construct (I assume you're using MySQL because your post is tagged with 'mysql'). This way you're both inserting new entries and updating existing ones in one statement / transaction, see http://dev.mysql.com/doc/refman/5.6/en/insert-on-duplicate.html
It's not ideal if you have multiple unique indexes and, depending on your schema, you'll have to fill in all NOT NULL values (hence issuing one bulk SELECT before calling it), but it's definitely the most efficient option and we use it a lot. The bulk version will look something like (let's assume name is a unique key):
INSERT INTO User (name, phone, ...) VALUES
('ksmith', '111-11-11', ...),
('jford', '222-22,22', ...),
...,
ON DUPLICATE KEY UPDATE
phone = VALUES(phone),
... ;
Unfortunately, INSERT ... ON DUPLICATE KEY UPDATE ... is not supported natively by SQLa so you'll have to implement a little helper function which will build the query for you.
I understand that the fastest way to check if a row exists isn't even to check, but to use an INSERT IGNORE when inserting new data. This is most excellent for my application. However, I need to be able to check if the insert was ignored (or conversely, if the row was actually inserted).
I could use a try/catch, but that's not very elegant. Was hoping that someone might have a cleaner and more elegant solution.
Naturally, a final search after I post the question yields the result.
mysql - after insert ignore get primary key
However, this still requires a second trip to the database. I would love to see if there's a clean pythonic way to do this with a single query.
query = "INSERT IGNORE ..."
cursor.execute(query)
# Last row was ignored
if cursor.lastrowid == 0:
This does an INSERT IGNORE query and if the insert is ignored (duplicate), the lastrowid will be 0.
In python, I am populating a SQLITE data base using the importmany, so I can import tens of thousands of rows of data at once. My data is contained as a list of tuples. I had my database set up with the primary keys where I wanted them.
Problem I ran into was primary key errors would throw up an IntegrityError. If I handle the exception my script stops importing at the primary key conflict.
try:
try:
self.curs.executemany("INSERT into towers values (NULL,?,?,?,?)",self.insertList)
except IntegrityError:
print "Primary key error"
conn.commit()
So my questions are, in python using importmany can I:
1. Capture the values that violate the primary key?
2. Continue loading data after I get my primary key errors.
I get why it doesnt continue to load, because after the exception I commit the data to the database. I dont know how to continue where I left off however.
Unforutnley I cannot copy and paste all the code on this network, any help would be greatly appreciated. Right now I have no PKs set as a work around...
To answer (2) first, if you want to continue loading after you get an error, it's a simple fix on the SQL side:
INSERT OR IGNORE INTO towers VALUES (NULL,?,?,?,?)
This will successfully insert any rows that don't have any violations, and gracefully ignore the conflicts. Please do note however that the IGNORE clause will still fail on Foreign Key violations.
Another option for a conflict resolution clause in your case is: INSERT OR REPLACE INTO .... I strongly recommend the SQLite docs for more information on conflicts and conflict resolution.
As far as I know you cannot do both (1) and (2) simultaneously in an efficient way. You could possibly create a trigger to fire before insertions that can capture conflicting rows but this will impose a lot of unnecessary overhead on all of your insertions. (Someone please let me know if you can do this in a smarter way.) Therefore I would recommend you consider whether you truly need to capture the values of the conflicting rows or whether a redesign of your schema is required, if possible/applicable.
You could use lastrowid to get the point where you stopped:
http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.lastrowid
If you use it, however, you can't use executemany.
Use a for loop to iterate through the list and use execute instead of executemany. Surround the for loop with your try and continue execution after an exception. Something like this:
for it in self.insertList:
try:
self.curs.execute("INSERT into towers values (NULL,?,?,?,?)",it)
except IntegrityError:
#here you could insert the itens that were rejected in a temporary table
#without constraints for later use (question 1)
pass
conn.commit()
You can even count how many items of the list were really inserted.