I am using Python mysql.connector . I have to make a lot of inserts. The data I am inserting is likely to have some rows that will fail the foreign key constraint and thus return the 1452 mysql error.
add_specific="""
INSERT INTO `specific info type`
(`name`, `Classification Type_idClassificationType`)
VALUES
(%s, %s);
"""
cursor.executemany(add_specific, specific_info)
Is there a way where I can execute all of the inserts and in the event of a 1452 that example would just be skipped. I read the executemany is more efficient so I would prefer to use it. I guess I could iterate through all examples and make individual inserts and catch the exception.
Use INSERT IGNORE INTO whatever and MySQL will ignore the rows that fail insertion.
Related
I am inserting around 500K rows of data from a Pandas dataframe into a DuckDB database, but some of the data is duplicated and I have unique columns set up to improve query speed
When using
conn.execute('INSERT OR IGNORE INTO Main SELECT * FROM df')
I receive the error
duckdb.ParserException: Parser Error: syntax error at or near "OR"
due to the IGNORE keyword not being supported in DuckDB.
Cleaning the data before insertion is not possible as the size of the data is very large and removing duplicates from the dataframe also does not work. How can I effectively insert this data into the database while avoiding duplicate records?
Upsert support is added with the latest release(0.7.0) using the ON CONFLICT clause, as well as the SQLite compatible INSERT OR REPLACE/INSERT OR IGNORE syntax.
INSERT INTO <table_name> ... ON CONFLICT <optional_columns_list> <optional_where_clause> DO NOTHING | DO UPDATE SET column_name = <optional 'excluded.' qualifier> column_name, ... <optional_where_clause>;
Examples:
insert into tbl VALUES (3,5,1) ON CONFLICT (i) WHERE k < 5 DO UPDATE SET k = 1;
-- shorter syntax
-- assuming tbl has a primary key/unique constraint, do nothing on conflict
INSERT OR IGNORE INTO tbl(i) VALUES(1);
-- or update the table with the new values instead
INSERT OR REPLACE INTO tbl(i) VALUES(1);
However there are still few limitations listed here.
I am trying to insert data contained in a .csv file from my pc to a remote server. The values are inserted in a table that contains 3 columns, namely Timestamp, Value and TimeseriesID. I have to insert approximately 3000 rows at a time, therefore I am currently using pyodbc and executemany.
My code up to now is the one shown below:
with contextlib.closing(pyodbc.connect(connection_string, autocommit=True)) as conn:
with contextlib.closing(conn.cursor()) as cursor:
cursor.fast_executemany = True # new in pyodbc 4.0.19
# Innsert values in the DataTable table
insert_df = df[["Time (UTC)", column]]
insert_df["id"] = timeseriesID
insert_df = insert_df[["id", "Time (UTC)", column]]
sql = "INSERT INTO %s (%s, %s, %s) VALUES (?, ?, ?)" % (
sqltbl_datatable, 'TimeseriesId', 'DateTime', 'Value')
params = [i.tolist() for i in insert_df.values]
cursor.executemany(sql, params)
As I am using pyodbc 4.0.19, I have the option fast_executemany set to True, which is supposed to speed up things. However, for some reason, I do not see any great improvement when I enable the fast_executemany option. Is there any alternative way that I could use in order to speed up insertion of my file?
Moreover, regarding the performance of the code shown above, I noticed that when disabling the autocommit=True option, and instead I included the cursor.commit() command in the end of my data was imported significantly faster. Is there any specific reason why this happens that I am not aware of?
Any help would be greatly appreciated :)
Regarding the cursor.commit() speed up that you are noticing: when you are using autocommit=True you are requesting the code to execute one database transaction per each of the insert. This means that the code resumes only after the database confirms the data is stored on disk. When you use cursor.commit() after the numerous INSERTs you are effectively executing one database transaction and the data is stored in RAM in the interim (it may be written to disk but not all at the time when you instruct the database to finalize the transaction).
The process of finalizing the transaction typically entails updating tables on disk, updating indexes, flushing logs, syncing copies, etc. which is costly. That is why you observe such a speed up between the 2 scenarios you describe.
When going the faster way please note that until you execute cursor.commit() you cannot be 100% sure that the data is in the database so there may be a need to reissue the query in case of an error (any partial transaction is going to be rolled back).
I am using the Python-MySQL (MySQLdb) library to insert values into a database. I want to avoid duplicate entries from being inserted into the database, so I have added the unique constraint to that column in MySQL. I am checking for duplicates in the title column. In my Python script, I am using the following statement:
cursor.execute ("""INSERT INTO `database` (title, introduction) VALUES (%s, %s)""", (title, pure_introduction))
Now when a duplicate entry is added to the database, it will produce an error. I do not want an error message to appear; I just want that if a duplicate entry is found then it should simply not enter that value into the database. How do I do this?
You can utilize the INSERT IGNORE syntax to suppress this type of error.
If you use the IGNORE keyword, errors that occur while executing the INSERT statement are ignored. For example, without IGNORE, a row that duplicates an existing UNIQUE index or PRIMARY KEY value in the table causes a duplicate-key error and the statement is aborted. With IGNORE, the row is discarded and no error occurs. Ignored errors may generate warnings instead, although duplicate-key errors do not.
In your case, the query would become:
INSERT IGNORE INTO `database` (title, introduction) VALUES (%s, %s)
Aside from what #Andy suggested (which should really be posted as an answer), you can also catch the exception in Python and silence it:
try:
cursor.execute ("""INSERT INTO `database` (title, introduction) VALUES (%s, %s)""", (title, pure_introduction))
except MySQLdb.IntegrityError:
pass # or may be at least log?
I understand that the fastest way to check if a row exists isn't even to check, but to use an INSERT IGNORE when inserting new data. This is most excellent for my application. However, I need to be able to check if the insert was ignored (or conversely, if the row was actually inserted).
I could use a try/catch, but that's not very elegant. Was hoping that someone might have a cleaner and more elegant solution.
Naturally, a final search after I post the question yields the result.
mysql - after insert ignore get primary key
However, this still requires a second trip to the database. I would love to see if there's a clean pythonic way to do this with a single query.
query = "INSERT IGNORE ..."
cursor.execute(query)
# Last row was ignored
if cursor.lastrowid == 0:
This does an INSERT IGNORE query and if the insert is ignored (duplicate), the lastrowid will be 0.
I wrote this python script to import a specific xls file into mysql. It works fine but if it's run twice on the same data it will create duplicate entries. I'm pretty sure I need to use MySQL JOIN but I'm not clear on how to do that. Also is executemany() going to have the same overhead as doing inserts in a loop? I'm obviously trying to avoid that.
Here's the code in question...
for row in range(sheet.nrows):
"""name is in the 0th col. email is the 4th col."""
name = sheet.cell(row, 0).value
email = sheet.cell(row, 4).value
if name and email:
mailing_list[name.lstrip()] = email.strip()
for n, e in sorted(mailing_list.iteritems()):
rows.append((n, e))
db = MySQLdb.connect(host=host, user=user, db=dbname, passwd=pwd)
cursor = db.cursor()
cursor.executemany("""
INSERT IGNORE INTO mailing_list (name, email) VALUES (%s,%s)""",(rows))
CLARIFICATION...
I read here that...
To be sure, executemany() is effectively the same as simple iteration.
However, it is typically faster. It provides an optimized means of
affecting INSERT and REPLACE across multiple rows.
Also I took Unodes suggestion and used the UNIQUE constraint. But the IGNORE keyword is better than ON DUPLICATE KEY UPDATE because I want it to fail silently.
TL;DR
1. What's the best way prevent duplicate inserts?
ANSWER 1: UNIQUE contraint on column with SELECT IGNORE to fail silently or ON DUPLICATE KEY UPDATE to increment the duplicate value and insert it.
Is executemany() as expensive as INSERT in a loop?
#Unode says it's not but my research tells me otherwise. I would like a definitive answer.
Is this the best way or is it going to be really slow with bigger
tables and how would I test to be sure?
1 - What's the best way prevent duplicate inserts?
Depending on what "preventing" means in your case, you have two strategies and one requirement.
The requirement is that you add a UNIQUE constraint on the column/columns that you want to be unique. This alone will cause an error if insertion of a duplicate entry is attempted. However given you are using executemany the outcome may not be what you would expect.
Then as strategies you can do:
An initial filter step by running a SELECT statement before. This means running one SELECT statement per item in your rows to check if it exists already. This strategy works but is inefficient.
Using ON DUPLICATE KEY UPDATE. This automatically triggers an update if the data already exists. For more information refer to the official documentation.
2 - Is executemany() as expensive as INSERT in a loop?
No, executemany creates one query which inserts in bulk while doing a for loop will create as many queries as the number of elements in your rows.