Insert into postgreSQL table from pandas with "on conflict" update - python

I have a pandas DataFrame that I need to store into the database. Here's my current line of code for inserting:
df.to_sql(table,con=engine,if_exists='append',index_label=index_col)
This works fine if none of the rows in df exist in my table. If a row already exists, I get this error:
sqlalchemy.exc.IntegrityError: (psycopg2.IntegrityError) duplicate key
value violates unique constraint "mypk"
DETAIL: Key (id)=(42) already exists.
[SQL: 'INSERT INTO mytable (id, owner,...) VALUES (%(id)s, %(owner)s,...']
[parameters:...] (Background on this error at: http://sqlalche.me/e/gkpj)
and nothing is inserted.
PostgreSQL has optional ON CONFLICT clause, which could be used to UPDATE the existing table rows. I read entire pandas.DataFrame.to_sql manual page and I couldn't find any way to use ON CONFLICT within DataFrame.to_sql() function.
I have considered spliting my DataFrame in two based on what's already in the db table. So now I have two DataFrames, insert_rows and update_rows, and I can safely execute
insert_rows.to_sql(table, con=engine, if_exists='append', index_label=index_col)
But then, there seems to be no UPDATE equivalent to DataFrame.to_sql(). So how do I update the table using DataFrame update_rows?

I know it's an old thread, but I ran into the same issue and this thread showed up in Google. None of the answers is really satisfying yet, so I here's what I came up with:
My solution is pretty similar to zdgriffith's answer, but much more performant as there's no need to iterate over data_iter:
def postgres_upsert(table, conn, keys, data_iter):
from sqlalchemy.dialects.postgresql import insert
data = [dict(zip(keys, row)) for row in data_iter]
insert_statement = insert(table.table).values(data)
upsert_statement = insert_statement.on_conflict_do_update(
constraint=f"{table.table.name}_pkey",
set_={c.key: c for c in insert_statement.excluded},
)
conn.execute(upsert_statement)
Now you can use this custom upsert method in pandas' to_sql method like zdgriffith showed.
Please note that my upsert function uses the primary key constraint of the table. You can target another constraint by changing the constraint argument of .on_conflict_do_update.
This SO answer on a related thread explains the use of .excluded a bit more: https://stackoverflow.com/a/51935542/7066758

# SaturnFromTitan, thanks for the reply to this old thread. That worked like magic. I would upvote, but I don't have the rep.
For those that are as new to all this as I am:
You can cut and paste SaturnFromTitan answer and call it with something like:
df.to_sql('my_table_name',
dbConnection,schema='my_schema',
if_exists='append',
index=False,
method=postgres_upsert)
And that's it. The upsert works.

To follow up on Brendan's answer with an example, this is what worked for me:
import os
import sqlalchemy as sa
import pandas as pd
from sqlalchemy.dialects.postgresql import insert
engine = sa.create_engine(os.getenv("DBURL"))
meta = sa.MetaData()
meta.bind = engine
meta.reflect(views=True)
def upsert(table, conn, keys, data_iter):
upsert_args = {"constraint": "test_table_col_a_col_b_key"}
for data in data_iter:
data = {k: data[i] for i, k in enumerate(keys)}
upsert_args["set_"] = data
insert_stmt = insert(meta.tables[table.name]).values(**data)
upsert_stmt = insert_stmt.on_conflict_do_update(**upsert_args)
conn.execute(upsert_stmt)
if __name__ == "__main__":
df = pd.read_csv("test_data.txt")
with db.engine.connect() as conn:
df.to_sql(
"test_table",
con=conn,
if_exists="append",
method=upsert,
index=False,
)
where in this example the schema would be something like:
CREATE TABLE test_table(
col_a text NOT NULL,
col_b text NOT NULL,
col_c text,
UNIQUE (col_a, col_b)
)

If you notice in the to_sql docs there's mention of a method argument that takes a callable. Creating this callable should allow you to use the Postgres clauses you need. Here's an example of a callable they mentioned in the docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method
It's pretty different from what you need, but follow the arguments passed to this callable. They will allow you to construct a regular SQL statement.

If anybody wanted to build on top of the answer from zdgriffith and dynamically generate the table constraint name you can use the following query for postgreSQL:
select distinct tco.constraint_name
from information_schema.table_constraints tco
join information_schema.key_column_usage kcu
on kcu.constraint_name = tco.constraint_name
and kcu.constraint_schema = tco.constraint_schema
and kcu.constraint_name = tco.constraint_name
where kcu.table_name = '{table.name}'
and constraint_type = 'PRIMARY KEY';
You can then format this string to populate table.name inside the upsert() method.
I also didn't require the meta.bind and meta.reflect() lines. The latter will be deprecated soon anyway.

Related

write dataframe from jupyter notebook to snowflake without define table column type

I have a data frame in jupyter notebook. My objective is to import this df into snowflake as a new table.
Is there any way to write a new table into snowflake directly without defining any table columns' names and types?
i am using
import snowflake.connector as snow
from snowflake.connector.pandas_tools import write_pandas
from sqlalchemy import create_engine
import pandas as pd
connection = snowflake.connector.connect(
user='XXX',
password='XXX',
account='XXX',
warehouse='COMPUTE_WH',
database= 'SNOWPLOW',
schema = 'DBT_WN'
)
df.to_sql('aaa', connection, index = False)
it ran into an error:
DatabaseError: Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': not all arguments converted during string formatting
Can anyone provide the sample code to fix this issue?
Here's one way to do it -- apologies in advance for my code formatting in SO combined with python's spaces vs tabs "model". Check the tabs/spaces if you cut-n-paste ...
Because of the Snowsql security model, in your connection parameters be sure to specify the ROLE you are using as well. (Often the default role is 'PUBLIC')
Since you already have sqlAlchemy in the mix ... this idea doesn't use the snowflake write_pandas, so it isn't a good answer for large dataframes ... Some odd behaviors with sqlAlchemy and Snowflake; make sure the dataframe column names are upper case; yet use a lowercase table name in the argument to to_sql() ...
def df2sf_alch(target_df, target_table):
# create a sqlAlchemy connection object
engine = create_engine(f"snowflake://{your-sf-account-url}",
creator=lambda:connection)
# re/create table in Snowflake
try:
# sqlAlchemy creates table based on a lower-case table name
# and it works to have uppercase df column names
target_df.to_sql(target_table.lower(), con=engine, if_exists='replace', index=False)
print(f"Table {target_table.upper()} re/created")
except Exception as e:
print(f"Could not replace table {target_table.upper()}", exc_info=1)
nrows = connection.cursor().execute(f"select count(*) from {target_table}").fetchone()[0]
print(f"Table {target_table.upper()} rows = {nrows}")
Note this function needs to be changed to reflect the appropriate 'snowflake account url' in order to create the sqlAlchemy connection object. Also, assuming the case naming oddities are taken care of in the df, along with your already defined connection, you'd call this function simply passing the df and the name of the table, like df2sf_alch(my_df, 'MY_TABLE')

Example of using the 'callable' method in pandas.to_sql()?

I'm trying to make a specific insert statement that has an ON CONFLICT argument (I'm uploading to a Postgres database); will the df.to_sql(method='callable') allow that? Or is it intended for another purpose? I've read through the documentation, but I wasn't able to grasp the concept. I looked around on this website and others for similar questions, but I haven't found one yet. If possible I would love to see an example of how to use the 'callable' method in practice. Any other ideas on how to effectively load large numbers of rows from pandas using ON CONFLICT logic would be much appreciated as well. Thanks in advance for the help!
Here's an example on how to use postgres's ON CONFLICT DO NOTHING with to_sql
# import postgres specific insert
from sqlalchemy.dialects.postgresql import insert
def to_sql_on_conflict_do_nothing(pd_table, conn, keys, data_iter):
# This is very similar to the default to_sql function in pandas
# Only the conn.execute line is changed
data = [dict(zip(keys, row)) for row in data_iter]
conn.execute(insert(pd_table.table).on_conflict_do_nothing(), data)
conn = engine.connect()
df.to_sql("some_table", conn, if_exists="append", index=False, method=to_sql_on_conflict_do_nothing)
I have just had similar problem, and followed by to this answer I came up with solution on how to send df to potgresSQL ON CONFLICT:
1. Send some initial data to the database to create the table
from sqlalchemy import create_engine
engine = create_engine(connection_string)
df.to_sql(table_name,engine)
2. add primary key
ALTER TABLE table_name ADD COLUMN id SERIAL PRIMARY KEY;
3. prepare index on the column (or columns) you want to check the uniqueness
CREATE UNIQUE INDEX review_id ON test(review_id);
4. map the sql table with sqlalchemy
from sqlalchemy.ext.automap import automap_base
ABase = automap_base()
Table = ABase.classes.table_name
Table.__tablename__ = 'table_name'
6. do your insert on conflict with:
from sqlalchemy.dialects.postgresql import insert
insrt_vals = df.to_dict(orient='records')
insrt_stmnt = insert(Table).values(insrt_vals)
do_nothing_stmt = insrt_stmnt.on_conflict_do_nothing(index_elements=['review_id'])
results = engine.execute(do_nothing_stmt)

How to prevent aliasing in SQLAlchemy Query

I am using SQLAlchemy to extract data from a SQL Server DB into a Pandas Dataframe:
q: Query = self._session(db).query(tbl_obj)
return pd.read_sql(
str(q),
db.conn()
)
tbl_obj is a SQLAlchemy Table object that has been autloaded from an existing table in the DB.
My problem is that the query that's being created automatically aliases the column names to 'TABLE_NAME_COLUMN_NAME,' when I just want them to be 'COLUMN_NAME.'
I figure this is a fairly simple solution, but I haven't figured it out yet. Any thoughts?
Posting this because I figured it out while I was typing up the question. The problem was that I was calling str(q) when I should have been calling q.statement
This code works as expected, because the 'statement' attribute doesn't include the alisaing:
q: Query = self._session(db).query(tbl_obj)
return pd.read_sql(
q.statement,
db.conn()
)

How to update all the values of column in a db2 table using python [duplicate]

Is there any way to do an SQL update-where from a dataframe without iterating through each line? I have a postgresql database and to update a table in the db from a dataframe I would use psycopg2 and do something like:
con = psycopg2.connect(database='mydb', user='abc', password='xyz')
cur = con.cursor()
for index, row in df.iterrows():
sql = 'update table set column = %s where column = %s'
cur.execute(sql, (row['whatver'], row['something']))
con.commit()
But on the other hand if im either reading a table from sql or writing an entire dataframe to sql (with no update-where), then I would just use pandas and sqlalchemy. Something like:
engine = create_engine('postgresql+psycopg2://user:pswd#mydb')
df.to_sql('table', engine, if_exists='append')
It's great just having a 'one-liner' using to_sql. Isn't there something similar to do an update-where from pandas to postgresql? Or is the only way to do it by iterating through each row like i've done above. Isn't iterating through each row an inefficient way to do it?
Consider a temp table which would be exact replica of your final table, cleaned out with each run:
engine = create_engine('postgresql+psycopg2://user:pswd#mydb')
df.to_sql('temp_table', engine, if_exists='replace')
sql = """
UPDATE final_table AS f
SET col1 = t.col1
FROM temp_table AS t
WHERE f.id = t.id
"""
with engine.begin() as conn: # TRANSACTION
conn.execute(sql)
It looks like you are using some external data stored in df for the conditions on updating your database table. If it is possible why not just do a one-line sql update?
If you are working with a smallish database (where loading the whole data to the python dataframe object isn't going to kill you) then you can definitely conditionally update the dataframe after loading it using read_sql. Then you can use a keyword arg if_exists="replace" to replace the DB table with the new updated table.
df = pandas.read_sql("select * from your_table;", engine)
#update information (update your_table set column = "new value" where column = "old value")
#still may need to iterate for many old value/new value pairs
df[df['column'] == "old value", "column"] = "new value"
#send data back to sql
df.to_sql("your_table", engine, if_exists="replace")
Pandas is a powerful tool, where limited SQL support was just a small feature at first. As time goes by people are trying to use pandas as their only database interface software. I don't think pandas was ever meant to be an end-all for database interaction, but there are a lot of people working on new features all the time. See: https://github.com/pandas-dev/pandas/issues
I have so far not seen a case where the pandas sql connector can be used in any scalable way to update database data. It may have seemed like a good idea to build one, but really, for operational work it just does not scale.
What I would recommend is to dump your entire dataframe as CSV using
df.to_csv('filename.csv', encoding='utf-8')
Then loading the CSV into the database using COPY for PostgreSQL or LOAD DATA INFILE for MySQL.
If you do not make other changes to the table in question while the data is being manipulated by pandas, you can just load into the table.
If there are concurrency issues, you will have to load the data into a staging table that you then use to update your primary table from.
In the later case, your primary table needs to have a datetime which tells you when the latest modification to it was so you can determine if your pandas changes are the latest or if the database changes should remain.
I was wondering why donnt you update the df first based on your equation and then store the df to the database, you could use if_exists='replace', to store on the same table.
In case the column names have not changed I prefer removing all rows and then appending the data to the now empty table. Otherwise, dependent views will have to be regenerated as well:
from sqlalchemy import create_engine
from sqlalchemy import MetaData
engine = create_engine(f'postgresql://postgres:{pw}#localhost:5432/table')
# Get main table and delete all rows
# without deleting the table
meta = MetaData(engine)
meta.reflect(engine)
table = meta.tables['table']
del_st = table.delete()
conn = engine.connect()
res = conn.execute(del_st)
# Insert new data
df.to_sql('table', engine, if_exists='append', index=False)
I try the first answer and find it works not so well, then I change some parts to pass all situation by using pandas+sqlalchemy to update.
def update_to_sql(self, table_name, key_name)
a = []
self.table = table_name
self.primary_key = key_name
for col in df.columns:
if col == self.primary_key:
continue
a.append("f.{col}=t.{col}".format(col=col))
df.to_sql('temporary_table', self.sql_engine, if_exists='replace', index=False)
update_stmt_1 = "UPDATE {final_table} AS f".format(final_table=self.table)
update_stmt_2 = " INNER JOIN (SELECT * FROM temporary_table) AS t ON t.{primary_key}=f.{primary_key} ".format(primary_key=self.primary_key)
update_stmt_3 = "SET "
update_stmt_4 = ", ".join(a)
update_stmt_5 = update_stmt_1 + update_stmt_2 + update_stmt_3 + update_stmt_4 + ";"
print(update_stmt_5)
with self.sql_engine.begin() as cnx:
cnx.execute(update_stmt_5)
Here is an approach that I found to be somewhat clean. This utilizes sqlalchemy. It only updates one column at a time but can easily be generalized.
def dataframe_update(df, table, engine, primary_key, column):
md = MetaData(engine)
table = Table(table, md, autoload=True)
session = sessionmaker(bind=engine)()
for _, row in df.iterrows():
session.query(table).filter(table.columns[primary_key] == row[primary_key]).update({column: row[column]})
session.commit()

How do you safely and efficiently get the row id after an insert with mysql using MySQLdb in python?

I have a simple table in mysql with the following fields:
id -- Primary key, int, autoincrement
name -- varchar(50)
description -- varchar(256)
Using MySQLdb, a python module, I want to insert a name and description into the table, and get back the id.
In pseudocode:
db = MySQLdb.connection(...)
queryString = "INSERT into tablename (name, description) VALUES" % (a_name, a_desc);"
db.execute(queryString);
newID = ???
I think it might be
newID = db.insert_id()
Edit by Original Poster
Turns out, in the version of MySQLdb that I am using (1.2.2)
You would do the following:
conn = MySQLdb(host...)
c = conn.cursor()
c.execute("INSERT INTO...")
newID = c.lastrowid
I am leaving this as the correct answer, since it got me pointed in the right direction.
I don't know if there's a MySQLdb specific API for this, but in general you can obtain the last inserted id by SELECTing LAST_INSERT_ID()
It is on a per-connection basis, so you don't risk race conditions if some other client performs an insert as well.
You could also do a
conn.insert_id
The easiest way of all is to wrap your insert with a select count query into a single stored procedure and call that in your code. You would pass in the parameters needed to the stored procedure and it would then select your row count.

Categories

Resources