Example of using the 'callable' method in pandas.to_sql()? - python

I'm trying to make a specific insert statement that has an ON CONFLICT argument (I'm uploading to a Postgres database); will the df.to_sql(method='callable') allow that? Or is it intended for another purpose? I've read through the documentation, but I wasn't able to grasp the concept. I looked around on this website and others for similar questions, but I haven't found one yet. If possible I would love to see an example of how to use the 'callable' method in practice. Any other ideas on how to effectively load large numbers of rows from pandas using ON CONFLICT logic would be much appreciated as well. Thanks in advance for the help!

Here's an example on how to use postgres's ON CONFLICT DO NOTHING with to_sql
# import postgres specific insert
from sqlalchemy.dialects.postgresql import insert
def to_sql_on_conflict_do_nothing(pd_table, conn, keys, data_iter):
# This is very similar to the default to_sql function in pandas
# Only the conn.execute line is changed
data = [dict(zip(keys, row)) for row in data_iter]
conn.execute(insert(pd_table.table).on_conflict_do_nothing(), data)
conn = engine.connect()
df.to_sql("some_table", conn, if_exists="append", index=False, method=to_sql_on_conflict_do_nothing)

I have just had similar problem, and followed by to this answer I came up with solution on how to send df to potgresSQL ON CONFLICT:
1. Send some initial data to the database to create the table
from sqlalchemy import create_engine
engine = create_engine(connection_string)
df.to_sql(table_name,engine)
2. add primary key
ALTER TABLE table_name ADD COLUMN id SERIAL PRIMARY KEY;
3. prepare index on the column (or columns) you want to check the uniqueness
CREATE UNIQUE INDEX review_id ON test(review_id);
4. map the sql table with sqlalchemy
from sqlalchemy.ext.automap import automap_base
ABase = automap_base()
Table = ABase.classes.table_name
Table.__tablename__ = 'table_name'
6. do your insert on conflict with:
from sqlalchemy.dialects.postgresql import insert
insrt_vals = df.to_dict(orient='records')
insrt_stmnt = insert(Table).values(insrt_vals)
do_nothing_stmt = insrt_stmnt.on_conflict_do_nothing(index_elements=['review_id'])
results = engine.execute(do_nothing_stmt)

Related

Problems while inserting df values with python into Oracle db

I am having troubles when trying to insert data from a df into an Oracle database table, this is the error: DatabaseError: ORA-01036: illegal variable name/number
These are the steps I did:
This is the dataframe I have imported from yfinance package and elaborated in order to respect the integrity of the data types of my df
I transformed my df into a list, these are my data in the list:
this is the table where I want to insert my data:
This is the code:
sql_insert_temp = "INSERT INTO TEMPO('GIORNO','MESE','ANNO') VALUES(:2,:3,:4)"
index = 0
for i in df.iterrows():
cursor.execute(sql_insert_temp,df_list[index])
index += 1
connection.commit()
I have tried a single insert in the sqldeveloper worksheet, using the data you can see in the list, and it worked, so I guess I have made some mistake in the code. I have seen other discussions, but I couldn't find any solution to my problem.. Do you have any idea of how I can solve this or maybe is it possible to do this in another way?
I have tried to print the iterated queries and that's the result, that's why it's not inserting my data:
If you already have a pandas DataFrame, then you should be able to use the to_sql() method provided by the pandas library.
import cx_Oracle
import sqlalchemy
import pandas as pd
DATABASE = 'DB'
SCHEMA = 'DEV'
PASSWORD = 'password'
connection_string = f'oracle://{SCHEMA}:{PASSWORD}#{DATABASE}'
db_conn = sqlalchemy.create_engine(connection_string)
df_to_insert = df[['GIORNO', 'MESE', 'ANNO']] #creates a dataframe with only the columns you want to insert
df_to_insert.to_sql(name='TEMPO', con=db_connection, if_exists='append')
name is the name of the table
con is the connection object
if_exists='append' will add the rows to end of the table. There are other options to add fail or drop and re-create the table
other parameters can be found on the pandas website. pandas.to_sql()

write dataframe from jupyter notebook to snowflake without define table column type

I have a data frame in jupyter notebook. My objective is to import this df into snowflake as a new table.
Is there any way to write a new table into snowflake directly without defining any table columns' names and types?
i am using
import snowflake.connector as snow
from snowflake.connector.pandas_tools import write_pandas
from sqlalchemy import create_engine
import pandas as pd
connection = snowflake.connector.connect(
user='XXX',
password='XXX',
account='XXX',
warehouse='COMPUTE_WH',
database= 'SNOWPLOW',
schema = 'DBT_WN'
)
df.to_sql('aaa', connection, index = False)
it ran into an error:
DatabaseError: Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': not all arguments converted during string formatting
Can anyone provide the sample code to fix this issue?
Here's one way to do it -- apologies in advance for my code formatting in SO combined with python's spaces vs tabs "model". Check the tabs/spaces if you cut-n-paste ...
Because of the Snowsql security model, in your connection parameters be sure to specify the ROLE you are using as well. (Often the default role is 'PUBLIC')
Since you already have sqlAlchemy in the mix ... this idea doesn't use the snowflake write_pandas, so it isn't a good answer for large dataframes ... Some odd behaviors with sqlAlchemy and Snowflake; make sure the dataframe column names are upper case; yet use a lowercase table name in the argument to to_sql() ...
def df2sf_alch(target_df, target_table):
# create a sqlAlchemy connection object
engine = create_engine(f"snowflake://{your-sf-account-url}",
creator=lambda:connection)
# re/create table in Snowflake
try:
# sqlAlchemy creates table based on a lower-case table name
# and it works to have uppercase df column names
target_df.to_sql(target_table.lower(), con=engine, if_exists='replace', index=False)
print(f"Table {target_table.upper()} re/created")
except Exception as e:
print(f"Could not replace table {target_table.upper()}", exc_info=1)
nrows = connection.cursor().execute(f"select count(*) from {target_table}").fetchone()[0]
print(f"Table {target_table.upper()} rows = {nrows}")
Note this function needs to be changed to reflect the appropriate 'snowflake account url' in order to create the sqlAlchemy connection object. Also, assuming the case naming oddities are taken care of in the df, along with your already defined connection, you'd call this function simply passing the df and the name of the table, like df2sf_alch(my_df, 'MY_TABLE')

Is there any way to automatically load column data types into SQLite using SQLAlchemy?

I have a large csv file with nearly 100 columns with varying data types that I would like to load into a sqlite database using sqlalchemy. This will be an ongoing thing where I will periodicly load new data as a new table in the database. This seems like it should be trivial, but I cannot get anything to work.
All the solutions I've looked at so far have defined the columns explicitly when creating the tables.
Here is a minimal example (with far fewer columns) of what I have at the moment.
from sqlalchemy import *
import pandas as pd
values_list = []
url = r"https://raw.githubusercontent.com/amanthedorkknight/fifa18-all-player-statistics/master/2019/data.csv"
df = pd.read_csv(url,sep=",")
df = df.to_dict()
metadata = MetaData()
engine = create_engine("sqlite:///" + r"C:\Users\...\example.db")
connection = engine.connect()
# I would like define just the primary key column and the others be automatically loaded...
t1 = Table('t1', metadata, Column('ID',Integer,primary_key=True))
metadata.create_all(engine)
stmt = insert(t1).values()
values_list.append(df)
results = connection.execute(stmt, values_list)
values_list = []
connection.close()
Thanks for the suggestions. After some time searching, a decent solution is using the sqlathanor package. There is a function called generate-model-from-csv which allows you to read in a csv (also available for dictionary, json, etc) and build a sqlalchemy model directly. It is imperfect on datatype recognition, but certainly will save you some time if you have a lot of columns.
https://sqlathanor.readthedocs.io/en/latest/api.html#generate-model-from-csv

Insert into postgreSQL table from pandas with "on conflict" update

I have a pandas DataFrame that I need to store into the database. Here's my current line of code for inserting:
df.to_sql(table,con=engine,if_exists='append',index_label=index_col)
This works fine if none of the rows in df exist in my table. If a row already exists, I get this error:
sqlalchemy.exc.IntegrityError: (psycopg2.IntegrityError) duplicate key
value violates unique constraint "mypk"
DETAIL: Key (id)=(42) already exists.
[SQL: 'INSERT INTO mytable (id, owner,...) VALUES (%(id)s, %(owner)s,...']
[parameters:...] (Background on this error at: http://sqlalche.me/e/gkpj)
and nothing is inserted.
PostgreSQL has optional ON CONFLICT clause, which could be used to UPDATE the existing table rows. I read entire pandas.DataFrame.to_sql manual page and I couldn't find any way to use ON CONFLICT within DataFrame.to_sql() function.
I have considered spliting my DataFrame in two based on what's already in the db table. So now I have two DataFrames, insert_rows and update_rows, and I can safely execute
insert_rows.to_sql(table, con=engine, if_exists='append', index_label=index_col)
But then, there seems to be no UPDATE equivalent to DataFrame.to_sql(). So how do I update the table using DataFrame update_rows?
I know it's an old thread, but I ran into the same issue and this thread showed up in Google. None of the answers is really satisfying yet, so I here's what I came up with:
My solution is pretty similar to zdgriffith's answer, but much more performant as there's no need to iterate over data_iter:
def postgres_upsert(table, conn, keys, data_iter):
from sqlalchemy.dialects.postgresql import insert
data = [dict(zip(keys, row)) for row in data_iter]
insert_statement = insert(table.table).values(data)
upsert_statement = insert_statement.on_conflict_do_update(
constraint=f"{table.table.name}_pkey",
set_={c.key: c for c in insert_statement.excluded},
)
conn.execute(upsert_statement)
Now you can use this custom upsert method in pandas' to_sql method like zdgriffith showed.
Please note that my upsert function uses the primary key constraint of the table. You can target another constraint by changing the constraint argument of .on_conflict_do_update.
This SO answer on a related thread explains the use of .excluded a bit more: https://stackoverflow.com/a/51935542/7066758
# SaturnFromTitan, thanks for the reply to this old thread. That worked like magic. I would upvote, but I don't have the rep.
For those that are as new to all this as I am:
You can cut and paste SaturnFromTitan answer and call it with something like:
df.to_sql('my_table_name',
dbConnection,schema='my_schema',
if_exists='append',
index=False,
method=postgres_upsert)
And that's it. The upsert works.
To follow up on Brendan's answer with an example, this is what worked for me:
import os
import sqlalchemy as sa
import pandas as pd
from sqlalchemy.dialects.postgresql import insert
engine = sa.create_engine(os.getenv("DBURL"))
meta = sa.MetaData()
meta.bind = engine
meta.reflect(views=True)
def upsert(table, conn, keys, data_iter):
upsert_args = {"constraint": "test_table_col_a_col_b_key"}
for data in data_iter:
data = {k: data[i] for i, k in enumerate(keys)}
upsert_args["set_"] = data
insert_stmt = insert(meta.tables[table.name]).values(**data)
upsert_stmt = insert_stmt.on_conflict_do_update(**upsert_args)
conn.execute(upsert_stmt)
if __name__ == "__main__":
df = pd.read_csv("test_data.txt")
with db.engine.connect() as conn:
df.to_sql(
"test_table",
con=conn,
if_exists="append",
method=upsert,
index=False,
)
where in this example the schema would be something like:
CREATE TABLE test_table(
col_a text NOT NULL,
col_b text NOT NULL,
col_c text,
UNIQUE (col_a, col_b)
)
If you notice in the to_sql docs there's mention of a method argument that takes a callable. Creating this callable should allow you to use the Postgres clauses you need. Here's an example of a callable they mentioned in the docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method
It's pretty different from what you need, but follow the arguments passed to this callable. They will allow you to construct a regular SQL statement.
If anybody wanted to build on top of the answer from zdgriffith and dynamically generate the table constraint name you can use the following query for postgreSQL:
select distinct tco.constraint_name
from information_schema.table_constraints tco
join information_schema.key_column_usage kcu
on kcu.constraint_name = tco.constraint_name
and kcu.constraint_schema = tco.constraint_schema
and kcu.constraint_name = tco.constraint_name
where kcu.table_name = '{table.name}'
and constraint_type = 'PRIMARY KEY';
You can then format this string to populate table.name inside the upsert() method.
I also didn't require the meta.bind and meta.reflect() lines. The latter will be deprecated soon anyway.

How to update all the values of column in a db2 table using python [duplicate]

Is there any way to do an SQL update-where from a dataframe without iterating through each line? I have a postgresql database and to update a table in the db from a dataframe I would use psycopg2 and do something like:
con = psycopg2.connect(database='mydb', user='abc', password='xyz')
cur = con.cursor()
for index, row in df.iterrows():
sql = 'update table set column = %s where column = %s'
cur.execute(sql, (row['whatver'], row['something']))
con.commit()
But on the other hand if im either reading a table from sql or writing an entire dataframe to sql (with no update-where), then I would just use pandas and sqlalchemy. Something like:
engine = create_engine('postgresql+psycopg2://user:pswd#mydb')
df.to_sql('table', engine, if_exists='append')
It's great just having a 'one-liner' using to_sql. Isn't there something similar to do an update-where from pandas to postgresql? Or is the only way to do it by iterating through each row like i've done above. Isn't iterating through each row an inefficient way to do it?
Consider a temp table which would be exact replica of your final table, cleaned out with each run:
engine = create_engine('postgresql+psycopg2://user:pswd#mydb')
df.to_sql('temp_table', engine, if_exists='replace')
sql = """
UPDATE final_table AS f
SET col1 = t.col1
FROM temp_table AS t
WHERE f.id = t.id
"""
with engine.begin() as conn: # TRANSACTION
conn.execute(sql)
It looks like you are using some external data stored in df for the conditions on updating your database table. If it is possible why not just do a one-line sql update?
If you are working with a smallish database (where loading the whole data to the python dataframe object isn't going to kill you) then you can definitely conditionally update the dataframe after loading it using read_sql. Then you can use a keyword arg if_exists="replace" to replace the DB table with the new updated table.
df = pandas.read_sql("select * from your_table;", engine)
#update information (update your_table set column = "new value" where column = "old value")
#still may need to iterate for many old value/new value pairs
df[df['column'] == "old value", "column"] = "new value"
#send data back to sql
df.to_sql("your_table", engine, if_exists="replace")
Pandas is a powerful tool, where limited SQL support was just a small feature at first. As time goes by people are trying to use pandas as their only database interface software. I don't think pandas was ever meant to be an end-all for database interaction, but there are a lot of people working on new features all the time. See: https://github.com/pandas-dev/pandas/issues
I have so far not seen a case where the pandas sql connector can be used in any scalable way to update database data. It may have seemed like a good idea to build one, but really, for operational work it just does not scale.
What I would recommend is to dump your entire dataframe as CSV using
df.to_csv('filename.csv', encoding='utf-8')
Then loading the CSV into the database using COPY for PostgreSQL or LOAD DATA INFILE for MySQL.
If you do not make other changes to the table in question while the data is being manipulated by pandas, you can just load into the table.
If there are concurrency issues, you will have to load the data into a staging table that you then use to update your primary table from.
In the later case, your primary table needs to have a datetime which tells you when the latest modification to it was so you can determine if your pandas changes are the latest or if the database changes should remain.
I was wondering why donnt you update the df first based on your equation and then store the df to the database, you could use if_exists='replace', to store on the same table.
In case the column names have not changed I prefer removing all rows and then appending the data to the now empty table. Otherwise, dependent views will have to be regenerated as well:
from sqlalchemy import create_engine
from sqlalchemy import MetaData
engine = create_engine(f'postgresql://postgres:{pw}#localhost:5432/table')
# Get main table and delete all rows
# without deleting the table
meta = MetaData(engine)
meta.reflect(engine)
table = meta.tables['table']
del_st = table.delete()
conn = engine.connect()
res = conn.execute(del_st)
# Insert new data
df.to_sql('table', engine, if_exists='append', index=False)
I try the first answer and find it works not so well, then I change some parts to pass all situation by using pandas+sqlalchemy to update.
def update_to_sql(self, table_name, key_name)
a = []
self.table = table_name
self.primary_key = key_name
for col in df.columns:
if col == self.primary_key:
continue
a.append("f.{col}=t.{col}".format(col=col))
df.to_sql('temporary_table', self.sql_engine, if_exists='replace', index=False)
update_stmt_1 = "UPDATE {final_table} AS f".format(final_table=self.table)
update_stmt_2 = " INNER JOIN (SELECT * FROM temporary_table) AS t ON t.{primary_key}=f.{primary_key} ".format(primary_key=self.primary_key)
update_stmt_3 = "SET "
update_stmt_4 = ", ".join(a)
update_stmt_5 = update_stmt_1 + update_stmt_2 + update_stmt_3 + update_stmt_4 + ";"
print(update_stmt_5)
with self.sql_engine.begin() as cnx:
cnx.execute(update_stmt_5)
Here is an approach that I found to be somewhat clean. This utilizes sqlalchemy. It only updates one column at a time but can easily be generalized.
def dataframe_update(df, table, engine, primary_key, column):
md = MetaData(engine)
table = Table(table, md, autoload=True)
session = sessionmaker(bind=engine)()
for _, row in df.iterrows():
session.query(table).filter(table.columns[primary_key] == row[primary_key]).update({column: row[column]})
session.commit()

Categories

Resources