How to update existing rows when importing csv with psycopg2 [duplicate] - python

I want to do
" on conflict (time) do update set name , description "
but I have no idea when I use stdin with csv , I don't know what name equal what? and description equal what...
table_a:
xxx.csv:
with open('xxx/xxx.csv', 'r', encoding='utf8') as f:
sql = """
COPY table_a FROM STDIN With CSV on conflict (time)
do update set name=??, description=??;
"""
cur.copy_expert(sql, f)
conn.commit()

In this SO post, there are two answers that -combined together- provide a nice solution for successfully using ON CONFLICT. The example below, uses ON CONFLICT DO NOTHING;:
BEGIN;
CREATE TEMP TABLE tmp_table
(LIKE main_table INCLUDING DEFAULTS)
ON COMMIT DROP;
COPY tmp_table FROM 'full/file/name/here';
INSERT INTO main_table
SELECT *
FROM tmp_table
ON CONFLICT DO NOTHING;
COMMIT;
Replace both instances of main_table with the name of your table.

Thanks for every master's solution.
this is my solution.
sql = """
CREATE TABLE temp_h (
time ,
name,
description
);
COPY temp_h FROM STDIN With CSV;
INSERT INTO table_a(time, name, description)
SELECT *
FROM temp_h ON conflict (time)
DO update set name=EXCLUDED.name, description=EXCLUDED.description;
DROP TABLE temp_h;
"""

I've managed to accomplish a bulk upsert with the following function (suggestions are welcome):
import io
from sqlalchemy.engine import Engine
from sqlalchemy.ext import declarative_base
BaseModel = declarative_base()
def upsert_bulk(engine: Engine, model: BaseModel, data: io.StringIO) -> None:
"""
Fast way to upsert multiple entries at once
:param `db`: DB Session
:param `data`: CSV in a stream object
"""
table_name = model.__tablename__
temp_table_name = f"temp_{table_name}"
columns = [c.key for c in model.__table__.columns]
# Select only columns to be updated (in my case, all non-id columns)
variable_columns = [c for c in columns if c != "id"]
# Create string with set of columns to be updated
update_set = ", ".join([f"{v}=EXCLUDED.{v}" for v in variable_columns])
# Rewind data and prepare it for `copy_from`
data.seek(0)
with conn.cursor() as cur:
# Creates temporary empty table with same columns and types as
# the final table
cur.execute(
f"""
CREATE TEMPORARY TABLE {temp_table_name} (LIKE {table_name})
ON COMMIT DROP
"""
)
# Copy stream data to the created temporary table in DB
cur.copy_from(data, temp_table_name)
# Inserts copied data from the temporary table to the final table
# updating existing values at each new conflict
cur.execute(
f"""
INSERT INTO {table_name}({', '.join(columns)})
SELECT * FROM {temp_table_name}
ON CONFLICT (id) DO UPDATE SET {update_set}
"""
)
# Drops temporary table (I believe this step is unnecessary,
# but tables sizes where growing without any new data modifications
# if this command isn't executed)
cur.execute(f"DROP TABLE {temp_table_name}")
# Commit everything through cursor
conn.commit()
conn.close()

https://www.postgresql.org/docs/current/static/sql-copy.html
there is no copy ... on conflict do statement in postgres
https://www.postgresql.org/docs/current/static/sql-insert.html
only insert ... on conflict do

Related

UPDATE values of a column in a table with Python in SQL Server

I want to update a column with SQL server query in python, as you see I am updating the relative column as below:
I have a CSV file with some A values of relative table as below:
CSV file: (a.csv)
ART-B-C-ART0015-D-E01
ADC-B-C-ADC00112-V-E01
Python Code: (create Name Value)
ff = pd.read_csv("C:\\a.csv",encoding='cp1252')
ff["Name"]= df["A"].str.extract(r'([a-zA-Z]{3}\d{4,5})') + "-A"
Result of python Code:
ART0015-A
ADC00112-A
Table :
A Name FamilyName
ART-B-C-ART0015-D-E01 NULL ART
ADC-B-C-ADC00112-V-E01 NULL ADC00112
Also A is a column in my table (Not all of the A records but some of them) and based on A value I want to update Name column.
My database is SQL Server and I don't know how to update in Name Column in SQL Server where the A value in the csv file is equal to A in the relative table.
Code in Python:
conn = pyodbc.connect('Driver={SQL Server}; Server=ipaddress; Database=dbname; UID=username; PWD= {password};')
cursor = conn.cursor()
conn.commit()
for row in ff.itertuples():
cursor.execute('''UPDATE database.dbo.tablename SET Name where ?
)
conn.commit()
Expected result in table
A Name FamilyName
ART-B-C-ART0015-D-E01 ART0015-A ART
ADC-B-C-ADC00112-V-E01 ADC00112-A ADC00112
I would use an SQL temp table and inner join to update the values. This will work for only updating a subset of records in your SQL table. It can also be efficient at updating many records.
SQL Cursor
# reduce number of calls to server on inserts
cursor.fast_executemany = True
Create Temporary Table
statement = "CREATE TABLE #temp_tablename(A VARCHAR(200), Name VARCHAR(200))"
cursor.execute(statement)
Insert Values into a Temporary Table
# insert only the key and the updated values
subset = ff[['A','Name']]
# form SQL insert statement
columns = ", ".join(subset.columns)
values = '('+', '.join(['?']*len(subset.columns))+')'
# insert
statement = "INSERT INTO #temp_tablename ("+columns+") VALUES "+values
insert = [tuple(x) for x in subset.values]
cursor.executemany(statement, insert)
Update Values in Main Table from Temporary Table
statement = '''
UPDATE
tablename
SET
u.Name
FROM
tablename AS t
INNER JOIN
#temp_tablename AS u
ON
u.A=t.A;
'''
cursor.execute(statement)
Drop Temporary Table
cursor.execute("DROP TABLE #temp_tablename")

how to transfer data from one mysql database to another and mapping the data with different column names using python

How can I transfer data from one MySQL database to another? The other database may have different field names, except id, which will act as the primary key.
I have tried using mysqlalchemy, but the only data that gets mapped are the filed names that are same in both databases.
import sqlalchemy
db1 = sqlalchemy.create_engine("mysql+pymysql://root:#localhost:3306/mydatabase1")
db2 = sqlalchemy.create_engine("mysql+pymysql://root:#localhost:3306/nava")
print('Writing...')
query = ''' (SELECT * FROM customers1)'''
df = pd.read_sql(query, db1)
print(df)
#query1 = ''UPDATE 'leap' SET `leap`value '''
df.to_sql('nap', db2, index=False, if_exists='append')
i get error that other database dosent have same field names but what i want is that even if the field names change data still gets mapped with reference to the primary key id
this is the program that i asked about in the above question but there was an error so code hasent appeared in the right way
import pandas as pd
import sqlalchemy
db1 = sqlalchemy.create_engine("mysql+pymysql://root:#localhost:3306/mydatabase1")
db2 = sqlalchemy.create_engine("mysql+pymysql://root:#localhost:3306/nava")
print('Writing...')
query = ''' (SELECT * FROM customers1)'''
df = pd.read_sql(query, db1)
df.to_sql('nap', db2, index=False, if_exists='append')

Python create a SQLlite3 table with primary key from DataFrame [duplicate]

I have created a sqlite database using pandas df.to_sql however accessing it seems considerably slower than just reading in the 500mb csv file.
I need to:
set the primary key for each table using the df.to_sql method
tell the sqlite database what datatype each of the columns in my
3.dataframe are? - can I pass a list like [integer,integer,text,text]
code.... (format code button not working)
if ext == ".csv":
df = pd.read_csv("/Users/data/" +filename)
columns = df.columns columns = [i.replace(' ', '_') for i in columns]
df.columns = columns
df.to_sql(name,con,flavor='sqlite',schema=None,if_exists='replace',index=True,index_label=None, chunksize=None, dtype=None)
Unfortunately there is no way right now to set a primary key in the pandas df.to_sql() method. Additionally, just to make things more of a pain there is no way to set a primary key on a column in sqlite after a table has been created.
However, a work around at the moment is to create the table in sqlite with the pandas df.to_sql() method. Then you could create a duplicate table and set your primary key followed by copying your data over. Then drop your old table to clean up.
It would be something along the lines of this.
import pandas as pd
import sqlite3
df = pd.read_csv("/Users/data/" +filename)
columns = df.columns columns = [i.replace(' ', '_') for i in columns]
#write the pandas dataframe to a sqlite table
df.columns = columns
df.to_sql(name,con,flavor='sqlite',schema=None,if_exists='replace',index=True,index_label=None, chunksize=None, dtype=None)
#connect to the database
conn = sqlite3.connect('database')
c = conn.curser()
c.executescript('''
PRAGMA foreign_keys=off;
BEGIN TRANSACTION;
ALTER TABLE table RENAME TO old_table;
/*create a new table with the same column names and types while
defining a primary key for the desired column*/
CREATE TABLE new_table (col_1 TEXT PRIMARY KEY NOT NULL,
col_2 TEXT);
INSERT INTO new_table SELECT * FROM old_table;
DROP TABLE old_table;
COMMIT TRANSACTION;
PRAGMA foreign_keys=on;''')
#close out the connection
c.close()
conn.close()
In the past I have done this as I have faced this issue. Just wrapped the whole thing as a function to make it more convenient...
In my limited experience with sqlite I have found that not being able to add a primary key after a table has been created, not being able to perform Update Inserts or UPSERTS, and UPDATE JOIN has caused a lot of frustration and some unconventional workarounds.
Lastly, in the pandas df.to_sql() method there is a a dtype keyword argument that can take a dictionary of column names:types. IE: dtype = {col_1: TEXT}
Building on Chris Guarino's answer, here's some functions that provide a more general solution. See the example at the bottom for how to use them.
import re
def get_create_table_string(tablename, connection):
sql = """
select * from sqlite_master where name = "{}" and type = "table"
""".format(tablename)
result = connection.execute(sql)
create_table_string = result.fetchmany()[0][4]
return create_table_string
def add_pk_to_create_table_string(create_table_string, colname):
regex = "(\n.+{}[^,]+)(,)".format(colname)
return re.sub(regex, "\\1 PRIMARY KEY,", create_table_string, count=1)
def add_pk_to_sqlite_table(tablename, index_column, connection):
cts = get_create_table_string(tablename, connection)
cts = add_pk_to_create_table_string(cts, index_column)
template = """
BEGIN TRANSACTION;
ALTER TABLE {tablename} RENAME TO {tablename}_old_;
{cts};
INSERT INTO {tablename} SELECT * FROM {tablename}_old_;
DROP TABLE {tablename}_old_;
COMMIT TRANSACTION;
"""
create_and_drop_sql = template.format(tablename = tablename, cts = cts)
connection.executescript(create_and_drop_sql)
# Example:
# import pandas as pd
# import sqlite3
# df = pd.DataFrame({"a": [1,2,3], "b": [2,3,4]})
# con = sqlite3.connect("deleteme.db")
# df.to_sql("df", con, if_exists="replace")
# add_pk_to_sqlite_table("df", "index", con)
# r = con.execute("select sql from sqlite_master where name = 'df' and type = 'table'")
# print(r.fetchone()[0])
There is a gist of this code here
In pandas version 0.15, to_sql() got an argument dtype, which can be used to set both dtype and the primary key attribute for all columns:
import sqlite3
import pandas as pd
df = pd.DataFrame({'MyID': [1, 2, 3], 'Data': [3, 2, 6]})
with sqlite3.connect('foo.db') as con:
df.to_sql('df', con=con, dtype={'MyID': 'INTEGER PRIMARY KEY',
'Data': 'FLOAT'})
Building on Chris Guarino's answer, it is almost impossible to assign a Primary key to an already existing column using df.to_sql() method. Likewise in your 500mb csv file you cannot create an duplicate table with huge number of columns.
However a small Workaround of addding a new column as Primary key while creation of dataframe to SQL. It is possible to iterate over Pandas dataframe.columns function to create a new database and while the creation you can add a Primary key. With this duplicate table is not needed.
i am adding a small Code snippet of it.
import pandas as pd
import sqlite3
import sqlalchemy
from sqlalchemy import create_engine
df= pd.read_excel(r'C:\XXX\XXX\XXXX\XXX.xlsx',sep=';')
X1 = df1.iloc[0:,0:]
dataset = X1.astype('float32')
dataset['date'] = pd.date_range(start='1/1/2020', periods=len(dataset), freq='D')
dataset=dataset.set_index('date')
engine = create_engine('sqlite:///measurement.db')
sqlite_connection = engine.connect()
sqlite_table = "table1"
sqlite_connection.execute("CREATE TABLE table1 (id INTEGER PRIMARY KEY AUTOINCREMENT, date TIMESTAMP, " +
",".join(["%s REAL" % x for x in dataset.columns]) + ")" )
dataset.to_sql(sqlite_table, sqlite_connection, if_exists='append')
Output database table:
[(0, 'id', 'INTEGER', 0, None, 1),
(1, 'date', 'TIMESTAMP', 0, None, 0),
(2, 'time_stamp', 'REAL', 0, None, 0),
(3, 'column_1', 'REAL', 0, None, 0),
(4, 'column_2', 'REAL', 0, None, 0)]
This method works only if the dataframe has an index. Also to have the index as column in our table it should be explicitly defined while writing our query.
Hope this helps for huge database creations.
In Sqlite, with a normal rowid table, unless the primary key is a single INTEGER column (See ROWIDs and the INTEGER PRIMARY KEY in the documentation), it's equivalent to a UNIQUE index (Because the real PK of a normal table is the rowid).
Notes from the documentation for rowid tables:
The PRIMARY KEY of a rowid table (if there is one) is usually not the true primary key for the table, in the sense that it is not the unique key used by the underlying B-tree storage engine. The exception to this rule is when the rowid table declares an INTEGER PRIMARY KEY. In the exception, the INTEGER PRIMARY KEY becomes an alias for the rowid.
The true primary key for a rowid table (the value that is used as the key to look up rows in the underlying B-tree storage engine) is the rowid.
The PRIMARY KEY constraint for a rowid table (as long as it is not the true primary key or INTEGER PRIMARY KEY) is really the same thing as a UNIQUE constraint. Because it is not a true primary key, columns of the PRIMARY KEY are allowed to be NULL, in violation of all SQL standards.
So you can easily fake a primary key after creating the table with:
CREATE UNIQUE INDEX mytable_fake_pk ON mytable(pk_column)
Besides the NULL thing, you won't get the benefits of an INTEGER PRIMARY KEY if your column is supposed to hold integers, like taking up less space and auto-generating values on insert if left out, but it'll otherwise work for most purposes.
There is another option for getting pandas to create a primary key on table creation using some undocumented methods from the pandas internals (at your own risk). You can peruse the code here. The key is the keys param of SQLTable which is not exposed in the to_sql API.
Note that I reset_index and set index=False in the call to SQLTable to prevent a duplicate/unnecessary index from being created in addition to the primary key constraint.
from pandas.io.sql import SQLTable, pandasSQL_builder
df = <your dataframe>
engine = <sqlalchemy engine>
table = SQLTable(
"my_table",
pandasSQL_builder(engine, schema="my_schema"),
frame=df.reset_index(),
index=False,
keys=df.index.names,
if_exists=if_exists,
schema="my_schema",
)
table.create() # Will honor your if_exists settings
table.insert(chunksize, method="multi") # This hits limits in allowed sqlite params if chunks are too large
There is also a get_schema function in that file that can get you a create table statement if you want to do something manually.
There's no way to do that. You can only set the primary key directly in the database after you move the data.

Python Script for SQL Server - Update values with MERGE

I have this python function which inserts to a SQL database. The script is such that every time it is rerun it will have to insert the same row over again in addition to new rows. Eventually I will be changing this so that it only inserts new rows but for now I have to work with some sort of update statement.
I'm aware that I can use MERGE in SQL Server to achieve something similar to MySQL's ON DUPLICATE KEY UPDATE, but I'm not exactly sure how it should be used. Any advice is welcome. Thanks!
def sqlInsrt(headers, values):
#create string input of mylisth
strheaders = ','.join(str(i) for i in headers)
#create string ? param's for INSERT clause
placestr = ','.join(i for i in ["?" for i in headers])
#create string ? param's for UPDATE clause
replacestr = ', '.join(['{}=?'.format(h) for h in headers])
#Setup and execute SQL query
insert = ("INSERT INTO "+part+" ("+strheaders+") VALUES ("+placestr+")")
cursor.execute(insert, values)
cnx.commit()
You should read the docs for Merge.
Basically MERGE INTO TargetTable
USING SourceTable
ON TargetTable.id = SourceTable.id
....
Then you can read the docs about using When not marched by Target etc.
So your Python would maybe swap out the table names and joins using params
I wrote a script that solves the simplest case of merging two identically structured tables, one containing new/updated data. This is useful in incremental data imports. You can expand it depending on your needs (eg. if you need a type 2 SCD):
def create_merge_query(
stg_schema: str,
stg_table: str,
schema: str,
table: str,
primary_key: str,
con: pyodbc.Connection,
) -> str:
"""
Create a merge query for the simplest possible upsert scenario:
- updating and inserting all fields
- merging on a single column, which has the same name in both tables
Args:
stg_schema (str): The schema where the staging table is located.
stg_table (str): The table with new/updated data.
schema (str): The schema where the table is located.
table (str): The table to merge into.
primary_key (str): The column on which to merge.
"""
columns_query = f"""
SELECT
col.name
FROM sys.tables AS tab
INNER JOIN sys.columns AS col
ON tab.object_id = col.object_id
WHERE tab.name = '{table}'
AND schema_name(tab.schema_id) = '{schema}'
ORDER BY column_id;
"""
columns_query_result = con.execute(columns_query)
columns = [tup[0] for tup in columns_query_result]
columns_stg_fqn = [f"stg.{col}" for col in columns]
update_pairs = [f"existing.{col} = stg.{col}" for col in columns]
merge_query = f"""
MERGE INTO {schema}.{table} existing
USING {stg_schema}.{stg_table} stg
ON stg.{primary_key} = existing.{primary_key}
WHEN MATCHED
THEN UPDATE SET {", ".join(update_pairs)}
WHEN NOT MATCHED
THEN INSERT({", ".join(columns)})
VALUES({", ".join(columns_stg_fqn)});
"""
return merge_query

REPLACE rows in mysql database table with pandas DataFrame

Python Version - 2.7.6
Pandas Version - 0.17.1
MySQLdb Version - 1.2.5
In my database ( PRODUCT ) , I have a table ( XML_FEED ). The table XML_FEED is huge ( Millions of record )
I have a pandas.DataFrame() ( PROCESSED_DF ). The dataframe has thousands of rows.
Now I need to run this
REPLACE INTO TABLE PRODUCT.XML_FEED
(COL1, COL2, COL3, COL4, COL5),
VALUES (PROCESSED_DF.values)
Question:-
Is there a way to run REPLACE INTO TABLE in pandas? I already checked pandas.DataFrame.to_sql() but that is not what I need. I do not prefer to read XML_FEED table in pandas because it very huge.
With the release of pandas 0.24.0, there is now an official way to achieve this by passing a custom insert method to the to_sql function.
I was able to achieve the behavior of REPLACE INTO by passing this callable to to_sql:
def mysql_replace_into(table, conn, keys, data_iter):
from sqlalchemy.dialects.mysql import insert
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.sql.expression import Insert
#compiles(Insert)
def replace_string(insert, compiler, **kw):
s = compiler.visit_insert(insert, **kw)
s = s.replace("INSERT INTO", "REPLACE INTO")
return s
data = [dict(zip(keys, row)) for row in data_iter]
conn.execute(table.table.insert(replace_string=""), data)
You would pass it like so:
df.to_sql(db, if_exists='append', method=mysql_replace_into)
Alternatively, if you want the behavior of INSERT ... ON DUPLICATE KEY UPDATE ... instead, you can use this:
def mysql_replace_into(table, conn, keys, data_iter):
from sqlalchemy.dialects.mysql import insert
data = [dict(zip(keys, row)) for row in data_iter]
stmt = insert(table.table).values(data)
update_stmt = stmt.on_duplicate_key_update(**dict(zip(stmt.inserted.keys(),
stmt.inserted.values())))
conn.execute(update_stmt)
Credits to https://stackoverflow.com/a/11762400/1919794 for the compile method.
Till this version (0.17.1) I am unable find any direct way to do this in pandas. I reported a feature request for the same.
I did this in my project with executing some queries using MySQLdb and then using DataFrame.to_sql(if_exists='append')
Suppose
1) product_id is my primary key in table PRODUCT
2) feed_id is my primary key in table XML_FEED.
SIMPLE VERSION
import MySQLdb
import sqlalchemy
import pandas
con = MySQLdb.connect('localhost','root','my_password', 'database_name')
con_str = 'mysql+mysqldb://root:my_password#localhost/database_name'
engine = sqlalchemy.create_engine(con_str) #because I am using mysql
df = pandas.read_sql('SELECT * from PRODUCT', con=engine)
df_product_id = df['product_id']
product_id_str = (str(list(df_product_id.values))).strip('[]')
delete_str = 'DELETE FROM XML_FEED WHERE feed_id IN ({0})'.format(product_id_str)
cur = con.cursor()
cur.execute(delete_str)
con.commit()
df.to_sql('XML_FEED', if_exists='append', con=engine)# you can use flavor='mysql' if you do not want to create sqlalchemy engine but it is depreciated
Please note:-
The REPLACE [INTO] syntax allows us to INSERT a row into a table, except that if a UNIQUE KEY (including PRIMARY KEY) violation occurs, the old row is deleted prior to the new INSERT, hence no violation.
I needed a generic solution to this problem, so I built on shiva's answer--maybe it will be helpful to others. This is useful in situations where you grab a table from a MySQL database (whole or filtered), update/add some rows, and want to perform a REPLACE INTO statement with df.to_sql().
It finds the table's primary keys, performs a delete statement on the MySQL table with all keys from the pandas dataframe, and then inserts the dataframe into the MySQL table.
def to_sql_update(df, engine, schema, table):
df.reset_index(inplace=True)
sql = ''' SELECT column_name from information_schema.columns
WHERE table_schema = '{schema}' AND table_name = '{table}' AND
COLUMN_KEY = 'PRI';
'''.format(schema=schema, table=table)
id_cols = [x[0] for x in engine.execute(sql).fetchall()]
id_vals = [df[col_name].tolist() for col_name in id_cols]
sql = ''' DELETE FROM {schema}.{table} WHERE 0 '''.format(schema=schema, table=table)
for row in zip(*id_vals):
sql_row = ' AND '.join([''' {}='{}' '''.format(n, v) for n, v in zip(id_cols, row)])
sql += ' OR ({}) '.format(sql_row)
engine.execute(sql)
df.to_sql(table, engine, schema=schema, if_exists='append', index=False)
If you use to_sql you should be able to define it so that you replace values if they exist, so for a table named 'mydb' and a dataframe named 'df', you'd use:
df.to_sql(mydb,if_exists='replace')
That should replace values if they already exist, but I am not 100% sure if that's what you're looking for.

Categories

Resources