I have this python function which inserts to a SQL database. The script is such that every time it is rerun it will have to insert the same row over again in addition to new rows. Eventually I will be changing this so that it only inserts new rows but for now I have to work with some sort of update statement.
I'm aware that I can use MERGE in SQL Server to achieve something similar to MySQL's ON DUPLICATE KEY UPDATE, but I'm not exactly sure how it should be used. Any advice is welcome. Thanks!
def sqlInsrt(headers, values):
#create string input of mylisth
strheaders = ','.join(str(i) for i in headers)
#create string ? param's for INSERT clause
placestr = ','.join(i for i in ["?" for i in headers])
#create string ? param's for UPDATE clause
replacestr = ', '.join(['{}=?'.format(h) for h in headers])
#Setup and execute SQL query
insert = ("INSERT INTO "+part+" ("+strheaders+") VALUES ("+placestr+")")
cursor.execute(insert, values)
cnx.commit()
You should read the docs for Merge.
Basically MERGE INTO TargetTable
USING SourceTable
ON TargetTable.id = SourceTable.id
....
Then you can read the docs about using When not marched by Target etc.
So your Python would maybe swap out the table names and joins using params
I wrote a script that solves the simplest case of merging two identically structured tables, one containing new/updated data. This is useful in incremental data imports. You can expand it depending on your needs (eg. if you need a type 2 SCD):
def create_merge_query(
stg_schema: str,
stg_table: str,
schema: str,
table: str,
primary_key: str,
con: pyodbc.Connection,
) -> str:
"""
Create a merge query for the simplest possible upsert scenario:
- updating and inserting all fields
- merging on a single column, which has the same name in both tables
Args:
stg_schema (str): The schema where the staging table is located.
stg_table (str): The table with new/updated data.
schema (str): The schema where the table is located.
table (str): The table to merge into.
primary_key (str): The column on which to merge.
"""
columns_query = f"""
SELECT
col.name
FROM sys.tables AS tab
INNER JOIN sys.columns AS col
ON tab.object_id = col.object_id
WHERE tab.name = '{table}'
AND schema_name(tab.schema_id) = '{schema}'
ORDER BY column_id;
"""
columns_query_result = con.execute(columns_query)
columns = [tup[0] for tup in columns_query_result]
columns_stg_fqn = [f"stg.{col}" for col in columns]
update_pairs = [f"existing.{col} = stg.{col}" for col in columns]
merge_query = f"""
MERGE INTO {schema}.{table} existing
USING {stg_schema}.{stg_table} stg
ON stg.{primary_key} = existing.{primary_key}
WHEN MATCHED
THEN UPDATE SET {", ".join(update_pairs)}
WHEN NOT MATCHED
THEN INSERT({", ".join(columns)})
VALUES({", ".join(columns_stg_fqn)});
"""
return merge_query
Related
I have a GUI interacting with my database, and MySQL database has around 50 tables. I need to search each table for a value and return the field and key of the item in each table if it is found. I would like to search for partial matches. ex.( Search Value = "test", "Protest", "Test123" would be matches. Here is my attempt.
def searchdatabase(self, event):
print('Searching...')
self.connect_mysql() #Function to connect to database
d_tables = []
results_list = [] # I will store results here
s_string = "test" #Value I am searching
self.cursor.execute("USE db") # select the database
self.cursor.execute("SHOW TABLES")
for (table_name,) in self.cursor:
d_tables.append(table_name)
#Loop through tables list, get column name, and check if value is in the column
for table in d_tables:
#Get the columns
self.cursor.execute(f"SELECT * FROM `{table}` WHERE 1=0")
field_names = [i[0] for i in self.cursor.description]
#Find Value
for f_name in field_names:
print("RESULTS:", self.cursor.execute(f"SELECT * FROM `{table}` WHERE {f_name} LIKE {s_string}"))
print(table)
I get an error on print("RESULTS:", self.cursor.execute(f"SELECT * FROM `{table}` WHERE {f_name} LIKE {s_string}"))
Exception: (1054, "Unknown column 'test' in 'where clause'")
I use a similar insert query that works fine so I am not understanding what the issue is.
ex. insert_query = (f"INSERT INTO `{source_tbl}` ({query_columns}) VALUES ({query_placeholders})")
May be because of single quote you have missed while checking for some columns.
TRY :
print("RESULTS:", self.cursor.execute(f"SELECT * FROM `{table}` WHERE '{f_name}' LIKE '{s_string}'"))
Have a look -> here
Don’t insert user-provided data into SQL queries like this. It is begging for SQL injection attacks. Your database library will have a way of sending parameters to queries. Use that.
The whole design is fishy. Normally, there should be no need to look for a string across several columns of 50 different tables. Admittedly, sometimes you end up in these situations because of reasons outside your control.
I want to do
" on conflict (time) do update set name , description "
but I have no idea when I use stdin with csv , I don't know what name equal what? and description equal what...
table_a:
xxx.csv:
with open('xxx/xxx.csv', 'r', encoding='utf8') as f:
sql = """
COPY table_a FROM STDIN With CSV on conflict (time)
do update set name=??, description=??;
"""
cur.copy_expert(sql, f)
conn.commit()
In this SO post, there are two answers that -combined together- provide a nice solution for successfully using ON CONFLICT. The example below, uses ON CONFLICT DO NOTHING;:
BEGIN;
CREATE TEMP TABLE tmp_table
(LIKE main_table INCLUDING DEFAULTS)
ON COMMIT DROP;
COPY tmp_table FROM 'full/file/name/here';
INSERT INTO main_table
SELECT *
FROM tmp_table
ON CONFLICT DO NOTHING;
COMMIT;
Replace both instances of main_table with the name of your table.
Thanks for every master's solution.
this is my solution.
sql = """
CREATE TABLE temp_h (
time ,
name,
description
);
COPY temp_h FROM STDIN With CSV;
INSERT INTO table_a(time, name, description)
SELECT *
FROM temp_h ON conflict (time)
DO update set name=EXCLUDED.name, description=EXCLUDED.description;
DROP TABLE temp_h;
"""
I've managed to accomplish a bulk upsert with the following function (suggestions are welcome):
import io
from sqlalchemy.engine import Engine
from sqlalchemy.ext import declarative_base
BaseModel = declarative_base()
def upsert_bulk(engine: Engine, model: BaseModel, data: io.StringIO) -> None:
"""
Fast way to upsert multiple entries at once
:param `db`: DB Session
:param `data`: CSV in a stream object
"""
table_name = model.__tablename__
temp_table_name = f"temp_{table_name}"
columns = [c.key for c in model.__table__.columns]
# Select only columns to be updated (in my case, all non-id columns)
variable_columns = [c for c in columns if c != "id"]
# Create string with set of columns to be updated
update_set = ", ".join([f"{v}=EXCLUDED.{v}" for v in variable_columns])
# Rewind data and prepare it for `copy_from`
data.seek(0)
with conn.cursor() as cur:
# Creates temporary empty table with same columns and types as
# the final table
cur.execute(
f"""
CREATE TEMPORARY TABLE {temp_table_name} (LIKE {table_name})
ON COMMIT DROP
"""
)
# Copy stream data to the created temporary table in DB
cur.copy_from(data, temp_table_name)
# Inserts copied data from the temporary table to the final table
# updating existing values at each new conflict
cur.execute(
f"""
INSERT INTO {table_name}({', '.join(columns)})
SELECT * FROM {temp_table_name}
ON CONFLICT (id) DO UPDATE SET {update_set}
"""
)
# Drops temporary table (I believe this step is unnecessary,
# but tables sizes where growing without any new data modifications
# if this command isn't executed)
cur.execute(f"DROP TABLE {temp_table_name}")
# Commit everything through cursor
conn.commit()
conn.close()
https://www.postgresql.org/docs/current/static/sql-copy.html
there is no copy ... on conflict do statement in postgres
https://www.postgresql.org/docs/current/static/sql-insert.html
only insert ... on conflict do
I've been querying a few API's with Python to individually create CSV's for a table.
I would like to try and instead of recreating the table each time, update the existing table with any new API data.
At the moment the way the Query is working, I have a table that looks like this,
From this I am taking the suburbs of each state and copying them into a csv for each different state.
Then using this script I am cleaning them into a list (the api needs the %20 for any spaces),
"%20"
#suburbs = ["want this", "want this (meh)", "this as well (nope)"]
suburb_cleaned = []
#dont_want = frozenset( ["(meh)", "(nope)"] )
for urb in suburbs:
cleaned_name = []
name_parts = urb.split()
for part in name_parts:
if part in dont_want:
continue
cleaned_name.append(part)
suburb_cleaned.append('%20'.join(cleaned_name))
Then taking the suburbs for each state and putting them into this API to return a csv,
timestr = time.strftime("%Y%m%d-%H%M%S")
Name = "price_data_NT"+timestr+".csv"
url_price = "http://mwap.com/api"
string = 'gxg&state='
api_results = {}
n = 0
y = 2
for urbs in suburb_cleaned:
url = url_price + urbs + string + "NT"
print(url)
print(urbs)
request = requests.get(url)
api_results[urbs] = pd.DataFrame(request.json())
n = n+1
if n == y:
dfs = pd.concat(api_results).reset_index(level=1, drop=True).rename_axis(
'key').reset_index().set_index(['key'])
dfs.to_csv(Name, sep='\t', encoding='utf-8')
y = y+2
continue
print("made it through"+urbs)
# print(request.json())
# print(api_results)
dfs = pd.concat(api_results).reset_index(level=1, drop=True).rename_axis(
'key').reset_index().set_index(['key'])
dfs.to_csv(Name, sep='\t', encoding='utf-8')
Then adding the states manually in excel, and combining and cleaning the suburb names.
# use pd.concat
df = pd.concat([act, vic,nsw,SA,QLD,WA]).reset_index().set_index(['key']).rename_axis('suburb').reset_index().set_index(['state'])
# apply lambda to clean the %20
f = lambda s: s.replace('%20', ' ')
df['suburb'] = df['suburb'].apply(f)
and then finally inserting it into a db
engine = create_engine('mysql://username:password#localhost/dbname')
with engine.connect() as conn, conn.begin():
df.to_sql('Price_historic', conn, if_exists='replace',index=False)
Leading this this sort of output
Now, this is a hek of a process. I would love to simplify it and make the database only update the values that are needed from the API, and not have this much complexity in getting the data.
Would love some helpful tips on achieving this goal - I'm thinking I could do an update on the mysql database instead of insert or something? and with the querying of the API, I feel like I'm overcomplicating it.
Thanks!
I don't see any reason why you would be creating CSV files in this process. It sounds like you can just query the data and then load it into a MySql table directly. You say that you are adding the states manually in excel? Is that data not available through your prior api calls? If not, could you find that information and save it to a CSV, so you can automate that step by loading it into a table and having python look up the values for you?
Generally, you wouldn't want to overwrite the mysql table every time. When you have a table, you can identify the column or columns that uniquely identify a specific record, then create a UNIQUE INDEX for them. For example if your street and price values designate a unique entry, then in mysql you could run:
ALTER TABLE `Price_historic` ADD UNIQUE INDEX(street, price);
After this, your table will not allow duplicate records based on those values. Then, instead of creating a new table every time, you can insert your data into the existing table, with instructions to either update or ignore when you encounter a duplicate. For example:
final_str = "INSERT INTO Price_historic (state, suburb, property_price_id, type, street, price, date) " \
"VALUES (%s, %s, %s, %s, %s, %s, %s, %s) " \
"ON DUPLICATE KEY UPDATE " \
"state = VALUES(state), date = VALUES(date)"
con = pdb.connect(db_host, db_user, db_pass, db_name)
with con:
try:
cur = con.cursor()
cur.executemany(final_str, insert_list)
If the setup you are trying is something for longer term , I would suggest running 2 diff processes in parallel-
Process 1:
Query API 1, obtain required data and insert into DB table, with binary / bit flag that would specify only API 1 has been called.
Process 2:
Run query on DB to obtain all records needed for API call 2 based on binary/bit flag that we set in process 1--> For corresponding data run call 2 and update data back to DB table based on primary Key
Database : I would suggest adding Primary Key as well as [Bit Flag][1] that gives status of different API call statuses. Bit Flag also helps you
- in case you want to double confirm if specific API call has been made for specific record not.
- Expand your project to additional API calls and can still track status of each API call at record level
[1]: Bit Flags: https://docs.oracle.com/cd/B28359_01/server.111/b28286/functions014.htm#SQLRF00612
I am attempting to query a subset of a MySql database table, feed the results into a Pandas DataFrame, alter some data, and then write the updated rows back to the same table. My table size is ~1MM rows, and the number of rows I will be altering will be relatively small (<50,000) so bringing back the entire table and performing a df.to_sql(tablename,engine, if_exists='replace') isn't a viable option. Is there a straightforward way to UPDATE the rows that have been altered without iterating over every row in the DataFrame?
I am aware of this project, which attempts to simulate an "upsert" workflow, but it seems it only accomplishes the task of inserting new non-duplicate rows rather than updating parts of existing rows:
GitHub Pandas-to_sql-upsert
Here is a skeleton of what I'm attempting to accomplish on a much larger scale:
import pandas as pd
from sqlalchemy import create_engine
import threading
#Get sample data
d = {'A' : [1, 2, 3, 4], 'B' : [4, 3, 2, 1]}
df = pd.DataFrame(d)
engine = create_engine(SQLALCHEMY_DATABASE_URI)
#Create a table with a unique constraint on A.
engine.execute("""DROP TABLE IF EXISTS test_upsert """)
engine.execute("""CREATE TABLE test_upsert (
A INTEGER,
B INTEGER,
PRIMARY KEY (A))
""")
#Insert data using pandas.to_sql
df.to_sql('test_upsert', engine, if_exists='append', index=False)
#Alter row where 'A' == 2
df_in_db.loc[df_in_db['A'] == 2, 'B'] = 6
Now I would like to write df_in_db back to my 'test_upsert' table with the updated data reflected.
This SO question is very similar, and one of the comments recommends using an "sqlalchemy table class" to perform the task.
Update table using sqlalchemy table class
Can anyone expand on how I would implement this for my specific case above if that is the best (only?) way to implement it?
I think the easiest way would be to:
first delete those rows that are going to be "upserted". This can be done in a loop, but it's not very efficient for bigger data sets (5K+ rows), so i'd save this slice of the DF into a temporary MySQL table:
# assuming we have already changed values in the rows and saved those changed rows in a separate DF: `x`
x = df[mask] # `mask` should help us to find changed rows...
# make sure `x` DF has a Primary Key column as index
x = x.set_index('a')
# dump a slice with changed rows to temporary MySQL table
x.to_sql('my_tmp', engine, if_exists='replace', index=True)
conn = engine.connect()
trans = conn.begin()
try:
# delete those rows that we are going to "upsert"
engine.execute('delete from test_upsert where a in (select a from my_tmp)')
trans.commit()
# insert changed rows
x.to_sql('test_upsert', engine, if_exists='append', index=True)
except:
trans.rollback()
raise
PS i didn't test this code so it might have some small bugs, but it should give you an idea...
A MySQL specific solution using Panda's to_sql "method" arg and sqlalchemy's mysql insert on_duplicate_key_update features:
def create_method(meta):
def method(table, conn, keys, data_iter):
sql_table = db.Table(table.name, meta, autoload=True)
insert_stmt = db.dialects.mysql.insert(sql_table).values([dict(zip(keys, data)) for data in data_iter])
upsert_stmt = insert_stmt.on_duplicate_key_update({x.name: x for x in insert_stmt.inserted})
conn.execute(upsert_stmt)
return method
engine = db.create_engine(...)
conn = engine.connect()
with conn.begin():
meta = db.MetaData(conn)
method = create_method(meta)
df.to_sql(table_name, conn, if_exists='append', method=method)
Here is a general function that will update each row (but all values in the row simultaneously)
def update_table_from_df(df, table, where):
'''Will take a dataframe and update each specified row in the SQL table
with the DF values -- DF columns MUST match SQL columns
WHERE statement should be triple-quoted string
Will not update any columns contained in the WHERE statement'''
update_string = f'UPDATE {table} set '
for idx, row in df.iterrows():
upstr = update_string
for col in list(df.columns):
if (col != 'datetime') & (col not in where):
if col != df.columns[-1]:
if type(row[col] == str):
upstr += f'''{col} = '{row[col]}', '''
else:
upstr += f'''{col} = {row[col]}, '''
else:
if type(row[col] == str):
upstr += f'''{col} = '{row[col]}' '''
else:
upstr += f'''{col} = {row[col]} '''
upstr += where
cursor.execute(upstr)
cursor.commit()```
I was struggling with this before and now I've found a way.
Basically create a separate data frame in which you keep data that you only have to update.
df #updating data in dataframe
s_update = "" #String of updations
# Loop through the data frame
for i in range(len(df)):
s_update += "update your_table_name set column_name = '%s' where column_name = '%s';"%(df[col_name1][i], df[col_name2][i])
Now pass s_update to cursor.execute or engine.execute (wherever you execute SQL query)
This will update your data instantly.
Normally, if i want to insert values into a table, i will do something like this (assuming that i know which columns that the values i want to insert belong to):
conn = sqlite3.connect('mydatabase.db')
conn.execute("INSERT INTO MYTABLE (ID,COLUMN1,COLUMN2)\
VALUES(?,?,?)",[myid,value1,value2])
But now i have a list of columns (the length of list may vary) and a list of values for each columns in the list.
For example, if i have a table with 10 columns (Namely, column1, column2...,column10 etc). I have a list of columns that i want to update.Let's say [column3,column4]. And i have a list of values for those columns. [value for column3,value for column4].
How do i insert the values in the list to the individual columns that each belong?
As far as I know the parameter list in conn.execute works only for values, so we have to use string formatting like this:
import sqlite3
conn = sqlite3.connect(':memory:')
conn.execute('CREATE TABLE t (a integer, b integer, c integer)')
col_names = ['a', 'b', 'c']
values = [0, 1, 2]
conn.execute('INSERT INTO t (%s, %s, %s) values(?,?,?)'%tuple(col_names), values)
Please notice this is a very bad attempt since strings passed to the database shall always be checked for injection attack. However you could pass the list of column names to some injection function before insertion.
EDITED:
For variables with various length you could try something like
exec_text = 'INSERT INTO t (' + ','.join(col_names) +') values(' + ','.join(['?'] * len(values)) + ')'
conn.exec(exec_text, values)
# as long as len(col_names) == len(values)
Of course string formatting will work, you just need to be a bit cleverer about it.
col_names = ','.join(col_list)
col_spaces = ','.join(['?'] * len(col_list))
sql = 'INSERT INTO t (%s) values(%s)' % (col_list, col_spaces)
conn.execute(sql, values)
I was looking for a solution to create columns based on a list of unknown / variable length and found this question. However, I managed to find a nicer solution (for me anyway), that's also a bit more modern, so thought I'd include it in case it helps someone:
import sqlite3
def create_sql_db(my_list):
file = 'my_sql.db'
table_name = 'table_1'
init_col = 'id'
col_type = 'TEXT'
conn = sqlite3.connect(file)
c = conn.cursor()
# CREATE TABLE (IF IT DOESN'T ALREADY EXIST)
c.execute('CREATE TABLE IF NOT EXISTS {tn} ({nf} {ft})'.format(
tn=table_name, nf=init_col, ft=col_type))
# CREATE A COLUMN FOR EACH ITEM IN THE LIST
for new_column in my_list:
c.execute('ALTER TABLE {tn} ADD COLUMN "{cn}" {ct}'.format(
tn=table_name, cn=new_column, ct=col_type))
conn.close()
my_list = ["Col1", "Col2", "Col3"]
create_sql_db(my_list)
All my data is of the type text, so I just have a single variable "col_type" - but you could for example feed in a list of tuples (or a tuple of tuples, if that's what you're into):
my_other_list = [("ColA", "TEXT"), ("ColB", "INTEGER"), ("ColC", "BLOB")]
and change the CREATE A COLUMN step to:
for tupl in my_other_list:
new_column = tupl[0] # "ColA", "ColB", "ColC"
col_type = tupl[1] # "TEXT", "INTEGER", "BLOB"
c.execute('ALTER TABLE {tn} ADD COLUMN "{cn}" {ct}'.format(
tn=table_name, cn=new_column, ct=col_type))
As a noob, I can't comment on the very succinct, updated solution #ron_g offered. While testing, though I had to frequently delete the sample database itself, so for any other noobs using this to test, I would advise adding in:
c.execute('DROP TABLE IF EXISTS {tn}'.format(
tn=table_name))
Prior the the 'CREATE TABLE ...' portion.
It appears there are multiple instances of
.format(
tn=table_name ....)
in both 'CREATE TABLE ...' and 'ALTER TABLE ...' so trying to figure out if it's possible to create a single instance (similar to, or including in, the def section).