Pandas / sqlite3: Change part of pandas dataframe and replace in sqlite database - python

Experts,
I am struggling to find an efficient way to work with pandas and sqlite.
I am building a tool that let's users
extract part of a sql database (sub_table) based on some filters
change part of sub_table
upload changed sub_table back to
overall sql table replacing old values
Users will only see excel data (so I need to write back and forth to excel which is not part of my example as out of scope).
Users can
replace existing rows (entries) with new data
delete existing rows
add new rows
Question: how can I most efficiently do this "replace/delete/add" using Pandas / sqlite3?
Here is my example code. If I use df_sub.to_sql("MyTable", con = conn, index = False, if_exists="replace") at the bottom than obviously the entire table is replaced...so there must be another way I cannot think of.
import pandas as pd
import sqlite3
import numpy as np
#### SETTING EXAMPLE UP
### Create DataFrame
data = dict({"City": ["London","Frankfurt","Berlin","Paris","Brondby"],
"Population":[8,2,4,9,0.5]})
df = pd.DataFrame(data,index = pd.Index(np.arange(5)))
### Create SQL DataBase
conn = sqlite3.connect("MyDB.db")
### Upload DataFrame as Table into SQL Database
df.to_sql("MyTable", con = conn, index = False, if_exists="replace")
### Read DataFrame from SQL DB
query = "SELECT * from MyTable"
pd.read_sql_query(query, con = conn)
#### CREATE SUB_TABLE AND AMEND
#### EXTRACT sub_table FROM SQL TABLE
query = "SELECT * from MyTable WHERE Population > 2"
df_sub = pd.read_sql_query(query, con = conn)
df_sub
#### Amend Sub DF
df_sub[df_sub["City"] == "London"] = ["Brussel",4]
df_sub
#### Replace new data in SQL DB
df_sub.to_sql("MyTable", con = conn, index = False, if_exists="replace")
query = "SELECT * from MyTable"
pd.read_sql_query(query, con = conn)
Thanks for your help!
Note: I did try to achieve via pure SQL queries but gave up. As I am not an expert on SQL I would want to go with pandas if a solution exists. If not a hint on how to achieve this via sql would be great!

I think there is no way around using SQL queries for this task.
With pandas it is only possible to read a query to a DataFrame and to write a DataFrame to a Database (replace or append).
If you want to update specific values/ rows or want to delete rows, you have to use SQL queries.
Commands you should look into are for example:
UPDATE, REPLACE, INSERT, DELETE
# Update the database, change City to 'Brussel' and Population to 4, for the first row
# (Attention! python indices start at 0, SQL indices at 1)
cur = conn.cursor()
cur.execute('UPDATE MyTable SET City=?, Population=? WHERE ROWID=?', ('Brussel', 4, 1))
conn.commit()
conn.close()
# Display the changes
conn = sqlite3.connect("MyDB.db")
query = "SELECT * from MyTable"
pd.read_sql_query(query, con=conn)
For more examples on sql and pandas you can look at
https://www.dataquest.io/blog/python-pandas-databases/

Related

More Efficient Way To Insert Dataframe into SQL Server

I am trying to update a SQL table with updated information which is in a dataframe in pandas.
I have about 100,000 rows to iterate through and it's taking a long time. Any way I can make this code more efficient. Do I even need to truncate the data? Most rows will probably be the same.
conn = pyodbc.connect ("Driver={xxx};"
"Server=xxx;"
"Database=xxx;"
"Trusted_Connection=yes;")
cursor = conn.cursor()
cursor.execute('TRUNCATE dbo.Sheet1$')
for index, row in df_union.iterrows():
print(row)
cursor.execute("INSERT INTO dbo.Sheet1$ (Vendor, Plant) values(?,?)", row.Vendor, row.Plant)
Update: This is what I ended up doing.
params = urllib.parse.quote_plus(r'DRIVER={xxx};SERVER=xxx;DATABASE=xxx;Trusted_Connection=yes')
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = create_engine(conn_str)
df = pd.read_excel('xxx.xlsx')
print("loaded")
df.to_sql(name='tablename',schema= 'dbo', con=engine, if_exists='replace',index=False, chunksize = 1000, method = 'multi')
Don't use for or cursors just SQL
insert into TABLENAMEA (A,B,C,D)
select A,B,C,D from TABLENAMEB
Take a look to this link to see another demo:
https://www.sqlservertutorial.net/sql-server-basics/sql-server-insert-into-select/
You just need to update this part to run a normal insert
conn = pyodbc.connect ("Driver={xxx};"
"Server=xxx;"
"Database=xxx;"
"Trusted_Connection=yes;")
cursor = conn.cursor()
cursor.execute('insert into TABLENAMEA (A,B,C,D) select A,B,C,D from TABLENAMEB')
You don't need to store the dataset in a variable, just run the query directly as normal SQL, performance will be better than a iteration

Insert Python DuckDB table into SQL statement

I am trying to use a registered virtual table as a table in a SQL statement using a connection to another database. I can't just turn the column into a string and use that, I need the table/dataframe itself to work in the statement and join with the other tables in the SQL statment. I'm trying this out on an Access database to start. This is what I have so far:
import pyodbc
import pandas as pd
import duckdb
conn = duckdb.connect()
starterset = pd.read_excel (r'e:\Data Analytics\Python_Projects\Applications\DB_Test.xlsx')
conn.register("test_starter", starterset)
IDS = conn.execute("SELECT * FROM test_starter WHERE ProjectID > 1").fetchdf()
StartDate = '1/1/2015'
EndDate = '12/1/2021'
# establish the connection
connt = pyodbc.connect(r'Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=E:\Databases\Offline.accdb;')
cursor = conn.cursor()
# Run the query
query = ("Select ProjectID, Revenue, ClosedDate from Projects INNER JOIN " + IDS + " Z on Z.ProjectID = Projects.ProjectID "
"where ClosedDate between #" + StartDate + "# and #" + EndDate + "# AND Revenue > 0 order by ClosedDate")
sfd
df = pd.read_sql(query, connt)
df.to_excel(r'TEMP.xlsx', index=False)
os.system("start EXCEL.EXE TEMP.xlsx")
# Close the connection
cursor.close()
connt.close()
I have a list of IDs in the excel sheet that I'm trying to use as a filter from the database query. Ultimately, this will form into several criteria from the same table: dates, revenue, and IDs among others.
Honestly, I'm surprised I'm having so much trouble doing this. In SAS, with PROC SQL, it's so easy, but I can't get a dataframe to interface within the SQL parameters how I need it to. Am I making a syntax mistake?
Most common error so far is "UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U55'), dtype('<U55')) -> dtype('<U55')", but the types are the same.
It looks like you are pushing the contents of a DataFrame into an Access database query. I don't think there is a native way to do this in Pandas. The technique I use is database vendor specific, but I just build up a text string as either a CTE/WITH Clause or a temporary table.
Ex:
"""WITH my_data as (
SELECT 'raw_text_within_df' as df_column1, 'raw_text_within_df' as df_column2
UNION ALL
SELECT 'raw_text_within_df' as df_column1, 'raw_text_within_df' as df_column2
UNION ALL
...
)
[Your original query here]
"""

how to transfer data from one mysql database to another and mapping the data with different column names using python

How can I transfer data from one MySQL database to another? The other database may have different field names, except id, which will act as the primary key.
I have tried using mysqlalchemy, but the only data that gets mapped are the filed names that are same in both databases.
import sqlalchemy
db1 = sqlalchemy.create_engine("mysql+pymysql://root:#localhost:3306/mydatabase1")
db2 = sqlalchemy.create_engine("mysql+pymysql://root:#localhost:3306/nava")
print('Writing...')
query = ''' (SELECT * FROM customers1)'''
df = pd.read_sql(query, db1)
print(df)
#query1 = ''UPDATE 'leap' SET `leap`value '''
df.to_sql('nap', db2, index=False, if_exists='append')
i get error that other database dosent have same field names but what i want is that even if the field names change data still gets mapped with reference to the primary key id
this is the program that i asked about in the above question but there was an error so code hasent appeared in the right way
import pandas as pd
import sqlalchemy
db1 = sqlalchemy.create_engine("mysql+pymysql://root:#localhost:3306/mydatabase1")
db2 = sqlalchemy.create_engine("mysql+pymysql://root:#localhost:3306/nava")
print('Writing...')
query = ''' (SELECT * FROM customers1)'''
df = pd.read_sql(query, db1)
df.to_sql('nap', db2, index=False, if_exists='append')

How to drop a column using pyodbc

I am using pydobc to connect with my sql server. I have a table from which I need to delete a column.
I can read this table, the code I used to read this is as follows:
import pyodbc
cnxn = pyodbc.connect("Driver={SQL Server Native Client 11.0}; Server=xyz; database=db; Trusted_Connection=yes;")
cursor = cnxn.cursor()
df = pd.read_sql("select * from [db].[username].[mytable]", cnxn)
df.shape
Above code works as expected. But when I try to drop a column from this table it says can not find the object.
Here is my code trial
query = 'ALTER TABLE [db].[username].[mytable] DROP COLUMN [TEMP CELCIUS]'
cursor.execute(query)
My question is how to drop this column. To add here this column has a white space in it's name.
Try:
query = 'ALTER TABLE [db].[username].[mytable] DROP COLUMN "TEMP CELCIUS"'
OR:
query = 'ALTER TABLE [db].[username].[mytable] DROP COLUMN `TEMP CELCIUS`'

How to insert multiple IDs into a sql statement?

New to python and pandas, Im facing the following issue:
I would like to pass multiple string into a sql query and struggle to insert the delimiter ',' :
Example data
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print (df)
# Remove header (Not sure whether that is necessary)
df.columns = df.iloc[0]
pd.read_sql(
"""
SELECT
*
FROM emptable
WHERE empID IN ('{}',)
""".format(df.ix[:, 0]), # Which corresponds to 'Alex','Bob','Clarke'
con = connection)
I tried different combinations, however None of them have worked out.
Demo:
sql_ = """
SELECT *
FROM emptable
WHERE empID IN ({})
"""
sql = sql_.format(','.join([x for x in ['?'] * len(df)]))
print(sql)
new = pd.read_sql(query, conn, params=tuple(df['Name']))
Output:
In [166]: print(sql)
SELECT *
FROM emptable
WHERE empID IN (?,?,?)
NOTE: this approach will not work if your DF is large, because the generated SQL string would be too big.
In this case you can save/dump Names in a helper temporary table and use it in SQL:
df[['Name']].to_sql('tmp', conn, if_exists='replace')
sql = """
SELECT *
FROM emptable
WHERE empID IN (select Name from tmp)
"""
new = pd.read_sql(sql, conn)

Categories

Resources