More Efficient Way To Insert Dataframe into SQL Server - python

I am trying to update a SQL table with updated information which is in a dataframe in pandas.
I have about 100,000 rows to iterate through and it's taking a long time. Any way I can make this code more efficient. Do I even need to truncate the data? Most rows will probably be the same.
conn = pyodbc.connect ("Driver={xxx};"
"Server=xxx;"
"Database=xxx;"
"Trusted_Connection=yes;")
cursor = conn.cursor()
cursor.execute('TRUNCATE dbo.Sheet1$')
for index, row in df_union.iterrows():
print(row)
cursor.execute("INSERT INTO dbo.Sheet1$ (Vendor, Plant) values(?,?)", row.Vendor, row.Plant)
Update: This is what I ended up doing.
params = urllib.parse.quote_plus(r'DRIVER={xxx};SERVER=xxx;DATABASE=xxx;Trusted_Connection=yes')
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = create_engine(conn_str)
df = pd.read_excel('xxx.xlsx')
print("loaded")
df.to_sql(name='tablename',schema= 'dbo', con=engine, if_exists='replace',index=False, chunksize = 1000, method = 'multi')

Don't use for or cursors just SQL
insert into TABLENAMEA (A,B,C,D)
select A,B,C,D from TABLENAMEB
Take a look to this link to see another demo:
https://www.sqlservertutorial.net/sql-server-basics/sql-server-insert-into-select/
You just need to update this part to run a normal insert
conn = pyodbc.connect ("Driver={xxx};"
"Server=xxx;"
"Database=xxx;"
"Trusted_Connection=yes;")
cursor = conn.cursor()
cursor.execute('insert into TABLENAMEA (A,B,C,D) select A,B,C,D from TABLENAMEB')
You don't need to store the dataset in a variable, just run the query directly as normal SQL, performance will be better than a iteration

Related

How to the store a column from a SQL request into a variable?

I want to put the result of each column of the result of my request and store them into separate variables, so I can exploit its results.
I precise this is with a SELECt * and not separate requests.
So, If I do for example:
with connection.cursor() as cursor:
# Read a single record
sql = 'SELECT * FROM table'
cursor.execute(sql)
result = cursor.fetchall()
print(result)
I want to do :
a = [results from column1]
b = [results from column2]
The results should be turned into a row and not be left as a column, to make it a dictionary.
It's probably very simple but I'm new with Python / SQL, thank you.

Pandas / sqlite3: Change part of pandas dataframe and replace in sqlite database

Experts,
I am struggling to find an efficient way to work with pandas and sqlite.
I am building a tool that let's users
extract part of a sql database (sub_table) based on some filters
change part of sub_table
upload changed sub_table back to
overall sql table replacing old values
Users will only see excel data (so I need to write back and forth to excel which is not part of my example as out of scope).
Users can
replace existing rows (entries) with new data
delete existing rows
add new rows
Question: how can I most efficiently do this "replace/delete/add" using Pandas / sqlite3?
Here is my example code. If I use df_sub.to_sql("MyTable", con = conn, index = False, if_exists="replace") at the bottom than obviously the entire table is replaced...so there must be another way I cannot think of.
import pandas as pd
import sqlite3
import numpy as np
#### SETTING EXAMPLE UP
### Create DataFrame
data = dict({"City": ["London","Frankfurt","Berlin","Paris","Brondby"],
"Population":[8,2,4,9,0.5]})
df = pd.DataFrame(data,index = pd.Index(np.arange(5)))
### Create SQL DataBase
conn = sqlite3.connect("MyDB.db")
### Upload DataFrame as Table into SQL Database
df.to_sql("MyTable", con = conn, index = False, if_exists="replace")
### Read DataFrame from SQL DB
query = "SELECT * from MyTable"
pd.read_sql_query(query, con = conn)
#### CREATE SUB_TABLE AND AMEND
#### EXTRACT sub_table FROM SQL TABLE
query = "SELECT * from MyTable WHERE Population > 2"
df_sub = pd.read_sql_query(query, con = conn)
df_sub
#### Amend Sub DF
df_sub[df_sub["City"] == "London"] = ["Brussel",4]
df_sub
#### Replace new data in SQL DB
df_sub.to_sql("MyTable", con = conn, index = False, if_exists="replace")
query = "SELECT * from MyTable"
pd.read_sql_query(query, con = conn)
Thanks for your help!
Note: I did try to achieve via pure SQL queries but gave up. As I am not an expert on SQL I would want to go with pandas if a solution exists. If not a hint on how to achieve this via sql would be great!
I think there is no way around using SQL queries for this task.
With pandas it is only possible to read a query to a DataFrame and to write a DataFrame to a Database (replace or append).
If you want to update specific values/ rows or want to delete rows, you have to use SQL queries.
Commands you should look into are for example:
UPDATE, REPLACE, INSERT, DELETE
# Update the database, change City to 'Brussel' and Population to 4, for the first row
# (Attention! python indices start at 0, SQL indices at 1)
cur = conn.cursor()
cur.execute('UPDATE MyTable SET City=?, Population=? WHERE ROWID=?', ('Brussel', 4, 1))
conn.commit()
conn.close()
# Display the changes
conn = sqlite3.connect("MyDB.db")
query = "SELECT * from MyTable"
pd.read_sql_query(query, con=conn)
For more examples on sql and pandas you can look at
https://www.dataquest.io/blog/python-pandas-databases/

How to get table column-name/header for SQL query in python

I have the data in pandas dataframe which I am storing in SQLITE database using Python. When I am trying to query the tables inside it, I am able to get the results but without the column names. Can someone please guide me.
sql_query = """Select date(report_date), insertion_order_id, sum(impressions), sum(clicks), (sum(clicks)+0.0)/sum(impressions)*100 as CTR
from RawDailySummaries
Group By report_date, insertion_order_id
Having report_date like '2014-08-12%' """
cursor.execute(sql_query)
query1 = cursor.fetchall()
for i in query1:
print i
Below is the output that I get
(u'2014-08-12', 10187, 2024, 8, 0.3952569169960474)
(u'2014-08-12', 12419, 15054, 176, 1.1691244851866613)
What do I need to do to display the results in a tabular form with column names
In DB-API 2.0 compliant clients, cursor.description is a sequence of 7-item sequences of the form (<name>, <type_code>, <display_size>, <internal_size>, <precision>, <scale>, <null_ok>), one for each column, as described here. Note description will be None if the result of the execute statement is empty.
If you want to create a list of the column names, you can use list comprehension like this: column_names = [i[0] for i in cursor.description] then do with them whatever you'd like.
Alternatively, you can set the row_factory parameter of the connection object to something that provides column names with the results. An example of a dictionary-based row factory for SQLite is found here, and you can see a discussion of the sqlite3.Row type below that.
Step 1: Select your engine like pyodbc, SQLAlchemy etc.
Step 2: Establish connection
cursor = connection.cursor()
Step 3: Execute SQL statement
cursor.execute("Select * from db.table where condition=1")
Step 4: Extract Header from connection variable description
headers = [i[0] for i in cursor.description]
print(headers)
Try Pandas .read_sql(), I can't check it right now but it should be something like:
pd.read_sql( Q , connection)
Here is a sample code using cx_Oracle, that should do what is expected:
import cx_Oracle
def test_oracle():
connection = cx_Oracle.connect('user', 'password', 'tns')
try:
cursor = connection.cursor()
cursor.execute('SELECT day_no,area_code ,start_date from dic.b_td_m_area where rownum<10')
#only print head
title = [i[0] for i in cursor.description]
print(title)
# column info
for x in cursor.description:
print(x)
finally:
cursor.close()
if __name__ == "__main__":
test_oracle();

Select query in pymysql

When executing the following:
import pymysql
db = pymysql.connect(host='localhost', port=3306, user='root')
cur = db.cursor()
print(cur.execute("SELECT ParentGuardianID FROM ParentGuardianInformation WHERE UserID ='" + UserID + "'"))
The output is1
How could I alter the code so that the actual value of the ParentGuardianID (which is '001') is printed as opposed to 1.
I'm sure the answer is simple but I am a beginner so any help would be much appreciated - thanks!
cur.execute() just returns the number of rows affected. You should do cur.fetchone() to get the actual result, or cur.fetchall() if you are expecting multiple rows.
The cursor.execute() method gives out a cursor related to the result of the SQL sentence. In case of a select query, it returns the rows (if any) that meet it. So, you can iterate over these rows using a for loop for instance. In addition, I would recommend you to use pymysql.cursors.DictCursor because it allows treating the query results as a dictionary.
import pymysql
db = pymysql.connect(host='localhost', port=3306, user='root')
cur = db.cursor(pymysql.cursors.DictCursor)
UserId = 'whatsoever'
sql = "SELECT ParentGuardianID FROM ParentGuardianInformation WHERE UserID ='%s'"
cur.execute(sql % UserId)
for row in cur:
print(row['ParentGuardianID'])
Good luck!

Column names missing from generator with pyodbc

Im using Pyodbc to connect to sqlserver to get few rows. The select query I execute fetches almost 200,000 rows causing a memory issue.
To resolve this issue Im using a generator object, to fetch 5000 rows at any point in time..
The problem with this kind of execution is the generator object. I lose the data column names..
For example, if my table1 has column NAME, through normal exection I can access the result set as result.NAME
but I can't do the same with the generator object..It doesn't allow me to access through column names.
Any inputs will be useful?
Using Cursor.fetchmany() to process query result in batches returns a list of pyodbc.Row objects, which allows reference by column name.
Take these examples of a SQL Server query that returns database names in batches of 5:
Without generator
connection = pyodbc.connect(driver='{SQL Server Native Client 11.0}',
server='localhost', database='master',
trusted_connection='yes')
sql = 'select name from sys.databases'
cursor = connection.cursor().execute(sql)
while True:
rows = cursor.fetchmany(5)
if not rows:
break
for row in rows:
print row.name
With generator (modified from sample here)
def rows(cursor, size=5):
while True:
rows = cursor.fetchmany(size)
if not rows:
break
for row in rows:
yield row
connection = pyodbc.connect(driver='{SQL Server Native Client 11.0}',
server='localhost', database='master',
trusted_connection='yes')
sql = 'select name from sys.databases'
cursor = connection.cursor().execute(sql)
for row in rows(cursor):
print row.name

Categories

Resources