PYODBC Insert Statement in MS Access DB Extremely slow - python

I am looking to speed up my insert statement into Access Db. the data is only 86500 records and is takikng maorew than 24 hours to process. The part of the code i am looking to speed up is comparing two tables for duplicates. If no duplicates are found then insert that row. I am running 64bit windows 10, 32bit python 2.7, 32bit ms access odbc driver and a 32 bit pyodbc module. Any help would be greatly appreciated the code sample is below.
def importDIDsACC():
"""Compare the Ledger to ImportDids to find any missing records"""
imdidsLst = []
ldgrLst = readMSAccess("ActivityNumber", "Ledger")
for x in readMSAccess("DISP_NUM", "ImportDids"):
if x not in ldgrLst and x not in imdidsLst:
didsLst.append(x)
#Select the records to import
if len(imdidsLst) > 0:
sql = ""
for row in imdidsLst:
sql += "DISP_NUM = '" + row[0]
cursor.execute("SELECT * FROM ImportDids WHERE " + sql)
rows = cursor.fetchall()
#Import to Ledger
dupChk = []
for row in rows:
if row[4] not in dupChk:
cursor.execute('INSERT into Ledger ([ActivityNumber], [WorkArea], [ClientName], [SurfacePurpose], [OpsApsDist], [AppDate], [LOADate], [EffDate], [AmnDate], [CanDate], [RenDate], [ExpDate], [ReiDate], [AmlDate], [DispType], [TRM], [Section], [Quarter], [Inspected_Date], [Inspection_Reason], [Inspected_By], [InspectionStatus], [REGION], [DOC], [STATCD]) VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)',
str(row[1]), str(row[18]), str(row[17]), row[14], str(row[26]), row[4], row[5], row[6], row[7], row[8], row[9], row[10], row[11], row[12], str(row[1][0:3]), trmCal(str(row[21]),str(row[20]), str(row[19])), str(row[22]), str(row[23]), inspSts(str(row[1]), 0),inspSts(str(row[1]), 1), inspSts(str(row[1]), 2), inspSts(str(row[1]), 3), str(row[27]), str(row[3]), str(row[13]))
dupChk.append(row[4])
cnxn.commit()
def readMSAccess(columns, table):
"""Select all records from the chosen field"""
sql = "SELECT "+ columns + " FROM " + table
cursor.execute(sql)
rows = cursor.fetchall()
return rows
def dbConn():
"""Connects to Access dataBase"""
connStr = """
DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};
DBQ=""" + getDatabasepath() + ";"
cnxn = pyodbc.connect(connStr)
cursor = cnxn.cursor()
return cursor, cnxn
def getDatabasepath():
"""get the path to the access database"""
mycwd = os.getcwd()
os.chdir("..")
dataBasePath = os.getcwd() + os.sep + "LandsAccessTool.accdb"
os.chdir(mycwd)
return dataBasePath
# Connect to the Access Database
cursor, cnxn = dbConn()
# Update the Ledger with any new records from importDids
importDIDsACC()

Don't use external code to check for duplicates. The power of a database (even Access) is maximizing its data-set operations. Don't try to rewrite that kind of code, especially since as you've discovered it is not efficient. Instead, import everything into a temporary database table, then use Access (or the appropriate Access Data Engine) to execute SQL statements to compare tables, either finding or excluding duplicate rows. Results of those queries can then be used to create and/or update other tables--all within context of the database engine. Of course set up the temporary table(s) with appropriate indexes and keys to maximize the efficiency.
In the mean time, it is usually faster (am I allowed to say always?) when comparing data sets locally (i.e. tables) to load all values into some searchable collection from a single database request (i.e. SQL SELECT statement), then use that in-memory collection to search for matches. This may seem ironic after my last statement about maximizing the database capabilities, but the big idea is understanding how the data set as a whole is being processed. Transporting the data back and forth between python processes and the database engine, even if it is on the same machine, will be much slower than either processing everything within the python process or everything within the database engine process. The only time that might not be useful is when the remote dataset is much too large to download, but 87,000 key values is definitely small enough to load all the values into a python collection.

Related

Shorten SQLite3 insert statement for efficiency and readability

From this answer:
cursor.execute("INSERT INTO booking_meeting (room_name,from_date,to_date,no_seat,projector,video,created_date,location_name) VALUES (?, ?, ?, ?, ?, ?, ?, ?)", (rname, from_date, to_date, seat, projector, video, now, location_name ))
I'd like to shorten it to something like:
simple_insert(booking_meeting, rname, from_date, to_date, seat, projector, video, now, location_name)
The first parameter is the table name which can be read to get list of column names to format the first section of the SQLite3 statement:
cursor.execute("INSERT INTO booking_meeting (room_name,from_date,to_date,no_seat,projector,video,created_date,location_name)
Then the values clause (second part of the insert statement):
VALUES (?, ?, ?, ?, ?, ?, ?, ?)"
can be formatted by counting the number of column names in the table.
I hope I explained the question properly and you can appreciate the time savings of such a function. How to write this function in python? ...is my question.
There may already a simple_insert() function in SQLite3 but I just haven't stumbled across it yet.
If you're inserting into all the columns, then you don't need to specify the column names in the INSERT query. For that scenario, you could write a function like this:
def simple_insert(cursor, table, *args):
query = f'INSERT INTO {table} VALUES (' + '?, ' * (len(args)-1) + '?)'
cursor.execute(query, args)
For your example, you would call it as:
simple_insert(cursor, 'booking_meeting', rname, from_date, to_date, seat, projector, video, now, location_name)
Note I've chosen to pass cursor to the function, you could choose to just rely on it as a global variable instead.

Is there a way to check and load missing data to SQL?

I am trying to figure out a way to check data I am loading into a SQL table from a dataframe so I can load missing data and avoid loading duplicate data.
Here is a really rough idea.
sql_data = []
data = [(2020-01-01, Monday, 20, 0.1), (2020-01-02, Tuesday, 12, 0.4), (2020-01-01, Wednesday, 26, 0.3)]
cursor.execute('''Select * FROM Table ''')
for row in cursor.fetchall():
sql_data.append(row)
if data in sql_data:
pass
else:
query = '''INSERT INTO Table (Time, Day, Number, Decimal)
VALUES (?, ?, ?, ?)'''
cursor.execute(query, data)
conn.commit()
Consider rarely used EXCEPT clause (part of UNION and INTERSECT set operator family) since SQL Server supports scalar values in SELECT without a FROM data source:
query = '''INSERT INTO Table (Time, Day, Number, Decimal)
SELECT ?, ?, ?, ?
EXCEPT
SELECT Time, Day, Number, Decimal
FROM Table
'''
cursor.executemany(query, data)
conn.commit()
Online Demo

(TypeError: Can't convert 'int' object to str implicitly) when pushing a data into a table using python environment

Age and phone_num are int values rest all are strings. when trying to push this into a DB using the code below am getting the following error
insert_query = "insert into employee.details (name,emp_id,age,contact,address) values('"+name+"','"+emp_id+"',"+age+","+phone_num+",'"+address+"')"
cursor = connection.cursor
result = cursor.execute(insert_query)
print("Table updated successfully ")
I think you were getting this error because python cannot combine integers and strings unless they are explicitly converted using str()
I assume you are using SQLite3? If so here is the proper syntax for a query.
insert_query = """INSERT INTO employee.details (name, emp_id, age, contact, address) VALUES (?, ?, ?, ?, ?)"""
cur = conn.cursor()
cur.execute(insert_query, (name, emp_id, age, phone_num, address))
one_row = cur.fetchone() # This will only get one row of the data
all_data = cur.fetchall() # This will get all rows of data in a list of tuples
conn.commit()
conn.close() # only if this is last db change
Templating into your query using a tuple will automatically escape strings and prevent SQL injection. It will also convert your integers to strings, fixing your error.

Insert or update rows in MS Access database in Python

I've got an MS Access table (SearchAdsAccountLevel) which needs to be updated frequently from a python script. I've set up the pyodbc connection and now I would like to UPDATE/INSERT rows from my pandas df to the MS Access table based on whether the Date_ AND CampaignId fields match with the df data.
Looking at previous examples I've built the UPDATE statement which uses iterrows to iterate through all rows within df and execute the SQL code as per below:
connection_string = (
r"Driver={Microsoft Access Driver (*.mdb, *.accdb)};"
r"c:\AccessDatabases\Database2.accdb;"
)
cnxn = pyodbc.connect(connection_string, autocommit=True)
crsr = cnxn.cursor()
for index, row in df.iterrows():
crsr.execute("UPDATE SearchAdsAccountLevel SET [OrgId]=?, [CampaignName]=?, [CampaignStatus]=?, [Storefront]=?, [AppName]=?, [AppId]=?, [TotalBudgetAmount]=?, [TotalBudgetCurrency]=?, [DailyBudgetAmount]=?, [DailyBudgetCurrency]=?, [Impressions]=?, [Taps]=?, [Conversions]=?, [ConversionsNewDownloads]=?, [ConversionsRedownloads]=?, [Ttr]=?, [LocalSpendAmount]=?, [LocalSpendCurrency]=?, [ConversionRate]=?, [Week_]=?, [Month_]=?, [Year_]=?, [Quarter]=?, [FinancialYear]=?, [RowUpdatedTime]=? WHERE [Date_]=? AND [CampaignId]=?",
row['OrgId'],
row['CampaignName'],
row['CampaignStatus'],
row['Storefront'],
row['AppName'],
row['AppId'],
row['TotalBudgetAmount'],
row['TotalBudgetCurrency'],
row['DailyBudgetAmount'],
row['DailyBudgetCurrency'],
row['Impressions'],
row['Taps'],
row['Conversions'],
row['ConversionsNewDownloads'],
row['ConversionsRedownloads'],
row['Ttr'],
row['LocalSpendAmount'],
row['LocalSpendCurrency'],
row['ConversionRate'],
row['Week_'],
row['Month_'],
row['Year_'],
row['Quarter'],
row['FinancialYear'],
row['RowUpdatedTime'],
row['Date_'],
row['CampaignId'])
crsr.commit()
I would like to iterate through each row within my df (around 3000) and if the ['Date_'] AND ['CampaignId'] match I UPDATE all other fields. Otherwise I want to INSERT the whole df row in my Access Table (create new row). What's the most efficient and effective way to achieve this?
Consider DataFrame.values and pass list into an executemany call, making sure to order columns accordingly for the UPDATE query:
cols = ['OrgId', 'CampaignName', 'CampaignStatus', 'Storefront',
'AppName', 'AppId', 'TotalBudgetAmount', 'TotalBudgetCurrency',
'DailyBudgetAmount', 'DailyBudgetCurrency', 'Impressions',
'Taps', 'Conversions', 'ConversionsNewDownloads', 'ConversionsRedownloads',
'Ttr', 'LocalSpendAmount', 'LocalSpendCurrency', 'ConversionRate',
'Week_', 'Month_', 'Year_', 'Quarter', 'FinancialYear',
'RowUpdatedTime', 'Date_', 'CampaignId']
sql = '''UPDATE SearchAdsAccountLevel
SET [OrgId]=?, [CampaignName]=?, [CampaignStatus]=?, [Storefront]=?,
[AppName]=?, [AppId]=?, [TotalBudgetAmount]=?,
[TotalBudgetCurrency]=?, [DailyBudgetAmount]=?,
[DailyBudgetCurrency]=?, [Impressions]=?, [Taps]=?, [Conversions]=?,
[ConversionsNewDownloads]=?, [ConversionsRedownloads]=?, [Ttr]=?,
[LocalSpendAmount]=?, [LocalSpendCurrency]=?, [ConversionRate]=?,
[Week_]=?, [Month_]=?, [Year_]=?, [Quarter]=?, [FinancialYear]=?,
[RowUpdatedTime]=?
WHERE [Date_]=? AND [CampaignId]=?'''
crsr.executemany(sql, df[cols].values.tolist())
cnxn.commit()
For the insert, use a temp, staging table with exact structure as final table which you can create with make-table query: SELECT TOP 1 * INTO temp FROM final. This temp table will be regularly cleaned out and inserted with all data frame rows. The final query migrates only new rows from temp into final with NOT EXISTS, NOT IN, or LEFT JOIN/NULL. You can run this query anytime and never worry about duplicates per Date_ and CampaignId columns.
# CLEAN OUT TEMP
sql = '''DELETE FROM SearchAdsAccountLevel_Temp'''
crsr.executemany(sql)
cnxn.commit()
# APPEND TO TEMP
sql = '''INSERT INTO SearchAdsAccountLevel_Temp (OrgId, CampaignName, CampaignStatus, Storefront,
AppName, AppId, TotalBudgetAmount, TotalBudgetCurrency,
DailyBudgetAmount, DailyBudgetCurrency, Impressions,
Taps, Conversions, ConversionsNewDownloads, ConversionsRedownloads,
Ttr, LocalSpendAmount, LocalSpendCurrency, ConversionRate,
Week_, Month_, Year_, Quarter, FinancialYear,
RowUpdatedTime, Date_, CampaignId)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?,
?, ?, ?, ?, ?, ?, ?, ?, ?,
?, ?, ?, ?, ?, ?, ?, ?, ?);'''
crsr.executemany(sql, df[cols].values.tolist())
cnxn.commit()
# MIGRATE TO FINAL
sql = '''INSERT INTO SearchAdsAccountLevel
SELECT t.*
FROM SearchAdsAccountLevel_Temp t
LEFT JOIN SearchAdsAccountLevel f
ON t.Date_ = f.Date_ AND t.CampaignId = f.CampaignId
WHERE f.OrgId IS NULL'''
crsr.executemany(sql)
cnxn.commit()

How to format and build query strings in Python Sqlite?

What is the most used way to create a Sqlite query in Python?
query = 'insert into events (date, title, col3, col4, int5, int6)
values("%s", "%s", "%s", "%s", %s, %s)' % (date, title, col3, col4, int5, int6)
print query
c.execute(query)
Problem: it won't work for example if title contains a quote ".
query = 'insert into events (date, title, col3, col4, int5, int6)
values(?, ?, ?, ?, ?, ?)'
c.execute(query, (date, title, col3, col4, int5, int6))
Problem: in solution 1., we could display/print the query (to log it); here in solution 2. we can't log the query string anymore because the "replace" of each ? by a variable is done during the execute.
Another cleaner way to do it? Can we avoid to repeat ?, ?, ?, ..., ? and have one single values(?) and still have it replaced by all the parameters in the tuple?
You should always use parameter substitution of DB API, to avoid SQL injection, query logging is relatively trivial by subclassing sqlite3.Cursor:
import sqlite3
class MyConnection(sqlite3.Connection):
def cursor(self):
return super().cursor(MyCursor)
class MyCursor(sqlite3.Cursor):
def execute(self, sql, parameters=''):
print(f'statement: {sql!r}, parameters: {parameters!r}')
return super().execute(sql, parameters)
conn = sqlite3.connect(':memory:', timeout=60, factory=MyConnection)
conn.execute('create table if not exists "test" (id integer, value integer)')
conn.execute('insert into test values (?, ?)', (1, 0));
conn.commit()
yields:
statement: 'create table if not exists "test" (id integer, value integer)', parameters: ''
statement: 'insert into test values (?, ?)', parameters: (1, 0)
To avoid formatting problems and SQL injection attacks, you should always use parameters.
When you want to log the query, you can simply log the parameter list together with the query string.
(SQLite has a function to get the expanded query, but Python does not expose it.)
Each parameter markers corresponds to exactly one value. If writing many markers is too tedious for you, let the computer do it:
parms = (1, 2, 3)
markers = ",".join("?" * len(parms))

Categories

Resources