Speeding up performance when writing from pandas to sqlite - python

Hoping for a few pointers on how I can optimise this code up... Ideally I'd like to keep with using pandas but assume there's some nifty sqlite tricks I can use to get some good speed-up. For additional "points", would love to know if Cython could help at all here?
Incase it's not obvious from the code.. for context, I'm having to write out millions of very small sqlite files (files in "uncompressedDir") and outputting them into a much larger "master" sqlite DB ("6th jan.db").
Thanks in advance everyone!
%%cython -a
import os
import pandas as pd
import sqlite3
import time
import sys
def main():
rootDir = "/Users/harryrobinson/Desktop/dataForMartin/"
unCompressedDir = "/Users/harryrobinson/Desktop/dataForMartin/unCompressedSqlFiles/"
with sqlite3.connect(rootDir+'6thJan.db') as conn:
destCursor = conn.cursor()
createTable = "CREATE TABLE IF NOT EXISTS userData(TimeStamp, Category, Action, Parameter1Name, Parameter1Value, Parameter2Name, Parameter2Value, formatVersion, appVersion, userID, operatingSystem)"
destCursor.execute(createTable)
for i in os.listdir(unCompressedDir):
try:
with sqlite3.connect(unCompressedDir+i) as connection:
cursor = connection.cursor()
cursor.execute('SELECT * FROM Events')
df_events = pd.DataFrame(cursor.fetchall())
cursor.execute('SELECT * FROM Global')
df_global = pd.DataFrame(cursor.fetchall())
cols = ['TimeStamp', 'Category', 'Action', 'Parameter1Name', 'Parameter1Value', 'Parameter2Name', 'Parameter2Value']
df_events = df_events.drop(0,axis=1)
df_events.columns = cols
df_events['formatVersion'] = df_global.iloc[0,0]
df_events['appVersion'] = df_global.iloc[0,1]
df_events['userID'] = df_global.iloc[0,2]
df_events['operatingSystem'] = df_global.iloc[0,3]
except Exception as e:
print(e, sys.exc_info()[-1].tb_lineno)
try:
df_events.to_sql("userData", conn, if_exists="append", index=False)
except Exception as e:
print("Sqlite error, {0} - line {1}".format(e, sys.exc_info()[-1].tb_lineno))
UPDATE: halved the time by adding a transaction instead of to_sql

Reconsider using Pandas as a staging tool (leave the library for data analysis). Simply write pure SQL queries which can be accommodated by using SQLite's ATTACH to query external databases.
with sqlite3.connect(os.path.join(rootDir,'6thJan.db')) as conn:
destCursor = conn.cursor()
createTable = """CREATE TABLE IF NOT EXISTS userData(
TimeStamp TEXT, Category TEXT, Action TEXT, Parameter1Name TEXT,
Parameter1Value TEXT, Parameter2Name TEXT, Parameter2Value TEXT,
formatVersion TEXT, appVersion TEXT, userID TEXT, operatingSystem TEXT
);"""
destCursor.execute(createTable)
conn.commit()
for i in os.listdir(unCompressedDir):
destCursor.execute("ATTACH ? AS curr_db;", i)
sql = """INSERT INTO userData
SELECT e.*, g.formatVersion, g.appVersion, g.userID, g.operatingSystem
FROM curr_db.[events] e
CROSS JOIN (SELECT * FROM curr_db.[global] LIMIT 1) g;"""
destCursor.execute(sql)
conn.commit()
destCursor.execute("DETACH curr_db;")

Related

Python looping to obtain different dataframes from a SQL database

I'm trying to connect to an SQL database and, within a loop, create separate dataframes for each different instance of Id, containing all the data related to that Id. I've tried a number of ways, without any success so far. I'm pretty new to all of this, so I'm probably making some rookie mistakes.
Attempt 1:
import pandas as pd
import pyodbc
conn = pyodbc.connect('Driver={SQL Server};'
'Server=Server_name;'
'Database=Database;'
'UID=Username;'
'PWD=password;'
'Trusted_Connection=yes;')
Name = ['HR','ZA','PR','FW']
for x in Name:
SQL = '''
SELECT *
FROM Database
WHERE Id = {x}'''.format(x = x)
cursor = conn.cursor()
cursor.execute(SQL)
df = pd.read_sql_query(SQL)
On this code, I get an 'invalid column name' programming error on the first Name 'HL'.
Attempt 2:
import pandas as pd
import pyodbc
conn = pyodbc.connect('Driver={SQL Server};'
'Server=Server_name;'
'Database=Database;'
'UID=Username;'
'PWD=password;'
'Trusted_Connection=yes;')
SQL = '''
SELECT *
FROM Database
conn.autocommit = True
cursor.execute(SQL)
for [Id] in cursor:
df = pd.Dataframe(SQL,conn)
On this code, I get a 'ValueError: too many values to unpack (expected 1)' - on the for statement.
I want to put a lot more code in the for loop so I need it to be set up to work through each Id. I hope that makes sense. Any guidance would be greatly appreciated. Thanks
UPDATE:
Thanks for all comments/answers. For some reason I just couldn't get it to work in either of the formats above so I took it back to where I started from now I understand how to include the syntax for the loop variable. The following now works:
import pandas as pd
import pyodbc
conn = pyodbc.connect('Driver={SQL Server};'
'Server=Server_name;'
'Database=Database;'
'UID=Username;'
'PWD=password;'
'Trusted_Connection=yes;')
Name = ['HR','ZA','PR','FW']
for x in Name:
SQL = pd.read_sql_query(
'''
SELECT *
FROM Database_table
WHERE Id = '{x}'
'''.format(x = x), conn)
df = pd.DataFrame(SQL)
I think that if you try a variation on your first attempt like:
for x in Name:
SQL = '''
SELECT *
FROM Database
WHERE Id = ?'''
cursor = conn.cursor()
cursor.execute(SQL)
df = pd.read_sql_query(SQL, params={x})
It should probably work :)

How to conduct SQL queries on multiple .db files and store the results in a .csv?

I have about 100 .db files stored on my Google Drive which I want to run the same SQL query on. I'd like to store these query results in a single .csv file.
I've managed to use the following code to write the results of a single SQL query into a .csv file, but I am unable to make it work for multiple files.
conn = sqlite3.connect('/content/drive/My Drive/Data/month_2014_01.db')
df = pd.read_sql_query("SELECT * FROM messages INNER JOIN users ON messages.id = users.id WHERE text LIKE '%house%'", conn)
df.to_csv('/content/drive/My Drive/Data/Query_Results.csv')
This is the code that I have used so far to try and make it work for all files, based on this post.
databases = []
directory = '/content/drive/My Drive/Data/'
for filename in os.listdir(directory):
flname = os.path.join(directory, filename)
databases.append(flname)
for database in databases:
try:
with sqlite3.connect(database) as conn:
conn.text_factory = str
cur = conn.cursor()
cur.execute(row["SELECT * FROM messages INNER JOIN users ON messages.id = users.id WHERE text LIKE '%house%'"])
df.loc[index,'Results'] = cur.fetchall()
except sqlite3.Error as err:
print ("[INFO] %s" % err)
But this throws me an error: TypeError: tuple indices must be integers or slices, not str.
I'm obviously doing something wrong and I would much appreciate any tips that would point towards an answer.
Consider building a list of data frames, then concatenate them together in a single data frame with pandas.concat:
gdrive = "/content/drive/My Drive/Data/"
sql = """SELECT * FROM messages
INNER JOIN users ON messages.id = users.id
WHERE text LIKE '%house%'
"""
def build_df(db)
with sqlite3.connect(os.path.join(gdrive, db)) as conn:
df = pd.read_sql_query(sql, conn)
return df
# BUILD LIST OF DFs WITH LIST COMPREHENSION
df_list = [build_df(db) for db in os.listdir(gdrive) if db.endswith('.db')]
# CONCATENATE ALL DFs INTO SINGLE DF FOR EXPORT
final_df = pd.concat(df_list, ignore_index = True)
final_df.to_csv(os.path.join(gdrive, 'Query_Results.csv'), index = False)
Better yet, consider SQLite's ATTACH DATABASE and append query results into a master table. This also avoids using the heavy data science, third-party library, pandas, for simple data migration needs. Plus, you keep all database data inside SQLite without worrying about data type conversion and i/o transfer issues.
import csv
import sqlite3
with sqlite3.connect(os.path.join(gdrive, 'month_2014_01')) as conn:
# CREATE MASTER TABLE
cur = conn.cursor()
cur.execute("DROP TABLE IF EXISTS master_query")
cur.execute("""CREATE TABLE master_query AS
SELECT * FROM tmp.messages
INNER JOIN tmp.users
ON tmp.messages.id = tmp.users.id
WHERE text LIKE '%house%'
""")
conn.commit()
# ITERATIVELY ATTACH AND APPEND RESULTS
for db in os.listdir(gdrive):
if db.endswith('.db'):
cur.execute("ATTACH DATABASE ? AS tmp", [db])
cur.execute("""INSERT INTO master_query
SELECT * FROM tmp.messages
INNER JOIN tmp.users
ON tmp.messages.id = tmp.users.id
WHERE text LIKE '%house%'
""")
cur.execute("DETACH DATABASE tmp")
conn.commit()
# WRITE TUPLE OF ROWS TO CSV
data = cur.execute("SELECT * FROM master_query")
with open(os.path.join(gdrive, 'Query_Results.csv'), 'wb') as f:
writer = csv.writer(f)
writer.writerow([i[0] for i in cur.description]) # HEADERS
writer.writerows(data) # DATA
cur.close()

Retrieve data from sql server database using Python

I am trying to execute the following script. but I don't get neither the desired results nor a error message ,and I can't figure out where I'm doing wrong.
import pyodbc
cnxn = pyodbc.connect("Driver={SQL Server Native Client 11.0};"
"Server=mySRVERNAME;"
"Database=MYDB;"
"uid=sa;pwd=MYPWD;"
"Trusted_Connection=yes;")
cursor = cnxn.cursor()
cursor.execute('select DISTINCT firstname,lastname,coalesce(middlename,\' \') as middlename from Person.Person')
for row in cursor:
print('row = %r' % (row,))
any ideas ? any help is appreciated :)
You have to use a fetch method along with cursor. For Example
for row in cursor.fetchall():
print('row = %r' % (row,))
EDIT :
The fetchall function returns all remaining rows in a list.
If there are no rows, an empty list is returned.
If there are a lot of rows, *this will use a lot of memory.*
Unread rows are stored by the database driver in a compact format and are often sent in batches from the database server.
Reading in only the rows you need at one time will save a lot of memory.
If we are going to process the rows one at a time, we can use the cursor itself as an interator
Moreover we can simplify it since cursor.execute() always returns a cursor :
for row in cursor.execute("select bla, anotherbla from blabla"):
print row.bla, row.anotherbla
Documentation
I found this information useful to retrieve data from SQL database to python as a data frame.
import pandas as pd
import pymssql
con = pymssql.connect(server='use-et-aiml-cloudforte-aiops- db.database.windows.net',user='login_username',password='login_password',database='database_name')
cursor = con.cursor()
query = "SELECT * FROM <TABLE_NAME>"
cursor.execute(query)
df = pd.read_sql(query, con)
con.close()
df
import mysql.connector as mc
connection creation
conn = mc.connect(host='localhost', user='root', passwd='password')
print(conn)
#create cursor object
cur = conn.cursor()
print(cur)
cur.execute('show databases')
for i in cur:
print(i)
query = "Select * from employee_performance.employ_mod_recent"
emp_data = pd.read_sql(query, conn)
emp_data

Can't use cx_Oracle LOB in Spatialite WKB insert statement

I have some Python code the selects data from Oracle spatial and inserts into Spatialite. My problem is that the cursor contains the geometry in binary and I can’t figure out how to read the binary into the Spatialite insert statement. Just to added this all works if I use WKT but some of the geometries are too long hence the reason for the binary format.
Can anyone help please?
# Import system modules
import cx_Oracle
from pyspatialite import dbapi2 as sl_db
def db_connect():
# Build connect from TNS names
o_db = cx_Oracle.connect("xxxxx", "xxxxx", "xxxxx_gl_dev")
cursor = o_db.cursor()
return cursor
def db_lookup(cursor):
# Select records
sql = "SELECT sdo_util.to_wkbgeometry(a.shape), a.objectid FROM span a WHERE a.objectid = 1382372"
cursor.execute(sql)
row = cursor.fetchall()
return row
def db_insert(row):
# Insert Rows in new spatailite table
database_name = 'C:\\Temp\\MYDATABASE.sqlite'
db_connection = sl_db.connect(database_name)
db_cursor = db_connection.cursor()
sql = 'INSERT INTO "SPAN_OFL" ("geometry", "OBJECTID") Values GeomFromWKB(?,27700),?);'
db_cursor.executemany(sql, row)
db_connection.commit()
db_connection.close()
# main code
cursor = db_connect()
row = db_lookup(cursor)
db_insert(row)

Checking if a postgresql table exists under python (and probably Psycopg2)

How can I determine if a table exists using the Psycopg2 Python library? I want a true or false boolean.
How about:
>>> import psycopg2
>>> conn = psycopg2.connect("dbname='mydb' user='username' host='localhost' password='foobar'")
>>> cur = conn.cursor()
>>> cur.execute("select * from information_schema.tables where table_name=%s", ('mytable',))
>>> bool(cur.rowcount)
True
An alternative using EXISTS is better in that it doesn't require that all rows be retrieved, but merely that at least one such row exists:
>>> cur.execute("select exists(select * from information_schema.tables where table_name=%s)", ('mytable',))
>>> cur.fetchone()[0]
True
I don't know the psycopg2 lib specifically, but the following query can be used to check for existence of a table:
SELECT EXISTS(SELECT 1 FROM information_schema.tables
WHERE table_catalog='DB_NAME' AND
table_schema='public' AND
table_name='TABLE_NAME');
The advantage of using information_schema over selecting directly from the pg_* tables is some degree of portability of the query.
select exists(select relname from pg_class
where relname = 'mytablename' and relkind='r');
The first answer did not work for me. I found success checking for the relation in pg_class:
def table_exists(con, table_str):
exists = False
try:
cur = con.cursor()
cur.execute("select exists(select relname from pg_class where relname='" + table_str + "')")
exists = cur.fetchone()[0]
print exists
cur.close()
except psycopg2.Error as e:
print e
return exists
#!/usr/bin/python
# -*- coding: utf-8 -*-
import psycopg2
import sys
con = None
try:
con = psycopg2.connect(database='testdb', user='janbodnar')
cur = con.cursor()
cur.execute('SELECT 1 from mytable')
ver = cur.fetchone()
print ver //здесь наш код при успехе
except psycopg2.DatabaseError, e:
print 'Error %s' % e
sys.exit(1)
finally:
if con:
con.close()
I know you asked for psycopg2 answers, but I thought I'd add a utility function based on pandas (which uses psycopg2 under the hood), just because pd.read_sql_query() makes things so convenient, e.g. avoiding creating/closing cursors.
import pandas as pd
def db_table_exists(conn, tablename):
# thanks to Peter Hansen's answer for this sql
sql = f"select * from information_schema.tables where table_name='{tablename}'"
# return results of sql query from conn as a pandas dataframe
results_df = pd.read_sql_query(sql, conn)
# True if we got any results back, False if we didn't
return bool(len(results_df))
I still use psycopg2 to create the db-connection object conn similarly to the other answers here.
The following solution is handling the schema too:
import psycopg2
with psycopg2.connect("dbname='dbname' user='user' host='host' port='port' password='password'") as conn:
cur = conn.cursor()
query = "select to_regclass(%s)"
cur.execute(query, ['{}.{}'.format('schema', 'table')])
exists = bool(cur.fetchone()[0])
Expanding on the above use of EXISTS, I needed something to test table existence generally. I found that testing for results using fetch on a select statement yielded the result "None" on an empty existing table -- not ideal.
Here's what I came up with:
import psycopg2
def exist_test(tabletotest):
schema=tabletotest.split('.')[0]
table=tabletotest.split('.')[1]
existtest="SELECT EXISTS (SELECT 1 FROM information_schema.tables WHERE table_schema = '"+schema+"' AND table_name = '"+table+"' );"
print('existtest',existtest)
cur.execute(existtest) # assumes youve already got your connection and cursor established
# print('exists',cur.fetchall()[0])
return ur.fetchall()[0] # returns true/false depending on whether table exists
exist_test('someschema.sometable')
You can look into pg_class catalog:
The catalog pg_class catalogs tables and most everything else that has
columns or is otherwise similar to a table. This includes indexes (but
see also pg_index), sequences (but see also pg_sequence), views,
materialized views, composite types, and TOAST tables; see relkind.
Below, when we mean all of these kinds of objects we speak of
“relations”. Not all columns are meaningful for all relation types.
Assuming an open connection with cur as cursor,
# python 3.6+
table = 'mytable'
cur.execute(f"SELECT EXISTS(SELECT relname FROM pg_class WHERE relname = {table});")
if cur.fetchone()[0]:
# if table exists, do something here
return True
cur.fetchone() will resolve to either True or False because of the EXISTS() function.

Categories

Resources