I am quite new in Python, any advice or link will help.
I have created two python scripts, -
Main.py which calls SQLcon.py.
SQLcon.py only creates connection to SQL server and downloads data based on multiple queries.
Later,
Main.py code reads/creates pandas dataframes from excel files which are downloaded by SQLcon and does calculations and etc and etc.
the File for the SQL connection and queries in the SQLcon.py has the main structure as below
Problems:
A) Quite a lot of queries are done and quite a lot of temporary files are created.
B) I do not want to keep the SQL related code on the Main file
Wanted Outcome:
I want to use dfX = pd.read_sql_query(qryX, engine) (or similar) in the main file and to get rid of part for saving/reading excel files.
Also, - would be nice to keep one connection during all these queries as multiple re-connections will slow down the code.
I am not sure how to start...
Thinking of putting main SQL connection into the function and call it from Main...
But it will create multiple re-connections...
import sqlalchemy as sa # and other imports
load_dotenv()
# .env passwords and etc.
'''...'''
# creating SQL connection via sqlalchemy
connection_url = URL.create("mssql+pyodbc", query={"odbc_connect": connection_string})
engine = sa.create_engine(connection_url)
engine.echo = False
# creating dfs
df1 = pd.read_sql_query(qry1, engine)
dfA = pd.read_sql_query(qryA, engine)
dfZ = pd.read_sql_query(qryZ, engine)
engine.dispose() #not sure if dispose() is needed
# saving dfs
df1.to_excel(r'C:\Test\df1_tbl_Data.xlsx', index=False)
dfA.to_excel(r'C:\Test\dfA_tbl_Data.xlsx', index=False)
dfZ.to_excel(r'C:\Test\dfZ_tbl_Data.xlsx', index=False)
Consider building a collection of your data pulls in a user defined method. Then, call it whenever needed by main or other scripts:
SQLcon.py
import sqlalchemy as sa
# and other
imports load_dotenv()
# .env passwords and etc. '''...'''
def pull_data():
# creating SQL connection via sqlalchemy
connection_url = URL.create(
"mssql+pyodbc",
query={"odbc_connect": connection_string}
)
engine = sa.create_engine(connection_url)
engine.echo = False
# creating dfs
df_dict = {
"df1": pd.read_sql_query(qry1, engine),
"dfA": pd.read_sql_query(qryA, engine),
"dfZ": pd.read_sql_query(qryZ, engine)
}
# releasing engine
engine.dispose()
return df_dict
Main.py (import above as a module)
from SQLcon import pull_data
...
# CALL AS NEEDED
df_dict = pull_data()
# ACCESS DICT ELEMENTS
df_dict["df1"]
df_dict["dfA"]
df_dict["dfZ"]
...
Related
I am running a loop, and in each iteration I go to the PostGres database to find certain matching records. My issue is that it takes too long to open a new database connection at each time through the loop.
import psycopg2
from psycopg2 import sql
from sqlalchemy import create_engine
engine = create_engine(mycredentials)
db_connect = engine.connect()
query = ("""SELECT ("CustomerId") AS "CustomerId",
("LastName") AS "LastName",
("FirstName") AS "FirstName",
("City") AS "City",
FROM
public."postgres_table"
WHERE (LEFT("LastName", 4) = %(blocking_lastname)s
AND LEFT("City", 4) = %(blocking_city)s)
OR (LEFT("FirstName", 3) = %(blocking_firstname)s
AND LEFT("City", 4) = %(blocking_city)s)
""")
for index, row in df_loop.iterrows():
blocking_lastname = row['LastName'][:4]
blocking_firstname = row['FirstName'][:3]
blocking_city = row['City'][:4]
df = pd.read_sql(query, db_connect, params= {
'blocking_firstname': blocking_firstname,
'blocking_lastname': blocking_lastname,
'blocking_city': blocking_city})
This code works, but it doesn't appear to be keeping the database connection open since there is too much latency with the query. (Note: The table columns have indexes.)
UPDATE: I updated the code above, using db_connect as an object outside of the loop rather than engine.connect() as a call in pd.read_sql() as I originally had it written.
i am having problems to load data into an access-database. For testing purpose i build a little convert functions which takes all data-sets from a hdf-file and writes it into the accdb. Without the #event.listens_for(engine, "before_cursor_execute") functionality it works, but veeery slow. With it, it creates an odd behavior. It creates only one empty table (from the first df) in the db and finishes execution. The for-loop will never be finished and no error raises.
Maybe it’s because the sqlalchemy-access package doesn’t support fast_executemany but couldn’t find any related information about it. Does any of you have some input for me how i can solve it or be able to write data in a faster way into the db?
big thanks!
import urllib
from pathlib import Path
from sqlalchemy import create_engine, event
# PATHS
HOME = Path(__file__).parent
DATA_DIR = HOME / 'output'
FILE_ACCESS = DATA_DIR / 'db.accdb'
FILE_HDF5 = DATA_DIR / 'Data.hdf'
# FUNCTIONS
def convert_from_hdf_to_accb():
# https://github.com/gordthompson/sqlalchemy-access/wiki/Getting-Connected
driver = '{Microsoft Access Driver (*.mdb, *.accdb)}'
conn_str = 'DRIVER={};DBQ={};'.format(driver, FILE_ACCESS)
conn_url = "access+pyodbc:///?odbc_connect={}".format(urllib.parse.quote_plus(conn_str))
# https://medium.com/analytics-vidhya/speed-up-bulk-inserts-to-sql-db-using-pandas-and-python-61707ae41990
# https://github.com/pandas-dev/pandas/issues/15276
# https://stackoverflow.com/questions/48006551/speeding-up-pandas-dataframe-to-sql-with-fast-executemany-of-pyodbc
engine = create_engine(conn_url)
#event.listens_for(engine, "before_cursor_execute")
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
with pd.HDFStore(path=FILE_HDF5, mode="r") as store:
for key in store.keys():
df = store.get(key)
df.to_sql(name=key, con=engine, index=False, if_exists='replace')
print(' IT NEVER REACHES AND DOESNT RAISE AN ERROR :( ')
# EXECUTE
if __name__ == "__main__":
convert_from_hdf_to_accb()
Maybe it’s because the sqlalchemy-access package doesn’t support fast_executemany
That is true. pyodbc's fast_executemany feature requires that the driver support an internal ODBC mechanism called "parameter arrays", and the Microsoft Access ODBC driver does not support them.
See also
https://github.com/mkleehammer/pyodbc/wiki/Driver-support-for-fast_executemany
I am writing a python script to query about 60 database tables based on a current timestamp and store those as csv file in S3 bucket. There are some global variables that I need access to like engine, aws credentials, current_time etc. I have this file currently as 60 functions each querying a table which then calls a function to write into s3.
How do I organize this code better so I won't have to call these 60 functions from the main function?
More importantly, how do I also organize this code following OOP. I am very new to this and any help would be greatly appreciated.
This is what my current code looks like:
import (bunch of imports)
engine = create_engine('sqlite:///bookdatabase.db', echo=False)
access_key = 'adasdasdasdasd'
access_id = 'asdasdasd'
def table_name():
table_name = 'book'
sql = "select * from book where modified_date < current_date"
mn = pandas.read_sql(sql, engine)
# write_to_s3
def another_table_name():
# .....
# etc. etc.
Functions that do the same thing, only with a single variance are a clue that really those actions can be combined into a better structure.
In your case, you are doing the same thing (calling a database, and updating a bucket), the difference is you call different databases, and read different tables.
So why not create a function like this:
S3_ACCESS_KEY = '....'
S3_ACCESS_ID = '....'
def export_to_s3(db_configuration):
for db, tables in db_configuration.items():
engine = create_engine('sqlite://{}'.format(db), echo=False)
for table_name in tables:
sql = "SELECT * FROM {} WHERE modified_date <
current_date".format(table_name)
cursor = engine.cursor()
cursor.execute(sql)
for result in cursor:
# push result to s3
db_table_names = {'bookdatabase.db': ['book'],
'another.db': ['fruits', 'planets']}
export_to_s3(db_table_names)
I have a dictionary with 3 keys which correspond to field names in a SQL Server table. The values of these keys come from an excel file and I store this dictionary in a dataframe which I now need to insert into a SQL table. This can all be seen in the code below:
import pandas as pd
import pymssql
df=[]
fp = "file path"
data = pd.read_excel(fp,sheetname ="CRM View" )
row_date = data.loc[3, ]
row_sita = "ABZPD"
row_event = data.iloc[12, :]
df = pd.DataFrame({'date': row_date,
'sita': row_sita,
'event': row_event
}, index=None)
df = df[4:]
df = df.fillna("")
print(df)
My question is how do I insert this dictionary into a SQL table now?
Also, as a side note, this code is part of a loop which needs to go through several excel files one by one, insert the data into dictionary then into SQL then delete the data in the dictionary and start again with the next excel file.
You could try something like this:
import MySQLdb
# connect
conn = MySQLdb.connect("127.0.0.1","username","passwore","table")
x = conn.cursor()
# write
x.execute('INSERT into table (row_date, sita, event) values ("%d", "%d", "%d")' % (row_date, sita, event))
# close
conn.commit()
conn.close()
You might have to change it a little based on your SQL restrictions, but should give you a good start anyway.
For the pandas dataframe, you can use the pandas built-in method to_sql to store in db. Following is the way to use it.
import sqlalchemy as sa
params = urllib.quote_plus("DRIVER={};SERVER={};DATABASE={};Trusted_Connection=True;".format("{SQL Server}",
"<db_server_url>",
"<db_name>"))
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = sa.create_engine(conn_str)
df.to_sql(<table_name>, engine,schema=<schema_name>, if_exists="append", index=False)
For this method you you will need to install sqlalchemy package.
pip install sqlalchemy
You will also need to setup the MSSql DSN on the machine.
I have the following function in python:
def add_odm_object(obj, table_name, primary_key, unique_column):
db = create_engine('mysql+pymysql://root:#127.0.0.1/mydb')
metadata = MetaData(db)
t = Table(table_name, metadata, autoload=True)
s = t.select(t.c[unique_column] == obj[unique_column])
rs = s.execute()
r = rs.fetchone()
if not r:
i = t.insert()
i_res = i.execute(obj)
v_id = i_res.inserted_primary_key[0]
return v_id
else:
return r[primary_key]
This function looks if the object obj is in the database, and if it is not found, it saves it to the DB. Now, I have a problem. I call the above function in a loop many times. And after few hundred times, I get an error: user root has exceeded the max_user_connections resource (current value: 30) I tried to search for answers and for example the question: How to close sqlalchemy connection in MySQL recommends creating a conn = db.connect() object where dbis the engine and calling conn.close() after my query is completed.
But, where should I open and close the connection in my code? I am not working with the connection directly, but I'm using the Table() and MetaData functions in my code.
The engine is an expensive-to-create factory for database connections. Your application should call create_engine() exactly once per database server.
Similarly, the MetaData and Table objects describe a fixed schema object within a known database. These are also configurational constructs that in most cases are created once, just like classes, in a module.
In this case, your function seems to want to load up tables dynamically, which is fine; the MetaData object acts as a registry, which has the convenience feature that it will give you back an existing table if it already exists.
Within a Python function and especially within a loop, for best performance you typically want to refer to a single database connection only.
Taking these things into account, your module might look like:
# module level variable. can be initialized later,
# but generally just want to create this once.
db = create_engine('mysql+pymysql://root:#127.0.0.1/mydb')
# module level MetaData collection.
metadata = MetaData()
def add_odm_object(obj, table_name, primary_key, unique_column):
with db.begin() as connection:
# will load table_name exactly once, then store it persistently
# within the above MetaData
t = Table(table_name, metadata, autoload=True, autoload_with=conn)
s = t.select(t.c[unique_column] == obj[unique_column])
rs = connection.execute(s)
r = rs.fetchone()
if not r:
i_res = connection.execute(t.insert(), some_col=obj)
v_id = i_res.inserted_primary_key[0]
return v_id
else:
return r[primary_key]