SQL Server slow/slower than Python/SQL Alchemy - python

Our team has been experiencing slow returns for our SQL queries in Microsoft SQL Server Management Studio. This just started recently, the slowness fluctuates seemingly randomly (doesn't correlate to when large amounts of data are being written to the DB), and now a new data point is that sending the same query using Python's Pandas and SQL alchemy library returns data much quicker.
Python:
import pandas as pd
from sqlalchemy import create_engine
database = 'database'
params = urllib.parse.quote_plus(
'DRIVER={ODBC Driver 17 for SQL Server};' +
'SERVER=' + sqlserver + ';DATABASE=' + database + ';Trusted_Connection=yes;')
engine = create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
df = pd.read_sql('SELECT * FROM table', con=engine)
SQL in SMSS:
SELECT * FROM table
Both return the same data.

Related

How to speed up pandas to_sql

I am trying to upload data to a MS Azure Sql database using pandas to_sql and it takes very long. I often have to run it before I go to bed and wake up in the morning and it is done but has taken several hours and if there is an error that comes up I am not able to address it. Here is the code I have:
params = urllib.parse.quote_plus(
'Driver=%s;' % driver +
'Server=%s,1433;' % server +
'Database=%s;' % database +
'Uid=%s;' % username +
'Pwd={%s};' % password +
'Encrypt=yes;' +
'TrustServerCertificate=no;'
)
conn_str = 'mssql+pyodbc:///?odbc_connect=' + params
engine = create_engine(conn_str)
#event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
cursor.commit()
connection = engine.connect()
connection
Then I run this command for the sql ingestion:
master_data.to_sql('table_name', engine, chunksize=500, if_exists='append', method='multi',index=False)
I have played around with the chunksize and the sweet spot seems to be 100, which isn't fast enough considering I am usually trying to upload 800,000-2,000,000 records at a time. If I increase it beyond that I will get an error which seems to only be related to the chunk size.
OperationalError: (pyodbc.OperationalError) ('08S01', '[08S01] [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure (0) (SQLExecDirectW)')
Not sure if you have your issue resolved but did want to provide an answer here for the benefit of providing Azure SQL Database libraries for Python specific information and some useful resources to investigate and resolve this issue, as applicable.
An example of using pyodbc to directly query an Azure SQL Database:
Quickstart: Use Python to query Azure SQL Database Single Instance & Managed Instance
An example of using Pandas dataframe: How to read and write to an Azure SQL database from a Pandas dataframe
main.py
"""Read write to Azure SQL database from pandas"""
import pyodbc
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
# 1. Constants
AZUREUID = 'myuserid' # Azure SQL database userid
AZUREPWD = '************' # Azure SQL database password
AZURESRV = 'shareddatabaseserver.database.windows.net' # Azure SQL database server name (fully qualified)
AZUREDB = 'Pandas' # Azure SQL database name (if it does not exit, pandas will create it)
TABLE = 'DataTable' # Azure SQL database table name
DRIVER = 'ODBC Driver 13 for SQL Server' # ODBC Driver
def main():
"""Main function"""
# 2. Build a connectionstring
connectionstring = 'mssql+pyodbc://{uid}:{password}#{server}:1433/{database}?driver={driver}'.format(
uid=AZUREUID,
password=AZUREPWD,
server=AZURESRV,
database=AZUREDB,
driver=DRIVER.replace(' ', '+'))
# 3. Read dummydata into dataframe
df = pd.read_csv('./data/data.csv')
# 4. Create SQL Alchemy engine and write data to SQL
engn = create_engine(connectionstring)
df.to_sql(TABLE, engn, if_exists='append')
# 5. Read data from SQL into dataframe
query = 'SELECT * FROM {table}'.format(table=TABLE)
dfsql = pd.read_sql(query, engn)
print(dfsql.head())
if __name__ == "__main__":
main()
And finally, the following resources should assist in comparing specific implementations, with performance issues, with the below information where the Stack Overflow thread is likely the best resource but the Monitoring and Performance tuning document is useful to investigate and mitigate ay server-side performance issues, etc.
Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC
Monitoring and performance tuning in Azure SQL Database and Azure SQL Managed Instance
Regards,
Mike
params = urllib.parse.quote_plus(
'Driver=%s;' % driver +
'Server=%s,1433;' % server +
'Database=%s;' % database +
'Uid=%s;' % username +
'Pwd={%s};' % password +
'Encrypt=yes;' +
'TrustServerCertificate=no;'
)
conn_str = 'mssql+pyodbc:///?odbc_connect=' + params
engine = create_engine(conn_str)
#event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
cursor.commit()
connection = engine.connect()
connection
Database ingestion is done with this next line. I had issues with chunksize before but fixed it by adding the method and index.
ingest_data.to_sql('db_table_name', engine, if_exists='append',chunksize=100000, method=None,index=False)

Connect to SQL Server and run query as "passthrough" from Python

I currently have code that executes queries on data stored on a SQL Server database, such as the following:
import pyodbc
conn = pyodbc.connect(
r'DRIVER={SQL Server};'
r'SERVER=SQL2SRVR;'
r'DATABASE=DBO732;'
r'Trusted_Connection=yes;'
)
sqlstr = '''
SELECT Company, Street_Address, City, State
FROM F556
WHERE [assume complicated criteria statement here]
'''
crsr = conn.cursor()
for row in crsr.execute(sqlstr):
print(row.Company, row.Street_Address, row.City, row.State)
I can't find documentation online of whether pyodbc can (or is by default) running my queries on the SQL Server (as passthrough queries), or whether (if pyodbc can't do that) there is another way (maybe sqlalchemy or similar?) of doing that. Any insight?
Or is there a way to execute passthrough queries directly from Pandas?
If you are working with pandas and SQL Server then you should already have created a SQLAlchemy Engine object (usually named engine). To execute a raw DML statement you can use the construct
with engine.begin() as conn:
conn.execute("UPDATE table_name SET column_name ...")
print("table updated")

django panda read sql query map parameters

I am trying to connect sql server database within django framework, to read sql query result into panda dataframe
from django.db import connections
query = """SELECT * FROM [dbo].[table] WHERE project=%(Name)s"""
data = pd.read_sql(query, connections[database], params={'Name': input} )
the error message I got is 'format requires a mapping'
if I do it something like below, it will work, but I really want to be able to map each parameter with names:
from django.db import connections
query = """SELECT * FROM [dbo].[table] WHERE project=%s"""
data = pd.read_sql(query, connections[database], params={input} )
I was using odbc driver 17 for sql server
you can format at string level and then run pd.read_sql

Utility to find join columns

I have been given several tables in SQL Server and am trying to figure out the best way to join them.
What I've done is:
1) open a connection in R to the database
2) pull all the column names from the INFORMATION_SCHEMA.COLUMNS table
3) build loops in R to try every combination of columns and see what the row count is of the inner join of the 2 columns
I'm wondering if there's a better way to do this or if there's a package or utility that helps with this type of problem.
You could do your joins in python using pandas. Pandas has a powerful IO engine, so you could import from SQL Server into a pandas dataframe, perform your joins with python and write back to SQL server.
Below is a script I use to perform an import from SQL Server and an export to a MySQL table. I use the python package sqlalchemy for my ORM connections. You could follow this example and read up on joins in pandas.
import pyodbc
import pandas as pd
from sqlalchemy import create_engine
# MySQL info
username = 'user'
password = 'pw'
sqlDB = 'mydb'
# Create MSSQL PSS Connector
server = 'server'
database = 'mydb'
connMSSQL = pyodbc.connect(
'DRIVER={ODBC Driver 13 for SQL Server};' +
f'SERVER={server};PORT=1433;DATABASE={database};Trusted_Connection=yes;')
# Read Table into pandas dataframe
tsql = '''
SELECT [Index],
Tag,
FROM [dbo].[Tags]
'''
df = pd.read_sql(tsql, connMSSQL, index_col='Index')
# Write df to MySQL db
engine = create_engine(
f'mysql+mysqldb://{username}:{password}#localhost/mydb', pool_recycle=3600)
with engine.connect() as connMySQL:
df.to_sql('pss_alarms', connMySQL, if_exists='replace')

python cx_oracle hang when storing as DataFrame?

I'm trying to store the results of an Oracle SQL query into a dataframe and the execution hangs infinitely. But, when I print the query it comes out instantly. What is causing the error when saving this as a DataFrame?
import cx_Oracle
import pandas as pd
dsn_tns = cx_Oracle.makedsn('HOST', 'PORT', service_name='SID')
conn = cx_Oracle.connect(user='USER', password='PASSWORD', dsn=dsn_tns)
curr =conn.cursor()
curr.execute('alter session set current_schema= apps')
df = pd.read_sql('select * from TABLE', curr)
####THE ALTERNATIVE CODE TO PRINT THE RESULTS
# curr.execute('select * from TABLE')
# for line in curr:
# print(line)
curr.close()
conn.close()
Pandas's read_sql requires a connection object for its con argument not the result of a cursor's execute. Also, consider using SQLAlchemy the recommended interface between pandas and databases where you define the schema in the engine connection assignment. This engine also allows to_sql calls.
engine = create_engine("oracle+cx_oracle://user:pwd#host:port/dbname")
df = pd.read_sql('select * from TABLE', con=engine)
engine.dispose()
And as mentioned on this DBA post, in Oracle users and schemas are essentially the same thing (unlike other RBDMS). Therefore, try passing apps as the user in create_engine call with needed credentials:
engine = create_engine("oracle+cx_oracle://apps:PASSWORD#HOST:PORT/SID")
df = pd.read_sql('select * from TABLE', con=engine)
engine.dispose()

Categories

Resources