How to speed up pandas to_sql - python

I am trying to upload data to a MS Azure Sql database using pandas to_sql and it takes very long. I often have to run it before I go to bed and wake up in the morning and it is done but has taken several hours and if there is an error that comes up I am not able to address it. Here is the code I have:
params = urllib.parse.quote_plus(
'Driver=%s;' % driver +
'Server=%s,1433;' % server +
'Database=%s;' % database +
'Uid=%s;' % username +
'Pwd={%s};' % password +
'Encrypt=yes;' +
'TrustServerCertificate=no;'
)
conn_str = 'mssql+pyodbc:///?odbc_connect=' + params
engine = create_engine(conn_str)
#event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
cursor.commit()
connection = engine.connect()
connection
Then I run this command for the sql ingestion:
master_data.to_sql('table_name', engine, chunksize=500, if_exists='append', method='multi',index=False)
I have played around with the chunksize and the sweet spot seems to be 100, which isn't fast enough considering I am usually trying to upload 800,000-2,000,000 records at a time. If I increase it beyond that I will get an error which seems to only be related to the chunk size.
OperationalError: (pyodbc.OperationalError) ('08S01', '[08S01] [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure (0) (SQLExecDirectW)')

Not sure if you have your issue resolved but did want to provide an answer here for the benefit of providing Azure SQL Database libraries for Python specific information and some useful resources to investigate and resolve this issue, as applicable.
An example of using pyodbc to directly query an Azure SQL Database:
Quickstart: Use Python to query Azure SQL Database Single Instance & Managed Instance
An example of using Pandas dataframe: How to read and write to an Azure SQL database from a Pandas dataframe
main.py
"""Read write to Azure SQL database from pandas"""
import pyodbc
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
# 1. Constants
AZUREUID = 'myuserid' # Azure SQL database userid
AZUREPWD = '************' # Azure SQL database password
AZURESRV = 'shareddatabaseserver.database.windows.net' # Azure SQL database server name (fully qualified)
AZUREDB = 'Pandas' # Azure SQL database name (if it does not exit, pandas will create it)
TABLE = 'DataTable' # Azure SQL database table name
DRIVER = 'ODBC Driver 13 for SQL Server' # ODBC Driver
def main():
"""Main function"""
# 2. Build a connectionstring
connectionstring = 'mssql+pyodbc://{uid}:{password}#{server}:1433/{database}?driver={driver}'.format(
uid=AZUREUID,
password=AZUREPWD,
server=AZURESRV,
database=AZUREDB,
driver=DRIVER.replace(' ', '+'))
# 3. Read dummydata into dataframe
df = pd.read_csv('./data/data.csv')
# 4. Create SQL Alchemy engine and write data to SQL
engn = create_engine(connectionstring)
df.to_sql(TABLE, engn, if_exists='append')
# 5. Read data from SQL into dataframe
query = 'SELECT * FROM {table}'.format(table=TABLE)
dfsql = pd.read_sql(query, engn)
print(dfsql.head())
if __name__ == "__main__":
main()
And finally, the following resources should assist in comparing specific implementations, with performance issues, with the below information where the Stack Overflow thread is likely the best resource but the Monitoring and Performance tuning document is useful to investigate and mitigate ay server-side performance issues, etc.
Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC
Monitoring and performance tuning in Azure SQL Database and Azure SQL Managed Instance
Regards,
Mike

params = urllib.parse.quote_plus(
'Driver=%s;' % driver +
'Server=%s,1433;' % server +
'Database=%s;' % database +
'Uid=%s;' % username +
'Pwd={%s};' % password +
'Encrypt=yes;' +
'TrustServerCertificate=no;'
)
conn_str = 'mssql+pyodbc:///?odbc_connect=' + params
engine = create_engine(conn_str)
#event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
cursor.commit()
connection = engine.connect()
connection
Database ingestion is done with this next line. I had issues with chunksize before but fixed it by adding the method and index.
ingest_data.to_sql('db_table_name', engine, if_exists='append',chunksize=100000, method=None,index=False)

Related

How can i access Azure Sql Database with Python function App?

I need to create connection with Azure Sql Database(Azure-Server:1) with Azure function which is hosted in Azure-Server:2. Basically both accounts are different but i need to fetch some data from Azure Sql Database which is hosted in (Azure-Server:1).
Is it even possible?
I tried:
import pandas as pd
import pyodbc
from sqlalchemy import create_engine
server = 'server.database.windows.net'
database = 'A'
username = 'B'
password = '###'
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
driver = 'ODBC Driver 17 for SQL Server'
DATABASE_CONNECTION = f'mssql://{username}:{password}#{server}/{database}?driver={driver}'
engine = create_engine(DATABASE_CONNECTION,fast_executemany = True)
connection = engine.connect()
The above code is working for within same Azure server Sql DB, when i am creating connection through Azure Function but not working for other Azure linked azure database?
The way you follow the python Azure function to connect azure Database is correct.
conn = "DRIVER={ODBC Driver 17 for SQL Server};SERVER=IP_ADDRESS;DATABASE=DataLake;UID=USERID;PWD=PASSWORD"
quotedConnection = quote_plus(conn)
db_con = 'mssql+pyodbc:///?odbc_connect={}'.format(quotedConnection )
engine = create_engine(db_con)
#event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
If you want to call the two different Database connections, you can make use of different calls. So that we can avoid the timeout issue.
For that either you can use One by one call or two different calls of the database connection.
Refer here to use different calls to connect databases.

SQL Server slow/slower than Python/SQL Alchemy

Our team has been experiencing slow returns for our SQL queries in Microsoft SQL Server Management Studio. This just started recently, the slowness fluctuates seemingly randomly (doesn't correlate to when large amounts of data are being written to the DB), and now a new data point is that sending the same query using Python's Pandas and SQL alchemy library returns data much quicker.
Python:
import pandas as pd
from sqlalchemy import create_engine
database = 'database'
params = urllib.parse.quote_plus(
'DRIVER={ODBC Driver 17 for SQL Server};' +
'SERVER=' + sqlserver + ';DATABASE=' + database + ';Trusted_Connection=yes;')
engine = create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
df = pd.read_sql('SELECT * FROM table', con=engine)
SQL in SMSS:
SELECT * FROM table
Both return the same data.

Python Existing connection forcibly closed by remote host Pandas read SQL and to sql functions different databases

Trying to read sql from one database and put the data into a dataframe. And then I want that dataframe to be written to a sql table in another database. Each time I try to do this I get the following error:
TCP Provider: An existing connection was forcibly closed by the remote host.\r\n (10054) (SQLExecute); [08S01] [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure (10054)')
The code works when I read sql from one database and put the dataframe into a sql table in the SAME database.
My 2 connections are enginemb and engine. I am reading from enginemb and then I want to use the df to write into a sql table into engine
Example code of engine connection string. I have the same for enginemb but with different server and user credentials:
# SQL CONNECTION TO DAMDB
params = urllib.parse.quote_plus("DRIVER={ODBC Driver 17 for SQL Server};"
"SERVER=server.database.windows.net;"
"DATABASE=db;"
"UID=admin;"
"PWD=**********")
engine = db.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params))
def write_to_df():
# Added to make insert into SQL faster using cursor.fast_executemany
#event.listens_for(enginemb, 'before_cursor_execute')
def plugin_bef_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
df = pd.read_sql(sql='EXEC [prod].[spData]?', con=enginemb,
params=['A'])
return df
def write_db(df):
# Added to make insert into SQL faster using cursor.fast_executemany
#event.listens_for(engine, 'before_cursor_execute')
def plugin_bef_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
df.to_sql('tblData', con=engine, if_exists='append', index=False,
schema='tmp')
def main():
df= write_to_df()
write_db(df)
if __name__ == '__main__':
main()

Connect to SQL Server and run query as "passthrough" from Python

I currently have code that executes queries on data stored on a SQL Server database, such as the following:
import pyodbc
conn = pyodbc.connect(
r'DRIVER={SQL Server};'
r'SERVER=SQL2SRVR;'
r'DATABASE=DBO732;'
r'Trusted_Connection=yes;'
)
sqlstr = '''
SELECT Company, Street_Address, City, State
FROM F556
WHERE [assume complicated criteria statement here]
'''
crsr = conn.cursor()
for row in crsr.execute(sqlstr):
print(row.Company, row.Street_Address, row.City, row.State)
I can't find documentation online of whether pyodbc can (or is by default) running my queries on the SQL Server (as passthrough queries), or whether (if pyodbc can't do that) there is another way (maybe sqlalchemy or similar?) of doing that. Any insight?
Or is there a way to execute passthrough queries directly from Pandas?
If you are working with pandas and SQL Server then you should already have created a SQLAlchemy Engine object (usually named engine). To execute a raw DML statement you can use the construct
with engine.begin() as conn:
conn.execute("UPDATE table_name SET column_name ...")
print("table updated")

Passing data to a stored procedure that accepts Table Valued Parameter using pyodbc

Trying to send data to a stored procedure that accepts Table Valued Parameter. Getting following error:
[Error] ('HY004', '[HY004] [Microsoft][ODBC SQL Server Driver]Invalid SQL data type (0) (SQLBindParameter)')
I know it is due to datatype mismatch – but how to correct this?
When I used SQL Server profiler, I see following
exec sp_sproc_columns N'[MyTestTvp]',N'dbo',#ODBCVer=3
Python Code
import pandas as pd
import pyodbc
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
def main():
cnxn = pyodbc.connect("Driver={SQL Server};Server=dataserver;UID=UserName;PWD=Password#123;Database=MySQLServerDatabase;")
dfInput = pd.read_sql_query('exec dbo.usp_Temp_GetAllPatientBKs_ToEncrypt ?', cnxn, params=['None'] )
c01 = [1, 2, 3]
param_array = []
for i in range(3):
param_array.append([c01[i]])
try:
cursor = cnxn.cursor()
result_array = cursor.execute("EXEC dbo.[MyTestTvp] ?", [param_array]).fetchall()
cursor.commit() #very important to commit
except Exception as ex:
print("Failed to execute MyTestTvp")
print("Exception: [" + type(ex).__name__ + "]", ex.args)
if __name__== "__main__":
main()
TVP in SQL Server
CREATE TYPE dbo.[MyList] AS TABLE
(
[Id] INT NOT NULL
);
-- create stored procedure
CREATE PROCEDURE dbo.[MyTestTvp]
(
#tvp dbo.[MyList] READONLY
)
AS
BEGIN
SET NOCOUNT ON;
SELECT * FROM #tvp
END
UPDATE
Thanks a lot to Gord Thompson. Based on the answer posted by Gord Thompson, I changed the connection
cnxn = pyodbc.connect("Driver={ODBC Driver 13 for SQL Server};Server=dataserver.sandbox.rcoanalytics.com;UID=SimpleTest;PWD=SimpleTest#123;Database=RCO_DW;")
Then I got following error:
Data source name not found and no default driver specified
Referred pyodbc + MySQL + Windows: Data source name not found and no default driver specified
Then Installed Driver={ODBC Driver 13 for SQL Server} on the server in ODBC Data Source Administrator in the System DSN tab
control panel>Systems and Security>Administrative Tools.>ODBC Data Sources
REFERENCES
Step 3: Proof of concept connecting to SQL using
pyodbc
Step 1: Configure development environment for pyodbc Python
development
Step 2: Create a SQL database for pyodbc Python development
Python on Azure
I was able to reproduce your issue. You are using the very old "SQL Server" ODBC driver which was written for SQL Server 2000. TVPs were introduced in SQL Server 2008.
So, you are getting the error because the driver you are using does not understand TVPs as they did not exist at the time the driver was created.You will need to use a more modern version of the driver, e.g., "ODBC Driver 17 for SQL Server".

Categories

Resources