ODBC for SQL Server in Python - python

I have requirement to extract data from SQL Server and create a .csv file from numerous tables. So I created a python script to do this activity which uses pyodbc/turbodbc connection with SQL Server ODBC Drivers. It works fine sometimes however it disconnects when it finds large table (over 11M) and performance wise it is very slow. I tried freeTDS, but looks the same as pyodbc interns of performance.
This is my connection:
pyodbc.connect(Driver='/opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.5.so.2.1',server=systemname,UID=user_name,PWD=pwd)
def connect_to_SQL_Server(logins):
'''Connects to SQL Server.
Returns connection object or None
'''
con = None
try:
hostname = logins['hostname']
username = logins['sql_username']
password = logins['snow_password']
#con = turbodbc.connect(Driver='/usr/lib64/libtdsodbc.so',server=hostname,UID=username,PWD=password,TDS_Version=8.0)
#con = turbodbc.connect(Driver='/usr/lib64/libtdsodbc.so',server=hostname,UID=username,PWD=password,TDS_Version=8.0)
#con = pyodbc.connect(Driver='/usr/lib64/libtdsodbc.so',server=hostname,UID=username,PWD=password,TDS_Version=8.0,Trace='Yes',ForceTrace='Yes',TraceFile='/maxbill_mvp_data/all_data/sql.log')
con = pyodbc.connect(Driver='/opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.5.so.2.1',server=hostname,UID=username,PWD=password)
#con = turbodbc.connect(Driver='/opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.5.so.2.1',server=hostname,UID=username,PWD=password)
#con = pyodbc.connect(DSN='MSSQLDEV',server=hostname,UID=username,PWD=password)
return con
except (pyodbc.ProgrammingError, Exception) as error:
logging.critical(error)
sqlCon = connect_to_SQL_Server(logins)
sql = 'select * from table'
i = 0
for partial_df in(pd.read_sql(sql, sqlCon, chunksize=300000)):
#chunk.to_csv(f+'_'+str(i)+'.csv',index = False,header = False,sep = ',',mode = 'a+')
partial_df.to_csv(filenamewithpath + '_'+str(i)+'.csv.gz', compression='gzip', index=False, sep='\01', header= False, mode='a+')
i+=1
Are there any parameters I can try with for performance improvement. Just to let you know these python scripts running from different server than SQL Server hosted server and which is Linux cloud instance

Related

Connect to Oracle Database from SQLAlchemy in Python on AWS EC2

I'm using python Jupyter-Lab inside a Docker Conteiner, which is embedded in an AWS EC-2. This Docker Container has an Instant Oracle Cliente installed inside it, so everything is set. The problem is that I'm still having trouble to connect this Docker to my AWS RDS with an Oracle Database, but only using SQLAlchemy.
When I try the connection using cx-Oracle==8.2.1 engine:
host = '***********************'
user = '*********'
password = '**********'
port = '****'
service = '****'
dsn_tns = cx_Oracle.makedsn(host,
port,
service)
engine_oracle = cx_Oracle.connect(user=user, password=password, dsn=dsn_tns)
Everything works fine. I can read tables using pandas read_sql(), I can create tables using cx_Oracle execute(), etc.
But when I try to take a DataFrame and send it to my RDS using pandas to_sql(), my cx_Oracle connection returns the error:
DatabaseError: ORA-01036: illegal variable name/number
I then tried to use a SQLAlchemy==1.4.22 engine from the string:
tns = """
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = %s)(PORT = %s))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = %s)
)
)
""" % (host, port, service)
engine_alchemy = create_engine('oracle+cx_oracle://%s:%s#%s' % (user, password, tns))
But I get this error:
DatabaseError: ORA-12154: TNS:could not resolve the connect identifier specified
And I keep getting this error even when I try to use pandas read_sql with the SQLAlchemy engine. Thus, I ran out of options. Can somebody help me please?
EDIT*
I tried again with SQLAlchemy==1.3.9 and it worked. Does anybody knows why?
The code I'm using for reading and sending a test table from and to Oracle is:
sql = """
SELECT
*
FROM
DADOS_MIS.DR_ACIO_ATIVOS_HASH
WHERE
ROWNUM <= 5"""
df = pd.read_sql(sql, engine_oracle)
dtyp1 = {c:'VARCHAR2('+str(df[c].str.len().max())+')'
for c in df.columns[df.dtypes == 'object'].tolist()}
dtyp2 = {c:'NUMBER'
for c in df.columns[df.dtypes == 'float64'].tolist()}
dtyp3 = {c:'DATE'
for c in df.columns[df.dtypes == 'datetime'].tolist()}
dtyp4 = {c:'NUMBER'
for c in df.columns[df.dtypes == 'int64'].tolist()}
dtyp_total = dtyp1
dtyp_total.update(dtyp2)
dtyp_total.update(dtyp3)
dtyp_total.update(dtyp4)
df.to_sql(name='teste', con=engine_oracle, if_exists='replace', dtype=dtyp_total, index=False)
The dtyp_total is:
{'IDENTIFICADOR': 'VARCHAR2(32)',
'IDENTIFICADOR_PRODUTO': 'VARCHAR2(32)',
'DATA_CHAMADA': 'VARCHAR2(19)',
'TABULACAO': 'VARCHAR2(25)'}

How to speed up pandas to_sql

I am trying to upload data to a MS Azure Sql database using pandas to_sql and it takes very long. I often have to run it before I go to bed and wake up in the morning and it is done but has taken several hours and if there is an error that comes up I am not able to address it. Here is the code I have:
params = urllib.parse.quote_plus(
'Driver=%s;' % driver +
'Server=%s,1433;' % server +
'Database=%s;' % database +
'Uid=%s;' % username +
'Pwd={%s};' % password +
'Encrypt=yes;' +
'TrustServerCertificate=no;'
)
conn_str = 'mssql+pyodbc:///?odbc_connect=' + params
engine = create_engine(conn_str)
#event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
cursor.commit()
connection = engine.connect()
connection
Then I run this command for the sql ingestion:
master_data.to_sql('table_name', engine, chunksize=500, if_exists='append', method='multi',index=False)
I have played around with the chunksize and the sweet spot seems to be 100, which isn't fast enough considering I am usually trying to upload 800,000-2,000,000 records at a time. If I increase it beyond that I will get an error which seems to only be related to the chunk size.
OperationalError: (pyodbc.OperationalError) ('08S01', '[08S01] [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure (0) (SQLExecDirectW)')
Not sure if you have your issue resolved but did want to provide an answer here for the benefit of providing Azure SQL Database libraries for Python specific information and some useful resources to investigate and resolve this issue, as applicable.
An example of using pyodbc to directly query an Azure SQL Database:
Quickstart: Use Python to query Azure SQL Database Single Instance & Managed Instance
An example of using Pandas dataframe: How to read and write to an Azure SQL database from a Pandas dataframe
main.py
"""Read write to Azure SQL database from pandas"""
import pyodbc
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
# 1. Constants
AZUREUID = 'myuserid' # Azure SQL database userid
AZUREPWD = '************' # Azure SQL database password
AZURESRV = 'shareddatabaseserver.database.windows.net' # Azure SQL database server name (fully qualified)
AZUREDB = 'Pandas' # Azure SQL database name (if it does not exit, pandas will create it)
TABLE = 'DataTable' # Azure SQL database table name
DRIVER = 'ODBC Driver 13 for SQL Server' # ODBC Driver
def main():
"""Main function"""
# 2. Build a connectionstring
connectionstring = 'mssql+pyodbc://{uid}:{password}#{server}:1433/{database}?driver={driver}'.format(
uid=AZUREUID,
password=AZUREPWD,
server=AZURESRV,
database=AZUREDB,
driver=DRIVER.replace(' ', '+'))
# 3. Read dummydata into dataframe
df = pd.read_csv('./data/data.csv')
# 4. Create SQL Alchemy engine and write data to SQL
engn = create_engine(connectionstring)
df.to_sql(TABLE, engn, if_exists='append')
# 5. Read data from SQL into dataframe
query = 'SELECT * FROM {table}'.format(table=TABLE)
dfsql = pd.read_sql(query, engn)
print(dfsql.head())
if __name__ == "__main__":
main()
And finally, the following resources should assist in comparing specific implementations, with performance issues, with the below information where the Stack Overflow thread is likely the best resource but the Monitoring and Performance tuning document is useful to investigate and mitigate ay server-side performance issues, etc.
Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC
Monitoring and performance tuning in Azure SQL Database and Azure SQL Managed Instance
Regards,
Mike
params = urllib.parse.quote_plus(
'Driver=%s;' % driver +
'Server=%s,1433;' % server +
'Database=%s;' % database +
'Uid=%s;' % username +
'Pwd={%s};' % password +
'Encrypt=yes;' +
'TrustServerCertificate=no;'
)
conn_str = 'mssql+pyodbc:///?odbc_connect=' + params
engine = create_engine(conn_str)
#event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
cursor.commit()
connection = engine.connect()
connection
Database ingestion is done with this next line. I had issues with chunksize before but fixed it by adding the method and index.
ingest_data.to_sql('db_table_name', engine, if_exists='append',chunksize=100000, method=None,index=False)

Getting error on python while transferring data from SQL server to snowflake

I am getting below error
query = command % processed_params TypeError: not all arguments
converted during string formatting
I am trying to pull data from SQL server and then inserting it into Snowflake
my below code
import pyodbc
import sqlalchemy
import snowflake.connector
driver = 'SQL Server'
server = 'tanmay'
db1 = 'testing'
tcon = 'no'
uname = 'sa'
pword = '123'
cnxn = pyodbc.connect(driver='{SQL Server}',
host=server, database=db1, trusted_connection=tcon,
user=uname, password=pword)
cursor = cnxn.cursor()
cursor.execute("select * from Admin_tbldbbackupdetails")
rows = cursor.fetchall()
#for row in rows:
# #data = [(row[0], row[1],row[2], row[3],row[4], row[5],row[6], row[7])]
print (rows[0])
cnxn.commit()
cnxn.close()
connection = snowflake.connector.connect(user='****',password='****',account='*****')
cursor2 = connection.cursor()
cursor2.execute("USE WAREHOUSE FOOD_WH")
cursor2.execute("USE DATABASE Test")
sql1="INSERT INTO CN_RND.Admin_tbldbbackupdetails_ip"
"(id,dbname, dbpath, backupdate, backuptime, backupStatus, FaildMsg, Backupsource)"
"values (?,?,?,?,?,?,?,?)"
cursor2.execute(sql1,*rows[0])
It's obviously string parsing error.
You missed to provide parameter to %s printout.
If you cannot fix it step back and try another approach.
Use another script to achieve the same and get back to you bug tomorrow :-)
My script is doing pretty much the same:
1. Connect to SQL Server
-> fetchmany
-> multipart upload to s3
-> COPY INTO Snowflake table
Details are here: Snowpipe-for-SQLServer

How to fetch data from read-only mysql database using python?

I am trying to fetch data in python from MySQL database using username that has read-only permission. I am using mysql.connector package to connect to database.
It gets connected to database properly, as I checked using following:
connection = mysql.connector.connect(host = HOSTNAME, user = USERNAME, passwd = PASSWORD, db = DATABASE, port=PORT)
print(connection.cmd_statistics())
But when I try to fetch data from Database using cursor, it returns 'None'.
My code is:
cursor = connection.cursor()
try:
query1 = 'SELECT * FROM table_name'
result = cursor.execute(query1)
print(result)
finally:
connection.close()
And the output is:
None
It works for python 3.6.5 and mysql_workbench 8.0 but not tried in other python -version**
import _mysql_connector
avi = _mysql_connector.MySQL()
avi.connect(host='127.0.0.1',user='root',port=3306, password='root',database='hr_table')
avi.query("select * from hr_table.countries")
row = avi.fetch_row()
while row:
print(row)
row = avi.fetch_row()
avi.free_result()
avi.close()

Python to SQL Server Insert

I'm trying to follow the method for inserting a Panda data frame into SQL Server that is mentioned here as it appears to be the fastest way to import lots of rows.
However I am struggling with figuring out the connection parameter.
I am not using DSN , I have a server name, a database name, and using trusted connection (i.e. windows login).
import sqlalchemy
import urllib
server = 'MYServer'
db = 'MyDB'
cxn_str = "DRIVER={SQL Server Native Client 11.0};SERVER=" + server +",1433;DATABASE="+db+";Trusted_Connection='Yes'"
#cxn_str = "Trusted_Connection='Yes',Driver='{ODBC Driver 13 for SQL Server}',Server="+server+",Database="+db
params = urllib.parse.quote_plus(cxn_str)
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
conn = engine.connect().connection
cursor = conn.cursor()
I'm just not sure what the correct way to specify my connection string is. Any suggestions?
I have been working with pandas and SQL server for a while and the fastest way I found to insert a lot of data in a table was in this way:
You can create a temporary CSV using:
df.to_csv('new_file_name.csv', sep=',', encoding='utf-8')
Then use pyobdc and BULK INSERT Transact-SQL:
import pyodbc
conn = pyodbc.connect(DRIVER='{SQL Server}', Server='server_name', Database='Database_name', trusted_connection='yes')
cur = conn.cursor()
cur.execute("""BULK INSERT table_name
FROM 'C:\\Users\\folders path\\new_file_name.csv'
WITH
(
CODEPAGE = 'ACP',
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)""")
conn.commit()
cur.close()
conn.close()
Then you can delete the file:
import os
os.remove('new_file_name.csv')
It was a second to charge a lot of data at once into SQL Server. I hope this gives you an idea.
Note: don't forget to have a field for the index. It was my mistake when I started to use this lol.
Connection string parameter values should not be enclosed in quotes so you should use Trusted_Connection=Yes instead of Trusted_Connection='Yes'.

Categories

Resources