I was working on a product where i have to write script in python for fetching big files(Around 1-1.5 GB) and do some processing and finally uploading into some other tables multiple times. I wrote a code for the same, but i feel it is taking way too much time for processing my code, i found that mostly it stuck when i am uploading files in to the tables, i want optimize the process around uploading and fetching the file from DB, I need help from you guys on that.
My function for Creating connection with Database:
def create_sqlalchemy_engine(server,db,username,passwrd,driver):
try:
engine = create_engine("mssql+pyodbc://{user}:{pw}#{server}/{db}?driver={drivr}"
.format(user=username,
server=server,
pw=passwrd,
db=db,
drivr=driver))
except Exception as e:
raise e
return engine
For Fetching File:
df = pd.read_sql_query('''
SELECT *
FROM {}''').format(Table_A)
For Uploading:
df.to_sql(table_name)
Reading SQL Query: Use chunksize param to speed up.
If specified, return an iterator where chunksize is the number of rows to include in each chunk.
documentation on available params:
df = pd.read_sql_query('''
SELECT *
FROM {}''', engine, chunksize=1000).format(Table_A)
Writing to SQL: You can speed up writing to the SQL database in two steps.
Set fast_executemany=True in create_engine, link to the documentation. Make sure you're using SQLAlchemy 1.3 or later.
Change your df.to_sql code to the following:
df.to_sql(table_name, con=engine, index=False, if_exists="append", schema="dbo", chunksize=1000)
remove index=False if needed from above. The meaning of those params can be found in the documenation.
Related
I am trying to run this code once a day to log the dataframes to make historical dataset.
I have connected mysql with pymysql to save my pandas dataframe into mysql using pymysql and converted pandas dataframe into sql using .to_sql method.
However, if I run this code 2 times, the name of the table overlaps and won't run 2nd time.
Therefore I need to change the name(data_day001, data_day002, data_day003...) of the table each time I run this code.
# Credentials to database connection
hostname="hostname"
dbname="sql_database"
uname="admin"
pwd="password"
# Create SQLAlchemy engine to connect to MySQL Database
engine = create_engine("mysql+pymysql://{user}:{pw}#{host}/{db}"
.format(host=hostname, db=dbname, user=uname, pw=pwd))
# Convert dataframe to sql table
channel_data.to_sql('data_day001', engine, index=False)
Please advise me how I could solve this problem.
Thank you so much in advance.
Use the inspect function:
from sqlalchemy import create_engine, inspect
def get_table_name(engine):
names = inspect(engine).get_table_names()
return f"data_day{len(names):03}"
engine = create_engine(...)
channel_data.to_sql(get_table_name(engine), engine, index=False)
After some days:
>>> inspect(engine).get_table_names()
['data_day000', 'data_day001', 'data_day002', 'data_day003', 'data_day004']
What approach should I follow to download DDL, DML and Stored Procedures from the teradata database using python.
I have created the sample code but what is the approach to download these sql files for data migration process.
udaExec = teradata.UdaExec(appName="HelloWorld", version="1.0",logConsole=False)
session = udaExec.connect(method="odbc", system="xxx",username="xxx", password="xxx");
for row in session.execute("show tables {} > {}".format(tables, export_tables)):
print(row)
Unlike MSSQL which had mssql-scripter to download .sql files, does teradata provide any such option to download. Also, does it provide support to download sequences, views and procedures ?
For the Schema Migration process, what should be the best approach to download these files from the teradata as a source ?
Happy to share that I got the solution for this approach. In order to get the files in sql format use the given code to extract DDL and DML Code.
The given code is for sample database dbc.
with teradatasql.connect(host='enter_host_ip', user='---', password='---') as connect:
#get the table name and database name in csv file using select statement
df = pd.read_csv("result.csv", index_col=None)
for tables_name in df['TableName']:
query = "SHOW TABLE DBC."+ tables_name
try:
df = pd.read_sql(query, connect)
df1 = df['Request Text'][0]
writePath = "C:\\Users\\SQL\\"+tables_name+".sql"
with open(writePath, 'a') as f:
dfAsString = df1
f.write(dfAsString)
except Exception as e1:
print(tables_name)
pass
Note : Out of 192 tables I was able to get DDL/DML scripts for 189 tables. For tables perform manual intervention.
I am extracting millions of data from sql server and inserting into oracle db using python. It is taking 1 record to insert in oracle table in 1 sec.. takes hours to insert. What is the fastest approach to load ?
My code below:
def insert_data(conn,cursor,query,data,batch_size = 10000):
recs = []
count = 1
for rec in data:
recs.append(rec)
if count % batch_size == 0:
cursor.executemany(query, recs,batcherrors=True)
conn.commit()`enter code here`
recs = []
count = count +1
cursor.executemany(query, recs,batcherrors=True)
conn.commit()
Perhaps you cannot buy a 3d Party ETL tool, but you can certainly write a procedure in PL/SQL in the oracle database.
First, install the oracle Transparenet Gateway for ODBC. No license cost involved.
Second, in the oracl db, create a db link to reference the MSSQL database via the gateway.
Third, write a PL/SQL procedure to pull the data from the MSSQL database, via the db link.
I was once presented a problem similar to yours. developer was using SSIS to copy around a million rows from mssql to oracle. Taking over 4 hours. I ran a trace on his process and saw that it was copying row-by-row, slow-by-slow. Took me less than 30 minutes write a pl/sql proc to copy the data, and it completed in less than 4 minutes.
I give a high-level view of the entire setup and process, here:
EDIT:
Thought you might like to see exactly how simple the actual procedure is:
create or replace my_load_proc
begin
insert into my_oracle_table (col_a,
col_b,
col_c)
select sql_col_a,
sql_col_b,
sql_col_c
from mssql_tbl#mssql_link;
end;
My actual procedure has more to it, dealing with run-time logging, emailing notification of completion, etc. But the above is the 'guts' of it, pulling the data from mssql into oracle.
then you might wanna use pandas or pyspark or other big data frameworks available on python
there are a lot of example out there, here is how to load data from Microsoft Docs:
import pyodbc
import pandas as pd
import cx_Oracle
server = 'servername'
database = 'AdventureWorks'
username = 'yourusername'
password = 'databasename'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
query = "SELECT [CountryRegionCode], [Name] FROM Person.CountryRegion;"
df = pd.read_sql(query, cnxn)
# you do data manipulation that is needed here
# then insert data into oracle
conn = create_engine('oracle+cx_oracle://xxxxxx')
df.to_sql(table_name, conn, index=False, if_exists="replace")
something like that, ( that might not work 100% , but just to give you an idea how you can do it)
I'm trying to insert data with Python from SQL Server to Snowflake table. It works in general, but if I want to insert a bigger chunk of data, it gives me an error:
snowflake connector SQL compilation error: maximum number of expressions in a list exceeded, expected at most 16,384
I'm using snowflake connector for Python. So, it works if you want to insert 16384 rows at once. My table has over a million records. I don't want to use csv files.
I was able to insert > 16k recs using sqlalchemy and pandas as:
pandas_df.to_sql(sf_table, con=engine, index=False, if_exists='append', chunksize=16000)
where engine is sqlalchemy.create_engine(...)
This is not the ideal way to load data into Snowflake, but since you specified that you didn't want to create CSV files, you could look into loading the data into a panda dataframe and then use the write_pandas function in the python connector, which will (behind the scenes) leverage a flat file and a COPY INTO statement, which is the fastest way to get data into Snowflake. This issue with this method will likely be that pandas requires a lot of memory on the machine you are running the script on. There is a chunk_size parameter, though, so you can control it with that.
https://docs.snowflake.com/en/user-guide/python-connector-api.html#write_pandas
For Whoever is facing that problem, find below a complete solution to connect and to insert data into Snowflake using Sqlalchemy.
from sqlalchemy import create_engine
import pandas as pd
snowflake_username = 'username'
snowflake_password = 'password'
snowflake_account = 'accountname'
snowflake_warehouse = 'warehouse_name'
snowflake_database = 'database_name'
snowflake_schema = 'public'
engine = create_engine(
'snowflake://{user}:{password}#{account}/{db}/{schema}?warehouse={warehouse}'.format(
user=snowflake_username,
password=snowflake_password,
account=snowflake_account,
db=snowflake_database,
schema=snowflake_schema,
warehouse=snowflake_warehouse,
),echo_pool=True, pool_size=10, max_overflow=20
)
try:
connection = engine.connect()
results = connection.execute('select current_version()').fetchone()
print(results[0])
df.columns = map(str.upper, df_sensor.columns)
df.to_sql('table'.lower(), con=connection, schema='amostraschema', index=False, if_exists='append', chunksize=16000)
finally:
connection.close()
engine.dispose()
Use the executemany with server side binding. This will create files in a staging area and will allow you to insert more than 16384 rows.
con = snowflake.connector.connect(
account='',
user = '',
password = '',
dbname='',
paramstyle = 'qmark')
sql = "insert into tablename (col1, col2) values (?, ?)"
rows = [[1, 2], [3, 4]]
con.cursor().executemany(sql, rows)
See https://docs.snowflake.com/en/user-guide/python-connector-example.html#label-python-connector-binding-batch-inserts for more details
Note: This will not work with client side binding, the %s format.
Edit - I am using Windows 10
Is there a faster alternative to pd._read_sql_query for a MS SQL database?
I was using pandas to read the data and add some columns and calculations on the data. I have cut out most of the alterations now and I am basically just reading (1-2 million rows per day at a time; my query is to read all of the data from the previous date) the data and saving it to a local database (Postgres).
The server I am connecting to is across the world and I have no privileges at all other than to query for the data. I want the solution to remain in Python if possible. I'd like to speed it up though and remove any overhead. Also, you can see that I am writing a file to disk temporarily and then opening it to COPY FROM STDIN. Is there a way to skip the file creation? It is sometimes over 500mb which seems like a waste.
engine = create_engine(engine_name)
query = 'SELECT * FROM {} WHERE row_date = %s;'
df = pd.read_sql_query(query.format(table_name), engine, params={query_date})
df.to_csv('../raw/temp_table.csv', index=False)
df= open('../raw/temp_table.csv')
process_file(conn=pg_engine, table_name=table_name, file_object=df)
UPDATE:
you can also try to unload data using bcp utility, which might be lot faster compared to pd.read_sql(), but you will need a local installation of Microsoft Command Line Utilities for SQL Server
After that you can use PostgreSQL's COPY ... FROM ......
OLD answer:
you can try to write your DF directly to PostgreSQL (skipping the df.to_csv(...) and df= open('../raw/temp_table.csv') parts):
from sqlalchemy import create_engine
engine = create_engine(engine_name)
query = 'SELECT * FROM {} WHERE row_date = %s;'
df = pd.read_sql_query(query.format(table_name), engine, params={query_date})
pg_engine = create_engine('postgresql+psycopg2://user:password#host:port/dbname')
df.to_sql(table_name, pg_engine, if_exists='append')
Just test whether it's faster compared to COPY FROM STDIN...