Edit - I am using Windows 10
Is there a faster alternative to pd._read_sql_query for a MS SQL database?
I was using pandas to read the data and add some columns and calculations on the data. I have cut out most of the alterations now and I am basically just reading (1-2 million rows per day at a time; my query is to read all of the data from the previous date) the data and saving it to a local database (Postgres).
The server I am connecting to is across the world and I have no privileges at all other than to query for the data. I want the solution to remain in Python if possible. I'd like to speed it up though and remove any overhead. Also, you can see that I am writing a file to disk temporarily and then opening it to COPY FROM STDIN. Is there a way to skip the file creation? It is sometimes over 500mb which seems like a waste.
engine = create_engine(engine_name)
query = 'SELECT * FROM {} WHERE row_date = %s;'
df = pd.read_sql_query(query.format(table_name), engine, params={query_date})
df.to_csv('../raw/temp_table.csv', index=False)
df= open('../raw/temp_table.csv')
process_file(conn=pg_engine, table_name=table_name, file_object=df)
UPDATE:
you can also try to unload data using bcp utility, which might be lot faster compared to pd.read_sql(), but you will need a local installation of Microsoft Command Line Utilities for SQL Server
After that you can use PostgreSQL's COPY ... FROM ......
OLD answer:
you can try to write your DF directly to PostgreSQL (skipping the df.to_csv(...) and df= open('../raw/temp_table.csv') parts):
from sqlalchemy import create_engine
engine = create_engine(engine_name)
query = 'SELECT * FROM {} WHERE row_date = %s;'
df = pd.read_sql_query(query.format(table_name), engine, params={query_date})
pg_engine = create_engine('postgresql+psycopg2://user:password#host:port/dbname')
df.to_sql(table_name, pg_engine, if_exists='append')
Just test whether it's faster compared to COPY FROM STDIN...
Related
I am extracting millions of data from sql server and inserting into oracle db using python. It is taking 1 record to insert in oracle table in 1 sec.. takes hours to insert. What is the fastest approach to load ?
My code below:
def insert_data(conn,cursor,query,data,batch_size = 10000):
recs = []
count = 1
for rec in data:
recs.append(rec)
if count % batch_size == 0:
cursor.executemany(query, recs,batcherrors=True)
conn.commit()`enter code here`
recs = []
count = count +1
cursor.executemany(query, recs,batcherrors=True)
conn.commit()
Perhaps you cannot buy a 3d Party ETL tool, but you can certainly write a procedure in PL/SQL in the oracle database.
First, install the oracle Transparenet Gateway for ODBC. No license cost involved.
Second, in the oracl db, create a db link to reference the MSSQL database via the gateway.
Third, write a PL/SQL procedure to pull the data from the MSSQL database, via the db link.
I was once presented a problem similar to yours. developer was using SSIS to copy around a million rows from mssql to oracle. Taking over 4 hours. I ran a trace on his process and saw that it was copying row-by-row, slow-by-slow. Took me less than 30 minutes write a pl/sql proc to copy the data, and it completed in less than 4 minutes.
I give a high-level view of the entire setup and process, here:
EDIT:
Thought you might like to see exactly how simple the actual procedure is:
create or replace my_load_proc
begin
insert into my_oracle_table (col_a,
col_b,
col_c)
select sql_col_a,
sql_col_b,
sql_col_c
from mssql_tbl#mssql_link;
end;
My actual procedure has more to it, dealing with run-time logging, emailing notification of completion, etc. But the above is the 'guts' of it, pulling the data from mssql into oracle.
then you might wanna use pandas or pyspark or other big data frameworks available on python
there are a lot of example out there, here is how to load data from Microsoft Docs:
import pyodbc
import pandas as pd
import cx_Oracle
server = 'servername'
database = 'AdventureWorks'
username = 'yourusername'
password = 'databasename'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
query = "SELECT [CountryRegionCode], [Name] FROM Person.CountryRegion;"
df = pd.read_sql(query, cnxn)
# you do data manipulation that is needed here
# then insert data into oracle
conn = create_engine('oracle+cx_oracle://xxxxxx')
df.to_sql(table_name, conn, index=False, if_exists="replace")
something like that, ( that might not work 100% , but just to give you an idea how you can do it)
I was working on a product where i have to write script in python for fetching big files(Around 1-1.5 GB) and do some processing and finally uploading into some other tables multiple times. I wrote a code for the same, but i feel it is taking way too much time for processing my code, i found that mostly it stuck when i am uploading files in to the tables, i want optimize the process around uploading and fetching the file from DB, I need help from you guys on that.
My function for Creating connection with Database:
def create_sqlalchemy_engine(server,db,username,passwrd,driver):
try:
engine = create_engine("mssql+pyodbc://{user}:{pw}#{server}/{db}?driver={drivr}"
.format(user=username,
server=server,
pw=passwrd,
db=db,
drivr=driver))
except Exception as e:
raise e
return engine
For Fetching File:
df = pd.read_sql_query('''
SELECT *
FROM {}''').format(Table_A)
For Uploading:
df.to_sql(table_name)
Reading SQL Query: Use chunksize param to speed up.
If specified, return an iterator where chunksize is the number of rows to include in each chunk.
documentation on available params:
df = pd.read_sql_query('''
SELECT *
FROM {}''', engine, chunksize=1000).format(Table_A)
Writing to SQL: You can speed up writing to the SQL database in two steps.
Set fast_executemany=True in create_engine, link to the documentation. Make sure you're using SQLAlchemy 1.3 or later.
Change your df.to_sql code to the following:
df.to_sql(table_name, con=engine, index=False, if_exists="append", schema="dbo", chunksize=1000)
remove index=False if needed from above. The meaning of those params can be found in the documenation.
I'm looking for an efficient way to import data from a CSV file to a Postgresql table using python in batches as I have quite large files and the server I'm importing the data to is far away. I need an efficient solution as everything I tried was either slow or just didn't work. I'm using SQLlahcemy.
I wanted to use raw SQL but it's so hard to parameterize and I need multiple loops to execute the query for multiple rows
I was given the task of manipulating & migrating some data from CSV files into a remote Postgres Instance.
I decided to use the Python script below:
import csv
import uuid
import psycopg2
import psycopg2.extras
import time
#Instant Time at the start of the Script
start = time.time()
psycopg2.extras.register_uuid()
#List of CSV Files that I want to manipulate & migrate.
file_list=["Address.csv"]
conn = psycopg2.connect("host=localhost dbname=address user=postgres password=docker")
cur = conn.cursor()
i = 1
for f in file_list:
f = open(f)
csv_f = csv.reader(f)
next(csv_f)
for row in csv_f:
# Some simple manipulations on each row
#Inserting a uuid4 into the first column
row.pop(0)
row.insert(0,uuid.uuid4())
row.pop(10)
row.insert(10,False)
row.pop(13)
#Tracking the number of rows inserted
print(i)
i = i + 1
#INSERT QUERY
postgres_insert_query = """ INSERT INTO "public"."address"("address_id","address_line_1","locality_area_street","address_name","app_version","channel_type","city","country","created_at","first_name","is_default","landmark","last_name","mobile","pincode","territory","updated_at","user_id") VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
record_to_insert = row
cur.execute(postgres_insert_query,record_to_insert)
f.close()
conn.commit()
conn.close()
print(time.time()-start)
The script worked quite well and promptly when testing it locally. But connecting to a remote Database Server added a lot more latency.
As a workaround, I migrated the manipulated data into my local postgres instance.
I then generated a .sql file of the migrated data & manually imported the .sql file on the remote server.
Alternatively, you can also use Python's Multithreading features, to launch multiple concurrent connections to the remote server and dedicate an isolated batch process to each connection, and flush the data.
This should make your migration considerably faster.
I have personally not tried the multi threading approach as it wasn't required in my case. But it seems darn efficient.
Hope this helped ! :)
Resources:
CSV Manipulation using Python for Beginners.
use copy_from command, it copies all the rows to table.
path=open('file.csv','r')
next(path)
cur.copy_from(path,'table_name',columns=('id','name','email'))
I am a new Python coder and also a new data scientist so please forgive any foolish sounding things here. I'll keep the details out unless anyone's curious but basically I need to connect to Microsoft SQL Server and upload a Pandas DF that is relatively large (~500k rows) and I need to do this almost every day as the project currently stands.
It doesn't have to be a Pandas DF - I've read about using odo for csv files but I haven't been able to get anything to work. The issue I'm having is that I can't bulk insert the DF because the file isn't on the same machine as the SQL Server instance. I'm consistently getting errors like the following:
pyodbc.ProgrammingError: ('42000', "[42000] [Microsoft][ODBC SQL
Server Driver][SQL Server]Incorrect syntax near the keyword 'IF'.
(156) (SQLExecDirectW)")
As I've attempted different SQL statements you can replace IF with whatever has been the first COL_NAME in the CREATE statement. I'm using SQLAlchemy to create the engine and connect to the database. This may go without saying but the pd.to_sql() method is just way too slow for how much data I'm moving so that's why I need something faster.
I'm using Python 3.6 by the way. I've put down here most of the things that I've tried that haven't been successful.
import pandas as pd
from sqlalchemy import create_engine
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), columns=list('test_col'))
address = 'mssql+pyodbc://uid:pw#server/path/database?driver=SQL Server'
engine = create_engine(address)
connection = engine.raw_connection()
cursor = connection.cursor()
# Attempt 1 <- This failed to even create a table at the cursor_execute statement so my issues could be way in the beginning here but I know that I have a connection to the SQL Server because I can use pd.to_sql() to create tables successfully (just incredibly slowly for my tables of interest)
create_statement = """
DROP TABLE test_table
CREATE TABLE test_table (test_col)
"""
cursor.execute(create_statement)
test_insert = '''
INSERT INTO test_table
(test_col)
values ('abs');
'''
cursor.execute(test_insert)
Attempt 2 <- From iabdb WordPress blog I came across
def chunker(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
records = [str(tuple(x)) for x in take_rates.values]
insert_ = """
INSERT INTO test_table
("A")
VALUES
"""
for batch in chunker(records, 2): # This would be set to 1000 in practice I hope
print(batch)
rows = str(batch).strip('[]')
print(rows)
insert_rows = insert_ + rows
print(insert_rows)
cursor.execute(insert_rows)
#conn.commit() # don't know when I would need to commit
conn.close()
# Attempt 3 # From a related Stack Exchange Post
create the table but first drop if it already exists
command = """DROP TABLE IF EXISTS test_table
CREATE TABLE test_table # these columns are from my real dataset
"Serial Number" serial primary key,
"Dealer Code" text,
"FSHIP_DT" timestamp without time zone,
;"""
cursor.execute(command)
connection.commit()
# stream the data using 'to_csv' and StringIO(); then use sql's 'copy_from' function
output = io.StringIO()
# ignore the index
take_rates.to_csv(output, sep='~', header=False, index=False)
# jump to start of stream
output.seek(0)
contents = output.getvalue()
cur = connection.cursor()
# null values become ''
cur.copy_from(output, 'Config_Take_Rates_TEST', null="")
connection.commit()
cur.close()
It seems to me that MS SQL Server is just not a nice Database to play around with...
I want to apologize for the rough formatting - I've been at this script for weeks now but just finally decided to try to organize something for StackOverflow. Thank you very much for any help anyone can offer!
If you only need to replace the existing table, truncate it and use bcp utility to upload the table. It's much faster.
from subprocess import call
command = "TRUNCATE TABLE test_table"
take_rates.to_csv('take_rates.csv', sep='\t', index=False)
call('bcp {t} in {f} -S {s} -U {u} -P {p} -d {db} -c -t "{sep}" -r "{nl}" -e {e}'.format(t='test_table', f='take_rates.csv', s=server, u=user, p=password, db=database, sep='\t', nl='\n')
You will need to install bcp utility (yum install mssql-tools on CentOS/RedHat).
'DROP TABLE IF EXISTS test_table' just looks like invalid tsql syntax.
you can do something like this:
if (object_id('test_table') is not null)
DROP TABLE test_table
I have a dataframe in Python. Can I write this data to Redshift as a new table?
I have successfully created a db connection to Redshift and am able to execute simple sql queries.
Now I need to write a dataframe to it.
You can use to_sql to push data to a Redshift database. I've been able to do this using a connection to my database through a SQLAlchemy engine. Just be sure to set index = False in your to_sql call. The table will be created if it doesn't exist, and you can specify if you want you call to replace the table, append to the table, or fail if the table already exists.
from sqlalchemy import create_engine
import pandas as pd
conn = create_engine('postgresql://username:password#yoururl.com:5439/yourdatabase')
df = pd.DataFrame([{'A': 'foo', 'B': 'green', 'C': 11},{'A':'bar', 'B':'blue', 'C': 20}])
df.to_sql('your_table', conn, index=False, if_exists='replace')
Note that you may need to pip install psycopg2 in order to connect to Redshift through SQLAlchemy.
to_sql Documentation
import pandas_redshift as pr
pr.connect_to_redshift(dbname = <dbname>,
host = <host>,
port = <port>,
user = <user>,
password = <password>)
pr.connect_to_s3(aws_access_key_id = <aws_access_key_id>,
aws_secret_access_key = <aws_secret_access_key>,
bucket = <bucket>,
subdirectory = <subdirectory>)
# Write the DataFrame to S3 and then to redshift
pr.pandas_to_redshift(data_frame = data_frame,
redshift_table_name = 'gawronski.nba_shots_log')
Details: https://github.com/agawronski/pandas_redshift
I tried using pandas df.to_sql() but it was tremendously slow. It was taking me well over 10 minutes to insert 50 rows. See this open issue (as of writing)
I tried using odo from the blaze ecosystem (as per the recommendations in the issue discussion), but faced a ProgrammingError which I didn't bother to investigate into.
Finally what worked:
import psycopg2
# Fill in the blanks for the conn object
conn = psycopg2.connect(user = 'user',
password = 'password',
host = 'host',
dbname = 'db',
port = 666)
cursor = conn.cursor()
# Adjust ... according to number of columns
args_str = b','.join(cursor.mogrify("(%s,%s,...)", x) for x in tuple(map(tuple,np_data)))
cursor.execute("insert into table (a,b,...) VALUES "+args_str.decode("utf-8"))
cursor.close()
conn.commit()
conn.close()
Yep, plain old psycopg2. This is for a numpy array but converting from a df to a ndarray shouldn't be too difficult. This gave me around 3k rows/minute.
However, the fastest solution as per recommendations from other team mates is to use the COPY command after dumping the dataframe as a TSV/CSV into a S3 cluster and then copying over. You should investigate into this if you're copying really huge datasets. (I will update here if and when I try it out)
Assuming you have access to S3, this approach should work:
Step 1: Write the DataFrame as a csv to S3 (I use AWS SDK boto3 for this)
Step 2: You know the columns, datatypes, and key/index for your Redshift table from your DataFrame, so you should be able to generate a create table script and push it to Redshift to create an empty table
Step 3: Send a copy command from your Python environment to Redshift to copy data from S3 into the empty table created in step 2
Works like a charm everytime.
Step 4: Before your cloud storage folks start yelling at you delete the csv from S3
If you see yourself doing this several times, wrapping all four steps in a function keeps it tidy.
I used to rely on pandas to_sql() function, but it is just too slow. I have recently switched to doing the following:
import pandas as pd
import s3fs # great module which allows you to read/write to s3 easily
import sqlalchemy
df = pd.DataFrame([{'A': 'foo', 'B': 'green', 'C': 11},{'A':'bar', 'B':'blue', 'C': 20}])
s3 = s3fs.S3FileSystem(anon=False)
filename = 'my_s3_bucket_name/file.csv'
with s3.open(filename, 'w') as f:
df.to_csv(f, index=False, header=False)
con = sqlalchemy.create_engine('postgresql://username:password#yoururl.com:5439/yourdatabase')
# make sure the schema for mytable exists
# if you need to delete the table but not the schema leave DELETE mytable
# if you want to only append, I think just removing the DELETE mytable would work
con.execute("""
DELETE mytable;
COPY mytable
from 's3://%s'
iam_role 'arn:aws:iam::xxxx:role/role_name'
csv;""" % filename)
the role has to allow redshift access to S3 see here for more details
I found that for a 300KB file (12000x2 dataframe) this takes 4 seconds compared to the 8 minutes I was getting with pandas to_sql() function
For the purpose of this conversation Postgres = RedShift
You have two options:
Option 1:
From Pandas:
http://pandas.pydata.org/pandas-docs/stable/io.html#io-sql
The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL.
Writing DataFrames
Assuming the following data is in a DataFrame data, we can insert it into the database using to_sql().
id Date Col_1 Col_2 Col_3
26 2012-10-18 X 25.7 True
42 2012-10-19 Y -12.4 False
63 2012-10-20 Z 5.73 True
In [437]: data.to_sql('data', engine)
With some databases, writing large DataFrames can result in errors due to packet size limitations being exceeded. This can be avoided by setting the chunksize parameter when calling to_sql. For example, the following writes data to the database in batches of 1000 rows at a time:
In [438]: data.to_sql('data_chunked', engine, chunksize=1000)
Option 2
Or you can simply do your own
If you have a dataframe called data simply loop over it using iterrows:
for row in data.iterrows():
then add each row to your database. I would use copy instead of insert for each row, as it will be much faster.
http://initd.org/psycopg/docs/usage.html#using-copy-to-and-copy-from
Given all the answers were not able to solve my query so I googled and got the following snippet which completed the work in 2 mins. I am using Python 3.8.5 on windows.
from red_panda import RedPanda
import pandas as pd
df = pd.read_csv('path_to_read_csv_file')
redshift_conf = {
"user": "username",
"password": "password",
"host": "hostname",
"port": port number in integer,
"dbname": "dbname",
}
aws_conf = {
"aws_access_key_id": "<access_key>",
"aws_secret_access_key": "<secret_key>",
# "aws_session_token": "temporary-token-if-you-have-one",
}
rp = RedPanda(redshift_conf, aws_conf)
s3_bucket = "bucketname"
s3_path = "subfolder if any" # optional, if you don't have any sub folders
s3_file_name = "filename" # optional, randomly generated if not provided
rp.df_to_redshift(df, "table_name", bucket=s3_bucket, path=s3_path, append=False)
for more info check out the package on github here