Reading a huge PostgreSQL table using psycopg2 - python

I'm trying to read a huge PostgreSQL table (~3 million rows of jsonb data, ~30GB size) to do some ETL in Python. I use psycopg2 for working with the database. I want to execute a Python function for each row of the PostgreSQL table and save the results in a .csv file.
The problem is that I need to select the whole 30GB table, and the query runs for a very long time without any possibility to monitor progress. I have found out that there exists a cursor parameter called itersize which determines the number of rows to be buffered on the client.
So I have written the following code:
import psycopg2
conn = psycopg2.connect("host=... port=... dbname=... user=... password=...")
cur = conn.cursor()
cur.itersize = 1000
sql_statement = """
select * from <HUGE TABLE>
"""
cur.execute(sql_statement)
for row in cur:
print(row)
cur.close()
conn.close()
Since the client buffers every 1000 rows on the client, I expect the following behavior:
The Python script buffers the first 1000 rows
We enter the for loop and print the buffered 1000 rows in the console
We reach the point where the next 1000 rows have to be buffered
The Python script buffers the next 1000 rows
GOTO 2
However, the code just hangs on the cur.execute() statement and no output is printed in the console. Why? Could you please explain what exactly is happening under the hood?

Related

Inserting Dataframe into MS SQLServer DB using python.to_sql() with SQLAlchemy takes too much time

Background:
I am creating a platform using python, where a user (layman) will be able to upload the data in the database on their own.
The user will select an excel file and the python will create multiple dataframes that will be stored in their each respective table on MS SQL Server in a Database.
Situation:
I am creating 12 different dataframes using the excel file and storing it in the MS SQL Database. The file has approximately 50k rows and about 150 columns (16mb file in total). The code works perfectly fine but is not time efficient since it takes approximately 2-3 mins just to upload these 12 frames to the database. I did a test run on a bigger file (Approx 50mb) and the time it took to upload these 12 frames in the database was 7 minutes
Where I need support:
Is there any way I can speed up this process of storing the data to the database? Ideally it should only be a matter of seconds and not minutes. I have tried the following libraries and got the results as follows.
Connection String and Data load in DataFrames:
#Connection String
connection_string = f"""
DRIVER={{{DRIVER_NAME}}};
SERVER={{{SERVER_NAME}}};
DATABASE={{{DATABASE_NAME}}};
uid=XYZ;
pwd=XYZ;
Trust_Connection=yes;
ColumnEncryption=Enabled;
"""
#Connection to Database
params=urllib.parse.quote_plus(connection_string)
engine = sa.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params), fast_executemany=True)
con=engine.connect()
#DataFrame 1 to be stored in DB table_1 of DB
df_Addr = pd.read_excel(excel_file, sheet_name = "Address_Details")
#DataFrame 2 to be stored in DB table_2 of DB
df_Bank = pd.read_excel(excel_file, sheet_name = "Bank_Details")
.
.
.
#DataFrame 12 to be stored in DB table_12 of DB
df_N = pd.read_excel(excel_file, sheet_name = "N_Details")
Option 1: : Using SQLAlchemy
#Saving Frame 1 in Table 1
saving_query_Address='DQ_Raw_Address'
df_Addr.to_sql(saving_query_Address,engine,schema="dbo",if_exists='append',index=False, chunksize = 5000, dtype={'NAME1': sa.types.NVARCHAR(length=100), 'CITY1': sa.types.NVARCHAR(length=100), 'STREET': sa.types.NVARCHAR(length=100)})
#Saving Frame 2 in Table 2
saving_query_Bank='DQ_Raw_Bank'
df_Bank.to_sql(saving_query_Bank,engine,schema="dbo",if_exists='append',index=False, chunksize = 5000, dtype={'_COMMENT':sa.types.VARCHAR(length=100),'_ACTION_CODE':sa.types.VARCHAR(length=100),'SOURCE_ID':sa.types.VARCHAR(length=100),'BKVID':sa.types.VARCHAR(length=100),'PARTNER':sa.types.VARCHAR(length=100),'BANKS':sa.types.VARCHAR(length=100),'IBAN':sa.types.VARCHAR(length=100),'ACCOUNT_ID':sa.types.VARCHAR(length=50),'CHECK_DIGIT':sa.types.VARCHAR(length=50),'ACCOUNT_TYPE':sa.types.VARCHAR(length=50),'BP_EEW_BUT0BK':sa.types.VARCHAR(length=50)})
#The logic follows for the remaining 10 Tables as well with the same settings.
#Total Time Taken: 130 seconds
Option 2: : Using PyODBC
#Saving Frame 1 in Table 1
saving_query_Address='DQ_Raw_Address'
insert_to_tbl = f"INSERT INTO {saving_query_Address} VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
cursor = conn.cursor()
cursor.fast_executemany = True
cursor.executemany(insert_to_tbl, df_Addr.values.tolist())
cursor.commit()
cursor.close()
#Saving Frame 2 in Table 2
saving_query_Bank='DQ_Raw_Bank'
insert_to_tmp_tbl_stmt = f"INSERT INTO {saving_query_Bank} VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
cursor = conn.cursor()
cursor.fast_executemany = True
cursor.executemany(insert_to_tmp_tbl_stmt, df_Bank.values.tolist())
cursor.commit()
cursor.close()
#The logic follows for the remaining 10 Tables as well with the same settings.
#Total Time Taken: 200 seconds
Note: I have tried loading the data as csv in the dataframe but no improvement so far. Cannot execute BULK INSERT query because do not have Bulk Admin rights on SQLServer. Also, I need to use VPN to connect to the server.
Versions Used:
Pandas: 1.5.0
,PyODBC: 4.0.34
,SQLALCHEMY: 1.4.42
I hope I made the issue clear.
Many Thanks!
Turns out, The issue was fixed using the following two approaches.
1. Reading the dataframe using pd.read_excel() was taking approx ~10 seconds to load a single frame. This can be cut by using pd.read_csv() from 10 seconds to merely half a second.
2. For storing purposes, I figured that TurbODBC works best for me to load all 12 frames in merely 20 seconds. Here is the link for TurbODBC that helped me for storing the data in a database in a timely fashion.
https://erickfis.medium.com/etl-process-with-turbodbc-1d19ed71510e
I hope this helps someone facing similar issues.

extract millions of records from sql server and load into oracle database using python script

I am extracting millions of data from sql server and inserting into oracle db using python. It is taking 1 record to insert in oracle table in 1 sec.. takes hours to insert. What is the fastest approach to load ?
My code below:
def insert_data(conn,cursor,query,data,batch_size = 10000):
recs = []
count = 1
for rec in data:
recs.append(rec)
if count % batch_size == 0:
cursor.executemany(query, recs,batcherrors=True)
conn.commit()`enter code here`
recs = []
count = count +1
cursor.executemany(query, recs,batcherrors=True)
conn.commit()
Perhaps you cannot buy a 3d Party ETL tool, but you can certainly write a procedure in PL/SQL in the oracle database.
First, install the oracle Transparenet Gateway for ODBC. No license cost involved.
Second, in the oracl db, create a db link to reference the MSSQL database via the gateway.
Third, write a PL/SQL procedure to pull the data from the MSSQL database, via the db link.
I was once presented a problem similar to yours. developer was using SSIS to copy around a million rows from mssql to oracle. Taking over 4 hours. I ran a trace on his process and saw that it was copying row-by-row, slow-by-slow. Took me less than 30 minutes write a pl/sql proc to copy the data, and it completed in less than 4 minutes.
I give a high-level view of the entire setup and process, here:
EDIT:
Thought you might like to see exactly how simple the actual procedure is:
create or replace my_load_proc
begin
insert into my_oracle_table (col_a,
col_b,
col_c)
select sql_col_a,
sql_col_b,
sql_col_c
from mssql_tbl#mssql_link;
end;
My actual procedure has more to it, dealing with run-time logging, emailing notification of completion, etc. But the above is the 'guts' of it, pulling the data from mssql into oracle.
then you might wanna use pandas or pyspark or other big data frameworks available on python
there are a lot of example out there, here is how to load data from Microsoft Docs:
import pyodbc
import pandas as pd
import cx_Oracle
server = 'servername'
database = 'AdventureWorks'
username = 'yourusername'
password = 'databasename'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
query = "SELECT [CountryRegionCode], [Name] FROM Person.CountryRegion;"
df = pd.read_sql(query, cnxn)
# you do data manipulation that is needed here
# then insert data into oracle
conn = create_engine('oracle+cx_oracle://xxxxxx')
df.to_sql(table_name, conn, index=False, if_exists="replace")
something like that, ( that might not work 100% , but just to give you an idea how you can do it)

Multithread Insert csvreader row into postgres connected by psycopg2

I have a pipeline that reads gzipped csv data into python and inserts the data into a postgres database, row by row, connected using psycopg2. I've created a thread connection pool, but I'm unsure how to leverage this to insert each row in a separate thread, rather than inserting sequentially. The internet gives me mixed messages if this is even possible, and I have some experience with the threading python module but not a lot.
The pipeline currently is successful, but it is slow, and I'm hoping that it can be made faster by inserting the rows across threads, rather than sequentially.
The following code is simplified for clarity:
main script
for row in reader:
insertrows(configs, row)
insertrows script
threadpool = pool.ThreadedConnectionPool(5, 20, database=dbname, port=port, user=user, password=password, host=host)
con = threadpool.getconn()
con.autocommit = True
cur = con.cursor()
cur.execute("INSERT INTO table VALUES row")
cur.close()
threadpool.putconn(con)
What I would like to do is rather than looping through the rows, create something like the threading example in this link but without a strong frame of reference for multithreading it's hard for me to figure out how to write something like that for my purposes.

Reading from mysql in batch end up hitting all rows

I am trying to read a 100GB+ table in python using pymysql python package.
the query I am firing is
select * from table
But I want to be able to process records in chunks instead of hitting the database for 100 GB records, below is my code
with self.connection.cursor() as cursor:
logging.info("Executing Read query")
logging.info(cursor.mogrify(query))
cursor.execute(query)
schema = cursor.description
size = cursor.rowcount
for i in range((size//batch)+1):
records = cursor.fetchmany(size=batch)
yield records, schema
but when the query gets executed at cursor.execute(query) it tried to get those 100GB records and end up killing the process.
Is there any better way to read data in chunk from mysql using python?

How can I select part of sqlite database using python

I have a very big database and I want to send part of that database (1/1000) to someone I am collaborating with to perform test runs. How can I (a) select 1/1000 of the total rows (or something similar) and (b) save the selection as a new .db file.
This is my current code, but I am stuck.
import sqlite3
import json
from pprint import pprint
conn = sqlite3.connect('C:/data/responses.db')
c = conn.cursor()
c.execute("SELECT * FROM responses;")
Create a another database with similar table structure as the original db. Sample records from original database and insert into new data base
import sqlite3
conn = sqlite3.connect("responses.db")
sample_conn = sqlite3.connect("responses_sample.db")
c = conn.cursor()
c_sample = sample_conn.cursor()
rows = c.execute("select no, nm from responses")
sample_rows = [r for i, r in enumerate(rows) if i%10 == 0] # select 1/1000 rows
# create sample table with similar structure
c_sample.execute("create table responses(no int, nm varchar(100))")
for r in sample_rows:
c_sample.execute("insert into responses (no, nm) values ({}, '{}')".format(*r))
c_sample.close()
sample_conn.commit()
sample_conn.close()
Simplest way to do this would be:
Copy the database file in your filesystem same as you would any other file (e.g. ctrl+c then ctrl+v in windows to make responses-partial.db or something)
Then open this new copy in an sqlite editor such as http://sqlitebrowser.org/ run the delete query to remove however many rows you want to. Then you might want to run compact database from file menu.
Close sqlite editor and confirm file size is smaller
Email the copy
Unless you need to create a repeatable system I wouldn't bother with doing this in python. But you could perform similar steps in python (copy the file, open it it run delete query, etc) if you need to.
The easiest way to do this is to
make a copy of the database file;
delete 999/1000th of the data, either by keeping the first few rows:
DELETE FROM responses WHERE SomeID > 1000;
or, if you want really random samples:
DELETE FROM responses
WHERE rowid NOT IN (SELECT rowid
FROM responses
ORDER BY random()
LIMIT (SELECT count(*)/1000 FROM responses));
run VACUUM to reduce the file size.

Categories

Resources