I have a pipeline that reads gzipped csv data into python and inserts the data into a postgres database, row by row, connected using psycopg2. I've created a thread connection pool, but I'm unsure how to leverage this to insert each row in a separate thread, rather than inserting sequentially. The internet gives me mixed messages if this is even possible, and I have some experience with the threading python module but not a lot.
The pipeline currently is successful, but it is slow, and I'm hoping that it can be made faster by inserting the rows across threads, rather than sequentially.
The following code is simplified for clarity:
main script
for row in reader:
insertrows(configs, row)
insertrows script
threadpool = pool.ThreadedConnectionPool(5, 20, database=dbname, port=port, user=user, password=password, host=host)
con = threadpool.getconn()
con.autocommit = True
cur = con.cursor()
cur.execute("INSERT INTO table VALUES row")
cur.close()
threadpool.putconn(con)
What I would like to do is rather than looping through the rows, create something like the threading example in this link but without a strong frame of reference for multithreading it's hard for me to figure out how to write something like that for my purposes.
Related
I am extracting millions of data from sql server and inserting into oracle db using python. It is taking 1 record to insert in oracle table in 1 sec.. takes hours to insert. What is the fastest approach to load ?
My code below:
def insert_data(conn,cursor,query,data,batch_size = 10000):
recs = []
count = 1
for rec in data:
recs.append(rec)
if count % batch_size == 0:
cursor.executemany(query, recs,batcherrors=True)
conn.commit()`enter code here`
recs = []
count = count +1
cursor.executemany(query, recs,batcherrors=True)
conn.commit()
Perhaps you cannot buy a 3d Party ETL tool, but you can certainly write a procedure in PL/SQL in the oracle database.
First, install the oracle Transparenet Gateway for ODBC. No license cost involved.
Second, in the oracl db, create a db link to reference the MSSQL database via the gateway.
Third, write a PL/SQL procedure to pull the data from the MSSQL database, via the db link.
I was once presented a problem similar to yours. developer was using SSIS to copy around a million rows from mssql to oracle. Taking over 4 hours. I ran a trace on his process and saw that it was copying row-by-row, slow-by-slow. Took me less than 30 minutes write a pl/sql proc to copy the data, and it completed in less than 4 minutes.
I give a high-level view of the entire setup and process, here:
EDIT:
Thought you might like to see exactly how simple the actual procedure is:
create or replace my_load_proc
begin
insert into my_oracle_table (col_a,
col_b,
col_c)
select sql_col_a,
sql_col_b,
sql_col_c
from mssql_tbl#mssql_link;
end;
My actual procedure has more to it, dealing with run-time logging, emailing notification of completion, etc. But the above is the 'guts' of it, pulling the data from mssql into oracle.
then you might wanna use pandas or pyspark or other big data frameworks available on python
there are a lot of example out there, here is how to load data from Microsoft Docs:
import pyodbc
import pandas as pd
import cx_Oracle
server = 'servername'
database = 'AdventureWorks'
username = 'yourusername'
password = 'databasename'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
query = "SELECT [CountryRegionCode], [Name] FROM Person.CountryRegion;"
df = pd.read_sql(query, cnxn)
# you do data manipulation that is needed here
# then insert data into oracle
conn = create_engine('oracle+cx_oracle://xxxxxx')
df.to_sql(table_name, conn, index=False, if_exists="replace")
something like that, ( that might not work 100% , but just to give you an idea how you can do it)
I'm trying to read a huge PostgreSQL table (~3 million rows of jsonb data, ~30GB size) to do some ETL in Python. I use psycopg2 for working with the database. I want to execute a Python function for each row of the PostgreSQL table and save the results in a .csv file.
The problem is that I need to select the whole 30GB table, and the query runs for a very long time without any possibility to monitor progress. I have found out that there exists a cursor parameter called itersize which determines the number of rows to be buffered on the client.
So I have written the following code:
import psycopg2
conn = psycopg2.connect("host=... port=... dbname=... user=... password=...")
cur = conn.cursor()
cur.itersize = 1000
sql_statement = """
select * from <HUGE TABLE>
"""
cur.execute(sql_statement)
for row in cur:
print(row)
cur.close()
conn.close()
Since the client buffers every 1000 rows on the client, I expect the following behavior:
The Python script buffers the first 1000 rows
We enter the for loop and print the buffered 1000 rows in the console
We reach the point where the next 1000 rows have to be buffered
The Python script buffers the next 1000 rows
GOTO 2
However, the code just hangs on the cur.execute() statement and no output is printed in the console. Why? Could you please explain what exactly is happening under the hood?
I'm looking for an efficient way to import data from a CSV file to a Postgresql table using python in batches as I have quite large files and the server I'm importing the data to is far away. I need an efficient solution as everything I tried was either slow or just didn't work. I'm using SQLlahcemy.
I wanted to use raw SQL but it's so hard to parameterize and I need multiple loops to execute the query for multiple rows
I was given the task of manipulating & migrating some data from CSV files into a remote Postgres Instance.
I decided to use the Python script below:
import csv
import uuid
import psycopg2
import psycopg2.extras
import time
#Instant Time at the start of the Script
start = time.time()
psycopg2.extras.register_uuid()
#List of CSV Files that I want to manipulate & migrate.
file_list=["Address.csv"]
conn = psycopg2.connect("host=localhost dbname=address user=postgres password=docker")
cur = conn.cursor()
i = 1
for f in file_list:
f = open(f)
csv_f = csv.reader(f)
next(csv_f)
for row in csv_f:
# Some simple manipulations on each row
#Inserting a uuid4 into the first column
row.pop(0)
row.insert(0,uuid.uuid4())
row.pop(10)
row.insert(10,False)
row.pop(13)
#Tracking the number of rows inserted
print(i)
i = i + 1
#INSERT QUERY
postgres_insert_query = """ INSERT INTO "public"."address"("address_id","address_line_1","locality_area_street","address_name","app_version","channel_type","city","country","created_at","first_name","is_default","landmark","last_name","mobile","pincode","territory","updated_at","user_id") VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
record_to_insert = row
cur.execute(postgres_insert_query,record_to_insert)
f.close()
conn.commit()
conn.close()
print(time.time()-start)
The script worked quite well and promptly when testing it locally. But connecting to a remote Database Server added a lot more latency.
As a workaround, I migrated the manipulated data into my local postgres instance.
I then generated a .sql file of the migrated data & manually imported the .sql file on the remote server.
Alternatively, you can also use Python's Multithreading features, to launch multiple concurrent connections to the remote server and dedicate an isolated batch process to each connection, and flush the data.
This should make your migration considerably faster.
I have personally not tried the multi threading approach as it wasn't required in my case. But it seems darn efficient.
Hope this helped ! :)
Resources:
CSV Manipulation using Python for Beginners.
use copy_from command, it copies all the rows to table.
path=open('file.csv','r')
next(path)
cur.copy_from(path,'table_name',columns=('id','name','email'))
I am trying to read a 100GB+ table in python using pymysql python package.
the query I am firing is
select * from table
But I want to be able to process records in chunks instead of hitting the database for 100 GB records, below is my code
with self.connection.cursor() as cursor:
logging.info("Executing Read query")
logging.info(cursor.mogrify(query))
cursor.execute(query)
schema = cursor.description
size = cursor.rowcount
for i in range((size//batch)+1):
records = cursor.fetchmany(size=batch)
yield records, schema
but when the query gets executed at cursor.execute(query) it tried to get those 100GB records and end up killing the process.
Is there any better way to read data in chunk from mysql using python?
Say i have only 1GB of memory and 1 TB of hard disk space.
This is my code and i am using a postgres database.
import psycopg2
try:
db = psycopg2.connect("database parameters")
conn = db.cursor()
conn.execute(query)
#At this point, i am running
for row in conn:
for this case, I guess it is safe to assume that conn is a generator as i cannot seem to find a definitive answer online and i cannot try it on my environment as i cannot afford the system to crash.
I am expecting this query to return data in excess of 100 GB
I am using python 2.7 and psycopg2 library
If you use an anonymous cursor, which you are doing in your example, then the entire query result will be read into memory.
If you use a named cursor then it will read from the server in chunks as it loops over the data.