How to optimize fetch from cursor with 5 millions raw - python

I got a table from MSSQL with 5M rows and when I fetch all the rows of this table, this take me 2~3 minutes. I want (if possible) to optimize that.
That's my code :
cursor.execute("SELECT * FROM MyTable")
rows = cursor.fetchall() # that takes 2~3 minutes
# some code for setup the output that take only few seconds
I already tried, to used :
while True:
rows = cursor.fetchmany(500000)
if not rows:
break
# Do some stuff
And Also with fetchone.
But again i'm between 2-3 mins :/ How to optimize that ? Maybe using thread but I don't know how.
thanks for your help.

I think you can limit the number of lines returned by your query even if you have to make several calls to your database.
About the Threads, you have several solutions:
A single connection but a different cursor for each Thread
One connection for each Thread and one cursor from that connection
In any case you need a ThreadedConnectionPool. Here is a small example of one of the ways to do it
import psycopg2
from psycopg2 import pool
from threading import Thread
from time import sleep
threaded_connection_pool = None
thread_table = list()
def get_new_connection():
global threaded_postgreSQL_pool
connection = None
while not isinstance(connection, psycopg2.extensions.connection):
try:
connection = threaded_postgreSQL_pool.getconn()
except pool.PoolError:
sleep(10) # Wait a free connection
return connection, connection.cursor()
def thread_target():
connection, cursor = get_new_connection()
with connection, cursor:
# Do some stuff
pass
threaded_connection_pool = psycopg2.pool.ThreadedConnectionPool(
# YOUR PARAM
)
for counter_thread in range(10):
thread = Thread(
target=thread_target,
name=f"Thread n°{counter_thread}"
)
thread_table.append(thread)
thread.start()
#
# Do many more stuff
#
for thread in thread_table:
thread.join()
# End

I prefer to use the first solution "A single connection but a different cursor for each Thread"
For that : I have to do something like that ?
result = []
cursor = connection.cursor()
def fetch_cursor(cursor):
global result
rows = cursor.fetchall()
if rows:
result += beautify_output(rows)
######### THIS CODE BELOW IS INSIDE A FUNCTION ######
thread_table = []
limit = 1000000
offset = 0
sql = "SELECT * FROM myTABLE"
while True:
try:
cursor.execute(f"{sql} LIMIT {limit} OFFSET {offset}")
except Exception as e:
break
offset += limit
thread = Thread(target=fetch_cursor, args=(cursor,))
thread_table.append(thread)
thread.start()
for thread in thread_table:
thread.join()
print(result)
So something like that should work ? (I will try that tommorow)

Related

Read sql queries via pandas quickly with pyoracle

I am using oracle sql developer and have built a script to read sql queries in parallel as well as under a thread. However, I have noticed no significant difference in speed by implementing this (even with chunksizes), than reading the table directly. Therefore, could my approach be wrong and what's the improvement to my approach, to speed things up?
For example:
#My table size is only 38k rows and this takes ~ 1.2 minutes to run
def table(self, table = None, query = None, chunksize = None):
from concurrent.futures import ThreadPoolExecutor
with self._ENGINE.connect() as conn:
tables = []
if query is None and table is not None:
with ThreadPoolExecutor(max_workers = 8) as executor:
for results in executor.submit(pd.read_sql,f"SELECT /*+ PARALLEL(16) */ NAME FROM {table}", conn, chunksize=chunksize).result():
tables.append(results)
table = pd.concat([pd.concat([x]) for x in tables])
conn.close()
return table
else:
print('something else')
After reading the following documentation:
tuning fetch
It takes approximately 118 seconds for the code above to run, whereas after a slight remodification:
def table2(self, table = None, query = None):
from concurrent.futures import ThreadPoolExecutor
self._cursor.arraysize = 10000
self._cursor.prefetchrows = 1000000
tables = []
start = time.time()
if query is None and table is not None:
with ThreadPoolExecutor(max_workers = 8) as executor:
for results in executor.submit(self._cursor.execute,f"SELECT /*+ PARALLEL(16) */ NAME FROM {table}").result():
tables.append(results)
end = time.time()
start_second = time.time()
self._cursor.execute(f"SELECT /*+ PARALLEL(16) */ NAME FROM {table}").fetchall()
end_second = time.time()
print("Threadpool time: %s, fetchall time: %s" % (str(end-start), str(end_second-start_second)))
Takes the following time to execute:
Threadpool time: 1.0487918853759766, fetchall time: 0.48572492599487305
Here's an example of fetching data from a single table using multiple connections. Whether it's faster than a single thread doing a full table scan is something for you to check. Maybe Python's GIL is a bottleneck. Maybe your database is on a single disk instead of multiple disks, so there is no extra throughput possible. Maybe the OFFSET/FETCH NEXT and ORDER BY are a limiting factor because the DB is busy doing work for other users (or maybe you're the only user so they are fast). Maybe for you it's what you do with the data when you get it in Python that will be a bottleneck.
Fundamentally from the Oracle side, tuning arraysize will be the biggest factor for any single SELECT that returns a large number of rows over a slower network.
# Fetch batches of rows from a table using multiple connections
import csv
import os
import platform
import threading
import oracledb
# To fetch everything, keep NUM_THREADS * BATCH_SIZE >= TABLE_SIZE
# number of rows to insert the demo table
TABLE_SIZE = 10000
# The degree of parallelism / number of connections to open
NUM_THREADS = 10
# How many rows to fetch in each thread
BATCH_SIZE = 1000
# Internal buffer size: Tune this for performance
ARRAY_SIZE = 1000
SQL = """
select data
from demo
order by id
offset :rowoffset rows fetch next :maxrows rows only
"""
un = os.environ.get('PYTHON_USERNAME')
pw = os.environ.get('PYTHON_PASSWORD')
cs = os.environ.get('PYTHON_CONNECTSTRING')
if os.environ.get('DRIVER_TYPE') == 'thick':
ld = None
if platform.system() == 'Darwin' and platform.machine() == 'x86_64':
ld = os.environ.get('HOME')+'/Downloads/instantclient_19_8'
elif platform.system() == 'Windows':
ld = r'C:\oracle\instantclient_19_17'
oracledb.init_oracle_client(lib_dir=ld)
# Create a connection pool
pool = oracledb.create_pool(user=un, password=pw, dsn=cs, min=NUM_THREADS, max=NUM_THREADS)
#
# Create the table for the demo
#
def create_schema():
with oracledb.connect(user=un, password=pw, dsn=cs) as connection:
with connection.cursor() as cursor:
connection.autocommit = True
cursor.execute("""
begin
begin
execute immediate 'drop table demo';
exception when others then
if sqlcode <> -942 then
raise;
end if;
end;
execute immediate 'create table demo (
id number generated by default as identity,
data varchar2(40))';
insert into demo (data)
select to_char(rownum)
from dual
connect by level <= :table_size;
end;""", table_size=TABLE_SIZE)
# Write the data to separate CSV files
def do_write_csv(tn):
with pool.acquire() as connection:
with connection.cursor() as cursor:
cursor.arraysize = ARRAY_SIZE
f = open(f"emp{tn}.csv", "w")
writer = csv.writer(f, lineterminator="\n", quoting=csv.QUOTE_NONNUMERIC)
cursor.execute(SQL, rowoffset=(tn*BATCH_SIZE), maxrows=BATCH_SIZE)
col_names = [row[0] for row in cursor.description]
writer.writerow(col_names)
while True:
rows = cursor.fetchmany() # extra call at end won't incur extra round-trip
if not rows:
break
writer.writerows(rows)
f.close()
# Print the data to the terminal
def do_query(tn):
with pool.acquire() as connection:
with connection.cursor() as cursor:
cursor.arraysize = ARRAY_SIZE
cursor.execute(SQL, rowoffset=(tn*BATCH_SIZE), maxrows=BATCH_SIZE)
while True:
rows = cursor.fetchmany() # extra call at end won't incur extra round-trip
if not rows:
break
print(f'Thread {tn}', rows)
#
# Start the desired number of threads.
#
def start_workload():
thread = []
for i in range(NUM_THREADS):
t = threading.Thread(target=do_write_csv, args=(i,))
#t = threading.Thread(target=do_query, args=(i,))
t.start()
thread.append(t)
for i in range(NUM_THREADS):
thread[i].join()
if __name__ == '__main__':
create_schema()
start_workload()
print("All done!")

Create a process from a function that will run in parallel in Python

I have a function that executes a SELECT sql query (using postgresql).
Now, I want to INSERT to some table in my DB the execution time of this query, however, I want to do it in parallel, so that even if my INSERT query is still running I will be able to continue my program and call other functions.
I tries to use multiprocessing.Process, however, my function is waiting for the process to finish and I'm actually losing the effect of the parallelism I wanted.
My code in a nut shell:
def select_func():
with connection.cursor() as cursor:
query = "SELECT * FROM myTable WHERE \"UserName\" = 'Alice'"
start = time.time()
cursor.execute(query)
end = time.time()
process = Process(target = insert_func, args = (query, (end-start)))
process.start()
process.join()
return cursor.fetchall()
def insert_func(query, time):
with connection.cursor() as cursor:
query = "INSERT INTO infoTable (\"query\", \"exec_time\")
VALUES (\"" + query + "\", \"" + time + "\")"
cursor.execute(query)
connection.commit()
Now the problem is that this operation is not really async, since select_func is waiting until insert_function is finished. I want that the execution of these functions won't be depended and that the select function could end even though insert_function is still running so that I will be able to continue and call other function in my script.
Thanks!
Quite a lot of issues with your code snippet but lets try to at least give a structure to implement.
def select_func():
with connection.cursor() as cursor: #I dont think the same global variable connectino should be used for read/write simultaneously
query = "SELECT * FROM myTable WHERE \"UserName\" = 'Alice'" #quotation issues
start = time.time()
cursor.execute(query)
end = time.time()
process = Process(target = insert_func, args = (query, (end-start)))
process.start() #you start the process here BUT
process.join() #you force python to wait for it here....
return cursor.fetchall()
def insert_func(query, time):
with connection.cursor() as cursor:
query = "INSERT INTO infoTable (\"query\", \"exec_time\")
VALUES (\"" + query + "\", \"" + time + "\")"
cursor.execute(query)
connection.commit()
Consider an alternative:
def select_func():
read_con = sql.connect() #sqlite syntax but use your connection
with read_con.cursor() as cursor:
query = "SELECT * FROM myTable WHERE \"UserName\" = 'Alice'" #where does Alice come from?
start = time.time()
cursor.execute(query)
end = time.time()
return cursor.fetchall(),(query,(end-start)) #Our tuple has query at position 0 and time at position 1
def insert_function(insert_queue): #The insert you want to parallleize
connection = sql.connect("db") #initialize your 'writer'. Note: May be good to initialize the connection on each insert. Not sure if optimal.
while True: #We keep pulling from the pipe
data = insert_queue.get() # we pull from our pipe
if data == 'STOP': #Example of a kill instruction to stop our process
break #breaks the while loop and the function can 'exit'
with connection.cursor() as cursor:
query_data = data #I assume you would want to pass your query through the pipe
query= query_data[0] #see how we stored the tuple
time = query_data[1] #as above
insert_query = "INSERT INTO infoTable (\"query\", \"exec_time\")
VALUES (\"" + query + "\", \"" + time + "\")" #Somehow query and time goes into the insert_query
cursor.execute(insert_query)
connection.commit()
if __name__ == '__main__': #Typical python main thread
query_pipe = Queue() #we initialize a Queue here to feed into your inserting function
process = Process(target = insert_func,args = (query_pipe,)
process.start()
stuff = []
for i in range(5):
data,insert_query = select_function() #select function lets say it gets the data you want to insert.
stuff.append(data)
query_pipe.put(insert_query)
#
#Do other stuff and even put more stuff into the pipe.
#
query_pipe.put('STOP') #we wanna kill our process so we send the stop command
process.join()

looping mycursor.execute and mycursor.fetchall() till it gets a result or a specific number of loop is met in python [duplicate]

This question already has answers here:
Why are some mysql connections selecting old data the mysql database after a delete + insert?
(2 answers)
Closed 3 months ago.
I need to repeatedly query a MySQL DB from Python, as the data is rapidly changing. Each time the data is read, it is transferred into a list.
I had assumed that simply putting the query in a loop would fetch the data from the database on each iteration. It seems not.
import mysql.connector
from mysql.connector import Error
from time import sleep
# Create empty list to store values from database.
listSize = 100
myList = []
for i in range(listSize):
myList.append([[0,0,0]])
# Connect to MySQL Server
mydb = mysql.connector.connect(host='localhost',
database='db',
user='user',
password='pass')
# Main loop
while True:
# SQL query
sql = "SELECT * FROM table"
# Read the database, store as a dictionary
mycursor = mydb.cursor(dictionary=True)
mycursor.execute(sql)
# Store data in rows
myresult = mycursor.fetchall()
# Transfer data into list
for row in myresult:
myList[int(row["rowID"])] = (row["a"], row["b"], row["c"])
print(myList[int(row["rowID"])])
print("---")
sleep (0.1)
I have tried using fetchall, fetchmany, and fetchone.
You need to commit the connection after each query. This commits the current transaction and ensures that the next (implicit) transaction will pick up changes made while the previous transaction was active.
# Main loop
while True:
# SQL query
sql = "SELECT * FROM table"
# Read the database, store as a dictionary
mycursor = mydb.cursor(dictionary=True)
mycursor.execute(sql)
# Store data in rows
myresult = mycursor.fetchall()
# Transfer data into list
for row in myresult:
myList[int(row["rowID"])] = (row["a"], row["b"], row["c"])
print(myList[int(row["rowID"])])
# Commit !
mydb.commit()
print("---")
sleep (0.1)
The concept here is isolation levels. From the docs (emphasis mine):
REPEATABLE READ
This is the default isolation level for InnoDB. Consistent reads within the same transaction read the snapshot established by the first read.
I'd make a few changes. First, declare the cursor before the while loop. I would also do a buffered cursor. And finally, close the cursor and DB after the file is done. Hope this helps.
import mysql.connector
from mysql.connector import Error
from time import sleep
# Create empty list to store values from database.
listSize = 100
myList = []
for i in range(listSize):
myList.append([[0,0,0]])
# Connect to MySQL Server
mydb = mysql.connector.connect(host='localhost',
database='db',
user='user',
password='pass')
mycursor = mydb.cursor(buffered=True, dictionary=True)
# Main loop
while True:
# SQL query
sql = "SELECT * FROM table"
# Read the database, store as a dictionary
mycursor.execute(sql)
# Store data in rows
myresult = mycursor.fetchall()
# Transfer data into list
for row in myresult:
myList[int(row["rowID"])] = (row["a"], row["b"], row["c"])
print(myList[int(row["rowID"])])
print("---")
sleep (0.1)
mycursor.close()
mydb.close()
For SqlAlchemy you need to close the session to get last changes:
try:
results = session.query(TableName).all()
return results
except Exception as e:
print(e)
return e
finally:
session.close() # optional, depends on use case

Python function is executed more than once

I want to get and print all the records from a table in a Mysql DB that is in a VPS but when I use a for loop to print all the records retrieved I get them printed 2-3 times and not just 1.
#!/usr/bin/env python
#Modules imported
# VPS
# Parameters to connecto to de DB in the VPS
def connDB():
global conn
global cur
try:
conn = MySQLdb.connect(DBhost, DBuser, DBpass, DBdb, charset='utf8', use_unicode=True)
cur = conn.cursor()
print("...DB VPS connect")
except:
print("...DB VPS ERROR")
pass
def selectallDB(query):
global conn
global cur
try:
cur.execute(query)
localrpis = cur.fetchall()
conn.commit()
print("... select All OK")
print('Total Row(s):', cur.rowcount)
for i in localrpis:
print(i)
except:
print("... select ERROR")
connDBLocal()
pass
def getallDB():
c_select = """
SELECT * FROM %s
"""%(trpistmsMCSIR)
selectallDB(c_select)
def checktime(sec):
# Function to trigger the read data funtion from "sec" to "sec"
while True:
res = round(time()%sec)
if res==0.0:
getallDB()
sleep(0.2) # Changed to 0.5
connDB()
while True:
checktime(10)
I assume that the for loop inside the try is executed 2 times (sometimes even 3) but I don't get why.
...DB connect
...DB VPS connect
... select All OK
('Total Row(s):', 2L)
('SELECT result OK')
... select All OK
('Total Row(s):', 2L)
('SELECT result OK')
As a work around after many changes I got it "working" changing the sleep(0.2) to sleep(0.5) but I'm not sure if this resolves the problem or it's just an illusion that the loop is working as expected.
It is not the for loop as can be seen from duplicate ... select All OK lines. The problem is your while loop as round(0.2) is equal to 0.0. That's why it is fixed when you make it 0.5. Theoretically it may also run 3 times (at seconds 0.0, 0.2, and 0.4) if the database operation is fast enough.
If you want to run your code every 10 seconds, sleeping 0.5 seconds in between checks is a good compromise.

Python Thread: can't start new thread

I'm trying to run this code:
def VideoHandler(id):
try:
cursor = conn.cursor()
print "Doing {0}".format(id)
data = urllib2.urlopen("http://myblogfms2.fxp.co.il/video" + str(id) + "/").read()
title = re.search("<span class=\"style5\"><strong>([\\s\\S]+?)</strong></span>", data).group(1)
picture = re.search("#4F9EFF;\"><img src=\"(.+?)\" width=\"120\" height=\"90\"", data).group(1)
link = re.search("flashvars=\"([\\s\\S]+?)\" width=\"612\"", data).group(1)
id = id
print "Done with {0}".format(id)
cursor.execute("insert into videos (`title`, `picture`, `link`, `vid_id`) values('{0}', '{1}', '{2}', {3})".format(title, picture, link, id))
print "Added {0} to the database".format(id)
except:
pass
x = 1
while True:
if x != 945719:
currentX = x
thread.start_new_thread(VideoHandler, (currentX))
else:
break
x += 1
and it says "can't start new thread"
The real reason for the error is most likely that you create way too many threads (more than 100k!!!) and hit an OS-level limit.
Your code can be improved in many ways besides this:
don't use the low level thread module, use the Thread class in the threading module.
join the threads at the end of your code
limit the number of threads you create to something reasonable: to process all elements, create a small number of threads and let each one process a subset of the whole data (this is what I propose below, but you could also adopt a producer-consumer pattern with worker threads getting their data from a queue.Queue instance)
and never, ever have a except: pass statement in your code. Or
if you do, don't come crying here if your code does not work and you
cannot figure out why. :-)
Here's a proposal:
from threading import Thread
import urllib2
import re
def VideoHandler(id_list):
for id in id_list:
try:
cursor = conn.cursor()
print "Doing {0}".format(id)
data = urllib2.urlopen("http://myblogfms2.fxp.co.il/video" + str(id) + "/").read()
title = re.search("<span class=\"style5\"><strong>([\\s\\S]+?)</strong></span>", data).group(1)
picture = re.search("#4F9EFF;\"><img src=\"(.+?)\" width=\"120\" height=\"90\"", data).group(1)
link = re.search("flashvars=\"([\\s\\S]+?)\" width=\"612\"", data).group(1)
id = id
print "Done with {0}".format(id)
cursor.execute("insert into videos (`title`, `picture`, `link`, `vid_id`) values('{0}', '{1}', '{2}', {3})".format(title, picture, link, id))
print "Added {0} to the database".format(id)
except:
import traceback
traceback.print_exc()
conn = get_some_dbapi_connection()
threads = []
nb_threads = 8
max_id = 945718
for i in range(nb_threads):
id_range = range(i*max_id//nb_threads, (i+1)*max_id//nb_threads + 1)
thread = Thread(target=VideoHandler, args=(id_range,))
threads.append(thread)
thread.start()
for thread in threads:
thread.join() # wait for completion
os has a limit of the amount of threads. So you can't create too many threads over the limit.
ThreadPool should be a good choice for you the do this high concurrency work.

Categories

Resources