cx_Oracle: fetchall() stops working with big SELECT statements - python

I'm trying to read data from an oracle db.
I have to read on python the results of a simple select that returns a million of rows.
I use the fetchall() function, changing the arraysize property of the cursor.
select_qry = db_functions.read_sql_file('src/data/scripts/03_perimetro_select.sql')
dsn_tns = cx_Oracle.makedsn(ip, port, sid)
con = cx_Oracle.connect(user, pwd, dsn_tns)
start = time.time()
cur = con.cursor()
cur.arraysize = 1000
cur.execute('select * from bigtable where rownum < 10000')
res = cur.fetchall()
# print res # uncomment to display the query results
elapsed = (time.time() - start)
print(elapsed, " seconds")
cur.close()
con.close()
If I remove the where condition where rownum < 10000 the python environment freezes and the fetchall() function never ends.
After some trials I found a limit for this precise select, it works till 50k lines, but it fails if I select 60k lines.
What is causing this problem? Do I have to find another way to fetch this amount of data or the problem is the ODBC connection? How can I test it?

Consider running in batches using Oracle's ROWNUM. To combine back into single object append to a growing list. Below assumes total row count for table is 1 mill. Adjust as needed:
table_row_count = 1000000
batch_size = 10000
# PREPARED STATEMENT
sql = """SELECT t.* FROM
(SELECT *, ROWNUM AS row_num
FROM
(SELECT * FROM bigtable ORDER BY primary_id) sub_t
) AS t
WHERE t.row_num BETWEEN :LOWER_BOUND AND :UPPER_BOUND;"""
data = []
for lower_bound in range(0, table_row_count, batch_size):
# BIND PARAMS WITH BOUND LIMITS
cursor.execute(sql, {'LOWER_BOUND': lower_bound,
'UPPER_BOUND': lower_bound + batch_size - 1})
for row in cur.fetchall():
data.append(row)

You are probably running out of memory on the computer running cx_Oracle. Don't use fetchall() because this will require cx_Oracle to hold all result in memory. Use something like this to fetch batches of records:
cursor = connection.cursor()
cursor.execute("select employee_id from employees")
res = cursor.fetchmany(numRows=3)
print(res)
res = cursor.fetchmany(numRows=3)
print(res)
Stick the fetchmany() calls in a loop, process each batch of rows in your app before fetching the next set of rows, and exit the loop when there is no more data.
What ever solution you use, tune cursor.arraysize to get best performance.
The already given suggestion to repeat the query and select subsets of rows is also worth considering. If you are using Oracle DB 12 there is a newer (easier) syntax like SELECT * FROM mytab ORDER BY id OFFSET 5 ROWS FETCH NEXT 5 ROWS ONLY.
PS cx_Oracle does not use ODBC.

Related

Selecting more than 1000000 rows from table with SQL, Python, Numpy, Pandas, and ODBC

I have a database that I need to extract a million rows from, the problem is I cannot select more than 100000 at a time. And when I do 100000, the query takes very long, up to 10 minutes to execute. This needs to be done once a day, so it is not a problem for the final project, but it is for debugging
The following code works to select 100000 rows and create a Numpy array, which turns into a Pandas dataframe.
How can I extend it to run several times to select 1000000 rows by continuously either by appending to table_array, to df, or can I just append to the cursor directly with a few more lines of SQL?
What I tried so far without success is cursor.execute( "DECLARE # count int; SET # count = (SELECT COUNT(*) FROM vYield; SELECT TOP(#count) * FROM vYield")
A note here is that I dont know how many rows the database may have, I am assuming if it's less than 1000000, it will return the maximum number of rows and not crash. In this situation, I just want to append the maximum number of rows total
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server +
';DATABASE='+database+';ENCRYPT=yes;UID='+username+';PWD=' + password + ';Authentication=ActiveDirectoryInteractive')
cursor = cnxn.cursor()
cursor.execute("select top 100000 * from vYield")
table_array = numpy.empty((0, 3))
with open('C:\\xampp\\htdocs\\dataout.txt', 'w') as f:
sys.stdout = f # Change the standard output to the file we created
row = cursor.fetchone()
while row:
table_array = numpy.append(
table_array, numpy.array([[row[27], row[8], row[19]]]), axis=0)
row = cursor.fetchone()
table_array = table_array[table_array[:, 0].argsort()]
df = pandas.DataFrame(table_array, columns=['Serial', 'Test', 'Fail'])

More Efficient Way To Insert Dataframe into SQL Server

I am trying to update a SQL table with updated information which is in a dataframe in pandas.
I have about 100,000 rows to iterate through and it's taking a long time. Any way I can make this code more efficient. Do I even need to truncate the data? Most rows will probably be the same.
conn = pyodbc.connect ("Driver={xxx};"
"Server=xxx;"
"Database=xxx;"
"Trusted_Connection=yes;")
cursor = conn.cursor()
cursor.execute('TRUNCATE dbo.Sheet1$')
for index, row in df_union.iterrows():
print(row)
cursor.execute("INSERT INTO dbo.Sheet1$ (Vendor, Plant) values(?,?)", row.Vendor, row.Plant)
Update: This is what I ended up doing.
params = urllib.parse.quote_plus(r'DRIVER={xxx};SERVER=xxx;DATABASE=xxx;Trusted_Connection=yes')
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = create_engine(conn_str)
df = pd.read_excel('xxx.xlsx')
print("loaded")
df.to_sql(name='tablename',schema= 'dbo', con=engine, if_exists='replace',index=False, chunksize = 1000, method = 'multi')
Don't use for or cursors just SQL
insert into TABLENAMEA (A,B,C,D)
select A,B,C,D from TABLENAMEB
Take a look to this link to see another demo:
https://www.sqlservertutorial.net/sql-server-basics/sql-server-insert-into-select/
You just need to update this part to run a normal insert
conn = pyodbc.connect ("Driver={xxx};"
"Server=xxx;"
"Database=xxx;"
"Trusted_Connection=yes;")
cursor = conn.cursor()
cursor.execute('insert into TABLENAMEA (A,B,C,D) select A,B,C,D from TABLENAMEB')
You don't need to store the dataset in a variable, just run the query directly as normal SQL, performance will be better than a iteration

Python cx_Oracle Insert Into table with multiple columns automating the values (1:,2: ... 100:)

I am working on a script to read from an oracle table with about 75 columns in one environment and load it into same table definition in a different environment. Till now I have been using cx_Oracle cur.execute() method to 'INSERT INTO TABLENAME VALUES(:1,:2,:3..:8);' and then load the data using 'cur.execute(sql, conn)' method.
However,this table that I'm trying to load has about 75+ columns and writing (:1, :2 ... :75) would be tedious and I'm guessing not part of best practice.
Is there an automated way to loop over the number of columns and automatically fill the values() portion of the SQL query.
user = 'username'
pass = getpass.getpass()
connection_prod = cx_Oracle.makedsn(host, port, service_name = '')
cursor_prod = connection_prod.cursor()
connection_dev = cx_Oracle.makedsn(host, port, service_name = '')
cursor_dev = connection_dev.cursor()
SQL_Read = """Select * from Table_name_Prod"""
Data = cur.execute(SQL_Read, connection_prod)
for row in Data:
SQL_Load = "INSERT INTO TABLE_NAME_DEV VALUES(:1, :2,:3, :4 ...:75);" --This part is ugly and tedious.
cursor_dev.execute(SQL_LOAD, row)
This is where I need Help
connection_Prod.commit()
cursor_Prod.close()
connection_Prod.close()
You can do the following which should help not only in reducing code but also in improving performance:
connection_prod = cx_Oracle.connect(...)
cursor_prod = connection_prod.cursor()
# set array size for source cursor to some reasonable value
# increasing this value reduces round-trips but increases memory usage
cursor_prod.arraysize = 500
connection_dev = cx_Oracle.connect(...)
cursor_dev = connection_dev.cursor()
cursor_prod.execute("select * from table_name_prod")
bind_names = ",".join(":" + str(i + 1) \
for i in range(len(cursor_prod.description)))
sql_load = "insert into table_name_dev values (" + bind_names + ")"
while True:
rows = cursor_prod.fetchmany()
if not rows:
break
cursor_dev.executemany(sql_load, rows)
# can call connection_dev.commit() here if you want to commit each batch
The use of cursor.executemany() will significantly help in terms of performance. Hope this helps you out!

python - SQL Select Conditional statements in python

This is my R piece of code but i want to do the same thing in python, as i am new in it having problems to write the correct code can anybody guide me how to write this is python. I have already made connections of database and also tried simple queries but here i am struggling
sql_command <- "SELECT COUNT(DISTINCT Id) FROM \"Bowlers\";"
total<-as.numeric(dbGetQuery(con, sql_command))
data<-setNames(data.frame(matrix(ncol=8,
nrow=total)),c("Name","Wkts","Ave","Econ","SR","WicketTaker","totalovers",
"Matches"))
for (i in 1:total){
sql_command <- paste("SELECT * FROM \"Bowlers\" where Id = ", i ,";",
sep="")
p<-dbGetQuery(con, sql_command)
p[is.na(p)] <- 0
data$Name[i] = p$bowler[1]
}
after this which works fine how should i proceed to write the loop code:
with engine.connect() as con:
rs=con.execute('SELECT COUNT(DISTINCT id) FROM "Bowlers"')
for row in rs:
print (row)
Use the format method for strings in python to achieve it.
I am using postgresql, but your connection should be similar. Something like:
connect to test database:
import psycopg2
con = psycopg2.connect("dbname='test' user='your_user' host='your_host' password='your_password'")
cur = con.cursor() # cursor method may differ for your connection
loop over your id's:
for i in range(1,total+1):
sql_command = 'SELECT * FROM "Bowlers" WHERE id = {}'.format(i)
cur.execute(sql_command) # execute and fetchall method may differ
rows = cur.fetchall() # check for your connection
print ("output first row for id = {}".format(i))
print (rows[0]) # sanity check, printing first row for ids
print('\n') # rows is a list of tuples
# you can convert them into numpy arrays

Python foreach from a MySQLdb

I'm trying to fetch a list of timestamps in MySQL by Python. Once I have the list, I check the time now and check which ones are longer than 15min ago. Onces I have those, I would really like a final total number. This seems more challenging to pull off than I had originally thought.
So, I'm using this to fetch the list from MySQL:
db = MySQLdb.connect(host=server, user=mysql_user, passwd=mysql_pwd, db=mysql_db, connect_timeout=10)
cur = db.cursor()
cur.execute("SELECT heartbeat_time FROM machines")
row = cur.fetchone()
print row
while row is not None:
print ", ".join([str(c) for c in row])
row = cur.fetchone()
cur.close()
db.close()
>> 2016-06-04 23:41:17
>> 2016-06-05 03:36:02
>> 2016-06-04 19:08:56
And this is the snippet I use to check if they are longer than 15min ago:
fmt = '%Y-%m-%d %H:%M:%S'
d2 = datetime.strptime('2016-06-05 07:51:48', fmt)
d1 = datetime.strptime('2016-06-04 23:41:17', fmt)
d1_ts = time.mktime(d1.timetuple())
d2_ts = time.mktime(d2.timetuple())
result = int(d2_ts-d1_ts) / 60
if str(result) >= 15:
print "more than 15m ago"
I'm at a loss how I am able to combine these though. Also, now that I put it in writing, there must be a easier/better way to filter these?
Thanks for the suggestions!
You could incorporate the 15min check directly into your SQL query. That way there is no need to mess around with timestamps and IMO it's far easier to read the code.
If you need some date from other columns from your table:
select * from machines where now() > heartbeat_time + INTERVAL 15 MINUTE;
If the total count is the only thing you are interested in:
SELECT count(*) FROM machines WHERE NOW() > heartbeat_time + INTERVAL 15 MINUTE;
That way you can do a cur.fetchone() and get either None or a tuple where the first value is the number of rows with a timestamp older than 15 minutes.
For iterating over a resultset it should be sufficient to write
cur.execute('SELECT * FROM machines')
for row in cur:
print row
because the base cursor already behaves like an iterator using .fetchone().
(all assuming you have timestamps in your DB as you stated in the question)
#user5740843: if str(result) >= 15: will not work as intended. This will always be True because of the str().
I assume heartbeat_time field is a datetime field.
import datetime
import MySQLdb
import MySQLdb.cursors
db = MySQLdb.connect(host=server, user=mysql_user, passwd=mysql_pwd, db=mysql_db, connect_timeout=10,
cursorclass=MySQLdb.cursors.DictCursor)
cur = db.cursor()
ago = datetime.datetime.utcnow() - datetime.timedelta(minutes=15)
try:
cur.execute("SELECT heartbeat_time FROM machines")
for row in cur:
if row['heartbeat_time'] <= ago:
print row['heartbeat_time'], 'more than 15 minutes ago'
finally:
cur.close()
db.close()
If data size is not that huge, loading all of them to memory is a good practice, which will release the memory buffer on the MySQL server. And for DictCursor, there is not such a difference between,
rows = cur.fetchall()
for r in rows:
and
for r in cur:
They both load data to the client. MySQLdb.SSCursor and SSDictCursor will try to transfer data as needed, while it requires MySQL server to support it.

Categories

Resources