I am using jupiter notebook with Python 3 and connecting to a SQL server database. I am using pyodbc version 4.0.22 to connect to the database.
My goal is to store the SQL results in a pandas dataframe, but the query was so slow.
Here is the code:
import pyodbc
cnxn = pyodbc.connect("DSN=ISTPRD02;"
"Trusted_Connection=yes;")
ontem = '20180521'
query = "SELECT LOJA, COUNT(DISTINCT RA) FROM VENDAS_CONTRATO(NOLOCK) WHERE DT_RETIRADA_RA = '" + ontem + "' AND SITUACAO IN ('ABERTO', 'FECHADO') GROUP BY LOJA"
start = time.time()
ra_ontem = pd.read_sql_query(query, cnxn)
end = time.time()
print("Tempo: ", end - start)
Tempo: 26.379971981048584
Since it took a long time, I have monitored the database server, and it takes about 3 seconds to run the query on the server, as you can see below:
query = "SELECT LOJA, COUNT(DISTINCT RA) FROM VENDAS_CONTRATO(NOLOCK) WHERE DT_RETIRADA_RA = '" + ontem + "' AND SITUACAO IN ('ABERTO', 'FECHADO') GROUP BY LOJA"
start = time.time()
crsr = cnxn.cursor()
crsr.execute(query)
end = time.time()
print("Tempo: ", end - start)
Tempo: 3.7947773933410645
start = time.time()
crsr.fetchone()
end = time.time()
print("Tempo: ", end - start)
Tempo: 0.2396855354309082
start = time.time()
crsr.fetchall()
end = time.time()
print("Tempo: ", end - start)
Tempo: 23.67447066307068
So it seams that my problem is local, when the data is already retrieved from the database server and it looks like the pyhton code is slow when dealing with the data.
But I have only 892 lines !
ra_ontem.shape
(189, 2)
So my question is how can I make this faster and load the results into a Pandas Dataframe ?
Thanks
This might get you a bit faster than usual
cursor.execute(query)
df = cursor.fetchallarrow().to_pandas()
I had the same issue, it was just because tracing is switched on.
Just open up ODBC Data Source Administrator and go to the Tracing tab and turn off tracing. It completely solves the problem.
Your problem is not with pyodbc but is with sql-server. your code has two problems:
1) you need to create indecies on the columns which appear in "WHERE" clause.(i.e. DT_RETIRADA and SITUACAO ). Please note that if you always filter SITUACAO with those two values constantly, you can use filtered index. If you have index on these two field the best solution is to rebuild the index.
2) you query most probably suffers from "parameter sniffing". you need to search more about that
Related
This is what I am currently doing but it is not fast...
get latest timestamp and subtract an hr. get average of all the data that lies within this hour. Then do this for a 24 hr period, and eventually longer. I am currently doing it with 24 queries but I was wondering if there was a better way in mysql to do this in a single query. If I run this as 24 queries in python it takes about 16 seconds to finish. I also tried doing semicolon separated multistatements but this only gave me about 2X speed. I was hoping to get it done much quicker as I will eventually do this many times. Here is the python code for this query...
db_connection = pymysql.connect(myStuffGoesHere)
count = 24
try:
with db_connection.cursor() as cursor:
# first get current server time and subtract 1 day from it
sql_current = "SELECT now()"
cursor.execute(sql_current)
current_time = cursor.fetchall()
current_time = current_time[0]['now()']
current_string_time = current_time.strftime("%Y-%m-%d %H:%M:%S")
previous_times = []
real_data = []
previous_time = current_time
for i in range(count):
previous_time -= timedelta(days=1)
previous_string_time = previous_time.strftime("%Y-%m-%d %H:%M:%S")
previous_times.append(previous_string_time)
for i in range(count):
sql_data = "SELECT AVG(value) FROM `000027` WHERE time<'" + current_string_time + "' AND time>'" + previous_times[i] + "'"
print(sql_data)
cursor.execute(sql_data)
data = cursor.fetchall()
for row in data:
real_data.append(int(row['AVG(value)']))
except Exception as e:
print("Exeception occured:{}".format(e))
It is possible to do this in one query.
SELECT HOUR(time) AS hour, AVG(value) AS average
FROM `000027`
WHERE time BETWEEN {start-of-day} and {end-of-day}
GROUP BY HOUR(time)
I am working on a script to read from an oracle table with about 75 columns in one environment and load it into same table definition in a different environment. Till now I have been using cx_Oracle cur.execute() method to 'INSERT INTO TABLENAME VALUES(:1,:2,:3..:8);' and then load the data using 'cur.execute(sql, conn)' method.
However,this table that I'm trying to load has about 75+ columns and writing (:1, :2 ... :75) would be tedious and I'm guessing not part of best practice.
Is there an automated way to loop over the number of columns and automatically fill the values() portion of the SQL query.
user = 'username'
pass = getpass.getpass()
connection_prod = cx_Oracle.makedsn(host, port, service_name = '')
cursor_prod = connection_prod.cursor()
connection_dev = cx_Oracle.makedsn(host, port, service_name = '')
cursor_dev = connection_dev.cursor()
SQL_Read = """Select * from Table_name_Prod"""
Data = cur.execute(SQL_Read, connection_prod)
for row in Data:
SQL_Load = "INSERT INTO TABLE_NAME_DEV VALUES(:1, :2,:3, :4 ...:75);" --This part is ugly and tedious.
cursor_dev.execute(SQL_LOAD, row)
This is where I need Help
connection_Prod.commit()
cursor_Prod.close()
connection_Prod.close()
You can do the following which should help not only in reducing code but also in improving performance:
connection_prod = cx_Oracle.connect(...)
cursor_prod = connection_prod.cursor()
# set array size for source cursor to some reasonable value
# increasing this value reduces round-trips but increases memory usage
cursor_prod.arraysize = 500
connection_dev = cx_Oracle.connect(...)
cursor_dev = connection_dev.cursor()
cursor_prod.execute("select * from table_name_prod")
bind_names = ",".join(":" + str(i + 1) \
for i in range(len(cursor_prod.description)))
sql_load = "insert into table_name_dev values (" + bind_names + ")"
while True:
rows = cursor_prod.fetchmany()
if not rows:
break
cursor_dev.executemany(sql_load, rows)
# can call connection_dev.commit() here if you want to commit each batch
The use of cursor.executemany() will significantly help in terms of performance. Hope this helps you out!
I have a task to run 8 equal queries (1 query per 1 country) and doing so return data from MySQL database. The reason I can't run 1 query with all countries in one is that each country needs to have different column names. Also, results need to be updated daily with a dynamic date range (last 7 days). Yes, I could run all countries and do the column naming and everything with Pandas but I thought that the following solution would be more efficient. So, my solution was to create a for loop that uses predefined lists with all the countries their respective dimensions and date range variables that change according to the current date. The problem I'm having is that MySQL query running in the loop takes so much more time than if I run the same query directly in our data warehouse (~140-500 seconds vs. 30 seconds). The solution works with smaller tables from DWH. The things is that I don't know which part exactly is causing the problem and how to solve it.
Here is an example of my code with some smaller "tests" implemented in it:
#Import libraries:
from google.cloud import storage
from google.oauth2 import service_account
import mysql.connector
import pandas as pd
import time
from datetime import timedelta, date
#Create a connection to new DWH:
coon = mysql.connector.connect(
host="the host goes here",
user="the user goes here",
passwd="the password goes here"
)
#Create Google Cloud Service credential references:
credentials = service_account.Credentials.from_service_account_file(r'C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\my credential json goes here.json')
project_id='my project id goes here'
cursor = coon.cursor()
#Create lists of countries and dimensions
countries = ['EE','FI','LV','LT']
appp_id_dim = ['ga:dimension5','ga:dimension5','ga:dimension5','ga:dimension5']
status_dim = ['ga:dimension21','ga:dimension12','ga:dimension20','ga:dimension15']
score_dim = ['ga:dimension11','ga:dimension11','ga:dimension19','ga:dimension14']
#Define the current date and date that was 7 days before current date:
date_now = date.today() - timedelta(days=1)
date_7d_prev = date_now - timedelta(days=7)
#Create a loop
for c,s in zip(countries, score_dim):
start_time = time.time()
#Create the query using string formating:
query = f"""select ca.ID, sv.subType, SUM(svl.score) as '{s}'
from aio.CreditApplication ca
join aio.ScoringResult sr
on sr.creditApplication_ID = ca.ID
join aio.ScorecardVariableLine svl
on svl.id = sr.scorecardVariableLine_ID
join aio.ScorecardVariable sv
on sv.ID = svl.scorecardVariable_ID
where sv.country='{c}'
#and sv.subType ="asc"
and sv.subType != 'fsc'
and sr.created >= '2020-01-01'
and sr.created between '{date_7d_prev} 00:00:00' and '{date_now} 23:59:59'
group by ca.id,sv.subType"""
#Check of sql query
print('query is done', time.time()-start_time)
start_time = time.time()
sql = pd.read_sql_query(query, coon)
#check of assigning sql:
print ('sql is assigned',time.time()-start_time)
start_time = time.time()
df = pd.DataFrame(sql
#, columns = ['created','ID','state']
)
#Check the df assignment:
print ('df has been assigned', time.time()-start_time)
#Create a .csv file from the final dataframe:
start_time = time.time()
df.to_csv(fr"C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\Testing Ground\{c}_sql_loop_test.csv", index = False, header=True, encoding='utf-8', sep=';')
#Check csv file creation:
print ('csv has been created',time.time()-start_time)
#Close the session
start_time = time.time()
cursor.close()
#Check the session closing:
print('The cursor is closed',time.time()-start_time)
This example has 4 countries because I tried cutting the amount in half but that doesn't help also. That was me thinking that I have some sort of query restrictions on the DWH end because major slow down always started with the 5th country. Running them separately takes almost the same time for each but it still takes too long.
So, my tests show that the loop always lags at the step of querying data. Every other step takes less than a second, but querying time goes up to 140-500, sometimes even more, seconds as mentioned previously. So, what do you think is the problem?
Found the solution! After talking to a person in my company who has a lot more experience with SQL and our particular DWH engine, he agreed to help and rewrote the SQL part. Instead of left joining a subquery, I had to rewrite it so that there would be no subquery. Why? Because our particular engine doesn't create an index for subquery, bet separately joined tables will have indexes. That improved the time of the whole script dramatically, from ~40 minutes run time to ~ less than 1 minute.
I'm trying to read data from an oracle db.
I have to read on python the results of a simple select that returns a million of rows.
I use the fetchall() function, changing the arraysize property of the cursor.
select_qry = db_functions.read_sql_file('src/data/scripts/03_perimetro_select.sql')
dsn_tns = cx_Oracle.makedsn(ip, port, sid)
con = cx_Oracle.connect(user, pwd, dsn_tns)
start = time.time()
cur = con.cursor()
cur.arraysize = 1000
cur.execute('select * from bigtable where rownum < 10000')
res = cur.fetchall()
# print res # uncomment to display the query results
elapsed = (time.time() - start)
print(elapsed, " seconds")
cur.close()
con.close()
If I remove the where condition where rownum < 10000 the python environment freezes and the fetchall() function never ends.
After some trials I found a limit for this precise select, it works till 50k lines, but it fails if I select 60k lines.
What is causing this problem? Do I have to find another way to fetch this amount of data or the problem is the ODBC connection? How can I test it?
Consider running in batches using Oracle's ROWNUM. To combine back into single object append to a growing list. Below assumes total row count for table is 1 mill. Adjust as needed:
table_row_count = 1000000
batch_size = 10000
# PREPARED STATEMENT
sql = """SELECT t.* FROM
(SELECT *, ROWNUM AS row_num
FROM
(SELECT * FROM bigtable ORDER BY primary_id) sub_t
) AS t
WHERE t.row_num BETWEEN :LOWER_BOUND AND :UPPER_BOUND;"""
data = []
for lower_bound in range(0, table_row_count, batch_size):
# BIND PARAMS WITH BOUND LIMITS
cursor.execute(sql, {'LOWER_BOUND': lower_bound,
'UPPER_BOUND': lower_bound + batch_size - 1})
for row in cur.fetchall():
data.append(row)
You are probably running out of memory on the computer running cx_Oracle. Don't use fetchall() because this will require cx_Oracle to hold all result in memory. Use something like this to fetch batches of records:
cursor = connection.cursor()
cursor.execute("select employee_id from employees")
res = cursor.fetchmany(numRows=3)
print(res)
res = cursor.fetchmany(numRows=3)
print(res)
Stick the fetchmany() calls in a loop, process each batch of rows in your app before fetching the next set of rows, and exit the loop when there is no more data.
What ever solution you use, tune cursor.arraysize to get best performance.
The already given suggestion to repeat the query and select subsets of rows is also worth considering. If you are using Oracle DB 12 there is a newer (easier) syntax like SELECT * FROM mytab ORDER BY id OFFSET 5 ROWS FETCH NEXT 5 ROWS ONLY.
PS cx_Oracle does not use ODBC.
I'm trying to fetch a list of timestamps in MySQL by Python. Once I have the list, I check the time now and check which ones are longer than 15min ago. Onces I have those, I would really like a final total number. This seems more challenging to pull off than I had originally thought.
So, I'm using this to fetch the list from MySQL:
db = MySQLdb.connect(host=server, user=mysql_user, passwd=mysql_pwd, db=mysql_db, connect_timeout=10)
cur = db.cursor()
cur.execute("SELECT heartbeat_time FROM machines")
row = cur.fetchone()
print row
while row is not None:
print ", ".join([str(c) for c in row])
row = cur.fetchone()
cur.close()
db.close()
>> 2016-06-04 23:41:17
>> 2016-06-05 03:36:02
>> 2016-06-04 19:08:56
And this is the snippet I use to check if they are longer than 15min ago:
fmt = '%Y-%m-%d %H:%M:%S'
d2 = datetime.strptime('2016-06-05 07:51:48', fmt)
d1 = datetime.strptime('2016-06-04 23:41:17', fmt)
d1_ts = time.mktime(d1.timetuple())
d2_ts = time.mktime(d2.timetuple())
result = int(d2_ts-d1_ts) / 60
if str(result) >= 15:
print "more than 15m ago"
I'm at a loss how I am able to combine these though. Also, now that I put it in writing, there must be a easier/better way to filter these?
Thanks for the suggestions!
You could incorporate the 15min check directly into your SQL query. That way there is no need to mess around with timestamps and IMO it's far easier to read the code.
If you need some date from other columns from your table:
select * from machines where now() > heartbeat_time + INTERVAL 15 MINUTE;
If the total count is the only thing you are interested in:
SELECT count(*) FROM machines WHERE NOW() > heartbeat_time + INTERVAL 15 MINUTE;
That way you can do a cur.fetchone() and get either None or a tuple where the first value is the number of rows with a timestamp older than 15 minutes.
For iterating over a resultset it should be sufficient to write
cur.execute('SELECT * FROM machines')
for row in cur:
print row
because the base cursor already behaves like an iterator using .fetchone().
(all assuming you have timestamps in your DB as you stated in the question)
#user5740843: if str(result) >= 15: will not work as intended. This will always be True because of the str().
I assume heartbeat_time field is a datetime field.
import datetime
import MySQLdb
import MySQLdb.cursors
db = MySQLdb.connect(host=server, user=mysql_user, passwd=mysql_pwd, db=mysql_db, connect_timeout=10,
cursorclass=MySQLdb.cursors.DictCursor)
cur = db.cursor()
ago = datetime.datetime.utcnow() - datetime.timedelta(minutes=15)
try:
cur.execute("SELECT heartbeat_time FROM machines")
for row in cur:
if row['heartbeat_time'] <= ago:
print row['heartbeat_time'], 'more than 15 minutes ago'
finally:
cur.close()
db.close()
If data size is not that huge, loading all of them to memory is a good practice, which will release the memory buffer on the MySQL server. And for DictCursor, there is not such a difference between,
rows = cur.fetchall()
for r in rows:
and
for r in cur:
They both load data to the client. MySQLdb.SSCursor and SSDictCursor will try to transfer data as needed, while it requires MySQL server to support it.