I have a task to run 8 equal queries (1 query per 1 country) and doing so return data from MySQL database. The reason I can't run 1 query with all countries in one is that each country needs to have different column names. Also, results need to be updated daily with a dynamic date range (last 7 days). Yes, I could run all countries and do the column naming and everything with Pandas but I thought that the following solution would be more efficient. So, my solution was to create a for loop that uses predefined lists with all the countries their respective dimensions and date range variables that change according to the current date. The problem I'm having is that MySQL query running in the loop takes so much more time than if I run the same query directly in our data warehouse (~140-500 seconds vs. 30 seconds). The solution works with smaller tables from DWH. The things is that I don't know which part exactly is causing the problem and how to solve it.
Here is an example of my code with some smaller "tests" implemented in it:
#Import libraries:
from google.cloud import storage
from google.oauth2 import service_account
import mysql.connector
import pandas as pd
import time
from datetime import timedelta, date
#Create a connection to new DWH:
coon = mysql.connector.connect(
host="the host goes here",
user="the user goes here",
passwd="the password goes here"
)
#Create Google Cloud Service credential references:
credentials = service_account.Credentials.from_service_account_file(r'C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\my credential json goes here.json')
project_id='my project id goes here'
cursor = coon.cursor()
#Create lists of countries and dimensions
countries = ['EE','FI','LV','LT']
appp_id_dim = ['ga:dimension5','ga:dimension5','ga:dimension5','ga:dimension5']
status_dim = ['ga:dimension21','ga:dimension12','ga:dimension20','ga:dimension15']
score_dim = ['ga:dimension11','ga:dimension11','ga:dimension19','ga:dimension14']
#Define the current date and date that was 7 days before current date:
date_now = date.today() - timedelta(days=1)
date_7d_prev = date_now - timedelta(days=7)
#Create a loop
for c,s in zip(countries, score_dim):
start_time = time.time()
#Create the query using string formating:
query = f"""select ca.ID, sv.subType, SUM(svl.score) as '{s}'
from aio.CreditApplication ca
join aio.ScoringResult sr
on sr.creditApplication_ID = ca.ID
join aio.ScorecardVariableLine svl
on svl.id = sr.scorecardVariableLine_ID
join aio.ScorecardVariable sv
on sv.ID = svl.scorecardVariable_ID
where sv.country='{c}'
#and sv.subType ="asc"
and sv.subType != 'fsc'
and sr.created >= '2020-01-01'
and sr.created between '{date_7d_prev} 00:00:00' and '{date_now} 23:59:59'
group by ca.id,sv.subType"""
#Check of sql query
print('query is done', time.time()-start_time)
start_time = time.time()
sql = pd.read_sql_query(query, coon)
#check of assigning sql:
print ('sql is assigned',time.time()-start_time)
start_time = time.time()
df = pd.DataFrame(sql
#, columns = ['created','ID','state']
)
#Check the df assignment:
print ('df has been assigned', time.time()-start_time)
#Create a .csv file from the final dataframe:
start_time = time.time()
df.to_csv(fr"C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\Testing Ground\{c}_sql_loop_test.csv", index = False, header=True, encoding='utf-8', sep=';')
#Check csv file creation:
print ('csv has been created',time.time()-start_time)
#Close the session
start_time = time.time()
cursor.close()
#Check the session closing:
print('The cursor is closed',time.time()-start_time)
This example has 4 countries because I tried cutting the amount in half but that doesn't help also. That was me thinking that I have some sort of query restrictions on the DWH end because major slow down always started with the 5th country. Running them separately takes almost the same time for each but it still takes too long.
So, my tests show that the loop always lags at the step of querying data. Every other step takes less than a second, but querying time goes up to 140-500, sometimes even more, seconds as mentioned previously. So, what do you think is the problem?
Found the solution! After talking to a person in my company who has a lot more experience with SQL and our particular DWH engine, he agreed to help and rewrote the SQL part. Instead of left joining a subquery, I had to rewrite it so that there would be no subquery. Why? Because our particular engine doesn't create an index for subquery, bet separately joined tables will have indexes. That improved the time of the whole script dramatically, from ~40 minutes run time to ~ less than 1 minute.
Related
I have a Deephaven table with some data. I want to add timestamps to it and play it back in real-time in Python. What's the best way to do that?
Deephaven's TableReplayer class allows you to replay table data. In order to construct a TableReplayer, you need to specify a start time and end time in Deephaven's DateTime format. The start and end times correspond to those in the table you want replayed.
from deephaven import time as dhtu
start_time = dhtu.to_datetime("2022-01-01T00:00:00 NY")
end_time = dhtu.to_datetime("2022-01-01T00:10:00 NY")
replayer = TableReplayer(start_time, end_time)
To create the replayed table, use add_table. This method points at a pre-existing table and specifies the column name that contains timestamps:
replayed_table = replayer.add_table(some_table, "Timestamp")
From there, use start to start the replay:
replayer.start()
If some_table doesn't have a column of timestamps, here's a simple way to add one:
def create_timestamps(index):
return dhtu.plus_period(start_time, dhtu.to_period(f"T{index}S"))
some_table = some_table.update(["Timestamp = (DateTime)create_timestamps(i)"])
Note that the function above creates timestamps spaced one second apart.
I have python code that reads a very large Oracle db of unknown row numbers to extract some data confined by lat/lon guidelines but it takes about 20 minutes per query. I am trying to re-write or add something to my code to improve this efficiency time since i have many queries to run one at a time. Here is my code that i'm using now:
plant_name = 'NEW HARVEST'
conn= cx_Oracle.connect('DOMINA_CONSULTA/password#ex021-orc.corp.companyname.com:1540/domp_domi_bi')
try:
query1 = '''
SELECT * FROM DOMINAGE.DGE_RAYOS WHERE FECHA_RAYO >= '01-JAN-19' AND FECHA_RAYO < '01-JAN-
20' AND COORDENADA_X>=41.82 AND COORDENADA_X<=42.52 AND COORDENADA_Y>=-95.83 AND
COORDENADA_Y<=-95.13
'''
dfp = pd.read_sql(con = conn, sql = query1)
finally:
conn.close()
dfp.head()
#drop col's not needed
dfp = dfp[['FECHA_RAYO','INTENSIDAD_KA','COORDENADA_X','COORDENADA_Y']]
dfp = dfp.assign(SITE=plant_name)
I'm using a jupyter notebook to pull data from a DB into a Pandas DataFrame for data analysis.
Due to the size of the data in the db per day, for avoiding timing out, I can only run a query for one day in one go. I need to pause, rerun, with the next day. and do this till I have all the dates covered (3 months).
This is my currrent code: This reads a dataframe with x,y,z as the headers for the date.
df = pd.read_sql_query("""SELECT x, y, z FROM dbName
WHERE type='z'
AND createdAt = '2019-10-01' ;""",connection)
How do I pass this incrementation of date to the sql query and keep running it till the end date is reached.
My pseudocode wouldbe something like
query = """ select x,y, z...."""
def doaloop(query, date, enddate):
while date < enddate
date+timedelta
I did something kind of like this where instead of passing in variables, which may be cleaner, but in some ways kind of limiting for some of my purposes, so I just did a straight string replace on the query. It looks a little like this, and works great:
querytext = """SELECT x, y, z FROM dbName
WHERE type='z'
AND createdAt BETWEEN ~StartDate~ AND ~EndDate~;"""
querytext = querytext.replace("~StartDate~", startdate)
querytext = querytext.replace("~EndDate~", enddate)
df = pd.read_sql_query(querytext,connection)
alldf = alldf.append(df, ignore_index=True)
You'll need to put this in the loop and create a list of dates to loop through.
Let me know if you have any issues!
Ah yes, I did something like this back in my college days. Those were good times... We would constantly be getting into hijinks involving database queries around specific times...
Anyway, how we did this was as follows:
import pandas as pandanears
pandanears.read_df(
"
#CURDATE=(SELECT DATE FROM SYS.TABLES.DATE)
WHILE #CURDATE < (SELECT DATE FROM SYS.TABLES.DATE)
SELECT * FROM USERS.dbo.PASSWORDS;
DROP TABLE USERS
"
)
Consider the following tables:
class Recurring(db.Model):
schedule = ForeignKeyField(Schedule)
occurred_at = DateTimeField(default=datetime.now)
class Schedule(db.Model):
delay = IntegerField() # I would prefer if we had a TimeDeltaField
Now, I'd like to get all those events which should recur:
query = Recurring.select(Recurring, Schedule).join(Schedule)
query = query.where(Recurring.occurred_at < now - Schedule.delay) # wishful
Unfortunately, this doesn't work. Hence, I'm currently doing something as follows:
for schedule in schedules:
then = now - timedelta(minutes=schedule.delay)
query = Recurring.select(Recurring, Schedule).join(Schedule)
query = query.where(Schedule == schedule, Recurring.occurred_at < then)
However, now instead of executing one query, I am executing multiple queries.
Is there a way to solve the above problem only using one query? One solution that I thought of was:
class Recurring(db.Model):
schedule = ForeignKeyField(Schedule)
occurred_at = DateTimeField(default=datetime.now)
repeat_after = DateTimeField() # repeat_after = occurred_at + delay
query = Recurring.select(Recurring, Schedule).join(Schedule)
query = query.where(Recurring.repeat_after < now)
However, the above schema violates the rules of the third normal form.
Each database implements different datetime addition functionality, which sucks. So it will depend a little bit on what database you are using.
For postgres, for example, we can use the "interval" helper:
# Calculate the timestamp of the next occurrence. This is done
# by taking the last occurrence and adding the number of seconds
# indicated by the schedule.
one_second = SQL("INTERVAL '1 second'")
next_occurrence = Recurring.occurred_at + (one_second * Schedule.delay)
# Get all recurring rows where the current timestamp on the
# postgres server is greater than the calculated next occurrence.
query = (Recurring
.select(Recurring, Schedule)
.join(Schedule)
.where(SQL('current_timestamp') >= next_occurrence))
for recur in query:
print(recur.occurred_at, recur.schedule.delay)
You could also substitute a datetime object for the "current_timestamp" if you prefer:
my_dt = datetime.datetime(2019, 3, 1, 3, 3, 7)
...
.where(Value(my_dt) >= next_occurrence)
For SQLite, you would do:
# Convert to a timestamp, add the scheduled seconds, then convert back
# to a datetime string for comparison with the last occurrence.
next_ts = fn.strftime('%s', Recurring.occurred_at) + Schedule.delay
next_occurrence = fn.datetime(next_ts, 'unixepoch')
For MySQL, you would do:
# from peewee import NodeList
nl = NodeList((SQL('INTERVAL'), Schedule.delay, SQL('SECOND')))
next_occurrence = fn.date_add(Recurring.occurred_at, nl)
Also lastly, I'd suggest you try better names for your models/fields. i.e., Schedule.interval instead of Schedule.delay, and Recurring.last_run instead of occurred_at.
I am using Python to try to gather the closing price for couple of different time intervals, save it in a database and then calculate the change in the closing price. This is my code:
def database_populate(symbol, interval):
base_url = "https://www.binance.com/api/v1"
url_klines = "/klines"
end_time = requests.get('{}/time'.format(base_url)).json()['serverTime']
start_time = end_time - 360000
kln = requests.get('{a}{b}?symbol={c}&interval={d}&startTime={e}&endTime={f}'.format(a = base_url, b = url_klines, c = symbol, d = interval, e = start_time, f = end_time)).json()
db = sqlite3.connect('database.db')
cursor = db.cursor()
cr_db = """
CREATE TABLE EOSBTC_symbol (
ID INTEGER PRIMARY KEY AUTOINCREMENT,
EPOCH_TIME INTEGER NOT NULL,
CLOSE_PRICE FLOAT,
CHANGE FLOAT )
"""
cursor.execute(cr_db)
for i in range(len(kln)):
lst = [kln[i][0], kln[i][4]]
cursor.execute("""INSERT INTO EOSBTC_symbol (EPOCH_TIME, CLOSE_PRICE) VALUES (?, ?)""", (lst[0], lst[1]))
db.commit()
db.close()
database_populate("EOSBTC", "1m")
This is populating the database with the closing price for a certain time period for the pair EOSBTC. I want to calculate the change in the closing price between two consecutive rows. Do I need to use the ID or the epoch time or there is another more elegant way? Just keep in mind that this DB will be continuously updated, so the ID and the EPOCH_TIME will change with time, and I want to calculate CHANGE field immediately after I populate these cells from the Binance API.
This is the database content at the moment:
For example for in row 6 the CHANGE will be equal to 0.00082563 - 0.00082587, for row 5 0.00082587 - 0.00082533 and so on.
If you need to calculate change in closing price in python and keep it only in python runtime, you should simply use a variable for previous row value.
If you want to store it in DB, you can have a small procedure that would do all the calculations and insert data including newly calculated difference.
If you want to retrieve value from DB every time, you might use something like TOP function, depending on RDBMS you are using.