This is what I am currently doing but it is not fast...
get latest timestamp and subtract an hr. get average of all the data that lies within this hour. Then do this for a 24 hr period, and eventually longer. I am currently doing it with 24 queries but I was wondering if there was a better way in mysql to do this in a single query. If I run this as 24 queries in python it takes about 16 seconds to finish. I also tried doing semicolon separated multistatements but this only gave me about 2X speed. I was hoping to get it done much quicker as I will eventually do this many times. Here is the python code for this query...
db_connection = pymysql.connect(myStuffGoesHere)
count = 24
try:
with db_connection.cursor() as cursor:
# first get current server time and subtract 1 day from it
sql_current = "SELECT now()"
cursor.execute(sql_current)
current_time = cursor.fetchall()
current_time = current_time[0]['now()']
current_string_time = current_time.strftime("%Y-%m-%d %H:%M:%S")
previous_times = []
real_data = []
previous_time = current_time
for i in range(count):
previous_time -= timedelta(days=1)
previous_string_time = previous_time.strftime("%Y-%m-%d %H:%M:%S")
previous_times.append(previous_string_time)
for i in range(count):
sql_data = "SELECT AVG(value) FROM `000027` WHERE time<'" + current_string_time + "' AND time>'" + previous_times[i] + "'"
print(sql_data)
cursor.execute(sql_data)
data = cursor.fetchall()
for row in data:
real_data.append(int(row['AVG(value)']))
except Exception as e:
print("Exeception occured:{}".format(e))
It is possible to do this in one query.
SELECT HOUR(time) AS hour, AVG(value) AS average
FROM `000027`
WHERE time BETWEEN {start-of-day} and {end-of-day}
GROUP BY HOUR(time)
Related
Currently I'm getting the datetime of a discord message in UTC then adding it into a database under the xp_time table.
My confusion is around retrieving that datetime then comparing it against the current datetime to see if 60 seconds has passed. I've tried a few attempts with different solutions but I can't figure it out.
db = sqlite3.connect('xpdata.db')
cursor = db.cursor()
user_id = int(msg.author.id)
guild_id = msg.guild.id
time_id = msg.created_at
sqlupdate = (f'INSERT OR IGNORE INTO xpdata(user_id,guild_id, xp, level,xp_time) VALUES(?,?,?,?,?)')
val = (user_id, guild_id, xp_inc, 0, time_id)
cursor.execute(sqlupdate, val)
db.commit()
Below is kinda where I'm confused, I want to return the difference (if possible) then check if it's over a 60 second difference then continue updating the user's xp
I know it's written wrong! but that's why I have the question haha.
#Check for cooldown
select = (f"SELECT julianday('now') - julianday(xp_time) FROM xpdata")
cursor.execute(select)
if select > 59:
select = (f'SELECT * FROM xpdata WHERE user_id = {user_id} AND guild_id = {guild_id}')
cursor.execute(select)
for row in cursor.fetchall():
xp_grab = row[2]
lvl_grab = row[3]
[...]
When using julianday 1 is equal to 1 day, time is the decimal portion. I'd suggest using the time in seconds which can be represented using strftime('%s', time_value).
The strftime function is very flexible and you can use strftime('%s','now','-59 seconds') as the time 59 seconds ago.
So you could use :-
If you use julianday then 1 is 1 day. I'd suggest using strftime('%s') where 1 is 1 second.
SELECT strftime('%s','now','-59 second') - strftime('%s',xp_time) FROM xpdata;
Example
CREATE TABLE IF NOT EXISTS xpdata (xp_time TEXT);
INSERT INTO xpdata VALUES(datetime('now','-120 seconds')); /* inserts value 120 seconds ago */
SELECT strftime('%s','now') - strftime('%s',xp_time) FROM xpdata;
The above results in :-
Other considerations
Alternately you could use :-
SELECT strftime('%s',xp_time) < strftime('%s','now','-59 seconds') FROM xpdata;
This would return 1 (true for more than 59 seconds) or 0 (false if 59 seconds or less).
Of course it could be that the above could perhaps be used in the select or even to just do the update without prior checks (I want to return the difference (if possible) then check if it's over a 60 second difference then continue updating the user's xp)
Along the lines of (update user by adding 10 xp points if xp_time is more than 59 seconds ago) :-
UPDATE xpdata SET xp = xp + 10 WHERE user_id = 1 AND guild_id = 1 AND strftime('%s',xp_time) < strftime('%s','now','-59 seconds');
You may wish to consider reading SQLite Date and Time Functions
I have a task to run 8 equal queries (1 query per 1 country) and doing so return data from MySQL database. The reason I can't run 1 query with all countries in one is that each country needs to have different column names. Also, results need to be updated daily with a dynamic date range (last 7 days). Yes, I could run all countries and do the column naming and everything with Pandas but I thought that the following solution would be more efficient. So, my solution was to create a for loop that uses predefined lists with all the countries their respective dimensions and date range variables that change according to the current date. The problem I'm having is that MySQL query running in the loop takes so much more time than if I run the same query directly in our data warehouse (~140-500 seconds vs. 30 seconds). The solution works with smaller tables from DWH. The things is that I don't know which part exactly is causing the problem and how to solve it.
Here is an example of my code with some smaller "tests" implemented in it:
#Import libraries:
from google.cloud import storage
from google.oauth2 import service_account
import mysql.connector
import pandas as pd
import time
from datetime import timedelta, date
#Create a connection to new DWH:
coon = mysql.connector.connect(
host="the host goes here",
user="the user goes here",
passwd="the password goes here"
)
#Create Google Cloud Service credential references:
credentials = service_account.Credentials.from_service_account_file(r'C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\my credential json goes here.json')
project_id='my project id goes here'
cursor = coon.cursor()
#Create lists of countries and dimensions
countries = ['EE','FI','LV','LT']
appp_id_dim = ['ga:dimension5','ga:dimension5','ga:dimension5','ga:dimension5']
status_dim = ['ga:dimension21','ga:dimension12','ga:dimension20','ga:dimension15']
score_dim = ['ga:dimension11','ga:dimension11','ga:dimension19','ga:dimension14']
#Define the current date and date that was 7 days before current date:
date_now = date.today() - timedelta(days=1)
date_7d_prev = date_now - timedelta(days=7)
#Create a loop
for c,s in zip(countries, score_dim):
start_time = time.time()
#Create the query using string formating:
query = f"""select ca.ID, sv.subType, SUM(svl.score) as '{s}'
from aio.CreditApplication ca
join aio.ScoringResult sr
on sr.creditApplication_ID = ca.ID
join aio.ScorecardVariableLine svl
on svl.id = sr.scorecardVariableLine_ID
join aio.ScorecardVariable sv
on sv.ID = svl.scorecardVariable_ID
where sv.country='{c}'
#and sv.subType ="asc"
and sv.subType != 'fsc'
and sr.created >= '2020-01-01'
and sr.created between '{date_7d_prev} 00:00:00' and '{date_now} 23:59:59'
group by ca.id,sv.subType"""
#Check of sql query
print('query is done', time.time()-start_time)
start_time = time.time()
sql = pd.read_sql_query(query, coon)
#check of assigning sql:
print ('sql is assigned',time.time()-start_time)
start_time = time.time()
df = pd.DataFrame(sql
#, columns = ['created','ID','state']
)
#Check the df assignment:
print ('df has been assigned', time.time()-start_time)
#Create a .csv file from the final dataframe:
start_time = time.time()
df.to_csv(fr"C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\Testing Ground\{c}_sql_loop_test.csv", index = False, header=True, encoding='utf-8', sep=';')
#Check csv file creation:
print ('csv has been created',time.time()-start_time)
#Close the session
start_time = time.time()
cursor.close()
#Check the session closing:
print('The cursor is closed',time.time()-start_time)
This example has 4 countries because I tried cutting the amount in half but that doesn't help also. That was me thinking that I have some sort of query restrictions on the DWH end because major slow down always started with the 5th country. Running them separately takes almost the same time for each but it still takes too long.
So, my tests show that the loop always lags at the step of querying data. Every other step takes less than a second, but querying time goes up to 140-500, sometimes even more, seconds as mentioned previously. So, what do you think is the problem?
Found the solution! After talking to a person in my company who has a lot more experience with SQL and our particular DWH engine, he agreed to help and rewrote the SQL part. Instead of left joining a subquery, I had to rewrite it so that there would be no subquery. Why? Because our particular engine doesn't create an index for subquery, bet separately joined tables will have indexes. That improved the time of the whole script dramatically, from ~40 minutes run time to ~ less than 1 minute.
I have two tables with ~9 to 12million records that I am joining on their primary key for all columns (30 columns each). I then load 100,000 (have also tried 1,000, 10,000 etc) rows into memory at a time to process it in chunks.
It takes ~9 seconds in SQL for this query:
9477056 rows returned in 9646ms from: SELECT A., B. FROM 'tableA_2019-12-08' A JOIN 'tableB_2019-12-08' B USING(Symbol);
But I am seeing 250-300 seconds in python.
Here's the python code:
myQuery = "SELECT {} FROM '{}' A JOIN '{}' B USING(Symbol)".format(ColumnsToSelect,table1_name,table2_name)
c.execute(myQuery)
nr = 0
while c.fetchmany(100000) != []: # this will break when no more rows will be returned i.e end of table
nr+=1
print(nr)
results = c.fetchmany(100000)
#code to process results goes here but I commented it out to test the fetchmany speed```
Does anyone know why this is happening?
Edit: Tried changing it to this but I have the same issue, taking 250 seconds:
```DataProcessed = False
start_time_diffgen = time.time()
while DataProcessed is not True: # this will break when no more rows will be returned i.e end of table
result = c.fetchmany(1000000)
nr+=1
if result == []:
DataProcessed = True```
I am using jupiter notebook with Python 3 and connecting to a SQL server database. I am using pyodbc version 4.0.22 to connect to the database.
My goal is to store the SQL results in a pandas dataframe, but the query was so slow.
Here is the code:
import pyodbc
cnxn = pyodbc.connect("DSN=ISTPRD02;"
"Trusted_Connection=yes;")
ontem = '20180521'
query = "SELECT LOJA, COUNT(DISTINCT RA) FROM VENDAS_CONTRATO(NOLOCK) WHERE DT_RETIRADA_RA = '" + ontem + "' AND SITUACAO IN ('ABERTO', 'FECHADO') GROUP BY LOJA"
start = time.time()
ra_ontem = pd.read_sql_query(query, cnxn)
end = time.time()
print("Tempo: ", end - start)
Tempo: 26.379971981048584
Since it took a long time, I have monitored the database server, and it takes about 3 seconds to run the query on the server, as you can see below:
query = "SELECT LOJA, COUNT(DISTINCT RA) FROM VENDAS_CONTRATO(NOLOCK) WHERE DT_RETIRADA_RA = '" + ontem + "' AND SITUACAO IN ('ABERTO', 'FECHADO') GROUP BY LOJA"
start = time.time()
crsr = cnxn.cursor()
crsr.execute(query)
end = time.time()
print("Tempo: ", end - start)
Tempo: 3.7947773933410645
start = time.time()
crsr.fetchone()
end = time.time()
print("Tempo: ", end - start)
Tempo: 0.2396855354309082
start = time.time()
crsr.fetchall()
end = time.time()
print("Tempo: ", end - start)
Tempo: 23.67447066307068
So it seams that my problem is local, when the data is already retrieved from the database server and it looks like the pyhton code is slow when dealing with the data.
But I have only 892 lines !
ra_ontem.shape
(189, 2)
So my question is how can I make this faster and load the results into a Pandas Dataframe ?
Thanks
This might get you a bit faster than usual
cursor.execute(query)
df = cursor.fetchallarrow().to_pandas()
I had the same issue, it was just because tracing is switched on.
Just open up ODBC Data Source Administrator and go to the Tracing tab and turn off tracing. It completely solves the problem.
Your problem is not with pyodbc but is with sql-server. your code has two problems:
1) you need to create indecies on the columns which appear in "WHERE" clause.(i.e. DT_RETIRADA and SITUACAO ). Please note that if you always filter SITUACAO with those two values constantly, you can use filtered index. If you have index on these two field the best solution is to rebuild the index.
2) you query most probably suffers from "parameter sniffing". you need to search more about that
I'm trying to fetch a list of timestamps in MySQL by Python. Once I have the list, I check the time now and check which ones are longer than 15min ago. Onces I have those, I would really like a final total number. This seems more challenging to pull off than I had originally thought.
So, I'm using this to fetch the list from MySQL:
db = MySQLdb.connect(host=server, user=mysql_user, passwd=mysql_pwd, db=mysql_db, connect_timeout=10)
cur = db.cursor()
cur.execute("SELECT heartbeat_time FROM machines")
row = cur.fetchone()
print row
while row is not None:
print ", ".join([str(c) for c in row])
row = cur.fetchone()
cur.close()
db.close()
>> 2016-06-04 23:41:17
>> 2016-06-05 03:36:02
>> 2016-06-04 19:08:56
And this is the snippet I use to check if they are longer than 15min ago:
fmt = '%Y-%m-%d %H:%M:%S'
d2 = datetime.strptime('2016-06-05 07:51:48', fmt)
d1 = datetime.strptime('2016-06-04 23:41:17', fmt)
d1_ts = time.mktime(d1.timetuple())
d2_ts = time.mktime(d2.timetuple())
result = int(d2_ts-d1_ts) / 60
if str(result) >= 15:
print "more than 15m ago"
I'm at a loss how I am able to combine these though. Also, now that I put it in writing, there must be a easier/better way to filter these?
Thanks for the suggestions!
You could incorporate the 15min check directly into your SQL query. That way there is no need to mess around with timestamps and IMO it's far easier to read the code.
If you need some date from other columns from your table:
select * from machines where now() > heartbeat_time + INTERVAL 15 MINUTE;
If the total count is the only thing you are interested in:
SELECT count(*) FROM machines WHERE NOW() > heartbeat_time + INTERVAL 15 MINUTE;
That way you can do a cur.fetchone() and get either None or a tuple where the first value is the number of rows with a timestamp older than 15 minutes.
For iterating over a resultset it should be sufficient to write
cur.execute('SELECT * FROM machines')
for row in cur:
print row
because the base cursor already behaves like an iterator using .fetchone().
(all assuming you have timestamps in your DB as you stated in the question)
#user5740843: if str(result) >= 15: will not work as intended. This will always be True because of the str().
I assume heartbeat_time field is a datetime field.
import datetime
import MySQLdb
import MySQLdb.cursors
db = MySQLdb.connect(host=server, user=mysql_user, passwd=mysql_pwd, db=mysql_db, connect_timeout=10,
cursorclass=MySQLdb.cursors.DictCursor)
cur = db.cursor()
ago = datetime.datetime.utcnow() - datetime.timedelta(minutes=15)
try:
cur.execute("SELECT heartbeat_time FROM machines")
for row in cur:
if row['heartbeat_time'] <= ago:
print row['heartbeat_time'], 'more than 15 minutes ago'
finally:
cur.close()
db.close()
If data size is not that huge, loading all of them to memory is a good practice, which will release the memory buffer on the MySQL server. And for DictCursor, there is not such a difference between,
rows = cur.fetchall()
for r in rows:
and
for r in cur:
They both load data to the client. MySQLdb.SSCursor and SSDictCursor will try to transfer data as needed, while it requires MySQL server to support it.