How to subtract a timedelta from a datetime in peewee? - python

Consider the following tables:
class Recurring(db.Model):
schedule = ForeignKeyField(Schedule)
occurred_at = DateTimeField(default=datetime.now)
class Schedule(db.Model):
delay = IntegerField() # I would prefer if we had a TimeDeltaField
Now, I'd like to get all those events which should recur:
query = Recurring.select(Recurring, Schedule).join(Schedule)
query = query.where(Recurring.occurred_at < now - Schedule.delay) # wishful
Unfortunately, this doesn't work. Hence, I'm currently doing something as follows:
for schedule in schedules:
then = now - timedelta(minutes=schedule.delay)
query = Recurring.select(Recurring, Schedule).join(Schedule)
query = query.where(Schedule == schedule, Recurring.occurred_at < then)
However, now instead of executing one query, I am executing multiple queries.
Is there a way to solve the above problem only using one query? One solution that I thought of was:
class Recurring(db.Model):
schedule = ForeignKeyField(Schedule)
occurred_at = DateTimeField(default=datetime.now)
repeat_after = DateTimeField() # repeat_after = occurred_at + delay
query = Recurring.select(Recurring, Schedule).join(Schedule)
query = query.where(Recurring.repeat_after < now)
However, the above schema violates the rules of the third normal form.

Each database implements different datetime addition functionality, which sucks. So it will depend a little bit on what database you are using.
For postgres, for example, we can use the "interval" helper:
# Calculate the timestamp of the next occurrence. This is done
# by taking the last occurrence and adding the number of seconds
# indicated by the schedule.
one_second = SQL("INTERVAL '1 second'")
next_occurrence = Recurring.occurred_at + (one_second * Schedule.delay)
# Get all recurring rows where the current timestamp on the
# postgres server is greater than the calculated next occurrence.
query = (Recurring
.select(Recurring, Schedule)
.join(Schedule)
.where(SQL('current_timestamp') >= next_occurrence))
for recur in query:
print(recur.occurred_at, recur.schedule.delay)
You could also substitute a datetime object for the "current_timestamp" if you prefer:
my_dt = datetime.datetime(2019, 3, 1, 3, 3, 7)
...
.where(Value(my_dt) >= next_occurrence)
For SQLite, you would do:
# Convert to a timestamp, add the scheduled seconds, then convert back
# to a datetime string for comparison with the last occurrence.
next_ts = fn.strftime('%s', Recurring.occurred_at) + Schedule.delay
next_occurrence = fn.datetime(next_ts, 'unixepoch')
For MySQL, you would do:
# from peewee import NodeList
nl = NodeList((SQL('INTERVAL'), Schedule.delay, SQL('SECOND')))
next_occurrence = fn.date_add(Recurring.occurred_at, nl)
Also lastly, I'd suggest you try better names for your models/fields. i.e., Schedule.interval instead of Schedule.delay, and Recurring.last_run instead of occurred_at.

Related

Efficiently get first and last model instances in Django Model with timestamp, by day

Suppose you have this model:
from django import models
from django.contrib.postgres.indexes import BrinIndex
class MyModel(model.Models):
device_id = models.IntegerField()
timestamp = models.DateTimeField(auto_now_add=True)
my_value = models.FloatField()
class Meta:
indexes = (BrinIndex(fields=['timestamp']),)
There is a periodic process that creates an instance of this model every 2 minutes or so. This process is supposed to run for years, with multiple devices, so this table will contain a great number of records.
My goal is, for each day when there are records, to get the first and last records in that day.
So far, what I could come up with is this:
from django.db.models import Min, Max
results = []
device_id = 1 # Could be other device id, of course, but 1 for illustration's sake
# This will get me a list of dictionaries that have first and last fields
# with the desired timestamps, but not the field my_value for them.
first_last = MyModel.objects.filter(device_id=device_id).values('timestamp__date')\
.annotate(first=Min('timestamp__date'),last=Max('timestamp__date'))
# So now I have to iterate over that list to get the instances/values
for f in first_last:
first = f['first']
last = f['last']
first_value = MyModel.objects.get(device=device, timestmap=first).my_value
last_value = MyModel.objects.get(device=device, timestamp=last).my_value
results.append({
'first': first,
'last': last,
'first_value': first_value,
'last_value': last_value,
})
# Do something with results[]
This works, but takes a long time (about 50 seconds on my machine, retrieving first and last values for about 450 days).
I have tried other combinations of annotate(), values(), values_list(), extra() etc, but this is the best I could come up with so far.
Any help or insight is appreciated!
You can take advantage of .distinct() if you are using PostgreSQL as DBMS.
first_models = MyModel.objects.order_by('timestamp__date', 'timestamp').distinct('timestamp__date')
last_models = MyModel.objects.order_by('timestamp__date', '-timestamp').distinct('timestamp__date')
first_last = first_models.union(last_models)
# do something with first_last
One more things need to be mentioned: first_last might eliminate duplicate when there is only one record for a date. It should not be a problem for you, but if it does, you can iterate first_models and last_models separately.

How can I replay a table in Deephaven?

I have a Deephaven table with some data. I want to add timestamps to it and play it back in real-time in Python. What's the best way to do that?
Deephaven's TableReplayer class allows you to replay table data. In order to construct a TableReplayer, you need to specify a start time and end time in Deephaven's DateTime format. The start and end times correspond to those in the table you want replayed.
from deephaven import time as dhtu
start_time = dhtu.to_datetime("2022-01-01T00:00:00 NY")
end_time = dhtu.to_datetime("2022-01-01T00:10:00 NY")
replayer = TableReplayer(start_time, end_time)
To create the replayed table, use add_table. This method points at a pre-existing table and specifies the column name that contains timestamps:
replayed_table = replayer.add_table(some_table, "Timestamp")
From there, use start to start the replay:
replayer.start()
If some_table doesn't have a column of timestamps, here's a simple way to add one:
def create_timestamps(index):
return dhtu.plus_period(start_time, dhtu.to_period(f"T{index}S"))
some_table = some_table.update(["Timestamp = (DateTime)create_timestamps(i)"])
Note that the function above creates timestamps spaced one second apart.

Slow MySQL database query time in a Python for loop

I have a task to run 8 equal queries (1 query per 1 country) and doing so return data from MySQL database. The reason I can't run 1 query with all countries in one is that each country needs to have different column names. Also, results need to be updated daily with a dynamic date range (last 7 days). Yes, I could run all countries and do the column naming and everything with Pandas but I thought that the following solution would be more efficient. So, my solution was to create a for loop that uses predefined lists with all the countries their respective dimensions and date range variables that change according to the current date. The problem I'm having is that MySQL query running in the loop takes so much more time than if I run the same query directly in our data warehouse (~140-500 seconds vs. 30 seconds). The solution works with smaller tables from DWH. The things is that I don't know which part exactly is causing the problem and how to solve it.
Here is an example of my code with some smaller "tests" implemented in it:
#Import libraries:
from google.cloud import storage
from google.oauth2 import service_account
import mysql.connector
import pandas as pd
import time
from datetime import timedelta, date
#Create a connection to new DWH:
coon = mysql.connector.connect(
host="the host goes here",
user="the user goes here",
passwd="the password goes here"
)
#Create Google Cloud Service credential references:
credentials = service_account.Credentials.from_service_account_file(r'C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\my credential json goes here.json')
project_id='my project id goes here'
cursor = coon.cursor()
#Create lists of countries and dimensions
countries = ['EE','FI','LV','LT']
appp_id_dim = ['ga:dimension5','ga:dimension5','ga:dimension5','ga:dimension5']
status_dim = ['ga:dimension21','ga:dimension12','ga:dimension20','ga:dimension15']
score_dim = ['ga:dimension11','ga:dimension11','ga:dimension19','ga:dimension14']
#Define the current date and date that was 7 days before current date:
date_now = date.today() - timedelta(days=1)
date_7d_prev = date_now - timedelta(days=7)
#Create a loop
for c,s in zip(countries, score_dim):
start_time = time.time()
#Create the query using string formating:
query = f"""select ca.ID, sv.subType, SUM(svl.score) as '{s}'
from aio.CreditApplication ca
join aio.ScoringResult sr
on sr.creditApplication_ID = ca.ID
join aio.ScorecardVariableLine svl
on svl.id = sr.scorecardVariableLine_ID
join aio.ScorecardVariable sv
on sv.ID = svl.scorecardVariable_ID
where sv.country='{c}'
#and sv.subType ="asc"
and sv.subType != 'fsc'
and sr.created >= '2020-01-01'
and sr.created between '{date_7d_prev} 00:00:00' and '{date_now} 23:59:59'
group by ca.id,sv.subType"""
#Check of sql query
print('query is done', time.time()-start_time)
start_time = time.time()
sql = pd.read_sql_query(query, coon)
#check of assigning sql:
print ('sql is assigned',time.time()-start_time)
start_time = time.time()
df = pd.DataFrame(sql
#, columns = ['created','ID','state']
)
#Check the df assignment:
print ('df has been assigned', time.time()-start_time)
#Create a .csv file from the final dataframe:
start_time = time.time()
df.to_csv(fr"C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\Testing Ground\{c}_sql_loop_test.csv", index = False, header=True, encoding='utf-8', sep=';')
#Check csv file creation:
print ('csv has been created',time.time()-start_time)
#Close the session
start_time = time.time()
cursor.close()
#Check the session closing:
print('The cursor is closed',time.time()-start_time)
This example has 4 countries because I tried cutting the amount in half but that doesn't help also. That was me thinking that I have some sort of query restrictions on the DWH end because major slow down always started with the 5th country. Running them separately takes almost the same time for each but it still takes too long.
So, my tests show that the loop always lags at the step of querying data. Every other step takes less than a second, but querying time goes up to 140-500, sometimes even more, seconds as mentioned previously. So, what do you think is the problem?
Found the solution! After talking to a person in my company who has a lot more experience with SQL and our particular DWH engine, he agreed to help and rewrote the SQL part. Instead of left joining a subquery, I had to rewrite it so that there would be no subquery. Why? Because our particular engine doesn't create an index for subquery, bet separately joined tables will have indexes. That improved the time of the whole script dramatically, from ~40 minutes run time to ~ less than 1 minute.

Python SQL cells interaction

I am using Python to try to gather the closing price for couple of different time intervals, save it in a database and then calculate the change in the closing price. This is my code:
def database_populate(symbol, interval):
base_url = "https://www.binance.com/api/v1"
url_klines = "/klines"
end_time = requests.get('{}/time'.format(base_url)).json()['serverTime']
start_time = end_time - 360000
kln = requests.get('{a}{b}?symbol={c}&interval={d}&startTime={e}&endTime={f}'.format(a = base_url, b = url_klines, c = symbol, d = interval, e = start_time, f = end_time)).json()
db = sqlite3.connect('database.db')
cursor = db.cursor()
cr_db = """
CREATE TABLE EOSBTC_symbol (
ID INTEGER PRIMARY KEY AUTOINCREMENT,
EPOCH_TIME INTEGER NOT NULL,
CLOSE_PRICE FLOAT,
CHANGE FLOAT )
"""
cursor.execute(cr_db)
for i in range(len(kln)):
lst = [kln[i][0], kln[i][4]]
cursor.execute("""INSERT INTO EOSBTC_symbol (EPOCH_TIME, CLOSE_PRICE) VALUES (?, ?)""", (lst[0], lst[1]))
db.commit()
db.close()
database_populate("EOSBTC", "1m")
This is populating the database with the closing price for a certain time period for the pair EOSBTC. I want to calculate the change in the closing price between two consecutive rows. Do I need to use the ID or the epoch time or there is another more elegant way? Just keep in mind that this DB will be continuously updated, so the ID and the EPOCH_TIME will change with time, and I want to calculate CHANGE field immediately after I populate these cells from the Binance API.
This is the database content at the moment:
For example for in row 6 the CHANGE will be equal to 0.00082563 - 0.00082587, for row 5 0.00082587 - 0.00082533 and so on.
If you need to calculate change in closing price in python and keep it only in python runtime, you should simply use a variable for previous row value.
If you want to store it in DB, you can have a small procedure that would do all the calculations and insert data including newly calculated difference.
If you want to retrieve value from DB every time, you might use something like TOP function, depending on RDBMS you are using.

Get range of columns from Cassandra based on TimeUUIDType using Python and the datetime module

I've got a table set up like so:
{"String" : {uuid1 : "String", uuid1: "String"}, "String" : {uuid : "String"}}
Or...
Row_validation_class = UTF8Type
Default_validation_class = UTF8Type
Comparator = UUID
(It's basically got website as a row label, and has dynamically generated columns based on datetime.datetime.now() with TimeUUIDType in Cassandra and a string as the value)
I'm looking to use Pycassa to retrieve slices of the data based on both the row and the columns. However, on other (smaller) tables I've done this but by downloading the whole data set (or at least filtered to one row) and then had an ordered dictionary I could compare with datetime objects.
I'd like to be able to use something like the Pycassa multiget or get_indexed_slice function to pull certain columns and rows. Does something like this exist that allows filtering on datetime. All my current attempts result in the following error message:
TypeError: can't compare datetime.datetime to UUID
The best I've managed to come up with so far is...
def get_number_of_visitors(site, start_date, end_date=datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S:%f")):
pool = ConnectionPool('Logs', timeout = 2)
col_fam = ColumnFamily(pool, 'sessions')
result = col_fam.get(site)
number_of_views = [(k,v) for k,v in col_fam.get(site).items() if get_posixtime(k) > datetime.datetime.strptime(str(start_date), "%Y-%m-%d %H:%M:%S:%f") and get_posixtime(k) < datetime.datetime.strptime(str(end_date), "%Y-%m-%d %H:%M:%S:%f")]
total_unique_sessions = len(number_of_views)
return total_unique_sessions
With get_posixtime being defined as:
def get_posixtime(uuid1):
assert uuid1.version == 1, ValueError('only applies to type 1')
t = uuid1.time
t = (t - 0x01b21dd213814000L)
t = t / 1e7
return datetime.datetime.fromtimestamp(t)
This doesn't seem to work (isn't returning the data I'd expect) and also feels like it shouldn't be necessary. I'm creating the column timestamps using:
timestamp = datetime.datetime.now()
Does anybody have any ideas? It feels like this is the sort of thing that Pycassa (or another python library) would support but I can't figure out how to do it.
p.s. table schema as described by cqlsh:
CREATE COLUMNFAMILY sessions (
KEY text PRIMARY KEY
) WITH
comment='' AND
comparator='TimeUUIDType' AND
row_cache_provider='ConcurrentLinkedHashCacheProvider' AND
key_cache_size=200000.000000 AND
row_cache_size=0.000000 AND
read_repair_chance=1.000000 AND
gc_grace_seconds=864000 AND
default_validation=text AND
min_compaction_threshold=4 AND
max_compaction_threshold=32 AND
row_cache_save_period_in_seconds=0 AND
key_cache_save_period_in_seconds=14400 AND
replicate_on_write=True;
p.s.
I know you can specify a column range in Pycassa but I won't be able to guarantee that the start and end values of the range will have entries for each of the rows and hence the column may not exist.
You do want to request a "slice" of columns using the column_start and column_finish parameters to get(), multiget(), get_count(), get_range(), etc. For TimeUUIDType comparators, pycassa actually accepts datetime instances or timestamps for those two parameters; it will internally convert them to a TimeUUID-like form with a matching timestamp component. There's a section of the documentation dedicated to working with TimeUUIDs that provides more details.
For example, I would implement your function like this:
def get_number_of_visitors(site, start_date, end_date=None):
"""
start_date and end_date should be datetime.datetime instances or
timestamps like those returned from time.time().
"""
if end_date is None:
end_date = datetime.datetime.now()
pool = ConnectionPool('Logs', timeout = 2)
col_fam = ColumnFamily(pool, 'sessions')
return col_fam.get_count(site, column_start=start_date, column_finish=end_date)
You could use the same form with col_fam.get() or col_fam.xget() to get the actual list of visitors.
P.S. try not to create a new ConnectionPool() for every request. If you have to, set a lower pool size.

Categories

Resources