Speed up django nested for loop time series

Speed up django nested for loop time series - python

I am working on a django-based open source project called OpenREM (http://demo.openrem.org/openrem/, http://openrem.org).
To calculate data for one of the plots that are used I am carrying out a series of queries to obtain the number of items that fall into each of the 24 hours for each day of the week. This data is used to plot the pie chart of studies per weekday on the CT page of the demo site, with a drill-down to studies per hour for that day:
studiesPerHourInWeekdays = [[0 for x in range(24)] for x in range(7)]
for day in range(7):
studyTimesOnThisWeekday = f.qs.filter(study_date__week_day=day+1).values('study_time')
if studyTimesOnThisWeekday:
for hour in range(24):
try:
studiesPerHourInWeekdays[day][hour] = studyTimesOnThisWeekday.filter(study_time__gte = str(hour)+':00').filter(study_time__lte = str(hour)+':59').values('study_time').count()
except:
studiesPerHourInWeekdays[day][hour] = 0
This takes a little while to run on a production system. I think the second FOR loop could be removed by using a qsstats-magic time_series, aggregated over hours. Unfortunately there isn't a suitable datetime object stored in the database that I can use for this.
Does anyone know how I can combine the "study_date" datetime.date object and "study_time" datetime.time object into a single datetime.datetime object for me to be able to run a qsstats-magic time_series by hour?
Thanks,
David

If you can at all (though you don't seem able given your circumstance) it would be best to change the database schema to reflect the kinds of queries you're making. A datetime field that had this information, some type of foreign key set up, etc.
You probably already know that, though, so the practical answer to your question is that you want to use the underlying database tools to your advantage through an extra() call. Maybe something like this* if you're using postgres:
date_hour_set = f.qs.extra(
select={
'date_hour': "study_date + interval '1h' * date_part('hour', study_time)",
'date_hour_count': "count(study_date + interval '1h' * date_part('hour', study_time))"
}).values('date_hour', 'date_hour_count').distinct()
which would give you queryset of datetimes (hours only) with their associated occurrence count. Handwritten SQL will give you the easiest option at the moment because of Django's lagging TimeField support, and will probably be the most performant, too.
*Note I don't write SQL regularly and am being lazy, so there are cleaner ways to work this.
If you really really need to be database portable and still can't edit the schema, you can stack together features of Django aggregation that are maybe a little convoluted all together:
from django.db.models import Value, Count, ExpressionWrapper, CharField
from django.db.models.functions import Substr, Concat
hour_counts = f.qs.annotate(hour=Concat(Substr('study_time', 1, 2), Value(':00:00')))
date_hour_pairs = hour_counts.annotate(
date_hour=ExpressionWrapper(Concat('study_date', 'hour'),
output_field=CharField())).values('study_date', 'hour', 'date_hour')
date_hour_counts = date_hour_pairs.annotate(count=Count('date_hour')).distinct()
which should give you a set of dicts with a datetime.time object for 'hour', the datetime.date you started with for 'study_date', a concatted string version of the date and time under 'date_hour', and then the all important (date, hour) count under 'count'.

Related

How to filter SQLAlchemy objects with datefield values before today?

I wish to keep all objects from the database which have a particular datetime field with values before the current day. I can see how you an filter with hardcoded dates or between two dates, but how can I keep all items with dates before today?

So it seems it is quite straight forward you can do direct datetime comparison in SQLAlchemy queries like this:
q = DBSession.query(User).filter(
User.sign_up_date <= datetime.now() - datetime.timedelta(hours=1),
)
which would return all user objects which signed up one hour ago.

Order_By custom date in peewee for SQLite

I made a huge misstake building up a database, but it works perfectly except for 1 feature. Changing the program in all the places where it needs to be changed for that feature to work would be a titanic job of weeks, so let's hope this workaround is possible.
The issue: I've stored data in a SQLite database as "dd/mm/yyyy" TextField format instead of DateField.
The need: I need to sort by dates on a union query, to get the last number of records in that union following my custom date format. They are from different tables, so I can't just use rowid or stuff like that to get the last ones, I need to do it by date and I can't change the already stored data in the database because there are already invoices created with that format ("dd/mm/yyyy" is the default date format in my country).
This is the query that captures data:
records = []
limited_to = 25
union = (facturas | contado | albaranes | presupuestos)
for record in (union
.select_from(union.c.idunica, union.c.fecha, union.c.codigo,
union.c.tipo, union.c.clienterazonsocial,
union.c.totalimporte, union.c.pagada,
union.c.contabilizar, union.c.concepto1,
union.c.cantidad1, union.c.precio1,
union.c.observaciones)
.order_by(union.c.fecha.desc()) # TODO this is what I need to change.
.limit(limited_to)
.tuples()):
records.append(record)
Now to complicate things even more, the union is already created by a really complex where clause for each database before it's transformed into an union query.
So my only hope is: Is there a way to make order_by follow a custom date format instead?
To clarify, this is the simple transformation that I'd need the order_by clause to follow, because I assume SQLite wouldn't have issues sorting if this would be the date format:
def reverse_date(date: str) -> str:
"""Reverse the date order from dd/mm/yyyy dates into yyyy-mm-dd"""
yyyy, mm, dd = date.split("/")
return f"{yyyy}-{mm}-{dd}"
Note: I've left lot of code out because I think it's unnecesary. This is the minimum amount of code needed to understand the problem. Let me know if you need more data.
Update: Trying this workaround, it seems to work fine. Need more testing but it's promising. If someone ever faces the same issue, here you go:
.order_by(sqlfunc.Substr(union.c.fecha, 7)
.concat('/')
.concat(sqlfunc.Substr(union.c.fecha, 4, 2))
.concat('/')
.concat(sqlfunc.Substr(union.c.fecha, 1, 2))
.desc())
Happy end of 2020 year!

As you pointed out, if you want the dates to sort properly, they need to be in yyyy-mm-dd format, which is the text format you should always use in SQLite (or something with the same year, month, day, order).
You might be able to do a rearrangement here using re.sub:
.order_by(re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\2-\1',
union.c.fecha))
Here we are using regex to capture the year, month, and day components in separate capture groups. Then, we replace with these components in the correct order for sorting.

Django - Time aggregates of DatetimeField across queryset

(using django 1.11.2, python 2.7.10, mysql 5.7.18)
If we imagine a simple model:
class Event(models.Model):
happened_datetime = DateTimeField()
value = IntegerField()
What would be the most elegant (and quickest) way to run something similar to:
res = Event.objects.all().aggregate(
Avg('happened_datetime')
)
But that would be able to extract the average time of day for all members of the queryset. Something like:
res = Event.objects.all().aggregate(
AvgTimeOfDay('happened_datetime')
)
Would it be possible to do this on the db directly?, i.e., without running a long loop client-side for each queryset member?
EDIT:
There may be a solution, along those lines, using raw SQL:
select sec_to_time(avg(time_to_sec(extract(HOUR_SECOND from happened_datetime)))) from event_event;
Performance-wise, this runs in 0.015 second for ~23k rows on a laptop, not optimised, etc. Assuming that could yield accurate/correct results and since time is only a secondary factor, could I be using that?

Add another integer field to your model that contains only the hour of the day extracted from the happened_datetime.
When creating/updating a model instance you need to update this new field accordingly whenever the happened_datetime is set/updated. You can extract the hours of the day for example by reading datetime.datetime.hour. Or use strftime to create a value to your liking.
Aggregation should then work as proposed by yourself.
EDIT:
Django's ORM has Extract() as a function. Example from the docs adapted to your use case:
>>> # How many experiments completed in the same year in which they started?
>>> Event.objects.aggregate(
... happenend_datetime__hour=Extract('happenend_datetime', 'hour'))
(Not tested!)
https://docs.djangoproject.com/en/1.11/ref/models/database-functions/#extract

So after a little search and tries.. the below seems to work. Any comments on how to improve (or hinting as to why it is completely wrong), are welcome! :-)
res = Event.objects.raw('''
SELECT id, sec_to_time(avg(time_to_sec(extract(HOUR_SECOND from happened_datetime)))) AS average_time_of_day
FROM event_event
WHERE happened_datetime BETWEEN %s AND %s;''', [start_datetime, end_datetime])
print res[0].__dict__
# {'average_time_of_day': datetime.time(18, 48, 10, 247700), '_state': <django.db.models.base.ModelState object at 0x0445B370>, 'id': 9397L}
Now the ID returned is that of the last object falling in the datetime range for the WHERE clause. I believe Django just inserts that because of "InvalidQuery: Raw query must include the primary key".
Quick explanation of the SQL series of function calls:
Extract HH:MM:SS from all datetime fields
Convert the time values to seconds via time_to_sec.
average all seconds values
convert averaged seconds value back into time format (HH:MM:SS)
Don't know why Django insists on returning microseconds but that is not really relevant. (maybe the local ms at which the time object was instantiated?)
Performance note: this seems to be extremely fast but then again I haven't tested that bit. Any insight would be kindly appreciated :)

Algorithm to determine the closest date to some date input

I have a Python program that uses historical data coming from a database and allows the user to select the dates input. However not all the possible dates are available into the database, since these are financial data: in other words, if the user will insert "02/03/2014" (which is Sunday) he won't find any record in the database because the stock exchange was closed.
This causes SQL problems cause when the record is not found, the SQL statement fails and the user needs to adjust the date until the moment he finds an existing record. To avoid this I would like to build an algorithm which is able to change the date inputs itself choosing the closest to the originary input. For example, if the user inserts "02/03/2014", the closest would be 03/03/2014".
I have thought about something like this, where the table MyData is containing date values only (I'm still in process of working on the proper syntaxis but it's just to show the idea):
con = lite.connect('C:/.../MyDatabase.db')
cur = con.cursor()
cur.execute('SELECT * from MyDates')
rowsD= cur.fetchall()
data = []
for row in rowsD:
data.append(rowsD[row])
>>>data
['01/01/2010', '02/01/2010', .... '31/12/2013']
inputDate = '07/01/2010'
differences = []
for i in range(0, len(data)):
differences.append(abs(data[i] - inputDate))
After that, I was thinking about:
getting the minimum value from the vector differences: mV = min(differences)
getting the corresponding date value into the list data
Howwever, this cost me two things in terms of memory:
I need to load all the database, which is huge;
I have to iterate many times (once to build the list data, then the list of differences etc.)
Does anyone have a better idea to build this, or knows a different approach to the problem?

Query the database on the dates that are smaller than the input date and take the maximum of these. This will give you the closest date before.
Symmetrically, you can query the minimum of the larger dates to get the closest date after. And keep the preferred of the two.
These should be efficient queries.
SELECT MAX(Date)
FROM MyDates
WHERE Date <= InputDate;

I would try getting a record with the maximum date smaller then the given one from database directly (this can be done with SQL). If you put an index in your database on date then this can be done in O(log(n)). That's of course not really the same as "being closest" but if you combine it with "the minimum date bigger then the given one" you will achieve it.
Also if you know more or less the distribution of your data, for example that in each 7 consecutive days you have some data, then you can restrict to a smaller range of data like [-3 days, +3 days].
Combining both of these solutions should give you quite nice performance.

Django aggregate count of records per day

I've got a django app that is doing some logging. My model looks like this:
class MessageLog(models.Model):
logtime = models.DateTimeField(auto_now_add=True)
user = models.CharField(max_length=50)
message = models.CharField(max_length=512)
What a want to do is get the average number of messages logged per day of the week so that I can see which days are the most active. I've managed to write a query that pulls the total number of messages per day which is:
for i in range(1, 8):
MessageLog.objects.filter(logtime__week_day=i).count()
But I'm having trouble calculating the average in a query. What I have right now is:
for i in range(1, 8):
MessageLog.objects.filter(logtime__week_day=i).annotate(num_msgs=Count('id')).aggregate(Avg('num_msgs'))
For some reason this is returning 1.0 for every day though. I looked at the SQL it is generating and it is:
SELECT AVG(num_msgs) FROM (
SELECT
`myapp_messagelog`.`id` AS `id`, `myapp_messagelog`.`logtime` AS `logtime`,
`myapp_messagelog`.`user` AS `user`, `myapp_messagelog`.`message` AS `message`,
COUNT(`myapp_messagelog`.`id`) AS `num_msgs`
FROM `myapp_messagelog`
WHERE DAYOFWEEK(`myapp_messagelog`.`logtime`) = 1
GROUP BY `myapp_messagelog`.`id` ORDER BY NULL
) subquery
I think the problem might be coming from the GROUP BY id but I'm not really sure. Anyone have any ideas or suggestions? Thanks in advance!

The reason your listed query always gives 1 is because you're not grouping by date. Basically, you've asked the database to take the MessageLog rows that fall on a given day of the week. For each such row, count how many ids it has (always 1). Then take the average of all those counts, which is of course also 1.
Normally, you would need to use a values clause to group your MessageLog rows prior to your annotate and aggregate parts. However, since your logtime field is a datetime rather than just a date, I am not sure you can express that directly with Django's ORM. You can definitely do it with an extra clause, as shown here. Or if you felt like it you could declare a view in your SQL with as much of the aggregating and average math as you liked and declare an unmanaged model for it, then just use the ORM normally.
So an extra field works to get the total number of records per actual day, but doesn't handle aggregating the average of the computed annotation. I think this may be sufficiently abstracted from the model that you'd have to use a raw SQL query, or at least I can't find anything that makes it work in one call.
That said, you already know how you can get the total number of records per weekday in a simple query as shown in your question.
And this query will tell you how many distinct date records there are on a given weekday:
MessageLog.objects.filter(logtime__week_day=i).dates('logtime', day').count()
So you could do the averaging math in Python instead, which might be simpler than trying get the SQL right.
Alternately, this query will get you the raw number of messages for all weekdays in one query rather than a for loop:
MessageLog.objects.extra({'weekday': "dayofweek(logtime)"}).values('weekday').annotate(Count('id'))
But I haven't been able to get a nice query to give you the count of distinct dates for each weekday annotated to that - dates querysets lose the ability to handle annotate calls, and annotating over an extra value doesn't seem to work either.
This has been surprisingly tricky, given that it's not that hard a SQL expression.

I do something similar with a datetime field, but annotating over extra values does work for me. I have a Record model with a datetime field "created_at" and a "my_value" field I want to get the average for.
from django.db.models import Avg
qs = Record.objects.extra({'created_day':"date(created_at)"}).\
values('created_day').\
annotate(count=Avg('my_value'))
The above will group by the day of the datetime value in "created_at" field.

queryset.extra(select={'day': 'date(logtime)'}).values('day').order_by('-day').annotate(Count('id'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.