Django - Time aggregates of DatetimeField across queryset

Django - Time aggregates of DatetimeField across queryset - python

(using django 1.11.2, python 2.7.10, mysql 5.7.18)
If we imagine a simple model:
class Event(models.Model):
happened_datetime = DateTimeField()
value = IntegerField()
What would be the most elegant (and quickest) way to run something similar to:
res = Event.objects.all().aggregate(
Avg('happened_datetime')
)
But that would be able to extract the average time of day for all members of the queryset. Something like:
res = Event.objects.all().aggregate(
AvgTimeOfDay('happened_datetime')
)
Would it be possible to do this on the db directly?, i.e., without running a long loop client-side for each queryset member?
EDIT:
There may be a solution, along those lines, using raw SQL:
select sec_to_time(avg(time_to_sec(extract(HOUR_SECOND from happened_datetime)))) from event_event;
Performance-wise, this runs in 0.015 second for ~23k rows on a laptop, not optimised, etc. Assuming that could yield accurate/correct results and since time is only a secondary factor, could I be using that?

Add another integer field to your model that contains only the hour of the day extracted from the happened_datetime.
When creating/updating a model instance you need to update this new field accordingly whenever the happened_datetime is set/updated. You can extract the hours of the day for example by reading datetime.datetime.hour. Or use strftime to create a value to your liking.
Aggregation should then work as proposed by yourself.
EDIT:
Django's ORM has Extract() as a function. Example from the docs adapted to your use case:
>>> # How many experiments completed in the same year in which they started?
>>> Event.objects.aggregate(
... happenend_datetime__hour=Extract('happenend_datetime', 'hour'))
(Not tested!)
https://docs.djangoproject.com/en/1.11/ref/models/database-functions/#extract

So after a little search and tries.. the below seems to work. Any comments on how to improve (or hinting as to why it is completely wrong), are welcome! :-)
res = Event.objects.raw('''
SELECT id, sec_to_time(avg(time_to_sec(extract(HOUR_SECOND from happened_datetime)))) AS average_time_of_day
FROM event_event
WHERE happened_datetime BETWEEN %s AND %s;''', [start_datetime, end_datetime])
print res[0].__dict__
# {'average_time_of_day': datetime.time(18, 48, 10, 247700), '_state': <django.db.models.base.ModelState object at 0x0445B370>, 'id': 9397L}
Now the ID returned is that of the last object falling in the datetime range for the WHERE clause. I believe Django just inserts that because of "InvalidQuery: Raw query must include the primary key".
Quick explanation of the SQL series of function calls:
Extract HH:MM:SS from all datetime fields
Convert the time values to seconds via time_to_sec.
average all seconds values
convert averaged seconds value back into time format (HH:MM:SS)
Don't know why Django insists on returning microseconds but that is not really relevant. (maybe the local ms at which the time object was instantiated?)
Performance note: this seems to be extremely fast but then again I haven't tested that bit. Any insight would be kindly appreciated :)

Related

Mongodb get new values from collection without timestamp

I want to fetch added new values from mongodb collections without timestamp value. I guess only choice using objectid field. I using test dataset on github. "https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json"
For example if I add new data to this collection, how ı fetch or how ı find these new values.
Some mongodb collections using timestamp value, and I use this timestamp value for get new values. But ı do not know, how ı find without timestamp.
Example dataset ;
enter image description here
I want like this filter. but it doesn't work
{_id: {$gt: '622e04d69edb39455e06d4af'}}

If you don't want to create a new field in the document.
SomeGlobalObj = ObjectId[] // length limit is 10
// you will need Redis or other outside storage if you have multi server
SomeGlobalObj.shift(newDocumentId)
SomeGlobalObj = SomeGlobalObj.slice(0,10)
//Make sure to keep the latest 10 IDs.
Now, if you want to retrieve the latest document, you can use this array.
If the up-to-date thing you're talking about is, disappears after checking, you can remove it from this array after query.

In the comments you mentioned that you want to do this using Python, so I shall answer from that perspective.
In Mongo, an ObjectId is composed of 3 sections:
a 4-byte timestamp value, representing the ObjectId's creation, measured in seconds since the Unix epoch
a 5-byte random value generated once per process. This random value is unique to the machine and process.
a 3-byte incrementing counter, initialized to a random value
Because of this, we can use the ObjectId to sort or filter by created timestamp. To construct an ObjectId for a specific date, we can use the following code:
gen_time = datetime.datetime(2010, 1, 1)
dummy_id = ObjectId.from_datetime(gen_time)
result = collection.find({"_id": {"$lt": dummy_id}})
Source: objectid - Tools for working with MongoDB ObjectIds
This example will find all documents created before 2010/01/01. Substituting $gt would allow this query to function as you desire.
If you need to get the timetamp from an ObjectId, you can use the following code:
id = myObjectId.generation_time

Order_By custom date in peewee for SQLite

I made a huge misstake building up a database, but it works perfectly except for 1 feature. Changing the program in all the places where it needs to be changed for that feature to work would be a titanic job of weeks, so let's hope this workaround is possible.
The issue: I've stored data in a SQLite database as "dd/mm/yyyy" TextField format instead of DateField.
The need: I need to sort by dates on a union query, to get the last number of records in that union following my custom date format. They are from different tables, so I can't just use rowid or stuff like that to get the last ones, I need to do it by date and I can't change the already stored data in the database because there are already invoices created with that format ("dd/mm/yyyy" is the default date format in my country).
This is the query that captures data:
records = []
limited_to = 25
union = (facturas | contado | albaranes | presupuestos)
for record in (union
.select_from(union.c.idunica, union.c.fecha, union.c.codigo,
union.c.tipo, union.c.clienterazonsocial,
union.c.totalimporte, union.c.pagada,
union.c.contabilizar, union.c.concepto1,
union.c.cantidad1, union.c.precio1,
union.c.observaciones)
.order_by(union.c.fecha.desc()) # TODO this is what I need to change.
.limit(limited_to)
.tuples()):
records.append(record)
Now to complicate things even more, the union is already created by a really complex where clause for each database before it's transformed into an union query.
So my only hope is: Is there a way to make order_by follow a custom date format instead?
To clarify, this is the simple transformation that I'd need the order_by clause to follow, because I assume SQLite wouldn't have issues sorting if this would be the date format:
def reverse_date(date: str) -> str:
"""Reverse the date order from dd/mm/yyyy dates into yyyy-mm-dd"""
yyyy, mm, dd = date.split("/")
return f"{yyyy}-{mm}-{dd}"
Note: I've left lot of code out because I think it's unnecesary. This is the minimum amount of code needed to understand the problem. Let me know if you need more data.
Update: Trying this workaround, it seems to work fine. Need more testing but it's promising. If someone ever faces the same issue, here you go:
.order_by(sqlfunc.Substr(union.c.fecha, 7)
.concat('/')
.concat(sqlfunc.Substr(union.c.fecha, 4, 2))
.concat('/')
.concat(sqlfunc.Substr(union.c.fecha, 1, 2))
.desc())
Happy end of 2020 year!

As you pointed out, if you want the dates to sort properly, they need to be in yyyy-mm-dd format, which is the text format you should always use in SQLite (or something with the same year, month, day, order).
You might be able to do a rearrangement here using re.sub:
.order_by(re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\2-\1',
union.c.fecha))
Here we are using regex to capture the year, month, and day components in separate capture groups. Then, we replace with these components in the correct order for sorting.

Speed up django nested for loop time series

I am working on a django-based open source project called OpenREM (http://demo.openrem.org/openrem/, http://openrem.org).
To calculate data for one of the plots that are used I am carrying out a series of queries to obtain the number of items that fall into each of the 24 hours for each day of the week. This data is used to plot the pie chart of studies per weekday on the CT page of the demo site, with a drill-down to studies per hour for that day:
studiesPerHourInWeekdays = [[0 for x in range(24)] for x in range(7)]
for day in range(7):
studyTimesOnThisWeekday = f.qs.filter(study_date__week_day=day+1).values('study_time')
if studyTimesOnThisWeekday:
for hour in range(24):
try:
studiesPerHourInWeekdays[day][hour] = studyTimesOnThisWeekday.filter(study_time__gte = str(hour)+':00').filter(study_time__lte = str(hour)+':59').values('study_time').count()
except:
studiesPerHourInWeekdays[day][hour] = 0
This takes a little while to run on a production system. I think the second FOR loop could be removed by using a qsstats-magic time_series, aggregated over hours. Unfortunately there isn't a suitable datetime object stored in the database that I can use for this.
Does anyone know how I can combine the "study_date" datetime.date object and "study_time" datetime.time object into a single datetime.datetime object for me to be able to run a qsstats-magic time_series by hour?
Thanks,
David

If you can at all (though you don't seem able given your circumstance) it would be best to change the database schema to reflect the kinds of queries you're making. A datetime field that had this information, some type of foreign key set up, etc.
You probably already know that, though, so the practical answer to your question is that you want to use the underlying database tools to your advantage through an extra() call. Maybe something like this* if you're using postgres:
date_hour_set = f.qs.extra(
select={
'date_hour': "study_date + interval '1h' * date_part('hour', study_time)",
'date_hour_count': "count(study_date + interval '1h' * date_part('hour', study_time))"
}).values('date_hour', 'date_hour_count').distinct()
which would give you queryset of datetimes (hours only) with their associated occurrence count. Handwritten SQL will give you the easiest option at the moment because of Django's lagging TimeField support, and will probably be the most performant, too.
*Note I don't write SQL regularly and am being lazy, so there are cleaner ways to work this.
If you really really need to be database portable and still can't edit the schema, you can stack together features of Django aggregation that are maybe a little convoluted all together:
from django.db.models import Value, Count, ExpressionWrapper, CharField
from django.db.models.functions import Substr, Concat
hour_counts = f.qs.annotate(hour=Concat(Substr('study_time', 1, 2), Value(':00:00')))
date_hour_pairs = hour_counts.annotate(
date_hour=ExpressionWrapper(Concat('study_date', 'hour'),
output_field=CharField())).values('study_date', 'hour', 'date_hour')
date_hour_counts = date_hour_pairs.annotate(count=Count('date_hour')).distinct()
which should give you a set of dicts with a datetime.time object for 'hour', the datetime.date you started with for 'study_date', a concatted string version of the date and time under 'date_hour', and then the all important (date, hour) count under 'count'.

PostgreSQL time conversion/formatting from "24:00:00" to "00:00:00" in SELECT statement

I have some market data with time fields stored in a PostgreSQL database. PostgreSQL uses the format "00:00:00" to "24:00:00" to store times (see http://www.postgresql.org/docs/9.1/static/datatype-datetime.html) which works perfectly as long as I only work within the database.
The problem is that I have to do some data processing (using Python) afterwards and the Python datetime.time format only supports the hours from "00:00:00" to "23:00:00". So if I fetch a record that contains "24:00:00" using the psycopg2 module I get an error "ValueError: hour must be in 0..23" because the time field cannot be converted properly.
My idea for a clean workaround is to convert the time field that contains the "24:00:00" hour already in the SELECT statement to "00:00:00". This would solve the problem as the converter function would not fail afterwards.
I have already looked at the formatting functions (see http://www.postgresql.org/docs/9.4/static/functions-formatting.html) but could not find anything suitable..
Is there a way to realize this using SQL?
Thanks in advance!

The problem with the value '24:00:00'::time is clearly a psycopg2 error. While we wait for Daniele or me to fix it (if possible at all), here's a workaround: just use a CASE expression to check for the specific value that cause the error. If your table is named tab and the time column is t then you can do:
SELECT CASE t WHEN '24:00:00'::time THEN '0:00:00'::time ELSE t END FROM tab;
And everything should work.
Note that this is a problem only if you extract a time column. It seems that PostgreSQL converts timestamp columns (even ones representing a leap second) to the corresponding midnight, i.e., 2012-6-30T24:00:00 (30 June 2012 leap second) results in 2012-7-1T00:00:00.

When you add time to a date the result is a timestamp which you can cast to time:
select (current_date + market_data_time)::time;

Django aggregate count of records per day

I've got a django app that is doing some logging. My model looks like this:
class MessageLog(models.Model):
logtime = models.DateTimeField(auto_now_add=True)
user = models.CharField(max_length=50)
message = models.CharField(max_length=512)
What a want to do is get the average number of messages logged per day of the week so that I can see which days are the most active. I've managed to write a query that pulls the total number of messages per day which is:
for i in range(1, 8):
MessageLog.objects.filter(logtime__week_day=i).count()
But I'm having trouble calculating the average in a query. What I have right now is:
for i in range(1, 8):
MessageLog.objects.filter(logtime__week_day=i).annotate(num_msgs=Count('id')).aggregate(Avg('num_msgs'))
For some reason this is returning 1.0 for every day though. I looked at the SQL it is generating and it is:
SELECT AVG(num_msgs) FROM (
SELECT
`myapp_messagelog`.`id` AS `id`, `myapp_messagelog`.`logtime` AS `logtime`,
`myapp_messagelog`.`user` AS `user`, `myapp_messagelog`.`message` AS `message`,
COUNT(`myapp_messagelog`.`id`) AS `num_msgs`
FROM `myapp_messagelog`
WHERE DAYOFWEEK(`myapp_messagelog`.`logtime`) = 1
GROUP BY `myapp_messagelog`.`id` ORDER BY NULL
) subquery
I think the problem might be coming from the GROUP BY id but I'm not really sure. Anyone have any ideas or suggestions? Thanks in advance!

The reason your listed query always gives 1 is because you're not grouping by date. Basically, you've asked the database to take the MessageLog rows that fall on a given day of the week. For each such row, count how many ids it has (always 1). Then take the average of all those counts, which is of course also 1.
Normally, you would need to use a values clause to group your MessageLog rows prior to your annotate and aggregate parts. However, since your logtime field is a datetime rather than just a date, I am not sure you can express that directly with Django's ORM. You can definitely do it with an extra clause, as shown here. Or if you felt like it you could declare a view in your SQL with as much of the aggregating and average math as you liked and declare an unmanaged model for it, then just use the ORM normally.
So an extra field works to get the total number of records per actual day, but doesn't handle aggregating the average of the computed annotation. I think this may be sufficiently abstracted from the model that you'd have to use a raw SQL query, or at least I can't find anything that makes it work in one call.
That said, you already know how you can get the total number of records per weekday in a simple query as shown in your question.
And this query will tell you how many distinct date records there are on a given weekday:
MessageLog.objects.filter(logtime__week_day=i).dates('logtime', day').count()
So you could do the averaging math in Python instead, which might be simpler than trying get the SQL right.
Alternately, this query will get you the raw number of messages for all weekdays in one query rather than a for loop:
MessageLog.objects.extra({'weekday': "dayofweek(logtime)"}).values('weekday').annotate(Count('id'))
But I haven't been able to get a nice query to give you the count of distinct dates for each weekday annotated to that - dates querysets lose the ability to handle annotate calls, and annotating over an extra value doesn't seem to work either.
This has been surprisingly tricky, given that it's not that hard a SQL expression.

I do something similar with a datetime field, but annotating over extra values does work for me. I have a Record model with a datetime field "created_at" and a "my_value" field I want to get the average for.
from django.db.models import Avg
qs = Record.objects.extra({'created_day':"date(created_at)"}).\
values('created_day').\
annotate(count=Avg('my_value'))
The above will group by the day of the datetime value in "created_at" field.

queryset.extra(select={'day': 'date(logtime)'}).values('day').order_by('-day').annotate(Count('id'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.