Grouping by week, and padding out 'missing' weeks - python

In my Django model, I've got a very simple model which represents a single occurrence of an event (such as a server alert occurring):
class EventOccurrence:
event = models.ForeignKey(Event)
time = models.DateTimeField()
My end goal is to produce a table or graph that shows how many times an event occurred over the past n weeks.
So my question has two parts:
How can I group_by the week of the time field?
How can I "pad out" the result of this group_by to add a zero-value for any missing weeks?
For example, for the second part, I'd like transform a result like this:
| week | count | | week | count |
| 2 | 3 | | 2 | 3 |
| 3 | 5 | —— becomes —> | 3 | 5 |
| 5 | 1 | | 4 | 0 |
| 5 | 1 |
What's the best way to do this in Django? General Python solutions are also OK.

Django's DateField as well as datetime doesn't support week attribute. To fetch everything in one query you need to do:
from django.db import connection
cursor = connection.cursor()
cursor.execute(" SELECT WEEK(`time`) AS 'week', COUNT(*) AS 'count' FROM %s GROUP BY WEEK(`time`) ORDER BY WEEK(`time`)" % EventOccurrence._meta.db_table, [])
data = []
results = cursor.fetchall()
for i, row in enumerate(results[:-1]):
data.append(row)
week = row[0] + 1
next_week = results[i+1][0]
while week < next_week:
data.append( (week, 0) )
week += 1
data.append( results[-1] )
print data

After digging django query api doc, I have don't found a way to make query through django ORM system. Cursor is a workaround, if your database brand is MySQL:
from django.db import connection, transaction
cursor = connection.cursor()
cursor.execute("""
select
week(time) as `week`,
count(*) as `count`
from EventOccurrence
group by week(time)
order by 1;""")
myData = dictfetchall(cursor)
This is, in my opinion, the best performance solution. But notice that this don't pad missing weeks.
EDITED Indepedent database brand solution via python (less performance)
If you are looking for database brand independece code then you should take dates day by day and aggregate it via python. If this is your case code may looks like:
#get all weeks:
import datetime
weeks = set()
d7 = datetime.timedelta( days = 7)
iterDay = datetime.date(2012,1,1)
while iterDay <= datetime.now():
weeks.add( iterDay.isocalendar()[1] )
iterDay += d7
#get all events
allEvents = EventOccurrence.objects.value_list( 'time', flat=True )
#aggregate event by week
result = dict()
for w in weeks:
result.setdefault( w ,0)
for e in allEvents:
result[ e.isocalendar()[1] ] += 1
(Disclaimer: not tested)

Since I have to query multiple tables by join them, I'm using db view to solve these requirements.
CREATE VIEW my_view
AS
SELECT
*, // <-- other fields goes here
YEAR(time_field) as year,
WEEK(time_field) as week
FROM my_table;
and the model as:
from django.db import models
class MyView(models.Model):
# other fields goes here
year = models.IntegerField()
week = models.IntegerField()
class Meta:
managed = False
db_table = 'my_view'
def query():
rows = MyView.objects.filter(week__range=[2, 5])
# to handle the rows
after get rows from this db view, use the way by #danihp to padding 0 for "hole" weeks/months.
NOTE: this is only tested for MySQL backend, I'm not sure if it's OK for MS SQL Server or other.

Related

Complex query using Django QuerySets

I am working on a personal project and I am trying to write a complex query that:
Gets every device that belongs to a certain user
Gets every sensor belonging to every one of the user's devices
Gets the last recorded value and timestamp for each of the user's devices sensors.
I am using Sqlite, and I managed to write the query as plain SQL, however, for the life of me I cannot figure out a way to do it in django. I looked at other questions, tried going through the documentation, but to no avail.
My models:
class User(AbstractBaseUser):
email = models.EmailField()
class Device(models.Model):
user = models.ForeignKey(User)
name = models.CharField()
class Unit(models.Model):
name = models.CharField()
class SensorType(models.Model):
name = models.CharField()
unit = models.ForeignKey(Unit)
class Sensor(models.Model):
gpio_port = models.IntegerField()
device = models.ForeignKey(Device)
sensor_type = models.ForeignKey(SensorType)
class SensorData(models.Model):
sensor = models.ForeignKey(Sensor)
value = models.FloatField()
timestamp = models.DateTimeField()
And here is the SQL query:
SELECT acc.email,
dev.name as device_name,
stype.name as sensor_type,
sen.gpio_port as sensor_port,
sdata.value as sensor_latest_value,
unit.name as sensor_units,
sdata.latest as value_received_on
FROM devices_device as dev
INNER JOIN accounts_user as acc on dev.user_id = acc.id
INNER JOIN devices_sensor as sen on sen.device_id = dev.id
INNER JOIN devices_sensortype as stype on stype.id = sen.sensor_type_id
INNER JOIN devices_unit as unit on unit.id = stype.unit_id
LEFT JOIN (
SELECT MAX(sd.timestamp) latest, sd.value, sensor_id
FROM devices_sensordata as sd
INNER JOIN devices_sensor as s ON s.id = sd.sensor_id
GROUP BY sd.sensor_id) as sdata on sdata.sensor_id= sen.id
WHERE acc.id = 1
ORDER BY dev.id
I have been playing with the django shell in order to find a way to implement this query with the QuerySet API, but I cannot figure it out...
The closest I managed to get is with this:
>>> sub = SensorData.objects.values('sensor_id', 'value').filter(sensor_id=OuterRef('pk')).order_by('-timestamp')[:1]
>>> Sensor.objects.annotate(data_id=Subquery(sub.values('sensor_id'))).filter(id=F('data_id')).values(...)
However it has two problems:
It does not include the sensors that do not yet have any values in SensorsData
If i include the SensorData.values field into the .values() I start to get previously recorded values of the sensors
If someone could please show me how to do it, or at least tell me what I am doing wrong I will be very grateful!
Thanks!
P.S. Please excuse my grammar and spelling errors, I am writing this in the middle of the night and I am tired.
EDIT:
Based on the answers I should clarify:
I only want the latest sensor value for each sensor. For example I have In sensordata:
id | sensor_id | value | timestamp|
1 | 1 | 2 | <today> |
2 | 1 | 5 | <yesterday>|
3 | 2 | 3 | <yesterday>|
Only the latest should be returned for each sensor_id:
id | sensor_id | value | timestamp |
1 | 1 | 2 | <today> |
3 | 2 | 3 | <yesterday>|
Or if the sensor does not yet have any data in this table, I waant the query to return a record of it with "null" for value and timestamp (basically the left join in my SQL query).
EDIT2:
Based on #ivissani 's answer, I managed to produce this:
>>> latest_sensor_data = Sensor.objects.annotate(is_latest=~Exists(SensorData.objects.filter(sensor=OuterRef('id'),timestamp__gt=OuterRef('sensordata__timestamp')))).filter(is_latest=True)
>>> user_devices = latest_sensor_data.filter(device__user=1)
>>> for x in user_devices.values_list('device__name','sensor_type__name', 'gpio_port','sensordata__value', 'sensor_type__unit__name', 'sensordata__timestamp').order_by('device__name'):
... print(x)
Which seems to do the job.
This is the SQL it produces:
SELECT
"devices_device"."name",
"devices_sensortype"."name",
"devices_sensor"."gpio_port",
"devices_sensordata"."value",
"devices_unit"."name",
"devices_sensordata"."timestamp"
FROM
"devices_sensor"
LEFT OUTER JOIN "devices_sensordata" ON (
"devices_sensor"."id" = "devices_sensordata"."sensor_id"
)
INNER JOIN "devices_device" ON (
"devices_sensor"."device_id" = "devices_device"."id"
)
INNER JOIN "devices_sensortype" ON (
"devices_sensor"."sensor_type_id" = "devices_sensortype"."id"
)
INNER JOIN "devices_unit" ON (
"devices_sensortype"."unit_id" = "devices_unit"."id"
)
WHERE
(
NOT EXISTS(
SELECT
U0."id",
U0."sensor_id",
U0."value",
U0."timestamp"
FROM
"devices_sensordata" U0
WHERE
(
U0."sensor_id" = ("devices_sensor"."id")
AND U0."timestamp" > ("devices_sensordata"."timestamp")
)
) = True
AND "devices_device"."user_id" = 1
)
ORDER BY
"devices_device"."name" ASC
Actually your query is rather simple, the only complex part is to establish which SensorData is the latest for each Sensor. I would go by using annotations and an Exists subquery in the following way
latest_data = SensorData.objects.annotate(
is_latest=~Exists(
SensorData.objects.filter(sensor=OuterRef('sensor'),
timestamp__gt=OuterRef('timestamp'))
)
).filter(is_latest=True)
Then it's just a matter of filtering this queryset by user in the following way:
certain_user_latest_data = latest_data.filter(sensor__device__user=certain_user)
Now as you want to retrieve the sensors even if they don't have any data this query will not suffice as only SensorData instances are retrieved and the Sensor and Device must be accessed through fields. Unfortunately Django does not allow for explicit joins through its ORM. Therefore I suggest the following (and let me say, this is far from ideal from a performance perspective).
The idea is to annotate the Sensors queryset with the specific values of the latest SensorData (value and timestamp) if any exists in the following way:
latest_data = SensorData.objects.annotate(
is_latest=~Exists(
SensorData.objects.filter(sensor=OuterRef('sensor'),
timestamp__gt=OuterRef('timestamp'))
)
).filter(is_latest=True, sensor=OuterRef('pk'))
sensors_with_value = Sensor.objects.annotate(
latest_value=Subquery(latest_data.values('value')),
latest_value_timestamp=Subquery(latest_data.values('timestamp'))
) # This will generate two subqueries...
certain_user_sensors = sensors_with_value.filter(device__user=certain_user).select_related('device__user')
If there aren't any instances of SensorData for a certain Sensor then the annotated fields latest_value and latest_value_timestamp will simply be set to None.
For this kind of queries, I recommend strongly to use Q objects, here the docs https://docs.djangoproject.com/en/2.2/topics/db/queries/#complex-lookups-with-q-objects
It's perfectly fine to execute raw queries with django, especially if they are that complex.
If you want to map the results to models use this :
https://docs.djangoproject.com/en/2.2/topics/db/sql/#performing-raw-queries
Otherwise, see this : https://docs.djangoproject.com/en/2.2/topics/db/sql/#executing-custom-sql-directly
Note that in both cases, no checking is done on the query by django.
This means that the security of the query is your full responsability, sanitize the parameters.
Something like this?:
Multiple Devices for 1 User
device_ids = Device.objects.filter(user=user).values_list("id", flat=True)
SensorData.objects.filter(sensor__device__id__in=device_ids
).values("sensor__device__name", "sensor__sensor_type__name",
"value","timestamp").order_by("-timestamp")
1 Device, 1 User
SensorData.objects.filter(sensor__device__user=user
).values("sensor__device__name", "sensor__sensor_type__name",
"value", "timestamp").order_by("-timestamp")
That Queryset will:
1.Gets every device that belongs to a certain user
2.Gets every sensor belonging to every one of the user's devices (but it return sensor_type every sensor cause there is no name field there so i return sensor_type_name)
3.Gets all recorded(order by the latest timestamp) value and timestamp for each of the user's devices sensors.
UPDATE
try this:
list_data=[]
for _id in device_ids:
sensor_data=SensorData.objects.filter(sensor__device__user__id=_id)
if sensor_data.exists():
data=sensor_data.values("sensor__id", "value", "timestamp", "sensor__device__user__id").latest("timestamp")
list_data.append(data)

Effectively classifying a DB of event logs by a column

Situation
I am using Python 3.7.2 with its built-in sqlite3 module. (sqlite3.version == 2.6.0)
I have a sqlite database that looks like:
| user_id | action | timestamp |
| ------- | ------ | ---------- |
| Alice | 0 | 1551683796 |
| Alice | 23 | 1551683797 |
| James | 1 | 1551683798 |
| ....... | ...... | .......... |
where user_id is TEXT, action is an arbitary INTEGER, and timestamp is an INTEGER representing UNIX time.
The database has 200M rows, and there are 70K distinct user_ids.
Goal
I need to make a Python dictionary that looks like:
{
"Alice":[(0, 1551683796), (23, 1551683797)],
"James":[(1, 1551683798)],
...
}
that has user_ids as keys and respective event logs as values, which are lists of tuples (action, timestamp). Hopefully each list will be sorted by timestamp in increasing order, but even if it isn't, I think it can be easily achieved by sorting each list after a dictionary is made.
Effort
I have the following code to query the database. It first queries for the list of users (with user_list_cursor), and then query for all rows belonging to the user.
import sqlite3
connection = sqlite3.connect("database.db")
user_list_cursor = connection.cursor()
user_list_cursor.execute("SELECT DISTINCT user_id FROM EVENT_LOG")
user_id = user_list_cursor.fetchone()
classified_log = {}
log_cursor = connection.cursor()
while user_id:
user_id = user_id[0] # cursor.fetchone() returns a tuple
query = (
"SELECT action, timestamp"
" FROM TABLE"
" WHERE user_id = ?"
" ORDER BY timestamp ASC"
)
parameters = (user_id,)
local_cursor.execute(query, parameters) # Here is the bottleneck
classified_log[user_id] = list()
for row in local_cursor.fetchall():
classified_log[user_id].append(row)
user_id = user_list_cursor.fetchone()
Problem
The query execution for each user is too slow. That single line of code (which is commented as bottleneck) takes around 10 seconds for each user_id. I think I am making a wrong approach with the queries. What is the right way to achieve the goal?
I tried searching with keywords "classify db by a column", "classify sql by a column", "sql log to dictionary python", but nothing seems to match my situation. I think this wouldn't be a rare need, so maybe I'm missing the right keyword to search with.
Reproducibility
If anyone is willing to reproduce the situation with a 200M row sqlite database, the following code will create a 5GB database file.
But I hope there is somebody who is familiar with such a situation and knows how to write the right query.
import sqlite3
import random
connection = sqlite3.connect("tmp.db")
cursor = connection.cursor()
cursor.execute(
"CREATE TABLE IF NOT EXISTS EVENT_LOG (user_id TEXT, action INTEGER, timestamp INTEGER)"
)
query = "INSERT INTO EVENT_LOG VALUES (?, ?, ?)"
parameters = []
for timestamp in range(200_000_000):
user_id = f"user{random.randint(0, 70000)}"
action = random.randint(0, 1_000_000)
parameters.append((user_id, action, timestamp))
cursor.executemany(query, parameters)
connection.commit()
cursor.close()
connection.close()
Big thanks to #Strawberry and #Solarflare for their help given in comments.
The following solution achieved more than 70X performance increase, so I'm leaving what I did as an answer for completeness' sake.
I used indices and queried for the whole table, as they suggested.
import sqlite3
from operators import attrgetter
connection = sqlite3.connect("database.db")
# Creating index, thanks to #Solarflare
cursor = connection.cursor()
cursor.execute("CREATE INDEX IF NOT EXISTS idx_user_id ON EVENT_LOG (user_id)")
cursor.commit()
# Reading the whole table, then make lists by user_id. Thanks to #Strawberry
cursor.execute("SELECT user_id, action, timestamp FROM EVENT_LOG ORDER BY user_id ASC")
previous_user_id = None
log_per_user = list()
classified_log = dict()
for row in cursor:
user_id, action, timestamp = row
if user_id != previous_user_id:
if previous_user_id:
log_per_user.sort(key=itemgetter(1))
classified_log[previous_user_id] = log_per_user[:]
log_per_user = list()
log_per_user.append((action, timestamp))
previous_user_id = user_id
So the points are
Indexing by user_id to make ORDER BY user_id ASC execute in acceptable time.
Reading the whole table, then classify by user_id, instead of making individual queries for each user_id.
Iterating over cursor to read row by row, instead of cursor.fetchall().

Web2py DAL find the record with the latest date

Hi I have a table with the following structure.
Table Name: DOCUMENTS
Sample Table Structure:
ID | UIN | COMPANY_ID | DOCUMENT_NAME | MODIFIED_ON |
---|----------|------------|---------------|---------------------|
1 | UIN_TX_1 | 1 | txn_summary | 2016-09-02 16:02:42 |
2 | UIN_TX_2 | 1 | txn_summary | 2016-09-02 16:16:56 |
3 | UIN_AD_3 | 2 | some other doc| 2016-09-02 17:15:43 |
I want to fetch the latest modified record UIN for the company whose id is 1 and document_name is "txn_summary".
This is the postgresql query that works:
select distinct on (company_id)
uin
from documents
where comapny_id = 1
and document_name = 'txn_summary'
order by company_id, "modified_on" DESC;
This query fetches me UIN_TX_2 which is correct.
I am using web2py DAL to get this value. After some research I have been successful to do this:
fmax = db.documents.modified_on.max()
query = (db.documents.company_id==1) & (db.documents.document_name=='txn_summary')
rows = db(query).select(fmax)
Now "rows" contains only the value of the modified_on date which has maximum value. I want to fetch the record which has the maximum date inside "rows". Please suggest a way. Help is much appreciated.
And my requirement extends to find each such records for each company_id for each document_name.
Your approach will not return complete row, it will only return last modified_on value.
To fetch last modified record for the company whose id is 1 and document_name "txn_summary", query will be
query = (db.documents.company_id==1) & (db.documents.document_name=='txn_summary')
row = db(query).select(db.documents.ALL, orderby=~db.documents.modified_on, limitby=(0, 1)).first()
orderby=~db.documents.modified_on will return records arranged in descending order of modified_on (last modified record will be first) and first() will select the first record. i.e. complete query will return last modified record having company 1 and document_name = "txn_summary".
There can be other/better way to achieve this. Hope this helps!

Getting database object in Django python

The following is my database model:
id | dept | bp | s | o | d | created_by | created_date
dept and bp together have an unique index for the table. This means there will always be 30 different records in the database under the same dept and bp. I was trying to do an update function on the records, by getting the object first. The following is the way I tried to get the object:
try:
Sod_object = Sod.objects.get(dept=dept_name, bp=bp_name)
except Sod.DoesNotExist:
print "Object doesn't exist"
msg = "Sod doesn't exist!"
else:
for s in Sod_object:
# Do something
But it's always giving me 30 records (obviously). How can I make this a single object? Any suggestions?
In Django, every table in the database will automatically have a column called id. That is the default primary key column and unique for every object in a table.
So you can do
Sod_object = Sod.objects.get(id=1)
to fetch the first object in your table.
To update all the 30 objects for your dept-bp combination, you can filter() by those values
sods = Sod.objects.filter(dept=dept_name, bp=bp_name)
and then update each one and save. Django will internally remember the individual records by their id.
for sod in sods:
sod.o = 1
sod.save()

Pivoting data and complex annotations in Django ORM

The ORM in Django lets us easily annotate (add fields to) querysets based on related data, hwoever I can't find a way to get multiple annotations for different filtered subsets of related data.
This is being asked in relation to django-helpdesk, an open-source Django-powered trouble-ticket tracker. I need to have data pivoted like this for charting and reporting purposes
Consider these models:
CHOICE_LIST = (
('open', 'Open'),
('closed', 'Closed'),
)
class Queue(models.model):
name = models.CharField(max_length=40)
class Issue(models.Model):
subject = models.CharField(max_length=40)
queue = models.ForeignKey(Queue)
status = models.CharField(max_length=10, choices=CHOICE_LIST)
And this dataset:
Queues:
ID | Name
---+------------------------------
1 | Product Information Requests
2 | Service Requests
Issues:
ID | Queue | Status
---+-------+---------
1 | 1 | open
2 | 1 | open
3 | 1 | closed
4 | 2 | open
5 | 2 | closed
6 | 2 | closed
7 | 2 | closed
I would like to see an annotation/aggregate look something like this:
Queue ID | Name | open | closed
---------+-------------------------------+------+--------
1 | Product Information Requests | 2 | 1
2 | Service Requests | 1 | 3
This is basically a crosstab or pivot table, in Excel parlance. I am currently building this output using some custom SQL queries, however if I can move to using the Django ORM I can more easily filter the data dynamically without doing dodgy insertion of WHERE clauses in my SQL.
For "bonus points": How would one do this where the pivot field (status in the example above) was a date, and we wanted the columns to be months / weeks / quarters / days?
You have Python, use it.
from collections import defaultdict
summary = defaultdict( int )
for issue in Issues.objects.all():
summary[issue.queue, issue.status] += 1
Now your summary object has queue, status as a two-tuple key. You can display it directly, using various template techniques.
Or, you can regroup it into a table-like structure, if that's simpler.
table = []
queues = list( q for q,_ in summary.keys() )
for q in sorted( queues ):
table.append( q.id, q.name, summary.count(q,'open'), summary.count(q.'closed') )
You have lots and lots of Python techniques for doing pivot tables.
If you measure, you may find that a mostly-Python solution like this is actually faster than a pure SQL solution. Why? Mappings can be faster than SQL algorithms which require a sort as part of a GROUP-BY.
Django has added a lot of functionality to the ORM since this question was originally asked. The answer to how to pivot data since Django 1.8 is to use the Case/When conditional expressions. And there is a third party app that will do that for you, PyPI and documentation

Categories

Resources