Django - get distinct dates from timestamp - python

I'm trying to filter users by date, but can't until I can find the first and last date of users in the db. While I can have my script filter out dups later on, I want to do it from the outset using Django's distinct since it significantly reduces. I tried
User.objects.values('install_time').distinct().order_by()
but since install_time is a timestamp, it includes the date AND time (which I don't really care about). As a result, the only ones it filters out are dates where we could retrieve multiple users' install dates but not times.
Any idea how to do this? I'm running this using Django 1.3.1, Postgres 9.0.5, and the latest version of psycopg2.
EDIT: I forgot to add the data type of install_time:
install_time = models.DateTimeField()
EDIT 2: Here's some sample output from the Postgres shell, along with a quick explanation of what I want:
2011-09-19 00:00:00
2011-09-11 00:00:00
2011-09-11 00:00:00 <--filtered out by distinct() (same date and time)
2011-10-13 06:38:37.576
2011-10-13 00:00:00 <--NOT filtered out by distinct() (same date but different time)
I am aware of Manager.raw, but would rather user django.db.connection.cursor to write the query directly since Manager.raw returns a RawQuerySet which, IMO, is worse than just writing the SQL query manually and iterating.

When doing reports on larger datasets itertools.group_by might be too slow. In those cases I make postgres handle the grouping:
truncate_date = connection.ops.date_trunc_sql('day','timestamp')
qs = qs.extra({'date':truncate_date})
return qs.values('date').annotate(Sum('amount')).order_by('date')

I've voted to close this since it's a dup of this question, so here's the answer if you don't want to visit the link, courtesy of nosklo.
Create a small function to extract just the date:
def extract_date(entity):
'extracts the starting date from an entity'
return entity.start_time.date()
Then you can use it with itertools.groupby:
from itertools import groupby
entities = Entity.objects.order_by('start_time')
for start_date, group in groupby(entities, key=extract_date):
do_something_with(start_date, list(group))

Related

Python - Filtering SQL query based on dates

I am trying to build a SQL query that will filter based on system date (Query for all sales done in the last 7 days):
import datetime
import pandas as pd
import psycopg2
con = p.connect(db_details)
cur = con.cursor()
df = pd.read_sql("""select store_name,count(*) from sales
where created_at between datetime.datetime.now() - (datetime.today() - timedelta(7))""",con=con)
I get an error
psycopg2.NotSupportedError: cross-database references are not implemented: datetime.datetime.now
You are mixing Python syntax into your SQL query. SQL is parsed and executed by the database, not by Python, and the database knows nothing about datetime.datetime.now() or datetime.date() or timedelta()! The specific error you see is caused by your Python code being interpreted as SQL instead and as SQL, datetime.datetime.now references the now column of the datetime table in the datetime database, which is a cross-database reference, and psycopg2 doesn't support queries that involve multiple databases.
Instead, use SQL parameters to pass in values from Python to the database. Use placeholders in the SQL to show the database driver where the values should go:
params = {
# all rows after this timestamp, 7 days ago relative to 'now'
'earliest': datetime.datetime.now() - datetime.timedelta(days=7),
# if you must have a date *only* (no time component), use
# 'earliest': datetime.date.today() - datetime.timedelta(days=7),
}
df = pd.read_sql("""
select store_name,count(*) from sales
where created_at >= %(latest)s""", params=params, con=con)
This uses placeholders as defined by the psycopg2 parameters documentation, where %(latest)s refers to the latest key in the params dictionary. datetime.datetime() instances are directly supported by the driver.
Note that I also fixed your 7 days ago expression, and replaced your BETWEEN syntax with >=; without a second date you are not querying for values between two dates, so use >= to limit the column to dates at or after the given date.
datetime.datetime.now() is not a proper SQL syntax, and thus cannot be executed by read_sql(). I suggest either using the correct SQL syntax that computes current time, or creating variables for each datetime.datetime.now() and datetime.today() - timedelta(7) and replacing them in your string.
edit: Do not follow the second suggestion. See comments below by Martijn Pieters.
Maybe you should remove that Python code inside your SQL, compute your dates in python and then use the strftime function to convert them to strings.
Then you'll be able to use them in your SQL query.
Actually, you do not necessarily need any params or computations in Python. Just use the corresponding SQL statement which should look like this:
select store_name,count(*)
from sales
where created_at >= now()::date - 7
group by store_name
Edit: I also added a group by which I think is missing.

Django: Order by evaluation of whether or not a date is empty

In Django, is it possible to order by whether or not a field is None, instead of the value of the field itself?
I know I can send the QuerySet to python sorted() but I want to keep it as a QuerySet for subsequent filtering. So, I'd prefer to order in the QuerySet itself.
For example, I have a termination_date field and I want to first sort the ones without a termination_date, then I want to order by a different field, like last_name, first_name.
Is this possible or am I stuck using sorted() and then having to do an entire new Query with the included ids and run sorted() on the new QuerySet? I can do this, but would prefer not to waste the overhead and use the beauty of QuerySets that they don't run until evaluated.
Translation, how can I get this SQL from Django assuming my app is employee, my model is Employee and it has three fields 'first_name (varchar)', 'last_name (varchar)', and 'termination_date (date)':
SELECT
"employee_employee"."last_name",
"employee_employee"."first_name",
"employee_employee"."termination_date"
FROM "employee_employee"
ORDER BY
"employee_employee"."termination_date" IS NOT NULL,
"employee_employee"."last_name",
"employee_employee"."first_name"
You should be able to order by query expressions, like this:
from django.db.models import IntegerField, Case, Value, When
MyModel.objects.all().order_by(
Case(
When(some_field=None, then=Value(1)),
default=Value(0),
output_field=IntegerField(),
).asc(),
'some_other_field'
)
I cannot test here so it might require a bit a fiddling around, but this should put rows that have a NULL some_field after those that have a some_field. And each set of rows should be sorted by some_other_field.
Granted, the CASE/WHEN is be a bit more cumbersome that what you put in your question, but I don't know how to get Django ORM to output that. Maybe someone else will have a better answer.
Spectras' answer works fine, but it only orders your records by 'null or not'. There is a shorter way that allows you to put empty dates wherever you want them in your date ordering - Coalesce:
from django.db.models import Value
from django.db.models.functions import Coalesce
wayback = datetime(year=1, month=1, day=1) # or whatever date you want
MyModel.objects
.annotate(null_date=Coalesce('date_field', Value(wayback)))
.order_by('null_date')
This will essentially sort by the field 'date_field' with all records with date_field == None will be in the order as if they had the date wayback. This works perfectly with PostgreSQL, but might need some raw sql casting in MySQL as described in the documentation.

Django day and month event date query

I have a django model that looks like this:
class Event(models.Model):
name = model.CharField(...etc...)
date = model.DateField(...etc...)
What I need is a way to get all events that are on a given day and month - much like an "on this day" page.
def on_this_day(self,day,month):
reutrn Events.filter(????)
I've tried all the regular date query types, but they all seem to require a year, and short of iterating through all years, I can't see how this could be done.
You can make a query like this by specifying the day and the month:
def on_this_day(day, month):
return Event.objects.filter(date__day=day, date__month=month)
It most likely scans your database table using SQL operators like MONTH(date) and DAY(date) or some lookup equivalent
You might get a better query performance if you add and index Event.day and Event.month (if Event.date is internally stored as an int in the DB, it makes it less adapted to your (day, month) queries)
Here's some docs from Django: https://docs.djangoproject.com/en/dev/ref/models/querysets/#month

How do I GROUP BY on every given increment of a field value?

I have a Python application. It has an SQLite database, full of data about things that happen, retrieved by a Web scraper from the Web. This data includes time-date groups, as Unix timestamps, in a column reserved for them. I want to retrieve the names of organisations that did things and count how often they did them, but to do this for each week (i.e. 604,800 seconds) I have data for.
Pseudocode:
for each 604800-second increment in time:
select count(time), org from table group by org
Essentially what I'm trying to do is iterate through the database like a list sorted on the time column, with a step value of 604800. The aim is to analyse how the distribution of different organisations in the total changed over time.
If at all possible, I'd like to avoid pulling all the rows from the db and processing them in Python as this seems a) inefficient and b) probably pointless given that the data is in a database.
Not being familiar with SQLite I think this approach should work for most databases, as it finds the weeknumber and subtracts the offset
SELECT org, ROUND(time/604800) - week_offset, COUNT(*)
FROM table
GROUP BY org, ROUND(time/604800) - week_offset
In Oracle I would use the following if time was a date column:
SELECT org, TO_CHAR(time, 'YYYY-IW'), COUNT(*)
FROM table
GROUP BY org, TO_CHAR(time, 'YYYY-IW')
SQLite probably has similar functionality that allows this kind of SELECT which is easier on the eye.
Create a table listing all weeks since the epoch, and JOIN it to your table of events.
CREATE TABLE Weeks (
week INTEGER PRIMARY KEY
);
INSERT INTO Weeks (week) VALUES (200919); -- e.g. this week
SELECT w.week, e.org, COUNT(*)
FROM Events e JOIN Weeks w ON (w.week = strftime('%Y%W', e.time))
GROUP BY w.week, e.org;
There are only 52-53 weeks per year. Even if you populate the Weeks table for 100 years, that's still a small table.
To do this in a set-based manner (which is what SQL is good at) you will need a set-based representation of your time increments. That can be a temporary table, a permanent table, or a derived table (i.e. subquery). I'm not too familiar with SQLite and it's been awhile since I've worked with UNIX. Timestamps in UNIX are just # seconds since some set date/time? Using a standard Calendar table (which is useful to have in a database)...
SELECT
C1.start_time,
C2.end_time,
T.org,
COUNT(time)
FROM
Calendar C1
INNER JOIN Calendar C2 ON
C2.start_time = DATEADD(dy, 6, C1.start_time)
INNER JOIN My_Table T ON
T.time BETWEEN C1.start_time AND C2.end_time -- You'll need to convert to timestamp here
WHERE
DATEPART(dw, C1.start_time) = 1 AND -- Basically, only get dates that are a Sunday or whatever other day starts your intervals
C1.start_time BETWEEN #start_range_date AND #end_range_date -- Period for which you're running the report
GROUP BY
C1.start_time,
C2.end_time,
T.org
The Calendar table can take whatever form you want, so you could use UNIX timestamps in it for the start_time and end_time. You just pre-populate it with all of the dates in any conceivable range that you might want to use. Even going from 1900-01-01 to 9999-12-31 won't be a terribly large table. It can come in handy for a lot of reporting type queries.
Finally, this code is T-SQL, so you'll probably need to convert the DATEPART and DATEADD to whatever the equivalent is in SQLite.

Django - SQL Query - Timestamp

Can anyone turn me to a tutorial, code or some kind of resource that will help me out with the following problem.
I have a table in a mySQL database. It contains an ID, Timestamp, another ID and a value. I'm passing it the 'main' ID which can uniquely identify a piece of data. However, I want to do a time search on this piece of data(therefore using the timestamp field). Therefore what would be ideal is to say: between the hours of 12 and 1, show me all the values logged for ID = 1987.
How would I go about querying this in Django? I know in mySQL it'd be something like less than/greater than etc... but how would I go about doing this in Django? i've been using Object.Filter for most of database handling so far. Finally, I'd like to stress that I'm new to Django and I'm genuinely stumped!
If the table in question maps to a Django model MyModel, e.g.
class MyModel(models.Model):
...
primaryid = ...
timestamp = ...
secondaryid = ...
valuefield = ...
then you can use
MyModel.objects.filter(
primaryid=1987
).exclude(
timestamp__lt=<min_timestamp>
).exclude(
timestamp__gt=<max_timestamp>
).values_list('valuefield', flat=True)
This selects entries with the primaryid 1987, with timestamp values between <min_timestamp> and <max_timestamp>, and returns the corresponding values in a list.
Update: Corrected bug in query (filter -> exclude).
I don't think Vinay Sajip's answer is correct. The closest correct variant based on his code is:
MyModel.objects.filter(
primaryid=1987
).exclude(
timestamp__lt=min_timestamp
).exclude(
timestamp__gt=max_timestamp
).values_list('valuefield', flat=True)
That's "exclude the ones less than the minimum timestamp and exclude the ones greater than the maximum timestamp." Alternatively, you can do this:
MyModel.objects.filter(
primaryid=1987
).filter(
timestamp__gte=min_timestamp
).exclude(
timestamp__gte=max_timestamp
).values_list('valuefield', flat=True)
exclude() and filter() are opposites: exclude() omits the identified rows and filter() includes them. You can use a combination of them to include/exclude whichever you prefer. In your case, you want to exclude() those below your minimum time stamp and to exclude() those above your maximum time stamp.
Here is the documentation on chaining QuerySet filters.

Categories

Resources