How to retrieve only the year from timestamp column? - python

I have the following query that runs correctly on Postgres 9.3:
select distinct date_part('year', date_created)
from "Topic";
The intention is to return only the distinct years on the column date_created which is created thus:
date_created | timestamp with time zone | not null default now()
I need to turn it into a SQLAlchemy query but what I wrote does a select distinct on the date_created, not on the year, and returns the whole row, not just the distinct value:
topics = Topic.query.distinct(func.date_part('YEAR', Topic.date_created)).all()
How can I get only the distinct years from the table Topic?

Here are two variants:
Using ORM:
from sqlalchemy import func, distinct
result = session.query(distinct(func.date_part('YEAR', Topic.date_created)))
for row in result:
print(row[0])
SQL Expression:
from sqlalchemy import func, select, distinct
query = select([distinct(func.date_part('YEAR', Topic.date_created))])
for row in session.execute(query):
print(row[0])

SQL Alchemy syntax aside, you have a potential problem in your query.
Your data type is timestamptz (timestamp with time zone), which is a good choice. However, you cannot tell the year reliably form a timestamptz alone, you need to specify the time zone additionally. If you don't, the current time zone setting of the session is applied silently, which may or may not work for you.
Think of New Year's Eve: timestamptz '2016-01-01 04:00:00+00' - what year is it?
It's 2016 in Europe, but still 2015 in the USA.
You should make that explicit with the AT TIME ZONE construct to avoid sneaky mistakes:
SELECT extract(year FROM timestamptz '2016-01-01 04:00:00+00'
AT TIME ZONE 'America/New_York') AS year;
Detailed explanation:
Ignoring timezones altogether in Rails and PostgreSQL
date_part() and extract() do the same in Postgres, extract() is the SQL standard, so rather use that.
SQL Fiddle.
BTW, you could also just:
SELECT extract(year FROM date_created) AS year
FROM "Topic"
GROUP BY 1;

Use extract function:
session.query(func.extract(Topic.date_created, 'year'))
this is a concept code, not tested.

Related

SQL Counter Loop

Coming from Python/R but new to SQL...
Is there a way to use 'counter' in a SQL loop?
For example, how can I mimic the following simple Python for-loop in SQL:
counter = 7
For i in range(1,conter+1):
some_function_that_takes_i_as_argument
counter +=1
I am trying to replace a numeric value "i" (in a range) in a SQL query !
With SQL you need to develop a different way of thinking - not procedural, but in terms of what is the set you are looking to get? From your statement SELECT * FROM tableA WHERE date > SYSDATE - i". In his case seems you just want data from the last 8 days (today and 7 days prior). With SQL this is just a single statement (no loop required):
select *
from table_a
where date >= trunc(sysdate – 7)
order by date desc;
NOTE: The above user "date" as a column name. This is an extremely poor choice as date is and Oracle data type as well as an Oracle and SQL standard reserved word.
The query uses the trunc function since the Oracle date data type always includes time.

Python - Filtering SQL query based on dates

I am trying to build a SQL query that will filter based on system date (Query for all sales done in the last 7 days):
import datetime
import pandas as pd
import psycopg2
con = p.connect(db_details)
cur = con.cursor()
df = pd.read_sql("""select store_name,count(*) from sales
where created_at between datetime.datetime.now() - (datetime.today() - timedelta(7))""",con=con)
I get an error
psycopg2.NotSupportedError: cross-database references are not implemented: datetime.datetime.now
You are mixing Python syntax into your SQL query. SQL is parsed and executed by the database, not by Python, and the database knows nothing about datetime.datetime.now() or datetime.date() or timedelta()! The specific error you see is caused by your Python code being interpreted as SQL instead and as SQL, datetime.datetime.now references the now column of the datetime table in the datetime database, which is a cross-database reference, and psycopg2 doesn't support queries that involve multiple databases.
Instead, use SQL parameters to pass in values from Python to the database. Use placeholders in the SQL to show the database driver where the values should go:
params = {
# all rows after this timestamp, 7 days ago relative to 'now'
'earliest': datetime.datetime.now() - datetime.timedelta(days=7),
# if you must have a date *only* (no time component), use
# 'earliest': datetime.date.today() - datetime.timedelta(days=7),
}
df = pd.read_sql("""
select store_name,count(*) from sales
where created_at >= %(latest)s""", params=params, con=con)
This uses placeholders as defined by the psycopg2 parameters documentation, where %(latest)s refers to the latest key in the params dictionary. datetime.datetime() instances are directly supported by the driver.
Note that I also fixed your 7 days ago expression, and replaced your BETWEEN syntax with >=; without a second date you are not querying for values between two dates, so use >= to limit the column to dates at or after the given date.
datetime.datetime.now() is not a proper SQL syntax, and thus cannot be executed by read_sql(). I suggest either using the correct SQL syntax that computes current time, or creating variables for each datetime.datetime.now() and datetime.today() - timedelta(7) and replacing them in your string.
edit: Do not follow the second suggestion. See comments below by Martijn Pieters.
Maybe you should remove that Python code inside your SQL, compute your dates in python and then use the strftime function to convert them to strings.
Then you'll be able to use them in your SQL query.
Actually, you do not necessarily need any params or computations in Python. Just use the corresponding SQL statement which should look like this:
select store_name,count(*)
from sales
where created_at >= now()::date - 7
group by store_name
Edit: I also added a group by which I think is missing.

Django day and month event date query

I have a django model that looks like this:
class Event(models.Model):
name = model.CharField(...etc...)
date = model.DateField(...etc...)
What I need is a way to get all events that are on a given day and month - much like an "on this day" page.
def on_this_day(self,day,month):
reutrn Events.filter(????)
I've tried all the regular date query types, but they all seem to require a year, and short of iterating through all years, I can't see how this could be done.
You can make a query like this by specifying the day and the month:
def on_this_day(day, month):
return Event.objects.filter(date__day=day, date__month=month)
It most likely scans your database table using SQL operators like MONTH(date) and DAY(date) or some lookup equivalent
You might get a better query performance if you add and index Event.day and Event.month (if Event.date is internally stored as an int in the DB, it makes it less adapted to your (day, month) queries)
Here's some docs from Django: https://docs.djangoproject.com/en/dev/ref/models/querysets/#month

Django - get distinct dates from timestamp

I'm trying to filter users by date, but can't until I can find the first and last date of users in the db. While I can have my script filter out dups later on, I want to do it from the outset using Django's distinct since it significantly reduces. I tried
User.objects.values('install_time').distinct().order_by()
but since install_time is a timestamp, it includes the date AND time (which I don't really care about). As a result, the only ones it filters out are dates where we could retrieve multiple users' install dates but not times.
Any idea how to do this? I'm running this using Django 1.3.1, Postgres 9.0.5, and the latest version of psycopg2.
EDIT: I forgot to add the data type of install_time:
install_time = models.DateTimeField()
EDIT 2: Here's some sample output from the Postgres shell, along with a quick explanation of what I want:
2011-09-19 00:00:00
2011-09-11 00:00:00
2011-09-11 00:00:00 <--filtered out by distinct() (same date and time)
2011-10-13 06:38:37.576
2011-10-13 00:00:00 <--NOT filtered out by distinct() (same date but different time)
I am aware of Manager.raw, but would rather user django.db.connection.cursor to write the query directly since Manager.raw returns a RawQuerySet which, IMO, is worse than just writing the SQL query manually and iterating.
When doing reports on larger datasets itertools.group_by might be too slow. In those cases I make postgres handle the grouping:
truncate_date = connection.ops.date_trunc_sql('day','timestamp')
qs = qs.extra({'date':truncate_date})
return qs.values('date').annotate(Sum('amount')).order_by('date')
I've voted to close this since it's a dup of this question, so here's the answer if you don't want to visit the link, courtesy of nosklo.
Create a small function to extract just the date:
def extract_date(entity):
'extracts the starting date from an entity'
return entity.start_time.date()
Then you can use it with itertools.groupby:
from itertools import groupby
entities = Entity.objects.order_by('start_time')
for start_date, group in groupby(entities, key=extract_date):
do_something_with(start_date, list(group))

How do I GROUP BY on every given increment of a field value?

I have a Python application. It has an SQLite database, full of data about things that happen, retrieved by a Web scraper from the Web. This data includes time-date groups, as Unix timestamps, in a column reserved for them. I want to retrieve the names of organisations that did things and count how often they did them, but to do this for each week (i.e. 604,800 seconds) I have data for.
Pseudocode:
for each 604800-second increment in time:
select count(time), org from table group by org
Essentially what I'm trying to do is iterate through the database like a list sorted on the time column, with a step value of 604800. The aim is to analyse how the distribution of different organisations in the total changed over time.
If at all possible, I'd like to avoid pulling all the rows from the db and processing them in Python as this seems a) inefficient and b) probably pointless given that the data is in a database.
Not being familiar with SQLite I think this approach should work for most databases, as it finds the weeknumber and subtracts the offset
SELECT org, ROUND(time/604800) - week_offset, COUNT(*)
FROM table
GROUP BY org, ROUND(time/604800) - week_offset
In Oracle I would use the following if time was a date column:
SELECT org, TO_CHAR(time, 'YYYY-IW'), COUNT(*)
FROM table
GROUP BY org, TO_CHAR(time, 'YYYY-IW')
SQLite probably has similar functionality that allows this kind of SELECT which is easier on the eye.
Create a table listing all weeks since the epoch, and JOIN it to your table of events.
CREATE TABLE Weeks (
week INTEGER PRIMARY KEY
);
INSERT INTO Weeks (week) VALUES (200919); -- e.g. this week
SELECT w.week, e.org, COUNT(*)
FROM Events e JOIN Weeks w ON (w.week = strftime('%Y%W', e.time))
GROUP BY w.week, e.org;
There are only 52-53 weeks per year. Even if you populate the Weeks table for 100 years, that's still a small table.
To do this in a set-based manner (which is what SQL is good at) you will need a set-based representation of your time increments. That can be a temporary table, a permanent table, or a derived table (i.e. subquery). I'm not too familiar with SQLite and it's been awhile since I've worked with UNIX. Timestamps in UNIX are just # seconds since some set date/time? Using a standard Calendar table (which is useful to have in a database)...
SELECT
C1.start_time,
C2.end_time,
T.org,
COUNT(time)
FROM
Calendar C1
INNER JOIN Calendar C2 ON
C2.start_time = DATEADD(dy, 6, C1.start_time)
INNER JOIN My_Table T ON
T.time BETWEEN C1.start_time AND C2.end_time -- You'll need to convert to timestamp here
WHERE
DATEPART(dw, C1.start_time) = 1 AND -- Basically, only get dates that are a Sunday or whatever other day starts your intervals
C1.start_time BETWEEN #start_range_date AND #end_range_date -- Period for which you're running the report
GROUP BY
C1.start_time,
C2.end_time,
T.org
The Calendar table can take whatever form you want, so you could use UNIX timestamps in it for the start_time and end_time. You just pre-populate it with all of the dates in any conceivable range that you might want to use. Even going from 1900-01-01 to 9999-12-31 won't be a terribly large table. It can come in handy for a lot of reporting type queries.
Finally, this code is T-SQL, so you'll probably need to convert the DATEPART and DATEADD to whatever the equivalent is in SQLite.

Categories

Resources