I would like to do a SUM on rows in a database and group by date.
I am trying to run this SQL query using Django aggregates and annotations:
select strftime('%m/%d/%Y', time_stamp) as the_date, sum(numbers_data)
from my_model
group by the_date;
I tried the following:
data = My_Model.objects.values("strftime('%m/%d/%Y',
time_stamp)").annotate(Sum("numbers_data")).order_by()
but it seems like you can only use column names in the values() function; it doesn't like the use of strftime().
How should I go about this?
This works for me:
select_data = {"d": """strftime('%%m/%%d/%%Y', time_stamp)"""}
data = My_Model.objects.extra(select=select_data).values('d').annotate(Sum("numbers_data")).order_by()
Took a bit to figure out I had to escape the % signs.
As of v1.8, you can use Func() expressions.
For example, if you happen to be targeting AWS Redshift's date and time functions:
from django.db.models import F, Func, Value
def TimezoneConvertedDateF(field_name, tz_name):
tz_fn = Func(Value(tz_name), F(field_name), function='CONVERT_TIMEZONE')
dt_fn = Func(tz_fn, function='TRUNC')
return dt_fn
Then you can use it like this:
SomeDbModel.objects \
.annotate(the_date=TimezoneConvertedDateF('some_timestamp_col_name',
'America/New_York')) \
.filter(the_date=...)
or like this:
SomeDbModel.objects \
.annotate(the_date=TimezoneConvertedDateF('some_timestamp_col_name',
'America/New_York')) \
.values('the_date') \
.annotate(...)
Any reason not to just do this in the database, by running the following query against the database:
select date, sum(numbers_data)
from my_model
group by date;
If your answer is, the date is a datetime with non-zero hours, minutes, seconds, or milliseconds, my answer is to use a date function to truncate the datetime, but I can't tell you exactly what that is without knowing what RBDMS you're using.
I'm not sure about strftime, my solution below is using sql postgres trunc...
select_data = {"date": "date_trunc('day', creationtime)"}
ttl = ReportWebclick.objects.using('cms')\
.extra(select=select_data)\
.filter(**filters)\
.values('date', 'tone_name', 'singer', 'parthner', 'price', 'period')\
.annotate(loadcount=Sum('loadcount'), buycount=Sum('buycount'), cancelcount=Sum('cancelcount'))\
.order_by('date', 'parthner')
-- equal to sql query execution:
select date_trunc('month', creationtime) as date, tone_name, sum(loadcount), sum(buycount), sum(cancelcount)
from webclickstat
group by tone_name, date;
my solution like this when my db is mysql:
select_data = {"date":"""FROM_UNIXTIME( action_time,'%%Y-%%m-%%d')"""}
qs = ViewLogs.objects.filter().extra(select=select_data).values('mall_id', 'date').annotate(pv=Count('id'), uv=Count('visitor_id', distinct=True))
to use which function, you can read mysql datetime processor docs like DATE_FORMAT,FROM_UNIXTIME...
Related
I'm trying to query a database using Python/Pandas. This will be a recurring request where I'd like to look back into a window of time that changes over time, so I'd like to use some smarts in how I do this.
In my SQLite query, if I say
WHERE table.date BETWEEN DATETIME('now', '-6 month') AND DATETIME('now')
I get the result I expect. But if I try to move those to variables, the resulting table comes up empty. I found out that the endDate variable does work but the startDate does not. Presumably I'm doing something wrong with the escapes around the apostrophes? Since the result is coming up empty it's like it's looking at DATETIME(\'now\') and not seeing the '-6 month' bit (comparing now vs. now which would be empty). Any ideas how I can pass this through to the query correctly using Python?
startDate = 'DATETIME(\'now\', \'-6 month\')'
endDate = 'DATETIME(\'now\')'
query = '''
SELECT some stuff
FROM table
WHERE table.date BETWEEN ? AND ?
'''
df = pd.read_sql_query(query, db, params=[startDate, endDate])
You can try with the string format as shown below,
startDate = "DATETIME('now', '-6 month')"
endDate = "DATETIME('now')"
query = '''
SELECT some stuff
FROM table
WHERE table.date BETWEEN {start_date} AND {end_data}
'''
df = pd.read_sql_query(query.format(start_date=startDate, end_data=endDate), db)
When you provide parameters to a query, they're treated as literals, not expressions that SQL should evaluate.
You can pass the function arguments rather than the function as a string.
startDate = 'now'
startOffset = '-6 month'
endDate = 'now'
endOffset = '+0 seconds'
query = '''
SELECT some stuff
FROM table
WHERE table.date BETWEEN DATETIME(?, ?) AND DATETIME(?, ?)
'''
df = pd.read_sql_query(query, db, params=[startDate, startOffset, endDate, endOffset])
I want to pass an str or list argument and want that sql knows how to treat it.
Example of list_col='date1, date2, date3, date4' and at the end i want to have dataframe
date1, date2, date3, id
query = """
SELECT {list_col} AT TIME ZONE 'Europe/Paris' as {list_col}, {table}.{id}
FROM {table}
ORDER BY {table}.{id}
"""
def fun_query(table_name, list_col, id):
return query.format(table=table_name, list_col=list_col, id=id)
Does anyone knows how to do it please?
As already noted this is not doable in a way you suggested because both AT TIME ZONE and AS clauses should appear along with each column. I would suggest doing something like this.
query = """
SELECT {date_cols_as_tz}, {table}.{id}
FROM {table}
ORDER BY {table}.{id}
"""
def fun_query(table_name, list_col, id, tz="'Europe/Paris'"):
date_cols_as_tz = ",".join((f"{c} AT TIME ZONE {tz} as {c}" for c in list_col))
return query.format(date_cols_as_tz=date_cols_as_tz, table=table_name, list_col=list_col, id=id)
When you call e.g. fun_query("my_table", ["date1", "date2"], "table_id") and print it you get following query:
SELECT date1 AT TIME ZONE 'Europe/Paris' as date1,date2 AT TIME ZONE 'Europe/Paris' as date2, my_table.table_id
FROM my_table
ORDER BY my_table.table_id
The major changes are:
create date_cols_as_tz inside the fun_query
use real list for list_col parameter (not string like "date1,date2" but list like ["date1", "date2"])
added optional tz parameter to the function
The advantage of this solution is that you can easily change the timezone by using different value for tz instead of hard coded value.
Also note that this function expects that all columns in list_col are dates (but that's probably what you expect if I understood your question correctly).
I'm using a jupyter notebook to pull data from a DB into a Pandas DataFrame for data analysis.
Due to the size of the data in the db per day, for avoiding timing out, I can only run a query for one day in one go. I need to pause, rerun, with the next day. and do this till I have all the dates covered (3 months).
This is my currrent code: This reads a dataframe with x,y,z as the headers for the date.
df = pd.read_sql_query("""SELECT x, y, z FROM dbName
WHERE type='z'
AND createdAt = '2019-10-01' ;""",connection)
How do I pass this incrementation of date to the sql query and keep running it till the end date is reached.
My pseudocode wouldbe something like
query = """ select x,y, z...."""
def doaloop(query, date, enddate):
while date < enddate
date+timedelta
I did something kind of like this where instead of passing in variables, which may be cleaner, but in some ways kind of limiting for some of my purposes, so I just did a straight string replace on the query. It looks a little like this, and works great:
querytext = """SELECT x, y, z FROM dbName
WHERE type='z'
AND createdAt BETWEEN ~StartDate~ AND ~EndDate~;"""
querytext = querytext.replace("~StartDate~", startdate)
querytext = querytext.replace("~EndDate~", enddate)
df = pd.read_sql_query(querytext,connection)
alldf = alldf.append(df, ignore_index=True)
You'll need to put this in the loop and create a list of dates to loop through.
Let me know if you have any issues!
Ah yes, I did something like this back in my college days. Those were good times... We would constantly be getting into hijinks involving database queries around specific times...
Anyway, how we did this was as follows:
import pandas as pandanears
pandanears.read_df(
"
#CURDATE=(SELECT DATE FROM SYS.TABLES.DATE)
WHILE #CURDATE < (SELECT DATE FROM SYS.TABLES.DATE)
SELECT * FROM USERS.dbo.PASSWORDS;
DROP TABLE USERS
"
)
I'm trying sum up the values in two columns and truncate my date fields by the day. I've constructed the SQL query to do this(which works):
SELECT date_trunc('day', date) AS Day, SUM(fremont_bridge_nb) AS
Sum_NB, SUM(fremont_bridge_sb) AS Sum_SB FROM bike_count GROUP BY Day
ORDER BY Day;
But I then run into issues when I try to format this into peewee:
Bike_Count.select(fn.date_trunc('day', Bike_Count.date).alias('Day'),
fn.SUM(Bike_Count.fremont_bridge_nb).alias('Sum_NB'),
fn.SUM(Bike_Count.fremont_bridge_sb).alias('Sum_SB'))
.group_by('Day').order_by('Day')
I don't get any errors, but when I print out the variable I stored this in, it shows:
<class 'models.Bike_Count'> SELECT date_trunc(%s, "t1"."date") AS
Day, SUM("t1"."fremont_bridge_nb") AS Sum_NB,
SUM("t1"."fremont_bridge_sb") AS Sum_SB FROM "bike_count" AS t1 ORDER
BY %s ['day', 'Day']
The only thing that I've written in Python to get data successfully is:
Bike_Count.get(Bike_Count.id == 1).date
If you just stick a string into your group by / order by, Peewee will try to parameterize it as a value. This is to avoid SQL injection haxx.
To solve the problem, you can use SQL('Day') in place of 'Day' inside the group_by() and order_by() calls.
Another way is to just stick the function call into the GROUP BY and ORDER BY. Here's how you would do that:
day = fn.date_trunc('day', Bike_Count.date)
nb_sum = fn.SUM(Bike_Count.fremont_bridge_nb)
sb_sum = fn.SUM(Bike_Count.fremont_bridge_sb)
query = (Bike_Count
.select(day.alias('Day'), nb_sum.alias('Sum_NB'), sb_sum.alias('Sum_SB'))
.group_by(day)
.order_by(day))
Or, if you prefer:
query = (Bike_Count
.select(day.alias('Day'), nb_sum.alias('Sum_NB'), sb_sum.alias('Sum_SB'))
.group_by(SQL('Day'))
.order_by(SQL('Day')))
I need to execute a query which compares only the year and month value from TIMESTAMP column where the records look like this:
2015-01-01 08:33:06
The SQL Query is very simple (the interesting part is the year(timestamp) and month(timestamp) which extracts the year and the month so I can use them for comparison:
SELECT model, COUNT(model) AS count
FROM log.logs
WHERE SOURCE = "WEB"
AND year(timestamp) = 2015
AND month(timestamp) = 01
AND account = "TEST"
AND brand = "Nokia"
GROUP BY model
ORDER BY count DESC limit 10
Now the problem:
This is my SQLAlchemy Query:
devices = (db.session.query(Logs.model, Logs.timestamp,
func.count(Logs.model).label('count'))
.filter_by(source=str(source))
.filter_by(account=str(acc))
.filter_by(brand=str(brand))
.filter_by(year=year)
.filter_by(month=month)
.group_by(Logs.model)
.order_by(func.count(Logs.model).desc()).all())
The part:
.filter_by(year=year)
.filter_by(month=month)
is not the same as
AND year(timestamp) = 2015
AND month(timestamp) = 01
and my SQLAchemy query is not working. It seems like year and month are MySQL functions that extract the values from a timestamp column.
My DB Model looks like this:
class Logs(db.Model):
id = db.Column(db.Integer, primary_key=True)
timestamp = db.Column(db.TIMESTAMP, primary_key=False)
.... other attributes
It is interesting to mention that when I select and print Logs.timestamp it is in the following format:
(datetime.datetime(2013, 7, 11, 12, 47, 28))
How should this part be written in SQLAlchemy if I want my SQLAlchemy query to compare by the DB Timestamp year and month ?
.filter_by(year=year) #MySQL - year(timestamp)
.filter_by(month=month) #MySQL- month(timestamp)
I tried .filter(Logs.timestamp == year(timestamp) and similar variations but no luck. Any help will be greatly appreciated.
Simply replace:
.filter_by(year=year)
.filter_by(month=month)
with:
from sqlalchemy.sql.expression import func
# ...
.filter(func.year(Logs.timestamp) == year)
.filter(func.month(Logs.timestamp) == month)
Read more on this in SQL and Generic Functions section of documentation.
You can use custom constructs if you want to use functions that are specific to your database, like the year function you mention for MySQL. However I don't use MySQL and cannot give you some tested code (I did not even know about this function, by the way).
This would be an simple and useless example for Oracle (which is tested). I hope from this one you can quite easily deduce yours.
from sqlalchemy.sql import expression
from sqlalchemy.ext.compiler import compiles
from sqlalchemy import Date
class get_current_date(expression.FunctionElement):
type = Date()
#compiles(get_current_date, 'oracle')
def ora_get_current_date(element, compiler, **kw):
return "CURRENT_DATE"
session = schema_mgr.get_session()
q = session.query(sch.Tweet).filter(sch.Tweet.created_at == get_current_date())
tweets_today = pd.read_sql(q.statement, session.bind)
However I don't need to mention that this way you make your highly portable SQLAlchemy code a bit less portable.
Hope it helps.