Let's say that I have a table with a column, that has some integer values and I want to calculate the percentage of values that are over 200 for that column.
Here's the kicker, I would prefer if I could do it inside one query that I could use group_by on.
results = db.session.query(
ClassA.some_variable,
label('entries', func.count(ClassA.some_variable)),
label('percent', *no clue*)
).filter(ClassA.value.isnot(None)).group_by(ClassA.some_variable)
Alternately it would be okay thought not prefered to do the percentage calculation on the client side, something like this.
results = db.session.query(
ClassA.some_variable,
label('entries', func.count(ClassA.some_variable)),
label('total_count', func.count(ClassA.value)),
label('over_200_count', func.count(ClassA.value > 200)),
).filter(ClassA.value.isnot(None)).group_by(ClassA.some_variable)
But I obviously can't filter within the count statemenet, and I can't apply the filter at the end of the query, since if I apply the > 200 constraint at the end, total_count wouldn't work.
Using RAW SQL is an option too, it doesn't have to be Sqlalchemy
MariaDB unfortunately does not support the aggregate FILTER clause, but you can work around that using a CASE expression or NULLIF, since COUNT returns the count of non-null values of given expression:
from sqlalchemy import case
...
func.count(case([(ClassA.value > 200, 1)])).label('over_200_count')
With that in mind you can calculate the percentage simply as
(func.count(case([(ClassA.value > 200, 1)])) * 1.0 /
func.count(ClassA.value)).label('percent')
though there's that one edge: what if func.count(ClassA.value) is 0? Depending on whether you'd consider 0 or NULL a valid return value you could either use yet another CASE expression or NULLIF:
dividend = func.count(case([(ClassA.value > 200, 1)])) * 1.0
divisor = func.count(ClassA.value)
# Zero
case([(divisor == 0, 0)],
else_=dividend / divisor).label('percent')
# NULL
(dividend / func.nullif(divisor, 0)).label('percent')
Finally, you could create a compilation extension for mysql dialect that rewrites a FILTER clause to a suitable CASE expression:
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.sql.expression import FunctionFilter
from sqlalchemy.sql.functions import Function
from sqlalchemy import case
#compiles(FunctionFilter, 'mysql')
def compile_functionfilter_mysql(element, compiler, **kwgs):
# Support unary functions only
arg0, = element.func.clauses
new_func = Function(
element.func.name,
case([(element.criterion, arg0)]),
packagenames=element.func.packagenames,
type_=element.func.type,
bind=element.func._bind)
return new_func._compiler_dispatch(compiler, **kwgs)
With that in place you could express the dividend as
dividend = func.count(1).filter(ClassA.value > 200) * 1.0
which compiles to
In [28]: print(dividend.compile(dialect=mysql.dialect()))
count(CASE WHEN (class_a.value > %s) THEN %s END) * %s
Related
I have some table in a database that looks like that:
phase_type phase_start phase_end
Obsolete 01/01/2021 02/02/2022
Obsolete 01/03/2021 02/07/2022
Obsolete 05/01/2021 09/02/2022
Available 05/07/2021 09/02/2027
Available 05/07/2023 09/02/2025
Available 05/07/2024 09/02/2029
If I want to select on this table and return only the rows that the date of today is lying between the range of 30 days in the past of phase_end, I could do like this:
from datetime import date,timedelta
fromo sqlalchemy import select
past = my_table.phase_end - timedelta(30)
future = my_table.phase_end
query = select(my_table).where(date.today() >= past,date.today() <= future)
session.exec(query).fetchall()
However I would like to use phase_start when calculating past and future for the case when phase_type is Obsolete, for all the other cases I would like to use phase_end as above. Thus the range should be calculated based on the value that phase_end takes. How can I do this and return all rows that pass the conditions ?
I am not sure that I understand your problem correctly, but if you want to return rows that meet the specified conditions, you can use a case statement in the where clause of your select query.
from datetime import date,timedelta
from sqlalchemy import case, select
past = case([(my_table.phase_type == 'Obsolete', my_table.phase_start)], else_=my_table.phase_end) - timedelta(30)
future = case([(my_table.phase_type == 'Obsolete', my_table.phase_start)], else_=my_table.phase_end)
query = select(my_table).where(date.today() >= past,date.today() <= future)
session.exec(query).fetchall()
This query will return all rows from my_table where date.today() falls within the range of 30 days in the past of either phase_start (if phase_type is Obsolete) or phase_end (for all other cases).
I have a booking system and I save the booked daterange in a DATERANGE column:
booked_date = Column(DATERANGE(), nullable=False)
I already know that I can access the actual dates with booked_date.lower or booked_date.upper
For example I do this here:
for bdate in room.RoomObject_addresses_UserBooksRoom:
unaviable_ranges['ranges'].append([str(bdate.booked_date.lower),\
str(bdate.booked_date.upper)])
Now I need to filter my bookings by a given daterange. For example I want to see all bookings between 01.01.2018 and 10.01.2018.
Usually its simple, because dates can be compared like this: date <= other date
But if I do it with the DATERANGE:
the_daterange_lower = datetime.strptime(the_daterange[0], '%d.%m.%Y')
the_daterange_upper = datetime.strptime(the_daterange[1], '%d.%m.%Y')
bookings = UserBooks.query.filter(UserBooks.booked_date.lower >= the_daterange_lower,\
UserBooks.booked_date.upper <= the_daterange_upper).all()
I get an error:
AttributeError: Neither 'InstrumentedAttribute' object nor 'Comparator' object associated with UserBooks.booked_date has an attribute 'lower'
EDIT
I found a sheet with useful range operators and it looks like there are better options to do what I want to do, but for this I need somehow to create a range variable, but python cant do this. So I am still confused.
In my database my daterange column entries look like this:
[2018-11-26,2018-11-28)
EDIT
I am trying to use native SQL and not sqlalchemy, but I dont understand how to create a daterange object.
bookings = db_session.execute('SELECT * FROM usersbookrooms WHERE booked_date && [' + str(the_daterange_lower) + ',' + str(the_daterange_upper) + ')')
The query
the_daterange_lower = datetime.strptime(the_daterange[0], '%d.%m.%Y')
the_daterange_upper = datetime.strptime(the_daterange[1], '%d.%m.%Y')
bookings = UserBooks.query.\
filter(UserBooks.booked_date.lower >= the_daterange_lower,
UserBooks.booked_date.upper <= the_daterange_upper).\
all()
could be implemented using "range is contained by" operator <#. In order to pass the right operand you have to create an instance of psycopg2.extras.DateRange, which represents a Postgresql daterange value in Python:
the_daterange_lower = datetime.strptime(the_daterange[0], '%d.%m.%Y').date()
the_daterange_upper = datetime.strptime(the_daterange[1], '%d.%m.%Y').date()
the_daterange = DateRange(the_dateranger_lower, the_daterange_upper)
bookings = UserBooks.query.\
filter(UserBooks.booked_date.contained_by(the_daterange)).\
all()
Note that the attributes lower and upper are part of the psycopg2.extras.Range types. The SQLAlchemy range column types do not provide such, as your error states.
If you want to use raw SQL and pass date ranges, you can use the same DateRange objects to pass values as well:
bookings = db_session.execute(
'SELECT * FROM usersbookrooms WHERE booked_date && %s',
(DateRange(the_daterange_lower, the_daterange_upper),))
You can also build literals manually, if you want to:
bookings = db_session.execute(
'SELECT * FROM usersbookrooms WHERE booked_date && %s::daterange',
(f'[{the_daterange_lower}, {the_daterange_upper})',))
The trick is to build the literal in Python and pass it as a single value – using placeholders, as always. It should avoid any SQL injection possibilities; only thing that can happen is that the literal has invalid syntax for a daterange. Alternatively you can pass the bounds to a range constructor:
bookings = db_session.execute(
'SELECT * FROM usersbookrooms WHERE booked_date && daterange(%s, %s)',
(the_daterange_lower, the_daterange_upper))
All in all it is easier to just use the Psycopg2 Range types and let them handle the details.
I have a large dataset with +50M records in a PostgreSQL database that require massive calculations, inner join.
Python is the tool of choice with Psycopg2.
Running the process with fetchmany of 20,000 records takes a couple of hours to finish.
The execution needs to take place sequentially, as in each record of the 50M needs to be fetched separately, then another query (in the below example) needs to run before a result is returned and saved in a separate table.
Indexes are properly configured on each table (5 tables in total) and the complex query (that returns a calculated value - example below) takes around 240MS to return results (when the database is not under load).
Celery is used to take care of database inserts of the calculated values in a separate table.
My question is about common strategies to reduce overall running time and produce results/calculations faster.
In other words, what is an effective way to go through all the records, one by one, calculate the value of a field via a second query then save the result.
UPDATE:
There is an important piece of information that I unintentionally missed mentioning while trying to obfuscate sensitive details. Sorry for that.
The original SELECT query calculates a value aggregated from different tables as follows:
SELECT CR.gg, (AX.b + BF.f)/CR.d AS calculated_field
FROM table_one CR
LEFT JOIN table_two AX ON EX.x = CR.x
LEFT JOIN table_three BF ON BF.x = CR.x
WHERE CR.gg = '123'
GROUP BY CR.gg;
PS: the SQL query is written by our experienced DBA so i trust that it is optimised.
don't loop over records and call the DBMS repeatedly for every record.
instead, let the DBMS process large chunks (preferrably: all) of data
and, let it spit out all the results.
Below is a snippet of my twitter-sucker(with a rather complex ugly query)
def fetch_referred_tweets(self):
self.curs = self.conn.cursor()
tups = ()
selrefd = """SELECT twx.id, twx.in_reply_to_id, twx.seq, twx.created_at
FROM(
SELECT tw1.id, tw1.in_reply_to_id, tw1.seq, tw1.created_at
FROM tt_tweets tw1
WHERE 1=1
AND tw1.in_reply_to_id > 0
AND tw1.is_retweet = False
AND tw1.did_resolve = False
AND NOT EXISTS ( SELECT * FROM tweets nx
WHERE nx.id = tw1.in_reply_to_id)
AND NOT EXISTS ( SELECT * FROM tt_tweets nx
WHERE nx.id = tw1.in_reply_to_id)
UNION ALL
SELECT tw2.id, tw2.in_reply_to_id, tw2.seq, tw2.created_at
FROM tweets tw2
WHERE 1=1
AND tw2.in_reply_to_id > 0
AND tw2.is_retweet = False
AND tw2.did_resolve = False
AND NOT EXISTS ( SELECT * FROM tweets nx
WHERE nx.id = tw2.in_reply_to_id)
AND NOT EXISTS ( SELECT * FROM tt_tweets nx
WHERE nx.id = tw2.in_reply_to_id)
-- ORDER BY tw2.created_at DESC
)twx
LIMIT %s;"""
# -- AND tw.created_at < now() - '15 min':: interval
# -- AND tw.created_at >= now() - '72 hour':: interval
count = 0
uniqs = 0
self.curs.execute(selrefd, (quotum_referred_tweets, ) )
tups = self.curs.fetchmany(quotum_referred_tweets)
for tup in tups:
if tup == None: break
print ('%d -->> %d [seq=%d] datum=%s' % tup)
self.resolve_list.append(tup[0] ) # this tweet
if tup[1] not in self.refetch_tweets:
self.refetch_tweets[ tup[1] ] = [ tup[0]] # referred tweet
uniqs += 1
count += 1
self.curs.close()
Note: your query makes no sense:
you only select fields from the ertable
so, the two LEFT JOINed tables could be omitted
if ex and ef do contain multiple matching rows, the resultset could be larger than just all the rows selected from er, resulting in duplicateder records
there is a GROUP BY present, but no aggregates are in the select list
select er.gg, er.z, er.y
from table_one er
where er.gg = '123'
-- or:
where er.gg >= '123'
and er.gg <= '456'
ORDER BY er.gg, er.z, er.y -- Or: some other ordering
;
since you are doing a join in your query, the logical thing to do is to work around it, meaning create what's known as a summary table, this summary table -residing on the database- will hold the final joined dataset, so in your python code you will just fetch/select data from it.
another way is to use materialized view link
I took #wildplasser's advice and moved the calculation operation inside the database as a function.
The result has been impressively efficient to say the least and total run time dropped to minutes/~ hour.
To recap:
Database records are no longer fetched in the sequence
mentioned earlier
Calculations happen inside the database via a function PostgreSQL function
Is it possible to calculate the cumulative (running) sum using django's orm? Consider the following model:
class AModel(models.Model):
a_number = models.IntegerField()
with a set of data where a_number = 1. Such that I have a number ( >1 ) of AModel instances in the database all with a_number=1. I'd like to be able to return the following:
AModel.objects.annotate(cumsum=??).values('id', 'cumsum').order_by('id')
>>> ({id: 1, cumsum: 1}, {id: 2, cumsum: 2}, ... {id: N, cumsum: N})
Ideally I'd like to be able to limit/filter the cumulative sum. So in the above case I'd like to limit the result to cumsum <= 2
I believe that in postgresql one can achieve a cumulative sum using window functions. How is this translated to the ORM?
For reference, starting with Django 2.0 it is possible to use the Window function to achieve this result:
AModel.objects.annotate(cumsum=Window(Sum('a_number'), order_by=F('id').asc()))\
.values('id', 'cumsum').order_by('id', 'cumsum')
From Dima Kudosh's answer and based on https://stackoverflow.com/a/5700744/2240489 I had to do the following:
I removed the reference to PARTITION BY in the sql and replaced with ORDER BY resulting in.
AModel.objects.annotate(
cumsum=Func(
Sum('a_number'),
template='%(expressions)s OVER (ORDER BY %(order_by)s)',
order_by="id"
)
).values('id', 'cumsum').order_by('id', 'cumsum')
This gives the following sql:
SELECT "amodel"."id",
SUM("amodel"."a_number")
OVER (ORDER BY id) AS "cumsum"
FROM "amodel"
GROUP BY "amodel"."id"
ORDER BY "amodel"."id" ASC, "cumsum" ASC
Dima Kudosh's answer was not summing the results but the above does.
For posterity, I found this to be a good solution for me. I didn't need the result to be a QuerySet, so I could afford to do this, since I was just going to plot the data using D3.js:
import numpy as np
import datettime
today = datetime.datetime.date()
raw_data = MyModel.objects.filter('date'=today).values_list('a_number', flat=True)
cumsum = np.cumsum(raw_data)
You can try to do this with Func expression.
from django.db.models import Func, Sum
AModel.objects.annotate(cumsum=Func(Sum('a_number'), template='%(expressions)s OVER (PARTITION BY %(partition_by)s)', partition_by='id')).values('id', 'cumsum').order_by('id')
Check this
AModel.objects.order_by("id").extra(select={"cumsum":'SELECT SUM(m.a_number) FROM table_name m WHERE m.id <= table_name.id'}).values('id', 'cumsum')
where table_name should be the name of table in database.
I have a model which have IntegerField named as threshold.
I need to get total SUM of threshold value regardless of negative values.
vote_threshold
100
-200
-5
result = 305
Right now I am doing it like this.
earning = 0
result = Vote.objects.all().values('vote_threshold')
for v in result:
if v.vote_threshold > 0:
earning += v.vote_threshold
else:
earning -= v.vote_threshold
What is a faster and more proper way?
use abs function in django
from django.db.models.functions import Abs
from django.db.models import Sum
<YourModel>.objects.aggregate(s=Sum(Abs("vote_threshold")))
try this:
objects = Vote.objects.extra(select={'abs_vote_threshold': 'abs(vote_threshold)'}).values('abs_vote_threshold')
earning = sum([obj['abs_vote_threshold'] for obj in objects])
I don't think there is an easy way to do the calculation using the Django orm. Unless you have performance issues, there is nothing wrong with doing the calculation in python. You can simplify your code slightly by using sum() and abs().
votes = Vote.objects.all()
earning = sum(abs(v.vote_threshold) for v in votes)
If performance is an issue, you can use raw SQL.
from django.db import connection
cursor = connection.cursor()
cursor.execute("SELECT sum(abs(vote_theshold)) from vote")
row = cursor.fetchone()
earning = row[0]
This one example, if you want to sum negative and positive in one query
select = {'positive': 'sum(if(value>0, value, 0))',
'negative': 'sum(if(value<0, value, 0))'}
summary = items.filter(query).extra(select=select).values('positive', 'negative')[0]
positive, negative = summary['positive'], summary['negative']