SQLAlchemy: several counts in one query - python

I am having hard time optimizing my SQLAlchemy queries. My SQL knowledge is very basic, and I just can't get the stuff I need from the SQLAlchemy docs.
Suppose the following very basic one-to-many relationship:
class Parent(Base):
__tablename__ = "parents"
id = Column(Integer, primary_key = True)
children = relationship("Child", backref = "parent")
class Child(Base):
__tablename__ = "children"
id = Column(Integer, primary_key = True)
parent_id = Column(Integer, ForeignKey("parents.id"))
naughty = Column(Boolean)
How could I:
Query tuples of (Parent, count_of_naughty_children, count_of_all_children) for each parent?
After decent time spent googling, I found how to query those values separately:
# The following returns tuples of (Parent, count_of_all_children):
session.query(Parent, func.count(Child.id)).outerjoin(Child, Parent.children).\
group_by(Parent.id)
# The following returns tuples of (Parent, count_of_naughty_children):
al = aliased(Children, session.query(Children).filter_by(naughty = True).\
subquery())
session.query(Parent, func.count(al.id)).outerjoin(al, Parent.children).\
group_by(Parent.id)
I tried to combine them in different ways, but didn't manage to get what I want.
Query all parents which have more than 80% naughty children? Edit: naughty could be NULL.
I guess this query is going to be based on the previous one, filtering by naughty/all ratio.
Any help is appreciated.
EDIT : Thanks to Antti Haapala's help, I found solution to the second question:
avg = func.avg(func.coalesce(Child.naughty, 0)) # coalesce() treats NULLs as 0
# avg = func.avg(Child.naughty) - if you want to ignore NULLs
session.query(Parent).join(Child, Parent.children).group_by(Parent).\
having(avg > 0.8)
It finds average if children's naughty variable, treating False and NULLs as 0, and True as 1. Tested with MySQL backend, but should work on others, too.

the count() sql aggretate function is pretty simple; it gives you the total number of non-null values in each group. With that in mind, we can adjust your query to give you the proper result.
print (Query([
Parent,
func.count(Child.id),
func.count(case(
[((Child.naughty == True), Child.id)], else_=literal_column("NULL"))).label("naughty")])
.join(Parent.children).group_by(Parent)
)
Which produces the following sql:
SELECT
parents.id AS parents_id,
count(children.id) AS count_1,
count(CASE WHEN (children.naughty = 1)
THEN children.id
ELSE NULL END) AS naughty
FROM parents
JOIN children ON parents.id = children.parent_id
GROUP BY parents.id

If your query is only to get the parents who have > 80 % children naughty, you can on most databases cast the naughty to integer, then take average of it; then having this average greater than 0.8.
Thus you get something like
from sqlalchemy.sql.expression import cast
naughtyp = func.avg(cast(Child.naughty, Integer))
session.query(Parent, func.count(Child.id), naughtyp).join(Child)\
.group_by(Parent.id).having(naughtyp > 0.8).all()

Related

Filter by field on relationship in SQLAlchemy

I have a very special case in which I want a group entity that have a list with the elements that fit some conditions.
These are the ORM class that I have defined:
class Group(Base):
__tablename__ = 'groups'
id = Column(Integer, Identity(1, 1), primary_key=True)
name = Column(String(50), nullable=False)
elements = relationship('Element', foreign_keys='[Element.group_id]')
class Element(Base):
__tablename__ = 'elemnts'
id = Column(Integer, Identity(1, 1), primary_key=True)
date = Column(Date, nullable=False)
value = Column(Numeric(38, 10), nullable=False)
group_id = Column(Integer, ForeignKey('groups.id'), nullable=False)
Now, I want to retrieve a group with all the elements of a specific date.
result = session.query(Group).filter(Group.name == 'group 1' and Element.date == '2021-05-27').all()
Sadly enough, the Group.name filter is working, but the retrieved group contains all elements, ignoring the Element.date condition.
As suggested by #van, I have tried:
query(Group).join(Element).filter(Group.name == 'group 1' and Element.date == '2021-05-27')
But I get every element again. On the logs I have noticed:
SELECT groups.id AS group_id, groups.name AS groups_name, element_1.id AS element_1_id, element_1.date AS element_1_date, element_1.value AS element_1_value, element_1.group_id AS element_1_group_id
FROM groups JOIN elements ON groups.id = elements.group_id LEFT OUTER JOIN elements AS elements_1 ON groups.id = elements_1.group_id
WHERE groups.name = %(name_1)s
There, I noticed two things. First, the join is being done twice (I guess one was already done just getting groups, before join). Second and most important: the date filter doesn't appear on the query.
The driver I'm using the mssql+pymssql driver.
OK, there seem to be a combination of few things happening here.
First, your relationship Group.elements will basically always contain all Elements of the Group. And this is completely separate from the filter, and this is how SA is supposed to work.
You can understand your current query (session.query(Group).filter(Group.name == 'group 1' and Element.date == '2021-05-27').all()) as the following:
"Return all Group instances which contain an Element for a given date."
But when you iterate over the Group.elements, the SA will make sure to return all children. This is what you are trying to solve.
Second, as pointed out by Yeong, you cannot use simple python and to create an AND SQL clause. Please fix either by using and_ or by just having separate clauses:
result = (
session.query(Group)
.filter(Group.name == "group 1")
.filter(Element.date == dat1)
.all()
)
Third, as you later pointed out, your relationship is lazy="joined" and this is why whenever you query for Group, the related Element instances will ALL be retrieved using OUTER JOIN condition. This is why when adding .join(Element) to your query resulted in two JOINs.
Solution
You can "trick" SA to think that the it loaded all Group.elements relationship when it only loaded the children you want by using orm.contains_eager() option, in which your query would like like this:
result = (
session.query(Group)
.join(Element)
.filter(Group.name == "group 1")
.filter(Element.date == dat1)
.options(contains_eager(Group.elements))
.all()
)
Above should work also with the lazy="joined" as the extra JOIN should not be generated anymore.
Update
If you would like to get the groups even if there are no Elements with the needed criteria, you need do:
replace join with outerjoin
place the filter on Element inside the outerjoin clause
result = (
session.query(Group)
.filter(Group.name == "group 1")
.outerjoin(
Element, and_(Element.group_id == Group.id, Element.date == dat1)
)
.options(contains_eager(Group.elements))
.all()
)
The and in python is not the same as the and condition in SQL. SQLAlchemy has a custom way to handle the conjunction using and_() method instead, i.e.
result = session.query(Group).join(Element).filter(and_(Group.name == 'group 1', Element.date == '2021-05-27')).all()

sqlalchemy query related tables: to join or not to join (and then to loop)?

Objective:
Retrieve a list of query results from one table (call it "groups").
Retrieve lists of results from a related table (call it "items"). One results list for each of the results from step 1.
Combine the "items" lists from step 2 with the corresponding "groups" result from step 1 in a tuple that contains both the "group" data and a list of all the related "items" data.
Question:
Is it more efficient to combine steps 1 and 2 above with a join query and then loop through to sift and aggregate the results by group? Or is it more efficient to query the results for step 1, then loop to query the results for step 2 and and aggregate the results?
Examples of each approach follow, hoping there is some other much better way...
Single query with loop approach (with join):
# query all "groups" in "category1" and all related "items"
results = session.query(Group.id, Group.name, Item.id, Item.name).\
outerjoin(Item, Group.items).\
filter(Group.category == 'category1').\
order_by(Group.id).\
all()
groups = list()
group_ids = set(results[0][0])
current_group = results[0][:2]
current_group_items = list()
for result in results:
# for each result, combine "group" with all related "items"
if result[0] in group_ids:
current_group_items.append(result[2:])
else:
groups.append(current_group + (current_group_items,))
group_ids.add(result[0])
current_group = result[:2]
current_group_items = [result[2:]]
Multiple query with loop approach (no join):
# query all "groups" in "category1"
groups = session.query(Group.id, Group.name).\
filter(Group.category == 'category1').\
all()
results = []
for group in groups:
# for each "group", query all related "items"
items = session.query(Item.id, Item.name).\
filter(Item.group_id == group[0]).\
all()
# append list of related "items" to "group" result
results.append(group + (items,))
Example schema for reference:
from sqlalchemy import Column, Integer, String, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
Base = declarative_base()
class Group(Base):
__tablename__ = 'groups'
id = Column(Integer, primary_key=True)
name = Column(String, nullable=False, index=True)
category = Column(String, nullable=False, index=True)
items = relationship('Sub', back_populates='group', cascade='all')
class Item(Base):
__tablename__ = 'items'
id = Column(Integer, primary_key=True)
name = Column(String, nullable=False, index=True)
group_id = Column(Integer, ForeignKey('groups.id'), nullable=False)
group = relationship('Group', back_populates='items')
There's also the 3rd option using a join: let SQLAlchemy eager load items and handle the grouping for you, since you've established the relationships between Group and Item already:
from sqlalchemy.orm import joinedload
groups = session.query(Group).\
options(joinedload(Group.items)).\
filter(Group.category == '...').\
all()
You'd then access items of a group using the Group.items collection.
Generally speaking a joinedload performs better than the "1+N" queries approach in your second example, because of the latencies involved in performing queries. That's of course a generalization and at times separate queries might even win, but even in that case you could still use the relationships – the default relationship loading strategy is 'select'.

How to perform a natural join on two tables using SQLAlchemy and Flask?

I have two tables Entry and Group defined in Python using Flask SQLAlchemy connected to a PostgresSQL database:
class Entry (db.Model):
__tablename__ = "entry"
id = db.Column('id', db.Integer, primary_key = True)
group_title = db.Column('group_title', db.Unicode, db.ForeignKey('group.group_title'))
url = db.Column('url', db.Unicode)
next_entry_id = db.Column('next_entry_id', db.Integer, db.ForeignKey('entry.id'))
entry = db.relationship('Entry', foreign_keys=next_entry_id)
group = db.relationship('Group', foreign_keys=group_title)
class Group (db.Model):
__tablename__ = "group"
group_title = db.Column('group_title', db.Unicode, primary_key = True)
group_start_id = db.Column('group_start_id', db.Integer)
#etc.
I am trying to combine the two tables with a natural join using the Entry.id and Group.group_start_id as the common field.
I have been able to query a single table for all records. But I want to join tables by foreign key ID to give records relating Group.group_start_id and Group.group_title to a specific Entry record.
I am having trouble with the Flask-SQLAlchemy query syntax or process
I have tried several approaches (to list a few):
db.session.query(Group, Entry.id).all()
db.session.query(Entry, Group).all()
db.session.query.join(Group).first()
db.session.query(Entry, Group).join(Group)
All of them have returned a list of tuples that is bigger than expected and does not contain what I want.
I am looking for the following result:
(Entry.id, group_title, Group.group_start_id, Entry.url)
I would be grateful for any help.
I used the following query to perform a natuaral join for Group and Entry Table:
db.session.query(Entry, Group).join(Group).filter(Group.group_start_id == Entry.id).order_by(Group.order.asc())
I did this using the .join function in my query which allowed me to join the Group table to the Entry table. Then I filtering the results of the query by using the Group.group_start_id which is a foreign key in the Group table which referred to the Entry.id which is the primary key in the Entry table.
Since you have already performed the basic join by using the relationship() call.
We can focus on getting the data you want, a query such as db.session.query(Entry, Group).all() returns tuples of (Entry, Group) type, from this you can easily do something like:
test = db.session.query(Entry, Group).one()
print(test[0].id) #prints entry.id
print(test[1].group_start_id) # prints Group.group_start_id
#...
SQLAlchemy has great article on how joins work

SQLAlchemy: Multiple tables and joins, duplicate rows

[I feel that this is maybe / for sure (?) a duplicate, however, I've searched all day long to find a solution for this and it seems I can't get it working the way I would like to by myself.]
In MySQL, I've got three tables, named ecordov (A), ecordovadr (B) and ecrgposvk (C).
I've got one key linking all these; in (A) there is one row per key, in (B) and (C) there might be multiple rows per key, so, without being an expert in these questions, I think these are one-to-many relations.
I've read the SQLAlchemy docs and set up my tables like this:
class Ecordov(Base):
__tablename__ = 'ecordov'
oovkey = Column(BIGINT, primary_key=True)
oovorder = Column(BIGINT)
ecordovadr = relationship('Ecordovadr')
ecrgposvk = relationship('Ecrgposvk')
class Ecordovadr(Base):
__tablename__ = 'ecordovadr'
ooakey = Column(BIGINT, primary_key=True)
ooaname1 = Column(VARCHAR)
ooaorder = Column(BIGINT, ForeignKey('ecordov.oovorder'))
class Ecrgposvk(Base):
__tablename__ = 'ecrgposvk'
rgkey = Column(BIGINT, primary_key=True)
rgposvalue = Column(DOUBLE)
rgposordnum = Column(BIGINT, ForeignKey('ecordov.oovorder'))
[So, as you see, the ForeignKeys aren't the primary_key(s), not really sure if this is a problem? However, I can't change the structure of the database.]
My sample query looks like:
jobs = session.query(Ecordov, func.group_concat(Ecordovadr.ooaname1.op('ORDER BY')(text('ecordovadr.ooatype, ecordovadr.ooarank separator "{}"'))).label('ooaname1')).outerjoin(Ecordovadr).filter(Ecordov.oovorder.like('75289')).group_by(Ecordov.oovorder)
gets evaluated to:
SELECT ecordov.oovkey AS ecordov_oovkey, ecordov.oovorder AS ecordov_oovorder, group_concat(ecordovadr.ooaname1 ORDER BY ecordovadr.ooatype, ecordovadr.ooarank separator "{}") AS ooaname1
FROM ecordov LEFT OUTER JOIN ecordovadr ON ecordov.oovorder = ecordovadr.ooaorder
WHERE ecordov.oovorder LIKE '75289'
GROUP BY ecordov.oovorder
and gives me the following:
for x in jobs:
x.ooaname1
u'Sorbe priv.{}Lebensn\xe4he GmbH'
which is my desired outcome.
However, after joining the second table as well, regardless if using an inner- or outerjoin, via, for example, this:
jobs = session.query(Ecordov, func.group_concat(Ecordovadr.ooaname1.op('ORDER BY')(text('ecordovadr.ooatype, ecordovadr.ooarank separator "{}"'))).label('ooaname1')).outerjoin(Ecordovadr).outerjoin(Ecrgposvk).filter(Ecordov.oovorder.like('75289')).group_by(Ecordov.oovorder)
which gets evaluated to:
SELECT ecordov.oovkey AS ecordov_oovkey, ecordov.oovorder AS ecordov_oovorder, group_concat(ecordovadr.ooaname1 ORDER BY ecordovadr.ooatype, ecordovadr.ooarank separator "{}") AS ooaname1
FROM ecordov LEFT OUTER JOIN ecordovadr ON ecordov.oovorder = ecordovadr.ooaorder LEFT OUTER JOIN ecrgposvk ON ecordov.oovorder = ecrgposvk.rgposordnum
WHERE ecordov.oovorder LIKE '75289'
GROUP BY ecordov.oovorder
gives me:
for x in jobs:
x.ooaname1
u'Sorbe priv.{}Sorbe priv.{}Sorbe priv.{}Lebensn\xe4he GmbH{}Lebensn\xe4he GmbH{}Lebensn\xe4he GmbH'
So, the data is tripled now. I've read in other threads about this topic that this is to be expected, especially when using the same ForeignKey for multiple tables.
But I need this data "like before", which means just one entry, instead of three. I've tried using distinct() but without success so far.
Could someone please point me into one direction, how to fix this?
Thanks in advance and all the best!

SQLalchemy: quantiles for all permutations of column value combinations

We have a sql server query in which we need to generate ntiles for increasingly large numbers of variables, such that the variables are combined with each other in their various permutations. Here's an excerpt exemplifying what I mean:
statement 1:
ntile(10) over (partition by MAUorALL, User_Type, fsi.Month_ID
order by Objects_Created) AS Ntile_Mon_Objects_Created,
statement 2:
ntile(10) over (partition by MAUorALL, User_Type, fsi.Month_ID, *Country*
order by Objects_Created) AS Ntile_Country_Objects_Created
statement 3:
ntile(10) over (partition by MAUorALL, User_Type, fsi.Month_ID, *User*_Type
order by Objects_Created) AS Ntile_UT_Objects_Created
You can see that the statements are the same except that in the second and third one the italicized columns "country" and "user type" have been created. So we take ntiles for the same variable "Objects_Created" at different levels of specificity, and we also have to take ntiles for the various possible permutations of these variables, e.g.:
statement 4:
ntile(10) over (partition by MAUorALL, User_Type, fsi.Month_ID, *Country, User_Type*
order by Objects_Created) AS Ntile_Country_UT_Objects_Created
We can manually code these permutations to a point, but if we could use sqlalchemy to execute all the permutations of these variables it might make things easier. Does anyone have an example I could re-purpose?
Thanks for your help!
I have no idea how fsi is related to other columns, but assuming all data is in one model (which is easy to extend with sqlalchemy query) like below:
class User(Base):
__tablename__ = 't_users'
id = Column(Integer, primary_key=True)
MAUorALL = Column(String)
User_Type = Column(String)
Country = Column(String)
Month_ID = Column(Integer)
Objects_Created = Column(Integer)
the task is accomplished by simple usage of itertools.permutations (or itertools.combinations, depending what you want to achieve) for creating query. Below code would generate a query for a User table with various ntiles for it. I assume reading the code suffice for understading what is happening:
# configuration: {label: Column}
column_labels = {
'Country': User.Country,
'UT': User.User_Type,
}
def get_ntile(additional_columns=None):
""" #return: sqlalchemy expression for selecting a given ntile() using
predefined as well as *additional* columns.
"""
partition_by = [
User.MAUorALL,
User.User_Type,
User.Month_ID,
]
label = "Ntile_Objects_Created"
if additional_columns:
lbls = []
for col_name in additional_columns:
col = column_labels[col_name]
partition_by.append(col)
lbls.append(col_name)
label = "Ntile_{}_Objects_Created".format("_".join(lbls))
xprs = over(
func.ntile(10),
partition_by = partition_by,
order_by = User.Objects_Created,
).label(label)
return xprs
def get_query(additional_columns=['UT', 'Country']):
""" #return: a query object which selects a User with additional ntiles
for predefined columns (fixed) and all possible permutations of
*additional_columns*
"""
from itertools import permutations#, combinations
tiles = [get_ntile(comb)
for r in range(len(additional_columns) + 1)
for comb in permutations(additional_columns, r)
]
q = session.query(User, *tiles)
return q
q = get_query()
print [_c["name"] for _c in q.column_descriptions]
# >>> ['User', 'Ntile_Objects_Created', 'Ntile_UT_Objects_Created', 'Ntile_Country_Objects_Created', 'Ntile_UT_Country_Objects_Created', 'Ntile_Country_UT_Objects_Created']
for tile in q.all():
print tile

Categories

Resources