SQLalchemy: quantiles for all permutations of column value combinations - python

We have a sql server query in which we need to generate ntiles for increasingly large numbers of variables, such that the variables are combined with each other in their various permutations. Here's an excerpt exemplifying what I mean:
statement 1:
ntile(10) over (partition by MAUorALL, User_Type, fsi.Month_ID
order by Objects_Created) AS Ntile_Mon_Objects_Created,
statement 2:
ntile(10) over (partition by MAUorALL, User_Type, fsi.Month_ID, *Country*
order by Objects_Created) AS Ntile_Country_Objects_Created
statement 3:
ntile(10) over (partition by MAUorALL, User_Type, fsi.Month_ID, *User*_Type
order by Objects_Created) AS Ntile_UT_Objects_Created
You can see that the statements are the same except that in the second and third one the italicized columns "country" and "user type" have been created. So we take ntiles for the same variable "Objects_Created" at different levels of specificity, and we also have to take ntiles for the various possible permutations of these variables, e.g.:
statement 4:
ntile(10) over (partition by MAUorALL, User_Type, fsi.Month_ID, *Country, User_Type*
order by Objects_Created) AS Ntile_Country_UT_Objects_Created
We can manually code these permutations to a point, but if we could use sqlalchemy to execute all the permutations of these variables it might make things easier. Does anyone have an example I could re-purpose?
Thanks for your help!

I have no idea how fsi is related to other columns, but assuming all data is in one model (which is easy to extend with sqlalchemy query) like below:
class User(Base):
__tablename__ = 't_users'
id = Column(Integer, primary_key=True)
MAUorALL = Column(String)
User_Type = Column(String)
Country = Column(String)
Month_ID = Column(Integer)
Objects_Created = Column(Integer)
the task is accomplished by simple usage of itertools.permutations (or itertools.combinations, depending what you want to achieve) for creating query. Below code would generate a query for a User table with various ntiles for it. I assume reading the code suffice for understading what is happening:
# configuration: {label: Column}
column_labels = {
'Country': User.Country,
'UT': User.User_Type,
}
def get_ntile(additional_columns=None):
""" #return: sqlalchemy expression for selecting a given ntile() using
predefined as well as *additional* columns.
"""
partition_by = [
User.MAUorALL,
User.User_Type,
User.Month_ID,
]
label = "Ntile_Objects_Created"
if additional_columns:
lbls = []
for col_name in additional_columns:
col = column_labels[col_name]
partition_by.append(col)
lbls.append(col_name)
label = "Ntile_{}_Objects_Created".format("_".join(lbls))
xprs = over(
func.ntile(10),
partition_by = partition_by,
order_by = User.Objects_Created,
).label(label)
return xprs
def get_query(additional_columns=['UT', 'Country']):
""" #return: a query object which selects a User with additional ntiles
for predefined columns (fixed) and all possible permutations of
*additional_columns*
"""
from itertools import permutations#, combinations
tiles = [get_ntile(comb)
for r in range(len(additional_columns) + 1)
for comb in permutations(additional_columns, r)
]
q = session.query(User, *tiles)
return q
q = get_query()
print [_c["name"] for _c in q.column_descriptions]
# >>> ['User', 'Ntile_Objects_Created', 'Ntile_UT_Objects_Created', 'Ntile_Country_Objects_Created', 'Ntile_UT_Country_Objects_Created', 'Ntile_Country_UT_Objects_Created']
for tile in q.all():
print tile

Related

Filter by field on relationship in SQLAlchemy

I have a very special case in which I want a group entity that have a list with the elements that fit some conditions.
These are the ORM class that I have defined:
class Group(Base):
__tablename__ = 'groups'
id = Column(Integer, Identity(1, 1), primary_key=True)
name = Column(String(50), nullable=False)
elements = relationship('Element', foreign_keys='[Element.group_id]')
class Element(Base):
__tablename__ = 'elemnts'
id = Column(Integer, Identity(1, 1), primary_key=True)
date = Column(Date, nullable=False)
value = Column(Numeric(38, 10), nullable=False)
group_id = Column(Integer, ForeignKey('groups.id'), nullable=False)
Now, I want to retrieve a group with all the elements of a specific date.
result = session.query(Group).filter(Group.name == 'group 1' and Element.date == '2021-05-27').all()
Sadly enough, the Group.name filter is working, but the retrieved group contains all elements, ignoring the Element.date condition.
As suggested by #van, I have tried:
query(Group).join(Element).filter(Group.name == 'group 1' and Element.date == '2021-05-27')
But I get every element again. On the logs I have noticed:
SELECT groups.id AS group_id, groups.name AS groups_name, element_1.id AS element_1_id, element_1.date AS element_1_date, element_1.value AS element_1_value, element_1.group_id AS element_1_group_id
FROM groups JOIN elements ON groups.id = elements.group_id LEFT OUTER JOIN elements AS elements_1 ON groups.id = elements_1.group_id
WHERE groups.name = %(name_1)s
There, I noticed two things. First, the join is being done twice (I guess one was already done just getting groups, before join). Second and most important: the date filter doesn't appear on the query.
The driver I'm using the mssql+pymssql driver.
OK, there seem to be a combination of few things happening here.
First, your relationship Group.elements will basically always contain all Elements of the Group. And this is completely separate from the filter, and this is how SA is supposed to work.
You can understand your current query (session.query(Group).filter(Group.name == 'group 1' and Element.date == '2021-05-27').all()) as the following:
"Return all Group instances which contain an Element for a given date."
But when you iterate over the Group.elements, the SA will make sure to return all children. This is what you are trying to solve.
Second, as pointed out by Yeong, you cannot use simple python and to create an AND SQL clause. Please fix either by using and_ or by just having separate clauses:
result = (
session.query(Group)
.filter(Group.name == "group 1")
.filter(Element.date == dat1)
.all()
)
Third, as you later pointed out, your relationship is lazy="joined" and this is why whenever you query for Group, the related Element instances will ALL be retrieved using OUTER JOIN condition. This is why when adding .join(Element) to your query resulted in two JOINs.
Solution
You can "trick" SA to think that the it loaded all Group.elements relationship when it only loaded the children you want by using orm.contains_eager() option, in which your query would like like this:
result = (
session.query(Group)
.join(Element)
.filter(Group.name == "group 1")
.filter(Element.date == dat1)
.options(contains_eager(Group.elements))
.all()
)
Above should work also with the lazy="joined" as the extra JOIN should not be generated anymore.
Update
If you would like to get the groups even if there are no Elements with the needed criteria, you need do:
replace join with outerjoin
place the filter on Element inside the outerjoin clause
result = (
session.query(Group)
.filter(Group.name == "group 1")
.outerjoin(
Element, and_(Element.group_id == Group.id, Element.date == dat1)
)
.options(contains_eager(Group.elements))
.all()
)
The and in python is not the same as the and condition in SQL. SQLAlchemy has a custom way to handle the conjunction using and_() method instead, i.e.
result = session.query(Group).join(Element).filter(and_(Group.name == 'group 1', Element.date == '2021-05-27')).all()

How to dynamically use select with SQLAlchemy?

I am trying to create a function which can filter a sql table using SQLAlchemy, with optional parameters.
The function is
def fetch_new_requests(status, request_type, request_subtype, engine, id_r=None):
table = Table('Sample_Table', metadata, autoload=True,
autoload_with=engine)
query = session.query(load_requests).filter_by(status = status,
request_type = request_type,
request_subtype = request_subtype,
id_r = id_r)
return pd.read_sql((query).statement,session.bind)
But it returns every time an empty table if I do not define id_r variable
I have googled but cannot find a woraround
The I have tried to use **kwargs, but it is not what I really need, I mean here I have to explicitly define column names and again come to the issue with id_r
def fetch_new_requests(**kwargs):
for x in kwargs.values():
query = session.query(load_requests).filter_by(status=x)
return pd.read_sql((query).statement,session.bind)
My ideal result
def fetch_new_requests(any column names, values of the columns):
for x in kwargs.values():
query = session.query(load_requests).filter_by(column_name=column_value)
return pd.read_sql((query).statement,session.bind)
In theorie I can use 2 lists or a dict but if there is another solution would be happy to hear
I can only give you an answer for SQLAlchemy core syntax, but it works with a dict! It has the column names in its keys and their required values, in the values.
table = Table('Sample_Table', metadata, autoload=True,
autoload_with=engine)
query = table.select()
where_dict = {"status": 1, "request_type": "something"}
for k, v in where_dict.items():
query = query.where(getattr(table.c, k) == v)
just for completeness: here's the syntax to select only specific fields (your question kinda sounds like you're also looking for this):
query = table.select().with_only_columns(select_columns) # select_columns is a list

sqlalchemy join and order by on multiple tables

I'm working with a database that has a relationship that looks like:
class Source(Model):
id = Identifier()
class SourceA(Source):
source_id = ForeignKey('source.id', nullable=False, primary_key=True)
name = Text(nullable=False)
class SourceB(Source):
source_id = ForeignKey('source.id', nullable=False, primary_key=True)
name = Text(nullable=False)
class SourceC(Source, ServerOptions):
source_id = ForeignKey('source.id', nullable=False, primary_key=True)
name = Text(nullable=False)
What I want to do is join all tables Source, SourceA, SourceB, SourceC and then order_by name.
Sound easy to me but I've been banging my head on this for while now and my heads starting to hurt. Also I'm not very familiar with SQL or sqlalchemy so there's been a lot of browsing the docs but to no avail. Maybe I'm just not seeing it. This seems to be close albeit related to a newer version than what I have available (see versions below).
I feel close not that that means anything. Here's my latest attempt which seems good up until the order_by call.
Sources = [SourceA, SourceB, SourceC]
# list of join on Source
joins = [session.query(Source).join(source) for source in Sources]
# union the list of joins
query = joins.pop(0).union_all(*joins)
query seems right at this point as far as I can tell i.e. query.all() works. So now I try to apply order_by which doesn't throw an error until .all is called.
Attempt 1: I just use the attribute I want
query.order_by('name').all()
# throws sqlalchemy.exc.ProgrammingError: (ProgrammingError) column "name" does not exist
Attempt 2: I just use the defined column attribute I want
query.order_by(SourceA.name).all()
# throws sqlalchemy.exc.ProgrammingError: (ProgrammingError) missing FROM-clause entry for table "SourceA"
Is it obvious? What am I missing? Thanks!
versions:
sqlalchemy.version = '0.8.1'
(PostgreSQL) 9.1.3
EDIT
I'm dealing with a framework that wants a handle to a query object. I have a bare query that appears to accomplish what I want but I would still need to wrap it in a query object. Not sure if that's possible. Googling ...
select = """
select s.*, a.name from Source d inner join SourceA a on s.id = a.Source_id
union
select s.*, b.name from Source d inner join SourceB b on s.id = b.Source_id
union
select s.*, c.name from Source d inner join SourceC c on s.id = c.Source_id
ORDER BY "name";
"""
selectText = text(select)
result = session.execute(selectText)
# how to put result into a query. maybe Query(selectText)? googling...
result.fetchall():
Assuming that coalesce function is good enough, below examples should point you in the direction. One option automatically creates a list of children, while the other is explicit.
This is not the query you specified in your edit, but you are able to sort (your original request):
def test_explicit():
# specify all children tables to be queried
Sources = [SourceA, SourceB, SourceC]
AllSources = with_polymorphic(Source, Sources)
name_col = func.coalesce(*(_s.name for _s in Sources)).label("name")
query = session.query(AllSources).order_by(name_col)
for x in query:
print(x)
def test_implicit():
# get all children tables in the query
from sqlalchemy.orm import class_mapper
_map = class_mapper(Source)
Sources = [_smap.class_
for _smap in _map.self_and_descendants
if _smap != _map # #note: exclude base class, it has no `name`
]
AllSources = with_polymorphic(Source, Sources)
name_col = func.coalesce(*(_s.name for _s in Sources)).label("name")
query = session.query(AllSources).order_by(name_col)
for x in query:
print(x)
Your first attempt sounds like it isn't working because there is no name in Source, which is the root table of the query. In addition, there will be multiple name columns after your joins, so you will need to be more specific. Try
query.order_by('SourceA.name').all()
As for your second attempt, what is ServerA?
query.order_by(ServerA.name).all()
Probably a typo, but not sure if it's for SO or your code. Try:
query.order_by(SourceA.name).all()

SQLAlchemy: several counts in one query

I am having hard time optimizing my SQLAlchemy queries. My SQL knowledge is very basic, and I just can't get the stuff I need from the SQLAlchemy docs.
Suppose the following very basic one-to-many relationship:
class Parent(Base):
__tablename__ = "parents"
id = Column(Integer, primary_key = True)
children = relationship("Child", backref = "parent")
class Child(Base):
__tablename__ = "children"
id = Column(Integer, primary_key = True)
parent_id = Column(Integer, ForeignKey("parents.id"))
naughty = Column(Boolean)
How could I:
Query tuples of (Parent, count_of_naughty_children, count_of_all_children) for each parent?
After decent time spent googling, I found how to query those values separately:
# The following returns tuples of (Parent, count_of_all_children):
session.query(Parent, func.count(Child.id)).outerjoin(Child, Parent.children).\
group_by(Parent.id)
# The following returns tuples of (Parent, count_of_naughty_children):
al = aliased(Children, session.query(Children).filter_by(naughty = True).\
subquery())
session.query(Parent, func.count(al.id)).outerjoin(al, Parent.children).\
group_by(Parent.id)
I tried to combine them in different ways, but didn't manage to get what I want.
Query all parents which have more than 80% naughty children? Edit: naughty could be NULL.
I guess this query is going to be based on the previous one, filtering by naughty/all ratio.
Any help is appreciated.
EDIT : Thanks to Antti Haapala's help, I found solution to the second question:
avg = func.avg(func.coalesce(Child.naughty, 0)) # coalesce() treats NULLs as 0
# avg = func.avg(Child.naughty) - if you want to ignore NULLs
session.query(Parent).join(Child, Parent.children).group_by(Parent).\
having(avg > 0.8)
It finds average if children's naughty variable, treating False and NULLs as 0, and True as 1. Tested with MySQL backend, but should work on others, too.
the count() sql aggretate function is pretty simple; it gives you the total number of non-null values in each group. With that in mind, we can adjust your query to give you the proper result.
print (Query([
Parent,
func.count(Child.id),
func.count(case(
[((Child.naughty == True), Child.id)], else_=literal_column("NULL"))).label("naughty")])
.join(Parent.children).group_by(Parent)
)
Which produces the following sql:
SELECT
parents.id AS parents_id,
count(children.id) AS count_1,
count(CASE WHEN (children.naughty = 1)
THEN children.id
ELSE NULL END) AS naughty
FROM parents
JOIN children ON parents.id = children.parent_id
GROUP BY parents.id
If your query is only to get the parents who have > 80 % children naughty, you can on most databases cast the naughty to integer, then take average of it; then having this average greater than 0.8.
Thus you get something like
from sqlalchemy.sql.expression import cast
naughtyp = func.avg(cast(Child.naughty, Integer))
session.query(Parent, func.count(Child.id), naughtyp).join(Child)\
.group_by(Parent.id).having(naughtyp > 0.8).all()

How can I rank entries using sqlalchemy?

I have created a model User that has the columns score and rank. I would like to periodically update the rank of all users in User such that the user with the highest score has rank 1, second highest score rank 2, etc. Is there anyway to efficiently achieve this in Flask-SQLAlchemy?
Thanks!
btw, here is the model:
app = Flask(__name__)
db = SQLAlchemy(app)
class User(db.Model):
id = db.Column(db.Integer, primary_key=True)
score = db.Column(db.Integer)
rank = db.Column(db.Integer)
Well as far as why one might do this, it's so that you can query for "rank" without needing to perform an aggregate query, which can be more performant. especially if you want to see "whats the rank for user #456?" without hitting every row.
the most efficient way to do this is a single UPDATE. Using standard SQL, we can use a correlated subquery like this:
UPDATE user SET rank=(SELECT count(*) FROM user AS u1 WHERE u1.score > user.score) + 1
Some databases have extensions like PG's UPDATE..FROM, which I have less experience with, perhaps if you could UPDATE..FROM a SELECT statement that gets the rank at once using a window function that would be more efficient, though I'm not totally sure.
Anyway our standard SQL with SQLAlchemy looks like:
from sqlalchemy.orm import aliased
from sqlalchemy import func
u1 = aliased(User)
subq = session.query(func.count(u1.id)).filter(u1.score > User.score).as_scalar()
session.query(User).update({"rank": subq + 1}, synchronize_session=False)
Just cycle on all your users:
users = User.query.order_by(User.score._desc()).all() #fetch them all in one query
for (rank, user) in enumerate(users):
user.rank = rank + 1 #plus 1 cause enumerate starts from zero
db.session.commit()

Categories

Resources