SQLAlchemy - aggregate inner query joining to outer query - python

The below is an example of a query which I'm trying to write in SQLAlchemy without much luck. I'm quite new to SQLA and am able to convert some queries but not this this:
select car, min(units)
from (
select car,
(select sum(case when side = 0 then 1 else -1 end * doors)
from p.trades i
where i.car = o.car and i.date = o.date
and i.server_time <= o.server_time) units
from p.trades o
where date = '2016-01-19'
and car in ('Golf')
order by server_time, trade_id
) boff
group by car
Can anyone be of assistance?
Thanks, much appreciated

I know that this is not what you expect, but I'd just use SQL query.
I worked with few different ORMs and my experience is that it is usually not worth trying to write relatively complex queries using object-like syntax.
Anything simple, like read / write a record or do a simple query, is usually obvious and clear, so it's easy to write it and to maintain it.
For more complex queries you will both spend time initially to convert to the ORM language and spend time later, when you need to modify it, to remember how it worked and to understand how it can be modified.
So I would just do this:
data = session.query(MyModel).from_statement(text(
"""
select * from
....
....
""")).params(x=a, y=b).all()

Related

How to convert strings to arrays and join on another table in sql

At my work, I have two SQL tables, one is called jobs, with string attributes, job and codes. The latter is called skills with string attributes code and skill.
job code
--- ----
j1 s0001,s0003
j2 s0002,20003
j3 s0003,s0004
code skills
----- ------
s0001 python programming language
s0002 oracle java
s0003 structured query language sql
s0004 microsoft excel
What my boss wants me to do is: Take values from the attribute code in jobs, split the string into an array, join this array on the codes (from skills table) and return the query in the format of job skills like:
job skills
--- ------
j1 python programming language,structured query language sql
At this point, I'd just like to know if (A) this is possible and (B) if there is a preferred alternative to this approach. I've listed my python solution, using dictionaries, below to illustrate my the concept:
jobs = {'j1':'s0001,s0003',
'j2':'s0002,20003',
'j3':'s0003,s0004'}
skills = {'s0001':'python programming language',
's0002':'oracle java',
's0003':'structured query language sql',
's0004':'microsoft excel'}
job_skills = {k:[] for k in jobs.keys()}
for j,s in jobs.items():
for code,skill in skills.items():
for i in s.split(','):
if i == code:
job_skills[j].append(skill)
for k,v in job_skills.items():
job_skills[k] = ','.join(v)
And the output:
{'j1': 'python programming language,structured query language sql',
'j2': 'oracle java',
'j3': 'structured query language sql,microsoft excel'}
The real crux of this problem is that there aren't just 4 different skills in our data. Our company's data includes ~5000 skills. My boss would greatly like to avoid creating a table with 5000 attributes, 1 for each skill; he believes the above approach will result in simpler queries, with potentially better memory management.
I'm still pretty new to SQL, and technically only do SQLite3 anyway so the best I can probably do is a Python solution. I'll tell you how I would solve it, and hopefully someone can come along and fix it, because doing things purely in SQL is vastly faster than ever using Python.
I'm going to assume that this is SQLite, because you tagged Python. If it's not, there's probably ways to convert the database to the .db format in order to use that if you prefer this solution.
I'm assuming that conn is your connection to the database conn = sqlite3.connect(your_database_path) or a cursor for it. I don't use cursors, but it's almost certainly better practice to use them.
First, I would fetch the 'skills' table and convert it to a dict. I would do so with:
skills_array = conn.execute("""SELECT * FROM skills""")
skills_dict = dict()
#replace i with something else. I just did it so that I could use 'skill' as a variable
for i in skills_array:
#skills array is an iterator of tuples, which means the first position is the code number, and the second position is the skill itself
code = i[0]
skill = i[1]
skills_dict[code] = skill
There's probably better ways to do this. If it's important, I recommend researching them. But if it's a one time thing this will work just fine. All this is doing is making giving an easy way to look up skills given a code. You could do this dozens of ways. You didn't mention it being a particularly large database, so this should be fine.
Before the next part, something should be mentioned about SQLite. It has very limited table modifying mechanics-- something I coincidentally found out about today. The recommended method is just to create a new table instead of trying to finagle with an old one. But there are easy ways to modify them with SQLiteBrowser-- something I highly recommend you use. At the very least it's much easier to view info in it for me, and it's available on all the important OS's.
Second, we need to combine the job table and the skills dict. There are much better ways to go about it, but I chose the easy approach. Delimiting the job.skills column by commas and going from there. I'll also create a new table, and insert directly to there.
conn.execute("""CREATE TABLE combined (job TEXT PRIMARY KEY, skills text)""")
conn.commit()
job_array = conn.execute("""SELECT * FROM jobs""")
for i in job_array:
job = i[0]
skill = i[1]
for code in skill.split(","):
skill.replace(code, skills_dict[code])
conn.execute("""INSERT INTO combined VALUES (?, ?)""", (job, skill,))
conn.commit()
And to combine it all...
import sqlite3
conn = sqlite3.connect(your_database_path)
skills_array = conn.execute("""SELECT * FROM skills""")
skills_dict = dict()
#replace i with something else. I just did it so that I could use 'skill' as a variable
for i in skills_array:
#skills array is an iterator of tuples, which means the first position is the code number, and the second position is the skill itself
code = i[0]
skill = i[1]
skills_dict[code] = skill
conn.execute("""CREATE TABLE combined (job TEXT PRIMARY KEY, skills text)""")
conn.commit()
job_array = conn.execute("""SELECT * FROM jobs""")
for i in job_array:
job = i[0]
skill = i[1]
for code in skill.split(","):
skill.replace(code, skills_dict[code])
conn.execute("""INSERT INTO combined VALUES (?, ?)""", (job, skill,))
conn.commit()
To explain a little further if you/someone is confused on the job_array for loop:
Splitting skills allows you to see each individual code, meaning that all you have to do is replace every instance of the code being looked up with the corresponding skill.
And that's it. There's probably a mistake or two in the above code, so I would backup your database/tables before trying it, but this should work. One thing that you might find helpful are context managers, that would make it far more Pythonic. If you plan to use this consistently (for some strange reason), refactoring for speed and readability may also be prudent.
I would also like to believe that there's an SQLite only approach, since this is exactly what databases are made for.
Hope this helps. If it did, let me know. :>
P.S. If you're confused by something/want more explanation feel free to comment.

How do I define a custom metric in Apache Superset?

I've been implementing superset at work, and I like it so far. However, I have such a table:
name,age,gender
John,42,M
Sally,38,F
Patricia,27,F
Steven,29,M
Amanda,51,F
I want to define a new metric against each name, counting the number of people who are younger. My data is in a MySQLdatabase, and I suppose that for one person, I could write the query thus:
SELECT COUNT(DISTINCT name) from users where users.age <= 42;
for, say, John's row. So, how do I do this continuously for the entire table?
Your query could look something like
select *,
(select count(distinct all_users.name) from users all_users where all_users.age <= users.age)
FROM users
To shadow's point - this would get quite expensive to run on a large dataset.
If that were the case, you'd probably want to try putting an index on age, or denormalize that count altogether - the tradeoff being that inserts would become slower.

Django ORM: Get latest record for distinct field

I'm having loads of trouble translating some SQL into Django.
Imagine we have some cars, each with a unique VIN, and we record the dates that they are in the shop with some other data. (Please ignore the reason one might structure the data this way. It's specifically for this question. :-) )
class ShopVisit(models.Model):
vin = models.CharField(...)
date_in_shop = models.DateField(...)
mileage = models.DecimalField(...)
boolfield = models.BooleanField(...)
We want a single query to return a Queryset with the most recent record for each vin and update it!
special_vins = [...]
# Doesn't work
ShopVisit.objects.filter(vin__in=special_vins).annotate(max_date=Max('date_in_shop').filter(date_in_shop=F('max_date')).update(boolfield=True)
# Distinct doesn't work with update
ShopVisit.objects.filter(vin__in=special_vins).order_by('vin', '-date_in_shop).distinct('vin').update(boolfield=True)
Yes, I could iterate over a queryset. But that's not very efficient and it takes a long time when I'm dealing with around 2M records. The SQL that could do this is below (I think!):
SELECT *
FROM cars
INNER JOIN (
SELECT MAX(dateInShop) as maxtime, vin
FROM cars
GROUP BY vin
) AS latest_record ON (cars.dateInShop= maxtime)
AND (latest_record.vin = cars.vin)
So how can I make this happen with Django?
This is somewhat untested, and relies on Django 1.11 for Subqueries, but perhaps something like:
latest_visits = Subquery(ShopVisit.objects.filter(id=OuterRef('id')).order_by('-date_in_shop').values('id')[:1])
ShopVisit.objects.filter(id__in=latest_visits)
I had a similar model, so went to test it but got an error of:
"This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery"
The SQL it generated looked reasonably like what you want, so I think the idea is sound. If you use PostGres, perhaps it has support for that type of subquery.
Here's the SQL it produced (trimmed up a bit and replaced actual names with fake ones):
SELECT `mymodel_activity`.* FROM `mymodel_activity` WHERE `mymodel_activity`.`id` IN (SELECT U0.`id` FROM `mymodel_activity` U0 WHERE U0.`id` = (`mymodel_activity`.`id`) ORDER BY U0.`date_in_shop` DESC LIMIT 1)
I wonder if you found the solution yourself.
I could come up with only raw query string. Django Raw SQL query Manual
UPDATE "yourapplabel_shopvisit"
SET boolfield = True WHERE date_in_shop
IN (SELECT MAX(date_in_shop) FROM "yourapplabel_shopvisit" GROUP BY vin);

SQLAlchemy: Perform double filter and sum in the same query

I have a general ledger table in my DB with the columns: member_id, is_credit and amount. I want to get the current balance of the member.
Ideally that can be got by two queries where the first query has is_credit == True and the second query is_credit == False something close to:
credit_amount = session.query(func.sum(Funds.amount).label('Debit_Amount')).filter(Funds.member_id==member_id, Funds.is_credit==True)
debit_amount = session.query(func.sum(Funds.amount).label('Debit_Amount')).filter(Funds.member_id==member_id, Funds.is_credit==False)
balance = credit_amount - debit_amount
and then subtract the result. Is there a way to have the above run in one query to give the balance?
From the comments you state that hybrids are too advanced right now, so I will propose an easier but not as efficient solution (still its okay):
(session.query(Funds.is_credit, func.sum(Funds.amount).label('Debit_Amount')).
filter(Funds.member_d==member_id).group_by(Funds.is_credit))
What will this do? You will recieve a two-row result, one has the credit, the other the debit, depending on the is_credit property of the result. The second part (Debit_Amount) will be the value. You then evaluate them to get the result: Only one query that fetches both values.
If you are unsure what group_by does, I recommend you read up on SQL before doing it in SQLAlchemy. SQLAlchemy offers very easy usage of SQL but it requires that you understand SQL as well. Thus, I recommend: First build a query in SQL and see that it does what you want - then translate it to SQLAlchemy and see that it does the same. Otherwise SQLAlchemy will often generate highly inefficient queries, because you asked for the wrong thing.

SQLAlchemy filter query by related object

Using SQLAlchemy, I have a one to many relation with two tables - users and scores. I am trying to query the top 10 users sorted by their aggregate score over the past X amount of days.
users:
id
user_name
score
scores:
user
score_amount
created
My current query is:
top_users = DBSession.query(User).options(eagerload('scores')).filter_by(User.scores.created > somedate).order_by(func.sum(User.scores).desc()).all()
I know this is clearly not correct, it's just my best guess. However, after looking at the documentation and googling I cannot find an answer.
EDIT:
Perhaps it would help if I sketched what the MySQL query would look like:
SELECT user.*, SUM(scores.amount) as score_increase
FROM user LEFT JOIN scores ON scores.user_id = user.user_id
WITH scores.created_at > someday
ORDER BY score_increase DESC
The single-joined-row way, with a group_by added in for all user columns although MySQL will let you group on just the "id" column if you choose:
sess.query(User, func.sum(Score.amount).label('score_increase')).\
join(User.scores).\
filter(Score.created_at > someday).\
group_by(User).\
order_by("score increase desc")
Or if you just want the users in the result:
sess.query(User).\
join(User.scores).\
filter(Score.created_at > someday).\
group_by(User).\
order_by(func.sum(Score.amount))
The above two have an inefficiency in that you're grouping on all columns of "user" (or you're using MySQL's "group on only a few columns" thing, which is MySQL only). To minimize that, the subquery approach:
subq = sess.query(Score.user_id, func.sum(Score.amount).label('score_increase')).\
filter(Score.created_at > someday).\
group_by(Score.user_id).subquery()
sess.query(User).join((subq, subq.c.user_id==User.user_id)).order_by(subq.c.score_increase)
An example of the identical scenario is in the ORM tutorial at: http://docs.sqlalchemy.org/en/latest/orm/tutorial.html#selecting-entities-from-subqueries
You will need to use a subquery in order to compute the aggregate score for each user. Subqueries are described here: http://www.sqlalchemy.org/docs/05/ormtutorial.html?highlight=subquery#using-subqueries
I am assuming the column (not the relation) you're using for the join is called Score.user_id, so change it if this is not the case.
You will need to do something like this:
DBSession.query(Score.user_id, func.sum(Score.score_amount).label('total_score')).group_by(Score.user_id).filter(Score.created > somedate).order_by('total_score DESC')[:10]
However this will result in tuples of (user_id, total_score). I'm not sure if the computed score is actually important to you, but if it is, you will probably want to do something like this:
users_scores = []
q = DBSession.query(Score.user_id, func.sum(Score.score_amount).label('total_score')).group_by(Score.user_id).filter(Score.created > somedate).order_by('total_score DESC')[:10]
for user_id, total_score in q:
user = DBSession.query(User)
users_scores.append((user, total_score))
This will result in 11 queries being executed, however. It is possible to do it all in a single query, but due to various limitations in SQLAlchemy, it will likely create a very ugly multi-join query or subquery (dependent on engine) and it won't be very performant.
If you plan on doing something like this often and you have a large amount of scores, consider denormalizing the current score onto the user table. It's more work to upkeep, but will result in a single non-join query like:
DBSession.query(User).order_by(User.computed_score.desc())
Hope that helps.

Categories

Resources