How can I query rows with unique values on a joined column? - python

I'm trying to have my popular_query subquery remove dupe Place.id, but it doesn't remove it. This is the code below. I tried using distinct but it does not respect the order_by rule.
SimilarPost = aliased(Post)
SimilarPostOption = aliased(PostOption)
popular_query = (db.session.query(Post, func.count(SimilarPost.id)).
join(Place, Place.id == Post.place_id).
join(PostOption, PostOption.post_id == Post.id).
outerjoin(SimilarPostOption, PostOption.val == SimilarPostOption.val).
join(SimilarPost,SimilarPost.id == SimilarPostOption.post_id).
filter(Place.id == Post.place_id).
filter(self.radius_cond()).
group_by(Post.id).
group_by(Place.id).
order_by(desc(func.count(SimilarPost.id))).
order_by(desc(Post.timestamp))
).subquery().select()
all_posts = db.session.query(Post).select_from(filter.pick()).all()
I did a test printout with
print [x.place.name for x in all_posts]
[u'placeB', u'placeB', u'placeB', u'placeC', u'placeC', u'placeA']
How can I fix this?
Thanks!

This should get you what you want:
SimilarPost = aliased(Post)
SimilarPostOption = aliased(PostOption)
post_popularity = (db.session.query(func.count(SimilarPost.id))
.select_from(PostOption)
.filter(PostOption.post_id == Post.id)
.correlate(Post)
.outerjoin(SimilarPostOption, PostOption.val == SimilarPostOption.val)
.join(SimilarPost, sql.and_(
SimilarPost.id == SimilarPostOption.post_id,
SimilarPost.place_id == Post.place_id)
)
.as_scalar())
popular_post_id = (db.session.query(Post.id)
.filter(Post.place_id == Place.id)
.correlate(Place)
.order_by(post_popularity.desc())
.limit(1)
.as_scalar())
deduped_posts = (db.session.query(Post, post_popularity)
.join(Place)
.filter(Post.id == popular_post_id)
.order_by(post_popularity.desc(), Post.timestamp.desc())
.all())
I can't speak to the runtime performance with large data sets, and there may be a better solution, but that's what I managed to synthesize from quite a few sources (MySQL JOIN with LIMIT 1 on joined table, SQLAlchemy - subquery in a WHERE clause, SQLAlchemy Query documentation). The biggest complicating factor is that you apparently need to use as_scalar to nest the subqueries in the right places, and therefore can't return both the Post id and the count from the same subquery.
FWIW, this is kind of a behemoth and I concur with user1675804 that SQLAlchemy code this deep is hard to grok and not very maintainable. You should take a hard look at any more low-tech solutions available like adding columns to the db or doing more of the work in python code.

I don't want to sound like the bad guy here but... in my opinion your approach to the issue seems far less than optimal... if you're using postgresql you could simplify the whole thing using WITH ... but a better approach factoring in my assumption that these posts will be read much more often than updated would be to add some columns to your tables that are updated by triggers on insert/update to other tables, at least if performance is likely to ever become an issue this is the solution I'd go with
Not very familiar with sqlalchemy, so can't write it in clear code for you, but the only other solution I can come up with uses at least a subquery to select the things from order by for each of the columns in group by, and that will add significantly to your already slow query

Related

Returning null where values don't exist in SQLAlchemy (Python)

I've got 3 tables
tblOffers (tsin, offerId)
tblProducts (tsin)
tblThresholds (offerId)
I'm trying to do a select on columns from all 3 tables.
The thing is, there might not be a record in tblThresholds which matches an offerId. In that instance, I still need the information from the other two tables to return... I don't mind if those columns or fields that are missing are null or whatever in the response.
Currently, I'm not getting anything back at all unless there is information in tblThresholds which correctly matches the offerId.
I suspect the issue lies with the way I'm doing the joining but I'm not very experienced with SQL and brand new to SQLAlchemy.
(Using MySQL by the way)
query = db.select([
tblOffers.c.title,
tblOffers.c.currentPrice,
tblOffers.c.rrp,
tblOffers.c.offerId,
tblOffers.c.gtin,
tblOffers.c.status,
tblOffers.c.mpBarcode,
tblThresholds.c.minPrice,
tblThresholds.c.maxPrice,
tblThresholds.c.increment,
tblProducts.c.currentSellerId,
tblProducts.c.brand,
tblOffers.c.productTakealotURL,
tblOffers.c.productLineId
]).select_from(
tblOffers.
join(tblProducts, tblProducts.c.tsin == tblOffers.c.tsinId).
join(tblThresholds, tblThresholds.c.offerId == tblOffers.c.offerId)
)
I'm happy to add to this question or provide more information but since I'm pretty new to this, I don't entirely know what other information might be needed.
Thanks
Try for hours -> ask here -> find the answer minutes later on your own 🤦‍♂️
So for those who might end up here for the same reason I did, here you go.
Turns out SQLAlchemy does a right join by default (from what I can tell - please correct me if I'm wrong). I added a isouter=True to my join on tblThresholds and it worked!
Link to the info in the docs: https://docs.sqlalchemy.org/en/13/orm/query.html?highlight=join#sqlalchemy.orm.query.Query.join.params.isouter
Final code:
query = db.select([
tblOffers.c.title,
tblOffers.c.currentPrice,
tblOffers.c.rrp,
tblOffers.c.offerId,
tblOffers.c.gtin,
tblOffers.c.status,
tblOffers.c.mpBarcode,
tblThresholds.c.minPrice,
tblThresholds.c.maxPrice,
tblThresholds.c.increment,
tblProducts.c.brand,
tblOffers.c.productTakealotURL,
tblOffers.c.productLineId
]).select_from(
tblOffers.
join(tblProducts, tblProducts.c.tsin == tblOffers.c.tsinId).
join(tblThresholds, tblThresholds.c.offerId == tblOffers.c.offerId, isouter=True)
)

How to convert strings to arrays and join on another table in sql

At my work, I have two SQL tables, one is called jobs, with string attributes, job and codes. The latter is called skills with string attributes code and skill.
job code
--- ----
j1 s0001,s0003
j2 s0002,20003
j3 s0003,s0004
code skills
----- ------
s0001 python programming language
s0002 oracle java
s0003 structured query language sql
s0004 microsoft excel
What my boss wants me to do is: Take values from the attribute code in jobs, split the string into an array, join this array on the codes (from skills table) and return the query in the format of job skills like:
job skills
--- ------
j1 python programming language,structured query language sql
At this point, I'd just like to know if (A) this is possible and (B) if there is a preferred alternative to this approach. I've listed my python solution, using dictionaries, below to illustrate my the concept:
jobs = {'j1':'s0001,s0003',
'j2':'s0002,20003',
'j3':'s0003,s0004'}
skills = {'s0001':'python programming language',
's0002':'oracle java',
's0003':'structured query language sql',
's0004':'microsoft excel'}
job_skills = {k:[] for k in jobs.keys()}
for j,s in jobs.items():
for code,skill in skills.items():
for i in s.split(','):
if i == code:
job_skills[j].append(skill)
for k,v in job_skills.items():
job_skills[k] = ','.join(v)
And the output:
{'j1': 'python programming language,structured query language sql',
'j2': 'oracle java',
'j3': 'structured query language sql,microsoft excel'}
The real crux of this problem is that there aren't just 4 different skills in our data. Our company's data includes ~5000 skills. My boss would greatly like to avoid creating a table with 5000 attributes, 1 for each skill; he believes the above approach will result in simpler queries, with potentially better memory management.
I'm still pretty new to SQL, and technically only do SQLite3 anyway so the best I can probably do is a Python solution. I'll tell you how I would solve it, and hopefully someone can come along and fix it, because doing things purely in SQL is vastly faster than ever using Python.
I'm going to assume that this is SQLite, because you tagged Python. If it's not, there's probably ways to convert the database to the .db format in order to use that if you prefer this solution.
I'm assuming that conn is your connection to the database conn = sqlite3.connect(your_database_path) or a cursor for it. I don't use cursors, but it's almost certainly better practice to use them.
First, I would fetch the 'skills' table and convert it to a dict. I would do so with:
skills_array = conn.execute("""SELECT * FROM skills""")
skills_dict = dict()
#replace i with something else. I just did it so that I could use 'skill' as a variable
for i in skills_array:
#skills array is an iterator of tuples, which means the first position is the code number, and the second position is the skill itself
code = i[0]
skill = i[1]
skills_dict[code] = skill
There's probably better ways to do this. If it's important, I recommend researching them. But if it's a one time thing this will work just fine. All this is doing is making giving an easy way to look up skills given a code. You could do this dozens of ways. You didn't mention it being a particularly large database, so this should be fine.
Before the next part, something should be mentioned about SQLite. It has very limited table modifying mechanics-- something I coincidentally found out about today. The recommended method is just to create a new table instead of trying to finagle with an old one. But there are easy ways to modify them with SQLiteBrowser-- something I highly recommend you use. At the very least it's much easier to view info in it for me, and it's available on all the important OS's.
Second, we need to combine the job table and the skills dict. There are much better ways to go about it, but I chose the easy approach. Delimiting the job.skills column by commas and going from there. I'll also create a new table, and insert directly to there.
conn.execute("""CREATE TABLE combined (job TEXT PRIMARY KEY, skills text)""")
conn.commit()
job_array = conn.execute("""SELECT * FROM jobs""")
for i in job_array:
job = i[0]
skill = i[1]
for code in skill.split(","):
skill.replace(code, skills_dict[code])
conn.execute("""INSERT INTO combined VALUES (?, ?)""", (job, skill,))
conn.commit()
And to combine it all...
import sqlite3
conn = sqlite3.connect(your_database_path)
skills_array = conn.execute("""SELECT * FROM skills""")
skills_dict = dict()
#replace i with something else. I just did it so that I could use 'skill' as a variable
for i in skills_array:
#skills array is an iterator of tuples, which means the first position is the code number, and the second position is the skill itself
code = i[0]
skill = i[1]
skills_dict[code] = skill
conn.execute("""CREATE TABLE combined (job TEXT PRIMARY KEY, skills text)""")
conn.commit()
job_array = conn.execute("""SELECT * FROM jobs""")
for i in job_array:
job = i[0]
skill = i[1]
for code in skill.split(","):
skill.replace(code, skills_dict[code])
conn.execute("""INSERT INTO combined VALUES (?, ?)""", (job, skill,))
conn.commit()
To explain a little further if you/someone is confused on the job_array for loop:
Splitting skills allows you to see each individual code, meaning that all you have to do is replace every instance of the code being looked up with the corresponding skill.
And that's it. There's probably a mistake or two in the above code, so I would backup your database/tables before trying it, but this should work. One thing that you might find helpful are context managers, that would make it far more Pythonic. If you plan to use this consistently (for some strange reason), refactoring for speed and readability may also be prudent.
I would also like to believe that there's an SQLite only approach, since this is exactly what databases are made for.
Hope this helps. If it did, let me know. :>
P.S. If you're confused by something/want more explanation feel free to comment.

SQLAlchemy: Perform double filter and sum in the same query

I have a general ledger table in my DB with the columns: member_id, is_credit and amount. I want to get the current balance of the member.
Ideally that can be got by two queries where the first query has is_credit == True and the second query is_credit == False something close to:
credit_amount = session.query(func.sum(Funds.amount).label('Debit_Amount')).filter(Funds.member_id==member_id, Funds.is_credit==True)
debit_amount = session.query(func.sum(Funds.amount).label('Debit_Amount')).filter(Funds.member_id==member_id, Funds.is_credit==False)
balance = credit_amount - debit_amount
and then subtract the result. Is there a way to have the above run in one query to give the balance?
From the comments you state that hybrids are too advanced right now, so I will propose an easier but not as efficient solution (still its okay):
(session.query(Funds.is_credit, func.sum(Funds.amount).label('Debit_Amount')).
filter(Funds.member_d==member_id).group_by(Funds.is_credit))
What will this do? You will recieve a two-row result, one has the credit, the other the debit, depending on the is_credit property of the result. The second part (Debit_Amount) will be the value. You then evaluate them to get the result: Only one query that fetches both values.
If you are unsure what group_by does, I recommend you read up on SQL before doing it in SQLAlchemy. SQLAlchemy offers very easy usage of SQL but it requires that you understand SQL as well. Thus, I recommend: First build a query in SQL and see that it does what you want - then translate it to SQLAlchemy and see that it does the same. Otherwise SQLAlchemy will often generate highly inefficient queries, because you asked for the wrong thing.

Convert SQL query with JOIN ON to SQLAlchemy

My query looks like so (the '3' and '4' of course will be different in real usage):
SELECT op_entries.*, op_entries_status.*
FROM op_entries
LEFT OUTER JOIN op_entries_status ON op_entries.id = op_entries_status.op_id AND op_entries_status.order_id = 3
WHERE op_entries.op_artikel_id = 4 AND op_entries.active
ORDER BY op_entries.id
This is to get all stages (operations) in the production of an article/order-combination as well as the current status (progress) for each stage, if a status entry exists. If not the stage must still be returned, but the status rows be null.
I'm having immerse problems getting this to play in SQLAlchemy. This would have been a 2 part question, but I found the way to do this in plain SQL here already. Now in the ORM, that's a different story, I can't even figure out how to make JOIN ON conditions with the documentation!
Edit (new users are not allowed to answer their own question):
Believe I solved it, I guess writing it down as a question helped! Maybe this will help some other newbie out there.
query = db.session.query(OpEntries, OpEntriesStatus).\
outerjoin(OpEntriesStatus, db.and_(
OpEntries.id == OpEntriesStatus.op_id,
OpEntriesStatus.order_id == arg_order_id)).\
filter(db.and_(
OpEntries.op_artikel_id == artidQuery,
OpEntries.active)).\
order_by(OpEntries.id).\
all()
I'm keeping this open in case someone got a better solution or any insights.
Assuming some naming convention, the below should do it:
qry = (session.query(OpEntry, OpEntryStatus)
.join(OpEntryStatus, and_(OpEntry.id == OpEntryStatus.op_id, OpEntryStatus.order_id == 3))
.filter(OpEntry.op_artikel_id == 4)
.filter(OpEntry.active == 1)
.order_by(OpEntry.id)
)
Read join, outerjoin for more information on joins, where second parameter is an onclause. If you need more than 1, just use and_ or or_ to create any expression you need.

Reduce Queries by optimizing _sets in Django

The follow results in 4 db hits. Since lines 3 & 4 are just filtering what I grabbed in line 2, what do I need to change so it doesn't hit the db again?
page = get_object_or_404(Page, url__iexact = page_url)
installed_modules = page.module_set.all()
navigation_links = installed_modules.filter(module_type=ModuleTypeCode.MODAL)
module_map = dict([(m.module_static_object.key, m) for m in installed_modules])
Django querysets are lazy, so the following line doesn't hit the database:
installed_modules = page.module_set.all()
The query isn't executed until you iterate over the queryset in this line:
module_map = dict([(m.module_static_object.key, m) for m in installed_modules])
So the code you posted only looks like 3 database queries hits to me, not 4.
Since you are fetching all of the modules from the database already, you could filter the navigation links using a list comprehension instead of another query:
navigation_links = [m for m in installed_modules if m.module_type == ModuleTypeCode.MODAL]
You would have to do some benchmarking to see if this improved performance. It looks like it could be premature optimisation to me.
You might be doing one database query for each module where you fetch module_static_object.key. In this case, you could use select_related.
This is a case of premature optimization. 4 DB queries for a page load is not bad. The idea is to use as few queries as possible, but you're never going to get it down to 1 in every scenario. The code you have there doesn't seem off-the-wall in terms of needlessly creating queries, so it's highly probable that it's already as optimized as you'll be able to make it.

Categories

Resources