I have a fairly heavy query in SQLAlchemy and I'm trying to optimise it a bit, but I'm struggling with the joins as it's not something I have much knowledge in. My very small test showed the selects were 7x slower than the joins, so it'll potentially be quite a speed increase.
Here are the relevant tables and their relationships:
ActionInfo (id, session_id = SessionInfo.id)
SessionInfo (id)
SessionLink (info_id = SessionInfo.id, data_id = SessionData.id)
SessionData (id, key, value)
I basically want to read SessionData.value where SessionData.key equals something, from a select of ActionInfo.
Here is the current way I've been doing things:
stmt = select(
ActionInfo.id,
select(SessionData.value).where(
SessionData.key == 'username',
SessionLink.data_id == SessionData.id,
SessionLink.info_id == ActionInfo.session_id,
).label('username'),
select(SessionData.value).where(
SessionData.key == 'country',
SessionLink.data_id == SessionData.id,
SessionLink.info_id == ActionInfo.session_id,
).label('country'),
)
In doing the above mentioned speed test, I got a single join working, but I'm obviously limited to only 1 value via this method:
stmt = select(
ActionInfo.id,
SessionData.value.label('country')
).filter(
SessionData.key == 'country'
).outerjoin(SessionInfo).outerjoin(SessionLink).outerjoin(SessionData)
How would I adapt it to end up something like this?
stmt = select(
ActionInfo.id,
select(SessionData.value).where(SessionData.key=='username').label('username'),
select(SessionData.value).where(SessionData.key=='country').label('country'),
).outerjoin(SessionInfo).outerjoin(SessionLink).outerjoin(SessionData)
If it's at all helpful, this is the join code as generated by SQLAlchemy:
SELECT action_info.id
FROM action_info LEFT OUTER JOIN session_info ON session_info.id = action_info.session_id LEFT OUTER JOIN session_link ON session_info.id = session_link.info_id LEFT OUTER JOIN session_data ON session_data.id = session_link.data_id
As a side note, I'm assuming I want a left outer join because I want to still include any records with missing SessionData records. Once I have this working though I'll test what difference an inner join makes to be sure.
The code below:
keys = ["username", "country", "gender"]
q = select(ActionInfo.id).join(SessionInfo)
for key in keys:
SD = aliased(SessionData)
SL = aliased(SessionLink)
q = (
q.outerjoin(SL, SessionInfo.id == SL.info_id)
.outerjoin(SD, and_(SL.data_id == SD.id, SD.key == key))
.add_columns(SD.value.label(key))
)
is generic and can be extended to different number of fields, and should generate SQL similar to below:
SELECT action_info.id,
session_data_1.value AS username,
session_data_2.value AS country,
session_data_3.value AS gender
FROM action_info
JOIN session_info ON session_info.id = action_info.session_id
LEFT OUTER JOIN session_link AS session_link_1 ON session_info.id = session_link_1.info_id
LEFT OUTER JOIN session_data AS session_data_1 ON session_link_1.data_id = session_data_1.id
AND session_data_1.key = :key_1
LEFT OUTER JOIN session_link AS session_link_2 ON session_info.id = session_link_2.info_id
LEFT OUTER JOIN session_data AS session_data_2 ON session_link_2.data_id = session_data_2.id
AND session_data_2.key = :key_2
LEFT OUTER JOIN session_link AS session_link_3 ON session_info.id = session_link_3.info_id
LEFT OUTER JOIN session_data AS session_data_3 ON session_link_3.data_id = session_data_3.id
AND session_data_3.key = :key_3
Related
I'm getting the warning
<mypath>/lib/python3.10/site-packages/pandas/io/sql.py:1405: SAWarning: SELECT statement has a cartesian product between FROM element(s) "Election" and FROM element "Contest". Apply join condition(s) between each element to resolve.
return self.connectable.execution_options().execute(*args, **kwargs)
from this line in my program:
import pandas as pd
pd.read_sql_query(stmt, session.bind)
where stmt has this value:
SELECT DISTINCT `VoteCount`.`Contest_Id` AS `Contest_Id`, `VoteCount`.`Selection_Id` AS `Selection_Id`, `VoteCount`.`ReportingUnit_Id` AS `ReportingUnit_Id`, _datafile.`Election_Id` AS `Election_Id`, `VoteCount`.`_datafile_Id` AS `_datafile_Id`, `VoteCount`.`CountItemType` AS `CountItemType`, `VoteCount`.`Count` AS `Count`
FROM `VoteCount`
INNER JOIN _datafile ON _datafile.`Id` = `VoteCount`.`_datafile_Id`
INNER JOIN `Contest` ON `Contest`.`Id` = `VoteCount`.`Contest_Id`
INNER JOIN `ComposingReportingUnitJoin` ON `ComposingReportingUnitJoin`.`ChildReportingUnit_Id` = `VoteCount`.`ReportingUnit_Id`
INNER JOIN `CandidateSelection` ON `CandidateSelection`.`Id` = `VoteCount`.`Selection_Id`
INNER JOIN `Candidate` ON `Candidate`.`Id` = `CandidateSelection`.`Candidate_Id`
INNER JOIN `Party` ON `Party`.`Id` = `CandidateSelection`.`Party_Id`
INNER JOIN `CandidateContest` ON `CandidateContest`.`Id` = `Contest`.`Id`
INNER JOIN `Office` ON `Office`.`Id` = `CandidateContest`.`Office_Id`
INNER JOIN `ReportingUnit` AS `ReportingUnit_1` ON `ReportingUnit_1`.`Id` = `Office`.`ElectionDistrict_Id`
INNER JOIN `ReportingUnit` AS `ReportingUnit_2` ON `ReportingUnit_2`.`Id` = `VoteCount`.`ReportingUnit_Id`
INNER JOIN `Election` ON `Election`.`Id` = _datafile.`Election_Id`
WHERE _datafile.`Election_Id` = %(Election_Id_1)s AND `ComposingReportingUnitJoin`.`ParentReportingUnit_Id` = %(ParentReportingUnit_Id_1)s
It seems to me that there is already a join condition between each table. Schematically (where "|" indicates a join between the table to the right and the table above):
VoteCount
|_datafile
|Election
|Contest
|CandidateContest
|Office
|ReportingUnit_1
|ComposingReportingUnitJoin
|CandidateSelection
|Candidate
|Party
|ReportingUnit_2
What am I missing? Why am I getting the warning, and what might I do to resolve it?
This is a problem that took me a long time to solve, and I wanted to share my solution. Here's the problem.
We have 2 pandas DataFrames that need to be outer joined on a very complex condition. Here was mine:
condition_statement = """
ON (
A.var0 = B.var0
OR (
A.var1 = B.var1
AND (
A.var2 = B.var2
OR A.var3 = B.var3
OR A.var4 = B.var4
OR A.var5 = B.var5
OR A.var6 = B.var6
OR A.var7 = B.var7
OR (
A.var8 = B.var8
AND A.var9 = B.var9
)
)
)
)
"""
Doing this in pandas would be a nightmare.
I like to do most of my DataFrame massaging with the pandasql package. It lets you run SQL queries on top of the DataFrames in your local environment.
The problem with pandasql is it runs on a SQLite engine, so you can't do RIGHT or FULL OUTER joins.
So how do you approach this problem?
Well you can achieve a FULL OUTER join with two LEFT joins, a condition, and a UNION.
First, declare a snippet with the columns you want to retrieve:
select_statement = """
SELECT
A.var0
, B.var1
, COALESCE(A.var2, B.var2) as var2
"""
Next, build a condition that represents all values in A being NULL. I built mine using the columns in my DataFrame:
where_a_is_null_statement = f"""
WHERE
{" AND ".join(["A." + col + " is NULL" for col in A.columns])}
"""
Now, do the 2-LEFT-JOIN-with-a-UNION trick using all of these snippets:
sqldf(f"""
{select_statement}
FROM A
LEFT JOIN B
{condition_statement}
UNION
{select_statement}
FROM B
LEFT JOIN A
{condition_statement}
{where_a_is_null_statement}
""")
I'm trying to perform an inner join of distinct values of three tables from an SQLite DB. I tried multiple times and failed. Please guide me.
Below is a pseudo-code of what I'm trying to achieve
sql = '''
SELECT DISTINCT lesson_id, question_id FROM lesson_practice_questions as lpq
INNER JOIN
SELECT DISTINCT topic_id, lesson_id FROM chapter_lessons as cl
WHERE cl.topic_id==2
ON cl.lesson_id = lpq.lesson_id
INNER JOIN
SELECT DISTINCT question_id, subject_id, question_type_id, knowledge_type_ids complexity_level FROM questions as q
ON q.question_id = lpq.question_id;'''
cur.execute(sql)
Many Thanks to #eshirvana for taking the time to help out!.
Perfect Solution with ambiguous error resolution for future stackoverflow reference:
sql = '''SELECT
lpq.lesson_id, cl.lesson_id,
topic_id,
q.question_id, lpq.question_id,
subject_id,
question_type_id,
knowledge_type_ids,
complexity
FROM lesson_practice_questions as lpq
INNER JOIN chapter_lessons as cl on cl.topic_id = 2 and cl.lesson_id = lpq.lesson_id
INNER JOIN questions as q ON q.question_id = lpq.question_id;'''
here is the right sql syntax , however you need to provide sample data and desired output if this is not the right output :
SELECT
lpq.lesson_id,
question_id,
topic_id,
lesson_id,
question_id,
subject_id,
question_type_id,
knowledge_type_ids,
complexity_level
FROM lesson_practice_questions as lpq
INNER JOIN chapter_lessons as cl on cl.topic_id = 2 and cl.lesson_id = lpq.lesson_id
INNER JOIN questions as q ON q.question_id = lpq.question_id;
I have a relatively complex sql statement that I want to execute with sqlalchemy ORM. But when I try to do so I always get the error {NoSuchColumnError}"Could not locate column in row for column 'transaction_out.value'". My sql statement looks as follows:
sql = """
Select
addresses.address,
transaction_out1.value As sent,
transaction_out1.transaction_id As sent_id,
transactions.block As block_sent,
transactions.time As time_sent,
transactions.txid As txid_sent,
"sent" as type
From
transaction_out INNER Join
transaction_out_address On transaction_out_address.transaction_out_id = transaction_out.id INNER Join
addresses On transaction_out_address.address_id = addresses.id INNER Join
transaction_in On transaction_in.transaction_out_id = transaction_out.id INNER Join
transactions On transaction_in.transaction_id = transactions.id INNER Join
transaction_out transaction_out1 On transaction_out1.transaction_id = transactions.id INNER Join
transactions transactions1 On transaction_out.transaction_id = transactions1.id
WHERE addresses.address=:address_string
UNION
Select
addresses.address,
transaction_out.value As received,
transaction_out.transaction_id As received_id,
transactions.block As received_block,
transactions.time As received_time,
transactions.txid As received_txid,
"received"
From
transaction_out LEFT Join
transaction_out_address On transaction_out_address.transaction_out_id = transaction_out.id LEFT Join
addresses On transaction_out_address.address_id = addresses.id LEFT Join
transaction_in On transaction_in.transaction_out_id = transaction_out.id LEFT Join
transactions On transaction_out.transaction_id = transactions.id
WHERE addresses.address=:address_string
"""
And I tried to execute the statement in the following way:
query = session.query(Address.address, TransactionOut.value, TransactionOut.id, Block.height, Transaction.time, Transaction.txid).from_statement(
stmt.bindparams(
bindparam("address_string",
value=address_string)
))
I can execute the raw sql statement with engine.execute() without any problems but I need to do it with session.query() so I can use sqlalchemy-datatables. My database looks more or less like the one here: https://dba.stackexchange.com/questions/137791/blockchain-bitcoin-as-a-database/137800#137800.
What is the problem with the way I try to execute it?
The column aliases in the raw SQL are hiding the columns from the SQLAlchemy query. Either remove them, or alter the query to accommodate them:
query = session.query(Address.address,
TransactionOut.value.label('sent'),
TransactionOut.id.label('sent_id'),
Transaction.block.label('block_sent'),
Transaction.time.label('time_sent'),
Transaction.txid.label('txid_sent')).\
from_statement(stmt).\
params(address_string=address_string)
I have to join two tables, where the first table contains two reverences to the second. E.g. The first table owns the columns START_ID and END_ID. The second table contains the positions. How can I join them in a way that I have access to START and END values.
Here is what I have tried:
endp = aliased(POSITIONS)
startp = aliased(POSITIONS)
trans_data = self.atomic_db.session.query(ONE, endp, startp
).join(
endp,
ONE.start_id == startp.id).join(source_level,
ONE.end_id == endp.id).values(
ONE.id, endp.value ,
startp.value)
T
your joins are not ordered right, you should write:
endp = aliased(POSITIONS)
startp = aliased(POSITIONS)
trans_data = self.atomic_db.session.query(ONE, endp, startp)\
.join(endp, ONE.end_id == endp.id)\
.join(startp, ONE.start_id == startp.id)\
.values(ONE.id, endp.value ,startp.value)
Also, did you want to update or just get query results? if the latter - you do not need ".values"