I have to join two tables, where the first table contains two reverences to the second. E.g. The first table owns the columns START_ID and END_ID. The second table contains the positions. How can I join them in a way that I have access to START and END values.
Here is what I have tried:
endp = aliased(POSITIONS)
startp = aliased(POSITIONS)
trans_data = self.atomic_db.session.query(ONE, endp, startp
).join(
endp,
ONE.start_id == startp.id).join(source_level,
ONE.end_id == endp.id).values(
ONE.id, endp.value ,
startp.value)
T
your joins are not ordered right, you should write:
endp = aliased(POSITIONS)
startp = aliased(POSITIONS)
trans_data = self.atomic_db.session.query(ONE, endp, startp)\
.join(endp, ONE.end_id == endp.id)\
.join(startp, ONE.start_id == startp.id)\
.values(ONE.id, endp.value ,startp.value)
Also, did you want to update or just get query results? if the latter - you do not need ".values"
Related
This is a problem that took me a long time to solve, and I wanted to share my solution. Here's the problem.
We have 2 pandas DataFrames that need to be outer joined on a very complex condition. Here was mine:
condition_statement = """
ON (
A.var0 = B.var0
OR (
A.var1 = B.var1
AND (
A.var2 = B.var2
OR A.var3 = B.var3
OR A.var4 = B.var4
OR A.var5 = B.var5
OR A.var6 = B.var6
OR A.var7 = B.var7
OR (
A.var8 = B.var8
AND A.var9 = B.var9
)
)
)
)
"""
Doing this in pandas would be a nightmare.
I like to do most of my DataFrame massaging with the pandasql package. It lets you run SQL queries on top of the DataFrames in your local environment.
The problem with pandasql is it runs on a SQLite engine, so you can't do RIGHT or FULL OUTER joins.
So how do you approach this problem?
Well you can achieve a FULL OUTER join with two LEFT joins, a condition, and a UNION.
First, declare a snippet with the columns you want to retrieve:
select_statement = """
SELECT
A.var0
, B.var1
, COALESCE(A.var2, B.var2) as var2
"""
Next, build a condition that represents all values in A being NULL. I built mine using the columns in my DataFrame:
where_a_is_null_statement = f"""
WHERE
{" AND ".join(["A." + col + " is NULL" for col in A.columns])}
"""
Now, do the 2-LEFT-JOIN-with-a-UNION trick using all of these snippets:
sqldf(f"""
{select_statement}
FROM A
LEFT JOIN B
{condition_statement}
UNION
{select_statement}
FROM B
LEFT JOIN A
{condition_statement}
{where_a_is_null_statement}
""")
How can I create a function that uses a list of strings to iterate the following. Intention is a, b, and c represent tables the users upload. Goal is to programmatically iterate no matter how many tables the users upload. I'm just looking to pull the counts of new rows broken out by table.
mylist = df.select('S_ID').distinct().rdd.flatMap(lambda x: x).collect()
mylist
>> ['a', 'b', 'c']
##Count new rows by S_ID type
a = df.filter(df.S_ID == 'a').count()
b = df.filter(df.S_ID == 'b').count()
c = df.filter(df.S_ID == 'c').count()
##Count current rows from Snowflake
a_current = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "select R_ID FROM mytable WHERE S_ID = 'a'").load()
a_current = a_current.count()
b_current = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "select R_ID FROM mytable WHERE S_ID = 'b'").load()
b_current = b_current.count()
c_current = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "select R_ID FROM mytable WHERE S_ID = 'c'").load()
c_current = c_current.count()
##Calculate count of new rows
a_new = a - a_current
a_new = str(a_new)
b_new = b - b_current
b_new = str(b_new)
c_new = c - c_current
c_new = str(c_new)
Something like this:
new_counts_list = []
for i in mylist:
i = df.filter(df.S_ID == 'i').count()
i_current = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "select R_ID FROM mytable WHERE S_ID = 'i'").load()
i_current = i_current.count()
i_new = i - i_current
i_new = str(i_new)
new_counts_list.append(i)
I'm stuck on keeping the {names : new_counts}
As it pertains to:
I'm stuck on keeping the {names : new_counts}
,at the end of your for loop you may use
new_counts_list[i]=i_new
instead of
new_counts_list.append(i)
assuming that you change how new_counts_list is initialized. i.e. initialized as a dict (new_counts_list={}) instead of a list.
You also seem to be hardcoding the literal value 'i' which is a string instead of using the variable i (i.e.) without the quotes in your proposed solution. Your updated solution may look like
new_counts_list={}
for i in mylist:
i = df.filter(df.S_ID == i).count()
i_current = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "select R_ID FROM mytable WHERE S_ID = '{0}'".format(i)).load()
i_current = i_current.count()
i_new = i - i_current
i_new = str(i_new)
new_counts_list[i]=i_new
On another note, while your approach, i.e. sequentially looping through each S_ID in my list and running the operations i.e.
Running the action collect to pull all the S_ID to your driver node from your initial dataframe df into a list mylist
Separately counting the number of occurrences of S_ID in your initial dataframe then executing another potentially expensive (IO reads/network communication/shuffles) collect()
Creating another dataframe with spark.read.format(SNOWFLAKE_SOURCE_NAME) that will load all records filtered by each S_ID into memory before executing a count
Finding the difference between the initial dataframe and the snowflake source
will work, it is expensive in IO reads and based on your cluster/setup, potentially expensive in network communication and shuffles.
You may consider using a groupby to reduce the amount of times you execute the potentially expensive collect. Furthermore, you may also join the initial dataframe to your snowflake source and let spark optimize your operations as a lazy execution plan distributed across your cluster/setup. Moreover, similar to how you are using the pushdown filter for the snowflake source, you may combine all selected S_ID in that query to allow snowflake to reduce all the desired results in one read. You would not need a loop. This could potentially look like:
Approach 1
In this approach, I will provide a pure spark solution to achieving your desired results
from pyspark.sql import functions as F
# Ask spark to select only the `S_ID` and group the data but not execute the transformation
my_exiting_counts_df = df.select('S_ID').groupBy('S_ID').count()
# Ask spark to select only the `S_ID` counts from the snowflake source
current_counts_df = (
spark.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(**sfOptions)
.option("query", "select R_ID, COUNT(1) as cnt FROM mytable GROUP BY R_ID")
)
# Join both datasets which will filter to only selected `S_ID`
# and determine the differences between the existing and current counts
results_df = (
my_exiting_counts_df.alias("existing")
.join(
current_counts_df.alias("current"),
F.col("S_ID")=F.col("R_ID"),
"inner"
)
.selectExpr(
"S_ID",
"count - cnt as actual_count"
)
)
# Execute the above transformations with `collect` and
# Convert the dictionary values in the list above to your desired final dictionary
new_counts = {}
for row in results_df.collect():
new_counts[row['S_ID']]=row['actual_count']
# your desired results are in `new_counts`
Approach 2
In this approach, I will collect the results of the group by and then use this to optimize the push down queries to the snowflake schema to return the desired results.
my_list_counts = df.select('S_ID').groupBy('S_ID').count()
selected_sids = []
case_expression = ""
for row in my_list_counts:
selected_sids.append(row['S_ID'])
case_expression = case_expression + " CASE WHEN R_ID='{0}' THEN {0} ".format(
row['S_ID'],
row['count']
)
# The above has a table with columns `S_ID` and `count` where the
# latter is the number of occurrences of `S_ID` in the dataset `df`
snowflake_push_down_query="""
SELECT
R_ID AS S_ID
((CASE
{0}
END) - cnt) as actual_count
FROM (
SELECT
R_ID,
COUNT(1) AS cnt
FROM
mytable
WHERE
R_ID IN ('{1}')
GROUP BY
R_ID
) t
""".format(
case_expression,
"','".join(selected_sids)
)
results_df = (
spark.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(**sfOptions)
.option("query", snowflake_push_down_query)
)
# Execute the above transformations with `collect` and
# Convert the dictionary values in the list above to your desired final dictionary
new_counts = {}
for row in results_df.collect():
new_counts[row['S_ID']]=row['actual_count']
# your desired results are in `new_counts`
Let me know if this works for you.
I have a fairly heavy query in SQLAlchemy and I'm trying to optimise it a bit, but I'm struggling with the joins as it's not something I have much knowledge in. My very small test showed the selects were 7x slower than the joins, so it'll potentially be quite a speed increase.
Here are the relevant tables and their relationships:
ActionInfo (id, session_id = SessionInfo.id)
SessionInfo (id)
SessionLink (info_id = SessionInfo.id, data_id = SessionData.id)
SessionData (id, key, value)
I basically want to read SessionData.value where SessionData.key equals something, from a select of ActionInfo.
Here is the current way I've been doing things:
stmt = select(
ActionInfo.id,
select(SessionData.value).where(
SessionData.key == 'username',
SessionLink.data_id == SessionData.id,
SessionLink.info_id == ActionInfo.session_id,
).label('username'),
select(SessionData.value).where(
SessionData.key == 'country',
SessionLink.data_id == SessionData.id,
SessionLink.info_id == ActionInfo.session_id,
).label('country'),
)
In doing the above mentioned speed test, I got a single join working, but I'm obviously limited to only 1 value via this method:
stmt = select(
ActionInfo.id,
SessionData.value.label('country')
).filter(
SessionData.key == 'country'
).outerjoin(SessionInfo).outerjoin(SessionLink).outerjoin(SessionData)
How would I adapt it to end up something like this?
stmt = select(
ActionInfo.id,
select(SessionData.value).where(SessionData.key=='username').label('username'),
select(SessionData.value).where(SessionData.key=='country').label('country'),
).outerjoin(SessionInfo).outerjoin(SessionLink).outerjoin(SessionData)
If it's at all helpful, this is the join code as generated by SQLAlchemy:
SELECT action_info.id
FROM action_info LEFT OUTER JOIN session_info ON session_info.id = action_info.session_id LEFT OUTER JOIN session_link ON session_info.id = session_link.info_id LEFT OUTER JOIN session_data ON session_data.id = session_link.data_id
As a side note, I'm assuming I want a left outer join because I want to still include any records with missing SessionData records. Once I have this working though I'll test what difference an inner join makes to be sure.
The code below:
keys = ["username", "country", "gender"]
q = select(ActionInfo.id).join(SessionInfo)
for key in keys:
SD = aliased(SessionData)
SL = aliased(SessionLink)
q = (
q.outerjoin(SL, SessionInfo.id == SL.info_id)
.outerjoin(SD, and_(SL.data_id == SD.id, SD.key == key))
.add_columns(SD.value.label(key))
)
is generic and can be extended to different number of fields, and should generate SQL similar to below:
SELECT action_info.id,
session_data_1.value AS username,
session_data_2.value AS country,
session_data_3.value AS gender
FROM action_info
JOIN session_info ON session_info.id = action_info.session_id
LEFT OUTER JOIN session_link AS session_link_1 ON session_info.id = session_link_1.info_id
LEFT OUTER JOIN session_data AS session_data_1 ON session_link_1.data_id = session_data_1.id
AND session_data_1.key = :key_1
LEFT OUTER JOIN session_link AS session_link_2 ON session_info.id = session_link_2.info_id
LEFT OUTER JOIN session_data AS session_data_2 ON session_link_2.data_id = session_data_2.id
AND session_data_2.key = :key_2
LEFT OUTER JOIN session_link AS session_link_3 ON session_info.id = session_link_3.info_id
LEFT OUTER JOIN session_data AS session_data_3 ON session_link_3.data_id = session_data_3.id
AND session_data_3.key = :key_3
I have a very special case in which I want a group entity that have a list with the elements that fit some conditions.
These are the ORM class that I have defined:
class Group(Base):
__tablename__ = 'groups'
id = Column(Integer, Identity(1, 1), primary_key=True)
name = Column(String(50), nullable=False)
elements = relationship('Element', foreign_keys='[Element.group_id]')
class Element(Base):
__tablename__ = 'elemnts'
id = Column(Integer, Identity(1, 1), primary_key=True)
date = Column(Date, nullable=False)
value = Column(Numeric(38, 10), nullable=False)
group_id = Column(Integer, ForeignKey('groups.id'), nullable=False)
Now, I want to retrieve a group with all the elements of a specific date.
result = session.query(Group).filter(Group.name == 'group 1' and Element.date == '2021-05-27').all()
Sadly enough, the Group.name filter is working, but the retrieved group contains all elements, ignoring the Element.date condition.
As suggested by #van, I have tried:
query(Group).join(Element).filter(Group.name == 'group 1' and Element.date == '2021-05-27')
But I get every element again. On the logs I have noticed:
SELECT groups.id AS group_id, groups.name AS groups_name, element_1.id AS element_1_id, element_1.date AS element_1_date, element_1.value AS element_1_value, element_1.group_id AS element_1_group_id
FROM groups JOIN elements ON groups.id = elements.group_id LEFT OUTER JOIN elements AS elements_1 ON groups.id = elements_1.group_id
WHERE groups.name = %(name_1)s
There, I noticed two things. First, the join is being done twice (I guess one was already done just getting groups, before join). Second and most important: the date filter doesn't appear on the query.
The driver I'm using the mssql+pymssql driver.
OK, there seem to be a combination of few things happening here.
First, your relationship Group.elements will basically always contain all Elements of the Group. And this is completely separate from the filter, and this is how SA is supposed to work.
You can understand your current query (session.query(Group).filter(Group.name == 'group 1' and Element.date == '2021-05-27').all()) as the following:
"Return all Group instances which contain an Element for a given date."
But when you iterate over the Group.elements, the SA will make sure to return all children. This is what you are trying to solve.
Second, as pointed out by Yeong, you cannot use simple python and to create an AND SQL clause. Please fix either by using and_ or by just having separate clauses:
result = (
session.query(Group)
.filter(Group.name == "group 1")
.filter(Element.date == dat1)
.all()
)
Third, as you later pointed out, your relationship is lazy="joined" and this is why whenever you query for Group, the related Element instances will ALL be retrieved using OUTER JOIN condition. This is why when adding .join(Element) to your query resulted in two JOINs.
Solution
You can "trick" SA to think that the it loaded all Group.elements relationship when it only loaded the children you want by using orm.contains_eager() option, in which your query would like like this:
result = (
session.query(Group)
.join(Element)
.filter(Group.name == "group 1")
.filter(Element.date == dat1)
.options(contains_eager(Group.elements))
.all()
)
Above should work also with the lazy="joined" as the extra JOIN should not be generated anymore.
Update
If you would like to get the groups even if there are no Elements with the needed criteria, you need do:
replace join with outerjoin
place the filter on Element inside the outerjoin clause
result = (
session.query(Group)
.filter(Group.name == "group 1")
.outerjoin(
Element, and_(Element.group_id == Group.id, Element.date == dat1)
)
.options(contains_eager(Group.elements))
.all()
)
The and in python is not the same as the and condition in SQL. SQLAlchemy has a custom way to handle the conjunction using and_() method instead, i.e.
result = session.query(Group).join(Element).filter(and_(Group.name == 'group 1', Element.date == '2021-05-27')).all()
I have 2 tables in MySQL.
One has transactions with important columns where each row has Debit account ID and Credit account ID. I have second table which contains Account name and special number associated to Account ID. I want somehow to try sql query which will take data from transactions table and assign account name and account number from second table.
I tried doing everything using two query , one would get transactions and second one would get account details and then I did iterate over dataframe and assigned everything one by one which doesn't seem to be good idea
query = "SELECT tr_id, tr_date, description, dr_acc, cr_acc, amount, currency, currency_rate, document, comment FROM transactions WHERE " \
"company_id = {} {} and deleted = 0 {} LIMIT {}, {}".format(
company_id, filter, sort, sn, en)
df = ncon.getDF(query)
df.insert(4, 'dr_name', '')
df.insert(6, 'cr_name', '')
data = tuple(list(set(df['dr_acc'].values.tolist() + df['cr_acc'].values.tolist())))
query = "SELECT account_number, acc_id, account_name FROM tb_accounts WHERE company_id = {} and deleted = 0 and acc_id in {}".format(
company_id, data)
df_accs = ncon.getDF(query)
for index, row in df_accs.iterrows():
acc = str(row['acc_id'])
ac = row['account_number']
nm = row['account_name']
indx = df.index[df['dr_acc'] == acc].tolist()
df.at[indx, 'dr_acc'] = ac
df.at[indx, 'dr_name'] = nm
indx = df.index[df['cr_acc'] == acc].tolist()
df.at[indx, 'cr_acc'] = ac
df.at[indx, 'cr_name'] = nm
What you're looking for, I think, is a SQL JOIN statement.
Taking a crack at writing a query that might work based on your code:
query = '''
SELECT transactions.tr_id,
transactions.tr_date,
transactions.description,
transactions.dr_acc,
transactions.cr_acc,
transactions.amount,
transactions.currency,
transactions.currency_rate,
transactions.document,
transactions.comment
FROM transactions INNER JOIN tb_accounts ON tb_accounts.acc_id = transactions.acc_id
WHERE
transactions.company_id = {} AND
tb_accounts.company_id = {} AND
transactions.deleted = 0 AND
tb_accounts.deleted = 0
ORDER BY transactions.tr_id
LIMIT 10;'''
The above query will, roughly, present query results with all the fields listed from the two tables for each pair of rows where the acc_id is the same.
NOTE, the query above will probably not have very good performance. SQL JOIN statements must be written with care, but I wrote it above in a way that's easy to understand, so as to illustrate the power of the JOIN.
You should as a matter of habit NEVER try to program something when you could use a join instead. As long as you take care to write a join properly so that it can be efficient, the MySQL engine will beat your python code for performance almost every time.
sort two dataframe and use merge for merging 2data frame
df1 = df1.sort_values(['dr_acc'], ascending=True)
df2 = df2.sort_values(['acc_id'], ascending=True)
merge2df = pd.merge(df1, df2, how='outer',
left_on=['dr_acc'], right_on=['acc_id'])
I assumed df1 is 1st query data set and df2 is 2nd query data set
sql query
'''SELECT tr_id, tr_date,
description,
dr_acc, cr_acc,
amount, currency,
currency_rate,
document,
account_number, acc_id, account_name
comment FROM transactions left join
tb_accounts on transactions.dr_acc=tb_accounts.account_number'''