How to write sql nested queries with "not in" in pyspark dataframe?

How to write sql nested queries with "not in" in pyspark dataframe? - python

I have a sql query that I want to convert to pyspark:
select * from Table_output where cct_id not in (select * from df_hr_excl)
Pseudo Code:
Table_output=Table_output.select(col("cct_id")).exceptAll(df_hr_excl.select("cct_id")) or
col("cct_id").isin(df_hr_excl.select("cct_id"))

Correlated subqueries in where clause with NOT IN or NOT EXISTS can be written using left anti join :
Table_output = Table_output.join(df_hr_excl, ["cct_id"], "left_anti")
As per your comment, if you have a condition in your subquery then you can put it in the join condition. E.g.:
Table_output = Table_output.alias("a").join(df_hr_excl.alias("b"), (F.col("a.x") > F.col("b.y")) & (F.col("a.id") == F.col("b.id")), "left_anti")

Related

Add GROUPBY COUNT(*) to UNION ALL Query in SQLAlchemy - EXCEPT Equivalent

I have a query which performs a UNION ALL operation on two SELECT statements in SQLAlchemy. It looks like this,
union_query = query1.union_all(query2)
What I want to do now is to perform a GROUPBY using several attributes and then get only the rows where COUNT(*) is equal to 1. How can I do this?
I know I can do a GROUPBY like this,
group_query = union_query.group_by(*columns)
But, how do I add the COUNT(*) condition?
So, the final outcome should be the equivalent of this query,
SELECT * FROM (
<query1>
UNION ALL
<query2>) AS result
GROUP BY <columns>
HAVING COUNT(*) = 1
Additionally, I would also like to know if I can get only the distinct values of a certain column from the result. That would be the equivalent of this,
SELECT DISTINCT <column> FROM (
<query1>
UNION ALL
<query2>) AS result
GROUP BY <columns>
HAVING COUNT(*) = 1
These are basically queries to get only the unique results of two SELECT statements.
Note: The easiest way to accomplish this is to use EXCEPT or EXCEPT ALL, but my database is running on MariaDB 8 and therefore, these operations are not supported.

For the first query, try the following where the final_query is the query you want to run.
union_query = query1.union_all(query2)
group_query = union_query.group_by(*columns)
final_query = group_query.having(func.count() == 1)
For the second query, try the following.
union_query = query1.union_all(query2)
group_query = union_query.group_by(*columns)
subquery = group_query.having(func.count() == 1).subquery()
final_query = query(<column>, subquery).distinct()
References
https://docs.sqlalchemy.org/en/14/orm/query.html#sqlalchemy.orm.Query.having
https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#migration-20-query-distinct
https://docs.sqlalchemy.org/en/14/orm/tutorial.html#using-subqueries

Filter by results of select

I am trying to translate the following query to peewee:
select count(*) from A where
id not in (select distinct package_id FROM B)
What is the correct Python code? So far I have this:
A.select(A.id).where(A.id.not_in(B.select(B.package_id).distinct()).count()
This code is not returning the same result. A and B are large 10-20M rows each. I can't create a dictionary of existing package_id items in the memory.
For example, this takes lot of time:
A.select(A.id).where(A.id.not_in({x.package_id for x in B.select(B.package_id).distinct()}).count()
May be LEFT JOIN?
Update: I ended up calling database.execute_sql()

Your SQL:
select count(*) from A where
id not in (select distinct package_id FROM B)
Equivalent peewee:
q = (A
.select(fn.COUNT(A.id))
.where(A.id.not_in(B.select(B.package_id.distinct()))))
count = q.scalar()

PySpark: How to select * from the left table during rdd join

How to select * in pyspark join
impression_rdd.join(
click_rdd,
impression_rdd.session_id == click_rdd.session_id,
"left_outer"
).select(impression_rdd.*) <------- pseudo code; how do you do this?
Basically, the sql equivalent
SELECT impression.* FROM impression LEFT JOIN click on (impression.session_id = click.session_id)

You can simply add alias and a couple of quotes to your pseudocode:
(impressions.alias("impressions")
.join(clicks, ["id"], "left_outer")
.select("impressions.*"))

two other equivalent constructs to zero323's answer:
(impressions.join(clicks, 'session_id', 'left_outer')
.select(*impressions.columns))
and if you only have one column, say 'count', to drop in the right-hand table, this might be more readable.
(impressions.join(clicks, 'session_id', 'left_outer')
.drop('count'))

Nested SELECT query in Pyspark DataFrames

Suppose I have two DataFrames in Pyspark and I'd want to run a nested SQL-like SELECT query, on the lines of
SELECT * FROM table1
WHERE b IN
(SELECT b FROM table2
WHERE c='1')
Now, I can achieve a select query by using where, as in
df.where(df.a.isin(my_list))
given I have selected the my_list tuple of values beforehand. How would I perform a nested query in one go instead?

As for know Spark doesn't support subqueries in WHERE clause (SPARK-4226). The closest thing you can get without collecting is join and distinct roughly equivalent to this:
SELECT DISTINCT table1.*
FROM table1 JOIN table2
WHERE table1.b = table2.b AND table2.c = '1'

Raw SQL to SQLAlchemy

Part of my raw sql statement looks like this:
select /*some selects*/
if(/*condition*/, table1.price , if(/*condition*/, t2.price, t3.price)) as price
/*some joins*/
left join table2 t2 on table1.type=t2.id
left join table3 t3 on table1.type=t3.id
This statement works as expected.
SQLAlchemy ORM:
query = db_session.query(Table1,\
func.IF(Table1.field5 == 5, Table1.price,\
func.IF(Table1.new_model == 1, Table2.price, Table3.price))
#+some selects
#+some joins
query = query.join(Table2, Table1.type == Table2.id)\
.join(table3, Table1.type == Table3.id)
And it doesn`t work the same way. It returns the result that only connected to the Table2. And not using this joins in query returns needed rows, but without needed fields from this Table2 and Table3, of course.
What is my mistake?

you need to use outerjoin for LEFT JOIN

LEFT JOIN and JOIN are different operations.
For LEFT JOIN use outerjoin. For JOIN (aka INNER JOIN) use join.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to write sql nested queries with "not in" in pyspark dataframe? - python

I have a sql query that I want to convert to pyspark: select * from Table_output where cct_id not in (select * from df_hr_excl) Pseudo Code: Table_output=Table_output.select(col("cct_id")).exceptAll(df_hr_excl.select("cct_id")) or col("cct_id").isin(df_hr_excl.select("cct_id"))

Related

Add GROUPBY COUNT(*) to UNION ALL Query in SQLAlchemy - EXCEPT Equivalent

Filter by results of select

PySpark: How to select * from the left table during rdd join

Nested SELECT query in Pyspark DataFrames

Raw SQL to SQLAlchemy

Categories

Resources