Nested SELECT query in Pyspark DataFrames

Nested SELECT query in Pyspark DataFrames - python

Suppose I have two DataFrames in Pyspark and I'd want to run a nested SQL-like SELECT query, on the lines of
SELECT * FROM table1
WHERE b IN
(SELECT b FROM table2
WHERE c='1')
Now, I can achieve a select query by using where, as in
df.where(df.a.isin(my_list))
given I have selected the my_list tuple of values beforehand. How would I perform a nested query in one go instead?

As for know Spark doesn't support subqueries in WHERE clause (SPARK-4226). The closest thing you can get without collecting is join and distinct roughly equivalent to this:
SELECT DISTINCT table1.*
FROM table1 JOIN table2
WHERE table1.b = table2.b AND table2.c = '1'

Related

Add GROUPBY COUNT(*) to UNION ALL Query in SQLAlchemy - EXCEPT Equivalent

I have a query which performs a UNION ALL operation on two SELECT statements in SQLAlchemy. It looks like this,
union_query = query1.union_all(query2)
What I want to do now is to perform a GROUPBY using several attributes and then get only the rows where COUNT(*) is equal to 1. How can I do this?
I know I can do a GROUPBY like this,
group_query = union_query.group_by(*columns)
But, how do I add the COUNT(*) condition?
So, the final outcome should be the equivalent of this query,
SELECT * FROM (
<query1>
UNION ALL
<query2>) AS result
GROUP BY <columns>
HAVING COUNT(*) = 1
Additionally, I would also like to know if I can get only the distinct values of a certain column from the result. That would be the equivalent of this,
SELECT DISTINCT <column> FROM (
<query1>
UNION ALL
<query2>) AS result
GROUP BY <columns>
HAVING COUNT(*) = 1
These are basically queries to get only the unique results of two SELECT statements.
Note: The easiest way to accomplish this is to use EXCEPT or EXCEPT ALL, but my database is running on MariaDB 8 and therefore, these operations are not supported.

For the first query, try the following where the final_query is the query you want to run.
union_query = query1.union_all(query2)
group_query = union_query.group_by(*columns)
final_query = group_query.having(func.count() == 1)
For the second query, try the following.
union_query = query1.union_all(query2)
group_query = union_query.group_by(*columns)
subquery = group_query.having(func.count() == 1).subquery()
final_query = query(<column>, subquery).distinct()
References
https://docs.sqlalchemy.org/en/14/orm/query.html#sqlalchemy.orm.Query.having
https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#migration-20-query-distinct
https://docs.sqlalchemy.org/en/14/orm/tutorial.html#using-subqueries

Filter by results of select

I am trying to translate the following query to peewee:
select count(*) from A where
id not in (select distinct package_id FROM B)
What is the correct Python code? So far I have this:
A.select(A.id).where(A.id.not_in(B.select(B.package_id).distinct()).count()
This code is not returning the same result. A and B are large 10-20M rows each. I can't create a dictionary of existing package_id items in the memory.
For example, this takes lot of time:
A.select(A.id).where(A.id.not_in({x.package_id for x in B.select(B.package_id).distinct()}).count()
May be LEFT JOIN?
Update: I ended up calling database.execute_sql()

Your SQL:
select count(*) from A where
id not in (select distinct package_id FROM B)
Equivalent peewee:
q = (A
.select(fn.COUNT(A.id))
.where(A.id.not_in(B.select(B.package_id.distinct()))))
count = q.scalar()

How to use an SQL query with join on the result of another SQL query executed before?

I am trying to use an SQL query on the result of a previous SQL query but I'm not able to.
I am creating a python script and using postgresql.
I have 3 tables from which I need to match different columns and join the data but using only 2 tables at a time.
For example:
I have table1 where I have a codecolumn and there is a same column of codes in table2
Now I am matching the values of both the columns and joining a column 'area' from table 2 which corresponds to codes and a column 'pincode' from table1.
For this I used the following query which is working:
'''
select
table1.code,table2.code,table2.area,table1.pincode
from
table1 left join table2
ON
table1.code=table2.code
order by table1.row_num '''
I am getting the result but in this data there is some data in which the area value is returned as None
Wherever I am getting the area as None when matching code columns, I need to use the pincode column in table1 and pincode column in table3 to again find the corresponding area from table3.area.
So I used the following Query:
'''
select
table1.code,table3.area,table1.pincode
from
table1 left join table3
ON
table1.pincode=table3.pincode
IN (
select
table1.code,table2.code,table2.area,table1.pincode
from
table1 left join table2
ON
table1.code=table2.code
where table2.area is NULL
order by table1.row_num '''
and I got the following error:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.SyntaxError) subquery has too many columns
My python code is as follows:
import psycopg2
from sqlalchemy import create_engine
engine=create_engine('postgresql+psycopg2://credentials')
conn=engine.connect()
query = '''
select
table1.code,table2.code,table2.area,table1.pincode
from
table1 left join table2
ON
table1.code=table2.code
order by table1.row_num '''
area=conn.execute(query)
area_x=area.fetchall()
for i in area_x:
print(i)
query2 = select
table1.code,table3.area,table1.pincode
from
table1 left join table3
ON
table1.pincode=table3.pincode
IN (
select
table1.code,table2.code,table2.area,table1.pincode
from
table1 left join table2
ON
table1.code=table2.code
where table2.area is NULL
order by table1.row_num '''
area=conn.execute(query2)
area_x=area.fetchall()
for i in area_x:
print(i)
This is how my first query is returning the data:
Wherever I am not able to match the code columns I get None value in area column from table 2 and whenever the area value is None I have to apply another query to find this data
Now i have to match data in table1.pincode with data in table3.pincode to find table3.area and replace the None value with table3.area
These are the 2 ways to find the area
The desired result should be:
What could be the correct solution??
Thank You

it looks that your query2 needs a where clause and the subquery, as per error message should be reduced to the column you are trying to pass to the outer query. query2 should be something like this:
query2 =
select
table1.code,table3.area,table1.pincode
from
table1 left join table3
ON
table1.pincode=table3.pincode
WHERE table1.code
IN (
select
table1.code
from
table1 left join table2
ON
table1.code=table2.code
where table2.area is NULL

How to write sql nested queries with "not in" in pyspark dataframe?

I have a sql query that I want to convert to pyspark:
select * from Table_output where cct_id not in (select * from df_hr_excl)
Pseudo Code:
Table_output=Table_output.select(col("cct_id")).exceptAll(df_hr_excl.select("cct_id")) or
col("cct_id").isin(df_hr_excl.select("cct_id"))

Correlated subqueries in where clause with NOT IN or NOT EXISTS can be written using left anti join :
Table_output = Table_output.join(df_hr_excl, ["cct_id"], "left_anti")
As per your comment, if you have a condition in your subquery then you can put it in the join condition. E.g.:
Table_output = Table_output.alias("a").join(df_hr_excl.alias("b"), (F.col("a.x") > F.col("b.y")) & (F.col("a.id") == F.col("b.id")), "left_anti")

Need to retain all columns but getting Duplicate Column Name #1060

I have this large query I am trying to perform. I perform a series of joins, and then from that resulting relation I want to perform another join and filter out certain tuples.
SELECT *
FROM
(
SELECT *
FROM
market_instrument
inner join exchange_instrument
on market_instrument.id = exchange_instrument.instrument_id
inner join Table1 on market_instrument.id = Table1.instrument_id
left join Table2 on market_instrument.id = Table2.instrument_id
left join `options`on market_instrument.id = `options`.instrument_id
left join Table3 on market_instrument.id = Table3.instrument_id
) as R
inner join Table4 on R.instrument_id = Table4.instrument_id
where Table4.fill_timestamp between CURDATE() - INTERVAL 30 DAY AND NOW();
R is the "series of joins" I'm referring to. I want to inner join R with Table4 and then filter out the resulting relation for the last 30 days (where the date attribute is Table4.fill_timestamp). I'm using SQLAlchemy so I thought about somehow saving R to some result relation variable and performing a separate query on that, but I don't know how SQLAlchemy handles that, so I wanted to try doing the entire query in SQL first.
I keep getting the Duplicate Column Name "instrument_id" error. instrument_id is the primary key for all tables except market_instrument, where it's the same but it's called id instead. What can I do to get around this issue?

The problem is that R has all the columns from several tables, and more than one of those tables has a column named "instrument_id". You have not assigned aliases to any of those column names, so SQL does not know which instrument_id column you mean when you say "R.instrument_id".
If market_instrument is the only table with an id column then you could join on R.id instead of R.instrument_id.
Alternatively, another group of solutions involves assigning different names to some or all of the columns in R. For example,
SELECT
market_instrument.*,
exchange_instrument.*,
Table1.instrument_id AS the_one_true_id,
Table1.another_column,
Table1.yet_another_column,
...
Table2.*,
options.*,
Table3.*
FROM
market_instrument
inner join exchange_instrument
on market_instrument.id = exchange_instrument.instrument_id
inner join Table1 on market_instrument.id = Table1.instrument_id
left join Table2 on market_instrument.id = Table2.instrument_id
left join `options`on market_instrument.id = `options`.instrument_id
left join Table3 on market_instrument.id = Table3.instrument_id
With the above, you could then join on R.the_one_true_id. Alternatively, you could leave your current join as it is, and rename all the instrument_id columns but one. It might (or might not) be convenient to do that in the context of replacing R with a full-fledged VIEW in your schema.
Alternatively, your select list could enumerate all the columns of all the tables in the join. That might be tedious, but if you really do need all of them, then you will need to do that to disambiguate the other duplicate names, which include, at least, the various other instrument_id columns. Presented with such a task, however, perhaps you would discover that you don't really need every one of them.
As yet another alternative, you could add more columns instead of renaming existing ones. For example,
SELECT
*
exchange_instrument.instrumentId AS ei_instrument_id,
Table1.instrument_id AS t1_instrument_id,
Table2.instrument_id AS t2_instrument_id,
options.instrument_id AS op_instrument_id,
Table3.instrument_id AS t3_instrument_id
FROM
...
Then you can access, say, R.t1_instrument_id, whose name is presumably unique.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.