PySpark: How to select * from the left table during rdd join

PySpark: How to select * from the left table during rdd join - python

How to select * in pyspark join
impression_rdd.join(
click_rdd,
impression_rdd.session_id == click_rdd.session_id,
"left_outer"
).select(impression_rdd.*) <------- pseudo code; how do you do this?
Basically, the sql equivalent
SELECT impression.* FROM impression LEFT JOIN click on (impression.session_id = click.session_id)

You can simply add alias and a couple of quotes to your pseudocode:
(impressions.alias("impressions")
.join(clicks, ["id"], "left_outer")
.select("impressions.*"))

two other equivalent constructs to zero323's answer:
(impressions.join(clicks, 'session_id', 'left_outer')
.select(*impressions.columns))
and if you only have one column, say 'count', to drop in the right-hand table, this might be more readable.
(impressions.join(clicks, 'session_id', 'left_outer')
.drop('count'))

Related

How to write sql nested queries with "not in" in pyspark dataframe?

I have a sql query that I want to convert to pyspark:
select * from Table_output where cct_id not in (select * from df_hr_excl)
Pseudo Code:
Table_output=Table_output.select(col("cct_id")).exceptAll(df_hr_excl.select("cct_id")) or
col("cct_id").isin(df_hr_excl.select("cct_id"))

Correlated subqueries in where clause with NOT IN or NOT EXISTS can be written using left anti join :
Table_output = Table_output.join(df_hr_excl, ["cct_id"], "left_anti")
As per your comment, if you have a condition in your subquery then you can put it in the join condition. E.g.:
Table_output = Table_output.alias("a").join(df_hr_excl.alias("b"), (F.col("a.x") > F.col("b.y")) & (F.col("a.id") == F.col("b.id")), "left_anti")

How can I reverse the join order to get a right join with sqlalchemy using a subquery?

I'm trying to create the following (very much simplified) SQL statement with Flask-SQLAlchemy ORM
SELECT TableA.attr
FROM (SELECT DISTINCT TableB.a_id FROM TableB) AS TableB
LEFT JOIN TableA
ON TableA.id = TableB.a_id
To achieve that I used the following SQLAlchemy statements
sq = db.session.query(distinct(TableB.a_id).label('a_id')).subquery()
results = db.session.query(TableA.attr).join(sq, sq.c.a_id==TableA.id, isouter=True).all()
This works however rather than joining my subquery TableB (left) with TableA (right) it does the reverse and joins TableA with TableB.
SELECT TableA.attr
FROM TableA
LEFT JOIN (SELECT DISTINCT TableB.a_id FROM TableB) AS TableB
ON TableB.a_id = TableA.id
Since I understand that SQLAlchemy doesn't have a right join, I'll have to somehow reverse the order while still getting TableA.attr as the result and I can't figure out how to do that with a subquery.

The answer to the question was indeed the select_from() method. I.e. the actual solution to my problem were the following statements:
sq = db.session.query(distinct(TableB.a_id).label('a_id')).subquery()
results = db.session.query(TableA.attr) \
.select_from(sq) \
.join(TableA, TableA.id==sq.c.a_id, isouter=True).all()
In general terms: The select_from() method gets the left part of the join. The join() gets the right part as its first argument.

Nested SELECT query in Pyspark DataFrames

Suppose I have two DataFrames in Pyspark and I'd want to run a nested SQL-like SELECT query, on the lines of
SELECT * FROM table1
WHERE b IN
(SELECT b FROM table2
WHERE c='1')
Now, I can achieve a select query by using where, as in
df.where(df.a.isin(my_list))
given I have selected the my_list tuple of values beforehand. How would I perform a nested query in one go instead?

As for know Spark doesn't support subqueries in WHERE clause (SPARK-4226). The closest thing you can get without collecting is join and distinct roughly equivalent to this:
SELECT DISTINCT table1.*
FROM table1 JOIN table2
WHERE table1.b = table2.b AND table2.c = '1'

Raw SQL to SQLAlchemy

Part of my raw sql statement looks like this:
select /*some selects*/
if(/*condition*/, table1.price , if(/*condition*/, t2.price, t3.price)) as price
/*some joins*/
left join table2 t2 on table1.type=t2.id
left join table3 t3 on table1.type=t3.id
This statement works as expected.
SQLAlchemy ORM:
query = db_session.query(Table1,\
func.IF(Table1.field5 == 5, Table1.price,\
func.IF(Table1.new_model == 1, Table2.price, Table3.price))
#+some selects
#+some joins
query = query.join(Table2, Table1.type == Table2.id)\
.join(table3, Table1.type == Table3.id)
And it doesn`t work the same way. It returns the result that only connected to the Table2. And not using this joins in query returns needed rows, but without needed fields from this Table2 and Table3, of course.
What is my mistake?

you need to use outerjoin for LEFT JOIN

LEFT JOIN and JOIN are different operations.
For LEFT JOIN use outerjoin. For JOIN (aka INNER JOIN) use join.

Need to retain all columns but getting Duplicate Column Name #1060

I have this large query I am trying to perform. I perform a series of joins, and then from that resulting relation I want to perform another join and filter out certain tuples.
SELECT *
FROM
(
SELECT *
FROM
market_instrument
inner join exchange_instrument
on market_instrument.id = exchange_instrument.instrument_id
inner join Table1 on market_instrument.id = Table1.instrument_id
left join Table2 on market_instrument.id = Table2.instrument_id
left join `options`on market_instrument.id = `options`.instrument_id
left join Table3 on market_instrument.id = Table3.instrument_id
) as R
inner join Table4 on R.instrument_id = Table4.instrument_id
where Table4.fill_timestamp between CURDATE() - INTERVAL 30 DAY AND NOW();
R is the "series of joins" I'm referring to. I want to inner join R with Table4 and then filter out the resulting relation for the last 30 days (where the date attribute is Table4.fill_timestamp). I'm using SQLAlchemy so I thought about somehow saving R to some result relation variable and performing a separate query on that, but I don't know how SQLAlchemy handles that, so I wanted to try doing the entire query in SQL first.
I keep getting the Duplicate Column Name "instrument_id" error. instrument_id is the primary key for all tables except market_instrument, where it's the same but it's called id instead. What can I do to get around this issue?

The problem is that R has all the columns from several tables, and more than one of those tables has a column named "instrument_id". You have not assigned aliases to any of those column names, so SQL does not know which instrument_id column you mean when you say "R.instrument_id".
If market_instrument is the only table with an id column then you could join on R.id instead of R.instrument_id.
Alternatively, another group of solutions involves assigning different names to some or all of the columns in R. For example,
SELECT
market_instrument.*,
exchange_instrument.*,
Table1.instrument_id AS the_one_true_id,
Table1.another_column,
Table1.yet_another_column,
...
Table2.*,
options.*,
Table3.*
FROM
market_instrument
inner join exchange_instrument
on market_instrument.id = exchange_instrument.instrument_id
inner join Table1 on market_instrument.id = Table1.instrument_id
left join Table2 on market_instrument.id = Table2.instrument_id
left join `options`on market_instrument.id = `options`.instrument_id
left join Table3 on market_instrument.id = Table3.instrument_id
With the above, you could then join on R.the_one_true_id. Alternatively, you could leave your current join as it is, and rename all the instrument_id columns but one. It might (or might not) be convenient to do that in the context of replacing R with a full-fledged VIEW in your schema.
Alternatively, your select list could enumerate all the columns of all the tables in the join. That might be tedious, but if you really do need all of them, then you will need to do that to disambiguate the other duplicate names, which include, at least, the various other instrument_id columns. Presented with such a task, however, perhaps you would discover that you don't really need every one of them.
As yet another alternative, you could add more columns instead of renaming existing ones. For example,
SELECT
*
exchange_instrument.instrumentId AS ei_instrument_id,
Table1.instrument_id AS t1_instrument_id,
Table2.instrument_id AS t2_instrument_id,
options.instrument_id AS op_instrument_id,
Table3.instrument_id AS t3_instrument_id
FROM
...
Then you can access, say, R.t1_instrument_id, whose name is presumably unique.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.