Joining multiple data frames in one statement and selecting only required columns - python

I have the following Spark DataFrames:
df1 with columns (id, name, age)
df2 with columns (id, salary, city)
df3 with columns (name, dob)
I want to join all of these Spark data frames using Python. This is the SQL statement I need to replicate.
SQL:
select df1.*,df2.salary,df3.dob
from df1
left join df2 on df1.id=df2.id
left join df3 on df1.name=df3.name
I tried something that looks like below in Pyspark using python but I am receiving an error.
joined_df = df1.join(df2,df1.id=df2.id,'left')\
.join(df3,df1.name=df3.name)\
.select(df1.(*),df2(name),df3(dob)
My question: Can we join all the three DataFrames in one go and select the required columns?

If you have a SQL query that works, why not use pyspark-sql?
First use pyspark.sql.DataDrame.createOrReplaceTempView() to register your DataFrame as a temporary table:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
df3.createOrReplaceTempView('df3')
Now you can access these DataFrames as tables with the names you provided in the argument to createOrReplaceTempView(). Use pyspark.sql.SparkSession.sql() to execute your query:
query = "select df1.*, df2.salary, df3.dob " \
"from df1 " \
"left join df2 on df1.id=df2.id "\
"left join df3 on df1.name=df3.name"
joined_df = spark.sql(query)

You can leverage col and alias to get the SQL-like syntax to work. Ensure your DataFrames are aliased:
df1 = df1.alias('df1')
df2 = df2.alias('df2')
df3 = df3.alias('df3')
Then the following should work:
from pyspark.sql.functions import col
joined_df = df1.join(df2, col('df1.id') == col('df2.id'), 'left') \
.join(df3, col('df1.name') == col('df3.name'), 'left') \
.select('df1.*', 'df2.salary', 'df3.dob')

Related

Join in Pandas Dataframe using conditional join statement

I am trying to join two dataframes with the following data:
df1
df2
I want to join these two dataframes on the condition that if 'col2' of df2 is blank/NULL then the join should occur only on 'column1' of df1 and 'col1' of df2 but if it is not NULL/blank then the join should occur on two conditions, i.e. 'column1', 'column2' of df1 with 'col1', 'col2' of df2 respectively.
For reference the final dataframe that I wish to obtain is:
My current approach is that I'm trying to slice these 2 dataframes into 4 and then joining them seperately based on the condition. Is there any way to do this without slicing them or maybe a better way that I'm missing out??
Idea is rename columns before left join by both columns first and then replace missing value by matching by column1, here is necessary remove duplicates by DataFrame.drop_duplicates before Series.map for unique values in col1:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
s = df2.drop_duplicates('col1').set_index('col1')['col3']
df['col3'] = df['col3'].fillna(df['column1'].map(s))
EDIT: General solution working with multiple columns - first part is same, is used left join, in second part is used merge by one column with DataFrame.combine_first for replace missing values:
df22 = df2.rename(columns={'col1':'column1','col2':'column2'})
df = df1.merge(df22, on=['column1','column2'], how='left')
df23 = df22.drop_duplicates('column1').drop('column2', axis=1)
df = df.merge(df23, on='column1', how='left', suffixes=('','_'))
cols = df.columns[df.columns.str.endswith('_')]
df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)

Pyspark dataframe joins with few duplicated column names and few without duplicate columns

I need to implement pyspark dataframe joins in my project.
I need to join in 3 different cases.
1)
If both dataframes have same name join columns. I joined as below. It eliminated duplicate columns col1, col2.
cond = ['col1', 'col2']
df1.join(df2, cond, "inner")
2) If both dataframes have different name join columns. I joined as below. It maintains all 4 join columns as expected.
cond = [df1.col_x == df2.col_y,
df1.col_a == df2.col_b]
df1.join(df2, cond, "inner")
3) If dataframes have few same name join columns and few different name join columns. I tried as below. But, it is failing.
cond = [df1.col_x == df2.col_y,
df1.col_a == df2.col_b,
'col1',
'col2',
'col3']
df1.join(df2, cond, "inner")
I tried as below which worked.
cond = [df1.col_x == df2.col_y,
df1.col_a == df2.col_b,
df1.col1 == df2.col1,
df1.col2 == df2.col2,
df1.col3 == df2.col3]
df1.join(df2, cond, "inner")
But col1, col2, col3 have duplicate columns. I want to eliminate these duplicate columns while joining itself instead of drop the columns later.
Please suggest how #3 can be achieved or suggest alternative approaches.
There is no way to do it: behind the scenes an equi-join (colA == colB) where the condition is given as a (sequence of) string(s) (which is called a natural join) is executed as if it were a regular equi-join (source) so
frame1.join(frame2,
"shared_column",
"inner")
gets translated to
frame1.join(frame2,
frame1.shared_column == frame2.shared_column,
"inner")
Afterwards, the duplicate gets dropped (projection).
If you have a condition that uses both keys that could be a natural join as well as a regular equi-join, then either drop the duplicate columns afterwards, or rename the ones that are not equally named before the join.

Key error when joining dfs in Pandas

I have a dataframe with these columns:
df1:
Index(['cnpj', '#CNAE', 'Estado', 'Capital_Social', '#CNAEpai', '#CNAEvo',
'#CNAEbisavo', 'Porte'],
dtype='object')
I have another dataframe with these columns:
df2:
Index(['#CNAEpai', 'ROA_t12_Peers_CNAEpai', 'MgBruta_t12_Peers_CNAEpai',
'MgEBITDA_t12_Peers_CNAEpai', 'LiqCorrente_t12_Peers_CNAEpai',
'Crescimento_t12_Peers_CNAEpai', 'MgLucro_t12_Peers_CNAEpai',
'Custo/Receita_t12_Peers_CNAEpai', 'Passivo/EBITDA_t12_Peers_CNAEpai',
'ROE_t12_Peers_CNAEpai', 'RFinanceiro/Receita_t12_Peers_CNAEpai',
'cnpj_t12_Peers_CNAEpai', 'LiqGeral_t12_Peers_CNAEpai'],
dtype='object')
I'm trying to join them, using this line:
df1=df1.join(df2,on=['#CNAEpai'],how='left',rsuffix='_bbb')
But I'm getting this error:
KeyError: '#CNAEpai'
Since #CNAEpai is a column in both dfs that shouldn't be happening right?
What's going on?
As #root indicated, pd.DataFrame.join joins index-on-index or index-on-column, but not column-on-column.
To join on column(s), use pd.DataFrame.merge:
df1 = df1.merge(df2, on='#CNAEpai', how='left', rsuffix='_bbb')

isnull(col1,col2) equivalent in pandas

How can I write an equivalent statement to the below SQL query?
SELECT ISNULL(df2.id, df1.id) as new_id
FROM dataframe1 df1
LEFT JOIN dataframe2 df2
ON df1.id = df2.id
One such equivalent would be:
df2.set_index('id').combine_first(df1.set_index('id')).reset_index()

Outer join Spark dataframe with non-identical join column and then merge join column

Suppose I have the following dataframes in pySpark:
df1 = sqlContext.createDataFrame([Row(name='john', age=50), Row(name='james', age=25)])
df2 = sqlContext.createDataFrame([Row(name='john', weight=150), Row(name='mike', weight=115)])
df3 = sqlContext.createDataFrame([Row(name='john', age=50, weight=150), Row(name='james', age=25, weight=None), Row(name='mike', age=None, weight=115)])
Now suppose I want to create df3 from joining/merging df1 and df2.
I tried doing
df1.join(df2, df1.name == df2.name, 'outer')
This doesn't quite work exactly because it produces two name columns. I need to then somehow combine the two name columns so that missing names from one name column are filled in by the missing name from the other name column.
How would I do that? Or is there a better way to create df3 from df1 and df2?
You can use coallesce function which returns the first not-null argument.
from pyspark.sql.functions import coalesce
df1 = df1.alias("df1")
df2 = df2.alias("df2")
(df1.join(df2, df1.name == df2.name, 'outer')
.withColumn("name_", coalesce("df1.name", "df2.name"))
.drop("name")
.withColumnRenamed("name_", "name"))
This is a little late, but there is a simpler solution if someone needs it. Just a simple change from original poster's solution:
df1.join(df2, 'name', 'outer')
df3 = df1.join(df2, ['name'], 'outer')
Joining in this way will prevent the duplication of the name column. https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html

Categories

Resources