Pyspark replace for loop over dates - python

I have an array of dates:
date_set = ["2019-01-01", "2019-02-01", "2019-03-01"....."2020-01-01"]
and I have this dataframe:
|DATE |ID |VALUE|
+---------+---------------+---------------+---------+-----------------+
| 2019-04| 1234 |100.0|
| 2019-05| 4567 |200.0|
For each element of my list, I have to apply the following transformations:
for date in date_set:
target = date - relativedelta(months=+6)
dfTemp = df.where(
(F.col("DATE") <= date) &
(F.col("DATE") >= target)
).groupBy("ID").agg(F.sum("VALUE").alias("VALUE"))
I want to avoid this for loop. How can I do it in an efficient way?

Make a dataframe of your date list, cross join it to your dataframe, and then use window functions to get the rolling sum over the past 6 months:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df.show()
+-------+----+-----+
| DATE| ID|VALUE|
+-------+----+-----+
|2019-04|1234|100.0|
|2019-05|4567|200.0|
+-------+----+-----+
date_set = ["2019-%02d-01" % i for i in range(1,13)] + ["2020-01-01"]
dates = spark.createDataFrame([[i] for i in date_set]).toDF('date')
dates_id = dates.crossJoin(df2.select('ID').distinct())
df2 = df.withColumn(
'date',
F.to_date('DATE', 'yyyy-MM')
).join(
dates_id,
['date', 'ID'],
'right'
).withColumn(
'deltamonth',
F.month('date') - F.min(F.month('date')).over(Window.partitionBy('id'))
).withColumn(
'sum_value',
F.sum('value').over(
Window.partitionBy('id').orderBy('deltamonth').rangeBetween(-6,0)
)
).drop('deltamonth').orderBy('date', 'ID')
Results:
df2.show(50)
+----------+----+-----+---------+
| date| ID|VALUE|sum_value|
+----------+----+-----+---------+
|2019-01-01|1234| null| null|
|2019-01-01|4567| null| null|
|2019-02-01|1234| null| null|
|2019-02-01|4567| null| null|
|2019-03-01|1234| null| null|
|2019-03-01|4567| null| null|
|2019-04-01|1234|100.0| 100.0|
|2019-04-01|4567| null| null|
|2019-05-01|1234| null| 100.0|
|2019-05-01|4567|200.0| 200.0|
|2019-06-01|1234| null| 100.0|
|2019-06-01|4567| null| 200.0|
|2019-07-01|1234| null| 100.0|
|2019-07-01|4567| null| 200.0|
|2019-08-01|1234| null| 100.0|
|2019-08-01|4567| null| 200.0|
|2019-09-01|1234| null| 100.0|
|2019-09-01|4567| null| 200.0|
|2019-10-01|1234| null| 100.0|
|2019-10-01|4567| null| 200.0|
|2019-11-01|1234| null| null|
|2019-11-01|4567| null| 200.0|
|2019-12-01|1234| null| null|
|2019-12-01|4567| null| null|
|2020-01-01|1234| null| null|
|2020-01-01|4567| null| null|
+----------+----+-----+---------+

Related

PySpark Multiple Lagged Variables for TimeSeries Data with Gaps

I am trying to generate multiple lag for the 'status' variable in the dataframe below. The data that I have is timeseries and it is possible to have gap within the data. What I am trying to do is generate the lags but when there is a gap, the value should be set to Missing/Null.
Input DF:
+---+----------+------+
| id| s_date|status|
+---+----------+------+
|123|2007-01-31| 1|
|123|2007-02-28| 1|
|123|2007-03-31| 2|
|123|2007-04-30| 2|
|123|2007-05-31| 1|
|123|2007-06-30| 1|
|123|2007-07-31| 2|
|123|2007-08-31| 2|
|345|2007-08-31| 3|
|123|2007-09-30| 2|
|345|2007-09-30| 2|
|123|2007-10-31| 1|
|345|2007-10-31| 1|
|123|2007-11-30| 1|
|345|2007-11-30| 2|
|123|2008-01-31| 3|
|345|2007-12-31| 2|
|567|2007-12-31| 3|
|123|2008-03-31| 4|
|345|2008-01-31| 2|
+---+----------+------+
from datetime import date
rdd = sc.parallelize([
[123,date(2007,1,31),1],
[123,date(2007,2,28),1],
[123,date(2007,3,31),2],
[123,date(2007,4,30),2],
[123,date(2007,5,31),1],
[123,date(2007,6,30),1],
[123,date(2007,7,31),2],
[123,date(2007,8,31),2],
[345,date(2007,8,31),3],
[123,date(2007,9,30),2],
[345,date(2007,9,30),2],
[123,date(2007,10,31),1],
[345,date(2007,10,31),1],
[123,date(2007,11,30),1],
[345,date(2007,11,30),2],
[123,date(2008,1,31),3],
[345,date(2007,12,31),2],
[567,date(2007,12,31),3],
[123,date(2008,3,31),4],
[345,date(2008,1,31),2],
[567,date(2008,1,31),3],
[123,date(2008,4,30),3],
[123,date(2008,5,31),2],
[123,date(2008,6,30),1]
])
df = rdd.toDF(['id','s_date','status'])
df.show()
# Below is the code that works
from pyspark.sql.window import Window
w = Window().partitionBy('id').orderBy('s_date')
for i in range(1, 25):
df = df.withColumn(f"lag_{i}", fn.lag(fn.col('status'), i).over(w))\
.withColumn(f"lag_month_{i}", fn.lag(fn.col('s_date), i).over(w))\
.withColumn(f"lag_status_{i}", fn.expr("case when "f"{i} = 1 and (last_day(add_months("f"lag_month_{i}"",1)) = last_day(s_date)) then "f"lag_{i}"" else null end"))
In the code above, column 'lag_status_1" is correctly getting population for i = 1. This column has value of Null for Jan and March 2008 which is what I want for every lag. However, I add below line to handle other lags (i.e. Lag_2, Lag_3, etc.) the code does not work.
.withColumn(f"lag_status_{i}", fn.expr("case when "f"{i} = 1 and (last_day(add_months("f"lag_month_{i}"",1)) = last_day(s_date)) then "f"lag_{i}"" " +
"when "f"{i} > 1 and (last_day(add_months("f"lag_month_{i}"",1)) = last_day("f"lag_month_{i-1})) then "f"lag_{i}"" else null end"))\
Output DF: lag_dlq_2 and lag_dlq_3 is Null but would like to populate with similar logic as lag_dlq_1 (but with respect to its lag)
+---+----------+------+---------+---------+---------+
| id| s_date|status|lag_status_1|lag_status_2|lag_status_3|
+---+----------+------+---------+---------+---------+
|123|2007-01-31| 1| null| null| null|
|123|2007-02-28| 1| 1| null| null|
|123|2007-03-31| 2| 1| null| null|
|123|2007-04-30| 2| 2| null| null|
|123|2007-05-31| 1| 2| null| null|
|123|2007-06-30| 1| 1| null| null|
|123|2007-07-31| 2| 1| null| null|
|123|2007-08-31| 2| 2| null| null|
|123|2007-09-30| 2| 2| null| null|
|123|2007-10-31| 1| 2| null| null|
|123|2007-11-30| 1| 1| null| null|
|123|2008-01-31| 3| null| null| null|
|123|2008-03-31| 4| null| null| null|
|123|2008-04-30| 3| 4| null| null|
|123|2008-05-31| 2| 3| null| null|
|123|2008-06-30| 1| 2| null| null|
|345|2007-08-31| 3| null| null| null|
|345|2007-09-30| 2| 3| null| null|
|345|2007-10-31| 1| 2| null| null|
|345|2007-11-30| 2| 1| null| null|
+---+----------+------+---------+---------+---------+
Can you please guide on how to resolve? If there is a better or more efficient solution, please suggest.

Pyspark: how to filter a table using a UDF?

I have a dataframe and I'd like to filter some rows out based on one column. But my conditions are quite complex and will require a separate function, it's not something I can do in a single expression or where clause.
My plan was to return True or False according to whether to keep or filter out that row:
from pyspark.sql.types import BooleanType
from pyspark.sql.function import udf
def my_filter(col1):
# I will have more conditions here, but for simplicity...
if col1 is null:
return True
my_filter_udf = udf(my_filter, BooleanType())
df = table1_df \
.select( \
'col1' \
) \
.filter(my_filter_udf('col1'))
.show()
I get the error "SparkException: Exception thrown in Future.get".
What's weird is in the function, if I just use:
def my_filter(col1):
return True
It works. But if I reference the col1 value in the function, I get this error. So the code is reaching and parsing the function, it just doesn't seem to like what I'm passing to it.
Any idea where I'm going wrong here?
The idea of the udf is that it will iterate through all your rows and the logic under that udf will execute on each row, here you are using filter() , use withColumn() instead
Some Sample Code
#F.udf("boolean")
def my_filter(col1):
if col1 == "NIF":
return True
df = df.withColumn("filter_col", my_filter("EVENT"))
df.show()
+----+-----+----------+
| ID|EVENT|filter_col|
+----+-----+----------+
|id_1| ST| null|
|id_1| ST| null|
|id_1| ST| null|
|id_1| ST| null|
|id_1| NIF| true|
|id_1| ST| null|
|id_1| SB| null|
|id_2| NIF| true|
|id_2| NIF| true|
|id_2| NIF| true|
|id_2| NIF| true|
|id_2| NIF| true|
|id_2| NIF| true|
|id_2| NIF| true|
|id_3| AB| null|
|id_3| NIF| true|
|id_3| DR| null|
|id_3| NIF| true|
|id_3| ST| null|
|id_3| NIF| true|
+----+-----+----------+

How to apply groupby and transpose in Pyspark?

I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4],
'readings' : ['READ_1','READ_2','READ_1','READ_3','READ_1','READ_5','READ_6','READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'],
'val' :[5,6,7,11,5,7,16,12,13,56,32,13,45,43,46],
})
My above input dataframe looks like this
Though the code below works fine(thanks to Jezrael) in Python pandas, when I apply this to real data (more than 4M records), it runs for a long time. So I was trying to use pyspark . Please note I already tried Dask,modin,pandarallel which are equivalent to pandas for large scale processing but didn't help either. What the below codes does is it generates the summary statistics for each subject for each reading. You can have a look at the expected output below to get an idea
df_op = (df.groupby(['subject_id','readings'])['val']
.describe()
.unstack()
.swaplevel(0,1,axis=1)
.reindex(df['readings'].unique(), axis=1, level=0))
df_op.columns = df_op.columns.map('_'.join)
df_op = df_op.reset_index()
Can you help me achieve the above operation in pyspark? When I tried the below, it threw an error
df.groupby(['subject_id','readings'])['val']
For example - subject_id = 1 has 4 readings but 3 unique readings. So we get 3 * 8 = 24 columns for subject_id = 1. Why 8? Because it's MIN,MAX,COUNT,Std,MEAN,25%percentile,50th percentile,75th percentile. Hope this helps
When I started off with this in pyspark, it returns the below error
TypeError: 'GroupedData' object is not subscriptable
I expect my output to be like as shown below
You need to groupby and get the statistics for each reading first, and then you make a pivot to get an expected outcome
import pyspark.sql.functions as F
agg_df = df.groupby("subject_id", "readings").agg(F.mean(F.col("val")), F.min(F.col("val")), F.max(F.col("val")),
F.count(F.col("val")),
F.expr('percentile_approx(val, 0.25)').alias("quantile_25"),
F.expr('percentile_approx(val, 0.75)').alias("quantile_75"))
This will give you the following output:
+----------+--------+--------+--------+--------+----------+-----------+-----------+
|subject_id|readings|avg(val)|min(val)|max(val)|count(val)|quantile_25|quantile_75|
+----------+--------+--------+--------+--------+----------+-----------+-----------+
| 2| READ_1| 5.0| 5| 5| 1| 5| 5|
| 2| READ_5| 7.0| 7| 7| 1| 7| 7|
| 2| READ_8| 12.0| 12| 12| 1| 12| 12|
| 4| READ_08| 43.0| 43| 43| 1| 43| 43|
| 1| READ_2| 6.0| 6| 6| 1| 6| 6|
| 1| READ_1| 6.0| 5| 7| 2| 5| 7|
| 2| READ_6| 16.0| 16| 16| 1| 16| 16|
| 1| READ_3| 11.0| 11| 11| 1| 11| 11|
| 4| READ_11| 32.0| 32| 32| 1| 32| 32|
| 3| READ_10| 13.0| 13| 13| 1| 13| 13|
| 3| READ_12| 56.0| 56| 56| 1| 56| 56|
| 4| READ_14| 13.0| 13| 13| 1| 13| 13|
| 4| READ_07| 46.0| 46| 46| 1| 46| 46|
| 4| READ_09| 45.0| 45| 45| 1| 45| 45|
+----------+--------+--------+--------+--------+----------+-----------+-----------+
Using groupby subject_id if you pivot readings, you will get the expected output:
agg_df2 = df.groupby("subject_id").pivot("readings").agg(F.mean(F.col("val")), F.min(F.col("val")), F.max(F.col("val")),
F.count(F.col("val")),
F.expr('percentile_approx(val, 0.25)').alias("quantile_25"),
F.expr('percentile_approx(val, 0.75)').alias("quantile_75"))
for i in agg_df2.columns:
agg_df2 = agg_df2.withColumnRenamed(i, i.replace("(val)", ""))
agg_df2.show()
+----------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+
|subject_id|READ_07_avg(val)|READ_07_min(val)|READ_07_max(val)|READ_07_count(val)|READ_07_quantile_25|READ_07_quantile_75|READ_08_avg(val)|READ_08_min(val)|READ_08_max(val)|READ_08_count(val)|READ_08_quantile_25|READ_08_quantile_75|READ_09_avg(val)|READ_09_min(val)|READ_09_max(val)|READ_09_count(val)|READ_09_quantile_25|READ_09_quantile_75|READ_1_avg(val)|READ_1_min(val)|READ_1_max(val)|READ_1_count(val)|READ_1_quantile_25|READ_1_quantile_75|READ_10_avg(val)|READ_10_min(val)|READ_10_max(val)|READ_10_count(val)|READ_10_quantile_25|READ_10_quantile_75|READ_11_avg(val)|READ_11_min(val)|READ_11_max(val)|READ_11_count(val)|READ_11_quantile_25|READ_11_quantile_75|READ_12_avg(val)|READ_12_min(val)|READ_12_max(val)|READ_12_count(val)|READ_12_quantile_25|READ_12_quantile_75|READ_14_avg(val)|READ_14_min(val)|READ_14_max(val)|READ_14_count(val)|READ_14_quantile_25|READ_14_quantile_75|READ_2_avg(val)|READ_2_min(val)|READ_2_max(val)|READ_2_count(val)|READ_2_quantile_25|READ_2_quantile_75|READ_3_avg(val)|READ_3_min(val)|READ_3_max(val)|READ_3_count(val)|READ_3_quantile_25|READ_3_quantile_75|READ_5_avg(val)|READ_5_min(val)|READ_5_max(val)|READ_5_count(val)|READ_5_quantile_25|READ_5_quantile_75|READ_6_avg(val)|READ_6_min(val)|READ_6_max(val)|READ_6_count(val)|READ_6_quantile_25|READ_6_quantile_75|READ_8_avg(val)|READ_8_min(val)|READ_8_max(val)|READ_8_count(val)|READ_8_quantile_25|READ_8_quantile_75|
+----------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+
| 1| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 6.0| 5| 7| 2| 5| 7| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 6.0| 6| 6| 1| 6| 6| 11.0| 11| 11| 1| 11| 11| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
| 3| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 13.0| 13| 13| 1| 13| 13| null| null| null| null| null| null| 56.0| 56| 56| 1| 56| 56| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
| 2| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 5.0| 5| 5| 1| 5| 5| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 7.0| 7| 7| 1| 7| 7| 16.0| 16| 16| 1| 16| 16| 12.0| 12| 12| 1| 12| 12|
| 4| 46.0| 46| 46| 1| 46| 46| 43.0| 43| 43| 1| 43| 43| 45.0| 45| 45| 1| 45| 45| null| null| null| null| null| null| null| null| null| null| null| null| 32.0| 32| 32| 1| 32| 32| null| null| null| null| null| null| 13.0| 13| 13| 1| 13| 13| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
+----------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+

How to convert string semi colon-separated column to MapType in pyspark?

Sample of data :
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|customtargeting |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|nocid=no;store=2007;tppid=45c566dd-00d7-4193-b5c7-17843c2764e9 |
|nocid=no;store=3084;tppid=4cd36fde-c59a-41d2-a2b4-b731b6cfbe05 |
|nocid=no;tppid=c688c1be-a9c5-47a2-8c09-aef175a19847 |
|nocid=yes;search=washing liquid;store=3060 |
|pos=top;tppid=278bab7b-d40b-4783-8f89-bef94a9f5150 |
|pos=top;tppid=00bb87fa-f3f5-4b0e-bbf8-16079a1a5efe |
|nocid=no;shelf=cleanser-toner-and-face-mask;store=2019;tppid=84006d41-eb63-4ae1-8c3c-3ac9436d446c |
|pos=top;tppid=ed02b037-066b-46bd-99e6-d183160644a2 |
|nocid=yes;search=salad;store=3060 |
|pos=top;nocid=no;store=2882;tppid=164563e4-8e5c-4366-a5a8-438ffb10da9d |
|nocid=yes;search=beer;store=3060 |
|nocid=no;search=washing capsules;store=5528;tppid=4f9b99eb-65ff-4fbc-b11c-b0552b7f158d |
|pos=right;tppid=ddb54247-a5c9-40a0-9f99-8412d8542b4c |
|nocid=yes;search=bedding;store=3060 |
|pos=top |
|pos=mpu1;keywords=helium canisters;keywords=tesco.com;keywords=helium canisters reviews;keywords=tesco;keywords=helium canisters uk;keywords=balloons;pagetype=category|
I want to convert a PySpark dataframe column to a map type, the column can contain any number of key value pair and the type of column is string and for some keys there are multiple values which i want convert in array as a value for the key .
Try this,
import pyspark.sql.functions as F
from pyspark.sql.types import *
def convert_to_json(_str):
_split_str = [tuple(x.split('=')) for x in _str.split(';') if len(tuple(x.split('='))) == 2]
_json = {}
for k,v in _split_str:
if k in _json:
_json[k].append(v)
else:
_json[k] = [v]
return _json
convert_udf = F.udf(convert_to_json, MapType(StringType(),ArrayType(StringType())))
df = df.withColumn('customtargeting', convert_udf('customtargeting'))
print df.schema
print df.limit(5).collect()
This gives you the schema and output as,
StructType(List(StructField(
customtargeting,MapType(StringType,ArrayType(StringType,true),true),true)))
[Row(customtargeting={u'store': [u'2007'], u'tppid': [u'45c566dd-00d7-4193-b5c7-17843c2764e9'], u'nocid': [u'no']}),
Row(customtargeting={u'store': [u'3084'], u'tppid': [u'4cd36fde-c59a-41d2-a2b4-b731b6cfbe05'], u'nocid': [u'no']}),
Row(customtargeting={u'nocid': [u'no'], u'tppid': [u'c688c1be-a9c5-47a2-8c09-aef175a19847']}),
Row(customtargeting={u'search': [u'washing liquid'], u'nocid': [u'yes'], u'store': [u'3060']}),
Row(customtargeting={u'pos': [u'top'], u'tppid': [u'278bab7b-d40b-4783-8f89-bef94a9f5150']})]
If you want to seperate columns and create a new dataframe, you can use pandas features. Find my solution below
>>> import pandas as pd
>>>
>>> rdd = sc.textFile('/home/ali/text1.txt')
>>> rdd.first()
'nocid=no;store=2007;tppid=45c566dd-00d7-4193-b5c7-17843c2764e9'
>>> rddMap = rdd.map(lambda x: x.split(';'))
>>> rddMap.first()
['nocid=no', 'store=2007', 'tppid=45c566dd-00d7-4193-b5c7-17843c2764e9']
>>>
>>> df1 = pd.DataFrame()
>>> for rdd in rddMap.collect():
... a = {i.split('=')[0]:i.split('=')[1] for i in rdd}
... df2 = pd.DataFrame([a], columns=a.keys())
... df1 = pd.concat([df1, df2])
...
>>> df = spark.createDataFrame(df1.astype(str)).replace('nan',None)
>>> df.show()
+--------+-----+--------+-----+----------------+--------------------+-----+--------------------+
|keywords|nocid|pagetype| pos| search| shelf|store| tppid|
+--------+-----+--------+-----+----------------+--------------------+-----+--------------------+
| null| no| null| null| null| null| 2007|45c566dd-00d7-419...|
| null| no| null| null| null| null| 3084|4cd36fde-c59a-41d...|
| null| no| null| null| null| null| null|c688c1be-a9c5-47a...|
| null| yes| null| null| washing liquid| null| 3060| null|
| null| null| null| top| null| null| null|278bab7b-d40b-478...|
| null| null| null| top| null| null| null|00bb87fa-f3f5-4b0...|
| null| no| null| null| null|cleanser-toner-an...| 2019|84006d41-eb63-4ae...|
| null| null| null| top| null| null| null|ed02b037-066b-46b...|
| null| yes| null| null| salad| null| 3060| null|
| null| no| null| top| null| null| 2882|164563e4-8e5c-436...|
| null| yes| null| null| beer| null| 3060| null|
| null| no| null| null|washing capsules| null| 5528|4f9b99eb-65ff-4fb...|
| null| null| null|right| null| null| null|ddb54247-a5c9-40a...|
| null| yes| null| null| bedding| null| 3060| null|
| null| null| null| top| null| null| null| null|
|balloons| null|category| mpu1| null| null| null| null|
+--------+-----+--------+-----+----------------+--------------------+-----+--------------------+

spark outer join with source

I am relatively new to spark, I was wondering if I could get the source of the column which was used for outer join
Let's say I have 3 DF
DF 1
+-----+----+
|item1| key|
+-----+----+
|Item1|key1|
|Item2|key2|
|Item3|key3|
|Item4|key4|
|Item5|key5|
+-----+----+
DF2
+-----+----+
|item2| key|
+-----+----+
| t1|key1|
| t2|key2|
| t3|key6|
| t4|key7|
| t5|key8|
+-----+----+
DF3
+-----+-----+
|item3| key|
+-----+-----+
| t1| key1|
| t2| key2|
| t3| key8|
| t4| key9|
| t5|key10|
+-----+-----+
I want to do full outer join on these 3 dataframes and include a new column with to indicate the source of that key.
E.g
+-----+-----+-----+-----+------+
| key|item1|item2|item3|source|
+-----+-----+-----+-----+------+
| key8| null| t5| t3| DF2|
| key5|Item5| null| null| DF1|
| key7| null| t4| null| DF2|
| key3|Item3| null| null| DF1|
| key6| null| t3| null| DF2|
| key1|Item1| t1| t1| DF1|
| key4|Item4| null| null| DF1|
| key2|Item2| t2| t2| DF1|
| key9| null| null| t4| DF3|
|key10| null| null| t5| DF3|
+-----+-----+-----+-----+------+
Is there any way to achieve this ?
I'd do something like this:
from pyspark.sql.functions import col, lit, coalesce, when
df1 = spark.createDataFrame(
[("Item1", "key1"), ("Item2", "key2"), ("Item3", "key3"),
("Item4", "key4"), ("Item5", "key5")],
["item1", "key"])
df2 = spark.createDataFrame(
[("t1", "key1"), ("t2", "key2"), ("t3", "key6"),
("t4", "key7"), ("t5", "key8")],
["item2", "key"])
df3 = spark.createDataFrame([
("t1", "key1"), ("t2", "key2"), ("t3", "key8"),
("t4", "key9"), ("t5", "key10")],
["item3", "key"])
df1.join(df2, ["key"], "outer").join(df3, ["key"], "outer").withColumn(
"source",
coalesce(
when(col("item1").isNotNull(), "df1"),
when(col("item2").isNotNull(), "df2"),
when(col("item3").isNotNull(), "df3")))
Result is:
## +-----+-----+-----+-----+------+
## | key|item1|item2|item3|source|
## +-----+-----+-----+-----+------+
## | key8| null| t5| t3| df2|
## | key5|Item5| null| null| df1|
## | key7| null| t4| null| df2|
## | key3|Item3| null| null| df1|
## | key6| null| t3| null| df2|
## | key1|Item1| t1| t1| df1|
## | key4|Item4| null| null| df1|
## | key2|Item2| t2| t2| df1|
## | key9| null| null| t4| df3|
## |key10| null| null| t5| df3|
## +-----+-----+-----+-----+------+

Categories

Resources