I have a PySpark dataframe like this,
+----------+--------+----------+----------+
|id_ | p |d1 | d2 |
+----------+--------+----------+----------+
| 1 | A |2018-09-26|2018-10-26|
| 2 | B |2018-06-21|2018-07-19|
| 2 | B |2018-08-13|2018-10-07|
| 2 | B |2018-12-31|2019-02-27|
| 2 | B |2019-05-28|2019-06-25|
| 3 |C |2018-06-15|2018-07-13|
| 3 |C |2018-08-15|2018-10-09|
| 3 |C |2018-12-03|2019-03-12|
| 3 |C |2019-05-10|2019-06-07|
| 4 | A |2019-01-30|2019-03-01|
| 4 | B |2019-05-30|2019-07-25|
| 5 |C |2018-09-19|2018-10-17|
-------------------------------------------
From this dataframe I have to derive another dataframe which have n columns. Where each column is a month from month(min(d1)) to month(max(d2)).
I want a in the derived dataframe for a row in the actual dataframe and the column values must be number of days in that month.
For example,
for first row, where id_ is 1 and p is A, I want to get a row in the derived dataframe where column of 201809 with value 5 and column 201810 with value 26.
For second row where id_ is 2 and p is B, I want to get a row in the derived dataframe where column of 201806 should be 9 and 201807 should be 19.
For the second last row, I want the columns 201905 filled with value 1, column 201906 with value 30, 201907 with 25.
So basically, I want the dataframe to be populated such a way that, for each row in my original dataframe I have a row in the derived dataframe where the columns in the table that corresponds to the month should be filled, for the range min(d1) to max(d2) with value number of days that is covered in that particular month.
I am currently doing this in the hard way. I am making n columns, where columns range for dates from min(d1) to max(d2). I am filling theses column with 1 and then melting the data and filtering based on value. Finally aggregating this dataframe to get my desired result, then selected the max valued p.
In codes,
d = df.select(F.min('d1').alias('d1'), F.max('d2').alias('d2')).first()
cols = [ c.strftime('%Y-%m-%d') for c in pd.period_range(d.d1, d.d2, freq='D') ]
result = df.select('id_', 'p', *[ F.when((df.d1 <= c)&(df.d2 >= c), 1).otherwise(0).alias(c) for c in cols ])
melted_data = melt(result, id_vars=['id_','p'], value_vars=cols)
melted_data = melted_data.withColumn('Month', F.substring(F.regexp_replace('variable', '-', ''), 1, 6))
melted_data = melted_data.groupBy('id_', 'Month', 'p').agg(F.sum('value').alias('days'))
melted_data = melted_data.orderBy('id_', 'Month', 'days', ascending=[False, False, False])
final_data = melted_data.groupBy('id_', 'Month').agg(F.first('p').alias('p'))
This codes takes a lot of time to run in decent configurations. How can I improve this.?
How can I achieve this task in a more optimized manner.? Making every single date in the range dont seems to be the best solution.
A small sample of the needed output is shown below,
+---+---+----------+----------+----------+----------+-------+
|id_|p |201806 |201807 |201808 | 201809 | 201810|
+---+---+----------+----------+----------+----------+-------+
| 1 | A | 0| 0 | 0| 4 | 26 |
| 2 | B | 9| 19| 0| 0 | 0 |
| 2 | B | 0| 0 | 18| 30 | 7 |
+---+---+----------+----------+----------+----------+-------+
I think it's slowing down because of the freq='D' and multiple transformations on dataset.
Please try below:
Edit 1: Update for the quarter
Edit 2: Per comment, the start date should be included in Final result
Edit 3: Per comment, Update for the daily
Prepared data
#Imports
import pyspark.sql.functions as f
from pyspark.sql.functions import when
import pandas as pd
df.show()
+---+---+----------+----------+
| id| p| d1| d2|
+---+---+----------+----------+
| 1| A|2018-09-26|2018-10-26|
| 2| B|2018-06-21|2018-07-19|
| 2| B|2018-08-13|2018-10-07|
| 2| B|2018-12-31|2019-02-27|
| 2| B|2019-05-28|2019-06-25|
| 3| C|2018-06-15|2018-07-13|
| 3| C|2018-08-15|2018-10-09|
| 3| C|2018-12-03|2019-03-12|
| 3| C|2019-05-10|2019-06-07|
| 4| A|2019-01-30|2019-03-01|
| 4| B|2019-05-30|2019-07-25|
| 5| C|2018-09-19|2018-10-17|
| 5| C|2019-05-16|2019-05-29| # --> Same month case
+---+---+----------+----------+
Get the min and max date from a dataset with month frequency freq='M'
d = df.select(f.min('d1').alias('min'), f.max('d2').alias('max')).first()
dates = pd.period_range(d.min, d.max, freq='M').strftime("%Y%m").tolist()
dates
['201806', '201807', '201808', '201809', '201810', '201811', '201812', '201901', '201902', '201903', '201904', '201905', '201906', '201907']
Now, Final Buesiness logic using spark date operators and functions
df1 = df.select('id',
'p',
'd1',
'd2', *[ (when( (f.trunc(df.d1, "month") == f.trunc(df.d2, "month")) & (f.to_date(f.lit(c),'yyyyMM') == f.trunc(df.d1, "month"))
, f.datediff(df.d2 , df.d1) +1 ) # Same month ((Last day - First dat) + 1
.when(f.to_date(f.lit(c),'yyyyMM') == f.trunc(df.d1, "month") ,
f.datediff(f.last_day(f.to_date(f.lit(c),'yyyyMM')) , df.d1) +1 ) # d1 date (Last day - current day)
.when(f.to_date(f.lit(c),'yyyyMM') == f.trunc(df.d2, "month") ,
f.datediff(df.d2, f.to_date(f.lit(c),'yyyyMM')) +1 ) # d2 date (Currentday - Firstday)
.when(f.to_date(f.lit(c),'yyyyMM').between(f.trunc(df.d1, "month"), df.d2),
f.dayofmonth(f.last_day(f.to_date(f.lit(c),'yyyyMM')))) # Between date (Total days in month)
).otherwise(0) # Rest of the months (0)
.alias(c) for c in dates ])
df1.show()
+---+---+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| id| p| d1| d2|201806|201807|201808|201809|201810|201811|201812|201901|201902|201903|201904|201905|201906|201907|
+---+---+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| 1| A|2018-09-26|2018-10-26| 0| 0| 0| 5| 26| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B|2018-06-21|2018-07-19| 10| 19| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B|2018-08-13|2018-10-07| 0| 0| 19| 30| 7| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B|2018-12-31|2019-02-27| 0| 0| 0| 0| 0| 0| 1| 31| 27| 0| 0| 0| 0| 0|
| 2| B|2019-05-28|2019-06-25| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 4| 25| 0|
| 3| C|2018-06-15|2018-07-13| 16| 13| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 3| C|2018-08-15|2018-10-09| 0| 0| 17| 30| 9| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 3| C|2018-12-03|2019-03-12| 0| 0| 0| 0| 0| 0| 29| 31| 28| 12| 0| 0| 0| 0|
| 3| C|2019-05-10|2019-06-07| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 22| 7| 0|
| 4| A|2019-01-30|2019-03-01| 0| 0| 0| 0| 0| 0| 0| 2| 28| 1| 0| 0| 0| 0|
| 4| B|2019-05-30|2019-07-25| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 2| 30| 25|
| 5| C|2018-09-19|2018-10-17| 0| 0| 0| 12| 17| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 5| C|2019-05-16|2019-05-29| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 14| 0| 0|
+---+---+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
Edit 2: Update for the quarter range:
Note: Taking quarter date range dictionary from #jxc's answer. We are more interested in the optimal solution here. #jxc has done an excellent job and no point of reinventing the wheel unless there is a performance issue.
Create date range dictionary :
q_dates = dict([
(str(c), [ c.to_timestamp().strftime("%Y-%m-%d") ,(c.to_timestamp() + pd.tseries.offsets.QuarterEnd()).strftime("%Y-%m-%d")
]) for c in pd.period_range(d.min, d.max, freq='Q')
])
# {'2018Q2': ['2018-04-01', '2018-06-30'],
# '2018Q3': ['2018-07-01', '2018-09-30'],
# '2018Q4': ['2018-10-01', '2018-12-31'],
# '2019Q1': ['2019-01-01', '2019-03-31'],
# '2019Q2': ['2019-04-01', '2019-06-30'],
# '2019Q3': ['2019-07-01', '2019-09-30']}
Now Apply business logic on quarters.
df1 = df.select('id',
'p',
'd1',
'd2',
*[(when( (df.d1.between(q_dates[c][0], q_dates[c][1])) & (f.trunc(df.d1, "month") == f.trunc(df.d2, "month")),
f.datediff(df.d2 , df.d1) +1 ) # Same month ((Last day - start day) +1 )
.when(df.d1.between(q_dates[c][0], q_dates[c][1]),
f.datediff(f.to_date(f.lit(q_dates[c][1])), df.d1) +1) # Min date , remaining days (Last day of quarter - Min day)
.when(df.d2.between(q_dates[c][0], q_dates[c][1]),
f.datediff(df.d2, f.to_date(f.lit(q_dates[c][0]))) +1 ) # Max date , remaining days (Max day - Start day of quarter )
.when(f.to_date(f.lit(q_dates[c][0])).between(df.d1, df.d2),
f.datediff(f.to_date(f.lit(q_dates[c][1])), f.to_date(f.lit(q_dates[c][0]))) +1) # All remaining days
).otherwise(0)
.alias(c) for c in q_dates ])
df1.show()
+---+---+----------+----------+------+------+------+------+------+------+
| id| p| d1| d2|2018Q2|2018Q3|2018Q4|2019Q1|2019Q2|2019Q3|
+---+---+----------+----------+------+------+------+------+------+------+
| 1| A|2018-09-26|2018-10-26| 0| 5| 26| 0| 0| 0|
| 2| B|2018-06-21|2018-07-19| 10| 19| 0| 0| 0| 0|
| 2| B|2018-08-13|2018-10-07| 0| 49| 7| 0| 0| 0|
| 2| B|2018-12-31|2019-02-27| 0| 0| 1| 58| 0| 0|
| 2| B|2019-05-28|2019-06-25| 0| 0| 0| 0| 34| 0|
| 3| C|2018-06-15|2018-07-13| 16| 13| 0| 0| 0| 0|
| 3| C|2018-08-15|2018-10-09| 0| 47| 9| 0| 0| 0|
| 3| C|2018-12-03|2019-03-12| 0| 0| 29| 71| 0| 0|
| 3| C|2019-05-10|2019-06-07| 0| 0| 0| 0| 52| 0|
| 4| A|2019-01-30|2019-03-01| 0| 0| 0| 61| 0| 0|
| 4| B|2019-05-30|2019-07-25| 0| 0| 0| 0| 32| 25|
| 5| C|2018-09-19|2018-10-17| 0| 12| 17| 0| 0| 0|
| 5| C|2019-05-16|2019-05-29| 0| 0| 0| 0| 14| 0|
+---+---+----------+----------+------+------+------+------+------+------+
Edit 3: Per comment, Update for the daily
Since here evaluations are more, need to careful in terms of performance.
Approach 1 : Dataframe/Dataset
Get Date list in yyyy-MM-dd format but as string
df_dates = pd.period_range(d.min, d.max, freq='D').strftime("%Y-%m-%d").tolist()
Now the business logic is quite simple. It's either 1 or 0
df1 = df.select('id'
, 'p'
, 'd1'
,'d2'
, *[ (when(f.lit(c).between (df.d1, df.d2),1)) # For date rabge 1
.otherwise(0) # For rest of days
.alias(c) for c in df_dates ])
df1.show()
+---+---+----------+----------+----------+----------+----------+
| id| p| d1| d2|2018-06-15|2018-06-16|2018-06-17| # and so on....
+---+---+----------+----------+----------+----------+----------+
| 1| A|2018-09-26|2018-10-26| 0| 0| 0|
| 2| B|2018-06-21|2018-07-19| 0| 0| 0|
| 2| B|2018-08-13|2018-10-07| 0| 0| 0|
| 2| B|2018-12-31|2019-02-27| 0| 0| 0|
| 2| B|2019-05-28|2019-06-25| 0| 0| 0|
| 3| C|2018-06-15|2018-07-13| 1| 1| 1|
| 3| C|2018-08-15|2018-10-09| 0| 0| 0|
| 3| C|2018-12-03|2019-03-12| 0| 0| 0|
| 3| C|2019-05-10|2019-06-07| 0| 0| 0|
| 4| A|2019-01-30|2019-03-01| 0| 0| 0|
| 4| B|2019-05-30|2019-07-25| 0| 0| 0|
| 5| C|2018-09-19|2018-10-17| 0| 0| 0|
| 5| C|2019-05-16|2019-05-29| 0| 0| 0|
+---+---+----------+----------+----------+----------+----------+
# Due to answer character limit unable to give the result.
Approach 2: RDD evaluations
Get Date list as a date object
rdd_dates = [ c.to_timestamp().date() for c in pd.period_range(d.min, d.max, freq='D') ]
Use map from rdd
df1 = df \
.rdd \
.map(lambda x : tuple([x.id, x.p, x.d1, x.d2 , *[ 1 if ( x.d1 <= c <=x.d2) else 0 for c in rdd_dates]])) \
.toDF(df.columns + [ c.strftime("%Y-%m-%d") for c in rdd_dates])
df1.show()
+---+---+----------+----------+----------+----------+----------+
| id| p| d1| d2|2018-06-15|2018-06-16|2018-06-17| # and so on....
+---+---+----------+----------+----------+----------+----------+
| 1| A|2018-09-26|2018-10-26| 0| 0| 0|
| 2| B|2018-06-21|2018-07-19| 0| 0| 0|
| 2| B|2018-08-13|2018-10-07| 0| 0| 0|
| 2| B|2018-12-31|2019-02-27| 0| 0| 0|
| 2| B|2019-05-28|2019-06-25| 0| 0| 0|
| 3| C|2018-06-15|2018-07-13| 1| 1| 1|
| 3| C|2018-08-15|2018-10-09| 0| 0| 0|
| 3| C|2018-12-03|2019-03-12| 0| 0| 0|
| 3| C|2019-05-10|2019-06-07| 0| 0| 0|
| 4| A|2019-01-30|2019-03-01| 0| 0| 0|
| 4| B|2019-05-30|2019-07-25| 0| 0| 0|
| 5| C|2018-09-19|2018-10-17| 0| 0| 0|
| 5| C|2019-05-16|2019-05-29| 0| 0| 0|
+---+---+----------+----------+----------+----------+----------+
IIUC, your problem can be simplified using some Spark SQL tricks:
# get start_date and end_date
d = df.select(F.min('d1').alias('start_date'), F.max('d2').alias('end_date')).first()
# get a list of month strings (using the first day of the month) between d.start_date and d.end_date
mrange = [ c.strftime("%Y-%m-01") for c in pd.period_range(d.start_date, d.end_date, freq='M') ]
#['2018-06-01',
# '2018-07-01',
# ....
# '2019-06-01',
# '2019-07-01']
Write the following Spark SQL snippet to count the number of days in each month, where {0} will be replaced by the month strings(i.e. "2018-06-01"), and {1} will be replaced by column names(i.e. "201806").
stmt = '''
IF(d2 < "{0}" OR d1 > LAST_DAY("{0}")
, 0
, DATEDIFF(LEAST(d2, LAST_DAY("{0}")), GREATEST(d1, TO_DATE("{0}")))
+ IF(d1 BETWEEN "{0}" AND LAST_DAY("{0}"), 0, 1)
) AS `{1}`
'''
This SQL snippet does the following, assumed m is the month string:
if (d1, d2) is out of range, i.e. d1 > last_day(m) or d2 < m, then return 0
otherwise, we calculate the datediff() between LEAST(d2, LAST_DAY(m)) and GREATEST(d1, m).
Notice there is an 1 day offset in calculating the above datediff(). it only exists when d1 is NOT in the current month, i.e. between(m, LAST_DAY(m))
We can then calculate the new columns using selectExpr and this SQL snippet:
df_new = df.withColumn('d1', F.to_date('d1')) \
.withColumn('d2', F.to_date('d2')) \
.selectExpr(
'id_'
, 'p'
, *[ stmt.format(m, m[:7].replace('-','')) for m in mrange ]
)
df_new.show()
+---+---+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
|id_| p|201806|201807|201808|201809|201810|201811|201812|201901|201902|201903|201904|201905|201906|201907|
+---+---+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| 1| A| 0| 0| 0| 4| 26| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B| 9| 19| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B| 0| 0| 18| 30| 7| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B| 0| 0| 0| 0| 0| 0| 0| 31| 27| 0| 0| 0| 0| 0|
| 2| B| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 3| 25| 0|
| 3| C| 15| 13| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 3| C| 0| 0| 16| 30| 9| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 3| C| 0| 0| 0| 0| 0| 0| 28| 31| 28| 12| 0| 0| 0| 0|
| 3| C| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 21| 7| 0|
| 4| A| 0| 0| 0| 0| 0| 0| 0| 1| 28| 1| 0| 0| 0| 0|
| 4| B| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 30| 25|
| 5| C| 0| 0| 0| 11| 17| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+---+---+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
Edit-1: About the Quarterly list
Per your comment, I modified the SQL snippet so that you can extend it into more named date ranges. see below: {0} will be replaced by range_start_date, and {1} by range_end_date and {2} by range_name:
stmt = '''
IF(d2 < "{0}" OR d1 > "{1}"
, 0
, DATEDIFF(LEAST(d2, TO_DATE("{1}")), GREATEST(d1, TO_DATE("{0}")))
+ IF(d1 BETWEEN "{0}" AND "{1}", 0, 1)
) AS `{2}`
'''
Create a dictionary using quarter name as keys and a list of corresponding start_date and end_date as values: (this part is a pure python or pandas problem)
range_dict = dict([
(str(c), [ c.to_timestamp().strftime("%Y-%m-%d")
,(c.to_timestamp() + pd.tseries.offsets.QuarterEnd()).strftime("%Y-%m-%d")
]) for c in pd.period_range(d.start_date, d.end_date, freq='Q')
])
#{'2018Q2': ['2018-04-01', '2018-06-30'],
# '2018Q3': ['2018-07-01', '2018-09-30'],
# '2018Q4': ['2018-10-01', '2018-12-31'],
# '2019Q1': ['2019-01-01', '2019-03-31'],
# '2019Q2': ['2019-04-01', '2019-06-30'],
# '2019Q3': ['2019-07-01', '2019-09-30']}
df_new = df.withColumn('d1', F.to_date('d1')) \
.withColumn('d2', F.to_date('d2')) \
.selectExpr(
'id_'
, 'p'
, *[ stmt.format(range_dict[n][0], range_dict[n][1], n) for n in sorted(range_dict.keys()) ]
)
df_new.show()
+---+---+------+------+------+------+------+------+
|id_| p|2018Q2|2018Q3|2018Q4|2019Q1|2019Q2|2019Q3|
+---+---+------+------+------+------+------+------+
| 1| A| 0| 4| 26| 0| 0| 0|
| 2| B| 9| 19| 0| 0| 0| 0|
| 2| B| 0| 48| 7| 0| 0| 0|
| 2| B| 0| 0| 0| 58| 0| 0|
| 2| B| 0| 0| 0| 0| 28| 0|
| 3| C| 15| 13| 0| 0| 0| 0|
| 3| C| 0| 46| 9| 0| 0| 0|
| 3| C| 0| 0| 28| 71| 0| 0|
| 3| C| 0| 0| 0| 0| 28| 0|
| 4| A| 0| 0| 0| 30| 0| 0|
| 4| B| 0| 0| 0| 0| 31| 25|
| 5| C| 0| 11| 17| 0| 0| 0|
+---+---+------+------+------+------+------+------+
Edit-2: Regarding the Segmentation errors
I tested the code with a sample dataframe of 56K rows (see below), everything ran well under my testing environment (VM, Centos 7.3, 1 CPU and 2GB RAM, spark-2.4.0-bin-hadoop2.7 run on local mode in a docker container. this is far below any production environment). Thus I doubt if it was from the Spark version issue? I rewrote the same code logic by using two different approaches: one is using only Spark SQL(with TempView etc) and another is using pure dataframe API functions(similar to #SMaZ's approach). I'd like to see if any of these could run through your environment and data. BTW. I think, given most of the fields are numeric, 1M rows + 100 columns should not be very huge in terms of big data projects.
Also, please do make sure if there exists missing data (null for d1/d2) or incorrectly data issues (i.e. d1 > d2) and adjust the code to handle such issues if needed.
# sample data-set
import pandas as pd, numpy as np
N = 560000
df1 = pd.DataFrame({
'id_': sorted(np.random.choice(range(100),N))
, 'p': np.random.choice(list('ABCDEFGHIJKLMN'),N)
, 'd1': sorted(np.random.choice(pd.date_range('2016-06-30','2019-06-30',freq='D'),N))
, 'n': np.random.choice(list(map(lambda x: pd.Timedelta(days=x), range(300))),N)
})
df1['d2'] = df1['d1'] + df1['n']
df = spark.createDataFrame(df1)
df.printSchema()
#root
# |-- id_: long (nullable = true)
# |-- p: string (nullable = true)
# |-- d1: timestamp (nullable = true)
# |-- n: long (nullable = true)
# |-- d2: timestamp (nullable = true)
# get the overall date-range of dataset
d = df.select(F.min('d1').alias('start_date'), F.max('d2').alias('end_date')).first()
#Row(start_date=datetime.datetime(2016, 6, 29, 20, 0), end_date=datetime.datetime(2020, 4, 22, 20, 0))
# range_dict for the month data
range_dict = dict([
(c.strftime('%Y%m'), [ c.to_timestamp().date()
,(c.to_timestamp() + pd.tseries.offsets.MonthEnd()).date()
]) for c in pd.period_range(d.start_date, d.end_date, freq='M')
])
#{'201606': [datetime.date(2016, 6, 1), datetime.date(2016, 6, 30)],
# '201607': [datetime.date(2016, 7, 1), datetime.date(2016, 7, 31)],
# '201608': [datetime.date(2016, 8, 1), datetime.date(2016, 8, 31)],
# ....
# '202003': [datetime.date(2020, 3, 1), datetime.date(2020, 3, 31)],
# '202004': [datetime.date(2020, 4, 1), datetime.date(2020, 4, 30)]}
Method-1: Using Spark SQL:
# create TempView `df_table`
df.createOrReplaceTempView('df_table')
# SQL snippet to calculate new column
stmt = '''
IF(d2 < "{0}" OR d1 > "{1}"
, 0
, DATEDIFF(LEAST(d2, to_date("{1}")), GREATEST(d1, to_date("{0}")))
+ IF(d1 BETWEEN "{0}" AND "{1}", 0, 1)
) AS `{2}`
'''
# set up the SQL field list
sql_fields_list = [
'id_'
, 'p'
, *[ stmt.format(range_dict[n][0], range_dict[n][1], n) for n in sorted(range_dict.keys()) ]
]
# create SQL statement
sql_stmt = 'SELECT {} FROM df_table'.format(', '.join(sql_fields_list))
# run the Spark SQL:
df_new = spark.sql(sql_stmt)
Method-2: Using dataframe API functions:
from pyspark.sql.functions import when, col, greatest, least, lit, datediff
df_new = df.select(
'id_'
, 'p'
, *[
when((col('d2') < range_dict[n][0]) | (col('d1') > range_dict[n][1]), 0).otherwise(
datediff(least('d2', lit(range_dict[n][1])), greatest('d1', lit(range_dict[n][0])))
+ when(col('d1').between(range_dict[n][0], range_dict[n][1]), 0).otherwise(1)
).alias(n)
for n in sorted(range_dict.keys())
]
)
If you want to avoid pandas completely (which brings the data back to driver) then a pure pyspark based solution can be:
from pyspark.sql import functions as psf
# Assumption made: your dataframe's name is : sample_data and has id, p, d1, d2 columns.
# Add month and days left column using pyspark functions
# I have kept a row id as well just to ensure that if you have duplicates in your data on the keys then it would still be able to handle it - no obligations though
data = sample_data.select("id", "p",
psf.monotonically_increasing_id().alias("row_id"),
psf.date_format("d2", 'YYYYMM').alias("d2_month"),
psf.dayofmonth("d2").alias("d2_id"),
psf.date_format("d1", 'YYYYMM').alias("d1_month"),
psf.datediff(psf.last_day("d1"), sample_data["d1"]).alias("d1_id"))
data.show(5, False)
Result:
+---+---+-----------+--------+-----+--------+-----+
|id |p |row_id |d2_month|d2_id|d1_month|d1_id|
+---+---+-----------+--------+-----+--------+-----+
|1 |A |8589934592 |201810 |26 |201809 |4 |
|2 |B |25769803776|201807 |19 |201806 |9 |
|2 |B |34359738368|201810 |7 |201808 |18 |
|2 |B |51539607552|201902 |27 |201912 |0 |
|2 |B |60129542144|201906 |25 |201905 |3 |
+---+---+-----------+--------+-----+--------+-----+
only showing top 5 rows
Then you can split the dataframe and pivot it:
####
# Create two separate dataframes by pivoting on d1_month and d2_month
####
df1 = data.groupby(["id", "p", "row_id"]).pivot("d1_month").max("d1_id")
df2 = data.groupby(["id", "p", "row_id"]).pivot("d2_month").max("d2_id")
df1.show(5, False), df2.show(5, False)
Result:
+---+---+------------+------+------+------+------+------+------+------+
|id |p |row_id |201806|201808|201809|201812|201901|201905|201912|
+---+---+------------+------+------+------+------+------+------+------+
|3 |C |85899345920 |null |16 |null |null |null |null |null |
|2 |B |51539607552 |null |null |null |null |null |null |0 |
|3 |C |77309411328 |15 |null |null |null |null |null |null |
|3 |C |103079215104|null |null |null |28 |null |null |null |
|4 |A |128849018880|null |null |null |null |1 |null |null |
+---+---+------------+------+------+------+------+------+------+------+
only showing top 5 rows
+---+---+------------+------+------+------+------+------+------+------+
|id |p |row_id |201807|201809|201810|201902|201903|201906|201907|
+---+---+------------+------+------+------+------+------+------+------+
|3 |C |85899345920 |null |null |9 |null |null |null |null |
|2 |B |51539607552 |null |null |null |27 |null |null |null |
|3 |C |77309411328 |13 |null |null |null |null |null |null |
|3 |C |103079215104|null |null |null |null |12 |null |null |
|4 |A |128849018880|null |null |null |null |1 |null |null |
+---+---+------------+------+------+------+------+------+------+------+
only showing top 5 rows
Join back and get your result:
result = df1.join(df2, on=["id", "p","row_id"])\
.select([psf.coalesce(df1[x_], df2[x_]).alias(x_)
if (x_ in df1.columns) and (x_ in df2.columns) else x_
for x_ in set(df1.columns + df2.columns)])\
.orderBy("row_id").drop("row_id")
result.na.fill(0).show(5, False)
Result:
+------+------+------+------+------+---+------+------+------+------+------+------+------+------+---+
|201906|201907|201912|201901|201810|p |201812|201905|201902|201903|201809|201808|201807|201806|id |
+------+------+------+------+------+---+------+------+------+------+------+------+------+------+---+
|0 |0 |0 |0 |26 |A |0 |0 |0 |0 |4 |0 |0 |0 |1 |
|0 |0 |0 |0 |0 |B |0 |0 |0 |0 |0 |0 |19 |9 |2 |
|0 |0 |0 |0 |7 |B |0 |0 |0 |0 |0 |18 |0 |0 |2 |
|0 |0 |0 |0 |0 |B |0 |0 |27 |0 |0 |0 |0 |0 |2 |
|25 |0 |0 |0 |0 |B |0 |3 |0 |0 |0 |0 |0 |0 |2 |
+------+------+------+------+------+---+------+------+------+------+------+------+------+------+---+
only showing top 5 rows