Grouped window operation in pyspark: restart sum by condition [duplicate] - python

I have this dataframe
+---+----+---+
| A| B| C|
+---+----+---+
| 0|null| 1|
| 1| 3.0| 0|
| 2| 7.0| 0|
| 3|null| 1|
| 4| 4.0| 0|
| 5| 3.0| 0|
| 6|null| 1|
| 7|null| 1|
| 8|null| 1|
| 9| 5.0| 0|
| 10| 2.0| 0|
| 11|null| 1|
+---+----+---+
What I need do is a cumulative sum of values from column C until the next value is zero.
Expected output:
+---+----+---+----+
| A| B| C| D|
+---+----+---+----+
| 0|null| 1| 1|
| 1| 3.0| 0| 0|
| 2| 7.0| 0| 0|
| 3|null| 1| 1|
| 4| 4.0| 0| 0|
| 5| 3.0| 0| 0|
| 6|null| 1| 1|
| 7|null| 1| 2|
| 8|null| 1| 3|
| 9| 5.0| 0| 0|
| 10| 2.0| 0| 0|
| 11|null| 1| 1|
+---+----+---+----+
To reproduce dataframe:
from pyspark.shell import sc
from pyspark.sql import Window
from pyspark.sql.functions import lag, when, sum
x = sc.parallelize([
[0, None], [1, 3.], [2, 7.], [3, None], [4, 4.],
[5, 3.], [6, None], [7, None], [8, None], [9, 5.], [10, 2.], [11, None]])
x = x.toDF(['A', 'B'])
# Transform null values into "1"
x = x.withColumn('C', when(x.B.isNull(), 1).otherwise(0))

Create a temporary column (grp) that increments a counter each time column C is equal to 0 (the reset condition) and use this as a partitioning column for your cumulative sum.
import pyspark.sql.functions as f
from pyspark.sql import Window
x.withColumn(
"grp",
f.sum((f.col("C") == 0).cast("int")).over(Window.orderBy("A"))
).withColumn(
"D",
f.sum(f.col("C")).over(Window.partitionBy("grp").orderBy("A"))
).drop("grp").show()
#+---+----+---+---+
#| A| B| C| D|
#+---+----+---+---+
#| 0|null| 1| 1|
#| 1| 3.0| 0| 0|
#| 2| 7.0| 0| 0|
#| 3|null| 1| 1|
#| 4| 4.0| 0| 0|
#| 5| 3.0| 0| 0|
#| 6|null| 1| 1|
#| 7|null| 1| 2|
#| 8|null| 1| 3|
#| 9| 5.0| 0| 0|
#| 10| 2.0| 0| 0|
#| 11|null| 1| 1|
#+---+----+---+---+

Related

Create label for last 3 months data in Pyspark

i have pyspark df with ID, date, group and label columns as shown below :
>>> df.show()
+----+------------+-----+-----+
| id | date |group|label|
+----+------------+-----+-----+
| ID1| 2021-04-30| A| 0|
| ID1| 2021-04-30| B| 0|
| ID1| 2021-04-30| C| 0|
| ID1| 2021-04-30| D| 0|
| ID1| 2021-03-31| A| 0|
| ID1| 2021-03-31| B| 1|
| ID1| 2021-03-31| C| 0|
| ID1| 2021-03-31| D| 0|
| ID1| 2021-02-28| A| 0|
| ID1| 2021-02-28| B| 0|
| ID1| 2021-02-28| C| 0|
| ID1| 2021-02-28| D| 0|
| ID1| 2021-01-31| A| 0|
| ID1| 2021-01-31| B| 0|
| ID1| 2021-01-31| C| 0|
| ID1| 2021-01-31| D| 0|
| ID1| 2020-12-31| A| 1|
| ID1| 2020-12-31| B| 0|
| ID1| 2020-12-31| C| 0|
| ID1| 2020-12-31| D| 0|
+----+------------+-----+-----+
I wanted to create a flag that indicates last 3 month grouping as shown in the group_l3m example column in the df below. Expected output :
+----+------------+-----+-----+---------+
| id | date |group|label|group_l3m|
+----+------------+-----+-----+---------+
| ID1| 2021-04-30| A| 0| A|
| ID1| 2021-03-31| A| 0| A|
| ID1| 2021-02-28| A| 0| A|
| ID1| 2021-03-31| A| 0| B|
| ID1| 2021-02-28| A| 0| B|
| ID1| 2021-01-31| A| 0| B|
| ID1| 2021-02-28| A| 0| C|
| ID1| 2021-01-31| A| 0| C|
| ID1| 2020-12-31| A| 1| C|
| ID1| 2021-04-30| B| 0| D|
| ID1| 2021-03-31| B| 1| D|
| ID1| 2021-02-28| B| 0| D|
| ID1| 2021-03-31| B| 1| E|
| ID1| 2021-02-28| B| 0| E|
| ID1| 2021-01-31| B| 0| E|
| ID1| 2021-02-28| B| 0| F|
| ID1| 2021-01-31| B| 0| F|
| ID1| 2020-12-31| B| 0| F|
| ID1| 2021-04-30| C| 0| G|
| ID1| 2021-03-31| C| 0| G|
| ID1| 2021-02-28| C| 0| G|
| ID1| 2021-03-31| C| 0| H|
| ID1| 2021-02-28| C| 0| H|
| ID1| 2021-01-31| C| 0| H|
| ID1| 2021-02-28| C| 0| I|
| ID1| 2021-01-31| C| 0| I|
| ID1| 2020-12-31| C| 0| I|
| ID1| 2021-04-30| D| 0| J|
| ID1| 2021-03-31| D| 0| J|
| ID1| 2021-02-28| D| 0| J|
| ID1| 2021-03-31| D| 0| K|
| ID1| 2021-02-28| D| 0| K|
| ID1| 2021-01-31| D| 0| K|
| ID1| 2021-02-28| D| 0| L|
| ID1| 2021-01-31| D| 0| L|
| ID1| 2020-12-31| D| 0| L|
+----+------------+-----+-----+---------+
after getting group_l3m hence i can make a groupby cube using that column to perform summation later. Any idea how to get the expected output as showed above?

Populate month wise dataframe from two date columns

I have a PySpark dataframe like this,
+----------+--------+----------+----------+
|id_ | p |d1 | d2 |
+----------+--------+----------+----------+
| 1 | A |2018-09-26|2018-10-26|
| 2 | B |2018-06-21|2018-07-19|
| 2 | B |2018-08-13|2018-10-07|
| 2 | B |2018-12-31|2019-02-27|
| 2 | B |2019-05-28|2019-06-25|
| 3 |C |2018-06-15|2018-07-13|
| 3 |C |2018-08-15|2018-10-09|
| 3 |C |2018-12-03|2019-03-12|
| 3 |C |2019-05-10|2019-06-07|
| 4 | A |2019-01-30|2019-03-01|
| 4 | B |2019-05-30|2019-07-25|
| 5 |C |2018-09-19|2018-10-17|
-------------------------------------------
From this dataframe I have to derive another dataframe which have n columns. Where each column is a month from month(min(d1)) to month(max(d2)).
I want a in the derived dataframe for a row in the actual dataframe and the column values must be number of days in that month.
For example,
for first row, where id_ is 1 and p is A, I want to get a row in the derived dataframe where column of 201809 with value 5 and column 201810 with value 26.
For second row where id_ is 2 and p is B, I want to get a row in the derived dataframe where column of 201806 should be 9 and 201807 should be 19.
For the second last row, I want the columns 201905 filled with value 1, column 201906 with value 30, 201907 with 25.
So basically, I want the dataframe to be populated such a way that, for each row in my original dataframe I have a row in the derived dataframe where the columns in the table that corresponds to the month should be filled, for the range min(d1) to max(d2) with value number of days that is covered in that particular month.
I am currently doing this in the hard way. I am making n columns, where columns range for dates from min(d1) to max(d2). I am filling theses column with 1 and then melting the data and filtering based on value. Finally aggregating this dataframe to get my desired result, then selected the max valued p.
In codes,
d = df.select(F.min('d1').alias('d1'), F.max('d2').alias('d2')).first()
cols = [ c.strftime('%Y-%m-%d') for c in pd.period_range(d.d1, d.d2, freq='D') ]
result = df.select('id_', 'p', *[ F.when((df.d1 <= c)&(df.d2 >= c), 1).otherwise(0).alias(c) for c in cols ])
melted_data = melt(result, id_vars=['id_','p'], value_vars=cols)
melted_data = melted_data.withColumn('Month', F.substring(F.regexp_replace('variable', '-', ''), 1, 6))
melted_data = melted_data.groupBy('id_', 'Month', 'p').agg(F.sum('value').alias('days'))
melted_data = melted_data.orderBy('id_', 'Month', 'days', ascending=[False, False, False])
final_data = melted_data.groupBy('id_', 'Month').agg(F.first('p').alias('p'))
This codes takes a lot of time to run in decent configurations. How can I improve this.?
How can I achieve this task in a more optimized manner.? Making every single date in the range dont seems to be the best solution.
A small sample of the needed output is shown below,
+---+---+----------+----------+----------+----------+-------+
|id_|p |201806 |201807 |201808 | 201809 | 201810|
+---+---+----------+----------+----------+----------+-------+
| 1 | A | 0| 0 | 0| 4 | 26 |
| 2 | B | 9| 19| 0| 0 | 0 |
| 2 | B | 0| 0 | 18| 30 | 7 |
+---+---+----------+----------+----------+----------+-------+
I think it's slowing down because of the freq='D' and multiple transformations on dataset.
Please try below:
Edit 1: Update for the quarter
Edit 2: Per comment, the start date should be included in Final result
Edit 3: Per comment, Update for the daily
Prepared data
#Imports
import pyspark.sql.functions as f
from pyspark.sql.functions import when
import pandas as pd
df.show()
+---+---+----------+----------+
| id| p| d1| d2|
+---+---+----------+----------+
| 1| A|2018-09-26|2018-10-26|
| 2| B|2018-06-21|2018-07-19|
| 2| B|2018-08-13|2018-10-07|
| 2| B|2018-12-31|2019-02-27|
| 2| B|2019-05-28|2019-06-25|
| 3| C|2018-06-15|2018-07-13|
| 3| C|2018-08-15|2018-10-09|
| 3| C|2018-12-03|2019-03-12|
| 3| C|2019-05-10|2019-06-07|
| 4| A|2019-01-30|2019-03-01|
| 4| B|2019-05-30|2019-07-25|
| 5| C|2018-09-19|2018-10-17|
| 5| C|2019-05-16|2019-05-29| # --> Same month case
+---+---+----------+----------+
Get the min and max date from a dataset with month frequency freq='M'
d = df.select(f.min('d1').alias('min'), f.max('d2').alias('max')).first()
dates = pd.period_range(d.min, d.max, freq='M').strftime("%Y%m").tolist()
dates
['201806', '201807', '201808', '201809', '201810', '201811', '201812', '201901', '201902', '201903', '201904', '201905', '201906', '201907']
Now, Final Buesiness logic using spark date operators and functions
df1 = df.select('id',
'p',
'd1',
'd2', *[ (when( (f.trunc(df.d1, "month") == f.trunc(df.d2, "month")) & (f.to_date(f.lit(c),'yyyyMM') == f.trunc(df.d1, "month"))
, f.datediff(df.d2 , df.d1) +1 ) # Same month ((Last day - First dat) + 1
.when(f.to_date(f.lit(c),'yyyyMM') == f.trunc(df.d1, "month") ,
f.datediff(f.last_day(f.to_date(f.lit(c),'yyyyMM')) , df.d1) +1 ) # d1 date (Last day - current day)
.when(f.to_date(f.lit(c),'yyyyMM') == f.trunc(df.d2, "month") ,
f.datediff(df.d2, f.to_date(f.lit(c),'yyyyMM')) +1 ) # d2 date (Currentday - Firstday)
.when(f.to_date(f.lit(c),'yyyyMM').between(f.trunc(df.d1, "month"), df.d2),
f.dayofmonth(f.last_day(f.to_date(f.lit(c),'yyyyMM')))) # Between date (Total days in month)
).otherwise(0) # Rest of the months (0)
.alias(c) for c in dates ])
df1.show()
+---+---+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| id| p| d1| d2|201806|201807|201808|201809|201810|201811|201812|201901|201902|201903|201904|201905|201906|201907|
+---+---+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| 1| A|2018-09-26|2018-10-26| 0| 0| 0| 5| 26| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B|2018-06-21|2018-07-19| 10| 19| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B|2018-08-13|2018-10-07| 0| 0| 19| 30| 7| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B|2018-12-31|2019-02-27| 0| 0| 0| 0| 0| 0| 1| 31| 27| 0| 0| 0| 0| 0|
| 2| B|2019-05-28|2019-06-25| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 4| 25| 0|
| 3| C|2018-06-15|2018-07-13| 16| 13| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 3| C|2018-08-15|2018-10-09| 0| 0| 17| 30| 9| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 3| C|2018-12-03|2019-03-12| 0| 0| 0| 0| 0| 0| 29| 31| 28| 12| 0| 0| 0| 0|
| 3| C|2019-05-10|2019-06-07| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 22| 7| 0|
| 4| A|2019-01-30|2019-03-01| 0| 0| 0| 0| 0| 0| 0| 2| 28| 1| 0| 0| 0| 0|
| 4| B|2019-05-30|2019-07-25| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 2| 30| 25|
| 5| C|2018-09-19|2018-10-17| 0| 0| 0| 12| 17| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 5| C|2019-05-16|2019-05-29| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 14| 0| 0|
+---+---+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
Edit 2: Update for the quarter range:
Note: Taking quarter date range dictionary from #jxc's answer. We are more interested in the optimal solution here. #jxc has done an excellent job and no point of reinventing the wheel unless there is a performance issue.
Create date range dictionary :
q_dates = dict([
(str(c), [ c.to_timestamp().strftime("%Y-%m-%d") ,(c.to_timestamp() + pd.tseries.offsets.QuarterEnd()).strftime("%Y-%m-%d")
]) for c in pd.period_range(d.min, d.max, freq='Q')
])
# {'2018Q2': ['2018-04-01', '2018-06-30'],
# '2018Q3': ['2018-07-01', '2018-09-30'],
# '2018Q4': ['2018-10-01', '2018-12-31'],
# '2019Q1': ['2019-01-01', '2019-03-31'],
# '2019Q2': ['2019-04-01', '2019-06-30'],
# '2019Q3': ['2019-07-01', '2019-09-30']}
Now Apply business logic on quarters.
df1 = df.select('id',
'p',
'd1',
'd2',
*[(when( (df.d1.between(q_dates[c][0], q_dates[c][1])) & (f.trunc(df.d1, "month") == f.trunc(df.d2, "month")),
f.datediff(df.d2 , df.d1) +1 ) # Same month ((Last day - start day) +1 )
.when(df.d1.between(q_dates[c][0], q_dates[c][1]),
f.datediff(f.to_date(f.lit(q_dates[c][1])), df.d1) +1) # Min date , remaining days (Last day of quarter - Min day)
.when(df.d2.between(q_dates[c][0], q_dates[c][1]),
f.datediff(df.d2, f.to_date(f.lit(q_dates[c][0]))) +1 ) # Max date , remaining days (Max day - Start day of quarter )
.when(f.to_date(f.lit(q_dates[c][0])).between(df.d1, df.d2),
f.datediff(f.to_date(f.lit(q_dates[c][1])), f.to_date(f.lit(q_dates[c][0]))) +1) # All remaining days
).otherwise(0)
.alias(c) for c in q_dates ])
df1.show()
+---+---+----------+----------+------+------+------+------+------+------+
| id| p| d1| d2|2018Q2|2018Q3|2018Q4|2019Q1|2019Q2|2019Q3|
+---+---+----------+----------+------+------+------+------+------+------+
| 1| A|2018-09-26|2018-10-26| 0| 5| 26| 0| 0| 0|
| 2| B|2018-06-21|2018-07-19| 10| 19| 0| 0| 0| 0|
| 2| B|2018-08-13|2018-10-07| 0| 49| 7| 0| 0| 0|
| 2| B|2018-12-31|2019-02-27| 0| 0| 1| 58| 0| 0|
| 2| B|2019-05-28|2019-06-25| 0| 0| 0| 0| 34| 0|
| 3| C|2018-06-15|2018-07-13| 16| 13| 0| 0| 0| 0|
| 3| C|2018-08-15|2018-10-09| 0| 47| 9| 0| 0| 0|
| 3| C|2018-12-03|2019-03-12| 0| 0| 29| 71| 0| 0|
| 3| C|2019-05-10|2019-06-07| 0| 0| 0| 0| 52| 0|
| 4| A|2019-01-30|2019-03-01| 0| 0| 0| 61| 0| 0|
| 4| B|2019-05-30|2019-07-25| 0| 0| 0| 0| 32| 25|
| 5| C|2018-09-19|2018-10-17| 0| 12| 17| 0| 0| 0|
| 5| C|2019-05-16|2019-05-29| 0| 0| 0| 0| 14| 0|
+---+---+----------+----------+------+------+------+------+------+------+
Edit 3: Per comment, Update for the daily
Since here evaluations are more, need to careful in terms of performance.
Approach 1 : Dataframe/Dataset
Get Date list in yyyy-MM-dd format but as string
df_dates = pd.period_range(d.min, d.max, freq='D').strftime("%Y-%m-%d").tolist()
Now the business logic is quite simple. It's either 1 or 0
df1 = df.select('id'
, 'p'
, 'd1'
,'d2'
, *[ (when(f.lit(c).between (df.d1, df.d2),1)) # For date rabge 1
.otherwise(0) # For rest of days
.alias(c) for c in df_dates ])
df1.show()
+---+---+----------+----------+----------+----------+----------+
| id| p| d1| d2|2018-06-15|2018-06-16|2018-06-17| # and so on....
+---+---+----------+----------+----------+----------+----------+
| 1| A|2018-09-26|2018-10-26| 0| 0| 0|
| 2| B|2018-06-21|2018-07-19| 0| 0| 0|
| 2| B|2018-08-13|2018-10-07| 0| 0| 0|
| 2| B|2018-12-31|2019-02-27| 0| 0| 0|
| 2| B|2019-05-28|2019-06-25| 0| 0| 0|
| 3| C|2018-06-15|2018-07-13| 1| 1| 1|
| 3| C|2018-08-15|2018-10-09| 0| 0| 0|
| 3| C|2018-12-03|2019-03-12| 0| 0| 0|
| 3| C|2019-05-10|2019-06-07| 0| 0| 0|
| 4| A|2019-01-30|2019-03-01| 0| 0| 0|
| 4| B|2019-05-30|2019-07-25| 0| 0| 0|
| 5| C|2018-09-19|2018-10-17| 0| 0| 0|
| 5| C|2019-05-16|2019-05-29| 0| 0| 0|
+---+---+----------+----------+----------+----------+----------+
# Due to answer character limit unable to give the result.
Approach 2: RDD evaluations
Get Date list as a date object
rdd_dates = [ c.to_timestamp().date() for c in pd.period_range(d.min, d.max, freq='D') ]
Use map from rdd
df1 = df \
.rdd \
.map(lambda x : tuple([x.id, x.p, x.d1, x.d2 , *[ 1 if ( x.d1 <= c <=x.d2) else 0 for c in rdd_dates]])) \
.toDF(df.columns + [ c.strftime("%Y-%m-%d") for c in rdd_dates])
df1.show()
+---+---+----------+----------+----------+----------+----------+
| id| p| d1| d2|2018-06-15|2018-06-16|2018-06-17| # and so on....
+---+---+----------+----------+----------+----------+----------+
| 1| A|2018-09-26|2018-10-26| 0| 0| 0|
| 2| B|2018-06-21|2018-07-19| 0| 0| 0|
| 2| B|2018-08-13|2018-10-07| 0| 0| 0|
| 2| B|2018-12-31|2019-02-27| 0| 0| 0|
| 2| B|2019-05-28|2019-06-25| 0| 0| 0|
| 3| C|2018-06-15|2018-07-13| 1| 1| 1|
| 3| C|2018-08-15|2018-10-09| 0| 0| 0|
| 3| C|2018-12-03|2019-03-12| 0| 0| 0|
| 3| C|2019-05-10|2019-06-07| 0| 0| 0|
| 4| A|2019-01-30|2019-03-01| 0| 0| 0|
| 4| B|2019-05-30|2019-07-25| 0| 0| 0|
| 5| C|2018-09-19|2018-10-17| 0| 0| 0|
| 5| C|2019-05-16|2019-05-29| 0| 0| 0|
+---+---+----------+----------+----------+----------+----------+
IIUC, your problem can be simplified using some Spark SQL tricks:
# get start_date and end_date
d = df.select(F.min('d1').alias('start_date'), F.max('d2').alias('end_date')).first()
# get a list of month strings (using the first day of the month) between d.start_date and d.end_date
mrange = [ c.strftime("%Y-%m-01") for c in pd.period_range(d.start_date, d.end_date, freq='M') ]
#['2018-06-01',
# '2018-07-01',
# ....
# '2019-06-01',
# '2019-07-01']
Write the following Spark SQL snippet to count the number of days in each month, where {0} will be replaced by the month strings(i.e. "2018-06-01"), and {1} will be replaced by column names(i.e. "201806").
stmt = '''
IF(d2 < "{0}" OR d1 > LAST_DAY("{0}")
, 0
, DATEDIFF(LEAST(d2, LAST_DAY("{0}")), GREATEST(d1, TO_DATE("{0}")))
+ IF(d1 BETWEEN "{0}" AND LAST_DAY("{0}"), 0, 1)
) AS `{1}`
'''
This SQL snippet does the following, assumed m is the month string:
if (d1, d2) is out of range, i.e. d1 > last_day(m) or d2 < m, then return 0
otherwise, we calculate the datediff() between LEAST(d2, LAST_DAY(m)) and GREATEST(d1, m).
Notice there is an 1 day offset in calculating the above datediff(). it only exists when d1 is NOT in the current month, i.e. between(m, LAST_DAY(m))
We can then calculate the new columns using selectExpr and this SQL snippet:
df_new = df.withColumn('d1', F.to_date('d1')) \
.withColumn('d2', F.to_date('d2')) \
.selectExpr(
'id_'
, 'p'
, *[ stmt.format(m, m[:7].replace('-','')) for m in mrange ]
)
df_new.show()
+---+---+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
|id_| p|201806|201807|201808|201809|201810|201811|201812|201901|201902|201903|201904|201905|201906|201907|
+---+---+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| 1| A| 0| 0| 0| 4| 26| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B| 9| 19| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B| 0| 0| 18| 30| 7| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B| 0| 0| 0| 0| 0| 0| 0| 31| 27| 0| 0| 0| 0| 0|
| 2| B| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 3| 25| 0|
| 3| C| 15| 13| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 3| C| 0| 0| 16| 30| 9| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 3| C| 0| 0| 0| 0| 0| 0| 28| 31| 28| 12| 0| 0| 0| 0|
| 3| C| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 21| 7| 0|
| 4| A| 0| 0| 0| 0| 0| 0| 0| 1| 28| 1| 0| 0| 0| 0|
| 4| B| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 30| 25|
| 5| C| 0| 0| 0| 11| 17| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+---+---+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
Edit-1: About the Quarterly list
Per your comment, I modified the SQL snippet so that you can extend it into more named date ranges. see below: {0} will be replaced by range_start_date, and {1} by range_end_date and {2} by range_name:
stmt = '''
IF(d2 < "{0}" OR d1 > "{1}"
, 0
, DATEDIFF(LEAST(d2, TO_DATE("{1}")), GREATEST(d1, TO_DATE("{0}")))
+ IF(d1 BETWEEN "{0}" AND "{1}", 0, 1)
) AS `{2}`
'''
Create a dictionary using quarter name as keys and a list of corresponding start_date and end_date as values: (this part is a pure python or pandas problem)
range_dict = dict([
(str(c), [ c.to_timestamp().strftime("%Y-%m-%d")
,(c.to_timestamp() + pd.tseries.offsets.QuarterEnd()).strftime("%Y-%m-%d")
]) for c in pd.period_range(d.start_date, d.end_date, freq='Q')
])
#{'2018Q2': ['2018-04-01', '2018-06-30'],
# '2018Q3': ['2018-07-01', '2018-09-30'],
# '2018Q4': ['2018-10-01', '2018-12-31'],
# '2019Q1': ['2019-01-01', '2019-03-31'],
# '2019Q2': ['2019-04-01', '2019-06-30'],
# '2019Q3': ['2019-07-01', '2019-09-30']}
df_new = df.withColumn('d1', F.to_date('d1')) \
.withColumn('d2', F.to_date('d2')) \
.selectExpr(
'id_'
, 'p'
, *[ stmt.format(range_dict[n][0], range_dict[n][1], n) for n in sorted(range_dict.keys()) ]
)
df_new.show()
+---+---+------+------+------+------+------+------+
|id_| p|2018Q2|2018Q3|2018Q4|2019Q1|2019Q2|2019Q3|
+---+---+------+------+------+------+------+------+
| 1| A| 0| 4| 26| 0| 0| 0|
| 2| B| 9| 19| 0| 0| 0| 0|
| 2| B| 0| 48| 7| 0| 0| 0|
| 2| B| 0| 0| 0| 58| 0| 0|
| 2| B| 0| 0| 0| 0| 28| 0|
| 3| C| 15| 13| 0| 0| 0| 0|
| 3| C| 0| 46| 9| 0| 0| 0|
| 3| C| 0| 0| 28| 71| 0| 0|
| 3| C| 0| 0| 0| 0| 28| 0|
| 4| A| 0| 0| 0| 30| 0| 0|
| 4| B| 0| 0| 0| 0| 31| 25|
| 5| C| 0| 11| 17| 0| 0| 0|
+---+---+------+------+------+------+------+------+
Edit-2: Regarding the Segmentation errors
I tested the code with a sample dataframe of 56K rows (see below), everything ran well under my testing environment (VM, Centos 7.3, 1 CPU and 2GB RAM, spark-2.4.0-bin-hadoop2.7 run on local mode in a docker container. this is far below any production environment). Thus I doubt if it was from the Spark version issue? I rewrote the same code logic by using two different approaches: one is using only Spark SQL(with TempView etc) and another is using pure dataframe API functions(similar to #SMaZ's approach). I'd like to see if any of these could run through your environment and data. BTW. I think, given most of the fields are numeric, 1M rows + 100 columns should not be very huge in terms of big data projects.
Also, please do make sure if there exists missing data (null for d1/d2) or incorrectly data issues (i.e. d1 > d2) and adjust the code to handle such issues if needed.
# sample data-set
import pandas as pd, numpy as np
N = 560000
df1 = pd.DataFrame({
'id_': sorted(np.random.choice(range(100),N))
, 'p': np.random.choice(list('ABCDEFGHIJKLMN'),N)
, 'd1': sorted(np.random.choice(pd.date_range('2016-06-30','2019-06-30',freq='D'),N))
, 'n': np.random.choice(list(map(lambda x: pd.Timedelta(days=x), range(300))),N)
})
df1['d2'] = df1['d1'] + df1['n']
df = spark.createDataFrame(df1)
df.printSchema()
#root
# |-- id_: long (nullable = true)
# |-- p: string (nullable = true)
# |-- d1: timestamp (nullable = true)
# |-- n: long (nullable = true)
# |-- d2: timestamp (nullable = true)
# get the overall date-range of dataset
d = df.select(F.min('d1').alias('start_date'), F.max('d2').alias('end_date')).first()
#Row(start_date=datetime.datetime(2016, 6, 29, 20, 0), end_date=datetime.datetime(2020, 4, 22, 20, 0))
# range_dict for the month data
range_dict = dict([
(c.strftime('%Y%m'), [ c.to_timestamp().date()
,(c.to_timestamp() + pd.tseries.offsets.MonthEnd()).date()
]) for c in pd.period_range(d.start_date, d.end_date, freq='M')
])
#{'201606': [datetime.date(2016, 6, 1), datetime.date(2016, 6, 30)],
# '201607': [datetime.date(2016, 7, 1), datetime.date(2016, 7, 31)],
# '201608': [datetime.date(2016, 8, 1), datetime.date(2016, 8, 31)],
# ....
# '202003': [datetime.date(2020, 3, 1), datetime.date(2020, 3, 31)],
# '202004': [datetime.date(2020, 4, 1), datetime.date(2020, 4, 30)]}
Method-1: Using Spark SQL:
# create TempView `df_table`
df.createOrReplaceTempView('df_table')
# SQL snippet to calculate new column
stmt = '''
IF(d2 < "{0}" OR d1 > "{1}"
, 0
, DATEDIFF(LEAST(d2, to_date("{1}")), GREATEST(d1, to_date("{0}")))
+ IF(d1 BETWEEN "{0}" AND "{1}", 0, 1)
) AS `{2}`
'''
# set up the SQL field list
sql_fields_list = [
'id_'
, 'p'
, *[ stmt.format(range_dict[n][0], range_dict[n][1], n) for n in sorted(range_dict.keys()) ]
]
# create SQL statement
sql_stmt = 'SELECT {} FROM df_table'.format(', '.join(sql_fields_list))
# run the Spark SQL:
df_new = spark.sql(sql_stmt)
Method-2: Using dataframe API functions:
from pyspark.sql.functions import when, col, greatest, least, lit, datediff
df_new = df.select(
'id_'
, 'p'
, *[
when((col('d2') < range_dict[n][0]) | (col('d1') > range_dict[n][1]), 0).otherwise(
datediff(least('d2', lit(range_dict[n][1])), greatest('d1', lit(range_dict[n][0])))
+ when(col('d1').between(range_dict[n][0], range_dict[n][1]), 0).otherwise(1)
).alias(n)
for n in sorted(range_dict.keys())
]
)
If you want to avoid pandas completely (which brings the data back to driver) then a pure pyspark based solution can be:
from pyspark.sql import functions as psf
# Assumption made: your dataframe's name is : sample_data and has id, p, d1, d2 columns.
# Add month and days left column using pyspark functions
# I have kept a row id as well just to ensure that if you have duplicates in your data on the keys then it would still be able to handle it - no obligations though
data = sample_data.select("id", "p",
psf.monotonically_increasing_id().alias("row_id"),
psf.date_format("d2", 'YYYYMM').alias("d2_month"),
psf.dayofmonth("d2").alias("d2_id"),
psf.date_format("d1", 'YYYYMM').alias("d1_month"),
psf.datediff(psf.last_day("d1"), sample_data["d1"]).alias("d1_id"))
data.show(5, False)
Result:
+---+---+-----------+--------+-----+--------+-----+
|id |p |row_id |d2_month|d2_id|d1_month|d1_id|
+---+---+-----------+--------+-----+--------+-----+
|1 |A |8589934592 |201810 |26 |201809 |4 |
|2 |B |25769803776|201807 |19 |201806 |9 |
|2 |B |34359738368|201810 |7 |201808 |18 |
|2 |B |51539607552|201902 |27 |201912 |0 |
|2 |B |60129542144|201906 |25 |201905 |3 |
+---+---+-----------+--------+-----+--------+-----+
only showing top 5 rows
Then you can split the dataframe and pivot it:
####
# Create two separate dataframes by pivoting on d1_month and d2_month
####
df1 = data.groupby(["id", "p", "row_id"]).pivot("d1_month").max("d1_id")
df2 = data.groupby(["id", "p", "row_id"]).pivot("d2_month").max("d2_id")
df1.show(5, False), df2.show(5, False)
Result:
+---+---+------------+------+------+------+------+------+------+------+
|id |p |row_id |201806|201808|201809|201812|201901|201905|201912|
+---+---+------------+------+------+------+------+------+------+------+
|3 |C |85899345920 |null |16 |null |null |null |null |null |
|2 |B |51539607552 |null |null |null |null |null |null |0 |
|3 |C |77309411328 |15 |null |null |null |null |null |null |
|3 |C |103079215104|null |null |null |28 |null |null |null |
|4 |A |128849018880|null |null |null |null |1 |null |null |
+---+---+------------+------+------+------+------+------+------+------+
only showing top 5 rows
+---+---+------------+------+------+------+------+------+------+------+
|id |p |row_id |201807|201809|201810|201902|201903|201906|201907|
+---+---+------------+------+------+------+------+------+------+------+
|3 |C |85899345920 |null |null |9 |null |null |null |null |
|2 |B |51539607552 |null |null |null |27 |null |null |null |
|3 |C |77309411328 |13 |null |null |null |null |null |null |
|3 |C |103079215104|null |null |null |null |12 |null |null |
|4 |A |128849018880|null |null |null |null |1 |null |null |
+---+---+------------+------+------+------+------+------+------+------+
only showing top 5 rows
Join back and get your result:
result = df1.join(df2, on=["id", "p","row_id"])\
.select([psf.coalesce(df1[x_], df2[x_]).alias(x_)
if (x_ in df1.columns) and (x_ in df2.columns) else x_
for x_ in set(df1.columns + df2.columns)])\
.orderBy("row_id").drop("row_id")
result.na.fill(0).show(5, False)
Result:
+------+------+------+------+------+---+------+------+------+------+------+------+------+------+---+
|201906|201907|201912|201901|201810|p |201812|201905|201902|201903|201809|201808|201807|201806|id |
+------+------+------+------+------+---+------+------+------+------+------+------+------+------+---+
|0 |0 |0 |0 |26 |A |0 |0 |0 |0 |4 |0 |0 |0 |1 |
|0 |0 |0 |0 |0 |B |0 |0 |0 |0 |0 |0 |19 |9 |2 |
|0 |0 |0 |0 |7 |B |0 |0 |0 |0 |0 |18 |0 |0 |2 |
|0 |0 |0 |0 |0 |B |0 |0 |27 |0 |0 |0 |0 |0 |2 |
|25 |0 |0 |0 |0 |B |0 |3 |0 |0 |0 |0 |0 |0 |2 |
+------+------+------+------+------+---+------+------+------+------+------+------+------+------+---+
only showing top 5 rows

Change all row values in a window pyspark dataframe based on a condition

I have a pyspark dataframe which has three columns id, seq, seq_checker. I need to order by id and check for 4 consecutive 1's in seq_checker column.
I tried using window functions. I'm unable to change all values in a window based on a condition.
new_window = Window.partitionBy().orderBy("id").rangeBetween(0, 3)
output = df.withColumn('check_sequence',F.when(F.min(df['seq_checker']).over(new_window) == 1, True))
original pyspark df:
+---+---+-----------+--------------+
| Id|seq|seq_checker|check_sequence|
+---+---+-----------+--------------+
| 1| 1| 1| false|
| 2| 2| 1| false|
| 3| 3| 1| false|
| 4| 4| 1| false|
| 5| 10| 0| false|
| 6| 14| 1| false|
| 7| 13| 1| false|
| 8| 18| 0| false|
| 9| 23| 0| false|
| 10| 5| 0| false|
| 11| 56| 0| false|
| 12| 66| 0| false|
| 13| 34| 1| false|
| 14| 35| 1| false|
| 15| 36| 1| false|
| 16| 37| 1| false|
| 17| 39| 0| false|
| 18| 54| 0| false|
| 19| 68| 0| false|
| 20| 90| 0| false|
+---+---+-----------+--------------+
Required output:
+---+---+-----------+--------------+
| Id|seq|seq_checker|check_sequence|
+---+---+-----------+--------------+
| 1| 1| 1| true|
| 2| 2| 1| true|
| 3| 3| 1| true|
| 4| 4| 1| true|
| 5| 10| 0| false|
| 6| 14| 1| false|
| 7| 13| 1| false|
| 8| 18| 0| false|
| 9| 23| 0| false|
| 10| 5| 0| false|
| 11| 56| 0| false|
| 12| 66| 0| false|
| 13| 34| 1| true|
| 14| 35| 1| true|
| 15| 36| 1| true|
| 16| 37| 1| true|
| 17| 39| 0| false|
| 18| 54| 0| false|
| 19| 68| 0| false|
| 20| 90| 0| false|
+---+---+-----------+--------------+
Based on the above code, my output is:
+---+---+-----------+--------------+
| Id|seq|seq_checker|check_sequence|
+---+---+-----------+--------------+
| 1| 1| 1| true|
| 2| 2| 1| null|
| 3| 3| 1| null|
| 4| 4| 1| null|
| 5| 10| 0| null|
| 6| 14| 1| null|
| 7| 13| 1| null|
| 8| 18| 0| null|
| 9| 23| 0| null|
| 10| 5| 0| null|
| 11| 56| 0| null|
| 12| 66| 0| null|
| 13| 34| 1| true|
| 14| 35| 1| null|
| 15| 36| 1| null|
| 16| 37| 1| null|
| 17| 39| 0| null|
| 18| 54| 0| null|
| 19| 68| 0| null|
| 20| 90| 0| null|
+---+---+-----------+--------------+
Edit:
1. If we have more than 4 consecutive rows having 1's we need to change check_sequence flag for all the rows to True.
My actual problem is to check for sequences of length greater than 4 in the 'seq' column. I was able to create seq_checker column using lead and lag functions.
Initially define a window with just an id ordering. Then use a difference in row numbers approach (with different ordering) to group consecutive 1's (also groups consecutive same values) with the same group number. Once the grouping is done, just check to see if the max and min of the group is 1 and there are atleast 4 1's in the group, to get the desired boolean output.
from pyspark.sql.functions import row_number,count,when,min,max
w1 = Window.orderBy(df.id)
w2 = Window.orderBy(df.seq_checker,df.id)
groups = df.withColumn('grp',row_number().over(w1)-row_number().over(w2))
w3 = Window.partitionBy(groups.grp)
output = groups.withColumn('check_seq',(max(groups.seq_checker).over(w3)==1) & (min(groups.seq_checker).over(w3)==1) & (count(groups.id).over(w3) >= 4)
output.show()
The rangeBetween gives you the access to rows which are relative from the current row. You defined a window for 0,3 which gives you access to the current row and the three following rows, but this will only set the correct value for the first 1 of 4 consectutive rows of 1's. The second element of 4 consectutive rows of 1's needs acess to the previous row and following two rows (-1,2). The third element of 4 consectutive rows of 1's needs acess to the two previous rows and following two rows (-2,1). Finally the fourth element of 4 consectutive rows of 1's needs acess to the three previous rows(-3,0).
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [
( 1, 1, 1),
( 2, 2, 1),
( 3, 3, 1),
( 4, 4, 1),
( 5, 10, 0),
( 6, 14, 1),
( 7, 13, 1),
( 8, 18, 0),
( 9, 23, 0),
( 10, 5, 0),
( 11, 56, 0),
( 12, 66, 0),
( 13, 34, 1),
( 14, 35, 1),
( 15, 36, 1),
( 16, 37, 1),
( 17, 39, 0),
( 18, 54, 0),
( 19, 68, 0),
( 20, 90, 0)
]
columns = ['Id','seq','seq_checker']
df=spark.createDataFrame(l, columns)
w1 = Window.partitionBy().orderBy("id").rangeBetween(0, 3)
w2 = Window.partitionBy().orderBy("id").rangeBetween(-1, 2)
w3 = Window.partitionBy().orderBy("id").rangeBetween(-2, 1)
w4 = Window.partitionBy().orderBy("id").rangeBetween(-3, 0)
output = df.withColumn('check_sequence',F.when(
(F.min(df['seq_checker']).over(w1) == 1) |
(F.min(df['seq_checker']).over(w2) == 1) |
(F.min(df['seq_checker']).over(w3) == 1) |
(F.min(df['seq_checker']).over(w4) == 1)
, True).otherwise(False))
output.show()
Output:
+---+---+-----------+--------------+
| Id|seq|seq_checker|check_sequence|
+---+---+-----------+--------------+
| 1| 1| 1| true|
| 2| 2| 1| true|
| 3| 3| 1| true|
| 4| 4| 1| true|
| 5| 10| 0| null|
| 6| 14| 1| null|
| 7| 13| 1| null|
| 8| 18| 0| null|
| 9| 23| 0| null|
| 10| 5| 0| null|
| 11| 56| 0| null|
| 12| 66| 0| null|
| 13| 34| 1| true|
| 14| 35| 1| true|
| 15| 36| 1| true|
| 16| 37| 1| true|
| 17| 39| 0| null|
| 18| 54| 0| null|
| 19| 68| 0| null|
| 20| 90| 0| null|
+---+---+-----------+--------------+

Add unique identifier (Serial No.) for consecutive column values in pyspark

I created a rdd using
import pyspark.sql.functions as F
from pyspark.sql import Window
df = pd.DataFrame({"b": ['A','A','A','A','A','A','A','B', 'B','B','C','C','D','D', 'D','D','D','D','D','D','D','D','D'],"Sno": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23],"a": [3,-4,2, -1, -3, 1,-7,-6, -4, -5, -1, 1,4,5,-3,2,-5,4, -4,-2,5,-5,-4]})
df2=spark.createDataFrame(df)
df2 = df2.withColumn("pos_neg",col("a") < 0)
df2 = df2.withColumn("huyguyg",concat(col("b"), lit(" "), col("pos_neg")))
+---+---+---+-------+---+-------+
| b|Sno| a|pos_neg|val|huyguyg|
+---+---+---+-------+---+-------+
| B| 8| -6| true| 1| B true|
| B| 9| -4| true| 1| B true|
| B| 10| -5| true| 1| B true|
| D| 13| 4| false| 0|D false|
| D| 14| 5| false| 0|D false|
| D| 15| -3| true| 1| D true|
| D| 16| 2| false| 1|D false|
| D| 17| -5| true| 2| D true|
| D| 18| 4| false| 2|D false|
| D| 19| -4| true| 3| D true|
| D| 20| -2| true| 3| D true|
| D| 21| 5| false| 3|D false|
| D| 22| -5| true| 4| D true|
| D| 23| -4| true| 4| D true|
| C| 11| -1| true| 1| C true|
| C| 12| 1| false| 1|C false|
| A| 1| 3| false| 0|A false|
| A| 2| -4| true| 1| A true|
| A| 3| 2| false| 1|A false|
| A| 4| -1| true| 2| A true|
+---+---+---+-------+---+-------+
I want an additional column in the end which adds a unique identifier (serial no.) for consecutive values, for instance starting value in column 'huyguyg' is 'B true' it can get a number say '1' and since next 2 values are also 'B true' they also get number '1', subsequently the serial number increases and remains constant for same 'huyguyg' value
Any support in this regard will be helpful. A lag function in this regard may be helpful, but I am not able to sum the number
df2 = df2.withColumn("serial no.",(df2.pos_neg != F.lag('pos_neg').over(w)).cast('int'))
Simple! just use Dense Rank function with orderBy clause.
Here is how it looks like:
import dense_rank
df3=df2.withColumn("denseRank",dense_rank().over(Window.orderBy(df2.huyguyg)))
+---+---+---+-------+-------+---------+
|Sno| a| b|pos_neg|huyguyg|denseRank|
+---+---+---+-------+-------+---------+
| 1| 3| A| false|A false| 1|
| 3| 2| A| false|A false| 1|
| 6| 1| A| false|A false| 1|
| 2| -4| A| true| A true| 2|
| 4| -1| A| true| A true| 2|
| 5| -3| A| true| A true| 2|
| 7| -7| A| true| A true| 2|
| 8| -6| B| true| B true| 3|
| 9| -4| B| true| B true| 3|
| 10| -5| B| true| B true| 3|
| 12| 1| C| false|C false| 4|
| 11| -1| C| true| C true| 5|
| 13| 4| D| false|D false| 6|
| 14| 5| D| false|D false| 6|
| 16| 2| D| false|D false| 6|
| 18| 4| D| false|D false| 6|
| 21| 5| D| false|D false| 6|
| 15| -3| D| true| D true| 7|
| 17| -5| D| true| D true| 7|
| 19| -4| D| true| D true| 7|
+---+---+---+-------+-------+---------+
only showing top 20 rows

How to set new flag based on condition in pyspark?

I have two data frames like below.
df = spark.createDataFrame(sc.parallelize([[1,1,2],[1,2,9], [2,1,2],[2,2,1],
[4,1,5],[4,2,6]]), ["sid","cid","Cr"])
df.show()
+---+---+---+
|sid|cid| Cr|
+---+---+---+
| 1| 1| 2|
| 1| 2| 9|
| 2| 1| 2|
| 2| 2| 1|
| 4| 1| 5|
| 4| 2| 6|
| 5| 1| 3|
| 5| 2| 8|
+---+---+---+
next I have created df1 like below.
df1 = spark.createDataFrame(sc.parallelize([[1,1],[1,2],[1,3], [2,1],[2,2],[2,3],[4,1],[4,2],[4,3],[5,1],[5,2],[5,3]]), ["sid","cid"])
df1.show()
+---+---+
|sid|cid|
+---+---+
| 1| 1|
| 1| 2|
| 1| 3|
| 2| 1|
| 2| 2|
| 2| 3|
| 4| 1|
| 4| 2|
| 4| 3|
| 5| 1|
| 5| 2|
| 5| 3|
+---+---+
now I want my final output should be like below i.e . if any of the data presented i.e.
if (df1.sid==df.sid)&(df1.cid==df.cid) then flag value 1 else 0.
and missing Cr values will be '0'
+---+---+---+----+
|sid|cid| Cr|flag|
+---+---+---+----+
| 1| 1| 2| 1 |
| 1| 2| 9| 1 |
| 1| 3| 0| 0 |
| 2| 1| 2| 1 |
| 2| 2| 1| 1 |
| 2| 3| 0| 0 |
| 4| 1| 5| 1 |
| 4| 2| 6| 1 |
| 4| 3| 0| 0 |
| 5| 1| 3| 1 |
| 5| 2| 8| 1 |
| 5| 3| 0| 0 |
+---+---+---+----+
please help me on this.
With data:
from pyspark.sql.functions import col, when, lit, coalesce
df = spark.createDataFrame(
[(1, 1, 2), (1, 2, 9), (2, 1, 2), (2, 2, 1), (4, 1, 5), (4, 2, 6), (5, 1, 3), (5, 2, 8)],
("sid", "cid", "Cr"))
df1 = spark.createDataFrame(
[[1,1],[1,2],[1,3], [2,1],[2,2],[2,3],[4,1],[4,2],[4,3],[5,1],[5,2],[5,3]],
["sid","cid"])
outer join:
joined = (df.alias("df")
.join(
df1.alias("df1"),
(col("df.sid") == col("df1.sid")) & (col("df.cid") == col("df1.cid")),
"rightouter"))
and select
joined.select(
col("df1.*"),
coalesce(col("Cr"), lit(0)).alias("Cr"),
col("df.sid").isNotNull().cast("integer").alias("flag")
).orderBy("sid", "cid").show()
# +---+---+---+----+
# |sid|cid| Cr|flag|
# +---+---+---+----+
# | 1| 1| 2| 1|
# | 1| 2| 9| 1|
# | 1| 3| 0| 0|
# | 2| 1| 2| 1|
# | 2| 2| 1| 1|
# | 2| 3| 0| 0|
# | 4| 1| 5| 1|
# | 4| 2| 6| 1|
# | 4| 3| 0| 0|
# | 5| 1| 3| 1|
# | 5| 2| 8| 1|
# | 5| 3| 0| 0|
# +---+---+---+----+

Categories

Resources