Pyspark : How to lead from specific column value in Dataframe - python

The dataframe is already sorted out by date,
col1 ==1 value is unique,
and only the 0 have duplicates.
I have a dataframe looks like this call it df
+--------+----+----+
date |col1|col2|
+--------+----+----+
2020-08-01| 5| -1|
2020-08-02| 4| -1|
2020-08-03| 3| 3|
2020-08-04| 2| 2|
2020-08-05| 1| 4|
2020-08-06| 0| 1|
2020-08-07| 0| 2|
2020-08-08| 0| 3|
2020-08-09| 0| -1|
+--------+----+----+
The condition is when col1 == 1, then we start adding backwards from col2 ==4, (eg. 4,5,6,7,8,...)
and the after col2 == 4 return 0 all the way (eg. 4,0,0,0,0...)
So, my resulted df will look something like this.
+--------+----+----+----+
date |col1|col2|want
+--------+----+----+----+
2020-08-01| 5| -1| 8 |
2020-08-02| 4| -1| 7 |
2020-08-03| 3| 3| 6 |
2020-08-04| 2| 2| 5 |
2020-08-05| 1| 4| 4 |
2020-08-06| 0| 1| 0 |
2020-08-07| 0| 2| 0 |
2020-08-08| 0| 3| 0 |
2020-08-09| 0| -1| 0 |
+---------+----+----+----+
Enhancement:
I want to add additional condition where col2 == -1 when col1 == 1 row, and -1 goes consecutive, then I want to count consecutive -1, and then add with next col2 == ? value. so here's an example to clear.
+--------+----+----+----+
date |col1|col2|want
+--------+----+----+----+
2020-08-01| 5| -1| 11|
2020-08-02| 4| -1| 10|
2020-08-03| 3| 3| 9 |
2020-08-04| 2| 2| 8 |
2020-08-05| 1| -1| 7 |
2020-08-06| 0| -1| 0 |
2020-08-07| 0| -1| 0 |
2020-08-08| 0| 4| 0 |
2020-08-09| 0| -1| 0 |
+---------+----+----+----+
so, we see 3 consecutive -1s, (we only care about first consecutive -1s) and after the consecutive we have 4, then we would have 4+ 3 =7 at the col1 ==1 row. is it possible?

Here is my try:
from pyspark.sql.functions import sum, when, col, rank, desc
from pyspark.sql import Window
w1 = Window.orderBy(desc('date'))
w2 = Window.partitionBy('case').orderBy(desc('date'))
df.withColumn('case', sum(when(col('col1') == 1, col('col2')).otherwise(0)).over(w1)) \
.withColumn('rank', when(col('case') != 0, rank().over(w2) - 1).otherwise(0)) \
.withColumn('want', col('case') + col('rank')) \
.orderBy('date') \
.show(10, False)
+----------+----+----+----+----+----+
|date |col1|col2|case|rank|want|
+----------+----+----+----+----+----+
|2020-08-01|5 |-1 |4 |4 |8 |
|2020-08-02|4 |-1 |4 |3 |7 |
|2020-08-03|3 |3 |4 |2 |6 |
|2020-08-04|2 |2 |4 |1 |5 |
|2020-08-05|1 |4 |4 |0 |4 |
|2020-08-06|0 |1 |0 |0 |0 |
|2020-08-07|0 |2 |0 |0 |0 |
|2020-08-08|0 |3 |0 |0 |0 |
|2020-08-09|0 |-1 |0 |0 |0 |
+----------+----+----+----+----+----+

Related

best way to perform one-hot encoding and explode column with prefix pyspark

Trying to replicate pandas code in pyspark 2.x.
say I have dataframe as follows:
age education country
0 22 A Canada
1 34 B Mongolia
2 55 A Peru
3 44 C Korea
Usually in pandas I would scale numerical columns and one hot encode categorical and get:
age education_A education_B education_C country_Canada country_Mongolia ...
0 0 1 0 0 1 0
1 0.3 0 1 0 0 0
2 1 1 0 0 0 0 ...
3 0.7 0 0 1 ... ...
In pyspark I've done
str_indexer1 = StringIndexer(inputCol="education", outputCol=education+"_si", handleInvalid="skip")
str_indexer2 = StringIndexer(inputCol="country", outputCol=country+"_si", handleInvalid="skip")
mod_df = str_indexer1.fit(df).transform(df)
mod_df = str_indexer2.fit(df).transform(mod_df)
ohe1 = OneHotEncoder(inputCol="education", outputCol=education+"_ohe")
ohe2 = OneHotEncoder(inputCol="country", outputCol=country+"_ohe")
ohe1.fit(mod_df).transform(mod_df)
This gives me
age education country education_si country_si education_ohe
0 0 A Canada 1 1 (1,0,0,0)
1 0.3 B Mongolia 2 2 (0,1,0,0)
2 1 A Peru 1 3 (1,0,0,0)
3 0.7 C Korea 3 4 (0,0,1,0)
From here I cannot find out how to explode education_ohe into education_A, etc...
How can I do this and also is more efficient way to perform ohe and scaler in large dataframe?
There is a built in oneHotEncoder in pyspark's functions, but I could not get it to provide true one-hot encoded columns.
That being said the following code will get the desired result.
Using the following dataframe.
df.show()
+---+---------+--------+
|age|education| country|
+---+---------+--------+
| 22| A| Canada|
| 34| B|Mongolia|
| 55| A| Peru|
| 44| C| Korea|
+---+---------+--------+
You can select all the distinct column values, and iteratively create additional columns.
variables_dict = {}
for col in df.columns:
variables_dict[col] = [row[col] for row in df.distinct().collect()]
for col in variables_dict.keys():
for val in variables_dict[col]:
df = df.withColumn("{}_{}".format(col, val), functions.when((df[col] == val), 1).otherwise(0))
df.show()
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+
|age|education| country|age_22|age_44|age_55|age_34|education_A|education_C|education_B|country_Canada|country_Korea|country_Peru|country_Mongolia|
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+
| 22| A| Canada| 1| 0| 0| 0| 1| 0| 0| 1| 0| 0| 0|
| 34| B|Mongolia| 0| 0| 0| 1| 0| 0| 1| 0| 0| 0| 1|
| 55| A| Peru| 0| 0| 1| 0| 1| 0| 0| 0| 0| 1| 0|
| 44| C| Korea| 0| 1| 0| 0| 0| 1| 0| 0| 1| 0| 0|
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+
You can then use the same variables_dict to apply the one_hot_encoder to another dataframe.
df2.show()
+---+---------+--------+
|age|education| country|
+---+---------+--------+
| 22| A| Canada|
| 66| B|Mongolia|
| 55| D| Peru|
| 44| C| China|
+---+---------+--------+
The dataframe above has column variables that have been seen before ie. (66, D, China).
We can use the following code to transform 'df2' to have identical columns as 'df'.
for col in variables_dict.keys():
for val in variables_dict[col]:
df2 = df2.withColumn("{}_{}".format(col, val), functions.when((df2[col] == val), 1).otherwise(0))
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+
|age|education| country|age_22|age_44|age_55|age_34|education_A|education_C|education_B|country_Canada|country_Korea|country_Peru|country_Mongolia|
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+
| 22| A| Canada| 1| 0| 0| 0| 1| 0| 0| 1| 0| 0| 0|
| 66| B|Mongolia| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 1|
| 55| D| Peru| 0| 0| 1| 0| 0| 0| 0| 0| 0| 1| 0|
| 44| C| China| 0| 1| 0| 0| 0| 1| 0| 0| 0| 0| 0|
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+

Pyspark get first value from a column for each group

I have a data frame in pyspark which would look like this
|Id1| id2 |row |grp |
|12 | 1234 |1 | 1 |
|23 | 1123 |2 | 1 |
|45 | 2343 |3 | 2 |
|65 | 2345 |1 | 2 |
|67 | 3456 |2 | 2 |```
I need to retrieve value for id2 corresponding to row = 1 and update all id2 values within a grp to that value.
This should be the final result
|Id1 | id2 |row |grp|
|12 |1234 |1 |1 |
|23 |1234 |2 |1 |
|45 |2345 |3 |2 |
|65 |2345 |1 |2 |
|67 |2345 |2 |2 |
I tried doing something like df.groupby('grp').sort('row').first('id2')
But apparently sort and orderby don't work with groupby in pyspark.
Any idea how to go about this?
Very similar to #Steven's answer, without using .rowsBetween
You basically create a Window for each grp, then sort the rows by row and pick the first id2 for each grp.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.partitionBy('grp').orderBy('row')
df = df.withColumn('id2', F.first('id2').over(w))
df.show()
+---+----+---+---+
|Id1| id2|row|grp|
+---+----+---+---+
| 12|1234| 1| 1|
| 23|1234| 2| 1|
| 65|2345| 1| 2|
| 67|2345| 2| 2|
| 45|2345| 3| 2|
+---+----+---+---+
try this :
from pyspark.sql import functions as F, Window as W
df.withColumn(
"id2",
F.first("id2").over(
W.partitionBy("grp")
.orderBy("row")
.rowsBetween(W.unboundedPreceding, W.currentRow)
),
).show()
+---+----+---+---+
|id1| id2|row|grp|
+---+----+---+---+
| 12|1234| 1| 1|
| 23|1234| 2| 1|
| 65|2345| 1| 2|
| 45|2345| 2| 2|
| 45|2345| 3| 2|
+---+----+---+---+

Count ocurrences in pyspark dataframe

I need to count the occurrences of repeated values ​​in a pyspark dataframe as shown.
In short, when the value is the same, it adds up until the value is different. When the value is different, the count is reset. And I need it to be in a column.
What I have:
+------+
| val |
+------+
| 0 |
| 0 |
| 0 |
| 1 |
| 1 |
| 2 |
| 2 |
| 2 |
| 3 |
| 3 |
| 3 |
| 3 |
+------+
What I need:
+------+-----+
| val |ocurr|
+------+-----+
| 0 | 0 |
| 0 | 1 |
| 0 | 2 |
| 1 | 0 |
| 1 | 1 |
| 2 | 0 |
| 2 | 1 |
| 2 | 2 |
| 3 | 0 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
+------+-----+
Use when and lag function to grouping the same concurrent values and use the row_number to get the counts. You should have an appropriate ordering column, my temp ordering column id is not good because that it is not guaranteed the order-preserving.
df = spark.createDataFrame([0, 0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 0, 0, 0], 'int').toDF('val')
from pyspark.sql.functions import *
from pyspark.sql import Window
w1 = Window.orderBy('id')
w2 = Window.partitionBy('group').orderBy('id')
df.withColumn('id', monotonically_increasing_id()) \
.withColumn('group', sum(when(col('val') == lag('val', 1, 1).over(w1), 0).otherwise(1)).over(w1)) \
.withColumn('order', row_number().over(w2) - 1) \
.orderBy('id').show()
+---+---+-----+-----+
|val| id|group|order|
+---+---+-----+-----+
| 0| 0| 1| 0|
| 0| 1| 1| 1|
| 0| 2| 1| 2|
| 1| 3| 2| 0|
| 1| 4| 2| 1|
| 2| 5| 3| 0|
| 2| 6| 3| 1|
| 2| 7| 3| 2|
| 3| 8| 4| 0|
| 3| 9| 4| 1|
| 3| 10| 4| 2|
| 3| 11| 4| 3|
| 0| 12| 5| 0|
| 0| 13| 5| 1|
| 0| 14| 5| 2|
+---+---+-----+-----+

How to add column with range values to DataFrame

I have dataframe with the current structure
user_id | country | event |
1 | CA | 1 |
2 | USA | 1 |
and I want to add the new column with period range (0-n) and get something like this
user_id | country | event |period|
1 | CA | 1 |1
1 | CA | 1 |2
1 | CA | 1 |...
1 | CA | 1 |n
2 | USA | 1 |1
2 | USA | 1 |2
2 | USA | 1 |...
2 | USA | 1 |n
As I understand it should be some window function and withColumn function
w = (Window.partitionBy(['user_id', 'country', 'event'])
df = df.withColumn('period', (???).over(w))
How I can add the new column and at the same time new rows by some range?
First use spark.range() to create a second DataFrame containing the periods. For example, with n=3:
n = 3
periods = spark.range(1, n+1).withColumnRenamed("id", "period")
periods.show()
#+------+
#|period|
#+------+
#| 1|
#| 2|
#| 3|
#+------+
Now crossJoin this with df to get the desired output:
df = df.crossJoin(periods)
df.show()
#+-------+-------+-----+------+
#|user_id|country|event|period|
#+-------+-------+-----+------+
#| 1| CA| 1| 1|
#| 1| CA| 1| 2|
#| 1| CA| 1| 3|
#| 2| USA| 1| 1|
#| 2| USA| 1| 2|
#| 2| USA| 1| 3|
#+-------+-------+-----+------+
Note that range doesn't actually materialize the DataFrame, so the Cartesian product will not be expensive.
df.explain()
#== Physical Plan ==
#BroadcastNestedLoopJoin BuildRight, Cross
#:- Scan ExistingRDD[user_id#0,country#1,event#2]
#+- BroadcastExchange IdentityBroadcastMode
# +- *(1) Project [id#31L AS period#33L]
# +- *(1) Range (1, 4, step=1, splits=2)

Populate month wise dataframe from two date columns

I have a PySpark dataframe like this,
+----------+--------+----------+----------+
|id_ | p |d1 | d2 |
+----------+--------+----------+----------+
| 1 | A |2018-09-26|2018-10-26|
| 2 | B |2018-06-21|2018-07-19|
| 2 | B |2018-08-13|2018-10-07|
| 2 | B |2018-12-31|2019-02-27|
| 2 | B |2019-05-28|2019-06-25|
| 3 |C |2018-06-15|2018-07-13|
| 3 |C |2018-08-15|2018-10-09|
| 3 |C |2018-12-03|2019-03-12|
| 3 |C |2019-05-10|2019-06-07|
| 4 | A |2019-01-30|2019-03-01|
| 4 | B |2019-05-30|2019-07-25|
| 5 |C |2018-09-19|2018-10-17|
-------------------------------------------
From this dataframe I have to derive another dataframe which have n columns. Where each column is a month from month(min(d1)) to month(max(d2)).
I want a in the derived dataframe for a row in the actual dataframe and the column values must be number of days in that month.
For example,
for first row, where id_ is 1 and p is A, I want to get a row in the derived dataframe where column of 201809 with value 5 and column 201810 with value 26.
For second row where id_ is 2 and p is B, I want to get a row in the derived dataframe where column of 201806 should be 9 and 201807 should be 19.
For the second last row, I want the columns 201905 filled with value 1, column 201906 with value 30, 201907 with 25.
So basically, I want the dataframe to be populated such a way that, for each row in my original dataframe I have a row in the derived dataframe where the columns in the table that corresponds to the month should be filled, for the range min(d1) to max(d2) with value number of days that is covered in that particular month.
I am currently doing this in the hard way. I am making n columns, where columns range for dates from min(d1) to max(d2). I am filling theses column with 1 and then melting the data and filtering based on value. Finally aggregating this dataframe to get my desired result, then selected the max valued p.
In codes,
d = df.select(F.min('d1').alias('d1'), F.max('d2').alias('d2')).first()
cols = [ c.strftime('%Y-%m-%d') for c in pd.period_range(d.d1, d.d2, freq='D') ]
result = df.select('id_', 'p', *[ F.when((df.d1 <= c)&(df.d2 >= c), 1).otherwise(0).alias(c) for c in cols ])
melted_data = melt(result, id_vars=['id_','p'], value_vars=cols)
melted_data = melted_data.withColumn('Month', F.substring(F.regexp_replace('variable', '-', ''), 1, 6))
melted_data = melted_data.groupBy('id_', 'Month', 'p').agg(F.sum('value').alias('days'))
melted_data = melted_data.orderBy('id_', 'Month', 'days', ascending=[False, False, False])
final_data = melted_data.groupBy('id_', 'Month').agg(F.first('p').alias('p'))
This codes takes a lot of time to run in decent configurations. How can I improve this.?
How can I achieve this task in a more optimized manner.? Making every single date in the range dont seems to be the best solution.
A small sample of the needed output is shown below,
+---+---+----------+----------+----------+----------+-------+
|id_|p |201806 |201807 |201808 | 201809 | 201810|
+---+---+----------+----------+----------+----------+-------+
| 1 | A | 0| 0 | 0| 4 | 26 |
| 2 | B | 9| 19| 0| 0 | 0 |
| 2 | B | 0| 0 | 18| 30 | 7 |
+---+---+----------+----------+----------+----------+-------+
I think it's slowing down because of the freq='D' and multiple transformations on dataset.
Please try below:
Edit 1: Update for the quarter
Edit 2: Per comment, the start date should be included in Final result
Edit 3: Per comment, Update for the daily
Prepared data
#Imports
import pyspark.sql.functions as f
from pyspark.sql.functions import when
import pandas as pd
df.show()
+---+---+----------+----------+
| id| p| d1| d2|
+---+---+----------+----------+
| 1| A|2018-09-26|2018-10-26|
| 2| B|2018-06-21|2018-07-19|
| 2| B|2018-08-13|2018-10-07|
| 2| B|2018-12-31|2019-02-27|
| 2| B|2019-05-28|2019-06-25|
| 3| C|2018-06-15|2018-07-13|
| 3| C|2018-08-15|2018-10-09|
| 3| C|2018-12-03|2019-03-12|
| 3| C|2019-05-10|2019-06-07|
| 4| A|2019-01-30|2019-03-01|
| 4| B|2019-05-30|2019-07-25|
| 5| C|2018-09-19|2018-10-17|
| 5| C|2019-05-16|2019-05-29| # --> Same month case
+---+---+----------+----------+
Get the min and max date from a dataset with month frequency freq='M'
d = df.select(f.min('d1').alias('min'), f.max('d2').alias('max')).first()
dates = pd.period_range(d.min, d.max, freq='M').strftime("%Y%m").tolist()
dates
['201806', '201807', '201808', '201809', '201810', '201811', '201812', '201901', '201902', '201903', '201904', '201905', '201906', '201907']
Now, Final Buesiness logic using spark date operators and functions
df1 = df.select('id',
'p',
'd1',
'd2', *[ (when( (f.trunc(df.d1, "month") == f.trunc(df.d2, "month")) & (f.to_date(f.lit(c),'yyyyMM') == f.trunc(df.d1, "month"))
, f.datediff(df.d2 , df.d1) +1 ) # Same month ((Last day - First dat) + 1
.when(f.to_date(f.lit(c),'yyyyMM') == f.trunc(df.d1, "month") ,
f.datediff(f.last_day(f.to_date(f.lit(c),'yyyyMM')) , df.d1) +1 ) # d1 date (Last day - current day)
.when(f.to_date(f.lit(c),'yyyyMM') == f.trunc(df.d2, "month") ,
f.datediff(df.d2, f.to_date(f.lit(c),'yyyyMM')) +1 ) # d2 date (Currentday - Firstday)
.when(f.to_date(f.lit(c),'yyyyMM').between(f.trunc(df.d1, "month"), df.d2),
f.dayofmonth(f.last_day(f.to_date(f.lit(c),'yyyyMM')))) # Between date (Total days in month)
).otherwise(0) # Rest of the months (0)
.alias(c) for c in dates ])
df1.show()
+---+---+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| id| p| d1| d2|201806|201807|201808|201809|201810|201811|201812|201901|201902|201903|201904|201905|201906|201907|
+---+---+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| 1| A|2018-09-26|2018-10-26| 0| 0| 0| 5| 26| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B|2018-06-21|2018-07-19| 10| 19| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B|2018-08-13|2018-10-07| 0| 0| 19| 30| 7| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B|2018-12-31|2019-02-27| 0| 0| 0| 0| 0| 0| 1| 31| 27| 0| 0| 0| 0| 0|
| 2| B|2019-05-28|2019-06-25| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 4| 25| 0|
| 3| C|2018-06-15|2018-07-13| 16| 13| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 3| C|2018-08-15|2018-10-09| 0| 0| 17| 30| 9| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 3| C|2018-12-03|2019-03-12| 0| 0| 0| 0| 0| 0| 29| 31| 28| 12| 0| 0| 0| 0|
| 3| C|2019-05-10|2019-06-07| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 22| 7| 0|
| 4| A|2019-01-30|2019-03-01| 0| 0| 0| 0| 0| 0| 0| 2| 28| 1| 0| 0| 0| 0|
| 4| B|2019-05-30|2019-07-25| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 2| 30| 25|
| 5| C|2018-09-19|2018-10-17| 0| 0| 0| 12| 17| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 5| C|2019-05-16|2019-05-29| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 14| 0| 0|
+---+---+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
Edit 2: Update for the quarter range:
Note: Taking quarter date range dictionary from #jxc's answer. We are more interested in the optimal solution here. #jxc has done an excellent job and no point of reinventing the wheel unless there is a performance issue.
Create date range dictionary :
q_dates = dict([
(str(c), [ c.to_timestamp().strftime("%Y-%m-%d") ,(c.to_timestamp() + pd.tseries.offsets.QuarterEnd()).strftime("%Y-%m-%d")
]) for c in pd.period_range(d.min, d.max, freq='Q')
])
# {'2018Q2': ['2018-04-01', '2018-06-30'],
# '2018Q3': ['2018-07-01', '2018-09-30'],
# '2018Q4': ['2018-10-01', '2018-12-31'],
# '2019Q1': ['2019-01-01', '2019-03-31'],
# '2019Q2': ['2019-04-01', '2019-06-30'],
# '2019Q3': ['2019-07-01', '2019-09-30']}
Now Apply business logic on quarters.
df1 = df.select('id',
'p',
'd1',
'd2',
*[(when( (df.d1.between(q_dates[c][0], q_dates[c][1])) & (f.trunc(df.d1, "month") == f.trunc(df.d2, "month")),
f.datediff(df.d2 , df.d1) +1 ) # Same month ((Last day - start day) +1 )
.when(df.d1.between(q_dates[c][0], q_dates[c][1]),
f.datediff(f.to_date(f.lit(q_dates[c][1])), df.d1) +1) # Min date , remaining days (Last day of quarter - Min day)
.when(df.d2.between(q_dates[c][0], q_dates[c][1]),
f.datediff(df.d2, f.to_date(f.lit(q_dates[c][0]))) +1 ) # Max date , remaining days (Max day - Start day of quarter )
.when(f.to_date(f.lit(q_dates[c][0])).between(df.d1, df.d2),
f.datediff(f.to_date(f.lit(q_dates[c][1])), f.to_date(f.lit(q_dates[c][0]))) +1) # All remaining days
).otherwise(0)
.alias(c) for c in q_dates ])
df1.show()
+---+---+----------+----------+------+------+------+------+------+------+
| id| p| d1| d2|2018Q2|2018Q3|2018Q4|2019Q1|2019Q2|2019Q3|
+---+---+----------+----------+------+------+------+------+------+------+
| 1| A|2018-09-26|2018-10-26| 0| 5| 26| 0| 0| 0|
| 2| B|2018-06-21|2018-07-19| 10| 19| 0| 0| 0| 0|
| 2| B|2018-08-13|2018-10-07| 0| 49| 7| 0| 0| 0|
| 2| B|2018-12-31|2019-02-27| 0| 0| 1| 58| 0| 0|
| 2| B|2019-05-28|2019-06-25| 0| 0| 0| 0| 34| 0|
| 3| C|2018-06-15|2018-07-13| 16| 13| 0| 0| 0| 0|
| 3| C|2018-08-15|2018-10-09| 0| 47| 9| 0| 0| 0|
| 3| C|2018-12-03|2019-03-12| 0| 0| 29| 71| 0| 0|
| 3| C|2019-05-10|2019-06-07| 0| 0| 0| 0| 52| 0|
| 4| A|2019-01-30|2019-03-01| 0| 0| 0| 61| 0| 0|
| 4| B|2019-05-30|2019-07-25| 0| 0| 0| 0| 32| 25|
| 5| C|2018-09-19|2018-10-17| 0| 12| 17| 0| 0| 0|
| 5| C|2019-05-16|2019-05-29| 0| 0| 0| 0| 14| 0|
+---+---+----------+----------+------+------+------+------+------+------+
Edit 3: Per comment, Update for the daily
Since here evaluations are more, need to careful in terms of performance.
Approach 1 : Dataframe/Dataset
Get Date list in yyyy-MM-dd format but as string
df_dates = pd.period_range(d.min, d.max, freq='D').strftime("%Y-%m-%d").tolist()
Now the business logic is quite simple. It's either 1 or 0
df1 = df.select('id'
, 'p'
, 'd1'
,'d2'
, *[ (when(f.lit(c).between (df.d1, df.d2),1)) # For date rabge 1
.otherwise(0) # For rest of days
.alias(c) for c in df_dates ])
df1.show()
+---+---+----------+----------+----------+----------+----------+
| id| p| d1| d2|2018-06-15|2018-06-16|2018-06-17| # and so on....
+---+---+----------+----------+----------+----------+----------+
| 1| A|2018-09-26|2018-10-26| 0| 0| 0|
| 2| B|2018-06-21|2018-07-19| 0| 0| 0|
| 2| B|2018-08-13|2018-10-07| 0| 0| 0|
| 2| B|2018-12-31|2019-02-27| 0| 0| 0|
| 2| B|2019-05-28|2019-06-25| 0| 0| 0|
| 3| C|2018-06-15|2018-07-13| 1| 1| 1|
| 3| C|2018-08-15|2018-10-09| 0| 0| 0|
| 3| C|2018-12-03|2019-03-12| 0| 0| 0|
| 3| C|2019-05-10|2019-06-07| 0| 0| 0|
| 4| A|2019-01-30|2019-03-01| 0| 0| 0|
| 4| B|2019-05-30|2019-07-25| 0| 0| 0|
| 5| C|2018-09-19|2018-10-17| 0| 0| 0|
| 5| C|2019-05-16|2019-05-29| 0| 0| 0|
+---+---+----------+----------+----------+----------+----------+
# Due to answer character limit unable to give the result.
Approach 2: RDD evaluations
Get Date list as a date object
rdd_dates = [ c.to_timestamp().date() for c in pd.period_range(d.min, d.max, freq='D') ]
Use map from rdd
df1 = df \
.rdd \
.map(lambda x : tuple([x.id, x.p, x.d1, x.d2 , *[ 1 if ( x.d1 <= c <=x.d2) else 0 for c in rdd_dates]])) \
.toDF(df.columns + [ c.strftime("%Y-%m-%d") for c in rdd_dates])
df1.show()
+---+---+----------+----------+----------+----------+----------+
| id| p| d1| d2|2018-06-15|2018-06-16|2018-06-17| # and so on....
+---+---+----------+----------+----------+----------+----------+
| 1| A|2018-09-26|2018-10-26| 0| 0| 0|
| 2| B|2018-06-21|2018-07-19| 0| 0| 0|
| 2| B|2018-08-13|2018-10-07| 0| 0| 0|
| 2| B|2018-12-31|2019-02-27| 0| 0| 0|
| 2| B|2019-05-28|2019-06-25| 0| 0| 0|
| 3| C|2018-06-15|2018-07-13| 1| 1| 1|
| 3| C|2018-08-15|2018-10-09| 0| 0| 0|
| 3| C|2018-12-03|2019-03-12| 0| 0| 0|
| 3| C|2019-05-10|2019-06-07| 0| 0| 0|
| 4| A|2019-01-30|2019-03-01| 0| 0| 0|
| 4| B|2019-05-30|2019-07-25| 0| 0| 0|
| 5| C|2018-09-19|2018-10-17| 0| 0| 0|
| 5| C|2019-05-16|2019-05-29| 0| 0| 0|
+---+---+----------+----------+----------+----------+----------+
IIUC, your problem can be simplified using some Spark SQL tricks:
# get start_date and end_date
d = df.select(F.min('d1').alias('start_date'), F.max('d2').alias('end_date')).first()
# get a list of month strings (using the first day of the month) between d.start_date and d.end_date
mrange = [ c.strftime("%Y-%m-01") for c in pd.period_range(d.start_date, d.end_date, freq='M') ]
#['2018-06-01',
# '2018-07-01',
# ....
# '2019-06-01',
# '2019-07-01']
Write the following Spark SQL snippet to count the number of days in each month, where {0} will be replaced by the month strings(i.e. "2018-06-01"), and {1} will be replaced by column names(i.e. "201806").
stmt = '''
IF(d2 < "{0}" OR d1 > LAST_DAY("{0}")
, 0
, DATEDIFF(LEAST(d2, LAST_DAY("{0}")), GREATEST(d1, TO_DATE("{0}")))
+ IF(d1 BETWEEN "{0}" AND LAST_DAY("{0}"), 0, 1)
) AS `{1}`
'''
This SQL snippet does the following, assumed m is the month string:
if (d1, d2) is out of range, i.e. d1 > last_day(m) or d2 < m, then return 0
otherwise, we calculate the datediff() between LEAST(d2, LAST_DAY(m)) and GREATEST(d1, m).
Notice there is an 1 day offset in calculating the above datediff(). it only exists when d1 is NOT in the current month, i.e. between(m, LAST_DAY(m))
We can then calculate the new columns using selectExpr and this SQL snippet:
df_new = df.withColumn('d1', F.to_date('d1')) \
.withColumn('d2', F.to_date('d2')) \
.selectExpr(
'id_'
, 'p'
, *[ stmt.format(m, m[:7].replace('-','')) for m in mrange ]
)
df_new.show()
+---+---+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
|id_| p|201806|201807|201808|201809|201810|201811|201812|201901|201902|201903|201904|201905|201906|201907|
+---+---+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| 1| A| 0| 0| 0| 4| 26| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B| 9| 19| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B| 0| 0| 18| 30| 7| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B| 0| 0| 0| 0| 0| 0| 0| 31| 27| 0| 0| 0| 0| 0|
| 2| B| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 3| 25| 0|
| 3| C| 15| 13| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 3| C| 0| 0| 16| 30| 9| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 3| C| 0| 0| 0| 0| 0| 0| 28| 31| 28| 12| 0| 0| 0| 0|
| 3| C| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 21| 7| 0|
| 4| A| 0| 0| 0| 0| 0| 0| 0| 1| 28| 1| 0| 0| 0| 0|
| 4| B| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 30| 25|
| 5| C| 0| 0| 0| 11| 17| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+---+---+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
Edit-1: About the Quarterly list
Per your comment, I modified the SQL snippet so that you can extend it into more named date ranges. see below: {0} will be replaced by range_start_date, and {1} by range_end_date and {2} by range_name:
stmt = '''
IF(d2 < "{0}" OR d1 > "{1}"
, 0
, DATEDIFF(LEAST(d2, TO_DATE("{1}")), GREATEST(d1, TO_DATE("{0}")))
+ IF(d1 BETWEEN "{0}" AND "{1}", 0, 1)
) AS `{2}`
'''
Create a dictionary using quarter name as keys and a list of corresponding start_date and end_date as values: (this part is a pure python or pandas problem)
range_dict = dict([
(str(c), [ c.to_timestamp().strftime("%Y-%m-%d")
,(c.to_timestamp() + pd.tseries.offsets.QuarterEnd()).strftime("%Y-%m-%d")
]) for c in pd.period_range(d.start_date, d.end_date, freq='Q')
])
#{'2018Q2': ['2018-04-01', '2018-06-30'],
# '2018Q3': ['2018-07-01', '2018-09-30'],
# '2018Q4': ['2018-10-01', '2018-12-31'],
# '2019Q1': ['2019-01-01', '2019-03-31'],
# '2019Q2': ['2019-04-01', '2019-06-30'],
# '2019Q3': ['2019-07-01', '2019-09-30']}
df_new = df.withColumn('d1', F.to_date('d1')) \
.withColumn('d2', F.to_date('d2')) \
.selectExpr(
'id_'
, 'p'
, *[ stmt.format(range_dict[n][0], range_dict[n][1], n) for n in sorted(range_dict.keys()) ]
)
df_new.show()
+---+---+------+------+------+------+------+------+
|id_| p|2018Q2|2018Q3|2018Q4|2019Q1|2019Q2|2019Q3|
+---+---+------+------+------+------+------+------+
| 1| A| 0| 4| 26| 0| 0| 0|
| 2| B| 9| 19| 0| 0| 0| 0|
| 2| B| 0| 48| 7| 0| 0| 0|
| 2| B| 0| 0| 0| 58| 0| 0|
| 2| B| 0| 0| 0| 0| 28| 0|
| 3| C| 15| 13| 0| 0| 0| 0|
| 3| C| 0| 46| 9| 0| 0| 0|
| 3| C| 0| 0| 28| 71| 0| 0|
| 3| C| 0| 0| 0| 0| 28| 0|
| 4| A| 0| 0| 0| 30| 0| 0|
| 4| B| 0| 0| 0| 0| 31| 25|
| 5| C| 0| 11| 17| 0| 0| 0|
+---+---+------+------+------+------+------+------+
Edit-2: Regarding the Segmentation errors
I tested the code with a sample dataframe of 56K rows (see below), everything ran well under my testing environment (VM, Centos 7.3, 1 CPU and 2GB RAM, spark-2.4.0-bin-hadoop2.7 run on local mode in a docker container. this is far below any production environment). Thus I doubt if it was from the Spark version issue? I rewrote the same code logic by using two different approaches: one is using only Spark SQL(with TempView etc) and another is using pure dataframe API functions(similar to #SMaZ's approach). I'd like to see if any of these could run through your environment and data. BTW. I think, given most of the fields are numeric, 1M rows + 100 columns should not be very huge in terms of big data projects.
Also, please do make sure if there exists missing data (null for d1/d2) or incorrectly data issues (i.e. d1 > d2) and adjust the code to handle such issues if needed.
# sample data-set
import pandas as pd, numpy as np
N = 560000
df1 = pd.DataFrame({
'id_': sorted(np.random.choice(range(100),N))
, 'p': np.random.choice(list('ABCDEFGHIJKLMN'),N)
, 'd1': sorted(np.random.choice(pd.date_range('2016-06-30','2019-06-30',freq='D'),N))
, 'n': np.random.choice(list(map(lambda x: pd.Timedelta(days=x), range(300))),N)
})
df1['d2'] = df1['d1'] + df1['n']
df = spark.createDataFrame(df1)
df.printSchema()
#root
# |-- id_: long (nullable = true)
# |-- p: string (nullable = true)
# |-- d1: timestamp (nullable = true)
# |-- n: long (nullable = true)
# |-- d2: timestamp (nullable = true)
# get the overall date-range of dataset
d = df.select(F.min('d1').alias('start_date'), F.max('d2').alias('end_date')).first()
#Row(start_date=datetime.datetime(2016, 6, 29, 20, 0), end_date=datetime.datetime(2020, 4, 22, 20, 0))
# range_dict for the month data
range_dict = dict([
(c.strftime('%Y%m'), [ c.to_timestamp().date()
,(c.to_timestamp() + pd.tseries.offsets.MonthEnd()).date()
]) for c in pd.period_range(d.start_date, d.end_date, freq='M')
])
#{'201606': [datetime.date(2016, 6, 1), datetime.date(2016, 6, 30)],
# '201607': [datetime.date(2016, 7, 1), datetime.date(2016, 7, 31)],
# '201608': [datetime.date(2016, 8, 1), datetime.date(2016, 8, 31)],
# ....
# '202003': [datetime.date(2020, 3, 1), datetime.date(2020, 3, 31)],
# '202004': [datetime.date(2020, 4, 1), datetime.date(2020, 4, 30)]}
Method-1: Using Spark SQL:
# create TempView `df_table`
df.createOrReplaceTempView('df_table')
# SQL snippet to calculate new column
stmt = '''
IF(d2 < "{0}" OR d1 > "{1}"
, 0
, DATEDIFF(LEAST(d2, to_date("{1}")), GREATEST(d1, to_date("{0}")))
+ IF(d1 BETWEEN "{0}" AND "{1}", 0, 1)
) AS `{2}`
'''
# set up the SQL field list
sql_fields_list = [
'id_'
, 'p'
, *[ stmt.format(range_dict[n][0], range_dict[n][1], n) for n in sorted(range_dict.keys()) ]
]
# create SQL statement
sql_stmt = 'SELECT {} FROM df_table'.format(', '.join(sql_fields_list))
# run the Spark SQL:
df_new = spark.sql(sql_stmt)
Method-2: Using dataframe API functions:
from pyspark.sql.functions import when, col, greatest, least, lit, datediff
df_new = df.select(
'id_'
, 'p'
, *[
when((col('d2') < range_dict[n][0]) | (col('d1') > range_dict[n][1]), 0).otherwise(
datediff(least('d2', lit(range_dict[n][1])), greatest('d1', lit(range_dict[n][0])))
+ when(col('d1').between(range_dict[n][0], range_dict[n][1]), 0).otherwise(1)
).alias(n)
for n in sorted(range_dict.keys())
]
)
If you want to avoid pandas completely (which brings the data back to driver) then a pure pyspark based solution can be:
from pyspark.sql import functions as psf
# Assumption made: your dataframe's name is : sample_data and has id, p, d1, d2 columns.
# Add month and days left column using pyspark functions
# I have kept a row id as well just to ensure that if you have duplicates in your data on the keys then it would still be able to handle it - no obligations though
data = sample_data.select("id", "p",
psf.monotonically_increasing_id().alias("row_id"),
psf.date_format("d2", 'YYYYMM').alias("d2_month"),
psf.dayofmonth("d2").alias("d2_id"),
psf.date_format("d1", 'YYYYMM').alias("d1_month"),
psf.datediff(psf.last_day("d1"), sample_data["d1"]).alias("d1_id"))
data.show(5, False)
Result:
+---+---+-----------+--------+-----+--------+-----+
|id |p |row_id |d2_month|d2_id|d1_month|d1_id|
+---+---+-----------+--------+-----+--------+-----+
|1 |A |8589934592 |201810 |26 |201809 |4 |
|2 |B |25769803776|201807 |19 |201806 |9 |
|2 |B |34359738368|201810 |7 |201808 |18 |
|2 |B |51539607552|201902 |27 |201912 |0 |
|2 |B |60129542144|201906 |25 |201905 |3 |
+---+---+-----------+--------+-----+--------+-----+
only showing top 5 rows
Then you can split the dataframe and pivot it:
####
# Create two separate dataframes by pivoting on d1_month and d2_month
####
df1 = data.groupby(["id", "p", "row_id"]).pivot("d1_month").max("d1_id")
df2 = data.groupby(["id", "p", "row_id"]).pivot("d2_month").max("d2_id")
df1.show(5, False), df2.show(5, False)
Result:
+---+---+------------+------+------+------+------+------+------+------+
|id |p |row_id |201806|201808|201809|201812|201901|201905|201912|
+---+---+------------+------+------+------+------+------+------+------+
|3 |C |85899345920 |null |16 |null |null |null |null |null |
|2 |B |51539607552 |null |null |null |null |null |null |0 |
|3 |C |77309411328 |15 |null |null |null |null |null |null |
|3 |C |103079215104|null |null |null |28 |null |null |null |
|4 |A |128849018880|null |null |null |null |1 |null |null |
+---+---+------------+------+------+------+------+------+------+------+
only showing top 5 rows
+---+---+------------+------+------+------+------+------+------+------+
|id |p |row_id |201807|201809|201810|201902|201903|201906|201907|
+---+---+------------+------+------+------+------+------+------+------+
|3 |C |85899345920 |null |null |9 |null |null |null |null |
|2 |B |51539607552 |null |null |null |27 |null |null |null |
|3 |C |77309411328 |13 |null |null |null |null |null |null |
|3 |C |103079215104|null |null |null |null |12 |null |null |
|4 |A |128849018880|null |null |null |null |1 |null |null |
+---+---+------------+------+------+------+------+------+------+------+
only showing top 5 rows
Join back and get your result:
result = df1.join(df2, on=["id", "p","row_id"])\
.select([psf.coalesce(df1[x_], df2[x_]).alias(x_)
if (x_ in df1.columns) and (x_ in df2.columns) else x_
for x_ in set(df1.columns + df2.columns)])\
.orderBy("row_id").drop("row_id")
result.na.fill(0).show(5, False)
Result:
+------+------+------+------+------+---+------+------+------+------+------+------+------+------+---+
|201906|201907|201912|201901|201810|p |201812|201905|201902|201903|201809|201808|201807|201806|id |
+------+------+------+------+------+---+------+------+------+------+------+------+------+------+---+
|0 |0 |0 |0 |26 |A |0 |0 |0 |0 |4 |0 |0 |0 |1 |
|0 |0 |0 |0 |0 |B |0 |0 |0 |0 |0 |0 |19 |9 |2 |
|0 |0 |0 |0 |7 |B |0 |0 |0 |0 |0 |18 |0 |0 |2 |
|0 |0 |0 |0 |0 |B |0 |0 |27 |0 |0 |0 |0 |0 |2 |
|25 |0 |0 |0 |0 |B |0 |3 |0 |0 |0 |0 |0 |0 |2 |
+------+------+------+------+------+---+------+------+------+------+------+------+------+------+---+
only showing top 5 rows

Categories

Resources