Create label for last 3 months data in Pyspark - python

i have pyspark df with ID, date, group and label columns as shown below :
>>> df.show()
+----+------------+-----+-----+
| id | date |group|label|
+----+------------+-----+-----+
| ID1| 2021-04-30| A| 0|
| ID1| 2021-04-30| B| 0|
| ID1| 2021-04-30| C| 0|
| ID1| 2021-04-30| D| 0|
| ID1| 2021-03-31| A| 0|
| ID1| 2021-03-31| B| 1|
| ID1| 2021-03-31| C| 0|
| ID1| 2021-03-31| D| 0|
| ID1| 2021-02-28| A| 0|
| ID1| 2021-02-28| B| 0|
| ID1| 2021-02-28| C| 0|
| ID1| 2021-02-28| D| 0|
| ID1| 2021-01-31| A| 0|
| ID1| 2021-01-31| B| 0|
| ID1| 2021-01-31| C| 0|
| ID1| 2021-01-31| D| 0|
| ID1| 2020-12-31| A| 1|
| ID1| 2020-12-31| B| 0|
| ID1| 2020-12-31| C| 0|
| ID1| 2020-12-31| D| 0|
+----+------------+-----+-----+
I wanted to create a flag that indicates last 3 month grouping as shown in the group_l3m example column in the df below. Expected output :
+----+------------+-----+-----+---------+
| id | date |group|label|group_l3m|
+----+------------+-----+-----+---------+
| ID1| 2021-04-30| A| 0| A|
| ID1| 2021-03-31| A| 0| A|
| ID1| 2021-02-28| A| 0| A|
| ID1| 2021-03-31| A| 0| B|
| ID1| 2021-02-28| A| 0| B|
| ID1| 2021-01-31| A| 0| B|
| ID1| 2021-02-28| A| 0| C|
| ID1| 2021-01-31| A| 0| C|
| ID1| 2020-12-31| A| 1| C|
| ID1| 2021-04-30| B| 0| D|
| ID1| 2021-03-31| B| 1| D|
| ID1| 2021-02-28| B| 0| D|
| ID1| 2021-03-31| B| 1| E|
| ID1| 2021-02-28| B| 0| E|
| ID1| 2021-01-31| B| 0| E|
| ID1| 2021-02-28| B| 0| F|
| ID1| 2021-01-31| B| 0| F|
| ID1| 2020-12-31| B| 0| F|
| ID1| 2021-04-30| C| 0| G|
| ID1| 2021-03-31| C| 0| G|
| ID1| 2021-02-28| C| 0| G|
| ID1| 2021-03-31| C| 0| H|
| ID1| 2021-02-28| C| 0| H|
| ID1| 2021-01-31| C| 0| H|
| ID1| 2021-02-28| C| 0| I|
| ID1| 2021-01-31| C| 0| I|
| ID1| 2020-12-31| C| 0| I|
| ID1| 2021-04-30| D| 0| J|
| ID1| 2021-03-31| D| 0| J|
| ID1| 2021-02-28| D| 0| J|
| ID1| 2021-03-31| D| 0| K|
| ID1| 2021-02-28| D| 0| K|
| ID1| 2021-01-31| D| 0| K|
| ID1| 2021-02-28| D| 0| L|
| ID1| 2021-01-31| D| 0| L|
| ID1| 2020-12-31| D| 0| L|
+----+------------+-----+-----+---------+
after getting group_l3m hence i can make a groupby cube using that column to perform summation later. Any idea how to get the expected output as showed above?

Related

How to replace value in a column based on maximum value in same column in Pyspark?

I have a column named version with integer values 1,2,....upto 8. I want to replace all the integer values with the maximum number present in the same column version, In this case its 8, So I want to replace 1,2,3,4,5,6,7 with 8. I tried couple of methods but couldn't get the solution.
testDF = spark.createDataFrame([(1,"a"), (2,"b"), (3,"c"), (4,"d"), (5,"e"), (6,"f"), (7,"g"), (8,"h")], ["version", "name"])
testDF.show()
+-------+----+
|version|name|
+-------+----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 5| e|
| 6| f|
| 7| g|
| 8| h|
+-------+----+
expected
+-------+----+
|version|name|
+-------+----+
| 8| a|
| 8| b|
| 8| c|
| 8| d|
| 8| e|
| 8| f|
| 8| g|
| 8| h|
+-------+----+
try this,
testDF=testDF.withColumn("version", lit(testDF.agg({"version": "max"}).collect()[0][0]))
Output:
+-------+----+
|version|name|
+-------+----+
| 8| a|
| 8| b|
| 8| c|
| 8| d|
| 8| e|
| 8| f|
| 8| g|
| 8| h|
+-------+----+
Increment value like below:
testDF.withColumn("version", lit(testDF.agg({"version": "max"}).collect()[0][0]+1))
Output:
+-------+----+
|version|name|
+-------+----+
| 9| a|
| 9| b|
| 9| c|
| 9| d|
| 9| e|
| 9| f|
| 9| g|
| 9| h|
+-------+----+

Probem faced when using groupBy()

I have the following (example) DataFrame - df (obtained using rollup function):
+----+---------+--------+
|Week|DayofWeek| count|
+----+---------+--------+
|null| null| 28|
| 27| null| 12|
| 27| 1| 1|
| 27| 1| 1|
| 27| 1| 4|
| 27| 2| 6|
| 28| null| 16|
| 28| 1| 4|
| 28| 1| 1|
| 28| 1| 3|
| 28| 2| 4|
| 28| 2| 2|
| 28| 2| 2|
+----+---------+--------+
My expected output (example) dataframe is:
+----+---------+--------+
|Week|DayofWeek| count|
+----+---------+--------+
|null| null| 28|
| 27| null| 12|
| 27| 1| 6|
| 27| 2| 6|
| 28| null| 16|
| 28| 1| 8|
| 28| 2| 8|
+----+---------+--------+
I am trying to achieve this by executing this line of code, but it does not give me my expected output:
df = df.groupBy('Week', 'DayofWeek').agg(F.sum('count')).orderBy(df.Week, df.DayofWeek)
Any help would be appreciated.
Actual Dataframe before GroupBy:
+----+---------+--------+
|Week|DayofWeek| count|
+----+---------+--------+
|null| null|19702637|
| 27| null| 5176492|
| 27| 1| 288|
| 27| 1| 326|
| 27| 1| 688|
| 27| 1| 343|
| 27| 1| 327|
| 27| 1| 784|
| 27| 1| 231|
| 27| 1| 1159|
| 27| 1| 492|
| 27| 1| 217|
| 27| 1| 386|
| 27| 1| 165|
| 27| 1| 2761|
| 27| 1| 3233|
| 27| 1| 81|
| 27| 1| 310|
| 27| 1| 341|
| 27| 1| 248|
+----+---------+--------+
only showing top 20 rows
Actual Dataframe after GroupBy (which is not my expected dataframe):
+----+---------+----------+
|Week|DayofWeek|sum(count)|
+----+---------+----------+
|null| null| 19702637|
| 27| null| 5176492|
| 27| 1| 1061084|
| 27| 2| 1356286|
| 27| 3| 1407338|
| 27| 4| 1510112|
| 27| 5| 1585684|
| 27| 6| 1876438|
| 27| 7| 1556042|
| 28| null| 4877306|
| 28| 1| 918296|
| 28| 2| 1560506|
| 28| 3| 1555056|
| 28| 4| 1502152|
| 28| 5| 1456802|
| 28| 6| 1550272|
| 28| 7| 1211528|
| 29| null| 5011023|
| 29| 1| 1138154|
| 29| 2| 1337084|
+----+---------+----------+
only showing top 20 rows
df_list = []
for (week, day_of_week), sub_df in df.groupby(["Week", "DayofWeek"], dropna=False):
count = sub_df["count"].sum()
df_list.append({"Week": week, "DayofWeek": day_of_week, "count": count})
df = pd.DataFrame(df_list)
So this is how I solved it:
I decided to use groupBy first and then use rollup, unlike the previous scenario (rollup first followed by groupBy). Required refactoring and minor changes in implementation.
But I faced another problem which is that after rollup each non-null row had a duplicate. I used df.distinct() , and finally got my expected DataFrame.

Grouped window operation in pyspark: restart sum by condition [duplicate]

I have this dataframe
+---+----+---+
| A| B| C|
+---+----+---+
| 0|null| 1|
| 1| 3.0| 0|
| 2| 7.0| 0|
| 3|null| 1|
| 4| 4.0| 0|
| 5| 3.0| 0|
| 6|null| 1|
| 7|null| 1|
| 8|null| 1|
| 9| 5.0| 0|
| 10| 2.0| 0|
| 11|null| 1|
+---+----+---+
What I need do is a cumulative sum of values from column C until the next value is zero.
Expected output:
+---+----+---+----+
| A| B| C| D|
+---+----+---+----+
| 0|null| 1| 1|
| 1| 3.0| 0| 0|
| 2| 7.0| 0| 0|
| 3|null| 1| 1|
| 4| 4.0| 0| 0|
| 5| 3.0| 0| 0|
| 6|null| 1| 1|
| 7|null| 1| 2|
| 8|null| 1| 3|
| 9| 5.0| 0| 0|
| 10| 2.0| 0| 0|
| 11|null| 1| 1|
+---+----+---+----+
To reproduce dataframe:
from pyspark.shell import sc
from pyspark.sql import Window
from pyspark.sql.functions import lag, when, sum
x = sc.parallelize([
[0, None], [1, 3.], [2, 7.], [3, None], [4, 4.],
[5, 3.], [6, None], [7, None], [8, None], [9, 5.], [10, 2.], [11, None]])
x = x.toDF(['A', 'B'])
# Transform null values into "1"
x = x.withColumn('C', when(x.B.isNull(), 1).otherwise(0))
Create a temporary column (grp) that increments a counter each time column C is equal to 0 (the reset condition) and use this as a partitioning column for your cumulative sum.
import pyspark.sql.functions as f
from pyspark.sql import Window
x.withColumn(
"grp",
f.sum((f.col("C") == 0).cast("int")).over(Window.orderBy("A"))
).withColumn(
"D",
f.sum(f.col("C")).over(Window.partitionBy("grp").orderBy("A"))
).drop("grp").show()
#+---+----+---+---+
#| A| B| C| D|
#+---+----+---+---+
#| 0|null| 1| 1|
#| 1| 3.0| 0| 0|
#| 2| 7.0| 0| 0|
#| 3|null| 1| 1|
#| 4| 4.0| 0| 0|
#| 5| 3.0| 0| 0|
#| 6|null| 1| 1|
#| 7|null| 1| 2|
#| 8|null| 1| 3|
#| 9| 5.0| 0| 0|
#| 10| 2.0| 0| 0|
#| 11|null| 1| 1|
#+---+----+---+---+

Change all row values in a window pyspark dataframe based on a condition

I have a pyspark dataframe which has three columns id, seq, seq_checker. I need to order by id and check for 4 consecutive 1's in seq_checker column.
I tried using window functions. I'm unable to change all values in a window based on a condition.
new_window = Window.partitionBy().orderBy("id").rangeBetween(0, 3)
output = df.withColumn('check_sequence',F.when(F.min(df['seq_checker']).over(new_window) == 1, True))
original pyspark df:
+---+---+-----------+--------------+
| Id|seq|seq_checker|check_sequence|
+---+---+-----------+--------------+
| 1| 1| 1| false|
| 2| 2| 1| false|
| 3| 3| 1| false|
| 4| 4| 1| false|
| 5| 10| 0| false|
| 6| 14| 1| false|
| 7| 13| 1| false|
| 8| 18| 0| false|
| 9| 23| 0| false|
| 10| 5| 0| false|
| 11| 56| 0| false|
| 12| 66| 0| false|
| 13| 34| 1| false|
| 14| 35| 1| false|
| 15| 36| 1| false|
| 16| 37| 1| false|
| 17| 39| 0| false|
| 18| 54| 0| false|
| 19| 68| 0| false|
| 20| 90| 0| false|
+---+---+-----------+--------------+
Required output:
+---+---+-----------+--------------+
| Id|seq|seq_checker|check_sequence|
+---+---+-----------+--------------+
| 1| 1| 1| true|
| 2| 2| 1| true|
| 3| 3| 1| true|
| 4| 4| 1| true|
| 5| 10| 0| false|
| 6| 14| 1| false|
| 7| 13| 1| false|
| 8| 18| 0| false|
| 9| 23| 0| false|
| 10| 5| 0| false|
| 11| 56| 0| false|
| 12| 66| 0| false|
| 13| 34| 1| true|
| 14| 35| 1| true|
| 15| 36| 1| true|
| 16| 37| 1| true|
| 17| 39| 0| false|
| 18| 54| 0| false|
| 19| 68| 0| false|
| 20| 90| 0| false|
+---+---+-----------+--------------+
Based on the above code, my output is:
+---+---+-----------+--------------+
| Id|seq|seq_checker|check_sequence|
+---+---+-----------+--------------+
| 1| 1| 1| true|
| 2| 2| 1| null|
| 3| 3| 1| null|
| 4| 4| 1| null|
| 5| 10| 0| null|
| 6| 14| 1| null|
| 7| 13| 1| null|
| 8| 18| 0| null|
| 9| 23| 0| null|
| 10| 5| 0| null|
| 11| 56| 0| null|
| 12| 66| 0| null|
| 13| 34| 1| true|
| 14| 35| 1| null|
| 15| 36| 1| null|
| 16| 37| 1| null|
| 17| 39| 0| null|
| 18| 54| 0| null|
| 19| 68| 0| null|
| 20| 90| 0| null|
+---+---+-----------+--------------+
Edit:
1. If we have more than 4 consecutive rows having 1's we need to change check_sequence flag for all the rows to True.
My actual problem is to check for sequences of length greater than 4 in the 'seq' column. I was able to create seq_checker column using lead and lag functions.
Initially define a window with just an id ordering. Then use a difference in row numbers approach (with different ordering) to group consecutive 1's (also groups consecutive same values) with the same group number. Once the grouping is done, just check to see if the max and min of the group is 1 and there are atleast 4 1's in the group, to get the desired boolean output.
from pyspark.sql.functions import row_number,count,when,min,max
w1 = Window.orderBy(df.id)
w2 = Window.orderBy(df.seq_checker,df.id)
groups = df.withColumn('grp',row_number().over(w1)-row_number().over(w2))
w3 = Window.partitionBy(groups.grp)
output = groups.withColumn('check_seq',(max(groups.seq_checker).over(w3)==1) & (min(groups.seq_checker).over(w3)==1) & (count(groups.id).over(w3) >= 4)
output.show()
The rangeBetween gives you the access to rows which are relative from the current row. You defined a window for 0,3 which gives you access to the current row and the three following rows, but this will only set the correct value for the first 1 of 4 consectutive rows of 1's. The second element of 4 consectutive rows of 1's needs acess to the previous row and following two rows (-1,2). The third element of 4 consectutive rows of 1's needs acess to the two previous rows and following two rows (-2,1). Finally the fourth element of 4 consectutive rows of 1's needs acess to the three previous rows(-3,0).
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [
( 1, 1, 1),
( 2, 2, 1),
( 3, 3, 1),
( 4, 4, 1),
( 5, 10, 0),
( 6, 14, 1),
( 7, 13, 1),
( 8, 18, 0),
( 9, 23, 0),
( 10, 5, 0),
( 11, 56, 0),
( 12, 66, 0),
( 13, 34, 1),
( 14, 35, 1),
( 15, 36, 1),
( 16, 37, 1),
( 17, 39, 0),
( 18, 54, 0),
( 19, 68, 0),
( 20, 90, 0)
]
columns = ['Id','seq','seq_checker']
df=spark.createDataFrame(l, columns)
w1 = Window.partitionBy().orderBy("id").rangeBetween(0, 3)
w2 = Window.partitionBy().orderBy("id").rangeBetween(-1, 2)
w3 = Window.partitionBy().orderBy("id").rangeBetween(-2, 1)
w4 = Window.partitionBy().orderBy("id").rangeBetween(-3, 0)
output = df.withColumn('check_sequence',F.when(
(F.min(df['seq_checker']).over(w1) == 1) |
(F.min(df['seq_checker']).over(w2) == 1) |
(F.min(df['seq_checker']).over(w3) == 1) |
(F.min(df['seq_checker']).over(w4) == 1)
, True).otherwise(False))
output.show()
Output:
+---+---+-----------+--------------+
| Id|seq|seq_checker|check_sequence|
+---+---+-----------+--------------+
| 1| 1| 1| true|
| 2| 2| 1| true|
| 3| 3| 1| true|
| 4| 4| 1| true|
| 5| 10| 0| null|
| 6| 14| 1| null|
| 7| 13| 1| null|
| 8| 18| 0| null|
| 9| 23| 0| null|
| 10| 5| 0| null|
| 11| 56| 0| null|
| 12| 66| 0| null|
| 13| 34| 1| true|
| 14| 35| 1| true|
| 15| 36| 1| true|
| 16| 37| 1| true|
| 17| 39| 0| null|
| 18| 54| 0| null|
| 19| 68| 0| null|
| 20| 90| 0| null|
+---+---+-----------+--------------+

Add unique identifier (Serial No.) for consecutive column values in pyspark

I created a rdd using
import pyspark.sql.functions as F
from pyspark.sql import Window
df = pd.DataFrame({"b": ['A','A','A','A','A','A','A','B', 'B','B','C','C','D','D', 'D','D','D','D','D','D','D','D','D'],"Sno": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23],"a": [3,-4,2, -1, -3, 1,-7,-6, -4, -5, -1, 1,4,5,-3,2,-5,4, -4,-2,5,-5,-4]})
df2=spark.createDataFrame(df)
df2 = df2.withColumn("pos_neg",col("a") < 0)
df2 = df2.withColumn("huyguyg",concat(col("b"), lit(" "), col("pos_neg")))
+---+---+---+-------+---+-------+
| b|Sno| a|pos_neg|val|huyguyg|
+---+---+---+-------+---+-------+
| B| 8| -6| true| 1| B true|
| B| 9| -4| true| 1| B true|
| B| 10| -5| true| 1| B true|
| D| 13| 4| false| 0|D false|
| D| 14| 5| false| 0|D false|
| D| 15| -3| true| 1| D true|
| D| 16| 2| false| 1|D false|
| D| 17| -5| true| 2| D true|
| D| 18| 4| false| 2|D false|
| D| 19| -4| true| 3| D true|
| D| 20| -2| true| 3| D true|
| D| 21| 5| false| 3|D false|
| D| 22| -5| true| 4| D true|
| D| 23| -4| true| 4| D true|
| C| 11| -1| true| 1| C true|
| C| 12| 1| false| 1|C false|
| A| 1| 3| false| 0|A false|
| A| 2| -4| true| 1| A true|
| A| 3| 2| false| 1|A false|
| A| 4| -1| true| 2| A true|
+---+---+---+-------+---+-------+
I want an additional column in the end which adds a unique identifier (serial no.) for consecutive values, for instance starting value in column 'huyguyg' is 'B true' it can get a number say '1' and since next 2 values are also 'B true' they also get number '1', subsequently the serial number increases and remains constant for same 'huyguyg' value
Any support in this regard will be helpful. A lag function in this regard may be helpful, but I am not able to sum the number
df2 = df2.withColumn("serial no.",(df2.pos_neg != F.lag('pos_neg').over(w)).cast('int'))
Simple! just use Dense Rank function with orderBy clause.
Here is how it looks like:
import dense_rank
df3=df2.withColumn("denseRank",dense_rank().over(Window.orderBy(df2.huyguyg)))
+---+---+---+-------+-------+---------+
|Sno| a| b|pos_neg|huyguyg|denseRank|
+---+---+---+-------+-------+---------+
| 1| 3| A| false|A false| 1|
| 3| 2| A| false|A false| 1|
| 6| 1| A| false|A false| 1|
| 2| -4| A| true| A true| 2|
| 4| -1| A| true| A true| 2|
| 5| -3| A| true| A true| 2|
| 7| -7| A| true| A true| 2|
| 8| -6| B| true| B true| 3|
| 9| -4| B| true| B true| 3|
| 10| -5| B| true| B true| 3|
| 12| 1| C| false|C false| 4|
| 11| -1| C| true| C true| 5|
| 13| 4| D| false|D false| 6|
| 14| 5| D| false|D false| 6|
| 16| 2| D| false|D false| 6|
| 18| 4| D| false|D false| 6|
| 21| 5| D| false|D false| 6|
| 15| -3| D| true| D true| 7|
| 17| -5| D| true| D true| 7|
| 19| -4| D| true| D true| 7|
+---+---+---+-------+-------+---------+
only showing top 20 rows

Categories

Resources