How to set new flag based on condition in pyspark?

How to set new flag based on condition in pyspark? - python

I have two data frames like below.
df = spark.createDataFrame(sc.parallelize([[1,1,2],[1,2,9], [2,1,2],[2,2,1],
[4,1,5],[4,2,6]]), ["sid","cid","Cr"])
df.show()
+---+---+---+
|sid|cid| Cr|
+---+---+---+
| 1| 1| 2|
| 1| 2| 9|
| 2| 1| 2|
| 2| 2| 1|
| 4| 1| 5|
| 4| 2| 6|
| 5| 1| 3|
| 5| 2| 8|
+---+---+---+
next I have created df1 like below.
df1 = spark.createDataFrame(sc.parallelize([[1,1],[1,2],[1,3], [2,1],[2,2],[2,3],[4,1],[4,2],[4,3],[5,1],[5,2],[5,3]]), ["sid","cid"])
df1.show()
+---+---+
|sid|cid|
+---+---+
| 1| 1|
| 1| 2|
| 1| 3|
| 2| 1|
| 2| 2|
| 2| 3|
| 4| 1|
| 4| 2|
| 4| 3|
| 5| 1|
| 5| 2|
| 5| 3|
+---+---+
now I want my final output should be like below i.e . if any of the data presented i.e.
if (df1.sid==df.sid)&(df1.cid==df.cid) then flag value 1 else 0.
and missing Cr values will be '0'
+---+---+---+----+
|sid|cid| Cr|flag|
+---+---+---+----+
| 1| 1| 2| 1 |
| 1| 2| 9| 1 |
| 1| 3| 0| 0 |
| 2| 1| 2| 1 |
| 2| 2| 1| 1 |
| 2| 3| 0| 0 |
| 4| 1| 5| 1 |
| 4| 2| 6| 1 |
| 4| 3| 0| 0 |
| 5| 1| 3| 1 |
| 5| 2| 8| 1 |
| 5| 3| 0| 0 |
+---+---+---+----+
please help me on this.

With data:
from pyspark.sql.functions import col, when, lit, coalesce
df = spark.createDataFrame(
[(1, 1, 2), (1, 2, 9), (2, 1, 2), (2, 2, 1), (4, 1, 5), (4, 2, 6), (5, 1, 3), (5, 2, 8)],
("sid", "cid", "Cr"))
df1 = spark.createDataFrame(
[[1,1],[1,2],[1,3], [2,1],[2,2],[2,3],[4,1],[4,2],[4,3],[5,1],[5,2],[5,3]],
["sid","cid"])
outer join:
joined = (df.alias("df")
.join(
df1.alias("df1"),
(col("df.sid") == col("df1.sid")) & (col("df.cid") == col("df1.cid")),
"rightouter"))
and select
joined.select(
col("df1.*"),
coalesce(col("Cr"), lit(0)).alias("Cr"),
col("df.sid").isNotNull().cast("integer").alias("flag")
).orderBy("sid", "cid").show()
# +---+---+---+----+
# |sid|cid| Cr|flag|
# +---+---+---+----+
# | 1| 1| 2| 1|
# | 1| 2| 9| 1|
# | 1| 3| 0| 0|
# | 2| 1| 2| 1|
# | 2| 2| 1| 1|
# | 2| 3| 0| 0|
# | 4| 1| 5| 1|
# | 4| 2| 6| 1|
# | 4| 3| 0| 0|
# | 5| 1| 3| 1|
# | 5| 2| 8| 1|
# | 5| 3| 0| 0|
# +---+---+---+----+

Related

Probem faced when using groupBy()

I have the following (example) DataFrame - df (obtained using rollup function):
+----+---------+--------+
|Week|DayofWeek| count|
+----+---------+--------+
|null| null| 28|
| 27| null| 12|
| 27| 1| 1|
| 27| 1| 1|
| 27| 1| 4|
| 27| 2| 6|
| 28| null| 16|
| 28| 1| 4|
| 28| 1| 1|
| 28| 1| 3|
| 28| 2| 4|
| 28| 2| 2|
| 28| 2| 2|
+----+---------+--------+
My expected output (example) dataframe is:
+----+---------+--------+
|Week|DayofWeek| count|
+----+---------+--------+
|null| null| 28|
| 27| null| 12|
| 27| 1| 6|
| 27| 2| 6|
| 28| null| 16|
| 28| 1| 8|
| 28| 2| 8|
+----+---------+--------+
I am trying to achieve this by executing this line of code, but it does not give me my expected output:
df = df.groupBy('Week', 'DayofWeek').agg(F.sum('count')).orderBy(df.Week, df.DayofWeek)
Any help would be appreciated.
Actual Dataframe before GroupBy:
+----+---------+--------+
|Week|DayofWeek| count|
+----+---------+--------+
|null| null|19702637|
| 27| null| 5176492|
| 27| 1| 288|
| 27| 1| 326|
| 27| 1| 688|
| 27| 1| 343|
| 27| 1| 327|
| 27| 1| 784|
| 27| 1| 231|
| 27| 1| 1159|
| 27| 1| 492|
| 27| 1| 217|
| 27| 1| 386|
| 27| 1| 165|
| 27| 1| 2761|
| 27| 1| 3233|
| 27| 1| 81|
| 27| 1| 310|
| 27| 1| 341|
| 27| 1| 248|
+----+---------+--------+
only showing top 20 rows
Actual Dataframe after GroupBy (which is not my expected dataframe):
+----+---------+----------+
|Week|DayofWeek|sum(count)|
+----+---------+----------+
|null| null| 19702637|
| 27| null| 5176492|
| 27| 1| 1061084|
| 27| 2| 1356286|
| 27| 3| 1407338|
| 27| 4| 1510112|
| 27| 5| 1585684|
| 27| 6| 1876438|
| 27| 7| 1556042|
| 28| null| 4877306|
| 28| 1| 918296|
| 28| 2| 1560506|
| 28| 3| 1555056|
| 28| 4| 1502152|
| 28| 5| 1456802|
| 28| 6| 1550272|
| 28| 7| 1211528|
| 29| null| 5011023|
| 29| 1| 1138154|
| 29| 2| 1337084|
+----+---------+----------+
only showing top 20 rows

df_list = []
for (week, day_of_week), sub_df in df.groupby(["Week", "DayofWeek"], dropna=False):
count = sub_df["count"].sum()
df_list.append({"Week": week, "DayofWeek": day_of_week, "count": count})
df = pd.DataFrame(df_list)

So this is how I solved it:
I decided to use groupBy first and then use rollup, unlike the previous scenario (rollup first followed by groupBy). Required refactoring and minor changes in implementation.
But I faced another problem which is that after rollup each non-null row had a duplicate. I used df.distinct() , and finally got my expected DataFrame.

Grouped window operation in pyspark: restart sum by condition [duplicate]

I have this dataframe
+---+----+---+
| A| B| C|
+---+----+---+
| 0|null| 1|
| 1| 3.0| 0|
| 2| 7.0| 0|
| 3|null| 1|
| 4| 4.0| 0|
| 5| 3.0| 0|
| 6|null| 1|
| 7|null| 1|
| 8|null| 1|
| 9| 5.0| 0|
| 10| 2.0| 0|
| 11|null| 1|
+---+----+---+
What I need do is a cumulative sum of values from column C until the next value is zero.
Expected output:
+---+----+---+----+
| A| B| C| D|
+---+----+---+----+
| 0|null| 1| 1|
| 1| 3.0| 0| 0|
| 2| 7.0| 0| 0|
| 3|null| 1| 1|
| 4| 4.0| 0| 0|
| 5| 3.0| 0| 0|
| 6|null| 1| 1|
| 7|null| 1| 2|
| 8|null| 1| 3|
| 9| 5.0| 0| 0|
| 10| 2.0| 0| 0|
| 11|null| 1| 1|
+---+----+---+----+
To reproduce dataframe:
from pyspark.shell import sc
from pyspark.sql import Window
from pyspark.sql.functions import lag, when, sum
x = sc.parallelize([
[0, None], [1, 3.], [2, 7.], [3, None], [4, 4.],
[5, 3.], [6, None], [7, None], [8, None], [9, 5.], [10, 2.], [11, None]])
x = x.toDF(['A', 'B'])
# Transform null values into "1"
x = x.withColumn('C', when(x.B.isNull(), 1).otherwise(0))

Create a temporary column (grp) that increments a counter each time column C is equal to 0 (the reset condition) and use this as a partitioning column for your cumulative sum.
import pyspark.sql.functions as f
from pyspark.sql import Window
x.withColumn(
"grp",
f.sum((f.col("C") == 0).cast("int")).over(Window.orderBy("A"))
).withColumn(
"D",
f.sum(f.col("C")).over(Window.partitionBy("grp").orderBy("A"))
).drop("grp").show()
#+---+----+---+---+
#| A| B| C| D|
#+---+----+---+---+
#| 0|null| 1| 1|
#| 1| 3.0| 0| 0|
#| 2| 7.0| 0| 0|
#| 3|null| 1| 1|
#| 4| 4.0| 0| 0|
#| 5| 3.0| 0| 0|
#| 6|null| 1| 1|
#| 7|null| 1| 2|
#| 8|null| 1| 3|
#| 9| 5.0| 0| 0|
#| 10| 2.0| 0| 0|
#| 11|null| 1| 1|
#+---+----+---+---+

Change all row values in a window pyspark dataframe based on a condition

I have a pyspark dataframe which has three columns id, seq, seq_checker. I need to order by id and check for 4 consecutive 1's in seq_checker column.
I tried using window functions. I'm unable to change all values in a window based on a condition.
new_window = Window.partitionBy().orderBy("id").rangeBetween(0, 3)
output = df.withColumn('check_sequence',F.when(F.min(df['seq_checker']).over(new_window) == 1, True))
original pyspark df:
+---+---+-----------+--------------+
| Id|seq|seq_checker|check_sequence|
+---+---+-----------+--------------+
| 1| 1| 1| false|
| 2| 2| 1| false|
| 3| 3| 1| false|
| 4| 4| 1| false|
| 5| 10| 0| false|
| 6| 14| 1| false|
| 7| 13| 1| false|
| 8| 18| 0| false|
| 9| 23| 0| false|
| 10| 5| 0| false|
| 11| 56| 0| false|
| 12| 66| 0| false|
| 13| 34| 1| false|
| 14| 35| 1| false|
| 15| 36| 1| false|
| 16| 37| 1| false|
| 17| 39| 0| false|
| 18| 54| 0| false|
| 19| 68| 0| false|
| 20| 90| 0| false|
+---+---+-----------+--------------+
Required output:
+---+---+-----------+--------------+
| Id|seq|seq_checker|check_sequence|
+---+---+-----------+--------------+
| 1| 1| 1| true|
| 2| 2| 1| true|
| 3| 3| 1| true|
| 4| 4| 1| true|
| 5| 10| 0| false|
| 6| 14| 1| false|
| 7| 13| 1| false|
| 8| 18| 0| false|
| 9| 23| 0| false|
| 10| 5| 0| false|
| 11| 56| 0| false|
| 12| 66| 0| false|
| 13| 34| 1| true|
| 14| 35| 1| true|
| 15| 36| 1| true|
| 16| 37| 1| true|
| 17| 39| 0| false|
| 18| 54| 0| false|
| 19| 68| 0| false|
| 20| 90| 0| false|
+---+---+-----------+--------------+
Based on the above code, my output is:
+---+---+-----------+--------------+
| Id|seq|seq_checker|check_sequence|
+---+---+-----------+--------------+
| 1| 1| 1| true|
| 2| 2| 1| null|
| 3| 3| 1| null|
| 4| 4| 1| null|
| 5| 10| 0| null|
| 6| 14| 1| null|
| 7| 13| 1| null|
| 8| 18| 0| null|
| 9| 23| 0| null|
| 10| 5| 0| null|
| 11| 56| 0| null|
| 12| 66| 0| null|
| 13| 34| 1| true|
| 14| 35| 1| null|
| 15| 36| 1| null|
| 16| 37| 1| null|
| 17| 39| 0| null|
| 18| 54| 0| null|
| 19| 68| 0| null|
| 20| 90| 0| null|
+---+---+-----------+--------------+
Edit:
1. If we have more than 4 consecutive rows having 1's we need to change check_sequence flag for all the rows to True.
My actual problem is to check for sequences of length greater than 4 in the 'seq' column. I was able to create seq_checker column using lead and lag functions.

Initially define a window with just an id ordering. Then use a difference in row numbers approach (with different ordering) to group consecutive 1's (also groups consecutive same values) with the same group number. Once the grouping is done, just check to see if the max and min of the group is 1 and there are atleast 4 1's in the group, to get the desired boolean output.
from pyspark.sql.functions import row_number,count,when,min,max
w1 = Window.orderBy(df.id)
w2 = Window.orderBy(df.seq_checker,df.id)
groups = df.withColumn('grp',row_number().over(w1)-row_number().over(w2))
w3 = Window.partitionBy(groups.grp)
output = groups.withColumn('check_seq',(max(groups.seq_checker).over(w3)==1) & (min(groups.seq_checker).over(w3)==1) & (count(groups.id).over(w3) >= 4)
output.show()

The rangeBetween gives you the access to rows which are relative from the current row. You defined a window for 0,3 which gives you access to the current row and the three following rows, but this will only set the correct value for the first 1 of 4 consectutive rows of 1's. The second element of 4 consectutive rows of 1's needs acess to the previous row and following two rows (-1,2). The third element of 4 consectutive rows of 1's needs acess to the two previous rows and following two rows (-2,1). Finally the fourth element of 4 consectutive rows of 1's needs acess to the three previous rows(-3,0).
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [
( 1, 1, 1),
( 2, 2, 1),
( 3, 3, 1),
( 4, 4, 1),
( 5, 10, 0),
( 6, 14, 1),
( 7, 13, 1),
( 8, 18, 0),
( 9, 23, 0),
( 10, 5, 0),
( 11, 56, 0),
( 12, 66, 0),
( 13, 34, 1),
( 14, 35, 1),
( 15, 36, 1),
( 16, 37, 1),
( 17, 39, 0),
( 18, 54, 0),
( 19, 68, 0),
( 20, 90, 0)
]
columns = ['Id','seq','seq_checker']
df=spark.createDataFrame(l, columns)
w1 = Window.partitionBy().orderBy("id").rangeBetween(0, 3)
w2 = Window.partitionBy().orderBy("id").rangeBetween(-1, 2)
w3 = Window.partitionBy().orderBy("id").rangeBetween(-2, 1)
w4 = Window.partitionBy().orderBy("id").rangeBetween(-3, 0)
output = df.withColumn('check_sequence',F.when(
(F.min(df['seq_checker']).over(w1) == 1) |
(F.min(df['seq_checker']).over(w2) == 1) |
(F.min(df['seq_checker']).over(w3) == 1) |
(F.min(df['seq_checker']).over(w4) == 1)
, True).otherwise(False))
output.show()
Output:
+---+---+-----------+--------------+
| Id|seq|seq_checker|check_sequence|
+---+---+-----------+--------------+
| 1| 1| 1| true|
| 2| 2| 1| true|
| 3| 3| 1| true|
| 4| 4| 1| true|
| 5| 10| 0| null|
| 6| 14| 1| null|
| 7| 13| 1| null|
| 8| 18| 0| null|
| 9| 23| 0| null|
| 10| 5| 0| null|
| 11| 56| 0| null|
| 12| 66| 0| null|
| 13| 34| 1| true|
| 14| 35| 1| true|
| 15| 36| 1| true|
| 16| 37| 1| true|
| 17| 39| 0| null|
| 18| 54| 0| null|
| 19| 68| 0| null|
| 20| 90| 0| null|
+---+---+-----------+--------------+

Add unique identifier (Serial No.) for consecutive column values in pyspark

I created a rdd using
import pyspark.sql.functions as F
from pyspark.sql import Window
df = pd.DataFrame({"b": ['A','A','A','A','A','A','A','B', 'B','B','C','C','D','D', 'D','D','D','D','D','D','D','D','D'],"Sno": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23],"a": [3,-4,2, -1, -3, 1,-7,-6, -4, -5, -1, 1,4,5,-3,2,-5,4, -4,-2,5,-5,-4]})
df2=spark.createDataFrame(df)
df2 = df2.withColumn("pos_neg",col("a") < 0)
df2 = df2.withColumn("huyguyg",concat(col("b"), lit(" "), col("pos_neg")))
+---+---+---+-------+---+-------+
| b|Sno| a|pos_neg|val|huyguyg|
+---+---+---+-------+---+-------+
| B| 8| -6| true| 1| B true|
| B| 9| -4| true| 1| B true|
| B| 10| -5| true| 1| B true|
| D| 13| 4| false| 0|D false|
| D| 14| 5| false| 0|D false|
| D| 15| -3| true| 1| D true|
| D| 16| 2| false| 1|D false|
| D| 17| -5| true| 2| D true|
| D| 18| 4| false| 2|D false|
| D| 19| -4| true| 3| D true|
| D| 20| -2| true| 3| D true|
| D| 21| 5| false| 3|D false|
| D| 22| -5| true| 4| D true|
| D| 23| -4| true| 4| D true|
| C| 11| -1| true| 1| C true|
| C| 12| 1| false| 1|C false|
| A| 1| 3| false| 0|A false|
| A| 2| -4| true| 1| A true|
| A| 3| 2| false| 1|A false|
| A| 4| -1| true| 2| A true|
+---+---+---+-------+---+-------+
I want an additional column in the end which adds a unique identifier (serial no.) for consecutive values, for instance starting value in column 'huyguyg' is 'B true' it can get a number say '1' and since next 2 values are also 'B true' they also get number '1', subsequently the serial number increases and remains constant for same 'huyguyg' value
Any support in this regard will be helpful. A lag function in this regard may be helpful, but I am not able to sum the number
df2 = df2.withColumn("serial no.",(df2.pos_neg != F.lag('pos_neg').over(w)).cast('int'))

Simple! just use Dense Rank function with orderBy clause.
Here is how it looks like:
import dense_rank
df3=df2.withColumn("denseRank",dense_rank().over(Window.orderBy(df2.huyguyg)))
+---+---+---+-------+-------+---------+
|Sno| a| b|pos_neg|huyguyg|denseRank|
+---+---+---+-------+-------+---------+
| 1| 3| A| false|A false| 1|
| 3| 2| A| false|A false| 1|
| 6| 1| A| false|A false| 1|
| 2| -4| A| true| A true| 2|
| 4| -1| A| true| A true| 2|
| 5| -3| A| true| A true| 2|
| 7| -7| A| true| A true| 2|
| 8| -6| B| true| B true| 3|
| 9| -4| B| true| B true| 3|
| 10| -5| B| true| B true| 3|
| 12| 1| C| false|C false| 4|
| 11| -1| C| true| C true| 5|
| 13| 4| D| false|D false| 6|
| 14| 5| D| false|D false| 6|
| 16| 2| D| false|D false| 6|
| 18| 4| D| false|D false| 6|
| 21| 5| D| false|D false| 6|
| 15| -3| D| true| D true| 7|
| 17| -5| D| true| D true| 7|
| 19| -4| D| true| D true| 7|
+---+---+---+-------+-------+---------+
only showing top 20 rows

retrieving pyspark offset lag dynamic value in to other dataframe

I am using pyspark 2.1. Below are my input dataframes . I am stuck up in taking dynamic offset values from different dataframe please help
df1=
category value
1 3
2 2
4 5
df2
category year month weeknumber lag_attribute runs
1 0 0 0 0 2
1 2019 1 1 1 0
1 2019 1 2 2 0
1 2019 1 3 3 0
1 2019 1 4 4 1
1 2019 1 5 5 2
1 2019 1 6 6 3
1 2019 1 7 7 4
1 2019 1 8 8 5
1 2019 1 9 9 6
2 0 0 0 9 0
2 2018 1 1 2 0
2 2018 1 2 3 2
2 2018 1 3 4 3
2 2018 1 3 5 4
As shown in above example df1 is my look up table which has offset values,for 1 offset value is 3 and for category 2 offset value is 2 .
in df2 ,runs is my output column so for every category values in df1 if the lag value is 3, then from dataframe2[df2] should consider the lag_attrbute and lag down by 3 values hence you could see for every 3 values of lag_attribute the runs were repeating
I tried below coding didn't work . Please help
df1=df1.registerTempTable("df1")
df2=df2.registerTempTable("df2")
sqlCtx.sql("select st.category,st.Year,st.Month,st.weekyear,st.lag_attribute,LAG(st.lag_attribute,df1.value, 0) OVER (PARTITION BY st.cagtegory ORDER BY st.Year,st.Month,st.weekyear) as return_test from df1 st,df2 lkp where df1.category=df2.category")
Please help me to cross this hurdle

lag takes in a column object and an integer (python integer), as shown in the function's signature:
Signature: psf.lag(col, count=1, default=None)
The value for count cannot be a pyspark IntegerType (column object). There are workarounds though, let's start with the sample data:
df1 = spark.createDataFrame([[1, 3],[2, 2],[4, 5]], ["category", "value"])
df2 = spark.createDataFrame([[1, 0, 0, 0, 0, 2],[1, 2019, 1, 1, 1, 0],[1, 2019, 1, 2, 2, 0],[1, 2019, 1, 3, 3, 0],
[1, 2019, 1, 4, 4, 1],[1, 2019, 1, 5, 5, 2],[1, 2019, 1, 6, 6, 3],[1, 2019, 1, 7, 7, 4],
[1, 2019, 1, 8, 8, 5],[1, 2019, 1, 9, 9, 6],[2, 0, 0, 0, 9, 0],[2, 2018, 1, 1, 2, 0],
[2, 2018, 1, 2, 3, 2],[2, 2018, 1, 3, 4, 3],[2, 2018, 1, 3, 5, 4]],
["category", "year", "month", "weeknumber", "lag_attribute", "runs"])
What you could do, if df1 is not too big (meaning a small amount of categories and potentially a lot of values in each category), is convert df1 to a list and create an if-elif-elif... condition based on its values:
list1 = df1.collect()
sc.broadcast(list1)
import pyspark.sql.functions as psf
from pyspark.sql import Window
w = Window.partitionBy("category").orderBy("year", "month", "weeknumber")
cond = eval('psf' + ''.join(['.when(df2.category == ' + str(c) + ', psf.lag("lag_attribute", ' + str(l) + ', 0).over(w))' for c, l in list1]))
Note: this is if c and l are integers, if they are strings then:
cond = eval('psf' + ''.join(['.when(df2.category == "' + str(c) + '", psf.lag("lag_attribute", "' + str(l) + '", 0).over(w))' for c, l in list1]))
Now we can apply the condition:
df2.select("*", cond.alias("return_test")).show()
+--------+----+-----+----------+-------------+----+-----------+
|category|year|month|weeknumber|lag_attribute|runs|return_test|
+--------+----+-----+----------+-------------+----+-----------+
| 1| 0| 0| 0| 0| 2| 0|
| 1|2019| 1| 1| 1| 0| 0|
| 1|2019| 1| 2| 2| 0| 0|
| 1|2019| 1| 3| 3| 0| 0|
| 1|2019| 1| 4| 4| 1| 1|
| 1|2019| 1| 5| 5| 2| 2|
| 1|2019| 1| 6| 6| 3| 3|
| 1|2019| 1| 7| 7| 4| 4|
| 1|2019| 1| 8| 8| 5| 5|
| 1|2019| 1| 9| 9| 6| 6|
| 2| 0| 0| 0| 9| 0| 0|
| 2|2018| 1| 1| 2| 0| 0|
| 2|2018| 1| 2| 3| 2| 9|
| 2|2018| 1| 3| 4| 3| 2|
| 2|2018| 1| 3| 5| 4| 3|
+--------+----+-----+----------+-------------+----+-----------+
If df1 is big then you can self join df2 on a built lag column:
First we'll bring the values from df1 to df2 using a join:
df = df2.join(df1, "category")
if df1 is not too big, you should broadcast it:
import pyspark.sql.functions as psf
df = df2.join(psf.broadcast(df1), "category")
Now we'll enumerate the rows in each partition and build a lag column:
from pyspark.sql import Window
w = Window.partitionBy("category").orderBy("year", "month", "weeknumber")
left = df.withColumn('rn', psf.row_number().over(w))
right = left.select((left.rn + left.value).alias("rn"), left.lag_attribute.alias("return_test"))
left.join(right, ["category", "rn"], "left")\
.na.fill(0)\
.sort("category", "rn").show()
+--------+---+----+-----+----------+-------------+----+-----+-----------+
|category| rn|year|month|weeknumber|lag_attribute|runs|value|return_test|
+--------+---+----+-----+----------+-------------+----+-----+-----------+
| 1| 1| 0| 0| 0| 0| 2| 3| 0|
| 1| 2|2019| 1| 1| 1| 0| 3| 0|
| 1| 3|2019| 1| 2| 2| 0| 3| 0|
| 1| 4|2019| 1| 3| 3| 0| 3| 0|
| 1| 5|2019| 1| 4| 4| 1| 3| 1|
| 1| 6|2019| 1| 5| 5| 2| 3| 2|
| 1| 7|2019| 1| 6| 6| 3| 3| 3|
| 1| 8|2019| 1| 7| 7| 4| 3| 4|
| 1| 9|2019| 1| 8| 8| 5| 3| 5|
| 1| 10|2019| 1| 9| 9| 6| 3| 6|
| 2| 1| 0| 0| 0| 9| 0| 2| 0|
| 2| 2|2018| 1| 1| 2| 0| 2| 0|
| 2| 3|2018| 1| 2| 3| 2| 2| 9|
| 2| 4|2018| 1| 3| 4| 3| 2| 2|
| 2| 5|2018| 1| 3| 5| 4| 2| 3|
+--------+---+----+-----+----------+-------------+----+-----+-----------+
Note: There is a problem with your runs lag value, for catagory=2 it is only lagging 1 instead of 2 for instance. Also some lines have the same order (eg. the two last lines in your sample dataframe df2 have the same category, year, month and weeknumber) in your dataframe, since there is shuffling involved you might get different results everytime you run the code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to set new flag based on condition in pyspark? - python

Related

Probem faced when using groupBy()

Grouped window operation in pyspark: restart sum by condition [duplicate]

Change all row values in a window pyspark dataframe based on a condition

Add unique identifier (Serial No.) for consecutive column values in pyspark

retrieving pyspark offset lag dynamic value in to other dataframe

Categories

Resources