Related
I am trying to generate multiple lag for the 'status' variable in the dataframe below. The data that I have is timeseries and it is possible to have gap within the data. What I am trying to do is generate the lags but when there is a gap, the value should be set to Missing/Null.
Input DF:
+---+----------+------+
| id| s_date|status|
+---+----------+------+
|123|2007-01-31| 1|
|123|2007-02-28| 1|
|123|2007-03-31| 2|
|123|2007-04-30| 2|
|123|2007-05-31| 1|
|123|2007-06-30| 1|
|123|2007-07-31| 2|
|123|2007-08-31| 2|
|345|2007-08-31| 3|
|123|2007-09-30| 2|
|345|2007-09-30| 2|
|123|2007-10-31| 1|
|345|2007-10-31| 1|
|123|2007-11-30| 1|
|345|2007-11-30| 2|
|123|2008-01-31| 3|
|345|2007-12-31| 2|
|567|2007-12-31| 3|
|123|2008-03-31| 4|
|345|2008-01-31| 2|
+---+----------+------+
from datetime import date
rdd = sc.parallelize([
[123,date(2007,1,31),1],
[123,date(2007,2,28),1],
[123,date(2007,3,31),2],
[123,date(2007,4,30),2],
[123,date(2007,5,31),1],
[123,date(2007,6,30),1],
[123,date(2007,7,31),2],
[123,date(2007,8,31),2],
[345,date(2007,8,31),3],
[123,date(2007,9,30),2],
[345,date(2007,9,30),2],
[123,date(2007,10,31),1],
[345,date(2007,10,31),1],
[123,date(2007,11,30),1],
[345,date(2007,11,30),2],
[123,date(2008,1,31),3],
[345,date(2007,12,31),2],
[567,date(2007,12,31),3],
[123,date(2008,3,31),4],
[345,date(2008,1,31),2],
[567,date(2008,1,31),3],
[123,date(2008,4,30),3],
[123,date(2008,5,31),2],
[123,date(2008,6,30),1]
])
df = rdd.toDF(['id','s_date','status'])
df.show()
# Below is the code that works
from pyspark.sql.window import Window
w = Window().partitionBy('id').orderBy('s_date')
for i in range(1, 25):
df = df.withColumn(f"lag_{i}", fn.lag(fn.col('status'), i).over(w))\
.withColumn(f"lag_month_{i}", fn.lag(fn.col('s_date), i).over(w))\
.withColumn(f"lag_status_{i}", fn.expr("case when "f"{i} = 1 and (last_day(add_months("f"lag_month_{i}"",1)) = last_day(s_date)) then "f"lag_{i}"" else null end"))
In the code above, column 'lag_status_1" is correctly getting population for i = 1. This column has value of Null for Jan and March 2008 which is what I want for every lag. However, I add below line to handle other lags (i.e. Lag_2, Lag_3, etc.) the code does not work.
.withColumn(f"lag_status_{i}", fn.expr("case when "f"{i} = 1 and (last_day(add_months("f"lag_month_{i}"",1)) = last_day(s_date)) then "f"lag_{i}"" " +
"when "f"{i} > 1 and (last_day(add_months("f"lag_month_{i}"",1)) = last_day("f"lag_month_{i-1})) then "f"lag_{i}"" else null end"))\
Output DF: lag_dlq_2 and lag_dlq_3 is Null but would like to populate with similar logic as lag_dlq_1 (but with respect to its lag)
+---+----------+------+---------+---------+---------+
| id| s_date|status|lag_status_1|lag_status_2|lag_status_3|
+---+----------+------+---------+---------+---------+
|123|2007-01-31| 1| null| null| null|
|123|2007-02-28| 1| 1| null| null|
|123|2007-03-31| 2| 1| null| null|
|123|2007-04-30| 2| 2| null| null|
|123|2007-05-31| 1| 2| null| null|
|123|2007-06-30| 1| 1| null| null|
|123|2007-07-31| 2| 1| null| null|
|123|2007-08-31| 2| 2| null| null|
|123|2007-09-30| 2| 2| null| null|
|123|2007-10-31| 1| 2| null| null|
|123|2007-11-30| 1| 1| null| null|
|123|2008-01-31| 3| null| null| null|
|123|2008-03-31| 4| null| null| null|
|123|2008-04-30| 3| 4| null| null|
|123|2008-05-31| 2| 3| null| null|
|123|2008-06-30| 1| 2| null| null|
|345|2007-08-31| 3| null| null| null|
|345|2007-09-30| 2| 3| null| null|
|345|2007-10-31| 1| 2| null| null|
|345|2007-11-30| 2| 1| null| null|
+---+----------+------+---------+---------+---------+
Can you please guide on how to resolve? If there is a better or more efficient solution, please suggest.
I have a dataframe, and I am trying to write a for loop on it.
|ID | from_dt | To_dt |row_number|diff|negetive_or_not|
+----------+----------+----------+----------+----+---------------+
|11111|2020-07-30|2020-07-31| 1| -2| 0|
|11111|2020-08-02|2020-08-11| 2| 4| 1|
|11111|2020-08-07|2020-08-08| 3| -4| 0|
|11111|2020-08-12|2020-08-18| 4| 1| 1|
|11111|2020-08-17|2020-08-19| 5| 0| 1|
|11111|2020-08-19|2020-08-22| 6| 2| 1|
|11111|2020-08-20|2020-08-24| 7| -1| 0|
|11111|2020-08-25|2020-08-27| 8| 0| 1|
|11111|2020-08-27|2020-08-31| 9|-999| 0|
The goal is to determine the episode. if the negative or not start with 0, it is an episode, if the negative or not start with 1, until it hit 0, it is an episode.
Here is the ideal output
+----------+----------+----------+----------+----+---------------+
|ID | from_dt | To_dt |row_number|diff|negetive_or_not| Episode
+----------+----------+----------+----------+----+---------------+
|11111|2020-07-30|2020-07-31| 1| -2| 0| 1
|11111|2020-08-02|2020-08-11| 2| 4| 1| 2
|11111|2020-08-07|2020-08-08| 3| -4| 0| 2
|11111|2020-08-12|2020-08-18| 4| 1| 1| 3
|11111|2020-08-17|2020-08-19| 5| 0| 1| 3
|11111|2020-08-19|2020-08-22| 6| 2| 1| 3
|11111|2020-08-20|2020-08-24| 7| -1| 0| 3
|11111|2020-08-25|2020-08-27| 8| 0| 1| 4
|11111|2020-08-27|2020-08-31| 9|-999| 0| 4
|22222|2020-07-30|2020-07-31| 1| -2| 0| 1
|22222|2020-08-02|2020-08-11| 2| 4| 1| 2
|22222|2020-08-07|2020-08-08| 3| -4| 0| 2
+----------+----------+----------+----------+----+---------------+
I tried to use case when and rank, such as
case when negetive_or_not = 0 then "eps1" else then "eps2", both not working.
df2 = df.selectExpr('*') .withColumn("Episode",lead(col("to_dt")).over(Window.partitionBy("patient_id").orderBy(col("negetive_or_not"))))
I also try to write a for loop in pyspark, but I have difficulty transferring dataframe into the list, any suggestions will be highly appreciated.
The approach goes as
First flag the row when the value should change.
Based on the flag generate a new value.
Forward fill the nulls
df.withColumn('flag', when(((col('negetive_or_not')==1) & (lag('negetive_or_not').over(Window.partitionBy('ID').orderBy('row_number'))==0)) | (lag('negetive_or_not').over(Window.partitionBy('ID').orderBy('row_number')).isNull()), lit('change')).otherwise(lit('no'))).\
withColumn('ep', when(col('flag')=='change',row_number().over(Window.partitionBy('ID','flag').orderBy('row_number')))).\
withColumn('Episode', last('ep', ignorenulls=True).over(Window.partitionBy('ID').orderBy('row_number'))).\
drop('flag','ep').show()
+-----+----------+----------+----------+----+---------------+-------+
| ID| from_dt| To_dt|row_number|diff|negetive_or_not|Episode|
+-----+----------+----------+----------+----+---------------+-------+
|11111|2020-07-30|2020-07-31| 1| -2| 0| 1|
|11111|2020-08-02|2020-08-11| 2| 4| 1| 2|
|11111|2020-08-07|2020-08-08| 3| -4| 0| 2|
|11111|2020-08-12|2020-08-18| 4| 1| 1| 3|
|11111|2020-08-17|2020-08-19| 5| 0| 1| 3|
|11111|2020-08-19|2020-08-22| 6| 2| 1| 3|
|11111|2020-08-20|2020-08-24| 7| -1| 0| 3|
|11111|2020-08-25|2020-08-27| 8| 0| 1| 4|
|11111|2020-08-27|2020-08-31| 9|-999| 0| 4|
+-----+----------+----------+----------+----+---------------+-------+
I have a dataframew like below in Pyspark
df = spark.createDataFrame([(2,'john',1,1),
(2,'john',1,2),
(3,'pete',8,3),
(3,'pete',8,4),
(5,'steve',9,5)],
['id','/na/me','val/ue', 'rank/'])
df.show()
+---+------+------+-----+
| id|/na/me|val/ue|rank/|
+---+------+------+-----+
| 2| john| 1| 1|
| 2| john| 1| 2|
| 3| pete| 8| 3|
| 3| pete| 8| 4|
| 5| steve| 9| 5|
+---+------+------+-----+
Now in this data frame I want to replace the column names where / to under scrore _. But if the / comes at the start or end of the column name then remove the / but don't replace with _.
I have done like below
for name in df.schema.names:
df = df.withColumnRenamed(name, name.replace('/', '_'))
>>> df
DataFrame[id: bigint, _na_me: string, val_ue: bigint, rank_: bigint]
>>>df.show()
+---+------+------+-----+
| id|_na_me|val_ue|rank_|
+---+------+------+-----+
| 2| john| 1| 1|
| 2| john| 1| 2|
| 3| pete| 8| 3|
| 3| pete| 8| 4|
| 5| steve| 9| 5|
+---+------+------+-----+
How can I achieve my desired result which is below
+---+------+------+-----+
| id| na_me|val_ue| rank|
+---+------+------+-----+
| 2| john| 1| 1|
| 2| john| 1| 2|
| 3| pete| 8| 3|
| 3| pete| 8| 4|
| 5| steve| 9| 5|
+---+------+------+-----+
Try with regular expression replace(re.sub) in python way.
import re
cols=[re.sub(r'(^_|_$)','',f.replace("/","_")) for f in df.columns]
df = spark.createDataFrame([(2,'john',1,1),
(2,'john',1,2),
(3,'pete',8,3),
(3,'pete',8,4),
(5,'steve',9,5)],
['id','/na/me','val/ue', 'rank/'])
df.toDF(*cols).show()
#+---+-----+------+----+
#| id|na_me|val_ue|rank|
#+---+-----+------+----+
#| 2| john| 1| 1|
#| 2| john| 1| 2|
#| 3| pete| 8| 3|
#| 3| pete| 8| 4|
#| 5|steve| 9| 5|
#+---+-----+------+----+
#or using for loop on schema.names
for name in df.schema.names:
df = df.withColumnRenamed(name, re.sub(r'(^_|_$)','',name.replace('/', '_')))
df.show()
#+---+-----+------+----+
#| id|na_me|val_ue|rank|
#+---+-----+------+----+
#| 2| john| 1| 1|
#| 2| john| 1| 2|
#| 3| pete| 8| 3|
#| 3| pete| 8| 4|
#| 5|steve| 9| 5|
#+---+-----+------+----+
I have the following example Spark DataFrame:
rdd = sc.parallelize([(1,"19:00:00", "19:30:00", 30), (1,"19:30:00", "19:40:00", 10),(1,"19:40:00", "19:43:00", 3), (2,"20:00:00", "20:10:00", 10), (1,"20:05:00", "20:15:00", 10),(1,"20:15:00", "20:35:00", 20)])
df = spark.createDataFrame(rdd, ["user_id", "start_time", "end_time", "duration"])
df.show()
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:30:00| 30|
| 1| 19:30:00|19:40:00| 10|
| 1| 19:40:00|19:43:00| 3|
| 2| 20:00:00|20:10:00| 10|
| 1| 20:05:00|20:15:00| 10|
| 1| 20:15:00|20:35:00| 20|
+-------+----------+--------+--------+
I want to group consecutive rows based on the start and end times. For instance, for the same user_id, if a row's start time is the same as the previous row's end time, I want to group them together and sum the duration.
The desired result is:
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:43:00| 43|
| 2| 20:00:00|20:10:00| 10|
| 1| 20:05:00|20:35:00| 30|
+-------+----------+--------+--------+
The first three rows of the dataframe were grouped together because they all correspond to user_id 1 and the start times and end times form a continuous timeline.
This was my initial approach:
Use the lag function to get the next start time:
from pyspark.sql.functions import *
from pyspark.sql import Window
import sys
# compute next start time
window = Window.partitionBy('user_id').orderBy('start_time')
df = df.withColumn("next_start_time", lag(df.start_time, -1).over(window))
df.show()
+-------+----------+--------+--------+---------------+
|user_id|start_time|end_time|duration|next_start_time|
+-------+----------+--------+--------+---------------+
| 1| 19:00:00|19:30:00| 30| 19:30:00|
| 1| 19:30:00|19:40:00| 10| 19:40:00|
| 1| 19:40:00|19:43:00| 3| 20:05:00|
| 1| 20:05:00|20:15:00| 10| 20:15:00|
| 1| 20:15:00|20:35:00| 20| null|
| 2| 20:00:00|20:10:00| 10| null|
+-------+----------+--------+--------+---------------+
get the difference between the current row's end time and the next row's start time:
time_fmt = "HH:mm:ss"
timeDiff = unix_timestamp('next_start_time', format=time_fmt) - unix_timestamp('end_time', format=time_fmt)
df = df.withColumn("difference", timeDiff)
df.show()
+-------+----------+--------+--------+---------------+----------+
|user_id|start_time|end_time|duration|next_start_time|difference|
+-------+----------+--------+--------+---------------+----------+
| 1| 19:00:00|19:30:00| 30| 19:30:00| 0|
| 1| 19:30:00|19:40:00| 10| 19:40:00| 0|
| 1| 19:40:00|19:43:00| 3| 20:05:00| 1320|
| 1| 20:05:00|20:15:00| 10| 20:15:00| 0|
| 1| 20:15:00|20:35:00| 20| null| null|
| 2| 20:00:00|20:10:00| 10| null| null|
+-------+----------+--------+--------+---------------+----------+
Now my idea was to use the sum function with a window to get the cumulative sum of duration and then do a groupBy. But my approach was flawed for many reasons.
Here's one approach:
Gather together rows into groups where a group is a set of rows with the same user_id that are consecutive (start_time matches previous end_time). Then you can use this group to do your aggregation.
A way to get here is by creating intermediate indicator columns to tell you if the user has changed or the time is not consecutive. Then perform a cumulative sum over the indicator column to create the group.
For example:
import pyspark.sql.functions as f
from pyspark.sql import Window
w1 = Window.orderBy("start_time")
df = df.withColumn(
"userChange",
(f.col("user_id") != f.lag("user_id").over(w1)).cast("int")
)\
.withColumn(
"timeChange",
(f.col("start_time") != f.lag("end_time").over(w1)).cast("int")
)\
.fillna(
0,
subset=["userChange", "timeChange"]
)\
.withColumn(
"indicator",
(~((f.col("userChange") == 0) & (f.col("timeChange")==0))).cast("int")
)\
.withColumn(
"group",
f.sum(f.col("indicator")).over(w1.rangeBetween(Window.unboundedPreceding, 0))
)
df.show()
#+-------+----------+--------+--------+----------+----------+---------+-----+
#|user_id|start_time|end_time|duration|userChange|timeChange|indicator|group|
#+-------+----------+--------+--------+----------+----------+---------+-----+
#| 1| 19:00:00|19:30:00| 30| 0| 0| 0| 0|
#| 1| 19:30:00|19:40:00| 10| 0| 0| 0| 0|
#| 1| 19:40:00|19:43:00| 3| 0| 0| 0| 0|
#| 2| 20:00:00|20:10:00| 10| 1| 1| 1| 1|
#| 1| 20:05:00|20:15:00| 10| 1| 1| 1| 2|
#| 1| 20:15:00|20:35:00| 20| 0| 0| 0| 2|
#+-------+----------+--------+--------+----------+----------+---------+-----+
Now that we have the group column, we can aggregate as follows to get the desired result:
df.groupBy("user_id", "group")\
.agg(
f.min("start_time").alias("start_time"),
f.max("end_time").alias("end_time"),
f.sum("duration").alias("duration")
)\
.drop("group")\
.show()
#+-------+----------+--------+--------+
#|user_id|start_time|end_time|duration|
#+-------+----------+--------+--------+
#| 1| 19:00:00|19:43:00| 43|
#| 1| 20:05:00|20:35:00| 30|
#| 2| 20:00:00|20:10:00| 10|
#+-------+----------+--------+--------+
Here is a working solution derived from Pault's answer:
Create the Dataframe:
rdd = sc.parallelize([(1,"19:00:00", "19:30:00", 30), (1,"19:30:00", "19:40:00", 10),(1,"19:40:00", "19:43:00", 3), (2,"20:00:00", "20:10:00", 10), (1,"20:05:00", "20:15:00", 10),(1,"20:15:00", "20:35:00", 20)])
df = spark.createDataFrame(rdd, ["user_id", "start_time", "end_time", "duration"])
df.show()
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:30:00| 30|
| 1| 19:30:00|19:40:00| 10|
| 1| 19:40:00|19:43:00| 3|
| 1| 20:05:00|20:15:00| 10|
| 1| 20:15:00|20:35:00| 20|
+-------+----------+--------+--------+
Create an indicator column that indicates whenever the time has changed, and use cumulative sum to give each group a unique id:
import pyspark.sql.functions as f
from pyspark.sql import Window
w1 = Window.partitionBy('user_id').orderBy('start_time')
df = df.withColumn(
"indicator",
(f.col("start_time") != f.lag("end_time").over(w1)).cast("int")
)\
.fillna(
0,
subset=[ "indicator"]
)\
.withColumn(
"group",
f.sum(f.col("indicator")).over(w1.rangeBetween(Window.unboundedPreceding, 0))
)
df.show()
+-------+----------+--------+--------+---------+-----+
|user_id|start_time|end_time|duration|indicator|group|
+-------+----------+--------+--------+---------+-----+
| 1| 19:00:00|19:30:00| 30| 0| 0|
| 1| 19:30:00|19:40:00| 10| 0| 0|
| 1| 19:40:00|19:43:00| 3| 0| 0|
| 1| 20:05:00|20:15:00| 10| 1| 1|
| 1| 20:15:00|20:35:00| 20| 0| 1|
+-------+----------+--------+--------+---------+-----+
Now GroupBy on user id and the group variable.
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:43:00| 43|
| 1| 20:05:00|20:35:00| 30|
+-------+----------+--------+--------+
I got this dataframe Sample:
from pyspark.sql.types import *
schema = StructType([
StructField("ClientId", IntegerType(), True),
StructField("m_ant21", IntegerType(), True),
StructField("m_ant22", IntegerType(), True),
StructField("m_ant23", IntegerType(), True),
StructField("m_ant24", IntegerType(), True)])
df = sqlContext.createDataFrame(
data=[(0, None, None, None, None),
(1, 23, 13, 17, 99),
(2, 0, 0, 0, 1),
(3, 0, None, 1, 0),
(4, None, None, None, None)],
schema=schema)
I have this data frame:
+--------+-------+-------+-------+-------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|
+--------+-------+-------+-------+-------+
| 0| null| null| null| null|
| 1| 23| 13| 17| 99|
| 2| 0| 0| 0| 1|
| 3| 0| null| 1| 0|
| 4| null| null| null| null|
+--------+-------+-------+-------+-------+
And I need to solve this question:
I'd like to create a new variable which counts how many null values have the data per row. For example:
ClientId 0 should be 4
ClientId 1 should be 0
ClientId 3 should be 1
Note that df is a pyspark.sql.dataframe.DataFrame.
Here is one option:
from pyspark.sql import Row
# add the column schema to the original schema
schema.add(StructField("count_null", IntegerType(), True))
# convert data frame to rdd and append an element to each row to count the number of nulls
df.rdd.map(lambda row: row + Row(sum(x is None for x in row))).toDF(schema).show()
+--------+-------+-------+-------+-------+----------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|count_null|
+--------+-------+-------+-------+-------+----------+
| 0| null| null| null| null| 4|
| 1| 23| 13| 17| 99| 0|
| 2| 0| 0| 0| 1| 0|
| 3| 0| null| 1| 0| 1|
| 4| null| null| null| null| 4|
+--------+-------+-------+-------+-------+----------+
If you don't want to deal with schema, here is another option:
from pyspark.sql.functions import col, when
df.withColumn("count_null", sum([when(col(x).isNull(),1).otherwise(0) for x in df.columns])).show()
+--------+-------+-------+-------+-------+----------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|count_null|
+--------+-------+-------+-------+-------+----------+
| 0| null| null| null| null| 4|
| 1| 23| 13| 17| 99| 0|
| 2| 0| 0| 0| 1| 0|
| 3| 0| null| 1| 0| 1|
| 4| null| null| null| null| 4|
+--------+-------+-------+-------+-------+----------+