I have the following dataset and working with PySpark
df = sparkSession.createDataFrame([(5, 'Samsung', '2018-02-23'),
(8, 'Apple', '2018-02-22'),
(5, 'Sony', '2018-02-21'),
(5, 'Samsung', '2018-02-20'),
(8, 'LG', '2018-02-20')],
['ID', 'Product', 'Date']
)
+---+-------+----------+
| ID|Product| Date|
+---+-------+----------+
| 5|Samsung|2018-02-23|
| 8| Apple|2018-02-22|
| 5| Sony|2018-02-21|
| 5|Samsung|2018-02-20|
| 8| LG|2018-02-20|
+---+-------+----------+
# Each ID will appear ALWAYS at least 2 times (do not consider the case of unique IDs in this df)
Each ID should increment the PRODUCT counter only when it represents the higher frequency.
In case of equal frequency, the most recent date should decide which product receives +1.
From the sample above, the desired output would be:
+-------+-------+
|Product|Counter|
+-------+-------+
|Samsung| 1|
| Apple| 1|
| Sony| 0|
| LG| 0|
+-------+-------+
# Samsung - 1 (preferred twice by ID=5)
# Apple - 1 (preferred by ID=8 more recently than LG)
# Sony - 0 (because ID=5 preferred Samsung 2 time, and Sony only 1)
# LG - 0 (because ID=8 preferred Apple more recently)
What is the most efficient way with PySpark to achieve this result?
IIUC, you want to pick the most frequent product for each ID, breaking ties using the
most recent Date
So first, we can get the count for each product/ID pair using:
import pyspark.sql.functions as f
from pyspark.sql import Window
df = df.select(
'ID',
'Product',
'Date',
f.count('Product').over(Window.partitionBy('ID', 'Product')).alias('count')
)
df.show()
#+---+-------+----------+-----+
#| ID|Product| Date|count|
#+---+-------+----------+-----+
#| 5| Sony|2018-02-21| 1|
#| 8| LG|2018-02-20| 1|
#| 8| Apple|2018-02-22| 1|
#| 5|Samsung|2018-02-23| 2|
#| 5|Samsung|2018-02-20| 2|
#+---+-------+----------+-----+
Now you can use a Window to rank each product for each ID. We can use pyspark.sql.functions.desc() to sort by count and Date descending. If the row_number() is equal to 1, that means that row is first.
w = Window.partitionBy('ID').orderBy(f.desc('count'), f.desc('Date'))
df = df.select(
'Product',
(f.row_number().over(w) == 1).cast("int").alias('Counter')
)
df.show()
#+-------+-------+
#|Product|Counter|
#+-------+-------+
#|Samsung| 1|
#|Samsung| 0|
#| Sony| 0|
#| Apple| 1|
#| LG| 0|
#+-------+-------+
Finally groupBy() the Product and pick the value for maximum value for Counter:
df.groupBy('Product').agg(f.max('Counter').alias('Counter')).show()
#+-------+-------+
#|Product|Counter|
#+-------+-------+
#| Sony| 0|
#|Samsung| 1|
#| LG| 0|
#| Apple| 1|
#+-------+-------+
Update
Here's a little bit of a simpler way:
w = Window.partitionBy('ID').orderBy(f.desc('count'), f.desc('Date'))
df.groupBy('ID', 'Product')\
.agg(f.max('Date').alias('Date'), f.count('Product').alias('Count'))\
.select('Product', (f.row_number().over(w) == 1).cast("int").alias('Counter'))\
.show()
#+-------+-------+
#|Product|Counter|
#+-------+-------+
#|Samsung| 1|
#| Sony| 0|
#| Apple| 1|
#| LG| 0|
#+-------+-------+
Related
If I have table
|a | b | c|
|"hello"|"world"| 1|
and the variables
start =2000
end =2015
How do I in pyspark add 15 cols with 1st column m2000 and second m2001 etc and all these new cols have 0 so new dataframe is
|a | b | c|m2000 | m2001 | m2002 | ... | m2015|
|"hello"|"world"| 1| 0 | 0 | 0 | ... | 0 |
I have tried below but
df = df.select(
'*',
*["0".alias(f'm{i}') for i in range(2000, 2016)]
)
df.show()
I get the error
AttributeError: 'str' object has no attribute 'alias'
You can simply use withColumn to add relevant columns.
from pyspark.sql.functions import col,lit
df = spark.createDataFrame(data=[("hello","world",1)],schema=["a","b","c"])
df.show()
+-----+-----+---+
| a| b| c|
+-----+-----+---+
|hello|world| 1|
+-----+-----+---+
for i in range(2000, 2015):
df = df.withColumn("m"+str(i), lit(0))
df.show()
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| a| b| c|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|hello|world| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
You can use one-liner
df = df.select(df.columns + [F.lit(0).alias(f"m{i}") for i in range(2000, 2015)])
Full example:
df = spark.createDataFrame([["hello","world",1]],["a","b","c"])
df = df.select(df.columns + [F.lit(0).alias(f"m{i}") for i in range(2000, 2015)])
[Out]:
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| a| b| c|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|hello|world| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
in pandas, you can do the following:
import pandas as pd
df = pd.Series({'a': 'Hello', 'b': 'World', 'c': 1}).to_frame().T
df[['m{}'.format(x) for x in range(2000, 2016)]] = 0
print(df)
I am not very familiar with the spark-synthax, but the approach should be near-identical.
What is happening:
The term ['m{}'.format(x) for x in range(2000, 2016)] is a list-comprehension that creates the list of desired column names. We assign the value 0 to these columns. Since the columns do not yet exist, they are added.
Your code for generating extra columns is perfectly fine - just need to wrap the "0" in lit function, like this:
from pyspark.sql.functions import lit
df.select('*', *[lit("0").alias(f'm{i}') for i in range(2000, 2016)]).show()
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| a| b| c|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|m2015|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|hello|world| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
I would be cautious with calling withColumn method repeatadly - every new call to it, creates a new projection in Spark's query execution plan and it can become very expensive computationally. Using just single select will always be better approach.
I have a PySpark data table that looks like the following
shouldMerge | number
true | 1
true | 1
true | 2
false | 3
false | 1
I want to combine all of the columns with shouldMerge as true and add up the numbers.
so the final output would look like
shouldMerge | number
true | 4
false | 3
false | 1
How can I select all the ones with shouldMerge == true, add up the numbers, and generate a new row in PySpark?
Edit: Alternate, slightly more complicated scenario closer to what I'm trying to solve, where we only aggregate positive numbers:
mergeId | number
1 | 1
2 | 1
1 | 2
-1 | 3
-1 | 1
shouldMerge | number
1 | 3
2 | 1
-1 | 3
-1 | 1
IIUC, you want to do a groupBy but only on the positive mergeIds.
One way is to filter your DataFrame for the positive ids, group, aggregate, and union this back with the negative ids (similar to #shanmuga's answer).
Other way would be use when to dynamically create a grouping key. If the mergeId is positive, use the mergeId to group. Otherwise, use a monotonically_increasing_id to ensure that the row does not get aggregated.
Here is an example:
import pyspark.sql.functions as f
df.withColumn("uid", f.monotonically_increasing_id())\
.groupBy(
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey"),
f.col("mergeId")
)\
.agg(f.sum("number").alias("number"))\
.drop("mergeKey")\
.show()
#+-------+------+
#|mergeId|number|
#+-------+------+
#| -1| 1.0|
#| 1| 3.0|
#| 2| 1.0|
#| -1| 3.0|
#+-------+------+
This can easily be generalized by changing the when condition (in this case it's f.col("mergeId") > 0) to match your specific requirements.
Explanation:
First we create a temporary column uid which is a unique ID for each row. Next, we call groupBy and if the mergeId is positive use the mergeId to group. Otherwise we use the uid as the mergeKey. I also passed in the mergeId as a second group by column as a way to keep that column for the output.
To demonstrate what is going on, take a look at the intermediate result:
df.withColumn("uid", f.monotonically_increasing_id())\
.withColumn(
"mergeKey",
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey")
)\
.show()
#+-------+------+-----------+-----------+
#|mergeId|number| uid| mergeKey|
#+-------+------+-----------+-----------+
#| 1| 1| 0| 1|
#| 2| 1| 8589934592| 2|
#| 1| 2|17179869184| 1|
#| -1| 3|25769803776|25769803776|
#| -1| 1|25769803777|25769803777|
#+-------+------+-----------+-----------+
As you can see, the mergeKey remains the unique value for the negative mergeIds.
From this intermediate step, the desired result is just a trivial group by and sum, followed by dropping the mergeKey column.
You will have to filter out only the rows where should merge is true and aggregate. then union this with all the remaining rows.
import pyspark.sql.functions as functions
df = sqlContext.createDataFrame([
(True, 1),
(True, 1),
(True, 2),
(False, 3),
(False, 1),
], ("shouldMerge", "number"))
false_df = df.filter("shouldMerge = false")
true_df = df.filter("shouldMerge = true")
result = true_df.groupBy("shouldMerge")\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
df = sqlContext.createDataFrame([
(1, 1),
(2, 1),
(1, 2),
(-1, 3),
(-1, 1),
], ("mergeId", "number"))
merge_condition = df["mergeId"] > -1
remaining = ~merge_condition
grouby_field = "mergeId"
false_df = df.filter(remaining)
true_df = df.filter(merge_condition)
result = true_df.groupBy(grouby_field)\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
result.show()
The first problem posted by the OP.
# Create the DataFrame
valuesCol = [(True,1),(True,1),(True,2),(False,3),(False,1)]
df = sqlContext.createDataFrame(valuesCol,['shouldMerge','number'])
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| true| 1|
| true| 1|
| true| 2|
| false| 3|
| false| 1|
+-----------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select shouldMerge, number, sum(number) over (partition by shouldMerge) as sum_number from table_view'
)
df = df.withColumn('number',when(col('shouldMerge')==True,col('sum_number')).otherwise(col('number')))
df.show()
+-----------+------+----------+
|shouldMerge|number|sum_number|
+-----------+------+----------+
| true| 4| 4|
| true| 4| 4|
| true| 4| 4|
| false| 3| 4|
| false| 1| 4|
+-----------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy().orderBy('shouldMerge')
df = df.withColumn('shouldMerge_lag', lag(col('shouldMerge'),1).over(my_window))
df.show()
+-----------+------+---------------+
|shouldMerge|number|shouldMerge_lag|
+-----------+------+---------------+
| false| 3| null|
| false| 1| false|
| true| 4| false|
| true| 4| true|
| true| 4| true|
+-----------+------+---------------+
df = df.where(~((col('shouldMerge')==True) & (col('shouldMerge_lag')==True))).drop('shouldMerge_lag')
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| false| 3|
| false| 1|
| true| 4|
+-----------+------+
For the second problem posted by the OP
# Create the DataFrame
valuesCol = [(1,2),(1,1),(2,1),(1,2),(-1,3),(-1,1)]
df = sqlContext.createDataFrame(valuesCol,['mergeId','number'])
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 2|
| 1| 1|
| 2| 1|
| 1| 2|
| -1| 3|
| -1| 1|
+-------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select mergeId, number, sum(number) over (partition by mergeId) as sum_number from table_view'
)
df = df.withColumn('number',when(col('mergeId') > 0,col('sum_number')).otherwise(col('number')))
df.show()
+-------+------+----------+
|mergeId|number|sum_number|
+-------+------+----------+
| 1| 5| 5|
| 1| 5| 5|
| 1| 5| 5|
| 2| 1| 1|
| -1| 3| 4|
| -1| 1| 4|
+-------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy('mergeId').orderBy('mergeId')
df = df.withColumn('mergeId_lag', lag(col('mergeId'),1).over(my_window))
df.show()
+-------+------+-----------+
|mergeId|number|mergeId_lag|
+-------+------+-----------+
| 1| 5| null|
| 1| 5| 1|
| 1| 5| 1|
| 2| 1| null|
| -1| 3| null|
| -1| 1| -1|
+-------+------+-----------+
df = df.where(~((col('mergeId') > 0) & (col('mergeId_lag').isNotNull()))).drop('mergeId_lag')
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 5|
| 2| 1|
| -1| 3|
| -1| 1|
+-------+------+
Documentation: lag() - Returns the value that is offset rows before the current row.
This question already has answers here:
Retrieve top n in each group of a DataFrame in pyspark
(6 answers)
Closed 4 years ago.
I have a dataframe grouped by 'id' and 'type':
+---+----+-----+
| id|type|count|
+---+----+-----+
| 0| A| 2|
| 0| B| 3|
| 0| C| 1|
| 0| D| 3|
| 0| G| 1|
| 1| A| 0|
| 1| C| 1|
| 1| D| 1|
| 1| G| 2|
+---+----+-----+
I would like now to group by 'id' and get a sum of 3 largest values:
+---+-----+
| id|count|
+---+-----+
| 0| 8|
| 1| 4|
+---+-----+
How can I do it in pyspark, so that the computation is relatively efficient?
Found solution here
You can use the following code to perform this
from pyspark.sql.functions import *
from pyspark.sql.window import Window
df = spark.createDataFrame([
(0, "A", 2),
(0,"B", 3),
(0,"C", 1),
(0,"D", 3),
(1,"A", 0),
(1,"C", 1),
(1,"D", 1),
(1,"G", 2)
], ("id", "type", "count"))
my_window = Window.partitionBy("id").orderBy("count")
df.withColumn("last_3", lead("count").over(my_window)).groupBy("id").agg(sum("last_3")).show()
Output:
+---+-----------+
| id|sum(last_3)|
+---+-----------+
| 0| 8|
| 1| 4|
+---+-----------+
Details: Window partitions your data by id and orders it by count then you create a new column where lead uses this window and returns the next value in that group (which was created by the window) so (0,C,1) is the lowest tuple in the group for id=0 this receive the value 2 since it is the next highest in this group (from tuple (0,A,2) and so on. The highest tuple does not have a following value and is assigned null. Finally perform the group operation and the sum.
I have the following example Spark DataFrame:
rdd = sc.parallelize([(1,"19:00:00", "19:30:00", 30), (1,"19:30:00", "19:40:00", 10),(1,"19:40:00", "19:43:00", 3), (2,"20:00:00", "20:10:00", 10), (1,"20:05:00", "20:15:00", 10),(1,"20:15:00", "20:35:00", 20)])
df = spark.createDataFrame(rdd, ["user_id", "start_time", "end_time", "duration"])
df.show()
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:30:00| 30|
| 1| 19:30:00|19:40:00| 10|
| 1| 19:40:00|19:43:00| 3|
| 2| 20:00:00|20:10:00| 10|
| 1| 20:05:00|20:15:00| 10|
| 1| 20:15:00|20:35:00| 20|
+-------+----------+--------+--------+
I want to group consecutive rows based on the start and end times. For instance, for the same user_id, if a row's start time is the same as the previous row's end time, I want to group them together and sum the duration.
The desired result is:
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:43:00| 43|
| 2| 20:00:00|20:10:00| 10|
| 1| 20:05:00|20:35:00| 30|
+-------+----------+--------+--------+
The first three rows of the dataframe were grouped together because they all correspond to user_id 1 and the start times and end times form a continuous timeline.
This was my initial approach:
Use the lag function to get the next start time:
from pyspark.sql.functions import *
from pyspark.sql import Window
import sys
# compute next start time
window = Window.partitionBy('user_id').orderBy('start_time')
df = df.withColumn("next_start_time", lag(df.start_time, -1).over(window))
df.show()
+-------+----------+--------+--------+---------------+
|user_id|start_time|end_time|duration|next_start_time|
+-------+----------+--------+--------+---------------+
| 1| 19:00:00|19:30:00| 30| 19:30:00|
| 1| 19:30:00|19:40:00| 10| 19:40:00|
| 1| 19:40:00|19:43:00| 3| 20:05:00|
| 1| 20:05:00|20:15:00| 10| 20:15:00|
| 1| 20:15:00|20:35:00| 20| null|
| 2| 20:00:00|20:10:00| 10| null|
+-------+----------+--------+--------+---------------+
get the difference between the current row's end time and the next row's start time:
time_fmt = "HH:mm:ss"
timeDiff = unix_timestamp('next_start_time', format=time_fmt) - unix_timestamp('end_time', format=time_fmt)
df = df.withColumn("difference", timeDiff)
df.show()
+-------+----------+--------+--------+---------------+----------+
|user_id|start_time|end_time|duration|next_start_time|difference|
+-------+----------+--------+--------+---------------+----------+
| 1| 19:00:00|19:30:00| 30| 19:30:00| 0|
| 1| 19:30:00|19:40:00| 10| 19:40:00| 0|
| 1| 19:40:00|19:43:00| 3| 20:05:00| 1320|
| 1| 20:05:00|20:15:00| 10| 20:15:00| 0|
| 1| 20:15:00|20:35:00| 20| null| null|
| 2| 20:00:00|20:10:00| 10| null| null|
+-------+----------+--------+--------+---------------+----------+
Now my idea was to use the sum function with a window to get the cumulative sum of duration and then do a groupBy. But my approach was flawed for many reasons.
Here's one approach:
Gather together rows into groups where a group is a set of rows with the same user_id that are consecutive (start_time matches previous end_time). Then you can use this group to do your aggregation.
A way to get here is by creating intermediate indicator columns to tell you if the user has changed or the time is not consecutive. Then perform a cumulative sum over the indicator column to create the group.
For example:
import pyspark.sql.functions as f
from pyspark.sql import Window
w1 = Window.orderBy("start_time")
df = df.withColumn(
"userChange",
(f.col("user_id") != f.lag("user_id").over(w1)).cast("int")
)\
.withColumn(
"timeChange",
(f.col("start_time") != f.lag("end_time").over(w1)).cast("int")
)\
.fillna(
0,
subset=["userChange", "timeChange"]
)\
.withColumn(
"indicator",
(~((f.col("userChange") == 0) & (f.col("timeChange")==0))).cast("int")
)\
.withColumn(
"group",
f.sum(f.col("indicator")).over(w1.rangeBetween(Window.unboundedPreceding, 0))
)
df.show()
#+-------+----------+--------+--------+----------+----------+---------+-----+
#|user_id|start_time|end_time|duration|userChange|timeChange|indicator|group|
#+-------+----------+--------+--------+----------+----------+---------+-----+
#| 1| 19:00:00|19:30:00| 30| 0| 0| 0| 0|
#| 1| 19:30:00|19:40:00| 10| 0| 0| 0| 0|
#| 1| 19:40:00|19:43:00| 3| 0| 0| 0| 0|
#| 2| 20:00:00|20:10:00| 10| 1| 1| 1| 1|
#| 1| 20:05:00|20:15:00| 10| 1| 1| 1| 2|
#| 1| 20:15:00|20:35:00| 20| 0| 0| 0| 2|
#+-------+----------+--------+--------+----------+----------+---------+-----+
Now that we have the group column, we can aggregate as follows to get the desired result:
df.groupBy("user_id", "group")\
.agg(
f.min("start_time").alias("start_time"),
f.max("end_time").alias("end_time"),
f.sum("duration").alias("duration")
)\
.drop("group")\
.show()
#+-------+----------+--------+--------+
#|user_id|start_time|end_time|duration|
#+-------+----------+--------+--------+
#| 1| 19:00:00|19:43:00| 43|
#| 1| 20:05:00|20:35:00| 30|
#| 2| 20:00:00|20:10:00| 10|
#+-------+----------+--------+--------+
Here is a working solution derived from Pault's answer:
Create the Dataframe:
rdd = sc.parallelize([(1,"19:00:00", "19:30:00", 30), (1,"19:30:00", "19:40:00", 10),(1,"19:40:00", "19:43:00", 3), (2,"20:00:00", "20:10:00", 10), (1,"20:05:00", "20:15:00", 10),(1,"20:15:00", "20:35:00", 20)])
df = spark.createDataFrame(rdd, ["user_id", "start_time", "end_time", "duration"])
df.show()
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:30:00| 30|
| 1| 19:30:00|19:40:00| 10|
| 1| 19:40:00|19:43:00| 3|
| 1| 20:05:00|20:15:00| 10|
| 1| 20:15:00|20:35:00| 20|
+-------+----------+--------+--------+
Create an indicator column that indicates whenever the time has changed, and use cumulative sum to give each group a unique id:
import pyspark.sql.functions as f
from pyspark.sql import Window
w1 = Window.partitionBy('user_id').orderBy('start_time')
df = df.withColumn(
"indicator",
(f.col("start_time") != f.lag("end_time").over(w1)).cast("int")
)\
.fillna(
0,
subset=[ "indicator"]
)\
.withColumn(
"group",
f.sum(f.col("indicator")).over(w1.rangeBetween(Window.unboundedPreceding, 0))
)
df.show()
+-------+----------+--------+--------+---------+-----+
|user_id|start_time|end_time|duration|indicator|group|
+-------+----------+--------+--------+---------+-----+
| 1| 19:00:00|19:30:00| 30| 0| 0|
| 1| 19:30:00|19:40:00| 10| 0| 0|
| 1| 19:40:00|19:43:00| 3| 0| 0|
| 1| 20:05:00|20:15:00| 10| 1| 1|
| 1| 20:15:00|20:35:00| 20| 0| 1|
+-------+----------+--------+--------+---------+-----+
Now GroupBy on user id and the group variable.
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:43:00| 43|
| 1| 20:05:00|20:35:00| 30|
+-------+----------+--------+--------+
I have pyspark dataframe, in which data column is there, which has weekend dates as well. I just want to change these dates to previous or next working days.
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
columns = ['Date', 'id', 'dogs', 'cats']
vals = [('04-05-2018',1, 2, 0), ('05-05-2018',2, 0, 1), ('06-05-2018',2, 0, 1)]
df = spark.createDataFrame(vals, columns)
df.show()
DataFrame look like:
+----------+---+----+----+
| Date| id|dogs|cats|
+----------+---+----+----+
|04-05-2018| 1| 2| 0|
|05-05-2018| 2| 0| 1|
|06-05-2018| 2| 0| 1|
+----------+---+----+----+
Now, i'm able to identify the weekday, as in separate column
df = df.withColumn('Date', unix_timestamp(df['Date'].cast("string"), 'dd-MM-yyyy').cast("double").cast('timestamp').cast('date'))
df = df.select('Date', date_format('Date', 'u').alias('dow_number'), 'id', 'dogs', 'cats')
temp = df
temp.show()
+----------+----------+---+----+----+
| Date|dow_number| id|dogs|cats|
+----------+----------+---+----+----+
|2018-05-04| 5| 1| 2| 0|
|2018-05-05| 6| 2| 0| 1|
|2018-05-06| 7| 2| 0| 1|
+----------+----------+---+----+----+
Now i just want to shift the data to last working day or next working day if it is on weekend, means i want my dataframe to be like this:
+----------+----------+---+----+----+
| Date|dow_number| id|dogs|cats|
+----------+----------+---+----+----+
|2018-05-04| 5| 1| 2| 0|
|2018-05-04| 5| 2| 0| 1|
|2018-05-04| 5| 2| 0| 1|
+----------+----------+---+----+----+
Thanks in advance.
Using the dow_number generated, you can apply condition to check and subtract date using date_sub(),
df = df.withColumn('Date1',F.when(df['dow_number'] == 6,F.date_sub(df.Date,1)).when(df['dow_number'] == 7,F.date_sub(df.Date,2)).otherwise(df.Date))
+----------+----------+---+----+----+----------+
| Date|dow_number| id|dogs|cats| Date1|
+----------+----------+---+----+----+----------+
|2018-05-04| 5| 1| 2| 0|2018-05-04|
|2018-05-05| 6| 2| 0| 1|2018-05-04|
|2018-05-06| 7| 2| 0| 1|2018-05-04|
+----------+----------+---+----+----+----------+
I believe, you don't need dow_number to change as well.If you need, either you can use date_format on new date and get it (or) apply another condition as above. Hope this helps!