PySpark Convert Column<> to Value

PySpark Convert Column<> to Value - python

I'd like to know how to get the value of a calculation done using functions such as date_add, datediff, date_sub, etc. The actual value of it in a variable.
As an example:
start_date = '2022-03-06'
end_date = '2022-03-01'
date_lag = datediff(to_date(lit(start_date)), to_date(lit(end_date)))
If I run date_lag, the output is: Column<'datediff(to_date(2022-03-06), to_date(2022-03-01))'>.
The expected output would be 5.
I was told by a coworker, I'd have to create a dataframe, apply the column expression and then apply a collect to get the value, but I was hoping there would be a simpler way to do it.

You have used PySpark functions datediff, to_date, lit. They all return a column data type. Columns (also the results of your calculations) do not exist unless you add them to a dataframe AND return the dataframe in some way.
So, your colleague was correct telling that first you need to create a dataframe (which will hold your column) and then, since you want your value in a variable, you will have to tell which record from that column you want to take (this can be done using either collect, head, take, first,..)
Creating a dataframe with 3 records and adding your column to it:
from pyspark.sql import functions as F
start_date = '2022-03-06'
end_date = '2022-03-01'
date_lag = F.datediff(F.to_date(F.lit(start_date)), F.to_date(F.lit(end_date)))
df = spark.range(3).select(
date_lag.alias('column_name')
)
df.show()
# +-----------+
# |column_name|
# +-----------+
# | 5|
# | 5|
# | 5|
# +-----------+
Any of the following lines will write the top row's value of your column into a variable.
date_lag_var = df.head().column_name
date_lag_var = df.first().column_name
date_lag_var = df.take(1)[0].column_name
date_lag_var = df.limit(1).collect()[0].column_name

you can easily do it using python
>>> start_date = '2022-03-06'
>>> end_date = '2022-03-01'
>>> str_d1=start_date.split("-")[0]+"/"+start_date.split("-")[1]+"/"+start_date.split("-")[2]
>>> str_d1
'2022/03/06'
>>> str_d2=end_date.split("-")[0]+"/"+end_date.split("-")[1]+"/"+end_date.split("-")[2]
>>> str_d2
'2022/03/01'
>>> d1 = datetime.strptime(str_d1, "%Y/%m/%d")
>>> d2 = datetime.strptime(str_d2, "%Y/%m/%d")
>>> delta = d1-d2
>>> delta.days
5

Related

Optimization on for loop on columns in Pyspark

I don't know if my title is very clear. I have a table with a lot columns (more than a hundred). Some of my columns contains values with brackets and I need to explode them into several rows. Here is a reproducible example:
# Import libraries
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import *
import pandas as ps
# Create an example
columns = ["Name", "Age", "Activity", "Studies"]
data = [("Jame", 25, "[Painting,Yoga]", "[Math,Physics]"), ("Anne", 20, "[Garden,Cooking,Travel]", "[Communication,Marketing]"), ("Jane", 10, "[Gymnastique]", "[Basic School]")]
df = spark.createDataFrame(data=data,schema=columns)
df.show(truncate=False)
it shows the following table:
+----+---+-----------------------+-------------------------+
|Name|Age|Activity |Studies |
+----+---+-----------------------+-------------------------+
|Jame|25 |[Painting,Yoga] |[Math,Physics] |
|Anne|20 |[Garden,Cooking,Travel]|[Communication,Marketing]|
|Jane|10 |[Gymnastique] |[Basic School] |
+----+---+-----------------------+-------------------------+
I need to determine what columns contains brackets as value:
list_col = df.dtypes
df_array_col = spark.createDataFrame(list_col)\
.withColumnRenamed("_1", "Colname")\
.withColumnRenamed("_2", "TypeColumn")\
.filter(col("TypeColumn") == "string")\
.withColumn("IsBracket", lit(0))\
.toPandas()
# Function for determining what column contains brackets as a value
def func_isSquaredBracket(my_col):
A = df.select(first(col(my_col).rlike("\["), ignorenulls=True).alias(my_col))
val_IsBracket = A.select(col(my_col)).collect()[0][0]
return val_IsBracket
# For loop for applying the function
n_array = df_array_col.count()["Colname"]
for index, row in df_array_col.iterrows():
IsBracket_value = func_isSquaredBracket(df_array_col.at[index, "Colname"])
if IsBracket_value == True:
df_array_col.at[index, "IsBracket"] = 1
I succeed what columns have brackets as value. Now I can explode my table:
def func_extractStringInBracket_andSplit(my_col):
extract_string = regexp_extract(my_col, r'(?<=\[).+?(?=\])', 0).alias(my_col)
string_split = split(extract_string, "\||,").alias(my_col)
string_explode_array = explode_outer(string_split).alias(my_col)
return string_explode_array
df_explode_bracket = df
for index, row in df_array_bracket_col.iterrows():
colname = df_array_bracket_col["Colname"][index]
df_explode_bracket = df_explode_bracket.withColumn(colname, func_extractStringInBracket_andSplit(colname))
df_explode_bracket.show(truncate=False)
I obtain the result I want:
+----+---+-----------+-------------+
|Name|Age|Activity |Studies |
+----+---+-----------+-------------+
|Jame|25 |Painting |Math |
|Jame|25 |Painting |Physics |
|Jame|25 |Yoga |Math |
|Jame|25 |Yoga |Physics |
|Anne|20 |Garden |Communication|
|Anne|20 |Garden |Marketing |
|Anne|20 |Cooking |Communication|
|Anne|20 |Cooking |Marketing |
|Anne|20 |Travel |Communication|
|Anne|20 |Travel |Marketing |
|Jane|10 |Gymnastique|Basic School |
+----+---+-----------+-------------+
However, this solution is not optimized when I have more than 100 columns and it takes more than 6 minutes to get the result with the following message:
/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py:289: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
'JavaPackage' object is not callable
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
warnings.warn(msg)
I am pretty new to PySpark and I am not an expert in Python. My question is: How can I optimize the solution by using PySpark instead of Pandas? For loop is not ideal when you have the opportunity to use parallel processing.

It's actually pretty easy, use regexp_extract_all:
df = (
df.withColumn("Activity_list", F.expr(r"regexp_extract_all(Activity, '(\\w+)', 1)"))
.withColumn("Studies_list", F.expr(r"regexp_extract_all(Studies, '(\\w+)', 1)"))
)
df = (
df.drop("Activity", "Studies")
.withColumn("Activity", F.explode("Activity_list"))
.withColumn("Studies", F.explode("Studies_list"))
)
Edit: It even works with strings without brackets.

Pandas UDF in pyspark

I am trying to fill a series of observation on a spark dataframe. Basically I have a list of days and I should create the missing one for each group.
In pandas there is the reindex function, which is not available in pyspark.
I tried to implement a pandas UDF:
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def reindex_by_date(df):
df = df.set_index('dates')
dates = pd.date_range(df.index.min(), df.index.max())
return df.reindex(dates, fill_value=0).ffill()
This looks like should do what I need, however it fails with this message
AttributeError: Can only use .dt accessor with datetimelike values
. What am I doing wrong here?
Here the full code:
data = spark.createDataFrame(
[(1, "2020-01-01", 0),
(1, "2020-01-03", 42),
(2, "2020-01-01", -1),
(2, "2020-01-03", -2)],
('id', 'dates', 'value'))
data = data.withColumn('dates', col('dates').cast("date"))
schema = StructType([
StructField('id', IntegerType()),
StructField('dates', DateType()),
StructField('value', DoubleType())])
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def reindex_by_date(df):
df = df.set_index('dates')
dates = pd.date_range(df.index.min(), df.index.max())
return df.reindex(dates, fill_value=0).ffill()
data = data.groupby('id').apply(reindex_by_date)
Ideally I would like something like this:
+---+----------+-----+
| id| dates|value|
+---+----------+-----+
| 1|2020-01-01| 0|
| 1|2020-01-02| 0|
| 1|2020-01-03| 42|
| 2|2020-01-01| -1|
| 2|2020-01-02| 0|
| 2|2020-01-03| -2|
+---+----------+-----+

Case 1: Each ID has an individual date range.
I would try to reduce the content of the udf as much as possible. In this case I would only calculate the date range per ID in the udf. For the other parts I would use Spark native functions.
from pyspark.sql import types as T
from pyspark.sql import functions as F
# Get min and max date per ID
date_ranges = data.groupby('id').agg(F.min('dates').alias('date_min'), F.max('dates').alias('date_max'))
# Calculate the date range for each ID
#F.udf(returnType=T.ArrayType(T.DateType()))
def get_date_range(date_min, date_max):
return [t.date() for t in list(pd.date_range(date_min, date_max))]
# To get one row per potential date, we need to explode the UDF output
date_ranges = date_ranges.withColumn(
'dates',
F.explode(get_date_range(F.col('date_min'), F.col('date_max')))
)
date_ranges = date_ranges.drop('date_min', 'date_max')
# Add the value for existing entries and add 0 for others
result = date_ranges.join(
data,
['id', 'dates'],
'left'
)
result = result.fillna({'value': 0})
Case 2: All ids have the same date range
I think there is no need to use a UDF here. What you want to can be archived in a different way: First, you get all possible IDs and all necessary dates. Second, you crossJoin them, which will provide you with all possible combinations. Third, left join the original data onto the combinations. Fourth, replace the occurred null values with 0.
# Get all unique ids
ids_df = data.select('id').distinct()
# Get the date series
date_min, date_max = data.agg(F.min('dates'), F.max('dates')).collect()[0]
dates = [[t.date()] for t in list(pd.date_range(date_min, date_max))]
dates_df = spark.createDataFrame(data=dates, schema="dates:date")
# Calculate all combinations
all_comdinations = ids_df.crossJoin(dates_df)
# Add the value column
result = all_comdinations.join(
data,
['id', 'dates'],
'left'
)
# Replace all null values with 0
result = result.fillna({'value': 0})
Please be aware of the following limitiations with this solution:
crossJoins can be quite costly. One potential solution to cope with the issue can be found in this related question.
The collect statement and use of Pandas results in a not perfectly parallelised Spark transformation.
[EDIT] Split into two cases as I first thought all IDs have the same date range.

Extract a column in pyspark dataframe using udfs

I am using below code snippets to extract a portion of a dataframe column .
df.withColumn("chargemonth",getBookedMonth1(df['chargedate']))
def getBookedMonth1(chargedate):
booked_year=chargedate[0:3]
booked_month=chargedate[5:7]
return booked_year+"-"+booked_month
I have also used getBookedMonth for the same , but I am getting null value for the new column chargemonth in both cases.
from pyspark.sql.functions import substring
def getBookedMonth(chargedate):
booked_year=substring(chargedate, 1,4)
booked_month=substring(chargedate,5, 6)
return booked_year+"-"+booked_month
Is this correct way of extraction/substring of columns in pyspark ?

Please DON'T use udf for this! UDFs are known for bad performance.
I would suggest you tu use Spark builtin functions to manipulate dates. Here is an example:
# DF sample
data = [(1, "2019-12-05"), (2, "2019-12-06"), (3, "2019-12-07")]
df = spark.createDataFrame(data, ["id", "chargedate"])
# format dates as 'yyyy-MM'
df.withColumn("chargemonth", date_format(to_date(col("chargedate")), "yyyy-MM")).show()
+---+----------+-----------+
| id|chargedate|chargemonth|
+---+----------+-----------+
| 1|2019-12-05| 2019-12|
| 2|2019-12-06| 2019-12|
| 3|2019-12-07| 2019-12|
+---+----------+-----------+

You need to make a new function as Pyspark UDF.
>>> from pyspark.sql.functions import udf
>>> data = [
... {"chargedate":"2019-01-01"},
... {"chargedate":"2019-02-01"},
... {"chargedate":"2019-03-01"},
... {"chargedate":"2019-04-01"}
... ]
>>>
>>> booked_month = udf(lambda a:"{0}-{1}".format(a[0:4], a[5:7]))
>>>
>>> df = spark.createDataFrame(data)
>>> df = df.withColumn("chargemonth",booked_month(df['chargedate'])).drop('chargedate')
>>> df.show()
+-----------+
|chargemonth|
+-----------+
| 2019-01|
| 2019-02|
| 2019-03|
| 2019-04|
+-----------+
>>>
withColumn is a right way to add a column, drop is used to drop a column.

Dataframe performance issue while retrieving rows in hierarchy order in pyspark

Dataframe performance issue while retrieving rows in hierarchy order in pyspark.
dataframe performance issue while retrieving rows in hierarchy order in pyspark
I am trying to retrieve data in hierachy order using pyspark dataframe from a csv file but it is taking more than 3 hrs to retrieve 30k records in hierachy order.
is there any alternate way to solve this problem in pyspark dataframe?
can anyone please help me on this?
from datetime import datetime
from pyspark.sql.functions import lit
df = sc.read.csv(path/of/csv/file, **kargs)
df.cache()
df.show()
def get_child(pid, df, col_name):
df_child_s = df.selectExpr(col_name).where(col("pid") == pid)
return df_child_s
def all_data(pid, df, col_name):
df_child_exist = True
cnt = 0
df_o = get_child_str(pid, df, col_name)
df_o = df_o.withColumn("order_id", lit(cnt))
df_child_exist = len(df_o.take(1)) >= 1
if df_child_exist :
dst = df_o.selectExpr("child_id").first()[0]
while df_child_exist:
cnt += 1
df_o2 = get_child_str(dst, df, "*")
df_o2 = df_o2.withColumn("order_id", lit(cnt))
df_child_exist = len(df_o2.take(1)) >= 1
if df_child_exist :
dst = df_o2.selectExpr("childid_id").first()[0]
df_o = df_o.union(df_o2)
return df_o
pid = 0
start = datetime.now()
df_f_1 = all_data(pid, df, "*")
df_f_1.show()
end = datetime.now()
totalTime = end - start
print(f"total execution time :{totalTime}")
**csv file data**
childid parentid
248278 264543
251713 252689
252689 248278
258977 251713
264543 0
**expected output result:**
childId parentId
264543 0
248278 264543
252689 248278
251713 252689
OR
+------+------+-------+
| dst| src|level|
+------+------+-------+
|264543| 0| 0|
|248278|264543| 1|
|252689|248278| 2|
|251713|252689| 3|
|258977|251713| 4||
+------+------+-------+

Raj, Here is my graphFrame answer as requested.
I thought there was a simpler way to do this with GraphFrames. I didn't find a way to find all decedents in a trivial way. I provide two solutions.
from graphframes import GraphFrame
from pyspark.sql.functions import col
# initial dataframe
edgesDf = spark.createDataFrame([
(248278, 264543),
(251713, 252689),
(252689, 248278),
(258977, 251713),
(264543, 0)
],
["dst", "src"]
)
# get all ids as vertices
verticesDf = edgesDf.select(col("dst").alias("id")).union(edgesDf.select("src")).distinct()
# create graphFrame
graphGf = GraphFrame(verticesDf, edgesDf)
# for performance
sc.setCheckpointDir("/tmp/checkpoints")
graphGf.cache()
#### Motif approach
# note that this requires knowing the depth of the tree
fullPathDf = graphGf.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d); (d)-[de]->(e); (e)-[ef]->(f)")
# pivot
edgeDf = fullPathDf.select(col("ab").alias("edge")).union(fullPathDf.select("bc")).union(fullPathDf.select("cd")).union(fullPathDf.select("de")).union(fullPathDf.select("ef"))
# Result
edgeDf.select("edge.dst", "edge.src").show()
### Breadth First Search approach
#
# Does not require knowing the depth, but does require knowing the id of the leaf node
pathDf = graphGf.bfs("id = 0", "id = 258977", maxPathLength = 5)
# pivot
edgeDf = pathDf.select(col("e0").alias("edge")).union(pathDf.select("e1")).union(pathDf.select("e2")).union(pathDf.select("e3")).union(pathDf.select("e4")
#
edgeDf.select("edge.dst", "edge.src").show()

I suggest adding a dataframe checkpoint() to your code. This prevents the dataframe lineage from getting too long and causing performance issues. You code seems to have a number of dataframes, it is not clear to me why you are creating multiple dataframes so I'm not sure which dataframes would benefit from checkpointing. Add checkpoints to dataframes that you modify in every iteration. Here is a nice pyspark explaination of checkpointing

pyspark dataframe foreach to fill a list

I'm working in Spark 1.6.1 and Python 2.7 and I have this thing to solve:
Get a dataframe A with X rows
For each row in A, depending on a field, create one or more rows of a new dataframe B
Save that new dataframe B
The solution that I've come up right now, is to collect dataframe A, go over it, append to a list the row(s) of B and then create the dataframe B from that list.
With this solution i obviously lose all the perks of working with dataframes and I would like to use foreach, but I can't find a way to make this work. I've tried this so far:
Pass an empty list to the foreach function (this just ignores the foreach function and doesn't do anything)
Create a global variable to be use in the foreach function (complains that it can't find the list)
Does anyone has any ideas?
Thank you
----------------------EDIT:
Examples of the things I've tried:
def f(row, list):
if row.one:
list += [Row(type='one', field='ok')]
else:
list += [Row(type='one', field='ok')]
list += [Row(type='two', field='nok')]
list = []
dfA.foreach(lambda x : f(x, list))
As I mention, this does nothing, it doesn't execute the function
And I've also tried (which list defined at the beginning of the class):
global list
def f(row):
if row.one:
list += [Row(type='one', field='ok')]
else:
list += [Row(type='one', field='ok')]
list += [Row(type='two', field='nok')]
dfA.foreach(list)
---------EDIT 2:
What I'm doing right now is:
list = []
for row in dfA.collect():
string = re.search(a_regex, row['raw'])
if string:
dates = re.findall(date_regex, string.group())
for date in dates:
date_string = datetime.strptime(date, '%Y-%m-%d').date()
list += [Row(event_type='1', event_date=date_string)]
b_string = re.search(b_regex, row['raw'])
if b_string:
dates = re.findall(date_regex, b_string.group())
for date in dates:
scheduled_to = datetime.strptime(date, '%Y-%m-%d').date()
list += [Row(event_type='2', event_date= date_string)]
and then:
dfB = self._sql_context.createDataFrame(list)
dfA is given by other process, I can't change it and i know it's a very stupid way of using dataframes but I can't do anything about that
--------------------EDIT3:
dfA.raw sample:
{"new":[],"removed":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}]}
{"new":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}],"removed":[]}
{"new":[{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"}],"removed":[{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2018-02-10","end":"2018-02-16"}]}|
and the regex:
a_regex = r'\"new\":{(.*?)}{2}|\"new\":\[(.*?)\]'
b_regex = r'\"removed\":{(.*?)}{2}|removed\":\[(.*?)\]'
date_regex = r'\"start\":\"(\d{4}-\d{2}-\d{2})\"'
dfA.select('raw').show(2,False)
+-------------------------------------------------------------------------------------------------------+
|raw |
+-------------------------------------------------------------------------------------------------------+
|{"new":[{"start":"2018-03-24","end":"2018-03-30","scheduled_by_system":null}],"removed":[]}|
|{"new":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}],"removed":[]}|
+-------------------------------------------------------------------------------------------------------+
only showing top 2 rows
df.select('raw').printSchema()
root
|-- raw: string (nullable = true)

You would need to write a udf function to return the event_type and event_date strings after you have selected the required raw column.
import re
def searchUdf(regex, dateRegex, x):
list_return = []
string = re.search(regex, x)
if string:
dates = re.findall(dateRegex, string.group())
for date in dates:
date_string = datetime.strptime(date, '%Y-%m-%d').date()
list_return.append(date_string)
return list_return
from pyspark.sql import functions as F
udfFunctionCall = F.udf(searchUdf, T.ArrayType(T.DateType()))
The udf function would parse the raw column string with the regex and dateRegex passed as arguments and return eventType and data_string as arrayType column
You should be calling the udf function defined and filter out the empty rows and then separate the columns as event_type and event_date columns
df = df.select("raw")
adf = df.select(F.lit(1).alias("event_type"), udfFunctionCall(F.lit(a_regex), F.lit(date_regex), df.raw).alias("event_date"))\
.filter(F.size(F.col("event_date")) > 0)
bdf = df.select(F.lit(2).alias("event_type"), udfFunctionCall(F.lit(a_regex), F.lit(date_regex), df.raw).alias("event_date")) \
.filter(F.size(F.col("event_date")) > 0)
The regex used are provided in the question as
a_regex = r'\"new\":{(.*?)}{2}|\"new\":\[(.*?)\]'
b_regex = r'\"removed\":{(.*?)}{2}|removed\":\[(.*?)\]'
date_regex = r'\"start\":\"(\d{4}-\d{2}-\d{2})\"'
Now that you have two dataframes for both event_type, final step is to merge them together
adf.unionAll(bdf)
And thats it. Your confusion is all solved.
With the following raw column
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|raw |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"new":[],"removed":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}]} |
|{"new":[{"start":"2018-03-10","end":"2018-03-16","scheduled_by_system":null}],"removed":[]} |
|{"new":[{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"}],"removed":[{"start":"2017-02-04","end":"2017-02-10"},{"start":"2017-02-11","end":"2017-02-17"},{"start":"2017-02-18","end":"2017-02-24"},{"start":"2017-03-04","end":"2017-03-10"},{"start":"2017-03-11","end":"2017-03-17"},{"start":"2017-03-18","end":"2017-03-24"},{"start":"2017-01-28","end":"2017-02-03"},{"start":"2017-09-16","end":"2017-09-22"},{"start":"2017-09-02","end":"2017-09-08"},{"start":"2017-09-30","end":"2017-10-06"},{"start":"2017-10-07","end":"2017-10-13"},{"start":"2017-09-23","end":"2017-09-29"},{"start":"2017-12-16","end":"2017-12-22"},{"start":"2017-12-23","end":"2017-12-29"},{"start":"2018-01-06","end":"2018-01-12"},{"start":"2017-12-09","end":"2017-12-15"},{"start":"2017-12-02","end":"2017-12-08"},{"start":"2018-02-10","end":"2018-02-16"}]}|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
You should be getting
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|event_type|event_date |
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |[2018-03-10] |
|1 |[2017-01-28, 2017-02-04, 2017-02-11, 2017-02-18, 2017-03-04, 2017-03-11, 2017-03-18, 2017-09-02, 2017-09-16, 2017-09-23, 2017-09-30, 2017-10-07, 2017-12-02, 2017-12-09, 2017-12-16, 2017-12-23, 2018-01-06]|
|2 |[2018-03-10] |
|2 |[2017-01-28, 2017-02-04, 2017-02-11, 2017-02-18, 2017-03-04, 2017-03-11, 2017-03-18, 2017-09-02, 2017-09-16, 2017-09-23, 2017-09-30, 2017-10-07, 2017-12-02, 2017-12-09, 2017-12-16, 2017-12-23, 2018-01-06]|
+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark Convert Column<> to Value - python

Related

Optimization on for loop on columns in Pyspark

Pandas UDF in pyspark

Extract a column in pyspark dataframe using udfs

Dataframe performance issue while retrieving rows in hierarchy order in pyspark

pyspark dataframe foreach to fill a list

Categories

Resources