I have two dataframes, each one with a date column. ie:
+-----------+
| DEADLINES|
+-----------+
| 2023-07-15|
| 2018-08-10|
| 2022-03-28|
| 2021-06-22|
| 2021-12-18|
| 2021-10-11|
| 2021-11-13|
+-----------+
+----------+
| DT_DATE|
+----------+
|2021-04-02|
|2021-04-21|
|2021-05-01|
|2021-06-03|
|2021-09-07|
|2021-10-12|
|2021-11-02|
+----------+
I need to count how many dates of DT_DATE are between a given reference date and each one of DEADLINES dates.
For example: using 2021-03-31 as reference date should give the following result set.
+-----------+------------+
| DEADLINES| dt_count|
+-----------+------------+
| 2023-07-15| 7|
| 2018-08-10| 0|
| 2022-03-28| 7|
| 2021-06-22| 4|
| 2021-12-18| 7|
| 2021-10-11| 5|
| 2021-11-13| 7|
+-----------+------------+
I managed to make it work iterating through each row of deadlines dataframe but with a larger dataset the performance got very poor.
Does anyone have a better solution?
Edit: thats my current solution:
def count_days(deadlines_df, dates_df, ref_date):
for row in deadlines_df.collect():
qtt = dates_df.filter(dates_df.DT_DATE.between(ref_date, row.DEADLINES)).count()
yield row.DEADLINES, qtt
new_df = spark.createDataFrame(count_days(deadlines_df, dates_df, "2021-03-31"), ["DEADLINES", "dt_count"])
Both dataframes can be united with different weight, and Window function with range from start to current row used (Scala):
val deadlines = Seq(
("2023-07-15"),
("2018-08-10"),
("2022-03-28"),
("2021-06-22"),
("2021-12-18"),
("2021-10-11"),
("2021-11-13")
).toDF("DEADLINES")
val dates = Seq(
("2021-04-02"),
("2021-04-21"),
("2021-05-01"),
("2021-06-03"),
("2021-09-07"),
("2021-10-12"),
("2021-11-02")
).toDF("DT_DATE")
val referenceDate = "2021-03-31"
val united = deadlines.withColumn("weight", lit(0))
.unionAll(
dates
.where($"DT_DATE" >= referenceDate)
.withColumn("weight", lit(1))
)
val fromStartToCurrentRowWindow = Window.orderBy("DEADLINES").rangeBetween(Window.unboundedPreceding, Window.currentRow)
val result = united
.withColumn("dt_count", sum("weight").over(fromStartToCurrentRowWindow))
.where($"weight" === lit(0))
.drop("weight")
Output:
+----------+--------+
|DEADLINES |dt_count|
+----------+--------+
|2018-08-10|0 |
|2021-06-22|4 |
|2021-10-11|5 |
|2021-11-13|7 |
|2021-12-18|7 |
|2022-03-28|7 |
|2023-07-15|7 |
+----------+--------+
Note: calculation will be executed in one partition, Spark shows such warning:
WARN Logging - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
Also other solution possible, joining two dataframes by range, which leads to cartesian join.
If you have a small number of deadline dates, you can:
add one column by deadline date on dates_df dataframe, with value is 1 when DT_DATE is between ref_date and deadline date and 0 otherwise
then sum each deadline date columns
finally transpose result dataframe to obtain your desired dataframe
Let's see step by step
Add one column by deadline date:
from pyspark.sql import functions as F
deadline_rows = deadlines_df.collect()
dates_with_deadlines = dates_df
for row in deadline_rows:
dates_with_deadlines = dates_with_deadlines.withColumn(
str(row.DEADLINES),
F.when(
dates_df.DT_DATE.between(ref_date, row.DEADLINES), F.lit(1))
.otherwise(
F.lit(0)
)
)
And you get, with your example, the following dates_with_deadlines dataframe:
+----------+----------+----------+----------+----------+----------+----------+----------+
|DT_DATE |2023-07-15|2018-08-10|2022-03-28|2021-06-22|2021-12-18|2021-10-11|2021-11-13|
+----------+----------+----------+----------+----------+----------+----------+----------+
|2021-04-02|1 |0 |1 |1 |1 |1 |1 |
|2021-04-21|1 |0 |1 |1 |1 |1 |1 |
|2021-05-01|1 |0 |1 |1 |1 |1 |1 |
|2021-06-03|1 |0 |1 |1 |1 |1 |1 |
|2021-09-07|1 |0 |1 |0 |1 |1 |1 |
|2021-10-12|1 |0 |1 |0 |1 |0 |1 |
|2021-11-02|1 |0 |1 |0 |1 |0 |1 |
+----------+----------+----------+----------+----------+----------+----------+----------+
Sum deadlines
aggregated_df = dates_with_deadlines.agg(*[F.sum(str(x.DEADLINES)).alias(str(x.DEADLINES)) for x in deadline_rows])
After this step, you get the following aggregated_df dataframe:
+----------+----------+----------+----------+----------+----------+----------+
|2023-07-15|2018-08-10|2022-03-28|2021-06-22|2021-12-18|2021-10-11|2021-11-13|
+----------+----------+----------+----------+----------+----------+----------+
|7 |0 |7 |4 |7 |5 |7 |
+----------+----------+----------+----------+----------+----------+----------+
Transpose dataframe
result_df = aggregated_df \
.withColumn('merged', F.array(*[F.struct(F.lit(x.DEADLINES).alias('DEADLINES'), F.col(str(x.DEADLINES)).alias('dt_count')) for x in deadline_rows])) \
.drop(*[str(x.DEADLINES) for x in deadline_rows]) \
.withColumn('data', F.explode('merged')) \
.drop('merged') \
.withColumn('DEADLINES', F.col('data.DEADLINES')) \
.withColumn('dt_count', F.col('data.dt_count')) \
.drop('data')
And you have your expected result_df dataframe:
+----------+--------+
|DEADLINES |dt_count|
+----------+--------+
|2023-07-15|7 |
|2018-08-10|0 |
|2022-03-28|7 |
|2021-06-22|4 |
|2021-12-18|7 |
|2021-10-11|5 |
|2021-11-13|7 |
+----------+--------+
Complete Code
from pyspark.sql import functions as F
deadline_rows = deadlines_df.collect()
dates_with_deadlines = dates_df
for row in deadline_rows:
dates_with_deadlines = dates_with_deadlines.withColumn(
str(row.DEADLINES),
F.when(
dates_df.DT_DATE.between(ref_date, row.DEADLINES), F.lit(1))
.otherwise(
F.lit(0)
)
)
aggregated_df = dates_with_deadlines.agg(*[F.sum(str(x.DEADLINES)).alias(str(x.DEADLINES)) for x in deadline_rows])
result_df = aggregated_df \
.withColumn('merged', F.array(*[F.struct(F.lit(x.DEADLINES).alias('DEADLINES'), F.col(str(x.DEADLINES)).alias('dt_count')) for x in deadline_rows])) \
.drop(*[str(x.DEADLINES) for x in deadline_rows]) \
.withColumn('data', F.explode('merged')) \
.drop('merged') \
.withColumn('DEADLINES', F.col('data.DEADLINES')) \
.withColumn('dt_count', F.col('data.dt_count')) \
.drop('data')
Advantages and limits of this solution
With this solution, the only step that cannot be done using a distributed system is the transpose step.
Moreover, instead of your current solution, we perform all aggregation for each deadline column in parallele, and not sequentially.
However, this solutions works only if there are few deadline dates (hundreds, maybe thousands deadline dates), first because we retrieve all those deadline dates in the Spark driver with .collect(), second because in first step we create one column per deadline date, creating rows with lot of data, and finally because the last step is also executed on only one executor.
Related
I have a delta table which has a column with JSON data. I do not have a schema for it and need a way to convert the JSON data into columns
|id | json_data
| 1 | {"name":"abc", "depts":["dep01", "dep02"]}
| 2 | {"name":"xyz", "depts":["dep03"],"sal":100}
| 3 | {"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}
Expected output
|id | name | depts | sal | address_city
| 1 | "abc" | ["dep01", "dep02"] | null| null
| 2 | "xyz" | ["dep03"] | 100 | null
| 3 | "pqr" | ["dep02"] | null| "SF"
Input Dataframe -
df = spark.createDataFrame(data = [(1 , """{"name":"abc", "depts":["dep01", "dep02"]}"""), (2 , """{"name":"xyz", "depts":["dep03"],"sal":100}"""), (3 , """{"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}""")], schema = ["id", "json_data"])
df.show(truncate=False)
+---+----------------------------------------------------------+
|id |json_data |
+---+----------------------------------------------------------+
|1 |{"name":"abc", "depts":["dep01", "dep02"]} |
|2 |{"name":"xyz", "depts":["dep03"],"sal":100} |
|3 |{"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}|
+---+----------------------------------------------------------+
Convert json_data column to MapType as below -
from pyspark.sql.functions import *
from pyspark.sql.types import *
df1 = df.withColumn("cols", from_json("json_data", MapType(StringType(), StringType()))).drop("json_data")
df1.show(truncate=False)
+---+-----------------------------------------------------------+
|id |cols |
+---+-----------------------------------------------------------+
|1 |{name -> abc, depts -> ["dep01","dep02"]} |
|2 |{name -> xyz, depts -> ["dep03"], sal -> 100} |
|3 |{name -> pqr, depts -> ["dep02"], address -> {"city":"SF"}}|
+---+-----------------------------------------------------------+
Now, column cols needs to be exploded as below -
df2 = df1.select("id",explode("cols").alias("col_columns", "col_rows"))
df2.show(truncate=False)
+---+-----------+-----------------+
|id |col_columns|col_rows |
+---+-----------+-----------------+
|1 |name |abc |
|1 |depts |["dep01","dep02"]|
|2 |name |xyz |
|2 |depts |["dep03"] |
|2 |sal |100 |
|3 |name |pqr |
|3 |depts |["dep02"] |
|3 |address |{"city":"SF"} |
+---+-----------+-----------------+
Once, you have col_columns and col_rows as individual columns, all that is needed to do is pivot col_columns and aggregate it using its corresponding first col_rows as below -
df3 = df2.groupBy("id").pivot("col_columns").agg(first("col_rows"))
df3.show(truncate=False)
+---+-------------+-----------------+----+----+
|id |address |depts |name|sal |
+---+-------------+-----------------+----+----+
|1 |null |["dep01","dep02"]|abc |null|
|2 |null |["dep03"] |xyz |100 |
|3 |{"city":"SF"}|["dep02"] |pqr |null|
+---+-------------+-----------------+----+----+
Finally, you again need to repeat the above steps to bring address in structured format as below -
df4 = df3.withColumn("address", from_json("address", MapType(StringType(), StringType())))
df4.select("id", "depts", "name", "sal",explode_outer("address").alias("key", "address_city")).drop("key").show(truncate=False)
+---+-----------------+----+----+------------+
|id |depts |name|sal |address_city|
+---+-----------------+----+----+------------+
|1 |["dep01","dep02"]|abc |null|null |
|2 |["dep03"] |xyz |100 |null |
|3 |["dep02"] |pqr |null|SF |
+---+-----------------+----+----+------------+
In order to solve it you can use split function as code below.
The function takes 2 parameters, the first one is the column itself and the second is the pattern to split the elements from column array.
More information and examples cand be found here:
https://sparkbyexamples.com/pyspark/pyspark-convert-string-to-array-column/#:~:text=PySpark%20SQL%20provides%20split(),and%20converting%20it%20into%20ArrayType.
from pyspark.sql import functions as F
df.select(F.split(F.col('depts'), ','))
To parse and promote the properties from a JSON string column without a known schema dynamically, I am afraid you cannot use pyspark, it can be done by using Scala.
For example when you have some avro files produced by Kafka, and you want to be able to parse the Value which is a serialized JSON string dynamically
var df = spark.read.format("avro").load("abfss://abc#def.dfs.core.windows.net/xyz.avro").select("Value")
var df_parsed = spark.read.json(df.as[String])
display(df_parsed)
The key is spark.read.json(df.as[String]) in Scala, it basically
Convert that DF ( it has only one column that we are interested in in this case, you can of course deal with multiple interested columns similarily and union whatever you want ) to String.
Parse the JSON string using standard spark read option, this does not require a schema.
So far there is no equivalent methods exposed to pyspark as far as I know.
I have a dataframe like so:
+------------------------------------+-----+-----+
|id |point|count|
+------------------------------------+-----+-----+
|id_1|5 |9 |
|id_2|5 |1 |
|id_3|4 |3 |
|id_1|3 |3 |
|id_2|4 |3 |
The id-point pairs are unique.
I would like to group by id and create columns from the point column with values from the count column like so:
+------------------------------------+-----+-----+
|id |point_3|point_4|point_5|
+------------------------------------+-----+-----+
|id_1|3 |0 |9
|id_2|0 |3 |1
|id_3|0 |3 |0
If you can guide me on how to start this or in which direction to start going, it would be much appreciated. I feel stuck on this for a while.
We can use pivot to achieve the required result:
from pyspark.sql import *
from pyspark.sql.functions import *
spark = SparkSession.builder.master("local[*]").getOrCreate()
#sample dataframe
in_values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), ("id_1", 3, 3), ("id_2", 4, 3)]
in_df = spark.createDataFrame(in_values, "id string, point int, count int")
out_df = in_df.groupby("id").pivot("point").agg(sum("count"))
# To replace null by 0
out_df = out_df.na.fill(0)
# To rename columns
columns_to_rename = out_df.columns
columns_to_rename.remove("id")
for col in columns_to_rename:
out_df = out_df.withColumnRenamed(col, f"point_{col}")
out_df.show()
+----+-------+-------+-------+
| id|point_3|point_4|point_5|
+----+-------+-------+-------+
|id_2| 0| 3| 1|
|id_1| 3| 0| 9|
|id_3| 0| 3| 0|
+----+-------+-------+-------+
In Pyspark, let's say we are comparing values in 2 columns such as:
df = df.filter(F.col(1) > F.col(2))
If Col 1 has the value 5, and Col 2 has the value NULL, what will happen? Will it be filtered or not?
Does this evaluate to true or false or nothing?
It will evaluate to false. You are trying to compare a value with null, which is a null pointer.
To emulate the same:
Preparing data
case class Test(t1:Int,t2:Int)
var df = Seq(Test(1,1),Test(2,0),Test(3,3)).toDF
df.show(false)
+---+---+
|t1 |t2 |
+---+---+
|1 |1 |
|2 |0 |
|3 |3 |
+---+---+
Comparing not null data
df.filter($"t1">$"t2").show(false)
+---+---+
|t1 |t2 |
+---+---+
|2 |0 |
+---+---+
Adding a column with null
df=df.withColumn("t3",lit(null))
df.show(false)
+---+---+----+
|t1 |t2 |t3 |
+---+---+----+
|1 |1 |null|
|2 |0 |null|
|3 |3 |null|
+---+---+----+
Comparing with null
df.filter($"t1">$"t3").show(false)
+---+---+---+
|t1 |t2 |t3 |
+---+---+---+
+---+---+---+
I have a pyspark dataframe that contains id, timestamp and value column. I am trying to create a dataframe that first groups rows with the same id, then separate the ones that are, say longer than 2 weeks apart and finally concatenate their value into a list.
I have already tried to use rangeBetween() Window function. It doesn't quite deliver what I want. I think the code below illutrates my question better:
My dataframe sdf:
+---+-------------------------+-----+
|id |tts |value|
+---+-------------------------+-----+
|0 |2019-01-01T00:00:00+00:00|a |
|0 |2019-01-02T00:00:00+00:00|b |
|0 |2019-01-20T00:00:00+00:00|c |
|0 |2019-01-25T00:00:00+00:00|d |
|1 |2019-01-02T00:00:00+00:00|a |
|1 |2019-01-29T00:00:00+00:00|b |
|2 |2019-01-01T00:00:00+00:00|a |
|2 |2019-01-30T00:00:00+00:00|b |
|2 |2019-02-02T00:00:00+00:00|c |
+---+-------------------------+-----+
My approach:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
DAY_SECS = 3600 * 24
w_spec = Window \
.partitionBy('id') \
.orderBy(F.col('tts').cast('timestamp').cast('long')) \
.rangeBetween((Window.currentRow)-(14*DAY_SECS), Window.currentRow)
out = sdf \
.withColumn('val_seq', F.collect_list('value').over(w_spec))
Output:
+---+-------------------------+-----+-------+
|id |tts |value|val_seq|
+---+-------------------------+-----+-------+
|0 |2019-01-01T00:00:00+00:00|a |[a] |
|0 |2019-01-02T00:00:00+00:00|b |[a, b] |
|0 |2019-01-20T00:00:00+00:00|c |[c] |
|0 |2019-01-25T00:00:00+00:00|d |[c, d] |
|1 |2019-01-02T00:00:00+00:00|a |[a] |
|1 |2019-01-29T00:00:00+00:00|b |[b] |
|2 |2019-01-01T00:00:00+00:00|a |[a] |
|2 |2019-01-30T00:00:00+00:00|b |[b] |
|2 |2019-02-02T00:00:00+00:00|c |[b, c] |
+---+-------------------------+-----+-------+
My desired output:
+---+-------------------------+---------+
|id |tts |val_seq|
+---+-------------------------+---------+
|0 |2019-01-02T00:00:00+00:00|[a, b] |
|0 |2019-01-25T00:00:00+00:00|[c, d] |
|1 |2019-01-02T00:00:00+00:00|[a] |
|1 |2019-01-29T00:00:00+00:00|[b] |
|2 |2019-01-30T00:00:00+00:00|[a] |
|2 |2019-02-02T00:00:00+00:00|[b, c] |
+---+-------------------------+---------+
To sum it up: I want to group rows in sdf with the same id, and further concatenate the value for rows that are not longer than 2 weeks apart and finally only show these rows.
I am really new to pyspark so any suggestions are appreciated!
Below code should work:
w_spec = Window \
.partitionBy('id') \
.orderBy(F.col('tts').cast('timestamp').cast('long')) \
.rangeBetween((Window.currentRow)-(14*DAY_SECS), Window.currentRow)
out1 = df.withColumn('val_seq', F.collect_list('value').over(w_spec)).withColumn('occurrences_in_5_min',F.count('tts').over(w_spec))
w_spec2 = Window.partitionBy("id").orderBy(col("occurrences_in_5_min").desc())
Out2=out1.withColumn("rank",rank().over(w_spec2)).filter("rank==1")
I have a DataFrame as below
A B C
1 3 1
1 8 2
1 5 3
2 2 1
My output should be, Column B is ordered based on the initial column B value
A B
1 3,1/5,3/8,2
2 2,1
I wrote something like this is scala
df.groupBy("A").withColumn("B",collect_list(concat("B",lit(","),"C"))
But dint solves my problem.
Given that you have input dataframe as
+---+---+---+
|A |B |C |
+---+---+---+
|1 |3 |1 |
|1 |8 |2 |
|1 |5 |3 |
|2 |2 |1 |
+---+---+---+
You can get following output as
+---+---------------+
|A |B |
+---+---------------+
|1 |[3,1, 5,3, 8,2]|
|2 |[2,1] |
+---+---------------+
By doing simple groupBy, aggregations and using functions
df.orderBy("B").groupBy("A").agg(collect_list(concat_ws(",", col("B"), col("C"))) as "B")
You can use udf function to get the final desired result as
def joinString = udf((b: mutable.WrappedArray[String]) => {
b.mkString("/")
} )
newdf.withColumn("B", joinString(col("B"))).show(false)
You should get
+---+-----------+
|A |B |
+---+-----------+
|1 |3,1/5,3/8,2|
|2 |2,1 |
+---+-----------+
Note you would need import org.apache.spark.sql.functions._ for all of the above to work
Edited
Column B is ordered based on the initial column B value
For this you can just remove the orderBy part as
import org.apache.spark.sql.functions._
val newdf = df.groupBy("A").agg(collect_list(concat_ws(",", col("B"), col("C"))) as "B")
def joinString = udf((b: mutable.WrappedArray[String]) => {
b.mkString("/")
} )
newdf.withColumn("B", joinString(col("B"))).show(false)
and you should get output as
+---+-----------+
|A |B |
+---+-----------+
|1 |3,1/8,2/5,3|
|2 |2,1 |
+---+-----------+
This is what you can achieve by using concat_ws function and then groupby column A and collect the list
val df1 = spark.sparkContext.parallelize(Seq(
( 1, 3, 1),
(1, 8, 2),
(1, 5, 3),
(2, 2, 1)
)).toDF("A", "B", "C")
val result = df1.withColumn("B", concat_ws("/", $"B", $"C"))
result.groupBy("A").agg(collect_list($"B").alias("B")).show
Output:
+---+---------------+
| A| B|
+---+---------------+
| 1|[3/1, 8/2, 5/3]|
| 2| [2/1]|
+---+---------------+
Edited:
Here is what you can do if you want to sort with the column B
val format = udf((value : Seq[String]) => {
value.sortBy(x => {x.split(",")(0)}).mkString("/")
})
val result = df1.withColumn("B", concat_ws(",", $"B", $"C"))
.groupBy($"A").agg(collect_list($"B").alias("B"))
.withColumn("B", format($"B"))
result.show()
Output:
+---+-----------+
| A| B|
+---+-----------+
| 1|3,1/5,3/8,2|
| 2| 2,1|
+---+-----------+
Hope this was helpful!