I have a pyspark dataframe that contains id, timestamp and value column. I am trying to create a dataframe that first groups rows with the same id, then separate the ones that are, say longer than 2 weeks apart and finally concatenate their value into a list.
I have already tried to use rangeBetween() Window function. It doesn't quite deliver what I want. I think the code below illutrates my question better:
My dataframe sdf:
+---+-------------------------+-----+
|id |tts |value|
+---+-------------------------+-----+
|0 |2019-01-01T00:00:00+00:00|a |
|0 |2019-01-02T00:00:00+00:00|b |
|0 |2019-01-20T00:00:00+00:00|c |
|0 |2019-01-25T00:00:00+00:00|d |
|1 |2019-01-02T00:00:00+00:00|a |
|1 |2019-01-29T00:00:00+00:00|b |
|2 |2019-01-01T00:00:00+00:00|a |
|2 |2019-01-30T00:00:00+00:00|b |
|2 |2019-02-02T00:00:00+00:00|c |
+---+-------------------------+-----+
My approach:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
DAY_SECS = 3600 * 24
w_spec = Window \
.partitionBy('id') \
.orderBy(F.col('tts').cast('timestamp').cast('long')) \
.rangeBetween((Window.currentRow)-(14*DAY_SECS), Window.currentRow)
out = sdf \
.withColumn('val_seq', F.collect_list('value').over(w_spec))
Output:
+---+-------------------------+-----+-------+
|id |tts |value|val_seq|
+---+-------------------------+-----+-------+
|0 |2019-01-01T00:00:00+00:00|a |[a] |
|0 |2019-01-02T00:00:00+00:00|b |[a, b] |
|0 |2019-01-20T00:00:00+00:00|c |[c] |
|0 |2019-01-25T00:00:00+00:00|d |[c, d] |
|1 |2019-01-02T00:00:00+00:00|a |[a] |
|1 |2019-01-29T00:00:00+00:00|b |[b] |
|2 |2019-01-01T00:00:00+00:00|a |[a] |
|2 |2019-01-30T00:00:00+00:00|b |[b] |
|2 |2019-02-02T00:00:00+00:00|c |[b, c] |
+---+-------------------------+-----+-------+
My desired output:
+---+-------------------------+---------+
|id |tts |val_seq|
+---+-------------------------+---------+
|0 |2019-01-02T00:00:00+00:00|[a, b] |
|0 |2019-01-25T00:00:00+00:00|[c, d] |
|1 |2019-01-02T00:00:00+00:00|[a] |
|1 |2019-01-29T00:00:00+00:00|[b] |
|2 |2019-01-30T00:00:00+00:00|[a] |
|2 |2019-02-02T00:00:00+00:00|[b, c] |
+---+-------------------------+---------+
To sum it up: I want to group rows in sdf with the same id, and further concatenate the value for rows that are not longer than 2 weeks apart and finally only show these rows.
I am really new to pyspark so any suggestions are appreciated!
Below code should work:
w_spec = Window \
.partitionBy('id') \
.orderBy(F.col('tts').cast('timestamp').cast('long')) \
.rangeBetween((Window.currentRow)-(14*DAY_SECS), Window.currentRow)
out1 = df.withColumn('val_seq', F.collect_list('value').over(w_spec)).withColumn('occurrences_in_5_min',F.count('tts').over(w_spec))
w_spec2 = Window.partitionBy("id").orderBy(col("occurrences_in_5_min").desc())
Out2=out1.withColumn("rank",rank().over(w_spec2)).filter("rank==1")
Related
I have a delta table which has a column with JSON data. I do not have a schema for it and need a way to convert the JSON data into columns
|id | json_data
| 1 | {"name":"abc", "depts":["dep01", "dep02"]}
| 2 | {"name":"xyz", "depts":["dep03"],"sal":100}
| 3 | {"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}
Expected output
|id | name | depts | sal | address_city
| 1 | "abc" | ["dep01", "dep02"] | null| null
| 2 | "xyz" | ["dep03"] | 100 | null
| 3 | "pqr" | ["dep02"] | null| "SF"
Input Dataframe -
df = spark.createDataFrame(data = [(1 , """{"name":"abc", "depts":["dep01", "dep02"]}"""), (2 , """{"name":"xyz", "depts":["dep03"],"sal":100}"""), (3 , """{"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}""")], schema = ["id", "json_data"])
df.show(truncate=False)
+---+----------------------------------------------------------+
|id |json_data |
+---+----------------------------------------------------------+
|1 |{"name":"abc", "depts":["dep01", "dep02"]} |
|2 |{"name":"xyz", "depts":["dep03"],"sal":100} |
|3 |{"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}|
+---+----------------------------------------------------------+
Convert json_data column to MapType as below -
from pyspark.sql.functions import *
from pyspark.sql.types import *
df1 = df.withColumn("cols", from_json("json_data", MapType(StringType(), StringType()))).drop("json_data")
df1.show(truncate=False)
+---+-----------------------------------------------------------+
|id |cols |
+---+-----------------------------------------------------------+
|1 |{name -> abc, depts -> ["dep01","dep02"]} |
|2 |{name -> xyz, depts -> ["dep03"], sal -> 100} |
|3 |{name -> pqr, depts -> ["dep02"], address -> {"city":"SF"}}|
+---+-----------------------------------------------------------+
Now, column cols needs to be exploded as below -
df2 = df1.select("id",explode("cols").alias("col_columns", "col_rows"))
df2.show(truncate=False)
+---+-----------+-----------------+
|id |col_columns|col_rows |
+---+-----------+-----------------+
|1 |name |abc |
|1 |depts |["dep01","dep02"]|
|2 |name |xyz |
|2 |depts |["dep03"] |
|2 |sal |100 |
|3 |name |pqr |
|3 |depts |["dep02"] |
|3 |address |{"city":"SF"} |
+---+-----------+-----------------+
Once, you have col_columns and col_rows as individual columns, all that is needed to do is pivot col_columns and aggregate it using its corresponding first col_rows as below -
df3 = df2.groupBy("id").pivot("col_columns").agg(first("col_rows"))
df3.show(truncate=False)
+---+-------------+-----------------+----+----+
|id |address |depts |name|sal |
+---+-------------+-----------------+----+----+
|1 |null |["dep01","dep02"]|abc |null|
|2 |null |["dep03"] |xyz |100 |
|3 |{"city":"SF"}|["dep02"] |pqr |null|
+---+-------------+-----------------+----+----+
Finally, you again need to repeat the above steps to bring address in structured format as below -
df4 = df3.withColumn("address", from_json("address", MapType(StringType(), StringType())))
df4.select("id", "depts", "name", "sal",explode_outer("address").alias("key", "address_city")).drop("key").show(truncate=False)
+---+-----------------+----+----+------------+
|id |depts |name|sal |address_city|
+---+-----------------+----+----+------------+
|1 |["dep01","dep02"]|abc |null|null |
|2 |["dep03"] |xyz |100 |null |
|3 |["dep02"] |pqr |null|SF |
+---+-----------------+----+----+------------+
In order to solve it you can use split function as code below.
The function takes 2 parameters, the first one is the column itself and the second is the pattern to split the elements from column array.
More information and examples cand be found here:
https://sparkbyexamples.com/pyspark/pyspark-convert-string-to-array-column/#:~:text=PySpark%20SQL%20provides%20split(),and%20converting%20it%20into%20ArrayType.
from pyspark.sql import functions as F
df.select(F.split(F.col('depts'), ','))
To parse and promote the properties from a JSON string column without a known schema dynamically, I am afraid you cannot use pyspark, it can be done by using Scala.
For example when you have some avro files produced by Kafka, and you want to be able to parse the Value which is a serialized JSON string dynamically
var df = spark.read.format("avro").load("abfss://abc#def.dfs.core.windows.net/xyz.avro").select("Value")
var df_parsed = spark.read.json(df.as[String])
display(df_parsed)
The key is spark.read.json(df.as[String]) in Scala, it basically
Convert that DF ( it has only one column that we are interested in in this case, you can of course deal with multiple interested columns similarily and union whatever you want ) to String.
Parse the JSON string using standard spark read option, this does not require a schema.
So far there is no equivalent methods exposed to pyspark as far as I know.
I am an intermediate learner and I have a pandas dataframe like below:
dfx=pd.DataFrame({'ID':['ID_1','ID_2','ID_3','ID_4'],'Extracts':[['QA,QB'], ['QB,QD'], ['QA,QD'], ['QC']],'QA':[0, 0, 0, 0],'QB':[0, 0, 0, 0],'QC':[0, 0, 0, 0],'QD':[0, 0, 0, 0]})
If any of the text in 'Extracts' column matches with the last four column names, I want the corresponding cells to be converted from 0 to 1 as shown in the following table:
From this:
| ID | Extracts | QA | QB| QC|QD |
|----|:--------:|----|---|---|---|
|ID_1|['QA,QB'] |0 |0 |0 |0 |
|ID_2|['QB,QD'] |0 |0 |0 |0 |
|ID_3|['QA,QD'] |0 |0 |0 |0 |
|ID_4|['QC'] |0 |0 |0 |0 |
To this:
| ID | Extracts | QA | QB| QC|QD |
|----|:--------:|----|---|---|---|
|ID_1|['QA,QB'] |1 |1 |0 |0 |
|ID_2|['QB,QD'] |0 |1 |0 |1 |
|ID_3|['QA,QD'] |1 |0 |0 |1 |
|ID_4|['QC'] |0 |0 |1 |0 |
I have tried so far with the intent of looping through the columns:
for i in list(dfx.columns[2:6]):
print(i)
if dfx.Extracts.str.contains(i).any():
dfx.i=1
But cannot get this working.
I would appreciate it if someone could guide me through this.
Many thanks in advance.
We can use indexing with the str accessor to select the strings then use get_dummies to create a dataframe of indicator variables, finally update the original dataframe using the values from indicator dataframe
dfx.update(dfx['Extracts'].str[0].str.get_dummies(sep=','))
print(dfx)
ID Extracts QA QB QC QD
0 ID_1 [QA,QB] 1 1 0 0
1 ID_2 [QB,QD] 0 1 0 1
2 ID_3 [QA,QD] 1 0 0 1
3 ID_4 [QC] 0 0 1 0
I have two dataframes, each one with a date column. ie:
+-----------+
| DEADLINES|
+-----------+
| 2023-07-15|
| 2018-08-10|
| 2022-03-28|
| 2021-06-22|
| 2021-12-18|
| 2021-10-11|
| 2021-11-13|
+-----------+
+----------+
| DT_DATE|
+----------+
|2021-04-02|
|2021-04-21|
|2021-05-01|
|2021-06-03|
|2021-09-07|
|2021-10-12|
|2021-11-02|
+----------+
I need to count how many dates of DT_DATE are between a given reference date and each one of DEADLINES dates.
For example: using 2021-03-31 as reference date should give the following result set.
+-----------+------------+
| DEADLINES| dt_count|
+-----------+------------+
| 2023-07-15| 7|
| 2018-08-10| 0|
| 2022-03-28| 7|
| 2021-06-22| 4|
| 2021-12-18| 7|
| 2021-10-11| 5|
| 2021-11-13| 7|
+-----------+------------+
I managed to make it work iterating through each row of deadlines dataframe but with a larger dataset the performance got very poor.
Does anyone have a better solution?
Edit: thats my current solution:
def count_days(deadlines_df, dates_df, ref_date):
for row in deadlines_df.collect():
qtt = dates_df.filter(dates_df.DT_DATE.between(ref_date, row.DEADLINES)).count()
yield row.DEADLINES, qtt
new_df = spark.createDataFrame(count_days(deadlines_df, dates_df, "2021-03-31"), ["DEADLINES", "dt_count"])
Both dataframes can be united with different weight, and Window function with range from start to current row used (Scala):
val deadlines = Seq(
("2023-07-15"),
("2018-08-10"),
("2022-03-28"),
("2021-06-22"),
("2021-12-18"),
("2021-10-11"),
("2021-11-13")
).toDF("DEADLINES")
val dates = Seq(
("2021-04-02"),
("2021-04-21"),
("2021-05-01"),
("2021-06-03"),
("2021-09-07"),
("2021-10-12"),
("2021-11-02")
).toDF("DT_DATE")
val referenceDate = "2021-03-31"
val united = deadlines.withColumn("weight", lit(0))
.unionAll(
dates
.where($"DT_DATE" >= referenceDate)
.withColumn("weight", lit(1))
)
val fromStartToCurrentRowWindow = Window.orderBy("DEADLINES").rangeBetween(Window.unboundedPreceding, Window.currentRow)
val result = united
.withColumn("dt_count", sum("weight").over(fromStartToCurrentRowWindow))
.where($"weight" === lit(0))
.drop("weight")
Output:
+----------+--------+
|DEADLINES |dt_count|
+----------+--------+
|2018-08-10|0 |
|2021-06-22|4 |
|2021-10-11|5 |
|2021-11-13|7 |
|2021-12-18|7 |
|2022-03-28|7 |
|2023-07-15|7 |
+----------+--------+
Note: calculation will be executed in one partition, Spark shows such warning:
WARN Logging - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
Also other solution possible, joining two dataframes by range, which leads to cartesian join.
If you have a small number of deadline dates, you can:
add one column by deadline date on dates_df dataframe, with value is 1 when DT_DATE is between ref_date and deadline date and 0 otherwise
then sum each deadline date columns
finally transpose result dataframe to obtain your desired dataframe
Let's see step by step
Add one column by deadline date:
from pyspark.sql import functions as F
deadline_rows = deadlines_df.collect()
dates_with_deadlines = dates_df
for row in deadline_rows:
dates_with_deadlines = dates_with_deadlines.withColumn(
str(row.DEADLINES),
F.when(
dates_df.DT_DATE.between(ref_date, row.DEADLINES), F.lit(1))
.otherwise(
F.lit(0)
)
)
And you get, with your example, the following dates_with_deadlines dataframe:
+----------+----------+----------+----------+----------+----------+----------+----------+
|DT_DATE |2023-07-15|2018-08-10|2022-03-28|2021-06-22|2021-12-18|2021-10-11|2021-11-13|
+----------+----------+----------+----------+----------+----------+----------+----------+
|2021-04-02|1 |0 |1 |1 |1 |1 |1 |
|2021-04-21|1 |0 |1 |1 |1 |1 |1 |
|2021-05-01|1 |0 |1 |1 |1 |1 |1 |
|2021-06-03|1 |0 |1 |1 |1 |1 |1 |
|2021-09-07|1 |0 |1 |0 |1 |1 |1 |
|2021-10-12|1 |0 |1 |0 |1 |0 |1 |
|2021-11-02|1 |0 |1 |0 |1 |0 |1 |
+----------+----------+----------+----------+----------+----------+----------+----------+
Sum deadlines
aggregated_df = dates_with_deadlines.agg(*[F.sum(str(x.DEADLINES)).alias(str(x.DEADLINES)) for x in deadline_rows])
After this step, you get the following aggregated_df dataframe:
+----------+----------+----------+----------+----------+----------+----------+
|2023-07-15|2018-08-10|2022-03-28|2021-06-22|2021-12-18|2021-10-11|2021-11-13|
+----------+----------+----------+----------+----------+----------+----------+
|7 |0 |7 |4 |7 |5 |7 |
+----------+----------+----------+----------+----------+----------+----------+
Transpose dataframe
result_df = aggregated_df \
.withColumn('merged', F.array(*[F.struct(F.lit(x.DEADLINES).alias('DEADLINES'), F.col(str(x.DEADLINES)).alias('dt_count')) for x in deadline_rows])) \
.drop(*[str(x.DEADLINES) for x in deadline_rows]) \
.withColumn('data', F.explode('merged')) \
.drop('merged') \
.withColumn('DEADLINES', F.col('data.DEADLINES')) \
.withColumn('dt_count', F.col('data.dt_count')) \
.drop('data')
And you have your expected result_df dataframe:
+----------+--------+
|DEADLINES |dt_count|
+----------+--------+
|2023-07-15|7 |
|2018-08-10|0 |
|2022-03-28|7 |
|2021-06-22|4 |
|2021-12-18|7 |
|2021-10-11|5 |
|2021-11-13|7 |
+----------+--------+
Complete Code
from pyspark.sql import functions as F
deadline_rows = deadlines_df.collect()
dates_with_deadlines = dates_df
for row in deadline_rows:
dates_with_deadlines = dates_with_deadlines.withColumn(
str(row.DEADLINES),
F.when(
dates_df.DT_DATE.between(ref_date, row.DEADLINES), F.lit(1))
.otherwise(
F.lit(0)
)
)
aggregated_df = dates_with_deadlines.agg(*[F.sum(str(x.DEADLINES)).alias(str(x.DEADLINES)) for x in deadline_rows])
result_df = aggregated_df \
.withColumn('merged', F.array(*[F.struct(F.lit(x.DEADLINES).alias('DEADLINES'), F.col(str(x.DEADLINES)).alias('dt_count')) for x in deadline_rows])) \
.drop(*[str(x.DEADLINES) for x in deadline_rows]) \
.withColumn('data', F.explode('merged')) \
.drop('merged') \
.withColumn('DEADLINES', F.col('data.DEADLINES')) \
.withColumn('dt_count', F.col('data.dt_count')) \
.drop('data')
Advantages and limits of this solution
With this solution, the only step that cannot be done using a distributed system is the transpose step.
Moreover, instead of your current solution, we perform all aggregation for each deadline column in parallele, and not sequentially.
However, this solutions works only if there are few deadline dates (hundreds, maybe thousands deadline dates), first because we retrieve all those deadline dates in the Spark driver with .collect(), second because in first step we create one column per deadline date, creating rows with lot of data, and finally because the last step is also executed on only one executor.
In Pyspark, let's say we are comparing values in 2 columns such as:
df = df.filter(F.col(1) > F.col(2))
If Col 1 has the value 5, and Col 2 has the value NULL, what will happen? Will it be filtered or not?
Does this evaluate to true or false or nothing?
It will evaluate to false. You are trying to compare a value with null, which is a null pointer.
To emulate the same:
Preparing data
case class Test(t1:Int,t2:Int)
var df = Seq(Test(1,1),Test(2,0),Test(3,3)).toDF
df.show(false)
+---+---+
|t1 |t2 |
+---+---+
|1 |1 |
|2 |0 |
|3 |3 |
+---+---+
Comparing not null data
df.filter($"t1">$"t2").show(false)
+---+---+
|t1 |t2 |
+---+---+
|2 |0 |
+---+---+
Adding a column with null
df=df.withColumn("t3",lit(null))
df.show(false)
+---+---+----+
|t1 |t2 |t3 |
+---+---+----+
|1 |1 |null|
|2 |0 |null|
|3 |3 |null|
+---+---+----+
Comparing with null
df.filter($"t1">$"t3").show(false)
+---+---+---+
|t1 |t2 |t3 |
+---+---+---+
+---+---+---+
I have a DataFrame as below
A B C
1 3 1
1 8 2
1 5 3
2 2 1
My output should be, Column B is ordered based on the initial column B value
A B
1 3,1/5,3/8,2
2 2,1
I wrote something like this is scala
df.groupBy("A").withColumn("B",collect_list(concat("B",lit(","),"C"))
But dint solves my problem.
Given that you have input dataframe as
+---+---+---+
|A |B |C |
+---+---+---+
|1 |3 |1 |
|1 |8 |2 |
|1 |5 |3 |
|2 |2 |1 |
+---+---+---+
You can get following output as
+---+---------------+
|A |B |
+---+---------------+
|1 |[3,1, 5,3, 8,2]|
|2 |[2,1] |
+---+---------------+
By doing simple groupBy, aggregations and using functions
df.orderBy("B").groupBy("A").agg(collect_list(concat_ws(",", col("B"), col("C"))) as "B")
You can use udf function to get the final desired result as
def joinString = udf((b: mutable.WrappedArray[String]) => {
b.mkString("/")
} )
newdf.withColumn("B", joinString(col("B"))).show(false)
You should get
+---+-----------+
|A |B |
+---+-----------+
|1 |3,1/5,3/8,2|
|2 |2,1 |
+---+-----------+
Note you would need import org.apache.spark.sql.functions._ for all of the above to work
Edited
Column B is ordered based on the initial column B value
For this you can just remove the orderBy part as
import org.apache.spark.sql.functions._
val newdf = df.groupBy("A").agg(collect_list(concat_ws(",", col("B"), col("C"))) as "B")
def joinString = udf((b: mutable.WrappedArray[String]) => {
b.mkString("/")
} )
newdf.withColumn("B", joinString(col("B"))).show(false)
and you should get output as
+---+-----------+
|A |B |
+---+-----------+
|1 |3,1/8,2/5,3|
|2 |2,1 |
+---+-----------+
This is what you can achieve by using concat_ws function and then groupby column A and collect the list
val df1 = spark.sparkContext.parallelize(Seq(
( 1, 3, 1),
(1, 8, 2),
(1, 5, 3),
(2, 2, 1)
)).toDF("A", "B", "C")
val result = df1.withColumn("B", concat_ws("/", $"B", $"C"))
result.groupBy("A").agg(collect_list($"B").alias("B")).show
Output:
+---+---------------+
| A| B|
+---+---------------+
| 1|[3/1, 8/2, 5/3]|
| 2| [2/1]|
+---+---------------+
Edited:
Here is what you can do if you want to sort with the column B
val format = udf((value : Seq[String]) => {
value.sortBy(x => {x.split(",")(0)}).mkString("/")
})
val result = df1.withColumn("B", concat_ws(",", $"B", $"C"))
.groupBy($"A").agg(collect_list($"B").alias("B"))
.withColumn("B", format($"B"))
result.show()
Output:
+---+-----------+
| A| B|
+---+-----------+
| 1|3,1/5,3/8,2|
| 2| 2,1|
+---+-----------+
Hope this was helpful!