I have a DataFrame as below
A B C
1 3 1
1 8 2
1 5 3
2 2 1
My output should be, Column B is ordered based on the initial column B value
A B
1 3,1/5,3/8,2
2 2,1
I wrote something like this is scala
df.groupBy("A").withColumn("B",collect_list(concat("B",lit(","),"C"))
But dint solves my problem.
Given that you have input dataframe as
+---+---+---+
|A |B |C |
+---+---+---+
|1 |3 |1 |
|1 |8 |2 |
|1 |5 |3 |
|2 |2 |1 |
+---+---+---+
You can get following output as
+---+---------------+
|A |B |
+---+---------------+
|1 |[3,1, 5,3, 8,2]|
|2 |[2,1] |
+---+---------------+
By doing simple groupBy, aggregations and using functions
df.orderBy("B").groupBy("A").agg(collect_list(concat_ws(",", col("B"), col("C"))) as "B")
You can use udf function to get the final desired result as
def joinString = udf((b: mutable.WrappedArray[String]) => {
b.mkString("/")
} )
newdf.withColumn("B", joinString(col("B"))).show(false)
You should get
+---+-----------+
|A |B |
+---+-----------+
|1 |3,1/5,3/8,2|
|2 |2,1 |
+---+-----------+
Note you would need import org.apache.spark.sql.functions._ for all of the above to work
Edited
Column B is ordered based on the initial column B value
For this you can just remove the orderBy part as
import org.apache.spark.sql.functions._
val newdf = df.groupBy("A").agg(collect_list(concat_ws(",", col("B"), col("C"))) as "B")
def joinString = udf((b: mutable.WrappedArray[String]) => {
b.mkString("/")
} )
newdf.withColumn("B", joinString(col("B"))).show(false)
and you should get output as
+---+-----------+
|A |B |
+---+-----------+
|1 |3,1/8,2/5,3|
|2 |2,1 |
+---+-----------+
This is what you can achieve by using concat_ws function and then groupby column A and collect the list
val df1 = spark.sparkContext.parallelize(Seq(
( 1, 3, 1),
(1, 8, 2),
(1, 5, 3),
(2, 2, 1)
)).toDF("A", "B", "C")
val result = df1.withColumn("B", concat_ws("/", $"B", $"C"))
result.groupBy("A").agg(collect_list($"B").alias("B")).show
Output:
+---+---------------+
| A| B|
+---+---------------+
| 1|[3/1, 8/2, 5/3]|
| 2| [2/1]|
+---+---------------+
Edited:
Here is what you can do if you want to sort with the column B
val format = udf((value : Seq[String]) => {
value.sortBy(x => {x.split(",")(0)}).mkString("/")
})
val result = df1.withColumn("B", concat_ws(",", $"B", $"C"))
.groupBy($"A").agg(collect_list($"B").alias("B"))
.withColumn("B", format($"B"))
result.show()
Output:
+---+-----------+
| A| B|
+---+-----------+
| 1|3,1/5,3/8,2|
| 2| 2,1|
+---+-----------+
Hope this was helpful!
Related
I have a delta table which has a column with JSON data. I do not have a schema for it and need a way to convert the JSON data into columns
|id | json_data
| 1 | {"name":"abc", "depts":["dep01", "dep02"]}
| 2 | {"name":"xyz", "depts":["dep03"],"sal":100}
| 3 | {"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}
Expected output
|id | name | depts | sal | address_city
| 1 | "abc" | ["dep01", "dep02"] | null| null
| 2 | "xyz" | ["dep03"] | 100 | null
| 3 | "pqr" | ["dep02"] | null| "SF"
Input Dataframe -
df = spark.createDataFrame(data = [(1 , """{"name":"abc", "depts":["dep01", "dep02"]}"""), (2 , """{"name":"xyz", "depts":["dep03"],"sal":100}"""), (3 , """{"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}""")], schema = ["id", "json_data"])
df.show(truncate=False)
+---+----------------------------------------------------------+
|id |json_data |
+---+----------------------------------------------------------+
|1 |{"name":"abc", "depts":["dep01", "dep02"]} |
|2 |{"name":"xyz", "depts":["dep03"],"sal":100} |
|3 |{"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}|
+---+----------------------------------------------------------+
Convert json_data column to MapType as below -
from pyspark.sql.functions import *
from pyspark.sql.types import *
df1 = df.withColumn("cols", from_json("json_data", MapType(StringType(), StringType()))).drop("json_data")
df1.show(truncate=False)
+---+-----------------------------------------------------------+
|id |cols |
+---+-----------------------------------------------------------+
|1 |{name -> abc, depts -> ["dep01","dep02"]} |
|2 |{name -> xyz, depts -> ["dep03"], sal -> 100} |
|3 |{name -> pqr, depts -> ["dep02"], address -> {"city":"SF"}}|
+---+-----------------------------------------------------------+
Now, column cols needs to be exploded as below -
df2 = df1.select("id",explode("cols").alias("col_columns", "col_rows"))
df2.show(truncate=False)
+---+-----------+-----------------+
|id |col_columns|col_rows |
+---+-----------+-----------------+
|1 |name |abc |
|1 |depts |["dep01","dep02"]|
|2 |name |xyz |
|2 |depts |["dep03"] |
|2 |sal |100 |
|3 |name |pqr |
|3 |depts |["dep02"] |
|3 |address |{"city":"SF"} |
+---+-----------+-----------------+
Once, you have col_columns and col_rows as individual columns, all that is needed to do is pivot col_columns and aggregate it using its corresponding first col_rows as below -
df3 = df2.groupBy("id").pivot("col_columns").agg(first("col_rows"))
df3.show(truncate=False)
+---+-------------+-----------------+----+----+
|id |address |depts |name|sal |
+---+-------------+-----------------+----+----+
|1 |null |["dep01","dep02"]|abc |null|
|2 |null |["dep03"] |xyz |100 |
|3 |{"city":"SF"}|["dep02"] |pqr |null|
+---+-------------+-----------------+----+----+
Finally, you again need to repeat the above steps to bring address in structured format as below -
df4 = df3.withColumn("address", from_json("address", MapType(StringType(), StringType())))
df4.select("id", "depts", "name", "sal",explode_outer("address").alias("key", "address_city")).drop("key").show(truncate=False)
+---+-----------------+----+----+------------+
|id |depts |name|sal |address_city|
+---+-----------------+----+----+------------+
|1 |["dep01","dep02"]|abc |null|null |
|2 |["dep03"] |xyz |100 |null |
|3 |["dep02"] |pqr |null|SF |
+---+-----------------+----+----+------------+
In order to solve it you can use split function as code below.
The function takes 2 parameters, the first one is the column itself and the second is the pattern to split the elements from column array.
More information and examples cand be found here:
https://sparkbyexamples.com/pyspark/pyspark-convert-string-to-array-column/#:~:text=PySpark%20SQL%20provides%20split(),and%20converting%20it%20into%20ArrayType.
from pyspark.sql import functions as F
df.select(F.split(F.col('depts'), ','))
To parse and promote the properties from a JSON string column without a known schema dynamically, I am afraid you cannot use pyspark, it can be done by using Scala.
For example when you have some avro files produced by Kafka, and you want to be able to parse the Value which is a serialized JSON string dynamically
var df = spark.read.format("avro").load("abfss://abc#def.dfs.core.windows.net/xyz.avro").select("Value")
var df_parsed = spark.read.json(df.as[String])
display(df_parsed)
The key is spark.read.json(df.as[String]) in Scala, it basically
Convert that DF ( it has only one column that we are interested in in this case, you can of course deal with multiple interested columns similarily and union whatever you want ) to String.
Parse the JSON string using standard spark read option, this does not require a schema.
So far there is no equivalent methods exposed to pyspark as far as I know.
I have a dataframe like so:
+------------------------------------+-----+-----+
|id |point|count|
+------------------------------------+-----+-----+
|id_1|5 |9 |
|id_2|5 |1 |
|id_3|4 |3 |
|id_1|3 |3 |
|id_2|4 |3 |
The id-point pairs are unique.
I would like to group by id and create columns from the point column with values from the count column like so:
+------------------------------------+-----+-----+
|id |point_3|point_4|point_5|
+------------------------------------+-----+-----+
|id_1|3 |0 |9
|id_2|0 |3 |1
|id_3|0 |3 |0
If you can guide me on how to start this or in which direction to start going, it would be much appreciated. I feel stuck on this for a while.
We can use pivot to achieve the required result:
from pyspark.sql import *
from pyspark.sql.functions import *
spark = SparkSession.builder.master("local[*]").getOrCreate()
#sample dataframe
in_values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), ("id_1", 3, 3), ("id_2", 4, 3)]
in_df = spark.createDataFrame(in_values, "id string, point int, count int")
out_df = in_df.groupby("id").pivot("point").agg(sum("count"))
# To replace null by 0
out_df = out_df.na.fill(0)
# To rename columns
columns_to_rename = out_df.columns
columns_to_rename.remove("id")
for col in columns_to_rename:
out_df = out_df.withColumnRenamed(col, f"point_{col}")
out_df.show()
+----+-------+-------+-------+
| id|point_3|point_4|point_5|
+----+-------+-------+-------+
|id_2| 0| 3| 1|
|id_1| 3| 0| 9|
|id_3| 0| 3| 0|
+----+-------+-------+-------+
In Pyspark, let's say we are comparing values in 2 columns such as:
df = df.filter(F.col(1) > F.col(2))
If Col 1 has the value 5, and Col 2 has the value NULL, what will happen? Will it be filtered or not?
Does this evaluate to true or false or nothing?
It will evaluate to false. You are trying to compare a value with null, which is a null pointer.
To emulate the same:
Preparing data
case class Test(t1:Int,t2:Int)
var df = Seq(Test(1,1),Test(2,0),Test(3,3)).toDF
df.show(false)
+---+---+
|t1 |t2 |
+---+---+
|1 |1 |
|2 |0 |
|3 |3 |
+---+---+
Comparing not null data
df.filter($"t1">$"t2").show(false)
+---+---+
|t1 |t2 |
+---+---+
|2 |0 |
+---+---+
Adding a column with null
df=df.withColumn("t3",lit(null))
df.show(false)
+---+---+----+
|t1 |t2 |t3 |
+---+---+----+
|1 |1 |null|
|2 |0 |null|
|3 |3 |null|
+---+---+----+
Comparing with null
df.filter($"t1">$"t3").show(false)
+---+---+---+
|t1 |t2 |t3 |
+---+---+---+
+---+---+---+
I have a PySpark dataframe that has a string column which contains a comma separated list of values (up to 5 values), like this:
+----+----------------------+
|col1|col2 |
+----+----------------------+
|1 | 'a1, b1, c1' |
|2 | 'a2, b2' |
|3 | 'a3, b3, c3, d3, e3' |
+----+----------------------+
I want to tokenize col2 and create 5 different columns out of col2, possibly with null values if the tokenization returns less than 5 values:
+----+----+----+----+----+----+
|col1|col3|col4|col5|col6|col7|
+----+----+----+----+----+----+
|1 |'a1'|'b1'|'c1'|null|null|
|2 |'a2'|'b2'|null|null|null|
|3 |'a3'|'b3'|'c3'|'d3'|'e3'|
+----+----+----+----+----+----+
Any help will be much appreciated.
Just split that column and select.
df.withColumn('col2', split('col2', ', ')) \
.select(col('col1'), *[col('col2')[i].alias('col' + str(i + 3)) for i in range(0, 5)]) \
.show()
+----+----+----+----+----+----+
|col1|col3|col4|col5|col6|col7|
+----+----+----+----+----+----+
| 1| a1| b1| c1|null|null|
| 2| a2| b2|null|null|null|
| 3| a3| b3| c3| d3| e3|
+----+----+----+----+----+----+
I have a pyspark dataframe that contains id, timestamp and value column. I am trying to create a dataframe that first groups rows with the same id, then separate the ones that are, say longer than 2 weeks apart and finally concatenate their value into a list.
I have already tried to use rangeBetween() Window function. It doesn't quite deliver what I want. I think the code below illutrates my question better:
My dataframe sdf:
+---+-------------------------+-----+
|id |tts |value|
+---+-------------------------+-----+
|0 |2019-01-01T00:00:00+00:00|a |
|0 |2019-01-02T00:00:00+00:00|b |
|0 |2019-01-20T00:00:00+00:00|c |
|0 |2019-01-25T00:00:00+00:00|d |
|1 |2019-01-02T00:00:00+00:00|a |
|1 |2019-01-29T00:00:00+00:00|b |
|2 |2019-01-01T00:00:00+00:00|a |
|2 |2019-01-30T00:00:00+00:00|b |
|2 |2019-02-02T00:00:00+00:00|c |
+---+-------------------------+-----+
My approach:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
DAY_SECS = 3600 * 24
w_spec = Window \
.partitionBy('id') \
.orderBy(F.col('tts').cast('timestamp').cast('long')) \
.rangeBetween((Window.currentRow)-(14*DAY_SECS), Window.currentRow)
out = sdf \
.withColumn('val_seq', F.collect_list('value').over(w_spec))
Output:
+---+-------------------------+-----+-------+
|id |tts |value|val_seq|
+---+-------------------------+-----+-------+
|0 |2019-01-01T00:00:00+00:00|a |[a] |
|0 |2019-01-02T00:00:00+00:00|b |[a, b] |
|0 |2019-01-20T00:00:00+00:00|c |[c] |
|0 |2019-01-25T00:00:00+00:00|d |[c, d] |
|1 |2019-01-02T00:00:00+00:00|a |[a] |
|1 |2019-01-29T00:00:00+00:00|b |[b] |
|2 |2019-01-01T00:00:00+00:00|a |[a] |
|2 |2019-01-30T00:00:00+00:00|b |[b] |
|2 |2019-02-02T00:00:00+00:00|c |[b, c] |
+---+-------------------------+-----+-------+
My desired output:
+---+-------------------------+---------+
|id |tts |val_seq|
+---+-------------------------+---------+
|0 |2019-01-02T00:00:00+00:00|[a, b] |
|0 |2019-01-25T00:00:00+00:00|[c, d] |
|1 |2019-01-02T00:00:00+00:00|[a] |
|1 |2019-01-29T00:00:00+00:00|[b] |
|2 |2019-01-30T00:00:00+00:00|[a] |
|2 |2019-02-02T00:00:00+00:00|[b, c] |
+---+-------------------------+---------+
To sum it up: I want to group rows in sdf with the same id, and further concatenate the value for rows that are not longer than 2 weeks apart and finally only show these rows.
I am really new to pyspark so any suggestions are appreciated!
Below code should work:
w_spec = Window \
.partitionBy('id') \
.orderBy(F.col('tts').cast('timestamp').cast('long')) \
.rangeBetween((Window.currentRow)-(14*DAY_SECS), Window.currentRow)
out1 = df.withColumn('val_seq', F.collect_list('value').over(w_spec)).withColumn('occurrences_in_5_min',F.count('tts').over(w_spec))
w_spec2 = Window.partitionBy("id").orderBy(col("occurrences_in_5_min").desc())
Out2=out1.withColumn("rank",rank().over(w_spec2)).filter("rank==1")