Related
I'm trying to split comma separated values in a string column to individual values and count each individual value.
The data I have is formatted as such:
+--------------------+
| tags|
+--------------------+
|cult, horror, got...|
| violence|
| romantic|
|inspiring, romant...|
|cruelty, murder, ...|
|romantic, queer, ...|
|gothic, cruelty, ...|
|mystery, suspense...|
| violence|
|revenge, neo noir...|
+--------------------+
And I want the result to look like
+--------------------+-----+
| tags|count|
+--------------------+-----+
|cult | 4|
|horror | 10|
|goth | 4|
|violence | 30|
...
The code I've tried that hasn't worked is below:
data.select('tags').groupby('tags').count().show(10)
I also used a countdistinct function which also failed to work.
I feel like I need to have a function that separates the values by comma and then lists them but not sure how to execute them.
You can use split() to split strings, then explode(). Finally, groupby and count:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[
["cult,horror"],
["cult,comedy"],
["romantic,comedy"],
["thriler,horror,comedy"],
], schema=["tags"])
df = df \
.withColumn("tags", F.split("tags", pattern=",")) \
.withColumn("tags", F.explode("tags"))
df = df.groupBy("tags").count()
[Out]:
+--------+-----+
|tags |count|
+--------+-----+
|romantic|1 |
|thriler |1 |
|horror |2 |
|cult |2 |
|comedy |3 |
+--------+-----+
I have a delta table which has a column with JSON data. I do not have a schema for it and need a way to convert the JSON data into columns
|id | json_data
| 1 | {"name":"abc", "depts":["dep01", "dep02"]}
| 2 | {"name":"xyz", "depts":["dep03"],"sal":100}
| 3 | {"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}
Expected output
|id | name | depts | sal | address_city
| 1 | "abc" | ["dep01", "dep02"] | null| null
| 2 | "xyz" | ["dep03"] | 100 | null
| 3 | "pqr" | ["dep02"] | null| "SF"
Input Dataframe -
df = spark.createDataFrame(data = [(1 , """{"name":"abc", "depts":["dep01", "dep02"]}"""), (2 , """{"name":"xyz", "depts":["dep03"],"sal":100}"""), (3 , """{"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}""")], schema = ["id", "json_data"])
df.show(truncate=False)
+---+----------------------------------------------------------+
|id |json_data |
+---+----------------------------------------------------------+
|1 |{"name":"abc", "depts":["dep01", "dep02"]} |
|2 |{"name":"xyz", "depts":["dep03"],"sal":100} |
|3 |{"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}|
+---+----------------------------------------------------------+
Convert json_data column to MapType as below -
from pyspark.sql.functions import *
from pyspark.sql.types import *
df1 = df.withColumn("cols", from_json("json_data", MapType(StringType(), StringType()))).drop("json_data")
df1.show(truncate=False)
+---+-----------------------------------------------------------+
|id |cols |
+---+-----------------------------------------------------------+
|1 |{name -> abc, depts -> ["dep01","dep02"]} |
|2 |{name -> xyz, depts -> ["dep03"], sal -> 100} |
|3 |{name -> pqr, depts -> ["dep02"], address -> {"city":"SF"}}|
+---+-----------------------------------------------------------+
Now, column cols needs to be exploded as below -
df2 = df1.select("id",explode("cols").alias("col_columns", "col_rows"))
df2.show(truncate=False)
+---+-----------+-----------------+
|id |col_columns|col_rows |
+---+-----------+-----------------+
|1 |name |abc |
|1 |depts |["dep01","dep02"]|
|2 |name |xyz |
|2 |depts |["dep03"] |
|2 |sal |100 |
|3 |name |pqr |
|3 |depts |["dep02"] |
|3 |address |{"city":"SF"} |
+---+-----------+-----------------+
Once, you have col_columns and col_rows as individual columns, all that is needed to do is pivot col_columns and aggregate it using its corresponding first col_rows as below -
df3 = df2.groupBy("id").pivot("col_columns").agg(first("col_rows"))
df3.show(truncate=False)
+---+-------------+-----------------+----+----+
|id |address |depts |name|sal |
+---+-------------+-----------------+----+----+
|1 |null |["dep01","dep02"]|abc |null|
|2 |null |["dep03"] |xyz |100 |
|3 |{"city":"SF"}|["dep02"] |pqr |null|
+---+-------------+-----------------+----+----+
Finally, you again need to repeat the above steps to bring address in structured format as below -
df4 = df3.withColumn("address", from_json("address", MapType(StringType(), StringType())))
df4.select("id", "depts", "name", "sal",explode_outer("address").alias("key", "address_city")).drop("key").show(truncate=False)
+---+-----------------+----+----+------------+
|id |depts |name|sal |address_city|
+---+-----------------+----+----+------------+
|1 |["dep01","dep02"]|abc |null|null |
|2 |["dep03"] |xyz |100 |null |
|3 |["dep02"] |pqr |null|SF |
+---+-----------------+----+----+------------+
In order to solve it you can use split function as code below.
The function takes 2 parameters, the first one is the column itself and the second is the pattern to split the elements from column array.
More information and examples cand be found here:
https://sparkbyexamples.com/pyspark/pyspark-convert-string-to-array-column/#:~:text=PySpark%20SQL%20provides%20split(),and%20converting%20it%20into%20ArrayType.
from pyspark.sql import functions as F
df.select(F.split(F.col('depts'), ','))
To parse and promote the properties from a JSON string column without a known schema dynamically, I am afraid you cannot use pyspark, it can be done by using Scala.
For example when you have some avro files produced by Kafka, and you want to be able to parse the Value which is a serialized JSON string dynamically
var df = spark.read.format("avro").load("abfss://abc#def.dfs.core.windows.net/xyz.avro").select("Value")
var df_parsed = spark.read.json(df.as[String])
display(df_parsed)
The key is spark.read.json(df.as[String]) in Scala, it basically
Convert that DF ( it has only one column that we are interested in in this case, you can of course deal with multiple interested columns similarily and union whatever you want ) to String.
Parse the JSON string using standard spark read option, this does not require a schema.
So far there is no equivalent methods exposed to pyspark as far as I know.
This question already has answers here:
Split Spark dataframe string column into multiple columns
(5 answers)
Closed 2 years ago.
I have a pyspark dataframe like the input data below. I would like to split the values in the productname column on white space. I'd then like to create new columns with the first 3 values. I have example input and output data below. Can someone please suggest how to do this with pyspark?
input data:
+------+-------------------+
|id |productname |
+------+-------------------+
|235832|EXTREME BERRY Sweet|
|419736|BLUE CHASER SAUCE |
|124513|LAAVA C2L5 |
+------+-------------------+
output:
+------+-------------------+-------------+-------------+-------------+
|id |productname |product1 |product2 |product3 |
+------+-------------------+-------------+-------------+-------------+
|235832|EXTREME BERRY Sweet|EXTREME |BERRY |Sweet |
|419736|BLUE CHASER SAUCE |BLUE |CHASER |SAUCE |
|124513|LAAVA C2L5 |LAAVA |C2L5 | |
+------+-------------------+-------------+-------------+-------------+
Split the productname column then create new columns using element_at (or) .getItem() on index value.
df.withColumn("tmp",split(col("productname"),"\s+")).\
withColumn("product1",element_at(col("tmp"),1)).\
withColumn("product2",element_at(col("tmp"),2)).\
withColumn("product3",coalesce(element_at(col("tmp"),3),lit(""))).drop("tmp").show()
#or
df.withColumn("tmp",split(col("productname"),"\s+")).\
withColumn("product1",col("tmp").getItem(0)).\
withColumn("product2",col("tmp").getItem(1)).\
withColumn("product3",coalesce(col("tmp").getItem(2),lit(""))).drop("tmp").show()
#+------+-------------------+--------+--------+--------+
#| id| productname|product1|product2|product3|
#+------+-------------------+--------+--------+--------+
#|235832|EXTREME BERRY Sweet| EXTREME| BERRY| Sweet|
#| 4| BLUE CHASER SAUCE| BLUE| CHASER| SAUCE|
#| 1| LAAVA C2L5| LAAVA| C2L5| |
#+------+-------------------+--------+--------+--------+
To do more dynamic way:
df.show()
#+------+-------------------+
#| id| productname|
#+------+-------------------+
#|235832|EXTREME BERRY Sweet|
#| 4| BLUE CHASER SAUCE|
#| 1| LAAVA C2L5|
#+------+-------------------+
#caluculate array max size and store into variable
arr=int(df.select(size(split(col("productname"),"\s+")).alias("size")).orderBy(desc("size")).collect()[0][0])
#loop through arr variable and add the columns replace null with ""
(df.withColumn('temp', split('productname', '\s+')).select("*",*(coalesce(col('temp').getItem(i),lit("")).alias('product{}'.format(i+1)) for i in range(arr))).drop("temp").show())
#+------+-------------------+--------+--------+--------+
#| id| productname|product1|product2|product3|
#+------+-------------------+--------+--------+--------+
#|235832|EXTREME BERRY Sweet| EXTREME| BERRY| Sweet|
#| 4| BLUE CHASER SAUCE| BLUE| CHASER| SAUCE|
#| 1| LAAVA C2L5| LAAVA| C2L5| |
#+------+-------------------+--------+--------+--------+
You can use split, element_at, and when/otherwise clause with array_union to put empty strings.
from pyspark.sql import functions as F
from pyspark.sql.functions import when
df.withColumn("array", F.split("productname","\ "))\
.withColumn("array", F.when(F.size("array")==2, F.array_union(F.col("array"),F.array(F.lit(""))))\
.when(F.size("array")==1, F.array_union(F.col("array"),F.array(F.lit(" "),F.lit(""))))\
.otherwise(F.col("array")))\
.withColumn("product1", F.element_at("array",1))\
.withColumn("product2", F.element_at("array",2))\
.withColumn("product3", F.element_at("array",3)).drop("array")\
.show(truncate=False)
+------+-------------------+--------+--------+--------+
|id |productname |product1|product2|product3|
+------+-------------------+--------+--------+--------+
|235832|EXTREME BERRY Sweet|EXTREME |BERRY |Sweet |
|419736|BLUE CHASER SAUCE |BLUE |CHASER |SAUCE |
|124513|LAAVA C2L5 |LAAVA |C2L5 | |
|123455|LAVA |LAVA | | |
+------+-------------------+--------+--------+--------+
I have a data frame in pyspark like below.
+---+-------------+------------+
| id| device| model|
+---+-------------+------------+
| 3| mac pro| mac|
| 1| iphone| iphone5|
| 1|android phone| android|
| 1| windows pc| windows|
| 1| spy camera| spy camera|
| 2| | camera|
| 3| cctv| cctv|
| 2| iphone|apple iphone|
| 3| spy camera| |
+---+-------------+------------+
I want to create a column by concatenating unique values in device and model columns for each id
I have done like below
First concatenated both device and model columns
df1 = df.select(col("id"), concat(col("model"), lit(","), col("device")).alias('con'))
+---+--------------------+
| id| con|
+---+--------------------+
| 3| mac,mac pro|
| 1| iphone5,iphone|
| 1|android,android p...|
| 1| windows,windows pc|
| 1|spy camera,spy ca...|
| 2| camera,|
| 3| cctv,cctv|
| 2| apple iphone,iphone|
| 3| ,spy camera|
+---+--------------------+
Then done a groupBy by id
df2 = df1.groupBy("id").agg(f.concat_ws(",", f.collect_set(df1.con)).alias('Group_con')
+---+-----------------------------------------------------------------------------+
| id| Group_con|
+---+-----------------------------------------------------------------------------+
| 1|iphone5,iphone,android,android phone,windows,windows pc,spy camera,spy camera|
| 2| camera,,apple iphone,iphone|
| 3| mac,mac pro,cctv,cctv,,spy camera|
+---+-----------------------------------------------------------------------------+
But I am getting duplicate values in the result. How can I avoid populating duplicate values in the final data frame
Use F.array(), F.explode() and F.collect_set():
from pyspark.sql import functions as F
df.withColumn('con', F.explode(F.array('device', 'model'))) \
.groupby('id').agg(F.collect_set('con').alias('Group_con')) \
.show(3,0)
# +---+--------------------------------------------------------------------------+
# |id |Group_con |
# +---+--------------------------------------------------------------------------+
# |3 |[cctv, mac pro, spy camera, mac] |
# |1 |[windows pc, iphone5, windows, iphone, android phone, spy camera, android]|
# |2 |[apple iphone, camera, iphone] |
# +---+--------------------------------------------------------------------------+
(tested on spark version 2.2.1)
You can remove the duplicates by using collect_set and a udf function as
from pyspark.sql import functions as f
from pyspark.sql import types as t
def uniqueStringUdf(device, model):
return ','.join(set(filter(bool, device + model)))
uniqueStringUdfCall = f.udf(uniqueStringUdf, t.StringType())
df.groupBy("id").agg(uniqueStringUdfCall(f.collect_set("device"), f.collect_set("model")).alias("con")).show(truncate=False)
which should give you
+---+------------------------------------------------------------------+
|id |con |
+---+------------------------------------------------------------------+
|3 |spy camera,mac,mac pro,cctv |
|1 |spy camera,windows,iphone5,windows pc,iphone,android phone,android|
|2 |camera,iphone,pple iphone |
+---+------------------------------------------------------------------+
where,
device + model is concatenation for two collected sets
filter(bool, device + model) is removing empty strings from concatenated list
set(filter(bool, device + model)) removes the duplicate strings in the concatenated list
','.join(set(filter(bool, device + model))) concats all the elements of concatenated list to a comma separated string.
I hope the answer is helpful
Not sure if this is going to be very helpful. But one solution I could think of is to check for the duplicate values in the column and then delete them by using their position/index.
Or
Split all values at comma "," list and remove all the duplicates by comparing each value. Or count() the occurrences of a value if its more than 1 the delete the all the duplicates other than the first one.
Sorry if this wasn't help. These are the 2 ways I could think of solving your problem.
I have a DataFrame whose data I am pasting below:
+---------------+--------------+----------+------------+----------+
|name | DateTime| Seq|sessionCount|row_number|
+---------------+--------------+----------+------------+----------+
| abc| 1521572913344| 17| 5| 1|
| xyz| 1521572916109| 17| 5| 2|
| rafa| 1521572916118| 17| 5| 3|
| {}| 1521572916129| 17| 5| 4|
| experience| 1521572917816| 17| 5| 5|
+---------------+--------------+----------+------------+----------+
The column 'name' is of type string. I want a new column "effective_name" which will contain the incremental values of "name" like shown below:
+---------------+--------------+----------+------------+----------+-------------------------+
|name | DateTime |sessionSeq|sessionCount|row_number |effective_name|
+---------------+--------------+----------+------------+----------+-------------------------+
|abc |1521572913344 |17 |5 |1 |abc |
|xyz |1521572916109 |17 |5 |2 |abcxyz |
|rafa |1521572916118 |17 |5 |3 |abcxyzrafa |
|{} |1521572916129 |17 |5 |4 |abcxyzrafa{} |
|experience |1521572917816 |17 |5 |5 |abcxyzrafa{}experience |
+---------------+--------------+----------+------------+----------+-------------------------+
The new column contains the incremental concatenation of its previous values of the name column.
You can achieve this by using a pyspark.sql.Window, which orders by the clientDateTime, pyspark.sql.functions.concat_ws, and pyspark.sql.functions.collect_list:
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.orderBy("DateTime") # define Window for ordering
df.drop("Seq", "sessionCount", "row_number").select(
"*",
f.concat_ws(
"",
f.collect_list(f.col("name")).over(w)
).alias("effective_name")
).show(truncate=False)
#+---------------+--------------+-------------------------+
#|name | DateTime|effective_name |
#+---------------+--------------+-------------------------+
#|abc |1521572913344 |abc |
#|xyz |1521572916109 |abcxyz |
#|rafa |1521572916118 |abcxyzrafa |
#|{} |1521572916129 |abcxyzrafa{} |
#|experience |1521572917816 |abcxyzrafa{}experience |
#+---------------+--------------+-------------------------+
I dropped "Seq", "sessionCount", "row_number" to make the output display friendlier.
If you needed to do this per group, you can add a partitionBy to the Window. Say in this case you want to group by sessionSeq, you can do the following:
w = Window.partitionBy("Seq").orderBy("DateTime")
df.drop("sessionCount", "row_number").select(
"*",
f.concat_ws(
"",
f.collect_list(f.col("name")).over(w)
).alias("effective_name")
).show(truncate=False)
#+---------------+--------------+----------+-------------------------+
#|name | DateTime|sessionSeq|effective_name |
#+---------------+--------------+----------+-------------------------+
#|abc |1521572913344 |17 |abc |
#|xyz |1521572916109 |17 |abcxyz |
#|rafa |1521572916118 |17 |abcxyzrafa |
#|{} |1521572916129 |17 |abcxyzrafa{} |
#|experience |1521572917816 |17 |abcxyzrafa{}experience |
#+---------------+--------------+----------+-------------------------+
If you prefer to use withColumn, the above is equivalent to:
df.drop("sessionCount", "row_number").withColumn(
"effective_name",
f.concat_ws(
"",
f.collect_list(f.col("name")).over(w)
)
).show(truncate=False)
Explanation
You want to apply a function over multiple rows, which is called an aggregation. With any aggregation, you need to define which rows to aggregate over and the order. We do this using a Window. In this case, w = Window.partitionBy("Seq").orderBy("DateTime") will partition the data by the Seq and sort by the DateTime.
We first apply the aggregate function collect_list("name") over the window. This gathers all of the values from the name column and puts them in a list. The order of insertion is defined by the Window's order.
For example, the intermediate output of this step would be:
df.select(
f.collect_list("name").over(w).alias("collected")
).show()
#+--------------------------------+
#|collected |
#+--------------------------------+
#|[abc] |
#|[abc, xyz] |
#|[abc, xyz, rafa] |
#|[abc, xyz, rafa, {}] |
#|[abc, xyz, rafa, {}, experience]|
#+--------------------------------+
Now that the appropriate values are in the list, we can concatenate them together with an empty string as the separator.
df.select(
f.concat_ws(
"",
f.collect_list("name").over(w)
).alias("concatenated")
).show()
#+-----------------------+
#|concatenated |
#+-----------------------+
#|abc |
#|abcxyz |
#|abcxyzrafa |
#|abcxyzrafa{} |
#|abcxyzrafa{}experience |
#+-----------------------+
Solution:
import pyspark.sql.functions as f
w = Window.partitionBy("Seq").orderBy("DateTime")
df.select(
"*",
f.concat_ws(
"",
f.collect_set(f.col("name")).over(w)
).alias("cummuliative_name")
).show()
Explanation
collect_set() - This function returns value like [["abc","xyz","rafa",{},"experience"]] .
concat_ws() - This function takes the output of collect_set() as input and converts it into abc, xyz, rafa, {}, experience
Note:
Use collect_set() if you don't have duplicates or else use collect_list()