Get distinct count of values in single row in Pyspark DataFrame - python

I'm trying to split comma separated values in a string column to individual values and count each individual value.
The data I have is formatted as such:
+--------------------+
| tags|
+--------------------+
|cult, horror, got...|
| violence|
| romantic|
|inspiring, romant...|
|cruelty, murder, ...|
|romantic, queer, ...|
|gothic, cruelty, ...|
|mystery, suspense...|
| violence|
|revenge, neo noir...|
+--------------------+
And I want the result to look like
+--------------------+-----+
| tags|count|
+--------------------+-----+
|cult | 4|
|horror | 10|
|goth | 4|
|violence | 30|
...
The code I've tried that hasn't worked is below:
data.select('tags').groupby('tags').count().show(10)
I also used a countdistinct function which also failed to work.
I feel like I need to have a function that separates the values by comma and then lists them but not sure how to execute them.

You can use split() to split strings, then explode(). Finally, groupby and count:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[
["cult,horror"],
["cult,comedy"],
["romantic,comedy"],
["thriler,horror,comedy"],
], schema=["tags"])
df = df \
.withColumn("tags", F.split("tags", pattern=",")) \
.withColumn("tags", F.explode("tags"))
df = df.groupBy("tags").count()
[Out]:
+--------+-----+
|tags |count|
+--------+-----+
|romantic|1 |
|thriler |1 |
|horror |2 |
|cult |2 |
|comedy |3 |
+--------+-----+

Related

Splitting string type dictionary in PySpark dataframe into individual columns [duplicate]

I have a delta table which has a column with JSON data. I do not have a schema for it and need a way to convert the JSON data into columns
|id | json_data
| 1 | {"name":"abc", "depts":["dep01", "dep02"]}
| 2 | {"name":"xyz", "depts":["dep03"],"sal":100}
| 3 | {"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}
Expected output
|id | name | depts | sal | address_city
| 1 | "abc" | ["dep01", "dep02"] | null| null
| 2 | "xyz" | ["dep03"] | 100 | null
| 3 | "pqr" | ["dep02"] | null| "SF"
Input Dataframe -
df = spark.createDataFrame(data = [(1 , """{"name":"abc", "depts":["dep01", "dep02"]}"""), (2 , """{"name":"xyz", "depts":["dep03"],"sal":100}"""), (3 , """{"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}""")], schema = ["id", "json_data"])
df.show(truncate=False)
+---+----------------------------------------------------------+
|id |json_data |
+---+----------------------------------------------------------+
|1 |{"name":"abc", "depts":["dep01", "dep02"]} |
|2 |{"name":"xyz", "depts":["dep03"],"sal":100} |
|3 |{"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}|
+---+----------------------------------------------------------+
Convert json_data column to MapType as below -
from pyspark.sql.functions import *
from pyspark.sql.types import *
df1 = df.withColumn("cols", from_json("json_data", MapType(StringType(), StringType()))).drop("json_data")
df1.show(truncate=False)
+---+-----------------------------------------------------------+
|id |cols |
+---+-----------------------------------------------------------+
|1 |{name -> abc, depts -> ["dep01","dep02"]} |
|2 |{name -> xyz, depts -> ["dep03"], sal -> 100} |
|3 |{name -> pqr, depts -> ["dep02"], address -> {"city":"SF"}}|
+---+-----------------------------------------------------------+
Now, column cols needs to be exploded as below -
df2 = df1.select("id",explode("cols").alias("col_columns", "col_rows"))
df2.show(truncate=False)
+---+-----------+-----------------+
|id |col_columns|col_rows |
+---+-----------+-----------------+
|1 |name |abc |
|1 |depts |["dep01","dep02"]|
|2 |name |xyz |
|2 |depts |["dep03"] |
|2 |sal |100 |
|3 |name |pqr |
|3 |depts |["dep02"] |
|3 |address |{"city":"SF"} |
+---+-----------+-----------------+
Once, you have col_columns and col_rows as individual columns, all that is needed to do is pivot col_columns and aggregate it using its corresponding first col_rows as below -
df3 = df2.groupBy("id").pivot("col_columns").agg(first("col_rows"))
df3.show(truncate=False)
+---+-------------+-----------------+----+----+
|id |address |depts |name|sal |
+---+-------------+-----------------+----+----+
|1 |null |["dep01","dep02"]|abc |null|
|2 |null |["dep03"] |xyz |100 |
|3 |{"city":"SF"}|["dep02"] |pqr |null|
+---+-------------+-----------------+----+----+
Finally, you again need to repeat the above steps to bring address in structured format as below -
df4 = df3.withColumn("address", from_json("address", MapType(StringType(), StringType())))
df4.select("id", "depts", "name", "sal",explode_outer("address").alias("key", "address_city")).drop("key").show(truncate=False)
+---+-----------------+----+----+------------+
|id |depts |name|sal |address_city|
+---+-----------------+----+----+------------+
|1 |["dep01","dep02"]|abc |null|null |
|2 |["dep03"] |xyz |100 |null |
|3 |["dep02"] |pqr |null|SF |
+---+-----------------+----+----+------------+
In order to solve it you can use split function as code below.
The function takes 2 parameters, the first one is the column itself and the second is the pattern to split the elements from column array.
More information and examples cand be found here:
https://sparkbyexamples.com/pyspark/pyspark-convert-string-to-array-column/#:~:text=PySpark%20SQL%20provides%20split(),and%20converting%20it%20into%20ArrayType.
from pyspark.sql import functions as F
df.select(F.split(F.col('depts'), ','))
To parse and promote the properties from a JSON string column without a known schema dynamically, I am afraid you cannot use pyspark, it can be done by using Scala.
For example when you have some avro files produced by Kafka, and you want to be able to parse the Value which is a serialized JSON string dynamically
var df = spark.read.format("avro").load("abfss://abc#def.dfs.core.windows.net/xyz.avro").select("Value")
var df_parsed = spark.read.json(df.as[String])
display(df_parsed)
The key is spark.read.json(df.as[String]) in Scala, it basically
Convert that DF ( it has only one column that we are interested in in this case, you can of course deal with multiple interested columns similarily and union whatever you want ) to String.
Parse the JSON string using standard spark read option, this does not require a schema.
So far there is no equivalent methods exposed to pyspark as far as I know.

pyspark aggregate while find the first value of the group

Suppose I have 5 TB of data with the following schema, and I am using Pyspark.
| id | date | Month | KPI_1 | ... | KPI_n
For 90% of the KPIs, I only need to know the sum/min/max value aggregate to (id, Month) level. For the rest 10%, I need to know the first value based on date.
One option for me is to use window. For example, I can do
from pyspark.sql import Window
import pyspark.sql.functions as F
w = Window.partitionBy("id", "Month").orderBy(F.desc("date"))
# for the 90% kpi
agg_df = df.withColumn("kpi_1", F.sum("kpi_1").over(w))
agg_df = agg_df.withColumn("kpi_2", F.max("kpi_2").over(w))
agg_df = agg_df.withColumn("kpi_3", F.min("kpi_3").over(w))
...
# Select last row for each window to get last accumulated sum for 90% kpis and last value for 10% kpi (which is equivalent to first value if ranked ascending).
# continue process agg_df with filters based on sum/max/min values of 90% KIPs.
But I am not sure how to select last row of each window. Does anyone have any suggestions, or if there is a better way to aggregate?
Let's assume we have this data
+---+----------+-------+-----+-----+
| id| date| month|kpi_1|kpi_2|
+---+----------+-------+-----+-----+
| 1|2000-01-01|2000-01| 1| 100|
| 1|2000-01-02|2000-01| 2| 200|
| 1|2000-01-03|2000-01| 3| 300|
| 1|2000-01-04|2000-01| 4| 400|
| 1|2000-01-05|2000-01| 5| 500|
| 1|2000-02-01|2000-02| 10| 11|
| 1|2000-02-02|2000-02| 20| 21|
| 1|2000-02-03|2000-02| 30| 31|
| 1|2000-02-04|2000-02| 40| 41|
+---+----------+-------+-----+-----+
and we want to calculate the min, max and sum for kpi_1 and get the last value of kpi_2 for each group.
Getting the min, max and sum of kpi_1 can be achieved by grouping the data by id and month. With Spark >= 3.0.0 max_by can be used to the get latest value of kpi_2:
df_avg = df \
.groupBy("id","month") \
.agg(F.sum("kpi_1"), F.min("kpi_1"), F.max("kpi_1"), F.expr("max_by(kpi_2, date)"))
df_avg.show()
prints
+---+-------+----------+----------+----------+-------------------+
| id| month|sum(kpi_1)|min(kpi_1)|max(kpi_1)|max_by(kpi_2, date)|
+---+-------+----------+----------+----------+-------------------+
| 1|2000-02| 100| 10| 40| 41|
| 1|2000-01| 15| 1| 5| 500|
+---+-------+----------+----------+----------+-------------------+
For Spark version < 3.0.0 max_by is not available and so getting the last value of kpi_2 for each group is more difficult. A first idea could be to use the aggregation function first() on an descending ordered data frame . A simple test gave me the correct result, but unfortunately the documentation states "The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle".
A better approach to get the last value of kpi_2 is to use a window like shown in the question. As window function row_number() would work:
w = Window.partitionBy("id", "Month").orderBy(F.desc("date"))
df_first = df.withColumn("row_number", F.row_number().over(w)).where("row_number = 1")\
.drop("row_number") \
.select("id", "month", "KPI_2")
df_first.show()
prints
+---+-------+-----+
| id| month|KPI_2|
+---+-------+-----+
| 1|2000-02| 41|
| 1|2000-01| 500|
+---+-------+-----+
Joining the first part (without the max_by column) and the second part gives the desired result:
df_result = df_avg.join(df_first, ['id', 'month'])
df_result.show()
prints
+---+-------+----------+----------+----------+-----+
| id| month|sum(kpi_1)|min(kpi_1)|max(kpi_1)|KPI_2|
+---+-------+----------+----------+----------+-----+
| 1|2000-02| 100| 10| 40| 41|
| 1|2000-01| 15| 1| 5| 500|
+---+-------+----------+----------+----------+-----+

get first numeric values from pyspark dataframe string column into new column

I have a pyspark dataframe like the input data below. I would like to create a new column product1_num that parses the first numeric in each record in the productname column, in to a new column. I have example output data below. I'm not sure what's available in pyspark as far as string split and regex matching. Can anyone suggest how to do this with pyspark?
input data:
+------+-------------------+
|id |productname |
+------+-------------------+
|234832|EXTREME BERRY SAUCE|
|419836|BLUE KOSHER SAUCE |
|350022|GUAVA (1G) |
|123213|GUAVA 1G |
+------+-------------------+
output:
+------+-------------------+-------------+
|id |productname |product1_num |
+------+-------------------+-------------+
|234832|EXTREME BERRY SAUCE| |
|419836|BLUE KOSHER SAUCE | |
|350022|GUAVA (1G) |1 |
|123213|GUAVA G5 |5 |
|125513|3GULA G5 |3 |
|127143|GUAVA G50 |50 |
|124513|LAAVA C2L5 |2 |
+------+-------------------+-------------+
You can use regexp_extract:
from pyspark.sql import functions as F
df.withColumn("product1_num", F.regexp_extract("productname", "([0-9]+)",1)).show()
+------+-------------------+------------+
| id| productname|product1_num|
+------+-------------------+------------+
|234832|EXTREME BERRY SAUCE| |
|419836| BLUE KOSHER SAUCE| |
|350022| GUAVA (1G)| 1|
|123213| GUAVA G5| 5|
|125513| 3GULA G5| 3|
|127143| GUAVA G50| 50|
|124513| LAAVA C2L5| 2|
+------+-------------------+------------+

get unique values when concatenating two columns pyspark data frame

I have a data frame in pyspark like below.
+---+-------------+------------+
| id| device| model|
+---+-------------+------------+
| 3| mac pro| mac|
| 1| iphone| iphone5|
| 1|android phone| android|
| 1| windows pc| windows|
| 1| spy camera| spy camera|
| 2| | camera|
| 3| cctv| cctv|
| 2| iphone|apple iphone|
| 3| spy camera| |
+---+-------------+------------+
I want to create a column by concatenating unique values in device and model columns for each id
I have done like below
First concatenated both device and model columns
df1 = df.select(col("id"), concat(col("model"), lit(","), col("device")).alias('con'))
+---+--------------------+
| id| con|
+---+--------------------+
| 3| mac,mac pro|
| 1| iphone5,iphone|
| 1|android,android p...|
| 1| windows,windows pc|
| 1|spy camera,spy ca...|
| 2| camera,|
| 3| cctv,cctv|
| 2| apple iphone,iphone|
| 3| ,spy camera|
+---+--------------------+
Then done a groupBy by id
df2 = df1.groupBy("id").agg(f.concat_ws(",", f.collect_set(df1.con)).alias('Group_con')
+---+-----------------------------------------------------------------------------+
| id| Group_con|
+---+-----------------------------------------------------------------------------+
| 1|iphone5,iphone,android,android phone,windows,windows pc,spy camera,spy camera|
| 2| camera,,apple iphone,iphone|
| 3| mac,mac pro,cctv,cctv,,spy camera|
+---+-----------------------------------------------------------------------------+
But I am getting duplicate values in the result. How can I avoid populating duplicate values in the final data frame
Use F.array(), F.explode() and F.collect_set():
from pyspark.sql import functions as F
df.withColumn('con', F.explode(F.array('device', 'model'))) \
.groupby('id').agg(F.collect_set('con').alias('Group_con')) \
.show(3,0)
# +---+--------------------------------------------------------------------------+
# |id |Group_con |
# +---+--------------------------------------------------------------------------+
# |3 |[cctv, mac pro, spy camera, mac] |
# |1 |[windows pc, iphone5, windows, iphone, android phone, spy camera, android]|
# |2 |[apple iphone, camera, iphone] |
# +---+--------------------------------------------------------------------------+
(tested on spark version 2.2.1)
You can remove the duplicates by using collect_set and a udf function as
from pyspark.sql import functions as f
from pyspark.sql import types as t
def uniqueStringUdf(device, model):
return ','.join(set(filter(bool, device + model)))
uniqueStringUdfCall = f.udf(uniqueStringUdf, t.StringType())
df.groupBy("id").agg(uniqueStringUdfCall(f.collect_set("device"), f.collect_set("model")).alias("con")).show(truncate=False)
which should give you
+---+------------------------------------------------------------------+
|id |con |
+---+------------------------------------------------------------------+
|3 |spy camera,mac,mac pro,cctv |
|1 |spy camera,windows,iphone5,windows pc,iphone,android phone,android|
|2 |camera,iphone,pple iphone |
+---+------------------------------------------------------------------+
where,
device + model is concatenation for two collected sets
filter(bool, device + model) is removing empty strings from concatenated list
set(filter(bool, device + model)) removes the duplicate strings in the concatenated list
','.join(set(filter(bool, device + model))) concats all the elements of concatenated list to a comma separated string.
I hope the answer is helpful
Not sure if this is going to be very helpful. But one solution I could think of is to check for the duplicate values in the column and then delete them by using their position/index.
Or
Split all values at comma "," list and remove all the duplicates by comparing each value. Or count() the occurrences of a value if its more than 1 the delete the all the duplicates other than the first one.
Sorry if this wasn't help. These are the 2 ways I could think of solving your problem.

How to add strings of one columns of the dataframe and form another column that will have the incremental value of the original column

I have a DataFrame whose data I am pasting below:
+---------------+--------------+----------+------------+----------+
|name | DateTime| Seq|sessionCount|row_number|
+---------------+--------------+----------+------------+----------+
| abc| 1521572913344| 17| 5| 1|
| xyz| 1521572916109| 17| 5| 2|
| rafa| 1521572916118| 17| 5| 3|
| {}| 1521572916129| 17| 5| 4|
| experience| 1521572917816| 17| 5| 5|
+---------------+--------------+----------+------------+----------+
The column 'name' is of type string. I want a new column "effective_name" which will contain the incremental values of "name" like shown below:
+---------------+--------------+----------+------------+----------+-------------------------+
|name | DateTime |sessionSeq|sessionCount|row_number |effective_name|
+---------------+--------------+----------+------------+----------+-------------------------+
|abc |1521572913344 |17 |5 |1 |abc |
|xyz |1521572916109 |17 |5 |2 |abcxyz |
|rafa |1521572916118 |17 |5 |3 |abcxyzrafa |
|{} |1521572916129 |17 |5 |4 |abcxyzrafa{} |
|experience |1521572917816 |17 |5 |5 |abcxyzrafa{}experience |
+---------------+--------------+----------+------------+----------+-------------------------+
The new column contains the incremental concatenation of its previous values of the name column.
You can achieve this by using a pyspark.sql.Window, which orders by the clientDateTime, pyspark.sql.functions.concat_ws, and pyspark.sql.functions.collect_list:
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.orderBy("DateTime") # define Window for ordering
df.drop("Seq", "sessionCount", "row_number").select(
"*",
f.concat_ws(
"",
f.collect_list(f.col("name")).over(w)
).alias("effective_name")
).show(truncate=False)
#+---------------+--------------+-------------------------+
#|name | DateTime|effective_name |
#+---------------+--------------+-------------------------+
#|abc |1521572913344 |abc |
#|xyz |1521572916109 |abcxyz |
#|rafa |1521572916118 |abcxyzrafa |
#|{} |1521572916129 |abcxyzrafa{} |
#|experience |1521572917816 |abcxyzrafa{}experience |
#+---------------+--------------+-------------------------+
I dropped "Seq", "sessionCount", "row_number" to make the output display friendlier.
If you needed to do this per group, you can add a partitionBy to the Window. Say in this case you want to group by sessionSeq, you can do the following:
w = Window.partitionBy("Seq").orderBy("DateTime")
df.drop("sessionCount", "row_number").select(
"*",
f.concat_ws(
"",
f.collect_list(f.col("name")).over(w)
).alias("effective_name")
).show(truncate=False)
#+---------------+--------------+----------+-------------------------+
#|name | DateTime|sessionSeq|effective_name |
#+---------------+--------------+----------+-------------------------+
#|abc |1521572913344 |17 |abc |
#|xyz |1521572916109 |17 |abcxyz |
#|rafa |1521572916118 |17 |abcxyzrafa |
#|{} |1521572916129 |17 |abcxyzrafa{} |
#|experience |1521572917816 |17 |abcxyzrafa{}experience |
#+---------------+--------------+----------+-------------------------+
If you prefer to use withColumn, the above is equivalent to:
df.drop("sessionCount", "row_number").withColumn(
"effective_name",
f.concat_ws(
"",
f.collect_list(f.col("name")).over(w)
)
).show(truncate=False)
Explanation
You want to apply a function over multiple rows, which is called an aggregation. With any aggregation, you need to define which rows to aggregate over and the order. We do this using a Window. In this case, w = Window.partitionBy("Seq").orderBy("DateTime") will partition the data by the Seq and sort by the DateTime.
We first apply the aggregate function collect_list("name") over the window. This gathers all of the values from the name column and puts them in a list. The order of insertion is defined by the Window's order.
For example, the intermediate output of this step would be:
df.select(
f.collect_list("name").over(w).alias("collected")
).show()
#+--------------------------------+
#|collected |
#+--------------------------------+
#|[abc] |
#|[abc, xyz] |
#|[abc, xyz, rafa] |
#|[abc, xyz, rafa, {}] |
#|[abc, xyz, rafa, {}, experience]|
#+--------------------------------+
Now that the appropriate values are in the list, we can concatenate them together with an empty string as the separator.
df.select(
f.concat_ws(
"",
f.collect_list("name").over(w)
).alias("concatenated")
).show()
#+-----------------------+
#|concatenated |
#+-----------------------+
#|abc |
#|abcxyz |
#|abcxyzrafa |
#|abcxyzrafa{} |
#|abcxyzrafa{}experience |
#+-----------------------+
Solution:
import pyspark.sql.functions as f
w = Window.partitionBy("Seq").orderBy("DateTime")
df.select(
"*",
f.concat_ws(
"",
f.collect_set(f.col("name")).over(w)
).alias("cummuliative_name")
).show()
Explanation
collect_set() - This function returns value like [["abc","xyz","rafa",{},"experience"]] .
concat_ws() - This function takes the output of collect_set() as input and converts it into abc, xyz, rafa, {}, experience
Note:
Use collect_set() if you don't have duplicates or else use collect_list()

Categories

Resources