Pyspark dataframe explode string column - python

I am looking for an efficient way to explode the rows in the pyspark dataframe df_input into columns. I dont understand that format '#{name...}' and don't know where to start in order to decode it. Thanks for help!
df_input = sqlContext.createDataFrame(
[
(1, '#{name= Hans; age= 45}'),
(2, '#{name= Jeff; age= 15}'),
(3, '#{name= Elona; age= 23}')
],
('id', 'firstCol')
)
expected result:
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| Hans| 45|
| 2| Jeff| 15|
| 3|Elona| 23|
+---+-----+---+

Convert the string into map type using str_to_map function, explode it then pivot the keys:
from pyspark.sql import functions as F
df = df_input.selectExpr(
"id",
"explode(str_to_map(regexp_replace(firstCol, '[#{}]', ''), ';', '='))"
).groupby("id").pivot("key").agg(F.first("value"))
df.show()
#+---+----+------+
#|id | age|name |
#+---+----+------+
#|1 | 45 | Hans |
#|2 | 15 | Jeff |
#|3 | 23 | Elona|
#+---+----+------+

from pyspark.sql.functions import regexp_extract
df_input.select(
df_input.id, #id
regexp_extract( #use regex
df_input.firstCol, #on firstCol
'\s(.*);', #find a space character then capture a (group of text) until you find a ';'
1 # use capture group 1 as text
).alias("name"),
regexp_extract(
df_input.firstCol,
'\s.*\s(.*)}', #find the second space then capture a (group of text) until you find a '}'
1 # use capture group 1 as text
).alias("age")
).show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| Hans| 45|
| 2| Jeff| 15|
| 3|Elona| 23|
+---+-----+---+

Related

Get distinct count of values in single row in Pyspark DataFrame

I'm trying to split comma separated values in a string column to individual values and count each individual value.
The data I have is formatted as such:
+--------------------+
| tags|
+--------------------+
|cult, horror, got...|
| violence|
| romantic|
|inspiring, romant...|
|cruelty, murder, ...|
|romantic, queer, ...|
|gothic, cruelty, ...|
|mystery, suspense...|
| violence|
|revenge, neo noir...|
+--------------------+
And I want the result to look like
+--------------------+-----+
| tags|count|
+--------------------+-----+
|cult | 4|
|horror | 10|
|goth | 4|
|violence | 30|
...
The code I've tried that hasn't worked is below:
data.select('tags').groupby('tags').count().show(10)
I also used a countdistinct function which also failed to work.
I feel like I need to have a function that separates the values by comma and then lists them but not sure how to execute them.
You can use split() to split strings, then explode(). Finally, groupby and count:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[
["cult,horror"],
["cult,comedy"],
["romantic,comedy"],
["thriler,horror,comedy"],
], schema=["tags"])
df = df \
.withColumn("tags", F.split("tags", pattern=",")) \
.withColumn("tags", F.explode("tags"))
df = df.groupBy("tags").count()
[Out]:
+--------+-----+
|tags |count|
+--------+-----+
|romantic|1 |
|thriler |1 |
|horror |2 |
|cult |2 |
|comedy |3 |
+--------+-----+

Compare two dataframes Pyspark

I'm trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames
df1 = spark.read.csv("/path/to/data1.csv")
df2 = spark.read.csv("/path/to/data2.csv")
Now I want to append new column to DF2 i.e. column_names which is the list of the columns with different values than df1
df2.withColumn("column_names",udf())
DF1
+------+---------+--------+------+
| id | |name | sal | Address |
+------+---------+--------+------+
| 1| ABC | 5000 | US |
| 2| DEF | 4000 | UK |
| 3| GHI | 3000 | JPN |
| 4| JKL | 4500 | CHN |
+------+---------+--------+------+
DF2:
+------+---------+--------+------+
| id | |name | sal | Address |
+------+---------+--------+------+
| 1| ABC | 5000 | US |
| 2| DEF | 4000 | CAN |
| 3| GHI | 3500 | JPN |
| 4| JKL_M | 4800 | CHN |
+------+---------+--------+------+
Now I want DF3
DF3:
+------+---------+--------+------+--------------+
| id | |name | sal | Address | column_names |
+------+---------+--------+------+--------------+
| 1| ABC | 5000 | US | [] |
| 2| DEF | 4000 | CAN | [address] |
| 3| GHI | 3500 | JPN | [sal] |
| 4| JKL_M | 4800 | CHN | [name,sal] |
+------+---------+--------+------+--------------+
I saw this SO question, How to compare two dataframe and print columns that are different in scala. Tried that, however the result is different.
I'm thinking of going with a UDF function by passing row from each dataframe to udf and compare column by column and return column list. However for that both the data frames should be in sorted order so that same id rows will be sent to udf. Sorting is costly operation here. Any solution?
Assuming that we can use id to join these two datasets I don't think that there is a need for UDF. This could be solved just by using inner join, array and array_remove functions among others.
First let's create the two datasets:
df1 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "UK"],
[3, "GHI", 3000, "JPN"],
[4, "JKL", 4500, "CHN"]
], ["id", "name", "sal", "Address"])
df2 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "CAN"],
[3, "GHI", 3500, "JPN"],
[4, "JKL_M", 4800, "CHN"]
], ["id", "name", "sal", "Address"])
First we do an inner join between the two datasets then we generate the condition df1[col] != df2[col] for each column except id. When the columns aren't equal we return the column name otherwise an empty string. The list of conditions will consist the items of an array from which finally we remove the empty items:
from pyspark.sql.functions import col, array, when, array_remove
# get conditions for all columns except id
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c != 'id']
select_expr =[
col("id"),
*[df2[c] for c in df2.columns if c != 'id'],
array_remove(array(*conditions_), "").alias("column_names")
]
df1.join(df2, "id").select(*select_expr).show()
# +---+-----+----+-------+------------+
# | id| name| sal|Address|column_names|
# +---+-----+----+-------+------------+
# | 1| ABC|5000| US| []|
# | 3| GHI|3500| JPN| [sal]|
# | 2| DEF|4000| CAN| [Address]|
# | 4|JKL_M|4800| CHN| [name, sal]|
# +---+-----+----+-------+------------+
Here is your solution with UDF, I have changed first dataframe name dynamically so that it will be not ambiguous during check. Go through below code and let me know in case any concerns.
>>> from pyspark.sql.functions import *
>>> df.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
| 1| ABC|5000| US|
| 2| DEF|4000| UK|
| 3| GHI|3000| JPN|
| 4| JKL|4500| CHN|
+---+----+----+-------+
>>> df1.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
| 1| ABC|5000| US|
| 2| DEF|4000| CAN|
| 3| GHI|3500| JPN|
| 4|JKLM|4800| CHN|
+---+----+----+-------+
>>> df2 = df.select([col(c).alias("x_"+c) for c in df.columns])
>>> df3 = df1.join(df2, col("id") == col("x_id"), "left")
//udf declaration
>>> def CheckMatch(Column,r):
... check=''
... ColList=Column.split(",")
... for cc in ColList:
... if(r[cc] != r["x_" + cc]):
... check=check + "," + cc
... return check.replace(',','',1).split(",")
>>> CheckMatchUDF = udf(CheckMatch)
//final column that required to select
>>> finalCol = df1.columns
>>> finalCol.insert(len(finalCol), "column_names")
>>> df3.withColumn("column_names", CheckMatchUDF(lit(','.join(df1.columns)),struct([df3[x] for x in df3.columns])))
.select(finalCol)
.show()
+---+----+----+-------+------------+
| id|name| sal|Address|column_names|
+---+----+----+-------+------------+
| 1| ABC|5000| US| []|
| 2| DEF|4000| CAN| [Address]|
| 3| GHI|3500| JPN| [sal]|
| 4|JKLM|4800| CHN| [name, sal]|
+---+----+----+-------+------------+
Python: PySpark version of my previous scala code.
import pyspark.sql.functions as f
df1 = spark.read.option("header", "true").csv("test1.csv")
df2 = spark.read.option("header", "true").csv("test2.csv")
columns = df1.columns
df3 = df1.alias("d1").join(df2.alias("d2"), f.col("d1.id") == f.col("d2.id"), "left")
for name in columns:
df3 = df3.withColumn(name + "_temp", f.when(f.col("d1." + name) != f.col("d2." + name), f.lit(name)))
df3.withColumn("column_names", f.concat_ws(",", *map(lambda name: f.col(name + "_temp"), columns))).select("d1.*", "column_names").show()
Scala: Here is my best approach for your problem.
val df1 = spark.read.option("header", "true").csv("test1.csv")
val df2 = spark.read.option("header", "true").csv("test2.csv")
val columns = df1.columns
val df3 = df1.alias("d1").join(df2.alias("d2"), col("d1.id") === col("d2.id"), "left")
columns.foldLeft(df3) {(df, name) => df.withColumn(name + "_temp", when(col("d1." + name) =!= col("d2." + name), lit(name)))}
.withColumn("column_names", concat_ws(",", columns.map(name => col(name + "_temp")): _*))
.show(false)
First, I join two dataframe into df3 and used the columns from df1. By folding left to the df3 with temp columns that have the value for column name when df1 and df2 has the same id and other column values.
After that, concat_ws for those column names and the null's are gone away and only the column names are left.
+---+----+----+-------+------------+
|id |name|sal |Address|column_names|
+---+----+----+-------+------------+
|1 |ABC |5000|US | |
|2 |DEF |4000|UK |Address |
|3 |GHI |3000|JPN |sal |
|4 |JKL |4500|CHN |name,sal |
+---+----+----+-------+------------+
The only thing different from your expected result is that the output is not a list but string.
p.s. I forgot to use PySpark but this is the normal spark, sorry.
You can get that query build for you in PySpark and Scala by the spark-extension package.
It provides the diff transformation that does exactly that.
from gresearch.spark.diff import *
options = DiffOptions().with_change_column('changes')
df1.diff_with_options(df2, options, 'id').show()
+----+-----------+---+---------+----------+--------+---------+------------+-------------+
|diff| changes| id|left_name|right_name|left_sal|right_sal|left_Address|right_Address|
+----+-----------+---+---------+----------+--------+---------+------------+-------------+
| N| []| 1| ABC| ABC| 5000| 5000| US| US|
| C| [Address]| 2| DEF| DEF| 4000| 4000| UK| CAN|
| C| [sal]| 3| GHI| GHI| 3000| 3500| JPN| JPN|
| C|[name, sal]| 4| JKL| JKL_M| 4500| 4800| CHN| CHN|
+----+-----------+---+---------+----------+--------+---------+------------+-------------+
While this is a simple example, diffing DataFrames can become complicated when wide schemas, insertions, deletions and null values are involved. That package is well-tested, so you don't have to worry about getting that query right yourself.
There is a wonderful package for pyspark that compares two dataframes. The name of the package is datacompy
https://capitalone.github.io/datacompy/
example code:
import datacompy as dc
comparison = dc.SparkCompare(spark, base_df=df1, compare_df=df2, join_columns=common_keys, match_rates=True)
comparison.report()
The above code will generate a summary report, and the one below it will give you the mismatches.
comparison.rows_both_mismatch.display()
There are also more fearures that you can explore.

How to split the column with same delimiter

My dataframe is this and I want to split my data frame by colon (:)
+------------------+
|Name:Roll_no:Class|
+------------------+
| #ab:cd#:23:C|
| #sd:ps#:34:A|
| #ra:kh#:14:H|
| #ku:pa#:36:S|
| #ra:sh#:50:P|
+------------------+
and I want my dataframe like:
+-----+-------+-----+
| Name|Roll_no|Class|
+-----+-------+-----+
|ab:cd| 23| C|
|sd:ps| 34| A|
|ra:kh| 14| H|
|ku:pa| 36| S|
|ra:sh| 50| P|
+-----+-------+-----+
If need split by last 2 : use Series.str.rsplit, then set columns by split column name and last remove first and last # by indexing:
col = 'Name:Roll_no:Class'
df1 = df[col].str.rsplit(':', n=2, expand=True)
df1.columns = col.split(':')
df1['Name'] = df1['Name'].str[1:-1]
#if only first and last value
#df1['Name'] = df1['Name'].str.strip('#')
print (df1)
Name Roll_no Class
0 ab:cd 23 C
1 sd:ps 34 A
2 ra:kh 14 H
3 ku:pa 36 S
4 ra:sh 50 P
Use read_csv() sep=':' and quotechar='#'
str = """Name:Roll_no:Class
#ab:cd#:23:C
#sd:ps#:34:A
#ra:kh#:14:H
#ku:pa#:36:S
#ra:sh#:50:P"""
df = pd.read_csv(pd.io.common.StringIO(str), sep=':', quotechar='#')
>>> df
Name Roll_no Class
#0 ab:cd 23 C
#1 sd:ps 34 A
#2 ra:kh 14 H
#3 ku:pa 36 S
#4 ra:sh 50 P
This is how you could do this in pyspark:
Specify the separator and the quote on read
If you're reading the data from a file, you can use spark.read_csv with the following arguments:
df = spark.read.csv("path/to/file", sep=":", quote="#", header=True)
df.show()
#+-----+-------+-----+
#| Name|Roll_no|Class|
#+-----+-------+-----+
#|ab:cd| 23| C|
#|sd:ps| 34| A|
#|ra:kh| 14| H|
#|ku:pa| 36| S|
#|ra:sh| 50| P|
#+-----+-------+-----+
Use Regular Expressions
If you're unable to change the way the data is read and you're starting with the DataFrame shown in the question, you can use regular expressions to get the desired output.
First get the new column names by splitting the existing column name on ":"
new_columns = df.columns[0].split(":")
print(new_columns)
#['Name', 'Roll_no', 'Class']
For the Name column you need to extract the data between the #s. For the other two columns, you need to remove the strings between the #s (and the following ":") and use pyspark.sql.functions.split to extract the components
from pyspark.sql.functions import regexp_extract, regexp_replace, split
df.withColumn(new_columns[0], regexp_extract(df.columns[0], r"(?<=#).+(?=#)", 0))\
.withColumn(new_columns[1], split(regexp_replace(df.columns[0], "#.+#:", ""), ":")[0])\
.withColumn(new_columns[2], split(regexp_replace(df.columns[0], "#.+#:", ""), ":")[1])\
.select(*new_columns)\
.show()
#+-----+-------+-----+
#| Name|Roll_no|Class|
#+-----+-------+-----+
#|ab:cd| 23| C|
#|sd:ps| 34| A|
#|ra:kh| 14| H|
#|ku:pa| 36| S|
#|ra:sh| 50| P|
#+-----+-------+-----+

Pyspark - Calculate number of null values in each dataframe column

I have a dataframe with many columns. My aim is to produce a dataframe thats lists each column name, along with the number of null values in that column.
Example:
+-------------+-------------+
| Column_Name | NULL_Values |
+-------------+-------------+
| Column_1 | 15 |
| Column_2 | 56 |
| Column_3 | 18 |
| ... | ... |
+-------------+-------------+
I have managed to get the number of null values for ONE column like so:
df.agg(F.count(F.when(F.isnull(c), c)).alias('NULL_Count'))
where c is a column in the dataframe. However, it does not show the name of the column. The output is:
+------------+
| NULL_Count |
+------------+
| 15 |
+------------+
Any ideas?
You can use a list comprehension to loop over all of your columns in the agg, and use alias to rename the output column:
import pyspark.sql.functions as F
df_agg = df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])
However, this will return the results in one row as shown below:
df_agg.show()
#+--------+--------+--------+
#|Column_1|Column_2|Column_3|
#+--------+--------+--------+
#| 15| 56| 18|
#+--------+--------+--------+
If you wanted the results in one column instead, you could union each column from df_agg using functools.reduce as follows:
from functools import reduce
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df_agg.select(F.lit(c).alias("Column_Name"), F.col(c).alias("NULL_Count"))
for c in df_agg.columns
)
)
df_agg_col.show()
#+-----------+----------+
#|Column_Name|NULL_Count|
#+-----------+----------+
#| Column_1| 15|
#| Column_2| 56|
#| Column_3| 18|
#+-----------+----------+
Or you can skip the intermediate step of creating df_agg and do:
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df.agg(
F.count(F.when(F.isnull(c), c)).alias('NULL_Count')
).select(F.lit(c).alias("Column_Name"), "NULL_Count")
for c in df.columns
)
)
Scala alternative could be
case class Test(id:Int, weight:Option[Int], age:Int, gender: Option[String])
val df1 = Seq(Test(1, Some(100), 23, Some("Male")), Test(2, None, 25, None), Test(3, None, 33, Some("Female"))).toDF()
df1.show()
+---+------+---+------+
| id|weight|age|gender|
+---+------+---+------+
| 1| 100| 23| Male|
| 2| null| 25| null|
| 3| null| 33|Female|
+---+------+---+------+
val s = df1.columns.map(c => sum(col(c).isNull.cast("integer")).alias(c))
val df2 = df1.agg(s.head, s.tail:_*)
val t = df2.columns.map(c => df2.select(lit(c).alias("col_name"), col(c).alias("null_count")))
val df_agg_col = t.reduce((df1, df2) => df1.union(df2))
df_agg_col.show()

How to add strings of one columns of the dataframe and form another column that will have the incremental value of the original column

I have a DataFrame whose data I am pasting below:
+---------------+--------------+----------+------------+----------+
|name | DateTime| Seq|sessionCount|row_number|
+---------------+--------------+----------+------------+----------+
| abc| 1521572913344| 17| 5| 1|
| xyz| 1521572916109| 17| 5| 2|
| rafa| 1521572916118| 17| 5| 3|
| {}| 1521572916129| 17| 5| 4|
| experience| 1521572917816| 17| 5| 5|
+---------------+--------------+----------+------------+----------+
The column 'name' is of type string. I want a new column "effective_name" which will contain the incremental values of "name" like shown below:
+---------------+--------------+----------+------------+----------+-------------------------+
|name | DateTime |sessionSeq|sessionCount|row_number |effective_name|
+---------------+--------------+----------+------------+----------+-------------------------+
|abc |1521572913344 |17 |5 |1 |abc |
|xyz |1521572916109 |17 |5 |2 |abcxyz |
|rafa |1521572916118 |17 |5 |3 |abcxyzrafa |
|{} |1521572916129 |17 |5 |4 |abcxyzrafa{} |
|experience |1521572917816 |17 |5 |5 |abcxyzrafa{}experience |
+---------------+--------------+----------+------------+----------+-------------------------+
The new column contains the incremental concatenation of its previous values of the name column.
You can achieve this by using a pyspark.sql.Window, which orders by the clientDateTime, pyspark.sql.functions.concat_ws, and pyspark.sql.functions.collect_list:
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.orderBy("DateTime") # define Window for ordering
df.drop("Seq", "sessionCount", "row_number").select(
"*",
f.concat_ws(
"",
f.collect_list(f.col("name")).over(w)
).alias("effective_name")
).show(truncate=False)
#+---------------+--------------+-------------------------+
#|name | DateTime|effective_name |
#+---------------+--------------+-------------------------+
#|abc |1521572913344 |abc |
#|xyz |1521572916109 |abcxyz |
#|rafa |1521572916118 |abcxyzrafa |
#|{} |1521572916129 |abcxyzrafa{} |
#|experience |1521572917816 |abcxyzrafa{}experience |
#+---------------+--------------+-------------------------+
I dropped "Seq", "sessionCount", "row_number" to make the output display friendlier.
If you needed to do this per group, you can add a partitionBy to the Window. Say in this case you want to group by sessionSeq, you can do the following:
w = Window.partitionBy("Seq").orderBy("DateTime")
df.drop("sessionCount", "row_number").select(
"*",
f.concat_ws(
"",
f.collect_list(f.col("name")).over(w)
).alias("effective_name")
).show(truncate=False)
#+---------------+--------------+----------+-------------------------+
#|name | DateTime|sessionSeq|effective_name |
#+---------------+--------------+----------+-------------------------+
#|abc |1521572913344 |17 |abc |
#|xyz |1521572916109 |17 |abcxyz |
#|rafa |1521572916118 |17 |abcxyzrafa |
#|{} |1521572916129 |17 |abcxyzrafa{} |
#|experience |1521572917816 |17 |abcxyzrafa{}experience |
#+---------------+--------------+----------+-------------------------+
If you prefer to use withColumn, the above is equivalent to:
df.drop("sessionCount", "row_number").withColumn(
"effective_name",
f.concat_ws(
"",
f.collect_list(f.col("name")).over(w)
)
).show(truncate=False)
Explanation
You want to apply a function over multiple rows, which is called an aggregation. With any aggregation, you need to define which rows to aggregate over and the order. We do this using a Window. In this case, w = Window.partitionBy("Seq").orderBy("DateTime") will partition the data by the Seq and sort by the DateTime.
We first apply the aggregate function collect_list("name") over the window. This gathers all of the values from the name column and puts them in a list. The order of insertion is defined by the Window's order.
For example, the intermediate output of this step would be:
df.select(
f.collect_list("name").over(w).alias("collected")
).show()
#+--------------------------------+
#|collected |
#+--------------------------------+
#|[abc] |
#|[abc, xyz] |
#|[abc, xyz, rafa] |
#|[abc, xyz, rafa, {}] |
#|[abc, xyz, rafa, {}, experience]|
#+--------------------------------+
Now that the appropriate values are in the list, we can concatenate them together with an empty string as the separator.
df.select(
f.concat_ws(
"",
f.collect_list("name").over(w)
).alias("concatenated")
).show()
#+-----------------------+
#|concatenated |
#+-----------------------+
#|abc |
#|abcxyz |
#|abcxyzrafa |
#|abcxyzrafa{} |
#|abcxyzrafa{}experience |
#+-----------------------+
Solution:
import pyspark.sql.functions as f
w = Window.partitionBy("Seq").orderBy("DateTime")
df.select(
"*",
f.concat_ws(
"",
f.collect_set(f.col("name")).over(w)
).alias("cummuliative_name")
).show()
Explanation
collect_set() - This function returns value like [["abc","xyz","rafa",{},"experience"]] .
concat_ws() - This function takes the output of collect_set() as input and converts it into abc, xyz, rafa, {}, experience
Note:
Use collect_set() if you don't have duplicates or else use collect_list()

Categories

Resources