Compare two dataframes Pyspark - python

I'm trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames
df1 = spark.read.csv("/path/to/data1.csv")
df2 = spark.read.csv("/path/to/data2.csv")
Now I want to append new column to DF2 i.e. column_names which is the list of the columns with different values than df1
df2.withColumn("column_names",udf())
DF1
+------+---------+--------+------+
| id | |name | sal | Address |
+------+---------+--------+------+
| 1| ABC | 5000 | US |
| 2| DEF | 4000 | UK |
| 3| GHI | 3000 | JPN |
| 4| JKL | 4500 | CHN |
+------+---------+--------+------+
DF2:
+------+---------+--------+------+
| id | |name | sal | Address |
+------+---------+--------+------+
| 1| ABC | 5000 | US |
| 2| DEF | 4000 | CAN |
| 3| GHI | 3500 | JPN |
| 4| JKL_M | 4800 | CHN |
+------+---------+--------+------+
Now I want DF3
DF3:
+------+---------+--------+------+--------------+
| id | |name | sal | Address | column_names |
+------+---------+--------+------+--------------+
| 1| ABC | 5000 | US | [] |
| 2| DEF | 4000 | CAN | [address] |
| 3| GHI | 3500 | JPN | [sal] |
| 4| JKL_M | 4800 | CHN | [name,sal] |
+------+---------+--------+------+--------------+
I saw this SO question, How to compare two dataframe and print columns that are different in scala. Tried that, however the result is different.
I'm thinking of going with a UDF function by passing row from each dataframe to udf and compare column by column and return column list. However for that both the data frames should be in sorted order so that same id rows will be sent to udf. Sorting is costly operation here. Any solution?

Assuming that we can use id to join these two datasets I don't think that there is a need for UDF. This could be solved just by using inner join, array and array_remove functions among others.
First let's create the two datasets:
df1 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "UK"],
[3, "GHI", 3000, "JPN"],
[4, "JKL", 4500, "CHN"]
], ["id", "name", "sal", "Address"])
df2 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "CAN"],
[3, "GHI", 3500, "JPN"],
[4, "JKL_M", 4800, "CHN"]
], ["id", "name", "sal", "Address"])
First we do an inner join between the two datasets then we generate the condition df1[col] != df2[col] for each column except id. When the columns aren't equal we return the column name otherwise an empty string. The list of conditions will consist the items of an array from which finally we remove the empty items:
from pyspark.sql.functions import col, array, when, array_remove
# get conditions for all columns except id
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c != 'id']
select_expr =[
col("id"),
*[df2[c] for c in df2.columns if c != 'id'],
array_remove(array(*conditions_), "").alias("column_names")
]
df1.join(df2, "id").select(*select_expr).show()
# +---+-----+----+-------+------------+
# | id| name| sal|Address|column_names|
# +---+-----+----+-------+------------+
# | 1| ABC|5000| US| []|
# | 3| GHI|3500| JPN| [sal]|
# | 2| DEF|4000| CAN| [Address]|
# | 4|JKL_M|4800| CHN| [name, sal]|
# +---+-----+----+-------+------------+

Here is your solution with UDF, I have changed first dataframe name dynamically so that it will be not ambiguous during check. Go through below code and let me know in case any concerns.
>>> from pyspark.sql.functions import *
>>> df.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
| 1| ABC|5000| US|
| 2| DEF|4000| UK|
| 3| GHI|3000| JPN|
| 4| JKL|4500| CHN|
+---+----+----+-------+
>>> df1.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
| 1| ABC|5000| US|
| 2| DEF|4000| CAN|
| 3| GHI|3500| JPN|
| 4|JKLM|4800| CHN|
+---+----+----+-------+
>>> df2 = df.select([col(c).alias("x_"+c) for c in df.columns])
>>> df3 = df1.join(df2, col("id") == col("x_id"), "left")
//udf declaration
>>> def CheckMatch(Column,r):
... check=''
... ColList=Column.split(",")
... for cc in ColList:
... if(r[cc] != r["x_" + cc]):
... check=check + "," + cc
... return check.replace(',','',1).split(",")
>>> CheckMatchUDF = udf(CheckMatch)
//final column that required to select
>>> finalCol = df1.columns
>>> finalCol.insert(len(finalCol), "column_names")
>>> df3.withColumn("column_names", CheckMatchUDF(lit(','.join(df1.columns)),struct([df3[x] for x in df3.columns])))
.select(finalCol)
.show()
+---+----+----+-------+------------+
| id|name| sal|Address|column_names|
+---+----+----+-------+------------+
| 1| ABC|5000| US| []|
| 2| DEF|4000| CAN| [Address]|
| 3| GHI|3500| JPN| [sal]|
| 4|JKLM|4800| CHN| [name, sal]|
+---+----+----+-------+------------+

Python: PySpark version of my previous scala code.
import pyspark.sql.functions as f
df1 = spark.read.option("header", "true").csv("test1.csv")
df2 = spark.read.option("header", "true").csv("test2.csv")
columns = df1.columns
df3 = df1.alias("d1").join(df2.alias("d2"), f.col("d1.id") == f.col("d2.id"), "left")
for name in columns:
df3 = df3.withColumn(name + "_temp", f.when(f.col("d1." + name) != f.col("d2." + name), f.lit(name)))
df3.withColumn("column_names", f.concat_ws(",", *map(lambda name: f.col(name + "_temp"), columns))).select("d1.*", "column_names").show()
Scala: Here is my best approach for your problem.
val df1 = spark.read.option("header", "true").csv("test1.csv")
val df2 = spark.read.option("header", "true").csv("test2.csv")
val columns = df1.columns
val df3 = df1.alias("d1").join(df2.alias("d2"), col("d1.id") === col("d2.id"), "left")
columns.foldLeft(df3) {(df, name) => df.withColumn(name + "_temp", when(col("d1." + name) =!= col("d2." + name), lit(name)))}
.withColumn("column_names", concat_ws(",", columns.map(name => col(name + "_temp")): _*))
.show(false)
First, I join two dataframe into df3 and used the columns from df1. By folding left to the df3 with temp columns that have the value for column name when df1 and df2 has the same id and other column values.
After that, concat_ws for those column names and the null's are gone away and only the column names are left.
+---+----+----+-------+------------+
|id |name|sal |Address|column_names|
+---+----+----+-------+------------+
|1 |ABC |5000|US | |
|2 |DEF |4000|UK |Address |
|3 |GHI |3000|JPN |sal |
|4 |JKL |4500|CHN |name,sal |
+---+----+----+-------+------------+
The only thing different from your expected result is that the output is not a list but string.
p.s. I forgot to use PySpark but this is the normal spark, sorry.

You can get that query build for you in PySpark and Scala by the spark-extension package.
It provides the diff transformation that does exactly that.
from gresearch.spark.diff import *
options = DiffOptions().with_change_column('changes')
df1.diff_with_options(df2, options, 'id').show()
+----+-----------+---+---------+----------+--------+---------+------------+-------------+
|diff| changes| id|left_name|right_name|left_sal|right_sal|left_Address|right_Address|
+----+-----------+---+---------+----------+--------+---------+------------+-------------+
| N| []| 1| ABC| ABC| 5000| 5000| US| US|
| C| [Address]| 2| DEF| DEF| 4000| 4000| UK| CAN|
| C| [sal]| 3| GHI| GHI| 3000| 3500| JPN| JPN|
| C|[name, sal]| 4| JKL| JKL_M| 4500| 4800| CHN| CHN|
+----+-----------+---+---------+----------+--------+---------+------------+-------------+
While this is a simple example, diffing DataFrames can become complicated when wide schemas, insertions, deletions and null values are involved. That package is well-tested, so you don't have to worry about getting that query right yourself.

There is a wonderful package for pyspark that compares two dataframes. The name of the package is datacompy
https://capitalone.github.io/datacompy/
example code:
import datacompy as dc
comparison = dc.SparkCompare(spark, base_df=df1, compare_df=df2, join_columns=common_keys, match_rates=True)
comparison.report()
The above code will generate a summary report, and the one below it will give you the mismatches.
comparison.rows_both_mismatch.display()
There are also more fearures that you can explore.

Related

How do I filter the column in pyspark?

I am new to pyspark. I want to compare two tables. If the the value in one of the column does not match, I want to print out that column name in a new column. Using, Compare two dataframes Pyspark link, I am able to get that result. Now, I want to filter the new table based on the newly created column.
df1 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "UK"],
[3, "GHI", 3000, "JPN"],
[4, "JKL", 4500, "CHN"]
], ["id", "name", "sal", "Address"])
df2 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "CAN"],
[3, "GHI", 3500, "JPN"],
[4, "JKL_M", 4800, "CHN"]
], ["id", "name", "sal", "Address"])
from pyspark.sql.functions import *
#from pyspark.sql.functions import col, array, when, array_remove
# get conditions for all columns except id
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c != 'id']
select_expr =[
col("id"),
*[df2[c] for c in df2.columns if c != 'id'],
array_remove(array(*conditions_), "").alias("column_names")
]
df3 = df1.join(df2, "id").select(*select_expr)
df3.show()
DF3:
+------+---------+--------+------+--------------+
| id | |name | sal | Address | column_names |
+------+---------+--------+------+--------------+
| 1| ABC | 5000 | US | [] |
| 2| DEF | 4000 | CAN | [address] |
| 3| GHI | 3500 | JPN | [sal] |
| 4| JKL_M | 4800 | CHN | [name,sal] |
+------+---------+--------+------+--------------+
This is the step where I am getting an error message.
df3.filter(df3.column_names!="")
Error: cannot resolve '(column_names = '')' due to data type mismatch: differing types in '(column_names = '')' (array<string> and string).
I want the following result
DF3:
+------+---------+--------+------+--------------+
| id | |name | sal | Address | column_names |
+------+---------+--------+------+--------------+
| 1| DEF | 4000 | CAN | [address] |
| 2| GHI | 3500 | JPN | [sal] |
| 3| JKL_M | 4800 | CHN | [name,sal] |
+------+---------+--------+------+--------------+
you are getting error because you are comparing array type to string, you should first convert column_names array type to string then it will work
df3 = df3.withColumn('column_names',concat_ws(";",col("column_names")))
You can create a udf to filter and pass the relevant column name to it, I hope below code will help.
from pyspark.sql import functions
simple filter function
#udf(returnType=BooleanType())
def my_filter(col1):
return True if len(col1) > 0 else False
df3.filter(my_filter(col('column_names'))).show()
Another way
#Do an outer join
new = df1.join(df2.alias('df2'), how='outer', on=['id','name','sal','Address'])
#Count disntict values in in each column per id
new1 =new.groupBy('id').agg(*[countDistinct(x).alias(f'{x}') for x in new.drop('id').columns])
#Using case when, where there is more than one distinct value, append column to new column
new2 = new1.select('id',array_except(array((*[when(col(c) != 1, lit(c)) for c in new1.drop('id').columns])),array(lit(None).cast('string'))).alias('column_names'))
#Join back to df2
df2.join(new2,how='right', on='id').show()
+---+-----+----+-------+------------+
| id| name| sal|Address|column_names|
+---+-----+----+-------+------------+
| 1| ABC|5000| US| []|
| 2| DEF|4000| CAN| [Address]|
| 3| GHI|3500| JPN| [sal]|
| 4|JKL_M|4800| CHN| [name, sal]|
+---+-----+----+-------+------------+
Use filter('array_column != array()'). See below example that filters out the empty arrays.
spark.sparkContext.parallelize([([],), (['blah', 'bleh'],)]).toDF(['arrcol']). \
show()
# +------------+
# | arrcol|
# +------------+
# | []|
# |[blah, bleh]|
# +------------+
spark.sparkContext.parallelize([([],), (['blah', 'bleh'],)]).toDF(['arrcol']). \
filter('arrcol != array()'). \
show()
# +------------+
# | arrcol|
# +------------+
# |[blah, bleh]|
# +------------+

Compare two different columns from two different pyspark dataframe

I'm trying to compare two different columns which are in two different data frames, and if I found a match I'm returning value 1 else None -
df1 =
df2 =
df1 (Expected_Output) =
I have tried the below code -
def getImpact(row):
match = df2.filter(df2.second_key == row)
if match.count() > 0:
return 1
return None
udf_sol = udf(lambda x: getImpact(x), IntegerType())
df1 = df1.withcolumn('impact',udf_sol(df1.first_key))
But getting below error -
TypeError: cannot pickle '_thread.RLock' object
Can anyone help me to achieve the expected output as shown above?
Thanks
Assuming first_key and second_key are unique , you can opt for a join across the dataframes -
More examples and explanation can be found here
from pyspark import SparkContext
from pyspark.sql import SQLContext
from functools import reduce
import pyspark.sql.functions as F
from pyspark.sql import Window
data_list1 = [
("abcd","Key1")
,("jkasd","Key2")
,("oigoa","Key3")
,("ad","Key4")
,("bas","Key5")
,("lkalsjf","Key6")
,("bsawva","Key7")
]
data_list2 = [
("cashj","Key1",10)
,("ax","Key11",12)
,("safa","Key5",21)
,("safasf","Key6",78)
,("vasv","Key3",4)
,("wgaga","Key8",0)
,("saasfas","Key7",10)
]
sparkDF1 = sql.createDataFrame(data_list1,['data','first_key'])
sparkDF2 = sql.createDataFrame(data_list2,['temp_data','second_key','frinks'])
>>> sparkDF1
+-------+---------+
| data|first_key|
+-------+---------+
| abcd| Key1|
| jkasd| Key2|
| oigoa| Key3|
| ad| Key4|
| bas| Key5|
|lkalsjf| Key6|
| bsawva| Key7|
+-------+---------+
>>> sparkDF2
+---------+----------+------+
|temp_data|second_key|frinks|
+---------+----------+------+
| cashj| Key1| 10|
| ax| Key11| 12|
| safa| Key5| 21|
| safasf| Key6| 78|
| vasv| Key3| 4|
| wgaga| Key8| 0|
| saasfas| Key7| 10|
+---------+----------+------+
#### Joining the dataframes on common columns
finalDF = sparkDF1.join(
sparkDF2
,(sparkDF1['first_key'] == sparkDF2['second_key'])
,'left'
).select(sparkDF1['*'],sparkDF2['frinks']).orderBy('frinks')
### Identifying impact if the frinks value is Null or Not
finalDF = finalDF.withColumn('impact',F.when(F.col('frinks').isNull(),0).otherwise(1))
>>> finalDF.show()
+-------+---------+------+------+
| data|first_key|frinks|impact|
+-------+---------+------+------+
| jkasd| Key2| null| 0|
| ad| Key4| null| 0|
| oigoa| Key3| 4| 1|
| abcd| Key1| 10| 1|
| bsawva| Key7| 10| 1|
| bas| Key5| 21| 1|
|lkalsjf| Key6| 78| 1|
+-------+---------+------+------+
import numpy as np
df1['final']= np.where(df1['first_key']==df2['second_key'],'1','None')

PySpark: Compare columns of one df with the rows of a second df

I would like to compare two PySpark dataframes.
I have Df1 with hundreds of columns (Col1, Col2, ..., Col800) and Df2 with hundreds of corresponding rows.
The Df2 describes the limit values for each of the 800 columns in Df1, if the value is too low or too high, then I would like to achieve the result in Final_Df, where I create a column Problem which checks if any of the columns is out of limits.
I thought about transposing Df2 with pivot, but it requires an aggregate function, so I am not sure if it is a relevant solution.
I also don't see how I could join the two Dfs for the comparison, since they don't share any common column.
Df1:
| X | Y | Col1 | Col2 | Col3 |
+-----------+-----------+------+------+------+
| Value_X_1 | Value_Y_1 | 5000 | 250 | 500 |
+-----------+-----------+------+------+------+
| Value_X_2 | Value_Y_2 | 1000 | 30 | 300 |
+-----------+-----------+------+------+------+
| Value_X_3 | Value_Y_3 | 0 | 100 | 100 |
+-----------+-----------+------+------+------+
Df2:
+------+------+-----+
| name | max | min |
+------+------+-----+
| Col1 | 2500 | 0 |
+------+------+-----+
| Col2 | 120 | 0 |
+------+------+-----+
| Col3 | 400 | 0 |
+------+------+-----+
Final_Df (after comparison):
+-----------+-----------+------+------+------+---------+
| X | Y | Col1 | Col2 | Col3 | Problem |
+-----------+-----------+------+------+------+---------+
| Value_X_1 | Value_Y_1 | 5000 | 250 | 500 | Yes |
+-----------+-----------+------+------+------+---------+
| Value_X_2 | Value_Y_2 | 1000 | 30 | 300 | No |
+-----------+-----------+------+------+------+---------+
| Value_X_3 | Value_Y_3 | 0 | 100 | 100 | No |
+-----------+-----------+------+------+------+---------+
If df2 is not a big dataframe, you can convert it to a dictionary and then use list comprehension and when function to check the status, for example:
from pyspark.sql import functions as F
>>> df1.show()
+---------+---------+----+----+----+
| X| Y|Col1|Col2|Col3|
+---------+---------+----+----+----+
|Value_X_1|Value_Y_1|5000| 250| 500|
|Value_X_2|Value_Y_2|1000| 30| 300|
|Value_X_3|Value_Y_3| 0| 100| 100|
+---------+---------+----+----+----+
>>> df2.show()
+----+----+---+
|name| max|min|
+----+----+---+
|Col1|2500| 0|
|Col2| 120| 0|
|Col3| 400| 0|
+----+----+---+
# concerned columns
cols = df1.columns[2:]
>>> cols
['Col1', 'Col2', 'Col3']
Note: I assumed data types already set to integer for the above cols in df1 and df2.min, df2.max.
Create a map from df2:
map1 = { r.name:[r.min, r.max] for r in df2.collect() }
>>> map1
{u'Col1': [0, 2500], u'Col2': [0, 120], u'Col3': [0, 400]}
Add new field 'Problem' based on two when() functions, use a list comprehension to iterate through all concerned columns
F.when(df1[c].between(min, max), 0).otherwise(1))
F.when(sum(...) > 0, 'Yes').otherwise('No')
We set a flag(0 or 1) with the first when() function for each concerned column, and then take the sum on this flag. if it's greater than 0 then Problem = 'Yes', otherwise 'No':
df_new = df1.withColumn('Problem', F.when(sum([ F.when(df1[c].between(map1[c][0], map1[c][1]), 0).otherwise(1) for c in cols ]) > 0, 'Yes').otherwise('No'))
>>> df_new.show()
+---------+---------+----+----+----+-------+
| X| Y|Col1|Col2|Col3|Problem|
+---------+---------+----+----+----+-------+
|Value_X_1|Value_Y_1|5000| 250| 500| Yes|
|Value_X_2|Value_Y_2|1000| 30| 300| No|
|Value_X_3|Value_Y_3| 0| 100| 100| No|
+---------+---------+----+----+----+-------+
Using UDF and dictionary I was able to solve it. Let me know if its helpful.
# Create a map like, name -> max#min
df = df.withColumn('name_max_min',F.create_map('name',F.concat( col('max'), lit("#"), col('min')) ))
# HANDLE THE null
# you can try this ,not sure about this , but python has math.inf which
# supplies both infinities
positiveInf = float("inf")
negativeInf = float("-inf")
df = df.fillna({ 'max':999999999, 'min':-999999999 })
### df is :
+----+----+---+-------------------+
|name| max|min| name_max_min|
+----+----+---+-------------------+
|Col1|2500| 0|Map(Col1 -> 2500#0)|
|Col2| 120| 0| Map(Col2 -> 120#0)|
|Col3| 400| 0| Map(Col3 -> 400#0)|
+----+----+---+-------------------+
# Create a dictionary out of it
v = df.select('name_max_min').rdd.flatMap(lambda x: x).collect()
keys = []
values = []
for p in v:
for r, s in p.items():
keys.append(str(r).strip())
values.append(str(s).strip().split('#'))
max_dict = dict(zip(keys,values))
# max_dict = {'Col1': ['2500', '0'], 'Col2': ['120', '0'], 'Col3': ['400', '0']}
# Create a UDF which can help you to assess the conditions.
def problem_udf(c1):
# GENERAL WAY
# if the column names are diff
#p =all([int(max_dict.get(r)[1]) <= int(c1[r]) <= int(max_dict.get(r)[0]) for r in c1.__fields__])
p = all([ int(max_dict.get("Col" + str(r))[1]) <= int(c1["Col" + str(r)]) <= int(max_dict.get("Col" + str(r))[0]) for r in range(1, len(c1) + 1)])
if p :
return("No")
else:
return("Yes")
callnewColsUdf= F.udf(problem_udf, StringType())
col_names = ['Col'+str(i) for i in range(1,4)]
# GENERAL WAY
# col_names = df1.schema.names
df1 = df1.withColumn('Problem', callnewColsUdf(F.struct(col_names)))
## Results in :
+---------+---------+----+----+----+-------+
| X| Y|Col1|Col2|Col3|Problem|
+---------+---------+----+----+----+-------+
|Value_X_1|Value_Y_1|5000| 250| 500| Yes|
|Value_X_2|Value_Y_2|1000| 30| 300| No|
|Value_X_3|Value_X_3| 0| 100| 100| No|
+---------+---------+----+----+----+-------+

Find column names of interconnected row values - Spark

I have a Spark dataframe that adheres to the following structure:
+------+-----------+-----------+-----------+------+
|ID | Name1 | Name2 | Name3 | Y |
+------+-----------+-----------+-----------+------+
| 1 | A,1 | B,1 | C,4 | B |
| 2 | D,2 | E,2 | F,8 | D |
| 3 | G,5 | H,2 | I,3 | H |
+------+-----------+-----------+-----------+------+
For every row I want to find in which column the value of Y is denoted as the first element. So, ideally I want to retrieve a list like: [Name2,Name1,Name2].
I am not sure how and whether it works to convert first to a RDD, then use a map function and convert the result back to DataFrame.
Any ideas are welcome.
You can probably try this piece of code :
df.show()
+---+-----+-----+-----+---+
| ID|Name1|Name2|Name3| Y|
+---+-----+-----+-----+---+
| 1| A,1| B,1| C,4| B|
| 2| D,2| E,2| F,8| D|
| 3| G,5| H,2| I,3| H|
+---+-----+-----+-----+---+
from pyspark.sql import functions as F
name_cols = ["Name1", "Name2", "Name3"]
cond = F
for col in name_cols:
cond = cond.when(F.split(F.col(col),',').getItem(0) == F.col("Y"), col)
df.withColumn("whichName", cond).show()
+---+-----+-----+-----+---+---------+
| ID|Name1|Name2|Name3| Y|whichName|
+---+-----+-----+-----+---+---------+
| 1| A,1| B,1| C,4| B| Name2|
| 2| D,2| E,2| F,8| D| Name1|
| 3| G,5| H,2| I,3| H| Name2|
+---+-----+-----+-----+---+---------+

Pyspark - Calculate number of null values in each dataframe column

I have a dataframe with many columns. My aim is to produce a dataframe thats lists each column name, along with the number of null values in that column.
Example:
+-------------+-------------+
| Column_Name | NULL_Values |
+-------------+-------------+
| Column_1 | 15 |
| Column_2 | 56 |
| Column_3 | 18 |
| ... | ... |
+-------------+-------------+
I have managed to get the number of null values for ONE column like so:
df.agg(F.count(F.when(F.isnull(c), c)).alias('NULL_Count'))
where c is a column in the dataframe. However, it does not show the name of the column. The output is:
+------------+
| NULL_Count |
+------------+
| 15 |
+------------+
Any ideas?
You can use a list comprehension to loop over all of your columns in the agg, and use alias to rename the output column:
import pyspark.sql.functions as F
df_agg = df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])
However, this will return the results in one row as shown below:
df_agg.show()
#+--------+--------+--------+
#|Column_1|Column_2|Column_3|
#+--------+--------+--------+
#| 15| 56| 18|
#+--------+--------+--------+
If you wanted the results in one column instead, you could union each column from df_agg using functools.reduce as follows:
from functools import reduce
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df_agg.select(F.lit(c).alias("Column_Name"), F.col(c).alias("NULL_Count"))
for c in df_agg.columns
)
)
df_agg_col.show()
#+-----------+----------+
#|Column_Name|NULL_Count|
#+-----------+----------+
#| Column_1| 15|
#| Column_2| 56|
#| Column_3| 18|
#+-----------+----------+
Or you can skip the intermediate step of creating df_agg and do:
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df.agg(
F.count(F.when(F.isnull(c), c)).alias('NULL_Count')
).select(F.lit(c).alias("Column_Name"), "NULL_Count")
for c in df.columns
)
)
Scala alternative could be
case class Test(id:Int, weight:Option[Int], age:Int, gender: Option[String])
val df1 = Seq(Test(1, Some(100), 23, Some("Male")), Test(2, None, 25, None), Test(3, None, 33, Some("Female"))).toDF()
df1.show()
+---+------+---+------+
| id|weight|age|gender|
+---+------+---+------+
| 1| 100| 23| Male|
| 2| null| 25| null|
| 3| null| 33|Female|
+---+------+---+------+
val s = df1.columns.map(c => sum(col(c).isNull.cast("integer")).alias(c))
val df2 = df1.agg(s.head, s.tail:_*)
val t = df2.columns.map(c => df2.select(lit(c).alias("col_name"), col(c).alias("null_count")))
val df_agg_col = t.reduce((df1, df2) => df1.union(df2))
df_agg_col.show()

Categories

Resources