How to get Weighted Average for a column in pyspark - python

Here i need to find exponential moving average in spark dataframe :
Table :
ab = spark.createDataFrame(
[(1,"1/1/2020", 41.0,0.5, 0.5 ,1, '10.22'),
(1,"10/3/2020",24.0,0.3, 0.7 ,2, '' ),
(1,"21/5/2020",32.0,0.4, 0.6 ,3, '' ),
(2,"3/1/2020", 51.0,0.22, 0.78,1, '34.78'),
(2,"10/5/2020",14.56,0.333,0.66,2, '' ),
(2,"30/9/2020",17.0,0.66, 0.34,3, '' )],["CID","date","A","B","C","Row","SMA"] )
ab.show()
+---+---------+-----+-----+----+---+-----+
|CID| date| A| B| C| Row| SMA|
+---+---------+-----+-----+----+---+-----+
| 1| 1/1/2020| 41.0| 0.5| 0.5| 1|10.22|
| 1|10/3/2020| 24.0| 0.3| 0.7| 2| |
| 1|21/5/2020| 32.0| 0.4| 0.6| 3| |
| 2| 3/1/2020| 51.0| 0.22|0.78| 1|34.78|
| 2|10/5/2020|14.56|0.333|0.66| 2| |
| 2|30/9/2020| 17.0| 0.66|0.34| 3| |
+---+---------+-----+-----+----+---+-----+
Expected Output :
+---+---------+-----+-----+----+---+-----+----------+
|CID| date| A| B| C|Row| SMA| EMA|
+---+---------+-----+-----+----+---+-----+----------+
| 1| 1/1/2020| 41.0| 0.5| 0.5| 1|10.22| 10.22|
| 1|10/3/2020| 24.0| 0.3| 0.7| 2| | 14.354|
| 1|21/5/2020| 32.0| 0.4| 0.6| 3| | 21.4124|
| 2| 3/1/2020| 51.0| 0.22|0.78| 1|34.78| 34.78|
| 2|10/5/2020|14.56|0.333|0.66| 2| | 28.04674|
| 2|30/9/2020| 17.0| 0.66|0.34| 3| |20.7558916|
+---+---------+-----+-----+----+---+-----+----------+
Logic :
For every customer
if row == 1 then
SMA as EMA
else ( C * LAG(EMA) + A * B ) as EMA

The problem here is that a freshly calculated value of a previous row is used as input for the current row. That means that it is not possible to parallelize the calculations for a single customer.
For Spark 3.0+, it is possible to get the required result with a pandas udf using grouped map
ab = spark.createDataFrame(
[(1,"1/1/2020", 41.0,0.5, 0.5 ,1, '10.22'),
(1,"10/3/2020",24.0,0.3, 0.7 ,2, '' ),
(1,"21/5/2020",32.0,0.4, 0.6 ,3, '' ),
(2,"3/1/2020", 51.0,0.22, 0.78,1, '34.78'),
(2,"10/5/2020",14.56,0.333,0.66,2, '' ),
(2,"30/9/2020",17.0,0.66, 0.34,3, '' )],\
["CID","date","A","B","C","Row","SMA"] ) \
.withColumn("SMA", F.col('SMA').cast(T.DoubleType())) \
.withColumn("date", F.to_date(F.col("date"), "d/M/yyyy"))
import pandas as pd
def calc(df: pd.DataFrame):
# df is a pandas.DataFrame
df = df.sort_values('date').reset_index(drop=True)
df.loc[0, 'EMA'] = df.loc[0, 'SMA']
for i in range(1, len(df)):
df.loc[i, 'EMA'] = df.loc[i, 'C'] * df.loc[i-1, 'EMA'] + \
df.loc[i, 'A'] * df.loc[i, 'B']
return df
ab.groupBy("CID").applyInPandas(calc,
schema = "CID long, date date, A double, B double, C double, Row long, SMA double, EMA double")\
.show()
Output:
+---+----------+-----+-----+----+---+-----+------------------+
|CID| date| A| B| C|Row| SMA| EMA|
+---+----------+-----+-----+----+---+-----+------------------+
| 1|2020-01-01| 41.0| 0.5| 0.5| 1|10.22| 10.22|
| 1|2020-03-10| 24.0| 0.3| 0.7| 2| null| 14.354|
| 1|2020-05-21| 32.0| 0.4| 0.6| 3| null|21.412399999999998|
| 2|2020-01-03| 51.0| 0.22|0.78| 1|34.78| 34.78|
| 2|2020-05-10|14.56|0.333|0.66| 2| null| 27.80328|
| 2|2020-09-30| 17.0| 0.66|0.34| 3| null| 20.6731152|
+---+----------+-----+-----+----+---+-----+------------------+
The idea is to use a Pandas dataframe for each group. This Pandas dataframe contains all values of the current partition and is ordered by date. During the iteration over the Pandas dataframe we can now access the value of EMA of the previous row (which is not possible for a Spark dataframe).
There are some caveats:
all rows of one partition should fit into the memory of a single executor. Partial aggregation is not possible here
iterating over a Pandas dataframe is discouraged

Related

filter then count for many different threshold

I want to calculate the number of lines that satisfy a condition on a very large dataframe which can be achieved by
df.filter(col("value") >= thresh).count()
I want to know the result for each threshold in range [1, 10]. Enumerate each threshold then do this action will scan the dataframe for 10 times. It's slow.
If I can achieve it by scanning the df only once?
Create an indicator column for each threshold, then sum:
import random
import pyspark.sql.functions as F
from pyspark.sql import Row
df = spark.createDataFrame([Row(value=random.randint(0,10)) for _ in range(1_000_000)])
df.select([
(F.col("value") >= thresh)
.cast("int")
.alias(f"ind_{thresh}")
for thresh in range(1,11)
]).groupBy().sum().show()
# +----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+
# |sum(ind_1)|sum(ind_2)|sum(ind_3)|sum(ind_4)|sum(ind_5)|sum(ind_6)|sum(ind_7)|sum(ind_8)|sum(ind_9)|sum(ind_10)|
# +----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+
# | 908971| 818171| 727240| 636334| 545463| 454279| 363143| 272460| 181729| 90965|
# +----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+
Using conditional aggregation with when expressions should do the job.
Here's an example:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1,), (2,), (3,), (4,), (4,), (6,), (7,)], ["value"])
count_expr = [
F.count(F.when(F.col("value") >= th, 1)).alias(f"gte_{th}")
for th in range(1, 11)
]
df.select(*count_expr).show()
#+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
#|gte_1|gte_2|gte_3|gte_4|gte_5|gte_6|gte_7|gte_8|gte_9|gte_10|
#+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
#| 7| 6| 5| 4| 2| 2| 1| 0| 0| 0|
#+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
Using a user-defined function udf from pyspark.sql.functions:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100, size=(20)), columns=['val'])
thres = [90, 80, 30] # these are the thresholds
thres.sort(reverse=True) # list needs to be sorted
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
spark = SparkSession.builder \
.master("local[2]") \
.appName("myApp") \
.getOrCreate()
sparkDF = spark.createDataFrame(df)
myUdf = udf(lambda x: 0 if x>thres[0] else 1 if x>thres[1] else 2 if x>thres[2] else 3)
sparkDF = sparkDF.withColumn("rank", myUdf(sparkDF.val))
sparkDF.show()
# +---+----+
# |val|rank|
# +---+----+
# | 28| 3|
# | 54| 2|
# | 19| 3|
# | 4| 3|
# | 74| 2|
# | 62| 2|
# | 95| 0|
# | 19| 3|
# | 55| 2|
# | 62| 2|
# | 33| 2|
# | 93| 0|
# | 81| 1|
# | 41| 2|
# | 80| 2|
# | 53| 2|
# | 14| 3|
# | 16| 3|
# | 30| 3|
# | 77| 2|
# +---+----+
sparkDF.groupby(['rank']).count().show()
# Out:
# +----+-----+
# |rank|count|
# +----+-----+
# | 3| 7|
# | 0| 2|
# | 1| 1|
# | 2| 10|
# +----+-----+
A value gets rank i if it's strictly greater than thres[i] but smaller or equal thres[i-1]. This should minimize the number of comparisons.
For thres = [90, 80, 30] we have the ranks 0-> [max, 90[, 1-> [90, 80[, 2->[80, 30[, 3->[30, min]

Pyspark - Transform columns with maximum values into separate 1 and 0 entries

I have a working version for this problem in pandas, but I'm having trouble translating it to pyspark.
My input DataFrame looks like the following:
test_df = pd.DataFrame({
'id': [1],
'cat_1': [2],
'cat_2': [2],
'cat_3': [1]
})
test_df_spark = spark.createDataFrame(test_df)
test_df_spark.show()
+---+-----+-----+-----+
| id|cat_1|cat_2|cat_3|
+---+-----+-----+-----+
| 1| 2| 2| 1| <- non-maximum
+---+-----+-----+-----+
^ ^
| |
maximum maximum
I would like to:
Obtain columns (1 or more) with maximum value across cat_1, cat_2, cat_3. In the example, these would be cat_1 and cat_2.
These columns should have a 1 value. The rest of non-maximum columns will be set to 0.
Columns with 1 value should be divided into separate rows.
The resulting DataFrame should then look like this:
+---+-----+-----+-----+
| id|cat_1|cat_2|cat_3|
+---+-----+-----+-----+
| 1| 1| 0| 0|
| 1| 0| 1| 0|
+---+-----+-----+-----+
Currently, the most I've been able to figure out is how to set columns to 1 or 0 according to their value (whether it's the maximum or not), but I'm still missing how to generate additional entries:
columns = ['cat_1', 'cat_2', 'cat_3']
(
test_df_spark
.withColumn(
'max_value',
F.greatest(
*columns
)
)
.select(
'id',
*[F.when(F.col(c) == F.col('max_value'), F.lit(1)).otherwise(F.lit(0)).alias(c) for c in columns]
)
.show()
)
+---+-----+-----+-----+
| id|cat_1|cat_2|cat_3|
+---+-----+-----+-----+
| 1| 1| 1| 0|
+---+-----+-----+-----+
Thanks in advance!
Assuming that your current result is df1:
columns = ['cat_1', 'cat_2', 'cat_3']
df1 = (
test_df_spark
.withColumn(
'max_value',
F.greatest(
*columns
)
)
.select(
'id',
*[F.when(F.col(c) == F.col('max_value'), F.lit(1)).otherwise(F.lit(0)).alias(c) for c in columns]
)
)
You can manipulate df1 to get your desired results by creating an array of structs and inline it:
df2 = df1.select(
'id',
F.array(*[
F.when(
F.col(c1) == 1,
F.struct(*[
F.lit(1).alias(c2) if i1 == i2 else F.lit(0).alias(c2)
for i2, c2 in enumerate(columns)
])
)
for i1, c1 in enumerate(columns)
]).alias('cat')
).selectExpr(
'id',
'inline(filter(cat, x -> x is not null))'
)
df2.show()
+---+-----+-----+-----+
| id|cat_1|cat_2|cat_3|
+---+-----+-----+-----+
| 1| 1| 0| 0|
| 1| 0| 1| 0|
+---+-----+-----+-----+

How to concatenate to a null column in pyspark dataframe

I have a below dataframe and I wanted to update the rows dynamically with some values
input_frame.show()
+----------+----------+---------+
|student_id|name |timestamp|
+----------+----------+---------+
| s1|testuser | t1|
| s1|sampleuser| t2|
| s2|test123 | t1|
| s2|sample123 | t2|
+----------+----------+---------+
input_frame = input_frame.withColumn('test', sf.lit(None))
input_frame.show()
+----------+----------+---------+----+
|student_id| name|timestamp|test|
+----------+----------+---------+----+
| s1| testuser| t1|null|
| s1|sampleuser| t2|null|
| s2| test123| t1|null|
| s2| sample123| t2|null|
+----------+----------+---------+----+
input_frame = input_frame.withColumn('test', sf.concat(sf.col('test'),sf.lit('test')))
input_frame.show()
+----------+----------+---------+----+
|student_id| name|timestamp|test|
+----------+----------+---------+----+
| s1| testuser| t1|null|
| s1|sampleuser| t2|null|
| s2| test123| t1|null|
| s2| sample123| t2|null|
+----------+----------+---------+----+
I want to update the 'test' column with some values and apply the filter with partial matches on the column. But concatenating to null column resulting in a null column again. How can we do this?
use concat_ws, like this:
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([["1", "2"], ["2", None], ["3", "4"], ["4", "5"], [None, "6"]]).toDF("a", "b")
# This won't work
df = df.withColumn("concat", concat(df.a, df.b))
# This won't work
df = df.withColumn("concat + cast", concat(df.a.cast('string'), df.b.cast('string')))
# Do it like this
df = df.withColumn("concat_ws", concat_ws("", df.a, df.b))
df.show()
gives:
+----+----+------+-------------+---------+
| a| b|concat|concat + cast|concat_ws|
+----+----+------+-------------+---------+
| 1| 2| 12| 12| 12|
| 2|null| null| null| 2|
| 3| 4| 34| 34| 34|
| 4| 5| 45| 45| 45|
|null| 6| null| null| 6|
+----+----+------+-------------+---------+
Note specifically that casting a NULL column to string doesn't work as you wish, and will result in the entire row being NULL if any column is null.
There's no nice way of dealing with more complicated scenarios, but note that you can use a when statement in side a concat if you're willing to
suffer the verboseness of it, like this:
df.withColumn("concat_custom", concat(
when(df.a.isNull(), lit('_')).otherwise(df.a),
when(df.b.isNull(), lit('_')).otherwise(df.b))
)
To get, eg:
+----+----+-------------+
| a| b|concat_custom|
+----+----+-------------+
| 1| 2| 12|
| 2|null| 2_|
| 3| 4| 34|
| 4| 5| 45|
|null| 6| _6|
+----+----+-------------+
You can use the coalesce function, which returns first of its arguments which is not null, and provide a literal in the second place, which will be used in case the column has a null value.
df = df.withColumn("concat", concat(coalesce(df.a, lit('')), coalesce(df.b, lit(''))))
You can fill null values with empty strings:
import pyspark.sql.functions as f
from pyspark.sql.types import *
data = spark.createDataFrame([('s1', 't1'), ('s2', 't2')], ['col1', 'col2'])
data = data.withColumn('test', f.lit(None).cast(StringType()))
display(data.na.fill('').withColumn('test2', f.concat('col1', 'col2', 'test')))
Is that what you were looking for?

Combine two rows in Pyspark if a condition is met

I have a PySpark data table that looks like the following
shouldMerge | number
true | 1
true | 1
true | 2
false | 3
false | 1
I want to combine all of the columns with shouldMerge as true and add up the numbers.
so the final output would look like
shouldMerge | number
true | 4
false | 3
false | 1
How can I select all the ones with shouldMerge == true, add up the numbers, and generate a new row in PySpark?
Edit: Alternate, slightly more complicated scenario closer to what I'm trying to solve, where we only aggregate positive numbers:
mergeId | number
1 | 1
2 | 1
1 | 2
-1 | 3
-1 | 1
shouldMerge | number
1 | 3
2 | 1
-1 | 3
-1 | 1
IIUC, you want to do a groupBy but only on the positive mergeIds.
One way is to filter your DataFrame for the positive ids, group, aggregate, and union this back with the negative ids (similar to #shanmuga's answer).
Other way would be use when to dynamically create a grouping key. If the mergeId is positive, use the mergeId to group. Otherwise, use a monotonically_increasing_id to ensure that the row does not get aggregated.
Here is an example:
import pyspark.sql.functions as f
df.withColumn("uid", f.monotonically_increasing_id())\
.groupBy(
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey"),
f.col("mergeId")
)\
.agg(f.sum("number").alias("number"))\
.drop("mergeKey")\
.show()
#+-------+------+
#|mergeId|number|
#+-------+------+
#| -1| 1.0|
#| 1| 3.0|
#| 2| 1.0|
#| -1| 3.0|
#+-------+------+
This can easily be generalized by changing the when condition (in this case it's f.col("mergeId") > 0) to match your specific requirements.
Explanation:
First we create a temporary column uid which is a unique ID for each row. Next, we call groupBy and if the mergeId is positive use the mergeId to group. Otherwise we use the uid as the mergeKey. I also passed in the mergeId as a second group by column as a way to keep that column for the output.
To demonstrate what is going on, take a look at the intermediate result:
df.withColumn("uid", f.monotonically_increasing_id())\
.withColumn(
"mergeKey",
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey")
)\
.show()
#+-------+------+-----------+-----------+
#|mergeId|number| uid| mergeKey|
#+-------+------+-----------+-----------+
#| 1| 1| 0| 1|
#| 2| 1| 8589934592| 2|
#| 1| 2|17179869184| 1|
#| -1| 3|25769803776|25769803776|
#| -1| 1|25769803777|25769803777|
#+-------+------+-----------+-----------+
As you can see, the mergeKey remains the unique value for the negative mergeIds.
From this intermediate step, the desired result is just a trivial group by and sum, followed by dropping the mergeKey column.
You will have to filter out only the rows where should merge is true and aggregate. then union this with all the remaining rows.
import pyspark.sql.functions as functions
df = sqlContext.createDataFrame([
(True, 1),
(True, 1),
(True, 2),
(False, 3),
(False, 1),
], ("shouldMerge", "number"))
false_df = df.filter("shouldMerge = false")
true_df = df.filter("shouldMerge = true")
result = true_df.groupBy("shouldMerge")\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
df = sqlContext.createDataFrame([
(1, 1),
(2, 1),
(1, 2),
(-1, 3),
(-1, 1),
], ("mergeId", "number"))
merge_condition = df["mergeId"] > -1
remaining = ~merge_condition
grouby_field = "mergeId"
false_df = df.filter(remaining)
true_df = df.filter(merge_condition)
result = true_df.groupBy(grouby_field)\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
result.show()
The first problem posted by the OP.
# Create the DataFrame
valuesCol = [(True,1),(True,1),(True,2),(False,3),(False,1)]
df = sqlContext.createDataFrame(valuesCol,['shouldMerge','number'])
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| true| 1|
| true| 1|
| true| 2|
| false| 3|
| false| 1|
+-----------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select shouldMerge, number, sum(number) over (partition by shouldMerge) as sum_number from table_view'
)
df = df.withColumn('number',when(col('shouldMerge')==True,col('sum_number')).otherwise(col('number')))
df.show()
+-----------+------+----------+
|shouldMerge|number|sum_number|
+-----------+------+----------+
| true| 4| 4|
| true| 4| 4|
| true| 4| 4|
| false| 3| 4|
| false| 1| 4|
+-----------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy().orderBy('shouldMerge')
df = df.withColumn('shouldMerge_lag', lag(col('shouldMerge'),1).over(my_window))
df.show()
+-----------+------+---------------+
|shouldMerge|number|shouldMerge_lag|
+-----------+------+---------------+
| false| 3| null|
| false| 1| false|
| true| 4| false|
| true| 4| true|
| true| 4| true|
+-----------+------+---------------+
df = df.where(~((col('shouldMerge')==True) & (col('shouldMerge_lag')==True))).drop('shouldMerge_lag')
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| false| 3|
| false| 1|
| true| 4|
+-----------+------+
For the second problem posted by the OP
# Create the DataFrame
valuesCol = [(1,2),(1,1),(2,1),(1,2),(-1,3),(-1,1)]
df = sqlContext.createDataFrame(valuesCol,['mergeId','number'])
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 2|
| 1| 1|
| 2| 1|
| 1| 2|
| -1| 3|
| -1| 1|
+-------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select mergeId, number, sum(number) over (partition by mergeId) as sum_number from table_view'
)
df = df.withColumn('number',when(col('mergeId') > 0,col('sum_number')).otherwise(col('number')))
df.show()
+-------+------+----------+
|mergeId|number|sum_number|
+-------+------+----------+
| 1| 5| 5|
| 1| 5| 5|
| 1| 5| 5|
| 2| 1| 1|
| -1| 3| 4|
| -1| 1| 4|
+-------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy('mergeId').orderBy('mergeId')
df = df.withColumn('mergeId_lag', lag(col('mergeId'),1).over(my_window))
df.show()
+-------+------+-----------+
|mergeId|number|mergeId_lag|
+-------+------+-----------+
| 1| 5| null|
| 1| 5| 1|
| 1| 5| 1|
| 2| 1| null|
| -1| 3| null|
| -1| 1| -1|
+-------+------+-----------+
df = df.where(~((col('mergeId') > 0) & (col('mergeId_lag').isNotNull()))).drop('mergeId_lag')
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 5|
| 2| 1|
| -1| 3|
| -1| 1|
+-------+------+
Documentation: lag() - Returns the value that is offset rows before the current row.

Pyspark - Calculate number of null values in each dataframe column

I have a dataframe with many columns. My aim is to produce a dataframe thats lists each column name, along with the number of null values in that column.
Example:
+-------------+-------------+
| Column_Name | NULL_Values |
+-------------+-------------+
| Column_1 | 15 |
| Column_2 | 56 |
| Column_3 | 18 |
| ... | ... |
+-------------+-------------+
I have managed to get the number of null values for ONE column like so:
df.agg(F.count(F.when(F.isnull(c), c)).alias('NULL_Count'))
where c is a column in the dataframe. However, it does not show the name of the column. The output is:
+------------+
| NULL_Count |
+------------+
| 15 |
+------------+
Any ideas?
You can use a list comprehension to loop over all of your columns in the agg, and use alias to rename the output column:
import pyspark.sql.functions as F
df_agg = df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])
However, this will return the results in one row as shown below:
df_agg.show()
#+--------+--------+--------+
#|Column_1|Column_2|Column_3|
#+--------+--------+--------+
#| 15| 56| 18|
#+--------+--------+--------+
If you wanted the results in one column instead, you could union each column from df_agg using functools.reduce as follows:
from functools import reduce
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df_agg.select(F.lit(c).alias("Column_Name"), F.col(c).alias("NULL_Count"))
for c in df_agg.columns
)
)
df_agg_col.show()
#+-----------+----------+
#|Column_Name|NULL_Count|
#+-----------+----------+
#| Column_1| 15|
#| Column_2| 56|
#| Column_3| 18|
#+-----------+----------+
Or you can skip the intermediate step of creating df_agg and do:
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df.agg(
F.count(F.when(F.isnull(c), c)).alias('NULL_Count')
).select(F.lit(c).alias("Column_Name"), "NULL_Count")
for c in df.columns
)
)
Scala alternative could be
case class Test(id:Int, weight:Option[Int], age:Int, gender: Option[String])
val df1 = Seq(Test(1, Some(100), 23, Some("Male")), Test(2, None, 25, None), Test(3, None, 33, Some("Female"))).toDF()
df1.show()
+---+------+---+------+
| id|weight|age|gender|
+---+------+---+------+
| 1| 100| 23| Male|
| 2| null| 25| null|
| 3| null| 33|Female|
+---+------+---+------+
val s = df1.columns.map(c => sum(col(c).isNull.cast("integer")).alias(c))
val df2 = df1.agg(s.head, s.tail:_*)
val t = df2.columns.map(c => df2.select(lit(c).alias("col_name"), col(c).alias("null_count")))
val df_agg_col = t.reduce((df1, df2) => df1.union(df2))
df_agg_col.show()

Categories

Resources