Calculate mean and max value with VectorAssembler - python

I'm working with a data frame, something like:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import *
schema = StructType([
StructField("ClientId", IntegerType(), True),
StructField("m_ant21", IntegerType(), True),
StructField("m_ant22", IntegerType(), True),
StructField("m_ant23", IntegerType(), True),
StructField("m_ant24", IntegerType(), True)
])
df = sqlContext.createDataFrame(
data=[(0, 5, 5, 4, 0),
(1, 23, 13, 17, 99),
(2, 0, 0, 0, 1),
(3, 0, 4, 1, 0),
(4, 2, 1, 30, 10),
(5, 0, 0, 0, 0)],
schema=schema)
I need to calculate the mean value an max value per row and using the columns "m_ant21", "m_ant22", "m_ant23", "m_ant24".
I'm trying using vectorAssembler:
assembler = VectorAssembler(
inputCols=["m_ant21", "m_ant22", "m_ant23","m_ant24"],
outputCol="muestra")
output = assembler.transform(df)
output.show()
now, I create a function to make the mean, but the input variable is a "DenseVector" called "dv":
dv = output.collect()[0].asDict()['muestra']
def mi_media( dv ) :
return float( sum( dv ) / dv.size )
udf_media = udf( mi_media, DoubleType() )
output1 = output.withColumn( "mediaVec", udf_media ( output.muestra ) )
output1.show()
and the same to the max value:
def mi_Max( dv ) :
return float(max( dv ) )
udf_max = udf( mi_Max, DoubleType() )
output2 = output.withColumn( "maxVec", udf_max ( output.muestra ) )
output2.show()
The problem is the error in output1.show() and output2.show(). Just It's not working and I dont know what happends with the code.
What am i doing wrong?
Please help me.

I have tried my way, check it,
from pyspark.sql import functions as F
df.show()
+--------+-------+-------+-------+-------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|
+--------+-------+-------+-------+-------+
| 0| 5| 5| 4| 0|
| 1| 23| 13| 17| 99|
| 2| 0| 0| 0| 1|
| 3| 0| 4| 1| 0|
| 4| 2| 1| 30| 10|
| 5| 0| 0| 0| 0|
+--------+-------+-------+-------+-------+
df1 = df.withColumn('mean',sum(df[c] for c in df.columns[1:])/len(df.columns[1:]))
df1 = df1.withColumn('max',F.greatest(*[F.coalesce(df[c],F.lit(0)) for c in df.columns[1:]]))
df1.show()
+--------+-------+-------+-------+-------+-----+---+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24| mean|max|
+--------+-------+-------+-------+-------+-----+---+
| 0| 5| 5| 4| 0| 3.5| 5|
| 1| 23| 13| 17| 99| 38.0| 99|
| 2| 0| 0| 0| 1| 0.25| 1|
| 3| 0| 4| 1| 0| 1.25| 4|
| 4| 2| 1| 30| 10|10.75| 30|
| 5| 0| 0| 0| 0| 0.0| 0|
+--------+-------+-------+-------+-------+-----+---+

It is possible to do it with DenseVector but in RDD way:
output2 = output.rdd.map(lambda x: (x.ClientId,
x.m_ant21,
x.m_ant22,
x.m_ant23,
x.m_ant24,
x.muestra,
float(max(x.muestra))))
output2 = spark.createDataFrame(output2)
output2.show()
which gives:
+---+---+---+---+---+--------------------+----+
| _1| _2| _3| _4| _5| _6| _7|
+---+---+---+---+---+--------------------+----+
| 0| 5| 5| 4| 0| [5.0,5.0,4.0,0.0]| 5.0|
| 1| 23| 13| 17| 99|[23.0,13.0,17.0,9...|99.0|
| 2| 0| 0| 0| 1| (4,[3],[1.0])| 1.0|
| 3| 0| 4| 1| 0| [0.0,4.0,1.0,0.0]| 4.0|
| 4| 2| 1| 30| 10| [2.0,1.0,30.0,10.0]|30.0|
| 5| 0| 0| 0| 0| (4,[],[])| 0.0|
+---+---+---+---+---+--------------------+----+
Now all that remains is to rename the columns, for example with withColumnRename function.The mean case is the same.
Also it's possible do that with SparseVector but, in this case, it is necessary to access to self class value variable:
output2 = output.rdd.map(lambda x: (x.ClientId,
x.m_ant21,
x.m_ant22,
x.m_ant23,
x.m_ant24,
x.muestra,
float(max(x.muestra.values))))
output2 = spark.createDataFrame(output2)
This way works better if you df has a lot of columns and it is not possible to calc max value before VectorAssembler stage.

i get a solution about this problem
import pyspark.sql.functions as f
import pyspark.sql.types as t
min_of_vector = f.udf(lambda vec: vec.toArray().min(), t.DoubleType())
max_of_vector = f.udf(lambda vec: vec.toArray().max(), t.DoubleType())
mean_of_vector = f.udf(lambda vec: vec.toArray().mean(), t.DoubleType())
final = output.withColumn('min', min_of_vector('muestra')) \
.withColumn('max', max_of_vector('muestra')) \
.withColumn('mean', mean_of_vector('muestra'))

Related

pyspark - join with OR condition

I would like to join two pyspark dataframes if at least one of two conditions is satisfied.
Toy data:
df1 = spark.createDataFrame([
(10, 1, 666),
(20, 2, 777),
(30, 1, 888),
(40, 3, 999),
(50, 1, 111),
(60, 2, 222),
(10, 4, 333),
(50, None, 444),
(10, 0, 555),
(50, 0, 666)
],
['var1', 'var2', 'other_var']
)
df2 = spark.createDataFrame([
(10, 1),
(20, 2),
(30, None),
(30, 0)
],
['var1_', 'var2_']
)
I would like to maintain all those rows of df1 where var1 is present in the distinct values of df2.var1_ OR var2 is present in the distinct values of df2.var2_ (but not in the case where such value is 0).
So, the expected output would be
+----+----+---------+-----+-----+
|var1|var2|other_var|var1_|var2_|
+----+----+---------+-----+-----+
| 10| 1| 666| 10| 1| # join on both var1 and var2
| 20| 2| 777| 20| 2| # join on both var1 and var2
| 30| 1| 888| 10| 1| # join on both var1 and var2
| 50| 1| 111| 10| 1| # join on var2
| 60| 2| 222| 20| 2| # join on var2
| 10| 4| 333| 10| 1| # join on var1
| 10| 0| 555| 10| 1| # join on var1
+----+----+---------+-----+-----+
Among the other attempts, I tried
cond = [(df1.var1 == (df2.select('var1_').distinct()).var1_) | (df1.var2 == (df2.filter(F.col('var2_') != 0).select('var2_').distinct()).var2_)]
df1\
.join(df2, how='inner', on=cond)\
.show()
+----+----+---------+-----+-----+
|var1|var2|other_var|var1_|var2_|
+----+----+---------+-----+-----+
| 10| 1| 666| 10| 1|
| 20| 2| 777| 20| 2|
| 30| 1| 888| 10| 1|
| 50| 1| 111| 10| 1|
| 30| 1| 888| 30| null|
| 30| 1| 888| 30| 0|
| 60| 2| 222| 20| 2|
| 10| 4| 333| 10| 1|
| 10| 0| 555| 10| 1|
| 10| 0| 555| 30| 0|
| 50| 0| 666| 30| 0|
+----+----+---------+-----+-----+
but I obtained more rows than expected, and the rows where var2 == 0 were also preserved.
What am I doing wrong?
Note: I'm not using the .isin method because my actual df2 has around 20k rows and I've read here that this method with a large number of IDs could have a bad performance.
Try the condition below:
cond = (df2.var2_ != 0) & ((df1.var1 == df2.var1_) | (df1.var2 == df2.var2_))
df1\
.join(df2, how='inner', on=cond)\
.show()
+----+----+---------+-----+-----+
|var1|var2|other_var|var1_|var2_|
+----+----+---------+-----+-----+
| 10| 1| 666| 10| 1|
| 30| 1| 888| 10| 1|
| 20| 2| 777| 20| 2|
| 50| 1| 111| 10| 1|
| 60| 2| 222| 20| 2|
| 10| 4| 333| 10| 1|
| 10| 0| 555| 10| 1|
+----+----+---------+-----+-----+
The condition should only include the columns from the two dataframes to be joined. If you want to remove var2_ = 0, you can put them as a join condition, rather than as a filter.
There is also no need to specify distinct, because it does not affect the equality condition, and also adds an unnecessary step.

Pyspark: Add new column from another pyspark dataframe

I have two dataframes as follows.
I want to add a new column to dataframe df_a from dataframe df_b column val_1 based on the condition df_a.col_p == df_b.id
df_a = sqlContext.createDataFrame([(1412, 31, 1), (2422, 21, 1), (4223, 22, 2), (
2244, 43, 1), (1235, 54, 1), (4126, 12, 5), (2342, 44, 1 )], ["idx", "col_n", "col_p"])
df_a.show()
+----+-----+-----+
| idx|col_n|col_p|
+----+-----+-----+
|1412| 31| 1|
|2422| 21| 1|
|4223| 22| 2|
|2244| 43| 1|
|1235| 54| 1|
|4126| 12| 5|
|2342| 44| 1|
+----+-----+-----+
df_b = sqlContext.createDataFrame([(1, 1, 1), (2, 1, 1), (3, 1, 2), (
4, 1, 1), (5, 2, 1), (6, 2, 2)], ["id", "val_1", "val_2"])
df_b.show()
+---+-----+-----+
| id|val_1|val_2|
+---+-----+-----+
| 1| 1| 1|
| 2| 1| 1|
| 3| 1| 2|
| 4| 1| 1|
| 5| 2| 1|
| 6| 2| 2|
+---+-----+-----+
Expected output
+----+-----+-----+-----+
| idx|col_n|col_p|val_1|
+----+-----+-----+-----+
|1412| 31| 1| 1|
|2422| 21| 1| 1|
|4223| 22| 2| 1|
|2244| 43| 1| 1|
|1235| 54| 1| 1|
|4126| 12| 5| 2|
|2342| 44| 1| 1|
+----+-----+-----+-----+
My code
cond = (df_a.col_p == df_b.id)
df_a_new = df_a.join(df_b, cond, how ='full').withColumn('val_new', F.when(cond, df_b.val_1))
df_a_new = df_a_new.drop(*['id', 'val_1', 'val_2'])
df_a_new = df_a_new.filter(df_a_new.idx. isNotNull())
df_a_new.show()
How can I get the proper output as expected result with correct index order?
You can assign an increasing index to df_a and sort by that index after joining. Also I'd suggest doing a left join rather than a full join.
import pyspark.sql.functions as F
df_a_new = df_a.withColumn('index', F.monotonically_increasing_id()) \
.join(df_b, df_a.col_p == df_b.id, 'left') \
.orderBy('index') \
.select('idx', 'col_n', 'col_p', 'val_1')
df_a_new.show()
+----+-----+-----+-----+
| idx|col_n|col_p|val_1|
+----+-----+-----+-----+
|1412| 31| 1| 1|
|2422| 21| 1| 1|
|4223| 22| 2| 1|
|2244| 43| 1| 1|
|1235| 54| 1| 1|
|4126| 12| 5| 2|
|2342| 44| 1| 1|
+----+-----+-----+-----+
you need to create your own indexes (monotomically_increasing_ids) and sort again after joining on those indexes.
But there is no way you can join while preserving order in spark as the rows are partitioned before joining and they lose order before combining
refer: Can Dataframe joins in Spark preserve order?

grouping consecutive rows in PySpark Dataframe

I have the following example Spark DataFrame:
rdd = sc.parallelize([(1,"19:00:00", "19:30:00", 30), (1,"19:30:00", "19:40:00", 10),(1,"19:40:00", "19:43:00", 3), (2,"20:00:00", "20:10:00", 10), (1,"20:05:00", "20:15:00", 10),(1,"20:15:00", "20:35:00", 20)])
df = spark.createDataFrame(rdd, ["user_id", "start_time", "end_time", "duration"])
df.show()
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:30:00| 30|
| 1| 19:30:00|19:40:00| 10|
| 1| 19:40:00|19:43:00| 3|
| 2| 20:00:00|20:10:00| 10|
| 1| 20:05:00|20:15:00| 10|
| 1| 20:15:00|20:35:00| 20|
+-------+----------+--------+--------+
I want to group consecutive rows based on the start and end times. For instance, for the same user_id, if a row's start time is the same as the previous row's end time, I want to group them together and sum the duration.
The desired result is:
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:43:00| 43|
| 2| 20:00:00|20:10:00| 10|
| 1| 20:05:00|20:35:00| 30|
+-------+----------+--------+--------+
The first three rows of the dataframe were grouped together because they all correspond to user_id 1 and the start times and end times form a continuous timeline.
This was my initial approach:
Use the lag function to get the next start time:
from pyspark.sql.functions import *
from pyspark.sql import Window
import sys
# compute next start time
window = Window.partitionBy('user_id').orderBy('start_time')
df = df.withColumn("next_start_time", lag(df.start_time, -1).over(window))
df.show()
+-------+----------+--------+--------+---------------+
|user_id|start_time|end_time|duration|next_start_time|
+-------+----------+--------+--------+---------------+
| 1| 19:00:00|19:30:00| 30| 19:30:00|
| 1| 19:30:00|19:40:00| 10| 19:40:00|
| 1| 19:40:00|19:43:00| 3| 20:05:00|
| 1| 20:05:00|20:15:00| 10| 20:15:00|
| 1| 20:15:00|20:35:00| 20| null|
| 2| 20:00:00|20:10:00| 10| null|
+-------+----------+--------+--------+---------------+
get the difference between the current row's end time and the next row's start time:
time_fmt = "HH:mm:ss"
timeDiff = unix_timestamp('next_start_time', format=time_fmt) - unix_timestamp('end_time', format=time_fmt)
df = df.withColumn("difference", timeDiff)
df.show()
+-------+----------+--------+--------+---------------+----------+
|user_id|start_time|end_time|duration|next_start_time|difference|
+-------+----------+--------+--------+---------------+----------+
| 1| 19:00:00|19:30:00| 30| 19:30:00| 0|
| 1| 19:30:00|19:40:00| 10| 19:40:00| 0|
| 1| 19:40:00|19:43:00| 3| 20:05:00| 1320|
| 1| 20:05:00|20:15:00| 10| 20:15:00| 0|
| 1| 20:15:00|20:35:00| 20| null| null|
| 2| 20:00:00|20:10:00| 10| null| null|
+-------+----------+--------+--------+---------------+----------+
Now my idea was to use the sum function with a window to get the cumulative sum of duration and then do a groupBy. But my approach was flawed for many reasons.
Here's one approach:
Gather together rows into groups where a group is a set of rows with the same user_id that are consecutive (start_time matches previous end_time). Then you can use this group to do your aggregation.
A way to get here is by creating intermediate indicator columns to tell you if the user has changed or the time is not consecutive. Then perform a cumulative sum over the indicator column to create the group.
For example:
import pyspark.sql.functions as f
from pyspark.sql import Window
w1 = Window.orderBy("start_time")
df = df.withColumn(
"userChange",
(f.col("user_id") != f.lag("user_id").over(w1)).cast("int")
)\
.withColumn(
"timeChange",
(f.col("start_time") != f.lag("end_time").over(w1)).cast("int")
)\
.fillna(
0,
subset=["userChange", "timeChange"]
)\
.withColumn(
"indicator",
(~((f.col("userChange") == 0) & (f.col("timeChange")==0))).cast("int")
)\
.withColumn(
"group",
f.sum(f.col("indicator")).over(w1.rangeBetween(Window.unboundedPreceding, 0))
)
df.show()
#+-------+----------+--------+--------+----------+----------+---------+-----+
#|user_id|start_time|end_time|duration|userChange|timeChange|indicator|group|
#+-------+----------+--------+--------+----------+----------+---------+-----+
#| 1| 19:00:00|19:30:00| 30| 0| 0| 0| 0|
#| 1| 19:30:00|19:40:00| 10| 0| 0| 0| 0|
#| 1| 19:40:00|19:43:00| 3| 0| 0| 0| 0|
#| 2| 20:00:00|20:10:00| 10| 1| 1| 1| 1|
#| 1| 20:05:00|20:15:00| 10| 1| 1| 1| 2|
#| 1| 20:15:00|20:35:00| 20| 0| 0| 0| 2|
#+-------+----------+--------+--------+----------+----------+---------+-----+
Now that we have the group column, we can aggregate as follows to get the desired result:
df.groupBy("user_id", "group")\
.agg(
f.min("start_time").alias("start_time"),
f.max("end_time").alias("end_time"),
f.sum("duration").alias("duration")
)\
.drop("group")\
.show()
#+-------+----------+--------+--------+
#|user_id|start_time|end_time|duration|
#+-------+----------+--------+--------+
#| 1| 19:00:00|19:43:00| 43|
#| 1| 20:05:00|20:35:00| 30|
#| 2| 20:00:00|20:10:00| 10|
#+-------+----------+--------+--------+
Here is a working solution derived from Pault's answer:
Create the Dataframe:
rdd = sc.parallelize([(1,"19:00:00", "19:30:00", 30), (1,"19:30:00", "19:40:00", 10),(1,"19:40:00", "19:43:00", 3), (2,"20:00:00", "20:10:00", 10), (1,"20:05:00", "20:15:00", 10),(1,"20:15:00", "20:35:00", 20)])
df = spark.createDataFrame(rdd, ["user_id", "start_time", "end_time", "duration"])
df.show()
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:30:00| 30|
| 1| 19:30:00|19:40:00| 10|
| 1| 19:40:00|19:43:00| 3|
| 1| 20:05:00|20:15:00| 10|
| 1| 20:15:00|20:35:00| 20|
+-------+----------+--------+--------+
Create an indicator column that indicates whenever the time has changed, and use cumulative sum to give each group a unique id:
import pyspark.sql.functions as f
from pyspark.sql import Window
w1 = Window.partitionBy('user_id').orderBy('start_time')
df = df.withColumn(
"indicator",
(f.col("start_time") != f.lag("end_time").over(w1)).cast("int")
)\
.fillna(
0,
subset=[ "indicator"]
)\
.withColumn(
"group",
f.sum(f.col("indicator")).over(w1.rangeBetween(Window.unboundedPreceding, 0))
)
df.show()
+-------+----------+--------+--------+---------+-----+
|user_id|start_time|end_time|duration|indicator|group|
+-------+----------+--------+--------+---------+-----+
| 1| 19:00:00|19:30:00| 30| 0| 0|
| 1| 19:30:00|19:40:00| 10| 0| 0|
| 1| 19:40:00|19:43:00| 3| 0| 0|
| 1| 20:05:00|20:15:00| 10| 1| 1|
| 1| 20:15:00|20:35:00| 20| 0| 1|
+-------+----------+--------+--------+---------+-----+
Now GroupBy on user id and the group variable.
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:43:00| 43|
| 1| 20:05:00|20:35:00| 30|
+-------+----------+--------+--------+

AnalysisException: "cannot resolve 'df2.*' give input columns Pyspark?

I have created two data frames like below.
df = spark.createDataFrame(
[(1, 1, 2,4), (1, 2, 9,5), (2, 1, 2,1), (2, 2, 1,2), (4, 1, 5,2), (4, 2, 6,3), (5, 1, 3,3), (5, 2, 8,4)],
("sid", "cid", "Cr","rank"))
df1 = spark.createDataFrame(
[[1,1],[1,2],[1,3], [2,1],[2,2],[2,3],[4,1],[4,2],[4,3],[5,2],[5,3],[5,3],[3,4]],
["sid","cid"])
because of some requirement I had created sqlContext and created Temporary view,like below.
df.createOrReplaceTempView("temp")
df2=sqlContext.sql("select sid,cid,cr,rank from temp")
then i am doing left join based on some condition .
joined = (df2.alias("df")
.join(
df1.alias("df1"),
(col("df.sid") == col("df1.sid")) & (col("df.cid") == col("df1.cid")),
"left"))
joined.show()
+---+---+---+----+----+----+
|sid|cid| cr|rank| sid| cid|
+---+---+---+----+----+----+
| 5| 1| 3| 3|null|null|
| 1| 1| 2| 4| 1| 1|
| 4| 2| 6| 3| 4| 2|
| 5| 2| 8| 4| 5| 2|
| 2| 2| 1| 2| 2| 2|
| 4| 1| 5| 2| 4| 1|
| 1| 2| 9| 5| 1| 2|
| 2| 1| 2| 1| 2| 1|
+---+---+---+----+----+----+
then finally i am executing below code:
final=joined.select(
col("df2.*"),
col("df1.sid").isNull().cast("integer").alias("flag")
).orderBy("sid", "cid")
then i am getting error like below.
"AnalysisException: "cannot resolve 'df2.*' give input columns 'cr, sid, sid, cid, cid, rank';"
but my expected out put should be:
+---+---+---+----+----+
|sid|cid| Cr|rank|flag|
+---+---+---+----+----+
| 1| 1| 2| 4| 0|
| 1| 2| 9| 5| 0|
| 2| 1| 2| 1| 0|
| 2| 2| 1| 2| 0|
| 4| 1| 5| 2| 0|
| 4| 2| 6| 3| 0|
| 5| 1| 3| 3| 1|
| 5| 2| 8| 4| 0|
+---+---+---+----+----+
mistake is :
joined = (df2.alias("df")
.join(
df1.alias("df1"),
(col("df2.sid") == col("df1.sid")) & (col("df2.cid") == col("df1.cid")),
"left"))
joined.show()
here we should to use df2.alias("df2") or joined.select(col("df.*")..)
Complete solution is :
joined = (df2.alias("df2")
.join(
df1.alias("df1"),
(col("df2.sid") == col("df1.sid")) & (col("df2.cid") == col("df1.cid")),
"left"))
joined.show()
final=joined.select(
col("df2.*"),
col("df1.sid").isNull().cast("integer").alias("flag")
).orderBy("sid", "cid")

How to count missing data per row in a data frame

I got this dataframe Sample:
from pyspark.sql.types import *
schema = StructType([
StructField("ClientId", IntegerType(), True),
StructField("m_ant21", IntegerType(), True),
StructField("m_ant22", IntegerType(), True),
StructField("m_ant23", IntegerType(), True),
StructField("m_ant24", IntegerType(), True)])
df = sqlContext.createDataFrame(
data=[(0, None, None, None, None),
(1, 23, 13, 17, 99),
(2, 0, 0, 0, 1),
(3, 0, None, 1, 0),
(4, None, None, None, None)],
schema=schema)
I have this data frame:
+--------+-------+-------+-------+-------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|
+--------+-------+-------+-------+-------+
| 0| null| null| null| null|
| 1| 23| 13| 17| 99|
| 2| 0| 0| 0| 1|
| 3| 0| null| 1| 0|
| 4| null| null| null| null|
+--------+-------+-------+-------+-------+
And I need to solve this question:
I'd like to create a new variable which counts how many null values have the data per row. For example:
ClientId 0 should be 4
ClientId 1 should be 0
ClientId 3 should be 1
Note that df is a pyspark.sql.dataframe.DataFrame.
Here is one option:
from pyspark.sql import Row
# add the column schema to the original schema
schema.add(StructField("count_null", IntegerType(), True))
# convert data frame to rdd and append an element to each row to count the number of nulls
df.rdd.map(lambda row: row + Row(sum(x is None for x in row))).toDF(schema).show()
+--------+-------+-------+-------+-------+----------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|count_null|
+--------+-------+-------+-------+-------+----------+
| 0| null| null| null| null| 4|
| 1| 23| 13| 17| 99| 0|
| 2| 0| 0| 0| 1| 0|
| 3| 0| null| 1| 0| 1|
| 4| null| null| null| null| 4|
+--------+-------+-------+-------+-------+----------+
If you don't want to deal with schema, here is another option:
from pyspark.sql.functions import col, when
df.withColumn("count_null", sum([when(col(x).isNull(),1).otherwise(0) for x in df.columns])).show()
+--------+-------+-------+-------+-------+----------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|count_null|
+--------+-------+-------+-------+-------+----------+
| 0| null| null| null| null| 4|
| 1| 23| 13| 17| 99| 0|
| 2| 0| 0| 0| 1| 0|
| 3| 0| null| 1| 0| 1|
| 4| null| null| null| null| 4|
+--------+-------+-------+-------+-------+----------+

Categories

Resources