How to count missing data per row in a data frame - python

I got this dataframe Sample:
from pyspark.sql.types import *
schema = StructType([
StructField("ClientId", IntegerType(), True),
StructField("m_ant21", IntegerType(), True),
StructField("m_ant22", IntegerType(), True),
StructField("m_ant23", IntegerType(), True),
StructField("m_ant24", IntegerType(), True)])
df = sqlContext.createDataFrame(
data=[(0, None, None, None, None),
(1, 23, 13, 17, 99),
(2, 0, 0, 0, 1),
(3, 0, None, 1, 0),
(4, None, None, None, None)],
schema=schema)
I have this data frame:
+--------+-------+-------+-------+-------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|
+--------+-------+-------+-------+-------+
| 0| null| null| null| null|
| 1| 23| 13| 17| 99|
| 2| 0| 0| 0| 1|
| 3| 0| null| 1| 0|
| 4| null| null| null| null|
+--------+-------+-------+-------+-------+
And I need to solve this question:
I'd like to create a new variable which counts how many null values have the data per row. For example:
ClientId 0 should be 4
ClientId 1 should be 0
ClientId 3 should be 1
Note that df is a pyspark.sql.dataframe.DataFrame.

Here is one option:
from pyspark.sql import Row
# add the column schema to the original schema
schema.add(StructField("count_null", IntegerType(), True))
# convert data frame to rdd and append an element to each row to count the number of nulls
df.rdd.map(lambda row: row + Row(sum(x is None for x in row))).toDF(schema).show()
+--------+-------+-------+-------+-------+----------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|count_null|
+--------+-------+-------+-------+-------+----------+
| 0| null| null| null| null| 4|
| 1| 23| 13| 17| 99| 0|
| 2| 0| 0| 0| 1| 0|
| 3| 0| null| 1| 0| 1|
| 4| null| null| null| null| 4|
+--------+-------+-------+-------+-------+----------+
If you don't want to deal with schema, here is another option:
from pyspark.sql.functions import col, when
df.withColumn("count_null", sum([when(col(x).isNull(),1).otherwise(0) for x in df.columns])).show()
+--------+-------+-------+-------+-------+----------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|count_null|
+--------+-------+-------+-------+-------+----------+
| 0| null| null| null| null| 4|
| 1| 23| 13| 17| 99| 0|
| 2| 0| 0| 0| 1| 0|
| 3| 0| null| 1| 0| 1|
| 4| null| null| null| null| 4|
+--------+-------+-------+-------+-------+----------+

Related

Re-label healthy (0) as failure (1) examples using PySpark

I wanted to re-label healthy examples (0) as failure (1) for 2 days before the actual failure for all serial numbers in the failure column. Here is my code:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('spark3.2show').getOrCreate()
print('Spark info :')
spark
url="https://gist.githubusercontent.com/JishanAhmed2019/e464ca4da5c871428ca9ed9264467aa0/raw/da3921c1953fefbc66dddc3ce238dac53142dba8/failure.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
df=spark.read.csv(SparkFiles.get("failure.csv"), header=True,sep='\t')
I wanted to re-label the red marked 0 as 1. Also, Serial C was mistakenly present in the database as healthy even after the actual failure.
I would recast the date column as a Timestamp because this will allow you to take the difference between any two Timestamps, which we will need to do.
Then you can create a new column called failure_dates that contains the date whenever a failure occurs, and is null otherwise.
Next, create a new column called 2_days_to_failure where you partition by serial_number and take the difference between the max value in the failure_date column each date inside the partition to get the number of days to failure, returning 1 whenever there is 2 days or fewer to failure.
Finally, we can create a column called failure_relabeled by combining the information from the columns 2_days_to_failure and the original failure column.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window = Window.partitionBy("serial_number")
df.withColumn(
'date', F.to_timestamp(F.col('date'), 'M/D/yyyy')
).withColumn(
"failure_dates", F.when(F.col('failure') == 1, F.col('date'))
).withColumn(
"2_days_to_failure", F.datediff(F.max(F.col('failure_dates')).over(window), F.col('date')) <= 2
).withColumn(
"failure_relabeled", F.when((F.col('2_days_to_failure') | (F.col('failure') == 1)), F.lit(1)).otherwise(F.lit(0))
).orderBy('serial_number','date').show()
+-------------------+-------------+-------+-----------+-------------+-------------------+-----------------+-----------------+
| date|serial_number|failure|smart_5_raw|smart_187_raw| failure_dates|2_days_to_failure|failure_relabeled|
+-------------------+-------------+-------+-----------+-------------+-------------------+-----------------+-----------------+
|2014-01-01 00:00:00| A| 0| 0| 60| null| false| 0|
|2014-01-02 00:00:00| A| 0| 0| 180| null| false| 0|
|2014-01-03 00:00:00| A| 0| 0| 140| null| true| 1|
|2014-01-04 00:00:00| A| 0| 0| 280| null| true| 1|
|2014-01-05 00:00:00| A| 1| 0| 400|2014-01-05 00:00:00| true| 1|
|2014-01-01 00:00:00| B| 0| 0| 40| null| null| 0|
|2014-01-02 00:00:00| B| 0| 0| 160| null| null| 0|
|2014-01-03 00:00:00| B| 0| 0| 100| null| null| 0|
|2014-01-04 00:00:00| B| 0| 0| 320| null| null| 0|
|2014-01-05 00:00:00| B| 0| 0| 340| null| null| 0|
|2014-01-06 00:00:00| B| 0| 0| 400| null| null| 0|
|2014-01-01 00:00:00| C| 0| 0| 80| null| true| 1|
|2014-01-02 00:00:00| C| 0| 0| 200| null| true| 1|
|2014-01-03 00:00:00| C| 1| 0| 120|2014-01-03 00:00:00| true| 1|
|2014-01-04 00:00:00| D| 0| 0| 300| null| null| 0|
|2014-01-05 00:00:00| D| 0| 0| 360| null| null| 0|
+-------------------+-------------+-------+-----------+-------------+-------------------+-----------------+-----------------+

PySpark Multiple Lagged Variables for TimeSeries Data with Gaps

I am trying to generate multiple lag for the 'status' variable in the dataframe below. The data that I have is timeseries and it is possible to have gap within the data. What I am trying to do is generate the lags but when there is a gap, the value should be set to Missing/Null.
Input DF:
+---+----------+------+
| id| s_date|status|
+---+----------+------+
|123|2007-01-31| 1|
|123|2007-02-28| 1|
|123|2007-03-31| 2|
|123|2007-04-30| 2|
|123|2007-05-31| 1|
|123|2007-06-30| 1|
|123|2007-07-31| 2|
|123|2007-08-31| 2|
|345|2007-08-31| 3|
|123|2007-09-30| 2|
|345|2007-09-30| 2|
|123|2007-10-31| 1|
|345|2007-10-31| 1|
|123|2007-11-30| 1|
|345|2007-11-30| 2|
|123|2008-01-31| 3|
|345|2007-12-31| 2|
|567|2007-12-31| 3|
|123|2008-03-31| 4|
|345|2008-01-31| 2|
+---+----------+------+
from datetime import date
rdd = sc.parallelize([
[123,date(2007,1,31),1],
[123,date(2007,2,28),1],
[123,date(2007,3,31),2],
[123,date(2007,4,30),2],
[123,date(2007,5,31),1],
[123,date(2007,6,30),1],
[123,date(2007,7,31),2],
[123,date(2007,8,31),2],
[345,date(2007,8,31),3],
[123,date(2007,9,30),2],
[345,date(2007,9,30),2],
[123,date(2007,10,31),1],
[345,date(2007,10,31),1],
[123,date(2007,11,30),1],
[345,date(2007,11,30),2],
[123,date(2008,1,31),3],
[345,date(2007,12,31),2],
[567,date(2007,12,31),3],
[123,date(2008,3,31),4],
[345,date(2008,1,31),2],
[567,date(2008,1,31),3],
[123,date(2008,4,30),3],
[123,date(2008,5,31),2],
[123,date(2008,6,30),1]
])
df = rdd.toDF(['id','s_date','status'])
df.show()
# Below is the code that works
from pyspark.sql.window import Window
w = Window().partitionBy('id').orderBy('s_date')
for i in range(1, 25):
df = df.withColumn(f"lag_{i}", fn.lag(fn.col('status'), i).over(w))\
.withColumn(f"lag_month_{i}", fn.lag(fn.col('s_date), i).over(w))\
.withColumn(f"lag_status_{i}", fn.expr("case when "f"{i} = 1 and (last_day(add_months("f"lag_month_{i}"",1)) = last_day(s_date)) then "f"lag_{i}"" else null end"))
In the code above, column 'lag_status_1" is correctly getting population for i = 1. This column has value of Null for Jan and March 2008 which is what I want for every lag. However, I add below line to handle other lags (i.e. Lag_2, Lag_3, etc.) the code does not work.
.withColumn(f"lag_status_{i}", fn.expr("case when "f"{i} = 1 and (last_day(add_months("f"lag_month_{i}"",1)) = last_day(s_date)) then "f"lag_{i}"" " +
"when "f"{i} > 1 and (last_day(add_months("f"lag_month_{i}"",1)) = last_day("f"lag_month_{i-1})) then "f"lag_{i}"" else null end"))\
Output DF: lag_dlq_2 and lag_dlq_3 is Null but would like to populate with similar logic as lag_dlq_1 (but with respect to its lag)
+---+----------+------+---------+---------+---------+
| id| s_date|status|lag_status_1|lag_status_2|lag_status_3|
+---+----------+------+---------+---------+---------+
|123|2007-01-31| 1| null| null| null|
|123|2007-02-28| 1| 1| null| null|
|123|2007-03-31| 2| 1| null| null|
|123|2007-04-30| 2| 2| null| null|
|123|2007-05-31| 1| 2| null| null|
|123|2007-06-30| 1| 1| null| null|
|123|2007-07-31| 2| 1| null| null|
|123|2007-08-31| 2| 2| null| null|
|123|2007-09-30| 2| 2| null| null|
|123|2007-10-31| 1| 2| null| null|
|123|2007-11-30| 1| 1| null| null|
|123|2008-01-31| 3| null| null| null|
|123|2008-03-31| 4| null| null| null|
|123|2008-04-30| 3| 4| null| null|
|123|2008-05-31| 2| 3| null| null|
|123|2008-06-30| 1| 2| null| null|
|345|2007-08-31| 3| null| null| null|
|345|2007-09-30| 2| 3| null| null|
|345|2007-10-31| 1| 2| null| null|
|345|2007-11-30| 2| 1| null| null|
+---+----------+------+---------+---------+---------+
Can you please guide on how to resolve? If there is a better or more efficient solution, please suggest.

How to join two pyspark dataframes in python on a condition while changing column value on match?

I have two dataframes like this:
df1 = spark.createDataFrame([(1, 11, 1999, 1999, None), (2, 22, 2000, 2000, 44), (3, 33, 2001, 2001,None)], ['id', 't', 'year','new_date','rev_t'])
df2 = spark.createDataFrame([(2, 44, 2022, 2022,None), (2, 55, 2001, 2001, 88)], ['id', 't', 'year','new_date','rev_t'])
df1.show()
df2.show()
+---+---+----+--------+-----+
| id| t|year|new_date|rev_t|
+---+---+----+--------+-----+
| 1| 11|1999| 1999| null|
| 2| 22|2000| 2000| 44|
| 3| 33|2001| 2001| null|
+---+---+----+--------+-----+
+---+---+----+--------+-----+
| id| t|year|new_date|rev_t|
+---+---+----+--------+-----+
| 2| 44|2022| 2022| null|
| 2| 55|2001| 2001| 88|
+---+---+----+--------+-----+
I want to join them in a way that if df2.t == df1.rev_t then update new_date to df2.year in the result dataframe.
So it should look like this:
+---+---+----+--------+-----+
| id| t|year|new_date|rev_t|
+---+---+----+--------+-----+
| 1| 11|1999| 1999| null|
| 2| 22|2000| 2022| 44|
| 2| 44|2022| 2022| null|
| 2| 55|2001| 2001| 88|
| 3| 33|2001| 2001| null|
+---+---+----+--------+-----+
To update a column from df2 in df1, you use left join + coalesce function for the column you want to update, in this case new_date.
From your expected output, it appears you want also to add the rows from df2, so union the join result with df2:
from pyspark.sql import functions as F
result = (df1.join(df2.selectExpr("t as rev_t", "new_date as df2_new_date"), ["rev_t"], "left")
.withColumn("new_date", F.coalesce("df2_new_date", "new_date"))
.select(*df1.columns).union(df2)
)
result.show()
#+---+---+----+--------+-----+
#| id| t|year|new_date|rev_t|
#+---+---+----+--------+-----+
#| 1| 11|1999| 1999| null|
#| 3| 33|2001| 2001| null|
#| 2| 22|2000| 2022| 44|
#| 2| 44|2022| 2022| null|
#| 2| 55|2001| 2001| 88|
#+---+---+----+--------+-----+

grouping consecutive rows in PySpark Dataframe

I have the following example Spark DataFrame:
rdd = sc.parallelize([(1,"19:00:00", "19:30:00", 30), (1,"19:30:00", "19:40:00", 10),(1,"19:40:00", "19:43:00", 3), (2,"20:00:00", "20:10:00", 10), (1,"20:05:00", "20:15:00", 10),(1,"20:15:00", "20:35:00", 20)])
df = spark.createDataFrame(rdd, ["user_id", "start_time", "end_time", "duration"])
df.show()
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:30:00| 30|
| 1| 19:30:00|19:40:00| 10|
| 1| 19:40:00|19:43:00| 3|
| 2| 20:00:00|20:10:00| 10|
| 1| 20:05:00|20:15:00| 10|
| 1| 20:15:00|20:35:00| 20|
+-------+----------+--------+--------+
I want to group consecutive rows based on the start and end times. For instance, for the same user_id, if a row's start time is the same as the previous row's end time, I want to group them together and sum the duration.
The desired result is:
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:43:00| 43|
| 2| 20:00:00|20:10:00| 10|
| 1| 20:05:00|20:35:00| 30|
+-------+----------+--------+--------+
The first three rows of the dataframe were grouped together because they all correspond to user_id 1 and the start times and end times form a continuous timeline.
This was my initial approach:
Use the lag function to get the next start time:
from pyspark.sql.functions import *
from pyspark.sql import Window
import sys
# compute next start time
window = Window.partitionBy('user_id').orderBy('start_time')
df = df.withColumn("next_start_time", lag(df.start_time, -1).over(window))
df.show()
+-------+----------+--------+--------+---------------+
|user_id|start_time|end_time|duration|next_start_time|
+-------+----------+--------+--------+---------------+
| 1| 19:00:00|19:30:00| 30| 19:30:00|
| 1| 19:30:00|19:40:00| 10| 19:40:00|
| 1| 19:40:00|19:43:00| 3| 20:05:00|
| 1| 20:05:00|20:15:00| 10| 20:15:00|
| 1| 20:15:00|20:35:00| 20| null|
| 2| 20:00:00|20:10:00| 10| null|
+-------+----------+--------+--------+---------------+
get the difference between the current row's end time and the next row's start time:
time_fmt = "HH:mm:ss"
timeDiff = unix_timestamp('next_start_time', format=time_fmt) - unix_timestamp('end_time', format=time_fmt)
df = df.withColumn("difference", timeDiff)
df.show()
+-------+----------+--------+--------+---------------+----------+
|user_id|start_time|end_time|duration|next_start_time|difference|
+-------+----------+--------+--------+---------------+----------+
| 1| 19:00:00|19:30:00| 30| 19:30:00| 0|
| 1| 19:30:00|19:40:00| 10| 19:40:00| 0|
| 1| 19:40:00|19:43:00| 3| 20:05:00| 1320|
| 1| 20:05:00|20:15:00| 10| 20:15:00| 0|
| 1| 20:15:00|20:35:00| 20| null| null|
| 2| 20:00:00|20:10:00| 10| null| null|
+-------+----------+--------+--------+---------------+----------+
Now my idea was to use the sum function with a window to get the cumulative sum of duration and then do a groupBy. But my approach was flawed for many reasons.
Here's one approach:
Gather together rows into groups where a group is a set of rows with the same user_id that are consecutive (start_time matches previous end_time). Then you can use this group to do your aggregation.
A way to get here is by creating intermediate indicator columns to tell you if the user has changed or the time is not consecutive. Then perform a cumulative sum over the indicator column to create the group.
For example:
import pyspark.sql.functions as f
from pyspark.sql import Window
w1 = Window.orderBy("start_time")
df = df.withColumn(
"userChange",
(f.col("user_id") != f.lag("user_id").over(w1)).cast("int")
)\
.withColumn(
"timeChange",
(f.col("start_time") != f.lag("end_time").over(w1)).cast("int")
)\
.fillna(
0,
subset=["userChange", "timeChange"]
)\
.withColumn(
"indicator",
(~((f.col("userChange") == 0) & (f.col("timeChange")==0))).cast("int")
)\
.withColumn(
"group",
f.sum(f.col("indicator")).over(w1.rangeBetween(Window.unboundedPreceding, 0))
)
df.show()
#+-------+----------+--------+--------+----------+----------+---------+-----+
#|user_id|start_time|end_time|duration|userChange|timeChange|indicator|group|
#+-------+----------+--------+--------+----------+----------+---------+-----+
#| 1| 19:00:00|19:30:00| 30| 0| 0| 0| 0|
#| 1| 19:30:00|19:40:00| 10| 0| 0| 0| 0|
#| 1| 19:40:00|19:43:00| 3| 0| 0| 0| 0|
#| 2| 20:00:00|20:10:00| 10| 1| 1| 1| 1|
#| 1| 20:05:00|20:15:00| 10| 1| 1| 1| 2|
#| 1| 20:15:00|20:35:00| 20| 0| 0| 0| 2|
#+-------+----------+--------+--------+----------+----------+---------+-----+
Now that we have the group column, we can aggregate as follows to get the desired result:
df.groupBy("user_id", "group")\
.agg(
f.min("start_time").alias("start_time"),
f.max("end_time").alias("end_time"),
f.sum("duration").alias("duration")
)\
.drop("group")\
.show()
#+-------+----------+--------+--------+
#|user_id|start_time|end_time|duration|
#+-------+----------+--------+--------+
#| 1| 19:00:00|19:43:00| 43|
#| 1| 20:05:00|20:35:00| 30|
#| 2| 20:00:00|20:10:00| 10|
#+-------+----------+--------+--------+
Here is a working solution derived from Pault's answer:
Create the Dataframe:
rdd = sc.parallelize([(1,"19:00:00", "19:30:00", 30), (1,"19:30:00", "19:40:00", 10),(1,"19:40:00", "19:43:00", 3), (2,"20:00:00", "20:10:00", 10), (1,"20:05:00", "20:15:00", 10),(1,"20:15:00", "20:35:00", 20)])
df = spark.createDataFrame(rdd, ["user_id", "start_time", "end_time", "duration"])
df.show()
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:30:00| 30|
| 1| 19:30:00|19:40:00| 10|
| 1| 19:40:00|19:43:00| 3|
| 1| 20:05:00|20:15:00| 10|
| 1| 20:15:00|20:35:00| 20|
+-------+----------+--------+--------+
Create an indicator column that indicates whenever the time has changed, and use cumulative sum to give each group a unique id:
import pyspark.sql.functions as f
from pyspark.sql import Window
w1 = Window.partitionBy('user_id').orderBy('start_time')
df = df.withColumn(
"indicator",
(f.col("start_time") != f.lag("end_time").over(w1)).cast("int")
)\
.fillna(
0,
subset=[ "indicator"]
)\
.withColumn(
"group",
f.sum(f.col("indicator")).over(w1.rangeBetween(Window.unboundedPreceding, 0))
)
df.show()
+-------+----------+--------+--------+---------+-----+
|user_id|start_time|end_time|duration|indicator|group|
+-------+----------+--------+--------+---------+-----+
| 1| 19:00:00|19:30:00| 30| 0| 0|
| 1| 19:30:00|19:40:00| 10| 0| 0|
| 1| 19:40:00|19:43:00| 3| 0| 0|
| 1| 20:05:00|20:15:00| 10| 1| 1|
| 1| 20:15:00|20:35:00| 20| 0| 1|
+-------+----------+--------+--------+---------+-----+
Now GroupBy on user id and the group variable.
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:43:00| 43|
| 1| 20:05:00|20:35:00| 30|
+-------+----------+--------+--------+

Calculate mean and max value with VectorAssembler

I'm working with a data frame, something like:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import *
schema = StructType([
StructField("ClientId", IntegerType(), True),
StructField("m_ant21", IntegerType(), True),
StructField("m_ant22", IntegerType(), True),
StructField("m_ant23", IntegerType(), True),
StructField("m_ant24", IntegerType(), True)
])
df = sqlContext.createDataFrame(
data=[(0, 5, 5, 4, 0),
(1, 23, 13, 17, 99),
(2, 0, 0, 0, 1),
(3, 0, 4, 1, 0),
(4, 2, 1, 30, 10),
(5, 0, 0, 0, 0)],
schema=schema)
I need to calculate the mean value an max value per row and using the columns "m_ant21", "m_ant22", "m_ant23", "m_ant24".
I'm trying using vectorAssembler:
assembler = VectorAssembler(
inputCols=["m_ant21", "m_ant22", "m_ant23","m_ant24"],
outputCol="muestra")
output = assembler.transform(df)
output.show()
now, I create a function to make the mean, but the input variable is a "DenseVector" called "dv":
dv = output.collect()[0].asDict()['muestra']
def mi_media( dv ) :
return float( sum( dv ) / dv.size )
udf_media = udf( mi_media, DoubleType() )
output1 = output.withColumn( "mediaVec", udf_media ( output.muestra ) )
output1.show()
and the same to the max value:
def mi_Max( dv ) :
return float(max( dv ) )
udf_max = udf( mi_Max, DoubleType() )
output2 = output.withColumn( "maxVec", udf_max ( output.muestra ) )
output2.show()
The problem is the error in output1.show() and output2.show(). Just It's not working and I dont know what happends with the code.
What am i doing wrong?
Please help me.
I have tried my way, check it,
from pyspark.sql import functions as F
df.show()
+--------+-------+-------+-------+-------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|
+--------+-------+-------+-------+-------+
| 0| 5| 5| 4| 0|
| 1| 23| 13| 17| 99|
| 2| 0| 0| 0| 1|
| 3| 0| 4| 1| 0|
| 4| 2| 1| 30| 10|
| 5| 0| 0| 0| 0|
+--------+-------+-------+-------+-------+
df1 = df.withColumn('mean',sum(df[c] for c in df.columns[1:])/len(df.columns[1:]))
df1 = df1.withColumn('max',F.greatest(*[F.coalesce(df[c],F.lit(0)) for c in df.columns[1:]]))
df1.show()
+--------+-------+-------+-------+-------+-----+---+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24| mean|max|
+--------+-------+-------+-------+-------+-----+---+
| 0| 5| 5| 4| 0| 3.5| 5|
| 1| 23| 13| 17| 99| 38.0| 99|
| 2| 0| 0| 0| 1| 0.25| 1|
| 3| 0| 4| 1| 0| 1.25| 4|
| 4| 2| 1| 30| 10|10.75| 30|
| 5| 0| 0| 0| 0| 0.0| 0|
+--------+-------+-------+-------+-------+-----+---+
It is possible to do it with DenseVector but in RDD way:
output2 = output.rdd.map(lambda x: (x.ClientId,
x.m_ant21,
x.m_ant22,
x.m_ant23,
x.m_ant24,
x.muestra,
float(max(x.muestra))))
output2 = spark.createDataFrame(output2)
output2.show()
which gives:
+---+---+---+---+---+--------------------+----+
| _1| _2| _3| _4| _5| _6| _7|
+---+---+---+---+---+--------------------+----+
| 0| 5| 5| 4| 0| [5.0,5.0,4.0,0.0]| 5.0|
| 1| 23| 13| 17| 99|[23.0,13.0,17.0,9...|99.0|
| 2| 0| 0| 0| 1| (4,[3],[1.0])| 1.0|
| 3| 0| 4| 1| 0| [0.0,4.0,1.0,0.0]| 4.0|
| 4| 2| 1| 30| 10| [2.0,1.0,30.0,10.0]|30.0|
| 5| 0| 0| 0| 0| (4,[],[])| 0.0|
+---+---+---+---+---+--------------------+----+
Now all that remains is to rename the columns, for example with withColumnRename function.The mean case is the same.
Also it's possible do that with SparseVector but, in this case, it is necessary to access to self class value variable:
output2 = output.rdd.map(lambda x: (x.ClientId,
x.m_ant21,
x.m_ant22,
x.m_ant23,
x.m_ant24,
x.muestra,
float(max(x.muestra.values))))
output2 = spark.createDataFrame(output2)
This way works better if you df has a lot of columns and it is not possible to calc max value before VectorAssembler stage.
i get a solution about this problem
import pyspark.sql.functions as f
import pyspark.sql.types as t
min_of_vector = f.udf(lambda vec: vec.toArray().min(), t.DoubleType())
max_of_vector = f.udf(lambda vec: vec.toArray().max(), t.DoubleType())
mean_of_vector = f.udf(lambda vec: vec.toArray().mean(), t.DoubleType())
final = output.withColumn('min', min_of_vector('muestra')) \
.withColumn('max', max_of_vector('muestra')) \
.withColumn('mean', mean_of_vector('muestra'))

Categories

Resources