pyspark - join with OR condition - python

I would like to join two pyspark dataframes if at least one of two conditions is satisfied.
Toy data:
df1 = spark.createDataFrame([
(10, 1, 666),
(20, 2, 777),
(30, 1, 888),
(40, 3, 999),
(50, 1, 111),
(60, 2, 222),
(10, 4, 333),
(50, None, 444),
(10, 0, 555),
(50, 0, 666)
],
['var1', 'var2', 'other_var']
)
df2 = spark.createDataFrame([
(10, 1),
(20, 2),
(30, None),
(30, 0)
],
['var1_', 'var2_']
)
I would like to maintain all those rows of df1 where var1 is present in the distinct values of df2.var1_ OR var2 is present in the distinct values of df2.var2_ (but not in the case where such value is 0).
So, the expected output would be
+----+----+---------+-----+-----+
|var1|var2|other_var|var1_|var2_|
+----+----+---------+-----+-----+
| 10| 1| 666| 10| 1| # join on both var1 and var2
| 20| 2| 777| 20| 2| # join on both var1 and var2
| 30| 1| 888| 10| 1| # join on both var1 and var2
| 50| 1| 111| 10| 1| # join on var2
| 60| 2| 222| 20| 2| # join on var2
| 10| 4| 333| 10| 1| # join on var1
| 10| 0| 555| 10| 1| # join on var1
+----+----+---------+-----+-----+
Among the other attempts, I tried
cond = [(df1.var1 == (df2.select('var1_').distinct()).var1_) | (df1.var2 == (df2.filter(F.col('var2_') != 0).select('var2_').distinct()).var2_)]
df1\
.join(df2, how='inner', on=cond)\
.show()
+----+----+---------+-----+-----+
|var1|var2|other_var|var1_|var2_|
+----+----+---------+-----+-----+
| 10| 1| 666| 10| 1|
| 20| 2| 777| 20| 2|
| 30| 1| 888| 10| 1|
| 50| 1| 111| 10| 1|
| 30| 1| 888| 30| null|
| 30| 1| 888| 30| 0|
| 60| 2| 222| 20| 2|
| 10| 4| 333| 10| 1|
| 10| 0| 555| 10| 1|
| 10| 0| 555| 30| 0|
| 50| 0| 666| 30| 0|
+----+----+---------+-----+-----+
but I obtained more rows than expected, and the rows where var2 == 0 were also preserved.
What am I doing wrong?
Note: I'm not using the .isin method because my actual df2 has around 20k rows and I've read here that this method with a large number of IDs could have a bad performance.

Try the condition below:
cond = (df2.var2_ != 0) & ((df1.var1 == df2.var1_) | (df1.var2 == df2.var2_))
df1\
.join(df2, how='inner', on=cond)\
.show()
+----+----+---------+-----+-----+
|var1|var2|other_var|var1_|var2_|
+----+----+---------+-----+-----+
| 10| 1| 666| 10| 1|
| 30| 1| 888| 10| 1|
| 20| 2| 777| 20| 2|
| 50| 1| 111| 10| 1|
| 60| 2| 222| 20| 2|
| 10| 4| 333| 10| 1|
| 10| 0| 555| 10| 1|
+----+----+---------+-----+-----+
The condition should only include the columns from the two dataframes to be joined. If you want to remove var2_ = 0, you can put them as a join condition, rather than as a filter.
There is also no need to specify distinct, because it does not affect the equality condition, and also adds an unnecessary step.

Related

GroupBy a dataframe records and display all columns with PySpark

I have the following dataframe
dataframe - columnA, columnB, columnC, columnD, columnE
I want to groupBy columnC and then consider max value of columnE
dataframe .select('*').groupBy('columnC').max('columnE')
expected output
dataframe - columnA, columnB, columnC, columnD, columnE
Real output
dataframe - columnC, columnE
Why all columns in the dataframe are not displayed as expected ?
For Spark version >= 3.0.0 you can use max_by to select the additional columns.
import random
from pyspark.sql import functions as F
#create some testdata
df = spark.createDataFrame(
[[random.randint(1,3)] + random.sample(range(0, 30), 4) for _ in range(10)],
schema=["columnC", "columnB", "columnA", "columnD", "columnE"]) \
.select("columnA", "columnB", "columnC", "columnD", "columnE")
df.groupBy("columnC") \
.agg(F.max("columnE"),
F.expr("max_by(columnA, columnE) as columnA"),
F.expr("max_by(columnB, columnE) as columnB"),
F.expr("max_by(columnD, columnE) as columnD")) \
.show()
For the testdata
+-------+-------+-------+-------+-------+
|columnA|columnB|columnC|columnD|columnE|
+-------+-------+-------+-------+-------+
| 25| 20| 2| 0| 2|
| 14| 2| 2| 24| 6|
| 26| 13| 3| 2| 1|
| 5| 24| 3| 19| 17|
| 22| 5| 3| 14| 21|
| 24| 5| 1| 8| 4|
| 7| 22| 3| 16| 20|
| 6| 17| 1| 5| 7|
| 24| 22| 2| 8| 3|
| 4| 14| 1| 16| 11|
+-------+-------+-------+-------+-------+
the result is
+-------+------------+-------+-------+-------+
|columnC|max(columnE)|columnA|columnB|columnD|
+-------+------------+-------+-------+-------+
| 1| 11| 4| 14| 16|
| 3| 21| 22| 5| 14|
| 2| 6| 14| 2| 24|
+-------+------------+-------+-------+-------+
What you want to achieve can be done via WINDOW function. Not groupBy
partition your data by columnC
Order your data within each partition in desc (rank)
filter out your desired result.
from pyspark.sql.window import Window
from pyspark.sql.functions import rank
from pyspark.sql.functions import col
windowSpec = Window.partitionBy("columnC").orderBy(col("columnE").desc())
expectedDf = df.withColumn("rank", rank().over(windowSpec)) \
.filter(col("rank") == 1)
You might wanna restructure your question.

Pyspark: Add new column from another pyspark dataframe

I have two dataframes as follows.
I want to add a new column to dataframe df_a from dataframe df_b column val_1 based on the condition df_a.col_p == df_b.id
df_a = sqlContext.createDataFrame([(1412, 31, 1), (2422, 21, 1), (4223, 22, 2), (
2244, 43, 1), (1235, 54, 1), (4126, 12, 5), (2342, 44, 1 )], ["idx", "col_n", "col_p"])
df_a.show()
+----+-----+-----+
| idx|col_n|col_p|
+----+-----+-----+
|1412| 31| 1|
|2422| 21| 1|
|4223| 22| 2|
|2244| 43| 1|
|1235| 54| 1|
|4126| 12| 5|
|2342| 44| 1|
+----+-----+-----+
df_b = sqlContext.createDataFrame([(1, 1, 1), (2, 1, 1), (3, 1, 2), (
4, 1, 1), (5, 2, 1), (6, 2, 2)], ["id", "val_1", "val_2"])
df_b.show()
+---+-----+-----+
| id|val_1|val_2|
+---+-----+-----+
| 1| 1| 1|
| 2| 1| 1|
| 3| 1| 2|
| 4| 1| 1|
| 5| 2| 1|
| 6| 2| 2|
+---+-----+-----+
Expected output
+----+-----+-----+-----+
| idx|col_n|col_p|val_1|
+----+-----+-----+-----+
|1412| 31| 1| 1|
|2422| 21| 1| 1|
|4223| 22| 2| 1|
|2244| 43| 1| 1|
|1235| 54| 1| 1|
|4126| 12| 5| 2|
|2342| 44| 1| 1|
+----+-----+-----+-----+
My code
cond = (df_a.col_p == df_b.id)
df_a_new = df_a.join(df_b, cond, how ='full').withColumn('val_new', F.when(cond, df_b.val_1))
df_a_new = df_a_new.drop(*['id', 'val_1', 'val_2'])
df_a_new = df_a_new.filter(df_a_new.idx. isNotNull())
df_a_new.show()
How can I get the proper output as expected result with correct index order?
You can assign an increasing index to df_a and sort by that index after joining. Also I'd suggest doing a left join rather than a full join.
import pyspark.sql.functions as F
df_a_new = df_a.withColumn('index', F.monotonically_increasing_id()) \
.join(df_b, df_a.col_p == df_b.id, 'left') \
.orderBy('index') \
.select('idx', 'col_n', 'col_p', 'val_1')
df_a_new.show()
+----+-----+-----+-----+
| idx|col_n|col_p|val_1|
+----+-----+-----+-----+
|1412| 31| 1| 1|
|2422| 21| 1| 1|
|4223| 22| 2| 1|
|2244| 43| 1| 1|
|1235| 54| 1| 1|
|4126| 12| 5| 2|
|2342| 44| 1| 1|
+----+-----+-----+-----+
you need to create your own indexes (monotomically_increasing_ids) and sort again after joining on those indexes.
But there is no way you can join while preserving order in spark as the rows are partitioned before joining and they lose order before combining
refer: Can Dataframe joins in Spark preserve order?

grouping consecutive rows in PySpark Dataframe

I have the following example Spark DataFrame:
rdd = sc.parallelize([(1,"19:00:00", "19:30:00", 30), (1,"19:30:00", "19:40:00", 10),(1,"19:40:00", "19:43:00", 3), (2,"20:00:00", "20:10:00", 10), (1,"20:05:00", "20:15:00", 10),(1,"20:15:00", "20:35:00", 20)])
df = spark.createDataFrame(rdd, ["user_id", "start_time", "end_time", "duration"])
df.show()
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:30:00| 30|
| 1| 19:30:00|19:40:00| 10|
| 1| 19:40:00|19:43:00| 3|
| 2| 20:00:00|20:10:00| 10|
| 1| 20:05:00|20:15:00| 10|
| 1| 20:15:00|20:35:00| 20|
+-------+----------+--------+--------+
I want to group consecutive rows based on the start and end times. For instance, for the same user_id, if a row's start time is the same as the previous row's end time, I want to group them together and sum the duration.
The desired result is:
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:43:00| 43|
| 2| 20:00:00|20:10:00| 10|
| 1| 20:05:00|20:35:00| 30|
+-------+----------+--------+--------+
The first three rows of the dataframe were grouped together because they all correspond to user_id 1 and the start times and end times form a continuous timeline.
This was my initial approach:
Use the lag function to get the next start time:
from pyspark.sql.functions import *
from pyspark.sql import Window
import sys
# compute next start time
window = Window.partitionBy('user_id').orderBy('start_time')
df = df.withColumn("next_start_time", lag(df.start_time, -1).over(window))
df.show()
+-------+----------+--------+--------+---------------+
|user_id|start_time|end_time|duration|next_start_time|
+-------+----------+--------+--------+---------------+
| 1| 19:00:00|19:30:00| 30| 19:30:00|
| 1| 19:30:00|19:40:00| 10| 19:40:00|
| 1| 19:40:00|19:43:00| 3| 20:05:00|
| 1| 20:05:00|20:15:00| 10| 20:15:00|
| 1| 20:15:00|20:35:00| 20| null|
| 2| 20:00:00|20:10:00| 10| null|
+-------+----------+--------+--------+---------------+
get the difference between the current row's end time and the next row's start time:
time_fmt = "HH:mm:ss"
timeDiff = unix_timestamp('next_start_time', format=time_fmt) - unix_timestamp('end_time', format=time_fmt)
df = df.withColumn("difference", timeDiff)
df.show()
+-------+----------+--------+--------+---------------+----------+
|user_id|start_time|end_time|duration|next_start_time|difference|
+-------+----------+--------+--------+---------------+----------+
| 1| 19:00:00|19:30:00| 30| 19:30:00| 0|
| 1| 19:30:00|19:40:00| 10| 19:40:00| 0|
| 1| 19:40:00|19:43:00| 3| 20:05:00| 1320|
| 1| 20:05:00|20:15:00| 10| 20:15:00| 0|
| 1| 20:15:00|20:35:00| 20| null| null|
| 2| 20:00:00|20:10:00| 10| null| null|
+-------+----------+--------+--------+---------------+----------+
Now my idea was to use the sum function with a window to get the cumulative sum of duration and then do a groupBy. But my approach was flawed for many reasons.
Here's one approach:
Gather together rows into groups where a group is a set of rows with the same user_id that are consecutive (start_time matches previous end_time). Then you can use this group to do your aggregation.
A way to get here is by creating intermediate indicator columns to tell you if the user has changed or the time is not consecutive. Then perform a cumulative sum over the indicator column to create the group.
For example:
import pyspark.sql.functions as f
from pyspark.sql import Window
w1 = Window.orderBy("start_time")
df = df.withColumn(
"userChange",
(f.col("user_id") != f.lag("user_id").over(w1)).cast("int")
)\
.withColumn(
"timeChange",
(f.col("start_time") != f.lag("end_time").over(w1)).cast("int")
)\
.fillna(
0,
subset=["userChange", "timeChange"]
)\
.withColumn(
"indicator",
(~((f.col("userChange") == 0) & (f.col("timeChange")==0))).cast("int")
)\
.withColumn(
"group",
f.sum(f.col("indicator")).over(w1.rangeBetween(Window.unboundedPreceding, 0))
)
df.show()
#+-------+----------+--------+--------+----------+----------+---------+-----+
#|user_id|start_time|end_time|duration|userChange|timeChange|indicator|group|
#+-------+----------+--------+--------+----------+----------+---------+-----+
#| 1| 19:00:00|19:30:00| 30| 0| 0| 0| 0|
#| 1| 19:30:00|19:40:00| 10| 0| 0| 0| 0|
#| 1| 19:40:00|19:43:00| 3| 0| 0| 0| 0|
#| 2| 20:00:00|20:10:00| 10| 1| 1| 1| 1|
#| 1| 20:05:00|20:15:00| 10| 1| 1| 1| 2|
#| 1| 20:15:00|20:35:00| 20| 0| 0| 0| 2|
#+-------+----------+--------+--------+----------+----------+---------+-----+
Now that we have the group column, we can aggregate as follows to get the desired result:
df.groupBy("user_id", "group")\
.agg(
f.min("start_time").alias("start_time"),
f.max("end_time").alias("end_time"),
f.sum("duration").alias("duration")
)\
.drop("group")\
.show()
#+-------+----------+--------+--------+
#|user_id|start_time|end_time|duration|
#+-------+----------+--------+--------+
#| 1| 19:00:00|19:43:00| 43|
#| 1| 20:05:00|20:35:00| 30|
#| 2| 20:00:00|20:10:00| 10|
#+-------+----------+--------+--------+
Here is a working solution derived from Pault's answer:
Create the Dataframe:
rdd = sc.parallelize([(1,"19:00:00", "19:30:00", 30), (1,"19:30:00", "19:40:00", 10),(1,"19:40:00", "19:43:00", 3), (2,"20:00:00", "20:10:00", 10), (1,"20:05:00", "20:15:00", 10),(1,"20:15:00", "20:35:00", 20)])
df = spark.createDataFrame(rdd, ["user_id", "start_time", "end_time", "duration"])
df.show()
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:30:00| 30|
| 1| 19:30:00|19:40:00| 10|
| 1| 19:40:00|19:43:00| 3|
| 1| 20:05:00|20:15:00| 10|
| 1| 20:15:00|20:35:00| 20|
+-------+----------+--------+--------+
Create an indicator column that indicates whenever the time has changed, and use cumulative sum to give each group a unique id:
import pyspark.sql.functions as f
from pyspark.sql import Window
w1 = Window.partitionBy('user_id').orderBy('start_time')
df = df.withColumn(
"indicator",
(f.col("start_time") != f.lag("end_time").over(w1)).cast("int")
)\
.fillna(
0,
subset=[ "indicator"]
)\
.withColumn(
"group",
f.sum(f.col("indicator")).over(w1.rangeBetween(Window.unboundedPreceding, 0))
)
df.show()
+-------+----------+--------+--------+---------+-----+
|user_id|start_time|end_time|duration|indicator|group|
+-------+----------+--------+--------+---------+-----+
| 1| 19:00:00|19:30:00| 30| 0| 0|
| 1| 19:30:00|19:40:00| 10| 0| 0|
| 1| 19:40:00|19:43:00| 3| 0| 0|
| 1| 20:05:00|20:15:00| 10| 1| 1|
| 1| 20:15:00|20:35:00| 20| 0| 1|
+-------+----------+--------+--------+---------+-----+
Now GroupBy on user id and the group variable.
+-------+----------+--------+--------+
|user_id|start_time|end_time|duration|
+-------+----------+--------+--------+
| 1| 19:00:00|19:43:00| 43|
| 1| 20:05:00|20:35:00| 30|
+-------+----------+--------+--------+

AnalysisException: "cannot resolve 'df2.*' give input columns Pyspark?

I have created two data frames like below.
df = spark.createDataFrame(
[(1, 1, 2,4), (1, 2, 9,5), (2, 1, 2,1), (2, 2, 1,2), (4, 1, 5,2), (4, 2, 6,3), (5, 1, 3,3), (5, 2, 8,4)],
("sid", "cid", "Cr","rank"))
df1 = spark.createDataFrame(
[[1,1],[1,2],[1,3], [2,1],[2,2],[2,3],[4,1],[4,2],[4,3],[5,2],[5,3],[5,3],[3,4]],
["sid","cid"])
because of some requirement I had created sqlContext and created Temporary view,like below.
df.createOrReplaceTempView("temp")
df2=sqlContext.sql("select sid,cid,cr,rank from temp")
then i am doing left join based on some condition .
joined = (df2.alias("df")
.join(
df1.alias("df1"),
(col("df.sid") == col("df1.sid")) & (col("df.cid") == col("df1.cid")),
"left"))
joined.show()
+---+---+---+----+----+----+
|sid|cid| cr|rank| sid| cid|
+---+---+---+----+----+----+
| 5| 1| 3| 3|null|null|
| 1| 1| 2| 4| 1| 1|
| 4| 2| 6| 3| 4| 2|
| 5| 2| 8| 4| 5| 2|
| 2| 2| 1| 2| 2| 2|
| 4| 1| 5| 2| 4| 1|
| 1| 2| 9| 5| 1| 2|
| 2| 1| 2| 1| 2| 1|
+---+---+---+----+----+----+
then finally i am executing below code:
final=joined.select(
col("df2.*"),
col("df1.sid").isNull().cast("integer").alias("flag")
).orderBy("sid", "cid")
then i am getting error like below.
"AnalysisException: "cannot resolve 'df2.*' give input columns 'cr, sid, sid, cid, cid, rank';"
but my expected out put should be:
+---+---+---+----+----+
|sid|cid| Cr|rank|flag|
+---+---+---+----+----+
| 1| 1| 2| 4| 0|
| 1| 2| 9| 5| 0|
| 2| 1| 2| 1| 0|
| 2| 2| 1| 2| 0|
| 4| 1| 5| 2| 0|
| 4| 2| 6| 3| 0|
| 5| 1| 3| 3| 1|
| 5| 2| 8| 4| 0|
+---+---+---+----+----+
mistake is :
joined = (df2.alias("df")
.join(
df1.alias("df1"),
(col("df2.sid") == col("df1.sid")) & (col("df2.cid") == col("df1.cid")),
"left"))
joined.show()
here we should to use df2.alias("df2") or joined.select(col("df.*")..)
Complete solution is :
joined = (df2.alias("df2")
.join(
df1.alias("df1"),
(col("df2.sid") == col("df1.sid")) & (col("df2.cid") == col("df1.cid")),
"left"))
joined.show()
final=joined.select(
col("df2.*"),
col("df1.sid").isNull().cast("integer").alias("flag")
).orderBy("sid", "cid")

Calculate mean and max value with VectorAssembler

I'm working with a data frame, something like:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import *
schema = StructType([
StructField("ClientId", IntegerType(), True),
StructField("m_ant21", IntegerType(), True),
StructField("m_ant22", IntegerType(), True),
StructField("m_ant23", IntegerType(), True),
StructField("m_ant24", IntegerType(), True)
])
df = sqlContext.createDataFrame(
data=[(0, 5, 5, 4, 0),
(1, 23, 13, 17, 99),
(2, 0, 0, 0, 1),
(3, 0, 4, 1, 0),
(4, 2, 1, 30, 10),
(5, 0, 0, 0, 0)],
schema=schema)
I need to calculate the mean value an max value per row and using the columns "m_ant21", "m_ant22", "m_ant23", "m_ant24".
I'm trying using vectorAssembler:
assembler = VectorAssembler(
inputCols=["m_ant21", "m_ant22", "m_ant23","m_ant24"],
outputCol="muestra")
output = assembler.transform(df)
output.show()
now, I create a function to make the mean, but the input variable is a "DenseVector" called "dv":
dv = output.collect()[0].asDict()['muestra']
def mi_media( dv ) :
return float( sum( dv ) / dv.size )
udf_media = udf( mi_media, DoubleType() )
output1 = output.withColumn( "mediaVec", udf_media ( output.muestra ) )
output1.show()
and the same to the max value:
def mi_Max( dv ) :
return float(max( dv ) )
udf_max = udf( mi_Max, DoubleType() )
output2 = output.withColumn( "maxVec", udf_max ( output.muestra ) )
output2.show()
The problem is the error in output1.show() and output2.show(). Just It's not working and I dont know what happends with the code.
What am i doing wrong?
Please help me.
I have tried my way, check it,
from pyspark.sql import functions as F
df.show()
+--------+-------+-------+-------+-------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|
+--------+-------+-------+-------+-------+
| 0| 5| 5| 4| 0|
| 1| 23| 13| 17| 99|
| 2| 0| 0| 0| 1|
| 3| 0| 4| 1| 0|
| 4| 2| 1| 30| 10|
| 5| 0| 0| 0| 0|
+--------+-------+-------+-------+-------+
df1 = df.withColumn('mean',sum(df[c] for c in df.columns[1:])/len(df.columns[1:]))
df1 = df1.withColumn('max',F.greatest(*[F.coalesce(df[c],F.lit(0)) for c in df.columns[1:]]))
df1.show()
+--------+-------+-------+-------+-------+-----+---+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24| mean|max|
+--------+-------+-------+-------+-------+-----+---+
| 0| 5| 5| 4| 0| 3.5| 5|
| 1| 23| 13| 17| 99| 38.0| 99|
| 2| 0| 0| 0| 1| 0.25| 1|
| 3| 0| 4| 1| 0| 1.25| 4|
| 4| 2| 1| 30| 10|10.75| 30|
| 5| 0| 0| 0| 0| 0.0| 0|
+--------+-------+-------+-------+-------+-----+---+
It is possible to do it with DenseVector but in RDD way:
output2 = output.rdd.map(lambda x: (x.ClientId,
x.m_ant21,
x.m_ant22,
x.m_ant23,
x.m_ant24,
x.muestra,
float(max(x.muestra))))
output2 = spark.createDataFrame(output2)
output2.show()
which gives:
+---+---+---+---+---+--------------------+----+
| _1| _2| _3| _4| _5| _6| _7|
+---+---+---+---+---+--------------------+----+
| 0| 5| 5| 4| 0| [5.0,5.0,4.0,0.0]| 5.0|
| 1| 23| 13| 17| 99|[23.0,13.0,17.0,9...|99.0|
| 2| 0| 0| 0| 1| (4,[3],[1.0])| 1.0|
| 3| 0| 4| 1| 0| [0.0,4.0,1.0,0.0]| 4.0|
| 4| 2| 1| 30| 10| [2.0,1.0,30.0,10.0]|30.0|
| 5| 0| 0| 0| 0| (4,[],[])| 0.0|
+---+---+---+---+---+--------------------+----+
Now all that remains is to rename the columns, for example with withColumnRename function.The mean case is the same.
Also it's possible do that with SparseVector but, in this case, it is necessary to access to self class value variable:
output2 = output.rdd.map(lambda x: (x.ClientId,
x.m_ant21,
x.m_ant22,
x.m_ant23,
x.m_ant24,
x.muestra,
float(max(x.muestra.values))))
output2 = spark.createDataFrame(output2)
This way works better if you df has a lot of columns and it is not possible to calc max value before VectorAssembler stage.
i get a solution about this problem
import pyspark.sql.functions as f
import pyspark.sql.types as t
min_of_vector = f.udf(lambda vec: vec.toArray().min(), t.DoubleType())
max_of_vector = f.udf(lambda vec: vec.toArray().max(), t.DoubleType())
mean_of_vector = f.udf(lambda vec: vec.toArray().mean(), t.DoubleType())
final = output.withColumn('min', min_of_vector('muestra')) \
.withColumn('max', max_of_vector('muestra')) \
.withColumn('mean', mean_of_vector('muestra'))

Categories

Resources