Fill Pyspark dataframe column null values with average value from same column

Fill Pyspark dataframe column null values with average value from same column - python

With a dataframe like this,
rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,None,"201602"),(1,20,3003,"201601"), (1,20,None,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, None,"201601")])
df_data = sqlContext.createDataFrame(rdd_2, ["id", "type", "cost", "date"])
df_data.show()
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|null| 201602|
| 1| 20|3003| 201601|
| 1| 20|null| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|null| 201601|
+---+----+----+-------+
I need to fill the null values with the average of the existing values, with the expected result being
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|1128| 201602|
| 1| 20|3003| 201601|
| 1| 20|1128| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|1128| 201601|
+---+----+----+-------+
where 1128 is the average of the existing values. I need to do that for several columns.
My current approach is to use na.fill:
fill_values = {column: df_data.agg({column:"mean"}).flatMap(list).collect()[0] for column in df_data.columns if column not in ['date','id']}
df_data = df_data.na.fill(fill_values)
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|1128| 201602|
| 1| 20|3003| 201601|
| 1| 20|1128| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|1128| 201601|
+---+----+----+-------+
But this is very cumbersome. Any ideas?

Well, one way or another you have to:
compute statistics
fill the blanks
It pretty much limits what you can really improve here, still:
replace flatMap(list).collect()[0] with first()[0] or structure unpacking
compute all stats with a single action
use built-in Row methods to extract dictionary
The final result could like this:
def fill_with_mean(df, exclude=set()):
stats = df.agg(*(
avg(c).alias(c) for c in df.columns if c not in exclude
))
return df.na.fill(stats.first().asDict())
fill_with_mean(df_data, ["id", "date"])
In Spark 2.2 or later you can also use Imputer. See Replace missing values with mean - Spark Dataframe.

Related

Merge 2 spark dataframes with non overlapping columns

I have two data frames, df1:
+---+---------+
| id| col_name|
+---+---------+
| 0| a |
| 1| b |
| 2| null|
| 3| null|
| 4| e |
| 5| f |
| 6| g |
| 7| h |
| 8| null|
| 9| j |
+---+---------+
and df2:
+---+---------+
| id| col_name|
+---+---------+
| 0| null|
| 1| null|
| 2| c|
| 3| d|
| 4| null|
| 5| null|
| 6| null|
| 7| null|
| 8| i|
| 9| null|
+---+---------+
and I want to merge them so I get
+---+---------+
| id| col_name|
+---+---------+
| 0| a|
| 1| b|
| 2| c|
| 3| d|
| 4| e|
| 5| f|
| 6| g|
| 7| h|
| 8| i|
| 9| j|
+---+---------+
I know for sure that they aren't overlapping (ie when df2 entry is null df1 entry isn't and vise versa)
I know that if I use join I won't get them on the same column and will instead get 2 "col_name". I just want it on the one column. How do I do this? Thanks

Try this-
df1.alias("a").join(df2.alias("b"), "id").selectExpr("id", "coalesce(a.col_name, b.col_name) as col_name")

You could do this:
mydf = df1.copy() #make copy of first array
idx = np.where(df1['col_name'].values == 'null')[0] #get indices of null
val = df2['col_name'].values[idx] #get values from df2 where df1 is null
mydf['col_name'][idx] = val #assign those values in mydf
mydf #print mydf

you should be able to utilize the coalesce function to achieve this.
df1 = df1.withColumnRenamed("col_name", "col_name_a")
df2 = df2.withColumnRenamed("col_name", "col_name_b")
joinedDF = renamedDF1.join(renamedDF2, "id")
joinedDF = joinedDF.withColumn(
"col_name",
coalesce(joinedDF.col("col_name_a"), joinedDF.col("col_name_b"))
)

Spark create a row containing a sum for every column (like a grand total for every column)

I have a dataframe that looks like this
+-----------+-----------+-----------+
|salesperson| device|amount_sold|
+-----------+-----------+-----------+
| john| notebook| 2|
| gary| notebook| 3|
| john|small_phone| 2|
| mary|small_phone| 3|
| john|large_phone| 3|
| john| camera| 3|
+-----------+-----------+-----------+
and I have transformed it using pivot function to this with a Total column
+-----------+------+-----------+--------+-----------+-----+
|salesperson|camera|large_phone|notebook|small_phone|Total|
+-----------+------+-----------+--------+-----------+-----+
| gary| 0| 0| 3| 0| 3|
| mary| 0| 0| 0| 3| 3|
| john| 3| 3| 2| 2| 10|
+-----------+------+-----------+--------+-----------+-----+
but I would like a dataframe with a row (Total) that would also contain a total for every column like below:
+-----------+------+-----------+--------+-----------+-----+
|salesperson|camera|large_phone|notebook|small_phone|Total|
+-----------+------+-----------+--------+-----------+-----+
| gary| 0| 0| 3| 0| 3|
| mary| 0| 0| 0| 3| 3|
| john| 3| 3| 2| 2| 10|
| Total| 3| 3| 5| 5| 16|
+-----------+------+-----------+--------+-----------+-----+
Is it possible to do this is Spark using Scala/Python? (Preferably Scala and using Spark) and not using Union if possible
TIA

You can do something like below:
val columns = df.columns.dropWhile(_ == "salesperson").map(col)
//Use function `sum` on each column and union the result with original DataFrame.
val withTotalAsRow = df.union(df.select(lit("Total").as("salesperson") +: columns.map(sum):_*))
//I think this column already exists in DataFrame
//Append another column by adding value from each column
val withTotalAsColumn = withTotalAsRow.withColumn("Total", columns.reduce(_ plus _))

With spark Scala, you can achieve this using following snippet of code.
// Assuming spark session available as variable named 'spark'
import spark.implicits._
val resultDF = df.withColumn("Total", sum($"camera", $"large_phone", $"notebook", $"small_phone"))

Pyspark groupby column while conditionally counting another column

I need help getting conditional output from pyspark when using groupBy. I have the following input table:
+----+-----------+-------+
|time|auth_orient|success|
+----+-----------+-------+
| 1| LogOn|Success|
| 1| LogOff|Success|
| 1| LogOff|Success|
| 1| LogOn|Success|
| 1| LogOn| Fail|
| 1| LogOn|Success|
| 2| LogOff|Success|
| 2| LogOn|Success|
| 2| LogOn|Success|
| 2| LogOff|Success|
| 2| LogOn|Success|
| 2| LogOn|Fail |
| 2| LogOff|Success|
| 2| LogOn|Success|
| 2| LogOn|Success|
| 2| LogOff|Success|
| 2| LogOn|Fail |
| 2| LogOn|Success|
| 2| LogOn|Success|
| 2| LogOn|Success|
+----+-----------+-------+
The table below shows what I want, which only displays the logon stats:
+----+-----------+-------+
|time|Fail |success|
+----+-----------+-------+
| 1|1 |3 |
| 2|2 |8 |
+----+-----------+-------+
Overall I am trying to group on time and populate the new columns, preferably I would rather have the code populate the column names as I will not always have a complete list, with counts.
I know a portion of what I am trying to do is capable with MultilabelBinarizer, but that is not currently available in pyspark from what I have seen.

Filter the data frame down to LogOn only first and then do groupBy.pivot:
import pyspark.sql.functions as F
df.filter(
df.auth_orient == 'LogOn'
).groupBy('time').pivot('success').agg(F.count('*')).show()
+----+----+-------+
|time|Fail|Success|
+----+----+-------+
| 1| 1| 3|
| 2| 2| 8|
+----+----+-------+

Groupby and divide count of grouped elements in pyspark data frame

I have a data frame in pyspark like below. I want to do groupby and count of category column in data frame
df.show()
+--------+----+
|category| val|
+--------+----+
| cat1| 13|
| cat2| 12|
| cat2| 14|
| cat3| 23|
| cat1| 20|
| cat1| 10|
| cat2| 30|
| cat3| 11|
| cat1| 7|
| cat1| 8|
+--------+----+
res = df.groupBy('category').count()
res.show()
+--------+-----+
|category|count|
+--------+-----+
| cat2| 3|
| cat3| 2|
| cat1| 5|
+--------+-----+
I am getting my desired result. Now I want to calculate the average of category. data frame has records for 3 days. I want to calculate average of count for these 3 days.
The result I want is below. I basically want to do count/no.of.days
+--------+-----+
|category|count|
+--------+-----+
| cat2| 1|
| cat3| 1|
| cat1| 2|
+--------+-----+
How can I do that?

I believe what you want is
from pyspark.sql import functions as F
df.groupby('category').agg((F.count('val') / 3).alias('average'))

Partition pyspark dataframe based on the change in column value

I have a dataframe in pyspark.
Say the has some columns a,b,c...
I want to group the data into groups as the value of column changes. Say
A B
1 x
1 y
0 x
0 y
0 x
1 y
1 x
1 y
There will be 3 groups as (1x,1y),(0x,0y,0x),(1y,1x,1y)
And corresponding row data

If I understand correctly you want to create a distinct group every time column A changes values.
First we'll create a monotonically increasing id to keep the row order as it is:
import pyspark.sql.functions as psf
df = sc.parallelize([[1,'x'],[1,'y'],[0,'x'],[0,'y'],[0,'x'],[1,'y'],[1,'x'],[1,'y']])\
.toDF(['A', 'B'])\
.withColumn("rn", psf.monotonically_increasing_id())
df.show()
+---+---+----------+
| A| B| rn|
+---+---+----------+
| 1| x| 0|
| 1| y| 1|
| 0| x| 2|
| 0| y| 3|
| 0| x|8589934592|
| 1| y|8589934593|
| 1| x|8589934594|
| 1| y|8589934595|
+---+---+----------+
Now we'll use a window function to create a column that contains 1 every time column A changes:
from pyspark.sql import Window
w = Window.orderBy('rn')
df = df.withColumn("changed", (df.A != psf.lag('A', 1, 0).over(w)).cast('int'))
+---+---+----------+-------+
| A| B| rn|changed|
+---+---+----------+-------+
| 1| x| 0| 1|
| 1| y| 1| 0|
| 0| x| 2| 1|
| 0| y| 3| 0|
| 0| x|8589934592| 0|
| 1| y|8589934593| 1|
| 1| x|8589934594| 0|
| 1| y|8589934595| 0|
+---+---+----------+-------+
Finally we'll use another window function to allocate different numbers to each group:
df = df.withColumn("group_id", psf.sum("changed").over(w)).drop("rn").drop("changed")
+---+---+--------+
| A| B|group_id|
+---+---+--------+
| 1| x| 1|
| 1| y| 1|
| 0| x| 2|
| 0| y| 2|
| 0| x| 2|
| 1| y| 3|
| 1| x| 3|
| 1| y| 3|
+---+---+--------+
Now you can build you groups

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fill Pyspark dataframe column null values with average value from same column - python

Related

Merge 2 spark dataframes with non overlapping columns

Spark create a row containing a sum for every column (like a grand total for every column)

Pyspark groupby column while conditionally counting another column

Groupby and divide count of grouped elements in pyspark data frame

Partition pyspark dataframe based on the change in column value

Categories

Resources