Pyspark groupby column while conditionally counting another column - python

I need help getting conditional output from pyspark when using groupBy. I have the following input table:
+----+-----------+-------+
|time|auth_orient|success|
+----+-----------+-------+
| 1| LogOn|Success|
| 1| LogOff|Success|
| 1| LogOff|Success|
| 1| LogOn|Success|
| 1| LogOn| Fail|
| 1| LogOn|Success|
| 2| LogOff|Success|
| 2| LogOn|Success|
| 2| LogOn|Success|
| 2| LogOff|Success|
| 2| LogOn|Success|
| 2| LogOn|Fail |
| 2| LogOff|Success|
| 2| LogOn|Success|
| 2| LogOn|Success|
| 2| LogOff|Success|
| 2| LogOn|Fail |
| 2| LogOn|Success|
| 2| LogOn|Success|
| 2| LogOn|Success|
+----+-----------+-------+
The table below shows what I want, which only displays the logon stats:
+----+-----------+-------+
|time|Fail |success|
+----+-----------+-------+
| 1|1 |3 |
| 2|2 |8 |
+----+-----------+-------+
Overall I am trying to group on time and populate the new columns, preferably I would rather have the code populate the column names as I will not always have a complete list, with counts.
I know a portion of what I am trying to do is capable with MultilabelBinarizer, but that is not currently available in pyspark from what I have seen.

Filter the data frame down to LogOn only first and then do groupBy.pivot:
import pyspark.sql.functions as F
df.filter(
df.auth_orient == 'LogOn'
).groupBy('time').pivot('success').agg(F.count('*')).show()
+----+----+-------+
|time|Fail|Success|
+----+----+-------+
| 1| 1| 3|
| 2| 2| 8|
+----+----+-------+

Related

how to group by only specific features using pyspark

I have this data frame
+---------+------+-----+-------------+-----+
| LCLid|KWH/hh|Acorn|Acorn_grouped|Month|
+---------+------+-----+-------------+-----+
|MAC000002| 0.0| 0| 0| 10|
|MAC000002| 0.0| 0| 0| 10|
|MAC000002| 0.0| 0| 0| 10|
I want to group by the LCid and month's average consumption only in a certain way that a get
+---------+-----+------------------+----------+------------------+
| LCLid|Month| sum(KWH/hh)|Acorn |Acorn_grouped |
+---------+-----+------------------+----------+------------------+
|MAC000003| 10| 904.9270009999999| 0 | 0 |
|MAC000022| 2|1672.5559999999978| 1 | 0 |
|MAC000019| 4| 368.4720001000007| 1 | 1 |
|MAC000022| 9|449.07699989999975| 0 | 1 |
|MAC000024| 8| 481.7160003000004| 1 | 0 |
but what I could do is using this code
dataset=dataset.groupBy("LCLid","Month").sum()
which gave me this result
+---------+-----+------------------+----------+------------------+----------+
| LCLid|Month| sum(KWH/hh)|sum(Acorn)|sum(Acorn_grouped)|sum(Month)|
+---------+-----+------------------+----------+------------------+----------+
|MAC000003| 10| 904.9270009999999| 2978| 2978| 29780|
|MAC000022| 2|1672.5559999999978| 12090| 4030| 8060|
|MAC000019| 4| 368.4720001000007| 20174| 2882| 11528|
|MAC000022| 9|449.07699989999975| 8646| 2882| 25938|
the problem is that the sum function was calculated also on the acron and acron_grouped
have you any idea how could I make the grouping only on the KWH/hh
Depends on how you want to handle the other two columns. If you don't want to sum them, and just want any value from that column, you can do
import pyspark.sql.functions as F
dataset = dataset.groupBy("LCLid","Month").agg(
F.sum("KWH/hh"),
F.first("Acorn").alias("Acorn"),
F.first("Acorn_grouped").alias("Acorn_grouped")
)

Spark create a row containing a sum for every column (like a grand total for every column)

I have a dataframe that looks like this
+-----------+-----------+-----------+
|salesperson| device|amount_sold|
+-----------+-----------+-----------+
| john| notebook| 2|
| gary| notebook| 3|
| john|small_phone| 2|
| mary|small_phone| 3|
| john|large_phone| 3|
| john| camera| 3|
+-----------+-----------+-----------+
and I have transformed it using pivot function to this with a Total column
+-----------+------+-----------+--------+-----------+-----+
|salesperson|camera|large_phone|notebook|small_phone|Total|
+-----------+------+-----------+--------+-----------+-----+
| gary| 0| 0| 3| 0| 3|
| mary| 0| 0| 0| 3| 3|
| john| 3| 3| 2| 2| 10|
+-----------+------+-----------+--------+-----------+-----+
but I would like a dataframe with a row (Total) that would also contain a total for every column like below:
+-----------+------+-----------+--------+-----------+-----+
|salesperson|camera|large_phone|notebook|small_phone|Total|
+-----------+------+-----------+--------+-----------+-----+
| gary| 0| 0| 3| 0| 3|
| mary| 0| 0| 0| 3| 3|
| john| 3| 3| 2| 2| 10|
| Total| 3| 3| 5| 5| 16|
+-----------+------+-----------+--------+-----------+-----+
Is it possible to do this is Spark using Scala/Python? (Preferably Scala and using Spark) and not using Union if possible
TIA
You can do something like below:
val columns = df.columns.dropWhile(_ == "salesperson").map(col)
//Use function `sum` on each column and union the result with original DataFrame.
val withTotalAsRow = df.union(df.select(lit("Total").as("salesperson") +: columns.map(sum):_*))
//I think this column already exists in DataFrame
//Append another column by adding value from each column
val withTotalAsColumn = withTotalAsRow.withColumn("Total", columns.reduce(_ plus _))
With spark Scala, you can achieve this using following snippet of code.
// Assuming spark session available as variable named 'spark'
import spark.implicits._
val resultDF = df.withColumn("Total", sum($"camera", $"large_phone", $"notebook", $"small_phone"))

Groupby and divide count of grouped elements in pyspark data frame

I have a data frame in pyspark like below. I want to do groupby and count of category column in data frame
df.show()
+--------+----+
|category| val|
+--------+----+
| cat1| 13|
| cat2| 12|
| cat2| 14|
| cat3| 23|
| cat1| 20|
| cat1| 10|
| cat2| 30|
| cat3| 11|
| cat1| 7|
| cat1| 8|
+--------+----+
res = df.groupBy('category').count()
res.show()
+--------+-----+
|category|count|
+--------+-----+
| cat2| 3|
| cat3| 2|
| cat1| 5|
+--------+-----+
I am getting my desired result. Now I want to calculate the average of category. data frame has records for 3 days. I want to calculate average of count for these 3 days.
The result I want is below. I basically want to do count/no.of.days
+--------+-----+
|category|count|
+--------+-----+
| cat2| 1|
| cat3| 1|
| cat1| 2|
+--------+-----+
How can I do that?
I believe what you want is
from pyspark.sql import functions as F
df.groupby('category').agg((F.count('val') / 3).alias('average'))

Partition pyspark dataframe based on the change in column value

I have a dataframe in pyspark.
Say the has some columns a,b,c...
I want to group the data into groups as the value of column changes. Say
A B
1 x
1 y
0 x
0 y
0 x
1 y
1 x
1 y
There will be 3 groups as (1x,1y),(0x,0y,0x),(1y,1x,1y)
And corresponding row data
If I understand correctly you want to create a distinct group every time column A changes values.
First we'll create a monotonically increasing id to keep the row order as it is:
import pyspark.sql.functions as psf
df = sc.parallelize([[1,'x'],[1,'y'],[0,'x'],[0,'y'],[0,'x'],[1,'y'],[1,'x'],[1,'y']])\
.toDF(['A', 'B'])\
.withColumn("rn", psf.monotonically_increasing_id())
df.show()
+---+---+----------+
| A| B| rn|
+---+---+----------+
| 1| x| 0|
| 1| y| 1|
| 0| x| 2|
| 0| y| 3|
| 0| x|8589934592|
| 1| y|8589934593|
| 1| x|8589934594|
| 1| y|8589934595|
+---+---+----------+
Now we'll use a window function to create a column that contains 1 every time column A changes:
from pyspark.sql import Window
w = Window.orderBy('rn')
df = df.withColumn("changed", (df.A != psf.lag('A', 1, 0).over(w)).cast('int'))
+---+---+----------+-------+
| A| B| rn|changed|
+---+---+----------+-------+
| 1| x| 0| 1|
| 1| y| 1| 0|
| 0| x| 2| 1|
| 0| y| 3| 0|
| 0| x|8589934592| 0|
| 1| y|8589934593| 1|
| 1| x|8589934594| 0|
| 1| y|8589934595| 0|
+---+---+----------+-------+
Finally we'll use another window function to allocate different numbers to each group:
df = df.withColumn("group_id", psf.sum("changed").over(w)).drop("rn").drop("changed")
+---+---+--------+
| A| B|group_id|
+---+---+--------+
| 1| x| 1|
| 1| y| 1|
| 0| x| 2|
| 0| y| 2|
| 0| x| 2|
| 1| y| 3|
| 1| x| 3|
| 1| y| 3|
+---+---+--------+
Now you can build you groups

Fill Pyspark dataframe column null values with average value from same column

With a dataframe like this,
rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,None,"201602"),(1,20,3003,"201601"), (1,20,None,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, None,"201601")])
df_data = sqlContext.createDataFrame(rdd_2, ["id", "type", "cost", "date"])
df_data.show()
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|null| 201602|
| 1| 20|3003| 201601|
| 1| 20|null| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|null| 201601|
+---+----+----+-------+
I need to fill the null values with the average of the existing values, with the expected result being
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|1128| 201602|
| 1| 20|3003| 201601|
| 1| 20|1128| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|1128| 201601|
+---+----+----+-------+
where 1128 is the average of the existing values. I need to do that for several columns.
My current approach is to use na.fill:
fill_values = {column: df_data.agg({column:"mean"}).flatMap(list).collect()[0] for column in df_data.columns if column not in ['date','id']}
df_data = df_data.na.fill(fill_values)
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|1128| 201602|
| 1| 20|3003| 201601|
| 1| 20|1128| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|1128| 201601|
+---+----+----+-------+
But this is very cumbersome. Any ideas?
Well, one way or another you have to:
compute statistics
fill the blanks
It pretty much limits what you can really improve here, still:
replace flatMap(list).collect()[0] with first()[0] or structure unpacking
compute all stats with a single action
use built-in Row methods to extract dictionary
The final result could like this:
def fill_with_mean(df, exclude=set()):
stats = df.agg(*(
avg(c).alias(c) for c in df.columns if c not in exclude
))
return df.na.fill(stats.first().asDict())
fill_with_mean(df_data, ["id", "date"])
In Spark 2.2 or later you can also use Imputer. See Replace missing values with mean - Spark Dataframe.

Categories

Resources