Chaining multiple groupBy in pyspark - python

My Data looks like this:
id | duration | action1 | action2 | ...
---------------------------------------------
1 | 10 | A | D
1 | 10 | B | E
2 | 25 | A | E
1 | 7 | A | G
I want to group it by ID (which works great!):
df.rdd.groupBy(lambda x: x['id']).mapValues(list).collect()
And now I would like to group values within each group by duration to get something like this:
[(id=1,
((duration=10,[(action1=A,action2=D),(action1=B,action2=E),
(duration=7,(action1=A,action2=G)),
(id=2,
((duration=25,(action1=A,action2=E)))]
And here is where I dont know how to do a nested group by. Any tips?

There is no need to serialize to rdd. Here's a generalized way to group by multiple columns and aggregate the rest of the columns into lists without hard-coding all of them:
from pyspark.sql.functions import collect_list
grouping_cols = ["id", "duration"]
other_cols = [c for c in df.columns if c not in grouping_cols]
df.groupBy(grouping_cols).agg(*[collect_list(c).alias(c) for c in other_cols]).show()
#+---+--------+-------+-------+
#| id|duration|action1|action2|
#+---+--------+-------+-------+
#| 1| 10| [A, B]| [D, E]|
#| 2| 25| [A]| [E]|
#| 1| 7| [A]| [G]|
#+---+--------+-------+-------+
Update
If you need to preserve the order of the actions, the best way is to use a pyspark.sql.Window with an orderBy(). This is because there seems to be some ambiguity as to whether or not a groupBy() following an orderBy() maintains that order.
Suppose your timestamps are stored in a column "ts". You should be able to do the following:
from pyspark.sql import Window
w = Window.partitionBy(grouping_cols).orderBy("ts")
grouped_df = df.select(
*(grouping_cols + [collect_list(c).over(w).alias(c) for c in other_cols])
).distinct()

Related

Groupby and Standardise values in Pyspark

So, I have a Pyspark dataframe of the type
Group
Value
A
12
B
10
A
1
B
0
B
1
A
6
and I'd like to perform an operation able to generate something a DataFrame having the standardised value with respect to its group.
In short, I should have:
Group
Value
A
1.26012384
B
1.4083737
A
-1.18599891
B
-0.81537425
B
-0.59299945
A
-0.07412493
I think this should be performed by using a groupBy and then some agg operation but honestly I'm not really sure on how to do it.
You can calculate the mean and stddev in each group using Window functions:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'Value',
(F.col('Value') - F.mean('Value').over(Window.partitionBy('Group'))) /
F.stddev_pop('Value').over(Window.partitionBy('Group'))
)
df2.show()
+-----+--------------------+
|Group| Value|
+-----+--------------------+
| B| 1.4083737016560922|
| B| -0.8153742483272112|
| B| -0.5929994533288808|
| A| 1.2601238383238722|
| A| -1.1859989066577619|
| A|-0.07412493166611006|
+-----+--------------------+
Note that the order of the results will be randomized because Spark dataframes do not have indices.

Combine two DataFrames in PySpark into matrix

I have 2 DataFrames in PySpark script.
DF1 has this data:
+-----+--------------+
| id | keyword |
+-----+--------------+
| 1 | banana |
| 2 | apple |
| 3 | orange |
+-----+--------------+
DF2 has this data:
+----+---------------+
| id | tokens |
+----+---------------+
| 13 | ['abc', 'def']|
| 14 | ['ghi', 'jkl']|
| 15 | ['mno', 'pqr']|
+----+---------------+
I'm looking to build a third DataFrame by a result of combining both of the DataFrames above and performing some complex calculations (the calculations are not important) between the keyword and the tokens defined by a python function:
def complex_calculation(keyword, tokens):
// some various stuff that produces a numeric result between the keyword and the tokens
// e.g. result = 0.7768756
return result
The final result should look something like this:
+-------------+---------+--------+--------+
| keyword | 13 | 14 | 15 |
+-------------+---------+--------+--------+
| banana | 0.5345 | 0.4325 | 0.6543 |
| apple | 0.2435 | 0.7865 | 0.9123 |
| orange | 0.3765 | 0.6942 | 0.2765 |
+-------------+---------+--------+--------+
Your complex calculation function is actually quite important in this context, because what you're looking to do is following:
Create a cartesian product of your two tables
table1 = spark._sc.parallelize([[1,"banana"],
[2,"apple"],
[3,"orange"]]).toDF(["id","keyword"])
table2 = spark._sc.parallelize([[13, ['abc', 'def']],
[14, ['ghi', 'jkl']],
[15, ['mno', 'pqr']]]).toDF(["id","token"])
Pivot with an aggregation function. Now this is where your function comes into play. As you can see, I am using f.count() as my aggregation function.
(
table1.select("keyword")
.crossJoin(table2)
.groupBy('keyword')
.pivot('id')
.agg(f.count("token"))
).show()
+-------+---+---+---+
|keyword| 13| 14| 15|
+-------+---+---+---+
| orange| 1| 1| 1|
| apple| 1| 1| 1|
| banana| 1| 1| 1|
+-------+---+---+---+
If you want to use some custom, clever calculation, you really have two options. If you're competent in Scala, you can write a UDAF (user-defined aggregate function) and register this jar to your Spark cluster. Alternatively, you can have a look at pandas udfs with something such as:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
#pandas_udf("struct<agg_key: string, parameter1: parameter1_type>", PandasUDFType.GROUPED_MAP)
def my_agg_function(df):
df = pd.DataFrame(
df.groupby(agg_key).apply(lambda x: (...))
df.reset_index(inplace=True, drop=False)
return df
And then you use your pandas udf such as
spark_df.groupBy("keyword").pivot("id").apply(my_agg_function(...)))
However, despite best attempts at being vectorized, pandas udf are still not great and can have significant performance impacts. Hope this helps. More on pandas udf here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf
Ideally, you should try to do your complex aggregations using spark functions as much as you can, because Tungsten can then optimise this under the hood and give you best performance possible.

Groupby and UDF/UDAF in PySpark while maintaining DataFrame structure

I am new to PySpark and struggling with a simple dataframe manipulation. I have a dataframe similar to:
product period rating product_Desc1 product_Desc2 ..... more columns
a 1 60 foo xx
a 2 70 foo xx
a 3 59 foo xx
b 1 50 bar yy
b 2 55 bar yy
c 1 90 foo bar xy
c 2 100 foo bar xy
I would like to groupBy product, add columns to calculate arithmetic, geometric and harmonic means of ratings while also maintaining the rest of the columns in the dataframe, which are all consistent across each product.
I have tried to do so with a combination of built in functions and UDF. For example:
a_means = df.groupBy("product").agg(mean("rating").alias("a_mean")
g_means = df.groupBy("product").agg(udf_gmean("rating").alias("g_mean")
where:
def g_mean(x):
gm = reduce(mul,x)**(1/len(x))
return gm
udf_gmean = udf(g_mean, FloatType())
I would then join the a_means and g_means output with the original dataframe on product and drop duplicates. However, this method returns an error, for g_means, stating that "rating" is not involved in the groupBy nor is it a user defined aggregation function....
I have also tried using SciPy's gmean module but the error message I get states that the ufunc 'log' is not suitable for the input types, despite all of the rating column being integer type as far as I can see.
There are similar questions on the site but nothing that I can find that seems to fix this issue I have. I would really appreciate the help as it's driving me mad!
Thanks in advance and I should be able to provide any further info quickly today if I haven't provided enough.
It's worth noting that, for efficiency, I am unable to simply convert to Pandas and transform as I would with a Pandas dataframe...and I am using Spark 2.2 and unable to update!
How about something like this
from pyspark.sql.functions import avg
df1 = df.select("product","rating").rdd.map(lambda x: (x[0],(1.0,x[1]*1.0))).reduceByKey(lambda x,y: (x[0]+y[0], x[1]*y[1])).toDF(['product', 'g_mean'])
gdf = df1.select(df1['product'],pow(df1['g_mean._2'],1.0/df1['g_mean._1']).alias("rating_g_mean"))
display(gdf)
+-------+-----------------+
|product| rating_g_mean|
+-------+-----------------+
| a|62.81071936240795|
| b|52.44044240850758|
| c|94.86832980505137|
+-------+-----------------+
df1 = df.withColumn("h_mean", 1.0/df["rating"])
hdf = df1.groupBy("product").agg(avg(df1["rating"]).alias("rating_mean"), (1.0/avg(df1["h_mean"])).alias("rating_h_mean"))
sdf = hdf.join(gdf, ['product'])
display(sdf)
+-------+-----------+-----------------+-----------------+
|product|rating_mean| rating_h_mean| rating_g_mean|
+-------+-----------+-----------------+-----------------+
| a| 63.0|62.62847514743051|62.81071936240795|
| b| 52.5|52.38095238095239|52.44044240850758|
| c| 95.0|94.73684210526315|94.86832980505137|
+-------+-----------+-----------------+-----------------+
fdf = df.join(sdf, ['product'])
display(fdf.sort("product"))
+-------+------+------+-------------+-------------+-----------+-----------------+-----------------+
|product|period|rating|product_Desc1|product_Desc2|rating_mean| rating_h_mean| rating_g_mean|
+-------+------+------+-------------+-------------+-----------+-----------------+-----------------+
| a| 3| 59| foo| xx| 63.0|62.62847514743051|62.81071936240795|
| a| 2| 70| foo| xx| 63.0|62.62847514743051|62.81071936240795|
| a| 1| 60| foo| xx| 63.0|62.62847514743051|62.81071936240795|
| b| 2| 55| bar| yy| 52.5|52.38095238095239|52.44044240850758|
| b| 1| 50| bar| yy| 52.5|52.38095238095239|52.44044240850758|
| c| 2| 100| foo bar| xy| 95.0|94.73684210526315|94.86832980505137|
| c| 1| 90| foo bar| xy| 95.0|94.73684210526315|94.86832980505137|
+-------+------+------+-------------+-------------+-----------+-----------------+-----------------+
A slightly easier way than above using gapply:
from spark_sklearn.group_apply import gapply
from scipy.stats.mstats import gmean
import pandas as pd
def g_mean(_, vals):
gm = gmean(vals["rating"])
return pd.DataFrame(data=[gm])
geoSchema = StructType().add("geo_mean", FloatType())
gMeans = gapply(df.groupby("product"), g_mean, geoSchema)
This returns a dataframe which can then be sorted and joined onto the original using:
df_withGeo = df.join(gMeans, ["product"])
And repeat the process for any aggregation type function columns to be added to the original DataFrame...

Pyspark - Calculate number of null values in each dataframe column

I have a dataframe with many columns. My aim is to produce a dataframe thats lists each column name, along with the number of null values in that column.
Example:
+-------------+-------------+
| Column_Name | NULL_Values |
+-------------+-------------+
| Column_1 | 15 |
| Column_2 | 56 |
| Column_3 | 18 |
| ... | ... |
+-------------+-------------+
I have managed to get the number of null values for ONE column like so:
df.agg(F.count(F.when(F.isnull(c), c)).alias('NULL_Count'))
where c is a column in the dataframe. However, it does not show the name of the column. The output is:
+------------+
| NULL_Count |
+------------+
| 15 |
+------------+
Any ideas?
You can use a list comprehension to loop over all of your columns in the agg, and use alias to rename the output column:
import pyspark.sql.functions as F
df_agg = df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])
However, this will return the results in one row as shown below:
df_agg.show()
#+--------+--------+--------+
#|Column_1|Column_2|Column_3|
#+--------+--------+--------+
#| 15| 56| 18|
#+--------+--------+--------+
If you wanted the results in one column instead, you could union each column from df_agg using functools.reduce as follows:
from functools import reduce
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df_agg.select(F.lit(c).alias("Column_Name"), F.col(c).alias("NULL_Count"))
for c in df_agg.columns
)
)
df_agg_col.show()
#+-----------+----------+
#|Column_Name|NULL_Count|
#+-----------+----------+
#| Column_1| 15|
#| Column_2| 56|
#| Column_3| 18|
#+-----------+----------+
Or you can skip the intermediate step of creating df_agg and do:
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df.agg(
F.count(F.when(F.isnull(c), c)).alias('NULL_Count')
).select(F.lit(c).alias("Column_Name"), "NULL_Count")
for c in df.columns
)
)
Scala alternative could be
case class Test(id:Int, weight:Option[Int], age:Int, gender: Option[String])
val df1 = Seq(Test(1, Some(100), 23, Some("Male")), Test(2, None, 25, None), Test(3, None, 33, Some("Female"))).toDF()
df1.show()
+---+------+---+------+
| id|weight|age|gender|
+---+------+---+------+
| 1| 100| 23| Male|
| 2| null| 25| null|
| 3| null| 33|Female|
+---+------+---+------+
val s = df1.columns.map(c => sum(col(c).isNull.cast("integer")).alias(c))
val df2 = df1.agg(s.head, s.tail:_*)
val t = df2.columns.map(c => df2.select(lit(c).alias("col_name"), col(c).alias("null_count")))
val df_agg_col = t.reduce((df1, df2) => df1.union(df2))
df_agg_col.show()

Pyspark - Select the distinct values from each column

I am trying to find all of the distinct values in each column in a dataframe and show in one table.
Example data:
|-----------|-----------|-----------|
| COL_1 | COL_2 | COL_3 |
|-----------|-----------|-----------|
| A | C | D |
| A | C | D |
| A | C | E |
| B | C | E |
| B | C | F |
| B | C | F |
|-----------|-----------|-----------|
Example output:
|-----------|-----------|-----------|
| COL_1 | COL_2 | COL_3 |
|-----------|-----------|-----------|
| A | C | D |
| B | | E |
| | | F |
|-----------|-----------|-----------|
Is this even possible? I have been able to do it in separate tables, but it would be much better all in one table.
Any ideas?
The simplest thing here would be to use pyspark.sql.functions.collect_set on all of the columns:
import pyspark.sql.functions as f
df.select(*[f.collect_set(c).alias(c) for c in df.columns]).show()
#+------+-----+---------+
#| COL_1|COL_2| COL_3|
#+------+-----+---------+
#|[B, A]| [C]|[F, E, D]|
#+------+-----+---------+
Obviously, this returns the data as one row.
If instead you want the output as you wrote in your question (one row per unique value for each column), it's doable but requires quite a bit of pyspark gymnastics (and any solution likely will be much less efficient).
Nevertheless, I present you some options:
Option 1: Explode and Join
You can use pyspark.sql.functions.posexplode to explode the elements in the set of values for each column along with the index in the array. Do this for each column separately and then outer join the resulting list of DataFrames together using functools.reduce:
from functools import reduce
unique_row = df.select(*[f.collect_set(c).alias(c) for c in df.columns])
final_df = reduce(
lambda a, b: a.join(b, how="outer", on="pos"),
(unique_row.select(f.posexplode(c).alias("pos", c)) for c in unique_row.columns)
).drop("pos")
final_df.show()
#+-----+-----+-----+
#|COL_1|COL_2|COL_3|
#+-----+-----+-----+
#| A| null| E|
#| null| null| D|
#| B| C| F|
#+-----+-----+-----+
Option 2: Select by position
First compute the size of the maximum array and store this in a new column max_length. Then select elements from each array if a value exists at that index.
Once again we use pyspark.sql.functions.posexplode but this time it's just to create a column to represent the index in each array to extract.
Finally we use this trick that allows you to use a column value as a parameter.
final_df= df.select(*[f.collect_set(c).alias(c) for c in df.columns])\
.withColumn("max_length", f.greatest(*[f.size(c) for c in df.columns]))\
.select("*", f.expr("posexplode(split(repeat(',', max_length-1), ','))"))\
.select(
*[
f.expr(
"case when size({c}) > pos then {c}[pos] else null end AS {c}".format(c=c))
for c in df.columns
]
)
final_df.show()
#+-----+-----+-----+
#|COL_1|COL_2|COL_3|
#+-----+-----+-----+
#| B| C| F|
#| A| null| E|
#| null| null| D|
#+-----+-----+-----+

Categories

Resources