Groupby and Standardise values in Pyspark

Groupby and Standardise values in Pyspark - python

So, I have a Pyspark dataframe of the type
Group
Value
A
12
B
10
A
1
B
0
B
1
A
6
and I'd like to perform an operation able to generate something a DataFrame having the standardised value with respect to its group.
In short, I should have:
Group
Value
A
1.26012384
B
1.4083737
A
-1.18599891
B
-0.81537425
B
-0.59299945
A
-0.07412493
I think this should be performed by using a groupBy and then some agg operation but honestly I'm not really sure on how to do it.

You can calculate the mean and stddev in each group using Window functions:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'Value',
(F.col('Value') - F.mean('Value').over(Window.partitionBy('Group'))) /
F.stddev_pop('Value').over(Window.partitionBy('Group'))
)
df2.show()
+-----+--------------------+
|Group| Value|
+-----+--------------------+
| B| 1.4083737016560922|
| B| -0.8153742483272112|
| B| -0.5929994533288808|
| A| 1.2601238383238722|
| A| -1.1859989066577619|
| A|-0.07412493166611006|
+-----+--------------------+
Note that the order of the results will be randomized because Spark dataframes do not have indices.

Related

Groupby and UDF/UDAF in PySpark while maintaining DataFrame structure

I am new to PySpark and struggling with a simple dataframe manipulation. I have a dataframe similar to:
product period rating product_Desc1 product_Desc2 ..... more columns
a 1 60 foo xx
a 2 70 foo xx
a 3 59 foo xx
b 1 50 bar yy
b 2 55 bar yy
c 1 90 foo bar xy
c 2 100 foo bar xy
I would like to groupBy product, add columns to calculate arithmetic, geometric and harmonic means of ratings while also maintaining the rest of the columns in the dataframe, which are all consistent across each product.
I have tried to do so with a combination of built in functions and UDF. For example:
a_means = df.groupBy("product").agg(mean("rating").alias("a_mean")
g_means = df.groupBy("product").agg(udf_gmean("rating").alias("g_mean")
where:
def g_mean(x):
gm = reduce(mul,x)**(1/len(x))
return gm
udf_gmean = udf(g_mean, FloatType())
I would then join the a_means and g_means output with the original dataframe on product and drop duplicates. However, this method returns an error, for g_means, stating that "rating" is not involved in the groupBy nor is it a user defined aggregation function....
I have also tried using SciPy's gmean module but the error message I get states that the ufunc 'log' is not suitable for the input types, despite all of the rating column being integer type as far as I can see.
There are similar questions on the site but nothing that I can find that seems to fix this issue I have. I would really appreciate the help as it's driving me mad!
Thanks in advance and I should be able to provide any further info quickly today if I haven't provided enough.
It's worth noting that, for efficiency, I am unable to simply convert to Pandas and transform as I would with a Pandas dataframe...and I am using Spark 2.2 and unable to update!

How about something like this
from pyspark.sql.functions import avg
df1 = df.select("product","rating").rdd.map(lambda x: (x[0],(1.0,x[1]*1.0))).reduceByKey(lambda x,y: (x[0]+y[0], x[1]*y[1])).toDF(['product', 'g_mean'])
gdf = df1.select(df1['product'],pow(df1['g_mean._2'],1.0/df1['g_mean._1']).alias("rating_g_mean"))
display(gdf)
+-------+-----------------+
|product| rating_g_mean|
+-------+-----------------+
| a|62.81071936240795|
| b|52.44044240850758|
| c|94.86832980505137|
+-------+-----------------+
df1 = df.withColumn("h_mean", 1.0/df["rating"])
hdf = df1.groupBy("product").agg(avg(df1["rating"]).alias("rating_mean"), (1.0/avg(df1["h_mean"])).alias("rating_h_mean"))
sdf = hdf.join(gdf, ['product'])
display(sdf)
+-------+-----------+-----------------+-----------------+
|product|rating_mean| rating_h_mean| rating_g_mean|
+-------+-----------+-----------------+-----------------+
| a| 63.0|62.62847514743051|62.81071936240795|
| b| 52.5|52.38095238095239|52.44044240850758|
| c| 95.0|94.73684210526315|94.86832980505137|
+-------+-----------+-----------------+-----------------+
fdf = df.join(sdf, ['product'])
display(fdf.sort("product"))
+-------+------+------+-------------+-------------+-----------+-----------------+-----------------+
|product|period|rating|product_Desc1|product_Desc2|rating_mean| rating_h_mean| rating_g_mean|
+-------+------+------+-------------+-------------+-----------+-----------------+-----------------+
| a| 3| 59| foo| xx| 63.0|62.62847514743051|62.81071936240795|
| a| 2| 70| foo| xx| 63.0|62.62847514743051|62.81071936240795|
| a| 1| 60| foo| xx| 63.0|62.62847514743051|62.81071936240795|
| b| 2| 55| bar| yy| 52.5|52.38095238095239|52.44044240850758|
| b| 1| 50| bar| yy| 52.5|52.38095238095239|52.44044240850758|
| c| 2| 100| foo bar| xy| 95.0|94.73684210526315|94.86832980505137|
| c| 1| 90| foo bar| xy| 95.0|94.73684210526315|94.86832980505137|
+-------+------+------+-------------+-------------+-----------+-----------------+-----------------+

A slightly easier way than above using gapply:
from spark_sklearn.group_apply import gapply
from scipy.stats.mstats import gmean
import pandas as pd
def g_mean(_, vals):
gm = gmean(vals["rating"])
return pd.DataFrame(data=[gm])
geoSchema = StructType().add("geo_mean", FloatType())
gMeans = gapply(df.groupby("product"), g_mean, geoSchema)
This returns a dataframe which can then be sorted and joined onto the original using:
df_withGeo = df.join(gMeans, ["product"])
And repeat the process for any aggregation type function columns to be added to the original DataFrame...

How to update a pyspark dataframe with new values from another dataframe?

I have two spark dataframes:
Dataframe A:
|col_1 | col_2 | ... | col_n |
|val_1 | val_2 | ... | val_n |
and dataframe B:
|col_1 | col_2 | ... | col_m |
|val_1 | val_2 | ... | val_m |
Dataframe B can contain duplicate, updated and new rows from dataframe A. I want to write an operation in spark where I can create a new dataframe containing the rows from dataframe A and the updated and new rows from dataframe B.
I started by creating a hash column containing only the columns that are not updatable. This is the unique id. So let's say col1 and col2 can change value (can be updated), but col3,..,coln are unique. I have created a hash function as hash(col3,..,coln):
A=A.withColumn("hash", hash(*[col(colname) for colname in unique_cols_A]))
B=B.withColumn("hash", hash(*[col(colname) for colname in unique_cols_B]))
Now I want to write some spark code that basically selects the rows from B that have the hash not in A (so new rows and updated rows) and join them into a new dataframe together with the rows from A. How can I achieve this in pyspark?
Edit:
Dataframe B can have extra columns from dataframe A, so a union is not possible.
Sample example
Dataframe A:
+-----+-----+
|col_1|col_2|
+-----+-----+
| a| www|
| b| eee|
| c| rrr|
+-----+-----+
Dataframe B:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| d| yyy| 2|
| c| rer| 3|
+-----+-----+-----+
Result:
Dataframe C:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| b| eee| null|
| c| rer| 3|
| d| yyy| 2|
+-----+-----+-----+

This is closely related to update a dataframe column with new values, except that you also want to add the rows from DataFrame B. One approach would be to first do what is outlined in the linked question and then union the result with DataFrame B and drop duplicates.
For example:
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
'col_1',
f.when(
~f.isnull(f.col('b.col_2')),
f.col('b.col_2')
).otherwise(f.col('a.col_2')).alias('col_2'),
'b.col_3'
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()
#+-----+-----+-----+
#|col_1|col_2|col_3|
#+-----+-----+-----+
#| a| wew| 1|
#| b| eee| null|
#| c| rer| 3|
#| d| yyy| 2|
#+-----+-----+-----+
Or more generically using a list comprehension if you have a lot of columns to replace and you don't want to hard code them all:
cols_to_update = ['col_2']
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
*[
['col_1'] +
[
f.when(
~f.isnull(f.col('b.{}'.format(c))),
f.col('b.{}'.format(c))
).otherwise(f.col('a.{}'.format(c))).alias(c)
for c in cols_to_update
] +
['b.col_3']
]
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()

I would opt for different solution, which I believe is less verbose, more generic and does not involve column listing. I would first identify subset of dfA that will be updated (replaceDf) by performing inner join based on keyCols (list). Then I would subtract this replaceDF from dfA and union it with dfB.
replaceDf = dfA.alias('a').join(dfB.alias('b'), on=keyCols, how='inner').select('a.*')
resultDf = dfA.subtract(replaceDf).union(dfB).show()
Even though there will be different columns in dfA and dfB, you can still overcome this with obtaining list of columns from both DataFrames and finding their union. Then I would
prepare select query (instead of "select.('a.')*") so that I would just list columns from dfA that exist in dfB + "null as colname" for those that do not exist in dfB.

If you want to keep only unique values, and require strictly correct results, then union followed by dropDupilcates should do the trick:
columns_which_dont_change = [...]
old_df.union(new_df).dropDuplicates(subset=columns_which_dont_change)

Chaining multiple groupBy in pyspark

My Data looks like this:
id | duration | action1 | action2 | ...
---------------------------------------------
1 | 10 | A | D
1 | 10 | B | E
2 | 25 | A | E
1 | 7 | A | G
I want to group it by ID (which works great!):
df.rdd.groupBy(lambda x: x['id']).mapValues(list).collect()
And now I would like to group values within each group by duration to get something like this:
[(id=1,
((duration=10,[(action1=A,action2=D),(action1=B,action2=E),
(duration=7,(action1=A,action2=G)),
(id=2,
((duration=25,(action1=A,action2=E)))]
And here is where I dont know how to do a nested group by. Any tips?

There is no need to serialize to rdd. Here's a generalized way to group by multiple columns and aggregate the rest of the columns into lists without hard-coding all of them:
from pyspark.sql.functions import collect_list
grouping_cols = ["id", "duration"]
other_cols = [c for c in df.columns if c not in grouping_cols]
df.groupBy(grouping_cols).agg(*[collect_list(c).alias(c) for c in other_cols]).show()
#+---+--------+-------+-------+
#| id|duration|action1|action2|
#+---+--------+-------+-------+
#| 1| 10| [A, B]| [D, E]|
#| 2| 25| [A]| [E]|
#| 1| 7| [A]| [G]|
#+---+--------+-------+-------+
Update
If you need to preserve the order of the actions, the best way is to use a pyspark.sql.Window with an orderBy(). This is because there seems to be some ambiguity as to whether or not a groupBy() following an orderBy() maintains that order.
Suppose your timestamps are stored in a column "ts". You should be able to do the following:
from pyspark.sql import Window
w = Window.partitionBy(grouping_cols).orderBy("ts")
grouped_df = df.select(
*(grouping_cols + [collect_list(c).over(w).alias(c) for c in other_cols])
).distinct()

PySpark aggregation function for "any value"

I have a PySpark Dataframe with an A field, few B fields that dependent on A (A->B) and C fields that I want to aggregate per each A. For example:
A | B | C
----------
A | 1 | 6
A | 1 | 7
B | 2 | 8
B | 2 | 4
I wish to group by A , present any of B and run aggregation (let's say SUM) on C.
The expected result would be:
A | B | C
----------
A | 1 | 13
B | 2 | 12
SQL-wise I would do:
SELECT A, COALESCE(B) as B, SUM(C) as C
FROM T
GROUP BY A
What is the PySpark way to do that?
I can group by A and B together or select MIN(B) per each A, for example:
df.groupBy('A').agg(F.min('B').alias('B'),F.sum('C').alias('C'))
or
df.groupBy(['A','B']).agg(F.sum('C').alias('C'))
but that seems inefficient. Is there is anything similar to SQL coalesce in PySpark?
Thanks

You'll just need to use first instead :
from pyspark.sql.functions import first, sum, col
from pyspark.sql import Row
array = [Row(A="A", B=1, C=6),
Row(A="A", B=1, C=7),
Row(A="B", B=2, C=8),
Row(A="B", B=2, C=4)]
df = sqlContext.createDataFrame(sc.parallelize(array))
results = df.groupBy(col("A")).agg(first(col("B")).alias("B"), sum(col("C")).alias("C"))
Let's now check the results :
results.show()
# +---+---+---+
# | A| B| C|
# +---+---+---+
# | B| 2| 12|
# | A| 1| 13|
# +---+---+---+
From the comments:
Is first here is computationally equivalent to any ?
groupBy causes shuffle. Thus a non deterministic behaviour is to expect.
Which is confirmed in the documentation of first :
Aggregate function: returns the first value in a group.
The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
note:: The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
So yes, computationally there are the same, and that's one of the reasons you need to use sorting if you need a deterministic behaviour.
I hope this helps !

How to assign ranks to records in a spark dataframe based on some conditions?

Given a dataframe :
+-------+-------+
| A | B |
+-------+-------+
| a| 1|
+-------+-------+
| b| 2|
+-------+-------+
| c| 5|
+-------+-------+
| d| 7|
+-------+-------+
| e| 11|
+-------+-------+
I want to assign ranks to records based on conditions :
Start rank with 1
Assign rank = rank of previous record if ( B of current record - B of previous record ) is <= 2
Increment rank when ( B of current record - B of previous record ) is > 2
So I want result to be like this :
+-------+-------+------+
| A | B | rank |
+-------+-------+------+
| a| 1| 1|
+-------+-------+------+
| b| 2| 1|
+-------+-------+------+
| c| 5| 2|
+-------+-------+------+
| d| 7| 2|
+-------+-------+------+
| e| 11| 3|
+-------+-------+------+
Inbuilt functions in spark like rowNumber, rank, dense_rank don't
provide any functionality to achieve this.
I tried doing it by using a global variable rank and fetching
previous record values using lag function but it does not give
consistent results due to distributed processing in spark unlike in sql.
One more method I tried was passing lag values of records to a UDF while generating a new column and applying conditions in UDF. But the problem I am facing is I can get lag values for columns A as well as B but not for column rank.
This gives error as it cannot resolve column name rank :
HiveContext.sql("SELECT df.*,LAG(df.rank, 1) OVER (ORDER BY B , 0) AS rank_lag, udfGetVisitNo(B,rank_lag) as rank FROM df")
I cannot get lag value of a column which I am currently adding.
Also I dont want methods which require using df.collect() as this dataframe is quite large in size and collecting it on a single working node results in memory errors.
Any other method by which I can achieve the same?
I would like to know a solution having time complexity O(n) , n being the no of records.

A SQL solution would be
select a,b,1+sum(col) over(order by a) as rnk
from
(
select t.*
,case when b - lag(b,1,b) over(order by a) <= 2 then 0 else 1 end as col
from t
) x
The solution assumes the ordering is based on column a.
SQL Server example

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Groupby and Standardise values in Pyspark - python

Related

Groupby and UDF/UDAF in PySpark while maintaining DataFrame structure

How to update a pyspark dataframe with new values from another dataframe?

Chaining multiple groupBy in pyspark

PySpark aggregation function for "any value"

How to assign ranks to records in a spark dataframe based on some conditions?

Categories

Resources