pyspark groupBy with multiple aggregates (like pandas) - python

I'm very new to pyspark and I'm attempting to transition my pandas code to pyspark. One thing I'm having issues with is aggregating my groupby.
Here is the pandas code:
df_trx_m = train1.groupby('CUSTOMER_NUMBER')['trx'].agg(['mean', 'var'])
I saw this example on AnalyticsVidhya but I'm not sure how to apply that to the code above:
train.groupby('Age').agg({'Purchase': 'mean'}).show()
Output:
+-----+-----------------+
| Age| avg(Purchase)|
+-----+-----------------+
|51-55|9534.808030960236|
|46-50|9208.625697468327|
| 0-17|8933.464640444974|
|36-45|9331.350694917874|
|26-35|9252.690632869888|
| 55+|9336.280459449405|
|18-25|9169.663606261289|
+-----+-----------------+
Any help would be much apprecaited
EDIT:
Here's another attempt:
from pyspark.sql.functions import avg, variance
train1.groupby("CUSTOMER_NUMBER")\
.agg(
avg('repatha_trx').alias("repatha_trx_avg"),
variance('repatha_trx').alias("repatha_trx_Var")
)\
.show(100)
But that is just giving me an empty dataframe.

You can import pyspark functions to perform aggregation.
# load function
from pyspark.sql import functions as F
# aggregate data
df_trx_m = train.groupby('Age').agg(
F.avg(F.col('repatha_trx')).alias('repatha_trx_avg'),
F.variance(F.col('repatha_trx')).alias('repatha_trx_var')
)
Note that pyspark.sql.functions.variance() returns the population variance. There is another function pyspark.sql.functions.var_samp() for the unbiased sample variance.

Related

Pandas to Pyspark environment

newlist = []
for column in new_columns:
count12 = new_df.loc[new_df[col].diff() == 1]
new_df2=new_df2.groupby(['my_id','friend_id','family_id','colleage_id']).apply(len)
There is no option is available in pyspark for getting all length of column
How can we achieve this code into pyspark.
Thanks in advance..
Literally, apply(len) is just an aggregation function that would count grouped elements from groupby. You can do the very same thing in basic PySpark syntax
import pyspark.sql.functions as F
(df
.groupBy('my_id','friend_id','family_id','colleage_id')
.agg(F.count('*'))
.show()
)

PySpark Pandas UDF applied with Series changes Data Type from Float to Int

This is what I am trying to do:
I am still learning PySpark and right now I am exploring Pandas UDFs. The picture embedded above is self-explanatory. I have the column long with values such as -122.23. Then I define a UDF where I simply want to turn that into a positive number. But, even though I tried also forcing the pd.Series to be cast as float, or multiplying the series with -1.0, the results I get are rounded to the nearest integer.
The same thing happens when I create a UDF which adds or subtracts. I am losing the floating point precision here and didn't really find anything on the web or knew how to search for an answer for this issue.
You are using long in pandas_udf, to represent floating numbers you can use DoubleType (or FloatType, depending on the precision).
UDF example:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType
#F.pandas_udf(DoubleType())
def neg_to_pos(x):
return abs(x)
spark = SparkSession.builder.getOrCreate()
data = [(-1.2331,), (1.0,), (234.23,), (-0.233,)]
df = spark.createDataFrame(data, ["a"])
df = df.select(neg_to_pos(df.a))
Where input DF is:
+-------+
|a |
+-------+
|-1.2331|
|1.0 |
|234.23 |
|-0.233 |
+-------+
and result is:
+-------------+
|neg_to_pos(a)|
+-------------+
|1.2331 |
|1.0 |
|234.23 |
|0.233 |
+-------------+
However, the same can be done without UDF, using just one line:
df = df.select(F.abs("a"))

Databricks: Python pivot table in spark dataframe

Anyone can give me some guidance on the pivot table, using spark dataframe in python language
I am getting the following error :Column is not iterable
enter image description here
anyone has idea ?
Pivots function Pivots a column of the current DataFrame and performs the specified aggregation operation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not.
With specifying column values - df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings")
without specifying column values (more concise but less efficient) - df.groupBy("year").pivot("course").sum("earnings")
You are proceeding in the right direction. Sample working code, python 2
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.functions import col
>>> spark = SparkSession.builder.master('local').appName('app').getOrCreate()
>>> df = spark.read.option('header', 'true').csv('pivot.csv')
>>> df = df.withColumn('value1', col('value1').cast("int"))
>>> pdf = df.groupBy('thisyear').pivot('month', ['JAN','FEB']).sum('value1')
>>> pdf.show(10)
+--------+---+---+
|thisyear|JAN|FEB|
+--------+---+---+
| 2019| 3| 2|
+--------+---+---+
//pivot.csv
thisyear,month,value1
2019,JAN,1
2019,JAN,1
2019,FEB,1
2019,JAN,1
2019,FEB,1

How to zip two array columns in Spark SQL

I have a Pandas dataframe. I have tried to join two columns containing string values into a list first and then using zip, I joined each element of the list with '_'. My data set is like below:
df['column_1']: 'abc, def, ghi'
df['column_2']: '1.0, 2.0, 3.0'
I wanted to join these two columns in a third column like below for each row of my dataframe.
df['column_3']: [abc_1.0, def_2.0, ghi_3.0]
I have successfully done so in python using the code below but the dataframe is quite large and it takes a very long time to run it for the whole dataframe. I want to do the same thing in PySpark for efficiency. I have read the data in spark dataframe successfully but I'm having a hard time determining how to replicate Pandas functions with PySpark equivalent functions. How can I get my desired result in PySpark?
df['column_3'] = df['column_2']
for index, row in df.iterrows():
while index < 3:
if isinstance(row['column_1'], str):
row['column_1'] = list(row['column_1'].split(','))
row['column_2'] = list(row['column_2'].split(','))
row['column_3'] = ['_'.join(map(str, i)) for i in zip(list(row['column_1']), list(row['column_2']))]
I have converted the two columns to arrays in PySpark by using the below code
from pyspark.sql.types import ArrayType, IntegerType, StringType
from pyspark.sql.functions import col, split
crash.withColumn("column_1",
split(col("column_1"), ",\s*").cast(ArrayType(StringType())).alias("column_1")
)
crash.withColumn("column_2",
split(col("column_2"), ",\s*").cast(ArrayType(StringType())).alias("column_2")
)
Now all I need is to zip each element of the arrays in the two columns using '_'. How can I use zip with this? Any help is appreciated.
A Spark SQL equivalent of Python's would be pyspark.sql.functions.arrays_zip:
pyspark.sql.functions.arrays_zip(*cols)
Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
So if you already have two arrays:
from pyspark.sql.functions import split
df = (spark
.createDataFrame([('abc, def, ghi', '1.0, 2.0, 3.0')])
.toDF("column_1", "column_2")
.withColumn("column_1", split("column_1", "\s*,\s*"))
.withColumn("column_2", split("column_2", "\s*,\s*")))
You can just apply it on the result
from pyspark.sql.functions import arrays_zip
df_zipped = df.withColumn(
"zipped", arrays_zip("column_1", "column_2")
)
df_zipped.select("zipped").show(truncate=False)
+------------------------------------+
|zipped |
+------------------------------------+
|[[abc, 1.0], [def, 2.0], [ghi, 3.0]]|
+------------------------------------+
Now to combine the results you can transform (How to use transform higher-order function?, TypeError: Column is not iterable - How to iterate over ArrayType()?):
df_zipped_concat = df_zipped.withColumn(
"zipped_concat",
expr("transform(zipped, x -> concat_ws('_', x.column_1, x.column_2))")
)
df_zipped_concat.select("zipped_concat").show(truncate=False)
+---------------------------+
|zipped_concat |
+---------------------------+
|[abc_1.0, def_2.0, ghi_3.0]|
+---------------------------+
Note:
Higher order functions transform and arrays_zip has been introduced in Apache Spark 2.4.
For Spark 2.4+, this can be done using only zip_with function to zip a concatenate on the same time:
df.withColumn("column_3", expr("zip_with(column_1, column_2, (x, y) -> concat(x, '_', y))"))
The higher-order function takes 2 arrays to merge, element-wise, using a lambda function (x, y) -> concat(x, '_', y).
You can also UDF to zip the split array columns,
df = spark.createDataFrame([('abc,def,ghi','1.0,2.0,3.0')], ['col1','col2'])
+-----------+-----------+
|col1 |col2 |
+-----------+-----------+
|abc,def,ghi|1.0,2.0,3.0|
+-----------+-----------+ ## Hope this is how your dataframe is
from pyspark.sql import functions as F
from pyspark.sql.types import *
def concat_udf(*args):
return ['_'.join(x) for x in zip(*args)]
udf1 = F.udf(concat_udf,ArrayType(StringType()))
df = df.withColumn('col3',udf1(F.split(df.col1,','),F.split(df.col2,',')))
df.show(1,False)
+-----------+-----------+---------------------------+
|col1 |col2 |col3 |
+-----------+-----------+---------------------------+
|abc,def,ghi|1.0,2.0,3.0|[abc_1.0, def_2.0, ghi_3.0]|
+-----------+-----------+---------------------------+
For Spark 3.1+, they now provide pyspark.sql.functions.zip_with() with Python lambda function, therefore it can be done like this:
import pyspark.sql.functions as F
df = df.withColumn("column_3", F.zip_with("column_1", "column_2", lambda x,y: F.concat_ws("_", x, y)))

Transform string column to vector column Spark DataFrames

I have a Spark dataframe that looks as follows:
+-----------+-------------------+
| ID | features |
+-----------+-------------------+
| 18156431|(5,[0,1,4],[1,1,1])|
| 20260831|(5,[0,4,5],[2,1,1])|
| 91859831|(5,[0,1],[1,3]) |
| 206186631|(5,[3,4,5],[1,5]) |
| 223134831|(5,[2,3,5],[1,1,1])|
+-----------+-------------------+
In this dataframe the features column is a sparse vector. In my scripts I have to save this DF as file on disk. When doing this, the features column is saved as as text column: example "(5,[0,1,4],[1,1,1])".
When importing again in Spark the column stays string, as you could expect. How can I convert the column back to (sparse) vector format?
Not particularly efficient (it would be a good idea to use a format that preserves types) due to UDF overhead but you can do something like this:
from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf
df = sc.parallelize([
(18156431, "(5,[0,1,4],[1,1,1])")
]).toDF(["id", "features"])
parse = udf(lambda s: Vectors.parse(s), VectorUDT())
df.select(parse("features"))
Please note this doesn't port directly to 2.0.0+ and ML Vector. Since ML vectors don't provide parse method you'd have to parse to MLLib and use asML:
parse = udf(lambda s: Vectors.parse(s).asML(), VectorUDT())

Categories

Resources