Spark reduceByKey on several different values - python

I have a table stored as an RDD of lists, on which I want to perform something akin to a groupby in SQL or pandas, taking the sum or average for every variable.
The way I currently do it is this (untested code):
l=[(3, "add"),(4, "add")]
dict={}
i=0
for aggregation in l:
RDD= RDD.map(lambda x: (x[6], float(x[aggregation[0]])))
agg=RDD.reduceByKey(aggregation[1])
dict[i]=agg
i+=1
Then I'll need to join all the RDDs in dict.
This isn't very efficient though. Is there a better way?

If you are using >= Spark 1.3, you could look at the DataFrame API.
In the pyspark shell:
import numpy as np
# create a DataFrame (this can also be from an RDD)
df = sqlCtx.createDataFrame(map(lambda x:map(float, x), np.random.rand(50, 3)))
df.agg({col: "mean" for col in df.columns}).collect()
This outputs:
[Row(AVG(_3#1456)=0.5547187588389414, AVG(_1#1454)=0.5149476209374797, AVG(_2#1455)=0.5022967093047612)]
The available aggregate methods are "avg"/"mean", "max", "min", "sum", "count".
To get several aggregations for the same column, you can call agg with a list of explicitly constructed aggregations rather than with a dictionary:
from pyspark.sql import functions as F
df.agg(*[F.min(col) for col in df.columns] + [F.avg(col) for col in df.columns]).collect()
Or for your case:
df.agg(F.count(df.var3), F.max(df.var3), ) # etc...

Related

PySpark: Fastest way of counting values in multiple columns

I need to count a value in several columns and I want all those individual count for each column in a list.
Is there a faster/better way of doing this? Because my solution takes quite some time.
dataframe.cache()
list = [dataframe.filter(col(str(i)) == "value").count() for i in range(150)]
You can do a conditional count aggregation:
import pyspark.sql.functions as F
df2 = df.agg(*[
F.count(F.when(F.col(str(i)) == "value", 1)).alias(i)
for i in range(150)
])
result = df2.toPandas().transpose()[0].tolist()
You can try the following approach/design
write a map function for each row of the data frame like this:
VALUE = 'value'
def row_mapper(df_row):
return [each == VALUE for each in df_row]
write a reduce function for data frame that takes 2 two rows as input:
def reduce_rows(df_row1, df_row2):
return [x + y for x, y in zip(df_row1, df_row2)]
Note: these are simple python function to help you understand not some udf functions you can directly apply on PySpark.

pandas apply function to each group (output is not really an aggregation)

I have a list of time-series (=pandas dataframe) and want to calculate for each time-series (of a device) the matrixprofile.
One option is to iterate all the devices - which seems to be slow.
A second option would be to group by the devices - and apply a UDF. The problem is now, that the UDF will return 1:1 rows i.e. not a single scalar value per group but the same number of rows will be outputted as the input.
Is it still possible to somehow vectorize this calculation for reach group when 1:1 (or at least non scalar values) are returned?
import pandas as pd
df = pd.DataFrame({
'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)
print('***************************')
# slow version retaining all the rows
for g in df.bar.unique():
print(g)
this_group = df[df.bar == g]
# perform a UDF which needs to have all the values per group
# i.e. for real I want to calculate the matrixprofile for each time-series of a device
this_group['result'] = this_group.baz.apply(lambda x: 1)
display(this_group)
print('***************************')
def my_non_scalar1_1_agg_function(x):
display(pd.DataFrame(x))
return x
# neatly vectorized application of a non_scalar function
# but this fails as: Must produce aggregated value
df = df.groupby(['bar']).baz.agg(my_non_scalar1_1_agg_function)
display(df)
For non-aggregated functions applied to each distinct group that does not return a non-scalar value, you need to iterate method across groups and then compile together.
Therefore, consider a list or dict comprehension using groupby(), followed by concat. Be sure method inputs and returns a full data frame, series, or ndarray.
# LIST COMPREHENSION
df_list = [ myfunction(sub) for index, sub in df.groupby(['group_column']) ]
final_df = pd.concat(df_list)
# DICT COMPREHENSION
df_dict = { index: myfunction(sub) for index, sub in df.groupby(['group_column']) }
final_df = pd.concat(df_dict, ignore_index=True)
Indeed this (see also the link above in the comment) is a way to get it to work in a faster/more desired way. Perhaps there is even a better alternative
import pandas as pd
df = pd.DataFrame({
'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)
grouped_df = df.groupby(['bar'])
altered = []
for index, subframe in grouped_df:
display(subframe)
subframe = subframe# obviously we need to apply the UDF here - not the idempotent operation (=doing nothing)
altered.append(subframe)
print (index)
#print (subframe)
pd.concat(altered, ignore_index=True)
#pd.DataFrame(altered)

Filtering counts in Pandas .agg

I am attempting to create a new Pandas Dataframe with specific counts from an existing dataframe (grouping by date and department).
I have read the documentation here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html
and have constructed the following:
new_values = df.groupby(['department',pd.to_datetime(df.date).dt.strftime('%m/%Y')]).agg({'id':"count", 'confirmed':"count"})
I am having difficulty with the syntax, as I want the second count to count only where 'confirmed'=='1'.
I can do this using a second dataframe
filtered_values = df[df['confirmed']==1]
taking another count, and then merging them back together, however, is there a way to do it in the aggregate listed above?
If the values are just (0, 1), you can do directly:
new_values = df.groupby(['department',pd.to_datetime(df.date).dt.strftime('%m/%Y')]).agg({'id':"count", 'confirmed': lambda x: x.sum()})
In any other case (more different values) you can do:
new_values = df.groupby(['department',pd.to_datetime(df.date).dt.strftime('%m/%Y')]).agg({'id':"count", 'confirmed': lambda x: x.eq(1).sum()})

How to zip two array columns in Spark SQL

I have a Pandas dataframe. I have tried to join two columns containing string values into a list first and then using zip, I joined each element of the list with '_'. My data set is like below:
df['column_1']: 'abc, def, ghi'
df['column_2']: '1.0, 2.0, 3.0'
I wanted to join these two columns in a third column like below for each row of my dataframe.
df['column_3']: [abc_1.0, def_2.0, ghi_3.0]
I have successfully done so in python using the code below but the dataframe is quite large and it takes a very long time to run it for the whole dataframe. I want to do the same thing in PySpark for efficiency. I have read the data in spark dataframe successfully but I'm having a hard time determining how to replicate Pandas functions with PySpark equivalent functions. How can I get my desired result in PySpark?
df['column_3'] = df['column_2']
for index, row in df.iterrows():
while index < 3:
if isinstance(row['column_1'], str):
row['column_1'] = list(row['column_1'].split(','))
row['column_2'] = list(row['column_2'].split(','))
row['column_3'] = ['_'.join(map(str, i)) for i in zip(list(row['column_1']), list(row['column_2']))]
I have converted the two columns to arrays in PySpark by using the below code
from pyspark.sql.types import ArrayType, IntegerType, StringType
from pyspark.sql.functions import col, split
crash.withColumn("column_1",
split(col("column_1"), ",\s*").cast(ArrayType(StringType())).alias("column_1")
)
crash.withColumn("column_2",
split(col("column_2"), ",\s*").cast(ArrayType(StringType())).alias("column_2")
)
Now all I need is to zip each element of the arrays in the two columns using '_'. How can I use zip with this? Any help is appreciated.
A Spark SQL equivalent of Python's would be pyspark.sql.functions.arrays_zip:
pyspark.sql.functions.arrays_zip(*cols)
Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
So if you already have two arrays:
from pyspark.sql.functions import split
df = (spark
.createDataFrame([('abc, def, ghi', '1.0, 2.0, 3.0')])
.toDF("column_1", "column_2")
.withColumn("column_1", split("column_1", "\s*,\s*"))
.withColumn("column_2", split("column_2", "\s*,\s*")))
You can just apply it on the result
from pyspark.sql.functions import arrays_zip
df_zipped = df.withColumn(
"zipped", arrays_zip("column_1", "column_2")
)
df_zipped.select("zipped").show(truncate=False)
+------------------------------------+
|zipped |
+------------------------------------+
|[[abc, 1.0], [def, 2.0], [ghi, 3.0]]|
+------------------------------------+
Now to combine the results you can transform (How to use transform higher-order function?, TypeError: Column is not iterable - How to iterate over ArrayType()?):
df_zipped_concat = df_zipped.withColumn(
"zipped_concat",
expr("transform(zipped, x -> concat_ws('_', x.column_1, x.column_2))")
)
df_zipped_concat.select("zipped_concat").show(truncate=False)
+---------------------------+
|zipped_concat |
+---------------------------+
|[abc_1.0, def_2.0, ghi_3.0]|
+---------------------------+
Note:
Higher order functions transform and arrays_zip has been introduced in Apache Spark 2.4.
For Spark 2.4+, this can be done using only zip_with function to zip a concatenate on the same time:
df.withColumn("column_3", expr("zip_with(column_1, column_2, (x, y) -> concat(x, '_', y))"))
The higher-order function takes 2 arrays to merge, element-wise, using a lambda function (x, y) -> concat(x, '_', y).
You can also UDF to zip the split array columns,
df = spark.createDataFrame([('abc,def,ghi','1.0,2.0,3.0')], ['col1','col2'])
+-----------+-----------+
|col1 |col2 |
+-----------+-----------+
|abc,def,ghi|1.0,2.0,3.0|
+-----------+-----------+ ## Hope this is how your dataframe is
from pyspark.sql import functions as F
from pyspark.sql.types import *
def concat_udf(*args):
return ['_'.join(x) for x in zip(*args)]
udf1 = F.udf(concat_udf,ArrayType(StringType()))
df = df.withColumn('col3',udf1(F.split(df.col1,','),F.split(df.col2,',')))
df.show(1,False)
+-----------+-----------+---------------------------+
|col1 |col2 |col3 |
+-----------+-----------+---------------------------+
|abc,def,ghi|1.0,2.0,3.0|[abc_1.0, def_2.0, ghi_3.0]|
+-----------+-----------+---------------------------+
For Spark 3.1+, they now provide pyspark.sql.functions.zip_with() with Python lambda function, therefore it can be done like this:
import pyspark.sql.functions as F
df = df.withColumn("column_3", F.zip_with("column_1", "column_2", lambda x,y: F.concat_ws("_", x, y)))

Dynamically add columns to dataframe via apply

The following code applies a function f to a dataframe column data_df["c"] and concats the results to the original dataframe, i.e. concatenating 1024 columns to the dataframe data_df.
data_df = apply_and_concat(data_df, "c", lambda x: f(x, y), [y + "-dim" + str(i) for i in range(0,1024)])
def apply_and_concat(df, field, func, column_names):
return pd.concat((
df,
df[field].apply(
lambda cell: pd.Series(func(cell), index=column_names))), axis=1)
The problem is that I want to execute this dynamically, meaning that I don't know how many columns it returns. freturns a list. Is there any better or easier way to add these columns without the need to specify the number of columns before?
Your use of pd.concat(df, df.apply(...), axis=1) already solves the main task well. It seems like your main question really boils down to "how do I name an unknown number of columns", where you're happy to use a name based on sequential integers. For that, use itertools.count():
import itertools
f_modified = lambda x: dict(zip(
('{}-dim{}'.format(y, i) for i in itertools.count()),
f(x, y)
))
Then use f_modified instead of f. That way, you get a dictionary instead of a list, with an arbitrary number of dynamically generated names as keys. When converting this dictionary to a Series, you'll end up with the keys being used as the index, so you don't need to provide an explicit list as the index, and hence don't need to know the number of columns in advance.

Categories

Resources