I have a PySpark Dataframe with an A field, few B fields that dependent on A (A->B) and C fields that I want to aggregate per each A. For example:
A | B | C
----------
A | 1 | 6
A | 1 | 7
B | 2 | 8
B | 2 | 4
I wish to group by A , present any of B and run aggregation (let's say SUM) on C.
The expected result would be:
A | B | C
----------
A | 1 | 13
B | 2 | 12
SQL-wise I would do:
SELECT A, COALESCE(B) as B, SUM(C) as C
FROM T
GROUP BY A
What is the PySpark way to do that?
I can group by A and B together or select MIN(B) per each A, for example:
df.groupBy('A').agg(F.min('B').alias('B'),F.sum('C').alias('C'))
or
df.groupBy(['A','B']).agg(F.sum('C').alias('C'))
but that seems inefficient. Is there is anything similar to SQL coalesce in PySpark?
Thanks
You'll just need to use first instead :
from pyspark.sql.functions import first, sum, col
from pyspark.sql import Row
array = [Row(A="A", B=1, C=6),
Row(A="A", B=1, C=7),
Row(A="B", B=2, C=8),
Row(A="B", B=2, C=4)]
df = sqlContext.createDataFrame(sc.parallelize(array))
results = df.groupBy(col("A")).agg(first(col("B")).alias("B"), sum(col("C")).alias("C"))
Let's now check the results :
results.show()
# +---+---+---+
# | A| B| C|
# +---+---+---+
# | B| 2| 12|
# | A| 1| 13|
# +---+---+---+
From the comments:
Is first here is computationally equivalent to any ?
groupBy causes shuffle. Thus a non deterministic behaviour is to expect.
Which is confirmed in the documentation of first :
Aggregate function: returns the first value in a group.
The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
note:: The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
So yes, computationally there are the same, and that's one of the reasons you need to use sorting if you need a deterministic behaviour.
I hope this helps !
Related
I have a PySpark dataframe like this:
A
B
1
abc_value
2
abc_value
3
some_other_value
4
anything_else
I have a mapping dictionary:
d = {
"abc":"X",
"some_other":Y,
"anything":Z
}
I need to create new column in my original Dataframe which should be like this:
A
B
C
1
abc_value
X
2
abc_value
X
3
some_other_value
Y
4
anything_else
Z
I tried mapping like this:
mapping_expr = f.create_map([f.lit(x) for x in chain(*d.items())]) and then applying it with withColumn however it is exact matching, however I need partial (regex) matching as you can see.
How to accomplish this, please?
I'm afraid in PySpark there's no implemented function that extracts substrings according to a defined dictionary; you probably need to resort to tricks.
In this case, you can first create a search string which includes all your dictionary keys to be searched:
keys = list(d.keys())
keys_expr = '|'.join(keys)
keys_expr
# 'abc|some_other|anything'
Then you can use regexp_extract to extract the first key from keys_expr that we encounter in column B, if present (that's the reason for the | operator).
Finally, you can use dictionary d to replace the values in the new column.
import pyspark.sql.functions as F
df = df\
.withColumn('C', F.regexp_extract('B', keys_expr, 0))\
.replace(d, subset=['C'])
df.show()
+---+----------------+---+
| A| B| C|
+---+----------------+---+
| 1| abc_value| X|
| 2| abc_value| X|
| 3|some_other_value| Y|
| 4| anything_else| Z|
+---+----------------+---+
So, I have a Pyspark dataframe of the type
Group
Value
A
12
B
10
A
1
B
0
B
1
A
6
and I'd like to perform an operation able to generate something a DataFrame having the standardised value with respect to its group.
In short, I should have:
Group
Value
A
1.26012384
B
1.4083737
A
-1.18599891
B
-0.81537425
B
-0.59299945
A
-0.07412493
I think this should be performed by using a groupBy and then some agg operation but honestly I'm not really sure on how to do it.
You can calculate the mean and stddev in each group using Window functions:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'Value',
(F.col('Value') - F.mean('Value').over(Window.partitionBy('Group'))) /
F.stddev_pop('Value').over(Window.partitionBy('Group'))
)
df2.show()
+-----+--------------------+
|Group| Value|
+-----+--------------------+
| B| 1.4083737016560922|
| B| -0.8153742483272112|
| B| -0.5929994533288808|
| A| 1.2601238383238722|
| A| -1.1859989066577619|
| A|-0.07412493166611006|
+-----+--------------------+
Note that the order of the results will be randomized because Spark dataframes do not have indices.
I would like to use Panda's groupby with multiple aggregation functions, but also including conditional statements per aggregation. Imagine having this data as an example:
df = pd.DataFrame({
'id': ['a', 'a', 'a', 'b', 'b'],
'type': ['in_scope', 'in_scope', 'exclude', 'in_scope', 'exclude'],
'value': [5, 5, 99, 20, 99]
})
INPUT DATA:
| id | in_scope | value |
|----|----------|-------|
| a | True | 5 |
| a | True | 5 |
| a | False | 99 |
| b | True | 20 |
| b | False | 99 |
And I want to do a Pandas groupby like this:
df.groupby('id').agg(
num_records=('id', 'size'),
sum_value=('value', np.sum)
)
OUTPUT OF SIMPLE GROUPBY:
| id | num_records | sum_value |
|----|-------------|-----------|
| a | 3 | 109 |
| b | 2 | 119 |
However, I would like to do the sum depending on a condition, namely that only the "in_scope" records that are defined as True in column in_scope should be used. Note, the first aggregation should still use the entire table. In short, this is the desired output:
DESIRED OUTPUT OF GROUPBY:
| id | num_records | sum_value_in_scope |
|----|-------------|--------------------|
| a | 3 | 10 |
| b | 2 | 20 |
I was thinking about passing two arguments to a lambda function, but I do not succeed. Of course, it can be solved by performing two separate groupbys on filtered and unfiltered data and combine them together afterwards. But I was hoping there was a shorter and more elegant way.
Unfortunately, you cannot do this with aggregate, however you can do it in one step with apply and a custom function:
def f(x):
d = {}
d['num_records'] = len(x)
d['sum_value_in_scope'] = x[x.in_scope].value.sum()
return pd.Series(d, index=['num_records', 'sum_value_in_scope'])
df.groupby('id').apply(f)
Since the column df.in_scope is already boolean, you can use it as a mask directly to filter the values which are summed. If the column you are working with is not boolean, it is better to use df.query('<your query here>') to get the subset of the data (there are optimizations under the hood which make it faster than most other methods).
Updated answer: Create a temporary column that contains values only when type is in_scope, then aggregate:
(
df.assign(temp=np.where(df["type"] == "in_scope", df["value"], None))
.groupby("id", as_index=False)
.agg(num_records=("type", "size"), sum_value=("temp", "sum"))
)
id num_records sum_value
a 3 10
b 2 20
I have a dataframe with a very large number of columns (>30000).
I'm filling it with 1 and 0 based on the first column like this:
for column in list_of_column_names:
df = df.withColumn(column, when(array_contains(df['list_column'], column), 1).otherwise(0))
However this process takes a lot of time. Is there a way to do this more efficiently? Something tells me that column processing can be parallelized.
Edit:
Sample input data
+----------------+-----+-----+-----+
| list_column | Foo | Bar | Baz |
+----------------+-----+-----+-----+
| ['Foo', 'Bak'] | | | |
| ['Bar', Baz'] | | | |
| ['Foo'] | | | |
+----------------+-----+-----+-----+
There is nothing specifically wrong with your code, other than very wide data:
for column in list_of_column_names:
df = df.withColumn(...)
only generates the execution plan.
Actual data processing will concurrent and parallelized, once the result is evaluated.
It is however an expensive process as it require O(NMK) operations with N rows, M columns and K values in the list.
Additionally execution plans on very wide data are very expensive to compute (though cost is constant in terms of number of records). If it becomes a limiting factor, you might be better off with RDDs:
Sort column array using sort_array function.
Convert data to RDD.
Apply search for each column using binary search.
You might approach like this,
import pyspark.sql.functions as F
exprs = [F.when(F.array_contains(F.col('list_column'), column), 1).otherwise(0).alias(column)\
for column in list_column_names]
df = df.select(['list_column']+exprs)
withColumn is already distributed so a faster approach would be difficult to get other than what you already have. you can try defining a udf function as following
from pyspark.sql import functions as f
from pyspark.sql import types as t
def containsUdf(listColumn):
row = {}
for column in list_of_column_names:
if(column in listColumn):
row.update({column: 1})
else:
row.update({column: 0})
return row
callContainsUdf = f.udf(containsUdf, t.StructType([t.StructField(x, t.StringType(), True) for x in list_of_column_names]))
df.withColumn('struct', callContainsUdf(df['list_column']))\
.select(f.col('list_column'), f.col('struct.*'))\
.show(truncate=False)
which should give you
+-----------+---+---+---+
|list_column|Foo|Bar|Baz|
+-----------+---+---+---+
|[Foo, Bak] |1 |0 |0 |
|[Bar, Baz] |0 |1 |1 |
|[Foo] |1 |0 |0 |
+-----------+---+---+---+
Note: list_of_column_names = ["Foo","Bar","Baz"]
My Data looks like this:
id | duration | action1 | action2 | ...
---------------------------------------------
1 | 10 | A | D
1 | 10 | B | E
2 | 25 | A | E
1 | 7 | A | G
I want to group it by ID (which works great!):
df.rdd.groupBy(lambda x: x['id']).mapValues(list).collect()
And now I would like to group values within each group by duration to get something like this:
[(id=1,
((duration=10,[(action1=A,action2=D),(action1=B,action2=E),
(duration=7,(action1=A,action2=G)),
(id=2,
((duration=25,(action1=A,action2=E)))]
And here is where I dont know how to do a nested group by. Any tips?
There is no need to serialize to rdd. Here's a generalized way to group by multiple columns and aggregate the rest of the columns into lists without hard-coding all of them:
from pyspark.sql.functions import collect_list
grouping_cols = ["id", "duration"]
other_cols = [c for c in df.columns if c not in grouping_cols]
df.groupBy(grouping_cols).agg(*[collect_list(c).alias(c) for c in other_cols]).show()
#+---+--------+-------+-------+
#| id|duration|action1|action2|
#+---+--------+-------+-------+
#| 1| 10| [A, B]| [D, E]|
#| 2| 25| [A]| [E]|
#| 1| 7| [A]| [G]|
#+---+--------+-------+-------+
Update
If you need to preserve the order of the actions, the best way is to use a pyspark.sql.Window with an orderBy(). This is because there seems to be some ambiguity as to whether or not a groupBy() following an orderBy() maintains that order.
Suppose your timestamps are stored in a column "ts". You should be able to do the following:
from pyspark.sql import Window
w = Window.partitionBy(grouping_cols).orderBy("ts")
grouped_df = df.select(
*(grouping_cols + [collect_list(c).over(w).alias(c) for c in other_cols])
).distinct()