PySpark: iterate inside small groups in DataFrame - python

I am trying to understand how can I do operations inside small groups in a PySpark DataFrame. Suppose I have DF with the following schema:
root
|-- first_id: string (nullable = true)
|-- second_id_struct: struct (nullable = true)
| |-- s_id: string (nullable = true)
| |-- s_id_2: int (nullable = true)
|-- depth_from: float (nullable = true)
|-- depth_to: float (nullable = true)
|-- total_depth: float (nullable = true)
So data might look something like this:
I would like to:
group data by first_id
inside each group, order it by s_id_2 in ascending order
append extra column layer to either struct or root DataFrame that would indicate order of this s_id_2 in a group.
For example:
first_id | second_id | second_id_order
---------| --------- | ---------------
A1 | [B, 10] | 1
---------| --------- | ---------------
A1 | [B, 14] | 2
---------| --------- | ---------------
A1 | [B, 22] | 3
---------| --------- | ---------------
A5 | [A, 1] | 1
---------| --------- | ---------------
A5 | [A, 7] | 2
---------| --------- | ---------------
A7 | null | 1
---------| --------- | ---------------
Once grouped each first_id will have at most 4 second_id_struct. How do I approach those kind of problems?
I am particularly interested in how to make iterative operations inside small groups (1-40 rows) of DataFrames in general, where order of columns inside a group matters.
Thanks!

create a DataFrame
d = [{'first_id': 'A1', 'second_id': ['B',10]}, {'first_id': 'A1', 'second_id': ['B',14]},{'first_id': 'A1', 'second_id': ['B',22]},{'first_id': 'A5', 'second_id': ['A',1]},{'first_id': 'A5', 'second_id': ['A',7]}]
df = sqlContext.createDataFrame(d)
And you can see the structure
df.printSchema()
|-- first_id: string (nullable = true)
|-- second_id: array (nullable = true)
|........|-- element: string (containsNull = true)
df.show()
+--------+----------+
|first_id|second_id |
+--------+----------+
| A1| [B, 10]|
| A1| [B, 14]|
| A1| [B, 22]|
| A5| [A, 1]|
| A5| [A, 7]|
+--------+----------+
Then you can use dense_rank and Window function to show the order in the subgroup. It is as same as over partition in SQL.
The introduction of window function: Introducing Window Functions in Spark SQL
Code here:
# setting a window spec
windowSpec = Window.partitionBy('first_id').orderBy(df.second_id[1])
# apply dense_rank to the window spec
df.select(df.first_id, df.second_id, dense_rank().over(windowSpec).alias("second_id_order")).show()
Result:
+--------+---------+---------------+
|first_id|second_id|second_id_order|
+--------+---------+---------------+
| A1| [B, 10]| 1|
| A1| [B, 14]| 2|
| A1| [B, 22]| 3|
| A5| [A, 1]| 1|
| A5| [A, 7]| 2|
+--------+---------+---------------+

Related

what's the easiest way to explode/flatten deeply nested struct using pyspark?

i have an example dataset:
+---+------------------------------+
|id |example_field |
+---+------------------------------+
|1 |{[{[{111, AAA}, {222, BBB}]}]}|
+---+------------------------------+
The data type of the two fields are:
[('id', 'int'),
('example_field',
'struct<xxx:array<struct<nested_field:array<struct<field_1:int,field_2:string>>>>>')]
My question is if there's a way/function to flatten the field example_field using pyspark?
my expected output is something like this:
id field_1 field_2
1 111 AAA
1 222 BBB
The following code should do the trick:
from pyspark.sql import functions as F
(
df
.withColumn('_temp_ef', F.explode('example_field.xxx'))
.withColumn('_temp_nf', F.explode('_temp_ef.nested_field'))
.select(
'id',
F.col('_temp_nf.*')
)
)
The function explode creates a row for each element in an array, while select turns the fields of nested_field structure into columns.
The result is:
+---+-------+-------+
|id |field_1|field_2|
+---+-------+-------+
|1 |111 |AAA |
|1 |222 |BBB |
+---+-------+-------+
Note: I assumed that your DataFrame is something like this:
root
|-- id: integer (nullable = true)
|-- example_field: struct (nullable = true)
| |-- xxx: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- nested_field: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- field_1: integer (nullable = true)
| | | | | |-- field_2: string (nullable = true)

convert python dictionary into pyspark dataframe

I have a json file which contains a dictionary in the following format:
{"a1":{"b1":["c1","c2"], "b2":["c4","c3"]}, "a2":{"b3":["c1","c4"]}}
Is it possible to convert this dictionary into a PySpark dataframe as the following?
col1 | col2 | col3
----------------------
| a1 | b1 | c1 |
----------------------
| a1 | b1 | c2 |
----------------------
| a1 | b2 | c4 |
----------------------
| a1 | b2 | c3 |
----------------------
| a2 | b3 | c1 |
----------------------
| a2 | b3 | c4 |
I have seen the standard format of converting json to PySpark dataframe (example in this link) but was wondering about nested dictionaries that contain lists as well.
Interesting problem! The main struggle I realized with this problem is your when reading from JSON, your schema is likely has struct type, making it harder to solve, because basically a1 has different type than a2.
My idea is using somehow converting your struct type to map type, then stack them together, then apply a few explodes:
This is your df
+----------------------------------+
|data |
+----------------------------------+
|{{[c1, c2], [c4, c3]}, {[c1, c4]}}|
+----------------------------------+
root
|-- data: struct (nullable = true)
| |-- a1: struct (nullable = true)
| | |-- b1: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- b2: array (nullable = true)
| | | |-- element: string (containsNull = true)
| |-- a2: struct (nullable = true)
| | |-- b3: array (nullable = true)
| | | |-- element: string (containsNull = true)
Create a temporary df to handle JSON's first level
first_level_df = df.select('data.*')
first_level_df.show()
first_level_cols = first_level_df.columns # ['a1', 'a2']
+--------------------+----------+
| a1| a2|
+--------------------+----------+
|{[c1, c2], [c4, c3]}|{[c1, c4]}|
+--------------------+----------+
Some helper variables
map_cols = [F.from_json(F.to_json(c), T.MapType(T.StringType(), T.StringType())).alias(c) for c in first_level_cols]
# [Column<'entries AS a1'>, Column<'entries AS a2'>]
stack_cols = ', '.join([f"'{c}', {c}" for c in first_level_cols])
# 'a1', a1, 'a2', a2
Main transformation
(first_level_df
.select(map_cols)
.select(F.expr(f'stack(2, {stack_cols})').alias('AA', 'temp'))
.select('AA', F.explode('temp').alias('BB', 'temp'))
.select('AA', 'BB', F.explode(F.from_json('temp', T.ArrayType(T.StringType()))).alias('CC'))
.show(10, False)
)
+---+---+---+
|AA |BB |CC |
+---+---+---+
|a1 |b1 |c1 |
|a1 |b1 |c2 |
|a1 |b2 |c4 |
|a1 |b2 |c3 |
|a2 |b3 |c1 |
|a2 |b3 |c4 |
+---+---+---+

Spark: How to transpose and explode columns with dynamic nested arrays

I applied an algorithm from the question Spark: How to transpose and explode columns with nested arrays to transpose and explode nested spark dataframe with dynamic arrays.
I have added to the dataframe """{"id":3,"c":[{"date":3,"val":3, "val_dynamic":3}]}}""" , with new column c, where array has new val_dynamic field which can appear on random basis.
I'm looking for required output 2 (Transpose and Explode ) but even example of required output 1 (Transpose) will be very useful.
Input df:
+------------------+--------+-----------+---+
| a| b| c| id|
+------------------+--------+-----------+---+
|[{1, 1}, {11, 11}]| null| null| 1|
| null|[{2, 2}]| null| 2|
| null| null|[{3, 3, 3}]| 3| !!! NOTE: Added `val_dynamic`
+------------------+--------+-----------+---+
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
| | |-- val_dynamic: long (nullable = true) !!! NOTE: Added `val_dynamic`
|-- id: long (nullable = true)
Required output 1 (transpose_df):
+---+------+-------------------+
| id| cols | arrays |
+---+------+-------------------+
| 1| a | [{1, 1}, {11, 11}]|
| 2| b | [{2, 2}] |
| 3| c | [{3, 3, 3}] | !!! NOTE: Added `val_dynamic`
+---+------+-------------------+
Required output 2 (explode_df):
+---+----+----+---+-----------+
| id|cols|date|val|val_dynamic|
+---+----+----+---+-----------+
| 1| a| 1| 1| null |
| 1| a| 11| 11| null |
| 2| b| 2| 2| null |
| 3| c| 3| 3| 3 | !!! NOTE: Added `val_dynamic`
+---+----+----+---+-----------+
Current code:
import pyspark.sql.functions as f
df = spark.read.json(sc.parallelize([
"""{"id":1,"a":[{"date":1,"val":1},{"date":11,"val":11}]}""",
"""{"id":2,"b":[{"date":2,"val":2}]}}""",
"""{"id":3,"c":[{"date":3,"val":3, "val_dynamic":3}]}}"""
]))
df.show()
cols = [ 'a', 'b', 'c']
#expr = stack(2,'a',a,'b',b,'c',c )
expr = f"stack({len(cols)}," + \
",".join([f"'{c}',{c}" for c in cols]) + \
")"
transpose_df = df.selectExpr("id", expr) \
.withColumnRenamed("col0", "cols") \
.withColumnRenamed("col1", "arrays") \
.filter("not arrays is null")
transpose_df.show()
explode_df = transpose_df.selectExpr('id', 'cols', 'inline(arrays)')
explode_df.show()
Current outcome
AnalysisException: cannot resolve 'stack(3, 'a', `a`, 'b', `b`, 'c', `c`)' due to data type mismatch: Argument 2 (array<struct<date:bigint,val:bigint>>) != Argument 6 (array<struct<date:bigint,val:bigint,val_dynamic:bigint>>); line 1 pos 0;
'Project [id#2304L, unresolvedalias(stack(3, a, a#2301, b, b#2302, c, c#2303), Some(org.apache.spark.sql.Column$$Lambda$2580/0x00000008411d3040#4d9eefd0))]
+- LogicalRDD [a#2301, b#2302, c#2303, id#2304L], false
ref : Transpose column to row with Spark
stack requires that all stacked columns have the same type. The problem here is that the structs inside of the arrays have different members. One approach would be to add the missing members to all structs so that the approach of my previous answer works again.
cols = ['a', 'b', 'c']
#create a map containing all struct fields per column
existing_fields = {c:list(map(lambda field: field.name, df.schema.fields[i].dataType.elementType.fields))
for i,c in enumerate(df.columns) if c in cols}
#get a (unique) set of all fields that exist in all columns
all_fields = set(sum(existing_fields.values(),[]))
#create a list of transform expressions to fill up the structs will null fields
transform_exprs = [f"transform({c}, e -> named_struct(" +
",".join([f"'{f}', {('e.'+f) if f in existing_fields[c] else 'cast(null as long)'}" for f in all_fields])
+ f")) as {c}" for c in cols]
#create a df where all columns contain arrays with the same struct
full_struct_df = df.selectExpr("id", *transform_exprs)
full_struct_df has now the schema
root
|-- id: long (nullable = true)
|-- a: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- val: long (nullable = true)
| | |-- val_dynamic: long (nullable = true)
| | |-- date: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- val: long (nullable = true)
| | |-- val_dynamic: long (nullable = true)
| | |-- date: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- val: long (nullable = true)
| | |-- val_dynamic: long (nullable = true)
| | |-- date: long (nullable = true)
From here the logic works as before:
stack_expr = f"stack({len(cols)}," + \
",".join([f"'{c}',{c}" for c in cols]) + \
")"
transpose_df = full_struct_df.selectExpr("id", stack_expr) \
.withColumnRenamed("col0", "cols") \
.withColumnRenamed("col1", "arrays") \
.filter("not arrays is null")
explode_df = transpose_df.selectExpr('id', 'cols', 'inline(arrays)')
The first part of this answer requires that
each column mentioned in cols is an array of structs
all members of all structs are longs. The reason for this restriction is the cast(null as long) when creating the transform expression.

how can i unpack a column of type list in pyspark

I have a dataframe in pyspark, the df has a column of type array string, so I need to generate a new column with the head of the list and also I need other columns with the concat of the tail list.
This is my original dataframe:
pyspark> df.show()
+---+------------+
| id| lst_col|
+---+------------+
| 1|[a, b, c, d]|
+---+------------+
pyspark> df.printSchema()
root
|-- id: integer (nullable = false)
|-- lst_col: array (nullable = true)
| |-- element: string (containsNull = true)
And I need to generate something like this:
pyspark> df2.show()
+---+--------+---------------+
| id|lst_head|lst_concat_tail|
+---+--------+---------------+
| 1| a| b,c,d|
+---+--------+---------------+
For Spark 2.4+, you can use element_at, slice and size functions for arrays:
df.select("id",
element_at("lst_col", 1).alias("lst_head"),
expr("slice(lst_col, 2, size(lst_col))").alias("lst_concat_tail")
).show()
Gives:
+---+--------+---------------+
| id|lst_head|lst_concat_tail|
+---+--------+---------------+
| 1| a| [b, c, d]|
+---+--------+---------------+

PySpark Dataframe.groupBy MapType column

I have a dataframe with a MapType column where the key is an id and the value is another StructType with two numbers, a counter and a revenue.
It looks like that:
+--------------------------------------+
| myMapColumn |
+--------------------------------------+
| Map(1 -> [1, 4.0], 2 -> [1, 1.5]) |
| Map() |
| Map(1 -> [3, 5.5]) |
| Map(1 -> [4, 0.1], 2 -> [6, 101.56]) |
+--------------------------------------+
Now I need to sum up these two values per id and the result would be:
+----------------------+
| id | count | revenue |
+----------------------+
| 1 | 8 | 9.6 |
| 2 | 7 | 103.06 |
+----------------------+
I actually have no idea how to do that and could not find a documentation for this special case. I tried using Dataframe.groupBy but could not make it work :(
Any ideas ?
I'm using Spark 1.5.2 with Python 2.6.6
Assuming that the schema is equivalent to this:
root
|-- myMapColumn: map (nullable = true)
| |-- key: integer
| |-- value: struct (valueContainsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: double (nullable = false)
all you need is explode and a simple aggregation:
from pyspark.sql.functions import col, explode, sum as sum_
(df
.select(explode(col("myMapColumn")))
.groupBy(col("key").alias("id"))
.agg(sum_("value._1").alias("count"), sum_("value._2").alias("revenue")))

Categories

Resources