PySpark Dataframe.groupBy MapType column - python

I have a dataframe with a MapType column where the key is an id and the value is another StructType with two numbers, a counter and a revenue.
It looks like that:
+--------------------------------------+
| myMapColumn |
+--------------------------------------+
| Map(1 -> [1, 4.0], 2 -> [1, 1.5]) |
| Map() |
| Map(1 -> [3, 5.5]) |
| Map(1 -> [4, 0.1], 2 -> [6, 101.56]) |
+--------------------------------------+
Now I need to sum up these two values per id and the result would be:
+----------------------+
| id | count | revenue |
+----------------------+
| 1 | 8 | 9.6 |
| 2 | 7 | 103.06 |
+----------------------+
I actually have no idea how to do that and could not find a documentation for this special case. I tried using Dataframe.groupBy but could not make it work :(
Any ideas ?
I'm using Spark 1.5.2 with Python 2.6.6

Assuming that the schema is equivalent to this:
root
|-- myMapColumn: map (nullable = true)
| |-- key: integer
| |-- value: struct (valueContainsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: double (nullable = false)
all you need is explode and a simple aggregation:
from pyspark.sql.functions import col, explode, sum as sum_
(df
.select(explode(col("myMapColumn")))
.groupBy(col("key").alias("id"))
.agg(sum_("value._1").alias("count"), sum_("value._2").alias("revenue")))

Related

what's the easiest way to explode/flatten deeply nested struct using pyspark?

i have an example dataset:
+---+------------------------------+
|id |example_field |
+---+------------------------------+
|1 |{[{[{111, AAA}, {222, BBB}]}]}|
+---+------------------------------+
The data type of the two fields are:
[('id', 'int'),
('example_field',
'struct<xxx:array<struct<nested_field:array<struct<field_1:int,field_2:string>>>>>')]
My question is if there's a way/function to flatten the field example_field using pyspark?
my expected output is something like this:
id field_1 field_2
1 111 AAA
1 222 BBB
The following code should do the trick:
from pyspark.sql import functions as F
(
df
.withColumn('_temp_ef', F.explode('example_field.xxx'))
.withColumn('_temp_nf', F.explode('_temp_ef.nested_field'))
.select(
'id',
F.col('_temp_nf.*')
)
)
The function explode creates a row for each element in an array, while select turns the fields of nested_field structure into columns.
The result is:
+---+-------+-------+
|id |field_1|field_2|
+---+-------+-------+
|1 |111 |AAA |
|1 |222 |BBB |
+---+-------+-------+
Note: I assumed that your DataFrame is something like this:
root
|-- id: integer (nullable = true)
|-- example_field: struct (nullable = true)
| |-- xxx: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- nested_field: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- field_1: integer (nullable = true)
| | | | | |-- field_2: string (nullable = true)

Handle spark DataFrame structure

Let's suppose that we have the following two tables
+---------+--------+
|AUTHOR_ID| NAME |
+---------+--------+
| 102 |Camus |
| 103 |Hugo |
+---------+-------- +------------
|AUTHOR_ID| BOOK_ID + BOOK_NAME |
+---------+-------- + -----------|
| 1 |Camus | Etranger
| 1 |Hugo | Mesirable |
I want to join the two table in order to get a DataFrame with the following Schema
root
|-- AUTHORID: integer
|-- NAME: string
|-- BOOK_LIST: array
| |-- BOOK_ID: integer
| |-- BOOK_NAME: string
I'm using pyspark, Thanks in advance
Simple join + group by should do the job:
from pyspark.sql import functions as F
result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
.groupBy("AUTHOR_ID", "NAME")
.agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
)
In the aggregation we use collect_list to create the array of structs.

Spark: How to transform to Data Frame data from multiple nested XML files with attributes

How to transform values below from multiple XML files to spark data frame :
attribute Id0 from Level_0
Date/Value from Level_4
Required output:
+----------------+-------------+---------+
|Id0 |Date |Value |
+----------------+-------------+---------+
|Id0_value_file_1| 2021-01-01 | 4_1 |
|Id0_value_file_1| 2021-01-02 | 4_2 |
|Id0_value_file_2| 2021-01-01 | 4_1 |
|Id0_value_file_2| 2021-01-02 | 4_2 |
+----------------+-------+---------------+
file_1.xml:
<Level_0 Id0="Id0_value_file1">
<Level_1 Id1_1 ="Id3_value" Id_2="Id2_value">
<Level_2_A>A</Level_2_A>
<Level_2>
<Level_3>
<Level_4>
<Date>2021-01-01</Date>
<Value>4_1</Value>
</Level_4>
<Level_4>
<Date>2021-01-02</Date>
<Value>4_2</Value>
</Level_4>
</Level_3>
</Level_2>
</Level_1>
</Level_0>
file_2.xml:
<Level_0 Id0="Id0_value_file2">
<Level_1 Id1_1 ="Id3_value" Id_2="Id2_value">
<Level_2_A>A</Level_2_A>
<Level_2>
<Level_3>
<Level_4>
<Date>2021-01-01</Date>
<Value>4_1</Value>
</Level_4>
<Level_4>
<Date>2021-01-02</Date>
<Value>4_2</Value>
</Level_4>
</Level_3>
</Level_2>
</Level_1>
</Level_0>
Current Code Example:
files_list = ["file_1.xml", "file_2.xml"]
df = (spark.read.format('xml')
.options(rowTag="Level_4")
.load(','.join(files_list))
Current Output:(Id0 column with attributes missing)
+-------------+---------+
|Date |Value |
+-------------+---------+
| 2021-01-01 | 4_1 |
| 2021-01-02 | 4_2 |
| 2021-01-01 | 4_1 |
| 2021-01-02 | 4_2 |
+-------+---------------+
There are some examples, but non of them solve the problem:
-I'm using databricks spark_xml - https://github.com/databricks/spark-xml
-There is an examample but not with attribute reading, Read XML in spark, Extracting tag attributes from xml using sparkxml .
EDIT:
As #mck pointed out correctly <Level_2>A</Level_2> is not correct XML format. I had a mistake in my example(now xml file is corrected), it should be <Level_2_A>A</Level_2_A>. After that , proposed solution works even on multiple files.
NOTE: To speedup loading of large number of xmls define schema, if no schema is defined spark is reading each file when creating dataframe to interfere schema...
for more info: https://szczeles.github.io/Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark/
STEP 1):
files_list = ["file_1.xml", "file_2.xml"]
# for schema seem NOTE above
df = (spark.read.format('xml')
.options(rowTag="Level_0")
.load(','.join(files_list),schema=schema))
df.printSchema()
root
|-- Level_1: struct (nullable = true)
| |-- Level_2: struct (nullable = true)
| | |-- Level_3: struct (nullable = true)
| | | |-- Level_4: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- Date: string (nullable = true)
| | | | | |-- Value: string (nullable = true)
| |-- Level_2_A: string (nullable = true)
| |-- _Id1_1: string (nullable = true)
| |-- _Id_2: string (nullable = true)
|-- _Id0: string (nullable = true
STEP 2) see below #mck solution:
You can use Level_0 as the rowTag, and explode the relevant arrays/structs:
import pyspark.sql.functions as F
df = spark.read.format('xml').options(rowTag="Level_0").load('line_removed.xml')
df2 = df.select(
'_Id0',
F.explode_outer('Level_1.Level_2.Level_3.Level_4').alias('Level_4')
).select(
'_Id0',
'Level_4.*'
)
df2.show()
+---------------+----------+-----+
| _Id0| Date|Value|
+---------------+----------+-----+
|Id0_value_file1|2021-01-01| 4_1|
|Id0_value_file1|2021-01-02| 4_2|
+---------------+----------+-----+

Adding a value into a DenseVector in PySpark

I have a DataFrame that I have processed to be like:
+---------+-------+
| inputs | temp |
+---------+-------+
| [1,0,0] | 12 |
+---------+-------+
| [0,1,0] | 10 |
+---------+-------+
...
inputs is a column of DenseVectors. temp is a column of values. I want to append the DenseVector with these values and create one column, but I am not sure how to start. Any tips for this desired output:
+---------------+
| inputsMerged |
+---------------+
| [1,0,0,12] |
+---------------+
| [0,1,0,10] |
+---------------+
...
EDIT: I am trying to use the VectorAssembler method but my resulting array is not as intended.
You might do something like this:
df.show()
+-------------+----+
| inputs|temp|
+-------------+----+
|[1.0,0.0,0.0]| 12|
|[0.0,1.0,0.0]| 10|
+-------------+----+
df.printSchema()
root
|-- inputs: vector (nullable = true)
|-- temp: long (nullable = true)
Import:
import pyspark.sql.functions as F
from pyspark.ml.linalg import Vectors, VectorUDT
Create the udf to merge the Vector and element:
concat = F.udf(lambda v, e: Vectors.dense(list(v) + [e]), VectorUDT())
Apply udf to inputs and temp columns:
merged_df = df.select(concat(df.inputs, df.temp).alias('inputsMerged'))
merged_df.show()
+------------------+
| inputsMerged|
+------------------+
|[1.0,0.0,0.0,12.0]|
|[0.0,1.0,0.0,10.0]|
+------------------+
merged_df.printSchema()
root
|-- inputsMerged: vector (nullable = true)

PySpark: iterate inside small groups in DataFrame

I am trying to understand how can I do operations inside small groups in a PySpark DataFrame. Suppose I have DF with the following schema:
root
|-- first_id: string (nullable = true)
|-- second_id_struct: struct (nullable = true)
| |-- s_id: string (nullable = true)
| |-- s_id_2: int (nullable = true)
|-- depth_from: float (nullable = true)
|-- depth_to: float (nullable = true)
|-- total_depth: float (nullable = true)
So data might look something like this:
I would like to:
group data by first_id
inside each group, order it by s_id_2 in ascending order
append extra column layer to either struct or root DataFrame that would indicate order of this s_id_2 in a group.
For example:
first_id | second_id | second_id_order
---------| --------- | ---------------
A1 | [B, 10] | 1
---------| --------- | ---------------
A1 | [B, 14] | 2
---------| --------- | ---------------
A1 | [B, 22] | 3
---------| --------- | ---------------
A5 | [A, 1] | 1
---------| --------- | ---------------
A5 | [A, 7] | 2
---------| --------- | ---------------
A7 | null | 1
---------| --------- | ---------------
Once grouped each first_id will have at most 4 second_id_struct. How do I approach those kind of problems?
I am particularly interested in how to make iterative operations inside small groups (1-40 rows) of DataFrames in general, where order of columns inside a group matters.
Thanks!
create a DataFrame
d = [{'first_id': 'A1', 'second_id': ['B',10]}, {'first_id': 'A1', 'second_id': ['B',14]},{'first_id': 'A1', 'second_id': ['B',22]},{'first_id': 'A5', 'second_id': ['A',1]},{'first_id': 'A5', 'second_id': ['A',7]}]
df = sqlContext.createDataFrame(d)
And you can see the structure
df.printSchema()
|-- first_id: string (nullable = true)
|-- second_id: array (nullable = true)
|........|-- element: string (containsNull = true)
df.show()
+--------+----------+
|first_id|second_id |
+--------+----------+
| A1| [B, 10]|
| A1| [B, 14]|
| A1| [B, 22]|
| A5| [A, 1]|
| A5| [A, 7]|
+--------+----------+
Then you can use dense_rank and Window function to show the order in the subgroup. It is as same as over partition in SQL.
The introduction of window function: Introducing Window Functions in Spark SQL
Code here:
# setting a window spec
windowSpec = Window.partitionBy('first_id').orderBy(df.second_id[1])
# apply dense_rank to the window spec
df.select(df.first_id, df.second_id, dense_rank().over(windowSpec).alias("second_id_order")).show()
Result:
+--------+---------+---------------+
|first_id|second_id|second_id_order|
+--------+---------+---------------+
| A1| [B, 10]| 1|
| A1| [B, 14]| 2|
| A1| [B, 22]| 3|
| A5| [A, 1]| 1|
| A5| [A, 7]| 2|
+--------+---------+---------------+

Categories

Resources