Handle spark DataFrame structure - python

Let's suppose that we have the following two tables
+---------+--------+
|AUTHOR_ID| NAME |
+---------+--------+
| 102 |Camus |
| 103 |Hugo |
+---------+-------- +------------
|AUTHOR_ID| BOOK_ID + BOOK_NAME |
+---------+-------- + -----------|
| 1 |Camus | Etranger
| 1 |Hugo | Mesirable |
I want to join the two table in order to get a DataFrame with the following Schema
root
|-- AUTHORID: integer
|-- NAME: string
|-- BOOK_LIST: array
| |-- BOOK_ID: integer
| |-- BOOK_NAME: string
I'm using pyspark, Thanks in advance

Simple join + group by should do the job:
from pyspark.sql import functions as F
result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
.groupBy("AUTHOR_ID", "NAME")
.agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
)
In the aggregation we use collect_list to create the array of structs.

Related

what's the easiest way to explode/flatten deeply nested struct using pyspark?

i have an example dataset:
+---+------------------------------+
|id |example_field |
+---+------------------------------+
|1 |{[{[{111, AAA}, {222, BBB}]}]}|
+---+------------------------------+
The data type of the two fields are:
[('id', 'int'),
('example_field',
'struct<xxx:array<struct<nested_field:array<struct<field_1:int,field_2:string>>>>>')]
My question is if there's a way/function to flatten the field example_field using pyspark?
my expected output is something like this:
id field_1 field_2
1 111 AAA
1 222 BBB
The following code should do the trick:
from pyspark.sql import functions as F
(
df
.withColumn('_temp_ef', F.explode('example_field.xxx'))
.withColumn('_temp_nf', F.explode('_temp_ef.nested_field'))
.select(
'id',
F.col('_temp_nf.*')
)
)
The function explode creates a row for each element in an array, while select turns the fields of nested_field structure into columns.
The result is:
+---+-------+-------+
|id |field_1|field_2|
+---+-------+-------+
|1 |111 |AAA |
|1 |222 |BBB |
+---+-------+-------+
Note: I assumed that your DataFrame is something like this:
root
|-- id: integer (nullable = true)
|-- example_field: struct (nullable = true)
| |-- xxx: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- nested_field: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- field_1: integer (nullable = true)
| | | | | |-- field_2: string (nullable = true)

Spark: How to transform to Data Frame data from multiple nested XML files with attributes

How to transform values below from multiple XML files to spark data frame :
attribute Id0 from Level_0
Date/Value from Level_4
Required output:
+----------------+-------------+---------+
|Id0 |Date |Value |
+----------------+-------------+---------+
|Id0_value_file_1| 2021-01-01 | 4_1 |
|Id0_value_file_1| 2021-01-02 | 4_2 |
|Id0_value_file_2| 2021-01-01 | 4_1 |
|Id0_value_file_2| 2021-01-02 | 4_2 |
+----------------+-------+---------------+
file_1.xml:
<Level_0 Id0="Id0_value_file1">
<Level_1 Id1_1 ="Id3_value" Id_2="Id2_value">
<Level_2_A>A</Level_2_A>
<Level_2>
<Level_3>
<Level_4>
<Date>2021-01-01</Date>
<Value>4_1</Value>
</Level_4>
<Level_4>
<Date>2021-01-02</Date>
<Value>4_2</Value>
</Level_4>
</Level_3>
</Level_2>
</Level_1>
</Level_0>
file_2.xml:
<Level_0 Id0="Id0_value_file2">
<Level_1 Id1_1 ="Id3_value" Id_2="Id2_value">
<Level_2_A>A</Level_2_A>
<Level_2>
<Level_3>
<Level_4>
<Date>2021-01-01</Date>
<Value>4_1</Value>
</Level_4>
<Level_4>
<Date>2021-01-02</Date>
<Value>4_2</Value>
</Level_4>
</Level_3>
</Level_2>
</Level_1>
</Level_0>
Current Code Example:
files_list = ["file_1.xml", "file_2.xml"]
df = (spark.read.format('xml')
.options(rowTag="Level_4")
.load(','.join(files_list))
Current Output:(Id0 column with attributes missing)
+-------------+---------+
|Date |Value |
+-------------+---------+
| 2021-01-01 | 4_1 |
| 2021-01-02 | 4_2 |
| 2021-01-01 | 4_1 |
| 2021-01-02 | 4_2 |
+-------+---------------+
There are some examples, but non of them solve the problem:
-I'm using databricks spark_xml - https://github.com/databricks/spark-xml
-There is an examample but not with attribute reading, Read XML in spark, Extracting tag attributes from xml using sparkxml .
EDIT:
As #mck pointed out correctly <Level_2>A</Level_2> is not correct XML format. I had a mistake in my example(now xml file is corrected), it should be <Level_2_A>A</Level_2_A>. After that , proposed solution works even on multiple files.
NOTE: To speedup loading of large number of xmls define schema, if no schema is defined spark is reading each file when creating dataframe to interfere schema...
for more info: https://szczeles.github.io/Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark/
STEP 1):
files_list = ["file_1.xml", "file_2.xml"]
# for schema seem NOTE above
df = (spark.read.format('xml')
.options(rowTag="Level_0")
.load(','.join(files_list),schema=schema))
df.printSchema()
root
|-- Level_1: struct (nullable = true)
| |-- Level_2: struct (nullable = true)
| | |-- Level_3: struct (nullable = true)
| | | |-- Level_4: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- Date: string (nullable = true)
| | | | | |-- Value: string (nullable = true)
| |-- Level_2_A: string (nullable = true)
| |-- _Id1_1: string (nullable = true)
| |-- _Id_2: string (nullable = true)
|-- _Id0: string (nullable = true
STEP 2) see below #mck solution:
You can use Level_0 as the rowTag, and explode the relevant arrays/structs:
import pyspark.sql.functions as F
df = spark.read.format('xml').options(rowTag="Level_0").load('line_removed.xml')
df2 = df.select(
'_Id0',
F.explode_outer('Level_1.Level_2.Level_3.Level_4').alias('Level_4')
).select(
'_Id0',
'Level_4.*'
)
df2.show()
+---------------+----------+-----+
| _Id0| Date|Value|
+---------------+----------+-----+
|Id0_value_file1|2021-01-01| 4_1|
|Id0_value_file1|2021-01-02| 4_2|
+---------------+----------+-----+

Is there any function which helps me convert date and string format in PySpark

Currently I am working in Pyspark and have little knowledge of this technology. My data frame looks like:
id dob var1
1 13-02-1976 aab#dfsfs
2 01-04-2000 bb#NAm
3 28-11-1979 adam11#kjfd
4 30-01-1955 rehan42#ggg
My output looks like:
id dob var1 age var2
1 13-02-1976 aab#dfsfs 43 aab
2 01-04-2000 bb#NAm 19 bb
3 28-11-1979 adam11#kjfd 39 adam11
4 30-01-1955 rehan42#ggg 64 rehan42
What I have done so far -
df= df.select( df.id.cast('int').alias('id'),
df.dob.cast('date').alias('dob'),
df.var1.cast('string').alias('var1'))
But I think dob is not converted properly.
df= df.withColumn('age', F.datediff(F.current_date(), df.dob))
As you said Casting of dob column is not proper. Please Try this.
from pyspark.sql.functions import col, unix_timestamp, to_date
import pyspark.sql.functions as F
df2 = df.withColumn('date_in_dateFormat',to_date(unix_timestamp(F.col('dob'),'dd-MM-
yyyy').cast("timestamp")))
df2.show()
+---+----------+-----------+------------------+
| id| dob| var1|date_in_dateFormat|
+---+----------+-----------+------------------+
| 1|13-02-1976| aab#dfsfs| 1976-02-13|
| 2|01-04-2000| bb#NAm| 2000-04-01|
| 3|28-11-1979|adam11#kjfd| 1979-11-28|
| 4|30-01-1955|rehan42#ggg| 1955-01-30|
+---+----------+-----------+------------------+
df2.printSchema()
root
|-- id: integer (nullable = true)
|-- dob: string (nullable = true)
|-- var1: string (nullable = true)
|-- date_in_dateFormat: date (nullable = true)
df3= df2.withColumn('age', F.datediff(F.current_date(), df2.date_in_dateFormat))
df3.show()
+---+----------+-----------+------------------+-----+
| id| dob| var1|date_in_dateFormat| age|
+---+----------+-----------+------------------+-----+
| 1|13-02-1976| aab#dfsfs| 1976-02-13|15789|
| 2|01-04-2000| bb#NAm| 2000-04-01| 6975|
| 3|28-11-1979|adam11#kjfd| 1979-11-28|14405|
| 4|30-01-1955|rehan42#ggg| 1955-01-30|23473|
+---+----------+-----------+------------------+-----+
split_col =F.split(df['var1'], '#')
df4=df3.withColumn('Var2', split_col.getItem(0))
df4.show()
+---+----------+-----------+------------------+-----+-------+
| id| dob| var1|date_in_dateFormat| age| Var2|
+---+----------+-----------+------------------+-----+-------+
| 1|13-02-1976| aab#dfsfs| 1976-02-13|15789| aab|
| 2|01-04-2000| bb#NAm| 2000-04-01| 6975| bb|
| 3|28-11-1979|adam11#kjfd| 1979-11-28|14405| adam11|
| 4|30-01-1955|rehan42#ggg| 1955-01-30|23473|rehan42|
+---+----------+-----------+------------------+-----+-------+

Adding a value into a DenseVector in PySpark

I have a DataFrame that I have processed to be like:
+---------+-------+
| inputs | temp |
+---------+-------+
| [1,0,0] | 12 |
+---------+-------+
| [0,1,0] | 10 |
+---------+-------+
...
inputs is a column of DenseVectors. temp is a column of values. I want to append the DenseVector with these values and create one column, but I am not sure how to start. Any tips for this desired output:
+---------------+
| inputsMerged |
+---------------+
| [1,0,0,12] |
+---------------+
| [0,1,0,10] |
+---------------+
...
EDIT: I am trying to use the VectorAssembler method but my resulting array is not as intended.
You might do something like this:
df.show()
+-------------+----+
| inputs|temp|
+-------------+----+
|[1.0,0.0,0.0]| 12|
|[0.0,1.0,0.0]| 10|
+-------------+----+
df.printSchema()
root
|-- inputs: vector (nullable = true)
|-- temp: long (nullable = true)
Import:
import pyspark.sql.functions as F
from pyspark.ml.linalg import Vectors, VectorUDT
Create the udf to merge the Vector and element:
concat = F.udf(lambda v, e: Vectors.dense(list(v) + [e]), VectorUDT())
Apply udf to inputs and temp columns:
merged_df = df.select(concat(df.inputs, df.temp).alias('inputsMerged'))
merged_df.show()
+------------------+
| inputsMerged|
+------------------+
|[1.0,0.0,0.0,12.0]|
|[0.0,1.0,0.0,10.0]|
+------------------+
merged_df.printSchema()
root
|-- inputsMerged: vector (nullable = true)

PySpark Dataframe.groupBy MapType column

I have a dataframe with a MapType column where the key is an id and the value is another StructType with two numbers, a counter and a revenue.
It looks like that:
+--------------------------------------+
| myMapColumn |
+--------------------------------------+
| Map(1 -> [1, 4.0], 2 -> [1, 1.5]) |
| Map() |
| Map(1 -> [3, 5.5]) |
| Map(1 -> [4, 0.1], 2 -> [6, 101.56]) |
+--------------------------------------+
Now I need to sum up these two values per id and the result would be:
+----------------------+
| id | count | revenue |
+----------------------+
| 1 | 8 | 9.6 |
| 2 | 7 | 103.06 |
+----------------------+
I actually have no idea how to do that and could not find a documentation for this special case. I tried using Dataframe.groupBy but could not make it work :(
Any ideas ?
I'm using Spark 1.5.2 with Python 2.6.6
Assuming that the schema is equivalent to this:
root
|-- myMapColumn: map (nullable = true)
| |-- key: integer
| |-- value: struct (valueContainsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: double (nullable = false)
all you need is explode and a simple aggregation:
from pyspark.sql.functions import col, explode, sum as sum_
(df
.select(explode(col("myMapColumn")))
.groupBy(col("key").alias("id"))
.agg(sum_("value._1").alias("count"), sum_("value._2").alias("revenue")))

Categories

Resources