I have a json file which contains a dictionary in the following format:
{"a1":{"b1":["c1","c2"], "b2":["c4","c3"]}, "a2":{"b3":["c1","c4"]}}
Is it possible to convert this dictionary into a PySpark dataframe as the following?
col1 | col2 | col3
----------------------
| a1 | b1 | c1 |
----------------------
| a1 | b1 | c2 |
----------------------
| a1 | b2 | c4 |
----------------------
| a1 | b2 | c3 |
----------------------
| a2 | b3 | c1 |
----------------------
| a2 | b3 | c4 |
I have seen the standard format of converting json to PySpark dataframe (example in this link) but was wondering about nested dictionaries that contain lists as well.
Interesting problem! The main struggle I realized with this problem is your when reading from JSON, your schema is likely has struct type, making it harder to solve, because basically a1 has different type than a2.
My idea is using somehow converting your struct type to map type, then stack them together, then apply a few explodes:
This is your df
+----------------------------------+
|data |
+----------------------------------+
|{{[c1, c2], [c4, c3]}, {[c1, c4]}}|
+----------------------------------+
root
|-- data: struct (nullable = true)
| |-- a1: struct (nullable = true)
| | |-- b1: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- b2: array (nullable = true)
| | | |-- element: string (containsNull = true)
| |-- a2: struct (nullable = true)
| | |-- b3: array (nullable = true)
| | | |-- element: string (containsNull = true)
Create a temporary df to handle JSON's first level
first_level_df = df.select('data.*')
first_level_df.show()
first_level_cols = first_level_df.columns # ['a1', 'a2']
+--------------------+----------+
| a1| a2|
+--------------------+----------+
|{[c1, c2], [c4, c3]}|{[c1, c4]}|
+--------------------+----------+
Some helper variables
map_cols = [F.from_json(F.to_json(c), T.MapType(T.StringType(), T.StringType())).alias(c) for c in first_level_cols]
# [Column<'entries AS a1'>, Column<'entries AS a2'>]
stack_cols = ', '.join([f"'{c}', {c}" for c in first_level_cols])
# 'a1', a1, 'a2', a2
Main transformation
(first_level_df
.select(map_cols)
.select(F.expr(f'stack(2, {stack_cols})').alias('AA', 'temp'))
.select('AA', F.explode('temp').alias('BB', 'temp'))
.select('AA', 'BB', F.explode(F.from_json('temp', T.ArrayType(T.StringType()))).alias('CC'))
.show(10, False)
)
+---+---+---+
|AA |BB |CC |
+---+---+---+
|a1 |b1 |c1 |
|a1 |b1 |c2 |
|a1 |b2 |c4 |
|a1 |b2 |c3 |
|a2 |b3 |c1 |
|a2 |b3 |c4 |
+---+---+---+
Related
i have an example dataset:
+---+------------------------------+
|id |example_field |
+---+------------------------------+
|1 |{[{[{111, AAA}, {222, BBB}]}]}|
+---+------------------------------+
The data type of the two fields are:
[('id', 'int'),
('example_field',
'struct<xxx:array<struct<nested_field:array<struct<field_1:int,field_2:string>>>>>')]
My question is if there's a way/function to flatten the field example_field using pyspark?
my expected output is something like this:
id field_1 field_2
1 111 AAA
1 222 BBB
The following code should do the trick:
from pyspark.sql import functions as F
(
df
.withColumn('_temp_ef', F.explode('example_field.xxx'))
.withColumn('_temp_nf', F.explode('_temp_ef.nested_field'))
.select(
'id',
F.col('_temp_nf.*')
)
)
The function explode creates a row for each element in an array, while select turns the fields of nested_field structure into columns.
The result is:
+---+-------+-------+
|id |field_1|field_2|
+---+-------+-------+
|1 |111 |AAA |
|1 |222 |BBB |
+---+-------+-------+
Note: I assumed that your DataFrame is something like this:
root
|-- id: integer (nullable = true)
|-- example_field: struct (nullable = true)
| |-- xxx: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- nested_field: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- field_1: integer (nullable = true)
| | | | | |-- field_2: string (nullable = true)
I wrote a spark code to fill a list with all the optional dates-flight prices combinations.
I use a large parquet file that stores over a million flight searches to fill the list with prices.
My source df without any changes looks like this:
+--------------------+--------------------+----+-----+---+
| request| response|year|month|day|
+--------------------+--------------------+----+-----+---+
|[[[KTW, ROM, 2022...| [[]]|2022| 5| 25|
|[[[TLV, OTP, 2022...|[[[false, [185.74...|2022| 5| 25|
|[[[TLV, LJU, 2022...| [[]]|2022| 5| 25|
|[[[TLV, NYC, 2022...|[[[false, [509.35...|2022| 5| 25|
|[[[IST, DSA, 2022...| [[]]|2022| 5| 25|
+--------------------+--------------------+----+-----+---+
Schema is:
root
|-- request: struct (nullable = true)
| |-- Segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Origin: string (nullable = true)
| | | |-- Destination: string (nullable = true)
| | | |-- FlightTime: string (nullable = true)
| | | |-- Cabin: integer (nullable = true)
| |-- MarketId: string (nullable = true)
|-- response: struct (nullable = true)
| |-- results: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- IsTwoOneWay: boolean (nullable = true)
| | | |-- PriceInfo: struct (nullable = true)
| | | | |-- Price: double (nullable = true)
| | | | |-- Currency: string (nullable = true)
| | | |-- Segments: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- Duration: integer (nullable = true)
| | | | | |-- Flights: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- Origin: string (nullable = true)
| | | | | | | |-- Destination: string (nullable = true)
| | | | | | | |-- Carrier: string (nullable = true)
| | | | | | | |-- Departure: string (nullable = true)
| | | | | | | |-- Arrival: string (nullable = true)
| | | | | | | |-- Duration: integer (nullable = true)
| | | | | | | |-- FlightNumber: string (nullable = true)
| | | | | | | |-- DestinationCityCode: string (nullable = true)
| | | | | | | |-- OriginCityCode: string (nullable = true)
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
I have a function that returns a list with every possible date combination in a month.
I iterate over the dataframe (after filtering the origin-destination) for every date combination.
The code is:
import findspark
findspark.init()
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from itertools import combinations
import datetime
spark = SparkSession.builder.appName("Practice").master("local[*]").config("spark.executor.memory", "70g").config("spark.driver.memory", "50g").config("spark.memory.offHeap.enabled",True).config("spark.memory.offHeap.size","16g").getOrCreate()
df = spark.read.parquet('spark-big-data\parquet_example.parquet')
df = df.withColumn('fs_origin',df.request.Segments.getItem(0)['Origin'])
df = df.withColumn('fs_destination',df.request.Segments.getItem(0)['Destination'])
df = df.withColumn('fs_date',df.request.Segments.getItem(0)['FlightTime'])
df = df.withColumn('ss_origin',df.request.Segments.getItem(1)['Origin'])
df = df.withColumn('ss_destination',df.request.Segments.getItem(1)['Destination'])
df = df.withColumn('ss_date',df.request.Segments.getItem(1)['FlightTime'])
df = df.filter((df["fs_origin"] == 'TLV') & (df["fs_destination"] == 'NYC') & (df["ss_origin"] == 'NYC') & (df['ss_destination']=='TLV'))
df = df.withColumn('full_date',F.concat_ws('-', df.year,df.month,df.day)).persist()
df =df.filter(F.size('response.results') > 0)
def get_all_dates_in_month():
start = datetime.datetime.strptime("2022-06-01", "%Y-%m-%d")
end = datetime.datetime.strptime("2022-07-10", "%Y-%m-%d")
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days+1)]
date_generated = [date_obj.strftime('%Y-%m-%d') for date_obj in date_generated]
combinations_res = combinations(date_generated, 2)
return list(combinations_res)
dates_list = get_all_dates_in_month()
res =[]
for date in dates_list:
df_date = df.filter((df['fs_date']==date[0]+'T00:00:00') & (df['ss_date']==date[1]+'T00:00:00'))
if df_date.count()==0:
res.append([date, 0])
else:
df_date = df_date.sort(F.unix_timestamp("full_date", "yyyy-M-d").desc())
latest_day = df_date.collect()[0]['full_date']
df_date = df_date.filter(df_date['full_date']==latest_day)
df_date = df_date.withColumn("exploded_data", F.explode("response.results"))
df_date = df_date.withColumn(
"price",
F.col("exploded_data").getItem('PriceInfo').getItem('Price') # either get by name or by index e.g. getItem(0) etc
)
res.append([date, df_date.sort(df_date.price.asc()).collect()[0]['price']])
df.unpersist()
spark.catalog.clearCache()
print(*res,sep='\n')
The response is like so:
[('2022-06-01', '2022-06-02'), 1210.58]
[('2022-06-01', '2022-06-03'), 856.38]
[('2022-06-01', '2022-06-04'), 746.58]
[('2022-06-01', '2022-06-05'), 638.28]
[('2022-06-01', '2022-06-06'), 687.28]
[('2022-06-01', '2022-06-07'), 654.78]
[('2022-06-01', '2022-06-08'), 687.28]
[('2022-06-01', '2022-06-09'), 687.28]
[('2022-06-01', '2022-06-10'), 1241.63]
[('2022-06-01', '2022-06-11'), 773.18]
[('2022-06-01', '2022-06-12'), 773.18]
[('2022-06-01', '2022-06-13'), 697.98]
...
...
The problem I have is that it takes around 4 minutes to produce the response, and I'm sure there is a much efficient way of doing it instead of iterating over 400 possible dates in the list.
I need to find a way of creating the list and when I iterate over the dataframe, once I find a matching flight (from all the possible dates) it appends it to the list, and this way I don't need to go through the dataframe 400 times.
I don't know how to do it or how to make the code more efficient if possible.
Thanks!
I applied an algorithm from the question Spark: How to transpose and explode columns with nested arrays to transpose and explode nested spark dataframe with dynamic arrays.
I have added to the dataframe """{"id":3,"c":[{"date":3,"val":3, "val_dynamic":3}]}}""" , with new column c, where array has new val_dynamic field which can appear on random basis.
I'm looking for required output 2 (Transpose and Explode ) but even example of required output 1 (Transpose) will be very useful.
Input df:
+------------------+--------+-----------+---+
| a| b| c| id|
+------------------+--------+-----------+---+
|[{1, 1}, {11, 11}]| null| null| 1|
| null|[{2, 2}]| null| 2|
| null| null|[{3, 3, 3}]| 3| !!! NOTE: Added `val_dynamic`
+------------------+--------+-----------+---+
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
| | |-- val_dynamic: long (nullable = true) !!! NOTE: Added `val_dynamic`
|-- id: long (nullable = true)
Required output 1 (transpose_df):
+---+------+-------------------+
| id| cols | arrays |
+---+------+-------------------+
| 1| a | [{1, 1}, {11, 11}]|
| 2| b | [{2, 2}] |
| 3| c | [{3, 3, 3}] | !!! NOTE: Added `val_dynamic`
+---+------+-------------------+
Required output 2 (explode_df):
+---+----+----+---+-----------+
| id|cols|date|val|val_dynamic|
+---+----+----+---+-----------+
| 1| a| 1| 1| null |
| 1| a| 11| 11| null |
| 2| b| 2| 2| null |
| 3| c| 3| 3| 3 | !!! NOTE: Added `val_dynamic`
+---+----+----+---+-----------+
Current code:
import pyspark.sql.functions as f
df = spark.read.json(sc.parallelize([
"""{"id":1,"a":[{"date":1,"val":1},{"date":11,"val":11}]}""",
"""{"id":2,"b":[{"date":2,"val":2}]}}""",
"""{"id":3,"c":[{"date":3,"val":3, "val_dynamic":3}]}}"""
]))
df.show()
cols = [ 'a', 'b', 'c']
#expr = stack(2,'a',a,'b',b,'c',c )
expr = f"stack({len(cols)}," + \
",".join([f"'{c}',{c}" for c in cols]) + \
")"
transpose_df = df.selectExpr("id", expr) \
.withColumnRenamed("col0", "cols") \
.withColumnRenamed("col1", "arrays") \
.filter("not arrays is null")
transpose_df.show()
explode_df = transpose_df.selectExpr('id', 'cols', 'inline(arrays)')
explode_df.show()
Current outcome
AnalysisException: cannot resolve 'stack(3, 'a', `a`, 'b', `b`, 'c', `c`)' due to data type mismatch: Argument 2 (array<struct<date:bigint,val:bigint>>) != Argument 6 (array<struct<date:bigint,val:bigint,val_dynamic:bigint>>); line 1 pos 0;
'Project [id#2304L, unresolvedalias(stack(3, a, a#2301, b, b#2302, c, c#2303), Some(org.apache.spark.sql.Column$$Lambda$2580/0x00000008411d3040#4d9eefd0))]
+- LogicalRDD [a#2301, b#2302, c#2303, id#2304L], false
ref : Transpose column to row with Spark
stack requires that all stacked columns have the same type. The problem here is that the structs inside of the arrays have different members. One approach would be to add the missing members to all structs so that the approach of my previous answer works again.
cols = ['a', 'b', 'c']
#create a map containing all struct fields per column
existing_fields = {c:list(map(lambda field: field.name, df.schema.fields[i].dataType.elementType.fields))
for i,c in enumerate(df.columns) if c in cols}
#get a (unique) set of all fields that exist in all columns
all_fields = set(sum(existing_fields.values(),[]))
#create a list of transform expressions to fill up the structs will null fields
transform_exprs = [f"transform({c}, e -> named_struct(" +
",".join([f"'{f}', {('e.'+f) if f in existing_fields[c] else 'cast(null as long)'}" for f in all_fields])
+ f")) as {c}" for c in cols]
#create a df where all columns contain arrays with the same struct
full_struct_df = df.selectExpr("id", *transform_exprs)
full_struct_df has now the schema
root
|-- id: long (nullable = true)
|-- a: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- val: long (nullable = true)
| | |-- val_dynamic: long (nullable = true)
| | |-- date: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- val: long (nullable = true)
| | |-- val_dynamic: long (nullable = true)
| | |-- date: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- val: long (nullable = true)
| | |-- val_dynamic: long (nullable = true)
| | |-- date: long (nullable = true)
From here the logic works as before:
stack_expr = f"stack({len(cols)}," + \
",".join([f"'{c}',{c}" for c in cols]) + \
")"
transpose_df = full_struct_df.selectExpr("id", stack_expr) \
.withColumnRenamed("col0", "cols") \
.withColumnRenamed("col1", "arrays") \
.filter("not arrays is null")
explode_df = transpose_df.selectExpr('id', 'cols', 'inline(arrays)')
The first part of this answer requires that
each column mentioned in cols is an array of structs
all members of all structs are longs. The reason for this restriction is the cast(null as long) when creating the transform expression.
How to transform values below from multiple XML files to spark data frame :
attribute Id0 from Level_0
Date/Value from Level_4
Required output:
+----------------+-------------+---------+
|Id0 |Date |Value |
+----------------+-------------+---------+
|Id0_value_file_1| 2021-01-01 | 4_1 |
|Id0_value_file_1| 2021-01-02 | 4_2 |
|Id0_value_file_2| 2021-01-01 | 4_1 |
|Id0_value_file_2| 2021-01-02 | 4_2 |
+----------------+-------+---------------+
file_1.xml:
<Level_0 Id0="Id0_value_file1">
<Level_1 Id1_1 ="Id3_value" Id_2="Id2_value">
<Level_2_A>A</Level_2_A>
<Level_2>
<Level_3>
<Level_4>
<Date>2021-01-01</Date>
<Value>4_1</Value>
</Level_4>
<Level_4>
<Date>2021-01-02</Date>
<Value>4_2</Value>
</Level_4>
</Level_3>
</Level_2>
</Level_1>
</Level_0>
file_2.xml:
<Level_0 Id0="Id0_value_file2">
<Level_1 Id1_1 ="Id3_value" Id_2="Id2_value">
<Level_2_A>A</Level_2_A>
<Level_2>
<Level_3>
<Level_4>
<Date>2021-01-01</Date>
<Value>4_1</Value>
</Level_4>
<Level_4>
<Date>2021-01-02</Date>
<Value>4_2</Value>
</Level_4>
</Level_3>
</Level_2>
</Level_1>
</Level_0>
Current Code Example:
files_list = ["file_1.xml", "file_2.xml"]
df = (spark.read.format('xml')
.options(rowTag="Level_4")
.load(','.join(files_list))
Current Output:(Id0 column with attributes missing)
+-------------+---------+
|Date |Value |
+-------------+---------+
| 2021-01-01 | 4_1 |
| 2021-01-02 | 4_2 |
| 2021-01-01 | 4_1 |
| 2021-01-02 | 4_2 |
+-------+---------------+
There are some examples, but non of them solve the problem:
-I'm using databricks spark_xml - https://github.com/databricks/spark-xml
-There is an examample but not with attribute reading, Read XML in spark, Extracting tag attributes from xml using sparkxml .
EDIT:
As #mck pointed out correctly <Level_2>A</Level_2> is not correct XML format. I had a mistake in my example(now xml file is corrected), it should be <Level_2_A>A</Level_2_A>. After that , proposed solution works even on multiple files.
NOTE: To speedup loading of large number of xmls define schema, if no schema is defined spark is reading each file when creating dataframe to interfere schema...
for more info: https://szczeles.github.io/Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark/
STEP 1):
files_list = ["file_1.xml", "file_2.xml"]
# for schema seem NOTE above
df = (spark.read.format('xml')
.options(rowTag="Level_0")
.load(','.join(files_list),schema=schema))
df.printSchema()
root
|-- Level_1: struct (nullable = true)
| |-- Level_2: struct (nullable = true)
| | |-- Level_3: struct (nullable = true)
| | | |-- Level_4: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- Date: string (nullable = true)
| | | | | |-- Value: string (nullable = true)
| |-- Level_2_A: string (nullable = true)
| |-- _Id1_1: string (nullable = true)
| |-- _Id_2: string (nullable = true)
|-- _Id0: string (nullable = true
STEP 2) see below #mck solution:
You can use Level_0 as the rowTag, and explode the relevant arrays/structs:
import pyspark.sql.functions as F
df = spark.read.format('xml').options(rowTag="Level_0").load('line_removed.xml')
df2 = df.select(
'_Id0',
F.explode_outer('Level_1.Level_2.Level_3.Level_4').alias('Level_4')
).select(
'_Id0',
'Level_4.*'
)
df2.show()
+---------------+----------+-----+
| _Id0| Date|Value|
+---------------+----------+-----+
|Id0_value_file1|2021-01-01| 4_1|
|Id0_value_file1|2021-01-02| 4_2|
+---------------+----------+-----+
I am trying to understand how can I do operations inside small groups in a PySpark DataFrame. Suppose I have DF with the following schema:
root
|-- first_id: string (nullable = true)
|-- second_id_struct: struct (nullable = true)
| |-- s_id: string (nullable = true)
| |-- s_id_2: int (nullable = true)
|-- depth_from: float (nullable = true)
|-- depth_to: float (nullable = true)
|-- total_depth: float (nullable = true)
So data might look something like this:
I would like to:
group data by first_id
inside each group, order it by s_id_2 in ascending order
append extra column layer to either struct or root DataFrame that would indicate order of this s_id_2 in a group.
For example:
first_id | second_id | second_id_order
---------| --------- | ---------------
A1 | [B, 10] | 1
---------| --------- | ---------------
A1 | [B, 14] | 2
---------| --------- | ---------------
A1 | [B, 22] | 3
---------| --------- | ---------------
A5 | [A, 1] | 1
---------| --------- | ---------------
A5 | [A, 7] | 2
---------| --------- | ---------------
A7 | null | 1
---------| --------- | ---------------
Once grouped each first_id will have at most 4 second_id_struct. How do I approach those kind of problems?
I am particularly interested in how to make iterative operations inside small groups (1-40 rows) of DataFrames in general, where order of columns inside a group matters.
Thanks!
create a DataFrame
d = [{'first_id': 'A1', 'second_id': ['B',10]}, {'first_id': 'A1', 'second_id': ['B',14]},{'first_id': 'A1', 'second_id': ['B',22]},{'first_id': 'A5', 'second_id': ['A',1]},{'first_id': 'A5', 'second_id': ['A',7]}]
df = sqlContext.createDataFrame(d)
And you can see the structure
df.printSchema()
|-- first_id: string (nullable = true)
|-- second_id: array (nullable = true)
|........|-- element: string (containsNull = true)
df.show()
+--------+----------+
|first_id|second_id |
+--------+----------+
| A1| [B, 10]|
| A1| [B, 14]|
| A1| [B, 22]|
| A5| [A, 1]|
| A5| [A, 7]|
+--------+----------+
Then you can use dense_rank and Window function to show the order in the subgroup. It is as same as over partition in SQL.
The introduction of window function: Introducing Window Functions in Spark SQL
Code here:
# setting a window spec
windowSpec = Window.partitionBy('first_id').orderBy(df.second_id[1])
# apply dense_rank to the window spec
df.select(df.first_id, df.second_id, dense_rank().over(windowSpec).alias("second_id_order")).show()
Result:
+--------+---------+---------------+
|first_id|second_id|second_id_order|
+--------+---------+---------------+
| A1| [B, 10]| 1|
| A1| [B, 14]| 2|
| A1| [B, 22]| 3|
| A5| [A, 1]| 1|
| A5| [A, 7]| 2|
+--------+---------+---------------+