I applied an algorithm from the question Spark: How to transpose and explode columns with nested arrays to transpose and explode nested spark dataframe with dynamic arrays.
I have added to the dataframe """{"id":3,"c":[{"date":3,"val":3, "val_dynamic":3}]}}""" , with new column c, where array has new val_dynamic field which can appear on random basis.
I'm looking for required output 2 (Transpose and Explode ) but even example of required output 1 (Transpose) will be very useful.
Input df:
+------------------+--------+-----------+---+
| a| b| c| id|
+------------------+--------+-----------+---+
|[{1, 1}, {11, 11}]| null| null| 1|
| null|[{2, 2}]| null| 2|
| null| null|[{3, 3, 3}]| 3| !!! NOTE: Added `val_dynamic`
+------------------+--------+-----------+---+
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
| | |-- val_dynamic: long (nullable = true) !!! NOTE: Added `val_dynamic`
|-- id: long (nullable = true)
Required output 1 (transpose_df):
+---+------+-------------------+
| id| cols | arrays |
+---+------+-------------------+
| 1| a | [{1, 1}, {11, 11}]|
| 2| b | [{2, 2}] |
| 3| c | [{3, 3, 3}] | !!! NOTE: Added `val_dynamic`
+---+------+-------------------+
Required output 2 (explode_df):
+---+----+----+---+-----------+
| id|cols|date|val|val_dynamic|
+---+----+----+---+-----------+
| 1| a| 1| 1| null |
| 1| a| 11| 11| null |
| 2| b| 2| 2| null |
| 3| c| 3| 3| 3 | !!! NOTE: Added `val_dynamic`
+---+----+----+---+-----------+
Current code:
import pyspark.sql.functions as f
df = spark.read.json(sc.parallelize([
"""{"id":1,"a":[{"date":1,"val":1},{"date":11,"val":11}]}""",
"""{"id":2,"b":[{"date":2,"val":2}]}}""",
"""{"id":3,"c":[{"date":3,"val":3, "val_dynamic":3}]}}"""
]))
df.show()
cols = [ 'a', 'b', 'c']
#expr = stack(2,'a',a,'b',b,'c',c )
expr = f"stack({len(cols)}," + \
",".join([f"'{c}',{c}" for c in cols]) + \
")"
transpose_df = df.selectExpr("id", expr) \
.withColumnRenamed("col0", "cols") \
.withColumnRenamed("col1", "arrays") \
.filter("not arrays is null")
transpose_df.show()
explode_df = transpose_df.selectExpr('id', 'cols', 'inline(arrays)')
explode_df.show()
Current outcome
AnalysisException: cannot resolve 'stack(3, 'a', `a`, 'b', `b`, 'c', `c`)' due to data type mismatch: Argument 2 (array<struct<date:bigint,val:bigint>>) != Argument 6 (array<struct<date:bigint,val:bigint,val_dynamic:bigint>>); line 1 pos 0;
'Project [id#2304L, unresolvedalias(stack(3, a, a#2301, b, b#2302, c, c#2303), Some(org.apache.spark.sql.Column$$Lambda$2580/0x00000008411d3040#4d9eefd0))]
+- LogicalRDD [a#2301, b#2302, c#2303, id#2304L], false
ref : Transpose column to row with Spark
stack requires that all stacked columns have the same type. The problem here is that the structs inside of the arrays have different members. One approach would be to add the missing members to all structs so that the approach of my previous answer works again.
cols = ['a', 'b', 'c']
#create a map containing all struct fields per column
existing_fields = {c:list(map(lambda field: field.name, df.schema.fields[i].dataType.elementType.fields))
for i,c in enumerate(df.columns) if c in cols}
#get a (unique) set of all fields that exist in all columns
all_fields = set(sum(existing_fields.values(),[]))
#create a list of transform expressions to fill up the structs will null fields
transform_exprs = [f"transform({c}, e -> named_struct(" +
",".join([f"'{f}', {('e.'+f) if f in existing_fields[c] else 'cast(null as long)'}" for f in all_fields])
+ f")) as {c}" for c in cols]
#create a df where all columns contain arrays with the same struct
full_struct_df = df.selectExpr("id", *transform_exprs)
full_struct_df has now the schema
root
|-- id: long (nullable = true)
|-- a: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- val: long (nullable = true)
| | |-- val_dynamic: long (nullable = true)
| | |-- date: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- val: long (nullable = true)
| | |-- val_dynamic: long (nullable = true)
| | |-- date: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- val: long (nullable = true)
| | |-- val_dynamic: long (nullable = true)
| | |-- date: long (nullable = true)
From here the logic works as before:
stack_expr = f"stack({len(cols)}," + \
",".join([f"'{c}',{c}" for c in cols]) + \
")"
transpose_df = full_struct_df.selectExpr("id", stack_expr) \
.withColumnRenamed("col0", "cols") \
.withColumnRenamed("col1", "arrays") \
.filter("not arrays is null")
explode_df = transpose_df.selectExpr('id', 'cols', 'inline(arrays)')
The first part of this answer requires that
each column mentioned in cols is an array of structs
all members of all structs are longs. The reason for this restriction is the cast(null as long) when creating the transform expression.
Related
I have a pyspark dataframe created from XML. Because of the way XML is structured I have an extra, unnecessary level of nesting in the schema of the dataframe.
The schema of my current dataframe:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- movies: struct (nullable = true)
| | | |-- movie: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- b: string (nullable = true)
| | | | | |-- c: string (nullable = true)
| | | | | |-- d: integer (nullable = true)
| | | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
I'm trying to replace the movies struct with the movie array underneath it as follows:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- movies: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- b: string (nullable = true)
| | | | |-- c: string (nullable = true)
| | | | |-- d: integer (nullable = true)
| | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
The closest I've gotten was using:
from pyspark.sql import functions as F
df.withColumn("a", F.transform('a', lambda x: x.withField("movies_new", F.col("a.movies.movie"))))
which results in the following schema:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- movies: struct (nullable = true)
| | | |-- movie: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- b: string (nullable = true)
| | | | | |-- c: string (nullable = true)
| | | | | |-- d: integer (nullable = true)
| | | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
| | |-- movies_new: array (nullable = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- b: string (nullable = true)
| | | | | |-- c: string (nullable = true)
| | | | | |-- d: integer (nullable = true)
| | | | | |-- e: string (nullable = true)
I understand why this is happening, but thought if I never extracted the nested array out of 'a' that it might not become an array of an array.
Any suggestions?
The logic is:
Explode array "a".
Recompute new struct as (movies.movie, f, g)
Collect "a" back as array.
df = df.withColumn("a", F.explode("a"))
df = df.withColumn("a", F.struct( \
df.a.movies.getField("movie").alias("movies"), \
df.a.f.alias("f"), \
df.a.g.alias("g")))
df = df.select(F.collect_list("a").alias("a"))
The full working code:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[
[[(([("b1", "c1", "d1", "e1")],), "f1", "g1")]]
], schema="a array<struct<movies struct<movie array<struct<b string, c string, d string, e string>>>, f string, g string>>")
df.printSchema()
# df.show(truncate=False)
df = df.withColumn("a", F.explode("a"))
df = df.withColumn("a", F.struct( \
df.a.movies.getField("movie").alias("movies"), \
df.a.f.alias("f"), \
df.a.g.alias("g")))
df = df.select(F.collect_list("a").alias("a"))
df.printSchema()
# df.show(truncate=False)
Output schema before:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- movies: struct (nullable = true)
| | | |-- movie: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- b: string (nullable = true)
| | | | | |-- c: string (nullable = true)
| | | | | |-- d: string (nullable = true)
| | | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
Output schema after:
root
|-- a: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- movies: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- b: string (nullable = true)
| | | | |-- c: string (nullable = true)
| | | | |-- d: string (nullable = true)
| | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
i have an example dataset:
+---+------------------------------+
|id |example_field |
+---+------------------------------+
|1 |{[{[{111, AAA}, {222, BBB}]}]}|
+---+------------------------------+
The data type of the two fields are:
[('id', 'int'),
('example_field',
'struct<xxx:array<struct<nested_field:array<struct<field_1:int,field_2:string>>>>>')]
My question is if there's a way/function to flatten the field example_field using pyspark?
my expected output is something like this:
id field_1 field_2
1 111 AAA
1 222 BBB
The following code should do the trick:
from pyspark.sql import functions as F
(
df
.withColumn('_temp_ef', F.explode('example_field.xxx'))
.withColumn('_temp_nf', F.explode('_temp_ef.nested_field'))
.select(
'id',
F.col('_temp_nf.*')
)
)
The function explode creates a row for each element in an array, while select turns the fields of nested_field structure into columns.
The result is:
+---+-------+-------+
|id |field_1|field_2|
+---+-------+-------+
|1 |111 |AAA |
|1 |222 |BBB |
+---+-------+-------+
Note: I assumed that your DataFrame is something like this:
root
|-- id: integer (nullable = true)
|-- example_field: struct (nullable = true)
| |-- xxx: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- nested_field: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- field_1: integer (nullable = true)
| | | | | |-- field_2: string (nullable = true)
I wrote a spark code to fill a list with all the optional dates-flight prices combinations.
I use a large parquet file that stores over a million flight searches to fill the list with prices.
My source df without any changes looks like this:
+--------------------+--------------------+----+-----+---+
| request| response|year|month|day|
+--------------------+--------------------+----+-----+---+
|[[[KTW, ROM, 2022...| [[]]|2022| 5| 25|
|[[[TLV, OTP, 2022...|[[[false, [185.74...|2022| 5| 25|
|[[[TLV, LJU, 2022...| [[]]|2022| 5| 25|
|[[[TLV, NYC, 2022...|[[[false, [509.35...|2022| 5| 25|
|[[[IST, DSA, 2022...| [[]]|2022| 5| 25|
+--------------------+--------------------+----+-----+---+
Schema is:
root
|-- request: struct (nullable = true)
| |-- Segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Origin: string (nullable = true)
| | | |-- Destination: string (nullable = true)
| | | |-- FlightTime: string (nullable = true)
| | | |-- Cabin: integer (nullable = true)
| |-- MarketId: string (nullable = true)
|-- response: struct (nullable = true)
| |-- results: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- IsTwoOneWay: boolean (nullable = true)
| | | |-- PriceInfo: struct (nullable = true)
| | | | |-- Price: double (nullable = true)
| | | | |-- Currency: string (nullable = true)
| | | |-- Segments: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- Duration: integer (nullable = true)
| | | | | |-- Flights: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- Origin: string (nullable = true)
| | | | | | | |-- Destination: string (nullable = true)
| | | | | | | |-- Carrier: string (nullable = true)
| | | | | | | |-- Departure: string (nullable = true)
| | | | | | | |-- Arrival: string (nullable = true)
| | | | | | | |-- Duration: integer (nullable = true)
| | | | | | | |-- FlightNumber: string (nullable = true)
| | | | | | | |-- DestinationCityCode: string (nullable = true)
| | | | | | | |-- OriginCityCode: string (nullable = true)
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
I have a function that returns a list with every possible date combination in a month.
I iterate over the dataframe (after filtering the origin-destination) for every date combination.
The code is:
import findspark
findspark.init()
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from itertools import combinations
import datetime
spark = SparkSession.builder.appName("Practice").master("local[*]").config("spark.executor.memory", "70g").config("spark.driver.memory", "50g").config("spark.memory.offHeap.enabled",True).config("spark.memory.offHeap.size","16g").getOrCreate()
df = spark.read.parquet('spark-big-data\parquet_example.parquet')
df = df.withColumn('fs_origin',df.request.Segments.getItem(0)['Origin'])
df = df.withColumn('fs_destination',df.request.Segments.getItem(0)['Destination'])
df = df.withColumn('fs_date',df.request.Segments.getItem(0)['FlightTime'])
df = df.withColumn('ss_origin',df.request.Segments.getItem(1)['Origin'])
df = df.withColumn('ss_destination',df.request.Segments.getItem(1)['Destination'])
df = df.withColumn('ss_date',df.request.Segments.getItem(1)['FlightTime'])
df = df.filter((df["fs_origin"] == 'TLV') & (df["fs_destination"] == 'NYC') & (df["ss_origin"] == 'NYC') & (df['ss_destination']=='TLV'))
df = df.withColumn('full_date',F.concat_ws('-', df.year,df.month,df.day)).persist()
df =df.filter(F.size('response.results') > 0)
def get_all_dates_in_month():
start = datetime.datetime.strptime("2022-06-01", "%Y-%m-%d")
end = datetime.datetime.strptime("2022-07-10", "%Y-%m-%d")
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days+1)]
date_generated = [date_obj.strftime('%Y-%m-%d') for date_obj in date_generated]
combinations_res = combinations(date_generated, 2)
return list(combinations_res)
dates_list = get_all_dates_in_month()
res =[]
for date in dates_list:
df_date = df.filter((df['fs_date']==date[0]+'T00:00:00') & (df['ss_date']==date[1]+'T00:00:00'))
if df_date.count()==0:
res.append([date, 0])
else:
df_date = df_date.sort(F.unix_timestamp("full_date", "yyyy-M-d").desc())
latest_day = df_date.collect()[0]['full_date']
df_date = df_date.filter(df_date['full_date']==latest_day)
df_date = df_date.withColumn("exploded_data", F.explode("response.results"))
df_date = df_date.withColumn(
"price",
F.col("exploded_data").getItem('PriceInfo').getItem('Price') # either get by name or by index e.g. getItem(0) etc
)
res.append([date, df_date.sort(df_date.price.asc()).collect()[0]['price']])
df.unpersist()
spark.catalog.clearCache()
print(*res,sep='\n')
The response is like so:
[('2022-06-01', '2022-06-02'), 1210.58]
[('2022-06-01', '2022-06-03'), 856.38]
[('2022-06-01', '2022-06-04'), 746.58]
[('2022-06-01', '2022-06-05'), 638.28]
[('2022-06-01', '2022-06-06'), 687.28]
[('2022-06-01', '2022-06-07'), 654.78]
[('2022-06-01', '2022-06-08'), 687.28]
[('2022-06-01', '2022-06-09'), 687.28]
[('2022-06-01', '2022-06-10'), 1241.63]
[('2022-06-01', '2022-06-11'), 773.18]
[('2022-06-01', '2022-06-12'), 773.18]
[('2022-06-01', '2022-06-13'), 697.98]
...
...
The problem I have is that it takes around 4 minutes to produce the response, and I'm sure there is a much efficient way of doing it instead of iterating over 400 possible dates in the list.
I need to find a way of creating the list and when I iterate over the dataframe, once I find a matching flight (from all the possible dates) it appends it to the list, and this way I don't need to go through the dataframe 400 times.
I don't know how to do it or how to make the code more efficient if possible.
Thanks!
I have around 30 JSON files with nested attributes, sample shown below. I would like to drop "questionnaire" column from the file and then would like to Union all files.
Could you please suggest how shall we achieve this using Python.
Sample file:
|-- profileEntity: struct (nullable = true)
| |-- consent: string (nullable = true)
| |-- documents: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- documentProperties: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- field: string (nullable = true)
| | | | | |-- value: string (nullable = true)
| | | |-- documentType: string (nullable = true)
| | | |-- files: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- edmFileIndex: long (nullable = true)
| | | | | |-- edmFilename: string (nullable = true)
| |-- questionnaire: struct (nullable = true)
| | |-- deposit3rdParties: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- deposit3rdPartiesOther: string (nullable = true)
The reason for this is because "questionnaire" column is not in sync in each file and hence UNION is failing.
Error:
Py4JJavaError: An error occurred while calling o72.union.
: org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types.
I'm quite new to pyspark and I have a dataframe that currently looks like below.
| col1 | col2 |
+---------------------------------+-------------------+
| [(a, 0)], [(b,0)].....[(z,1)] | [0, 0, ... 1] |
| [(b, 0)], [(b,1)].....[(z,0)] | [0, 1, ... 0] |
| [(a, 0)], [(c, 1)].....[(z,0)] | [0, 1, ... 0] |
I extracted values from col1.QueryNum into col2 and when I print the schema, it's an array containing the list of number from col1.QueryNum.
Ultimately my goal is to convert the list values in col2 into struct format inside pyspark(refer to desired schema).
Current Schema
|-- col1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- types: string (nullable = true)
| | |-- QueryNum: integer (nullable = true)
|-- col2: array (nullable = true)
| |-- element: integer (containsNull = true)
Desired Schema
|-- col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- val1: integer (nullable = true)
| | |-- val2: integer (nullable = true)
.
.
.
| | |-- val80: integer (nullable = true)
I tried using from_json and it's not really working.
If the you have fixed array size you can create struct using list-comprehension:
from pyspark.sql import functions as F
df1 = df.withColumn(
"col2",
F.array(
F.struct(*[
F.col("col1")[i]["QueryNum"].alias(f"val{i+1}") for i in range(2)
])
)
)
df1.show()
#+----------------+--------+
#|col1 |col2 |
#+----------------+--------+
#|[[0, a], [0, b]]|[[0, 0]]|
#|[[0, b], [1, b]]|[[0, 1]]|
#|[[0, a], [1, c]]|[[0, 1]]|
#+----------------+--------+
df1.printSchema()
#root
#|-- col1: array (nullable = true)
#| |-- element: struct (containsNull = true)
#| | |-- QueryNum: long (nullable = true)
#| | |-- types: string (nullable = true)
#|-- col2: array (nullable = false)
#| |-- element: struct (containsNull = false)
#| | |-- val1: long (nullable = true)
#| | |-- val2: long (nullable = true)
Note however that there is no need to use array in this case as you'll always have one struct in that array. Just use simple struct:
df1 = df.withColumn(
"col2",
F.struct(*[
F.col("col1")[i]["QueryNum"].alias(f"val{i+1}") for i in range(2)
])
)
Or if you prefer a map type:
df1 = df.withColumn(
"col2",
F.map_from_entries(
F.expr("transform(col1, (x,i) -> struct('val' || (i+1) as name, x.QueryNum as value))")
)
)