Flatten a nested array of array & structs in Pyspark - python

I have a schema of this form from a json file:
root
|-- fruit_id: string (nullable = true)
|-- fruit_type: array (nullable = true)
| |-- name: string (nullable = true)
| |-- info: struct (nullable = true)
| |-- fruit_quality: array (nullable = true)
| | |-- quality: string (nullable = true)
| |-- likes: string (containsNull = true)
| |-- finance: struct (nullable = true)
| | |-- last_year_price: string (nullable = true)
| | |-- current_price: string (nullable = true)
| |-- shops: struct (nullable = true)
| | |-- shop1: string (nullable = true)
| | |-- shop2: string (nullable = true)
|-- season: string (nullable = true)
How can I get it of this form?
root
|-- fruit_id: string (nullable = true)
|-- fruit_type_name: string (nullable = true)
|-- fruit_type_info_fruit_quality_quality: string (nullable = true)
|-- fruit_type_info_likes: string (nullable = true)
|-- fruit_type_finance_last_year_price: string (nullable = true)
|-- fruit_type_finance_current_price: string (nullable = true)
|-- fruit_type_shops_shop1: string (nullable = true)
|-- fruit_type_shops_shop2: string (nullable = true)
|-- season: string (nullable = true)
This is for the case of fruits. How would I flatten it similar way if I receive a file with info on vegetables ?
I am facing issue while flattening the array part. I am able to flatten structs inside structs, I followed this: link
I also added this piece of code to code on above link, to see if this approach would work:
import pyspark.sql.functions as F
array_cols = [c[0] for c in df.dtypes if c[1][:6] == 'array']
df = df.select(
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in array_cols
for c in df.select(nc+'.*').columns])
But it's not working.
I then checked this link as well: link
But here issue is if I want to flatten the json file of fruits, It is possible, but then if I send a json file of vegetables with similar schema, I'll have to redefine the code.
Another approach I went for was converting an array to struct & then I could use the flatten the nested structs, but that wasn't helpful.
Lastly, I checked this link as well: link
But this approach threw an error, saying flattening not possible, since I have array of structs & not an array of array.
So how can I solve this?

Related

How to merge two dataframes in pyspark with different columns inside struct or array?

Lets say, there are two data-frames. Reference dataframe and Target dataframe.
Reference DF is a reference schema.
Schema for reference DF (r_df)
r_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- mail: boolean (nullable = true)
| |-- sms: boolean (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
However, target data-frame schema is dynamic in nature.
Schema for target DF (t_df)
t_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- sms: string (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
| | |-- createdBy: string (nullable = true)
So we observe multiple changes in target's schema.
Columns inside t_df struct or array can have more or less columns.
Datatype of columns can change too. So type casting is required. (Ex. sms column is boolean in r_df but string in t_df)
I was able to add/remove columns which are of non-struct datatype. However, struct and arrays are real pain for me. Since there are 50+ columns, I need an optimised solution which works for all.
Any solution/ opinion/ way around will be really helpful.
Expected output
I want to make my t_df's schema exactly same as my r_df's schema.
below code is un-tested but should prescribe how to do it. (written from memory without testing.)
There may be a way to get fields from a struct but I'm not aware how so i'm interested to hear others ideas.
Extract struct column names and types.
Find columns that need to be dropped
Drop columns
rebuild struts according to r_df.
stucts_in_r_df = [ field.name for field in r_df.schema.fields if(str(field.dataType).startswith("Struct")) ] # use list comprehension to create a list of struct fields
struct_columns = []
for structs in stucts_in_r_df: # get a list of fields in the structs
struct_columns.append(r_df\
.select(
"$structs.*"
).columns
)
missingColumns = list(set(r_df.columns) - set(tdf.columns)) # find missing columns
similiar_Columns = list(set(r_df.columns).intersect(set(tdf.columns))))
#remove struct columns from both lists so you don't represent them twice.
# you need to repeat the above intersection/missing for the structs and then rebuild them but really the above gives you the idea of how to get the fields out.
# you can use variable replacemens col("$struct.$field") to get the values out of the fields,
result = r_df.union(
tdf\
.select(*(
[ lit(None).cast(dict(r_df.dtypes)[column]).alias(column) for column in missingColumns] +\
[ col(column).cast(dict(r_df.dtypes)[column]).alias(column) for column in similiar_Columns] ) # using list comprehension with joins and then passing as varargs to select will completely dynamically pull out the values you need.
)
)
Here's a way once you have the union to pull back the struct:
result = result\
.select(
col("_id"),
struct( col("sms").alias("sms") ).alias("notificationsSend"),
struct( *[col(column).alias(column) for column in struct_columns] # pass varags to struct with columns
).alias("recordingDetails") #reconstitue struct with
)

Union for Nested Spark Data Frames

Suppose we have two data frames df1 and df2 with the following schema:
A
|-- B: struct (nullable = true)
| |-- b1: string (nullable = true)
| |-- b2: string (nullable = true)
| |-- b3: string (nullable = true)
| |-- C: array (nullable = true)
| | |-- D: struct (containsNull = true)
| | | |-- d1: string (nullable = true)
| | | |-- d2: string (nullable = true)
Would df1.union(df2)work for these nested data frames if you wanted to add a new record? Or would you have to flatten them first if you wanted to add a new record?
This should work, here is a knowledge article by databricks
https://kb.databricks.com/data/append-a-row-to-rdd-or-dataframe.html
and you won't need to flatten your struct fields.
PS: Please ensure your column are in same orders in both dataframe.

pySpark: How can I get all element names in structType in arrayType column in a dataframe?

I have a dataframe that looks something like this:
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- job: string (nullable = true)
|-- hobbies: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- favorite: string (nullable = true)
| | |-- non-favorite: string (nullable = true)
And I'm trying to get this information:
['favorite', 'non-favorite']
However, the only closest solution I found was using the explode function with withColumn, but it was based on the assumption that I already know the names of the elements. But What I want to do is, without knowing the element names, I want to get the element names only with the column name, in this case 'hobbies'.
Is there a good way to get all the element names in any given column?
For a given dataframe with this schema:
df.printSchema()
root
|-- hobbies: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- favorite: string (nullable = false)
| | |-- non-favorite: string (nullable = false)
You can select the field names of the struct as:
struct_fields = df.schema['hobbies'].dataType.elementType.fieldNames()
# output: ['favorite', 'non-favorite']
pyspark.sql.types.StructType.fieldnames should get you what you want.
fieldNames()
Returns all field names in a list.
>>> struct = StructType([StructField("f1", StringType(), True)])
>>> struct.fieldNames()
['f1']
So in your case something like
dataframe.hobbies.getItem(0).fieldnames()

How to add an empty array using when and otherwise in pyspark

How can i add an empty array when using df.withColomn when() and otherwise(***empty_array***)
New column type is T.ArrayType(T.StringType()) from UDF
I want to avoid ending up with NaN values.
Simply use array(lit(None))
df.select(when(col('target_bool')=='true',array(lit(1))).otherwise(array(lit(None)))).show()
Try below - Create a column with None value and cast to Array()
df_b = df_b.withColumn("empty_array", F.when(F.col("rn") == F.lit("1"), (None))).withColumn("empty_array", F.col("empty_array").cast(T.ArrayType(T.StringType())))
df_b.show()
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- rn: integer (nullable = true)
|-- case_condition: integer (nullable = true)
|-- empty_array: array (nullable = true)
| |-- element: string (containsNull = true)

Convert PySpark dataframe column type to string and replace the square brackets

I need to convert a PySpark df column type from array to string and also remove the square brackets. This is the schema for the dataframe. columns that needs to be processed is CurrencyCode and TicketAmount
>>> plan_queryDF.printSchema()
root
|-- event_type: string (nullable = true)
|-- publishedDate: string (nullable = true)
|-- plannedCustomerChoiceID: string (nullable = true)
|-- assortedCustomerChoiceID: string (nullable = true)
|-- CurrencyCode: array (nullable = true)
| |-- element: string (containsNull = true)
|-- TicketAmount: array (nullable = true)
| |-- element: string (containsNull = true)
|-- currentPlan: boolean (nullable = true)
|-- originalPlan: boolean (nullable = true)
|-- globalId: string (nullable = true)
|-- PlanJsonData: string (nullable = true)
sample data from dataframe
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
| event_type| publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan| globalId| PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...| 087d1ff1-5f3a-496...| 2539cc4a-37e5-4f3...| [GBP]| [0]| false| false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...| 0a1af217-d1e8-4ab...| 61bc5fda-0160-484...| [CNY]| [329]| true| false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...| 1028b477-f93e-47f...| c6d5b761-94f2-454...| [JPY]| [3400]| true| false|000576058003|{"httpStatus":200...|
how can I do it? Currently I am doing a cast to string and then replacing the square braces with regexp_replace. but this approach fails when I process huge amount of data.
Is there any other way I can do it?
This is what I want.
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
| event_type| publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan| globalId| PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...| 087d1ff1-5f3a-496...| 2539cc4a-37e5-4f3...| GBP| 0| false| false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...| 0a1af217-d1e8-4ab...| 61bc5fda-0160-484...| CNY| 329| true| false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...| 1028b477-f93e-47f...| c6d5b761-94f2-454...| JPY| 3400| true| false|000576058003|{"httpStatus":200...|
You can try getItem(0):
df \
.withColumn("CurrencyCode", df["CurrencyCode"].getItem(0).cast("string")) \
.withColumn("TicketAmount", df["TicketAmount"].getItem(0).cast("string"))
The final cast to string is optional.

Categories

Resources