How to add an empty array using when and otherwise in pyspark - python

How can i add an empty array when using df.withColomn when() and otherwise(***empty_array***)
New column type is T.ArrayType(T.StringType()) from UDF
I want to avoid ending up with NaN values.

Simply use array(lit(None))
df.select(when(col('target_bool')=='true',array(lit(1))).otherwise(array(lit(None)))).show()

Try below - Create a column with None value and cast to Array()
df_b = df_b.withColumn("empty_array", F.when(F.col("rn") == F.lit("1"), (None))).withColumn("empty_array", F.col("empty_array").cast(T.ArrayType(T.StringType())))
df_b.show()
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- rn: integer (nullable = true)
|-- case_condition: integer (nullable = true)
|-- empty_array: array (nullable = true)
| |-- element: string (containsNull = true)

Related

Sum up columns that are nested structs to get another column of the same structure in pyspark

I have a large data frame with several columns which have similar names and are nested structs. I want to sum all of them up and get a final column with the same structure.
root
|-- employee1: struct (nullable = true)
| |-- first_week_salary: Decimal (nullable = true)
| |-- second_week_salary: Decimal (nullable = true)
| |-- third_week_salary: struct (nullable = true)
|-- day1_salary: Decimal (nullable = true)
| |-- day2_salary: Decimal (nullable = true)
|-- employee2: struct (nullable = true)
| |-- first_week_salary: Decimal (nullable = true)
| |-- second_week_salary: Decimal (nullable = true)
| |-- third_week_salary: struct (nullable = true)
|-- day1_salary: Decimal (nullable = true)
| |-- day2_salary: Decimal (nullable = true)
There are several more employee columns that start with the word 'employee' and have the same structure.
I want to add all the employee columns to get a total column that has the same structure as the employee columns and the values are the addition of all employee values the above.
Currently I have been able to add for 1 level by doing
field_list = [first_week_salary, second_week_salary, third_week_salary]
df = df.withColumn('total', f.struct(*[(df.employee1[field]+df.employee2[field]).alias(field) for field in field_list]))
However, this approach can only sum up columns up to 1 level, and also since we have many more employee columns, adding them this way seems very bulky.
Is there a way to do this?
Any help is appreciated.

How to merge two dataframes in pyspark with different columns inside struct or array?

Lets say, there are two data-frames. Reference dataframe and Target dataframe.
Reference DF is a reference schema.
Schema for reference DF (r_df)
r_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- mail: boolean (nullable = true)
| |-- sms: boolean (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
However, target data-frame schema is dynamic in nature.
Schema for target DF (t_df)
t_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- sms: string (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
| | |-- createdBy: string (nullable = true)
So we observe multiple changes in target's schema.
Columns inside t_df struct or array can have more or less columns.
Datatype of columns can change too. So type casting is required. (Ex. sms column is boolean in r_df but string in t_df)
I was able to add/remove columns which are of non-struct datatype. However, struct and arrays are real pain for me. Since there are 50+ columns, I need an optimised solution which works for all.
Any solution/ opinion/ way around will be really helpful.
Expected output
I want to make my t_df's schema exactly same as my r_df's schema.
below code is un-tested but should prescribe how to do it. (written from memory without testing.)
There may be a way to get fields from a struct but I'm not aware how so i'm interested to hear others ideas.
Extract struct column names and types.
Find columns that need to be dropped
Drop columns
rebuild struts according to r_df.
stucts_in_r_df = [ field.name for field in r_df.schema.fields if(str(field.dataType).startswith("Struct")) ] # use list comprehension to create a list of struct fields
struct_columns = []
for structs in stucts_in_r_df: # get a list of fields in the structs
struct_columns.append(r_df\
.select(
"$structs.*"
).columns
)
missingColumns = list(set(r_df.columns) - set(tdf.columns)) # find missing columns
similiar_Columns = list(set(r_df.columns).intersect(set(tdf.columns))))
#remove struct columns from both lists so you don't represent them twice.
# you need to repeat the above intersection/missing for the structs and then rebuild them but really the above gives you the idea of how to get the fields out.
# you can use variable replacemens col("$struct.$field") to get the values out of the fields,
result = r_df.union(
tdf\
.select(*(
[ lit(None).cast(dict(r_df.dtypes)[column]).alias(column) for column in missingColumns] +\
[ col(column).cast(dict(r_df.dtypes)[column]).alias(column) for column in similiar_Columns] ) # using list comprehension with joins and then passing as varargs to select will completely dynamically pull out the values you need.
)
)
Here's a way once you have the union to pull back the struct:
result = result\
.select(
col("_id"),
struct( col("sms").alias("sms") ).alias("notificationsSend"),
struct( *[col(column).alias(column) for column in struct_columns] # pass varags to struct with columns
).alias("recordingDetails") #reconstitue struct with
)

Flatten a nested array of array & structs in Pyspark

I have a schema of this form from a json file:
root
|-- fruit_id: string (nullable = true)
|-- fruit_type: array (nullable = true)
| |-- name: string (nullable = true)
| |-- info: struct (nullable = true)
| |-- fruit_quality: array (nullable = true)
| | |-- quality: string (nullable = true)
| |-- likes: string (containsNull = true)
| |-- finance: struct (nullable = true)
| | |-- last_year_price: string (nullable = true)
| | |-- current_price: string (nullable = true)
| |-- shops: struct (nullable = true)
| | |-- shop1: string (nullable = true)
| | |-- shop2: string (nullable = true)
|-- season: string (nullable = true)
How can I get it of this form?
root
|-- fruit_id: string (nullable = true)
|-- fruit_type_name: string (nullable = true)
|-- fruit_type_info_fruit_quality_quality: string (nullable = true)
|-- fruit_type_info_likes: string (nullable = true)
|-- fruit_type_finance_last_year_price: string (nullable = true)
|-- fruit_type_finance_current_price: string (nullable = true)
|-- fruit_type_shops_shop1: string (nullable = true)
|-- fruit_type_shops_shop2: string (nullable = true)
|-- season: string (nullable = true)
This is for the case of fruits. How would I flatten it similar way if I receive a file with info on vegetables ?
I am facing issue while flattening the array part. I am able to flatten structs inside structs, I followed this: link
I also added this piece of code to code on above link, to see if this approach would work:
import pyspark.sql.functions as F
array_cols = [c[0] for c in df.dtypes if c[1][:6] == 'array']
df = df.select(
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in array_cols
for c in df.select(nc+'.*').columns])
But it's not working.
I then checked this link as well: link
But here issue is if I want to flatten the json file of fruits, It is possible, but then if I send a json file of vegetables with similar schema, I'll have to redefine the code.
Another approach I went for was converting an array to struct & then I could use the flatten the nested structs, but that wasn't helpful.
Lastly, I checked this link as well: link
But this approach threw an error, saying flattening not possible, since I have array of structs & not an array of array.
So how can I solve this?

pySpark: How can I get all element names in structType in arrayType column in a dataframe?

I have a dataframe that looks something like this:
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- job: string (nullable = true)
|-- hobbies: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- favorite: string (nullable = true)
| | |-- non-favorite: string (nullable = true)
And I'm trying to get this information:
['favorite', 'non-favorite']
However, the only closest solution I found was using the explode function with withColumn, but it was based on the assumption that I already know the names of the elements. But What I want to do is, without knowing the element names, I want to get the element names only with the column name, in this case 'hobbies'.
Is there a good way to get all the element names in any given column?
For a given dataframe with this schema:
df.printSchema()
root
|-- hobbies: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- favorite: string (nullable = false)
| | |-- non-favorite: string (nullable = false)
You can select the field names of the struct as:
struct_fields = df.schema['hobbies'].dataType.elementType.fieldNames()
# output: ['favorite', 'non-favorite']
pyspark.sql.types.StructType.fieldnames should get you what you want.
fieldNames()
Returns all field names in a list.
>>> struct = StructType([StructField("f1", StringType(), True)])
>>> struct.fieldNames()
['f1']
So in your case something like
dataframe.hobbies.getItem(0).fieldnames()

Convert PySpark dataframe column type to string and replace the square brackets

I need to convert a PySpark df column type from array to string and also remove the square brackets. This is the schema for the dataframe. columns that needs to be processed is CurrencyCode and TicketAmount
>>> plan_queryDF.printSchema()
root
|-- event_type: string (nullable = true)
|-- publishedDate: string (nullable = true)
|-- plannedCustomerChoiceID: string (nullable = true)
|-- assortedCustomerChoiceID: string (nullable = true)
|-- CurrencyCode: array (nullable = true)
| |-- element: string (containsNull = true)
|-- TicketAmount: array (nullable = true)
| |-- element: string (containsNull = true)
|-- currentPlan: boolean (nullable = true)
|-- originalPlan: boolean (nullable = true)
|-- globalId: string (nullable = true)
|-- PlanJsonData: string (nullable = true)
sample data from dataframe
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
| event_type| publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan| globalId| PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...| 087d1ff1-5f3a-496...| 2539cc4a-37e5-4f3...| [GBP]| [0]| false| false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...| 0a1af217-d1e8-4ab...| 61bc5fda-0160-484...| [CNY]| [329]| true| false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...| 1028b477-f93e-47f...| c6d5b761-94f2-454...| [JPY]| [3400]| true| false|000576058003|{"httpStatus":200...|
how can I do it? Currently I am doing a cast to string and then replacing the square braces with regexp_replace. but this approach fails when I process huge amount of data.
Is there any other way I can do it?
This is what I want.
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
| event_type| publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan| globalId| PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...| 087d1ff1-5f3a-496...| 2539cc4a-37e5-4f3...| GBP| 0| false| false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...| 0a1af217-d1e8-4ab...| 61bc5fda-0160-484...| CNY| 329| true| false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...| 1028b477-f93e-47f...| c6d5b761-94f2-454...| JPY| 3400| true| false|000576058003|{"httpStatus":200...|
You can try getItem(0):
df \
.withColumn("CurrencyCode", df["CurrencyCode"].getItem(0).cast("string")) \
.withColumn("TicketAmount", df["TicketAmount"].getItem(0).cast("string"))
The final cast to string is optional.

Categories

Resources