Pyspark: cast element array with nested struct - python

I have pyspark dataframe with a column named received: ""
how to access and convert the "size" element that is as a string into a float usando pyspark?
root
|-- title: string (nullable = true)
|-- received: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- date: string (nullable = true)
| | |-- size: string (nullable = true)
|-- urls: struct (nullable = true)
| |-- body: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- scheme: string (nullable = true)
| | | |-- url: string (nullable = true)
|-- ...
|-- ...
I tried like this but I'm not getting success!
df.withColumn("received", SF.col("received").withField("delay", SF.col("received.delay").cast("float")))
Could someone guide me how to do this?

I managed to solve it like this:
df = df.withColumn(
"received",
SF.expr("""transform(
received,
x -> struct(x.col1, x.col2, x.col3, x.col4, float(x.delay) as delay, x.col6))"""
)
)

Related

How to convert array of struct of struct into string in pyspark

root
|-- id: long (nullable = true)
|-- person: struct (nullable = true)
| |-- resource: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- alias: string (nullable = true)
| |-- id: string (nullable = true)
|-- school: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- teacher: struct (nullable = true)
| | | |-- sys_id: string (nullable = true)
| | | |-- ip: string (nullable = true)
| | |-- Partition: string (nullable = true)
to
root
|-- id: long (nullable = true)
|-- person: struct (nullable = true)
| |-- resource: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- alias: string (nullable = true)
| |-- id: string (nullable = true)
|-- school: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- teacher: string (nullable = true)
| | |-- Partition: string (nullable = true)
i want to convert teacher into a string in pyspark
i tried using functions.transform and then a with field on the struct teacher , but always gets an error with the below
AnalysisException: cannot resolve 'update_fields(school, WithField(concat_ws(',', 'teacher.*')))'
due to data type mismatch: struct argument should be struct type, got:
array<structteacher:struct<sys_id:string,ip:string,Partition:string>>;
df1 = df1.withColumn("school",
functions.transform(functions.col("school").withField("teacher", functions.expr("concat_ws(',', 'teacher.*')")),lambda x: x.cast("string")))
df1 = df1.withColumn("school", functions.transform(functions.col("school"),
lambda x: x.withField("teacher",x['teacher'].cast('string'))))
worked for me

How to iterate through an array struct and return the element I want in pyspark

Here is my example json file:
{"data":"example1","data2":"example2","register":[{"name":"John","last_name":"Travolta","age":68},{"name":"Nicolas","last_name":"Cage","age":58}], "data3":"example3","data4":"example4"}
And I have a data schema similar to this (totally illustrative):
root
|-- register: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- age: long (nullable = true)
| | |-- last_name: string (nullable = true)
| | |-- name: string (nullable = true)
|-- data: string (nullable = true)
|-- data2: string (nullable = true)
|-- data3: string (nullable = true)
|-- data4: string (nullable = true)
What I want is to iterate inside this register, check if the name field is equal to e.g. John Travolta and create a new struct new_register (for example) with all the fields that are in the same index as the name.
I tried using some of spark's own functions, like filter, when, contains, but none of them gave me the desired result.
I also tried to implement a UDF, but I couldn't find a way to apply the function to the field I want.
How do I resolve the above problem?
First explode array field and access struct field with dot notation and filter required value.Here is the code.
df.printSchema()
df.show(10,False)
df1 = df.withColumn("new_struct",explode("register")).filter((col("new_struct.last_name") == 'Travolta') & (col("new_struct.name") == 'John'))
df1.show(10,False)
df1.printSchema()
root
|-- data: string (nullable = true)
|-- data2: string (nullable = true)
|-- data3: string (nullable = true)
|-- data4: string (nullable = true)
|-- register: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- age: long (nullable = true)
| | |-- last_name: string (nullable = true)
| | |-- name: string (nullable = true)
+--------+--------+--------+--------+-------------------------------------------+
|data |data2 |data3 |data4 |register |
+--------+--------+--------+--------+-------------------------------------------+
|example1|example2|example3|example4|[{68, Travolta, John}, {58, Cage, Nicolas}]|
+--------+--------+--------+--------+-------------------------------------------+
+--------+--------+--------+--------+-------------------------------------------+--------------------+
|data |data2 |data3 |data4 |register |new_struct |
+--------+--------+--------+--------+-------------------------------------------+--------------------+
|example1|example2|example3|example4|[{68, Travolta, John}, {58, Cage, Nicolas}]|{68, Travolta, John}|
+--------+--------+--------+--------+-------------------------------------------+--------------------+
root
|-- data: string (nullable = true)
|-- data2: string (nullable = true)
|-- data3: string (nullable = true)
|-- data4: string (nullable = true)
|-- register: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- age: long (nullable = true)
| | |-- last_name: string (nullable = true)
| | |-- name: string (nullable = true)
|-- new_struct: struct (nullable = true)
| |-- age: long (nullable = true)
| |-- last_name: string (nullable = true)
| |-- name: string (nullable = true)

how to select column from array of struct?

root
|-- InvoiceNo: string (nullable = true)
|-- Quantity: integer (nullable = true)
|-- CustomerID: integer (nullable = true)
|-- InvoiceDate: string (nullable = true)
|-- collect_list(items): array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- StockCode: string (nullable = true)
| | |-- Description: string (nullable = true)
| | |-- UnitPrice: double (nullable = true)
| | |-- Country: string (nullable = true)
Here is my schema I try to create new column totalPrice.
.withColumn('TotalPrice',col('Quantity') * col('UnitPrice'))\
like that but I cannot take UnitPrice from array-struct.. how to do that?

How get schema of a column in a dataframe (not all schema)?

i have a dataframe after a flatten operation.
I want return to original dataframe.
for example:
Df:
|-- delivery: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- load_delivery_intervals: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- from_time: string (nullable = true)
| | | | |-- to_time: string (nullable = true)
| | |-- delivery_start_date_time: string (nullable = true)
| | |-- delivery_end_date_time: string (nullable = true)
| | |-- duration: string (nullable = true)
| | |-- week_days: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- delivery_capacity_quantity: string (nullable = true)
| | |-- quantity_unit: string (nullable = true)
i have a dataframe (flatten) like:
flat_df_new:
delivery_from_time: string (nullable = true)
delivery_to_time: string (nullable = true)
delivery_delivery_start_date_time: string (nullable = true)
delivery_delivery_end_date_time: string (nullable = true)
delivery_duration: string (nullable = true)
delivery_delivery_capacity_quantity: string (nullable = true)
delivery_quantity_unit: string (nullable = true)
flat_df_new is flatten dataframe (explode all struct Type) and operation on it.
parentList are List of array struct exploded in df original.
for parent in parentList:
df_temp=df.select(parent).schema <--get struct Type schema
flat_df_new=flat_df_new.withColumn(parent,....) <--- here now i want add a column named as parent variable but with schema as df_temp and value as column in flat_df_new.
Thanks
Regards

Convert PySpark dataframe column a list/array entries to double list/array entries

I would like convert a Pyspark dataframe having the following schema structure.
root
|-- top: long (nullable = true)
|-- inner: struct (nullable = true)
| |-- inner1: long (nullable = true)
| |-- inner2: long (nullable = true)
| |-- inner3: date (nullable = true)
| |-- inner4: date (nullable = true)
to:
root
|-- top: long (nullable = true)
|-- inner: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- inner1: long (nullable = true)
| | |-- inner2: long (nullable = true)
| | |-- inner3: date (nullable = true)
| | |-- inner4: date (nullable = true)
This is basically changing
top | [ inner1, inner2, inner3, inner4]
to
top | [[inner1, inner2, inner3, inner4]]

Categories

Resources