How get schema of a column in a dataframe (not all schema)? - python

i have a dataframe after a flatten operation.
I want return to original dataframe.
for example:
Df:
|-- delivery: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- load_delivery_intervals: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- from_time: string (nullable = true)
| | | | |-- to_time: string (nullable = true)
| | |-- delivery_start_date_time: string (nullable = true)
| | |-- delivery_end_date_time: string (nullable = true)
| | |-- duration: string (nullable = true)
| | |-- week_days: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- delivery_capacity_quantity: string (nullable = true)
| | |-- quantity_unit: string (nullable = true)
i have a dataframe (flatten) like:
flat_df_new:
delivery_from_time: string (nullable = true)
delivery_to_time: string (nullable = true)
delivery_delivery_start_date_time: string (nullable = true)
delivery_delivery_end_date_time: string (nullable = true)
delivery_duration: string (nullable = true)
delivery_delivery_capacity_quantity: string (nullable = true)
delivery_quantity_unit: string (nullable = true)
flat_df_new is flatten dataframe (explode all struct Type) and operation on it.
parentList are List of array struct exploded in df original.
for parent in parentList:
df_temp=df.select(parent).schema <--get struct Type schema
flat_df_new=flat_df_new.withColumn(parent,....) <--- here now i want add a column named as parent variable but with schema as df_temp and value as column in flat_df_new.
Thanks
Regards

Related

Pyspark: cast element array with nested struct

I have pyspark dataframe with a column named received: ""
how to access and convert the "size" element that is as a string into a float usando pyspark?
root
|-- title: string (nullable = true)
|-- received: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- date: string (nullable = true)
| | |-- size: string (nullable = true)
|-- urls: struct (nullable = true)
| |-- body: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- scheme: string (nullable = true)
| | | |-- url: string (nullable = true)
|-- ...
|-- ...
I tried like this but I'm not getting success!
df.withColumn("received", SF.col("received").withField("delay", SF.col("received.delay").cast("float")))
Could someone guide me how to do this?
I managed to solve it like this:
df = df.withColumn(
"received",
SF.expr("""transform(
received,
x -> struct(x.col1, x.col2, x.col3, x.col4, float(x.delay) as delay, x.col6))"""
)
)

How to convert array of struct of struct into string in pyspark

root
|-- id: long (nullable = true)
|-- person: struct (nullable = true)
| |-- resource: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- alias: string (nullable = true)
| |-- id: string (nullable = true)
|-- school: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- teacher: struct (nullable = true)
| | | |-- sys_id: string (nullable = true)
| | | |-- ip: string (nullable = true)
| | |-- Partition: string (nullable = true)
to
root
|-- id: long (nullable = true)
|-- person: struct (nullable = true)
| |-- resource: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- alias: string (nullable = true)
| |-- id: string (nullable = true)
|-- school: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- teacher: string (nullable = true)
| | |-- Partition: string (nullable = true)
i want to convert teacher into a string in pyspark
i tried using functions.transform and then a with field on the struct teacher , but always gets an error with the below
AnalysisException: cannot resolve 'update_fields(school, WithField(concat_ws(',', 'teacher.*')))'
due to data type mismatch: struct argument should be struct type, got:
array<structteacher:struct<sys_id:string,ip:string,Partition:string>>;
df1 = df1.withColumn("school",
functions.transform(functions.col("school").withField("teacher", functions.expr("concat_ws(',', 'teacher.*')")),lambda x: x.cast("string")))
df1 = df1.withColumn("school", functions.transform(functions.col("school"),
lambda x: x.withField("teacher",x['teacher'].cast('string'))))
worked for me

how to select column from array of struct?

root
|-- InvoiceNo: string (nullable = true)
|-- Quantity: integer (nullable = true)
|-- CustomerID: integer (nullable = true)
|-- InvoiceDate: string (nullable = true)
|-- collect_list(items): array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- StockCode: string (nullable = true)
| | |-- Description: string (nullable = true)
| | |-- UnitPrice: double (nullable = true)
| | |-- Country: string (nullable = true)
Here is my schema I try to create new column totalPrice.
.withColumn('TotalPrice',col('Quantity') * col('UnitPrice'))\
like that but I cannot take UnitPrice from array-struct.. how to do that?

Convert PySpark dataframe column a list/array entries to double list/array entries

I would like convert a Pyspark dataframe having the following schema structure.
root
|-- top: long (nullable = true)
|-- inner: struct (nullable = true)
| |-- inner1: long (nullable = true)
| |-- inner2: long (nullable = true)
| |-- inner3: date (nullable = true)
| |-- inner4: date (nullable = true)
to:
root
|-- top: long (nullable = true)
|-- inner: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- inner1: long (nullable = true)
| | |-- inner2: long (nullable = true)
| | |-- inner3: date (nullable = true)
| | |-- inner4: date (nullable = true)
This is basically changing
top | [ inner1, inner2, inner3, inner4]
to
top | [[inner1, inner2, inner3, inner4]]

converting all fields in a structtype to array

I have this structtype with over a 1000 fields, every field type is a string.
root
|-- mac: string (nullable = true)
|-- kv: struct (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_FEAT_B64: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_FEAT_CODE: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_HELP_B64: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_HELP_CODE: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_SYST_B64: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_SYST_CODE: string (nullable = true)
| |-- FTP_SERVER_HELLO_B64: string (nullable = true)
| |-- FTP_STATUS_HELLO_CODE: string (nullable = true)
| |-- HTML_LOGIN_FORM_ACTION_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_DETECTION_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_INPUT_PASSWORD_NAME_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_INPUT_TEXT_NAME_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_METHOD_0: string (nullable = true)
| |-- HTML_REDIRECT_TYPE_0: string (nullable = true)
I want to select only the fields which are non null, and some identifier of which fields are non-null. Is there anyway to convert this struct to an array without explicitly referring to each element ?
I'd use an udf:
from pyspark.sql.types import *
from pyspark.sql.functions import udf
as_array = udf(
lambda arr: [x for x in arr if x is not None],
ArrayType(StringType()))
df.withColumn("arr", as_array(df["kv"])))

Categories

Resources