how to select column from array of struct? - python

root
|-- InvoiceNo: string (nullable = true)
|-- Quantity: integer (nullable = true)
|-- CustomerID: integer (nullable = true)
|-- InvoiceDate: string (nullable = true)
|-- collect_list(items): array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- StockCode: string (nullable = true)
| | |-- Description: string (nullable = true)
| | |-- UnitPrice: double (nullable = true)
| | |-- Country: string (nullable = true)
Here is my schema I try to create new column totalPrice.
.withColumn('TotalPrice',col('Quantity') * col('UnitPrice'))\
like that but I cannot take UnitPrice from array-struct.. how to do that?

Related

How to convert array of struct of struct into string in pyspark

root
|-- id: long (nullable = true)
|-- person: struct (nullable = true)
| |-- resource: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- alias: string (nullable = true)
| |-- id: string (nullable = true)
|-- school: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- teacher: struct (nullable = true)
| | | |-- sys_id: string (nullable = true)
| | | |-- ip: string (nullable = true)
| | |-- Partition: string (nullable = true)
to
root
|-- id: long (nullable = true)
|-- person: struct (nullable = true)
| |-- resource: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- alias: string (nullable = true)
| |-- id: string (nullable = true)
|-- school: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- teacher: string (nullable = true)
| | |-- Partition: string (nullable = true)
i want to convert teacher into a string in pyspark
i tried using functions.transform and then a with field on the struct teacher , but always gets an error with the below
AnalysisException: cannot resolve 'update_fields(school, WithField(concat_ws(',', 'teacher.*')))'
due to data type mismatch: struct argument should be struct type, got:
array<structteacher:struct<sys_id:string,ip:string,Partition:string>>;
df1 = df1.withColumn("school",
functions.transform(functions.col("school").withField("teacher", functions.expr("concat_ws(',', 'teacher.*')")),lambda x: x.cast("string")))
df1 = df1.withColumn("school", functions.transform(functions.col("school"),
lambda x: x.withField("teacher",x['teacher'].cast('string'))))
worked for me

How to iterate through an array struct and return the element I want in pyspark

Here is my example json file:
{"data":"example1","data2":"example2","register":[{"name":"John","last_name":"Travolta","age":68},{"name":"Nicolas","last_name":"Cage","age":58}], "data3":"example3","data4":"example4"}
And I have a data schema similar to this (totally illustrative):
root
|-- register: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- age: long (nullable = true)
| | |-- last_name: string (nullable = true)
| | |-- name: string (nullable = true)
|-- data: string (nullable = true)
|-- data2: string (nullable = true)
|-- data3: string (nullable = true)
|-- data4: string (nullable = true)
What I want is to iterate inside this register, check if the name field is equal to e.g. John Travolta and create a new struct new_register (for example) with all the fields that are in the same index as the name.
I tried using some of spark's own functions, like filter, when, contains, but none of them gave me the desired result.
I also tried to implement a UDF, but I couldn't find a way to apply the function to the field I want.
How do I resolve the above problem?
First explode array field and access struct field with dot notation and filter required value.Here is the code.
df.printSchema()
df.show(10,False)
df1 = df.withColumn("new_struct",explode("register")).filter((col("new_struct.last_name") == 'Travolta') & (col("new_struct.name") == 'John'))
df1.show(10,False)
df1.printSchema()
root
|-- data: string (nullable = true)
|-- data2: string (nullable = true)
|-- data3: string (nullable = true)
|-- data4: string (nullable = true)
|-- register: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- age: long (nullable = true)
| | |-- last_name: string (nullable = true)
| | |-- name: string (nullable = true)
+--------+--------+--------+--------+-------------------------------------------+
|data |data2 |data3 |data4 |register |
+--------+--------+--------+--------+-------------------------------------------+
|example1|example2|example3|example4|[{68, Travolta, John}, {58, Cage, Nicolas}]|
+--------+--------+--------+--------+-------------------------------------------+
+--------+--------+--------+--------+-------------------------------------------+--------------------+
|data |data2 |data3 |data4 |register |new_struct |
+--------+--------+--------+--------+-------------------------------------------+--------------------+
|example1|example2|example3|example4|[{68, Travolta, John}, {58, Cage, Nicolas}]|{68, Travolta, John}|
+--------+--------+--------+--------+-------------------------------------------+--------------------+
root
|-- data: string (nullable = true)
|-- data2: string (nullable = true)
|-- data3: string (nullable = true)
|-- data4: string (nullable = true)
|-- register: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- age: long (nullable = true)
| | |-- last_name: string (nullable = true)
| | |-- name: string (nullable = true)
|-- new_struct: struct (nullable = true)
| |-- age: long (nullable = true)
| |-- last_name: string (nullable = true)
| |-- name: string (nullable = true)

How to change JSON structure on pyspark?

I have two json files that are read by kafka and this is their printSchema ()
JSON1 printSchema:
root
|-- _id: string (nullable = true)
|-- Data: string (nullable = true)
|-- NomeAzienda: string (nullable = true)
|-- Valori_Di_Borsa: struct (nullable = false)
| |-- PrezzoUltimoContratto: double (nullable = true)
| |-- Var%: double (nullable = true)
| |-- VarAssoluta: double (nullable = true)
| |-- OraUltimoContratto: string (nullable = true)
| |-- QuantitaUltimo: double (nullable = true)
| |-- QuantitaAcquisto: double (nullable = true)
| |-- QuantitaVendita: double (nullable = true)
| |-- QuantitaTotale: double (nullable = true)
| |-- NumeroContratti: double (nullable = true)
| |-- MaxOggi: double (nullable = true)
| |-- MinOggi: double (nullable = true)
JSON2 printSchema():
root
|-- _id: string (nullable = true)
|-- News: struct (nullable = false)
| |-- TitoloNews: string (nullable = true)
| |-- TestoNews: string (nullable = true)
| |-- DataNews: string (nullable = true)
| |-- OraNews: long (nullable = true)
| |-- SoggettoNews: string (nullable = true)
Joining the two JSON I get this printSchema():
root
|-- _id: string (nullable = true)
|-- Data: string (nullable = true)
|-- NomeAzienda: string (nullable = true)
|-- Valori_Di_Borsa: struct (nullable = false)
| |-- PrezzoUltimoContratto: double (nullable = true)
| |-- Var%: double (nullable = true)
| |-- VarAssoluta: double (nullable = true)
| |-- OraUltimoContratto: string (nullable = true)
| |-- QuantitaUltimo: double (nullable = true)
| |-- QuantitaAcquisto: double (nullable = true)
| |-- QuantitaVendita: double (nullable = true)
| |-- QuantitaTotale: double (nullable = true)
| |-- NumeroContratti: double (nullable = true)
| |-- MaxOggi: double (nullable = true)
| |-- MinOggi: double (nullable = true)
|-- _id: string (nullable = true)
|-- News: struct (nullable = false)
| |-- TitoloNews: string (nullable = true)
| |-- TestoNews: string (nullable = true)
| |-- DataNews: string (nullable = true)
| |-- OraNews: long (nullable = true)
| |-- SoggettoNews: string (nullable = true)
But the result I would like to have is this:
Update Root:
-- _id: string (nullable = true)
-- Data: string (nullable = true)
-- NomeAzienda: string (nullable = true)
-- Valori_Di_Borsa: struct (nullable = false)
|-- PrezzoUltimoContratto: double (nullable = true)
|-- Var%: double (nullable = true)
|-- VarAssoluta: double (nullable = true)
|-- OraUltimoContratto: string (nullable = true)
|-- QuantitaUltimo: double (nullable = true)
|-- QuantitaAcquisto: double (nullable = true)
|-- QuantitaVendita: double (nullable = true)
|-- QuantitaTotale: double (nullable = true)
|-- NumeroContratti: double (nullable = true)
|-- MaxOggi: double (nullable = true)
|-- MinOggi: double (nullable = true)
|-- News: struct (nullable = false)
|-- id: string (nullable = true)
|-- TitoloNews: string (nullable = true)
|-- TestoNews: string (nullable = true)
|-- DataNews: string (nullable = true)
|-- OraNews: long (nullable = true)
|-- SoggettoNews: string (nullable = true)
How can I do it using pyspark?
This is my code:
df_borsa = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", kafka_broker) \
.option("startingOffsets", "latest") \
.option("subscribe","Be_borsa") \
.load() \
.selectExpr("CAST(value AS STRING)")
df_news = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", kafka_broker) \
.option("startingOffsets", "latest") \
.option("subscribe","Ita_news") \
.load() \
.selectExpr("CAST(value AS STRING)")
df_borsa =df_borsa.withColumn("Valori_Di_Borsa",F.struct(F.col("PrezzoUltimoContratto"),F.col("Var%"),F.col("VarAssoluta"),F.col("OraUltimoContratto"),F.col("QuantitaUltimo"),F.col("QuantitaAcquisto"),F.col("QuantitaVendita"),F.col("QuantitaTotale"),F.col("NumeroContratti"),F.col("MaxOggi"),F.col("MinOggi")))
df_borsa.printSchema()
df_news = df_news.withColumn("News",F.struct(F.col("TitoloNews"),F.col("TestoNews"),F.col("DataNews"),F.col("OraNews"),F.col("SoggettoNews")))
df_news.printSchema()
df_join = df_borsa.join(df_news)
df_join.printSchema()
Check below code.
Extract struct Valori_Di_Borsa column, add News column & reconstruct struct back.
df_join = df_borsa.join(df_news)
.withColumn("Valori_Di_Borsa",F.struct(F.col("Valori_Di_Borsa.*"),F.col("News"))))

Convert PySpark dataframe column a list/array entries to double list/array entries

I would like convert a Pyspark dataframe having the following schema structure.
root
|-- top: long (nullable = true)
|-- inner: struct (nullable = true)
| |-- inner1: long (nullable = true)
| |-- inner2: long (nullable = true)
| |-- inner3: date (nullable = true)
| |-- inner4: date (nullable = true)
to:
root
|-- top: long (nullable = true)
|-- inner: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- inner1: long (nullable = true)
| | |-- inner2: long (nullable = true)
| | |-- inner3: date (nullable = true)
| | |-- inner4: date (nullable = true)
This is basically changing
top | [ inner1, inner2, inner3, inner4]
to
top | [[inner1, inner2, inner3, inner4]]

converting all fields in a structtype to array

I have this structtype with over a 1000 fields, every field type is a string.
root
|-- mac: string (nullable = true)
|-- kv: struct (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_FEAT_B64: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_FEAT_CODE: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_HELP_B64: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_HELP_CODE: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_SYST_B64: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_SYST_CODE: string (nullable = true)
| |-- FTP_SERVER_HELLO_B64: string (nullable = true)
| |-- FTP_STATUS_HELLO_CODE: string (nullable = true)
| |-- HTML_LOGIN_FORM_ACTION_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_DETECTION_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_INPUT_PASSWORD_NAME_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_INPUT_TEXT_NAME_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_METHOD_0: string (nullable = true)
| |-- HTML_REDIRECT_TYPE_0: string (nullable = true)
I want to select only the fields which are non null, and some identifier of which fields are non-null. Is there anyway to convert this struct to an array without explicitly referring to each element ?
I'd use an udf:
from pyspark.sql.types import *
from pyspark.sql.functions import udf
as_array = udf(
lambda arr: [x for x in arr if x is not None],
ArrayType(StringType()))
df.withColumn("arr", as_array(df["kv"])))

Categories

Resources