Remove or Replace a value in Struct column in Pyspark - python

I have a struct column in pyspark with below schema:
root
|-- transformedJSON: struct (nullable = true)
| |-- _class: string (nullable = true)
| |-- _id: struct (nullable = true)
| | |-- $oid: string (nullable = true)
| |-- email: string (nullable = true)
| |-- password: string (nullable = true)
| |-- token: string (nullable = true)
| |-- token_expire_in: string (nullable = true)
| |-- token_generation: string (nullable = true)
| |-- uid: string (nullable = true)
|-- Op: string (nullable = true)
And the data looks something like this:
The token key value is not parsed correctly because it contains \
Now I want to remove the \ from only the token but cannot figure out how to do so.
Only thing I found was to use withColumn with regex_replace but it looks for the whole column values and I want to filter out to the field inside the struct column
object
_class: null
_id: {"$oid": "6090c1566264e14261d02c23"}
email: "sandra#arias.com"
password: "2Qu3qe/f"
token: "{\"refresh_token\":\"8rR6mOQhDBSmg1S6sfEP1dYGzYGOlffrgop0OcCL\",\"token_type\":\"Bearer\",\"expires_in\":43200,\"access_token\":\"KWVyQME7bsuCpVgEJfuzjsCHphGkfexrZyB9w5Xc\"}"
token_expire_in: "1620178814909"
token_generation: "1620139214909"
uid: "81YoCIzxHRQJesGdvnEPfC5NdXr2"

You can read or parse the json string to the struct column as follows:
from pyspark.sql.types import StructType, StringType, IntegerType
df = spark.read.json('test.json')
token_schema = StructType() \
.add("refresh_token", StringType(), True) \
.add("token_type", StringType(), True) \
.add("expires_in", IntegerType(), True) \
.add("access_token", StringType(), True)
df = df.withColumn("token", f.from_json("token", schema=token_schema))
df.printSchema()
df.show(truncate=False)
root
|-- _class: string (nullable = true)
|-- _id: struct (nullable = true)
| |-- $oid: string (nullable = true)
|-- email: string (nullable = true)
|-- password: string (nullable = true)
|-- token: struct (nullable = true)
| |-- refresh_token: string (nullable = true)
| |-- token_type: string (nullable = true)
| |-- expires_in: integer (nullable = true)
| |-- access_token: string (nullable = true)
|-- token_expire_in: string (nullable = true)
|-- token_generation: string (nullable = true)
|-- uid: string (nullable = true)
+------+--------------------------+----------------+--------+---------------------------------------------------------------------------------------------------+---------------+----------------+----------------------------+
|_class|_id |email |password|token |token_expire_in|token_generation|uid |
+------+--------------------------+----------------+--------+---------------------------------------------------------------------------------------------------+---------------+----------------+----------------------------+
|null |{6090c1566264e14261d02c23}|sandra#arias.com|2Qu3qe/f|{8rR6mOQhDBSmg1S6sfEP1dYGzYGOlffrgop0OcCL, Bearer, 43200, KWVyQME7bsuCpVgEJfuzjsCHphGkfexrZyB9w5Xc}|1620178814909 |1620139214909 |81YoCIzxHRQJesGdvnEPfC5NdXr2|
+------+--------------------------+----------------+--------+---------------------------------------------------------------------------------------------------+---------------+----------------+----------------------------+

Related

How to iterate through an array struct and return the element I want in pyspark

Here is my example json file:
{"data":"example1","data2":"example2","register":[{"name":"John","last_name":"Travolta","age":68},{"name":"Nicolas","last_name":"Cage","age":58}], "data3":"example3","data4":"example4"}
And I have a data schema similar to this (totally illustrative):
root
|-- register: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- age: long (nullable = true)
| | |-- last_name: string (nullable = true)
| | |-- name: string (nullable = true)
|-- data: string (nullable = true)
|-- data2: string (nullable = true)
|-- data3: string (nullable = true)
|-- data4: string (nullable = true)
What I want is to iterate inside this register, check if the name field is equal to e.g. John Travolta and create a new struct new_register (for example) with all the fields that are in the same index as the name.
I tried using some of spark's own functions, like filter, when, contains, but none of them gave me the desired result.
I also tried to implement a UDF, but I couldn't find a way to apply the function to the field I want.
How do I resolve the above problem?
First explode array field and access struct field with dot notation and filter required value.Here is the code.
df.printSchema()
df.show(10,False)
df1 = df.withColumn("new_struct",explode("register")).filter((col("new_struct.last_name") == 'Travolta') & (col("new_struct.name") == 'John'))
df1.show(10,False)
df1.printSchema()
root
|-- data: string (nullable = true)
|-- data2: string (nullable = true)
|-- data3: string (nullable = true)
|-- data4: string (nullable = true)
|-- register: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- age: long (nullable = true)
| | |-- last_name: string (nullable = true)
| | |-- name: string (nullable = true)
+--------+--------+--------+--------+-------------------------------------------+
|data |data2 |data3 |data4 |register |
+--------+--------+--------+--------+-------------------------------------------+
|example1|example2|example3|example4|[{68, Travolta, John}, {58, Cage, Nicolas}]|
+--------+--------+--------+--------+-------------------------------------------+
+--------+--------+--------+--------+-------------------------------------------+--------------------+
|data |data2 |data3 |data4 |register |new_struct |
+--------+--------+--------+--------+-------------------------------------------+--------------------+
|example1|example2|example3|example4|[{68, Travolta, John}, {58, Cage, Nicolas}]|{68, Travolta, John}|
+--------+--------+--------+--------+-------------------------------------------+--------------------+
root
|-- data: string (nullable = true)
|-- data2: string (nullable = true)
|-- data3: string (nullable = true)
|-- data4: string (nullable = true)
|-- register: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- age: long (nullable = true)
| | |-- last_name: string (nullable = true)
| | |-- name: string (nullable = true)
|-- new_struct: struct (nullable = true)
| |-- age: long (nullable = true)
| |-- last_name: string (nullable = true)
| |-- name: string (nullable = true)

how to select column from array of struct?

root
|-- InvoiceNo: string (nullable = true)
|-- Quantity: integer (nullable = true)
|-- CustomerID: integer (nullable = true)
|-- InvoiceDate: string (nullable = true)
|-- collect_list(items): array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- StockCode: string (nullable = true)
| | |-- Description: string (nullable = true)
| | |-- UnitPrice: double (nullable = true)
| | |-- Country: string (nullable = true)
Here is my schema I try to create new column totalPrice.
.withColumn('TotalPrice',col('Quantity') * col('UnitPrice'))\
like that but I cannot take UnitPrice from array-struct.. how to do that?

How to change JSON structure on pyspark?

I have two json files that are read by kafka and this is their printSchema ()
JSON1 printSchema:
root
|-- _id: string (nullable = true)
|-- Data: string (nullable = true)
|-- NomeAzienda: string (nullable = true)
|-- Valori_Di_Borsa: struct (nullable = false)
| |-- PrezzoUltimoContratto: double (nullable = true)
| |-- Var%: double (nullable = true)
| |-- VarAssoluta: double (nullable = true)
| |-- OraUltimoContratto: string (nullable = true)
| |-- QuantitaUltimo: double (nullable = true)
| |-- QuantitaAcquisto: double (nullable = true)
| |-- QuantitaVendita: double (nullable = true)
| |-- QuantitaTotale: double (nullable = true)
| |-- NumeroContratti: double (nullable = true)
| |-- MaxOggi: double (nullable = true)
| |-- MinOggi: double (nullable = true)
JSON2 printSchema():
root
|-- _id: string (nullable = true)
|-- News: struct (nullable = false)
| |-- TitoloNews: string (nullable = true)
| |-- TestoNews: string (nullable = true)
| |-- DataNews: string (nullable = true)
| |-- OraNews: long (nullable = true)
| |-- SoggettoNews: string (nullable = true)
Joining the two JSON I get this printSchema():
root
|-- _id: string (nullable = true)
|-- Data: string (nullable = true)
|-- NomeAzienda: string (nullable = true)
|-- Valori_Di_Borsa: struct (nullable = false)
| |-- PrezzoUltimoContratto: double (nullable = true)
| |-- Var%: double (nullable = true)
| |-- VarAssoluta: double (nullable = true)
| |-- OraUltimoContratto: string (nullable = true)
| |-- QuantitaUltimo: double (nullable = true)
| |-- QuantitaAcquisto: double (nullable = true)
| |-- QuantitaVendita: double (nullable = true)
| |-- QuantitaTotale: double (nullable = true)
| |-- NumeroContratti: double (nullable = true)
| |-- MaxOggi: double (nullable = true)
| |-- MinOggi: double (nullable = true)
|-- _id: string (nullable = true)
|-- News: struct (nullable = false)
| |-- TitoloNews: string (nullable = true)
| |-- TestoNews: string (nullable = true)
| |-- DataNews: string (nullable = true)
| |-- OraNews: long (nullable = true)
| |-- SoggettoNews: string (nullable = true)
But the result I would like to have is this:
Update Root:
-- _id: string (nullable = true)
-- Data: string (nullable = true)
-- NomeAzienda: string (nullable = true)
-- Valori_Di_Borsa: struct (nullable = false)
|-- PrezzoUltimoContratto: double (nullable = true)
|-- Var%: double (nullable = true)
|-- VarAssoluta: double (nullable = true)
|-- OraUltimoContratto: string (nullable = true)
|-- QuantitaUltimo: double (nullable = true)
|-- QuantitaAcquisto: double (nullable = true)
|-- QuantitaVendita: double (nullable = true)
|-- QuantitaTotale: double (nullable = true)
|-- NumeroContratti: double (nullable = true)
|-- MaxOggi: double (nullable = true)
|-- MinOggi: double (nullable = true)
|-- News: struct (nullable = false)
|-- id: string (nullable = true)
|-- TitoloNews: string (nullable = true)
|-- TestoNews: string (nullable = true)
|-- DataNews: string (nullable = true)
|-- OraNews: long (nullable = true)
|-- SoggettoNews: string (nullable = true)
How can I do it using pyspark?
This is my code:
df_borsa = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", kafka_broker) \
.option("startingOffsets", "latest") \
.option("subscribe","Be_borsa") \
.load() \
.selectExpr("CAST(value AS STRING)")
df_news = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", kafka_broker) \
.option("startingOffsets", "latest") \
.option("subscribe","Ita_news") \
.load() \
.selectExpr("CAST(value AS STRING)")
df_borsa =df_borsa.withColumn("Valori_Di_Borsa",F.struct(F.col("PrezzoUltimoContratto"),F.col("Var%"),F.col("VarAssoluta"),F.col("OraUltimoContratto"),F.col("QuantitaUltimo"),F.col("QuantitaAcquisto"),F.col("QuantitaVendita"),F.col("QuantitaTotale"),F.col("NumeroContratti"),F.col("MaxOggi"),F.col("MinOggi")))
df_borsa.printSchema()
df_news = df_news.withColumn("News",F.struct(F.col("TitoloNews"),F.col("TestoNews"),F.col("DataNews"),F.col("OraNews"),F.col("SoggettoNews")))
df_news.printSchema()
df_join = df_borsa.join(df_news)
df_join.printSchema()
Check below code.
Extract struct Valori_Di_Borsa column, add News column & reconstruct struct back.
df_join = df_borsa.join(df_news)
.withColumn("Valori_Di_Borsa",F.struct(F.col("Valori_Di_Borsa.*"),F.col("News"))))

How to transform nested dataframe schema in PySpark

I have a dataframe with the following schema:
root
|-- _1: struct (nullable = true)
| |-- key: string (nullable = true)
|-- _2: struct (nullable = true)
| |-- value: long (nullable = true)
I want to transform dataframe to the following schema:
root
|-- _1: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: long (nullable = true)
Use struct:
pyspark.sql.functions.struct(*cols)
Creates a new struct column.
from pyspark.sql.functions import struct, col
from pyspark.sql import Row
df = spark.createDataFrame([Row(_1=Row(key="a"), _2=Row(value=1))])
result = df.select(struct(col("_1.key"), col("_2.value")).alias("_1"))
which gives:
result.printSchema()
# root
# |-- _1: struct (nullable = false)
# | |-- key: string (nullable = true)
# | |-- value: long (nullable = true)
and
result.show()
# +-----+
# | _1|
# +-----+
# |[a,1]|
# +-----+
If your dataframe is with following schema
root
|-- _1: struct (nullable = true)
| |-- key: string (nullable = true)
|-- _2: struct (nullable = true)
| |-- value: long (nullable = true)
Then you can use * to select all elements of struct columns into separate columns and then use struct inbuilt function to combine them back to one struct field
from pyspark.sql import functions as F
df.select(F.struct("_1.*", "_2.*").alias("_1"))
you should get your desired output dataframe
root
|-- _1: struct (nullable = false)
| |-- key: string (nullable = true)
| |-- value: long (nullable = true)
Updated
More generalized form of above code if all the columns in original dataframe are struct is as below
df.select(F.struct(["{}.*".format(x) for x in df.columns]).alias("_1"))

converting all fields in a structtype to array

I have this structtype with over a 1000 fields, every field type is a string.
root
|-- mac: string (nullable = true)
|-- kv: struct (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_FEAT_B64: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_FEAT_CODE: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_HELP_B64: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_HELP_CODE: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_SYST_B64: string (nullable = true)
| |-- FTP_SERVER_ANAUTHORIZED_SYST_CODE: string (nullable = true)
| |-- FTP_SERVER_HELLO_B64: string (nullable = true)
| |-- FTP_STATUS_HELLO_CODE: string (nullable = true)
| |-- HTML_LOGIN_FORM_ACTION_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_DETECTION_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_INPUT_PASSWORD_NAME_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_INPUT_TEXT_NAME_0: string (nullable = true)
| |-- HTML_LOGIN_FORM_METHOD_0: string (nullable = true)
| |-- HTML_REDIRECT_TYPE_0: string (nullable = true)
I want to select only the fields which are non null, and some identifier of which fields are non-null. Is there anyway to convert this struct to an array without explicitly referring to each element ?
I'd use an udf:
from pyspark.sql.types import *
from pyspark.sql.functions import udf
as_array = udf(
lambda arr: [x for x in arr if x is not None],
ArrayType(StringType()))
df.withColumn("arr", as_array(df["kv"])))

Categories

Resources