nested json to tsv in databricks pyspark

nested json to tsv in databricks pyspark - python

Want to convert a nested json to tsv in databricks notebook using pysoark.
Below is json structure where columns can be changed.
{"tables":[{"name":"Result","columns":[{"name":"JobTime","type":"datetime"},{"name":"Status","type":"string"}]
,"rows":[
["2020-04-19T13:45:12.528Z","Failed"]
,["2020-04-19T14:05:40.098Z","Failed"]
,["2020-04-19T13:46:31.655Z","Failed"]
,["2020-04-19T14:01:16.275Z","Failed"],
["2020-04-19T14:03:16.073Z","Failed"],
["2020-04-19T14:01:16.672Z","Failed"],
["2020-04-19T14:02:13.958Z","Failed"],
["2020-04-19T14:04:41.099Z","Failed"],
["2020-04-19T14:04:41.16Z","Failed"],
["2020-04-19T14:05:14.462Z","Failed"]
]}
]}
I am new in databricks Please help

you have two ways to deal with this problem. Either you do some preprocessing in python with json library (or equivalent), or you load directly into pyspark and play around such as:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
# your json
so_json = """
{"tables":[{"name":"Result","columns":[{"name":"JobTime","type":"datetime"},{"name":"Status","type":"string"}]
,"rows":[
["2020-04-19T13:45:12.528Z","Failed"]
,["2020-04-19T14:05:40.098Z","Failed"]
,["2020-04-19T13:46:31.655Z","Failed"]
,["2020-04-19T14:01:16.275Z","Failed"],
["2020-04-19T14:03:16.073Z","Failed"],
["2020-04-19T14:01:16.672Z","Failed"],
["2020-04-19T14:02:13.958Z","Failed"],
["2020-04-19T14:04:41.099Z","Failed"],
["2020-04-19T14:04:41.16Z","Failed"],
["2020-04-19T14:05:14.462Z","Failed"]
]}
]}
"""
# load in directly using read.json(), you'll see that this becomes
# a nested ArrayType/StructType wombo combo
json_df = spark.read.json(spark._sc.parallelize([so_json]))
json_df.printSchema()
root
|-- tables: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- columns: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- name: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- rows: array (nullable = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: string (containsNull = true)
# select nested columns "tables" and "rows" and explode
array_df = json_df.select(f.explode(f.col('tables')['rows'][0]))
Exploding takes the rows which is ArrayType and splits it into actual rows.
Then you can subselect either by dot or slice notation
array_df.printSchema()
root
|-- col: array (nullable = true)
| |-- element: string (containsNull = true)
tabular_df = array_df.select(
array_df.col[0].alias("JobTime"),
array_df.col[1].alias("Status")
)
tabular_df.show()
+--------------------+------+
| JobTime|Status|
+--------------------+------+
|2020-04-19T13:45:...|Failed|
|2020-04-19T14:05:...|Failed|
|2020-04-19T13:46:...|Failed|
|2020-04-19T14:01:...|Failed|
|2020-04-19T14:03:...|Failed|
|2020-04-19T14:01:...|Failed|
|2020-04-19T14:02:...|Failed|
|2020-04-19T14:04:...|Failed|
|2020-04-19T14:04:...|Failed|
|2020-04-19T14:05:...|Failed|
+--------------------+------+
Finally, you want to save as CSV with a custom separator (\t). Hence:
tabular_df.write.csv("path/to/file.tsv", sep="\t")
NB: You may need to manually control for types, such as converting JobTime to TimestampType, but I'll leave that up to you.
Hope this helps.

Related

How to merge two dataframes in pyspark with different columns inside struct or array?

Lets say, there are two data-frames. Reference dataframe and Target dataframe.
Reference DF is a reference schema.
Schema for reference DF (r_df)
r_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- mail: boolean (nullable = true)
| |-- sms: boolean (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
However, target data-frame schema is dynamic in nature.
Schema for target DF (t_df)
t_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- sms: string (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
| | |-- createdBy: string (nullable = true)
So we observe multiple changes in target's schema.
Columns inside t_df struct or array can have more or less columns.
Datatype of columns can change too. So type casting is required. (Ex. sms column is boolean in r_df but string in t_df)
I was able to add/remove columns which are of non-struct datatype. However, struct and arrays are real pain for me. Since there are 50+ columns, I need an optimised solution which works for all.
Any solution/ opinion/ way around will be really helpful.
Expected output
I want to make my t_df's schema exactly same as my r_df's schema.

below code is un-tested but should prescribe how to do it. (written from memory without testing.)
There may be a way to get fields from a struct but I'm not aware how so i'm interested to hear others ideas.
Extract struct column names and types.
Find columns that need to be dropped
Drop columns
rebuild struts according to r_df.
stucts_in_r_df = [ field.name for field in r_df.schema.fields if(str(field.dataType).startswith("Struct")) ] # use list comprehension to create a list of struct fields
struct_columns = []
for structs in stucts_in_r_df: # get a list of fields in the structs
struct_columns.append(r_df\
.select(
"$structs.*"
).columns
)
missingColumns = list(set(r_df.columns) - set(tdf.columns)) # find missing columns
similiar_Columns = list(set(r_df.columns).intersect(set(tdf.columns))))
#remove struct columns from both lists so you don't represent them twice.
# you need to repeat the above intersection/missing for the structs and then rebuild them but really the above gives you the idea of how to get the fields out.
# you can use variable replacemens col("$struct.$field") to get the values out of the fields,
result = r_df.union(
tdf\
.select(*(
[ lit(None).cast(dict(r_df.dtypes)[column]).alias(column) for column in missingColumns] +\
[ col(column).cast(dict(r_df.dtypes)[column]).alias(column) for column in similiar_Columns] ) # using list comprehension with joins and then passing as varargs to select will completely dynamically pull out the values you need.
)
)
Here's a way once you have the union to pull back the struct:
result = result\
.select(
col("_id"),
struct( col("sms").alias("sms") ).alias("notificationsSend"),
struct( *[col(column).alias(column) for column in struct_columns] # pass varags to struct with columns
).alias("recordingDetails") #reconstitue struct with
)

Add new element to nested array of structs pyspark

I have a dataframe with the following schema using pyspark:
|-- suborders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- trackingStatusHistory: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- trackingStatusUpdatedAt: string (nullable = true)
| | | | |-- trackingStatus: string (nullable = true)
What I want to do is create a new deliveredat element for each suborders array using conditions.
I need to find the date within the trackingStatusHistory array where trackingStatusHistory.trackingStatus = 'delivered'. If this trackingStatus exists, the new deliveredat element will receive the date in trackingStatusHistory.trackingStatusUpdatedAt. If doesn't exist, receive null.
How can I do this using pyspark?

You can do that using higher-order functions transform + filter on arrays. For each struct element of suborders array you add a new field by filtering the sub-array trackingStatusHistory and getting the delivery date, like this:
import pyspark.sql.functions as F
df = df.withColumn(
"suborders",
F.expr("""transform(
suborders,
x -> struct(
filter(x.trackingStatusHistory, y -> y.trackingStatus = 'delivered')[0].trackingStatusUpdatedAt as deliveredAt,
x.trackingStatusHistory as trackingStatusHistory
)
)
""")
)

Union for Nested Spark Data Frames

Suppose we have two data frames df1 and df2 with the following schema:
A
|-- B: struct (nullable = true)
| |-- b1: string (nullable = true)
| |-- b2: string (nullable = true)
| |-- b3: string (nullable = true)
| |-- C: array (nullable = true)
| | |-- D: struct (containsNull = true)
| | | |-- d1: string (nullable = true)
| | | |-- d2: string (nullable = true)
Would df1.union(df2)work for these nested data frames if you wanted to add a new record? Or would you have to flatten them first if you wanted to add a new record?

This should work, here is a knowledge article by databricks
https://kb.databricks.com/data/append-a-row-to-rdd-or-dataframe.html
and you won't need to flatten your struct fields.
PS: Please ensure your column are in same orders in both dataframe.

How access struct elements inside pyspark dataframe?

I have the following schema for a pyspark dataframe
root
|-- maindata: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- label: string (nullable = true)
| | | |-- value: string (nullable = true)
| | | |-- unit: string (nullable = true)
| | | |-- dateTime: string (nullable = true)
Giving some data for a particular row which I received by df.select(F.col("maindata")).show(1,False)
|[[[a1, 43.24, km/h, 2019-04-06T13:02:08.020], [TripCount, 135, , 2019-04-06T13:02:08.790],["t2", 0, , 2019-04-06T13:02:08.040], [t4, 0, , 2019-04-06T13:02:08.050], [t09, 0, , 2019-04-06T13:02:08.050], [t3, 1, , 2019-04-06T13:02:08.050], [t7, 0, , 2019-04-06T13:02:08.050],[TripCount, ,136, 2019-04-06T13:02:08.790]]
I want access the tripcount value inside this ex: [TripCount -> 136,135 etc,What is the best way to access this data?TripC is present multiple times
and also is there any way to access say for example only label data like maindata.label..?

I would suggest to do explode multiple times, to convert array elements into individual rows, and then either convert struct into individual columns, or work with nested elements using the dot syntax. For example:
from pyspark.sql.functions import col, explode
df=spark.createDataFrame([[[[('k1','v1', 'v2')]]]], ['d'])
df2 = df.select(explode(col('d')).alias('d')).select(explode(col('d')).alias('d'))
>>> df2.printSchema()
root
|-- data: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: string (nullable = true)
| |-- _3: string (nullable = true)
>>> df2.filter(col("data._1") == "k1").show()
+------------+
| data|
+------------+
|[k1, v1, v2]|
+------------+
or you can extract members of the struct as individual columns:
from pyspark.sql.functions import col, explode
df = spark.createDataFrame([[[[('k1','v1', 'v2')]]]], ['d'])
df2 = df.select(explode(col('d')).alias('d')).select(explode(col('d')).alias('d')).select("d.*").drop("d")
>>> df2.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
|-- _3: string (nullable = true)
>>> df2.filter(col("_1") == "k1").show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| k1| v1| v2|
+---+---+---+

Removing columns in a nested struct in a spark dataframe using PySpark (details in text)

I know that I've asked a similar question here but that was for row filtering. This time I am trying to drop columns instead. I tried to implement Higher Order Functions such as FILTER and others for a while but could not get it to work. I think what I need is a SELECT Higher Order Function but it doesn't seem to exist. Thank you for your help!
I am using pyspark and I have a dataframe object df and this is what the output of df.printSchema() looks like
root
|-- M_MRN: string (nullable = true)
|-- measurements: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Observation_ID: string (nullable = true)
| | |-- Observation_Name: string (nullable = true)
| | |-- Observation_Result: string (nullable = true)
I would like to keep only the 'Observation_ID' or 'Observation_Result' columns in 'measurements'. So currently when I run df.select('measurements').take(2) I get
[Row(measurements=[Row(Observation_ID='5', Observation_Name='ABC', Observation_Result='108/72'),
Row(Observation_ID='11', Observation_Name='ABC', Observation_Result='70'),
Row(Observation_ID='10', Observation_Name='ABC', Observation_Result='73.029'),
Row(Observation_ID='14', Observation_Name='XYZ', Observation_Result='23.1')]),
Row(measurements=[Row(Observation_ID='2', Observation_Name='ZZZ', Observation_Result='3/4'),
Row(Observation_ID='5', Observation_Name='ABC', Observation_Result='7')])]
I would like that after I do the above filtering and run df.select('measurements').take(2) I get
[Row(measurements=[Row(Observation_ID='5', Observation_Result='108/72'),
Row(Observation_ID='11', Observation_Result='70'),
Row(Observation_ID='10', Observation_Result='73.029'),
Row(Observation_ID='14', Observation_Result='23.1')]),
Row(measurements=[Row(Observation_ID='2', Observation_Result='3/4'),
Row(Observation_ID='5', Observation_Result='7')])]
Is there a way to do this in pyspark? Thank you in anticipation for your help!

You could use higher order function transform to select your desired fields and put them in a struct.
from pyspark.sql import functions as F
df.withColumn("measurements",F.expr("""transform(measurements\
,x-> struct(x.Observation_ID as Observation_ID,\
x.Observation_Result as Observation_Result))""")).printSchema()
#root
#|-- measurements: array (nullable = true)
#| |-- element: struct (containsNull = false)
#| | |-- Observation_ID: string (nullable = true)
#| | |-- Observation_Result: string (nullable = true)

For anyone looking for an answer that works with older pyspark versions, here is one using udfs:
import pyspark.sql.functions as f
from pyspark.sql.types import ArrayType, LongType, StringType, StructField, StructType
_measurement_type = ArrayType(StructType([
StructField('Observation_ID', StringType(), True),
StructField('Observation_Result', StringType(), True)
]))
#f.udf(returnType=_measurement_type)
def higher_order_select(measurements):
return [(m.Observation_ID, m.Observation_Result) for m in measurements]
df.select(higher_order_select('measurements').alias('measurements')).printSchema()
prints
root
|-- measurements: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Observation_ID: string (nullable = true)
| | |-- Observation_Result: string (nullable = true)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

nested json to tsv in databricks pyspark - python

Related

How to merge two dataframes in pyspark with different columns inside struct or array?

Add new element to nested array of structs pyspark

Union for Nested Spark Data Frames

How access struct elements inside pyspark dataframe?

Removing columns in a nested struct in a spark dataframe using PySpark (details in text)

Categories

Resources