Want to convert a nested json to tsv in databricks notebook using pysoark.
Below is json structure where columns can be changed.
{"tables":[{"name":"Result","columns":[{"name":"JobTime","type":"datetime"},{"name":"Status","type":"string"}]
,"rows":[
["2020-04-19T13:45:12.528Z","Failed"]
,["2020-04-19T14:05:40.098Z","Failed"]
,["2020-04-19T13:46:31.655Z","Failed"]
,["2020-04-19T14:01:16.275Z","Failed"],
["2020-04-19T14:03:16.073Z","Failed"],
["2020-04-19T14:01:16.672Z","Failed"],
["2020-04-19T14:02:13.958Z","Failed"],
["2020-04-19T14:04:41.099Z","Failed"],
["2020-04-19T14:04:41.16Z","Failed"],
["2020-04-19T14:05:14.462Z","Failed"]
]}
]}
I am new in databricks Please help
you have two ways to deal with this problem. Either you do some preprocessing in python with json library (or equivalent), or you load directly into pyspark and play around such as:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
# your json
so_json = """
{"tables":[{"name":"Result","columns":[{"name":"JobTime","type":"datetime"},{"name":"Status","type":"string"}]
,"rows":[
["2020-04-19T13:45:12.528Z","Failed"]
,["2020-04-19T14:05:40.098Z","Failed"]
,["2020-04-19T13:46:31.655Z","Failed"]
,["2020-04-19T14:01:16.275Z","Failed"],
["2020-04-19T14:03:16.073Z","Failed"],
["2020-04-19T14:01:16.672Z","Failed"],
["2020-04-19T14:02:13.958Z","Failed"],
["2020-04-19T14:04:41.099Z","Failed"],
["2020-04-19T14:04:41.16Z","Failed"],
["2020-04-19T14:05:14.462Z","Failed"]
]}
]}
"""
# load in directly using read.json(), you'll see that this becomes
# a nested ArrayType/StructType wombo combo
json_df = spark.read.json(spark._sc.parallelize([so_json]))
json_df.printSchema()
root
|-- tables: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- columns: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- name: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- rows: array (nullable = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: string (containsNull = true)
# select nested columns "tables" and "rows" and explode
array_df = json_df.select(f.explode(f.col('tables')['rows'][0]))
Exploding takes the rows which is ArrayType and splits it into actual rows.
Then you can subselect either by dot or slice notation
array_df.printSchema()
root
|-- col: array (nullable = true)
| |-- element: string (containsNull = true)
tabular_df = array_df.select(
array_df.col[0].alias("JobTime"),
array_df.col[1].alias("Status")
)
tabular_df.show()
+--------------------+------+
| JobTime|Status|
+--------------------+------+
|2020-04-19T13:45:...|Failed|
|2020-04-19T14:05:...|Failed|
|2020-04-19T13:46:...|Failed|
|2020-04-19T14:01:...|Failed|
|2020-04-19T14:03:...|Failed|
|2020-04-19T14:01:...|Failed|
|2020-04-19T14:02:...|Failed|
|2020-04-19T14:04:...|Failed|
|2020-04-19T14:04:...|Failed|
|2020-04-19T14:05:...|Failed|
+--------------------+------+
Finally, you want to save as CSV with a custom separator (\t). Hence:
tabular_df.write.csv("path/to/file.tsv", sep="\t")
NB: You may need to manually control for types, such as converting JobTime to TimestampType, but I'll leave that up to you.
Hope this helps.
Related
Lets say, there are two data-frames. Reference dataframe and Target dataframe.
Reference DF is a reference schema.
Schema for reference DF (r_df)
r_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- mail: boolean (nullable = true)
| |-- sms: boolean (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
However, target data-frame schema is dynamic in nature.
Schema for target DF (t_df)
t_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- sms: string (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
| | |-- createdBy: string (nullable = true)
So we observe multiple changes in target's schema.
Columns inside t_df struct or array can have more or less columns.
Datatype of columns can change too. So type casting is required. (Ex. sms column is boolean in r_df but string in t_df)
I was able to add/remove columns which are of non-struct datatype. However, struct and arrays are real pain for me. Since there are 50+ columns, I need an optimised solution which works for all.
Any solution/ opinion/ way around will be really helpful.
Expected output
I want to make my t_df's schema exactly same as my r_df's schema.
below code is un-tested but should prescribe how to do it. (written from memory without testing.)
There may be a way to get fields from a struct but I'm not aware how so i'm interested to hear others ideas.
Extract struct column names and types.
Find columns that need to be dropped
Drop columns
rebuild struts according to r_df.
stucts_in_r_df = [ field.name for field in r_df.schema.fields if(str(field.dataType).startswith("Struct")) ] # use list comprehension to create a list of struct fields
struct_columns = []
for structs in stucts_in_r_df: # get a list of fields in the structs
struct_columns.append(r_df\
.select(
"$structs.*"
).columns
)
missingColumns = list(set(r_df.columns) - set(tdf.columns)) # find missing columns
similiar_Columns = list(set(r_df.columns).intersect(set(tdf.columns))))
#remove struct columns from both lists so you don't represent them twice.
# you need to repeat the above intersection/missing for the structs and then rebuild them but really the above gives you the idea of how to get the fields out.
# you can use variable replacemens col("$struct.$field") to get the values out of the fields,
result = r_df.union(
tdf\
.select(*(
[ lit(None).cast(dict(r_df.dtypes)[column]).alias(column) for column in missingColumns] +\
[ col(column).cast(dict(r_df.dtypes)[column]).alias(column) for column in similiar_Columns] ) # using list comprehension with joins and then passing as varargs to select will completely dynamically pull out the values you need.
)
)
Here's a way once you have the union to pull back the struct:
result = result\
.select(
col("_id"),
struct( col("sms").alias("sms") ).alias("notificationsSend"),
struct( *[col(column).alias(column) for column in struct_columns] # pass varags to struct with columns
).alias("recordingDetails") #reconstitue struct with
)
I have a dataframe with the following schema using pyspark:
|-- suborders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- trackingStatusHistory: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- trackingStatusUpdatedAt: string (nullable = true)
| | | | |-- trackingStatus: string (nullable = true)
What I want to do is create a new deliveredat element for each suborders array using conditions.
I need to find the date within the trackingStatusHistory array where trackingStatusHistory.trackingStatus = 'delivered'. If this trackingStatus exists, the new deliveredat element will receive the date in trackingStatusHistory.trackingStatusUpdatedAt. If doesn't exist, receive null.
How can I do this using pyspark?
You can do that using higher-order functions transform + filter on arrays. For each struct element of suborders array you add a new field by filtering the sub-array trackingStatusHistory and getting the delivery date, like this:
import pyspark.sql.functions as F
df = df.withColumn(
"suborders",
F.expr("""transform(
suborders,
x -> struct(
filter(x.trackingStatusHistory, y -> y.trackingStatus = 'delivered')[0].trackingStatusUpdatedAt as deliveredAt,
x.trackingStatusHistory as trackingStatusHistory
)
)
""")
)
Suppose we have two data frames df1 and df2 with the following schema:
A
|-- B: struct (nullable = true)
| |-- b1: string (nullable = true)
| |-- b2: string (nullable = true)
| |-- b3: string (nullable = true)
| |-- C: array (nullable = true)
| | |-- D: struct (containsNull = true)
| | | |-- d1: string (nullable = true)
| | | |-- d2: string (nullable = true)
Would df1.union(df2)work for these nested data frames if you wanted to add a new record? Or would you have to flatten them first if you wanted to add a new record?
This should work, here is a knowledge article by databricks
https://kb.databricks.com/data/append-a-row-to-rdd-or-dataframe.html
and you won't need to flatten your struct fields.
PS: Please ensure your column are in same orders in both dataframe.
I have the following schema for a pyspark dataframe
root
|-- maindata: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- label: string (nullable = true)
| | | |-- value: string (nullable = true)
| | | |-- unit: string (nullable = true)
| | | |-- dateTime: string (nullable = true)
Giving some data for a particular row which I received by df.select(F.col("maindata")).show(1,False)
|[[[a1, 43.24, km/h, 2019-04-06T13:02:08.020], [TripCount, 135, , 2019-04-06T13:02:08.790],["t2", 0, , 2019-04-06T13:02:08.040], [t4, 0, , 2019-04-06T13:02:08.050], [t09, 0, , 2019-04-06T13:02:08.050], [t3, 1, , 2019-04-06T13:02:08.050], [t7, 0, , 2019-04-06T13:02:08.050],[TripCount, ,136, 2019-04-06T13:02:08.790]]
I want access the tripcount value inside this ex: [TripCount -> 136,135 etc,What is the best way to access this data?TripC is present multiple times
and also is there any way to access say for example only label data like maindata.label..?
I would suggest to do explode multiple times, to convert array elements into individual rows, and then either convert struct into individual columns, or work with nested elements using the dot syntax. For example:
from pyspark.sql.functions import col, explode
df=spark.createDataFrame([[[[('k1','v1', 'v2')]]]], ['d'])
df2 = df.select(explode(col('d')).alias('d')).select(explode(col('d')).alias('d'))
>>> df2.printSchema()
root
|-- data: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: string (nullable = true)
| |-- _3: string (nullable = true)
>>> df2.filter(col("data._1") == "k1").show()
+------------+
| data|
+------------+
|[k1, v1, v2]|
+------------+
or you can extract members of the struct as individual columns:
from pyspark.sql.functions import col, explode
df = spark.createDataFrame([[[[('k1','v1', 'v2')]]]], ['d'])
df2 = df.select(explode(col('d')).alias('d')).select(explode(col('d')).alias('d')).select("d.*").drop("d")
>>> df2.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
|-- _3: string (nullable = true)
>>> df2.filter(col("_1") == "k1").show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| k1| v1| v2|
+---+---+---+
I know that I've asked a similar question here but that was for row filtering. This time I am trying to drop columns instead. I tried to implement Higher Order Functions such as FILTER and others for a while but could not get it to work. I think what I need is a SELECT Higher Order Function but it doesn't seem to exist. Thank you for your help!
I am using pyspark and I have a dataframe object df and this is what the output of df.printSchema() looks like
root
|-- M_MRN: string (nullable = true)
|-- measurements: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Observation_ID: string (nullable = true)
| | |-- Observation_Name: string (nullable = true)
| | |-- Observation_Result: string (nullable = true)
I would like to keep only the 'Observation_ID' or 'Observation_Result' columns in 'measurements'. So currently when I run df.select('measurements').take(2) I get
[Row(measurements=[Row(Observation_ID='5', Observation_Name='ABC', Observation_Result='108/72'),
Row(Observation_ID='11', Observation_Name='ABC', Observation_Result='70'),
Row(Observation_ID='10', Observation_Name='ABC', Observation_Result='73.029'),
Row(Observation_ID='14', Observation_Name='XYZ', Observation_Result='23.1')]),
Row(measurements=[Row(Observation_ID='2', Observation_Name='ZZZ', Observation_Result='3/4'),
Row(Observation_ID='5', Observation_Name='ABC', Observation_Result='7')])]
I would like that after I do the above filtering and run df.select('measurements').take(2) I get
[Row(measurements=[Row(Observation_ID='5', Observation_Result='108/72'),
Row(Observation_ID='11', Observation_Result='70'),
Row(Observation_ID='10', Observation_Result='73.029'),
Row(Observation_ID='14', Observation_Result='23.1')]),
Row(measurements=[Row(Observation_ID='2', Observation_Result='3/4'),
Row(Observation_ID='5', Observation_Result='7')])]
Is there a way to do this in pyspark? Thank you in anticipation for your help!
You could use higher order function transform to select your desired fields and put them in a struct.
from pyspark.sql import functions as F
df.withColumn("measurements",F.expr("""transform(measurements\
,x-> struct(x.Observation_ID as Observation_ID,\
x.Observation_Result as Observation_Result))""")).printSchema()
#root
#|-- measurements: array (nullable = true)
#| |-- element: struct (containsNull = false)
#| | |-- Observation_ID: string (nullable = true)
#| | |-- Observation_Result: string (nullable = true)
For anyone looking for an answer that works with older pyspark versions, here is one using udfs:
import pyspark.sql.functions as f
from pyspark.sql.types import ArrayType, LongType, StringType, StructField, StructType
_measurement_type = ArrayType(StructType([
StructField('Observation_ID', StringType(), True),
StructField('Observation_Result', StringType(), True)
]))
#f.udf(returnType=_measurement_type)
def higher_order_select(measurements):
return [(m.Observation_ID, m.Observation_Result) for m in measurements]
df.select(higher_order_select('measurements').alias('measurements')).printSchema()
prints
root
|-- measurements: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Observation_ID: string (nullable = true)
| | |-- Observation_Result: string (nullable = true)