Identifying and deleting duplicate columns in pyspark nested Json dataframe

Identifying and deleting duplicate columns in pyspark nested Json dataframe - python

I have a dataFrame with below schema:
|-- nlucontexttrail: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- agentid: string (nullable = true)
| | |-- intent: struct (nullable = true)
| | | |-- confidence: double (nullable = true)
| | | |-- entities: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- id: string (nullable = true)
| | | | | |-- values: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- literal: string (nullable = true)
| | | | | | | |-- value: string (nullable = true)
| | | |-- intentname: string (nullable = true)
| | | |-- name: string (nullable = true)
| | |-- intentcandidates: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- confidence: double (nullable = true)
| | | | |-- entities: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- id: string (nullable = true)
| | | | | | |-- values: array (nullable = true)
| | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | |-- literal: string (nullable = true)
| | | | | | | | |-- value: string (nullable = true)
| | | | |-- intentname: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | |-- modelid: string (nullable = true)
| | |-- modelversion: long (nullable = true)
| | |-- nlusessionid: string (nullable = true)
| | |-- usednluengine: string (nullable = true)
| | |-- usednluengine: string (nullable = true)
If you all can see the highlighted duplicate columns("usednluengine"), one of them has value as 'None' and other has expected value. Now I want to delete the column which has 'None' value. I am sharing the data as well below , please kindly go through it.
[{"agentid":"dispatcher","intent":{"confidence":0.8822699,"entities":[{"id":"duration","values":[{"literal":"2 Sekunden","value":"PT2S"}]},{"id":"date","values":[{"literal":"eins","value":"T23:00:00Z"},{"literal":"eins","value":"T23:00:00Z"}]},{"id":"number","values":[{"literal":"eins","value":"1"},{"literal":"2","value":"2"},{"literal":"eins","value":"1"}]},{"id":"station","values":[{"literal":"eins","value":"eins"},{"literal":"eins","value":"eins"}]},{"id":"number_values","values":[{"literal":"eins","value":"1"},{"literal":"eins","value":"1"}]},{"id":"percentage_values","values":[{"literal":"höchsten","value":"100"}]}],"intentname":null,"name":"TV"},"intentcandidates":[{"confidence":0.8822699,"entities":[{"id":"duration","values":[{"literal":"2 Sekunden","value":"PT2S"}]},{"id":"date","values":[{"literal":"eins","value":"T23:00:00Z"},{"literal":"eins","value":"T23:00:00Z"}]},{"id":"number","values":[{"literal":"eins","value":"1"},{"literal":"2","value":"2"},{"literal":"eins","value":"1"}]},{"id":"station","values":[{"literal":"eins","value":"eins"},{"literal":"eins","value":"eins"}]},{"id":"number_values","values":[{"literal":"eins","value":"1"},{"literal":"eins","value":"1"}]},{"id":"percentage_values","values":[{"literal":"höchsten","value":"100"}]}],"intentname":null,"name":"TV"}],"modelid":"SVH_STAGING__DISPATCHER","modelversion":13,"nlusessionid":null,"usednluengine":"luis"},{"agentid":"dispatcher","intent":{"confidence":0.140685484,"entities":[{"id":"duration","values":[{"literal":"2 Sekunden","value":"PT2S"}]},{"id":"date","values":[{"literal":"eins","value":"T23:00:00Z"},{"literal":"eins","value":"T23:00:00Z"}]},{"id":"number","values":[{"literal":"eins","value":"1"},{"literal":"2","value":"2"},{"literal":"eins","value":"1"}]},{"id":"number_values","values":[{"literal":"eins","value":"1"},{"literal":"eins","value":"1"}]},{"id":"percentage_values","values":[{"literal":"höchsten","value":"100"}]}],"intentname":null,"name":"TV__SWITCH_CHANNEL"},"intentcandidates":[{"confidence":0.140685484,"entities":[{"id":"duration","values":[{"literal":"2 Sekunden","value":"PT2S"}]},{"id":"date","values":[{"literal":"eins","value":"T23:00:00Z"},{"literal":"eins","value":"T23:00:00Z"}]},{"id":"number","values":[{"literal":"eins","value":"1"},{"literal":"2","value":"2"},{"literal":"eins","value":"1"}]},{"id":"number_values","values":[{"literal":"eins","value":"1"},{"literal":"eins","value":"1"}]},{"id":"percentage_values","values":[{"literal":"höchsten","value":"100"}]}],"intentname":null,"name":"TV__SWITCH_CHANNEL"}],"modelid":"SVH_STAGING__TV","modelversion":13,"nlusessionid":null,"usednluengine":"luis"}]
You may placed the above data in below link to see it in proper format:
http://jsonviewer.stack.hu/
Point to be noted is duplicate column which has value as 'None' will not be visible in data but actually it is available in df.printSchema, I want to delete all duplicate columns/nested-columns(which are part of inside struct) from schema , and keep the column which has value. With which I mean no change in data but change in schema actually.
Hope I am able to clear my question. If not please comment below for further discussions.

Related

Move deeply nested fields one level up in pyspark dataframe

I have a pyspark dataframe created from XML. Because of the way XML is structured I have an extra, unnecessary level of nesting in the schema of the dataframe.
The schema of my current dataframe:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- movies: struct (nullable = true)
| | | |-- movie: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- b: string (nullable = true)
| | | | | |-- c: string (nullable = true)
| | | | | |-- d: integer (nullable = true)
| | | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
I'm trying to replace the movies struct with the movie array underneath it as follows:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- movies: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- b: string (nullable = true)
| | | | |-- c: string (nullable = true)
| | | | |-- d: integer (nullable = true)
| | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
The closest I've gotten was using:
from pyspark.sql import functions as F
df.withColumn("a", F.transform('a', lambda x: x.withField("movies_new", F.col("a.movies.movie"))))
which results in the following schema:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- movies: struct (nullable = true)
| | | |-- movie: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- b: string (nullable = true)
| | | | | |-- c: string (nullable = true)
| | | | | |-- d: integer (nullable = true)
| | | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
| | |-- movies_new: array (nullable = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- b: string (nullable = true)
| | | | | |-- c: string (nullable = true)
| | | | | |-- d: integer (nullable = true)
| | | | | |-- e: string (nullable = true)
I understand why this is happening, but thought if I never extracted the nested array out of 'a' that it might not become an array of an array.
Any suggestions?

The logic is:
Explode array "a".
Recompute new struct as (movies.movie, f, g)
Collect "a" back as array.
df = df.withColumn("a", F.explode("a"))
df = df.withColumn("a", F.struct( \
df.a.movies.getField("movie").alias("movies"), \
df.a.f.alias("f"), \
df.a.g.alias("g")))
df = df.select(F.collect_list("a").alias("a"))
The full working code:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[
[[(([("b1", "c1", "d1", "e1")],), "f1", "g1")]]
], schema="a array<struct<movies struct<movie array<struct<b string, c string, d string, e string>>>, f string, g string>>")
df.printSchema()
# df.show(truncate=False)
df = df.withColumn("a", F.explode("a"))
df = df.withColumn("a", F.struct( \
df.a.movies.getField("movie").alias("movies"), \
df.a.f.alias("f"), \
df.a.g.alias("g")))
df = df.select(F.collect_list("a").alias("a"))
df.printSchema()
# df.show(truncate=False)
Output schema before:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- movies: struct (nullable = true)
| | | |-- movie: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- b: string (nullable = true)
| | | | | |-- c: string (nullable = true)
| | | | | |-- d: string (nullable = true)
| | | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
Output schema after:
root
|-- a: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- movies: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- b: string (nullable = true)
| | | | |-- c: string (nullable = true)
| | | | |-- d: string (nullable = true)
| | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)

How to access a particular column in this json field element?

This is the schema of the json file that I've loaded into the dataframe. I want to filter out the data where the data in this column doesn't contain any elements
dataframe image
root
|-- children: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- children: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- children: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- created_date: string (nullable = true)
| | | | | | |-- description: string (nullable = true)
| | | | | | |-- id: long (nullable = true)
| | | | | | |-- last_modified_date: string (nullable = true)
| | | | | | |-- links: array (nullable = true)
| | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | |-- href: string (nullable = true)
| | | | | | | | |-- rel: string (nullable = true)
| | | | | | |-- name: string (nullable = true)
| | | | | | |-- order: long (nullable = true)
| | | | | | |-- parent_id: long (nullable = true)
| | | | | | |-- pid: string (nullable = true)
| | | | | | |-- recursive: boolean (nullable = true)
| | | | | | |-- shared: boolean (nullable = true)
| | | | |-- created_date: string (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- id: long (nullable = true)
| | | | |-- last_modified_date: string (nullable = true)
| | | | |-- links: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- href: string (nullable = true)
| | | | | | |-- rel: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- order: long (nullable = true)
| | | | |-- parent_id: long (nullable = true)
| | | | |-- pid: string (nullable = true)
| | | | |-- recursive: boolean (nullable = true)
| | | | |-- shared: boolean (nullable = true)
| | |-- created_date: string (nullable = true)
| | |-- description: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- last_modified_date: string (nullable = true)
| | |-- links: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- href: string (nullable = true)
| | | | |-- rel: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- order: long (nullable = true)
| | |-- parent_id: long (nullable = true)
| | |-- pid: string (nullable = true)
| | |-- recursive: boolean (nullable = true)
| | |-- shared: boolean (nullable = true)
|-- created_date: string (nullable = true)
|-- description: string (nullable = true)
|-- id: long (nullable = true)
|-- last_modified_date: string (nullable = true)
|-- links: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- href: string (nullable = true)
| | |-- rel: string (nullable = true)
|-- name: string (nullable = true)
|-- order: long (nullable = true)
|-- parent_id: long (nullable = true)
|-- pid: string (nullable = true)
|-- project_id: long (nullable = true)
|-- recursive: boolean (nullable = true)
I tried accessing the data column using
df.filter("children = array('')").show()
but was getting the following error
cannot resolve '(children = array(''))' due to data type mismatch: differing types in '(children = array(''))' (array<structchildren:array<struct<children:array<struct<created_date:string,description:string,id:bigint,last_modified_date:string,links:array<struct<href:string,rel:string>,name:string,order:bigint,parent_id:bigint,pid:string,recursive:boolean,shared:boolean>>,created_date:string,description:string,id:bigint,last_modified_date:string,links:array<structhref:string,rel:string>,name:string,order:bigint,parent_id:bigint,pid:string,recursive:boolean,shared:boolean>>,created_date:string,description:string,id:bigint,last_modified_date:string,links:array<structhref:string,rel:string>,name:string,order:bigint,parent_id:bigint,pid:string,recursive:boolean,shared:boolean>> and array).;

Transform aws glue get-tables output from json to PySpark Dataframe

I'm trying to transform the json output of aws glue get-tables command into a PySpark dataframe.
After reading the json output with this command:
df = spark.read.option("inferSchema", "true") \
.option("multiline", "true") \
.json("tmp/my_json.json")
I get the following from printSchema:
root
|-- TableList: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- CatalogId: string (nullable = true)
| | |-- CreateTime: string (nullable = true)
| | |-- CreatedBy: string (nullable = true)
| | |-- DatabaseName: string (nullable = true)
| | |-- IsRegisteredWithLakeFormation: boolean (nullable = true)
| | |-- LastAccessTime: string (nullable = true)
| | |-- Name: string (nullable = true)
| | |-- Owner: string (nullable = true)
| | |-- Parameters: struct (nullable = true)
| | | |-- CrawlerSchemaDeserializerVersion: string (nullable = true)
| | | |-- CrawlerSchemaSerializerVersion: string (nullable = true)
| | | |-- UPDATED_BY_CRAWLER: string (nullable = true)
| | | |-- averageRecordSize: string (nullable = true)
| | | |-- classification: string (nullable = true)
| | | |-- compressionType: string (nullable = true)
| | | |-- objectCount: string (nullable = true)
| | | |-- recordCount: string (nullable = true)
| | | |-- sizeKey: string (nullable = true)
| | | |-- spark.sql.create.version: string (nullable = true)
| | | |-- spark.sql.sources.schema.numPartCols: string (nullable = true)
| | | |-- spark.sql.sources.schema.numParts: string (nullable = true)
| | | |-- spark.sql.sources.schema.part.0: string (nullable = true)
| | | |-- spark.sql.sources.schema.part.1: string (nullable = true)
| | | |-- spark.sql.sources.schema.partCol.0: string (nullable = true)
| | | |-- spark.sql.sources.schema.partCol.1: string (nullable = true)
| | | |-- typeOfData: string (nullable = true)
| | |-- PartitionKeys: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Name: string (nullable = true)
| | | | |-- Type: string (nullable = true)
| | |-- Retention: long (nullable = true)
| | |-- StorageDescriptor: struct (nullable = true)
| | | |-- BucketColumns: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- Columns: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- Name: string (nullable = true)
| | | | | |-- Type: string (nullable = true)
| | | |-- Compressed: boolean (nullable = true)
| | | |-- InputFormat: string (nullable = true)
| | | |-- Location: string (nullable = true)
| | | |-- NumberOfBuckets: long (nullable = true)
| | | |-- OutputFormat: string (nullable = true)
| | | |-- Parameters: struct (nullable = true)
| | | | |-- CrawlerSchemaDeserializerVersion: string (nullable = true)
| | | | |-- CrawlerSchemaSerializerVersion: string (nullable = true)
| | | | |-- UPDATED_BY_CRAWLER: string (nullable = true)
| | | | |-- averageRecordSize: string (nullable = true)
| | | | |-- classification: string (nullable = true)
| | | | |-- compressionType: string (nullable = true)
| | | | |-- objectCount: string (nullable = true)
| | | | |-- recordCount: string (nullable = true)
| | | | |-- sizeKey: string (nullable = true)
| | | | |-- spark.sql.create.version: string (nullable = true)
| | | | |-- spark.sql.sources.schema.numPartCols: string (nullable = true)
| | | | |-- spark.sql.sources.schema.numParts: string (nullable = true)
| | | | |-- spark.sql.sources.schema.part.0: string (nullable = true)
| | | | |-- spark.sql.sources.schema.part.1: string (nullable = true)
| | | | |-- spark.sql.sources.schema.partCol.0: string (nullable = true)
| | | | |-- spark.sql.sources.schema.partCol.1: string (nullable = true)
| | | | |-- typeOfData: string (nullable = true)
| | | |-- SerdeInfo: struct (nullable = true)
| | | | |-- Parameters: struct (nullable = true)
| | | | | |-- serialization.format: string (nullable = true)
| | | | |-- SerializationLibrary: string (nullable = true)
| | | |-- SortColumns: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- StoredAsSubDirectories: boolean (nullable = true)
| | |-- TableType: string (nullable = true)
| | |-- UpdateTime: string (nullable = true)
But just one column with the whole json is created in df:
+--------------------+
| TableList|
+--------------------+
|[[903342277921, 2...|
+--------------------+
Is there a way to programmatically (and dynamically) create the dataframe in the same way that is referenced in printSchema?
Thanks in advance!

You can use the explode() function to turn the elements of an array to separate rows:
df = df.select('*',explode(df['TableList']).select('col.*')

Request a JSON with pyspark

I'm trying to request a JSON file (from Google Maps API) with a complex architecture to get all lat and lng. Please, find here the JSON Schematic :
root
|-- geocoded_waypoints: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- geocoder_status: string (nullable = true)
| | |-- place_id: string (nullable = true)
| | |-- types: array (nullable = true)
| | | |-- element: string (containsNull = true)
|-- routes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- bounds: struct (nullable = true)
| | | |-- northeast: struct (nullable = true)
| | | | |-- lat: double (nullable = true)
| | | | |-- lng: double (nullable = true)
| | | |-- southwest: struct (nullable = true)
| | | | |-- lat: double (nullable = true)
| | | | |-- lng: double (nullable = true)
| | |-- copyrights: string (nullable = true)
| | |-- legs: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- distance: struct (nullable = true)
| | | | | |-- text: string (nullable = true)
| | | | | |-- value: long (nullable = true)
| | | | |-- duration: struct (nullable = true)
| | | | | |-- text: string (nullable = true)
| | | | | |-- value: long (nullable = true)
| | | | |-- end_address: string (nullable = true)
| | | | |-- end_location: struct (nullable = true)
| | | | | |-- lat: double (nullable = true)
| | | | | |-- lng: double (nullable = true)
| | | | |-- start_address: string (nullable = true)
| | | | |-- start_location: struct (nullable = true)
| | | | | |-- lat: double (nullable = true)
| | | | | |-- lng: double (nullable = true)
| | | | |-- steps: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- distance: struct (nullable = true)
| | | | | | | |-- text: string (nullable = true)
| | | | | | | |-- value: long (nullable = true)
| | | | | | |-- duration: struct (nullable = true)
| | | | | | | |-- text: string (nullable = true)
| | | | | | | |-- value: long (nullable = true)
| | | | | | |-- end_location: struct (nullable = true)
| | | | | | | |-- lat: double (nullable = true)
| | | | | | | |-- lng: double (nullable = true)
| | | | | | |-- html_instructions: string (nullable = true)
| | | | | | |-- maneuver: string (nullable = true)
| | | | | | |-- polyline: struct (nullable = true)
| | | | | | | |-- points: string (nullable = true)
| | | | | | |-- start_location: struct (nullable = true)
| | | | | | | |-- lat: double (nullable = true)
| | | | | | | |-- lng: double (nullable = true)
| | | | | | |-- travel_mode: string (nullable = true)
| | | | |-- traffic_speed_entry: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | |-- via_waypoint: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- location: struct (nullable = true)
| | | | | | | |-- lat: double (nullable = true)
| | | | | | | |-- lng: double (nullable = true)
| | | | | | |-- step_index: long (nullable = true)
| | | | | | |-- step_interpolation: double (nullable = true)
| | |-- overview_polyline: struct (nullable = true)
| | | |-- points: string (nullable = true)
| | |-- summary: string (nullable = true)
| | |-- warnings: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- waypoint_order: array (nullable = true)
| | | |-- element: string (containsNull = true)
|-- status: string (nullable = true)
Here is my function to get lat and lng datas :
def getTraceGps(json_file, spark):
#Lecture du fichier route
sqlContext = SQLContext(spark)
df=sqlContext.read.json(json_file, multiLine=True)
df.printSchema()
df.createOrReplaceTempView("Maps")
df.select(df["routes.bounds.northeast.lat"], df["routes.bounds.northeast.lng"]).show() #IT WORKS
df.select(df["routes.legs.steps.end_location.lat"],df["routes.legs.steps.end_location.lng"]) #WRONG
results.show()
Here is the LOG :
py4j.protocol.Py4JJavaError: An error occurred while calling o53.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`routes`.`legs`['steps']' due to data type mismatch: argument 2 requires integral type, however, ''steps'' is of string type.;;
'Project [routes#1.legs[steps].end_location.lat AS lat#19, routes#1.legs[steps].end_location.lng AS lng#20]
+- AnalysisBarrier
+- Relation[geocoded_waypoints#0,routes#1,status#2] json
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:95)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3295)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1307)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
I don't understand why the first df.select works and not the second one.
Maybe because steps contains several objects.
I tried a lot of queries before but I was wrong.
Where does the problem come from ?
Thank you in advance.

The error message is kind of cryptic, but notice legs is an array type. Becuase it's an array you must choose a specific element using bracket notation (like legs[1])
I haven't seen IntegralType in any documentation but it is part of spark.sql internals. It's just an internal datatype to represent arrays and the like, see
https://github.com/apache/spark/blob/cba69aeb453d2489830f3e6e0473a64dee81989e/sql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala

Json file to pyspark dataframe

I'm trying to work with JSON file on spark (pyspark) environment.
Problem: Unable to convert JSON to expected format in Pyspark Dataframe
1st Input data set:
https://health.data.ny.gov/api/views/cnih-y5dw/rows.json
In this file metadata is defined at start for of the file with tag "meta" and then followed by data with tag "data".
FYI: Steps taken to download data from web to local drive. 1. I've downloaded file to my local drive 2. then pushed to hdfs - from there I'm reading it to spark environment.
df=sqlContext.read.json("/user/train/ny.json",multiLine=True)
df.count()
out[5]: 1
df.show()
df.printSchema()
root
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- meta: struct (nullable = true)
| |-- view: struct (nullable = true)
| | |-- attribution: string (nullable = true)
| | |-- attributionLink: string (nullable = true)
| | |-- averageRating: long (nullable = true)
| | |-- category: string (nullable = true)
| | |-- columns: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- cachedContents: struct (nullable = true)
| | | | | |-- average: string (nullable = true)
| | | | | |-- largest: string (nullable = true)
| | | | | |-- non_null: long (nullable = true)
| | | | | |-- null: long (nullable = true)
| | | | | |-- smallest: string (nullable = true)
| | | | | |-- sum: string (nullable = true)
| | | | | |-- top: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- count: long (nullable = true)
| | | | | | | |-- item: string (nullable = true)
| | | | |-- dataTypeName: string (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- fieldName: string (nullable = true)
| | | | |-- flags: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | |-- format: struct (nullable = true)
| | | | | |-- align: string (nullable = true)
| | | | | |-- mask: string (nullable = true)
| | | | | |-- noCommas: string (nullable = true)
| | | | | |-- precisionStyle: string (nullable = true)
| | | | |-- id: long (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- position: long (nullable = true)
| | | | |-- renderTypeName: string (nullable = true)
| | | | |-- tableColumnId: long (nullable = true)
| | | | |-- width: long (nullable = true)
| | |-- createdAt: long (nullable = true)
| | |-- description: string (nullable = true)
| | |-- displayType: string (nullable = true)
| | |-- downloadCount: long (nullable = true)
| | |-- flags: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- grants: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- flags: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | |-- inherited: boolean (nullable = true)
| | | | |-- type: string (nullable = true)
| | |-- hideFromCatalog: boolean (nullable = true)
| | |-- hideFromDataJson: boolean (nullable = true)
| | |-- id: string (nullable = true)
| | |-- indexUpdatedAt: long (nullable = true)
| | |-- metadata: struct (nullable = true)
| | | |-- attachments: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- assetId: string (nullable = true)
| | | | | |-- blobId: string (nullable = true)
| | | | | |-- filename: string (nullable = true)
| | | | | |-- name: string (nullable = true)
| | | |-- availableDisplayTypes: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- custom_fields: struct (nullable = true)
| | | | |-- Additional Resources: struct (nullable = true)
| | | | | |-- See Also: string (nullable = true)
| | | | |-- Dataset Information: struct (nullable = true)
| | | | | |-- Agency: string (nullable = true)
| | | | |-- Dataset Summary: struct (nullable = true)
| | | | | |-- Contact Information: string (nullable = true)
| | | | | |-- Coverage: string (nullable = true)
| | | | | |-- Data Frequency: string (nullable = true)
| | | | | |-- Dataset Owner: string (nullable = true)
| | | | | |-- Granularity: string (nullable = true)
| | | | | |-- Organization: string (nullable = true)
| | | | | |-- Posting Frequency: string (nullable = true)
| | | | | |-- Time Period: string (nullable = true)
| | | | | |-- Units: string (nullable = true)
| | | | |-- Disclaimers: struct (nullable = true)
| | | | | |-- Limitations: string (nullable = true)
| | | | |-- Local Data: struct (nullable = true)
| | | | | |-- County Filter: string (nullable = true)
| | | | | |-- County_Column: string (nullable = true)
| | | |-- filterCondition: struct (nullable = true)
| | | | |-- children: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- metadata: struct (nullable = true)
| | | | | | | |-- includeAuto: long (nullable = true)
| | | | | | | |-- multiSelect: boolean (nullable = true)
| | | | | | | |-- operator: string (nullable = true)
| | | | | | | |-- tableColumnId: struct (nullable = true)
| | | | | | | | |-- 583607: long (nullable = true)
| | | | | | |-- type: string (nullable = true)
| | | | | | |-- value: string (nullable = true)
| | | | |-- metadata: struct (nullable = true)
| | | | | |-- advanced: boolean (nullable = true)
| | | | | |-- unifiedVersion: long (nullable = true)
| | | | |-- type: string (nullable = true)
| | | | |-- value: string (nullable = true)
| | | |-- jsonQuery: struct (nullable = true)
| | | | |-- order: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- ascending: boolean (nullable = true)
| | | | | | |-- columnFieldName: string (nullable = true)
| | | |-- rdfSubject: string (nullable = true)
| | | |-- renderTypeConfig: struct (nullable = true)
| | | | |-- visible: struct (nullable = true)
| | | | | |-- table: boolean (nullable = true)
| | | |-- rowLabel: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- newBackend: boolean (nullable = true)
| | |-- numberOfComments: long (nullable = true)
| | |-- oid: long (nullable = true)
| | |-- owner: struct (nullable = true)
| | | |-- displayName: string (nullable = true)
| | | |-- flags: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- id: string (nullable = true)
| | | |-- profileImageUrlLarge: string (nullable = true)
| | | |-- profileImageUrlMedium: string (nullable = true)
| | | |-- profileImageUrlSmall: string (nullable = true)
| | | |-- screenName: string (nullable = true)
| | | |-- type: string (nullable = true)
| | |-- provenance: string (nullable = true)
| | |-- publicationAppendEnabled: boolean (nullable = true)
| | |-- publicationDate: long (nullable = true)
| | |-- publicationGroup: long (nullable = true)
| | |-- publicationStage: string (nullable = true)
| | |-- query: struct (nullable = true)
| | | |-- orderBys: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- ascending: boolean (nullable = true)
| | | | | |-- expression: struct (nullable = true)
| | | | | | |-- columnId: long (nullable = true)
| | | | | | |-- type: string (nullable = true)
| | |-- rights: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- rowsUpdatedAt: long (nullable = true)
| | |-- rowsUpdatedBy: string (nullable = true)
| | |-- tableAuthor: struct (nullable = true)
| | | |-- displayName: string (nullable = true)
| | | |-- flags: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- id: string (nullable = true)
| | | |-- profileImageUrlLarge: string (nullable = true)
| | | |-- profileImageUrlMedium: string (nullable = true)
| | | |-- profileImageUrlSmall: string (nullable = true)
| | | |-- screenName: string (nullable = true)
| | | |-- type: string (nullable = true)
| | |-- tableId: long (nullable = true)
| | |-- tags: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- totalTimesRated: long (nullable = true)
| | |-- viewCount: long (nullable = true)
| | |-- viewLastModified: long (nullable = true)
| | |-- viewType: string (nullable = true)
Problem: All records getting wrapped up in single row and two column, i.e meta and data. Also with spark native JSON utility - Spark infers the schema (meatadata) automatically - and my expectation is it shouldn't explicitly as separate column on dataframe.
Expected Output
JSON data set has following list of columns. It should show them in tabular format in dataframe where I can query them"
FACILITY, ADDRESS, LAST INSPECTED, VIOLATIONS,TOTAL CRITICAL VIOLATIONS, TOTAL CRIT. NOT CORRECTED, TOTAL NONCRITICAL VIOLATIONS, DESCRIPTION, LOCAL HEALTH DEPARTMENT, COUNTY, FACILITY ADDRESS, CITY, ZIP CODE, NYSDOH GAZETTEER (1980), MUNICIPALITY, OPERATION NAME, PERMIT EXPIRATION DATE, PERMITTED (D/B/A), PERMITTED CORP. NAME,PERM. OPERATOR LAST NAME, PERM. OPERATOR LAST NAME, PERM. OPERATOR FIRST NAME, NYS HEALTH OPERATION ID, INSPECTION TYPE, INSPECTION COMMENTS, FOOD SERVICE FACILITY STATE, Location1
2nd Input DataSet:
On site, it's first data set about funded projects by world bank
http://jsonstudio.com/resources/
(On site, it's first data set about funded projects by world bank)
It works all fine.
df=sqlContext.read.json("/user/train/wb.json")
df.count()
500
2nd Input Data Sets works all but 1st Input dataset is not. My Observation is the way metadata defined is different for both Json files. In 1st. Meta data is defined first and then data however in 2nd file - meatadate is available with data on every line.
Can you please guide me on 1st input JSON file format and how to handle situation while converting it into pyspark dataframe?
Updated Outcome: After initial analysis we found that format seems wrong but community member helped an alternative way to read format. Marking answer as right and closing this thread.
Let me know if you need any further details, thanks in advance.

Check out my notebook on databricks
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/753971180331031/2636709537891264/8469274372247319/latest.html
The first dataset is corrupt, i.e. it's not valid json and so spark can't read it.
But this was for spark 2.2.1
This is especially confusing because of the way this json file is organized
The data is stored as a list of lists
df=spark.read.json("rows.json",multiLine=True)
data = df.select("data").collect()[0]['data']
And the column names are stored separately
column_names = map(lambda x: x['fieldName'], df.select("meta").collect()[0][0][0][4])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Identifying and deleting duplicate columns in pyspark nested Json dataframe - python

Related

Move deeply nested fields one level up in pyspark dataframe

How to access a particular column in this json field element?

Transform aws glue get-tables output from json to PySpark Dataframe

Request a JSON with pyspark

Json file to pyspark dataframe

Categories

Resources