I have a pyspark dataframe created from XML. Because of the way XML is structured I have an extra, unnecessary level of nesting in the schema of the dataframe.
The schema of my current dataframe:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- movies: struct (nullable = true)
| | | |-- movie: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- b: string (nullable = true)
| | | | | |-- c: string (nullable = true)
| | | | | |-- d: integer (nullable = true)
| | | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
I'm trying to replace the movies struct with the movie array underneath it as follows:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- movies: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- b: string (nullable = true)
| | | | |-- c: string (nullable = true)
| | | | |-- d: integer (nullable = true)
| | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
The closest I've gotten was using:
from pyspark.sql import functions as F
df.withColumn("a", F.transform('a', lambda x: x.withField("movies_new", F.col("a.movies.movie"))))
which results in the following schema:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- movies: struct (nullable = true)
| | | |-- movie: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- b: string (nullable = true)
| | | | | |-- c: string (nullable = true)
| | | | | |-- d: integer (nullable = true)
| | | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
| | |-- movies_new: array (nullable = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- b: string (nullable = true)
| | | | | |-- c: string (nullable = true)
| | | | | |-- d: integer (nullable = true)
| | | | | |-- e: string (nullable = true)
I understand why this is happening, but thought if I never extracted the nested array out of 'a' that it might not become an array of an array.
Any suggestions?
The logic is:
Explode array "a".
Recompute new struct as (movies.movie, f, g)
Collect "a" back as array.
df = df.withColumn("a", F.explode("a"))
df = df.withColumn("a", F.struct( \
df.a.movies.getField("movie").alias("movies"), \
df.a.f.alias("f"), \
df.a.g.alias("g")))
df = df.select(F.collect_list("a").alias("a"))
The full working code:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[
[[(([("b1", "c1", "d1", "e1")],), "f1", "g1")]]
], schema="a array<struct<movies struct<movie array<struct<b string, c string, d string, e string>>>, f string, g string>>")
df.printSchema()
# df.show(truncate=False)
df = df.withColumn("a", F.explode("a"))
df = df.withColumn("a", F.struct( \
df.a.movies.getField("movie").alias("movies"), \
df.a.f.alias("f"), \
df.a.g.alias("g")))
df = df.select(F.collect_list("a").alias("a"))
df.printSchema()
# df.show(truncate=False)
Output schema before:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- movies: struct (nullable = true)
| | | |-- movie: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- b: string (nullable = true)
| | | | | |-- c: string (nullable = true)
| | | | | |-- d: string (nullable = true)
| | | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
Output schema after:
root
|-- a: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- movies: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- b: string (nullable = true)
| | | | |-- c: string (nullable = true)
| | | | |-- d: string (nullable = true)
| | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
This is the schema of the json file that I've loaded into the dataframe. I want to filter out the data where the data in this column doesn't contain any elements
dataframe image
root
|-- children: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- children: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- children: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- created_date: string (nullable = true)
| | | | | | |-- description: string (nullable = true)
| | | | | | |-- id: long (nullable = true)
| | | | | | |-- last_modified_date: string (nullable = true)
| | | | | | |-- links: array (nullable = true)
| | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | |-- href: string (nullable = true)
| | | | | | | | |-- rel: string (nullable = true)
| | | | | | |-- name: string (nullable = true)
| | | | | | |-- order: long (nullable = true)
| | | | | | |-- parent_id: long (nullable = true)
| | | | | | |-- pid: string (nullable = true)
| | | | | | |-- recursive: boolean (nullable = true)
| | | | | | |-- shared: boolean (nullable = true)
| | | | |-- created_date: string (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- id: long (nullable = true)
| | | | |-- last_modified_date: string (nullable = true)
| | | | |-- links: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- href: string (nullable = true)
| | | | | | |-- rel: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- order: long (nullable = true)
| | | | |-- parent_id: long (nullable = true)
| | | | |-- pid: string (nullable = true)
| | | | |-- recursive: boolean (nullable = true)
| | | | |-- shared: boolean (nullable = true)
| | |-- created_date: string (nullable = true)
| | |-- description: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- last_modified_date: string (nullable = true)
| | |-- links: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- href: string (nullable = true)
| | | | |-- rel: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- order: long (nullable = true)
| | |-- parent_id: long (nullable = true)
| | |-- pid: string (nullable = true)
| | |-- recursive: boolean (nullable = true)
| | |-- shared: boolean (nullable = true)
|-- created_date: string (nullable = true)
|-- description: string (nullable = true)
|-- id: long (nullable = true)
|-- last_modified_date: string (nullable = true)
|-- links: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- href: string (nullable = true)
| | |-- rel: string (nullable = true)
|-- name: string (nullable = true)
|-- order: long (nullable = true)
|-- parent_id: long (nullable = true)
|-- pid: string (nullable = true)
|-- project_id: long (nullable = true)
|-- recursive: boolean (nullable = true)
I tried accessing the data column using
df.filter("children = array('')").show()
but was getting the following error
cannot resolve '(children = array(''))' due to data type mismatch: differing types in '(children = array(''))' (array<structchildren:array<struct<children:array<struct<created_date:string,description:string,id:bigint,last_modified_date:string,links:array<struct<href:string,rel:string>,name:string,order:bigint,parent_id:bigint,pid:string,recursive:boolean,shared:boolean>>,created_date:string,description:string,id:bigint,last_modified_date:string,links:array<structhref:string,rel:string>,name:string,order:bigint,parent_id:bigint,pid:string,recursive:boolean,shared:boolean>>,created_date:string,description:string,id:bigint,last_modified_date:string,links:array<structhref:string,rel:string>,name:string,order:bigint,parent_id:bigint,pid:string,recursive:boolean,shared:boolean>> and array).;
I'm trying to transform the json output of aws glue get-tables command into a PySpark dataframe.
After reading the json output with this command:
df = spark.read.option("inferSchema", "true") \
.option("multiline", "true") \
.json("tmp/my_json.json")
I get the following from printSchema:
root
|-- TableList: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- CatalogId: string (nullable = true)
| | |-- CreateTime: string (nullable = true)
| | |-- CreatedBy: string (nullable = true)
| | |-- DatabaseName: string (nullable = true)
| | |-- IsRegisteredWithLakeFormation: boolean (nullable = true)
| | |-- LastAccessTime: string (nullable = true)
| | |-- Name: string (nullable = true)
| | |-- Owner: string (nullable = true)
| | |-- Parameters: struct (nullable = true)
| | | |-- CrawlerSchemaDeserializerVersion: string (nullable = true)
| | | |-- CrawlerSchemaSerializerVersion: string (nullable = true)
| | | |-- UPDATED_BY_CRAWLER: string (nullable = true)
| | | |-- averageRecordSize: string (nullable = true)
| | | |-- classification: string (nullable = true)
| | | |-- compressionType: string (nullable = true)
| | | |-- objectCount: string (nullable = true)
| | | |-- recordCount: string (nullable = true)
| | | |-- sizeKey: string (nullable = true)
| | | |-- spark.sql.create.version: string (nullable = true)
| | | |-- spark.sql.sources.schema.numPartCols: string (nullable = true)
| | | |-- spark.sql.sources.schema.numParts: string (nullable = true)
| | | |-- spark.sql.sources.schema.part.0: string (nullable = true)
| | | |-- spark.sql.sources.schema.part.1: string (nullable = true)
| | | |-- spark.sql.sources.schema.partCol.0: string (nullable = true)
| | | |-- spark.sql.sources.schema.partCol.1: string (nullable = true)
| | | |-- typeOfData: string (nullable = true)
| | |-- PartitionKeys: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Name: string (nullable = true)
| | | | |-- Type: string (nullable = true)
| | |-- Retention: long (nullable = true)
| | |-- StorageDescriptor: struct (nullable = true)
| | | |-- BucketColumns: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- Columns: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- Name: string (nullable = true)
| | | | | |-- Type: string (nullable = true)
| | | |-- Compressed: boolean (nullable = true)
| | | |-- InputFormat: string (nullable = true)
| | | |-- Location: string (nullable = true)
| | | |-- NumberOfBuckets: long (nullable = true)
| | | |-- OutputFormat: string (nullable = true)
| | | |-- Parameters: struct (nullable = true)
| | | | |-- CrawlerSchemaDeserializerVersion: string (nullable = true)
| | | | |-- CrawlerSchemaSerializerVersion: string (nullable = true)
| | | | |-- UPDATED_BY_CRAWLER: string (nullable = true)
| | | | |-- averageRecordSize: string (nullable = true)
| | | | |-- classification: string (nullable = true)
| | | | |-- compressionType: string (nullable = true)
| | | | |-- objectCount: string (nullable = true)
| | | | |-- recordCount: string (nullable = true)
| | | | |-- sizeKey: string (nullable = true)
| | | | |-- spark.sql.create.version: string (nullable = true)
| | | | |-- spark.sql.sources.schema.numPartCols: string (nullable = true)
| | | | |-- spark.sql.sources.schema.numParts: string (nullable = true)
| | | | |-- spark.sql.sources.schema.part.0: string (nullable = true)
| | | | |-- spark.sql.sources.schema.part.1: string (nullable = true)
| | | | |-- spark.sql.sources.schema.partCol.0: string (nullable = true)
| | | | |-- spark.sql.sources.schema.partCol.1: string (nullable = true)
| | | | |-- typeOfData: string (nullable = true)
| | | |-- SerdeInfo: struct (nullable = true)
| | | | |-- Parameters: struct (nullable = true)
| | | | | |-- serialization.format: string (nullable = true)
| | | | |-- SerializationLibrary: string (nullable = true)
| | | |-- SortColumns: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- StoredAsSubDirectories: boolean (nullable = true)
| | |-- TableType: string (nullable = true)
| | |-- UpdateTime: string (nullable = true)
But just one column with the whole json is created in df:
+--------------------+
| TableList|
+--------------------+
|[[903342277921, 2...|
+--------------------+
Is there a way to programmatically (and dynamically) create the dataframe in the same way that is referenced in printSchema?
Thanks in advance!
You can use the explode() function to turn the elements of an array to separate rows:
df = df.select('*',explode(df['TableList']).select('col.*')
I have a dataFrame with below schema:
|-- nlucontexttrail: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- agentid: string (nullable = true)
| | |-- intent: struct (nullable = true)
| | | |-- confidence: double (nullable = true)
| | | |-- entities: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- id: string (nullable = true)
| | | | | |-- values: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- literal: string (nullable = true)
| | | | | | | |-- value: string (nullable = true)
| | | |-- intentname: string (nullable = true)
| | | |-- name: string (nullable = true)
| | |-- intentcandidates: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- confidence: double (nullable = true)
| | | | |-- entities: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- id: string (nullable = true)
| | | | | | |-- values: array (nullable = true)
| | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | |-- literal: string (nullable = true)
| | | | | | | | |-- value: string (nullable = true)
| | | | |-- intentname: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | |-- modelid: string (nullable = true)
| | |-- modelversion: long (nullable = true)
| | |-- nlusessionid: string (nullable = true)
| | |-- usednluengine: string (nullable = true)
| | |-- usednluengine: string (nullable = true)
If you all can see the highlighted duplicate columns("usednluengine"), one of them has value as 'None' and other has expected value. Now I want to delete the column which has 'None' value. I am sharing the data as well below , please kindly go through it.
[{"agentid":"dispatcher","intent":{"confidence":0.8822699,"entities":[{"id":"duration","values":[{"literal":"2 Sekunden","value":"PT2S"}]},{"id":"date","values":[{"literal":"eins","value":"T23:00:00Z"},{"literal":"eins","value":"T23:00:00Z"}]},{"id":"number","values":[{"literal":"eins","value":"1"},{"literal":"2","value":"2"},{"literal":"eins","value":"1"}]},{"id":"station","values":[{"literal":"eins","value":"eins"},{"literal":"eins","value":"eins"}]},{"id":"number_values","values":[{"literal":"eins","value":"1"},{"literal":"eins","value":"1"}]},{"id":"percentage_values","values":[{"literal":"höchsten","value":"100"}]}],"intentname":null,"name":"TV"},"intentcandidates":[{"confidence":0.8822699,"entities":[{"id":"duration","values":[{"literal":"2 Sekunden","value":"PT2S"}]},{"id":"date","values":[{"literal":"eins","value":"T23:00:00Z"},{"literal":"eins","value":"T23:00:00Z"}]},{"id":"number","values":[{"literal":"eins","value":"1"},{"literal":"2","value":"2"},{"literal":"eins","value":"1"}]},{"id":"station","values":[{"literal":"eins","value":"eins"},{"literal":"eins","value":"eins"}]},{"id":"number_values","values":[{"literal":"eins","value":"1"},{"literal":"eins","value":"1"}]},{"id":"percentage_values","values":[{"literal":"höchsten","value":"100"}]}],"intentname":null,"name":"TV"}],"modelid":"SVH_STAGING__DISPATCHER","modelversion":13,"nlusessionid":null,"usednluengine":"luis"},{"agentid":"dispatcher","intent":{"confidence":0.140685484,"entities":[{"id":"duration","values":[{"literal":"2 Sekunden","value":"PT2S"}]},{"id":"date","values":[{"literal":"eins","value":"T23:00:00Z"},{"literal":"eins","value":"T23:00:00Z"}]},{"id":"number","values":[{"literal":"eins","value":"1"},{"literal":"2","value":"2"},{"literal":"eins","value":"1"}]},{"id":"number_values","values":[{"literal":"eins","value":"1"},{"literal":"eins","value":"1"}]},{"id":"percentage_values","values":[{"literal":"höchsten","value":"100"}]}],"intentname":null,"name":"TV__SWITCH_CHANNEL"},"intentcandidates":[{"confidence":0.140685484,"entities":[{"id":"duration","values":[{"literal":"2 Sekunden","value":"PT2S"}]},{"id":"date","values":[{"literal":"eins","value":"T23:00:00Z"},{"literal":"eins","value":"T23:00:00Z"}]},{"id":"number","values":[{"literal":"eins","value":"1"},{"literal":"2","value":"2"},{"literal":"eins","value":"1"}]},{"id":"number_values","values":[{"literal":"eins","value":"1"},{"literal":"eins","value":"1"}]},{"id":"percentage_values","values":[{"literal":"höchsten","value":"100"}]}],"intentname":null,"name":"TV__SWITCH_CHANNEL"}],"modelid":"SVH_STAGING__TV","modelversion":13,"nlusessionid":null,"usednluengine":"luis"}]
You may placed the above data in below link to see it in proper format:
http://jsonviewer.stack.hu/
Point to be noted is duplicate column which has value as 'None' will not be visible in data but actually it is available in df.printSchema, I want to delete all duplicate columns/nested-columns(which are part of inside struct) from schema , and keep the column which has value. With which I mean no change in data but change in schema actually.
Hope I am able to clear my question. If not please comment below for further discussions.
I'm trying to request a JSON file (from Google Maps API) with a complex architecture to get all lat and lng. Please, find here the JSON Schematic :
root
|-- geocoded_waypoints: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- geocoder_status: string (nullable = true)
| | |-- place_id: string (nullable = true)
| | |-- types: array (nullable = true)
| | | |-- element: string (containsNull = true)
|-- routes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- bounds: struct (nullable = true)
| | | |-- northeast: struct (nullable = true)
| | | | |-- lat: double (nullable = true)
| | | | |-- lng: double (nullable = true)
| | | |-- southwest: struct (nullable = true)
| | | | |-- lat: double (nullable = true)
| | | | |-- lng: double (nullable = true)
| | |-- copyrights: string (nullable = true)
| | |-- legs: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- distance: struct (nullable = true)
| | | | | |-- text: string (nullable = true)
| | | | | |-- value: long (nullable = true)
| | | | |-- duration: struct (nullable = true)
| | | | | |-- text: string (nullable = true)
| | | | | |-- value: long (nullable = true)
| | | | |-- end_address: string (nullable = true)
| | | | |-- end_location: struct (nullable = true)
| | | | | |-- lat: double (nullable = true)
| | | | | |-- lng: double (nullable = true)
| | | | |-- start_address: string (nullable = true)
| | | | |-- start_location: struct (nullable = true)
| | | | | |-- lat: double (nullable = true)
| | | | | |-- lng: double (nullable = true)
| | | | |-- steps: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- distance: struct (nullable = true)
| | | | | | | |-- text: string (nullable = true)
| | | | | | | |-- value: long (nullable = true)
| | | | | | |-- duration: struct (nullable = true)
| | | | | | | |-- text: string (nullable = true)
| | | | | | | |-- value: long (nullable = true)
| | | | | | |-- end_location: struct (nullable = true)
| | | | | | | |-- lat: double (nullable = true)
| | | | | | | |-- lng: double (nullable = true)
| | | | | | |-- html_instructions: string (nullable = true)
| | | | | | |-- maneuver: string (nullable = true)
| | | | | | |-- polyline: struct (nullable = true)
| | | | | | | |-- points: string (nullable = true)
| | | | | | |-- start_location: struct (nullable = true)
| | | | | | | |-- lat: double (nullable = true)
| | | | | | | |-- lng: double (nullable = true)
| | | | | | |-- travel_mode: string (nullable = true)
| | | | |-- traffic_speed_entry: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | |-- via_waypoint: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- location: struct (nullable = true)
| | | | | | | |-- lat: double (nullable = true)
| | | | | | | |-- lng: double (nullable = true)
| | | | | | |-- step_index: long (nullable = true)
| | | | | | |-- step_interpolation: double (nullable = true)
| | |-- overview_polyline: struct (nullable = true)
| | | |-- points: string (nullable = true)
| | |-- summary: string (nullable = true)
| | |-- warnings: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- waypoint_order: array (nullable = true)
| | | |-- element: string (containsNull = true)
|-- status: string (nullable = true)
Here is my function to get lat and lng datas :
def getTraceGps(json_file, spark):
#Lecture du fichier route
sqlContext = SQLContext(spark)
df=sqlContext.read.json(json_file, multiLine=True)
df.printSchema()
df.createOrReplaceTempView("Maps")
df.select(df["routes.bounds.northeast.lat"], df["routes.bounds.northeast.lng"]).show() #IT WORKS
df.select(df["routes.legs.steps.end_location.lat"],df["routes.legs.steps.end_location.lng"]) #WRONG
results.show()
Here is the LOG :
py4j.protocol.Py4JJavaError: An error occurred while calling o53.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`routes`.`legs`['steps']' due to data type mismatch: argument 2 requires integral type, however, ''steps'' is of string type.;;
'Project [routes#1.legs[steps].end_location.lat AS lat#19, routes#1.legs[steps].end_location.lng AS lng#20]
+- AnalysisBarrier
+- Relation[geocoded_waypoints#0,routes#1,status#2] json
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:95)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3295)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1307)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
I don't understand why the first df.select works and not the second one.
Maybe because steps contains several objects.
I tried a lot of queries before but I was wrong.
Where does the problem come from ?
Thank you in advance.
The error message is kind of cryptic, but notice legs is an array type. Becuase it's an array you must choose a specific element using bracket notation (like legs[1])
I haven't seen IntegralType in any documentation but it is part of spark.sql internals. It's just an internal datatype to represent arrays and the like, see
https://github.com/apache/spark/blob/cba69aeb453d2489830f3e6e0473a64dee81989e/sql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala