Having a dataframe df in Spark:
|-- array_field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
How to rename field array_field.a to array_field.a_renamed?
[Update]:
.withColumnRenamed() does not work with nested fields so I tried this hacky and unsafe method:
# First alter the schema:
schema = df.schema
schema['array_field'].dataType.elementType['a'].name = 'a_renamed'
ind = schema['array_field'].dataType.elementType.names.index('a')
schema['array_field'].dataType.elementType.names[ind] = 'a_renamed'
# Then set dataframe's schema with altered schema
df._schema = schema
I know that setting a private attribute is not a good practice but I don't know other way to set the schema for df
I think I am on a right track but df.printSchema() still shows the old name for array_field.a, though df.schema == schema is True
Python
It is not possible to modify a single nested field. You have to recreate a whole structure. In this particular case the simplest solution is to use cast.
First a bunch of imports:
from collections import namedtuple
from pyspark.sql.functions import col
from pyspark.sql.types import (
ArrayType, LongType, StringType, StructField, StructType)
and example data:
Record = namedtuple("Record", ["a", "b", "c"])
df = sc.parallelize([([Record("foo", 1, 3)], )]).toDF(["array_field"])
Let's confirm that the schema is the same as in your case:
df.printSchema()
root
|-- array_field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
You can define a new schema for example as a string:
str_schema = "array<struct<a_renamed:string,b:bigint,c:bigint>>"
df.select(col("array_field").cast(str_schema)).printSchema()
root
|-- array_field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a_renamed: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
or a DataType:
struct_schema = ArrayType(StructType([
StructField("a_renamed", StringType()),
StructField("b", LongType()),
StructField("c", LongType())
]))
df.select(col("array_field").cast(struct_schema)).printSchema()
root
|-- array_field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a_renamed: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
Scala
The same techniques can be used in Scala:
case class Record(a: String, b: Long, c: Long)
val df = Seq(Tuple1(Seq(Record("foo", 1, 3)))).toDF("array_field")
val strSchema = "array<struct<a_renamed:string,b:bigint,c:bigint>>"
df.select($"array_field".cast(strSchema))
or
import org.apache.spark.sql.types._
val structSchema = ArrayType(StructType(Seq(
StructField("a_renamed", StringType),
StructField("b", LongType),
StructField("c", LongType)
)))
df.select($"array_field".cast(structSchema))
Possible improvements:
If you use an expressive data manipulation or JSON processing library it could be easier to dump data types to dict or JSON string and take it from there for example (Python / toolz):
from toolz.curried import pipe, assoc_in, update_in, map
from operator import attrgetter
# Update name to "a_updated" if name is "a"
rename_field = update_in(
keys=["name"], func=lambda x: "a_updated" if x == "a" else x)
updated_schema = pipe(
# Get schema of the field as a dict
df.schema["array_field"].jsonValue(),
# Update fields with rename
update_in(
keys=["type", "elementType", "fields"],
func=lambda x: pipe(x, map(rename_field), list)),
# Load schema from dict
StructField.fromJson,
# Get data type
attrgetter("dataType"))
df.select(col("array_field").cast(updated_schema)).printSchema()
You can recurse over the data frame's schema to create a new schema with the required changes.
A schema in PySpark is a StructType which holds a list of StructFields and each StructField can hold some primitve type or another StructType.
This means that we can decide if we want to recurse based on whether the type is a StructType or not.
Below is an annotated sample implementation that shows you how you can implement the above idea.
# Some imports
from pyspark.sql.types import DataType, StructType, ArrayType
from copy import copy
# We take a dataframe and return a new one with required changes
def cleanDataFrame(df: DataFrame) -> DataFrame:
# Returns a new sanitized field name (this function can be anything really)
def sanitizeFieldName(s: str) -> str:
return s.replace("-", "_").replace("&", "_").replace("\"", "_")\
.replace("[", "_").replace("]", "_").replace(".", "_")
# We call this on all fields to create a copy and to perform any
# changes we might want to do to the field.
def sanitizeField(field: StructField) -> StructField:
field = copy(field)
field.name = sanitizeFieldName(field.name)
# We recursively call cleanSchema on all types
field.dataType = cleanSchema(field.dataType)
return field
def cleanSchema(dataType: [DataType]) -> [DataType]:
dataType = copy(dataType)
# If the type is a StructType we need to recurse otherwise
# we can return since we've reached the leaf node
if isinstance(dataType, StructType):
# We call our sanitizer for all top level fields
dataType.fields = [sanitizeField(f) for f in dataType.fields]
elif isinstance(dataType, ArrayType):
dataType.elementType = cleanSchema(dataType.elementType)
return dataType
# Now since we have the new schema we can create a new DataFrame
# by using the old Frame's RDD as data and the new schema as the
# schema for the data
return spark.createDataFrame(df.rdd, cleanSchema(df.schema))
I found a much easier way than the one provided by #zero323, along the lines
of #MaxPY:
Pyspark 2.4:
# Get the schema from the dataframe df
schema = df.schema
# Override `fields` with a list of new StructField, equals to the previous but for the names
schema.fields = (list(map(lambda field:
StructField(field.name + "_renamed", field.dataType), schema.fields)))
# Override also `names` with the same mechanism
schema.names = list(map(lambda name: name + "_renamed", table_schema.names))
Now df.schema will print all the renewed names.
Another much easier solution if it works for you like it works for me is to flatten the structure and then rename:
Using Scala:
val df_flat = df.selectExpr("array_field.*")
Now the rename works
val df_renamed = df_flat.withColumnRenamed("a", "a_renamed")
Of course this only works for you if you dont need the hierarchy (although I suppose it can be recreated again if needed)
Using answer provided by Leo C in:https://stackoverflow.com/a/55363153/5475506, I have built what I consider a more human-friendly/pythoniac script:
import pyspark.sql.types as sql_types
path_table = "<PATH_TO_DATA>"
table_name = "<TABLE_NAME>"
def recur_rename(schema: StructType, old_char, new_char):
schema_new = []
for struct_field in schema:
if type(struct_field.dataType)==sql_types.StructType:
schema_new.append(sql_types.StructField(struct_field.name.replace(old_char, new_char), sql_types.StructType(recur_rename(struct_field.dataType, old_char, new_char)), struct_field.nullable, struct_field.metadata))
elif type(struct_field.dataType)==sql_types.ArrayType:
if type(struct_field.dataType.elementType)==sql_types.StructType:
schema_new.append(sql_types.StructField(struct_field.name.replace(old_char, new_char), sql_types.ArrayType(sql_types.StructType(recur_rename(struct_field.dataType.elementType, old_char, new_char)),True), struct_field.nullable, struct_field.metadata)) # Recursive call to loop over all Array elements
else:
schema_new.append(sql_types.StructField(struct_field.name.replace(old_char, new_char), struct_field.dataType.elementType, struct_field.nullable, struct_field.metadata)) # If ArrayType only has one field, it is no sense to use an Array so Array is exploded
else:
schema_new.append(sql_types.StructField(struct_field.name.replace(old_char, new_char), struct_field.dataType, struct_field.nullable, struct_field.metadata))
return schema_new
def rename_columns(schema: StructType, old_char, new_char):
return sql_types.StructType(recur_rename(schema, old_char, new_char))
df = spark.read.format("json").load(path_table) # Read data whose schema has to be changed.
newSchema = rename_columns(df.schema, ":", "_") # Replace special characters in schema (More special characters not allowed in Spark/Hive meastore: ':', ',', ';')
df2= spark.read.format("json").schema(newSchema).load(path_table) # Read data with new schema.
I consider the code self explanatory (furthermore, it has comments) but what it does is recursively loop over all the fields in the schema, replacing "old_char" by "new_char" in each of them. If field type is a nested one (StructType or ArrayType) new recursive calls are made.
Related
I am trying to lower the case of all columns names of PySpark Dataframe schema, including complex type columns' element names.
Example:
original_df
|-- USER_ID: long (nullable = true)
|-- COMPLEX_COL_ARRAY: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- KEY: timestamp (nullable = true)
| | |-- VALUE: integer (nullable = true)
target_df
|-- user_id: long (nullable = true)
|-- complex_col_array: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: timestamp (nullable = true)
| | |-- value: integer (nullable = true)
However, I've only been able to lower the case of column names using the script below:
from pyspark.sql.types import StructField
schema = df.schema
schema.fields = list(map(lambda field: StructField(field.name.lower(), field.dataType), schema.fields))
I know I can access the field names of nested elements using this syntax:
for f in schema.fields:
if hasattr(f.dataType, 'elementType') and hasattr(f.dataType.elementType, 'fieldNames'):
print(schema.f.dataType.elementType.fieldNames())
But how can I modify the case of these field names?
Thanks for your help!
Suggesting an answer to my own question, inspired by this question here: Rename nested field in spark dataframe
from pyspark.sql.types import StructField
# Read parquet file
path = "/path/to/data"
df = spark.read.parquet(path)
schema = df.schema
# Lower the case of all fields that are not nested
schema.fields = list(map(lambda field: StructField(field.name.lower(), field.dataType), schema.fields))
for f in schema.fields:
# if field is nested and has named elements, lower the case of all element names
if hasattr(f.dataType, 'elementType') and hasattr(f.dataType.elementType, 'fieldNames'):
for e in f.dataType.elementType.fieldNames():
schema[f.name].dataType.elementType[e].name = schema[f.name].dataType.elementType[e].name.lower()
ind = schema[f.name].dataType.elementType.names.index(e)
schema[f.name].dataType.elementType.names[ind] = e.lower()
# Recreate dataframe with lowercase schema
df_lowercase = spark.createDataFrame(df.rdd, schema)
I have packed my nested json as string columns in my pyspark dataframe and I am trying to perform UPSERT on some columns based on groupBy.
Input:
from pyspark.sql.functions import *
from pyspark.sql.types import *
input_json = """[{
"candidate_email": "cust1#email.com",
"transactions":"[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]"
},
{
"candidate_email": "cust1#email.com",
"transactions":"[{'transaction_id':'12', 'transaction_amount':'$23.43'}]"
}
]
"""
input_df = spark.read.json(sc.parallelize([input_json]), multiLine=True)
input_df.printSchema()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: string (nullable = true)
Transformation & current output:
output_df = input_df.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions"))
output_df.printSchema()
output_df.collect()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: array (nullable = true)
# | |-- element: string (containsNull = true)
# Out[161]:
# [Row(candidatey_email='cust1#email.com', transactions=["[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]", "[{'transaction_id':'12', 'transaction_amount':'$23.43'}]"])]
But what changes should I make in above code to get this output:
desired output:
output_json = """[{
"candidate_email": "cust1#email.com",
"transactions":"[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}, {'transaction_id':'12', 'transaction_amount':'$23.43'}]"
}]"""
output_df = spark.read.json(sc.parallelize([output_json]), multiLine=True)
output_df.printSchema()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: string (nullable = true)
Basically, I am trying to get clean merge by having one list instead of having multiple.
Thanks!
As you are having string type for transactions column we need to convert as array type then by doing explode and groupBy + collect_list we can achieve the expected result.
Example:
df.show(10,False)
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|cust1#email.com|[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]|
#|cust1#email.com|[{'transaction_id':'12', 'transaction_amount':'$23.43'}] |
#+---------------+----------------------------------------------------------------------------------------------------------------+
#to make proper array we first replace (},) with (}},) then remove ("[|]") and split on (},) it results array finally we explode on the array.
df1=df.selectExpr("candidate_email","""explode(split(regexp_replace(regexp_replace(transactions,'(\\\},)','}},'),'(\\\[|\\\])',''),"},")) as transactions""")
df1.show(10,False)
#+---------------+------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+------------------------------------------------------+
#|cust1#email.com|{'transaction_id':'10', 'transaction_amount':'$55.46'}|
#|cust1#email.com|{'transaction_id':'11','transaction_amount':'$545.46'}|
#|cust1#email.com|{'transaction_id':'12', 'transaction_amount':'$23.43'}|
#+---------------+------------------------------------------------------+
#groupBy and then create json object
df2=df1.groupBy("candidate_email").\
agg(collect_list(col("transactions")).alias("transactions"))
df2.show(10,False)
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|cust1#email.com|[{'transaction_id':'10', 'transaction_amount':'$55.46'}, {'transaction_id':'11','transaction_amount':'$545.46'}, {'transaction_id':'12', 'transaction_amount':'$23.43'}]|
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#creating json object column in dataframe
df2.selectExpr("to_json(struct(candidate_email,transactions)) as json").show(10,False)
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|json |
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|{"candidate_email":"cust1#email.com","transactions":["{'transaction_id':'10', 'transaction_amount':'$55.46'}","{'transaction_id':'11','transaction_amount':'$545.46'}","{'transaction_id':'12', 'transaction_amount':'$23.43'}"]}|
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#To the output to json file
df2.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions")).write.mode("overwrite").json("<path>")
#content of file
#{"candidate_email":"cust1#email.com","transactions":["{'transaction_id':'10', 'transaction_amount':'$55.46'}","{'transaction_id':'11','transaction_amount':'$545.46'}","{'transaction_id':'12', 'transaction_amount':'$23.43'}"]}
#converting to json by using toJSON
df2.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions")).toJSON().collect()
#[u'{"candidate_email":"cust1#email.com","transactions":["{\'transaction_id\':\'10\', \'transaction_amount\':\'$55.46\'}","{\'transaction_id\':\'11\',\'transaction_amount\':\'$545.46\'}","{\'transaction_id\':\'12\', \'transaction_amount\':\'$23.43\'}"]}']
The dataframe below is a temp_table named: 'table_name'.
How would you use spark.sql() to give a prefix to all columns?
root
|-- MAIN_COL: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: string (nullable = true)
| |-- c: string (nullable = true)
| |-- d: string (nullable = true)
| |-- f: long (nullable = true)
| |-- g: long (nullable = true)
| |-- h: long (nullable = true)
| |-- j: long (nullable = true)
The below query
spark.sql("select MAIN_COL.* from table_name")
gives back columns named a,b,c..., but how to make them all look like e.g. pre_a, pre_b, pre_c?
Want to avoid selecting and giving them alias one by one. What if I have 30 columns?
I hope a custom UDF can solve it which is used in SQL, but really not sure how to handle this.
# Generate a pandas DataFrame
import pandas as pd
a_dict={
'a':[1,2,3,4,5],
'b':[1,2,3,4,5],
'c':[1,2,3,4,5],
'e':list('abcde'),
'f':list('abcde'),
'g':list('abcde')
}
pandas_df=pd.DataFrame(a_dict)
# Create a Spark DataFrame from a pandas DataFrame using Arrow
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df = spark.createDataFrame(pandas_df)
#struct
from pyspark.sql.functions import struct
main=df.select(struct(df.columns).alias("MAIN_COL"))
Here is one way to go through the fields and modify their names dynamically. First use main.schema.fields[0].dataType.fields to access the target fields. Next use python map to prepend pre_ to each field:
from pyspark.sql.types import *
from pyspark.sql.functions import col
inner_fields = main.schema.fields[0].dataType.fields
# [StructField(a,LongType,true),
# StructField(b,LongType,true),
# StructField(c,LongType,true),
# StructField(e,StringType,true),
# StructField(f,StringType,true),
# StructField(g,StringType,true)]
pre_cols = list(map(lambda sf: StructField(f"pre_{sf.name}", sf.dataType, sf.nullable), inner_fields))
new_schema = StructType(pre_cols)
main.select(col("MAIN_COL").cast(new_schema)).printSchema()
# root
# |-- MAIN_COL: struct (nullable = false)
# | |-- pre_a: long (nullable = true)
# | |-- pre_b: long (nullable = true)
# | |-- pre_c: long (nullable = true)
# | |-- pre_e: string (nullable = true)
# | |-- pre_f: string (nullable = true)
# | |-- pre_g: string (nullable = true)
Finally, you can use cast with the new schema as #Mahesh already mentioned.
Beauty of Spark, you can programatically manipulate metadata
This is an example that continues the original code snippet:
main.createOrReplaceTempView("table_name")
new_cols_select = ", ".join(["MAIN_COL." + col + " as pre_" + col for col in spark.sql("select MAIN_COL.* from table_name").columns])
new_df = spark.sql(f"select {new_cols_select} from table_name")
Due to Spark's laziness and because all the manipulations are metadata only, this code doesn't have almost any performance cost and will work same for 10 columns or 500 columns (we actually are doing something similar on 1k of columns).
It is also possible to get original column names in more elegant way with df.schema object
you can try this: add all the column as per requirements to schema2
val schema2 = new StructType()
.add("pre_a",StringType)
.add("pre_b",StringType)
.add("pre_c",StringType)
Now select column using like:
df.select(col("MAIN_COL").cast(schema2)).show()
it will give you all the updated column names.
The following expands all the struct columns adding the parent column name as prefix.
struct_cols = [c for c, t in df.dtypes if t.startswith('struct')]
for c in struct_cols:
schema = T.StructType([T.StructField(f"{c}_{f.name}", f.dataType, f.nullable) for f in df.schema[c].dataType.fields])
df = df.withColumn(c, F.col(c).cast(schema))
df = df.select([f"{c}.*" if c in struct_cols else c for c in df.columns])
Test input:
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame([((1, 2), 5)], 'c1:struct<f1:int,f2:int>, c2:int')
print(df.dtypes)
# [('c1', 'struct<f1:int,f2:int>'), ('c2', 'int')]
Result:
struct_cols = [c for c, t in df.dtypes if t.startswith('struct')]
for c in struct_cols:
schema = T.StructType([T.StructField(f"{c}_{f.name}", f.dataType, f.nullable) for f in df.schema[c].dataType.fields])
df = df.withColumn(c, F.col(c).cast(schema))
df = df.select([f"{c}.*" if c in struct_cols else c for c in df.columns])
print(df.dtypes)
# [('c1_f1', 'int'), ('c1_f2', 'int'), ('c2', 'int')]
you can also do this with PySpark:
df.select([col(col_name).alias('prefix' + col_name) for col_name in df])
In Pyspark, whenever i read a json file with an empty set element. The entire element is ignored in the resultant DataFrame. How can i ask spark to consider without ignoring it.
I am using spark 2.4.2 and Python 3.7.3
I tried using df.fillna('Null'). This didnt work because the moment DataFrame got created, the element is not there.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
people = ['{"name":{},"address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = spark.sparkContext.parallelize(people)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.printSchema()
root
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
As we can see, the empty set element (name) is not part of Dataframe.
Is there a way to have name element to be considered.
Let me know if that helps:
people = ['{"name":{},"address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = spark.sparkContext.parallelize(people)
otherPeople = spark.read.json(otherPeopleRDD)
# lit with None
otherPeople = otherPeople.withColumn('name', lit(None).cast(StringType()))
otherPeople.printSchema()
root
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
|-- name: string (nullable = true)
+---------------+----+
| address|name|
+---------------+----+
|[Columbus,Ohio]|null|
+---------------+----+
EDIT
if the json is not too complex this will work.
# Change the dictionary itself than changing it at df level
import json
d = json.loads(people[0])
# Takes care of any column which has empty dictionary value
for k, v in d.items():
if ( (v is None) | (len(v) == 0) ): # You can add any conditions to detect empty set
d[k] = "nan" # i prefer d[k] = None, and then fillna
people[0] = str(json.dumps(d))
otherPeopleRDD = spark.sparkContext.parallelize(people)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.printSchema()
otherPeople.show()
root
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
|-- name: string (nullable = true)
+---------------+----+
| address|name|
+---------------+----+
|[Columbus,Ohio]|nan |
+---------------+----+
After couple of hours struggling with a similar problem, here is what I found:
If data has at least one row with a nonempty "name" field, then it won't be ignored. If it doesn't, then we need to add "name" column with init value like #Preetham said. so we can check if "name" exists in the result's schema or not.
Another solution would be adding a sample row with all fields filled with data to the json file/string and then ignoring or removing it from the result.
Say I have a PySpark dataframe df:
>>> df.printSchema()
root
|-- a: struct
|-- alpha: integer
|-- beta: string
|-- gamma: boolean
|-- b: string
|-- c: struct
|-- delta: string
|-- epsilon: struct
|-- omega: string
|-- psi: boolean
I know I can flatten the dataframe:
select_col_list = [col.replace("a", "a.*").replace("c", "c.*") for col in df.columns]
flat_df = df.select(*select_col_list)
This results in a schema like this:
root
|-- alpha: integer
|-- beta: string
|-- gamma: boolean
|-- b: string
|-- delta: string
|-- epsilon: struct
|-- omega: string
|-- psi: boolean
But I want to append the supercolumn's name to subcolumns when I flatten too, so I want the resulting schema to be like this:
root
|-- a_alpha: integer
|-- a_beta: string
|-- a_gamma: boolean
|-- b: string
|-- c_delta: string
|-- c_epsilon: struct
|-- omega: string
|-- psi: boolean
How do I do this?
I don't think there's an straightforward way to do it, but here's a hacky solution that I came up with.
Define a list of the columns to be expanded and create a temporary id column using pyspark.sql.functions.monotonically_increasing_id().
Loop over all the columns in the dataframe and create a temporary dataframe for each one.
If the column is in cols_to_expand: Use .* to expand the column. Then rename all fields (except id) in the resultant (temporary) dataframe by with the corresponding prefix using alias().
If the column is not in cols_to_expand: Select that column and id and store it in a temporary dataframe.
Store temp_df in a list.
Join all the dataframes in the list using id and drop the id column.
Code:
df = df.withColumn('id', f.monotonically_increasing_id())
cols_to_expand = ['a', 'c']
flat_dfs = []
for col in df.columns:
if col in cols_to_expand:
temp_df = df.select('id', col+".*")
temp_df = temp_df.select(
[
f.col(c).alias(col+"_"+c if c != 'id' else c) for c in temp_df.columns
]
)
else:
temp_df = df.select('id', col)
flat_dfs.append(temp_df)
flat_df = reduce(lambda x, y: x.join(y, on='id'), flat_dfs)
flat_df = flat_df.drop('id')
flat_df.printSchema()
The resulting schema:
flat_df.printSchema()
#root
# |-- a_alpha: integer (nullable = true)
# |-- a_beta: string (nullable = true)
# |-- a_gamma: boolean (nullable = true)
# |-- b: string (nullable = true)
# |-- c_delta: string (nullable = true)
# |-- c_epsilon: struct (nullable = true)
# | |-- omega: string (nullable = true)
# | |-- psi: boolean (nullable = true)
I actually found a way to do this today. First using the beautiful auto-flattening pyspark function by Evan V. Combine this with the rather brilliant solution to mass-renaming from proggeo and you can basically build up a list of names down the full tree of columns and alias them all as you select.
In my case I took the result of the flatten function and replaced all the "." characters with an "_" in the renaming. Result is as follows:
from pyspark.sql.types import StructType, ArrayType
def flatten(schema, prefix=None):
fields = []
for field in schema.fields:
name = prefix + '.' + field.name if prefix else field.name
dtype = field.dataType
if isinstance(dtype, ArrayType):
dtype = dtype.elementType
if isinstance(dtype, StructType):
fields += flatten(dtype, prefix=name)
else:
fields.append(name)
return fields
# Get actual field names, with nested '.' structure, and create equivalents with '_'
fields=flatten(df.schema)
fields_renamed = [field.replace(".","_") for field in fields]
# Select while aliasing for all fields
df=df.select(*[col(field).alias(new_field) for field,new_field in zip(fields,fields_renamed)])