Write a pyspark.sql.dataframe.DataFrame without losing information - python

I am trying to save an pyspark.sql.dataframe.DataFrame in CSV format (could also be another format, as long as it is easily readable).
So far, I found a couple of examples to save the DataFrame. However, it is losing information everytime that I write it.
Dataset example:
# Create an example Pyspark DataFrame
from pyspark.sql import Row
Employee = Row("firstName", "lastName", "email", "salary")
employee1 = Employee('A', 'AA', 'mail1', 100000)
employee2 = Employee('B', 'BB', 'mail2', 120000 )
employee3 = Employee('C', None, 'mail3', 140000 )
employee4 = Employee('D', 'DD', 'mail4', 160000 )
employee5 = Employee('E', 'EE', 'mail5', 160000 )
department1 = Row(id='123', name='HR')
department2 = Row(id='456', name='OPS')
department3 = Row(id='789', name='FN')
department4 = Row(id='101112', name='DEV')
departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2, employee5])
departmentWithEmployees2 = Row(department=department2, employees=[employee3, employee4])
departmentWithEmployees3 = Row(department=department3, employees=[employee1, employee4, employee3])
departmentWithEmployees4 = Row(department=department4, employees=[employee2, employee3])
departmentsWithEmployees_Seq = [departmentWithEmployees1, departmentWithEmployees2]
dframe = spark.createDataFrame(departmentsWithEmployees_Seq)
In order to save this file as CSV, I firstly tried this solution:
type(dframe)
Out[]: pyspark.sql.dataframe.DataFrame
dframe.write.csv('junk_mycsv.csv')
Unfortunately, that result in this error:
org.apache.spark.sql.AnalysisException: CSV data source does not support struct<id:string,name:string> data type.;
That is the reason why I tried another possibility, to convert the spark dataframe into a pandas dataframe, and save it then. As mentioned in this example.
pandas_df = dframe.toPandas()
Works good! However, If I show my data, it is missing data:
print(pandas_df.head())
department employees
0 (123, HR) [(A, AA, mail1, 100000), (B, BB, mail2, 120000...
1 (456, OPS) [(C, None, mail3, 140000), (D, DD, mail4, 1600...
As you can see in the snapshot below, we are missing information. Because the data should be like this:
department employees
0 id:123, name:HR firstName: A, lastName: AA, email: mail1, salary: 100000
# Info is missing like 'id', 'name', 'firstName', 'lastName', 'email' etc.
# For the complete expected example, see screenshow below.
Just for information: I am working in Databricks, with Python.
Therefore, how can I write my data (dframe from the example above) without losing information?
Many thanks in advance!
Edit
Adding a picture for Pault, to show the format of the csv (and the headers).
Edit2
Replacing the picture for example csv output:
After running Pault's code:
from pyspark.sql.functions import to_json
dframe.select(*[to_json(c).alias(c) for c in dframe.columns])\
.repartition(1).write.csv("junk_mycsv.csv", header= True)
The output is not tidy, since most column headers are empty (due the nested format?). Only copying the first row:
department employees (empty ColName) (empty ColName) (and so on)
{\id\":\"123\" \"name\":\"HR\"}" [{\firstName\":\"A\" \"lastName\":\"AA\" (...)

Your dataframe has the following schema:
dframe.printSchema()
#root
# |-- department: struct (nullable = true)
# | |-- id: string (nullable = true)
# | |-- name: string (nullable = true)
# |-- employees: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- firstName: string (nullable = true)
# | | |-- lastName: string (nullable = true)
# | | |-- email: string (nullable = true)
# | | |-- salary: long (nullable = true)
So the department column is a StructType with two named fields and the employees column is an array of structs with four named fields. It appears what you want is to write the data in a format that saves both the key and the value for each record.
One option is to write the file in JSON format instead of CSV:
dframe.write.json("junk.json")
Which produces the following output:
{"department":{"id":"123","name":"HR"},"employees":[{"firstName":"A","lastName":"AA","email":"mail1","salary":100000},{"firstName":"B","lastName":"BB","email":"mail2","salary":120000},{"firstName":"E","lastName":"EE","email":"mail5","salary":160000}]}
{"department":{"id":"456","name":"OPS"},"employees":[{"firstName":"C","email":"mail3","salary":140000},{"firstName":"D","lastName":"DD","email":"mail4","salary":160000}]}
Or if you wanted to keep it in CSV format, you can use to_json to convert each column to JSON before writing the CSV.
# looping over all columns
# but you can also just limit this to the columns you want to convert
from pyspark.sql.functions import to_json
dframe.select(*[to_json(c).alias(c) for c in dframe.columns])\
.write.csv("junk_mycsv.csv")
This produces the following output:
"{\"id\":\"123\",\"name\":\"HR\"}","[{\"firstName\":\"A\",\"lastName\":\"AA\",\"email\":\"mail1\",\"salary\":100000},{\"firstName\":\"B\",\"lastName\":\"BB\",\"email\":\"mail2\",\"salary\":120000},{\"firstName\":\"E\",\"lastName\":\"EE\",\"email\":\"mail5\",\"salary\":160000}]"
"{\"id\":\"456\",\"name\":\"OPS\"}","[{\"firstName\":\"C\",\"email\":\"mail3\",\"salary\":140000},{\"firstName\":\"D\",\"lastName\":\"DD\",\"email\":\"mail4\",\"salary\":160000}]"
Note that the double-quotes are escaped.

Related

How to convert the type of a column from String to Date

I have a pretty simple question where I couldn't find a simple answer for: I want to convert the type of a column in my Pyspark DataFrame from a String to a Date, how can I do this?
I tried the following:
df.withColumn('dates', datetime.strpdate(col('date'), %Y%m%d))
and
df.withColumn('dates', datetime.strpdate(df.date, %Y%m%d))
but in each case I get the following error: TypeError: strptime() argument 1 must be str, not Column.
So obviously the col('date') and df.date are interpreted as a Column rather than the string that is holds. How can I fix this?
You can convert the string column to date using "cast" function if the format is "yyyy-MM-dd" or you can use "to_date" function which is more generalized function where you can specify the input format as well.
Here is the sample code.
Code
# Create DaraFrame
df = spark.createDataFrame([(1, "2020-06-03", "2020/06/03"), (2, "2020-05-01", "2020/05/01")] , ["id", "date_fmt_1", "date_fmt_2"])
# Convert the string columne to date.
df1 = (df
# approach - 1: use cast function if the format is "yyyy-MM-dd"
.withColumn("date_1", df["date_fmt_1"].cast("date"))
# Approach - 2 : use to_date function and specify the input format. "yyyy/MM/dd" in our case
.withColumn("date_2", to_date("date_fmt_2", "yyyy/MM/dd"))
# If you don't specify any format, it will take spark default format "yyyy-MM-dd"
.withColumn("date_3", to_date(df["date_fmt_1"])))
# Print the schema
df1.printSchema()
Schema Output
root
|-- id: long (nullable = true)
|-- date_fmt_1: string (nullable = true)
|-- date_fmt_2: string (nullable = true)
|-- date_1: date (nullable = true)
|-- date_2: date (nullable = true)
|-- date_3: date (nullable = true)
Display data.
df1.show()
DataFrame Output
+---+----------+----------+----------+----------+----------+
| id|date_fmt_1|date_fmt_2| date_1| date_2| date_3|
+---+----------+----------+----------+----------+----------+
| 1|2020-06-03|2020/06/03|2020-06-03|2020-06-03|2020-06-03|
| 2|2020-05-01|2020/05/01|2020-05-01|2020-05-01|2020-05-01|
+---+----------+----------+----------+----------+----------+
For more information above Spark's DateTime function you can visit this blog:https://medium.com/expedia-group-tech/deep-dive-into-apache-spark-datetime-functions-b66de737950a
I hope this helps.

PySpark : how to aggregate json columns in a clean way

I have packed my nested json as string columns in my pyspark dataframe and I am trying to perform UPSERT on some columns based on groupBy.
Input:
from pyspark.sql.functions import *
from pyspark.sql.types import *
input_json = """[{
"candidate_email": "cust1#email.com",
"transactions":"[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]"
},
{
"candidate_email": "cust1#email.com",
"transactions":"[{'transaction_id':'12', 'transaction_amount':'$23.43'}]"
}
]
"""
input_df = spark.read.json(sc.parallelize([input_json]), multiLine=True)
input_df.printSchema()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: string (nullable = true)
Transformation & current output:
output_df = input_df.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions"))
output_df.printSchema()
output_df.collect()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: array (nullable = true)
# | |-- element: string (containsNull = true)
# Out[161]:
# [Row(candidatey_email='cust1#email.com', transactions=["[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]", "[{'transaction_id':'12', 'transaction_amount':'$23.43'}]"])]
But what changes should I make in above code to get this output:
desired output:
output_json = """[{
"candidate_email": "cust1#email.com",
"transactions":"[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}, {'transaction_id':'12', 'transaction_amount':'$23.43'}]"
}]"""
output_df = spark.read.json(sc.parallelize([output_json]), multiLine=True)
output_df.printSchema()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: string (nullable = true)
Basically, I am trying to get clean merge by having one list instead of having multiple.
Thanks!
As you are having string type for transactions column we need to convert as array type then by doing explode and groupBy + collect_list we can achieve the expected result.
Example:
df.show(10,False)
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|cust1#email.com|[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]|
#|cust1#email.com|[{'transaction_id':'12', 'transaction_amount':'$23.43'}] |
#+---------------+----------------------------------------------------------------------------------------------------------------+
#to make proper array we first replace (},) with (}},) then remove ("[|]") and split on (},) it results array finally we explode on the array.
df1=df.selectExpr("candidate_email","""explode(split(regexp_replace(regexp_replace(transactions,'(\\\},)','}},'),'(\\\[|\\\])',''),"},")) as transactions""")
df1.show(10,False)
#+---------------+------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+------------------------------------------------------+
#|cust1#email.com|{'transaction_id':'10', 'transaction_amount':'$55.46'}|
#|cust1#email.com|{'transaction_id':'11','transaction_amount':'$545.46'}|
#|cust1#email.com|{'transaction_id':'12', 'transaction_amount':'$23.43'}|
#+---------------+------------------------------------------------------+
#groupBy and then create json object
df2=df1.groupBy("candidate_email").\
agg(collect_list(col("transactions")).alias("transactions"))
df2.show(10,False)
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|cust1#email.com|[{'transaction_id':'10', 'transaction_amount':'$55.46'}, {'transaction_id':'11','transaction_amount':'$545.46'}, {'transaction_id':'12', 'transaction_amount':'$23.43'}]|
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#creating json object column in dataframe
df2.selectExpr("to_json(struct(candidate_email,transactions)) as json").show(10,False)
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|json |
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|{"candidate_email":"cust1#email.com","transactions":["{'transaction_id':'10', 'transaction_amount':'$55.46'}","{'transaction_id':'11','transaction_amount':'$545.46'}","{'transaction_id':'12', 'transaction_amount':'$23.43'}"]}|
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#To the output to json file
df2.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions")).write.mode("overwrite").json("<path>")
#content of file
#{"candidate_email":"cust1#email.com","transactions":["{'transaction_id':'10', 'transaction_amount':'$55.46'}","{'transaction_id':'11','transaction_amount':'$545.46'}","{'transaction_id':'12', 'transaction_amount':'$23.43'}"]}
#converting to json by using toJSON
df2.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions")).toJSON().collect()
#[u'{"candidate_email":"cust1#email.com","transactions":["{\'transaction_id\':\'10\', \'transaction_amount\':\'$55.46\'}","{\'transaction_id\':\'11\',\'transaction_amount\':\'$545.46\'}","{\'transaction_id\':\'12\', \'transaction_amount\':\'$23.43\'}"]}']

Give prefix to all columns when selecting with 'struct_name.*'

The dataframe below is a temp_table named: 'table_name'.
How would you use spark.sql() to give a prefix to all columns?
root
|-- MAIN_COL: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: string (nullable = true)
| |-- c: string (nullable = true)
| |-- d: string (nullable = true)
| |-- f: long (nullable = true)
| |-- g: long (nullable = true)
| |-- h: long (nullable = true)
| |-- j: long (nullable = true)
The below query
spark.sql("select MAIN_COL.* from table_name")
gives back columns named a,b,c..., but how to make them all look like e.g. pre_a, pre_b, pre_c?
Want to avoid selecting and giving them alias one by one. What if I have 30 columns?
I hope a custom UDF can solve it which is used in SQL, but really not sure how to handle this.
# Generate a pandas DataFrame
import pandas as pd
a_dict={
'a':[1,2,3,4,5],
'b':[1,2,3,4,5],
'c':[1,2,3,4,5],
'e':list('abcde'),
'f':list('abcde'),
'g':list('abcde')
}
pandas_df=pd.DataFrame(a_dict)
# Create a Spark DataFrame from a pandas DataFrame using Arrow
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df = spark.createDataFrame(pandas_df)
#struct
from pyspark.sql.functions import struct
main=df.select(struct(df.columns).alias("MAIN_COL"))
Here is one way to go through the fields and modify their names dynamically. First use main.schema.fields[0].dataType.fields to access the target fields. Next use python map to prepend pre_ to each field:
from pyspark.sql.types import *
from pyspark.sql.functions import col
inner_fields = main.schema.fields[0].dataType.fields
# [StructField(a,LongType,true),
# StructField(b,LongType,true),
# StructField(c,LongType,true),
# StructField(e,StringType,true),
# StructField(f,StringType,true),
# StructField(g,StringType,true)]
pre_cols = list(map(lambda sf: StructField(f"pre_{sf.name}", sf.dataType, sf.nullable), inner_fields))
new_schema = StructType(pre_cols)
main.select(col("MAIN_COL").cast(new_schema)).printSchema()
# root
# |-- MAIN_COL: struct (nullable = false)
# | |-- pre_a: long (nullable = true)
# | |-- pre_b: long (nullable = true)
# | |-- pre_c: long (nullable = true)
# | |-- pre_e: string (nullable = true)
# | |-- pre_f: string (nullable = true)
# | |-- pre_g: string (nullable = true)
Finally, you can use cast with the new schema as #Mahesh already mentioned.
Beauty of Spark, you can programatically manipulate metadata
This is an example that continues the original code snippet:
main.createOrReplaceTempView("table_name")
new_cols_select = ", ".join(["MAIN_COL." + col + " as pre_" + col for col in spark.sql("select MAIN_COL.* from table_name").columns])
new_df = spark.sql(f"select {new_cols_select} from table_name")
Due to Spark's laziness and because all the manipulations are metadata only, this code doesn't have almost any performance cost and will work same for 10 columns or 500 columns (we actually are doing something similar on 1k of columns).
It is also possible to get original column names in more elegant way with df.schema object
you can try this: add all the column as per requirements to schema2
val schema2 = new StructType()
.add("pre_a",StringType)
.add("pre_b",StringType)
.add("pre_c",StringType)
Now select column using like:
df.select(col("MAIN_COL").cast(schema2)).show()
it will give you all the updated column names.
The following expands all the struct columns adding the parent column name as prefix.
struct_cols = [c for c, t in df.dtypes if t.startswith('struct')]
for c in struct_cols:
schema = T.StructType([T.StructField(f"{c}_{f.name}", f.dataType, f.nullable) for f in df.schema[c].dataType.fields])
df = df.withColumn(c, F.col(c).cast(schema))
df = df.select([f"{c}.*" if c in struct_cols else c for c in df.columns])
Test input:
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame([((1, 2), 5)], 'c1:struct<f1:int,f2:int>, c2:int')
print(df.dtypes)
# [('c1', 'struct<f1:int,f2:int>'), ('c2', 'int')]
Result:
struct_cols = [c for c, t in df.dtypes if t.startswith('struct')]
for c in struct_cols:
schema = T.StructType([T.StructField(f"{c}_{f.name}", f.dataType, f.nullable) for f in df.schema[c].dataType.fields])
df = df.withColumn(c, F.col(c).cast(schema))
df = df.select([f"{c}.*" if c in struct_cols else c for c in df.columns])
print(df.dtypes)
# [('c1_f1', 'int'), ('c1_f2', 'int'), ('c2', 'int')]
you can also do this with PySpark:
df.select([col(col_name).alias('prefix' + col_name) for col_name in df])

PySpark: Concatenate Two Columns with Datatype of 'Struc' --> Error: Cannot Resolve Due to Datatype Mismatch

I have a data table in PySpark that contains two columns with data type of 'struc'.
Please see sample data frame below:
word_verb word_noun
{_1=cook, _2=VB} {_1=chicken, _2=NN}
{_1=pack, _2=VBN} {_1=lunch, _2=NN}
{_1=reconnected, _2=VBN} {_1=wifi, _2=NN}
I want to concatenate the two columns together so I can do a frequency count of the concatenated verb and noun chunk.
I tried the code below:
df = df.withColumn('word_chunk_final', F.concat(F.col('word_verb'), F.col('word_noun')))
But I get the following error:
AnalysisException: u"cannot resolve 'concat(`word_verb`, `word_noun`)' due to data type mismatch: input to function concat should have been string, binary or array, but it's [struct<_1:string,_2:string>, struct<_1:string,_2:string>]
My desired output table is as follows. The concatenated new field would have datatype of string:
word_verb word_noun word_chunk_final
{_1=cook, _2=VB} {_1=chicken, _2=NN} cook chicken
{_1=pack, _2=VBN} {_1=lunch, _2=NN} pack lunch
{_1=reconnected, _2=VBN} {_1=wifi, _2=NN} reconnected wifi
Your code is almost there.
Assuming your schema is as follows:
df.printSchema()
#root
# |-- word_verb: struct (nullable = true)
# | |-- _1: string (nullable = true)
# | |-- _2: string (nullable = true)
# |-- word_noun: struct (nullable = true)
# | |-- _1: string (nullable = true)
# | |-- _2: string (nullable = true)
You just need to access the value of the _1 field for each column:
import pyspark.sql.functions as F
df.withColumn(
"word_chunk_final",
F.concat_ws(' ', F.col('word_verb')['_1'], F.col('word_noun')['_1'])
).show()
#+-----------------+------------+----------------+
#| word_verb| word_noun|word_chunk_final|
#+-----------------+------------+----------------+
#| [cook,VB]|[chicken,NN]| cook chicken|
#| [pack,VBN]| [lunch,NN]| pack lunch|
#|[reconnected,VBN]| [wifi,NN]|reconnected wifi|
#+-----------------+------------+----------------+
Also, you should use concat_ws ("concatenate with separator") instead of concat to add the strings together with a space in between them. It's similar to how str.join works in python.

Rename nested field in spark dataframe

Having a dataframe df in Spark:
|-- array_field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
How to rename field array_field.a to array_field.a_renamed?
[Update]:
.withColumnRenamed() does not work with nested fields so I tried this hacky and unsafe method:
# First alter the schema:
schema = df.schema
schema['array_field'].dataType.elementType['a'].name = 'a_renamed'
ind = schema['array_field'].dataType.elementType.names.index('a')
schema['array_field'].dataType.elementType.names[ind] = 'a_renamed'
# Then set dataframe's schema with altered schema
df._schema = schema
I know that setting a private attribute is not a good practice but I don't know other way to set the schema for df
I think I am on a right track but df.printSchema() still shows the old name for array_field.a, though df.schema == schema is True
Python
It is not possible to modify a single nested field. You have to recreate a whole structure. In this particular case the simplest solution is to use cast.
First a bunch of imports:
from collections import namedtuple
from pyspark.sql.functions import col
from pyspark.sql.types import (
ArrayType, LongType, StringType, StructField, StructType)
and example data:
Record = namedtuple("Record", ["a", "b", "c"])
df = sc.parallelize([([Record("foo", 1, 3)], )]).toDF(["array_field"])
Let's confirm that the schema is the same as in your case:
df.printSchema()
root
|-- array_field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
You can define a new schema for example as a string:
str_schema = "array<struct<a_renamed:string,b:bigint,c:bigint>>"
df.select(col("array_field").cast(str_schema)).printSchema()
root
|-- array_field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a_renamed: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
or a DataType:
struct_schema = ArrayType(StructType([
StructField("a_renamed", StringType()),
StructField("b", LongType()),
StructField("c", LongType())
]))
df.select(col("array_field").cast(struct_schema)).printSchema()
root
|-- array_field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a_renamed: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
Scala
The same techniques can be used in Scala:
case class Record(a: String, b: Long, c: Long)
val df = Seq(Tuple1(Seq(Record("foo", 1, 3)))).toDF("array_field")
val strSchema = "array<struct<a_renamed:string,b:bigint,c:bigint>>"
df.select($"array_field".cast(strSchema))
or
import org.apache.spark.sql.types._
val structSchema = ArrayType(StructType(Seq(
StructField("a_renamed", StringType),
StructField("b", LongType),
StructField("c", LongType)
)))
df.select($"array_field".cast(structSchema))
Possible improvements:
If you use an expressive data manipulation or JSON processing library it could be easier to dump data types to dict or JSON string and take it from there for example (Python / toolz):
from toolz.curried import pipe, assoc_in, update_in, map
from operator import attrgetter
# Update name to "a_updated" if name is "a"
rename_field = update_in(
keys=["name"], func=lambda x: "a_updated" if x == "a" else x)
updated_schema = pipe(
# Get schema of the field as a dict
df.schema["array_field"].jsonValue(),
# Update fields with rename
update_in(
keys=["type", "elementType", "fields"],
func=lambda x: pipe(x, map(rename_field), list)),
# Load schema from dict
StructField.fromJson,
# Get data type
attrgetter("dataType"))
df.select(col("array_field").cast(updated_schema)).printSchema()
You can recurse over the data frame's schema to create a new schema with the required changes.
A schema in PySpark is a StructType which holds a list of StructFields and each StructField can hold some primitve type or another StructType.
This means that we can decide if we want to recurse based on whether the type is a StructType or not.
Below is an annotated sample implementation that shows you how you can implement the above idea.
# Some imports
from pyspark.sql.types import DataType, StructType, ArrayType
from copy import copy
# We take a dataframe and return a new one with required changes
def cleanDataFrame(df: DataFrame) -> DataFrame:
# Returns a new sanitized field name (this function can be anything really)
def sanitizeFieldName(s: str) -> str:
return s.replace("-", "_").replace("&", "_").replace("\"", "_")\
.replace("[", "_").replace("]", "_").replace(".", "_")
# We call this on all fields to create a copy and to perform any
# changes we might want to do to the field.
def sanitizeField(field: StructField) -> StructField:
field = copy(field)
field.name = sanitizeFieldName(field.name)
# We recursively call cleanSchema on all types
field.dataType = cleanSchema(field.dataType)
return field
def cleanSchema(dataType: [DataType]) -> [DataType]:
dataType = copy(dataType)
# If the type is a StructType we need to recurse otherwise
# we can return since we've reached the leaf node
if isinstance(dataType, StructType):
# We call our sanitizer for all top level fields
dataType.fields = [sanitizeField(f) for f in dataType.fields]
elif isinstance(dataType, ArrayType):
dataType.elementType = cleanSchema(dataType.elementType)
return dataType
# Now since we have the new schema we can create a new DataFrame
# by using the old Frame's RDD as data and the new schema as the
# schema for the data
return spark.createDataFrame(df.rdd, cleanSchema(df.schema))
I found a much easier way than the one provided by #zero323, along the lines
of #MaxPY:
Pyspark 2.4:
# Get the schema from the dataframe df
schema = df.schema
# Override `fields` with a list of new StructField, equals to the previous but for the names
schema.fields = (list(map(lambda field:
StructField(field.name + "_renamed", field.dataType), schema.fields)))
# Override also `names` with the same mechanism
schema.names = list(map(lambda name: name + "_renamed", table_schema.names))
Now df.schema will print all the renewed names.
Another much easier solution if it works for you like it works for me is to flatten the structure and then rename:
Using Scala:
val df_flat = df.selectExpr("array_field.*")
Now the rename works
val df_renamed = df_flat.withColumnRenamed("a", "a_renamed")
Of course this only works for you if you dont need the hierarchy (although I suppose it can be recreated again if needed)
Using answer provided by Leo C in:https://stackoverflow.com/a/55363153/5475506, I have built what I consider a more human-friendly/pythoniac script:
import pyspark.sql.types as sql_types
path_table = "<PATH_TO_DATA>"
table_name = "<TABLE_NAME>"
def recur_rename(schema: StructType, old_char, new_char):
schema_new = []
for struct_field in schema:
if type(struct_field.dataType)==sql_types.StructType:
schema_new.append(sql_types.StructField(struct_field.name.replace(old_char, new_char), sql_types.StructType(recur_rename(struct_field.dataType, old_char, new_char)), struct_field.nullable, struct_field.metadata))
elif type(struct_field.dataType)==sql_types.ArrayType:
if type(struct_field.dataType.elementType)==sql_types.StructType:
schema_new.append(sql_types.StructField(struct_field.name.replace(old_char, new_char), sql_types.ArrayType(sql_types.StructType(recur_rename(struct_field.dataType.elementType, old_char, new_char)),True), struct_field.nullable, struct_field.metadata)) # Recursive call to loop over all Array elements
else:
schema_new.append(sql_types.StructField(struct_field.name.replace(old_char, new_char), struct_field.dataType.elementType, struct_field.nullable, struct_field.metadata)) # If ArrayType only has one field, it is no sense to use an Array so Array is exploded
else:
schema_new.append(sql_types.StructField(struct_field.name.replace(old_char, new_char), struct_field.dataType, struct_field.nullable, struct_field.metadata))
return schema_new
def rename_columns(schema: StructType, old_char, new_char):
return sql_types.StructType(recur_rename(schema, old_char, new_char))
df = spark.read.format("json").load(path_table) # Read data whose schema has to be changed.
newSchema = rename_columns(df.schema, ":", "_") # Replace special characters in schema (More special characters not allowed in Spark/Hive meastore: ':', ',', ';')
df2= spark.read.format("json").schema(newSchema).load(path_table) # Read data with new schema.
I consider the code self explanatory (furthermore, it has comments) but what it does is recursively loop over all the fields in the schema, replacing "old_char" by "new_char" in each of them. If field type is a nested one (StructType or ArrayType) new recursive calls are made.

Categories

Resources