Unify schema across multiple rows of json strings in Spark Dataframe - python

I have a difficult issue regarding rows in a PySpark DataFrame which contains a series of json strings.
The issue revolves around that each row might contain a different schema from another, so when I want to transform said rows into a subscriptable datatype in PySpark, I need to have a "unified" schema.
For example, consider this dataframe
import pandas as pd
json_1 = '{"a": 10, "b": 100}'
json_2 = '{"a": 20, "c": 2000}'
json_3 = '{"c": 300, "b": "3000", "d": 100.0, "f": {"some_other": {"A": 10}, "maybe_this": 10}}'
df = spark.createDataFrame(pd.DataFrame({'A': [1, 2, 3], 'B': [json_1, json_2, json_3]}))
Notice that each row contains different versions of the json string. To combat this, I do the following transforms
import json
import pyspark.sql.functions as fcn
from pyspark.sql import Row
from collections import OrderedDict
from pyspark.sql import DataFrame as SparkDataFrame
def convert_to_row(d: dict) -> Row:
"""Convert a dictionary to a SparkRow.
Parameters
----------
d : dict
Dictionary to convert.
Returns
-------
Row
"""
return Row(**OrderedDict(sorted(d.items())))
def get_schema_from_dictionary(the_dict: dict):
"""Create a schema from a dictionary.
Parameters
----------
the_dict : dict
Returns
-------
schema
Schema understood by PySpark.
"""
return spark.read.json(sc.parallelize([json.dumps(the_dict)])).schema
def get_universal_schema(df: SparkDataFrame, column: str):
"""Given a dataframe, retrieve the "global" schema for the column.
NOTE: It does this by merging across all the rows, so this will
take a long time for larger dataframes.
Parameters
----------
df : SparkDataFrame
Dataframe containing the column
column : str
Column to parse.
Returns
-------
schema
Schema understood by PySpark.
"""
col_values = [json.loads(getattr(item, column)) for item in df.select(column).collect()]
mega_dict = {}
for value in col_values:
mega_dict = {**mega_dict, **value}
return get_schema_from_dictionary(mega_dict)
def get_sample_schema(df, column):
"""Given a dataframe, sample a single value to convert.
NOTE: This assumes that the dataframe has the same schema
over all rows.
Parameters
----------
df : SparkDataFrame
Dataframe containing the column
column : str
Column to parse.
Returns
-------
schema
Schema understood by PySpark.
"""
return get_universal_schema(df.limit(1), column)
def from_json(df: SparkDataFrame, column: str, manual_schema=None, merge: bool = False) -> SparkDataFrame:
"""Convert json-string column to a subscriptable object.
Parameters
----------
df : SparkDataFrame
Dataframe containing the column
column : str
Column to parse.
manual_schema : PysparkSchema, optional
Schema understood by PySpark, by default None
merge : bool, optional
Parse the whole dataframe to extract a global schema, by default False
Returns
-------
SparkDataFrame
"""
if manual_schema is None or manual_schema == {}:
if merge:
schema = get_universal_schema(df, column)
else:
schema = get_sample_schema(df, column)
else:
schema = manual_schema
return df.withColumn(column, fcn.from_json(column, schema))
Then, I can simply do the following, to get a new dataframe, which has a unified schema
df = from_json(df, column='B', merge=True)
df.printSchema()
root
|-- A: long (nullable = true)
|-- B: struct (nullable = true)
| |-- a: long (nullable = true)
| |-- b: string (nullable = true)
| |-- c: long (nullable = true)
| |-- d: double (nullable = true)
| |-- f: struct (nullable = true)
| | |-- maybe_this: long (nullable = true)
| | |-- some_other: struct (nullable = true)
| | | |-- A: long (nullable = true)
Now we come to the crux of the issue. Since I'm doing this here col_values = [json.loads(getattr(item, column)) for item in df.select(column).collect()] I'm limited to the amount of memory on the master node.
How can I do a similar procedure, s.t the work is more distributed to each worker instead, before I collect to the master node?

If I understand your question correctly, since we can use RDD as the path parameter of the spark.read.json() method and RDD is distributed and could reduce the potential OOM issue using collect() method on a large dataset, thus you can try adjust the function get_universal_schema to the following:
def get_universal_schema(df: SparkDataFrame, column: str):
return spark.read.json(df.select(column).rdd.map(lambda x: x[0])).schema
and keep two functions: get_sample_schema() and from_json() as-is.

Spark DataFrames are designed to work with the data that has schema. DataFrame API exposes the methods that are useful on a data with a defined schema, like groupBy a column, or aggregation functions to operate on columns, etc. etc.
Given the requirements presented in the question, it appears to me that there is no fixed schema in the input data, and that you won't benefit from a DataFrame API. In fact it will likely add more constraints instead.
I think it is better to consider this data "schemaless" and use a lower-level API - the RDDs. RDDs are distributed across the cluster by definition. So, using RDD API you can first pre-process the data (consuming it as text), and then convert it to a DataFrame.

Related

Write a pyspark.sql.dataframe.DataFrame without losing information

I am trying to save an pyspark.sql.dataframe.DataFrame in CSV format (could also be another format, as long as it is easily readable).
So far, I found a couple of examples to save the DataFrame. However, it is losing information everytime that I write it.
Dataset example:
# Create an example Pyspark DataFrame
from pyspark.sql import Row
Employee = Row("firstName", "lastName", "email", "salary")
employee1 = Employee('A', 'AA', 'mail1', 100000)
employee2 = Employee('B', 'BB', 'mail2', 120000 )
employee3 = Employee('C', None, 'mail3', 140000 )
employee4 = Employee('D', 'DD', 'mail4', 160000 )
employee5 = Employee('E', 'EE', 'mail5', 160000 )
department1 = Row(id='123', name='HR')
department2 = Row(id='456', name='OPS')
department3 = Row(id='789', name='FN')
department4 = Row(id='101112', name='DEV')
departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2, employee5])
departmentWithEmployees2 = Row(department=department2, employees=[employee3, employee4])
departmentWithEmployees3 = Row(department=department3, employees=[employee1, employee4, employee3])
departmentWithEmployees4 = Row(department=department4, employees=[employee2, employee3])
departmentsWithEmployees_Seq = [departmentWithEmployees1, departmentWithEmployees2]
dframe = spark.createDataFrame(departmentsWithEmployees_Seq)
In order to save this file as CSV, I firstly tried this solution:
type(dframe)
Out[]: pyspark.sql.dataframe.DataFrame
dframe.write.csv('junk_mycsv.csv')
Unfortunately, that result in this error:
org.apache.spark.sql.AnalysisException: CSV data source does not support struct<id:string,name:string> data type.;
That is the reason why I tried another possibility, to convert the spark dataframe into a pandas dataframe, and save it then. As mentioned in this example.
pandas_df = dframe.toPandas()
Works good! However, If I show my data, it is missing data:
print(pandas_df.head())
department employees
0 (123, HR) [(A, AA, mail1, 100000), (B, BB, mail2, 120000...
1 (456, OPS) [(C, None, mail3, 140000), (D, DD, mail4, 1600...
As you can see in the snapshot below, we are missing information. Because the data should be like this:
department employees
0 id:123, name:HR firstName: A, lastName: AA, email: mail1, salary: 100000
# Info is missing like 'id', 'name', 'firstName', 'lastName', 'email' etc.
# For the complete expected example, see screenshow below.
Just for information: I am working in Databricks, with Python.
Therefore, how can I write my data (dframe from the example above) without losing information?
Many thanks in advance!
Edit
Adding a picture for Pault, to show the format of the csv (and the headers).
Edit2
Replacing the picture for example csv output:
After running Pault's code:
from pyspark.sql.functions import to_json
dframe.select(*[to_json(c).alias(c) for c in dframe.columns])\
.repartition(1).write.csv("junk_mycsv.csv", header= True)
The output is not tidy, since most column headers are empty (due the nested format?). Only copying the first row:
department employees (empty ColName) (empty ColName) (and so on)
{\id\":\"123\" \"name\":\"HR\"}" [{\firstName\":\"A\" \"lastName\":\"AA\" (...)
Your dataframe has the following schema:
dframe.printSchema()
#root
# |-- department: struct (nullable = true)
# | |-- id: string (nullable = true)
# | |-- name: string (nullable = true)
# |-- employees: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- firstName: string (nullable = true)
# | | |-- lastName: string (nullable = true)
# | | |-- email: string (nullable = true)
# | | |-- salary: long (nullable = true)
So the department column is a StructType with two named fields and the employees column is an array of structs with four named fields. It appears what you want is to write the data in a format that saves both the key and the value for each record.
One option is to write the file in JSON format instead of CSV:
dframe.write.json("junk.json")
Which produces the following output:
{"department":{"id":"123","name":"HR"},"employees":[{"firstName":"A","lastName":"AA","email":"mail1","salary":100000},{"firstName":"B","lastName":"BB","email":"mail2","salary":120000},{"firstName":"E","lastName":"EE","email":"mail5","salary":160000}]}
{"department":{"id":"456","name":"OPS"},"employees":[{"firstName":"C","email":"mail3","salary":140000},{"firstName":"D","lastName":"DD","email":"mail4","salary":160000}]}
Or if you wanted to keep it in CSV format, you can use to_json to convert each column to JSON before writing the CSV.
# looping over all columns
# but you can also just limit this to the columns you want to convert
from pyspark.sql.functions import to_json
dframe.select(*[to_json(c).alias(c) for c in dframe.columns])\
.write.csv("junk_mycsv.csv")
This produces the following output:
"{\"id\":\"123\",\"name\":\"HR\"}","[{\"firstName\":\"A\",\"lastName\":\"AA\",\"email\":\"mail1\",\"salary\":100000},{\"firstName\":\"B\",\"lastName\":\"BB\",\"email\":\"mail2\",\"salary\":120000},{\"firstName\":\"E\",\"lastName\":\"EE\",\"email\":\"mail5\",\"salary\":160000}]"
"{\"id\":\"456\",\"name\":\"OPS\"}","[{\"firstName\":\"C\",\"email\":\"mail3\",\"salary\":140000},{\"firstName\":\"D\",\"lastName\":\"DD\",\"email\":\"mail4\",\"salary\":160000}]"
Note that the double-quotes are escaped.

Pre-define datatype of dataframe while reading json

I have one json file with 100 columns and I want to read all columns along with predefined datatype of two columns.
I know that I could do this with schema option:
struct1 = StructType([StructField("npi", StringType(), True), StructField("NCPDP", StringType(), True)
spark.read.json(path=abc.json, schema=struct1)
However, this code reads only two columns:
>>> df.printSchema()
root
|-- npi: string (nullable = true)
|-- NCPDP: string (nullable = true)
To use above code I have to give data type of all 100 columns. How can I solve this?
According to official documentation, schema can be either a StructType or a String.
I can advice you 2 solutions :
1 - You use the schema of a dummy file
If you have one light file with the same schema (ie one line same structure), you can read it as Dataframe and then use the schema for your other json files :
df = spark.read.json("/path/to/dummy/file.json")
schm = df.schema
df = spark.read.json(path="abc.json", schema=schm)
2 - You generate the schema
This step needs you to provide column name (and maybe types too).
Let's assume col is a dict with (key, value) as (column name, column type).
col_list = ['{col_name} {col_type}'.format(
col_name=col_name,
col_type=col_type,
) for col_name, col_type in col.items()]
schema_string = ', '.join(col_list)
df = spark.read.json(path="abc.json", schema=schema_string)
You can read all the data first and then convert the two columns in question:
df = spark.read.json(path=abc.json)
df.withColumn("npi", df["npi"].cast("string"))\
.withColumn("NCPDP", df["NCPDP"].cast("string"))

PySpark: How to judge column type of dataframe

Suppose we have a dataframe called df. I know there is way of using df.dtypes. However I prefer something similar to
type(123) == int # note here the int is not a string
I wonder is there is something like:
type(df.select(<column_name>).collect()[0][1]) == IntegerType
Basically I want to know the way to directly get the object of the class like IntegerType, StringType from the dataframe and then judge it.
Thanks!
TL;DR Use external data types (plain Python types) to test values, internal data types (DataType subclasses) to test schema.
First and foremost - You should never use
type(123) == int
Correct way to check types in Python, which handles inheritance, is
isinstance(123, int)
Having this done, lets talk about
Basically I want to know the way to directly get the object of the class like IntegerType, StringType from the dataframe and then judge it.
This is not how it works. DataTypes describe schema (internal representation) not values. External types, is a plain Python object, so if internal type is IntegerType, then external types is int and so on, according to the rules defined in the Spark SQL Programming guide.
The only place where IntegerType (or other DataTypes) instance exist is your schema:
from pyspark.sql.types import *
df = spark.createDataFrame([(1, "foo")])
isinstance(df.schema["_1"].dataType, LongType)
# True
isinstance(df.schema["_2"].dataType, StringType)
# True
_1, _2 = df.first()
isinstance(_1, int)
# True
isinstance(_2, str)
# True
What about trying:
df.printSchema()
This will return something like:
root
|-- id: integer (nullable = true)
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: integer (nullable = true)
|-- col4: date (nullable = true)
|-- col5: long (nullable = true)
If there's a need to check detail structure under ArrayType or StructType schema, I'd still prefer using df.dtypes, and then use XXXType.simpleString() from the type object to verify the complex schema more easily.
For example,
import pyspark.sql.types as T
df_dtypes = dict(df.dtypes)
# {'column1': 'array<string>',
# 'column2': 'array<struct<fieldA:string,fieldB:bigint>>'}
### if want to verify the complex type schema
column1_require_type = T.ArrayType(T.StringType())
column2_require_type = T.ArrayType(T.StructType([
T.StructField("fieldA", T.StringType()),
T.StructField("fieldB", T.LongType()),
]))
column1_type_string = column1_require_type.simpleString() # array<string>
column2_type_string = column2_require_type.simpleString() # array<struct<fieldA:string,fieldB:bigint>>
# easy verification for complex structure
assert df_dtypes['column1'] == column1_type_string # True
assert df_dtypes['column2'] == column2_type_string # True
I think it's helpful if need to verify complex schema. This works for me (I'm using PySpark 3.2)

Extract data out of ORC using Pyspark

I have a ORC file i am able to read that into a DataFrame using Pyspark 2.2.0
from pyspark.context import SparkContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
df = spark.read.orc("s3://leadid-sandbox/krish/lead_test/")
The above df has a schema like below
root
|-- item: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
sample data looks like this(Just a sample data not the entire dataset)
item
{http_Accept-Language={"s":"en-US"}, Win64={"n":"1"},
geoip_region={"s":"FL"}, Platform={"s":"Win7"}, geoip_postal_code=
{"s":"33432"}, JavaApplets={"n":"1"}, http_Accept={"s":"*/*"},
Version={"s":"11.0"}, Cookies={"n":"1"}, Platform_Version=
{"s":"6.1"}, http_Content-Type={"s":"application/x-www-form-
urlencoded"}}
{http_Accept-Language={"s":"en-US"}, Win64={"n":"1"}, IFrames=
{"n":"1"}, geoip_region={"s":"CA"}, Platform={"s":"Win7"}, Parent=
{"s":"IE 11.0"}, http_Dnt={"n":"1"}}
So I exploded "item" like below
expDf = df.select(explode("item"))
The above DataFrame has below schema and when i do an show(2) has the below details
root
|-- key: string (nullable = false)
|-- value: string (nullable = true)
+------------+----------+
| key| value|
+------------+----------+
|geoip_region|
{
"s": "FL"
}
|
| Tables|
{
"n": "1"
}
|
+------------+----------+
How can i select data out of this DataFrame ? I have tried different ways but of no use.
So i would need 'geoip_region' with value as 'FL' and so on.
Any help is appreciated.
I am not sure your full use case, but if it just about having access to keys and value inside "item", you can do it using following sample code:
row = df.select(df.item).collect()
Above line will give you a list of Row Object like [Row(item={http_Accept-Language={"s":"en-US"}, Win64={"n":"1"},....})]
Then to select all the values inside the row you can do: row_item = row[0]['item']
row_item['http_Accept'] will give you access to u"{"s":"en-US"}"
and eval(row_item['http_Accept']) will give you a dictionary from where you can get its key,value
I have just outlined the process, which can be written with loops to get all the key/values in iteration.
Thanks Joshi for your response, for some reason i was getting row[0] not found error on my code, I was running this on AWS Glue environment may be that could be a reason.
I got what i wanted using the below code.
# Creating a DataFrame of the raw file
df = spark.read.orc("s3://leadid-sandbox/krish/lead_test/")
# Creating a temp view called Leads for the above dataFrame
df.createOrReplaceTempView("leads")
# Extracting the data using normal SQL from the above created Temp
View
tblSel = spark.sql("SELECT get_json_object(item['token'], '$.s') as
token, get_json_object(item['account_code'], '$.s') as account_code
from leads").show()

Rename nested field in spark dataframe

Having a dataframe df in Spark:
|-- array_field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
How to rename field array_field.a to array_field.a_renamed?
[Update]:
.withColumnRenamed() does not work with nested fields so I tried this hacky and unsafe method:
# First alter the schema:
schema = df.schema
schema['array_field'].dataType.elementType['a'].name = 'a_renamed'
ind = schema['array_field'].dataType.elementType.names.index('a')
schema['array_field'].dataType.elementType.names[ind] = 'a_renamed'
# Then set dataframe's schema with altered schema
df._schema = schema
I know that setting a private attribute is not a good practice but I don't know other way to set the schema for df
I think I am on a right track but df.printSchema() still shows the old name for array_field.a, though df.schema == schema is True
Python
It is not possible to modify a single nested field. You have to recreate a whole structure. In this particular case the simplest solution is to use cast.
First a bunch of imports:
from collections import namedtuple
from pyspark.sql.functions import col
from pyspark.sql.types import (
ArrayType, LongType, StringType, StructField, StructType)
and example data:
Record = namedtuple("Record", ["a", "b", "c"])
df = sc.parallelize([([Record("foo", 1, 3)], )]).toDF(["array_field"])
Let's confirm that the schema is the same as in your case:
df.printSchema()
root
|-- array_field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
You can define a new schema for example as a string:
str_schema = "array<struct<a_renamed:string,b:bigint,c:bigint>>"
df.select(col("array_field").cast(str_schema)).printSchema()
root
|-- array_field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a_renamed: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
or a DataType:
struct_schema = ArrayType(StructType([
StructField("a_renamed", StringType()),
StructField("b", LongType()),
StructField("c", LongType())
]))
df.select(col("array_field").cast(struct_schema)).printSchema()
root
|-- array_field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a_renamed: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
Scala
The same techniques can be used in Scala:
case class Record(a: String, b: Long, c: Long)
val df = Seq(Tuple1(Seq(Record("foo", 1, 3)))).toDF("array_field")
val strSchema = "array<struct<a_renamed:string,b:bigint,c:bigint>>"
df.select($"array_field".cast(strSchema))
or
import org.apache.spark.sql.types._
val structSchema = ArrayType(StructType(Seq(
StructField("a_renamed", StringType),
StructField("b", LongType),
StructField("c", LongType)
)))
df.select($"array_field".cast(structSchema))
Possible improvements:
If you use an expressive data manipulation or JSON processing library it could be easier to dump data types to dict or JSON string and take it from there for example (Python / toolz):
from toolz.curried import pipe, assoc_in, update_in, map
from operator import attrgetter
# Update name to "a_updated" if name is "a"
rename_field = update_in(
keys=["name"], func=lambda x: "a_updated" if x == "a" else x)
updated_schema = pipe(
# Get schema of the field as a dict
df.schema["array_field"].jsonValue(),
# Update fields with rename
update_in(
keys=["type", "elementType", "fields"],
func=lambda x: pipe(x, map(rename_field), list)),
# Load schema from dict
StructField.fromJson,
# Get data type
attrgetter("dataType"))
df.select(col("array_field").cast(updated_schema)).printSchema()
You can recurse over the data frame's schema to create a new schema with the required changes.
A schema in PySpark is a StructType which holds a list of StructFields and each StructField can hold some primitve type or another StructType.
This means that we can decide if we want to recurse based on whether the type is a StructType or not.
Below is an annotated sample implementation that shows you how you can implement the above idea.
# Some imports
from pyspark.sql.types import DataType, StructType, ArrayType
from copy import copy
# We take a dataframe and return a new one with required changes
def cleanDataFrame(df: DataFrame) -> DataFrame:
# Returns a new sanitized field name (this function can be anything really)
def sanitizeFieldName(s: str) -> str:
return s.replace("-", "_").replace("&", "_").replace("\"", "_")\
.replace("[", "_").replace("]", "_").replace(".", "_")
# We call this on all fields to create a copy and to perform any
# changes we might want to do to the field.
def sanitizeField(field: StructField) -> StructField:
field = copy(field)
field.name = sanitizeFieldName(field.name)
# We recursively call cleanSchema on all types
field.dataType = cleanSchema(field.dataType)
return field
def cleanSchema(dataType: [DataType]) -> [DataType]:
dataType = copy(dataType)
# If the type is a StructType we need to recurse otherwise
# we can return since we've reached the leaf node
if isinstance(dataType, StructType):
# We call our sanitizer for all top level fields
dataType.fields = [sanitizeField(f) for f in dataType.fields]
elif isinstance(dataType, ArrayType):
dataType.elementType = cleanSchema(dataType.elementType)
return dataType
# Now since we have the new schema we can create a new DataFrame
# by using the old Frame's RDD as data and the new schema as the
# schema for the data
return spark.createDataFrame(df.rdd, cleanSchema(df.schema))
I found a much easier way than the one provided by #zero323, along the lines
of #MaxPY:
Pyspark 2.4:
# Get the schema from the dataframe df
schema = df.schema
# Override `fields` with a list of new StructField, equals to the previous but for the names
schema.fields = (list(map(lambda field:
StructField(field.name + "_renamed", field.dataType), schema.fields)))
# Override also `names` with the same mechanism
schema.names = list(map(lambda name: name + "_renamed", table_schema.names))
Now df.schema will print all the renewed names.
Another much easier solution if it works for you like it works for me is to flatten the structure and then rename:
Using Scala:
val df_flat = df.selectExpr("array_field.*")
Now the rename works
val df_renamed = df_flat.withColumnRenamed("a", "a_renamed")
Of course this only works for you if you dont need the hierarchy (although I suppose it can be recreated again if needed)
Using answer provided by Leo C in:https://stackoverflow.com/a/55363153/5475506, I have built what I consider a more human-friendly/pythoniac script:
import pyspark.sql.types as sql_types
path_table = "<PATH_TO_DATA>"
table_name = "<TABLE_NAME>"
def recur_rename(schema: StructType, old_char, new_char):
schema_new = []
for struct_field in schema:
if type(struct_field.dataType)==sql_types.StructType:
schema_new.append(sql_types.StructField(struct_field.name.replace(old_char, new_char), sql_types.StructType(recur_rename(struct_field.dataType, old_char, new_char)), struct_field.nullable, struct_field.metadata))
elif type(struct_field.dataType)==sql_types.ArrayType:
if type(struct_field.dataType.elementType)==sql_types.StructType:
schema_new.append(sql_types.StructField(struct_field.name.replace(old_char, new_char), sql_types.ArrayType(sql_types.StructType(recur_rename(struct_field.dataType.elementType, old_char, new_char)),True), struct_field.nullable, struct_field.metadata)) # Recursive call to loop over all Array elements
else:
schema_new.append(sql_types.StructField(struct_field.name.replace(old_char, new_char), struct_field.dataType.elementType, struct_field.nullable, struct_field.metadata)) # If ArrayType only has one field, it is no sense to use an Array so Array is exploded
else:
schema_new.append(sql_types.StructField(struct_field.name.replace(old_char, new_char), struct_field.dataType, struct_field.nullable, struct_field.metadata))
return schema_new
def rename_columns(schema: StructType, old_char, new_char):
return sql_types.StructType(recur_rename(schema, old_char, new_char))
df = spark.read.format("json").load(path_table) # Read data whose schema has to be changed.
newSchema = rename_columns(df.schema, ":", "_") # Replace special characters in schema (More special characters not allowed in Spark/Hive meastore: ':', ',', ';')
df2= spark.read.format("json").schema(newSchema).load(path_table) # Read data with new schema.
I consider the code self explanatory (furthermore, it has comments) but what it does is recursively loop over all the fields in the schema, replacing "old_char" by "new_char" in each of them. If field type is a nested one (StructType or ArrayType) new recursive calls are made.

Categories

Resources