Lets say, there are two data-frames. Reference dataframe and Target dataframe.
Reference DF is a reference schema.
Schema for reference DF (r_df)
r_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- mail: boolean (nullable = true)
| |-- sms: boolean (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
However, target data-frame schema is dynamic in nature.
Schema for target DF (t_df)
t_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- sms: string (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
| | |-- createdBy: string (nullable = true)
So we observe multiple changes in target's schema.
Columns inside t_df struct or array can have more or less columns.
Datatype of columns can change too. So type casting is required. (Ex. sms column is boolean in r_df but string in t_df)
I was able to add/remove columns which are of non-struct datatype. However, struct and arrays are real pain for me. Since there are 50+ columns, I need an optimised solution which works for all.
Any solution/ opinion/ way around will be really helpful.
Expected output
I want to make my t_df's schema exactly same as my r_df's schema.
below code is un-tested but should prescribe how to do it. (written from memory without testing.)
There may be a way to get fields from a struct but I'm not aware how so i'm interested to hear others ideas.
Extract struct column names and types.
Find columns that need to be dropped
Drop columns
rebuild struts according to r_df.
stucts_in_r_df = [ field.name for field in r_df.schema.fields if(str(field.dataType).startswith("Struct")) ] # use list comprehension to create a list of struct fields
struct_columns = []
for structs in stucts_in_r_df: # get a list of fields in the structs
struct_columns.append(r_df\
.select(
"$structs.*"
).columns
)
missingColumns = list(set(r_df.columns) - set(tdf.columns)) # find missing columns
similiar_Columns = list(set(r_df.columns).intersect(set(tdf.columns))))
#remove struct columns from both lists so you don't represent them twice.
# you need to repeat the above intersection/missing for the structs and then rebuild them but really the above gives you the idea of how to get the fields out.
# you can use variable replacemens col("$struct.$field") to get the values out of the fields,
result = r_df.union(
tdf\
.select(*(
[ lit(None).cast(dict(r_df.dtypes)[column]).alias(column) for column in missingColumns] +\
[ col(column).cast(dict(r_df.dtypes)[column]).alias(column) for column in similiar_Columns] ) # using list comprehension with joins and then passing as varargs to select will completely dynamically pull out the values you need.
)
)
Here's a way once you have the union to pull back the struct:
result = result\
.select(
col("_id"),
struct( col("sms").alias("sms") ).alias("notificationsSend"),
struct( *[col(column).alias(column) for column in struct_columns] # pass varags to struct with columns
).alias("recordingDetails") #reconstitue struct with
)
I am trying to lower the case of all columns names of PySpark Dataframe schema, including complex type columns' element names.
Example:
original_df
|-- USER_ID: long (nullable = true)
|-- COMPLEX_COL_ARRAY: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- KEY: timestamp (nullable = true)
| | |-- VALUE: integer (nullable = true)
target_df
|-- user_id: long (nullable = true)
|-- complex_col_array: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: timestamp (nullable = true)
| | |-- value: integer (nullable = true)
However, I've only been able to lower the case of column names using the script below:
from pyspark.sql.types import StructField
schema = df.schema
schema.fields = list(map(lambda field: StructField(field.name.lower(), field.dataType), schema.fields))
I know I can access the field names of nested elements using this syntax:
for f in schema.fields:
if hasattr(f.dataType, 'elementType') and hasattr(f.dataType.elementType, 'fieldNames'):
print(schema.f.dataType.elementType.fieldNames())
But how can I modify the case of these field names?
Thanks for your help!
Suggesting an answer to my own question, inspired by this question here: Rename nested field in spark dataframe
from pyspark.sql.types import StructField
# Read parquet file
path = "/path/to/data"
df = spark.read.parquet(path)
schema = df.schema
# Lower the case of all fields that are not nested
schema.fields = list(map(lambda field: StructField(field.name.lower(), field.dataType), schema.fields))
for f in schema.fields:
# if field is nested and has named elements, lower the case of all element names
if hasattr(f.dataType, 'elementType') and hasattr(f.dataType.elementType, 'fieldNames'):
for e in f.dataType.elementType.fieldNames():
schema[f.name].dataType.elementType[e].name = schema[f.name].dataType.elementType[e].name.lower()
ind = schema[f.name].dataType.elementType.names.index(e)
schema[f.name].dataType.elementType.names[ind] = e.lower()
# Recreate dataframe with lowercase schema
df_lowercase = spark.createDataFrame(df.rdd, schema)
I have a use case to groupBy on a set of columns and use collect_list(col):
df.groupBy("col1","col2","col3").agg(collect_set(col("col4")))
when I do this, the schema of df is coming up as below:
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: string (nullable = true)
|-- col4: array (nullable = true)
| |-- *element*: string (containsNull = true)
When an underlying job pickup this file(which I wrote into S3 in parquet format) and converts it into json(it's a java program), col4 values are coming up like this:
"col4": [
{
"element": "val1"
},
{
"element": "val2"
},
{
"element": "val3
}
]
Any idea why this is coming as a key:val pair here eventhough I had collected into a list?
What is the significance of 'element' here?
How to effectievly use Structtype fields in pyspark?
I have packed my nested json as string columns in my pyspark dataframe and I am trying to perform UPSERT on some columns based on groupBy.
Input:
from pyspark.sql.functions import *
from pyspark.sql.types import *
input_json = """[{
"candidate_email": "cust1#email.com",
"transactions":"[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]"
},
{
"candidate_email": "cust1#email.com",
"transactions":"[{'transaction_id':'12', 'transaction_amount':'$23.43'}]"
}
]
"""
input_df = spark.read.json(sc.parallelize([input_json]), multiLine=True)
input_df.printSchema()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: string (nullable = true)
Transformation & current output:
output_df = input_df.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions"))
output_df.printSchema()
output_df.collect()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: array (nullable = true)
# | |-- element: string (containsNull = true)
# Out[161]:
# [Row(candidatey_email='cust1#email.com', transactions=["[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]", "[{'transaction_id':'12', 'transaction_amount':'$23.43'}]"])]
But what changes should I make in above code to get this output:
desired output:
output_json = """[{
"candidate_email": "cust1#email.com",
"transactions":"[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}, {'transaction_id':'12', 'transaction_amount':'$23.43'}]"
}]"""
output_df = spark.read.json(sc.parallelize([output_json]), multiLine=True)
output_df.printSchema()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: string (nullable = true)
Basically, I am trying to get clean merge by having one list instead of having multiple.
Thanks!
As you are having string type for transactions column we need to convert as array type then by doing explode and groupBy + collect_list we can achieve the expected result.
Example:
df.show(10,False)
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|cust1#email.com|[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]|
#|cust1#email.com|[{'transaction_id':'12', 'transaction_amount':'$23.43'}] |
#+---------------+----------------------------------------------------------------------------------------------------------------+
#to make proper array we first replace (},) with (}},) then remove ("[|]") and split on (},) it results array finally we explode on the array.
df1=df.selectExpr("candidate_email","""explode(split(regexp_replace(regexp_replace(transactions,'(\\\},)','}},'),'(\\\[|\\\])',''),"},")) as transactions""")
df1.show(10,False)
#+---------------+------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+------------------------------------------------------+
#|cust1#email.com|{'transaction_id':'10', 'transaction_amount':'$55.46'}|
#|cust1#email.com|{'transaction_id':'11','transaction_amount':'$545.46'}|
#|cust1#email.com|{'transaction_id':'12', 'transaction_amount':'$23.43'}|
#+---------------+------------------------------------------------------+
#groupBy and then create json object
df2=df1.groupBy("candidate_email").\
agg(collect_list(col("transactions")).alias("transactions"))
df2.show(10,False)
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|cust1#email.com|[{'transaction_id':'10', 'transaction_amount':'$55.46'}, {'transaction_id':'11','transaction_amount':'$545.46'}, {'transaction_id':'12', 'transaction_amount':'$23.43'}]|
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#creating json object column in dataframe
df2.selectExpr("to_json(struct(candidate_email,transactions)) as json").show(10,False)
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|json |
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|{"candidate_email":"cust1#email.com","transactions":["{'transaction_id':'10', 'transaction_amount':'$55.46'}","{'transaction_id':'11','transaction_amount':'$545.46'}","{'transaction_id':'12', 'transaction_amount':'$23.43'}"]}|
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#To the output to json file
df2.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions")).write.mode("overwrite").json("<path>")
#content of file
#{"candidate_email":"cust1#email.com","transactions":["{'transaction_id':'10', 'transaction_amount':'$55.46'}","{'transaction_id':'11','transaction_amount':'$545.46'}","{'transaction_id':'12', 'transaction_amount':'$23.43'}"]}
#converting to json by using toJSON
df2.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions")).toJSON().collect()
#[u'{"candidate_email":"cust1#email.com","transactions":["{\'transaction_id\':\'10\', \'transaction_amount\':\'$55.46\'}","{\'transaction_id\':\'11\',\'transaction_amount\':\'$545.46\'}","{\'transaction_id\':\'12\', \'transaction_amount\':\'$23.43\'}"]}']
In Pyspark, whenever i read a json file with an empty set element. The entire element is ignored in the resultant DataFrame. How can i ask spark to consider without ignoring it.
I am using spark 2.4.2 and Python 3.7.3
I tried using df.fillna('Null'). This didnt work because the moment DataFrame got created, the element is not there.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
people = ['{"name":{},"address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = spark.sparkContext.parallelize(people)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.printSchema()
root
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
As we can see, the empty set element (name) is not part of Dataframe.
Is there a way to have name element to be considered.
Let me know if that helps:
people = ['{"name":{},"address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = spark.sparkContext.parallelize(people)
otherPeople = spark.read.json(otherPeopleRDD)
# lit with None
otherPeople = otherPeople.withColumn('name', lit(None).cast(StringType()))
otherPeople.printSchema()
root
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
|-- name: string (nullable = true)
+---------------+----+
| address|name|
+---------------+----+
|[Columbus,Ohio]|null|
+---------------+----+
EDIT
if the json is not too complex this will work.
# Change the dictionary itself than changing it at df level
import json
d = json.loads(people[0])
# Takes care of any column which has empty dictionary value
for k, v in d.items():
if ( (v is None) | (len(v) == 0) ): # You can add any conditions to detect empty set
d[k] = "nan" # i prefer d[k] = None, and then fillna
people[0] = str(json.dumps(d))
otherPeopleRDD = spark.sparkContext.parallelize(people)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.printSchema()
otherPeople.show()
root
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
|-- name: string (nullable = true)
+---------------+----+
| address|name|
+---------------+----+
|[Columbus,Ohio]|nan |
+---------------+----+
After couple of hours struggling with a similar problem, here is what I found:
If data has at least one row with a nonempty "name" field, then it won't be ignored. If it doesn't, then we need to add "name" column with init value like #Preetham said. so we can check if "name" exists in the result's schema or not.
Another solution would be adding a sample row with all fields filled with data to the json file/string and then ignoring or removing it from the result.