pyspark json read ignores empty set - python

In Pyspark, whenever i read a json file with an empty set element. The entire element is ignored in the resultant DataFrame. How can i ask spark to consider without ignoring it.
I am using spark 2.4.2 and Python 3.7.3
I tried using df.fillna('Null'). This didnt work because the moment DataFrame got created, the element is not there.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
people = ['{"name":{},"address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = spark.sparkContext.parallelize(people)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.printSchema()
root
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
As we can see, the empty set element (name) is not part of Dataframe.
Is there a way to have name element to be considered.

Let me know if that helps:
people = ['{"name":{},"address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = spark.sparkContext.parallelize(people)
otherPeople = spark.read.json(otherPeopleRDD)
# lit with None
otherPeople = otherPeople.withColumn('name', lit(None).cast(StringType()))
otherPeople.printSchema()
root
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
|-- name: string (nullable = true)
+---------------+----+
| address|name|
+---------------+----+
|[Columbus,Ohio]|null|
+---------------+----+
EDIT
if the json is not too complex this will work.
# Change the dictionary itself than changing it at df level
import json
d = json.loads(people[0])
# Takes care of any column which has empty dictionary value
for k, v in d.items():
if ( (v is None) | (len(v) == 0) ): # You can add any conditions to detect empty set
d[k] = "nan" # i prefer d[k] = None, and then fillna
people[0] = str(json.dumps(d))
otherPeopleRDD = spark.sparkContext.parallelize(people)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.printSchema()
otherPeople.show()
root
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
|-- name: string (nullable = true)
+---------------+----+
| address|name|
+---------------+----+
|[Columbus,Ohio]|nan |
+---------------+----+

After couple of hours struggling with a similar problem, here is what I found:
If data has at least one row with a nonempty "name" field, then it won't be ignored. If it doesn't, then we need to add "name" column with init value like #Preetham said. so we can check if "name" exists in the result's schema or not.
Another solution would be adding a sample row with all fields filled with data to the json file/string and then ignoring or removing it from the result.

Related

How to add empty map<string,string> type column to DataFrame in PySpark?

I tried below code but its not working:
df=df.withColumn("cars", typedLit(Map.empty[String, String]))
Gives the error: NameError: name 'typedLit' is not defined
Create an empty column and cast it to the type you need.
from pyspark.sql import functions as F, types as T
df = df.withColumn("cars", F.lit(None).cast(T.MapType(T.StringType(), T.StringType())))
df.select("cars").printSchema()
root
|-- cars: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Perhaps you can use pyspark.sql.functions.expr:
>>> from pyspark.sql.functions import *
>>> df.withColumn("cars",expr("map()")).printSchema()
root
|-- col1: string (nullable = true)
|-- cars: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = false)
EDIT:
If you'd like your map to have keys and/or values of a non-trivial type (not map<string,string> as your question's title says), some casting becomes unavoidable, I'm afraid. For example:
>>> df.withColumn("cars",create_map(lit(None).cast(IntegerType()),lit(None).cast(DoubleType()))).printSchema()
root
|-- col1: string (nullable = true)
|-- cars: map (nullable = false)
| |-- key: integer
| |-- value: double (valueContainsNull = true)
...in addition to other options suggested by #blackbishop and #Steven.
And just beware of the consequences :) -- maps can't have null keys!

How to merge two dataframes in pyspark with different columns inside struct or array?

Lets say, there are two data-frames. Reference dataframe and Target dataframe.
Reference DF is a reference schema.
Schema for reference DF (r_df)
r_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- mail: boolean (nullable = true)
| |-- sms: boolean (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
However, target data-frame schema is dynamic in nature.
Schema for target DF (t_df)
t_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- sms: string (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
| | |-- createdBy: string (nullable = true)
So we observe multiple changes in target's schema.
Columns inside t_df struct or array can have more or less columns.
Datatype of columns can change too. So type casting is required. (Ex. sms column is boolean in r_df but string in t_df)
I was able to add/remove columns which are of non-struct datatype. However, struct and arrays are real pain for me. Since there are 50+ columns, I need an optimised solution which works for all.
Any solution/ opinion/ way around will be really helpful.
Expected output
I want to make my t_df's schema exactly same as my r_df's schema.
below code is un-tested but should prescribe how to do it. (written from memory without testing.)
There may be a way to get fields from a struct but I'm not aware how so i'm interested to hear others ideas.
Extract struct column names and types.
Find columns that need to be dropped
Drop columns
rebuild struts according to r_df.
stucts_in_r_df = [ field.name for field in r_df.schema.fields if(str(field.dataType).startswith("Struct")) ] # use list comprehension to create a list of struct fields
struct_columns = []
for structs in stucts_in_r_df: # get a list of fields in the structs
struct_columns.append(r_df\
.select(
"$structs.*"
).columns
)
missingColumns = list(set(r_df.columns) - set(tdf.columns)) # find missing columns
similiar_Columns = list(set(r_df.columns).intersect(set(tdf.columns))))
#remove struct columns from both lists so you don't represent them twice.
# you need to repeat the above intersection/missing for the structs and then rebuild them but really the above gives you the idea of how to get the fields out.
# you can use variable replacemens col("$struct.$field") to get the values out of the fields,
result = r_df.union(
tdf\
.select(*(
[ lit(None).cast(dict(r_df.dtypes)[column]).alias(column) for column in missingColumns] +\
[ col(column).cast(dict(r_df.dtypes)[column]).alias(column) for column in similiar_Columns] ) # using list comprehension with joins and then passing as varargs to select will completely dynamically pull out the values you need.
)
)
Here's a way once you have the union to pull back the struct:
result = result\
.select(
col("_id"),
struct( col("sms").alias("sms") ).alias("notificationsSend"),
struct( *[col(column).alias(column) for column in struct_columns] # pass varags to struct with columns
).alias("recordingDetails") #reconstitue struct with
)

How to lower the case of element names in ArrayType or MapType columns in PySpark?

I am trying to lower the case of all columns names of PySpark Dataframe schema, including complex type columns' element names.
Example:
original_df
|-- USER_ID: long (nullable = true)
|-- COMPLEX_COL_ARRAY: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- KEY: timestamp (nullable = true)
| | |-- VALUE: integer (nullable = true)
target_df
|-- user_id: long (nullable = true)
|-- complex_col_array: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: timestamp (nullable = true)
| | |-- value: integer (nullable = true)
However, I've only been able to lower the case of column names using the script below:
from pyspark.sql.types import StructField
schema = df.schema
schema.fields = list(map(lambda field: StructField(field.name.lower(), field.dataType), schema.fields))
I know I can access the field names of nested elements using this syntax:
for f in schema.fields:
if hasattr(f.dataType, 'elementType') and hasattr(f.dataType.elementType, 'fieldNames'):
print(schema.f.dataType.elementType.fieldNames())
But how can I modify the case of these field names?
Thanks for your help!
Suggesting an answer to my own question, inspired by this question here: Rename nested field in spark dataframe
from pyspark.sql.types import StructField
# Read parquet file
path = "/path/to/data"
df = spark.read.parquet(path)
schema = df.schema
# Lower the case of all fields that are not nested
schema.fields = list(map(lambda field: StructField(field.name.lower(), field.dataType), schema.fields))
for f in schema.fields:
# if field is nested and has named elements, lower the case of all element names
if hasattr(f.dataType, 'elementType') and hasattr(f.dataType.elementType, 'fieldNames'):
for e in f.dataType.elementType.fieldNames():
schema[f.name].dataType.elementType[e].name = schema[f.name].dataType.elementType[e].name.lower()
ind = schema[f.name].dataType.elementType.names.index(e)
schema[f.name].dataType.elementType.names[ind] = e.lower()
# Recreate dataframe with lowercase schema
df_lowercase = spark.createDataFrame(df.rdd, schema)

pySpark: How can I get all element names in structType in arrayType column in a dataframe?

I have a dataframe that looks something like this:
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- job: string (nullable = true)
|-- hobbies: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- favorite: string (nullable = true)
| | |-- non-favorite: string (nullable = true)
And I'm trying to get this information:
['favorite', 'non-favorite']
However, the only closest solution I found was using the explode function with withColumn, but it was based on the assumption that I already know the names of the elements. But What I want to do is, without knowing the element names, I want to get the element names only with the column name, in this case 'hobbies'.
Is there a good way to get all the element names in any given column?
For a given dataframe with this schema:
df.printSchema()
root
|-- hobbies: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- favorite: string (nullable = false)
| | |-- non-favorite: string (nullable = false)
You can select the field names of the struct as:
struct_fields = df.schema['hobbies'].dataType.elementType.fieldNames()
# output: ['favorite', 'non-favorite']
pyspark.sql.types.StructType.fieldnames should get you what you want.
fieldNames()
Returns all field names in a list.
>>> struct = StructType([StructField("f1", StringType(), True)])
>>> struct.fieldNames()
['f1']
So in your case something like
dataframe.hobbies.getItem(0).fieldnames()

Rename nested field in spark dataframe

Having a dataframe df in Spark:
|-- array_field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
How to rename field array_field.a to array_field.a_renamed?
[Update]:
.withColumnRenamed() does not work with nested fields so I tried this hacky and unsafe method:
# First alter the schema:
schema = df.schema
schema['array_field'].dataType.elementType['a'].name = 'a_renamed'
ind = schema['array_field'].dataType.elementType.names.index('a')
schema['array_field'].dataType.elementType.names[ind] = 'a_renamed'
# Then set dataframe's schema with altered schema
df._schema = schema
I know that setting a private attribute is not a good practice but I don't know other way to set the schema for df
I think I am on a right track but df.printSchema() still shows the old name for array_field.a, though df.schema == schema is True
Python
It is not possible to modify a single nested field. You have to recreate a whole structure. In this particular case the simplest solution is to use cast.
First a bunch of imports:
from collections import namedtuple
from pyspark.sql.functions import col
from pyspark.sql.types import (
ArrayType, LongType, StringType, StructField, StructType)
and example data:
Record = namedtuple("Record", ["a", "b", "c"])
df = sc.parallelize([([Record("foo", 1, 3)], )]).toDF(["array_field"])
Let's confirm that the schema is the same as in your case:
df.printSchema()
root
|-- array_field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
You can define a new schema for example as a string:
str_schema = "array<struct<a_renamed:string,b:bigint,c:bigint>>"
df.select(col("array_field").cast(str_schema)).printSchema()
root
|-- array_field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a_renamed: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
or a DataType:
struct_schema = ArrayType(StructType([
StructField("a_renamed", StringType()),
StructField("b", LongType()),
StructField("c", LongType())
]))
df.select(col("array_field").cast(struct_schema)).printSchema()
root
|-- array_field: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a_renamed: string (nullable = true)
| | |-- b: long (nullable = true)
| | |-- c: long (nullable = true)
Scala
The same techniques can be used in Scala:
case class Record(a: String, b: Long, c: Long)
val df = Seq(Tuple1(Seq(Record("foo", 1, 3)))).toDF("array_field")
val strSchema = "array<struct<a_renamed:string,b:bigint,c:bigint>>"
df.select($"array_field".cast(strSchema))
or
import org.apache.spark.sql.types._
val structSchema = ArrayType(StructType(Seq(
StructField("a_renamed", StringType),
StructField("b", LongType),
StructField("c", LongType)
)))
df.select($"array_field".cast(structSchema))
Possible improvements:
If you use an expressive data manipulation or JSON processing library it could be easier to dump data types to dict or JSON string and take it from there for example (Python / toolz):
from toolz.curried import pipe, assoc_in, update_in, map
from operator import attrgetter
# Update name to "a_updated" if name is "a"
rename_field = update_in(
keys=["name"], func=lambda x: "a_updated" if x == "a" else x)
updated_schema = pipe(
# Get schema of the field as a dict
df.schema["array_field"].jsonValue(),
# Update fields with rename
update_in(
keys=["type", "elementType", "fields"],
func=lambda x: pipe(x, map(rename_field), list)),
# Load schema from dict
StructField.fromJson,
# Get data type
attrgetter("dataType"))
df.select(col("array_field").cast(updated_schema)).printSchema()
You can recurse over the data frame's schema to create a new schema with the required changes.
A schema in PySpark is a StructType which holds a list of StructFields and each StructField can hold some primitve type or another StructType.
This means that we can decide if we want to recurse based on whether the type is a StructType or not.
Below is an annotated sample implementation that shows you how you can implement the above idea.
# Some imports
from pyspark.sql.types import DataType, StructType, ArrayType
from copy import copy
# We take a dataframe and return a new one with required changes
def cleanDataFrame(df: DataFrame) -> DataFrame:
# Returns a new sanitized field name (this function can be anything really)
def sanitizeFieldName(s: str) -> str:
return s.replace("-", "_").replace("&", "_").replace("\"", "_")\
.replace("[", "_").replace("]", "_").replace(".", "_")
# We call this on all fields to create a copy and to perform any
# changes we might want to do to the field.
def sanitizeField(field: StructField) -> StructField:
field = copy(field)
field.name = sanitizeFieldName(field.name)
# We recursively call cleanSchema on all types
field.dataType = cleanSchema(field.dataType)
return field
def cleanSchema(dataType: [DataType]) -> [DataType]:
dataType = copy(dataType)
# If the type is a StructType we need to recurse otherwise
# we can return since we've reached the leaf node
if isinstance(dataType, StructType):
# We call our sanitizer for all top level fields
dataType.fields = [sanitizeField(f) for f in dataType.fields]
elif isinstance(dataType, ArrayType):
dataType.elementType = cleanSchema(dataType.elementType)
return dataType
# Now since we have the new schema we can create a new DataFrame
# by using the old Frame's RDD as data and the new schema as the
# schema for the data
return spark.createDataFrame(df.rdd, cleanSchema(df.schema))
I found a much easier way than the one provided by #zero323, along the lines
of #MaxPY:
Pyspark 2.4:
# Get the schema from the dataframe df
schema = df.schema
# Override `fields` with a list of new StructField, equals to the previous but for the names
schema.fields = (list(map(lambda field:
StructField(field.name + "_renamed", field.dataType), schema.fields)))
# Override also `names` with the same mechanism
schema.names = list(map(lambda name: name + "_renamed", table_schema.names))
Now df.schema will print all the renewed names.
Another much easier solution if it works for you like it works for me is to flatten the structure and then rename:
Using Scala:
val df_flat = df.selectExpr("array_field.*")
Now the rename works
val df_renamed = df_flat.withColumnRenamed("a", "a_renamed")
Of course this only works for you if you dont need the hierarchy (although I suppose it can be recreated again if needed)
Using answer provided by Leo C in:https://stackoverflow.com/a/55363153/5475506, I have built what I consider a more human-friendly/pythoniac script:
import pyspark.sql.types as sql_types
path_table = "<PATH_TO_DATA>"
table_name = "<TABLE_NAME>"
def recur_rename(schema: StructType, old_char, new_char):
schema_new = []
for struct_field in schema:
if type(struct_field.dataType)==sql_types.StructType:
schema_new.append(sql_types.StructField(struct_field.name.replace(old_char, new_char), sql_types.StructType(recur_rename(struct_field.dataType, old_char, new_char)), struct_field.nullable, struct_field.metadata))
elif type(struct_field.dataType)==sql_types.ArrayType:
if type(struct_field.dataType.elementType)==sql_types.StructType:
schema_new.append(sql_types.StructField(struct_field.name.replace(old_char, new_char), sql_types.ArrayType(sql_types.StructType(recur_rename(struct_field.dataType.elementType, old_char, new_char)),True), struct_field.nullable, struct_field.metadata)) # Recursive call to loop over all Array elements
else:
schema_new.append(sql_types.StructField(struct_field.name.replace(old_char, new_char), struct_field.dataType.elementType, struct_field.nullable, struct_field.metadata)) # If ArrayType only has one field, it is no sense to use an Array so Array is exploded
else:
schema_new.append(sql_types.StructField(struct_field.name.replace(old_char, new_char), struct_field.dataType, struct_field.nullable, struct_field.metadata))
return schema_new
def rename_columns(schema: StructType, old_char, new_char):
return sql_types.StructType(recur_rename(schema, old_char, new_char))
df = spark.read.format("json").load(path_table) # Read data whose schema has to be changed.
newSchema = rename_columns(df.schema, ":", "_") # Replace special characters in schema (More special characters not allowed in Spark/Hive meastore: ':', ',', ';')
df2= spark.read.format("json").schema(newSchema).load(path_table) # Read data with new schema.
I consider the code self explanatory (furthermore, it has comments) but what it does is recursively loop over all the fields in the schema, replacing "old_char" by "new_char" in each of them. If field type is a nested one (StructType or ArrayType) new recursive calls are made.

Categories

Resources