PySpark flattening dataframe while appending supercolumn names - python

Say I have a PySpark dataframe df:
>>> df.printSchema()
root
|-- a: struct
|-- alpha: integer
|-- beta: string
|-- gamma: boolean
|-- b: string
|-- c: struct
|-- delta: string
|-- epsilon: struct
|-- omega: string
|-- psi: boolean
I know I can flatten the dataframe:
select_col_list = [col.replace("a", "a.*").replace("c", "c.*") for col in df.columns]
flat_df = df.select(*select_col_list)
This results in a schema like this:
root
|-- alpha: integer
|-- beta: string
|-- gamma: boolean
|-- b: string
|-- delta: string
|-- epsilon: struct
|-- omega: string
|-- psi: boolean
But I want to append the supercolumn's name to subcolumns when I flatten too, so I want the resulting schema to be like this:
root
|-- a_alpha: integer
|-- a_beta: string
|-- a_gamma: boolean
|-- b: string
|-- c_delta: string
|-- c_epsilon: struct
|-- omega: string
|-- psi: boolean
How do I do this?

I don't think there's an straightforward way to do it, but here's a hacky solution that I came up with.
Define a list of the columns to be expanded and create a temporary id column using pyspark.sql.functions.monotonically_increasing_id().
Loop over all the columns in the dataframe and create a temporary dataframe for each one.
If the column is in cols_to_expand: Use .* to expand the column. Then rename all fields (except id) in the resultant (temporary) dataframe by with the corresponding prefix using alias().
If the column is not in cols_to_expand: Select that column and id and store it in a temporary dataframe.
Store temp_df in a list.
Join all the dataframes in the list using id and drop the id column.
Code:
df = df.withColumn('id', f.monotonically_increasing_id())
cols_to_expand = ['a', 'c']
flat_dfs = []
for col in df.columns:
if col in cols_to_expand:
temp_df = df.select('id', col+".*")
temp_df = temp_df.select(
[
f.col(c).alias(col+"_"+c if c != 'id' else c) for c in temp_df.columns
]
)
else:
temp_df = df.select('id', col)
flat_dfs.append(temp_df)
flat_df = reduce(lambda x, y: x.join(y, on='id'), flat_dfs)
flat_df = flat_df.drop('id')
flat_df.printSchema()
The resulting schema:
flat_df.printSchema()
#root
# |-- a_alpha: integer (nullable = true)
# |-- a_beta: string (nullable = true)
# |-- a_gamma: boolean (nullable = true)
# |-- b: string (nullable = true)
# |-- c_delta: string (nullable = true)
# |-- c_epsilon: struct (nullable = true)
# | |-- omega: string (nullable = true)
# | |-- psi: boolean (nullable = true)

I actually found a way to do this today. First using the beautiful auto-flattening pyspark function by Evan V. Combine this with the rather brilliant solution to mass-renaming from proggeo and you can basically build up a list of names down the full tree of columns and alias them all as you select.
In my case I took the result of the flatten function and replaced all the "." characters with an "_" in the renaming. Result is as follows:
from pyspark.sql.types import StructType, ArrayType
def flatten(schema, prefix=None):
fields = []
for field in schema.fields:
name = prefix + '.' + field.name if prefix else field.name
dtype = field.dataType
if isinstance(dtype, ArrayType):
dtype = dtype.elementType
if isinstance(dtype, StructType):
fields += flatten(dtype, prefix=name)
else:
fields.append(name)
return fields
# Get actual field names, with nested '.' structure, and create equivalents with '_'
fields=flatten(df.schema)
fields_renamed = [field.replace(".","_") for field in fields]
# Select while aliasing for all fields
df=df.select(*[col(field).alias(new_field) for field,new_field in zip(fields,fields_renamed)])

Related

How to add empty map<string,string> type column to DataFrame in PySpark?

I tried below code but its not working:
df=df.withColumn("cars", typedLit(Map.empty[String, String]))
Gives the error: NameError: name 'typedLit' is not defined
Create an empty column and cast it to the type you need.
from pyspark.sql import functions as F, types as T
df = df.withColumn("cars", F.lit(None).cast(T.MapType(T.StringType(), T.StringType())))
df.select("cars").printSchema()
root
|-- cars: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Perhaps you can use pyspark.sql.functions.expr:
>>> from pyspark.sql.functions import *
>>> df.withColumn("cars",expr("map()")).printSchema()
root
|-- col1: string (nullable = true)
|-- cars: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = false)
EDIT:
If you'd like your map to have keys and/or values of a non-trivial type (not map<string,string> as your question's title says), some casting becomes unavoidable, I'm afraid. For example:
>>> df.withColumn("cars",create_map(lit(None).cast(IntegerType()),lit(None).cast(DoubleType()))).printSchema()
root
|-- col1: string (nullable = true)
|-- cars: map (nullable = false)
| |-- key: integer
| |-- value: double (valueContainsNull = true)
...in addition to other options suggested by #blackbishop and #Steven.
And just beware of the consequences :) -- maps can't have null keys!

How to merge two dataframes in pyspark with different columns inside struct or array?

Lets say, there are two data-frames. Reference dataframe and Target dataframe.
Reference DF is a reference schema.
Schema for reference DF (r_df)
r_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- mail: boolean (nullable = true)
| |-- sms: boolean (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
However, target data-frame schema is dynamic in nature.
Schema for target DF (t_df)
t_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- sms: string (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
| | |-- createdBy: string (nullable = true)
So we observe multiple changes in target's schema.
Columns inside t_df struct or array can have more or less columns.
Datatype of columns can change too. So type casting is required. (Ex. sms column is boolean in r_df but string in t_df)
I was able to add/remove columns which are of non-struct datatype. However, struct and arrays are real pain for me. Since there are 50+ columns, I need an optimised solution which works for all.
Any solution/ opinion/ way around will be really helpful.
Expected output
I want to make my t_df's schema exactly same as my r_df's schema.
below code is un-tested but should prescribe how to do it. (written from memory without testing.)
There may be a way to get fields from a struct but I'm not aware how so i'm interested to hear others ideas.
Extract struct column names and types.
Find columns that need to be dropped
Drop columns
rebuild struts according to r_df.
stucts_in_r_df = [ field.name for field in r_df.schema.fields if(str(field.dataType).startswith("Struct")) ] # use list comprehension to create a list of struct fields
struct_columns = []
for structs in stucts_in_r_df: # get a list of fields in the structs
struct_columns.append(r_df\
.select(
"$structs.*"
).columns
)
missingColumns = list(set(r_df.columns) - set(tdf.columns)) # find missing columns
similiar_Columns = list(set(r_df.columns).intersect(set(tdf.columns))))
#remove struct columns from both lists so you don't represent them twice.
# you need to repeat the above intersection/missing for the structs and then rebuild them but really the above gives you the idea of how to get the fields out.
# you can use variable replacemens col("$struct.$field") to get the values out of the fields,
result = r_df.union(
tdf\
.select(*(
[ lit(None).cast(dict(r_df.dtypes)[column]).alias(column) for column in missingColumns] +\
[ col(column).cast(dict(r_df.dtypes)[column]).alias(column) for column in similiar_Columns] ) # using list comprehension with joins and then passing as varargs to select will completely dynamically pull out the values you need.
)
)
Here's a way once you have the union to pull back the struct:
result = result\
.select(
col("_id"),
struct( col("sms").alias("sms") ).alias("notificationsSend"),
struct( *[col(column).alias(column) for column in struct_columns] # pass varags to struct with columns
).alias("recordingDetails") #reconstitue struct with
)

How to lower the case of element names in ArrayType or MapType columns in PySpark?

I am trying to lower the case of all columns names of PySpark Dataframe schema, including complex type columns' element names.
Example:
original_df
|-- USER_ID: long (nullable = true)
|-- COMPLEX_COL_ARRAY: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- KEY: timestamp (nullable = true)
| | |-- VALUE: integer (nullable = true)
target_df
|-- user_id: long (nullable = true)
|-- complex_col_array: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: timestamp (nullable = true)
| | |-- value: integer (nullable = true)
However, I've only been able to lower the case of column names using the script below:
from pyspark.sql.types import StructField
schema = df.schema
schema.fields = list(map(lambda field: StructField(field.name.lower(), field.dataType), schema.fields))
I know I can access the field names of nested elements using this syntax:
for f in schema.fields:
if hasattr(f.dataType, 'elementType') and hasattr(f.dataType.elementType, 'fieldNames'):
print(schema.f.dataType.elementType.fieldNames())
But how can I modify the case of these field names?
Thanks for your help!
Suggesting an answer to my own question, inspired by this question here: Rename nested field in spark dataframe
from pyspark.sql.types import StructField
# Read parquet file
path = "/path/to/data"
df = spark.read.parquet(path)
schema = df.schema
# Lower the case of all fields that are not nested
schema.fields = list(map(lambda field: StructField(field.name.lower(), field.dataType), schema.fields))
for f in schema.fields:
# if field is nested and has named elements, lower the case of all element names
if hasattr(f.dataType, 'elementType') and hasattr(f.dataType.elementType, 'fieldNames'):
for e in f.dataType.elementType.fieldNames():
schema[f.name].dataType.elementType[e].name = schema[f.name].dataType.elementType[e].name.lower()
ind = schema[f.name].dataType.elementType.names.index(e)
schema[f.name].dataType.elementType.names[ind] = e.lower()
# Recreate dataframe with lowercase schema
df_lowercase = spark.createDataFrame(df.rdd, schema)

Give prefix to all columns when selecting with 'struct_name.*'

The dataframe below is a temp_table named: 'table_name'.
How would you use spark.sql() to give a prefix to all columns?
root
|-- MAIN_COL: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: string (nullable = true)
| |-- c: string (nullable = true)
| |-- d: string (nullable = true)
| |-- f: long (nullable = true)
| |-- g: long (nullable = true)
| |-- h: long (nullable = true)
| |-- j: long (nullable = true)
The below query
spark.sql("select MAIN_COL.* from table_name")
gives back columns named a,b,c..., but how to make them all look like e.g. pre_a, pre_b, pre_c?
Want to avoid selecting and giving them alias one by one. What if I have 30 columns?
I hope a custom UDF can solve it which is used in SQL, but really not sure how to handle this.
# Generate a pandas DataFrame
import pandas as pd
a_dict={
'a':[1,2,3,4,5],
'b':[1,2,3,4,5],
'c':[1,2,3,4,5],
'e':list('abcde'),
'f':list('abcde'),
'g':list('abcde')
}
pandas_df=pd.DataFrame(a_dict)
# Create a Spark DataFrame from a pandas DataFrame using Arrow
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df = spark.createDataFrame(pandas_df)
#struct
from pyspark.sql.functions import struct
main=df.select(struct(df.columns).alias("MAIN_COL"))
Here is one way to go through the fields and modify their names dynamically. First use main.schema.fields[0].dataType.fields to access the target fields. Next use python map to prepend pre_ to each field:
from pyspark.sql.types import *
from pyspark.sql.functions import col
inner_fields = main.schema.fields[0].dataType.fields
# [StructField(a,LongType,true),
# StructField(b,LongType,true),
# StructField(c,LongType,true),
# StructField(e,StringType,true),
# StructField(f,StringType,true),
# StructField(g,StringType,true)]
pre_cols = list(map(lambda sf: StructField(f"pre_{sf.name}", sf.dataType, sf.nullable), inner_fields))
new_schema = StructType(pre_cols)
main.select(col("MAIN_COL").cast(new_schema)).printSchema()
# root
# |-- MAIN_COL: struct (nullable = false)
# | |-- pre_a: long (nullable = true)
# | |-- pre_b: long (nullable = true)
# | |-- pre_c: long (nullable = true)
# | |-- pre_e: string (nullable = true)
# | |-- pre_f: string (nullable = true)
# | |-- pre_g: string (nullable = true)
Finally, you can use cast with the new schema as #Mahesh already mentioned.
Beauty of Spark, you can programatically manipulate metadata
This is an example that continues the original code snippet:
main.createOrReplaceTempView("table_name")
new_cols_select = ", ".join(["MAIN_COL." + col + " as pre_" + col for col in spark.sql("select MAIN_COL.* from table_name").columns])
new_df = spark.sql(f"select {new_cols_select} from table_name")
Due to Spark's laziness and because all the manipulations are metadata only, this code doesn't have almost any performance cost and will work same for 10 columns or 500 columns (we actually are doing something similar on 1k of columns).
It is also possible to get original column names in more elegant way with df.schema object
you can try this: add all the column as per requirements to schema2
val schema2 = new StructType()
.add("pre_a",StringType)
.add("pre_b",StringType)
.add("pre_c",StringType)
Now select column using like:
df.select(col("MAIN_COL").cast(schema2)).show()
it will give you all the updated column names.
The following expands all the struct columns adding the parent column name as prefix.
struct_cols = [c for c, t in df.dtypes if t.startswith('struct')]
for c in struct_cols:
schema = T.StructType([T.StructField(f"{c}_{f.name}", f.dataType, f.nullable) for f in df.schema[c].dataType.fields])
df = df.withColumn(c, F.col(c).cast(schema))
df = df.select([f"{c}.*" if c in struct_cols else c for c in df.columns])
Test input:
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame([((1, 2), 5)], 'c1:struct<f1:int,f2:int>, c2:int')
print(df.dtypes)
# [('c1', 'struct<f1:int,f2:int>'), ('c2', 'int')]
Result:
struct_cols = [c for c, t in df.dtypes if t.startswith('struct')]
for c in struct_cols:
schema = T.StructType([T.StructField(f"{c}_{f.name}", f.dataType, f.nullable) for f in df.schema[c].dataType.fields])
df = df.withColumn(c, F.col(c).cast(schema))
df = df.select([f"{c}.*" if c in struct_cols else c for c in df.columns])
print(df.dtypes)
# [('c1_f1', 'int'), ('c1_f2', 'int'), ('c2', 'int')]
you can also do this with PySpark:
df.select([col(col_name).alias('prefix' + col_name) for col_name in df])

pyspark json read ignores empty set

In Pyspark, whenever i read a json file with an empty set element. The entire element is ignored in the resultant DataFrame. How can i ask spark to consider without ignoring it.
I am using spark 2.4.2 and Python 3.7.3
I tried using df.fillna('Null'). This didnt work because the moment DataFrame got created, the element is not there.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
people = ['{"name":{},"address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = spark.sparkContext.parallelize(people)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.printSchema()
root
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
As we can see, the empty set element (name) is not part of Dataframe.
Is there a way to have name element to be considered.
Let me know if that helps:
people = ['{"name":{},"address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = spark.sparkContext.parallelize(people)
otherPeople = spark.read.json(otherPeopleRDD)
# lit with None
otherPeople = otherPeople.withColumn('name', lit(None).cast(StringType()))
otherPeople.printSchema()
root
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
|-- name: string (nullable = true)
+---------------+----+
| address|name|
+---------------+----+
|[Columbus,Ohio]|null|
+---------------+----+
EDIT
if the json is not too complex this will work.
# Change the dictionary itself than changing it at df level
import json
d = json.loads(people[0])
# Takes care of any column which has empty dictionary value
for k, v in d.items():
if ( (v is None) | (len(v) == 0) ): # You can add any conditions to detect empty set
d[k] = "nan" # i prefer d[k] = None, and then fillna
people[0] = str(json.dumps(d))
otherPeopleRDD = spark.sparkContext.parallelize(people)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.printSchema()
otherPeople.show()
root
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
|-- name: string (nullable = true)
+---------------+----+
| address|name|
+---------------+----+
|[Columbus,Ohio]|nan |
+---------------+----+
After couple of hours struggling with a similar problem, here is what I found:
If data has at least one row with a nonempty "name" field, then it won't be ignored. If it doesn't, then we need to add "name" column with init value like #Preetham said. so we can check if "name" exists in the result's schema or not.
Another solution would be adding a sample row with all fields filled with data to the json file/string and then ignoring or removing it from the result.

Categories

Resources