PySpark : how to aggregate json columns in a clean way

PySpark : how to aggregate json columns in a clean way - python

I have packed my nested json as string columns in my pyspark dataframe and I am trying to perform UPSERT on some columns based on groupBy.
Input:
from pyspark.sql.functions import *
from pyspark.sql.types import *
input_json = """[{
"candidate_email": "cust1#email.com",
"transactions":"[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]"
},
{
"candidate_email": "cust1#email.com",
"transactions":"[{'transaction_id':'12', 'transaction_amount':'$23.43'}]"
}
]
"""
input_df = spark.read.json(sc.parallelize([input_json]), multiLine=True)
input_df.printSchema()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: string (nullable = true)
Transformation & current output:
output_df = input_df.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions"))
output_df.printSchema()
output_df.collect()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: array (nullable = true)
# | |-- element: string (containsNull = true)
# Out[161]:
# [Row(candidatey_email='cust1#email.com', transactions=["[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]", "[{'transaction_id':'12', 'transaction_amount':'$23.43'}]"])]
But what changes should I make in above code to get this output:
desired output:
output_json = """[{
"candidate_email": "cust1#email.com",
"transactions":"[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}, {'transaction_id':'12', 'transaction_amount':'$23.43'}]"
}]"""
output_df = spark.read.json(sc.parallelize([output_json]), multiLine=True)
output_df.printSchema()
# root
# |-- candidate_email: string (nullable = true)
# |-- transactions: string (nullable = true)
Basically, I am trying to get clean merge by having one list instead of having multiple.
Thanks!

As you are having string type for transactions column we need to convert as array type then by doing explode and groupBy + collect_list we can achieve the expected result.
Example:
df.show(10,False)
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|cust1#email.com|[{'transaction_id':'10', 'transaction_amount':'$55.46'},{'transaction_id':'11', 'transaction_amount':'$545.46'}]|
#|cust1#email.com|[{'transaction_id':'12', 'transaction_amount':'$23.43'}] |
#+---------------+----------------------------------------------------------------------------------------------------------------+
#to make proper array we first replace (},) with (}},) then remove ("[|]") and split on (},) it results array finally we explode on the array.
df1=df.selectExpr("candidate_email","""explode(split(regexp_replace(regexp_replace(transactions,'(\\\},)','}},'),'(\\\[|\\\])',''),"},")) as transactions""")
df1.show(10,False)
#+---------------+------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+------------------------------------------------------+
#|cust1#email.com|{'transaction_id':'10', 'transaction_amount':'$55.46'}|
#|cust1#email.com|{'transaction_id':'11','transaction_amount':'$545.46'}|
#|cust1#email.com|{'transaction_id':'12', 'transaction_amount':'$23.43'}|
#+---------------+------------------------------------------------------+
#groupBy and then create json object
df2=df1.groupBy("candidate_email").\
agg(collect_list(col("transactions")).alias("transactions"))
df2.show(10,False)
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions |
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|cust1#email.com|[{'transaction_id':'10', 'transaction_amount':'$55.46'}, {'transaction_id':'11','transaction_amount':'$545.46'}, {'transaction_id':'12', 'transaction_amount':'$23.43'}]|
#+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#creating json object column in dataframe
df2.selectExpr("to_json(struct(candidate_email,transactions)) as json").show(10,False)
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|json |
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|{"candidate_email":"cust1#email.com","transactions":["{'transaction_id':'10', 'transaction_amount':'$55.46'}","{'transaction_id':'11','transaction_amount':'$545.46'}","{'transaction_id':'12', 'transaction_amount':'$23.43'}"]}|
#+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#To the output to json file
df2.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions")).write.mode("overwrite").json("<path>")
#content of file
#{"candidate_email":"cust1#email.com","transactions":["{'transaction_id':'10', 'transaction_amount':'$55.46'}","{'transaction_id':'11','transaction_amount':'$545.46'}","{'transaction_id':'12', 'transaction_amount':'$23.43'}"]}
#converting to json by using toJSON
df2.groupBy("candidate_email").agg(collect_list(col("transactions")).alias("transactions")).toJSON().collect()
#[u'{"candidate_email":"cust1#email.com","transactions":["{\'transaction_id\':\'10\', \'transaction_amount\':\'$55.46\'}","{\'transaction_id\':\'11\',\'transaction_amount\':\'$545.46\'}","{\'transaction_id\':\'12\', \'transaction_amount\':\'$23.43\'}"]}']

Related

How to lower the case of element names in ArrayType or MapType columns in PySpark?

I am trying to lower the case of all columns names of PySpark Dataframe schema, including complex type columns' element names.
Example:
original_df
|-- USER_ID: long (nullable = true)
|-- COMPLEX_COL_ARRAY: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- KEY: timestamp (nullable = true)
| | |-- VALUE: integer (nullable = true)
target_df
|-- user_id: long (nullable = true)
|-- complex_col_array: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: timestamp (nullable = true)
| | |-- value: integer (nullable = true)
However, I've only been able to lower the case of column names using the script below:
from pyspark.sql.types import StructField
schema = df.schema
schema.fields = list(map(lambda field: StructField(field.name.lower(), field.dataType), schema.fields))
I know I can access the field names of nested elements using this syntax:
for f in schema.fields:
if hasattr(f.dataType, 'elementType') and hasattr(f.dataType.elementType, 'fieldNames'):
print(schema.f.dataType.elementType.fieldNames())
But how can I modify the case of these field names?
Thanks for your help!

Suggesting an answer to my own question, inspired by this question here: Rename nested field in spark dataframe
from pyspark.sql.types import StructField
# Read parquet file
path = "/path/to/data"
df = spark.read.parquet(path)
schema = df.schema
# Lower the case of all fields that are not nested
schema.fields = list(map(lambda field: StructField(field.name.lower(), field.dataType), schema.fields))
for f in schema.fields:
# if field is nested and has named elements, lower the case of all element names
if hasattr(f.dataType, 'elementType') and hasattr(f.dataType.elementType, 'fieldNames'):
for e in f.dataType.elementType.fieldNames():
schema[f.name].dataType.elementType[e].name = schema[f.name].dataType.elementType[e].name.lower()
ind = schema[f.name].dataType.elementType.names.index(e)
schema[f.name].dataType.elementType.names[ind] = e.lower()
# Recreate dataframe with lowercase schema
df_lowercase = spark.createDataFrame(df.rdd, schema)

Give prefix to all columns when selecting with 'struct_name.*'

The dataframe below is a temp_table named: 'table_name'.
How would you use spark.sql() to give a prefix to all columns?
root
|-- MAIN_COL: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: string (nullable = true)
| |-- c: string (nullable = true)
| |-- d: string (nullable = true)
| |-- f: long (nullable = true)
| |-- g: long (nullable = true)
| |-- h: long (nullable = true)
| |-- j: long (nullable = true)
The below query
spark.sql("select MAIN_COL.* from table_name")
gives back columns named a,b,c..., but how to make them all look like e.g. pre_a, pre_b, pre_c?
Want to avoid selecting and giving them alias one by one. What if I have 30 columns?
I hope a custom UDF can solve it which is used in SQL, but really not sure how to handle this.
# Generate a pandas DataFrame
import pandas as pd
a_dict={
'a':[1,2,3,4,5],
'b':[1,2,3,4,5],
'c':[1,2,3,4,5],
'e':list('abcde'),
'f':list('abcde'),
'g':list('abcde')
}
pandas_df=pd.DataFrame(a_dict)
# Create a Spark DataFrame from a pandas DataFrame using Arrow
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df = spark.createDataFrame(pandas_df)
#struct
from pyspark.sql.functions import struct
main=df.select(struct(df.columns).alias("MAIN_COL"))

Here is one way to go through the fields and modify their names dynamically. First use main.schema.fields[0].dataType.fields to access the target fields. Next use python map to prepend pre_ to each field:
from pyspark.sql.types import *
from pyspark.sql.functions import col
inner_fields = main.schema.fields[0].dataType.fields
# [StructField(a,LongType,true),
# StructField(b,LongType,true),
# StructField(c,LongType,true),
# StructField(e,StringType,true),
# StructField(f,StringType,true),
# StructField(g,StringType,true)]
pre_cols = list(map(lambda sf: StructField(f"pre_{sf.name}", sf.dataType, sf.nullable), inner_fields))
new_schema = StructType(pre_cols)
main.select(col("MAIN_COL").cast(new_schema)).printSchema()
# root
# |-- MAIN_COL: struct (nullable = false)
# | |-- pre_a: long (nullable = true)
# | |-- pre_b: long (nullable = true)
# | |-- pre_c: long (nullable = true)
# | |-- pre_e: string (nullable = true)
# | |-- pre_f: string (nullable = true)
# | |-- pre_g: string (nullable = true)
Finally, you can use cast with the new schema as #Mahesh already mentioned.

Beauty of Spark, you can programatically manipulate metadata
This is an example that continues the original code snippet:
main.createOrReplaceTempView("table_name")
new_cols_select = ", ".join(["MAIN_COL." + col + " as pre_" + col for col in spark.sql("select MAIN_COL.* from table_name").columns])
new_df = spark.sql(f"select {new_cols_select} from table_name")
Due to Spark's laziness and because all the manipulations are metadata only, this code doesn't have almost any performance cost and will work same for 10 columns or 500 columns (we actually are doing something similar on 1k of columns).
It is also possible to get original column names in more elegant way with df.schema object

you can try this: add all the column as per requirements to schema2
val schema2 = new StructType()
.add("pre_a",StringType)
.add("pre_b",StringType)
.add("pre_c",StringType)
Now select column using like:
df.select(col("MAIN_COL").cast(schema2)).show()
it will give you all the updated column names.

The following expands all the struct columns adding the parent column name as prefix.
struct_cols = [c for c, t in df.dtypes if t.startswith('struct')]
for c in struct_cols:
schema = T.StructType([T.StructField(f"{c}_{f.name}", f.dataType, f.nullable) for f in df.schema[c].dataType.fields])
df = df.withColumn(c, F.col(c).cast(schema))
df = df.select([f"{c}.*" if c in struct_cols else c for c in df.columns])
Test input:
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame([((1, 2), 5)], 'c1:struct<f1:int,f2:int>, c2:int')
print(df.dtypes)
# [('c1', 'struct<f1:int,f2:int>'), ('c2', 'int')]
Result:
struct_cols = [c for c, t in df.dtypes if t.startswith('struct')]
for c in struct_cols:
schema = T.StructType([T.StructField(f"{c}_{f.name}", f.dataType, f.nullable) for f in df.schema[c].dataType.fields])
df = df.withColumn(c, F.col(c).cast(schema))
df = df.select([f"{c}.*" if c in struct_cols else c for c in df.columns])
print(df.dtypes)
# [('c1_f1', 'int'), ('c1_f2', 'int'), ('c2', 'int')]

you can also do this with PySpark:
df.select([col(col_name).alias('prefix' + col_name) for col_name in df])

pyspark json read ignores empty set

In Pyspark, whenever i read a json file with an empty set element. The entire element is ignored in the resultant DataFrame. How can i ask spark to consider without ignoring it.
I am using spark 2.4.2 and Python 3.7.3
I tried using df.fillna('Null'). This didnt work because the moment DataFrame got created, the element is not there.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
people = ['{"name":{},"address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = spark.sparkContext.parallelize(people)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.printSchema()
root
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
As we can see, the empty set element (name) is not part of Dataframe.
Is there a way to have name element to be considered.

Let me know if that helps:
people = ['{"name":{},"address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = spark.sparkContext.parallelize(people)
otherPeople = spark.read.json(otherPeopleRDD)
# lit with None
otherPeople = otherPeople.withColumn('name', lit(None).cast(StringType()))
otherPeople.printSchema()
root
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
|-- name: string (nullable = true)
+---------------+----+
| address|name|
+---------------+----+
|[Columbus,Ohio]|null|
+---------------+----+
EDIT
if the json is not too complex this will work.
# Change the dictionary itself than changing it at df level
import json
d = json.loads(people[0])
# Takes care of any column which has empty dictionary value
for k, v in d.items():
if ( (v is None) | (len(v) == 0) ): # You can add any conditions to detect empty set
d[k] = "nan" # i prefer d[k] = None, and then fillna
people[0] = str(json.dumps(d))
otherPeopleRDD = spark.sparkContext.parallelize(people)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.printSchema()
otherPeople.show()
root
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
|-- name: string (nullable = true)
+---------------+----+
| address|name|
+---------------+----+
|[Columbus,Ohio]|nan |
+---------------+----+

After couple of hours struggling with a similar problem, here is what I found:
If data has at least one row with a nonempty "name" field, then it won't be ignored. If it doesn't, then we need to add "name" column with init value like #Preetham said. so we can check if "name" exists in the result's schema or not.
Another solution would be adding a sample row with all fields filled with data to the json file/string and then ignoring or removing it from the result.

PySpark: Concatenate Two Columns with Datatype of 'Struc' --> Error: Cannot Resolve Due to Datatype Mismatch

I have a data table in PySpark that contains two columns with data type of 'struc'.
Please see sample data frame below:
word_verb word_noun
{_1=cook, _2=VB} {_1=chicken, _2=NN}
{_1=pack, _2=VBN} {_1=lunch, _2=NN}
{_1=reconnected, _2=VBN} {_1=wifi, _2=NN}
I want to concatenate the two columns together so I can do a frequency count of the concatenated verb and noun chunk.
I tried the code below:
df = df.withColumn('word_chunk_final', F.concat(F.col('word_verb'), F.col('word_noun')))
But I get the following error:
AnalysisException: u"cannot resolve 'concat(`word_verb`, `word_noun`)' due to data type mismatch: input to function concat should have been string, binary or array, but it's [struct<_1:string,_2:string>, struct<_1:string,_2:string>]
My desired output table is as follows. The concatenated new field would have datatype of string:
word_verb word_noun word_chunk_final
{_1=cook, _2=VB} {_1=chicken, _2=NN} cook chicken
{_1=pack, _2=VBN} {_1=lunch, _2=NN} pack lunch
{_1=reconnected, _2=VBN} {_1=wifi, _2=NN} reconnected wifi

Your code is almost there.
Assuming your schema is as follows:
df.printSchema()
#root
# |-- word_verb: struct (nullable = true)
# | |-- _1: string (nullable = true)
# | |-- _2: string (nullable = true)
# |-- word_noun: struct (nullable = true)
# | |-- _1: string (nullable = true)
# | |-- _2: string (nullable = true)
You just need to access the value of the _1 field for each column:
import pyspark.sql.functions as F
df.withColumn(
"word_chunk_final",
F.concat_ws(' ', F.col('word_verb')['_1'], F.col('word_noun')['_1'])
).show()
#+-----------------+------------+----------------+
#| word_verb| word_noun|word_chunk_final|
#+-----------------+------------+----------------+
#| [cook,VB]|[chicken,NN]| cook chicken|
#| [pack,VBN]| [lunch,NN]| pack lunch|
#|[reconnected,VBN]| [wifi,NN]|reconnected wifi|
#+-----------------+------------+----------------+
Also, you should use concat_ws ("concatenate with separator") instead of concat to add the strings together with a space in between them. It's similar to how str.join works in python.

PySpark flattening dataframe while appending supercolumn names

Say I have a PySpark dataframe df:
>>> df.printSchema()
root
|-- a: struct
|-- alpha: integer
|-- beta: string
|-- gamma: boolean
|-- b: string
|-- c: struct
|-- delta: string
|-- epsilon: struct
|-- omega: string
|-- psi: boolean
I know I can flatten the dataframe:
select_col_list = [col.replace("a", "a.*").replace("c", "c.*") for col in df.columns]
flat_df = df.select(*select_col_list)
This results in a schema like this:
root
|-- alpha: integer
|-- beta: string
|-- gamma: boolean
|-- b: string
|-- delta: string
|-- epsilon: struct
|-- omega: string
|-- psi: boolean
But I want to append the supercolumn's name to subcolumns when I flatten too, so I want the resulting schema to be like this:
root
|-- a_alpha: integer
|-- a_beta: string
|-- a_gamma: boolean
|-- b: string
|-- c_delta: string
|-- c_epsilon: struct
|-- omega: string
|-- psi: boolean
How do I do this?

I don't think there's an straightforward way to do it, but here's a hacky solution that I came up with.
Define a list of the columns to be expanded and create a temporary id column using pyspark.sql.functions.monotonically_increasing_id().
Loop over all the columns in the dataframe and create a temporary dataframe for each one.
If the column is in cols_to_expand: Use .* to expand the column. Then rename all fields (except id) in the resultant (temporary) dataframe by with the corresponding prefix using alias().
If the column is not in cols_to_expand: Select that column and id and store it in a temporary dataframe.
Store temp_df in a list.
Join all the dataframes in the list using id and drop the id column.
Code:
df = df.withColumn('id', f.monotonically_increasing_id())
cols_to_expand = ['a', 'c']
flat_dfs = []
for col in df.columns:
if col in cols_to_expand:
temp_df = df.select('id', col+".*")
temp_df = temp_df.select(
[
f.col(c).alias(col+"_"+c if c != 'id' else c) for c in temp_df.columns
]
)
else:
temp_df = df.select('id', col)
flat_dfs.append(temp_df)
flat_df = reduce(lambda x, y: x.join(y, on='id'), flat_dfs)
flat_df = flat_df.drop('id')
flat_df.printSchema()
The resulting schema:
flat_df.printSchema()
#root
# |-- a_alpha: integer (nullable = true)
# |-- a_beta: string (nullable = true)
# |-- a_gamma: boolean (nullable = true)
# |-- b: string (nullable = true)
# |-- c_delta: string (nullable = true)
# |-- c_epsilon: struct (nullable = true)
# | |-- omega: string (nullable = true)
# | |-- psi: boolean (nullable = true)

I actually found a way to do this today. First using the beautiful auto-flattening pyspark function by Evan V. Combine this with the rather brilliant solution to mass-renaming from proggeo and you can basically build up a list of names down the full tree of columns and alias them all as you select.
In my case I took the result of the flatten function and replaced all the "." characters with an "_" in the renaming. Result is as follows:
from pyspark.sql.types import StructType, ArrayType
def flatten(schema, prefix=None):
fields = []
for field in schema.fields:
name = prefix + '.' + field.name if prefix else field.name
dtype = field.dataType
if isinstance(dtype, ArrayType):
dtype = dtype.elementType
if isinstance(dtype, StructType):
fields += flatten(dtype, prefix=name)
else:
fields.append(name)
return fields
# Get actual field names, with nested '.' structure, and create equivalents with '_'
fields=flatten(df.schema)
fields_renamed = [field.replace(".","_") for field in fields]
# Select while aliasing for all fields
df=df.select(*[col(field).alias(new_field) for field,new_field in zip(fields,fields_renamed)])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark : how to aggregate json columns in a clean way - python

Related

How to lower the case of element names in ArrayType or MapType columns in PySpark?

Give prefix to all columns when selecting with 'struct_name.*'

pyspark json read ignores empty set

PySpark: Concatenate Two Columns with Datatype of 'Struc' --> Error: Cannot Resolve Due to Datatype Mismatch

PySpark flattening dataframe while appending supercolumn names

Categories

Resources