Spark: Dataframe Transformation

Spark: Dataframe Transformation - python

I have dataframe_1:
+-------------+----+---------+
| Name| Age| Salary|
+-------------+----+---------+
|Avery Bradley|25.0|7730337.0|
| Jae Crowder|25.0|6796117.0|
+-------------+----+---------+
and want to transform it to dataframe_2:
+----------------------------------------------------------------------------------------------------------------------+
| json_data |
+----------------------------------------------------------------------------------------------------------------------+
|[{"Name": "Avery Bradley", "Age": 25.0, "Salary" 7730337.0}, {"Name": "Jae Crowder", "Age": 25.0, "Salary" 6796117.0}]|
+----------------------------------------------------------------------------------------------------------------------+
I can do dataframe_1.toPandas().to_dict(orient="records"), but this would be dataframe to dict(json object) transformation and I need dataframe to dataframe transformation.
A solution in PySpark, if possible, would be appreciated.

You can do a collect_list of json:
import pyspark.sql.functions as F
df2 = df.agg(F.collect_list(F.to_json(F.struct('*'))).alias('json_data'))
df2.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------+
|json_data |
+--------------------------------------------------------------------------------------------------------------+
|[{"Name":"Avery Bradley","Age":25.0,"Salary":7730337.0}, {"Name":"Jae Crowder","Age":25.0,"Salary":6796117.0}]|
+--------------------------------------------------------------------------------------------------------------+

You can merge columns to a map then create a JSON out of it
(df
.withColumn('json', F.to_json(F.create_map(
F.lit('name'), F.col('name'),
F.lit('age'), F.col('age'),
F.lit('salary'), F.col('salary'),
)))
.agg(F.collect_list('json').alias('json_value'))
)
+----------------------------------------------------------------------------------------------------------------------+
|json_value |
+----------------------------------------------------------------------------------------------------------------------+
|[{"name":"Avery Bradley","age":"25.0","salary":"7730337.0"}, {"name":"Jae Crowder","age":"25.0","salary":"6796117.0"}]|
+----------------------------------------------------------------------------------------------------------------------+

Related

Python Pandas Flatten nested JSON

I'm new to python & pandas and it took me awhile to get the results I wanted using below. Basically, I am trying to flatten a nested JSON. While json_normalize works, there are columns where it contains a list of objects(key/value). I want to breakdown that list and add them as separate columns. Below sample code I wrote worked out fine but was wondering if this could be simplified or improved further or an alternative? Most of the articles I've found relies on actually naming the columns and such (codes that I cant pick up on just yet), but I would rather have this as a function and name the columns dynamically. The output csv will be used for Power BI.
with open('json.json', 'r') as json_file: jsondata = json.load(json_file)
df = pd.json_normalize(jsondata['results'], errors='ignore')
y = df.columns[df.applymap(type).eq(list).any()]
for x in y:
df = df.explode(x).reset_index(drop=True)
df_exploded = pd.json_normalize(df.pop(x))
for i in df_exploded.columns:
df_exploded = df_exploded.rename(columns={i:x + '_' + i})
df = df.join(df_exploded)
df.to_csv('json.csv')
Sample JSON Format (Not including the large JSON I was working on):
data = {
'results': [
{
'recordType': 'recordType',
'id': 'id',
'values':
{
'orderType': [{'value': 'value', 'text': 'text'}],
'type': [{'value': 'value', 'text': 'text'}],
'entity': [{'value': 'value', 'text': 'text'}],
'trandate': 'trandate'
}
}
]
}
The values part when json_normalized, doesn't get flatten and required explode and joined.

You can use something like this:
df = pd.json_normalize(jsondata['results'],meta=['recordType','id'])[['recordType','id','values.trandate']]
record_paths = [['values','orderType'],['values','type'],['values','entity']]
for i in record_paths:
df = pd.concat([df,pd.json_normalize(jsondata['results'],record_path=i,record_prefix=i[1])],axis=1)
df.to_csv('json.csv')
Or (much faster):
df = pd.DataFrame({'recordType':[i['recordType'] for i in jsondata['results']],
'id':[i['id'] for i in jsondata['results']],
'values.trandate':[i['values']['trandate'] for i in jsondata['results']],
'orderTypevalue':[i['values']['orderType'][0]['value'] for i in jsondata['results']],
'orderTypetext':[i['values']['orderType'][0]['text'] for i in jsondata['results']],
'typevalue':[i['values']['type'][0]['value'] for i in jsondata['results']],
'typetext':[i['values']['type'][0]['text'] for i in jsondata['results']],
'entityvalue':[i['values']['entity'][0]['value'] for i in jsondata['results']],
'entitytext':[i['values']['entity'][0]['text'] for i in jsondata['results']]})
df.to_csv('json.csv')
Output:
| | recordType | id | values.trandate | orderTypevalue | orderTypetext | typevalue | typetext | entityvalue | entitytext |
|---:|:-------------|:-----|:------------------|:-----------------|:----------------|:------------|:-----------|:--------------|:-------------|
| 0 | recordType | id | trandate | value | text | value | text | value | text |

Convert CSV into Dictionary using python

I have a CSV/EXCEL file. The sample data is shown below
+-----------+------------+------------+
| Number | start_date | end_date |
+-----------+------------+------------+
| 987654321 | 2021-07-15 | 2021-08-15 |
| 999999999 | 2021-07-15 | 2021-08-15 |
| 888888888 | 2021-07-15 | 2021-08-15 |
| 777777777 | 2021-07-15 | 2021-09-15 |
+-----------+------------+------------+
I need to convert it into a dictionary(JSON) with some condition on it and then pass that dictionary(JSON) into the DB rows. This means the CSV can provide n number of dictionaries.
Conditions to applied:
The numbers which are having the same start date and end date should be in the same dictionary.
All numbers in a dictionary should be concatenated with comma(,) string.
Expected dictionaries from the above input
dict1 = {
"request": [
{
"key": "AMI_LIST",
"value": "987654321,999999999,888888888"
},
{
"key": "START_DATE",
"value": "2021-07-15"
},
{
"key": "END_DATE",
"value": "2021-08-15"
}
]
}
dict2 = {
"request": [
{
"key": "AMI_LIST",
"value": "7777777777"
},
{
"key": "START_DATE",
"value": "2021-07-15"
},
{
"key": "END_DATE",
"value": "2021-09-15"
}
]
}
All these dictionaries will be stored as a model object and then will pass on to DB. I am not creating different variables for each dict it will handle in a loop. It is just a notation that I want to explain using variables dict1 and dict2.
NOTE: The maximum rows in a file will be 500 only.
I have tried using for loop but that will increase the complexity. Is there any other way to approach this problem?
Thanks in advance for your help.

Yep pandas is a really good option, you can do something like this :
import pandas as pd
import json
df = pd.read_csv("table.csv")
dfgrp = df.groupby(['end_date', 'start_date'], as_index = False).agg({"Number": list})
dfgrp.to_json()
which gives you :
{
'end_date': {'0': '2021-08-15', '1': '2021-09-15'},
'start_date': {'0': '2021-07-15', '1': '2021-07-15'},
'Number': {'0': [987654321, 999999999, 888888888], '1': [777777777]}
}
And you're almost there !

how about using pandas to do the job?
it should be capable of reading your excel-files in as a DataFrame (see here), then you can apply your conditions and use DataFrame.to_json (see here) to export your DataFrames to json files

Pyspark - from long to wide with new column names

I have this dataframe:
data = [{"name": "test", "sentiment":'positive', "avg":13.65, "stddev":15.24},
{"name": "test", "sentiment":'neutral', "avg":338.74, "stddev":187.27},
{"name": "test", "sentiment":'negative', "avg":54.58, "stddev":50.19}]
df = spark.createDataFrame(data).select("name", "sentiment", "avg", "stddev")
df.show()
+----+---------+------+------+
|name|sentiment| avg|stddev|
+----+---------+------+------+
|test| positive| 13.65| 15.24|
|test| neutral|338.74|187.27|
|test| negative| 54.58| 50.19|
+----+---------+------+------+
I'd like to create a dataframe with this structure:
+----+------------+-----------+------------+------------+-----------+------------+
|name|avg_positive|avg_neutral|avg_negative|std_positive|std_neutral|std_negative|
+----+------------+-----------+------------+------------+-----------+------------+
|test| 13.65| 338.74| 54.58| 15.24| 187.27| 50.19|
+----+------------+-----------+------------+------------+-----------+------------+
I also don't know the name of this operation, feel free to suggest a proper title.
Thanks!

use groupBy() and pivot()
df_grp = df.groupBy("name").pivot("sentiment").agg((F.first("avg").alias("avg")),(F.first("stddev").alias("stddev")) )
df_grp.show()
+----+------------+---------------+-----------+--------------+------------+---------------+
|name|negative_avg|negative_stddev|neutral_avg|neutral_stddev|positive_avg|positive_stddev|
+----+------------+---------------+-----------+--------------+------------+---------------+
|test| 54.58| 50.19| 338.74| 187.27| 13.65| 15.24|
+----+------------+---------------+-----------+--------------+------------+---------------+
rename the columns if you really want to

Split pyspark dataframe to chunks and convert to dictionary

I have a pyspark dataframe which looks like the following:
+----+--------------------+
| ID| Email|
+----+--------------------+
| 1| sample#example.org|
| 2| sample2#example.org|
| 3| sampleexample.org|
| 4| sample#exampleorg|
+----+--------------------+
What I need to do is to split it into chunks and then convert those chunks to dictionaries like:
chunk1
[{'ID': 1, 'Email': 'sample#example.org'}, {'ID': 2, 'Email': 'sample2#example.org'}]
chunk2
[{'ID': 3, 'Email': 'sampleexample.org'}, {'ID': 4, 'Email': 'sample#exampleorg'}]
I've found this post on SO but I figured it would not make any sense to first convert the chunks to pandas dataframe and from there to dictionary while I might be able to do it directly. Using the idea in that post, I've got the following code but not sure if this is the best way of doing it:
columns = spark_df.schema.fieldNames()
chunks = spark_df.repartition(num_chunks).rdd.mapPartitions(lambda iterator: [iterator.to_dict('records')]).toLocalIterator()
for list_of_dicts in chunks:
# do work locally on list_of_dicts

You can return [[x.asDict() for x in iterator]] in the mapPartitions function (no need Pandas). [x.asDict() for x in iterator] creates a list of dicts including all rows in the same partition. we then enclose it using another list so that it was treated as a single item with toLocalIterator():
from json import dumps
num_chunks = 2
chunks = spark_df.repartition(num_chunks).rdd.mapPartitions(lambda iterator: [[x.asDict() for x in iterator]]).toLocalIterator()
for list_of_dicts in chunks:
print(dumps(list_of_dicts))
#[{"ID": "2", "Email": "sample2#example.org"}, {"ID": "1", "Email": "sample#example.org"}]
#[{"ID": "4", "Email": "sample#exampleorg"}, {"ID": "3", "Email": "sampleexample.org"}]

How to check whether key or value exist in Pyspark Map

I have a Map column in a spark DF and would like to filter this column on a particular key (i.e. keep the row if the key in the map matches desired value).
For example, my schema is defined as:
df_schema = StructType(
[StructField('id', StringType()),
StructField('rank', MapType(StringType(), IntegerType()))]
)
My sample data is:
{ "id": "0981850006", "rank": {"a": 1} }
Is there any way to filter my df on rows where "a" is in "rank" without using explode()?
Is there a better schema representation for the given json than what I have defined?

Accessing the key with rank.key would mean rank is a StructType(). Although explode is probably the best solution let's build a UDF to assess whether or not k is a key of rank.
First let's create our dataframe:
from pyspark.sql.types import *
df_schema = StructType(
[StructField('id', StringType()),
StructField('rank', MapType(StringType(), IntegerType()))]
)
df = spark.createDataFrame([
["0981850006", {"a": 1}],
["0981850006", {"b": 2, "c": 3}],
], df_schema)
Now our UDF:
def isKey(k,d):
return k in d.keys()
isKey_udf = lambda k: psf.udf(lambda d: isKey(k,d), BooleanType())
Which gives:
df.withColumn(
"is_key",
isKey_udf('a')(df.rank)
)
+----------+-------------------+------+
| id| rank|is_key|
+----------+-------------------+------+
|0981850006| Map(a -> 1)| true|
|0981850006|Map(b -> 2, c -> 3)| false|
+----------+-------------------+------+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark: Dataframe Transformation - python

Related

Python Pandas Flatten nested JSON

Convert CSV into Dictionary using python

Pyspark - from long to wide with new column names

Split pyspark dataframe to chunks and convert to dictionary

How to check whether key or value exist in Pyspark Map

Categories

Resources