How to create JSON structure from a pyspark dataframe?

How to create JSON structure from a pyspark dataframe? - python

I'm trying to create a JSON structure from a pyspark dataframe. I have below columns in my dataframe - batch_id, batch_run_id, table_name, column_name, column_datatype, last_refresh_time, refresh_frequency, owner
I want it in below JSON structure -
{
"GeneralInfo": {
"DataSetID": "xxxx1234Abcsd",
"Owner" : ["test1#email.com", "test2#email.com", "test3#email.com"]
"Description": "",
"BuisnessFunction": "",
"Source": "",
"RefreshRate": "Weekly",
"LastUpdate": "2020/10/15",
"InfoSource": "TemplateInfo"
},
"Tables": [
{
"TableName": "Employee",
"Columns" : [
{ "ColumnName" : "EmployeeID",
"ColumnDataType": "int"
},
{ "ColumnName" : "EmployeeName",
"ColumnDataType": "string"
}
]
}
}
}
I'm trying to assign the values in JSON string through dataframe column indexes but it is giving me an error as "Object of Type Column is not JSON serializable". I have used like below -
{
"GeneralInfo": {
"DataSetID": df["batch_id"],
"Owner" : list(df["owner"])
"Description": "",
"BuisnessFunction": "",
"Source": "",
"RefreshRate": df["refresh_frequency"],
"LastUpdate": df["last_update_time"],
"InfoSource": "TemplateInfo"
},
"Tables": [
{
"TableName": df["table_name"],
"Columns" : [
{ "ColumnName" : df["table_name"]["column_name"],
"ColumnDataType": df["table_name"]["column_datatype"]
}
]
}
}
}
Sample Data -
Please help me on this, I have newly started coding in Pyspark.

Tried getting JSON format from the sample data which you provided, output format is not matching exactly as you expected. You can improvise the below code further.
We can use toJSON function to convert dataframe to JSON format. Before calling toJSON function we need to use array(), struct functions by passing required columns to match JSON format as required.
from pyspark.sql import *
from pyspark.sql.functions import *
spark = SparkSession.builder.master('local[*]').getOrCreate()
in_values = [
(123, '123abc', 'Employee', 'Employee_id', 'int', '21/05/15', 'Weekly',
['test1#gmail.com', 'test1#gmail.com', 'test3#gmail.com']),
(123, '123abc', 'Employee', 'Employee_name', 'string', '21/05/15', 'Weekly',
['test1#gmail.com', 'test1#gmail.com', 'test3#gmail.com'])
]
cols = ["batch_id", "batch_run_id", "table_name", "column_name", "column_datatype",
"last_update_time", "refresh_frequency", "Owner"]
df = spark.createDataFrame(in_values).toDF(*cols)\
.selectExpr("*","'' Description", "'' BusinessFunction", "'TemplateInfo' InfoSource", "'' Source")
list1 = [df["batch_id"].alias("DataSetID"), df["Owner"], df["refresh_frequency"].alias("RefreshRate"),
df["last_update_time"].alias("LastUpdate"), "Description", "BusinessFunction","InfoSource", "Source"]
list2 = [df["table_name"].alias("TableName"),df["column_name"].alias("ColumnName"),
df["column_datatype"].alias("ColumnDataType")]
df.groupBy("batch_id") \
.agg(collect_set(struct(*list1))[0].alias("GeneralInfo"),
collect_list(struct(*list2)).alias("Tables")).drop("batch_id") \
.toJSON().foreach(print)
# outputs JSON --->
'''
{
"GeneralInfo":{
"DataSetID":123,
"Owner":[
"test1#gmail.com",
"test1#gmail.com",
"test3#gmail.com"
],
"RefreshRate":"Weekly",
"LastUpdate":"21/05/15",
"Description":"",
"BusinessFunction":"",
"InfoSource":"TemplateInfo",
"Source":""
},
"Tables":[
{
"TableName":"Employee",
"ColumnName":"Employee_id",
"ColumnDataType":"int"
},
{
"TableName":"Employee",
"ColumnName":"Employee_name",
"ColumnDataType":"string"
}
]
}
'''

Related

Pandas select rows from a DataFrame based on column values?

I have below json string loaded to dataframe. Now I want to filter the record based on ossId.
The condition I have is giving the error message. what is the correct way to filter by ossId?
import pandas as pd
data = """
{
"components": [
{
"ossId": 3946,
"project": "OALX",
"licenses": [
{
"name": "BSD 3",
"status": "APPROVED"
}
]
},
{
"ossId": 3946,
"project": "OALX",
"version": "OALX.client.ALL",
"licenses": [
{
"name": "GNU Lesser General Public License v2.1 or later",
"status": "APPROVED"
}
]
},
{
"ossId": 2550,
"project": "OALX",
"version": "OALX.webservice.ALL" ,
"licenses": [
{
"name": "MIT License",
"status": "APPROVED"
}
]
}
]
}
"""
df = pd.read_json(data)
print(df)
df1 = df[df["components"]["ossId"] == 2550]

I think your issue is due to the json structure. You are actually loading into df a single row that is the whole list of field component.
You should instead pass to the dataframe the list of records. Something like:
json_data = json.loads(data)
df = pd.DataFrame(json_data["components"])
filtered_data = df[df["ossId"] == 2550]

You need to go into the cell's data and get the correct key:
df[df['components'].apply(lambda x: x.get('ossId')==2550)]

Use str
df[df.components.str['ossId']==2550]
Out[89]:
components
2 {'ossId': 2550, 'project': 'OALX', 'version': ...

Convert PANDAS dataframe to nested JSON + add array name

I've been wresting with this for many days now and would appreciate any help.
I'm importing an Excel file to a Pandas data frame resulting in the following dataframe [record]:
account_id
name
timestamp
value
A0001C
Fund_1
1588618800000000000
1
B0001B
Dev_2
1601578800000000000
1
I'm looking to produce a nested JSON output (will be used to submit data to an API), include adding a records and metric labels for the arrays.
Here is the output i'm looking for:
{
"records": [
{
"name": "Fund_1",
"account_id": "A0001C",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
}
]
}
{
"name": "Dev_2",
"account_id": "B0001B",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
}
]
}
I've gotten an output of a none nested JSON data set, but not able split out the timestamp and value to add the metrics part.
for record in df.to_dict(orient='records'):
record_data = {'records': [record]}
payload_json = json.dumps(record_data)
print(payload_json)
I get the following output:
{"records": [{"account_id": "A0001C", "name": "Fund_1", "Date Completed": 1588618800000000000, "Count": "1"}]}
{"records": [{"account_id": "B0001B", "name": "Dev_2", "Date Completed": 1601578800000000000, "Count": "1"}]}
Any help on how i can modify my code to add the metrics label and nest the data.
Thanks in advance.

One approach is through the use of pd.apply. This allows you to apply a function to series (either column- or row-wise) in your dataframe.
In your particular case, you want to apply the function row-by-row, so you have to use apply with axis=1:
records = list(df.apply(lambda row: {"name": row["name"],
"account_id": row["account_id"],
"metrics": [{
"timestamp": row["timestamp"],
"value": row["value"]}]
},
axis=1).values)
payload = {"records": records}
Alternatively, you could introduce an auxiliary column "metrics" in which you store your metrics (subsequently applying pd.to_json):
df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
"value": e.value}],
axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
Here's a full example applying option 2:
import io
import json
import pandas as pd
data = io.StringIO("""account_id name timestamp value
A0001C Fund_1 1588618800000000000 1
B0001B Dev_2 1601578800000000000 1""")
df = pd.read_csv(data, sep="\t")
df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
"value": e.value}],
axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))
Output:
{
"records": [
{
"account_id": "A0001C",
"name": "Fund_1",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
}
]
},
{
"account_id": "B0001B",
"name": "Dev_2",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
}
]
}
Edit: The second approach also makes grouping by accounts (in case you want to do that) rather easy. Below is a small example and output:
import io
import json
import pandas as pd
data = io.StringIO("""account_id name timestamp value
A0001C Fund_1 1588618800000000000 1
A0001C Fund_1 1588618900000000000 2
B0001B Dev_2 1601578800000000000 1""")
df = pd.read_csv(data, sep="\t")
# adding the metrics column as above
df["metrics"] = df.apply(lambda e: {"timestamp": e.timestamp,
"value": e.value},
axis=1)
# group metrics by account
df_grouped = df.groupby(by=["name", "account_id"]).metrics.agg(list).reset_index()
records = df_grouped[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))
Output:
{
"records": [
{
"account_id": "B0001B",
"name": "Dev_2",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
},
{
"account_id": "A0001C",
"name": "Fund_1",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
},
{
"timestamp": 1588618900000000000,
"value": 2
}
]
}
]
}

Convert Pandas Dataframe or csv file to Custom Nested JSON

I have a csv file with a DF with structure as follows:
my dataframe:
I want to enter the data to the following JSON format using python. I looked to couple of links (but I got lost in the nested part). The links I checked:
How to convert pandas dataframe to uniquely structured nested json
convert dataframe to nested json
"PHI": 2,
"firstname": "john",
"medicalHistory": {
"allergies": "egg",
"event": {
"inPatient":{
"hospitalized": {
"visit" : "7-20-20",
"noofdays": "5",
"test": {
"modality": "xray"
}
"vitalSign": {
"temperature": "32",
"heartRate": "80"
},
"patientcondition": {
"headache": "1",
"cough": "0"
}
},
"icu": {
"visit" : "",
"noofdays": "",
},
},
"outpatient": {
"visit":"5-20-20",
"vitalSign": {
"temperature": "32",
"heartRate": "80"
},
"patientcondition": {
"headache": "1",
"cough": "1"
},
"test": {
"modality": "blood"
}
}
}
}
If anyone can help me with the nested array, that will be really helpful.

You need one or more helper functions to unpack the data in the table like this. Write main helper function to accept two arguments: 1. df and 2. schema. The schema will be used to unpack the df into a nested structure for each row in the df. The schema below is an example of how to achieve this for a subset of the logic you describe. Although not exactly what you specified in example, should be enough of hint for you to complete the rest of the task on your own.
from operator import itemgetter
groupby_idx = ['PHI', 'firstName']
groups = df.groupby(groupby_idx, as_index=False, drop=False)
schema = {
"event": {
"eventType": itemgetter('event'),
"visit": itemgetter('visit'),
"noOfDays": itemgetter('noofdays'),
"test": {
"modality": itemgetter('test')
},
"vitalSign": {
"temperature": itemgetter('temperature'),
"heartRate": itemgetter('heartRate')
},
"patientCondition": {
"headache": itemgetter('headache'),
"cough": itemgetter('cough')
}
}
}
def unpack(obj, schema):
tmp = {}
for k, v in schema.items():
if isinstance(v, (dict,)):
tmp[k] = unpack(obj, v)
if callable(v):
tmp[k] = v(obj)
return tmp
def apply_unpack(groups, schema):
results = {}
for gidx, df in groups:
events = []
for ridx, obj in df.iterrows():
d = unpack(obj, schema)
events.append(d)
results[gidx] = events
return results
unpacked = apply_unpack(groups, schema)

How to convert CSV to JSON using Python?

I need to convert data.csv file to an "ExpectedJsonFile.json" file using python script which is specified below. But I fail to achieve this. Python script "csvjs.py" is specified as below.
import pandas as pd
from itertools import groupby
from collections import OrderedDict
import json
df = pd.read_csv('data8.csv', dtype={
"Source" : str,
"Template": str,
"ConfigurationSetName": str,
})
results = []
for (Source, Template, ConfigurationSetName), bag in df.groupby (["Source", "Template", "ConfigurationSetName"]):
contents_df = bag.drop(["Source", "Template", "ConfigurationSetName"], axis=1)
Destinations = [OrderedDict(row) for i,row in contents_df.iterrows()]
results.append(OrderedDict([("Source", Source),
("Template", Template),
("ConfigurationSetName", ConfigurationSetName),
("Destinations", Destinations)]))
print json.dumps(results[0], indent=4)
with open('ExpectedJsonFile.json', 'w') as outfile:
outfile.write(json.dumps(results[0], indent=4))
data in data.csv look like below.
Source,Template,ConfigurationSetName,ToAddresses,ReplacementTemplateData
demo#example.com,MyTemplate,noreply,customer1#gmail.com,customer1
demo#example.com,MyTemplate,noreply,customer2#gmail.com,customer2
Output Produces is like below when I run "python csvjs.py"
{
"Source": "demo#example.com",
"Template": "MyTemplate",
"ConfigurationSetName": "noreply",
"Destinations": [
{
"ToAddresses": "customer1#gmail.com",
"ReplacementTemplateData": "customer"
},
{
"ToAddresses": "customer2#gmail.com",
"ReplacementTemplateData": "customer2"
}
]
}
But my expected output is as below
{
"Source":"demo#example.com",
"Template":"MyTemplate",
"ConfigurationSetName": "noreply",
"Destinations":[
{
"Destination":{
"ToAddresses":[
"customer1#gmail.com"
]
},
"ReplacementTemplateData":"{ \"name\":\"customer1\" }"
},
{
"Destination":{
"ToAddresses":[
"customer2#gmail.com"
]
},
"ReplacementTemplateData":"{ \"name\":\"customer2\" }"
},
{
"Destination":{
"ToAddresses":[
"customer3#gmail.com"
]
},
"ReplacementTemplateData":"{}"
}
],
"DefaultTemplateData":"{ \"name\":\"friend\" }"
}
My template looks like below
{
"Template": {
"TemplateName": "MyTemplate",
"SubjectPart": "Greetings, {{Name}}!",
"HtmlPart": "<h1>Hello {{Name}},</h1><p>Your favorite animal is cat.</p>",
"TextPart": "Dear {{Name}},\r\nYour favorite animal is cat."
}
}

I partially succeed by changing this line code
contents_df = bag.drop(["Source", "Template", "ConfigurationSetName", "ToAddresses", "ReplacementTemplateData"], axis=1)
and the output produced now looks likes this .
{
"Source":"demo#example.com",
"Template":"MyTemplate",
"ConfigurationSetName": "noreply",
"Destinations":[
{
"Destination":{
"ToAddresses":"customer1#gmail.com"
}
},
{
"Destination":{
"ToAddresses":"customer2#gmail.com"
}
}
]
}

How to make a 'outer' JSON key for JSON object with python

I would like to make the following JSON syntax output with python:
data={
"timestamp": "1462868427",
"sites": [
{
"name": "SiteA",
"zone": 1
},
{
"name": "SiteB",
"zone": 7
}
]
}
But I cannot manage to get the 'outer' data key there.
So far I got this output without the data key:
{
"timestamp": "1462868427",
"sites": [
{
"name": "SiteA",
"zone": 1
},
{
"name": "SiteB",
"zone": 7
}
]
}
I have tried with this python code:
sites = [
{
"name":"nameA",
"zone":123
},
{
"name":"nameB",
"zone":324
}
]
data = {
"timestamp": 123456567,
"sites": sites
}
print(json.dumps(data, indent = 4))
But how do I manage to get the outer 'data' key there?

Once you have your data ready, you can simply do this :
data = {'data': data}

JSON doesn't have =, it's all key:value.
What you're looking for is
data = {
"data": {
"timestamp": 123456567,
"sites": sites
}
}
json.dumps(data)
json.dumps() doesn't care for the name you give to the data object in python. You have to specify it manually inside the object, as a string.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create JSON structure from a pyspark dataframe? - python

Related

Pandas select rows from a DataFrame based on column values?

Convert PANDAS dataframe to nested JSON + add array name

Convert Pandas Dataframe or csv file to Custom Nested JSON

How to convert CSV to JSON using Python?

How to make a 'outer' JSON key for JSON object with python

Categories

Resources