Pandas json normalize an object property containing an array of objects - python

If I have json data formatted like this:
{
"result": [
{
"id": 878787,
"name": "Testing",
"schema": {
"id": 3463463,
"smartElements": [
{
"svKey": "Model",
"value": {
"type": "type1",
"value": "ThisValue"
}
},
{
"svKey": "SecondKey",
"value": {
"type": "example",
"value": "ThisValue2"
}
}
]
}
},
{
"id": 333,
"name": "NameName",
"schema": {
"id": 1111,
"smartElements": [
{
"svKey": "Model",
"value": {
"type": "type1",
"value": "NewValue"
}
},
{
"svKey": "SecondKey",
"value": {
"type": "example",
"value": "ValueIs"
}
}
]
}
}
]
}
is there a way to normalize it so I end up with records:
name Model SecondKey
Testing ThisValue ThisValue2
NameName NewValue ValueIs
I can get the smartElements to a pandas series but I can't figure out a way to break out smartElements[x].svKey to a column header and smartElements[x].value.value to the value for that column and/or merge it.

I'd skip trying to use a pre-baked solution and just navigate the json yourself.
import json
import pandas as pd
data = json.load(open('my.json'))
records = []
for d in data['result']:
record = {}
record['name'] = d['name']
for ele in d['schema']['smartElements']:
record[ele['svKey']] = ele['value']['value']
records.append(record)
pd.DataFrame(records)
name Model SecondKey
0 Testing ThisValue ThisValue2
1 NameName NewValue ValueIs

My solution
import pandas as pd
import json
with open('test.json') as f:
a = json.load(f)
d = pd.json_normalize(data=a['result'], errors='ignore', record_path=['schema', 'smartElements'], meta=['name'])
print(d)
produces
svKey value.type value.value name
0 Model type1 ThisValue Testing
1 SecondKey example ThisValue2 Testing
2 Model type1 NewValue NameName
3 SecondKey example ValueIs NameName

Related

Flattening Multi-Level Nested Object to DataFrame

I am trying to convert an object/dictionary to a Python DataFrame using the following code:
sr = pd.Series(object)
df = pd.DataFrame(sr.values.tolist())
display(df)
It works well but some of the output columns are of object/dictionary type, and I would like to break them up to multiple columns, for example, if column "Items" produces the following value in a cell:
obj = {
"item1": {
"id": "item1",
"relatedItems": [
{
"id": "1111",
"category": "electronics"
},
{
"id": "9999",
"category": "electronics",
"subcategory": "computers"
},
{
"id": "2222",
"category": "electronics",
"subcategory": "computers",
"additionalData": {
"createdBy": "Doron",
"inventory": 100
}
}
]
},
"item2": {
"id": "item2",
"relatedItems": [
{
"id": "4444",
"category": "furniture",
"subcategory": "sofas"
},
{
"id": "5555",
"category": "books",
},
{
"id": "6666",
"category": "electronics",
"subcategory": "computers",
"additionalData": {
"createdBy": "Joe",
"inventory": 5,
"condition": {
"name": "new",
"inspectedBy": "Doron"
}
}
}
]
}
}
The desired output is:
I tried using df.explode, but it multiplies the row to multiple rows, I am looking for a way to achieve the same but split into columns and retain a single row.
Any suggestions?
You can use the pd.json_normalize function to flatten the nested dictionary into multiple columns, with the keys joined with a dot (.).
sr = pd.Series({
'Items': {
'item_name': 'name',
'item_value': 'value'
}
})
df = pd.json_normalize(sr, sep='.')
display(df)
This will give you the following df
Items.item_name Items.item_value
0 name value
You can also specify the level of nesting by passing the record_path parameter to pd.json_normalize, for example, to only flatten the 'Items' key:
df = pd.json_normalize(sr, 'Items', sep='.')
display(df)
Seems like you're looking for pandas.json_normalize which has a (sep) parameter:​
obj = {
'name': 'Doron Barel',
'items': {
'item_name': 'name',
'item_value': 'value',
'another_item_prop': [
{
'subitem1_name': 'just_another_name',
'subitem1_value': 'just_another_value',
},
{
'subitem2_name': 'one_more_name',
'subitem2_value': 'one_more_value',
}
]
}
}
​
df = pd.json_normalize(obj, sep='.')
​
ser = df.pop('items.another_item_prop').explode()
​
out = (df.join(pd.DataFrame(ser.tolist(), index=s.index)
.rename(columns= lambda x: ser.name+"."+x))
.groupby("name", as_index=False).first()
)
Output :
print(out)
​
name items.item_name items.item_value items.another_item_prop.subitem1_name items.another_item_prop.subitem1_value items.another_item_prop.subitem2_name items.another_item_prop.subitem2_value
0 Doron Barel name value just_another_name just_another_value one_more_name one_more_value

How to convert a list of dictionaries to a list?

How to convert a list of dictionaries to a list?
Here is what I have:
{
"sources": [
{
"ID": "6953",
"VALUE": "https://address-jbr.ofp.ae"
},
{
"ID": "6967",
"VALUE": "https://plots.ae"
},
{
"ID": "6970",
"VALUE": "https://dubai-creek-harbour.ofp.ae"
}]}
Here is what I want it to look like:
({'6953':'https://address-jbr.ofp.ae','6967':'https://plots.ae','6970':'https://dubai-creek-harbour.ofp.ae'})
This is indeed very straightforward:
data = {
"sources": [
{
"ID": "6953",
"VALUE": "https://address-jbr.ofp.ae"
},
{
"ID": "6967",
"VALUE": "https://plots.ae"
},
{
"ID": "6970",
"VALUE": "https://dubai-creek-harbour.ofp.ae"
}]
}
Then:
data_list = [{x["ID"]: x["VALUE"]} for x in data["sources"]]
Which is the same as:
data_list = []
for x in data["sources"]:
data_list.append({
x["ID"]: x["VALUE"]
})
EDIT: You said convert to a "list" in the question and that confused me. Then this is what you want:
data_dict = {x["ID"]: x["VALUE"] for x in data["sources"]}
Which is the same as:
data_dict = {}
for x in data["sources"]:
data_dict[x["ID"]] = x["VALUE"]
P.S. Seems like you're asking for answers to your course assignments or something here, which is not what this place is for.
A solution using pandas
import pandas as pd
data = {
"sources": [
{"ID": "6953", "VALUE": "https://address-jbr.ofp.ae"},
{"ID": "6967", "VALUE": "https://plots.ae"},
{"ID": "6970", "VALUE": "https://dubai-creek-harbour.ofp.ae"},
]
}
a = pd.DataFrame.from_dict(data["sources"])
print(a.set_index("ID").T.to_dict(orient="records"))
outputs to:
[{'6953': 'https://address-jbr.ofp.ae', '6967': 'https://plots.ae', '6970': 'https://dubai-creek-harbour.ofp.ae'}]
this should work.
Dict = {
"sources": [
{
"ID": "6953",
"VALUE": "https://address-jbr.ofp.ae"
},
{
"ID": "6967",
"VALUE": "https://plots.ae"
},
{
"ID": "6970",
"VALUE": "https://dubai-creek-harbour.ofp.ae"
}]}
# Store all the keys here
value_LIST = []
for item_of_list in Dict["sources"]:
for key, value in item_of_list.items():
value_LIST.append(value)

Pandas to JSON not respecting DataFrame format

I have a Pandas DataFrame which I need to transform into a JSON object. I thought by grouping it, I would achieve this but this does not seem to yield the correct results. Further, I wouldnt know how to name the sub group.
My data frame as follows:
parent
name
age
nick
stef
10
nick
rob
12
And I do a groupby as I would like all children together under one parent in json:
df = df.groupby(['parent', 'name'])['age'].min()
And I would like it to yield the following:
{
"parent": "Nick",
"children": [
{
"name": "Rob",
"age": 10,
},
{
"name": "Stef",
"age": 15,
},,.. ]
}
When I do .to_json() it seems to regroup everything on age etc.
df.groupby(['parent'])[['name', 'age']].apply(list).to_json()
Given I wanted to add some styling, I ended up solving it as follows:
import json
df_grouped = df.groupby('parent')
new = []
for group_name, df_group in df_grouped:
base = {}
base['parent'] = group_name
children = []
for row_index, row in df_group.iterrows():
temp = {}
temp['name'] = row['name']
temp['age'] = row['age']
children.append(temp)
base['children'] = children
new.append(base)
json_format = json.dumps(new)
print(new)
Which yielded the following results:
[
{
"parent":"fee",
"children":[
{
"name":"bob",
"age":9
},
{
"name":"stef",
"age":10
}
]
},
{
"parent":"nick",
"children":[
{
"name":"stef",
"age":10
},
{
"name":"tobi",
"age":2
},
{
"name":"ralf",
"age":12
}
]
},
{
"parent":"patrick",
"children":[
{
"name":"marion",
"age":10
}
]
}
]

Convert PANDAS dataframe to nested JSON + add array name

I've been wresting with this for many days now and would appreciate any help.
I'm importing an Excel file to a Pandas data frame resulting in the following dataframe [record]:
account_id
name
timestamp
value
A0001C
Fund_1
1588618800000000000
1
B0001B
Dev_2
1601578800000000000
1
I'm looking to produce a nested JSON output (will be used to submit data to an API), include adding a records and metric labels for the arrays.
Here is the output i'm looking for:
{
"records": [
{
"name": "Fund_1",
"account_id": "A0001C",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
}
]
}
{
"name": "Dev_2",
"account_id": "B0001B",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
}
]
}
I've gotten an output of a none nested JSON data set, but not able split out the timestamp and value to add the metrics part.
for record in df.to_dict(orient='records'):
record_data = {'records': [record]}
payload_json = json.dumps(record_data)
print(payload_json)
I get the following output:
{"records": [{"account_id": "A0001C", "name": "Fund_1", "Date Completed": 1588618800000000000, "Count": "1"}]}
{"records": [{"account_id": "B0001B", "name": "Dev_2", "Date Completed": 1601578800000000000, "Count": "1"}]}
Any help on how i can modify my code to add the metrics label and nest the data.
Thanks in advance.
One approach is through the use of pd.apply. This allows you to apply a function to series (either column- or row-wise) in your dataframe.
In your particular case, you want to apply the function row-by-row, so you have to use apply with axis=1:
records = list(df.apply(lambda row: {"name": row["name"],
"account_id": row["account_id"],
"metrics": [{
"timestamp": row["timestamp"],
"value": row["value"]}]
},
axis=1).values)
payload = {"records": records}
Alternatively, you could introduce an auxiliary column "metrics" in which you store your metrics (subsequently applying pd.to_json):
df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
"value": e.value}],
axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
Here's a full example applying option 2:
import io
import json
import pandas as pd
data = io.StringIO("""account_id name timestamp value
A0001C Fund_1 1588618800000000000 1
B0001B Dev_2 1601578800000000000 1""")
df = pd.read_csv(data, sep="\t")
df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
"value": e.value}],
axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))
Output:
{
"records": [
{
"account_id": "A0001C",
"name": "Fund_1",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
}
]
},
{
"account_id": "B0001B",
"name": "Dev_2",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
}
]
}
Edit: The second approach also makes grouping by accounts (in case you want to do that) rather easy. Below is a small example and output:
import io
import json
import pandas as pd
data = io.StringIO("""account_id name timestamp value
A0001C Fund_1 1588618800000000000 1
A0001C Fund_1 1588618900000000000 2
B0001B Dev_2 1601578800000000000 1""")
df = pd.read_csv(data, sep="\t")
# adding the metrics column as above
df["metrics"] = df.apply(lambda e: {"timestamp": e.timestamp,
"value": e.value},
axis=1)
# group metrics by account
df_grouped = df.groupby(by=["name", "account_id"]).metrics.agg(list).reset_index()
records = df_grouped[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))
Output:
{
"records": [
{
"account_id": "B0001B",
"name": "Dev_2",
"metrics": [
{
"timestamp": 1601578800000000000,
"value": 1
}
]
},
{
"account_id": "A0001C",
"name": "Fund_1",
"metrics": [
{
"timestamp": 1588618800000000000,
"value": 1
},
{
"timestamp": 1588618900000000000,
"value": 2
}
]
}
]
}

How to add square brackets in JSON object with python

I just need contexts to be an Array ie., 'contexts' :[{}] instead of 'contexts':{}
Below is my python code which helps in converting python data-frame to required JSON format
This is the sample df for one row
name type aim context
xxx xxx specs 67646546 United States of America
data = {'entities':[]}
for key,grp in df.groupby('name'):
for idx, row in grp.iterrows():
temp_dict_alpha = {'name':key,'type':row['type'],'data' :{'contexts':{'attributes':{},'context':{'dcountry':row['dcountry']}}}}
attr_row = row[~row.index.isin(['name','type'])]
for idx2,row2 in attr_row.iteritems():
dict_temp = {}
dict_temp[idx2] = {'values':[]}
dict_temp[idx2]['values'].append({'value':row2,'source':'internal','locale':'en_Us'})
temp_dict_alpha['data']['contexts']['attributes'].update(dict_temp)
data['entities'].append(temp_dict_alpha)
print(json.dumps(data, indent = 4))
Desired output:
{
"entities": [{
"name": "XXX XXX",
"type": "specs",
"data": {
"contexts": [{
"attributes": {
"aim": {
"values": [{
"value": 67646546,
"source": "internal",
"locale": "en_Us"
}
]
}
},
"context": {
"country": "United States of America"
}
}
]
}
}
]
}
However I am getting below output
{
"entities": [{
"name": "XXX XXX",
"type": "specs",
"data": {
"contexts": {
"attributes": {
"aim": {
"values": [{
"value": 67646546,
"source": "internal",
"locale": "en_Us"
}
]
}
},
"context": {
"country": "United States of America"
}
}
}
}
]
}
Can any one please suggest ways for solving this problem using Python.
I think this does it:
import pandas as pd
import json
df = pd.DataFrame([['xxx xxx','specs','67646546','United States of America']],
columns = ['name', 'type', 'aim', 'context' ])
data = {'entities':[]}
for key,grp in df.groupby('name'):
for idx, row in grp.iterrows():
temp_dict_alpha = {'name':key,'type':row['type'],'data' :{'contexts':[{'attributes':{},'context':{'country':row['context']}}]}}
attr_row = row[~row.index.isin(['name','type'])]
for idx2,row2 in attr_row.iteritems():
if idx2 != 'aim':
continue
dict_temp = {}
dict_temp[idx2] = {'values':[]}
dict_temp[idx2]['values'].append({'value':row2,'source':'internal','locale':'en_Us'})
temp_dict_alpha['data']['contexts'][0]['attributes'].update(dict_temp)
data['entities'].append(temp_dict_alpha)
print(json.dumps(data, indent = 4))
Output:
{
"entities": [
{
"name": "xxx xxx",
"type": "specs",
"data": {
"contexts": [
{
"attributes": {
"aim": {
"values": [
{
"value": "67646546",
"source": "internal",
"locale": "en_Us"
}
]
}
},
"context": {
"country": "United States of America"
}
}
]
}
}
]
}
The problem is here in the following code
temp_dict_alpha = {'name':key,'type':row['type'],'data' :{'contexts':{'attributes':{},'context':{'dcountry':row['dcountry']}}}}
As you can see , you are already creating a contexts dict and assigning values to it. What you could do is something like this
contextObj = {'attributes':{},'context':{'dcountry':row['dcountry']}}
contextList = []
for idx, row in grp.iterrows():
temp_dict_alpha = {'name':key,'type':row['type'],'data' :{'contexts':{'attributes':{},'context':{'dcountry':row['dcountry']}}}}
attr_row = row[~row.index.isin(['name','type'])]
for idx2,row2 in attr_row.iteritems():
dict_temp = {}
dict_temp[idx2] = {'values':[]}
dict_temp[idx2]['values'].append({'value':row2,'source':'internal','locale':'en_Us'})
contextObj['attributes'].update(dict_temp)
contextList.append(contextObj)
Please Note - This code will have logical errors and might not run ( as it is difficult for me , to understand the logic behind it). But here is what you need to do .
You need to create a list of objects, which is not what you are doing. You are trying to manipulate an object and when its JSON dumped , you are getting an object back instead of a list. What you need is a list. You create context object for each and every iteration and keep on appending them to the local list contextList that we created earlier.
Once when the for loop terminates, you can update your original object by using the contextList and you will have a list of objects instead of and object which you are having now.

Categories

Resources