Pandas DataFrame to Nested JSON Without Changing Data Structure

Pandas DataFrame to Nested JSON Without Changing Data Structure - python

I have pandas.DataFrame:
import pandas as pd
import json
df = pd.DataFrame([['2016-04-30T20:02:25.693Z', 'vmPowerOn', 'vmName'],['2016-04-07T22:35:41.145Z','vmPowerOff','hostName']],
columns=['date', 'event', 'object'])
date event object
0 2016-04-30T20:02:25.693Z vmPowerOn vmName
1 2016-04-07T22:35:41.145Z vmPowerOff hostName
I want to convert that dataframe into the following format:
{
"name":"Alarm/Error",
"data":[
{"date": "2016-04-30T20:02:25.693Z", "details": {"event": "vmPowerOn", "object": "vmName"}},
{"date": "2016-04-07T22:35:41.145Z", "details": {"event": "vmPowerOff", "object": "hostName"}}
]
}
So far, I've tried this:
df = df.to_dict(orient='records')
j = {"name":"Alarm/Error", "data":df}
json.dumps(j)
'{"name": "Alarm/Error",
"data": [{"date": "2016-04-30T20:02:25.693Z", "event": "vmPowerOn", "object": "vmName"},
{"date": "2016-04-07T22:35:41.145Z", "event": "vmPowerOff", "object": "hostName"}
]
}'
However, this obviously does not put the detail columns in their own dictionary.
How would I efficiently split the df date column and all other columns into separate parts of the JSON?

With a list and dict comprehension, you can do that like:
Code:
[{'date': x['date'], 'details': {k: v for k, v in x.items() if k != 'date'}}
for x in df.to_dict('records')]
Test Code:
df = pd.DataFrame([['2016-04-30T20:02:25.693Z', 'vmPowerOn', 'vmName'],
['2016-04-07T22:35:41.145Z', 'vmPowerOff', 'hostName']],
columns=['date', 'event', 'object'])
print([{'date': x['date'],
'details': {k: v for k, v in x.items() if k != 'date'}}
for x in df.to_dict('records')])
Results:
[{'date': '2016-04-30T20:02:25.693Z', 'details': {'event': 'vmPowerOn', 'object': 'vmName'}},
{'date': '2016-04-07T22:35:41.145Z', 'details': {'event': 'vmPowerOff', 'object': 'hostName'}}
]

Related

Changing value of a value in a dictionary within a list within a dictionary

I have a json like:
pd = {
"RP": [
{
"Name": "PD",
"Value": "qwe"
},
{
"Name": "qwe",
"Value": "change"
}
],
"RFN": [
"All"
],
"RIT": [
{
"ID": "All",
"IDT": "All"
}
]
}
I am trying to change the value change to changed. This is a dictionary within a list which is within another dictionary. Is there a better/ more efficient/pythonic way to do this than what I did below:
for key, value in pd.items():
ls = pd[key]
for d in ls:
if type(d) == dict:
for k,v in d.items():
if v == 'change':
pd[key][ls.index(d)][k] = "changed"
This seems pretty inefficient due to the amount of times I am parsing through the data.

String replacement could work if you don't want to write depth/breadth-first search.
>>> import json
>>> json.loads(json.dumps(pd).replace('"Value": "change"', '"Value": "changed"'))
{'RP': [{'Name': 'PD', 'Value': 'qwe'}, {'Name': 'qwe', 'Value': 'changed'}],
'RFN': ['All'],
'RIT': [{'ID': 'All', 'IDT': 'All'}]}

python dictionary to json

the output of file comes as dictionary, with 5 columns. Due to the 5th column the first 4 are duplicated. My goals is to output it as a json, without duplicates in the following format.
Sample input:
test_dict = [
{'ID':"A", 'ID_A':"A1",'ID_B':"A2",'ID_C':"A3",'INVOICE':"123"},
{'ID':"A", 'ID_A':"A1",'ID_B':"A2",'ID_C':"A3",'INVOICE':"345"}
]
Previously there were no duplicates so it was easy to transform to json as below:
result = defaultdict(set)
for i in test_dict:
id = i.get('ID')
if id:
result[i].add(i.get('ID_A'))
result[i].add(i.get('ID_B'))
result[i].add(i.get('ID_C'))
output = []
for id, details in result.items():
output.append(
{
"ID": id,
"otherDetails": {
"IDs": [
{"id": ref} for ref in details
]
},
}
)
How could I add INVOICE to this without duplicating the rows? The output would look like this:
[{'ID': '"A"',
'OtherDetails': {'IDs': [{'id': 'A1'},
{'id': 'A2'},
{'id': 'A3'}],
{'INVOICE': [{'id':'123'},
{'id':'345'}]}}}]
Thanks! (python 3.9)

Basically, you can just do the same as for the IDs, using a second defaultdict (or similar) for the invoice IDs. Afterwards, use a nested list/dict comprehension to build the final result.
id_to_ids = defaultdict(set)
id_to_inv = defaultdict(set)
for d in test_dict:
id_to_ids[d["ID"]] |= {d[k] for k in ["ID_A", "ID_B", "ID_C"]}
id_to_inv[d["ID"]] |= {d["INVOICE"]}
result = [{
'ID': k,
'OtherDetails': {
'IDs': [{'id': v} for v in id_to_ids[k]],
'INVOICE': [{'id': v} for v in id_to_inv[k]]
}} for k in id_to_ids]
Note, though, that using this format, you will lose the information which of the "other" IDs was which, and with that invoice ID those were associated.

You were pretty close. I would make the intermediate dictionary a little bit more straight forward. And have it just be a diction with id, and two lists.
When walking the original data, you just need to append INVOICE if there is already an entry for the ID. Then when you create the "json" format (a list of dictionary for each ID), all you have to do is use the lists you already generate. Here is the structure I propose.
from collections import defaultdict
test_dict = [
{'ID':"A", 'ID_A':"A1",'ID_B':"A2",'ID_C':"A3",'INVOICE':"123"},
{'ID':"A", 'ID_A':"A1",'ID_B':"A2",'ID_C':"A3",'INVOICE':"345"}
]
result = {}
for i in test_dict:
id = i.get('ID')
if not id:
continue
if id in result:
# just add INVOICE
result[id]['INVOICE'].append(i.get('INVOICE'))
else:
# ID not in result dictionary, so populate it
result[id] = {'IDs': [ i.get('ID_A'), i.get('ID_B'), i.get('ID_C')],
'INVOICE' : [i.get('INVOICE')]
}
output = []
for id, details in result.items():
output.append(
{
"ID": id,
"otherDetails": {
"IDs": details['IDs'],
'INVOICE': details['INVOICE']
}
}
)
The trick for duplicate id's is handled by the if id in result where it only appends the invoice to the list of invoices. I will also add since we are using a lot of dict.get() calls rather than simple dict[], we are potentially adding a bunch of None's into these lists.

The like the answer from #tobias_k, but it does not handle duplicate values for any of the ID_* or invoice columns. His answer is the most simple if order and repetition are not important.
Checkout this if they are important.
import pandas as pd
def create_item(df: pd.DataFrame):
output = list()
groups = df.groupby(["ID", "ID_A", "ID_B", "ID_C"])[["INVOICE"]]
for group, gdf in groups:
row = dict()
row["ID"] = group[0]
row["OtherDetails"] = dict()
row["OtherDetails"]["IDS"] = [{"id": x} for x in group[1:]]
row["OtherDetails"]["INVOICE"] = [{"id": x} for x in gdf["INVOICE"]]
output.append(row)
return output
test_dict = [
{"ID": "A", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "123"},
{"ID": "A", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "345"},
{"ID": "B", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "123"},
{"ID": "B", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "345"},
{"ID": "B", "ID_A": "A1", "ID_B": "A2", "ID_C": "A3", "INVOICE": "123"},
]
test_df = pd.DataFrame(test_dict)
create_item(test_df)
Which will return
[{'ID': 'A',
'OtherDetails': {'IDS': [{'id': 'A1'}, {'id': 'A2'}, {'id': 'A3'}],
'INVOICE': [{'id': '123'}, {'id': '345'}]}},
{'ID': 'B',
'OtherDetails': {'IDS': [{'id': 'A1'}, {'id': 'A2'}, {'id': 'A3'}],
'INVOICE': [{'id': '123'}, {'id': '345'}, {'id': '123'}]}}]

python generator to pandas dataframe

I have a generator being returned from:
data = public_client.get_product_trades(product_id='BTC-USD', limit=10)
How do i turn the data in to a pandas dataframe?
the method DOCSTRING reads:
"""{"Returns": [{
"time": "2014-11-07T22:19:28.578544Z",
"trade_id": 74,
"price": "10.00000000",
"size": "0.01000000",
"side": "buy"
}, {
"time": "2014-11-07T01:08:43.642366Z",
"trade_id": 73,
"price": "100.00000000",
"size": "0.01000000",
"side": "sell"
}]}"""
I have tried:
df = [x for x in data]
df = pd.DataFrame.from_records(df)
but it does not work as i get the error:
AttributeError: 'str' object has no attribute 'keys'
When i print the above "x for x in data" i see the list of dicts but the end looks strange, could this be why?
print(list(data))
[{'time': '2020-12-30T13:04:14.385Z', 'trade_id': 116918468, 'price': '27853.82000000', 'size': '0.00171515', 'side': 'sell'},{'time': '2020-12-30T12:31:24.185Z', 'trade_id': 116915675, 'price': '27683.70000000', 'size': '0.01683711', 'side': 'sell'}, 'message']
It looks to be a list of dicts but the end value is a single string 'message'.

Based on the updated question:
df = pd.DataFrame(list(data)[:-1])
Or, more cleanly:
df = pd.DataFrame([x for x in data if isinstance(x, dict)])
print(df)
time trade_id price size side
0 2020-12-30T13:04:14.385Z 116918468 27853.82000000 0.00171515 sell
1 2020-12-30T12:31:24.185Z 116915675 27683.70000000 0.01683711 sell
Oh, and BTW, you'll still need to change those strings into something usable...
So e.g.:
df['time'] = pd.to_datetime(df['time'])
for k in ['price', 'size']:
df[k] = pd.to_numeric(df[k])

You could access the values in the dictionary and build a dataframe from it (although not particularly clean):
dict_of_data = [{
"time": "2014-11-07T22:19:28.578544Z",
"trade_id": 74,
"price": "10.00000000",
"size": "0.01000000",
"side": "buy"
}, {
"time": "2014-11-07T01:08:43.642366Z",
"trade_id": 73,
"price": "100.00000000",
"size": "0.01000000",
"side": "sell"
}]
import pandas as pd
list_of_data = [list(dict_of_data[0].values()),list(dict_of_data[1].values())]
pd.DataFrame(list_of_data, columns=list(dict_of_data[0].keys())).set_index('time')

its straightforward just use the pd.DataFrame constructor:
#list_of_dicts = [{
# "time": "2014-11-07T22:19:28.578544Z",
# "trade_id": 74,
# "price": "10.00000000",
# "size": "0.01000000",
# "side": "buy"
# }, {
# "time": "2014-11-07T01:08:43.642366Z",
# "trade_id": 73,
# "price": "100.00000000",
# "size": "0.01000000",
# "side": "sell"
#}]
# or if you take it from 'data'
list_of_dicts = data[:-1]
df = pd.DataFrame(list_of_dicts)
df
Out[4]:
time trade_id price size side
0 2014-11-07T22:19:28.578544Z 74 10.00000000 0.01000000 buy
1 2014-11-07T01:08:43.642366Z 73 100.00000000 0.01000000 sell
UPDATE
according to the question update, it seems you have json data that is still string...
import json
data = json.loads(data)
data = data['Returns']
pd.DataFrame(data)
time trade_id price size side
0 2014-11-07T22:19:28.578544Z 74 10.00000000 0.01000000 buy
1 2014-11-07T01:08:43.642366Z 73 100.00000000 0.01000000 sell

Converting CSV to Hierarchical JSON output

I am trying to convert the CSV file into a Hierarchical JSON file.CSV file input as follows, It contains two columns Gene and Disease.
gene,disease
A1BG,Adenocarcinoma
A1BG,apnea
A1BG,Athritis
A2M,Asthma
A2M,Astrocytoma
A2M,Diabetes
NAT1,polyps
NAT1,lymphoma
NAT1,neoplasms
The expected Output format should be in the following format
{
"name": "A1BG",
"children": [
{"name": "Adenocarcinoma"},
{"name": "apnea"},
{"name": "Athritis"}
]
},
{
"name": "A2M",
"children": [
{"name": "Asthma"},
{"name": "Astrocytoma"},
{"name": "Diabetes"}
]
},
{
"name": "NAT1",
"children": [
{"name": "polyps"},
{"name": "lymphoma"},
{"name": "neoplasms"}
]
}
The python code I have written is below. let me know where I need to change to get the desired output.
import json
finalList = []
finalDict = {}
grouped = df.groupby(['gene'])
for key, value in grouped:
dictionary = {}
dictList = []
anotherDict = {}
j = grouped.get_group(key).reset_index(drop=True)
dictionary['name'] = j.at[0, 'gene']
for i in j.index:
anotherDict['disease'] = j.at[i, 'disease']
dictList.append(anotherDict)
dictionary['children'] = dictList
finalList.append(dictionary)
with open('outputresult3.json', "w") as out:
json.dump(finalList,out)

import json
json_data = []
# group the data by each unique gene
for gene, data in df.groupby(["gene"]):
# obtain a list of diseases for the current gene
diseases = data["disease"].tolist()
# create a new list of dictionaries to satisfy json requirements
children = [{"name": disease} for disease in diseases]
entry = {"name": gene, "children": children}
json_data.append(entry)
with open('outputresult3.json', "w") as out:
json.dump(json_data, out)

Use DataFrame.groupby with custom lambda function for convert values to dictionaries by DataFrame.to_dict:
L = (df.rename(columns={'disease':'name'})
.groupby('gene')
.apply(lambda x: x[['name']].to_dict('records'))
.reset_index(name='children')
.rename(columns={'gene':'name'})
.to_dict('records')
)
print (L)
[{'name': 'A1BG', 'children': [{'name': 'Adenocarcinoma'},
{'name': 'apnea'},
{'name': 'Athritis'}]},
{'name': 'A2M', 'children': [{'name': 'Asthma'},
{'name': 'Astrocytoma'},
{'name': 'Diabetes'}]},
{'name': 'NAT1', 'children': [{'name': 'polyps'},
{'name': 'lymphoma'},
{'name': 'neoplasms'}]}]
with open('outputresult3.json', "w") as out:
json.dump(L,out)

Appending value to dict key if dict is "similar" to another one

First of all, sorry for the title, I couldn't think of a good one to be honest.
I have a list of dictionaries like this
data=[
{'identifier': 'ID', 'resource': 'resource1' , 'name': 'name1'},
{'identifier': 'ID', 'resource': 'resource2' , 'name': 'name1'},
{'identifier': 'ID', 'resource': 'resource3' , 'name': 'name1'},
{'identifier': 'ID', 'resource': 'resource1' , 'name': 'name2'},
{'identifier': 'ID', 'resource': 'resource2' , 'name': 'name2'},
{'identifier': 'ID', 'resource': 'resource3' , 'name': 'name2'}
]
Basically, I want a dict that contains the name and every resource with that name, something like this
final = [
{
'name': 'name1',
'resources': ['resource1','resource2','resource3']
},
{
'name': 'name2',
'resources': ['resource1','resource2','resource3']
}
]
I have tried some approaches, like iterating over the first list and verifying if the key value pair already exists on the second one, so after that I can append the resource to the key I want but clearly thats not the correct way to do it.
Im sure there's a way to easily do this but I cant wrap my head around on how to achieve it. Any ideas on how can this be done?

Group with a collections.defaultdict, with name as the grouping key and the resources for each name appended to a list:
from collections import defaultdict
data = [
{"identifier": "ID", "resource": "resource1", "name": "name1"},
{"identifier": "ID", "resource": "resource2", "name": "name1"},
{"identifier": "ID", "resource": "resource3", "name": "name1"},
{"identifier": "ID", "resource": "resource1", "name": "name2"},
{"identifier": "ID", "resource": "resource2", "name": "name2"},
{"identifier": "ID", "resource": "resource3", "name": "name2"},
]
d = defaultdict(list)
for x in data:
d[x["name"]].append(x["resource"])
result = [{"name": k, "resources": v} for k, v in d.items()]
print(result)
However, since your names are ordered, we can also get away with using itertools.groupby:
from itertools import groupby
from operator import itemgetter
result = [
{"name": k, "resources": [x["resource"] for x in g]}
for k, g in groupby(data, key=itemgetter("name"))
]
print(result)
If your names are not ordered, then we will need to sort data by name:
result = [
{"name": k, "resources": [x["resource"] for x in g]}
for k, g in groupby(sorted(data, key=itemgetter("name")), key=itemgetter("name"))
]
print(result)
Output:
[{'name': 'name1', 'resources': ['resource1', 'resource2', 'resource3']}, {'name': 'name2', 'resources': ['resource1', 'resource2', 'resource3']}]
Note: I would probably just stick with the first defaultdict solution in most cases, because it doesn't care about order.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas DataFrame to Nested JSON Without Changing Data Structure - python

Related

Changing value of a value in a dictionary within a list within a dictionary

python dictionary to json

python generator to pandas dataframe

Converting CSV to Hierarchical JSON output

Appending value to dict key if dict is "similar" to another one

Categories

Resources