Convert pandas DataFrame to 2-layer nested JSON using groupby - python

Assume that I have a pandas dataframe called df similar to:
source tables
src1 table1
src1 table2
src1 table3
src2 table1
src2 table2
I'm currently able to output a JSON file that iterates through the various sources, creating an object for each, with the code below:
all_data = []
for src in df['source']:
source_data = {
src: {
}
}
all_data.append(source_data)
with open('data.json', 'w') as f:
json.dump(all_data, f, indent = 2)
This yields the following output:
[
{
"src1": {}
},
{
"src2": {}
}
]
Essentially, what I want to do is also iterate through those list of sources and add the table objects corresponding to each source respectively. My desired output would look similar to as follows:
[
{
"src1": {
"table1": {},
"table2": {},
"table3": {}
}
},
{
"src2": {
"table1": {},
"table2": {}
}
}
]
Any assistance on how I can modify my code to also iterate through the tables column and append that to the respective source values would be greatly appreciated. Thanks in advance.

Is this what you're looking for?
data = [
{k: v}
for k, v in df.groupby('source')['tables'].agg(
lambda x: {v: {} for v in x}).items()
]
with open('data.json', 'w') as f:
json.dump(data, f, indent=2)
There are two layers to the answer here. To group the tables by source, use groupby first with an inner comprehension. You can use a list comprehension to assemble your data in this specific format overall.
[
{
"src1": {
"table1": {},
"table2": {},
"table3": {}
}
},
{
"src2": {
"table1": {},
"table2": {}
}
}
]
Example using .apply with arbitrary data
df['tables2'] = 'abc'
def func(g):
return {x: y for x, y in zip(g['tables'], g['tables2'])}
data = [{k: v} for k, v in df.groupby('source').apply(func).items()]
data
# [{'src1': {'table1': 'abc', 'table2': 'abc', 'table3': 'abc'}},
# {'src2': {'table1': 'abc', 'table2': 'abc'}}]
Note that this will not work with pandas 1.0 (probably because of a bug)

Related

Flatten json dynamically using python, all nested keys in a column and value in another

I have a requirment to flatten json into keys and values using pyspark/python, so all the nested keys goes into a column value and the corresponding values goes into another column.
Also to note input json is dynamic, so in below sample there could be multiple subkey and child keys. Appreciate if anyone can help on this
sample json Input:
{
"key1": {
"subkey1":"1.1",
"subkey2":"1.2"
},
"key2": {
"subkey1":"2.1",
"subkey2":"2.2",
"subkey3": {"child3": { "subchild3":"2.3.3.3" } }
},
"key3": {
"subkey1":"3.1",
"subkey2":"3.2"
}
}
Expected Output: To flatten only key2 from the nested keys
ID
key
value
1
key2.subkey1
2.1
2
key2.subkey2
2.2
3
key2.subkey3.child3.subchild3
2.3.3.3
The following code provides you with all you need to accomplish what you want to achieve:
data = \
{ "key1": {
"subkey1":"1.1",
"subkey2":"1.2"
},
"key2": {
"subkey1":"2.1",
"subkey2":"2.2",
"subkey3": {
"child3": {
"subchild3":"2.3.3.3"
}
}
}
}
print(data)
ID = 0
lstRows = []
def getTableRow(data, key):
global lstRows, ID
for k, v in data.items():
#print('for k,v:', k,v)
if isinstance(v, dict):
#print('dict:',v)
if key=='':
getTableRow(v, k)
else:
getTableRow(v, key +'.'+ k)
else:
#print('lstRows.append()')
ID += 1
lstRows.append({"ID":ID, "key":key +'.'+ k, "value":v})
getTableRow(data, '')
print( lstRows )
dctTable = {"ID":[],"key":[], "value":[]}
for dct in lstRows:
dctTable["ID"].append(dct["ID"])
dctTable["key"].append(dct["key"])
dctTable["value"].append(dct["value"])
print( dctTable )
import pandas as pd
df = pd.DataFrame.from_dict(dctTable)
# df = pd.DataFrame(lstRows) # equivalent to above .from_dict()
# df = pd.DataFrame(dctTable) # equivalent to above .from_dict()
print(df)
prints
{'key1': {'subkey1': '1.1', 'subkey2': '1.2'}, 'key2': {'subkey1': '2.1', 'subkey2': '2.2', 'subkey3': {'child3': {'subchild3': '2.3.3.3'}}}}
[{'ID': 1, 'key': 'key1.subkey1', 'value': '1.1'}, {'ID': 2, 'key': 'key1.subkey2', 'value': '1.2'}, {'ID': 3, 'key': 'key2.subkey1', 'value': '2.1'}, {'ID': 4, 'key': 'key2.subkey2', 'value': '2.2'}, {'ID': 5, 'key': 'key2.subkey3.child3.subchild3', 'value': '2.3.3.3'}]
{'ID': [1, 2, 3, 4, 5], 'key': ['key1.subkey1', 'key1.subkey2', 'key2.subkey1', 'key2.subkey2', 'key2.subkey3.child3.subchild3'], 'value': ['1.1', '1.2', '2.1', '2.2', '2.3.3.3']}
ID key value
0 1 key1.subkey1 1.1
1 2 key1.subkey2 1.2
2 3 key2.subkey1 2.1
3 4 key2.subkey2 2.2
4 5 key2.subkey3.child3.subchild3 2.3.3.3
It uses a recursive call of a function creating the rows of the resulting table.
As I don't use pyspark the shown table was created using Pandas.
See also here ( "Flatten nested dictionaries, compressing keys"
) for a general and flexible way of flattening a nested dictionary handling also values being lists.
See below code for further instructions and explanations requested in the comments:
# ======================================================================
# You can read the json file content directly into the dct_data using
# the Python json.load(fp) function ( fp=open(filename) ).
# This code starts with an in str_data stored json file content:
str_data = """
{ "key1": {
"subkey1":"1.1",
"subkey2":"1.2"
},
"key2": {
"subkey1":"2.1",
"subkey2":"2.2",
"subkey3": {
"child3": {
"subchild3":"2.3.3.3"
}
}
}
}"""
# print(str_data)
# Let's create a Python dictionary from the json data string:
import json
dct_data = json.loads(str_data) # or = json.load(open(filename))
print(dct_data)
# Here the function for flattening the dictionary dct_data returning
# a dictionary with flattened dct_data content:
def flattenNestedDictionary(dct_data, key='', ID=0, lstRows=[]):
#global lstRows, ID
for k, v in dct_data.items():
#print('for k,v:', k,v)
if isinstance(v, dict):
#print('dict:',v)
if key=='':
flattenNestedDictionary(v, k, ID, lstRows)
else:
flattenNestedDictionary(v, key +'.'+ k, ID, lstRows)
else:
#print('lstRows.append()')
ID += 1
lstRows.append({"ID":ID, "key":key +'.'+ k, "value":v})
# now lstRows has all the required content so let's create the
# flattened dictionary:
if ID==0:
print('lstRows:', lstRows )
dct_flattened_json = {"ID":[],"key":[], "value":[]}
for dct in lstRows:
dct_flattened_json["ID"].append(dct["ID"])
dct_flattened_json["key"].append(dct["key"])
dct_flattened_json["value"].append(dct["value"])
print('#', dct_flattened_json )
return dct_flattened_json
dct_flattened_json = flattenNestedDictionary(dct_data)
# Let's create a valid json data string out of the dictionary:
str_flattened_json = json.dumps(dct_flattened_json)
print('>', str_flattened_json)
# you can now write the str_flattened_json string to a file and load
# the new json file with flattened data into spark DataFrame.
# Or you load the string str_flattened_json into a spark DataFrame.

Python: building complex nested lists within a dictionary

I am looking at building lists of lists within a dictionary from an Excel spreadsheet.
My spreadsheet looks like this:
source_item_id
target_item_id
find_sting
replace_sting
source_id1
target_id1
abcd1
efgh1
source_id1
target_id1
ijkl1
mnop1
source_id1
target_id2
abcd2
efgh2
source_id1
target_id2
ijkl2
mnop2
source_id2
target_id3
qrst
uvwx
source_id2
target_id3
yzab
cdef
source_id2
target_id4
ghij
klmn
source_id2
target_id4
opqr
stuv
My output dictionary should looks like this:
{
"source_id1": [{
"target_id1": [{
"find_string": "abcd1",
"replace_string": "efgh1"
},
{
"find_string": "ijkl1",
"replace_string": "mnop1"
}]
},
{
"target_id2": [{
"find_string": "abcd2",
"replace_string": "efgh2"
},
{
"find_string": "ijkl2",
"replace_string": "mnop2"
}]
}],
"source_id2": [{
"target_id3": [{
"find_string": "qrst",
"replace_string": "uvwx"
},
{
"find_string": "yzab",
"replace_string": "cdef"
}]
},
{
"target_id4": [{
"find_string": "ghij",
"replace_string": "klmn"
},
{
"find_string": "opqr",
"replace_string": "stuv"
}]
}]
}
With the following code I only get the last values in each of the lists:
import xlrd
xls_path = r"C:\data\ItemContent.xlsx"
book = xlrd.open_workbook(xls_path)
sheet_find_replace = book.sheet_by_index(1)
find_replace_dict = dict()
for line in range(1, sheet_find_replace.nrows):
source_item_id = sheet_find_replace.cell(line, 0).value
target_item_id = sheet_find_replace.cell(line, 1).value
find_string = sheet_find_replace.cell(line, 2).value
replace_sting = sheet_find_replace.cell(line, 3).value
find_replace_list = [{"find_string": find_string, "replace_sting": replace_sting}]
find_replace_dict[source_item_id] = [target_item_id]
find_replace_dict[source_item_id].append(find_replace_list)
print(find_replace_dict)
--> result
{
"source_id1": ["target_id2", [{
"find_string": "ijkl2",
"replace_sting": "mnop2"
}
]],
"source_id2": ["target_id4", [{
"find_string": "opqr",
"replace_sting": "stuv"
}
]]
}
Your problem is rather complicated by the fact that you have a list of single-key dictionaries as the value of your source ids, but you can follow a pattern of parsing each line for the relevant items and, and then using those to target where you insert appends, or alternatively create new lists:
def process_line(line) -> Tuple[str, str, dict]:
source_item_id = sheet_find_replace.cell(line, 0).value
target_item_id = sheet_find_replace.cell(line, 1).value
find_string = sheet_find_replace.cell(line, 2).value
replace_string = sheet_find_replace.cell(line, 3).value
return source_item_id, target_item_id, {
"find_string": find_string,
"replace_string": replace_string
}
def find_target(target: str, ls: List[dict]) -> int:
# Find the index of the target id in the list
for i in len(ls):
if ls[i].get(target):
return i
return -1 # Or some other marker
import xlrd
xls_path = r"C:\data\ItemContent.xlsx"
book = xlrd.open_workbook(xls_path)
sheet_find_replace = book.sheet_by_index(1)
result_dict = dict()
for line in range(1, sheet_find_replace.nrows):
source, target, replacer = process_line(line)
# You can check here that the above three are correct
source_list = result_dict.get(source, []) # Leverage the default value of the get function
target_idx = find_target(target, source_list)
target_dict = source_list[target_idx] if target_idx >=0 else {}
replace_list = target_dict.get(target, [])
replace_list.append(replacer)
target_dict[target] = replace_list
if target_idx >= 0:
source_list[target_idx] = target_dict
else:
source_list.append(target_dict)
result_dict[source] = source_list
print(result_dict)
I would note that if source_id pointed to a dictionary rather than a list, this could be radically simplified, since we wouldn't need to search through the list for a potentially already-existing list item and then awkwardly replace or append as needed. If you can change this constraint (remember, you can always convert a dictionary to a list downstream), I might consider doing that.

unable to update JSON using python

I am trying to update transaction ID from the following json:
{
"locationId": "5115",
"transactions": [
{
"transactionId": "1603804404-5650",
"source": "WEB"
} ]
I have done following code for the same, but it does not update the transaction id, but it inserts the transaction id to the end of block:-
try:
session = requests.Session()
with open(
"sales.json",
"r") as read_file:
payload = json.load(read_file)
payload["transactionId"] = random.randint(0, 5)
with open(
"sales.json",
"w") as read_file:
json.dump(payload, read_file)
Output:-
{
"locationId": "5115",
"transactions": [
{
"transactionId": "1603804404-5650",
"source": "WEB"
} ]
}
'transactionId': 1
}
Expected Outut:-
{
"locationId": "5115",
"transactions": [
{
"transactionId": "1",
"source": "WEB"
} ]
This would do it, but only in your specific case:
payload["transactions"][0]["transactionId"] = xxx
There should be error handling for cases like "transactions" key is not int the dict, or there are no records or there are more than one
also, you will need to assign =str(your_random_number) not the int if you wish to have the record of type string as the desired output suggests
If you just want to find the transactionId key and you don't know exactly where it may exist. You can do-
from collections.abc import Mapping
def update_key(key, new_value, jsondict):
new_dict = {}
for k, v in jsondict.items():
if isinstance(v, Mapping):
# Recursive traverse if value is a dict
new_dict[k] = update_key(key, new_value, v)
elif isinstance(v, list):
# Traverse through all values of list
# Recursively traverse if an element is a dict
new_dict[k] = [update_key(key, new_value, innerv) if isinstance(innerv, Mapping) else innerv for innerv in v]
elif k == key:
# This is the key to replace with new value
new_dict[k] = new_value
else:
# Just a regular value, assign to new dict
new_dict[k] = v
return new_dict
Given a dict-
{
"locationId": "5115",
"transactions": [
{
"transactionId": "1603804404-5650",
"source": "WEB"
} ]
}
You can do-
>>> update_key('transactionId', 5, d)
{'locationId': '5115', 'transactions': [{'transactionId': 5, 'source': 'WEB'}]}
Yes because transactionId is inside transactions node. So your code should be like:
payload["transactions"][0].transactionId = random.randint(0, 5)
or
payload["transactions"][0]["transactionId"] = random.randint(0, 5)

Get json object with value with python for loop

When I use:
for reports in raw_data:
for names in reports["names"]:
report_name = json.dumps(names).strip('"')
report_names.append(report_name)
I get the key/object name: 'report1', ...
When I use:
for reports in raw_data:
for names in reports["names"].values():
report_name = json.dumps(names).strip('"')
report_names.append(report_name)
I get the value of the object: 'name1', ...
How do get the object and value together, for example: 'report1': 'name1', ...
The json:
[
{
"names": {
"report1": "name1",
"report2": "name2"
}
},
{
"names": {
"report3": "name3",
"report4": "name4"
}
}
]
You need to loop over each dictionary in the object, then extract each key: value pair from items():
data = [
{
"names": {
"report1": "name1",
"report2": "name2"
}
},
{
"names": {
"report3": "name3",
"report4": "name4"
}
}
]
for d in data:
for k, v in d["names"].items():
print(k, v)
Result:
report1 name1
report2 name2
report3 name3
report4 name4
Or if you can just print out the tuple pairs:
for d in data:
for pair in d["names"].items():
print(pair)
# ('report1', 'name1')
# ('report2', 'name2')
# ('report3', 'name3')
# ('report4', 'name4')
If you want all of the pairs in a list, use a list comprehension:
[pair for d in data for pair in d["names"].items()]
# [('report1', 'name1'), ('report2', 'name2'), ('report3', 'name3'), ('report4', 'name4')]
Try something like this:
import json
with open(r'jsonfile.json', 'r') as f:
qe = json.load(f)
for item in qe:
if item == 'name1':
print(qe)

Merge list of dict into one dict in python

I have the following list of dicts in python.
[
{
"US": {
"Intial0": 12.515
},
{
"GE": {
"Intial0": 11.861
}
},
{
"US": {
"Final0": 81.159
}
},
{
"GE": {
"Final0": 12.9835
}
}
]
I want the final list of dicts as
[{"US": {"Initial0":12.515, "Final0": 81.159}}, {"GE": {"Initial0": 11.861, "Final0": 12.9835}}]
I am struggling with this from quite some time . Any help?
You could use Python's defaultdict as follows:
from collections import defaultdict
lod = [
{"US": {"Intial0": 12.515}},
{"GE": {"Intial0": 11.861}},
{"US": {"Final0": 81.159}},
{"GE": {"Final0": 12.9835}}]
output = defaultdict(dict)
for d in lod:
output[d.keys()[0]].update(d.values()[0])
print output
For the data given, this would display the following:
defaultdict(<type 'dict'>, {'GE': {'Intial0': 11.861, 'Final0': 12.9835}, 'US': {'Intial0': 12.515, 'Final0': 81.159}})
Or you could convert it back to a standard Python dictionary with print dict(output) giving:
{'GE': {'Intial0': 11.861, 'Final0': 12.9835}, 'US': {'Intial0': 12.515, 'Final0': 81.159}}
list1=[{"US": {"Intial0": 12.515}},{"GE": {"Intial0": 11.861}},{"US": {"Final0": 81.159}},{"GE": {"Final0": 12.9835}}]
dict_US={}
dict_GE={}
for dict_x in list1:
if dict_x.keys()==['US']:
dict_US.update(dict_x["US"])
if dict_x.keys()==['GE']:
dict_GE.update(dict_x["GE"])
list2=[{"US":dict_US},{"GE":dict_GE}]
print list2

Categories

Resources