Merge deep JSON files in Python

Merge deep JSON files in Python - python

I have two JSON files, one that contains a fully defined object with multiple levbels of nesting, the other contains a stripped back version of the same object that lists just elements that need to be changed
File 1 example
{
"toplevel": {
"value": {
"settings": [
{
"name": "A Default Value",
"region": "US",
"inner": {
"name": "Another Default",
"setting": "help"
}
}
]
}
}
}
File 2 example
{
"toplevel": {
"value": {
"settings": [
{
"name": "A Real Value",
"inner": {
"name": "Another Real Value",
}
}
]
}
}
}
I want to merge the updates from file 2 into file 1.
my output should look like
{
"toplevel": {
"value": {
"settings": [
{
"name": "A Real Value",
"region": "US",
"inner": {
"name": "Another Real Value",
"setting": "help"
}
}
]
}
}
}
so far I've tried
f1 = json_load(file1)
f2 = json_load(file2)
f1['toplevel']['value']['settings'][0].update(f2['toplevel']['value']['settings'][0].items())
it works perfectly for the top level items, but obviously it overwrites the whole of the "inner" object, removing the "setting" key inside it.
Is there a way to traverse the whole tree and replace only the non-dictionary values? I don't have access to external libraries other than json and collections (for the ordered dict)

It depends slightly on what you want
Solution 1
If you simply want to replace all values by the new dictionary, you can use the following options:
result = {**file_1, **file_2}
from pprint import pprint
pprint(result)
This will result in:
{'toplevel': {'value': {'settings': [{'inner': {'name': 'Another Real Value'},
'name': 'A Real Value'}]}}}
Alternatively you can use
file_1.update(file_2)
pprint(file_1)
Which will lead to the same outcome, but will update file_1 in place.
Solution 2
If you only want to update the specific key in the nesting, and leave all other values intact, you can do this using recursion. In your example you are using dict, list and str values. So I will build the recursion using the same types.
def update_dict(original, update):
for key, value in update.items():
# Add new key values
if key not in original:
original[key] = update[key]
continue
# Update the old key values with the new key values
if key in original:
if isinstance(value, dict):
update_dict(original[key], update[key])
if isinstance(value, list):
update_list(original[key], update[key])
if isinstance(value, (str, int, float)):
original[key] = update[key]
return original
def update_list(original, update):
# Make sure the order is equal, otherwise it is hard to compare the items.
assert len(original) == len(update), "Can only handle equal length lists."
for idx, (val_original, val_update) in enumerate(zip(original, update)):
if not isinstance(val_original, type(val_update)):
raise ValueError(f"Different types! {type(val_original)}, {type(val_update)}")
if isinstance(val_original, dict):
original[idx] = update_dict(original[idx], update[idx])
if isinstance(val_original, (tuple, list)):
original[idx] = update_list(original[idx], update[idx])
if isinstance(val_original, (str, int, float)):
original[idx] = val_update
return original
The above might be a bit harder to understand, but I will try to explain it.
There are two methods, one which will merge two dictionaries and one that tries to merge two lists.
Merging dictionaries
In order to merge the two dictionaries I go over all the keys and values of the update dictionary, because this will probably be the smaller of the two.
The first block puts new keys in the original dictionary, this is updating values that weren't in the original dictionary at the start.
The second block is updating the nested values. There I distinguish three cases:
If the value is another dict, run the dictionary merge again, but one level deeper.
If the value is a list (or tuple), run the list merge function.
If the value is a str (or int, float), replace the original value with the updated value.
Merging lists
This is a bit trickier than dictionaries, because lists do not have an order or keys that I can compare. Therefore I have to make a heavy assumption that the list updates will always contain the same elements, see limitations on how to handle lists with more than 1 element.
Since the lists are of the same length, I can assume that the indices of the lists are matching. Now in order to check if all the values are the same, we have to do the following:
Make sure that the value types are the same, otherwise we will throw an error since I am not sure how to handle that case.
If the values are dictionaries, use the merging of dictionaries.
If the values are list (or tuple) us the list merging.
If the values are str (or int, float), override the original in place.
Result
using:
from pprint import pprint
pprint(update_dict(file_1, file_2))
The final result will be:
{'toplevel': {'value': {'settings': [{'inner': {'name': 'Another Real Value',
'setting': 'help'},
'name': 'A Real Value',
'region': 'US'}]}}}
Note that in contrast with the first solution the values 'setting': 'help' and 'region': 'US'} are now still in the original dictionary.
Limitations
Due to the same length constraint, if you do not want to update an element in the list you have to pass the same element type, but empty.
Example on how to ignore a list update:
... {'settings': [
{} # do not update the first element.
{'name': 'A new name'} # update second element.
]
}

Related

Looking up a value in a nested dictionary to get a value in a separate dictionary but same parent dictionary

In the response that comes back from an API request. I get a format that is a list of nested dictionaries. for example.
{
"data": [
{
"3": {
"value": 1
},
"6": {
"value": "Conversion"
},
"7": {
"value": "HVAC"
}
},
I can easily get past the the first dictionary using r['data']. At this point, each list item is a record in a database. r['data'][0] gives me a dictionary of what the field ('3') is and then a dictionary of what the value is ({'value': 'Conversion'}).
I want to be able to look up a value like 'Conversion' and have it tell me what the value is for field '3'. Anyway to do this using python?

Your description doesn't quite fit with reality. Let's have a complete structure:
r = {
"data": [
{
"3": {
"value": 1
},
"6": {
"value": "Conversion"
},
"7": {
"value": "HVAC"
}
}
]
}
r is a dictionary. It contains a single key ('data') whose associated value is a list which, in this example, contains one dictionary. That dictionary has 3 keys - "3", "6" and "7". Each of those keys has a value which itself is a dictionary comprised of a single key ('value') and, obviously, an associated value.
You can assert as follows:
assert r['data'][0]['6']['value'] == 'Conversion'
Hopefully that shows how you can access the lower level value(s)
What's unclear from your question is why would you be searching for 'Conversion' when you want the value from key '3' which would be:
r['data'][0]['3']['value']
EDIT:
def get_value(r, from_key, ref_key, value):
for d in r['data']:
if d.get(from_key, {}).get('value') == value:
return d.get(ref_key, {}).get('value')
print(get_value(r, '6', '3', 'Conversion'))
Doing it this way offers the flexibility of specifying the relevant keys and value

I have to make an assumption that what you are trying to say is that you want the index based on the value.
r = {...} # your data
def lookup(val, r_data):
for i in range(len(r_data['data'])): #loops through list based on index
for k, v in r_data['data'][i].items(): #loops dict with key/value
if v['value'] == val: #checks if value matches needle
return "[{}][{}]".format(i, k)
This returns '[0][6]'. I don't know if your data is supposed to be formatted that way... a list within 'data' and then only one dict inside. But, given this format, this will give you the index.

Iterate through a nested python dict

I have a JSON file that looks like this:
{
"returnCode": 200,
"message": "OK",
“people”: [
{
“details: {
"first": “joe”,
“last”: doe,
“id”: 1234567,
},
“otheDetails”: {
“employeeNum”: “0000111222”,
“res”: “USA”,
“address”: “123 main street”,
},
“moreDetails”: {
“family”: “yes”,
“siblings”: “no”,
“home”: “USA”,
},
},
{
“details: {
"first": “jane”,
“last”: doe,
“id”: 987654321,
},
“otheDetails”: {
“employeeNum”: “222333444”,
“res”: “UK”,
“address”: “321 nottingham dr”,
},
“moreDetails”: {
“family”: “yes”,
“siblings”: “yes”,
“home”: “UK,
},
}
This shows two entries, but really there are hundreds or more. I do not know the number of entries at the time the code is run.
My goal is to iterate through each entry and get the 'id' under "details". I load the JSON into a python dict named 'data' and am able to get the first 'id' by:
data['people'][0]['details']['id']
I can then get the second 'id' by incrementing the '0' to '1'. I know I can set i = 0 and then increment i, but since I do not know the number of entries, this does not work. Is there a better way?

Less pythonic then a list comprehension, but a simple for loop will work here.
You can first calculate the number of people in the people list and then loop over the list, pulling out each id at each iteration:
id_list = []
for i in range(len(data['people'])):
id_list.append(data['people'][i]['details']['id'])

You can use dict.get method in a list comprehension to avoid getting a KeyError on id. This way, you can fill dictionaries without ids with None:
ids = [dct['details'].get('id') for dct in data['people']]
If you still get KeyError, then that probably means some dcts in data['people'] don't have details key. In that case, it might be better to wrap this exercise in try/except. You may also want to identify which dcts don't have details key, which can be gathered using error_dct list (which you can uncomment out from below).
ids = []
#error_dct = []
for dct in data['people']:
try:
ids.append(dct['details']['id'])
except KeyError:
ids.append(None)
#error_dct.append(dct)
Output:
1234567
987654321

Assigning python dictionary's nested value without mentioning the immediate key

I have dozens of lines to update values in nested dictionary like this:
dictionary["parent-key"]["child-key"] = [whatever]
And that goes with different parent-key for each lines, but it always has the same child-keys.
Also, the [whatever] part is written in unique manner for each lines, so the simple recursion isn't the option here. (Although one might suggest to make a separate lists of value to be assigned, and assign them to each dictionary entry later on.)
Is there a way do the same but in even shorter manner to avoid duplicated part of the code?
I'd be happy if it could be written something like this:
update_child_val("parent-key") = [whatever]
By the way, that [whatever] part that I'm assigning will be a long and complicated code, therefore I don't wish to use function such as this:
def update_child_val(parent_key, child_val):
dictionary[parent_key]["child-key"] = child_val
update_child_val("parent-key", [whatever])
Specific Use Case:
I'm making ETL to convert database's table into CSV, and this is the part of the process. I wrote some bits of example below.
single_item_template = {
# Unique values will be assigned in place of `None`later
"name": {
"id": "name",
"name": "Product Name",
"val": None
},
"price": {
"id": "price",
"name": "Product Price (pre-tax)",
"val": None
},
"tax": {
"id": "tax",
"name": "Sales Tax",
"val": 10
},
"another column id": {
"id": "another column id",
"name": "another 'name' for this column",
"val": "another 'val' for this column"
},
..
}
And I have a separate area to assign values to the copy of the dictionary single_item_template for the each row of source database table.
for table_row in table:
item = Item(table_row)
Item class here will return the copy of dictionary single_item_template with updated values assigned for item[column][val]. And each of vals will involve unique process for changing values in setter function within the given class such as
self._item["name"]["val"] = table_row["prod_name"].replace('_', ' ')
self._item["price"]["val"] = int(table_row["price_0"].replace(',', ''))
..
etcetera, etcetera.
In above example, self._item can be shortened easily by assigning it to variable, but I was wondering if I could also save the last five character ["val"].
(..or putting the last logic part as a string and eval later, which I really really do not want to do.)
(So basically all I'm saying here is that I'm lazy typing out ["val"], but I don't bother doing it either. Although I was still interested if there's such thing while I'm not even sure such thing exists in programming in general..)

While you can't get away from doing the work, you can abstract it away in a couple of different ways.
Let's say you have a mapping of parent IDs to intended value:
values = {
'name': None,
'price': None,
'tax': 10,
'[another column id]': "[another 'val' for this column]"
}
Setting all of these at once is only two lines of code:
for parent, val in values.items():
dictionary[parent]['val'] = val
Unfortunately there isn't an easy or legible way to transform this into a dict comprehension. You can easily put this into a utility function that will turn it into a one-line call:
def set_children(d, parents, values, child='val'):
for parent, values in zip(parents, values):
d[parent][child] = value
set_children(dictionary, values.keys(), values.values())
In this case, your values mapping will encode the transformations you want to perform:
values = {
'name': table_row["prod_name"].replace('_', ' '),
'price': int(table_row["price_0"].replace(',', '')),
...
}

How do I extract nested json names and convert to dot notation string list in python?

I need to pull data in from elasticsearch, do some cleaning/munging and export as table/rds.
To do this I have a long list of variable names required to pull from elasticsearch. This list of variables is required for the pull, but the issue is that not all fields may be represented within a given pull, meaning that I need to add the fields after the fact. I can do this using a schema (in nested json format) of the same list of variable names.
To try and [slightly] future proof this work I would ideally like to only maintain the list/schema in one place, and convert from list to schema (or vice-versa).
Is there a way to do this in python? Please see example below of input and desired output.
Small part of schema:
{
"_source": {
"filters": {"group": {"filter_value": 0}},
"user": {
"email": "",
"uid": ""
},
"status": {
"date": "",
"active": True
}
}
}
Desired string list output:
[
"_source.filters.group.filter_value",
"_source.user.email",
"_source.user.uid",
"_source.status.date",
"_source.status.active"
]
I believe that schema -> list might be an easier transformation than list -> schema, though am happy for it to be the other way round if that is simpler (though need to ensure the schema variables have the correct type, i.e. str, bool, float).
I have explored the following answers which come close, but I am struggling to understand since none appear to be in python:
Convert dot notation to JSON
Convert string with dot notation to JSON

Where d is your json as a dictionary,
def full_search(d):
arr = []
def dfs(d, curr):
if not type(d) == dict or curr[-1] not in d or type(d[curr[-1]]) != dict:
arr.append(curr)
return
for key in d[curr[-1]].keys():
dfs(d[curr[-1]], curr + [key])
for key in d.keys():
dfs(d, [key])
return ['.'.join(x) for x in arr]
If d is in json form, use
import json
res = full_search(json.loads(d))

How to properly keep structure when removing keys in JSON using python?

I'm using this as a reference: Elegant way to remove fields from nested dictionaries
I have a large number of JSON-formatted data here and we've determined a list of unnecessary keys (and all their underlying values) that we can remove.
I'm a bit new to working with JSON and Python specifically (mostly did sysadmin work) and initially thought it was just a plain dictionary of dictionaries. While some of the data looks like that, several more pieces of data consists of dictionaries of lists, which can furthermore contain more lists or dictionaries with no specific pattern.
The idea is to keep the data identical EXCEPT for the specified keys and associated values.
Test Data:
to_be_removed = ['leecher_here']
easy_modo =
{
'hello_wold':'konnichiwa sekai',
'leeching_forbidden':'wanpan kinshi',
'leecher_here':'nushiyowa'
}
lunatic_modo =
{
'hello_wold':
{'
leecher_here':'nushiyowa','goodbye_world':'aokigahara'
},
'leeching_forbidden':'wanpan kinshi',
'leecher_here':'nushiyowa',
'something_inside':
{
'hello_wold':'konnichiwa sekai',
'leeching_forbidden':'wanpan kinshi',
'leecher_here':'nushiyowa'
},
'list_o_dicts':
[
{
'hello_wold':'konnichiwa sekai',
'leeching_forbidden':'wanpan kinshi',
'leecher_here':'nushiyowa'
}
]
}
Obviously, the original question posted there isn't accounting for lists.
My code, modified appropriately to work with my requirements.
from copy import deepcopy
def remove_key(json,trash):
"""
<snip>
"""
keys_set = set(trash)
modified_dict = {}
if isinstance(json,dict):
for key, value in json.items():
if key not in keys_set:
if isinstance(value, dict):
modified_dict[key] = remove_key(value, keys_set)
elif isinstance(value,list):
for ele in value:
modified_dict[key] = remove_key(ele,trash)
else:
modified_dict[key] = deepcopy(value)
return modified_dict
I'm sure something's messing with the structure since it doesn't pass the test I wrote since the expected data is exactly the same, minus the removed keys. The test shows that, yes it's properly removing the data but for the parts where it's supposed to be a list of dictionaries, it's only getting returned as a dictionary instead which will have unfortunate implications down the line.
I'm sure it's because the function returns a dictionary but I don't know to proceed from here in order to maintain the structure.
At this point, I'm needing help on what I could have overlooked.

When you go through your json file, you only need to determine whether it is a list, a dict or neither. Here is a recursive way to modify your input dict in place:
def remove_key(d, trash=None):
if not trash: trash = []
if isinstance(d,dict):
keys = [k for k in d]
for key in keys:
if any(key==s for s in trash):
del d[key]
for value in d.values():
remove_key(value, trash)
elif isinstance(d,list):
for value in d:
remove_key(value, trash)
remove_key(lunatic_modo,to_be_removed)
remove_key(easy_modo,to_be_removed)
Result:
{
"hello_wold": {
"goodbye_world": "aokigahara"
},
"leeching_forbidden": "wanpan kinshi",
"something_inside": {
"hello_wold": "konnichiwa sekai",
"leeching_forbidden": "wanpan kinshi"
},
"list_o_dicts": [
{
"hello_wold": "konnichiwa sekai",
"leeching_forbidden": "wanpan kinshi"
}
]
}
{
"hello_wold": "konnichiwa sekai",
"leeching_forbidden": "wanpan kinshi"
}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge deep JSON files in Python - python

Related

Looking up a value in a nested dictionary to get a value in a separate dictionary but same parent dictionary

Iterate through a nested python dict

Assigning python dictionary's nested value without mentioning the immediate key

How do I extract nested json names and convert to dot notation string list in python?

How to properly keep structure when removing keys in JSON using python?

Categories

Resources