So, I'm trying to parse this json object into multiple events, as it's the expected input for a ETL tool. I know this is quite straight forward if we do this via loops, if statements and explicitly defining the search fields for given events. This method is not feasible because I have multiple heavily nested JSON objects and I would prefer to let the python recursions handle the heavy lifting. The following is a sample object, which consist of string, list and dict (basically covers most use-cases, from the data I have).
{
"event_name": "restaurants",
"properties": {
"_id": "5a9909384309cf90b5739342",
"name": "Mangal Kebab Turkish Restaurant",
"restaurant_id": "41009112",
"borough": "Queens",
"cuisine": "Turkish",
"address": {
"building": "4620",
"coord": {
"0": -73.9180155,
"1": 40.7427742
},
"street": "Queens Boulevard",
"zipcode": "11104"
},
"grades": [
{
"date": 1414540800000,
"grade": "A",
"score": 12
},
{
"date": 1397692800000,
"grade": "A",
"score": 10
},
{
"date": 1381276800000,
"grade": "A",
"score": 12
}
]
}
}
And I want to convert it to this following list of dictionaries
[
{
"event_name": "restaurants",
"properties": {
"restaurant_id": "41009112",
"name": "Mangal Kebab Turkish Restaurant",
"cuisine": "Turkish",
"_id": "5a9909384309cf90b5739342",
"borough": "Queens"
}
},
{
"event_name": "restaurant_address",
"properties": {
"zipcode": "11104",
"ref_id": "41009112",
"street": "Queens Boulevard",
"building": "4620"
}
},
{
"event_name": "restaurant_address_coord"
"ref_id": "41009112"
"0": -73.9180155,
"1": 40.7427742
},
{
"event_name": "restaurant_grades",
"properties": {
"date": 1414540800000,
"ref_id": "41009112",
"score": 12,
"grade": "A",
"index": "0"
}
},
{
"event_name": "restaurant_grades",
"properties": {
"date": 1397692800000,
"ref_id": "41009112",
"score": 10,
"grade": "A",
"index": "1"
}
},
{
"event_name": "restaurant_grades",
"properties": {
"date": 1381276800000,
"ref_id": "41009112",
"score": 12,
"grade": "A",
"index": "2"
}
}
]
And most importantly these events will be broken up into independent structured tables to conduct joins, we need to create primary keys/ unique identifiers. So the deeply nested dictionaries should have its corresponding parents_id field as ref_id. In this case ref_id = restaurant_id from its parent dictionary.
Most of the example on the internet flatten's the whole object to be normalized and into a dataframe, but to utilise this ETL tool to its full potential it would be ideal to solve this problem via recursions and outputting as list of dictionaries.
This is what one might call a brute force method. Create a translator function to move each item into the correct part of the new structure (like a schema).
# input dict
d = {
"event_name": "demo",
"properties": {
"_id": "5a9909384309cf90b5739342",
"name": "Mangal Kebab Turkish Restaurant",
"restaurant_id": "41009112",
"borough": "Queens",
"cuisine": "Turkish",
"address": {
"building": "4620",
"coord": {
"0": -73.9180155,
"1": 40.7427742
},
"street": "Queens Boulevard",
"zipcode": "11104"
},
"grades": [
{
"date": 1414540800000,
"grade": "A",
"score": 12
},
{
"date": 1397692800000,
"grade": "A",
"score": 10
},
{
"date": 1381276800000,
"grade": "A",
"score": 12
}
]
}
}
def convert_structure(d: dict):
''' function to convert to new structure'''
# the new dict
e = {}
e['event_name'] = d['event_name']
e['properties'] = {}
e['properties']['restaurant_id'] = d['properties']['restaurant_id']
# and so forth...
# keep building the new structure / template
# return a list
return [e]
# run & print
x = convert_structure(d)
print(x)
the reuslt (for the part done) looks like this:
[{'event_name': 'demo', 'properties': {'restaurant_id': '41009112'}}]
If a pattern is identified, then the above could be improved...
I have two objects:
d1 = [ { "id": 3, "name": "test", "components": [ { "id": 1, "name": "test" }, { "id": 2, "name": "test2" } ] } ]
d2 = [ { "id": 4, "name": "test", "components": [ { "id": 2, "name": "test" }, { "id": 3, "name": "test"2 } ] } ]
As you can see, everything stays the same, but the id property changes on both root object and also inside components.
I'm using DeepDiff to compare d1 and d2 and trying to ignore comparison of id objects. However, I'm not sure how to achieve this. I tried the following which didn't seem to work.
excluded_paths = "root[\d+\]['id']"
diff = DeepDiff(d1, d2, exclude_paths=excluded_paths)
You can try using exclude_obj_callback:
from deepdiff import DeepDiff
def exclude_obj_callback(obj, path):
return True if "id" in path else False
d1 = [ { "id": 3, "name": "test", "components": [ { "id": 1, "name": "test" }, { "id": 2, "name": "test2" } ] } ]
d2 = [ { "id": 4, "name": "test", "components": [ { "id": 2, "name": "test" }, { "id": 3, "name": "test2" } ] } ]
print(DeepDiff(d1, d2, exclude_obj_callback=exclude_obj_callback))
What this does is returns a boolean for every deep component that includes the string "id" in it. You may want to be careful with this since you may exclude other objects that you didn't mean to. A way around this could be to set less generic key values for example "component_id".
I have a JSON file and that is a nested JSON. I would like to remove duplicates based on two keys.
JSON example:
"books": [
{
"id": "1",
"story": {
"title": "Lonely lion"
},
"description": [
{
"release": false,
"author": [
{
"name": "John",
"main": 1
},
{
"name": "Jeroge",
"main": 0
},
{
"name": "Peter",
"main": 0
}
]
}
]
},
{
"id": "2",
"story": {
"title": "Lonely lion"
},
"description": [
{
"release": false,
"author": [
{
"name": "Jeroge",
"main": 1
},
{
"name": "Peter",
"main": 0
},
{
"name": "John",
"main": 0
}
]
}
]
},
{
"id": "3",
"story": {
"title": "Lonely lion"
},
"description": [
{
"release": false,
"author": [
{
"name": "John",
"main": 1
},
{
"name": "Jeroge",
"main": 0
}
]
}
]
}
]
Here I try to match the title and author name. For example, for id 1 and id 2 are duplicates( as the title is same and author names are also same(the author sequence doesn't matter and no need to consider the main attributes). So, in the output JSON only id:1 or id:2 will remain with id:3. In the final output I need two file.
Output_JSON:
"books": [
{
"id": "1",
"story": {
"title": "Lonely lion"
},
"description": [
{
"release": false,
"author": [
{
"name": "John",
"main": 1
},
{
"name": "Jeroge",
"main": 0
},
{
"name": "Peter",
"main": 0
}
]
}
]
},
{
"id": "3",
"story": {
"title": "Lonely lion"
},
"description": [
{
"release": false,
"author": [
{
"name": "John",
"main": 1
},
{
"name": "Jeroge",
"main": 0
}
]
}
]
}
]
duplicatedID.csv:
1-2
The following method I tried but it is not giving correct results:
list= []
duplicate_Id = []
for data in (json_data['books'])[:]:
elements= []
id = data['id']
title = data['story']['title']
elements.append(title)
for i in (data['description'][0]['author']):
name = (i['name'])
elements.append(name)
if not list:
list.append(elements)
else:
for j in list:
if set(elements) == set(j):
duplicate_Id.append(id)
elements = []
The general idea is to:
Get the groups identified by some function that collects duplicates.
Then return the first entry of each group, ensuring no duplicates.
Define the key function as the sorted list of authors and. As the list of authors is by definition the unique key, but may appear in any order.
import json
from itertools import groupby
j = json.load(books)
def transform(books):
groups = [list(group) for _, group in groupby(books, key=getAuthors)]
return [group[0] for group in groups]
def getAuthors(book):
authors = book['description'][0]['author']
return sorted([author['name'] for author in authors])
print(transform(j['books']))
If we wanted to get the duplicates, then we do the same computation, but return any sublist with length > 1 as this is by our definition duplicated data.
def transform(books):
groups = [list(group) for _, group in groupby(books, key=getAuthors)]
return [group for group in groups if len(group) > 1]
Where j['books'] is the JSON you gave enclosed in an object.
I have searched a nested dict for certain keys, I have succeeded in being able to locate the keys I am looking for, but I am not sure how I can now add a key/value pair to the location the key I am looking for is. Is there a way to tell python to append the data entry to the location it is currently looking at?
Code:
import os
import json
import shutil
import re
import fileinput
from collections import OrderedDict
#Finds and lists the folders that have been provided
d='.'
folders = list(filter (lambda x: os.path.isdir(os.path.join(d, x)), os.listdir(d)))
print("Folders found: ")
print(folders)
print("\n")
def processModelFolder(inFolder):
#Creating the file names
fileName = os.path.join(d, inFolder, inFolder + ".mdl")
fileNameTwo = os.path.join(d, inFolder, inFolder + ".vg2.json")
fileNameThree = os.path.join(d, inFolder, inFolder + "APPENDED.vg2.json")
#copying the json file so the new copy can be appended
shutil.copyfile(fileNameTwo, fileNameThree)
#assigning IDs and properties to search for in the mdl file
IDs = ["7f034e5c-24df-4145-bab8-601f49b43b50"]
Properties = ["IDSU_FX[0]"]
#Basic check to see if IDs and Properties are valid
for i in IDs:
if len(i) != 36:
print("ID may not have been valid and might not return the results you expect, check to ensure the characters are correct: ")
print(i)
print("\n")
if len(IDs) == 0:
print("No IDs were given!")
elif len(Properties) == 0:
print("No Properties were given!")
#Reads code untill an ID is found
else:
with open(fileName , "r") as in_file:
IDCO = None
for n, line in enumerate(in_file, 1):
if line.startswith('IDCO_IDENTIFICATION'):
#Checks if the second part of each line is a ID tag in IDs
if line.split('"')[1] in IDs:
#If ID found it is stored as IDCO
IDCO = line.split('"')[1]
else:
if IDCO:
pass
IDCO = None
#Checks if the first part of each line is a Prop in Propterties
elif IDCO and line.split(' ')[0] in Properties:
print('Found! ID:{} Prop:{} Value: {}'.format(IDCO, line.split('=')[0][:-1], line.split('=')[1][:-1]))
print("\n")
#Stores the property name and value
name = str(line.split(' ')[0])
value = str(line.split(' ')[2])
#creates the entry to be appended to the dict
#json file editing
with open(fileNameThree , "r+") as json_data:
python_obj = json.load(json_data)
#calling recursive search
get_recursively(python_obj, IDCO, name, value)
with open(fileNameThree , "w") as json_data:
json.dump(python_obj, json_data, indent = 1)
print('Processed {} lines in file: {}'.format(n , fileName))
def get_recursively(search_dict, IDCO, name, value):
"""
Takes a dict with nested lists and dicts,
and searches all dicts for a key of the field
provided, when key "id" is found it checks to,
see if its value is the current IDCO tag, if so it appends the new data.
"""
fields_found = []
for key, value in search_dict.iteritems():
if key == "id":
if value == IDCO:
print("FOUND IDCO IN JSON: " + value +"\n")
elif isinstance(value, dict):
results = get_recursively(value, IDCO, name, value)
for result in results:
x = 1
elif isinstance(value, list):
for item in value:
if isinstance(item, dict):
more_results = get_recursively(item, IDCO, name, value)
for another_result in more_results:
x=1
return fields_found
for modelFolder in folders:
processModelFolder(modelFolder)
In short, once it finds a key/id value pair that I want, can I tell it to append name/value to that location directly and then continue?
nested dict:
{
"id": "79cb20b0-02be-42c7-9b45-96407c888dc2",
"tenantId": "00000000-0000-0000-0000-000000000000",
"name": "2-stufiges Stirnradgetriebe",
"description": null,
"visibility": "None",
"method": "IDM_CALCULATE_GEAR_COUPLED",
"created": "2018-10-16T10:25:20.874Z",
"createdBy": "00000000-0000-0000-0000-000000000000",
"lastModified": "2018-10-16T10:25:28.226Z",
"lastModifiedBy": "00000000-0000-0000-0000-000000000000",
"client": "STRING_BEARINX_ONLINE",
"project": {
"id": "10c37dcc-0e4e-4c4d-a6d6-12cf65cceaf9",
"name": "proj 2",
"isBookmarked": false
},
"rootObject": {
"id": "6ff0010c-00fe-485b-b695-4ddd6aca4dcd",
"type": "IDO_GEAR",
"children": [
{
"id": "1dd94d1a-e52d-40b3-a82b-6db02a8fbbab",
"type": "IDO_SYSTEM_LOADCASE",
"children": [],
"childList": "SYSTEMLOADCASE",
"properties": [
{
"name": "IDCO_IDENTIFICATION",
"value": "1dd94d1a-e52d-40b3-a82b-6db02a8fbbab"
},
{
"name": "IDCO_DESIGNATION",
"value": "Lastfall 1"
},
{
"name": "IDSLC_TIME_PORTION",
"value": 100
},
{
"name": "IDSLC_DISTANCE_PORTION",
"value": 100
},
{
"name": "IDSLC_OPERATING_TIME_IN_HOURS",
"value": 1
},
{
"name": "IDSLC_OPERATING_TIME_IN_SECONDS",
"value": 3600
},
{
"name": "IDSLC_OPERATING_REVOLUTIONS",
"value": 1
},
{
"name": "IDSLC_OPERATING_DISTANCE",
"value": 1
},
{
"name": "IDSLC_ACCELERATION",
"value": 9.81
},
{
"name": "IDSLC_EPSILON_X",
"value": 0
},
{
"name": "IDSLC_EPSILON_Y",
"value": 0
},
{
"name": "IDSLC_EPSILON_Z",
"value": 0
},
{
"name": "IDSLC_CALCULATION_WITH_OWN_WEIGHT",
"value": "CO_CALCULATION_WITHOUT_OWN_WEIGHT"
},
{
"name": "IDSLC_CALCULATION_WITH_TEMPERATURE",
"value": "CO_CALCULATION_WITH_TEMPERATURE"
},
{
"name": "IDSLC_FLAG_FOR_LOADCASE_CALCULATION",
"value": "LB_CALCULATE_LOADCASE"
},
{
"name": "IDSLC_STATUS_OF_LOADCASE_CALCULATION",
"value": false
}
],
"position": 1,
"order": 1,
"support_vector": {
"x": 0,
"y": 0,
"z": 0
},
"u_axis_vector": {
"x": 1,
"y": 0,
"z": 0
},
"w_axis_vector": {
"x": 0,
"y": 0,
"z": 1
},
"role": "_none_"
},
{
"id": "ab7fbf37-17bb-4e60-a543-634571a0fd73",
"type": "IDO_SHAFT_SYSTEM",
"children": [
{
"id": "7f034e5c-24df-4145-bab8-601f49b43b50",
"type": "IDO_RADIAL_ROLLER_BEARING",
"children": [
{
"id": "0b3e695b-6028-43af-874d-4826ab60dd3f",
"type": "IDO_RADIAL_BEARING_INNER_RING",
"children": [
{
"id": "330aa09d-60fb-40d7-a190-64264b3d44b7",
"type": "IDO_LOADCONTAINER",
"children": [
{
"id": "03036040-fc1a-4e52-8a69-d658e18a8d4a",
"type": "IDO_DISPLACEMENT",
"children": [],
"childList": "DISPLACEMENT",
"properties": [
{
"name": "IDCO_IDENTIFICATION",
"value": "03036040-fc1a-4e52-8a69-d658e18a8d4a"
},
{
"name": "IDCO_DESIGNATION",
"value": "Displacement 1"
}
],
"position": 1,
"order": 1,
"support_vector": {
"x": -201.3,
"y": 0,
"z": -229.8
},
"u_axis_vector": {
"x": 1,
"y": 0,
"z": 0
},
"w_axis_vector": {
"x": 0,
"y": 0,
"z": 1
},
"shaftSystemId": "ab7fbf37-17bb-4e60-a543-634571a0fd73",
"role": "_none_"
},
{
"id": "485f5bf4-fb97-415b-8b42-b46e9be080da",
"type": "IDO_CUMULATED_LOAD",
"children": [],
"childList": "CUMULATEDLOAD",
"properties": [
{
"name": "IDCO_IDENTIFICATION",
"value": "485f5bf4-fb97-415b-8b42-b46e9be080da"
},
{
"name": "IDCO_DESIGNATION",
"value": "Cumulated load 1"
},
{
"name": "IDCO_X",
"value": 0
},
{
"name": "IDCO_Y",
"value": 0
},
{
"name": "IDCO_Z",
"value": 0
}
],
"position": 2,
"order": 1,
"support_vector": {
"x": -201.3,
"y": 0,
"z": -229.8
},
"u_axis_vector": {
"x": 1,
"y": 0,
"z": 0
},
"w_axis_vector": {
"x": 0,
"y": 0,
"z": 1
},
"shaftSystemId": "ab7fbf37-17bb-4e60-a543-634571a0fd73",
"role": "_none_"
}
],
"childList": "LOADCONTAINER",
"properties": [
{
"name": "IDCO_IDENTIFICATION",
"value": "330aa09d-60fb-40d7-a190-64264b3d44b7"
},
{
"name": "IDCO_DESIGNATION",
"value": "Load container 1"
},
{
"name": "IDLC_LOAD_DISPLACEMENT_COMBINATION",
"value": "LOAD_MOMENT"
},
{
"name": "IDLC_TYPE_OF_MOVEMENT",
"value": "LB_ROTATING"
},
{
"name": "IDLC_NUMBER_OF_ARRAY_ELEMENTS",
"value": 20
}
],
"position": 1,
"order": 1,
"support_vector": {
"x": -201.3,
"y": 0,
"z": -229.8
},
"u_axis_vector": {
"x": 1,
"y": 0,
"z": 0
},
"w_axis_vector": {
"x": 0,
"y": 0,
"z": 1
},
"shaftSystemId": "ab7fbf37-17bb-4e60-a543-634571a0fd73",
"role": "_none_"
},
{
"id": "3258d217-e6e4-4a5c-8677-ae1fca26f21e",
"type": "IDO_RACEWAY",
"children": [],
"childList": "RACEWAY",
"properties": [
{
"name": "IDCO_IDENTIFICATION",
"value": "3258d217-e6e4-4a5c-8677-ae1fca26f21e"
},
{
"name": "IDCO_DESIGNATION",
"value": "Raceway 1"
},
{
"name": "IDRCW_UPPER_DEVIATION_RACEWAY_DIAMETER",
"value": 0
},
{
"name": "IDRCW_LOWER_DEVIATION_RACEWAY_DIAMETER",
"value": 0
},
{
"name": "IDRCW_PROFILE_OFFSET",
"value": 0
},
{
"name": "IDRCW_PROFILE_ANGLE",
"value": 0
},
{
"name": "IDRCW_PROFILE_CURVATURE_RADIUS",
"value": 0
},
{
"name": "IDRCW_PROFILE_CENTER_POINT_OFFSET",
"value": 0
},
{
"name": "IDRCW_PROFILE_NUMBER_OF_WAVES",
"value": 0
},
{
"name": "IDRCW_PROFILE_AMPLITUDE",
"value": 0
},
{
"name": "IDRCW_PROFILE_POSITION_OF_FIRST_WAVE",
"value": 0
},
Bug
First of all, replace the value variable's name by something else, because you have a value variable as the method argument and another value variable with the same name when iterating over the dictionary:
for key, value in search_dict.iteritems(): # <-- REPLACE value TO SOMETHING ELSE LIKE val
Otherwise you will have bugs, because the value from the dictionary is the new value which you will insert. But if you iterate like for key, val in then you can actually use the outer value variable.
Adding The Value Pair
It seems id is a key inside your search_dict, but reading your JSON file your search_dict may have several nested lists like properties and/or children, so it depends on where you want to add the new pair.
If you want to add it to the same dictionary where your id is:
if key == "id":
if value == IDCO:
print("FOUND IDCO IN JSON: " + value +"\n")
search_dict[name] = value
Result:
{
"id": "3258d217-e6e4-4a5c-8677-ae1fca26f21e",
"type": "IDO_RACEWAY",
"children": [],
"childList": "RACEWAY",
"<new name>": "<new value>",
"properties": [
{
"name": "IDCO_IDENTIFICATION",
"value": "3258d217-e6e4-4a5c-8677-ae1fca26f21e"
},
If you want to add it to the children or properties list inside the dictionary where id is:
if key == "id":
if value == IDCO:
print("FOUND IDCO IN JSON: " + value +"\n")
if search_dict.has_key("properties"): # you can swap "properties" to "children", depends on your use case
search_dict["properties"].append({"name": name, "value": value}) # a new dictionary with 'name' and 'value' keys
Result:
{
"id": "3258d217-e6e4-4a5c-8677-ae1fca26f21e",
"type": "IDO_RACEWAY",
"children": [],
"childList": "RACEWAY",
"properties": [
{
"name": "IDCO_IDENTIFICATION",
"value": "3258d217-e6e4-4a5c-8677-ae1fca26f21e"
},
{
"name": "<new name>",
"value": "<new value>"
},