Create/ re-create a list of dictionaries from a dictionary via Python Recursion function - python

So, I'm trying to parse this json object into multiple events, as it's the expected input for a ETL tool. I know this is quite straight forward if we do this via loops, if statements and explicitly defining the search fields for given events. This method is not feasible because I have multiple heavily nested JSON objects and I would prefer to let the python recursions handle the heavy lifting. The following is a sample object, which consist of string, list and dict (basically covers most use-cases, from the data I have).
{
"event_name": "restaurants",
"properties": {
"_id": "5a9909384309cf90b5739342",
"name": "Mangal Kebab Turkish Restaurant",
"restaurant_id": "41009112",
"borough": "Queens",
"cuisine": "Turkish",
"address": {
"building": "4620",
"coord": {
"0": -73.9180155,
"1": 40.7427742
},
"street": "Queens Boulevard",
"zipcode": "11104"
},
"grades": [
{
"date": 1414540800000,
"grade": "A",
"score": 12
},
{
"date": 1397692800000,
"grade": "A",
"score": 10
},
{
"date": 1381276800000,
"grade": "A",
"score": 12
}
]
}
}
And I want to convert it to this following list of dictionaries
[
{
"event_name": "restaurants",
"properties": {
"restaurant_id": "41009112",
"name": "Mangal Kebab Turkish Restaurant",
"cuisine": "Turkish",
"_id": "5a9909384309cf90b5739342",
"borough": "Queens"
}
},
{
"event_name": "restaurant_address",
"properties": {
"zipcode": "11104",
"ref_id": "41009112",
"street": "Queens Boulevard",
"building": "4620"
}
},
{
"event_name": "restaurant_address_coord"
"ref_id": "41009112"
"0": -73.9180155,
"1": 40.7427742
},
{
"event_name": "restaurant_grades",
"properties": {
"date": 1414540800000,
"ref_id": "41009112",
"score": 12,
"grade": "A",
"index": "0"
}
},
{
"event_name": "restaurant_grades",
"properties": {
"date": 1397692800000,
"ref_id": "41009112",
"score": 10,
"grade": "A",
"index": "1"
}
},
{
"event_name": "restaurant_grades",
"properties": {
"date": 1381276800000,
"ref_id": "41009112",
"score": 12,
"grade": "A",
"index": "2"
}
}
]
And most importantly these events will be broken up into independent structured tables to conduct joins, we need to create primary keys/ unique identifiers. So the deeply nested dictionaries should have its corresponding parents_id field as ref_id. In this case ref_id = restaurant_id from its parent dictionary.
Most of the example on the internet flatten's the whole object to be normalized and into a dataframe, but to utilise this ETL tool to its full potential it would be ideal to solve this problem via recursions and outputting as list of dictionaries.

This is what one might call a brute force method. Create a translator function to move each item into the correct part of the new structure (like a schema).
# input dict
d = {
"event_name": "demo",
"properties": {
"_id": "5a9909384309cf90b5739342",
"name": "Mangal Kebab Turkish Restaurant",
"restaurant_id": "41009112",
"borough": "Queens",
"cuisine": "Turkish",
"address": {
"building": "4620",
"coord": {
"0": -73.9180155,
"1": 40.7427742
},
"street": "Queens Boulevard",
"zipcode": "11104"
},
"grades": [
{
"date": 1414540800000,
"grade": "A",
"score": 12
},
{
"date": 1397692800000,
"grade": "A",
"score": 10
},
{
"date": 1381276800000,
"grade": "A",
"score": 12
}
]
}
}
def convert_structure(d: dict):
''' function to convert to new structure'''
# the new dict
e = {}
e['event_name'] = d['event_name']
e['properties'] = {}
e['properties']['restaurant_id'] = d['properties']['restaurant_id']
# and so forth...
# keep building the new structure / template
# return a list
return [e]
# run & print
x = convert_structure(d)
print(x)
the reuslt (for the part done) looks like this:
[{'event_name': 'demo', 'properties': {'restaurant_id': '41009112'}}]
If a pattern is identified, then the above could be improved...

Related

How can I find a specific key from a python dict and then get a value from that key in Python

I have a python dictionary that looks something like this:
[
{
"timestamp": 1621559698154,
"user": {
"uri": "spotify:user:xxxxxxxxxxxxxxxxxxxx",
"name": "Panda",
"imageUrl": "https://i.scdn.co/image/ab67757000003b82b54c68ed19f1047912529ef4"
},
"track": {
"uri": "spotify:track:6SJSOont5dooK2IXQoolNQ",
"name": "Dirty",
"imageUrl": "http://i.scdn.co/image/ab67616d0000b273a36e3d46e406deebdd5eafb0",
"album": {
"uri": "spotify:album:0NMpswZbEcswI3OIe6ml3Y",
"name": "Dirty (Live)"
},
"artist": {
"uri": "spotify:artist:4ZgQDCtRqZlhLswVS6MHN4",
"name": "grandson"
},
"context": {
"uri": "spotify:artist:4ZgQDCtRqZlhLswVS6MHN4",
"name": "grandson",
"index": 0
}
}
},
{
"timestamp": 1621816159299,
"user": {
"uri": "spotify:user:xxxxxxxxxxxxxxxxxxxxxxxx",
"name": "maja",
"imageUrl": "https://i.scdn.co/image/ab67757000003b8286459151d5426f5a9e77cfee"
},
"track": {
"uri": "spotify:track:172rW45GEnGoJUuWfm1drt",
"name": "Your Best American Girl",
"imageUrl": "http://i.scdn.co/image/ab67616d0000b27351630f0f26aff5bbf9e10835",
"album": {
"uri": "spotify:album:16i5KnBjWgUtwOO7sVMnJB",
"name": "Puberty 2"
},
"artist": {
"uri": "spotify:artist:2uYWxilOVlUdk4oV9DvwqK",
"name": "Mitski"
},
"context": {
"uri": "spotify:playlist:0tol7yRYYfiPJ17BuJQKu2",
"name": "I Bet on Losing Dogs",
"index": 0
}
}
}
]
How can I get, for example, the group of values for user.name "Panda" and then get that specific "track" list? I can't parse through the list by index because the list order changes randomly.
If you are only looking for "Panda", then you can just loop over the list, check whether the name is "Panda", and then retrieve the track list accordingly.
Otherwise, that would be inefficient if you want to do that for many different users. I would first make a dict that maps user to its index in the list, and then use that for each user (I am assuming that the list does not get modified while you execute the code, although it can be modified between executions.)
user_to_id = {data[i]['user']['name']: i for i in range(len(data))} # {'Panda': 0, 'maja': 1}
def get_track(user):
return data[user_to_id[user]]['track']
print(get_track('maja'))
print(get_track('Panda'))
where data is the list you provided.
Or, perhaps just make a dictionary of tracks directly:
tracks = {item['user']['name']: item['track'] for item in data}
print(tracks['Panda'])
If you want to get list of tracks for user Panda:
tracks = [entry['track'] for entry in data if entry['user']['name'] == 'Panda']

Create new key value in JSON data using Python / Pandas?

I'm trying to work with the Campaign Monitor API, posting JSON data through the API to update subscriber lists. I'm currently one change away from being able to send data,
Right now, my JSON data looks like this
{
"EmailAddress": "subscriber1#example.com",
"Name": "New Subscriber One",
"CustomFields": [
{
"Key": "website",
"Value": "http://example.com"
},
{
"Key": "interests",
"Value": "magic"
},
{
"Key": "interests",
"Value": "romantic walks"
},
{
"Key": "age",
"Value": "",
"Clear": true
}
],
},
{
"EmailAddress": "subscriber2#example.com",
"Name": "New Subscriber Two",
},
{
"EmailAddress": "subscriber3#example.com",
"Name": "New Subscriber Three",
}
}
I still need to add a new key value at the beginning of the JSON payload, incorporating the 'Subscribers' : my_json_data. How would I go about easily adding on the Subscribers key and placing my full and current json data into a list?
Final result should look like
{
'Subscribers' : [
{
"EmailAddress": "subscriber1#example.com",
"Name": "New Subscriber One",
"CustomFields": [
{
"Key": "website",
"Value": "http://example.com"
},
{
"Key": "interests",
"Value": "magic"
},
{
"Key": "interests",
"Value": "romantic walks"
},
{
"Key": "age",
"Value": "",
"Clear": true
}
],
},
{
"EmailAddress": "subscriber2#example.com",
"Name": "New Subscriber Two",
},
{
"EmailAddress": "subscriber3#example.com",
"Name": "New Subscriber Three",
}
}
]
}
I've tried to approach this with creating a new dictionary however when I convert that back to JSON I get more issues and headaches. Is there any easy way to keep everything as a JSON formatted dataset and add in the leading 'Subscribers' key?
this should do it assuming you've got a valid JSON.
your_new_json = {}
your_new_json['Subscribers'] = [your_current_json]

Get "path" of parent keys and indices in dictionary of nested dictionaries and lists

I am receiving a large json from Google Assistant and I want to retrieve some specific details from it. The json is the following:
{
"responseId": "************************",
"queryResult": {
"queryText": "actions_intent_DELIVERY_ADDRESS",
"action": "delivery",
"parameters": {},
"allRequiredParamsPresent": true,
"fulfillmentMessages": [
{
"text": {
"text": [
""
]
}
}
],
"outputContexts": [
{
"name": "************************/agent/sessions/1527070836044/contexts/actions_capability_screen_output"
},
{
"name": "************************/agent/sessions/1527070836044/contexts/more",
"parameters": {
"polar": "no",
"polar.original": "No",
"cardinal": 2,
"cardinal.original": "2"
}
},
{
"name": "************************/agent/sessions/1527070836044/contexts/actions_capability_audio_output"
},
{
"name": "************************/agent/sessions/1527070836044/contexts/actions_capability_media_response_audio"
},
{
"name": "************************/agent/sessions/1527070836044/contexts/actions_intent_delivery_address",
"parameters": {
"DELIVERY_ADDRESS_VALUE": {
"userDecision": "ACCEPTED",
"#type": "type.googleapis.com/google.actions.v2.DeliveryAddressValue",
"location": {
"postalAddress": {
"regionCode": "US",
"recipients": [
"Amazon"
],
"postalCode": "NY 10001",
"locality": "New York",
"addressLines": [
"450 West 33rd Street"
]
},
"phoneNumber": "+1 206-266-2992"
}
}
}
},
{
"name": "************************/agent/sessions/1527070836044/contexts/actions_capability_web_browser"
}
],
"intent": {
"name": "************************/agent/intents/86fb2293-7ae9-4bed-adeb-6dfe8797e5ff",
"displayName": "Delivery"
},
"intentDetectionConfidence": 1,
"diagnosticInfo": {},
"languageCode": "en-gb"
},
"originalDetectIntentRequest": {
"source": "google",
"version": "2",
"payload": {
"isInSandbox": true,
"surface": {
"capabilities": [
{
"name": "actions.capability.MEDIA_RESPONSE_AUDIO"
},
{
"name": "actions.capability.SCREEN_OUTPUT"
},
{
"name": "actions.capability.AUDIO_OUTPUT"
},
{
"name": "actions.capability.WEB_BROWSER"
}
]
},
"inputs": [
{
"rawInputs": [
{
"query": "450 West 33rd Street"
}
],
"arguments": [
{
"extension": {
"userDecision": "ACCEPTED",
"#type": "type.googleapis.com/google.actions.v2.DeliveryAddressValue",
"location": {
"postalAddress": {
"regionCode": "US",
"recipients": [
"Amazon"
],
"postalCode": "NY 10001",
"locality": "New York",
"addressLines": [
"450 West 33rd Street"
]
},
"phoneNumber": "+1 206-266-2992"
}
},
"name": "DELIVERY_ADDRESS_VALUE"
}
],
"intent": "actions.intent.DELIVERY_ADDRESS"
}
],
"user": {
"lastSeen": "2018-05-23T10:20:25Z",
"locale": "en-GB",
"userId": "************************"
},
"conversation": {
"conversationId": "************************",
"type": "ACTIVE",
"conversationToken": "[\"more\"]"
},
"availableSurfaces": [
{
"capabilities": [
{
"name": "actions.capability.SCREEN_OUTPUT"
},
{
"name": "actions.capability.AUDIO_OUTPUT"
},
{
"name": "actions.capability.WEB_BROWSER"
}
]
}
]
}
},
"session": "************************/agent/sessions/1527070836044"
}
This large json returns amongst other things to my back-end the delivery address details of the user (here I use Amazon's NY locations details as an example). Therefore, I want to retrieve the location dictionary which is near the end of this large json. The location details appear also near the start of this json but I want to retrieve specifically the second location dictionary which is near the end of this large json.
For this reason, I had to read through this json by myself and manually test some possible "paths" of the location dictionary within this large json to find out finally that I had to write the following line to retrieve the second location dictionary:
location = json['originalDetectIntentRequest']['payload']['inputs'][0]['arguments'][0]['extension']['location']
Therefore, my question is the following: is there any concise way to retrieve automatically the "path" of the parent keys and indices of the second location dictionary within this large json?
Hence, I expect that the general format of the output from a function which does this for all the occurrences of the location dictionary in any json will be the following:
[["path" of first `location` dictionary], ["path" of second `location` dictionary], ["path" of third `location` dictionary], ...]
where for the json above it will be
[["path" of first `location` dictionary], ["path" of second `location` dictionary]]
as there are two occurrences of the location dictionary with
["path" of second `location` dictionary] = ['originalDetectIntentRequest', 'payload', 'inputs', 0, 'arguments', 0, 'extension', 'location']
I have in my mind relevant posts on StackOverflow (Python--Finding Parent Keys for a specific value in a nested dictionary) but I am not sure that these apply exactly to my problem since these are for parent keys in nested dictionaries whereas here I am talking about the parent keys and indices in dictionary with nested dictionaries and lists.
I solved this by using recursive search
# result and path should be outside of the scope of find_path to persist values during recursive calls to the function
result = []
path = []
from copy import copy
# i is the index of the list that dict_obj is part of
def find_path(dict_obj,key,i=None):
for k,v in dict_obj.items():
# add key to path
path.append(k)
if isinstance(v,dict):
# continue searching
find_path(v, key,i)
if isinstance(v,list):
# search through list of dictionaries
for i,item in enumerate(v):
# add the index of list that item dict is part of, to path
path.append(i)
if isinstance(item,dict):
# continue searching in item dict
find_path(item, key,i)
# if reached here, the last added index was incorrect, so removed
path.pop()
if k == key:
# add path to our result
result.append(copy(path))
# remove the key added in the first line
if path != []:
path.pop()
# default starting index is set to None
find_path(di,"location")
print(result)
# [['queryResult', 'outputContexts', 4, 'parameters', 'DELIVERY_ADDRESS_VALUE', 'location'], ['originalDetectIntentRequest', 'payload', 'inputs', 0, 'arguments', 0, 'extension', 'location']]

Python parse large JSON nests and lists - string indices must be integers

NIST recently released all CVE data in JSON format, and I am trying to parse it out to add to a MySQL database so I can compare my security findings to what NIST shows.
The data, is very confusing to parses because there is a lot of nesting, with some lists included.
Here is a snippet of the JSON.
{
"CVE_data_type": "CVE",
"CVE_data_format": "MITRE",
"CVE_data_version": "4.0",
"CVE_data_numberOfCVEs": "600",
"CVE_data_timestamp": "Fri Apr 28 16:00:10 EDT 2017",
"CVE_Items": [
{
"CVE_data_meta": {
"CVE_ID": "CVE-2007-6761"
},
"CVE_affects": {
"CVE_vendor": {
"CVE_data_version": "4.0",
"CVE_vendor_data": [
{
"CVE_vendor_name": "linux",
"CVE_product": {
"CVE_product_data": [
{
"CVE_data_version": "4.0",
"CVE_product_name": "linux_kernel",
"CVE_version": {
"CVE_version_data": [
{
"CVE_version_value": "2.6.23",
"CVE_version_affected": "<="
}
]
}
}
]
}
}
]
}
},
"CVE_configurations": {
"CVE_data_version": "4.0",
"CVE_configuration_data": [
{
"operator": "OR",
"cpe": [
{
"vulnerable": true,
"previousVersions": true,
"cpeMatchString": "cpe:/o:linux:linux_kernel:2.6.23",
"cpe23Uri": "cpe:2.3:o:linux:linux_kernel:2.6.23:*:*:*:*:*:*:*"
}
]
}
]
},
"CVE_description": {
"CVE_data_version": "4.0",
"CVE_description_data": [
{
"lang": "en",
"value": "drivers/media/video/videobuf-vmalloc.c in the Linux kernel before 2.6.24 does not initialize videobuf_mapping data structures, which allows local users to trigger an incorrect count value and videobuf leak via unspecified vectors, a different vulnerability than CVE-2010-5321."
}
]
},
"CVE_references": {
"CVE_data_version": "4.0",
"CVE_reference_data": [
{
"url": "http://www.linuxgrill.com/anonymous/kernel/v2.6/ChangeLog-2.6.24",
"name": "CONFIRM",
"publish_date": "04/24/2017"
},
{
"url": "http://www.securityfocus.com/bid/98001",
"name": "BID",
"publish_date": "04/26/2017"
},
{
"url": "https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=827340",
"name": "MISC",
"publish_date": "04/24/2017"
},
{
"url": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b29669c065f60501e7289e1950fa2a618962358",
"name": "CONFIRM",
"publish_date": "04/24/2017"
},
{
"url": "https://github.com/torvalds/linux/commit/0b29669c065f60501e7289e1950fa2a618962358",
"name": "CONFIRM",
"publish_date": "04/24/2017"
}
]
},
"CVE_impact": {
"CVE_impact_cvssv2": {
"bm": {
"av": "LOCAL",
"ac": "LOW",
"au": "NONE",
"c": "PARTIAL",
"i": "PARTIAL",
"a": "PARTIAL",
"score": "4.6"
}
},
"CVE_impact_cvssv3": {
"bm": {
"av": "LOCAL",
"ac": "LOW",
"pr": "LOW",
"ui": "NONE",
"scope": "UNCHANGED",
"c": "HIGH",
"i": "HIGH",
"a": "HIGH",
"score": "7.8"
}
}
},
"CVE_problemtype": {
"CVE_data_version": "4.0",
"CVE_problemtype_data": [
{
"description": [
{
"lang": "en",
"value": "CWE-119"
}
]
}
]
}
}
]
}
When I try to parse it to get the info I want, I run into errors. Here is the code test.
import json
with open('/tmp/nvdcve-1.0-recent.json') as data_file:
cve_data = json.load(data_file)
product_list = []
for data_list in cve_data["CVE_Items"]:
for cve_tag,cve_id in data_list["CVE_data_meta"].items():
cve = str(cve_id)
for vendor_data in data_list["CVE_affects"]["CVE_vendor"]["CVE_vendor_data"]["CVE_product"]:
for data_version,product_name,version_set in vendor_data["CVE_product_data"].items():
print(product_name)
The Error
TypeError Traceback (most recent call last)
<ipython-input-10-81b0239327c1> in <module>()
10 cve = str(cve_id)
11
---> 12 for vendor_data in data_list["CVE_affects"]["CVE_vendor"]["CVE_vendor_data"]["CVE_product"]:
13 for data_version,product_name,version_set in vendor_data["CVE_product_data"].items():
14 print data_version
TypeError: list indices must be integers, not str
This is confusing to me because there is nests within nests, and lists within theses nests. I am having a hard time figuring out how to get some of this super nested info.
I feel your pain, but after closer inspection "CVE_vendor_data" is not a dictionary, but a list of dictionaries. Notice the "[]" after the colon. That is why it needs integers to index the list. Same goes for "CVE_product_data". It is also a list of dictionaries.

Python - Find value anywhere within JSON and return location

In Python I'm currently working with a very large JSON file with some deep dictionaries and arrays. I'm having an issue where it's not constant. For example that's below, it's essentially countries, with regions/states, cities, and suburbs. The issue is that if there is only one suburb, it'll return a dictionary, though if there's more than one, it's a array with a dictionary making me have to add another line of code to go deeper. Sure, can ifelse/for it, but this is only a very small portion of the inconstancy and it's just not proper going ifelse all the time.
What I'd like to do is simply search anything within Belgium for the dictionary entry "code": "8400" and return it's location within the JSON file. What would be my best approach in order to do something like this? Thanks!
***SNIP***
{
"code": "BE",
"name": "Belgium",
"regions": {
"region": [
{
"code": "45",
"name": "Flanders",
"places": {
"place": [
{
"code": "1790",
"name": "Affligem"
},
{
"code": "8570",
"name": "Anzegem"
},
{
"code": "8630",
"name": "Diksmuide"
},
{
"code": "9600",
"name": "Ronse"
}
]
},
"subregions": {
"subregion": [
{
"code": "46",
"name": "Coast",
"places": {
"place": [
{
"code": "8300",
"name": "Knokke-Heist"
},
{
"code": "8400",
"name": "Oostende",
"subplaces": {
"subplace": {
"code": "8450",
"name": "Bredene"
}
}
},
{
"code": "8420",
"name": "De Haan"
},
{
"code": "8430",
"name": "Middelkerke"
},
{
"code": "8434",
"name": "Westende-Bad"
},
{
"code": "8490",
"name": "Jabbeke"
},
{
"code": "8660",
"name": "De Panne"
},
{
"code": "8670",
"name": "Oostduinkerke"
}
]
}
},
{
"code": "47",
"name": "Cities",
"places": {
"place": [
{
"code": "1000",
"name": "Brussels"
},
{
"code": "2000",
"name": "Antwerp"
},
{
"code": "8000",
"name": "Bruges"
},
{
"code": "8340",
"name": "Damme"
},
{
"code": "9000",
"name": "Gent"
}
]
}
},
{
"code": "48",
"name": "Interior",
"places": {
"place": [
{
"code": "2260",
"name": "Westerlo"
},
{
"code": "2400",
"name": "Mol"
},
{
"code": "2590",
"name": "Berlaar"
},
{
"code": "8500",
"name": "Kortrijk",
"subplaces": {
"subplace": {
"code": "8940",
"name": "Wervik"
}
}
},
{
"code": "8610",
"name": "Handzame"
},
{
"code": "8755",
"name": "Ruiselede"
},
{
"code": "8900",
"name": "Ieper"
},
{
"code": "8970",
"name": "Poperinge"
}
]
}
},
EDIT:
I was asked to show how I'm currently getting through this JSON file. Root is a dictionary containing numbers that equal the city/suburb I'm trying to search for. It doesn't define whether it is a city or suburb before hand. Below is my lazyly coded search while I was trying to learn how to dig through this JSON file, until I realized how complicated it was getting and got a bit stuck.
SNIP
for k in dataDict['countries']['country']:
if k['code'] == root['country']:
for y in k['regions']['region']['places']['place']:
if y['code'] == root['place']:
city = y['name']
else:
try:
for p in y['subplaces']['subplace']:
if p['code'] == root['place']:
city = p['name']
except:
pass
If I understand well, each dictionary has the following structure:
{"code": # some int
"name": # some str
none / "country" / "place" / whatever # some dict or list
You can write a recursive function that handle one and only one dict:
def foo(my_dict):
if my_dict['code'] == root['place']:
city = my_dict['name']
elif "country" in my_dict:
city = foo(my_dict['country'])
elif "place" in my_dict:
#
# and so on...
else:
city = None
return city
Hope this example will help you.

Categories

Resources