Extracting values from deeply nested JSON structures

Extracting values from deeply nested JSON structures - python

This is a structure I'm getting from elsewhere, that is, a list of deeply nested dictionaries:
{
"foo_code": 404,
"foo_rbody": {
"query": {
"info": {
"acme_no": "444444",
"road_runner": "123"
},
"error": "no_lunch",
"message": "runner problem."
}
},
"acme_no": "444444",
"road_runner": "123",
"xyzzy_code": 200,
"xyzzy_rbody": {
"api": {
"items": [
{
"desc": "OK",
"id": 198,
"acme_no": "789",
"road_runner": "123",
"params": {
"bicycle": "2wheel",
"willie": "hungry",
"height": "1",
"coyote_id": "1511111"
},
"activity": "TRAP",
"state": "active",
"status": 200,
"type": "chase"
}
]
}
}
}
{
"foo_code": 200,
"foo_rbody": {
"query": {
"result": {
"acme_no": "260060730303258",
"road_runner": "123",
"abyss": "26843545600"
}
}
},
"acme_no": "260060730303258",
"road_runner": "123",
"xyzzy_code": 200,
"xyzzy_rbody": {
"api": {
"items": [
{
"desc": "OK",
"id": 198,
"acme_no": "789",
"road_runner": "123",
"params": {
"bicycle": "2wheel",
"willie": "hungry",
"height": "1",
"coyote_id": "1511111"
},
"activity": "TRAP",
"state": "active",
"status": 200,
"type": "chase"
}
]
}
}
}
Asking for different structures is out of question (legacy apis etc).
So I'm wondering if there's some clever way of extracting selected values from such a structure.
The candidates I was thinking of:
flatten particular dictionaries, building composite keys, smth like:
{
"foo_rbody.query.info.acme_no": "444444",
"foo_rbody.query.info.road_runner": "123",
...
}
Pro: getting every value with one access and if predictable key is not there, it means that the structure was not there (as you might have noticed, dictionaries may have different structures depending on whether it was successful operation, error happened, etc).
Con: what to do with lists?
Use some recursive function that would do successive key lookups, say by "foo_rbody", then by "query", "info", etc.
Any better candidates?

You can try this rather trivial function to access nested properties:
import re
def get_path(dct, path):
for i, p in re.findall(r'(\d+)|(\w+)', path):
dct = dct[p or int(i)]
return dct
Usage:
value = get_path(data, "xyzzy_rbody.api.items[0].params.bicycle")

Maybe the function byPath in my answer to this post might help you.

You could create your own path mechanism and then query the complicated dict with paths. Example:
/ : get the root object
/key: get the value of root_object['key'], e.g. /foo_code --> 404
/key/key: nesting: /foo_rbody/query/info/acme_no -> 444444
/key[i]: get ith element of that list, e.g. /xyzzy_rbody/api/items[0]/desc --> "OK"
The path can also return a dict which you then run more queries on, etc.
It would be fairly easy to implement recursively.

I think about two more solutions:
You can try package Pynq, described here - structured query language for JSON (in Python). As far as a I understand, it's some kind of LINQ for python.
You may also try to convert your JSON to XML and then use Xquery language to get data from it - XQuery library under Python

Related

Python function to extract specific values from complex JSON logs data

I am trying to write a Python function (for use in a Google Cloud Function) that extracts specific values from JSON logs data. Ordinarily, I do this using the standard method of sorting through keys:
my_data['key1'], etc.
This JSON data, however is quite different, since it appears to have the data I need as lists inside of dictionaries. Here is a sample of the logs data:
{
"insertId": "-mgv16adfcja",
"logName": "projects/my_project/logs/cloudaudit.googleapis.com%2Factivity",
"protoPayload": {
"#type": "type.googleapis.com/google.cloud.audit.AuditLog",
"authenticationInfo": {
"principalEmail": "email#email.com"
},
"authorizationInfo": [{
"granted": true,
"permission": "resourcemanager.projects.setIamPolicy",
"resource": "projects/my_project",
"resourceAttributes": {
"name": "projects/my_project",
"service": "cloudresourcemanager.googleapis.com",
"type": "cloudresourcemanager.googleapis.com/Project"
}
},
{
"granted": true,
"permission": "resourcemanager.projects.setIamPolicy",
"resource": "projects/my_project",
"resourceAttributes": {
"name": "projects/my_project",
"service": "cloudresourcemanager.googleapis.com",
"type": "cloudresourcemanager.googleapis.com/Project"
}
}
],
"methodName": "SetIamPolicy",
"request": {
"#type": "type.SetIamPolicyRequest",
"policy": {
"bindings": [{
"members": [
"serviceAccount:my-test-
sa #my_project.iam.gserviceaccount.com "
],
"role": "projects/my_project/roles/PubBuckets"
},
{
"members": [
"serviceAccount:my-test-sa-
2 #my_project.iam.gserviceaccount.com "
],
"role": "roles/owner"
},
{
"members": [
"serviceAccount:my-test-sa-3#my_project.iam.gserviceaccount.com",
"serviceAccount:my-test-sa-4#my_project.iam.gserviceaccount.com"
]
}
My goal with this data is to extract the "role":"roles/editor" and the associated "members." So in this case, I would like to extract service accounts my-test-sa-3, 4, and 5, and print them.
When the JSON enters my cloud function I do the following:
pubsub_message = base64.b64decode(event['data']).decode('utf-8')
msg = json.loads(pubsub_message)
print(msg)
And I can get to other data that I need, e.g., project id-
proj_id = msg['resource']['labels']['project_id']
But I cannot get into the lists within the dictionaries effectively. The deepest I can currently get is to the 'bindings' key.
I have additionally tried restructuring and flattening output as a list:
policy_request =credentials.projects().getIamPolicy(resource=proj_id, body={})
policy_response = policy_request.execute()
my_bindings = policy_response['bindings']
flat_list = []
for element in my_bindings:
if type(element) is list:
for item in element:
flat_list.append(item)
else:
flat_list.append(element)
print('Here is flat_list: ', flat_list)
I then use an if statement to search the list, which returns nothing. I can't use indices, because the output will change consistently, so I need a solution that can extract the values by a key, value approach if at all possible.
Expected Output:
Role: roles/editor
Members:
sa-1#gcloud.com
sa2#gcloud.com
sa3#gcloud.com
and so on
Appreciate any help.

Validation Json Schema

I am trying to validate the json for required fields using python. I am doing it manually like iterating through the json reading it. Howerver i am looking for more of library / generic solution to handle all scenarios.
For example I want to check in a list, if a particular attribute is available in all the list items.
Here is the sample json which I am trying to validate.
{
"service": {
"refNumber": "abc",
"item": [{
"itemnumber": "1",
"itemloc": "uk"
}, {
"itemnumber": "2",
"itemloc": "us"
}]
}
}
I want to validate if I have refNumber and itemnumber in all the list items.

A JSON Schema is a way to define the structure of JSON.
There are some accompanying python packages which can use a JSON schema to validate JSON (jsonschema).
The JSON Schema for your example would look approximately like this:
{
"type": "object",
"properties": {
"service": {
"type": "object",
"properties": {
"refNumber": {
"type": "string"
},
"item": {
"type": "array",
"items": {
"type": "object",
"properties": {
"itemnumber": {
"type": "string"
},
"itemloc": {
"type": "string"
}
}
}
}
}
}
}
}
i.e., an object containing service, which itself contains a refNumber and a list of items.

Since i dont have enough rep to add a comment i will post this answer.
First i have to say i dont program with python.
According to my google search, you have a jsonschema module available for Python.
from jsonschema import validate
schema = {
"type": "object",
"properties": {
"service": {"object": {
"refNumber": {"type" : "string"},
"item: {"array": []}
},
"required": ["refNumber"]
},
},
}
validate(instance=yourJSON, schema=yourValidationSchema)
This example is not tested, but you can get some idea,
Link to jsonschema docs

How to restructure a collection in MongoDB

I'm looking to restructure my MongoDB collection and haven't been able to do so. I'm quite new to it and looking for some help. I'm struggling to access move the data within the "itemsList" field.
My collection documents are currently structured like this:
{
"_id": 1,
"pageName": "List of Fruit",
"itemsList":[
{
"myID": 101,
"itemName": "Apple"
},
{
"myID": 102,
"itemName": "Orange"
}
]
},
{
"_id": 2,
"pageName": "List of Computers",
"itemsList":[
{
"myID": 201,
"itemName": "MacBook"
},
{
"myID": 202,
"itemName": "Desktop"
}
]
}
The end result
But I would like the data to be restructured so that the value for "itemName" is it's own document.
I would also like to change the name of "myID" to "itemID".
And save the new documents to another collection.
{
"_id": 1,
"itemName": "Apple",
"itemID": 101,
"pageName": "List of Fruit"
},
{
"_id": 2,
"itemName": "Orange",
"itemID": 102,
"pageName": "List of Fruit"
},
{
"_id": 3,
"itemName": "MacBook",
"itemID": 201,
"pageName": "List of Computers"
},
{
"_id": 4,
"itemName": "Desktop",
"itemID": 202,
"pageName": "List of Computers"
}
What I've tried
I have tried using MongoDB's aggregate functionality, but because there are multiple "itemName" fields in each document, it will add both of them to one Array - instead of one in each document.
db.collection.aggregate([
{$Project:{
itemName: "$itemsList.itemName",
itemID: "$itemsList.otherID",
pageName: "$pageName"
}},
{$out: "myNewCollection"}
])
I've also tried using PyMongo 3.x to loop through the document's fields and save as a new document, but haven't been successful.
Ways to implement it
I'm open to using MongoDB's aggregate functionality, if it can move these items to their own documents, or a Python script (3.x) - or any other means you think can help.
Thanks in advance for your help!

You just need a $unwind to "break" the array. Then you can do some data wrangling and output to your collection.
Note that as you didn't specify the exact requirement for the _id. You might need to take extra handling. Below demonstration use the native _id generation, which will auto assigned ObjectIds.
db.collection.aggregate([
{
"$unwind": "$itemsList"
},
{
"$project": {
"_id": 0,
"itemName": "$itemsList.itemName",
"itemID": "$itemsList.myID",
"pageName": "$pageName"
}
},
{
$out: "myNewCollection"
}
])
Here is the Mongo playground for your reference.

dereference JSON from muliple files

I am using python to dereference JSON from two/more files.
Something like below,
content of file1 (Primary.json):
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "https://hcmdevblobsa.blob.core.windows.net/order/v2/order.json",
"title": "LCBO order schema",
"description": "Canonical order structure describing various order types",
"definitions": {
"addressDetail": {
"description": "Address Information",
"type": "object",
"properties": {
"name": {
"type": ["string","null"],
"minLength": 1,
"maxLength": 64
},
"age": {
"$ref": "secondary.json#/properties/age"
}
}
}
}
}
And the File2(Secondary.json) :
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "https://hcmdevblobsa.blob.core.windows.net/enumerations.json",
"title": "Enumerations used for JSON schemas",
"description": "Catalog of allowed values for schema properties",
"properties": {
"age": {
"type": "integer"
}
}
}
My idea is to use jsonref.I tried this library from the answer
json reference extraction in python
but in this case the reference is mentioned in the same file like this,
json_str = """{"real": [1, 2, 3, 4], "ref": {"$ref": "#/real"}}"""
data = jsonref.loads(json_str)
but in my case, the reference is in another file. So I tried to merge two files with jsonmerge and to use jsonref,
I tried using jsonmerge and jsonref with the below code,
import jsonmerge
import jsonref
import pprint
head = open('AltPrimary.json')
tail = open('Secondary.json')
result = jsonmerge.merge(head,tail)
final = jsonref.loads(data3)
this errors out in jsonref.loads because it doesn't know that the second part (after merging) is from 'Secondary.json'. So, it errors out while reading,
$ref": "secondary.json#/properties/age"
I tried by concatenating, 'seondary' to the file 'secondary.json' like,
"Secondary" :{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "https://hcmdevblobsa.blob.core.windows.net/enumerations.json",
"title": "Enumerations used for JSON schemas",
"description": "Catalog of allowed values for schema properties",
"properties": {
"age": {
"type": "integer"
}
}
}
but during validation it failed. In real world I may need to dereference from multiple files like below,
"person": {
"$ref": "schemas/people/Bruce-Wayne.json"
},
"place": {
"$ref": "schemas/places.yaml#/definitions/Gotham-City"
},
It would be helpful if anyone has any thoughts on this. Many thanks.

Issues decoding Collections+JSON in Python

I've been trying to decode a JSON response in Collections+JSON format using Python for a while now but I can't seem to overcome a small issue.
First of all, here is the JSON response:
{
"collection": {
"href": "http://localhost:8000/social/messages-api/",
"items": [
{
"data": [
{
"name": "messageID",
"value": 19
},
{
"name": "author",
"value": "mike"
},
{
"name": "recipient",
"value": "dan"
},
{
"name": "pm",
"value": "0"
},
{
"name": "time",
"value": "2015-03-31T15:04:01.165060Z"
},
{
"name": "text",
"value": "first message"
}
]
}
],
"version": "1.0",
"links": []
}
}
And here is how I am attempting to extract data:
response = urllib2.urlopen('myurl')
responseData = response.read()
jsonData = json.loads(responseData)
test = jsonData['collection']['items']['data']
When I run this code I get the error:
list indices must be integers, not str
If I use an integer, e.g. 0, instead of a string it merely shows 'data' instead of any useful information, unlike if I were to simply output 'items'. Similarly, I can't seem to access the data within a data child, for example:
test = jsonData['collection']['items'][0]['name']
This will argue that there is no element called 'name'.
What is the proper method of accessing JSON data in this situation? I would also like to iterate over the collection, if that helps.
I'm aware of a package that can be used to simplify working with Collections+JSON in Python, collection-json, but I'd rather be able to do this without using such a package.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting values from deeply nested JSON structures - python

You can try this rather trivial function to access nested properties: import re def get_path(dct, path): for i, p in re.findall(r'(\d+)|(\w+)', path): dct = dct[p or int(i)] return dct Usage: value = get_path(data, "xyzzy_rbody.api.items[0].params.bicycle")

Maybe the function byPath in my answer to this post might help you.

Related

Python function to extract specific values from complex JSON logs data

Validation Json Schema

How to restructure a collection in MongoDB

dereference JSON from muliple files

Issues decoding Collections+JSON in Python

Categories

Resources