Python function to extract specific values from complex JSON logs data

Python function to extract specific values from complex JSON logs data - python

I am trying to write a Python function (for use in a Google Cloud Function) that extracts specific values from JSON logs data. Ordinarily, I do this using the standard method of sorting through keys:
my_data['key1'], etc.
This JSON data, however is quite different, since it appears to have the data I need as lists inside of dictionaries. Here is a sample of the logs data:
{
"insertId": "-mgv16adfcja",
"logName": "projects/my_project/logs/cloudaudit.googleapis.com%2Factivity",
"protoPayload": {
"#type": "type.googleapis.com/google.cloud.audit.AuditLog",
"authenticationInfo": {
"principalEmail": "email#email.com"
},
"authorizationInfo": [{
"granted": true,
"permission": "resourcemanager.projects.setIamPolicy",
"resource": "projects/my_project",
"resourceAttributes": {
"name": "projects/my_project",
"service": "cloudresourcemanager.googleapis.com",
"type": "cloudresourcemanager.googleapis.com/Project"
}
},
{
"granted": true,
"permission": "resourcemanager.projects.setIamPolicy",
"resource": "projects/my_project",
"resourceAttributes": {
"name": "projects/my_project",
"service": "cloudresourcemanager.googleapis.com",
"type": "cloudresourcemanager.googleapis.com/Project"
}
}
],
"methodName": "SetIamPolicy",
"request": {
"#type": "type.SetIamPolicyRequest",
"policy": {
"bindings": [{
"members": [
"serviceAccount:my-test-
sa #my_project.iam.gserviceaccount.com "
],
"role": "projects/my_project/roles/PubBuckets"
},
{
"members": [
"serviceAccount:my-test-sa-
2 #my_project.iam.gserviceaccount.com "
],
"role": "roles/owner"
},
{
"members": [
"serviceAccount:my-test-sa-3#my_project.iam.gserviceaccount.com",
"serviceAccount:my-test-sa-4#my_project.iam.gserviceaccount.com"
]
}
My goal with this data is to extract the "role":"roles/editor" and the associated "members." So in this case, I would like to extract service accounts my-test-sa-3, 4, and 5, and print them.
When the JSON enters my cloud function I do the following:
pubsub_message = base64.b64decode(event['data']).decode('utf-8')
msg = json.loads(pubsub_message)
print(msg)
And I can get to other data that I need, e.g., project id-
proj_id = msg['resource']['labels']['project_id']
But I cannot get into the lists within the dictionaries effectively. The deepest I can currently get is to the 'bindings' key.
I have additionally tried restructuring and flattening output as a list:
policy_request =credentials.projects().getIamPolicy(resource=proj_id, body={})
policy_response = policy_request.execute()
my_bindings = policy_response['bindings']
flat_list = []
for element in my_bindings:
if type(element) is list:
for item in element:
flat_list.append(item)
else:
flat_list.append(element)
print('Here is flat_list: ', flat_list)
I then use an if statement to search the list, which returns nothing. I can't use indices, because the output will change consistently, so I need a solution that can extract the values by a key, value approach if at all possible.
Expected Output:
Role: roles/editor
Members:
sa-1#gcloud.com
sa2#gcloud.com
sa3#gcloud.com
and so on
Appreciate any help.

Related

check if a property contains a specific value in a document with pymongo

I have a collection of documents that looks like this
{
"_id": "4",
"contacts": [
{
"email": "mail#mail.com",
"name": "A1",
"phone": "00",
"crashNotificationEnabled": false,
"locationShared": true,
"creationDate": ISODate("2020-10-19T15:19:04.498Z")
},
{
"email": "mail#mail.com",
"name": "AG2",
"phone": "00",
"crashNotificationEnabled": false,
"locationShared": false,
"creationDate": ISODate("2020-10-19T15:19:04.498Z")
}
],
"creationDate": ISODate("2020-10-19T15:19:04.498Z"),
"_class": ".model.UserContacts"
}
And i would like to iterate through all documents to check if either crashNotificationEnabled or locationShared is true and add +1 to a counter if its the case, im quite new to python and mongosql so i actually have a hard time trying to do that, i tried a lot of things but there is my last try :
def users_with_guardian_angel(mongoclient):
try:
mydb = mongoclient["main"]
userContacts = mydb["userContacts"]
users = userContacts.find()
for user in users:
result = userContacts.find_one({contacts : { $in: [true]}})
if result:
count_users = count_users + 1
print(f"{count_users} have at least one notificiation enabled")
But the result variable stays empty all the time, so if somebody could help me to accomplish what i want to do and tell what i did wrong here ?
Thanks !

Here's one way you could do it by letting the MongoDB server do all the work.
N.B.: This doesn't consider the possibility of multiple entries of the same user.
db.userContacts.aggregate([
{
"$unwind": "$contacts"
},
{
"$match": {
"$expr": {
"$or": [
"$contacts.crashNotificationEnabled",
"$contacts.locationShared"
]
}
}
},
{
"$count": "userCountWithNotificationsEnabled"
}
])
Try it on mongoplayground.net.
Example output:
[
{
"userCountWithNotificationsEnabled": 436
}
]

Get "path" of parent keys and indices in dictionary of nested dictionaries and lists

I am receiving a large json from Google Assistant and I want to retrieve some specific details from it. The json is the following:
{
"responseId": "************************",
"queryResult": {
"queryText": "actions_intent_DELIVERY_ADDRESS",
"action": "delivery",
"parameters": {},
"allRequiredParamsPresent": true,
"fulfillmentMessages": [
{
"text": {
"text": [
""
]
}
}
],
"outputContexts": [
{
"name": "************************/agent/sessions/1527070836044/contexts/actions_capability_screen_output"
},
{
"name": "************************/agent/sessions/1527070836044/contexts/more",
"parameters": {
"polar": "no",
"polar.original": "No",
"cardinal": 2,
"cardinal.original": "2"
}
},
{
"name": "************************/agent/sessions/1527070836044/contexts/actions_capability_audio_output"
},
{
"name": "************************/agent/sessions/1527070836044/contexts/actions_capability_media_response_audio"
},
{
"name": "************************/agent/sessions/1527070836044/contexts/actions_intent_delivery_address",
"parameters": {
"DELIVERY_ADDRESS_VALUE": {
"userDecision": "ACCEPTED",
"#type": "type.googleapis.com/google.actions.v2.DeliveryAddressValue",
"location": {
"postalAddress": {
"regionCode": "US",
"recipients": [
"Amazon"
],
"postalCode": "NY 10001",
"locality": "New York",
"addressLines": [
"450 West 33rd Street"
]
},
"phoneNumber": "+1 206-266-2992"
}
}
}
},
{
"name": "************************/agent/sessions/1527070836044/contexts/actions_capability_web_browser"
}
],
"intent": {
"name": "************************/agent/intents/86fb2293-7ae9-4bed-adeb-6dfe8797e5ff",
"displayName": "Delivery"
},
"intentDetectionConfidence": 1,
"diagnosticInfo": {},
"languageCode": "en-gb"
},
"originalDetectIntentRequest": {
"source": "google",
"version": "2",
"payload": {
"isInSandbox": true,
"surface": {
"capabilities": [
{
"name": "actions.capability.MEDIA_RESPONSE_AUDIO"
},
{
"name": "actions.capability.SCREEN_OUTPUT"
},
{
"name": "actions.capability.AUDIO_OUTPUT"
},
{
"name": "actions.capability.WEB_BROWSER"
}
]
},
"inputs": [
{
"rawInputs": [
{
"query": "450 West 33rd Street"
}
],
"arguments": [
{
"extension": {
"userDecision": "ACCEPTED",
"#type": "type.googleapis.com/google.actions.v2.DeliveryAddressValue",
"location": {
"postalAddress": {
"regionCode": "US",
"recipients": [
"Amazon"
],
"postalCode": "NY 10001",
"locality": "New York",
"addressLines": [
"450 West 33rd Street"
]
},
"phoneNumber": "+1 206-266-2992"
}
},
"name": "DELIVERY_ADDRESS_VALUE"
}
],
"intent": "actions.intent.DELIVERY_ADDRESS"
}
],
"user": {
"lastSeen": "2018-05-23T10:20:25Z",
"locale": "en-GB",
"userId": "************************"
},
"conversation": {
"conversationId": "************************",
"type": "ACTIVE",
"conversationToken": "[\"more\"]"
},
"availableSurfaces": [
{
"capabilities": [
{
"name": "actions.capability.SCREEN_OUTPUT"
},
{
"name": "actions.capability.AUDIO_OUTPUT"
},
{
"name": "actions.capability.WEB_BROWSER"
}
]
}
]
}
},
"session": "************************/agent/sessions/1527070836044"
}
This large json returns amongst other things to my back-end the delivery address details of the user (here I use Amazon's NY locations details as an example). Therefore, I want to retrieve the location dictionary which is near the end of this large json. The location details appear also near the start of this json but I want to retrieve specifically the second location dictionary which is near the end of this large json.
For this reason, I had to read through this json by myself and manually test some possible "paths" of the location dictionary within this large json to find out finally that I had to write the following line to retrieve the second location dictionary:
location = json['originalDetectIntentRequest']['payload']['inputs'][0]['arguments'][0]['extension']['location']
Therefore, my question is the following: is there any concise way to retrieve automatically the "path" of the parent keys and indices of the second location dictionary within this large json?
Hence, I expect that the general format of the output from a function which does this for all the occurrences of the location dictionary in any json will be the following:
[["path" of first `location` dictionary], ["path" of second `location` dictionary], ["path" of third `location` dictionary], ...]
where for the json above it will be
[["path" of first `location` dictionary], ["path" of second `location` dictionary]]
as there are two occurrences of the location dictionary with
["path" of second `location` dictionary] = ['originalDetectIntentRequest', 'payload', 'inputs', 0, 'arguments', 0, 'extension', 'location']
I have in my mind relevant posts on StackOverflow (Python--Finding Parent Keys for a specific value in a nested dictionary) but I am not sure that these apply exactly to my problem since these are for parent keys in nested dictionaries whereas here I am talking about the parent keys and indices in dictionary with nested dictionaries and lists.

I solved this by using recursive search
# result and path should be outside of the scope of find_path to persist values during recursive calls to the function
result = []
path = []
from copy import copy
# i is the index of the list that dict_obj is part of
def find_path(dict_obj,key,i=None):
for k,v in dict_obj.items():
# add key to path
path.append(k)
if isinstance(v,dict):
# continue searching
find_path(v, key,i)
if isinstance(v,list):
# search through list of dictionaries
for i,item in enumerate(v):
# add the index of list that item dict is part of, to path
path.append(i)
if isinstance(item,dict):
# continue searching in item dict
find_path(item, key,i)
# if reached here, the last added index was incorrect, so removed
path.pop()
if k == key:
# add path to our result
result.append(copy(path))
# remove the key added in the first line
if path != []:
path.pop()
# default starting index is set to None
find_path(di,"location")
print(result)
# [['queryResult', 'outputContexts', 4, 'parameters', 'DELIVERY_ADDRESS_VALUE', 'location'], ['originalDetectIntentRequest', 'payload', 'inputs', 0, 'arguments', 0, 'extension', 'location']]

Elastic Search: including #/hashtags in search results

Using elastic search's query DSL this is how I am currently constructing my query:
elastic_sort = [
{ "timestamp": {"order": "desc" }},
"_score",
{ "name": { "order": "desc" }},
{ "channel": { "order": "desc" }},
]
elastic_query = {
"fuzzy_like_this" : {
"fields" : [ "msgs.channel", "msgs.msg", "msgs.name" ],
"like_text" : search_string,
"max_query_terms" : 10,
"fuzziness": 0.7,
}
}
res = self.es.search(index="chat", body={
"from" : from_result, "size" : results_per_page,
"track_scores": True,
"query": elastic_query,
"sort": elastic_sort,
})
I've been trying to implement a filter or an analyzer that will allow the inclusion of "#" in searches (I want a search for "#thing" to return results that include "#thing"), but I am coming up short. The error messages I am getting are not helpful and just telling me that my query is malformed.
I attempted to incorporate the method found here : http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html but it doesn't make any sense to me in context.
Does anyone have a clue how I can do this?

Did you create a mapping for you index? You can specify within your mapping to not analyze certain fields.
For example, a tweet mapping can be something like:
"tweet": {
"properties": {
"id": {
"type": "long"
},
"msg": {
"type": "string"
},
"hashtags": {
"type": "string",
"index": "not_analyzed"
}
}
}
You can then perform a term query on "hashtags" for an exact string match, including "#" character.
If you want "hashtags" to be tokenized as well, you can always create a multi-field for "hashtags".

Issues decoding Collections+JSON in Python

I've been trying to decode a JSON response in Collections+JSON format using Python for a while now but I can't seem to overcome a small issue.
First of all, here is the JSON response:
{
"collection": {
"href": "http://localhost:8000/social/messages-api/",
"items": [
{
"data": [
{
"name": "messageID",
"value": 19
},
{
"name": "author",
"value": "mike"
},
{
"name": "recipient",
"value": "dan"
},
{
"name": "pm",
"value": "0"
},
{
"name": "time",
"value": "2015-03-31T15:04:01.165060Z"
},
{
"name": "text",
"value": "first message"
}
]
}
],
"version": "1.0",
"links": []
}
}
And here is how I am attempting to extract data:
response = urllib2.urlopen('myurl')
responseData = response.read()
jsonData = json.loads(responseData)
test = jsonData['collection']['items']['data']
When I run this code I get the error:
list indices must be integers, not str
If I use an integer, e.g. 0, instead of a string it merely shows 'data' instead of any useful information, unlike if I were to simply output 'items'. Similarly, I can't seem to access the data within a data child, for example:
test = jsonData['collection']['items'][0]['name']
This will argue that there is no element called 'name'.
What is the proper method of accessing JSON data in this situation? I would also like to iterate over the collection, if that helps.
I'm aware of a package that can be used to simplify working with Collections+JSON in Python, collection-json, but I'd rather be able to do this without using such a package.

Extracting values from deeply nested JSON structures

This is a structure I'm getting from elsewhere, that is, a list of deeply nested dictionaries:
{
"foo_code": 404,
"foo_rbody": {
"query": {
"info": {
"acme_no": "444444",
"road_runner": "123"
},
"error": "no_lunch",
"message": "runner problem."
}
},
"acme_no": "444444",
"road_runner": "123",
"xyzzy_code": 200,
"xyzzy_rbody": {
"api": {
"items": [
{
"desc": "OK",
"id": 198,
"acme_no": "789",
"road_runner": "123",
"params": {
"bicycle": "2wheel",
"willie": "hungry",
"height": "1",
"coyote_id": "1511111"
},
"activity": "TRAP",
"state": "active",
"status": 200,
"type": "chase"
}
]
}
}
}
{
"foo_code": 200,
"foo_rbody": {
"query": {
"result": {
"acme_no": "260060730303258",
"road_runner": "123",
"abyss": "26843545600"
}
}
},
"acme_no": "260060730303258",
"road_runner": "123",
"xyzzy_code": 200,
"xyzzy_rbody": {
"api": {
"items": [
{
"desc": "OK",
"id": 198,
"acme_no": "789",
"road_runner": "123",
"params": {
"bicycle": "2wheel",
"willie": "hungry",
"height": "1",
"coyote_id": "1511111"
},
"activity": "TRAP",
"state": "active",
"status": 200,
"type": "chase"
}
]
}
}
}
Asking for different structures is out of question (legacy apis etc).
So I'm wondering if there's some clever way of extracting selected values from such a structure.
The candidates I was thinking of:
flatten particular dictionaries, building composite keys, smth like:
{
"foo_rbody.query.info.acme_no": "444444",
"foo_rbody.query.info.road_runner": "123",
...
}
Pro: getting every value with one access and if predictable key is not there, it means that the structure was not there (as you might have noticed, dictionaries may have different structures depending on whether it was successful operation, error happened, etc).
Con: what to do with lists?
Use some recursive function that would do successive key lookups, say by "foo_rbody", then by "query", "info", etc.
Any better candidates?

You can try this rather trivial function to access nested properties:
import re
def get_path(dct, path):
for i, p in re.findall(r'(\d+)|(\w+)', path):
dct = dct[p or int(i)]
return dct
Usage:
value = get_path(data, "xyzzy_rbody.api.items[0].params.bicycle")

Maybe the function byPath in my answer to this post might help you.

You could create your own path mechanism and then query the complicated dict with paths. Example:
/ : get the root object
/key: get the value of root_object['key'], e.g. /foo_code --> 404
/key/key: nesting: /foo_rbody/query/info/acme_no -> 444444
/key[i]: get ith element of that list, e.g. /xyzzy_rbody/api/items[0]/desc --> "OK"
The path can also return a dict which you then run more queries on, etc.
It would be fairly easy to implement recursively.

I think about two more solutions:
You can try package Pynq, described here - structured query language for JSON (in Python). As far as a I understand, it's some kind of LINQ for python.
You may also try to convert your JSON to XML and then use Xquery language to get data from it - XQuery library under Python

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python function to extract specific values from complex JSON logs data - python

Related

check if a property contains a specific value in a document with pymongo

Get "path" of parent keys and indices in dictionary of nested dictionaries and lists

Elastic Search: including #/hashtags in search results

Issues decoding Collections+JSON in Python

Extracting values from deeply nested JSON structures

Categories

Resources