Extract data from JSON in python

Extract data from JSON in python - python

Can I access link inside next from the below Json data? I am doing in this way
data = json.loads(html.decode('utf-8'))
for i in data['comments']:
for h in i['paging']:
print(h)
{
Because comments is a main object. Inside the comments there are three sub objects data , paging and summary. The above code is doing the same , inside comments, because of paging is an object of multiple other objects, in the loop and print that. It is giving the error
for h in i['paging']:
TypeError: string indices must be integers
"comments": {
"data": [
{
"created_time": "2016-05-22T14:57:04+0000",
"from": {
"id": "908005352638687",
"name": "Marianela Ferrer"
},
"id": "101536081674319615443",
"message": "I love the way you talk! I can not get enough of your voice. I'm going to buy Seveneves! I'm going to read it this week. Thanks again se\u00f1or Gates. I hope you have a beautiful day:-)"
}
],
"paging": {
"cursors": {
"after": "NzQ0",
"before": "NzQ0"
},
"next": "https://graph.facebook.com/v2.7/10153608167431961/comments?access_token=xECxPrxuXbaRqcippFcrwZDZD&summary=true&limit=1&after=NzQ0"
},
"summary": {
"can_comment": true,
"order": "ranked",
"total_count": 744
}
},
"id": "10153608167431961"
}

You're iterating through "comments" which results in three objects: data, paging, and summary. All you want is paging, but your first for-loop wants you to go through all the others.
Thus, when it starts with data, you're trying to call data['paging'], but this doesn't work because data's value is a list and not a dictionary.
You want to immediately access paging:
print data['comments']['paging']['next']

Related

Python function to extract specific values from complex JSON logs data

I am trying to write a Python function (for use in a Google Cloud Function) that extracts specific values from JSON logs data. Ordinarily, I do this using the standard method of sorting through keys:
my_data['key1'], etc.
This JSON data, however is quite different, since it appears to have the data I need as lists inside of dictionaries. Here is a sample of the logs data:
{
"insertId": "-mgv16adfcja",
"logName": "projects/my_project/logs/cloudaudit.googleapis.com%2Factivity",
"protoPayload": {
"#type": "type.googleapis.com/google.cloud.audit.AuditLog",
"authenticationInfo": {
"principalEmail": "email#email.com"
},
"authorizationInfo": [{
"granted": true,
"permission": "resourcemanager.projects.setIamPolicy",
"resource": "projects/my_project",
"resourceAttributes": {
"name": "projects/my_project",
"service": "cloudresourcemanager.googleapis.com",
"type": "cloudresourcemanager.googleapis.com/Project"
}
},
{
"granted": true,
"permission": "resourcemanager.projects.setIamPolicy",
"resource": "projects/my_project",
"resourceAttributes": {
"name": "projects/my_project",
"service": "cloudresourcemanager.googleapis.com",
"type": "cloudresourcemanager.googleapis.com/Project"
}
}
],
"methodName": "SetIamPolicy",
"request": {
"#type": "type.SetIamPolicyRequest",
"policy": {
"bindings": [{
"members": [
"serviceAccount:my-test-
sa #my_project.iam.gserviceaccount.com "
],
"role": "projects/my_project/roles/PubBuckets"
},
{
"members": [
"serviceAccount:my-test-sa-
2 #my_project.iam.gserviceaccount.com "
],
"role": "roles/owner"
},
{
"members": [
"serviceAccount:my-test-sa-3#my_project.iam.gserviceaccount.com",
"serviceAccount:my-test-sa-4#my_project.iam.gserviceaccount.com"
]
}
My goal with this data is to extract the "role":"roles/editor" and the associated "members." So in this case, I would like to extract service accounts my-test-sa-3, 4, and 5, and print them.
When the JSON enters my cloud function I do the following:
pubsub_message = base64.b64decode(event['data']).decode('utf-8')
msg = json.loads(pubsub_message)
print(msg)
And I can get to other data that I need, e.g., project id-
proj_id = msg['resource']['labels']['project_id']
But I cannot get into the lists within the dictionaries effectively. The deepest I can currently get is to the 'bindings' key.
I have additionally tried restructuring and flattening output as a list:
policy_request =credentials.projects().getIamPolicy(resource=proj_id, body={})
policy_response = policy_request.execute()
my_bindings = policy_response['bindings']
flat_list = []
for element in my_bindings:
if type(element) is list:
for item in element:
flat_list.append(item)
else:
flat_list.append(element)
print('Here is flat_list: ', flat_list)
I then use an if statement to search the list, which returns nothing. I can't use indices, because the output will change consistently, so I need a solution that can extract the values by a key, value approach if at all possible.
Expected Output:
Role: roles/editor
Members:
sa-1#gcloud.com
sa2#gcloud.com
sa3#gcloud.com
and so on
Appreciate any help.

How do I output specific data from a json response?

I am fairly new to using APIs in python and I am trying to create a system that outputs data from previous motorsport races. I have sent requests to an API, but I am struggling to get it to just output one specific piece of data (eg. time, location). I get this when I just print the raw JSON data sent.
{
"MRData": {
"RaceTable": {
"Races": [
{
"Circuit": {
"Location": {
"country": "Spain",
"lat": "41.57",
"locality": "Montmeló",
"long": "2.26111"
},
"circuitId": "catalunya",
"circuitName": "Circuit de Barcelona-Catalunya",
"url": "http://en.wikipedia.org/wiki/Circuit_de_Barcelona-Catalunya"
},
"date": "2020-08-16",
"raceName": "Spanish Grand Prix",
"round": "6",
"season": "2020",
"time": "13:10:00Z",
"url": "https://en.wikipedia.org/wiki/2020_Spanish_Grand_Prix"
}
],
"round": "6",
"season": "2020"
},
"limit": "30",
"offset": "0",
"series": "f1",
"total": "1",
"url": "http://ergast.com/api/f1/2020/6.json",
"xmlns": "http://ergast.com/mrd/1.4"
}
}
Just to get to grips with APIs I am simply trying to output a simple piece of data of a specific race, and once I can do that, I'll be able to scale it up and output all sorts of data. I'd assumed it would just be as simple as typing print(data['time']) (as seen below) but I get an error message saying this:
KeyError: 'time'
My source code:
import requests
response = requests.get("http://ergast.com/api/f1/2020/6.json")
data = response.json()
print (data["time"])
Any help is appreciated!

Like this...
import json
data = """{
"MRData":{
"xmlns":"http://ergast.com/mrd/1.4",
"series":"f1",
"url":"http://ergast.com/api/f1/2020/6.json",
"limit":"30",
"offset":"0",
"total":"1",
"RaceTable":{
"season":"2020",
"round":"6",
"Races":[
{
"season":"2020",
"round":"6",
"url":"https://en.wikipedia.org/wiki/2020_Spanish_Grand_Prix",
"raceName":"Spanish Grand Prix",
"Circuit":{
"circuitId":"catalunya",
"url":"http://en.wikipedia.org/wiki/Circuit_de_Barcelona-Catalunya",
"circuitName":"Circuit de Barcelona-Catalunya",
"Location":{
"lat":"41.57",
"long":"2.26111",
"locality":"Montmeló",
"country":"Spain"
}
},
"date":"2020-08-16",
"time":"13:10:00Z"
}
]
}
}
}"""
jsonData = json.loads(data)
Races is an array, in this case there is only one race so you would desigate it as ["Races"][0]
print(jsonData["MRData"]["RaceTable"]["Races"][0]["time"])

data['time'] would work if you had a flat dictionary, but you have a nested dicts/list structure, so:
data["MRData"]["RaceTable"]["Races"][0]["time"]
data["MRData"] returns another dict, which has a key "RaceTable". The value of this key is again a dictionary which has a key "Races". The value of this is a list of races, of which you only have one. The races are again dicts which have the key time.

How to paginate terms aggregation results in Elasticsearch

I've been trying to figure out a way to paginate the results of a terms aggregation in Elasticsearch and so far I have not been able to achieve the desired result.
Here's the problem I am trying to solve. In my index, I have a bunch of documents that have a score (separate to the ES _score) that is calculated based on the values of the other fields in the document. Each document "belongs" to a customer, referenced by the customer_id field. The document also has an id, referenced by the doc_id field, and is the same as the ES meta-field _id. Here is an example.
{
'_id': '1',
'doc_id': '1',
'doc_score': '85',
'customer_id': '123'
}
For each customer_id there are multiple documents, all with different document ids and different scores. What I want to be able to do is, given a list of customer ids, return the top document for each customer_id (only 1 per customer) and be able to paginate those results similar to the size, from method in the regular ES search API. The field that I want to use for the document score is the doc_score field.
So far in my current Python script, I've tried is a nested aggs with a "top hits" aggregation to only get the top document for each customer.
{
"size": 0,
"query:": {
"bool": {
"must": [
{
"match_all": {}
},
{
"terms": {
"customer_id": customer_ids # a list of the customer ids I want documents for
}
},
{
"exists": {
"field": "score" # sometimes it's possible a document does not have a score
}
}
]
}
}
"aggs": {
"customers": {
"terms" : {
{"field": "customer_id", "min_doc_count": 1},
"aggs": {
"top_documents": {
"top_hits": {
"sort": [
{"score": {"order": "desc"}}
],
"size": 1
}
}
}
}
}
}
}
I then "paginate" by going through each customer bucket, appending the top document blob to a list and then sorting the list based on the value of the score field and finally taking a slice documents_list[from:from+size].
The issue with this is that, say I have 500 customers in the list but I only want the 2nd 20 documents, i.e. size = 20, from=20. So each time I call the function I have to first get the list for each of the 500 customers and then slice. This sounds very inefficient and is also a speed issue, since I need that function to be as fast as I can possibly make it.
Ideally, I could just get the 2nd 20 directly from ES without having to do any slicing in my function.
I have looked into Composite aggregations that ES offers, but it looks to me like I would not be able to use it in my case, since I need to get the entire doc, i.e. everything in the _source field in the regular search API response.
I would greatly appreciate any suggestions.

The best way to do this would be to use partitions
According to documentation:
GET /_search
{
"size": 0,
"aggs": {
"expired_sessions": {
"terms": {
"field": "account_id",
"include": {
"partition": 1,
"num_partitions": 25
},
"size": 20,
"order": {
"last_access": "asc"
}
},
"aggs": {
"last_access": {
"max": {
"field": "access_date"
}
}
}
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions

how to read specific values using Json to dict?

"Instances": [{
"nlu_classification": {
"Domain": "UDE",
"Intention": "Unspecified"
},
"nlu_interpretation_index": 1,
"nlu_slot_details": {
"Name": {
"literal": "ConnectedDrive"
},
"Search-phrase": {
"literal": "connecteddrive"
}
},
"interpretation_confidence": 5484
}],
"type": "nlu_results",
"api_version": "1.0"
}],
"nlps_version": "nlps(z):6.1.100.12.2-B359;Version: nlps-base-Zeppelin-6.1.100-B124-GMT20151130193521;"
}
},
"final_response": 1,
"prompt": "",
"result_format": "appserver_post_results"
}
I am getting the above code as a reply from the server. I am storing those result in the variable NLU_RESULT. later I am using json_loads to convert that json_format into dict and to check for the specific value within it as below.
parsed_json = json.loads(NLU_RESULT)
print(parsed_json["Instances"]["nlu_classification"]["Domain"]).
when I use the above code. Its not printing the value of Domain. Can someone tell me what is the mistake here ?

UPDATE
it should be something like
parsed['appserver_results']['payload']['actions'][0]['Instances'][0]['nlu_classification']['Domain']
the json you posted has instances as an array
so it should be something like
print(parsed_json["Instances"][0]["nlu_classification"]["Domain"])
also the json is a bit broken and contains some array closing without the array

Extracting values from deeply nested JSON structures

This is a structure I'm getting from elsewhere, that is, a list of deeply nested dictionaries:
{
"foo_code": 404,
"foo_rbody": {
"query": {
"info": {
"acme_no": "444444",
"road_runner": "123"
},
"error": "no_lunch",
"message": "runner problem."
}
},
"acme_no": "444444",
"road_runner": "123",
"xyzzy_code": 200,
"xyzzy_rbody": {
"api": {
"items": [
{
"desc": "OK",
"id": 198,
"acme_no": "789",
"road_runner": "123",
"params": {
"bicycle": "2wheel",
"willie": "hungry",
"height": "1",
"coyote_id": "1511111"
},
"activity": "TRAP",
"state": "active",
"status": 200,
"type": "chase"
}
]
}
}
}
{
"foo_code": 200,
"foo_rbody": {
"query": {
"result": {
"acme_no": "260060730303258",
"road_runner": "123",
"abyss": "26843545600"
}
}
},
"acme_no": "260060730303258",
"road_runner": "123",
"xyzzy_code": 200,
"xyzzy_rbody": {
"api": {
"items": [
{
"desc": "OK",
"id": 198,
"acme_no": "789",
"road_runner": "123",
"params": {
"bicycle": "2wheel",
"willie": "hungry",
"height": "1",
"coyote_id": "1511111"
},
"activity": "TRAP",
"state": "active",
"status": 200,
"type": "chase"
}
]
}
}
}
Asking for different structures is out of question (legacy apis etc).
So I'm wondering if there's some clever way of extracting selected values from such a structure.
The candidates I was thinking of:
flatten particular dictionaries, building composite keys, smth like:
{
"foo_rbody.query.info.acme_no": "444444",
"foo_rbody.query.info.road_runner": "123",
...
}
Pro: getting every value with one access and if predictable key is not there, it means that the structure was not there (as you might have noticed, dictionaries may have different structures depending on whether it was successful operation, error happened, etc).
Con: what to do with lists?
Use some recursive function that would do successive key lookups, say by "foo_rbody", then by "query", "info", etc.
Any better candidates?

You can try this rather trivial function to access nested properties:
import re
def get_path(dct, path):
for i, p in re.findall(r'(\d+)|(\w+)', path):
dct = dct[p or int(i)]
return dct
Usage:
value = get_path(data, "xyzzy_rbody.api.items[0].params.bicycle")

Maybe the function byPath in my answer to this post might help you.

You could create your own path mechanism and then query the complicated dict with paths. Example:
/ : get the root object
/key: get the value of root_object['key'], e.g. /foo_code --> 404
/key/key: nesting: /foo_rbody/query/info/acme_no -> 444444
/key[i]: get ith element of that list, e.g. /xyzzy_rbody/api/items[0]/desc --> "OK"
The path can also return a dict which you then run more queries on, etc.
It would be fairly easy to implement recursively.

I think about two more solutions:
You can try package Pynq, described here - structured query language for JSON (in Python). As far as a I understand, it's some kind of LINQ for python.
You may also try to convert your JSON to XML and then use Xquery language to get data from it - XQuery library under Python

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract data from JSON in python - python

Related

Python function to extract specific values from complex JSON logs data

How do I output specific data from a json response?

How to paginate terms aggregation results in Elasticsearch

how to read specific values using Json to dict?

Extracting values from deeply nested JSON structures

Categories

Resources