how to read specific values using Json to dict? - python

"Instances": [{
"nlu_classification": {
"Domain": "UDE",
"Intention": "Unspecified"
},
"nlu_interpretation_index": 1,
"nlu_slot_details": {
"Name": {
"literal": "ConnectedDrive"
},
"Search-phrase": {
"literal": "connecteddrive"
}
},
"interpretation_confidence": 5484
}],
"type": "nlu_results",
"api_version": "1.0"
}],
"nlps_version": "nlps(z):6.1.100.12.2-B359;Version: nlps-base-Zeppelin-6.1.100-B124-GMT20151130193521;"
}
},
"final_response": 1,
"prompt": "",
"result_format": "appserver_post_results"
}
I am getting the above code as a reply from the server. I am storing those result in the variable NLU_RESULT. later I am using json_loads to convert that json_format into dict and to check for the specific value within it as below.
parsed_json = json.loads(NLU_RESULT)
print(parsed_json["Instances"]["nlu_classification"]["Domain"]).
when I use the above code. Its not printing the value of Domain. Can someone tell me what is the mistake here ?

UPDATE
it should be something like
parsed['appserver_results']['payload']['actions'][0]['Instances'][0]['nlu_classification']['Domain']
the json you posted has instances as an array
so it should be something like
print(parsed_json["Instances"][0]["nlu_classification"]["Domain"])
also the json is a bit broken and contains some array closing without the array

Related

Why does my query using a MinHash analyzer fail to retrieve duplicates?

I am trying to query an Elasticsearch index for near-duplicates using its MinHash implementation.
I use the Python client running in containers to index and perform the search.
My corpus is a JSONL file a bit like this:
{"id":1, "text":"I'd just like to interject for a moment"}
{"id":2, "text":"I come up here for perception and clarity"}
...
I create an Elasticsearch index successfully, trying to use custom settings and analyzer, taking inspiration from the official examples and MinHash docs:
def create_index(client):
client.indices.create(
index="documents",
body={
"settings": {
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 5,
"max_shingle_size": 5,
"output_unigrams": False
},
"my_minhash_filter": {
"type": "min_hash",
"hash_count": 10,
"bucket_count": 512,
"hash_set_size": 1,
"with_rotation": True
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"my_shingle_filter",
"my_minhash_filter"
]
}
}
}
},
"mappings": {
"properties": {
"name": {"type": "text", "analyzer": "my_analyzer"}
}
},
},
ignore=400,
)
I verify that index creation hasn't big problems via Kibana and also by visiting http://localhost:9200/documents/_settings I get something that seems in order:
However, querying the index with:
def get_duplicate_documents(body, K, es):
doc = {
'_source': ['_id', 'body'],
'size': K,
'query': {
"match": {
"body": {
"query": body,
"analyzer" : "my_analyzer"
}
}
}
}
res = es.search(index='documents', body=doc)
top_matches = [hit['_source']['_id'] for hit in res['hits']['hits']]
my res['hits'] is consistently empty even if I set my body to match exactly the text of one of the entries in my corpus. In other words I don't get any results if I try as values for body e.g.
"I come up here for perception and clarity"
or substrings like
"I come up here for perception"
while ideally, I'd like the procedure to return near-duplicates, with a score being an approximation of the Jaccard similarity of the query and the near-duplicates, obtained via MinHash.
Is there something wrong in my query and/or way I index Elasticsearch? Am I missing something else entirely?
P.S.: You can have a look at https://github.com/davidefiocco/dockerized-elasticsearch-duplicate-finder/tree/ea0974363b945bf5f85d52a781463fba76f4f987 for a non-functional, but hopefully reproducible example (I will also update the repo as I find a solution!)
Here are some things that you should double-check as they are likely culprits:
when you create your mapping you should change from "name" to "text" in your client.indices.create method inside body param, because your json document has a field called text:
"mappings": {
"properties": {
"text": {"type": "text", "analyzer": "my_analyzer"}
}
in indexing phase you could also rework your generate_actions() method following the documentation with something like:
for elem in corpus:
yield {
"_op_type": "index"
"_index": "documents",
"_id": elem["id"],
"_source": elem["text"]
}
Incidentally, if you are indexing pandas dataframes, you may want to check the experimental official library eland.
Also, according to your mapping, you are using a minhash token filter, so Lucene will transform your text inside text field in hash. So you can query against this field with an hash and not with a string as you have done in your example "I come up here for perception and clarity".
So the best way to use it is to retrieve the content of the field text and then query in Elasticsearch for the same value retrieved. Then the _id metafield is not inside _source metafield, so you should change your get_duplicate_documents() method in:
def get_duplicate_documents(body, K, es):
doc = {
'_source': ['text'],
'size': K,
'query': {
"match": {
"text": { # I changed this line!
"query": body
}
}
}
}
res = es.search(index='documents', body=doc)
# also changed the list comprehension!
top_matches = [(hit['_id'], hit['_source']) for hit in res['hits']['hits']]

Flattening an array in a JSON object

I have a JSON object which I want to flatten before exporting it to CSV. I'd like to use the flatten_json module for this.
My JSON input looks like this:
{
"responseStatus": "SUCCESS",
"responseDetails": {
"total": 5754
},
"data": [
{
"id": 1324651
},
{
"id": 5686131
},
{
"id": 2165735
},
{
"id": 2133256
}
]
}
Easy so far even for a beginner like me, but what I'm interesting in exporting is only the data array. So, I would think of this:
data_json = json["data"]
flat_json = flatten_json.flatten(data_json)
Which doesn't work, since data is an array, stored as a list in Python, not as a dictionary:
[
{
"id": 1324651
},
{
"id": 5686131
},
{
"id": 2165735
},
{
"id": 2133256
}
]
How should I proceed to feed the content of the data array into the flatten_json function?
Thanks!
R.
This function expects a ditionary, let's pass one:
flat_json = flatten_json.flatten({'data': data_json})
Output:
{'data_0_id': 1324651, 'data_1_id': 5686131, 'data_2_id': 2165735, 'data_3_id': 2133256}
You can choose the keys you want to ignore when you call the flatten method. For example, in your case, you can do the following.
flatten_json.flatten(dic, root_keys_to_ignore={'responseStatus', 'responseDetails'})
where dic is the original JSON input.
This will give as output:
{'data_0_id': 1324651, 'data_1_id': 5686131, 'data_2_id': 2165735, 'data_3_id': 2133256}

Extract data from JSON in python

Can I access link inside next from the below Json data? I am doing in this way
data = json.loads(html.decode('utf-8'))
for i in data['comments']:
for h in i['paging']:
print(h)
{
Because comments is a main object. Inside the comments there are three sub objects data , paging and summary. The above code is doing the same , inside comments, because of paging is an object of multiple other objects, in the loop and print that. It is giving the error
for h in i['paging']:
TypeError: string indices must be integers
"comments": {
"data": [
{
"created_time": "2016-05-22T14:57:04+0000",
"from": {
"id": "908005352638687",
"name": "Marianela Ferrer"
},
"id": "101536081674319615443",
"message": "I love the way you talk! I can not get enough of your voice. I'm going to buy Seveneves! I'm going to read it this week. Thanks again se\u00f1or Gates. I hope you have a beautiful day:-)"
}
],
"paging": {
"cursors": {
"after": "NzQ0",
"before": "NzQ0"
},
"next": "https://graph.facebook.com/v2.7/10153608167431961/comments?access_token=xECxPrxuXbaRqcippFcrwZDZD&summary=true&limit=1&after=NzQ0"
},
"summary": {
"can_comment": true,
"order": "ranked",
"total_count": 744
}
},
"id": "10153608167431961"
}
You're iterating through "comments" which results in three objects: data, paging, and summary. All you want is paging, but your first for-loop wants you to go through all the others.
Thus, when it starts with data, you're trying to call data['paging'], but this doesn't work because data's value is a list and not a dictionary.
You want to immediately access paging:
print data['comments']['paging']['next']

Issues decoding Collections+JSON in Python

I've been trying to decode a JSON response in Collections+JSON format using Python for a while now but I can't seem to overcome a small issue.
First of all, here is the JSON response:
{
"collection": {
"href": "http://localhost:8000/social/messages-api/",
"items": [
{
"data": [
{
"name": "messageID",
"value": 19
},
{
"name": "author",
"value": "mike"
},
{
"name": "recipient",
"value": "dan"
},
{
"name": "pm",
"value": "0"
},
{
"name": "time",
"value": "2015-03-31T15:04:01.165060Z"
},
{
"name": "text",
"value": "first message"
}
]
}
],
"version": "1.0",
"links": []
}
}
And here is how I am attempting to extract data:
response = urllib2.urlopen('myurl')
responseData = response.read()
jsonData = json.loads(responseData)
test = jsonData['collection']['items']['data']
When I run this code I get the error:
list indices must be integers, not str
If I use an integer, e.g. 0, instead of a string it merely shows 'data' instead of any useful information, unlike if I were to simply output 'items'. Similarly, I can't seem to access the data within a data child, for example:
test = jsonData['collection']['items'][0]['name']
This will argue that there is no element called 'name'.
What is the proper method of accessing JSON data in this situation? I would also like to iterate over the collection, if that helps.
I'm aware of a package that can be used to simplify working with Collections+JSON in Python, collection-json, but I'd rather be able to do this without using such a package.

grab values in json

I have the following json in this format:
{
"HATg": {
"id": "208-2",
"code": "225a"
"state" : True
},
"PROPEMPTY": {
"id": "208-3",
"code": "225b"
"state" False
}
}
Was wondering how do I access/grab both the id and code as I iterate each items in the file in a pythonic way? Like for i in items...
By the way, the contents in the json file differ as it is manipulate by user adding in different contents. Apologize in advance if I am not using any terms as I am not sure what they are called
Assuming your "JSON" looks more like this:
{
"HATg": {
"id": "208-2",
"code": "225a",
"state": true
},
"PROPEMPTY": {
"id": "208-3",
"code": "225b",
"state": false
}
}
and that you have succesfully parsed it into a Python object (for example, by using j = json.load(jsonfile)), then it's trivial to iterate through it in Python (assuming Python 3):
>>> for key, value in j.items():
... print("{}: {}, {}".format(key, value['id'], value['code']))
...
PROPEMPTY: 208-3, 225b
HATg: 208-2, 225a
What you have here is a python dictionary and not a json. You can iterate them like this:
a = {
"HATg": {
"id": "208-2",
"code": "225a",
"state" : True
},
"PROPEMPTY": {
"id": "208-3",
"code": "225b",
"state" : False
}
}
for i in a:
print i
print a[i]['id'], a[i]['code']
This will give the output as
PROPEMPTY
208-3 225b
HATg
208-2 225a

Categories

Resources