Xpath like query for nested python dictionaries - python

Is there a way to define a XPath type query for nested python dictionaries.
Something like this:
foo = {
'spam':'eggs',
'morefoo': {
'bar':'soap',
'morebar': {'bacon' : 'foobar'}
}
}
print( foo.select("/morefoo/morebar") )
>> {'bacon' : 'foobar'}
I also needed to select nested lists ;)
This can be done easily with #jellybean's solution:
def xpath_get(mydict, path):
elem = mydict
try:
for x in path.strip("/").split("/"):
try:
x = int(x)
elem = elem[x]
except ValueError:
elem = elem.get(x)
except:
pass
return elem
foo = {
'spam':'eggs',
'morefoo': [{
'bar':'soap',
'morebar': {
'bacon' : {
'bla':'balbla'
}
}
},
'bla'
]
}
print xpath_get(foo, "/morefoo/0/morebar/bacon")
[EDIT 2016] This question and the accepted answer are ancient. The newer answers may do the job better than the original answer. However I did not test them so I won't change the accepted answer.

One of the best libraries I've been able to identify, which, in addition, is very actively developed, is an extracted project from boto: JMESPath. It has a very powerful syntax of doing things that would normally take pages of code to express.
Here are some examples:
search('foo | bar', {"foo": {"bar": "baz"}}) -> "baz"
search('foo[*].bar | [0]', {
"foo": [{"bar": ["first1", "second1"]},
{"bar": ["first2", "second2"]}]}) -> ["first1", "second1"]
search('foo | [0]', {"foo": [0, 1, 2]}) -> [0]

There is an easier way to do this now.
http://github.com/akesterson/dpath-python
$ easy_install dpath
>>> dpath.util.search(YOUR_DICTIONARY, "morefoo/morebar")
... done. Or if you don't like getting your results back in a view (merged dictionary that retains the paths), yield them instead:
$ easy_install dpath
>>> for (path, value) in dpath.util.search(YOUR_DICTIONARY, "morefoo/morebar", yielded=True)
... and done. 'value' will hold {'bacon': 'foobar'} in that case.

Not exactly beautiful, but you might use sth like
def xpath_get(mydict, path):
elem = mydict
try:
for x in path.strip("/").split("/"):
elem = elem.get(x)
except:
pass
return elem
This doesn't support xpath stuff like indices, of course ... not to mention the / key trap unutbu indicated.

There is the newer jsonpath-rw library supporting a JSONPATH syntax but for python dictionaries and arrays, as you wished.
So your 1st example becomes:
from jsonpath_rw import parse
print( parse('$.morefoo.morebar').find(foo) )
And the 2nd:
print( parse("$.morefoo[0].morebar.bacon").find(foo) )
PS: An alternative simpler library also supporting dictionaries is python-json-pointer with a more XPath-like syntax.

dict > jmespath
You can use JMESPath which is a query language for JSON, and which has a python implementation.
import jmespath # pip install jmespath
data = {'root': {'section': {'item1': 'value1', 'item2': 'value2'}}}
jmespath.search('root.section.item2', data)
Out[42]: 'value2'
The jmespath query syntax and live examples: http://jmespath.org/tutorial.html
dict > xml > xpath
Another option would be converting your dictionaries to XML using something like dicttoxml and then use regular XPath expressions e.g. via lxml or whatever other library you prefer.
from dicttoxml import dicttoxml # pip install dicttoxml
from lxml import etree # pip install lxml
data = {'root': {'section': {'item1': 'value1', 'item2': 'value2'}}}
xml_data = dicttoxml(data, attr_type=False)
Out[43]: b'<?xml version="1.0" encoding="UTF-8" ?><root><root><section><item1>value1</item1><item2>value2</item2></section></root></root>'
tree = etree.fromstring(xml_data)
tree.xpath('//item2/text()')
Out[44]: ['value2']
Json Pointer
Yet another option is Json Pointer which is an IETF spec that has a python implementation:
https://github.com/stefankoegl/python-json-pointer
From the jsonpointer-python tutorial:
from jsonpointer import resolve_pointer
obj = {"foo": {"anArray": [ {"prop": 44}], "another prop": {"baz": "A string" }}}
resolve_pointer(obj, '') == obj
# True
resolve_pointer(obj, '/foo/another%20prop/baz') == obj['foo']['another prop']['baz']
# True
>>> resolve_pointer(obj, '/foo/anArray/0') == obj['foo']['anArray'][0]
# True

If terseness is your fancy:
def xpath(root, path, sch='/'):
return reduce(lambda acc, nxt: acc[nxt],
[int(x) if x.isdigit() else x for x in path.split(sch)],
root)
Of course, if you only have dicts, then it's simpler:
def xpath(root, path, sch='/'):
return reduce(lambda acc, nxt: acc[nxt],
path.split(sch),
root)
Good luck finding any errors in your path spec tho ;-)

Another alternative (besides that suggested by jellybean) is this:
def querydict(d, q):
keys = q.split('/')
nd = d
for k in keys:
if k == '':
continue
if k in nd:
nd = nd[k]
else:
return None
return nd
foo = {
'spam':'eggs',
'morefoo': {
'bar':'soap',
'morebar': {'bacon' : 'foobar'}
}
}
print querydict(foo, "/morefoo/morebar")

More work would have to be put into how the XPath-like selector would work.
'/' is a valid dictionary key, so how would
foo={'/':{'/':'eggs'},'//':'ham'}
be handled?
foo.select("///")
would be ambiguous.

Is there any reason for you to the query it the way like the XPath pattern? As the commenter to your question suggested, it just a dictionary, so you can access the elements in a nest manner. Also, considering that data is in the form of JSON, you can use simplejson module to load it and access the elements too.
There is this project JSONPATH, which is trying to help people do opposite of what you intend to do (given an XPATH, how to make it easily accessible via python objects), which seems more useful.

def Dict(var, *arg, **kwarg):
""" Return the value of an (imbricated) dictionnary, if all fields exist else return "" unless "default=new_value" specified as end argument
Avoid TypeError: argument of type 'NoneType' is not iterable
Ex: Dict(variable_dict, 'field1', 'field2', default = 0)
"""
for key in arg:
if isinstance(var, dict) and key and key in var: var = var[key]
else: return kwarg['default'] if kwarg and 'default' in kwarg else "" # Allow Dict(var, tvdbid).isdigit() for example
return kwarg['default'] if var in (None, '', 'N/A', 'null') and kwarg and 'default' in kwarg else "" if var in (None, '', 'N/A', 'null') else var
foo = {
'spam':'eggs',
'morefoo': {
'bar':'soap',
'morebar': {'bacon' : 'foobar'}
}
}
print Dict(foo, 'morefoo', 'morebar')
print Dict(foo, 'morefoo', 'morebar', default=None)
Have a SaveDict(value, var, *arg) function that can even append to lists in dict...

I reference form this link..
Following code is for json xpath base parse implemented in python :
import json
import xmltodict
# Parse the json string
class jsonprase(object):
def __init__(self, json_value):
try:
self.json_value = json.loads(json_value)
except Exception :
raise ValueError('must be a json str value')
def find_json_node_by_xpath(self, xpath):
elem = self.json_value
nodes = xpath.strip("/").split("/")
for x in range(len(nodes)):
try:
elem = elem.get(nodes[x])
except AttributeError:
elem = [y.get(nodes[x]) for y in elem]
return elem
def datalength(self, xpath="/"):
return len(self.find_json_node_by_xpath(xpath))
#property
def json_to_xml(self):
try:
root = {"root": self.json_value}
xml = xmltodict.unparse(root, pretty=True)
except ArithmeticError :
pyapilog().error(e)
return xml
Test Json :
{
"responseHeader": {
"zkConnected": true,
"status": 0,
"QTime": 2675,
"params": {
"q": "TxnInitTime:[2021-11-01T00:00:00Z TO 2021-11-30T23:59:59Z] AND Status:6",
"stats": "on",
"stats.facet": "CountryCode",
"rows": "0",
"wt": "json",
"stats.field": "ItemPrice"
}
},
"response": {
"numFound": 15162439,
"start": 0,
"maxScore": 1.8660598,
"docs": []
}
}
Test Code to read the values from above input json.
numFound = jsonprase(ABOVE_INPUT_JSON).find_json_node_by_xpath('/response/numFound')
print(numFound)

Related

Is there a package that converts xmltodict dictionaries to lxml trees?

The problem I have is this. I've started the XML creation using the dictionary structure used by xmltodict Python package so I can use the unparse method to create the XML. But I think I reached a point where xmltodict can't help me. I have actions in this dictionary format, highly nested each, something like this, just much more complex:
action = {
"#id": 1,
"some-nested-stuff":
{"#attr1": "string value", "child": True}
}
Now I need to group some actions similar to this:
<action id=1>...</action>
<action-group groupId=1>
<action id=2>...</action>
<action id=3>...</action>
</action-group>
<action id=4>...</action>
And yes, the first action needs to go before the action group and the fourth action after it. It seems impossible to do it with just xmltodict. I was thinking that I create the actions' XML tree as an lxml object from these dictionaries, and than I merge those objects into a whole XML. I think that it wouldn't be a big task, but there might be a ready package for that. Is there one?
The alternative solution — that I try to avoid if possible — is to rewrite the project from scratch using just lxml. Or is there a way to create that XML using just xmltodict but not the xml/lxml packages?
It seems that no such package. So far I have this solution. I doesn't handle #text keys and there can be problems with namespaces.
"""
Converts the dictionary used by xmltodict package to represent XMLs
to lxml.
"""
from typing import Dict, Any
from lxml import etree
XmlDictType = Dict[str, Any]
element = etree.Element("for-creating-types")
ElementType = type(element)
ElementTreeType = type(etree.ElementTree(element))
def convert(xml_dict: XmlDictType) -> ElementType:
root_name = list(xml_dict)[0]
inside_dict = xml_dict[root_name]
attrs, children = split_attrs_and_children(inside_dict)
root = etree.Element(root_name, **attrs)
convert_children(root, children)
return root
def split_attrs_and_children(xml_dict: XmlDictType) -> ElementType:
"""Split the categories and fix the types"""
def fix_types(v):
if isinstance(v, (int, float)):
return str(v)
elif isinstance(v, bool):
return {True: "true", False: "false"}[v]
else:
return v
attrs = {k[1:]: fix_types(v) for k, v in xml_dict.items() if k.startswith("#")}
children = {k: fix_types(v) for k, v in xml_dict.items() if not (k.startswith("#") or k.startswith("#"))}
return attrs, children
def convert_children(parent: ElementType, children: XmlDictType) -> ElementType:
for child_name, value in children.items():
if isinstance(value, dict):
attrs, children = split_attrs_and_children(value)
child = etree.SubElement(parent, child_name, **attrs)
convert_children(child, children)
elif isinstance(value, list):
for v in value:
child = etree.SubElement(parent, child_name).text = v
else:
child = etree.SubElement(parent, child_name).text = value
return parent
You can convert for example this dictionary:
xml_dict = {
"mydocument": {
"#has": "an attribute",
"and": {
"many": [
"elements",
"more elements"
]
},
"plus": {
"#a": "complex",
"#text": "element as well"
}
}
}
Note that the #text line is not included yet.

rename duplicate key in json file python

I have json file which has duplicate keys.
Example
{
"data":"abc",
"data":"xyz"
}
I want to make this as
{
"data1":"abc",
"data2":"xyz"
}
I tried using object_pairs_hook with json_loads, but it is not working. Could anyone one help me with Python solution for above problem
You can pass the load method a keyword parameter to handle pairing, there you can check for duplicates like this:
raw_text_data = """{
"data":"abc",
"data":"xyz",
"data":"xyz22"
}"""
def manage_duplicates(pairs):
d = {}
k_counter = Counter(defaultdict(int))
for k, v in pairs:
d[k+str(k_counter[k])] = v
k_counter[k] += 1
return d
print(json.loads(raw_text_data, object_pairs_hook=manage_duplicates))
I used Counter to count each key, if it already exists, I'm saving the key as k+str(k_counter[k) - so it will be added with a trailing number.
P.S
If you have control on the input, I would highly recommend to change your json structure to:
{"data": ["abc", "xyz"]}
The rfc 4627 for application/json media type recommends unique keys but it doesn't forbid them explicitly:
The names within an object SHOULD be unique.
A quick and dirty solution using re.
import re
s = '{ "data":"abc", "data":"xyz", "test":"one", "test":"two", "no":"numbering" }'
def find_dupes(s):
keys = re.findall(r'"(\w+)":', s)
return list(set(filter(lambda w: keys.count(w) > 1, keys)))
for key in find_dupes(s):
for i in range(1, len(re.findall(r'"{}":'.format(key), s)) + 1):
s = re.sub(r'"{}":'.format(key), r'"{}{}":'.format(key, i), s, count=1)
print(s)
Prints this string:
{
"data1":"abc",
"data2":"xyz",
"test1":"one",
"test2":"two",
"no":"numbering"
}

Accessing json type data without knowing layout of data?

I have a file with JSON data I am loading using json.load.
Suppose I want to put a variable in the json data, which references another data field. How can I process this reference in python?
eg:
{
"dictionary" : {
"list_1" : [
"item_1"
],
"list_2" : [
"$dictionary.list_1"
]
}
}
when I come across $, I then want list_2 to grab the data from: dictionary.list_1
and extend list_2, as if I had written in my python code:
jsonData["dictionary"]["list_2"].extend(jsonData["dictionary"]["list_1"])
As far as I know, there is nothing in the JSON standard for doing references. My first suggestion would be to use YAML which does have references in the form of Node Anchors. Python has a good implementation of YAML which supports those.
That being said, if you're set on using JSON, you'll have to roll your own implementation.
One possible example(though this doesn't extend the current array by the referenced array because that's ambiguous in the case of dicts, it replaces the reference by the value it refers to) is below. Note that it doesn't handle malformed references you'll have to add the error-checking yourself or guarantee that there aren't malformed references. If you want to change it to extend instead of replacing, you can, but you know your use-case better than I so you'll be able to specify it that way. This is meant to give you a starting point.
def resolve_references(structure, sub_structure=None):
if sub_structure is None:
return resolve_references(structure, structure)
if isinstance(sub_structure, list):
tmp = []
for item in sub_structure:
tmp.append(resolve_references(structure, item))
return tmp
if isinstance(sub_structure, dict):
tmp = {}
for key,value in sub_structure.items():
tmp[key] = resolve_references(structure, value)
return tmp
if isinstance(sub_structure, str) or isinstance(sub_structure, unicode):
if sub_structure[0] != "$":
return sub_structure
keys = sub_structure[1:].split(".")
def get_value(obj, key):
if isinstance(obj, dict):
return obj[key]
if isinstance(obj, list):
return obj[int(key)]
return value
value = get_value(structure, keys[0])
for key in keys[1:]:
value = get_value(value, key)
return value
return sub_structure
Example usage:
>>> import json
>>> json_str = """
... {
... "dictionary" : {
... "list_1" : [
... "item_1"
... ],
...
... "list_2" : "$dictionary.list_1"
... }
... }
... """
>>> obj = json.loads(json_str)
>>> resolve_references(obj)
{u'dictionary': {u'list_2': [u'item_1'], u'list_1': [u'item_1']}}

Parsing json file with changeable structure in Python

I'm using Yahoo Placemaker API which gives different structure of json depending on input.
Simple json file looks like this:
{
'document':{
'itemDetails':{
'id'='0'
'prop1':'1',
'prop2':'2'
}
'other':{
'propA':'A',
'propB':'B'
}
}
}
When I want to access itemDetails I simply write json_file['document']['itemDetails'].
But when I get more complicated response, such as
{
'document':{
'1':{
'itemDetails':{
'id'='1'
'prop1':'1',
'prop2':'2'
}
},
'0':{
'itemDetails':{
'id'='0'
'prop1':'1',
'prop2':'2'
},
'2':{
'itemDetails':{
'id'='1'
'prop1':'1',
'prop2':'2'
}
'other':{
'propA':'A',
'propB':'B'
}
}
}
the solution obviously does not work.
I use id, prop1 and prop2 to create objects.
What would be the best approach to automatically access itemDetails in the second case without writing json_file['document']['0']['itemDetails'] ?
If I understand correctly, you want to loop through all of json_file['document']['0']['itemDetails'], json_file['document']['1']['itemDetails'], ...
If that's the case, then:
item_details = {}
for key, value in json_file['document']:
item_details[key] = value['itemDetails']
Or, a one-liner:
item_details = {k: v['itemDetails'] for k, v in json_file['document']}
Then, you would access them as item_details['0'], item_details['1'], ...
Note: You can suppress the single quotes around 0 and 1, by using int(key) or int(k).
Edit:
If you want to access both cases seamlessly (whether there is one result or many), you could check:
if 'itemDetails' in json_file['document']:
item_details = {'0': json_file['document']['itemDetails']}
else:
item_details = {k: v['itemDetails'] for k, v in json_file['document'] if k != 'other'}
Then loop through the item_details dict.

Using Python simplejson to return pregenerated json

I have a GeoDjango model object that I want't to serialize to json. I do this in my view:
lat = float(request.GET.get('lat'))
lng = float(request.GET.get('lng'))
a = Authority.objects.get(area__contains=Point(lng, lat))
if a:
return HttpResponse(simplejson.dumps({'name': a.name,
'area': a.area.geojson,
'id': a.id}),
mimetype='application/json')
The problem is that simplejson considers the a.area.geojson as a simple string, even though it is beautiful pre-generated json. This is easily fixed in the client by eval()'ing the area-string, but I would like to do it proper. Can I tell simplejson that a particular string is already json and should be used as-is (and not returned as a simple string)? Or is there another workaround?
UPDATE
Just to clarify, this is the json currently returned:
{
"id": 95,
"name": "Roskilde",
"area": "{ \"type\": \"MultiPolygon\", \"coordinates\": [ [ [ [ 12.078701, 55.649927 ], ... ] ] ] }"
}
The challenge is to have "area" be a json dictionary instead of a simple string.
I think the clean way to do this is by extending JSONEncoder, and creating an encoder that detects if the given object is already JSON. if it is - it just returns it. If its not, it uses the ordinary JSONEncoder to encode it.
class SkipJSONEncoder(simplejson.JSONEncoder):
def default(self, obj):
if isinstance(obj, str) and (obj[0]=='{') and (obj[-1]=='}'):
return obj
return simplejson.JSONEncoder.default(self, obj)
and in your view, you use:
simplejson.dumps(..., cls=SkipJSONEncoder)
If you have a cleaner way to test that something is already JSON, please use it (my way - looking for strings that start in '{' and end in '}' is ugly).
EDITED after author's edit:
Can you do something like this:
lat = float(request.GET.get('lat'))
lng = float(request.GET.get('lng'))
a = Authority.objects.get(area__contains=Point(lng, lat))
if a:
json = simplejson.dumps({'name': a.name,
'area': "{replaceme}",
'id': a.id}),
return HttpResponse(json.replace('"{replaceme}"', a.area.geojson),
mimetype='application/json')

Categories

Resources