Extract path for each terminal node

Extract path for each terminal node - python

I have a python nested dictionary structure that looks like the below.
This is a small example but I have larger examples that can have varying levels of nesting.
From this, I need to extract a list with:
One record for each terminal 'leaf' node
A string, list, or object representing the logical path leading up to that node
(e.g. 'nodeid_3: X < 0.500007 and X < 0.279907')
I've spent the larger part of this weekend trying to get something working and am realizing just how bad I am with recursion.
# Extract json string
json_string = booster.get_dump(with_stats=True, dump_format='json')[0]
# Convert to python dictionary
json.loads(json_string)
{u'children': [{u'children': [
{u'cover': 2291, u'leaf': -0.0611795, u'nodeid': 3},
{u'cover': 1779, u'leaf': -0.00965727, u'nodeid': 4}],
u'cover': 4070,
u'depth': 1,
u'gain': 265.811,
u'missing': 3,
u'no': 4,
u'nodeid': 1,
u'split': u'X',
u'split_condition': 0.279907,
u'yes': 3},
{u'cover': 3930, u'leaf': -0.0611946, u'nodeid': 2}],
u'cover': 8000,
u'depth': 0,
u'gain': 101.245,
u'missing': 1,
u'no': 2,
u'nodeid': 0,
u'split': u'X',
u'split_condition': 0.500007,
u'yes': 1}

You data structure is recursive. If a node has a children key, then we can consider that it is not terminal.
To analyze your data, you need a recursive function which keeps track of the ancestors (the path).
I would implement this like that:
def find_path(obj, path=None):
path = path or []
if 'children' in obj:
child_obj = {k: v for k, v in obj.items()
if k in ['nodeid', 'split_condition']}
child_path = path + [child_obj]
children = obj['children']
for child in children:
find_path(child, child_path)
else:
pprint.pprint((obj, path))
If you call:
find_path(data)
You get 3 results:
({'cover': 2291, 'leaf': -0.0611795, 'nodeid': 3},
[{'nodeid': 0, 'split_condition': 0.500007},
{'nodeid': 1, 'split_condition': 0.279907}])
({'cover': 1779, 'leaf': -0.00965727, 'nodeid': 4},
[{'nodeid': 0, 'split_condition': 0.500007},
{'nodeid': 1, 'split_condition': 0.279907}])
({'cover': 3930, 'leaf': -0.0611946, 'nodeid': 2},
[{'nodeid': 0, 'split_condition': 0.500007}])
Of course, you can replace the call to pprint.pprint() by a yield to turn this function into a generator:
def iter_path(obj, path=None):
path = path or []
if 'children' in obj:
child_obj = {k: v for k, v in obj.items()
if k in ['nodeid', 'split_condition']}
child_path = path + [child_obj]
children = obj['children']
for child in children:
# for o, p in iteration_path(child, child_path):
# yield o, p
yield from iter_path(child, child_path)
else:
yield obj, path
Note the usage of yield from for the recursive call. You use this generator like below:
for obj, path in iter_path(data):
pprint.pprint((obj, path))
You can also change the way child_obj object is build to match your needs.
To keep the order of objects: reverse the if condition: if 'children' not in obj: ….

Related

Cannot get ordered result

I'm working with Python 3.5.2 and I'm trying to get a dictionary ordered by key by using OrderedDict.
Here is what I'm trying:
import re
from collections import OrderedDict
BRACKETS_PATTERN = re.compile(r"(?P<info>.*)?\((?P<bracket_info>.*?)\)")
def transform_vertical(vertical, trans=True):
# elearning & Massive Online Open Courses (MOOCs) => ELEARNING_AND_MASSIVE_ONLINE_OPEN_COURSES
# Repair & Maintenance (SMB) => SMB_REPAIR_AND_MAINTENANCE
# Digital Advertising & Marketing/Untagged Agencies => DIGITAL_ADVERTISING_AND_MARKETING_OR_UNTAGGED_AGENCIES
if not trans:
return vertical
else:
v = vertical.replace(" & ", "_AND_").replace(", ", "_AND_").replace("/", "_OR_")
brackets_search_result = BRACKETS_PATTERN.search(v)
result = v
if brackets_search_result:
bracket_info = brackets_search_result.group("bracket_info")
info = brackets_search_result.group("info")
if bracket_info.upper() in ("SMB", "CBV"): # todo more prefix
result = bracket_info.upper() + "_" + info
else:
result = info
result = result.replace(" ", "_").upper().strip("_")
return result
VERTICAL_MAP = OrderedDict({
"GAMING": OrderedDict({
"MOBILE_GAMING": 1,
"AR_OR_VR_GAMING": 1,
"CONSOLE_AND_CROSS_PLATFORM_GAMING": 1,
"ESPORTS": 1,
"PC_GAMING": 1,
"REAL_MONEY_GAMING": 1,
}),
"TRAVEL": OrderedDict({
"AUTO_RENTAL": 1,
"RAILROADS": 1,
"HOTEL_AND_ACCOMODATION": 1,
"RIDE_SHARING_OR_TAXI_SERVICES": 1,
"TOURISM_AND_TRAVEL_SERVICES": 1,
"TOURISM_BOARD": 1,
"AIR": 1,
"TRAVEL_AGENCIES_AND_GUIDES_AND_OTAS": 1,
"CRUISES_AND_MARINE": 1,
})
})
s = list(VERTICAL_MAP[transform_vertical("Gaming")].keys())
print(s)
And I get non-ordered result like:
['REAL_MONEY_GAMING', 'AR_OR_VR_GAMING', 'MOBILE_GAMING', 'CONSOLE_AND_CROSS_PLATFORM_GAMING', 'ESPORTS', 'PC_GAMING']
Expected result:
[ 'MOBILE_GAMING', 'AR_OR_VR_GAMING','CONSOLE_AND_CROSS_PLATFORM_GAMING', 'ESPORTS', 'PC_GAMING', 'REAL_MONEY_GAMING']
What's wrong with my code and how to get an ordered result?

Dictionaries are not insertion ordered in Python 3.5.
You are instantiating the ordered dicts with arbitrarily ordered regular dicts. Construct each of the ordered dicts from a list of (key, value) tuples.

Dictionary: How to list every key path that contains a certain value?

Let's say I've got a nested dictionary of the form:
{'geo': {'bgcolor': 'white','lakecolor': 'white','caxis': {'gridcolor': 'white', 'linecolor': 'white',}},
'title': {'x': 0.05},
'yaxis': {'automargin': True,'linecolor': 'white','zerolinecolor': 'white','zerolinewidth': 2}
}
How can you work your way through that dict and make a list of each complete key path that contains the value 'white'?
Using a function defined by user jfs in the post Search for a value in a nested dictionary python lets you check whether or not 'white' occurs at least one time and also returns the path:
# dictionary
d={'geo': {'bgcolor': 'white','lakecolor': 'white','caxis': {'gridcolor': 'white', 'linecolor': 'white',}},
'title': {'x': 0.05},
'yaxis': {'automargin': True,'linecolor': 'white','ticks': '','zerolinecolor': 'white','zerolinewidth': 2}
}
# function:
def getpath(nested_dict, value, prepath=()):
for k, v in nested_dict.items():
path = prepath + (k,)
if v == value: # found value
return path
elif hasattr(v, 'items'): # v is a dict
p = getpath(v, value, path) # recursive call
if p is not None:
return p
getpath(d,'white')
# out:
('geo', 'bgcolor')
But 'white' occurs other places too, like in :
1. d['geo']['lakecolor']
2: d['geo']['caxis']['gridcolor']
3: d['yaxis']['linecolor']
How can I make sure that the function finds all paths?
I've tried applying the function above until it returns none while eliminating found paths one by one, but that quickly turned into an ugly mess.
Thank you for any suggestions!

This is a perfect use case to write a generator:
def find_paths(haystack, needle):
if haystack == needle:
yield ()
if not isinstance(haystack, dict):
return
for key, val in haystack.items():
for subpath in find_paths(val, needle):
yield (key, *subpath)
You can use it as follows:
d = {
'geo': {'bgcolor': 'white','lakecolor': 'white','caxis': {'gridcolor': 'white', 'linecolor': 'white',}},
'title': {'x': 0.05},
'yaxis': {'automargin': True,'linecolor': 'white','ticks': '','zerolinecolor': 'white','zerolinewidth': 2}
}
# you can iterate over the paths directly...
for path in find_paths(d, 'white'):
print('found at path: ', path)
# ...or you can collect them into a list:
paths = list(find_paths(d, 'white'))
print('found at paths: ' + repr(paths))
The generator approach has the advantage that it doesn't need to create an object to keep all paths in memory at once; they can be processed one by one and immediately discarded. In this case, the memory savings would be rather modest, but in others they may be significant. Also, if a loop iterating over a generator is terminated early, the generator is not going to keep searching for more paths that would be later discarded anyway.

just transform your function so it returns a list and don't return when something is found. Just add to/extend the list
def getpath(nested_dict, value, prepath=()):
p = []
for k, v in nested_dict.items():
path = prepath + (k,)
if v == value: # found value
p.append(path)
elif hasattr(v, 'items'): # v is a dict
p += getpath(v, value, path) # recursive call
return p
with your input data, this produces (order may vary depending on python versions where dictionaries are unordered):
[('yaxis', 'linecolor'), ('yaxis', 'zerolinecolor'), ('geo', 'lakecolor'),
('geo', 'caxis', 'linecolor'), ('geo', 'caxis', 'gridcolor'), ('geo', 'bgcolor')]

Returning is what makes the result incomplete. Instead of returning, use a separate list to track your paths. I'm using list cur_list here, and returning it at the very end of the loop:
d = {
'geo': {'bgcolor': 'white',
'caxis': {'gridcolor': 'white', 'linecolor': 'white'},
'lakecolor': 'white'},
'title': {'x': 0.05},
'yaxis': {'automargin': True,
'linecolor': 'white',
'ticks': '',
'zerolinecolor': 'white',
'zerolinewidth': 2}
}
cur_list = []
def getpath(nested_dict, value, prepath=()):
for k, v in nested_dict.items():
path = prepath + (k,)
if v == value: # found value
cur_list.append(path)
elif isinstance(v, dict): # v is a dict
p = getpath(v, value, path, cur_list) # recursive call
if p is not None:
cur_list.append(p)
getpath(d,'white')
print(cur_list)
# RESULT:
# [('geo', 'bgcolor'), ('geo', 'caxis', 'gridcolor'), ('geo', 'caxis', 'linecolor'), ('geo', 'lakecolor'), ('yaxis', 'linecolor'), ('yaxis', 'zerolinecolor')]

I needed this functionality for traversing HDF files with h5py. This code is a slight alteration of the answer by user114332 which looks for keys instead of values, and additionally yields the needle in the result, in case it is useful to someone else.
import h5py
def find_paths(haystack, needle):
if not isinstance(haystack, h5py.Group) and not isinstance(haystack, dict):
return
if needle in haystack:
yield (needle,)
for key, val in haystack.items():
for subpath in find_paths(val, needle):
yield (key, *subpath)
Execution:
sf = h5py.File("file.h5py", mode = "w")
g = sf.create_group("my group")
h = g.create_group("my2")
k = sf.create_group("two group")
l = k.create_group("my2")
a = l.create_group("my2")
for path in find_paths(sf, "my2"):
print('found at path: ', path)
which prints the following
found at path: ('my group', 'my2')
found at path: ('two group', 'my2')
found at path: ('two group', 'my2', 'my2')

How do I skip to next entry if a key doesn't exist in dict

I have a dict of dicts, but a given entry might not exist. For example, I have the following dict where the entry for c is missing:
g = {
'a': {'w': 14, 'x': 7, 'y': 9},
'b': {'w': 9, 'c': 6}, # <- c is not in dict
'w': {'a': 14, 'b': 9, 'y': 2},
'x': {'a': 7, 'y': 10, 'z': 15},
'y': {'a': 9, 'w': 2, 'x': 10, 'z': 11},
'z': {'b': 6, 'x': 15, 'y': 11}
}
My current code
start = 'a'
end = 'z'
queue, seen = [(0, start, [])], set()
while True:
(distance, vertex, path) = heapq.heappop(queue)
if vertex not in seen:
path = path + [vertex]
seen.add(vertex)
if vertex == end:
print(distance, path)
break # new line, based on solutions below
# new line
if vertex not in graph: # new line
continue # new line
for (next_v, d) in graph[vertex].items():
heapq.heappush(queue, (distance + d, next_v, path))
Right now I am getting the error:
for (next_v, d) in graph[vertex].items():
KeyError: 'c'
EDIT 1
If key is not found in dict skip ahead.
EDIT 2
Even with the newly added code I get an error, this time:
(distance, vertex, path) = heapq.heappop(queue)
IndexError: index out of range
Here is the data file I use
https://s3-eu-west-1.amazonaws.com/citymapper-assets/citymapper-coding-test-graph.dat
Here is the file format:
<number of nodes>
<OSM id of node>
...
<OSM id of node>
<number of edges>
<from node OSM id> <to node OSM id> <length in meters>
...
<from node OSM id> <to node OSM id> <length in meters>
And here is the code to create the graph
with open(filename, 'r') as reader:
num_nodes = int(reader.readline())
edges = []
for line in islice(reader, num_nodes + 1, None):
values = line.split()
values[2] = int(values[2])
edges.append(tuple(values))
graph = {k: dict(x[1:] for x in grp) for k, grp in groupby(sorted(edges), itemgetter(0))}
Change start and end to:
start = '876500321'
end = '1524235806'
Any help/advice is highly appreciated.
Thanks

Before accessing graph[vertex], make sure it is in the dict:
if vertex not in graph:
continue
for (next_v, d) in graph[vertex].items():
heapq.heappush(queue, (distance + d, next_v, path))

You can check whether the vertex is in the graph before executing that final for loop:
if vertex in graph:
for (next_v, d) in graph[vertex].items():
heapq.heappush(queue, (distance + d, next_v, path))

You could do a .get and return a empty {} incase the key is not there, so that the .items() won't break like,
for (next_v, d) in graph.get(vertex, {}).items():
heapq.heappush(queue, (distance + d, next_v, path))

Function that generates dictionary with optional fields

I'm trying to use a function to generate a dictionary with some variable fields according to the arguments that I give to the function. The idea is to try multiple configurations and obtain different dictionaries.
I have a function already but it looks non pythonic and it looks very hardcoded.
def build_entry(prefix=None,
field_a=None,
field_b=None,
quantity_a=None,
quantity_b=None,
):
fields = {}
if prefix is not None:
fields['prefix'] = prefix
if field_a is not None:
fields['field_a'] = field_a
if field_b is not None:
fields['field_b'] = field_b
if quantity_a is not None:
fields['quantity_a'] = quantity_a
if quantity_b is not None:
fields['quantity_b'] = quantity_b
return fields
The idea is to call the function like this:
fields = build_entry(*config)
Input: [26, 0, None, None, 20]
Output: {'prefix': 26, 'field_a': 0, 'quantity_b': 5}
Input: [20, 5, None, None, None]
Output: {'prefix': 20, 'field_a':5}
Input: [None, None, 0, 5, None]
Output: {'field_b': 0, 'quantity_a':5}
Any idea how to make this function better or more pythonic? Or there is any function that already does this?
I'm using Python 2.7.

def build_entry(*values):
keys = ['prefix', 'field_a', 'field_b', 'quantity_a', 'quantity_b']
return { k: v for k, v in zip(keys, values) if v is not None}
And then called the same way:
In [1]: build_entry(*[26, 0, None, None, 20])
Out[1]: {'prefix': 26, 'field_a': 0, 'quantity_b': 20}

I think that you want something like this:
def build_entry(**kwargs):
return kwargs
if __name__ == '__main__':
print(build_entry(prefix=1, field_a='a'))
Outputs:
{'prefix': 1, 'field_a': 'a'}

Python CFFI convert structure to dictionary

There is a way to initialize structure with dictionary:
fooData= {'y': 1, 'x': 2}
fooStruct = ffi.new("foo_t*", fooData)
fooBuffer = ffi.buffer(fooStruct)
Is there some ready function to do the conversion?
fooStruct = ffi.new("foo_t*")
(ffi.buffer(fooStruct))[:] = fooBuffer
fooData= convert_to_python( fooStruct[0] )
Do I have to use ffi.typeof("foo_t").fields by myself?
I come up with this code so far:
def __convert_struct_field( s, fields ):
for field,fieldtype in fields:
if fieldtype.type.kind == 'primitive':
yield (field,getattr( s, field ))
else:
yield (field, convert_to_python( getattr( s, field ) ))
def convert_to_python(s):
type=ffi.typeof(s)
if type.kind == 'struct':
return dict(__convert_struct_field( s, type.fields ) )
elif type.kind == 'array':
if type.item.kind == 'primitive':
return [ s[i] for i in range(type.length) ]
else:
return [ convert_to_python(s[i]) for i in range(type.length) ]
elif type.kind == 'primitive':
return int(s)
Is there a faster way?

Arpegius' solution works fine for me, and is quite elegant. I implemented a solution based on Selso's suggestion to use inspect. dir() can substitute inspect.
from inspect import getmembers
from cffi import FFI
ffi = FFI()
from pprint import pprint
def cdata_dict(cd):
if isinstance(cd, ffi.CData):
try:
return ffi.string(cd)
except TypeError:
try:
return [cdata_dict(x) for x in cd]
except TypeError:
return {k: cdata_dict(v) for k, v in getmembers(cd)}
else:
return cd
foo = ffi.new("""
struct Foo {
char name[6];
struct {
int a, b[3];
} item;
} *""",{
'name': b"Foo",
'item': {'a': 3, 'b': [1, 2, 3]}
})
pprint(cdata_dict(foo))
Output:
{'item': {'a': 3, 'b': [1, 2, 3]}, 'name': b'Foo'}

This code infortunately does not work for me, as some struct members are "pointer" types, it leads to storing "none" in the dict.
I am a Python noob, but maybe the inspect module would be another starting point, and a shorter way to print "simple" data. Then we would iterate over the result in order to unroll data structure.
For example with the following example :
struct foo {
int a;
char b[10];
};
Using inspect.getmembers( obj ) I have the following result :
[('a', 10), ('b', <cdata 'char[10]' 0x7f0be10e2824>)]

Your code is fine.
Even if there was a built-in way in CFFI, it would not be what you need here. Indeed, you can say ffi.new("foo_t*", {'p': p1}) where p1 is another cdata, but you cannot recursively pass a dictionary containing more dictionaries. The same would be true in the opposite direction: you would get a dictionary that maps field names to "values", but the values themselves would be more cdata objects anyway, and not recursively more dictionaries.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract path for each terminal node - python

Related

Cannot get ordered result

Dictionary: How to list every key path that contains a certain value?

How do I skip to next entry if a key doesn't exist in dict

Function that generates dictionary with optional fields

Python CFFI convert structure to dictionary

Categories

Resources