I have a giant dict with a lot of nested dicts -- like a giant tree, and depth in unknown.
I need a function, something like find_value(), that takes dict, value (as string), and returns list of lists, each one of them is "path" (sequential chain of keys from first key to key (or key value) with found value). If nothing found, returns empty list.
I wrote this code:
def find_value(dict, sought_value, current_path, result):
for key,value in dict.items():
current_path.pop()
current_path.append(key)
if sought_value in key:
result.append(current_path)
if type(value) == type(''):
if sought_value in value:
result.append(current_path+[value])
else:
current_path.append(key)
result = find_value(value, sought_value, current_path, result)
current_path.pop()
return result
I call this function to test:
result = find_value(self.dump, sought_value, ['START_KEY_FOR_DELETE'], [])
if not len(result):
print "forgive me, mylord, i'm afraid we didn't find him.."
elif len(result) == 1:
print "bless gods, for all that we have one match, mylord!"
For some inexplicable reasons, my implementation of this function fails some of my tests. I started to debug and find out, that even if current_path prints correct things (it always does, I checked!), the result is inexplicably corrupted. Maybe it is because of recursion magic?
Can anyone help me with this problem? Maybe there is a simple solution for my tasks?
When you write result.append(current_path), you're not copying current_path, which continues to mutate. Change it to result.append(current_path[:]).
I doubt you can do much to optimize a recursive search like that. Assuming there are many lookups on the same dictionary, and the dictionary doesn't change once loaded, then you can index it to get O(1) lookups...
def build_index(src, dest, path=[]):
for k, v in src.iteritems():
fk = path+[k]
if isinstance(v, dict):
build_index(v, dest, fk)
else:
try:
dest[v].append(fk)
except KeyError:
dest[v] = [fk]
>>> data = {'foo': {'sub1': 'blah'}, 'bar': {'sub2': 'whatever'}, 'baz': 'blah'}
>>> index = {}
>>> build_index(data, index)
>>> index
{'blah': [['baz'], ['foo', 'sub1']], 'whatever': [['bar', 'sub2']]}
>>> index['blah']
[['baz'], ['foo', 'sub1']]
Related
I have this nested python dictionary
dictionary = {'a':'1', 'b':{'c':'2', 'd':{'z':'5', 'e':{'f':'13', 'g':'14'}}}}
So the recommended output will be:
output = ['a:1', 'b:c:2', 'b:d:z:5', 'b:d:e:f:13', 'b:d:e:g:13']
using recursive function and without using recursive function
In cases like this, I always like to try and solve the easy part first.
def flatten_dict(dictionary):
output = []
for key, item in dictionary.items():
if isinstance(item, dict):
output.append(f'{key}:???') # Hm, here is the difficult part
else:
output.append(f'{key}:{item}')
return output
Trying flatten_dict(dictionary) now prints ['a:1', 'b:???'] which is obviously not good enough. For one thing, the list has three items too few.
First, I'd like to switch to using generator functions. This is more complicated for now, but will pay off later.
def flatten_dict(dictionary):
return list(flatten_dict_impl(dictionary))
def flatten_dict_impl(dictionary):
for key, item in dictionary.items():
if isinstance(item, dict):
yield f'{key}:???'
else:
yield f'{key}:{item}'
No change in the output yet. Time to go recusrive.
You want the output to be a flat list, so that means we have to yield multiple things in the case item is a dictionary. Only, what things? Let's try plugging in a recursive call to flatten_dict_impl on this subdictionary, that seems the most straightforward way to go.
# flatten_dict is unchanged
def flatten_dict_impl(dictionary):
for key, item in dictionary.items():
if isinstance(item, dict):
for value in flatten_dict_impl(item):
yield f'{key}:{value}'
else:
yield f'{key}:{item}'
The output is now ['a:1', 'b:c:2', 'b:d:z:5', 'b:d:e:f:13', 'b:d:e:g:14'], which is the output you wanted, except the final 14, but I think that's a typo on your part.
Now the non-recursive route. For that we need to manage some state ourselves, because we need to know how deep we are.
def flatten_dict_nonrecursive(dictionary):
return list(flatten_dict_nonrecursive_impl(dictionary))
def flatten_dict_nonrecursive_impl(dictionary):
dictionaries = [iter(dictionary.items())]
keys = []
while dictionaries:
try:
key, value = next(dictionaries[-1])
except StopIteration:
dictionaries.pop()
if keys:
keys.pop()
else:
if isinstance(value, dict):
keys.append(key)
dictionaries.append(iter(value.items()))
else:
yield ':'.join(keys + [key, value])
Now this gives the right output but is a lot less easy to understand, and a lot longer. It took a lot longer for me to get right too. There may be shorter and more obvious ways to do it that I missed, but in general recursive problems are easier to solve with recursive functions.
Such an approach can still be useful: if your dictionaries are nested hundreds or thousands of levels deep, then trying to do it recursively will likely overflow the stack.
I hope this helps. Let me know if I need to go into more detail or something.
You can use a NestedDict. First install ndicts
pip install ndicts
Then:
from ndicts.ndicts import NestedDictionary
dictionary = {'a': '1', 'b': {'c' :'2', 'd': {'z': '5', 'e': {'f': '13', 'g': '14'}}}}
nd = NestedDict(dictionary)
output = [":".join((*key, value)) for key, value in nd.items()]
I haven't found is there a way to do this.
Let's say I recieve a JSON object like this:
{'1_data':{'4_data':[{'5_data':'hooray'}, {'3_data':'hooray2'}], '2_data':[]}}
It's hard to instantly say, how should I get value from 3_data key: data['1_data']['4_data'][1]['3_data']
I know about pprint, it helps to understand structure a bit.
But sometimes data is huge, and it takes time
Are there any methods that may help me with that?
Here are a family of recursive generators that can be used to search through an object composed of dicts and lists. find_key yields a tuple containing a list of the dictionary keys and list indices that lead to the key that you pass in; the tuple also contains the value associated with that key. Because it's a generator it will find all matching keys if the object contains multiple matching keys, if desired.
def find_key(obj, key):
if isinstance(obj, dict):
yield from iter_dict(obj, key, [])
elif isinstance(obj, list):
yield from iter_list(obj, key, [])
def iter_dict(d, key, indices):
for k, v in d.items():
if k == key:
yield indices + [k], v
if isinstance(v, dict):
yield from iter_dict(v, key, indices + [k])
elif isinstance(v, list):
yield from iter_list(v, key, indices + [k])
def iter_list(seq, key, indices):
for k, v in enumerate(seq):
if isinstance(v, dict):
yield from iter_dict(v, key, indices + [k])
elif isinstance(v, list):
yield from iter_list(v, key, indices + [k])
# test
data = {
'1_data': {
'4_data': [
{'5_data': 'hooray'},
{'3_data': 'hooray2'}
],
'2_data': []
}
}
for t in find_key(data, '3_data'):
print(t)
output
(['1_data', '4_data', 1, '3_data'], 'hooray2')
To get a single key list you can pass find_key to the next function. And if you want to use a key list to fetch the associated value you can use a simple for loop.
seq, val = next(find_key(data, '3_data'))
print('seq:', seq, 'val:', val)
obj = data
for k in seq:
obj = obj[k]
print('obj:', obj, obj == val)
output
seq: ['1_data', '4_data', 1, '3_data'] val: hooray2
obj: hooray2 True
If the key may be missing, then give next an appropriate default tuple. Eg:
seq, val = next(find_key(data, '6_data'), ([], None))
print('seq:', seq, 'val:', val)
if seq:
obj = data
for k in seq:
obj = obj[k]
print('obj:', obj, obj == val)
output
seq: [] val: None
Note that this code is for Python 3. To run it on Python 2 you need to replace all the yield from statements, eg replace
yield from iter_dict(obj, key, [])
with
for u in iter_dict(obj, key, []):
yield u
How it works
To understand how this code works you need to be familiar with recursion and with Python generators. You may also find this page helpful: Understanding Generators in Python; there are also various Python generators tutorials available online.
The Python object returned by json.load or json.loads is generally a dict, but it can also be a list. We pass that object to the find_key generator as the obj arg, along with the key string that we want to locate. find_key then calls either iter_dict or iter_list, as appropriate, passing them the object, the key, and an empty list indices, which is used to collect the dict keys and list indices that lead to the key we want.
iter_dict iterates over each (k, v) pair at the top level of its d dict arg. If k matches the key we're looking for then the current indices list is yielded with k appended to it, along with the associated value. Because iter_dict is recursive the yielded (indices list, value) pairs get passed up to the previous level of recursion, eventually making their way up to find_key and then to the code that called find_key. Note that this is the "base case" of our recursion: it's the part of the code that determines whether this recursion path leads to the key we want. If a recursion path never finds a key matching the key we're looking for then that recursion path won't add anything to indices and it will terminate without yielding anything.
If the current v is a dict, then we need to examine all the (key, value) pairs it contains. We do that by making a recursive call to iter_dict, passing that v is its starting object and the current indices list. If the current v is a list we instead call iter_list, passing it the same args.
iter_list works similarly to iter_dict except that a list doesn't have any keys, it only contains values, so we don't perform the k == key test, we just recurse into any dicts or lists that the original list contains.
The end result of this process is that when we iterate over find_key we get pairs of (indices, value) where each indices list is the sequence of dict keys and list indices that succesfully terminate in a dict item with our desired key, and value is the value associated with that particular key.
If you'd like to see some other examples of this code in use please see how to modify the key of a nested Json and How can I select deeply nested key:values from dictionary in python.
Also take look at my new, more streamlined show_indices function.
I have to convert a bunch of strings into numbers, process the numbers and convert back.
I thought of a map where I will add 2 keys when I've provided string:
Key1: (string, number);
Key2: (number, string).
But this is not optimal in terms of memory.
What I need to archieve in example:
my_cool_class.get('string') # outputs 1
my_cool_class.get(1) # outputs 'string'
Is there better way to do this in python?
Thanks in advance!
You can implement your own twoway dict like
class TwoWayDict(dict):
def __len__(self):
return dict.__len__(self) / 2
def __setitem__(self, key, value):
dict.__setitem__(self, key, value)
dict.__setitem__(self, value, key)
my_cool_class = TwoWayDict()
my_cool_class[1] = 'string'
print my_cool_class[1] # 'string'
print my_cool_class['string'] # 1
Instead of allocate another memory for the second dict, you can get the key from the value, consider that it will cost you with run-time.
mydict = {'george':16,'amber':19}
print (mydict.keys()[mydict.values().index(16)])
>>> 'george'
EDIT:
Notice that In Python 3, dict.values() (along with dict.keys() and dict.items()) returns a view, rather than a list. You therefore need to wrap your call to dict.values() in a call to list like so:
mydict = {'george':16,'amber':19}
print (list(mydict.keys())[list(mydict.values()).index(16)])
If optimal memory usage is an issue, you may not want to use Python in the first place. To solve your immediate problem, just add both the string and the number as keys to the dictionary. Remember that only a reference to the original objects will be stored. Additional copies will not be made:
d = {}
s = '123'
n = int(s)
d[s] = n
d[n] = s
Now you can access the value by the opposite key just like you wanted. This method has the advantage of O(1) lookup time.
You can create a dictionary of tuples this way you just need to check against the type of the variable to decide which one you should return.
Example:
class your_cool_class(object):
def __init__(self):
# example of dictionary
self.your_dictionary = {'3': ('3', 3), '4': ('4', 4)}
def get(self, numer):
is_string = isinstanceof(number, str)
number = str(number)
n = self.your_dictionary.get(number)
if n is not None:
return n[0] if is_string else n[1]
>>>> my_cool_class = your_cool_class()
>>>> my_cool_class.get(3)
>>>> '3'
>>>> my_cool_class.get('3')
>>>> 3
I haven't found is there a way to do this.
Let's say I recieve a JSON object like this:
{'1_data':{'4_data':[{'5_data':'hooray'}, {'3_data':'hooray2'}], '2_data':[]}}
It's hard to instantly say, how should I get value from 3_data key: data['1_data']['4_data'][1]['3_data']
I know about pprint, it helps to understand structure a bit.
But sometimes data is huge, and it takes time
Are there any methods that may help me with that?
Here are a family of recursive generators that can be used to search through an object composed of dicts and lists. find_key yields a tuple containing a list of the dictionary keys and list indices that lead to the key that you pass in; the tuple also contains the value associated with that key. Because it's a generator it will find all matching keys if the object contains multiple matching keys, if desired.
def find_key(obj, key):
if isinstance(obj, dict):
yield from iter_dict(obj, key, [])
elif isinstance(obj, list):
yield from iter_list(obj, key, [])
def iter_dict(d, key, indices):
for k, v in d.items():
if k == key:
yield indices + [k], v
if isinstance(v, dict):
yield from iter_dict(v, key, indices + [k])
elif isinstance(v, list):
yield from iter_list(v, key, indices + [k])
def iter_list(seq, key, indices):
for k, v in enumerate(seq):
if isinstance(v, dict):
yield from iter_dict(v, key, indices + [k])
elif isinstance(v, list):
yield from iter_list(v, key, indices + [k])
# test
data = {
'1_data': {
'4_data': [
{'5_data': 'hooray'},
{'3_data': 'hooray2'}
],
'2_data': []
}
}
for t in find_key(data, '3_data'):
print(t)
output
(['1_data', '4_data', 1, '3_data'], 'hooray2')
To get a single key list you can pass find_key to the next function. And if you want to use a key list to fetch the associated value you can use a simple for loop.
seq, val = next(find_key(data, '3_data'))
print('seq:', seq, 'val:', val)
obj = data
for k in seq:
obj = obj[k]
print('obj:', obj, obj == val)
output
seq: ['1_data', '4_data', 1, '3_data'] val: hooray2
obj: hooray2 True
If the key may be missing, then give next an appropriate default tuple. Eg:
seq, val = next(find_key(data, '6_data'), ([], None))
print('seq:', seq, 'val:', val)
if seq:
obj = data
for k in seq:
obj = obj[k]
print('obj:', obj, obj == val)
output
seq: [] val: None
Note that this code is for Python 3. To run it on Python 2 you need to replace all the yield from statements, eg replace
yield from iter_dict(obj, key, [])
with
for u in iter_dict(obj, key, []):
yield u
How it works
To understand how this code works you need to be familiar with recursion and with Python generators. You may also find this page helpful: Understanding Generators in Python; there are also various Python generators tutorials available online.
The Python object returned by json.load or json.loads is generally a dict, but it can also be a list. We pass that object to the find_key generator as the obj arg, along with the key string that we want to locate. find_key then calls either iter_dict or iter_list, as appropriate, passing them the object, the key, and an empty list indices, which is used to collect the dict keys and list indices that lead to the key we want.
iter_dict iterates over each (k, v) pair at the top level of its d dict arg. If k matches the key we're looking for then the current indices list is yielded with k appended to it, along with the associated value. Because iter_dict is recursive the yielded (indices list, value) pairs get passed up to the previous level of recursion, eventually making their way up to find_key and then to the code that called find_key. Note that this is the "base case" of our recursion: it's the part of the code that determines whether this recursion path leads to the key we want. If a recursion path never finds a key matching the key we're looking for then that recursion path won't add anything to indices and it will terminate without yielding anything.
If the current v is a dict, then we need to examine all the (key, value) pairs it contains. We do that by making a recursive call to iter_dict, passing that v is its starting object and the current indices list. If the current v is a list we instead call iter_list, passing it the same args.
iter_list works similarly to iter_dict except that a list doesn't have any keys, it only contains values, so we don't perform the k == key test, we just recurse into any dicts or lists that the original list contains.
The end result of this process is that when we iterate over find_key we get pairs of (indices, value) where each indices list is the sequence of dict keys and list indices that succesfully terminate in a dict item with our desired key, and value is the value associated with that particular key.
If you'd like to see some other examples of this code in use please see how to modify the key of a nested Json and How can I select deeply nested key:values from dictionary in python.
Also take look at my new, more streamlined show_indices function.
I'm using Python 2.7 with plistlib to import a .plist in a nested dict/array form, then look for a particular key and delete it wherever I see it.
When it comes to the actual files we're working with in the office, I already know where to find the values -- but I wrote my script with the idea that I didn't, in the hopes that I wouldn't have to make changes in the future if the file structure changes or we need to do likewise to other similar files.
Unfortunately I seem to be trying to modify a dict while iterating over it, but I'm not certain how that's actually happening, since I'm using iteritems() and enumerate() to get generators and work with those instead of the object I'm actually working with.
def scrub(someobject, badvalue='_default'): ##_default isn't the real variable
"""Walks the structure of a plistlib-created dict and finds all the badvalues and viciously eliminates them.
Can optionally be passed a different key to search for."""
count = 0
try:
iterator = someobject.iteritems()
except AttributeError:
iterator = enumerate(someobject)
for key, value in iterator:
try:
scrub(value)
except:
pass
if key == badvalue:
del someobject[key]
count += 1
return "Removed {count} instances of {badvalue} from {file}.".format(count=count, badvalue=badvalue, file=file)
Unfortunately, when I run this on my test .plist file, I get the following error:
Traceback (most recent call last):
File "formscrub.py", line 45, in <module>
scrub(loadedplist)
File "formscrub.py", line 19, in scrub
for key, value in iterator:
RuntimeError: dictionary changed size during iteration
So the problem might be the recursive call to itself, but even then shouldn't it just be removing from the original object? I'm not sure how to avoid recursion (or if that's the right strategy) but since it's a .plist, I do need to be able to identify when things are dicts or lists and iterate over them in search of either (a) more dicts to search, or (b) the actual key-value pair in the imported .plist that I need to delete.
Ultimately, this is a partial non-issue, in that the files I'll be working with on a regular basis have a known structure. However, I was really hoping to create something that doesn't care about the nesting or order of the object it's working with, as long as it's a Python dict with arrays in it.
Adding or removing items to/from a sequence while iterating over this sequence is tricky at best, and just illegal (as you just discovered) with dicts. The right way to remove entries from a dict while iterating over it is to iterate on a snapshot of the keys. In Python 2.x, dict.keys() provides such a snapshot. So for dicts the solution is:
for key in mydict.keys():
if key == bad_value:
del mydict[key]
As mentionned by cpizza in a comment, for python3, you'll need to explicitely create the snapshot using list():
for key in list(mydict.keys()):
if key == bad_value:
del mydict[key]
For lists, trying to iterate on a snapshot of the indexes (ie for i in len(thelist):) would result in an IndexError as soon as anything is removed (obviously since at least the last index will no more exist), and even if not you might skip one or more items (since the removal of an item makes the sequence of indexes out of sync with the list itself). enumerate is safe against IndexError (since the iteration will stop by itself when there's no more 'next' item in the list, but you'll still skip items:
>>> mylist = list("aabbccddeeffgghhii")
>>> for x, v in enumerate(mylist):
... if v in "bdfh":
... del mylist[x]
>>> print mylist
['a', 'a', 'b', 'c', 'c', 'd', 'e', 'e', 'f', 'g', 'g', 'h', 'i', 'i']
Not a quite a success, as you can see.
The known solution here is to iterate on reversed indexes, ie:
>>> mylist = list("aabbccddeeffgghhii")
>>> for x in reversed(range(len(mylist))):
... if mylist[x] in "bdfh":
... del mylist[x]
>>> print mylist
['a', 'a', 'c', 'c', 'e', 'e', 'g', 'g', 'i', 'i']
This works with reversed enumeration too, but we dont really care.
So to summarize: you need two different code path for dicts and lists - and you also need to take care of "not container" values (values which are neither lists nor dicts), something you do not take care of in your current code.
def scrub(obj, bad_key="_this_is_bad"):
if isinstance(obj, dict):
# the call to `list` is useless for py2 but makes
# the code py2/py3 compatible
for key in list(obj.keys()):
if key == bad_key:
del obj[key]
else:
scrub(obj[key], bad_key)
elif isinstance(obj, list):
for i in reversed(range(len(obj))):
if obj[i] == bad_key:
del obj[i]
else:
scrub(obj[i], bad_key)
else:
# neither a dict nor a list, do nothing
pass
As a side note: never write a bare except clause. Never ever. This should be illegal syntax, really.
Here a generalized version of the one of #bruno desthuilliers, with a callable to test against the keys.
def clean_dict(obj, func):
"""
This method scrolls the entire 'obj' to delete every key for which the 'callable' returns
True
:param obj: a dictionary or a list of dictionaries to clean
:param func: a callable that takes a key in argument and return True for each key to delete
"""
if isinstance(obj, dict):
# the call to `list` is useless for py2 but makes
# the code py2/py3 compatible
for key in list(obj.keys()):
if func(key):
del obj[key]
else:
clean_dict(obj[key], func)
elif isinstance(obj, list):
for i in reversed(range(len(obj))):
if func(obj[i]):
del obj[i]
else:
clean_dict(obj[i], func)
else:
# neither a dict nor a list, do nothing
pass
And an example with a regex callable :
func = lambda key: re.match(r"^<div>", key)
clean_dict(obj, func)
def walk(d, badvalue, answer=None, sofar=None):
if sofar is None:
sofar = []
if answer is None:
answer = []
for k,v in d.iteritems():
if k == badvalue:
answer.append(sofar + [k])
if isinstance(v, dict):
walk(v, badvalue, answer, sofar+[k])
return answer
def delKeys(d, badvalue):
for path in walk(d, badvalue):
dd = d
while len(path) > 1:
dd = dd[path[0]]
path.pop(0)
dd.pop(path[0])
Output
In [30]: d = {1:{2:3}, 2:{3:4}, 5:{6:{2:3}, 7:{1:2, 2:3}}, 3:4}
In [31]: delKeys(d, 2)
In [32]: d
Out[32]: {1: {}, 3: 4, 5: {6: {}, 7: {1: 2}}}