Related
I would like to index a very large number of strings (mapping each string to an numeric value) but also be able to retrieve each string from its numeric index.
Using hash tables or python dict is not an option because of memory issues so I decided to use a radix trie to store the strings, I can retrieve the index of any string very quickly and handle a very large number of strings.
My problem is that I also need to retrieve the strings from their numeric index, and if I maintain a "reverse index" list [string1, string2, ..., stringn] I'll loose the memory benefit of the Trie.
I thought maybe the "reverse index" could be a list of pointers to the last node of a kind-of Trie structure but first, there are no pointers in python, and second I'm not sure I can have a "node-level" access to the Trie structure I'm currently using.
Does this kind of data-structure already exists? And if not how would you do this in python?
As per What data structure to use to have O(log n) key AND value lookup? , you need two synchronized data structures for key and value lookups, each holding references to the other's leaf nodes.
The structure for the ID lookup can be anything with sufficient efficientcy -- a balanced tree, a hash table, another trie.
To be able to extract the value from a leaf node reference, a trie needs to allow 1) leaf node references themselves (not necessarily a real Python reference, anything that its API can use); 2) walking up the trie to extract the word from that reference.
Note that a reference is effectively a unique integer so if your IDs are not larger than an integer, it makes sense to reuse something as IDs -- e.g. the trie node references themselves. Then if the trie API can validate such a reference (i.e. tell if it has a used node with such a reference) this will act as the ID lookup and you don't need the 2nd structure at all! This way, the IDs will be non-persistent though 'cuz reference values (effectively memory addresses) change between processes and runs.
I'm answering to myself because I finally end up creating my own data-structure which is perfectly suited for the word-to-index-to-word problem I had, using only python3 built-in functions.
I tried to make it clean and efficient but there's obviously room for improvement and a C binding would be better.
So the final result is a indexedtrie class that looks like a python dict (or defaultdict if you invoke it with a default_factory parameter) but can also be queried like a list because a kind of "reversed index" is automatically maintained.
The keys, which are stored in an internal radix trie, can be any subscriptable object (bytes, strings, tuples, lists) and the values you want to store anything you want inside.
Also the indextrie class is pickable, and you can benefit from the advantages of radix tries regarding "prefix search" and this kind of things!
Each key in the trie is associated with a unique integer index, you can retrieve the key with the index or the index with the key and the whole thing is fast and memory safe so I personally think that's one of the best data-structure in the world and that it should be integrated in python standard library :).
Enough talking, here is the code, feel free to adapt and use it:
"""
A Python3 indexed trie class.
An indexed trie's key can be any subscriptable object.
Keys of the indexed trie are stored using a "radix trie", a space-optimized data-structure which has many advantages (see https://en.wikipedia.org/wiki/Radix_tree).
Also, each key in the indexed trie is associated to a unique index which is build dynamically.
Indexed trie is used like a python dictionary (and even a collections.defaultdict if you want to) but its values can also be accessed or updated (but not created) like a list!
Example:
>>> t = indextrie()
>>> t["abc"] = "hello"
>>> t[0]
'hello'
>>> t["abc"]
'hello'
>>> t.index2key(0)
'abc'
>>> t.key2index("abc")
0
>>> t[:]
[0]
>>> print(t)
{(0, 'abc'): hello}
"""
__author__ = "#fbparis"
_SENTINEL = object()
class _Node(object):
"""
A single node in the trie.
"""
__slots__ = "_children", "_parent", "_index", "_key"
def __init__(self, key, parent, index=None):
self._children = set()
self._key = key
self._parent = parent
self._index = index
self._parent._children.add(self)
class IndexedtrieKey(object):
"""
A pair (index, key) acting as an indexedtrie's key
"""
__slots__ = "index", "key"
def __init__(self, index, key):
self.index = index
self.key = key
def __repr__(self):
return "(%d, %s)" % (self.index, self.key)
class indexedtrie(object):
"""
The indexed trie data-structure.
"""
__slots__ = "_children", "_indexes", "_values", "_nodescount", "_default_factory"
def __init__(self, items=None, default_factory=_SENTINEL):
"""
A list of items can be passed to initialize the indexed trie.
"""
self._children = set()
self.setdefault(default_factory)
self._indexes = []
self._values = []
self._nodescount = 0 # keeping track of nodes count is purely informational
if items is not None:
for k, v in items:
if isinstance(k, IndexedtrieKey):
self.__setitem__(k.key, v)
else:
self.__setitem__(k, v)
#classmethod
def fromkeys(cls, keys, value=_SENTINEL, default_factory=_SENTINEL):
"""
Build a new indexedtrie from a list of keys.
"""
obj = cls(default_factory=default_factory)
for key in keys:
if value is _SENTINEL:
if default_factory is not _SENTINEL:
obj[key] = obj._default_factory()
else:
obj[key] = None
else:
obj[key] = value
return obj
#classmethod
def fromsplit(cls, keys, value=_SENTINEL, default_factory=_SENTINEL):
"""
Build a new indexedtrie from a splitable object.
"""
obj = cls(default_factory=default_factory)
for key in keys.split():
if value is _SENTINEL:
if default_factory is not _SENTINEL:
obj[key] = obj._default_factory()
else:
obj[key] = None
else:
obj[key] = value
return obj
def setdefault(self, factory=_SENTINEL):
"""
"""
if factory is not _SENTINEL:
# indexed trie will act like a collections.defaultdict except in some cases because the __missing__
# method is not implemented here (on purpose).
# That means that simple lookups on a non existing key will return a default value without adding
# the key, which is the more logical way to do.
# Also means that if your default_factory is for example "list", you won't be able to create new
# items with "append" or "extend" methods which are updating the list itself.
# Instead you have to do something like trie["newkey"] += [...]
try:
_ = factory()
except TypeError:
# a default value is also accepted as default_factory, even "None"
self._default_factory = lambda: factory
else:
self._default_factory = factory
else:
self._default_factory = _SENTINEL
def copy(self):
"""
Return a pseudo-shallow copy of the indexedtrie.
Keys and nodes are deepcopied, but if you store some referenced objects in values, only the references will be copied.
"""
return self.__class__(self.items(), default_factory=self._default_factory)
def __len__(self):
return len(self._indexes)
def __repr__(self):
if self._default_factory is not _SENTINEL:
default = ", default_value=%s" % self._default_factory()
else:
default = ""
return "<%s object at %s: %d items, %d nodes%s>" % (self.__class__.__name__, hex(id(self)), len(self), self._nodescount, default)
def __str__(self):
ret = ["%s: %s" % (k, v) for k, v in self.items()]
return "{%s}" % ", ".join(ret)
def __iter__(self):
return self.keys()
def __contains__(self, key_or_index):
"""
Return True if the key or index exists in the indexed trie.
"""
if isinstance(key_or_index, IndexedtrieKey):
return key_or_index.index >= 0 and key_or_index.index < len(self)
if isinstance(key_or_index, int):
return key_or_index >= 0 and key_or_index < len(self)
if self._seems_valid_key(key_or_index):
try:
node = self._get_node(key_or_index)
except KeyError:
return False
else:
return node._index is not None
raise TypeError("invalid key type")
def __getitem__(self, key_or_index):
"""
"""
if isinstance(key_or_index, IndexedtrieKey):
return self._values[key_or_index.index]
if isinstance(key_or_index, int) or isinstance(key_or_index, slice):
return self._values[key_or_index]
if self._seems_valid_key(key_or_index):
try:
node = self._get_node(key_or_index)
except KeyError:
if self._default_factory is _SENTINEL:
raise
else:
return self._default_factory()
else:
if node._index is None:
if self._default_factory is _SENTINEL:
raise KeyError
else:
return self._default_factory()
else:
return self._values[node._index]
raise TypeError("invalid key type")
def __setitem__(self, key_or_index, value):
"""
"""
if isinstance(key_or_index, IndexedtrieKey):
self._values[key_or_index.index] = value
elif isinstance(key_or_index, int):
self._values[key_or_index] = value
elif isinstance(key_or_index, slice):
raise NotImplementedError
elif self._seems_valid_key(key_or_index):
try:
node = self._get_node(key_or_index)
except KeyError:
# create a new node
self._add_node(key_or_index, value)
else:
if node._index is None:
# if node exists but not indexed, we index it and update the value
self._add_to_index(node, value)
else:
# else we update its value
self._values[node._index] = value
else:
raise TypeError("invalid key type")
def __delitem__(self, key_or_index):
"""
"""
if isinstance(key_or_index, IndexedtrieKey):
node = self._indexes[key_or_index.index]
elif isinstance(key_or_index, int):
node = self._indexes[key_or_index]
elif isinstance(key_or_index, slice):
raise NotImplementedError
elif self._seems_valid_key(key_or_index):
node = self._get_node(key_or_index)
if node._index is None:
raise KeyError
else:
raise TypeError("invalid key type")
# switch last index with deleted index (except if deleted index is last index)
last_node, last_value = self._indexes.pop(), self._values.pop()
if node._index != last_node._index:
last_node._index = node._index
self._indexes[node._index] = last_node
self._values[node._index] = last_value
if len(node._children) > 1:
#case 1: node has more than 1 child, only turn index off
node._index = None
elif len(node._children) == 1:
# case 2: node has 1 child
child = node._children.pop()
child._key = node._key + child._key
child._parent = node._parent
node._parent._children.add(child)
node._parent._children.remove(node)
del(node)
self._nodescount -= 1
else:
# case 3: node has no child, check the parent node
parent = node._parent
parent._children.remove(node)
del(node)
self._nodescount -= 1
if hasattr(parent, "_index"):
if parent._index is None and len(parent._children) == 1:
node = parent._children.pop()
node._key = parent._key + node._key
node._parent = parent._parent
parent._parent._children.add(node)
parent._parent._children.remove(parent)
del(parent)
self._nodescount -= 1
#staticmethod
def _seems_valid_key(key):
"""
Return True if "key" can be a valid key (must be subscriptable).
"""
try:
_ = key[:0]
except TypeError:
return False
return True
def keys(self, prefix=None):
"""
Yield keys stored in the indexedtrie where key is a IndexedtrieKey object.
If prefix is given, yield only keys of items with key matching the prefix.
"""
if prefix is None:
for i, node in enumerate(self._indexes):
yield IndexedtrieKey(i, self._get_key(node))
else:
if self._seems_valid_key(prefix):
empty = prefix[:0]
children = [(empty, prefix, child) for child in self._children]
while len(children):
_children = []
for key, prefix, child in children:
if prefix == child._key[:len(prefix)]:
_key = key + child._key
_children.extend([(_key, empty, _child) for _child in child._children])
if child._index is not None:
yield IndexedtrieKey(child._index, _key)
elif prefix[:len(child._key)] == child._key:
_prefix = prefix[len(child._key):]
_key = key + prefix[:len(child._key)]
_children.extend([(_key, _prefix, _child) for _child in child._children])
children = _children
else:
raise ValueError("invalid prefix type")
def values(self, prefix=None):
"""
Yield values stored in the indexedtrie.
If prefix is given, yield only values of items with key matching the prefix.
"""
if prefix is None:
for value in self._values:
yield value
else:
for key in self.keys(prefix):
yield self._values[key.index]
def items(self, prefix=None):
"""
Yield (key, value) pairs stored in the indexedtrie where key is a IndexedtrieKey object.
If prefix is given, yield only (key, value) pairs of items with key matching the prefix.
"""
for key in self.keys(prefix):
yield key, self._values[key.index]
def show_tree(self, node=None, level=0):
"""
Pretty print the internal trie (recursive function).
"""
if node is None:
node = self
for child in node._children:
print("-" * level + "<key=%s, index=%s>" % (child._key, child._index))
if len(child._children):
self.show_tree(child, level + 1)
def _get_node(self, key):
"""
Return the node associated to key or raise a KeyError.
"""
children = self._children
while len(children):
notfound = True
for child in children:
if key == child._key:
return child
if child._key == key[:len(child._key)]:
children = child._children
key = key[len(child._key):]
notfound = False
break
if notfound:
break
raise KeyError
def _add_node(self, key, value):
"""
Add a new key in the trie and updates indexes and values.
"""
children = self._children
parent = self
moved = None
done = len(children) == 0
# we want to insert key="abc"
while not done:
done = True
for child in children:
# assert child._key != key # uncomment if you don't trust me
if child._key == key[:len(child._key)]:
# case 1: child's key is "ab", insert "c" in child's children
parent = child
children = child._children
key = key[len(child._key):]
done = len(children) == 0
break
elif key == child._key[:len(key)]:
# case 2: child's key is "abcd", we insert "abc" in place of the child
# child's parent will be the inserted node and child's key is now "d"
parent = child._parent
moved = child
parent._children.remove(moved)
moved._key = moved._key[len(key):]
break
elif type(key) is type(child._key): # don't mess it up
# find longest common prefix
prefix = key[:0]
for i, c in enumerate(key):
if child._key[i] != c:
prefix = key[:i]
break
if prefix:
# case 3: child's key is abd, we spawn a new node with key "ab"
# to replace child ; child's key is now "d" and child's parent is
# the new created node.
# the new node will also be inserted as a child of this node
# with key "c"
node = _Node(prefix, child._parent)
self._nodescount += 1
child._parent._children.remove(child)
child._key = child._key[len(prefix):]
child._parent = node
node._children.add(child)
key = key[len(prefix):]
parent = node
break
# create the new node
node = _Node(key, parent)
self._nodescount += 1
if moved is not None:
# if we have moved an existing node, update it
moved._parent = node
node._children.add(moved)
self._add_to_index(node, value)
def _get_key(self, node):
"""
Rebuild key from a terminal node.
"""
key = node._key
while node._parent is not self:
node = node._parent
key = node._key + key
return key
def _add_to_index(self, node, value):
"""
Add a new node to the index.
Also record its value.
"""
node._index = len(self)
self._indexes.append(node)
self._values.append(value)
def key2index(self, key):
"""
key -> index
"""
if self._seems_valid_key(key):
node = self._get_node(key)
if node._index is not None:
return node._index
raise KeyError
raise TypeError("invalid key type")
def index2key(self, index):
"""
index or IndexedtrieKey -> key.
"""
if isinstance(index, IndexedtrieKey):
index = index.index
elif not isinstance(index, int):
raise TypeError("index must be an int")
if index < 0 or index > len(self._indexes):
raise IndexError
return self._get_key(self._indexes[index])
I want to insert an item into an OrderedDict at a certain position.
Using the gist of this SO answer i have the problem that it doesn't work on python 3.
This is the implementation used
from collections import OrderedDict
class ListDict(OrderedDict):
def __init__(self, *args, **kwargs):
super(ListDict, self).__init__(*args, **kwargs)
def __insertion(self, link_prev, key_value):
key, value = key_value
if link_prev[2] != key:
if key in self:
del self[key]
link_next = link_prev[1]
self._OrderedDict__map[key] = link_prev[1] = link_next[0] = [link_prev, link_next, key]
dict.__setitem__(self, key, value)
def insert_after(self, existing_key, key_value):
self.__insertion(self._OrderedDict__map[existing_key], key_value)
def insert_before(self, existing_key, key_value):
self.__insertion(self._OrderedDict__map[existing_key][0], key_value)
Using it like
ld = ListDict([(1,1), (2,2), (3,3)])
ld.insert_before(2, (1.5, 1.5))
gives
File "...", line 35, in insert_before
self.__insertion(self._OrderedDict__map[existing_key][0], key_value)
AttributeError: 'ListDict' object has no attribute '_OrderedDict__map'
It works with python 2.7. What is the reason that it fails in python 3?
Checking the source code of the OrderedDict implementation shows that self.__map is used instead of self._OrderedDict__map. Changing the code to the usage of self.__map gives
AttributeError: 'ListDict' object has no attribute '_ListDict__map'
How come? And how can i make this work in python 3? OrderedDict uses the internal __map attribute to store a doubly linked list. So how can i access this attribute properly?
I'm not sure you wouldn't be better served just keeping up with a separate list and dict in your code, but here is a stab at a pure Python implementation of such an object. This will be an order of magnitude slower than an actual OrderedDict in Python 3.5, which as I pointed out in my comment has been rewritten in C.
"""
A list/dict hybrid; like OrderedDict with insert_before and insert_after
"""
import collections.abc
class MutableOrderingDict(collections.abc.MutableMapping):
def __init__(self, iterable_or_mapping=None, **kw):
# This mimics dict's initialization and accepts the same arguments
# Of course, you have to pass an ordered iterable or mapping unless you
# want the order to be arbitrary. Garbage in, garbage out and all :)
self.__data = {}
self.__keys = []
if iterable_or_mapping is not None:
try:
iterable = iterable_or_mapping.items()
except AttributeError:
iterable = iterable_or_mapping
for key, value in iterable:
self.__keys.append(key)
self.__data[key] = value
for key, value in kw.items():
self.__keys.append(key)
self.__data[key] = value
def insert_before(self, key, new_key, value):
try:
self.__keys.insert(self.__keys.index(key), new_key)
except ValueError:
raise KeyError(key) from ValueError
else:
self.__data[new_key] = value
def insert_after(self, key, new_key, value):
try:
self.__keys.insert(self.__keys.index(key) + 1, new_key)
except ValueError:
raise KeyError(key) from ValueError
else:
self.__data[new_key] = value
def __getitem__(self, key):
return self.__data[key]
def __setitem__(self, key, value):
self.__keys.append(key)
self.__data[key] = value
def __delitem__(self, key):
del self.__data[key]
self.__keys.remove(key)
def __iter__(self):
return iter(self.__keys)
def __len__(self):
return len(self.__keys)
def __contains__(self, key):
return key in self.__keys
def __eq__(self, other):
try:
return (self.__data == dict(other.items()) and
self.__keys == list(other.keys()))
except AttributeError:
return False
def keys(self):
for key in self.__keys:
yield key
def items(self):
for key in self.__keys:
yield key, self.__data[key]
def values(self):
for key in self.__keys:
yield self.__data[key]
def get(self, key, default=None):
try:
return self.__data[key]
except KeyError:
return default
def pop(self, key, default=None):
value = self.get(key, default)
self.__delitem__(key)
return value
def popitem(self):
try:
return self.__data.pop(self.__keys.pop())
except IndexError:
raise KeyError('%s is empty' % self.__class__.__name__)
def clear(self):
self.__keys = []
self.__data = {}
def update(self, mapping):
for key, value in mapping.items():
self.__keys.append(key)
self.__data[key] = value
def setdefault(self, key, default):
try:
return self[key]
except KeyError:
self[key] = default
return self[key]
def __repr__(self):
return 'MutableOrderingDict(%s)' % ', '.join(('%r: %r' % (k, v)
for k, v in self.items()))
I ended up implementing the whole collections.abc.MutableMapping contract because none of the methods were very long, but you probably won't use all of them. In particular, __eq__ and popitem are a little arbitrary. I changed your signature on the insert_* methods to a 4-argument one that feels a little more natural to me. Final note: Only tested on Python 3.5. Certainly will not work on Python 2 without some (minor) changes.
Trying out the new dict object in 3.7 and thought I'd try to implement what Two-Bit Alchemist had done with his answer but just overriding the native dict class because in 3.7 dict's are ordered.
''' Script that extends python3.7 dictionary to include insert_before and insert_after methods. '''
from sys import exit as sExit
class MutableDict(dict):
''' Class that extends python3.7 dictionary to include insert_before and insert_after methods. '''
def insert_before(self, key, newKey, val):
''' Insert newKey:value into dict before key'''
try:
__keys = list(self.keys())
__vals = list(self.values())
insertAt = __keys.index(key)
__keys.insert(insertAt, newKey)
__vals.insert(insertAt, val)
self.clear()
self.update({x: __vals[i] for i, x in enumerate(__keys)})
except ValueError as e:
sExit(e)
def insert_after(self, key, newKey, val):
''' Insert newKey:value into dict after key'''
try:
__keys = list(self.keys())
__vals = list(self.values())
insertAt = __keys.index(key) + 1
if __keys[-1] != key:
__keys.insert(insertAt, newKey)
__vals.insert(insertAt, val)
self.clear()
self.update({x: __vals[i] for i, x in enumerate(__keys)})
else:
self.update({newKey: val})
except ValueError as e:
sExit(e)
A little testing:
In: v = MutableDict([('a', 1), ('b', 2), ('c', 3)])
Out: {'a': 1, 'b': 2, 'c': 3}
In: v.insert_before('a', 'g', 5)
Out: {'g': 5, 'a': 1, 'b': 2, 'c': 3}
In: v.insert_after('b', 't', 5)
Out: {'g': 5, 'a': 1, 'b': 2, 't': 5, 'c': 3}
Edit: I decided to do a little benchmark test to see what kind of performance hit this would take. I will use from timeit import timeit
Get a baseline. Create a dict with arbitrary values.
In: timeit('{x: ord(x) for x in string.ascii_lowercase[:27]}', setup='import string', number=1000000)
Out: 1.8214202160015702
See how much longer it would take to initialize the MutableDict with the same arbitrary values as before.
In: timeit('MD({x: ord(x) for x in string.ascii_lowercase[:27]})', setup='import string; from MutableDict import MutableDict as MD', number=1000000)
Out: 2.382507269998314
1.82 / 2.38 = 0.76. So if I'm thinking about this right MutableDict is 24% slower on creation.
Lets see how long it takes to do an insert. For this test I'll use the insert_after method as it is slightly bigger. Will also look for a key close to the end for insertion. 't' in this case.
In: timeit('v.insert_after("t", "zzrr", ord("z"))', setup='import string; from MutableDict import MutableDict as MD; v = MD({x: ord(x) for x in string.ascii_lowercase[:27]})' ,number=1000000)
Out: 3.9161406760104
2.38 / 3.91 = 0.60, 40% slower inserting_after than it's initialization. Not bad on a small test of 1 million loops. For a comparison in time relation we'll test this:
In: timeit('"-".join(map(str, range(100)))', number=1000000)
Out: 10.342204540997045
Not quite an apples to apples comparison but I hope these tests will aid you in your(reader not necessarily OP) decision to use or not use this class in your 3.7 projects.
Since Python 3.2, move_to_end can be used to move items around in an OrderedDict. The following code will implement the insert functionality by moving all items after the provided index to the end.
Note that this isn't very efficient and should be used sparingly (if at all).
def ordered_dict_insert(ordered_dict, index, key, value):
if key in ordered_dict:
raise KeyError("Key already exists")
if index < 0 or index > len(ordered_dict):
raise IndexError("Index out of range")
keys = list(ordered_dict.keys())[index:]
ordered_dict[key] = value
for k in keys:
ordered_dict.move_to_end(k)
There are obvious optimizations and improvements that could be made, but that's the general idea.
from collections import OrderedDict
od1 = OrderedDict([
('a', 1),
('b', 2),
('d', 4),
])
items = od1.items()
items.insert(2, ('c', 3))
od2 = OrderedDict(items)
print(od2) # OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4)])
Suppose I have d = {'dogs': 3}. Using:
d['cats'] = 2
would create the key 'cats' and give it the value 2.
If I really intend to update a dict with a new key and value, I would use d.update(cats=2) because it feels more explicit.
Having automatic creation of a key feels error prone (especially in larger programs), e.g.:
# I decide to make a change to my dict.
d = {'puppies': 4, 'big_dogs': 2}
# Lots and lots of code.
# ....
def change_my_dogs_to_maximum_room_capacity():
# But I forgot to change this as well and there is no error to inform me.
# Instead a bug was created.
d['dogs'] = 1
Question:
Is there a way to disable the automatic creation of a key that doesn't exist through d[key] = value, and instead raise a KeyError?
Everything else should keep working though:
d = new_dict() # Works
d = new_dict(hi=1) # Works
d.update(c=5, x=2) # Works
d.setdefault('9', 'something') # Works
d['a_new_key'] = 1 # Raises KeyError
You could create a child of dict with a special __setitem__ method that refuses keys that didn't exist when it was initially created:
class StrictDict(dict):
def __setitem__(self, key, value):
if key not in self:
raise KeyError("{} is not a legal key of this StricDict".format(repr(key)))
dict.__setitem__(self, key, value)
x = StrictDict({'puppies': 4, 'big_dogs': 2})
x["puppies"] = 23 #this works
x["dogs"] = 42 #this raises an exception
It's not totally bulletproof (it will allow x.update({"cats": 99}) without complaint, for example), but it prevents the most likely case.
Inherit dict class and override __setitem__ to suits your needs.Try this
class mydict(dict):
def __init__(self, *args, **kwargs):
self.update(*args, **kwargs)
def __setitem__(self, key, value):
raise KeyError(key)
>>>a=mydict({'a':3})
>>>d
{'a': 3}
>>>d['a']
3
>>>d['b']=4
KeyError: 'b'
This will only allow new keys to be added with key=value using update:
class MyDict(dict):
def __init__(self, d):
dict.__init__(self)
self.instant = False
self.update(d)
def update(self, other=None, **kwargs):
if other is not None:
if isinstance(other, dict):
for k, v in other.items():
self[k] = v
else:
for k, v in other:
self[k] = v
else:
dict.update(self, kwargs)
self.instant = True
def __setitem__(self, key, value):
if self.instant and key not in self:
raise KeyError(key)
dict.__setitem__(self, key, value)
x = MyDict({1:2,2:3})
x[1] = 100 # works
x.update(cat=1) # works
x.update({2:200}) # works
x["bar"] = 3 # error
x.update({"foo":2}) # error
x.update([(5,2),(3,4)]) # error
I'm trying to write a very simple function to recursively search through a possibly nested (in the most extreme cases ten levels deep) Python dictionary and return the first value it finds from the given key.
I cannot understand why my code doesn't work for nested dictionaries.
def _finditem(obj, key):
if key in obj: return obj[key]
for k, v in obj.items():
if isinstance(v,dict):
_finditem(v, key)
print _finditem({"B":{"A":2}},"A")
It returns None.
It does work, however, for _finditem({"B":1,"A":2},"A"), returning 2.
I'm sure it's a simple mistake but I cannot find it. I feel like there already might be something for this in the standard library or collections, but I can't find that either.
If you are looking for a general explanation of what is wrong with code like this, the canonical is Why does my recursive function return None?. The answers here are mostly specific to the task of searching in a nested dictionary.
when you recurse, you need to return the result of _finditem
def _finditem(obj, key):
if key in obj: return obj[key]
for k, v in obj.items():
if isinstance(v,dict):
return _finditem(v, key) #added return statement
To fix the actual algorithm, you need to realize that _finditem returns None if it didn't find anything, so you need to check that explicitly to prevent an early return:
def _finditem(obj, key):
if key in obj: return obj[key]
for k, v in obj.items():
if isinstance(v,dict):
item = _finditem(v, key)
if item is not None:
return item
Of course, that will fail if you have None values in any of your dictionaries. In that case, you could set up a sentinel object() for this function and return that in the case that you don't find anything -- Then you can check against the sentinel to know if you found something or not.
Here's a function that searches a dictionary that contains both nested dictionaries and lists. It creates a list of the values of the results.
def get_recursively(search_dict, field):
"""
Takes a dict with nested lists and dicts,
and searches all dicts for a key of the field
provided.
"""
fields_found = []
for key, value in search_dict.iteritems():
if key == field:
fields_found.append(value)
elif isinstance(value, dict):
results = get_recursively(value, field)
for result in results:
fields_found.append(result)
elif isinstance(value, list):
for item in value:
if isinstance(item, dict):
more_results = get_recursively(item, field)
for another_result in more_results:
fields_found.append(another_result)
return fields_found
Here is a way to do this using a "stack" and the "stack of iterators" pattern (credits to Gareth Rees):
def search(d, key, default=None):
"""Return a value corresponding to the specified key in the (possibly
nested) dictionary d. If there is no item with that key, return
default.
"""
stack = [iter(d.items())]
while stack:
for k, v in stack[-1]:
if isinstance(v, dict):
stack.append(iter(v.items()))
break
elif k == key:
return v
else:
stack.pop()
return default
The print(search({"B": {"A": 2}}, "A")) would print 2.
Just trying to make it shorter:
def get_recursively(search_dict, field):
if isinstance(search_dict, dict):
if field in search_dict:
return search_dict[field]
for key in search_dict:
item = get_recursively(search_dict[key], field)
if item is not None:
return item
elif isinstance(search_dict, list):
for element in search_dict:
item = get_recursively(element, field)
if item is not None:
return item
return None
Here's a Python 3.3+ solution which can handle lists of lists of dicts.
It also uses duck typing, so it can handle any iterable, or object implementing the 'items' method.
from typing import Iterator
def deep_key_search(obj, key: str) -> Iterator:
""" Do a deep search of {obj} and return the values of all {key} attributes found.
:param obj: Either a dict type object or an iterator.
:return: Iterator of all {key} values found"""
if isinstance(obj, str):
# When duck-typing iterators recursively, we must exclude strings
return
try:
# Assume obj is a like a dict and look for the key
for k, v in obj.items():
if k == key:
yield v
else:
yield from deep_key_search(v, key)
except AttributeError:
# Not a dict type object. Is it iterable like a list?
try:
for v in obj:
yield from deep_key_search(v, key)
except TypeError:
pass # Not iterable either.
Pytest:
#pytest.mark.parametrize(
"data, expected, dscr", [
({}, [], "Empty dict"),
({'Foo': 1, 'Bar': 2}, [1], "Plain dict"),
([{}, {'Foo': 1, 'Bar': 2}], [1], "List[dict]"),
([[[{'Baz': 3, 'Foo': 'a'}]], {'Foo': 1, 'Bar': 2}], ['a', 1], "Deep list"),
({'Foo': 1, 'Bar': {'Foo': 'c'}}, [1, 'c'], "Dict of Dict"),
(
{'Foo': 1, 'Bar': {'Foo': 'c', 'Bar': 'abcdef'}},
[1, 'c'], "Contains a non-selected string value"
),
])
def test_deep_key_search(data, expected, dscr):
assert list(deep_key_search(data, 'Foo')) == expected
I couldn't add a comment to the accepted solution proposed by #mgilston because of lack of reputation. The solution doesn't work if the key being searched for is inside a list.
Looping through the elements of the lists and calling the recursive function should extend the functionality to find elements inside nested lists:
def _finditem(obj, key):
if key in obj: return obj[key]
for k, v in obj.items():
if isinstance(v,dict):
item = _finditem(v, key)
if item is not None:
return item
elif isinstance(v,list):
for list_item in v:
item = _finditem(list_item, key)
if item is not None:
return item
print(_finditem({"C": {"B": [{"A":2}]}}, "A"))
I had to create a general-case version that finds a uniquely-specified key (a minimal dictionary that specifies the path to the desired value) in a dictionary that contains multiple nested dictionaries and lists.
For the example below, a target dictionary is created to search, and the key is created with the wildcard "???". When run, it returns the value "D"
def lfind(query_list:List, target_list:List, targ_str:str = "???"):
for tval in target_list:
#print("lfind: tval = {}, query_list[0] = {}".format(tval, query_list[0]))
if isinstance(tval, dict):
val = dfind(query_list[0], tval, targ_str)
if val:
return val
elif tval == query_list[0]:
return tval
def dfind(query_dict:Dict, target_dict:Dict, targ_str:str = "???"):
for key, qval in query_dict.items():
tval = target_dict[key]
#print("dfind: key = {}, qval = {}, tval = {}".format(key, qval, tval))
if isinstance(qval, dict):
val = dfind(qval, tval, targ_str)
if val:
return val
elif isinstance(qval, list):
return lfind(qval, tval, targ_str)
else:
if qval == targ_str:
return tval
if qval != tval:
break
def find(target_dict:Dict, query_dict:Dict):
result = dfind(query_dict, target_dict)
return result
target_dict = {"A":[
{"key1":"A", "key2":{"key3": "B"}},
{"key1":"C", "key2":{"key3": "D"}}]
}
query_dict = {"A":[{"key1":"C", "key2":{"key3": "???"}}]}
result = find(target_dict, query_dict)
print("result = {}".format(result))
Thought I'd throw my hat in the ring, this will allow for recursive requests on anything that implements a __getitem__ method.
def _get_recursive(obj, args, default=None):
"""Apply successive requests to an obj that implements __getitem__ and
return result if something is found, else return default"""
if not args:
return obj
try:
key, *args = args
_obj = object.__getitem__(obj, key)
return _get_recursive(_obj, args, default=default)
except (KeyError, IndexError, AttributeError):
return default
I have a dict subclass whose job is to dynamically add nested dict key if it not exists and do list append if append is called:
class PowerDict(dict):
def __getitem__(self, item):
try:
return dict.__getitem__(self, item)
except KeyError:
value = self[item] = type(self)()
return value
def append(self,item):
if type(self) != list:
self = list()
self.append(item)
so
a = PowerDict()
a['1']['2'] = 3
produce output:
a = {'1': {'2': 3}}
However, sometime i need to do something like this:
b = PowerDict()
b['1']['2'].append(3)
b['1']['2'].append(4)
should produce output:
b = {'1': {'2': [3, 4]}}
but above code produce output:
{'1': {'2': {}}}
What i am missing?
class PowerDict(dict):
# http://stackoverflow.com/a/3405143/190597 (gnibbler)
def __init__(self, parent = None, key = None):
self.parent = parent
self.key = key
def __missing__(self, key):
self[key] = PowerDict(self, key)
return self[key]
def append(self, item):
self.parent[self.key] = [item]
def __setitem__(self, key, val):
dict.__setitem__(self, key, val)
try:
val.parent = self
val.key = key
except AttributeError:
pass
a = PowerDict()
a['1']['2'] = 3
print(a)
b = PowerDict()
b['1']['2'].append(3)
b['1']['2'].append(4)
print(b)
a['1']['2'] = b
a['1']['2'].append(5)
print(a['1']['2'])
yields
{'1': {'2': 3}}
{'1': {'2': [3, 4]}}
[5]
Your append() method never works. By doing self = list() you're just reassigning the name self to a new list, which is then thrown away.
And I don't understand what you're trying to do - from getitem, you're creating new dictionaries on-the-fly if something is missing... how would you mix list behaviour in?
One of your problems is reassigning self, however, that's not it. Try printing out the value of self in the append command, and you can see another problems: The loop enters an infinite recursion. This is because you're calling the append command on a powerDict in your append command!
This should solve your problem without re-writing the append command, but I strongly suggest you re-write it anyway to avoid the above-mentioned problem:
b['1']['2']= [3]
b['1']['2'].append(4)