i need a function remove() that removes characters from a string.
This was my first approach:
def remove(self, string, index):
return string[0:index] + string[index + 1:]
def remove_indexes(self, string, indexes):
for index in indexes:
string = self.remove(string, index)
return string
Where I pass the indexes I want to remove in an array, but once I remove a character, the whole indexes change.
Is there a more pythonic whay to do this. it would be more preffarable to implement it like that:
"hello".remove([1, 2])
I dont know about a "pythonic" way, but you can achieve this. If you can ensure that in remove_indexes the indexes are always sorted, then you may do this
def remove_indexes(self, string, indexes):
for index in indexes.reverse():
string = self.remove(string, index)
return string
If you cant ensure that then just do
def remove_indexes(self, string, indexes):
for index in indexes.sort(reverse=True):
string = self.remove(string, index)
return string
I think below code will work for you.It removes the indexes(that you want to remove from string) and returns joined string formed with remaining indexes.
def remove_indexes(string,indexes):
return "".join([string[i] for i in range(len(string)) if i not in indexes])
remove_indexes("hello",[1,2])
The most pythonic way would be to use regular expressions. The danger with your indexing approach is that the string you are passing in may have variable length, and therefore you would be removing parts of the string unintentionally.
Lets say you wanted to remove all numbers from a string
import re
s = "This is a string with s0m3 numb3rs in it1 !"
num_reg = re.compile(r"\d+") # catches all digits 0-9
re.sub(num_reg , "**", s) # substitute numbers in `s` with "**"
>>> "This is a string with s**m** numb**rs in it** !"
This way, you define an general expression that may appear regularly in a string (a "regular expression" or regex), and you can quickly and reliably replace all instances of that regex in the string.
You cannot add attribute to built-in types you will have an error like this:
TypeError: can't set attributes of built-in/extension type 'str'
You can create a class that inherit the str and add this method:
class String(str):
def remove(self, index):
if isinstance(index, list):
# order the index to remove the biggest first
for i in sorted(index, reverse=True):
self = self.remove(i)
return self
return String(self[0:index] + self[index + 1:])
s = String("hello")
print(s.remove([0, 1]))
You want change in place you need to create a new type for example:
class String:
def __init__(self, value):
self._str = value
def __getattr__(self, item):
""" delegate to str"""
return getattr(self._str, item)
def __getitem__(self, item):
""" support slicing"""
return String(self._str[item])
def remove(self, indexex):
indexes = indexex if isinstance(indexex, list) else [indexex]
# order the index to remove the biggest first
for i in sorted(indexes, reverse=True):
self._str = self._str[0:i] + self._str[i + 1:]
# change in place should return None
return None
def __str__(self):
return str(self._str)
def __repr__(self):
return repr(self._str)
s = String("hello")
s.remove([0, 1])
print(s.upper()) # delegate to str class
print(s[:1]) # support slicing
print(list(x for x in s)) # it's iterable
But still missing a other magic method to act like a real str class. like __add__ , __mult___, .....
If you want a class like str but have a remove method that changes the instance itself you need to create your own mutable type, str are primitive immutable type and self = self.remove(i) will not really change the variable because it's just changing the reference of self argument to another object, but the reference s is still pointing to the same object created by String("hello").
My question comes from this page, while I would like to create a pointer(-like thing) for an list element. The element is a primitive value (string) so I have to create a FooWrapper class as that page says.
I know that by setting __repr__ one can directly access this value.
class FooWrapper(object):
def __init__(self, value):
self.value = value
def __repr__(self):
return repr(self.value)
>>> bar=FooWrapper('ABC')
>>> bar
'ABC'
>>> bar=FooWrapper(3)
>>> bar
3
Now I can use it as an reference of string:
>>> L=[3,5,6,9]
>>> L[1]=FooWrapper('ABC')
>>> L
[3,'ABC',6,9]
>>> this=L[1]
>>> this.value='BCD'
>>> print(L)
[3,'BCD',6,9]
So now I have a pointer-like this for the list element L[1].
However it is still inconvenient since I must use this.value='BCD' to change its value. While there exists a __repr__ method to make this directly return this.value, is there any similar method to make this='BCD' to do this.value='BCD' ? I know this changes the rule of binding.. but anyway, is it possible?
I would also appreciate if there is a better solution for a list element pointer.
Thank you in advance:)
I'm not sure exactly what you are trying to do, but you could do something like:
class FooWrapper(object):
def __init__(self, value):
self.value = value
def __repr__(self):
return 'FooWrapper(' + repr(self.value) + ')'
def __str__(self):
return str(self.value)
def __call__(self,value):
self.value = value
Here I got rid of your idea of using __repr__ to hide FooWrapper since I think it a bad idea to hide from the programmer what is happening at the REPL. Instead -- I used __str__ so that when you print the object you will print the wrapped value. The __call__ functions as a default method, which doesn't change the meaning of = but is sort of what you want:
>>> vals = [1,2,3]
>>> vals[1] = FooWrapper("Bob")
>>> vals
[1, FooWrapper('Bob'), 3]
>>> for x in vals: print(x)
1
Bob
3
>>> this = vals[1]
>>> this(10)
>>> vals
[1, FooWrapper(10), 3]
However, I think it misleading to refer to this as a pointer. It is just a wrapper object, and is almost certain to make dealing with the wrapped object inconvenient.
On Edit: The following is more of a pointer to a list. It allows you to create something like a pointer object with __call__ used to dereference the pointer (when no argument is passed to it) or to mutate the list (when a value is passed to __call__). It also implements a form of p++ called (pp) with wrap-around (though the wrap-around part could of course be dropped):
class ListPointer(object):
def __init__(self, myList,i=0):
self.myList = myList
self.i = i % len(self.myList)
def __repr__(self):
return 'ListPointer(' + repr(self.myList) + ',' + str(self.i) + ')'
def __str__(self):
return str(self.myList[self.i])
def __call__(self,*value):
if len(value) == 0:
return self.myList[self.i]
else:
self.myList[self.i] = value[0]
def pp(self):
self.i = (self.i + 1) % len(self.myList)
Used like this:
>>> vals = ['a','b','c']
>>> this = ListPointer(vals)
>>> this()
'a'
>>> this('d')
>>> vals
['d', 'b', 'c']
>>> this.pp()
>>> this()
'b'
>>> print(this)
b
I think that this is a more transparent way of getting something which acts like a list pointer. It doesn't require the thing pointed to to be wrapped in anything.
The __repr__ method can get a string however it wants to. Let's say it says return repr(self.value) + 'here'. If you say this = '4here', what should be affected? Should self.value be assigned to 4 or 4here? What if this had another attribute called key and __repr__ did return repr(self.key) + repr(self.value)? When you did this = '4here', would it assign self.key to the whole string, assign self.value to the whole string, or assign self.key to 4 and self.value to here? What if the string is completely made up by the method? If it says return 'here', what should this = '4here' do?
In short, you can't.
Self-Answer based on John Coleman's idea:
class ListPointer(object):
def __init__(self,list,index):
self.value=list[index]
list[index]=self
def __repr__(self):
return self.value
def __call__(self,value):
self.value=value
>>> foo=[2,3,4,5]
>>> this=ListPointer(foo,2)
>>> this
4
>>> foo
[2,3,4,5]
>>> this('ABC')
>>> foo
[2,3,'ABC',5]
>>> type(foo(2))
<class '__main__.ListPointer'>
ListPointer object accepts a list and an index, stores the list[index] in self.value and then substitutes the list element with self. Similarly a pp method can also be achieved, while the former element should be restored and the next element be substituted with self object. By directly referring foo[2] one also gets this object, which is what I want. (Perhaps this should be called as a reference rather than a pointer..)
Is it possible to have a list be evaluated lazily in Python?
For example
a = 1
list = [a]
print list
#[1]
a = 2
print list
#[1]
If the list was set to evaluate lazily then the final line would be [2]
The concept of "lazy" evaluation normally comes with functional languages -- but in those you could not reassign two different values to the same identifier, so, not even there could your example be reproduced.
The point is not about laziness at all -- it is that using an identifier is guaranteed to be identical to getting a reference to the same value that identifier is referencing, and re-assigning an identifier, a bare name, to a different value, is guaranteed to make the identifier refer to a different value from them on. The reference to the first value (object) is not lost.
Consider a similar example where re-assignment to a bare name is not in play, but rather any other kind of mutation (for a mutable object, of course -- numbers and strings are immutable), including an assignment to something else than a bare name:
>>> a = [1]
>>> list = [a]
>>> print list
[[1]]
>>> a[:] = [2]
>>> print list
[[2]]
Since there is no a - ... that reassigns the bare name a, but rather an a[:] = ... that reassigns a's contents, it's trivially easy to make Python as "lazy" as you wish (and indeed it would take some effort to make it "eager"!-)... if laziness vs eagerness had anything to do with either of these cases (which it doesn't;-).
Just be aware of the perfectly simple semantics of "assigning to a bare name" (vs assigning to anything else, which can be variously tweaked and controlled by using your own types appropriately), and the optical illusion of "lazy vs eager" might hopefully vanish;-)
Came across this post when looking for a genuine lazy list implementation, but it sounded like a fun thing to try and work out.
The following implementation does basically what was originally asked for:
from collections import Sequence
class LazyClosureSequence(Sequence):
def __init__(self, get_items):
self._get_items = get_items
def __getitem__(self, i):
return self._get_items()[i]
def __len__(self):
return len(self._get_items())
def __repr__(self):
return repr(self._get_items())
You use it like this:
>>> a = 1
>>> l = LazyClosureSequence(lambda: [a])
>>> print l
[1]
>>> a = 2
>>> print l
[2]
This is obviously horrible.
Python is not really very lazy in general.
You can use generators to emulate lazy data structures (like infinite lists, et cetera), but as far as things like using normal list syntax, et cetera, you're not going to have laziness.
That is a read-only lazy list where it only needs a pre-defined length and a cache-update function:
import copy
import operations
from collections.abc import Sequence
from functools import partialmethod
from typing import Dict, Union
def _cmp_list(a: list, b: list, op, if_eq: bool, if_long_a: bool) -> bool:
"""utility to implement gt|ge|lt|le class operators"""
if a is b:
return if_eq
for ia, ib in zip(a, b):
if ia == ib:
continue
return op(ia, ib)
la, lb = len(a), len(b)
if la == lb:
return if_eq
if la > lb:
return if_long_a
return not if_long_a
class LazyListView(Sequence):
def __init__(self, length):
self._range = range(length)
self._cache: Dict[int, Value] = {}
def __len__(self) -> int:
return len(self._range)
def __getitem__(self, ix: Union[int, slice]) -> Value:
length = len(self)
if isinstance(ix, slice):
clone = copy.copy(self)
clone._range = self._range[slice(*ix.indices(length))] # slicing
return clone
else:
if ix < 0:
ix += len(self) # negative indices count from the end
if not (0 <= ix < length):
raise IndexError(f"list index {ix} out of range [0, {length})")
if ix not in self._cache:
... # update cache
return self._cache[ix]
def __iter__(self) -> dict:
for i, _row_ix in enumerate(self._range):
yield self[i]
__eq__ = _eq_list
__gt__ = partialmethod(_cmp_list, op=operator.gt, if_eq=False, if_long_a=True)
__ge__ = partialmethod(_cmp_list, op=operator.ge, if_eq=True, if_long_a=True)
__le__ = partialmethod(_cmp_list, op=operator.le, if_eq=True, if_long_a=False)
__lt__ = partialmethod(_cmp_list, op=operator.lt, if_eq=False, if_long_a=False)
def __add__(self, other):
"""BREAKS laziness and returns a plain-list"""
return list(self) + other
def __mul__(self, factor):
"""BREAKS laziness and returns a plain-list"""
return list(self) * factor
__radd__ = __add__
__rmul__ = __mul__
Note that this class is discussed also in this SO.
As an exercise, and mostly for my own amusement, I'm implementing a backtracking packrat parser. The inspiration for this is i'd like to have a better idea about how hygenic macros would work in an algol-like language (as apposed to the syntax free lisp dialects you normally find them in). Because of this, different passes through the input might see different grammars, so cached parse results are invalid, unless I also store the current version of the grammar along with the cached parse results. (EDIT: a consequence of this use of key-value collections is that they should be immutable, but I don't intend to expose the interface to allow them to be changed, so either mutable or immutable collections are fine)
The problem is that python dicts cannot appear as keys to other dicts. Even using a tuple (as I'd be doing anyways) doesn't help.
>>> cache = {}
>>> rule = {"foo":"bar"}
>>> cache[(rule, "baz")] = "quux"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>>
I guess it has to be tuples all the way down. Now the python standard library provides approximately what i'd need, collections.namedtuple has a very different syntax, but can be used as a key. continuing from above session:
>>> from collections import namedtuple
>>> Rule = namedtuple("Rule",rule.keys())
>>> cache[(Rule(**rule), "baz")] = "quux"
>>> cache
{(Rule(foo='bar'), 'baz'): 'quux'}
Ok. But I have to make a class for each possible combination of keys in the rule I would want to use, which isn't so bad, because each parse rule knows exactly what parameters it uses, so that class can be defined at the same time as the function that parses the rule.
Edit: An additional problem with namedtuples is that they are strictly positional. Two tuples that look like they should be different can in fact be the same:
>>> you = namedtuple("foo",["bar","baz"])
>>> me = namedtuple("foo",["bar","quux"])
>>> you(bar=1,baz=2) == me(bar=1,quux=2)
True
>>> bob = namedtuple("foo",["baz","bar"])
>>> you(bar=1,baz=2) == bob(bar=1,baz=2)
False
tl'dr: How do I get dicts that can be used as keys to other dicts?
Having hacked a bit on the answers, here's the more complete solution I'm using. Note that this does a bit extra work to make the resulting dicts vaguely immutable for practical purposes. Of course it's still quite easy to hack around it by calling dict.__setitem__(instance, key, value) but we're all adults here.
class hashdict(dict):
"""
hashable dict implementation, suitable for use as a key into
other dicts.
>>> h1 = hashdict({"apples": 1, "bananas":2})
>>> h2 = hashdict({"bananas": 3, "mangoes": 5})
>>> h1+h2
hashdict(apples=1, bananas=3, mangoes=5)
>>> d1 = {}
>>> d1[h1] = "salad"
>>> d1[h1]
'salad'
>>> d1[h2]
Traceback (most recent call last):
...
KeyError: hashdict(bananas=3, mangoes=5)
based on answers from
http://stackoverflow.com/questions/1151658/python-hashable-dicts
"""
def __key(self):
return tuple(sorted(self.items()))
def __repr__(self):
return "{0}({1})".format(self.__class__.__name__,
", ".join("{0}={1}".format(
str(i[0]),repr(i[1])) for i in self.__key()))
def __hash__(self):
return hash(self.__key())
def __setitem__(self, key, value):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def __delitem__(self, key):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def clear(self):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def pop(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def popitem(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def setdefault(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
def update(self, *args, **kwargs):
raise TypeError("{0} does not support item assignment"
.format(self.__class__.__name__))
# update is not ok because it mutates the object
# __add__ is ok because it creates a new object
# while the new object is under construction, it's ok to mutate it
def __add__(self, right):
result = hashdict(self)
dict.update(result, right)
return result
if __name__ == "__main__":
import doctest
doctest.testmod()
Here is the easy way to make a hashable dictionary. Just remember not to mutate them after embedding in another dictionary for obvious reasons.
class hashabledict(dict):
def __hash__(self):
return hash(tuple(sorted(self.items())))
Hashables should be immutable -- not enforcing this but TRUSTING you not to mutate a dict after its first use as a key, the following approach would work:
class hashabledict(dict):
def __key(self):
return tuple((k,self[k]) for k in sorted(self))
def __hash__(self):
return hash(self.__key())
def __eq__(self, other):
return self.__key() == other.__key()
If you DO need to mutate your dicts and STILL want to use them as keys, complexity explodes hundredfolds -- not to say it can't be done, but I'll wait until a VERY specific indication before I get into THAT incredible morass!-)
All that is needed to make dictionaries usable for your purpose is to add a __hash__ method:
class Hashabledict(dict):
def __hash__(self):
return hash(frozenset(self))
Note, the frozenset conversion will work for all dictionaries (i.e. it doesn't require the keys to be sortable). Likewise, there is no restriction on the dictionary values.
If there are many dictionaries with identical keys but with distinct values, it is necessary to have the hash take the values into account. The fastest way to do that is:
class Hashabledict(dict):
def __hash__(self):
return hash((frozenset(self), frozenset(self.itervalues())))
This is quicker than frozenset(self.iteritems()) for two reasons. First, the frozenset(self) step reuses the hash values stored in the dictionary, saving unnecessary calls to hash(key). Second, using itervalues will access the values directly and avoid the many memory allocator calls using by items to form new many key/value tuples in memory every time you do a lookup.
The given answers are okay, but they could be improved by using frozenset(...) instead of tuple(sorted(...)) to generate the hash:
>>> import timeit
>>> timeit.timeit('hash(tuple(sorted(d.iteritems())))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
4.7758948802947998
>>> timeit.timeit('hash(frozenset(d.iteritems()))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
1.8153600692749023
The performance advantage depends on the content of the dictionary, but in most cases I've tested, hashing with frozenset is at least 2 times faster (mainly because it does not need to sort).
A reasonably clean, straightforward implementation is
import collections
class FrozenDict(collections.Mapping):
"""Don't forget the docstrings!!"""
def __init__(self, *args, **kwargs):
self._d = dict(*args, **kwargs)
def __iter__(self):
return iter(self._d)
def __len__(self):
return len(self._d)
def __getitem__(self, key):
return self._d[key]
def __hash__(self):
return hash(tuple(sorted(self._d.iteritems())))
I keep coming back to this topic... Here's another variation. I'm uneasy with subclassing dict to add a __hash__ method; There's virtually no escape from the problem that dict's are mutable, and trusting that they won't change seems like a weak idea. So I've instead looked at building a mapping based on a builtin type that is itself immutable. although tuple is an obvious choice, accessing values in it implies a sort and a bisect; not a problem, but it doesn't seem to be leveraging much of the power of the type it's built on.
What if you jam key, value pairs into a frozenset? What would that require, how would it work?
Part 1, you need a way of encoding the 'item's in such a way that a frozenset will treat them mainly by their keys; I'll make a little subclass for that.
import collections
class pair(collections.namedtuple('pair_base', 'key value')):
def __hash__(self):
return hash((self.key, None))
def __eq__(self, other):
if type(self) != type(other):
return NotImplemented
return self.key == other.key
def __repr__(self):
return repr((self.key, self.value))
That alone puts you in spitting distance of an immutable mapping:
>>> frozenset(pair(k, v) for k, v in enumerate('abcd'))
frozenset([(0, 'a'), (2, 'c'), (1, 'b'), (3, 'd')])
>>> pairs = frozenset(pair(k, v) for k, v in enumerate('abcd'))
>>> pair(2, None) in pairs
True
>>> pair(5, None) in pairs
False
>>> goal = frozenset((pair(2, None),))
>>> pairs & goal
frozenset([(2, None)])
D'oh! Unfortunately, when you use the set operators and the elements are equal but not the same object; which one ends up in the return value is undefined, we'll have to go through some more gyrations.
>>> pairs - (pairs - goal)
frozenset([(2, 'c')])
>>> iter(pairs - (pairs - goal)).next().value
'c'
However, looking values up in this way is cumbersome, and worse, creates lots of intermediate sets; that won't do! We'll create a 'fake' key-value pair to get around it:
class Thief(object):
def __init__(self, key):
self.key = key
def __hash__(self):
return hash(pair(self.key, None))
def __eq__(self, other):
self.value = other.value
return pair(self.key, None) == other
Which results in the less problematic:
>>> thief = Thief(2)
>>> thief in pairs
True
>>> thief.value
'c'
That's all the deep magic; the rest is wrapping it all up into something that has an interface like a dict. Since we're subclassing from frozenset, which has a very different interface, there's quite a lot of methods; we get a little help from collections.Mapping, but most of the work is overriding the frozenset methods for versions that work like dicts, instead:
class FrozenDict(frozenset, collections.Mapping):
def __new__(cls, seq=()):
return frozenset.__new__(cls, (pair(k, v) for k, v in seq))
def __getitem__(self, key):
thief = Thief(key)
if frozenset.__contains__(self, thief):
return thief.value
raise KeyError(key)
def __eq__(self, other):
if not isinstance(other, FrozenDict):
return dict(self.iteritems()) == other
if len(self) != len(other):
return False
for key, value in self.iteritems():
try:
if value != other[key]:
return False
except KeyError:
return False
return True
def __hash__(self):
return hash(frozenset(self.iteritems()))
def get(self, key, default=None):
thief = Thief(key)
if frozenset.__contains__(self, thief):
return thief.value
return default
def __iter__(self):
for item in frozenset.__iter__(self):
yield item.key
def iteritems(self):
for item in frozenset.__iter__(self):
yield (item.key, item.value)
def iterkeys(self):
for item in frozenset.__iter__(self):
yield item.key
def itervalues(self):
for item in frozenset.__iter__(self):
yield item.value
def __contains__(self, key):
return frozenset.__contains__(self, pair(key, None))
has_key = __contains__
def __repr__(self):
return type(self).__name__ + (', '.join(repr(item) for item in self.iteritems())).join('()')
#classmethod
def fromkeys(cls, keys, value=None):
return cls((key, value) for key in keys)
which, ultimately, does answer my own question:
>>> myDict = {}
>>> myDict[FrozenDict(enumerate('ab'))] = 5
>>> FrozenDict(enumerate('ab')) in myDict
True
>>> FrozenDict(enumerate('bc')) in myDict
False
>>> FrozenDict(enumerate('ab', 3)) in myDict
False
>>> myDict[FrozenDict(enumerate('ab'))]
5
The accepted answer by #Unknown, as well as the answer by #AlexMartelli work perfectly fine, but only under the following constraints:
The dictionary's values must be hashable. For example, hash(hashabledict({'a':[1,2]})) will raise TypeError.
Keys must support comparison operation. For example, hash(hashabledict({'a':'a', 1:1})) will raise TypeError.
The comparison operator on keys imposes total ordering. For example, if the two keys in a dictionary are frozenset((1,2,3)) and frozenset((4,5,6)), they compare unequal in both directions. Therefore, sorting the items of a dictionary with such keys can result in an arbitrary order, and therefore will violate the rule that equal objects must have the same hash value.
The much faster answer by #ObenSonne lifts the constraints 2 and 3, but is still bound by constraint 1 (values must be hashable).
The faster yet answer by #RaymondHettinger lifts all 3 constraints because it does not include .values() in the hash calculation. However, its performance is good only if:
Most of the (non-equal) dictionaries that need to be hashed have do not identical .keys().
If this condition isn't satisfied, the hash function will still be valid, but may cause too many collisions. For example, in the extreme case where all the dictionaries are generated from a website template (field names as keys, user input as values), the keys will always be the same, and the hash function will return the same value for all the inputs. As a result, a hashtable that relies on such a hash function will become as slow as a list when retrieving an item (O(N) instead of O(1)).
I think the following solution will work reasonably well even if all 4 constraints I listed above are violated. It has an additional advantage that it can hash not only dictionaries, but any containers, even if they have nested mutable containers.
I'd much appreciate any feedback on this, since I only tested this lightly so far.
# python 3.4
import collections
import operator
import sys
import itertools
import reprlib
# a wrapper to make an object hashable, while preserving equality
class AutoHash:
# for each known container type, we can optionally provide a tuple
# specifying: type, transform, aggregator
# even immutable types need to be included, since their items
# may make them unhashable
# transformation may be used to enforce the desired iteration
# the result of a transformation must be an iterable
# default: no change; for dictionaries, we use .items() to see values
# usually transformation choice only affects efficiency, not correctness
# aggregator is the function that combines all items into one object
# default: frozenset; for ordered containers, we can use tuple
# aggregator choice affects both efficiency and correctness
# e.g., using a tuple aggregator for a set is incorrect,
# since identical sets may end up with different hash values
# frozenset is safe since at worst it just causes more collisions
# unfortunately, no collections.ABC class is available that helps
# distinguish ordered from unordered containers
# so we need to just list them out manually as needed
type_info = collections.namedtuple(
'type_info',
'type transformation aggregator')
ident = lambda x: x
# order matters; first match is used to handle a datatype
known_types = (
# dict also handles defaultdict
type_info(dict, lambda d: d.items(), frozenset),
# no need to include set and frozenset, since they are fine with defaults
type_info(collections.OrderedDict, ident, tuple),
type_info(list, ident, tuple),
type_info(tuple, ident, tuple),
type_info(collections.deque, ident, tuple),
type_info(collections.Iterable, ident, frozenset) # other iterables
)
# hash_func can be set to replace the built-in hash function
# cache can be turned on; if it is, cycles will be detected,
# otherwise cycles in a data structure will cause failure
def __init__(self, data, hash_func=hash, cache=False, verbose=False):
self._data=data
self.hash_func=hash_func
self.verbose=verbose
self.cache=cache
# cache objects' hashes for performance and to deal with cycles
if self.cache:
self.seen={}
def hash_ex(self, o):
# note: isinstance(o, Hashable) won't check inner types
try:
if self.verbose:
print(type(o),
reprlib.repr(o),
self.hash_func(o),
file=sys.stderr)
return self.hash_func(o)
except TypeError:
pass
# we let built-in hash decide if the hash value is worth caching
# so we don't cache the built-in hash results
if self.cache and id(o) in self.seen:
return self.seen[id(o)][0] # found in cache
# check if o can be handled by decomposing it into components
for typ, transformation, aggregator in AutoHash.known_types:
if isinstance(o, typ):
# another option is:
# result = reduce(operator.xor, map(_hash_ex, handler(o)))
# but collisions are more likely with xor than with frozenset
# e.g. hash_ex([1,2,3,4])==0 with xor
try:
# try to frozenset the actual components, it's faster
h = self.hash_func(aggregator(transformation(o)))
except TypeError:
# components not hashable with built-in;
# apply our extended hash function to them
h = self.hash_func(aggregator(map(self.hash_ex, transformation(o))))
if self.cache:
# storing the object too, otherwise memory location will be reused
self.seen[id(o)] = (h, o)
if self.verbose:
print(type(o), reprlib.repr(o), h, file=sys.stderr)
return h
raise TypeError('Object {} of type {} not hashable'.format(repr(o), type(o)))
def __hash__(self):
return self.hash_ex(self._data)
def __eq__(self, other):
# short circuit to save time
if self is other:
return True
# 1) type(self) a proper subclass of type(other) => self.__eq__ will be called first
# 2) any other situation => lhs.__eq__ will be called first
# case 1. one side is a subclass of the other, and AutoHash.__eq__ is not overridden in either
# => the subclass instance's __eq__ is called first, and we should compare self._data and other._data
# case 2. neither side is a subclass of the other; self is lhs
# => we can't compare to another type; we should let the other side decide what to do, return NotImplemented
# case 3. neither side is a subclass of the other; self is rhs
# => we can't compare to another type, and the other side already tried and failed;
# we should return False, but NotImplemented will have the same effect
# any other case: we won't reach the __eq__ code in this class, no need to worry about it
if isinstance(self, type(other)): # identifies case 1
return self._data == other._data
else: # identifies cases 2 and 3
return NotImplemented
d1 = {'a':[1,2], 2:{3:4}}
print(hash(AutoHash(d1, cache=True, verbose=True)))
d = AutoHash(dict(a=1, b=2, c=3, d=[4,5,6,7], e='a string of chars'),cache=True, verbose=True)
print(hash(d))
You might also want to add these two methods to get the v2 pickling protocol work with hashdict instances. Otherwise cPickle will try to use hashdict.____setitem____ resulting in a TypeError. Interestingly, with the other two versions of the protocol your code works just fine.
def __setstate__(self, objstate):
for k,v in objstate.items():
dict.__setitem__(self,k,v)
def __reduce__(self):
return (hashdict, (), dict(self),)
serialize the dict as string with json package:
d = {'a': 1, 'b': 2}
s = json.dumps(d)
restore the dict when you need:
d2 = json.loads(s)
If you don't put numbers in the dictionary and you never lose the variables containing your dictionaries, you can do this:
cache[id(rule)] = "whatever"
since id() is unique for every dictionary
EDIT:
Oh sorry, yeah in that case what the other guys said would be better. I think you could also serialize your dictionaries as a string, like
cache[ 'foo:bar' ] = 'baz'
If you need to recover your dictionaries from the keys though, then you'd have to do something uglier like
cache[ 'foo:bar' ] = ( {'foo':'bar'}, 'baz' )
I guess the advantage of this is that you wouldn't have to write as much code.