Python consistent hash replacement - python

As noted by many, Python's hash is not consistent anymore (as of version 3.3), as a random PYTHONHASHSEED is now used by default (to address security concerns, as explained in this excellent answer).
However, I have noticed that the hash of some objects are still consistent (as of Python 3.7 anyway): that includes int, float, tuple(x), frozenset(x) (as long as x yields consistent hash). For example:
assert hash(10) == 10
assert hash((10, 11)) == 3713074054246420356
assert hash(frozenset([(0, 1, 2), 3, (4, 5, 6)])) == -8046488914261726427
Is that always true and guaranteed? If so, is that expected to stay that way? Is the PYTHONHASHSEED only applied to salt the hash of strings and byte arrays?
Why am I asking?
I have a system that relies on hashing to remember whether or not we have seen a given dict (in any order): {key: tuple(ints)}. In that system, the keys are a collection of filenames and the tuples a subset of os.stat_result, e.g. (size, mtime) associated with them. This system is used to make update/sync decisions based on detecting differences.
In my application, I have on the order of 100K such dicts, and each can represent several thousands of files and their state, so the compactness of the cache is important.
I can tolerate the small false positive rate (< 10^-19 for 64-bit hashes) coming from possible hash collisions (see also birthday paradox).
One compact representation is the following for each such dict "fsd":
def fsd_hash(fsd: dict):
return hash(frozenset(fsd.items()))
It is very fast and yields a single int to represent an entire dict (with order-invariance). If anything in the fsd dict changes, with high probability the hash will be different.
Unfortunately, hash is only consistent within a single Python instance, rendering it useless for hosts to compare their respective hashes. Persisting the full cache ({location_name: fsd_hash}) to disk to be reloaded on restart is also useless.
I cannot expect the larger system that uses that module to have been invoked with PYTHONHASHSEED=0, and, to my knowledge, there is no way to change this once the Python instance has started.
Things I have tried
I may use hashlib.sha1 or similar to calculate consistent hashes. This is slower and I can't directly use the frozenset trick: I have to iterate through the dict in a consistent order (e.g. by sorting on keys, slow) while updating the hasher. In my tests on real data, I see over 50x slow-down.
I could try applying an order-invariant hashing algorithm on consistent hashes obtained for each item (also slow, as starting a fresh hasher for each item is time-consuming).
I can try transforming everything into ints or tuples of ints and then frozensets of such tuples. At the moment, it seems that all int, tuple(int) and frozenset(tuple(int)) yield consistent hashes, but: is that guaranteed, and if so, how long can I expect this to be the case?
Additional question: more generally, what would be a good way to write a consistent hash replacement for hash(frozenset(some_dict.items())) when the dict contains various types and classes? I can implement a custom __hash__ (a consistent one) for the classes I own, but I cannot override str's hash for example. One thing I came up with is:
def const_hash(x):
if isinstance(x, (int, float, bool)):
pass
elif isinstance(x, frozenset):
x = frozenset([const_hash(v) for v in x])
elif isinstance(x, str):
x = tuple([ord(e) for e in x])
elif isinstance(x, bytes):
x = tuple(x)
elif isinstance(x, dict):
x = tuple([(const_hash(k), const_hash(v)) for k, v in x.items()])
elif isinstance(x, (list, tuple)):
x = tuple([const_hash(e) for e in x])
else:
try:
return x.const_hash()
except AttributeError:
raise TypeError(f'no known const_hash implementation for {type(x)}')
return hash(x)

Short answer to broad question: There are no explicit guarantees made about hashing stability aside from the overall guarantee that x == y requires that hash(x) == hash(y). There is an implication that x and y are both defined in the same run of the program (you can't perform x == y where one of them doesn't exist in that program obviously, so no guarantees are needed about the hash across runs).
Longer answers to specific questions:
Is [your belief that int, float, tuple(x), frozenset(x) (for x with consistent hash) have consistent hashes across separate runs] always true and guaranteed?
It's true of numeric types, with the mechanism being officially documented, but the mechanism is only guaranteed for a particular interpreter for a particular build. sys.hash_info provides the various constants, and they'll be consistent on that interpreter, but on a different interpreter (CPython vs. PyPy, 64 bit build vs. 32 bit build, even 3.n vs. 3.n+1) they can differ (documented to differ in the case of 64 vs. 32 bit CPython), so the hashes won't be portable across machines with different interpreters.
No guarantees on algorithm are made for tuple and frozenset; I can't think of any reason they'd change it between runs (if the underlying types are seeded, the tuple and frozenset benefit from it without needing any changes), but they can and do change the implementation between releases of CPython (e.g. in late 2018 they made a change to reduce the number of hash collisions in short tuples of ints and floats), so if you store off the hashes of tuples from say, 3.7, and then compute hashes of the same tuples in 3.8+, they won't match (even though they'd match between runs on 3.7 or between runs on 3.8).
If so, is that expected to stay that way?
Expected to, yes. Guaranteed, no. I could easily see seeded hashes for ints (and by extension, for all numeric types to preserve the numeric hash/equality guarantees) for the same reason they seeded hashes for str/bytes, etc. The main hurdles would be:
It would almost certainly be slower than the current, very simple algorithm.
By documenting the numeric hashing algorithm explicitly, they'd need a long period of deprecation before they could change it.
It's not strictly necessary (if web apps need seeded hashes for DoS protection, they can always just convert ints to str before using them as keys).
Is the PYTHONHASHSEED only applied to salt the hash of strings and byte arrays?
Beyond str and bytes, it applies to a number of random things that implement their own hashing in terms of the hash of str or bytes, often because they're already naturally convertable to raw bytes and are commonly used as keys in dicts populated by web-facing frontends. The ones I know of off-hand include the various classes of the datetime module (datetime, date, time, though this isn't actually documented in the module itself), and read-only memoryviews of with byte-sized formats (which hash equivalently to hashing the result of the view's .tobytes() method).
What would be a good way to write a consistent hash replacement for hash(frozenset(some_dict.items())) when the dict contains various types and classes?
The simplest/most composable solution would probably be to define your const_hash as a single dispatch function, using it the same way you do hash itself. This avoids having one single function defined in a single place that must handle all types; you can have the const_hash default implementation (which just relies on hash for those things with known consistent hashes) in a central location, and provide additional definitions for the built-in types you know aren't consistent (or which might contain inconsistent stuff) there, while still allowing people to extend the set of things it covers seamlessly by registering their own single-dispatch functions by importing your const_hash and decorating the implementation for their type with #const_hash.register. It's not significantly different in effect from your proposed const_hash, but it's a lot more manageable.

Related

Is python's hash() persistent?

Is the hash() function in python guaranteed to always be the same for a given input, regardless of when/where it's entered? So far -- from trial-and-error only -- the answer seems to be yes, but it would be nice to understand the internals of how this works. For example, in a test:
$ python
>>> from ingest.tpr import *
>>> d=DailyPriceObj(date="2014-01-01")
>>> hash(d)
5440882306090652359
>>> ^D
$ python
>>> from ingest.tpr import *
>>> d=DailyPriceObj(date="2014-01-01")
>>> hash(d)
5440882306090652359
The contract for the __hash__ method requires that it be consistent within a given run of Python. There is no guarantee that it be consistent across different runs of Python, and in fact, for the built-in str, bytes-like types, and datetime.datetime objects (possibly others), the hash is salted with a per-run value so that it's almost never the same for the same input in different runs of Python.
No, it's dependent on the process. If you need a persistent hash, see Persistent Hashing of Strings in Python.
Truncation depending on platform, from the documentation of __hash__:
hash() truncates the value returned from an object’s custom __hash__() method to the size of a Py_ssize_t. This is typically 8 bytes on 64-bit builds and 4 bytes on 32-bit builds.
Salted hashes, from the same documentation (ShadowRanger's answer):
By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.
A necessary condition for hashability is that for equivalent objects this value is always the same (inside one run of the interpreter).
Of course, nothing prevents you from ignoring this requirement. But, if you suddently want to store your objects in a dictionary or a set, then problems may arise.
When you implement your own class you can define methods __eq__ and __hash__. I have used a polynominal hash function for strings and a hash function from a universal family of hash functions.
In general, the hash values for a specific object shouldn't change from start to start interpreter. But for many data types this is true. One of the reasons for this implementation is that it is more difficult to find and anti-hash test.
For example:
For numeric types, the hash of a number x is based on the reduction
of x modulo the prime P = 2**_PyHASH_BITS - 1. It's designed so that
hash(x) == hash(y) whenever x and y are numerically equal, even if
x and y have different types.
a = 123456789
hash(a) == 123456789
hash(a + b * (2 ** 61 - 1)) == 123456789

Why does the **kwargs mapping compare equal with a differently ordered OrderedDict?

According to PEP 468:
Starting in version 3.6 Python will preserve the order of keyword arguments as passed to a function. To accomplish this the collected kwargs will now be an ordered mapping. Note that this does not necessarily mean OrderedDict.
In that case, why does this ordered mapping fail to respect equality comparison with Python's canonical ordered mapping type, the collections.OrderedDict:
>>> from collections import OrderedDict
>>> data = OrderedDict(zip('xy', 'xy'))
>>> def foo(**kwargs):
... return kwargs == data
...
>>> foo(x='x', y='y') # expected result: True
True
>>> foo(y='y', x='x') # expected result: False
True
Although iteration order is now preserved, kwargs seems to be behaving just like a normal dict for the comparisons. Python has a C implemented ordered dict since 3.5, so it could conceivably have been used directly (or, if performance was still a concern, a faster implementation using a thin subclass of the 3.6 compact dict).
Why doesn't the ordered mapping received by a function respect ordering in equality comparisons?
Regardless of what an “ordered mapping” means, as long as it’s not necessarily OrderedDict, OrderedDict’s == won’t take into account its order. Docs:
Equality tests between OrderedDict objects are order-sensitive and are implemented as list(od1.items())==list(od2.items()). Equality tests between OrderedDict objects and other Mapping objects are order-insensitive like regular dictionaries. This allows OrderedDict objects to be substituted anywhere a regular dictionary is used.
"Ordered mapping" only means the mapping has to preserve order. It doesn't mean that order has to be part of the mapping's == relation.
The purpose of PEP 468 is just to preserve the ordering information. Having order be part of == would produce backward incompatibility without any real benefit to any of the use cases that motivated PEP 468. Using OrderedDict would also be more expensive (since OrderedDict still maintains its own separate linked list to track order, and it can't abandon that linked list without sacrificing big-O efficiency in popitem and move_to_end).
The answer to your first 'why' is because this feature is implemented by using a plain dict in CPython. As #Ryan's answer points out, this means that comparisons won't be order-sensitive.
The second 'why' here is why this doesn't use an OrderedDict.
Using an OrderedDict was the initial plan as stated in the first draft of PEP 486. The idea, as stated in this reply, was to collect some perf data to show the effect of plugging in the OrderedDict since this was a point of contention when the idea was floated around before. The author of the PEP even alluded to the order preserving dict being another option in the final reply on that thread.
After that, the conversation on the topic seems to have died down until Python 3.6 came along. When the new dict came, it had the nice side-effect of just implementing PEP 486 out of the box (as this Python-dev thread states). The specific message in that thread also states how the author wanted the term OrderedDict to be changed to Ordered Mapping. (This is also when a new commit on PEP 468, after the initial one, was made)
As far as I can tell, this rewording was done in order to allow other implementations to provide this feature as they see fit. CPython and PyPy already had a dict that easily implemented PEP 468, other implementations might opt for an OrderedDict, others could go for another form of an ordered mapping.
That does open the door for a problem, though. It does mean that, theoretically, in an implementation of Python 3.6 with an OrderedDict as the structure implementing this feature, the comparison would be order-sensitive while in others (CPython) it wouldn't. (In Python 3.7, all dicts are required to be insertion-ordered so this point is probably moot since all implementations would use it for **kwargs)
Though it does seem like an issue, it really isn't. As #user2357112 pointed out, there's no guarantee on ==. PEP 468 only guarantees order. As far as I can tell, == is basically implementation defined.
In short, it compares equal in CPython because kwargs in CPython is a dict and it's a dict because after 3.6 the whole thing just worked.
Just to add, if you do want to make this check (without relying on an implementation detail (which even then, won't be in python 3.7)), just do
from collections import OrderedDict
>>> data = OrderedDict(zip('xy', 'xy'))
>>> def foo(**kwargs):
... return OrderedDict(kwargs) == data
since this is guaranteed to be True.

Why are mutable strings slower than immutable strings?

Why are mutable strings slower than immutable strings?
EDIT:
>>> import UserString
... def test():
... s = UserString.MutableString('Python')
... for i in range(3):
... s[0] = 'a'
...
... if __name__=='__main__':
... from timeit import Timer
... t = Timer("test()", "from __main__ import test")
... print t.timeit()
13.5236170292
>>> import UserString
... def test():
... s = UserString.MutableString('Python')
... s = 'abcd'
... for i in range(3):
... s = 'a' + s[1:]
...
... if __name__=='__main__':
... from timeit import Timer
... t = Timer("test()", "from __main__ import test")
... print t.timeit()
6.24725079536
>>> import UserString
... def test():
... s = UserString.MutableString('Python')
... for i in range(3):
... s = 'a' + s[1:]
...
... if __name__=='__main__':
... from timeit import Timer
... t = Timer("test()", "from __main__ import test")
... print t.timeit()
38.6385951042
i think it is obvious why i put s = UserString.MutableString('Python') on second test.
In a hypothetical language that offers both mutable and immutable, otherwise equivalent, string types (I can't really think of one offhand -- e.g., Python and Java both have immutable strings only, and other ways to make one through mutation which add indirectness and therefore can of course slow things down a bit;-), there's no real reason for any performance difference -- for example, in C++, interchangeably using a std::string or a const std::string I would expect to cause no performance difference (admittedly a compiler might be able to optimize code using the latter better by counting on the immutability, but I don't know any real-world ones that do perform such theoretically possible optimizations;-).
Having immutable strings may and does in fact allow very substantial optimizations in Java and Python. For example, if the strings get hashed, the hash can be cached, and will never have to be recomputed (since the string can't change) -- that's especially important in Python, which uses hashed strings (for look-ups in sets and dictionaries) so lavishly and even "behind the scenes". Fresh copies never need to be made "just in case" the previous one has changed in the meantime -- references to a single copy can always be handed out systematically whenever that string is required. Python also copiously uses "interning" of (some) strings, potentially allowing constant-time comparisons and many other similarly fast operations -- think of it as one more way, a more advanced one to be sure, to take advantage of strings' immutability to cache more of the results of operations often performed on them.
That's not to say that a given compiler is going to take advantage of all possible optimizations, of course. For example, when a slice of a string is requested, there is no real need to make a new object and copy the data over -- the new slice might refer to the old one with an offset (and an independently stored length), potentially a great optimization for big strings out of which many slices are taken. Python doesn't do that because, unless particular care is taken in memory management, this might easily result in the "big" string being all kept in memory when only a small slice of it is actually needed -- but it's a tradeoff that a different implementation might definitely choose to perform (with that burden of extra memory management, to be sure -- more complex, harder-to-debug compiler and runtime code for the hypothetical language in question).
I'm just scratching the surface here -- and many of these advantages would be hard to keep if otherwise interchangeable string types could exist in both mutable and immutable versions (which I suspect is why, to the best of my current knowledge at least, C++ compilers actually don't bother with such optimizations, despite being generally very performance-conscious). But by offering only immutable strings as the primitive, fundamental data type (and thus implicitly accepting some disadvantage when you'd really need a mutable one;-), languages such as Java and Python can clearly gain all sorts of advantages -- performance issues being only one group of them (Python's choice to allow only immutable primitive types to be hashable, for example, is not a performance-centered design decision -- it's more about clarity and predictability of behavior for sets and dictionaries!-).
I don't know if they are really a lot slower but they make thinking about programming easier a lot of the times, because the state of the object/string can't change. That's the most important property to immutability to me.
Furthermore you might assume that immutable string are faster because they have less state(which can change), which might mean lower memory consumption, CPU-cycles.
I also found this interesting article while googling which I would like to quote:
knowing that a string is immutable
makes it easy to lay it out at
construction time — fixed and
unchanging storage requirements
with an immutable string, python can intern it and refer to it internally by it's address in memory. This means that to compare two strings, it only has to compare their addresses in memory (unless one of them isn't interned). Also, keep in mind that not all strings are interned. I've seen example of constructed strings that are not interned.
with mutable strings, string comparison would involve comparing them character by character and would also require either storing identical strings in different locations (malloc is not free) or adding logic to keep track of how many times a given string is referred to and making a copy for every mutation if there were more than one referrer.
It seems like python is optimized for string comparison. This makes sense because even string manipulation involves string comparison in most cases so for most use cases, it's the lowest common denominator.
Another advantage of immutable strings is that it makes it possible for them to be hashable which is a requirement for using them for dictionary keys. imagine a scenario where they were mutable:
s = 'a'
d = {s : 1}
s = s + 'b'
d[s] = ?
I suppose python could keep track of which dicts have which strings as keys and update all of their hashtables when a string was modified but that's just adding more overhead to dict insertion. It's not to far off the mark to say that you can't do anything in python without a dict insertion/lookup so that would be very very bad. It also adds overhead to string manipulation.
The obvious answer to your question is that normal strings are implemented in C, while MutableString is implemented in Python.
Not only does every operation on a mutable string have the overhead of going through one or more Python function calls, but the implementation is essentially a wrapper round an immutable string - when you modify the string it creates a new immutable string and throws the old one away. You can read the source in the UserString.py file in your Python lib directory.
To quote the Python docs:
Note:
This UserString class from this module
is available for backward
compatibility only. If you are writing
code that does not need to work with
versions of Python earlier than Python
2.2, please consider subclassing directly from the built-in str type
instead of using UserString (there is
no built-in equivalent to
MutableString).
This module defines a class that acts
as a wrapper around string objects. It
is a useful base class for your own
string-like classes, which can inherit
from them and override existing
methods or add new ones. In this way
one can add new behaviors to strings.
It should be noted that these classes
are highly inefficient compared to
real string or Unicode objects; this
is especially the case for
MutableString.
(Emphasis added).

Is python's "set" stable?

The question arose when answering to another SO question (there).
When I iterate several times over a python set (without changing it between calls), can I assume it will always return elements in the same order? And if not, what is the rationale of changing the order ? Is it deterministic, or random? Or implementation defined?
And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?
The underlying question is if python set iteration order only depends on the algorithm used to implement sets, or also on the execution context?
There's no formal guarantee about the stability of sets. However, in the CPython implementation, as long as nothing changes the set, the items will be produced in the same order. Sets are implemented as open-addressing hashtables (with a prime probe), so inserting or removing items can completely change the order (in particular, when that triggers a resize, which reorganizes how the items are laid out in memory.) You can also have two identical sets that nonetheless produce the items in different order, for example:
>>> s1 = {-1, -2}
>>> s2 = {-2, -1}
>>> s1 == s2
True
>>> list(s1), list(s2)
([-1, -2], [-2, -1])
Unless you're very certain you have the same set and nothing touched it inbetween the two iterations, it's best not to rely on it staying the same. Making seemingly irrelevant changes to, say, functions you call inbetween could produce very hard to find bugs.
A set or frozenset is inherently an unordered collection. Internally, sets are based on a hash table, and the order of keys depends both on the insertion order and on the hash algorithm. In CPython (aka standard Python) integers less than the machine word size (32 bit or 64 bit) hash to themself, but text strings, bytes strings, and datetime objects hash to integers that vary randomly; you can control that by setting the PYTHONHASHSEED environment variable.
From the __hash__ docs:
Note
By default, the __hash__() values of str, bytes and datetime
objects are “salted” with an unpredictable random value. Although they
remain constant within an individual Python process, they are not
predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service
caused by carefully-chosen inputs that exploit the worst case
performance of a dict insertion, O(n^2) complexity. See
http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and
other mappings. Python has never made guarantees about this ordering
(and it typically varies between 32-bit and 64-bit builds).
See also PYTHONHASHSEED.
The results of hashing objects of other classes depend on the details of the class's __hash__ method.
The upshot of all this is that you can have two sets containing identical strings but when you convert them to lists they can compare unequal. Or they may not. ;) Here's some code that demonstrates this. On some runs, it will just loop, not printing anything, but on other runs it will quickly find a set that uses a different order to the original.
from random import seed, shuffle
seed(42)
data = list('abcdefgh')
a = frozenset(data)
la = list(a)
print(''.join(la), a)
while True:
shuffle(data)
lb = list(frozenset(data))
if lb != la:
print(''.join(data), ''.join(lb))
break
typical output
dachbgef frozenset({'d', 'a', 'c', 'h', 'b', 'g', 'e', 'f'})
deghcfab dahcbgef
And when I call the same python
program repeatedly (not random, not
input dependent), will I get the same
ordering for sets?
I can answer this part of the question now after a quick experiment. Using the following code:
class Foo(object) :
def __init__(self,val) :
self.val = val
def __repr__(self) :
return str(self.val)
x = set()
for y in range(500) :
x.add(Foo(y))
print list(x)[-10:]
I can trigger the behaviour that I was asking about in the other question. If I run this repeatedly then the output changes, but not on every run. It seems to be "weakly random" in that it changes slowly. This is certainly implementation dependent so I should say that I'm running the macports Python2.6 on snow-leopard. While the program will output the same answer for long runs of time, doing something that affects the system entropy pool (writing to the disk mostly works) will somethimes kick it into a different output.
The class Foo is just a simple int wrapper as experiments show that this doesn't happen with sets of ints. I think that the problem is caused by the lack of __eq__ and __hash__ members for the object, although I would dearly love to know the underlying explanation / ways to avoid it. Also useful would be some way to reproduce / repeat a "bad" run. Does anyone know what seed it uses, or how I could set that seed?
It’s definitely implementation defined. The specification of a set says only that
Being an unordered collection, sets do not record element position or order of insertion.
Why not use OrderedDict to create your own OrderedSet class?
The answer is simply a NO.
Python set operation is NOT stable.
I did a simple experiment to show this.
The code:
import random
random.seed(1)
x=[]
class aaa(object):
def __init__(self,a,b):
self.a=a
self.b=b
for i in range(5):
x.append(aaa(random.choice('asf'),random.randint(1,4000)))
for j in x:
print(j.a,j.b)
print('====')
for j in set(x):
print(j.a,j.b)
Run this for twice, you will get this:
First time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
a 2030
a 2332
f 1555
a 1045
s 1935
Process finished with exit code 0
Second time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
s 1935
a 2332
a 1045
f 1555
a 2030
Process finished with exit code 0
The reason is explained in comments in this answer.
However, there are some ways to make it stable:
set PYTHONHASHSEED to 0, see details here, here and here.
Use OrderedDict instead.
As pointed out, this is strictly an implementation detail.
But as long as you don’t change the structure between calls, there should be no reason for a read-only operation (= iteration) to change with time: no sane implementation does that. Even randomized (= non-deterministic) data structures that can be used to implement sets (e.g. skip lists) don’t change the reading order when no changes occur.
So, being rational, you can safely rely on this behaviour.
(I’m aware that certain GCs may reorder memory in a background thread but even this reordering will not be noticeable on the level of data structures, unless a bug occurs.)
The definition of a set is unordered, unique elements ("Unordered collections of unique elements"). You should care only about the interface, not the implementation. If you want an ordered enumeration, you should probably put it into a list and sort it.
There are many different implementations of Python. Don't rely on undocumented behaviour, as your code could break on different Python implementations.

I need to free up RAM by storing a Python dictionary on the hard drive, not in RAM. Is it possible?

In my case, I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings. As I build this dictionary up, my RAM goes up super high. Is there a way to write the dictionary as it is being built to the harddrive rather than the RAM so that I can save some memory? I've heard of something called "pickle" but I don't know if this is a feasible method for what I am doing.
Thanks for your help!
Maybe you should be using a database, but check out the shelve module
If shelve isn't powerful enough for you, there is always the industrial strength ZODB
shelve, as #gnibbler recommends, is what I would no doubt be using, but watch out for two traps: a simple one (all keys must be strings) and a subtle one (as the values don't normally exist in memory, calling mutators on them may not work as you expect).
For the simple problem, it's normally easy to find a workaround (and you do get a clear exception if you forget and try e.g. using an int or whatever as the key, so it's not hard t remember that you do need a workaround either).
For the subtle problem, consider for example:
x = d['foo']
x.amutatingmethod()
...much later...
y = d['foo']
# is y "mutated" or not now?
the answer to the question in the last comment depends on whether d is a real dict (in which case y will be mutated, and in fact exactly the same object as x) or a shelf (in which case y will be a distinct object from x, and in exactly the state you last saved to d['foo']!).
To get your mutations to persist, you need to "save them to disk" by doing
d['foo'] = x
after calling any mutators you want on x (so in particular you cannot just do
d['foo'].mutator()
and expect the mutation to "stick", as you would if d were a dict).
shelve does have an option to cache all fetched items in memory, but of course that can fill up the memory again, and result in long delays when you finally close the shelf object (since all the cached items must be saved back to disk then, just in case they had been mutated). That option was something I originally pushed for (as a Python core committer), but I've since changed my mind and I now apologize for getting it in (ah well, at least it's not the default!-), since the cases it should be used in are rare, and it can often trap the unwary user... sorry.
BTW, in case you don't know what a mutator, or "mutating method", is, it's any method that alters the state of the object you call it on -- e.g. .append if the object is a list, .pop if the object is any kind of container, and so on. No need to worry if the object is immutable, of course (numbers, strings, tuples, frozensets, ...), since it doesn't have mutating methods in that case;-).
Pickling an entire hash over and over again is bound to run into the same memory pressures that you're facing now -- maybe even worse, with all the data marshaling back and forth.
Instead, using an on-disk database that acts like a hash is probably the best bet; see this page for a quick introduction to using dbm-style databases in your program: http://docs.python.org/library/dbm
They act enough like hashes that it should be a simple transition for you.
"""I have a dictionary of about 6000 instantiated classes, where each class has 1000 attributed variables all of type string or list of strings""" ... I presume that you mean: """I have a class with about 1000 attributes all of type str or list of str. I have a dictionary mapping about 6000 keys of unspecified type to corresponding instances of that class.""" If that's not a reasonable translation, please correct it.
For a start, 1000 attributes in a class is mindboggling. You must be treating the vast majority generically using value = getattr(obj, attr_name) and setattr(obj, attr_name, value). Consider using a dict instead of an instance: value = obj[attr_name] and obj[attr_name] = value.
Secondly, what percentage of those 6 million attributes are ""? If sufficiently high, you might like to consider implementing a sparse dict which doesn't physically have entries for those attributes, using the __missing__ hook -- docs here.

Categories

Resources