Is the hash() function in python guaranteed to always be the same for a given input, regardless of when/where it's entered? So far -- from trial-and-error only -- the answer seems to be yes, but it would be nice to understand the internals of how this works. For example, in a test:
$ python
>>> from ingest.tpr import *
>>> d=DailyPriceObj(date="2014-01-01")
>>> hash(d)
5440882306090652359
>>> ^D
$ python
>>> from ingest.tpr import *
>>> d=DailyPriceObj(date="2014-01-01")
>>> hash(d)
5440882306090652359
The contract for the __hash__ method requires that it be consistent within a given run of Python. There is no guarantee that it be consistent across different runs of Python, and in fact, for the built-in str, bytes-like types, and datetime.datetime objects (possibly others), the hash is salted with a per-run value so that it's almost never the same for the same input in different runs of Python.
No, it's dependent on the process. If you need a persistent hash, see Persistent Hashing of Strings in Python.
Truncation depending on platform, from the documentation of __hash__:
hash() truncates the value returned from an object’s custom __hash__() method to the size of a Py_ssize_t. This is typically 8 bytes on 64-bit builds and 4 bytes on 32-bit builds.
Salted hashes, from the same documentation (ShadowRanger's answer):
By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.
A necessary condition for hashability is that for equivalent objects this value is always the same (inside one run of the interpreter).
Of course, nothing prevents you from ignoring this requirement. But, if you suddently want to store your objects in a dictionary or a set, then problems may arise.
When you implement your own class you can define methods __eq__ and __hash__. I have used a polynominal hash function for strings and a hash function from a universal family of hash functions.
In general, the hash values for a specific object shouldn't change from start to start interpreter. But for many data types this is true. One of the reasons for this implementation is that it is more difficult to find and anti-hash test.
For example:
For numeric types, the hash of a number x is based on the reduction
of x modulo the prime P = 2**_PyHASH_BITS - 1. It's designed so that
hash(x) == hash(y) whenever x and y are numerically equal, even if
x and y have different types.
a = 123456789
hash(a) == 123456789
hash(a + b * (2 ** 61 - 1)) == 123456789
Related
Context: building a consistent hashing algorithm.
The official documentation for Python's hash() function states:
Return the hash value of the object (if it has one). Hash values are integers.
However, it does not explicitly state whether the function maps to an integer range (with a minimum and a maximum) or not.
Coming from other languages where values for primitive types are bounded (e.g. C#'s/Java's Int.MaxValue), I know that Python's likes to think in "unbounded" terms – i.e. switching from int to long in the background.
Am I to assume that the hash() function also is unbounded? Or is it bounded, for example mapping to what Python assigns to the max/min values of the "int-proper" – i.e. between -2147483648 through 2147483647?
As others pointed out, there is a misplaced[1] Note in the documentation that reads:
hash() truncates the value returned from an object’s custom hash() method to the size of a Py_ssize_t.
To answer the question, we need to get this Py_ssize_t. After some research, it seems that it is stored in sys.maxsize, although I'd appreciate some feedback here.
The solution that I adopted eventually was then:
import sys
bits = sys.hash_info.width # in my case, 64
print (sys.maxsize) # in my case, 9223372036854775807
# Therefore:
hash_maxValue = int((2**bits)/2) - 1 # 9223372036854775807, or +sys.maxsize
hash_minValue = -hash_maxValue # -9223372036854775807, or -sys.maxsize
Happy to receive comments/feedbacks on this – until proven wrong, this is the accepted answer.
[1] The note is included in the section dedicated to __hash__() instead of the one dedicated to hash().
From the documentation
hash() truncates the value returned from an object’s custom __hash__()
method to the size of a Py_ssize_t. This is typically 8 bytes on
64-bit builds and 4 bytes on 32-bit builds. If an object’s __hash__()
must interoperate on builds of different bit sizes, be sure to check
the width on all supported builds. An easy way to do this is with
python -c "import sys; print(sys.hash_info.width)".
More details can be found here https://docs.python.org/3/reference/datamodel.html#object.__hash__
As noted by many, Python's hash is not consistent anymore (as of version 3.3), as a random PYTHONHASHSEED is now used by default (to address security concerns, as explained in this excellent answer).
However, I have noticed that the hash of some objects are still consistent (as of Python 3.7 anyway): that includes int, float, tuple(x), frozenset(x) (as long as x yields consistent hash). For example:
assert hash(10) == 10
assert hash((10, 11)) == 3713074054246420356
assert hash(frozenset([(0, 1, 2), 3, (4, 5, 6)])) == -8046488914261726427
Is that always true and guaranteed? If so, is that expected to stay that way? Is the PYTHONHASHSEED only applied to salt the hash of strings and byte arrays?
Why am I asking?
I have a system that relies on hashing to remember whether or not we have seen a given dict (in any order): {key: tuple(ints)}. In that system, the keys are a collection of filenames and the tuples a subset of os.stat_result, e.g. (size, mtime) associated with them. This system is used to make update/sync decisions based on detecting differences.
In my application, I have on the order of 100K such dicts, and each can represent several thousands of files and their state, so the compactness of the cache is important.
I can tolerate the small false positive rate (< 10^-19 for 64-bit hashes) coming from possible hash collisions (see also birthday paradox).
One compact representation is the following for each such dict "fsd":
def fsd_hash(fsd: dict):
return hash(frozenset(fsd.items()))
It is very fast and yields a single int to represent an entire dict (with order-invariance). If anything in the fsd dict changes, with high probability the hash will be different.
Unfortunately, hash is only consistent within a single Python instance, rendering it useless for hosts to compare their respective hashes. Persisting the full cache ({location_name: fsd_hash}) to disk to be reloaded on restart is also useless.
I cannot expect the larger system that uses that module to have been invoked with PYTHONHASHSEED=0, and, to my knowledge, there is no way to change this once the Python instance has started.
Things I have tried
I may use hashlib.sha1 or similar to calculate consistent hashes. This is slower and I can't directly use the frozenset trick: I have to iterate through the dict in a consistent order (e.g. by sorting on keys, slow) while updating the hasher. In my tests on real data, I see over 50x slow-down.
I could try applying an order-invariant hashing algorithm on consistent hashes obtained for each item (also slow, as starting a fresh hasher for each item is time-consuming).
I can try transforming everything into ints or tuples of ints and then frozensets of such tuples. At the moment, it seems that all int, tuple(int) and frozenset(tuple(int)) yield consistent hashes, but: is that guaranteed, and if so, how long can I expect this to be the case?
Additional question: more generally, what would be a good way to write a consistent hash replacement for hash(frozenset(some_dict.items())) when the dict contains various types and classes? I can implement a custom __hash__ (a consistent one) for the classes I own, but I cannot override str's hash for example. One thing I came up with is:
def const_hash(x):
if isinstance(x, (int, float, bool)):
pass
elif isinstance(x, frozenset):
x = frozenset([const_hash(v) for v in x])
elif isinstance(x, str):
x = tuple([ord(e) for e in x])
elif isinstance(x, bytes):
x = tuple(x)
elif isinstance(x, dict):
x = tuple([(const_hash(k), const_hash(v)) for k, v in x.items()])
elif isinstance(x, (list, tuple)):
x = tuple([const_hash(e) for e in x])
else:
try:
return x.const_hash()
except AttributeError:
raise TypeError(f'no known const_hash implementation for {type(x)}')
return hash(x)
Short answer to broad question: There are no explicit guarantees made about hashing stability aside from the overall guarantee that x == y requires that hash(x) == hash(y). There is an implication that x and y are both defined in the same run of the program (you can't perform x == y where one of them doesn't exist in that program obviously, so no guarantees are needed about the hash across runs).
Longer answers to specific questions:
Is [your belief that int, float, tuple(x), frozenset(x) (for x with consistent hash) have consistent hashes across separate runs] always true and guaranteed?
It's true of numeric types, with the mechanism being officially documented, but the mechanism is only guaranteed for a particular interpreter for a particular build. sys.hash_info provides the various constants, and they'll be consistent on that interpreter, but on a different interpreter (CPython vs. PyPy, 64 bit build vs. 32 bit build, even 3.n vs. 3.n+1) they can differ (documented to differ in the case of 64 vs. 32 bit CPython), so the hashes won't be portable across machines with different interpreters.
No guarantees on algorithm are made for tuple and frozenset; I can't think of any reason they'd change it between runs (if the underlying types are seeded, the tuple and frozenset benefit from it without needing any changes), but they can and do change the implementation between releases of CPython (e.g. in late 2018 they made a change to reduce the number of hash collisions in short tuples of ints and floats), so if you store off the hashes of tuples from say, 3.7, and then compute hashes of the same tuples in 3.8+, they won't match (even though they'd match between runs on 3.7 or between runs on 3.8).
If so, is that expected to stay that way?
Expected to, yes. Guaranteed, no. I could easily see seeded hashes for ints (and by extension, for all numeric types to preserve the numeric hash/equality guarantees) for the same reason they seeded hashes for str/bytes, etc. The main hurdles would be:
It would almost certainly be slower than the current, very simple algorithm.
By documenting the numeric hashing algorithm explicitly, they'd need a long period of deprecation before they could change it.
It's not strictly necessary (if web apps need seeded hashes for DoS protection, they can always just convert ints to str before using them as keys).
Is the PYTHONHASHSEED only applied to salt the hash of strings and byte arrays?
Beyond str and bytes, it applies to a number of random things that implement their own hashing in terms of the hash of str or bytes, often because they're already naturally convertable to raw bytes and are commonly used as keys in dicts populated by web-facing frontends. The ones I know of off-hand include the various classes of the datetime module (datetime, date, time, though this isn't actually documented in the module itself), and read-only memoryviews of with byte-sized formats (which hash equivalently to hashing the result of the view's .tobytes() method).
What would be a good way to write a consistent hash replacement for hash(frozenset(some_dict.items())) when the dict contains various types and classes?
The simplest/most composable solution would probably be to define your const_hash as a single dispatch function, using it the same way you do hash itself. This avoids having one single function defined in a single place that must handle all types; you can have the const_hash default implementation (which just relies on hash for those things with known consistent hashes) in a central location, and provide additional definitions for the built-in types you know aren't consistent (or which might contain inconsistent stuff) there, while still allowing people to extend the set of things it covers seamlessly by registering their own single-dispatch functions by importing your const_hash and decorating the implementation for their type with #const_hash.register. It's not significantly different in effect from your proposed const_hash, but it's a lot more manageable.
I'm trying to find the Go equivalent to python's hash function:
hash("test")
I've found this post which is a very similar function in the sense that it returns an integer, however, it uses fnv which appears to be a different hashing method to the python version
What I'm trying to do is pass a string to the hash function whereby it returns exactly the same integer in both languages for the same string.
By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.
You will get different numbers between different invocations of the Python script. So I don't think what you want is even possible.
Source: https://docs.python.org/3.5/reference/datamodel.html#object.__hash__
Why are mutable strings slower than immutable strings?
EDIT:
>>> import UserString
... def test():
... s = UserString.MutableString('Python')
... for i in range(3):
... s[0] = 'a'
...
... if __name__=='__main__':
... from timeit import Timer
... t = Timer("test()", "from __main__ import test")
... print t.timeit()
13.5236170292
>>> import UserString
... def test():
... s = UserString.MutableString('Python')
... s = 'abcd'
... for i in range(3):
... s = 'a' + s[1:]
...
... if __name__=='__main__':
... from timeit import Timer
... t = Timer("test()", "from __main__ import test")
... print t.timeit()
6.24725079536
>>> import UserString
... def test():
... s = UserString.MutableString('Python')
... for i in range(3):
... s = 'a' + s[1:]
...
... if __name__=='__main__':
... from timeit import Timer
... t = Timer("test()", "from __main__ import test")
... print t.timeit()
38.6385951042
i think it is obvious why i put s = UserString.MutableString('Python') on second test.
In a hypothetical language that offers both mutable and immutable, otherwise equivalent, string types (I can't really think of one offhand -- e.g., Python and Java both have immutable strings only, and other ways to make one through mutation which add indirectness and therefore can of course slow things down a bit;-), there's no real reason for any performance difference -- for example, in C++, interchangeably using a std::string or a const std::string I would expect to cause no performance difference (admittedly a compiler might be able to optimize code using the latter better by counting on the immutability, but I don't know any real-world ones that do perform such theoretically possible optimizations;-).
Having immutable strings may and does in fact allow very substantial optimizations in Java and Python. For example, if the strings get hashed, the hash can be cached, and will never have to be recomputed (since the string can't change) -- that's especially important in Python, which uses hashed strings (for look-ups in sets and dictionaries) so lavishly and even "behind the scenes". Fresh copies never need to be made "just in case" the previous one has changed in the meantime -- references to a single copy can always be handed out systematically whenever that string is required. Python also copiously uses "interning" of (some) strings, potentially allowing constant-time comparisons and many other similarly fast operations -- think of it as one more way, a more advanced one to be sure, to take advantage of strings' immutability to cache more of the results of operations often performed on them.
That's not to say that a given compiler is going to take advantage of all possible optimizations, of course. For example, when a slice of a string is requested, there is no real need to make a new object and copy the data over -- the new slice might refer to the old one with an offset (and an independently stored length), potentially a great optimization for big strings out of which many slices are taken. Python doesn't do that because, unless particular care is taken in memory management, this might easily result in the "big" string being all kept in memory when only a small slice of it is actually needed -- but it's a tradeoff that a different implementation might definitely choose to perform (with that burden of extra memory management, to be sure -- more complex, harder-to-debug compiler and runtime code for the hypothetical language in question).
I'm just scratching the surface here -- and many of these advantages would be hard to keep if otherwise interchangeable string types could exist in both mutable and immutable versions (which I suspect is why, to the best of my current knowledge at least, C++ compilers actually don't bother with such optimizations, despite being generally very performance-conscious). But by offering only immutable strings as the primitive, fundamental data type (and thus implicitly accepting some disadvantage when you'd really need a mutable one;-), languages such as Java and Python can clearly gain all sorts of advantages -- performance issues being only one group of them (Python's choice to allow only immutable primitive types to be hashable, for example, is not a performance-centered design decision -- it's more about clarity and predictability of behavior for sets and dictionaries!-).
I don't know if they are really a lot slower but they make thinking about programming easier a lot of the times, because the state of the object/string can't change. That's the most important property to immutability to me.
Furthermore you might assume that immutable string are faster because they have less state(which can change), which might mean lower memory consumption, CPU-cycles.
I also found this interesting article while googling which I would like to quote:
knowing that a string is immutable
makes it easy to lay it out at
construction time — fixed and
unchanging storage requirements
with an immutable string, python can intern it and refer to it internally by it's address in memory. This means that to compare two strings, it only has to compare their addresses in memory (unless one of them isn't interned). Also, keep in mind that not all strings are interned. I've seen example of constructed strings that are not interned.
with mutable strings, string comparison would involve comparing them character by character and would also require either storing identical strings in different locations (malloc is not free) or adding logic to keep track of how many times a given string is referred to and making a copy for every mutation if there were more than one referrer.
It seems like python is optimized for string comparison. This makes sense because even string manipulation involves string comparison in most cases so for most use cases, it's the lowest common denominator.
Another advantage of immutable strings is that it makes it possible for them to be hashable which is a requirement for using them for dictionary keys. imagine a scenario where they were mutable:
s = 'a'
d = {s : 1}
s = s + 'b'
d[s] = ?
I suppose python could keep track of which dicts have which strings as keys and update all of their hashtables when a string was modified but that's just adding more overhead to dict insertion. It's not to far off the mark to say that you can't do anything in python without a dict insertion/lookup so that would be very very bad. It also adds overhead to string manipulation.
The obvious answer to your question is that normal strings are implemented in C, while MutableString is implemented in Python.
Not only does every operation on a mutable string have the overhead of going through one or more Python function calls, but the implementation is essentially a wrapper round an immutable string - when you modify the string it creates a new immutable string and throws the old one away. You can read the source in the UserString.py file in your Python lib directory.
To quote the Python docs:
Note:
This UserString class from this module
is available for backward
compatibility only. If you are writing
code that does not need to work with
versions of Python earlier than Python
2.2, please consider subclassing directly from the built-in str type
instead of using UserString (there is
no built-in equivalent to
MutableString).
This module defines a class that acts
as a wrapper around string objects. It
is a useful base class for your own
string-like classes, which can inherit
from them and override existing
methods or add new ones. In this way
one can add new behaviors to strings.
It should be noted that these classes
are highly inefficient compared to
real string or Unicode objects; this
is especially the case for
MutableString.
(Emphasis added).
The question arose when answering to another SO question (there).
When I iterate several times over a python set (without changing it between calls), can I assume it will always return elements in the same order? And if not, what is the rationale of changing the order ? Is it deterministic, or random? Or implementation defined?
And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?
The underlying question is if python set iteration order only depends on the algorithm used to implement sets, or also on the execution context?
There's no formal guarantee about the stability of sets. However, in the CPython implementation, as long as nothing changes the set, the items will be produced in the same order. Sets are implemented as open-addressing hashtables (with a prime probe), so inserting or removing items can completely change the order (in particular, when that triggers a resize, which reorganizes how the items are laid out in memory.) You can also have two identical sets that nonetheless produce the items in different order, for example:
>>> s1 = {-1, -2}
>>> s2 = {-2, -1}
>>> s1 == s2
True
>>> list(s1), list(s2)
([-1, -2], [-2, -1])
Unless you're very certain you have the same set and nothing touched it inbetween the two iterations, it's best not to rely on it staying the same. Making seemingly irrelevant changes to, say, functions you call inbetween could produce very hard to find bugs.
A set or frozenset is inherently an unordered collection. Internally, sets are based on a hash table, and the order of keys depends both on the insertion order and on the hash algorithm. In CPython (aka standard Python) integers less than the machine word size (32 bit or 64 bit) hash to themself, but text strings, bytes strings, and datetime objects hash to integers that vary randomly; you can control that by setting the PYTHONHASHSEED environment variable.
From the __hash__ docs:
Note
By default, the __hash__() values of str, bytes and datetime
objects are “salted” with an unpredictable random value. Although they
remain constant within an individual Python process, they are not
predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service
caused by carefully-chosen inputs that exploit the worst case
performance of a dict insertion, O(n^2) complexity. See
http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and
other mappings. Python has never made guarantees about this ordering
(and it typically varies between 32-bit and 64-bit builds).
See also PYTHONHASHSEED.
The results of hashing objects of other classes depend on the details of the class's __hash__ method.
The upshot of all this is that you can have two sets containing identical strings but when you convert them to lists they can compare unequal. Or they may not. ;) Here's some code that demonstrates this. On some runs, it will just loop, not printing anything, but on other runs it will quickly find a set that uses a different order to the original.
from random import seed, shuffle
seed(42)
data = list('abcdefgh')
a = frozenset(data)
la = list(a)
print(''.join(la), a)
while True:
shuffle(data)
lb = list(frozenset(data))
if lb != la:
print(''.join(data), ''.join(lb))
break
typical output
dachbgef frozenset({'d', 'a', 'c', 'h', 'b', 'g', 'e', 'f'})
deghcfab dahcbgef
And when I call the same python
program repeatedly (not random, not
input dependent), will I get the same
ordering for sets?
I can answer this part of the question now after a quick experiment. Using the following code:
class Foo(object) :
def __init__(self,val) :
self.val = val
def __repr__(self) :
return str(self.val)
x = set()
for y in range(500) :
x.add(Foo(y))
print list(x)[-10:]
I can trigger the behaviour that I was asking about in the other question. If I run this repeatedly then the output changes, but not on every run. It seems to be "weakly random" in that it changes slowly. This is certainly implementation dependent so I should say that I'm running the macports Python2.6 on snow-leopard. While the program will output the same answer for long runs of time, doing something that affects the system entropy pool (writing to the disk mostly works) will somethimes kick it into a different output.
The class Foo is just a simple int wrapper as experiments show that this doesn't happen with sets of ints. I think that the problem is caused by the lack of __eq__ and __hash__ members for the object, although I would dearly love to know the underlying explanation / ways to avoid it. Also useful would be some way to reproduce / repeat a "bad" run. Does anyone know what seed it uses, or how I could set that seed?
It’s definitely implementation defined. The specification of a set says only that
Being an unordered collection, sets do not record element position or order of insertion.
Why not use OrderedDict to create your own OrderedSet class?
The answer is simply a NO.
Python set operation is NOT stable.
I did a simple experiment to show this.
The code:
import random
random.seed(1)
x=[]
class aaa(object):
def __init__(self,a,b):
self.a=a
self.b=b
for i in range(5):
x.append(aaa(random.choice('asf'),random.randint(1,4000)))
for j in x:
print(j.a,j.b)
print('====')
for j in set(x):
print(j.a,j.b)
Run this for twice, you will get this:
First time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
a 2030
a 2332
f 1555
a 1045
s 1935
Process finished with exit code 0
Second time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
s 1935
a 2332
a 1045
f 1555
a 2030
Process finished with exit code 0
The reason is explained in comments in this answer.
However, there are some ways to make it stable:
set PYTHONHASHSEED to 0, see details here, here and here.
Use OrderedDict instead.
As pointed out, this is strictly an implementation detail.
But as long as you don’t change the structure between calls, there should be no reason for a read-only operation (= iteration) to change with time: no sane implementation does that. Even randomized (= non-deterministic) data structures that can be used to implement sets (e.g. skip lists) don’t change the reading order when no changes occur.
So, being rational, you can safely rely on this behaviour.
(I’m aware that certain GCs may reorder memory in a background thread but even this reordering will not be noticeable on the level of data structures, unless a bug occurs.)
The definition of a set is unordered, unique elements ("Unordered collections of unique elements"). You should care only about the interface, not the implementation. If you want an ordered enumeration, you should probably put it into a list and sort it.
There are many different implementations of Python. Don't rely on undocumented behaviour, as your code could break on different Python implementations.