python hash function equivelant - python

I'm trying to find the Go equivalent to python's hash function:
hash("test")
I've found this post which is a very similar function in the sense that it returns an integer, however, it uses fnv which appears to be a different hashing method to the python version
What I'm trying to do is pass a string to the hash function whereby it returns exactly the same integer in both languages for the same string.

By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.
You will get different numbers between different invocations of the Python script. So I don't think what you want is even possible.
Source: https://docs.python.org/3.5/reference/datamodel.html#object.__hash__

Related

What is a "SupportsIndex"? [duplicate]

The Data Model section of the Python 3.2 documentation provides the following descriptions for the __int__ and __index__ methods:
object.__int__(self)
Called to implement the built-in [function int()]. Should return [an integer].
object.__index__(self)
Called to implement operator.index(). Also called whenever Python needs an integer object (such as in slicing, or in the built-in bin(), hex() and oct() functions). Must return an integer.
I understand that they're used for different purposes, but I've been unable to figure out why two different methods are necessary. What is the difference between these methods? Is it safe to just alias __index__ = __int__ in my classes?
See PEP 357: Allowing Any Object to be Used for Slicing.
The nb_int method is used for coercion and so means something
fundamentally different than what is requested here. This PEP
proposes a method for something that can already be thought of as
an integer communicate that information to Python when it needs an
integer. The biggest example of why using nb_int would be a bad
thing is that float objects already define the nb_int method, but
float objects should not be used as indexes in a sequence.
Edit: It seems that it was implemented in Python 2.5.
I believe you'll find the answer in PEP 357, which has this abstract:
This PEP proposes adding an nb_index
slot in PyNumberMethods and an
__index__ special method so that arbitrary objects can be used
whenever integers are explicitly needed in Python, such as in slice
syntax (from which the slot gets its name).

Maximum/minimum value returned by Python's hash() function

Context: building a consistent hashing algorithm.
The official documentation for Python's hash() function states:
Return the hash value of the object (if it has one). Hash values are integers.
However, it does not explicitly state whether the function maps to an integer range (with a minimum and a maximum) or not.
Coming from other languages where values for primitive types are bounded (e.g. C#'s/Java's Int.MaxValue), I know that Python's likes to think in "unbounded" terms – i.e. switching from int to long in the background.
Am I to assume that the hash() function also is unbounded? Or is it bounded, for example mapping to what Python assigns to the max/min values of the "int-proper" – i.e. between -2147483648 through 2147483647?
As others pointed out, there is a misplaced[1] Note in the documentation that reads:
hash() truncates the value returned from an object’s custom hash() method to the size of a Py_ssize_t.
To answer the question, we need to get this Py_ssize_t. After some research, it seems that it is stored in sys.maxsize, although I'd appreciate some feedback here.
The solution that I adopted eventually was then:
import sys
bits = sys.hash_info.width # in my case, 64
print (sys.maxsize) # in my case, 9223372036854775807
# Therefore:
hash_maxValue = int((2**bits)/2) - 1 # 9223372036854775807, or +sys.maxsize
hash_minValue = -hash_maxValue # -9223372036854775807, or -sys.maxsize
Happy to receive comments/feedbacks on this – until proven wrong, this is the accepted answer.
[1] The note is included in the section dedicated to __hash__() instead of the one dedicated to hash().
From the documentation
hash() truncates the value returned from an object’s custom __hash__()
method to the size of a Py_ssize_t. This is typically 8 bytes on
64-bit builds and 4 bytes on 32-bit builds. If an object’s __hash__()
must interoperate on builds of different bit sizes, be sure to check
the width on all supported builds. An easy way to do this is with
python -c "import sys; print(sys.hash_info.width)".
More details can be found here https://docs.python.org/3/reference/datamodel.html#object.__hash__

Is python's hash() persistent?

Is the hash() function in python guaranteed to always be the same for a given input, regardless of when/where it's entered? So far -- from trial-and-error only -- the answer seems to be yes, but it would be nice to understand the internals of how this works. For example, in a test:
$ python
>>> from ingest.tpr import *
>>> d=DailyPriceObj(date="2014-01-01")
>>> hash(d)
5440882306090652359
>>> ^D
$ python
>>> from ingest.tpr import *
>>> d=DailyPriceObj(date="2014-01-01")
>>> hash(d)
5440882306090652359
The contract for the __hash__ method requires that it be consistent within a given run of Python. There is no guarantee that it be consistent across different runs of Python, and in fact, for the built-in str, bytes-like types, and datetime.datetime objects (possibly others), the hash is salted with a per-run value so that it's almost never the same for the same input in different runs of Python.
No, it's dependent on the process. If you need a persistent hash, see Persistent Hashing of Strings in Python.
Truncation depending on platform, from the documentation of __hash__:
hash() truncates the value returned from an object’s custom __hash__() method to the size of a Py_ssize_t. This is typically 8 bytes on 64-bit builds and 4 bytes on 32-bit builds.
Salted hashes, from the same documentation (ShadowRanger's answer):
By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.
A necessary condition for hashability is that for equivalent objects this value is always the same (inside one run of the interpreter).
Of course, nothing prevents you from ignoring this requirement. But, if you suddently want to store your objects in a dictionary or a set, then problems may arise.
When you implement your own class you can define methods __eq__ and __hash__. I have used a polynominal hash function for strings and a hash function from a universal family of hash functions.
In general, the hash values for a specific object shouldn't change from start to start interpreter. But for many data types this is true. One of the reasons for this implementation is that it is more difficult to find and anti-hash test.
For example:
For numeric types, the hash of a number x is based on the reduction
of x modulo the prime P = 2**_PyHASH_BITS - 1. It's designed so that
hash(x) == hash(y) whenever x and y are numerically equal, even if
x and y have different types.
a = 123456789
hash(a) == 123456789
hash(a + b * (2 ** 61 - 1)) == 123456789

What does random.Random (not random.random) from the random module actually do in Python?

I would like to get a lucid explanation on what the random.Random function/class actually does. This is what Python's random module has to say about it.
Random number generator base class used by bound module functions.
Used to instantiate instances of Random to get generators that don't
share state. Especially useful for multi-threaded programs, creating
a different instance of Random for each thread, and using the
jumpahead() method to ensure that the generated sequences seen by each
thread don't overlap.
Class Random can also be subclassed if you want to use a different
basic generator of your own devising: in that case, override the
following methods: random(), seed(), getstate(), setstate() and
jumpahead(). Optionally, implement a getrandbits() method so that
randrange() can cover arbitrarily large ranges.
I do not understand this because I am still very much a beginner at Python. I do know a bit about base and derived classes and this clearly seems to have something to do with this.
I tried to play around with the random.Random() function/class in Python's IDLE and found out the following.
It only seems to accept 1 argument.(string, int, float)
Doesn't seem to to take in any lists or dictionaries as an argument; states they are unhashable. (What does 'unhashable' mean?)
It only seems to return two values alternatively on repeatedly invoking it, regardless of the argument passed to it, the two values being 'random.Random object at 0x03F24E40' and 'random.Random object at 0x03F26B60'.
I hope I can get a simple explanation of what random.Random does and also an explanation as to why it only returns two values. (I am a beginner so forgive my ignorance on the subject!)
Any explanation on how functions like seed(), getstate(), setstate() and jumpahead() work or references to any documents/books that explain so are welcome.
In simple terms, random.Random() creates a pseudorandom number generator, that is, an object that generates a sequence of numbers that appear random (are pseudorandom).
random.Random() accepts one object that can be a 'string', 'int' (integer), or 'float' (integer or non-integer such as 3.2 or 888). This object is called a seed, and it can be used to create an object that generates a specific sequence of pseudorandom numbers. For example, you can call—
random.Random(57),
random.Random(888.6),
random.Random("Hello World"), or
random.Random(99898989),
to get a generator of a specific sequence of pseudorandom numbers. However, you should specify a seed only if you need repeatable "randomness" in your program.
You can then use this generator to extract pseudorandom numbers from that sequence:
# Create a generator without
# a seed, so that the pseudorandom
# sequence will almost surely differ
# from run to run
randomGen = random.Random()
# Generate a number in [0, 1)
number = randomGen.random()
# Generate an integer in [0, 5]
number = randomGen.randint(0, 5)
Note that the example assigns the generator from random.Random to a variable named randomGen; in general, it's not useful on its own simply to create a generator.

Hashing a Python function

def funcA(x):
return x
Is funcA.__code__.__hash__() a suitable way to check whether funcA has changed?
I know that funcA.__hash__() won't work as it the same as id(funcA) / 16. I checked and this isn't true for __code__.__hash__(). I also tested the behaviour in a ipython terminal and it seemed to hold. But is this guaranteed to work?
Why
I would like to have a way of comparing an old version of function to a new version of the same function.
I'm trying to create a decorator for disk-based/long-term caching. Thus I need a way to identify if a function has changed. I also need to look at the call graph to check that none of the called functions have changed but that is not part of this question.
Requirements:
Needs to be stable over multiple calls and machines. 1 says that in Python 3.3 hash() is randomized on each start of a new instance. Although it also says that "HASH RANDOMIZATION IS DISABLED BY DEFAULT". Ideally, I'd like a function that does is stable even with randomization enabled.
Ideally, it would yield the same hash for def funcA: pass and def funcB: pass, i.e. when only the name of the function changes. Probably not necessary.
I only care about Python 3.
One alternative would be to hash the text inside the file that contains the given function.
Yes, it seems that func_a.__code__.__hash__() is unique to the specific functionality of the code. I could not find where this is implemented, or where it is __code__.__hash__() defined.
The perfect way would be to use func_a.__code__.co_code.__hash__() because co_code has the byte code as a string. Note that in this case, the function name is not part of the hash and two functions with the same code but names func_a and func_b will have the same hash.
hash(func_a.__code__.co_code)
Source.

Categories

Resources