one-to-one hashes in python - python

If I have two strings that are identical in value, is it guaranteed that hash(s1) == hash(s2) without using hashlib? Also, what is the upper bound on the number of digits in the hash?
Is there an alternative to hash that is invertable? I understand hash functions are not meant to be used like this. But a 1-1 mapping from strings to short hexadecimal strings that can be inverted and is guaranteed to be different for each string?
Will this work:
import zlib
# compress
zlib.compress("foo")
zlib.decompress(zlib.compress("foo")) == "foo" # always true?
Thanks.

YES.
>>>help(hash)
Help on built-in function hash in module builtins:
hash(...)
hash(object) -> integer
Return a hash value for the object. Two objects with the same value have
the same hash value. The reverse is not necessarily true, but likely.

Related

Assign a unique value to a string s based on its lexicographic order

I want to come up with a function that assigns unique values to a string based on it's lexicographic order. For instance if my function is labelled as get_key(s), the function should take as input a string s and return a unique integer which will allow me to compare two strings based on those unique integers that I get , in O(1) time.
Some code for clarity:
get_key('aaa')
#Returns some integer
get_key('b')
#Returns another integer > output of get_key('aaa') since 'b' > 'aaa'
Any help would be highly appreciated.
Note: Cannot use python built in function id()
It's impossible.
Why? No matter what number you return for a string, I can always find a new string that's in between those two.
You would need an unlimited number of values, because there's an infinite amount of strings.
If I understand your problem clearly, one idea I come to is to convert the input to hex then from hex to int, this I believe would solve the problem, however, I guess it is impossible to solve it in O(1). The solution I provided (and every possible solution in my mind) needs O(n) since you don't have any specification on the input length and the function will operate depending on the length of the input.

Constructing an object that's greater than any string

In Python 3, I have a list of strings and would find it useful to be able to append a sentinel that would compare greater than all elements in the list.
Is there a straightforward way to construct such an object?
I can define a class, possibly subclassing str, but it feels like there ought to be a simpler way.
For this to be useful in simplifying my algorithm, I need to do this ahead of time, before I know what the strings contained in the list are going to be (and so it can't be a function of those strings).
This is kind of a naïve answer, but when you're dealing with numbers and need a sentinel value for comparison purposes, it's not uncommon to use the largest (or smallest) number that a specific number type can hold.
Python strings are compared lexicographically, so to create a "max string", you'd simply need to create a long string of the "max char":
# 1114111 is the highest value that chr seems to accept
MAX_CHAR = chr(1114111)
# One million is entirely arbitary here.
# It should ideally be 1 + the length of the longest possible string that you'll compare against
MAX_STRING = MAX_CHAR * int(1e6)
Unless there's weird corner cases that I'm not aware of, MAX_STRING should now be considered greater than any other string (other than itself); providing that it's long enough.

Hex digest for Python's built-in hash function

I need to create an identifier token from a set of nested configuration values.
The token can be part of a URL, so – to make processing easier – it should contain only hexadecimal digits (or something similar).
The config values are nested tuples with elements of hashable types like int, bool, str etc.
My idea was to use the built-in hash() function, as this will continue to work even if the config structure changes.
This is my first attempt:
def token(config):
h = hash(config)
return '{:X}'.format(h)
This will produce tokens of variable length, but that doesn't matter.
What bothers me, though, is that the token might contain a leading minus sign, since the return value of hash() is a signed integer.
As a way to avoid the sign, I thought of the following work-around, which is adding a constant to the hash value.
This constant should be half the size of the range the value of hash() can take (which is platform-dependent, eg. different for 32-/64-bit systems):
HALF_HASH_RANGE = 2**(sys.hash_info.width-1)
Is this a sane and portable solution?
Or will I shoot myself in the foot with this?
I also saw suggestions for using struct.pack() (which returns a bytes object, on which one can call the .hex() method), but it also requires knowing the range of the hash value in advance (for the choice of the right format character).
Addendum:
Encryption strength or collisions by chance are not an issue.
The drawback of the hashlib library in this scenario is that it requires writing a converter that traverses the input structure and converts everything into a bytes representation, which is cumbersome.
You can use any of hash functions for getting unique string. Right now python support out of the box many algorithms, like: md5, sha1, sha224, sha256, sha384, sha512. You can read more about it here - https://docs.python.org/2/library/hashlib.html
This example shows how to use library hashlib. (Python 3)
>>> import hashlib
>>> sha = hashlib.sha256()
>>> sha.update('somestring'.encode())
>>> sha.hexdigest()
>>> '63f6fe797026d794e0dc3e2bd279aee19dd2f8db67488172a644bb68792a570c'
Also you can try library hashids. But note that it's not a hash algorithm and you (and anyone who knows salt) can decrypt data.
$ pip install hashids
Basic usage:
>>> from hashids import Hashids
>>> hashids = Hashids()
>>> hashids.encode(123)
'Mj3'
>>> hashids.decode('Mj3')
123
I need to create an identifier token from a set of nested configuration values
I came across this question while trying to solve the same problem, and realizing that some of the calls to hash return negative integers.
Here's how I would implement your token function:
import sys
def token(config) -> str:
"""Generates a hex token that identifies a hashable config."""
# `sign_mask` is used to make `hash` return unsigned values
sign_mask = (1 << sys.hash_info.width) - 1
# Get the hash as a positive hex value with consistent padding without '0x'
return f'{hash(config) & sign_mask:#0{sys.hash_info.width//4}x}'[2:]
In my case I needed it to work with a broad range of inputs for the config. It did not need to be particularly performant (it was not on a hot path), and it was acceptable if it occasionally had collisions (more than what would normally be expected from hash). All it really needed to do is produce short (e.g. 16 chars long) consistent outputs for consistent inputs. So for my case I used the above function with a small modification to ensure the provided config is hashable, at the cost of increased collision risk and processing time:
import sys
def token(config) -> str:
"""Generates a hex token that identifies a config."""
# `sign_mask` is used to make `hash` return unsigned values
sign_mask = (1 << sys.hash_info.width) - 1
# Use `json.dumps` with `repr` to ensure the config is hashable
json_config = json.dumps(config, default=repr)
# Get the hash as a positive hex value with consistent padding without '0x'
return f'{hash(json_config) & sign_mask:#0{sys.hash_info.width//4}x}'[2:]
I'd reccomend using hashlib
cast the token to a string, and then cast the hexdigest to a hex integer. Bellow is an example with the sha256 algorithm but you can use any hashing algorithm hashlib supports
import hashlib as hl
def shasum(token):
return int(hl.sha256(str(token).encode('utf-8')).hexdigest(), 16)

Why does Python's hash() generate a collision between these two close numbers? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
When is a python object's hash computed and why is the hash of -1 different?
Why do -1 and -2 both hash to the same number if Python?
Since they do, how does Python tell these two numbers apart?
>>> -1 is -2
False
>>> hash(-1) is hash(-2)
True
>>> hash(-1)
-2
>>> hash(-2)
-2
-1 is a reserved value at the C level of CPython which prevents hash functions from being able to produce a hash value of -1. As noted by DSM, the same is not true in IronPython and PyPy where hash(-1) != hash(-2).
See this Quora answer:
If you write a type in a C extension module and provide a tp_hash
method, you have to avoid -1 — if you return -1, Python will assume
you meant to throw an error.
If you write a class in pure Python and provide a __hash__ method,
there's no such requirement, thankfully. But that's because the C code
that invokes your __hash__ method does that for you — if your
__hash__ returns -1, then hash() applied to your object will actually return -2.
Which really just repackages the information from effbot:
The hash value -1 is reserved (it’s used to flag errors in the C
implementation). If the hash algorithm generates this value, we simply
use -2 instead.
You can also see this in the source. For example for Python 3’s int object, this is at the end of the hash implementation:
if (x == (Py_uhash_t)-1)
x = (Py_uhash_t)-2;
return (Py_hash_t)x;
Since they do, how does Python tell these two numbers apart?
Since all hash functions map a large input space to a smaller input space, collisions are always expected, no matter how good the hash function is. Think of hashing strings, for example. If hash codes are 32-bit integers, you have 2^32 (a little more than 4 billion) hash codes. If you consider all ASCII strings of length 6, you have (2^7)^6 (just under 4.4 trillion) different items in your input space. With only this set, you are guaranteed to have many, many collisions no matter how good you are. Add Unicode characters and strings of unlimited length to that!
Therefore, the hash code only hints at the location of an object, an equality test follows to test candidate keys. To implement a membership test in a hash-table set, the hash code gives you "bucket" number in which to search for the value. However, all set items with the same hash code are in the bucket. For this, you also need an equality test to distinguish between all candidates in the bucket.
This hash code and equality duality is hinted at in the CPython documentation on hashable objects. In other languages/frameworks, there is a guideline/rule that if you provide a custom hash code function, you must also provide a custom equality test (performed on the same fields as the hash code function).
Indeed, the Python release today address exactly this, with a security patch that addresses the efficiency issue when this (identical hash values, but on a massive scale) is used as a denial of service attack - http://mail.python.org/pipermail/python-list/2012-April/1290792.html

Exploits in Python - manipulating hex strings

I'm quite new to python and trying to port a simple exploit I've written for a stack overflow (just a nop sled, shell code and return address). This isn't for nefarious purposes but rather for a security lecture at a university.
Given a hex string (deadbeef), what are the best ways to:
represent it as a series of bytes
add or subtract a value
reverse the order (for x86 memory layout, i.e. efbeadde)
Any tips and tricks regarding common tasks in exploit writing in python are also greatly appreciated.
In Python 2.6 and above, you can use the built-in bytearray class.
To create your bytearray object:
b = bytearray.fromhex('deadbeef')
To alter a byte, you can reference it using array notation:
b[2] += 7
To reverse the bytearray in place, use b.reverse(). To create an iterator that iterates over it in reverse order, you can use the reversed function: reversed(b).
You may also be interested in the new bytes class in Python 3, which is like bytearray but immutable.
Not sure if this is the best way...
hex_str = "deadbeef"
bytes = "".join(chr(int(hex_str[i:i+2],16)) for i in xrange(0,len(hex_str),2))
rev_bytes = bytes[::-1]
Or might be simpler:
bytes = "\xde\xad\xbe\xef"
rev_bytes = bytes[::-1]
In Python 2.x, regular str values are binary-safe. You can use the binascii module's b2a_hex and a2b_hex functions to convert to and from hexadecimal.
You can use ordinary string methods to reverse or otherwise rearrange your bytes. However, doing any kind of arithmetic would require you to use the ord function to get numeric values for individual bytes, then chr to convert the result back, followed by concatenation to reassemble the modified string.
For mutable sequences with easier arithmetic, use the array module with type code 'B'. These can be initialized from the results of a2b_hex if you're starting from hexadecimal.

Categories

Resources