I am attempting to implement a CARP hash in Python as described in the following IETF draft:
https://datatracker.ietf.org/doc/html/draft-vinod-carp-v1-03#section-3.1
Specifically:
3.1. Hash Function
The hash function outputs a 32 bit unsigned integers based on a
zero-terminated ASCII input string. The machine name and domain
names of the URL, the protocol, and the machine names of each member
proxy should be evaluated in lower case since that portion of the
URL is case insensitive.
Because irreversibility and strong cryptographic features are
unnecessary for this application, a very simple and fast hash
function based on the bitwise left rotate operator is used.
For (each char in URL):
URL_Hash += _rotl(URL_Hash, 19) + char ;
Member proxy hashes are computed in a similar manner:
For (each char in MemberProxyName):
MemberProxy_Hash += _rotl(MemberProxy_Hash, 19) + char ;
Becaues member names are often similar to each other, their hash
values are further spread across hash space via the following
additional operations:
MemberProxy_Hash += MemberProxy_Hash * 0x62531965 ;
MemberProxy_Hash = _rotl (MemberProxy_Hash, 21) ;
3.2. Hash Combination
Hashes are combined by first exclusive or-ing (XOR) the URL hash by
the machine name and then multiplying by a constant and performing
a bitwise rotation.
All final and intermediate values are 32 bit unsigned integers.
Combined_Hash = (URL_hash ^ MemberProxy_Hash) ;
Combined_Hash += Combined_Hash * 0x62531965 ;
Combined_Hash = _rotl(Combined_Hash, 21) ;
I've tried to use numpy to create 32 bit unsigned integers. The first problem arrises when the left bit shift is implemented. Numpy automatically recasts the result as a 64 bit unsigned integer. Same for any arithmetic that would overflow 32 bits.
For example:
from numpy import uint32
def key_hash(data):
# hash should be a 32-bit unsigned integer
hashed = uint32()
for char in data:
hashed += hashed << 19 + ord(char)
return hashed
x = key_hash("testkey")
print type(x)
Returns:
type 'numpy.int64'
Any tips of how I constrain this all to 32 bit space? Also, I am a bit confused by the spec in how performing some of these operations like "MemberProxy_Hash += MemberProxy_Hash * 0x62531965" will ever fit in 32 bits as it is calculating the hash.
EDIT:
Based upon feedback, it sounds like the right solution would be:
def key_hash(data):
# hash should be a 32-bit unsigned integer
hashed = 0
for char in data:
hashed += ((hashed << 19) + (hashed >> 13) + ord(char)) & 0xFFFFFFFF
return hashed
def server_hash(data):
# hash should be a 32-bit unsigned integer
hashed = 0
for char in data:
hashed += ((hashed << 19) + (hashed >> 13) + ord(char)) & 0xFFFFFFFF
hashed += (hashed * 0x62531965) & 0xFFFFFFFF
hashed = ((hashed << 21) + (hashed >> 11)) & 0xFFFFFFFF
return hashed
def hash_combination(key_hash, server_hash):
# hash should be a 32-bit unsigned integer
combined_hash = (key_hash ^ server_hash) & 0xFFFFFFFF
combined_hash += (combined_hash * 0x62531965) & 0xFFFFFFFF
return combined_hash
EDIT #2:
Another fixed version.
def rotate_left(x, n, maxbit=32):
# assumes 32 bit
x = x & (2 ** maxbit - 1)
return ((x << n) | (x >> (maxbit - n)))
def key_hash(data):
# hash should be a 32-bit unsigned integer
hashed = 0
for char in data:
hashed = (hashed + rotate_left(hashed, 19) + ord(char))
return hashed
def server_hash(data):
# hash should be a 32-bit unsigned integer
hashed = 0
for char in data:
hashed = (hashed + rotate_left(hashed, 19) + ord(char))
hashed = hashed + hashed * 0x62531965
hashed = rotate_left(hashed, 21)
return hashed
def hash_combination(key_hash, server_hash):
# hash should be a 32-bit unsigned integer
combined_hash = key_hash ^ server_hash
combined_hash = combined_hash + combined_hash * 0x62531965
return combined_hash & 0xFFFFFFFF
Don't bother with numpy uint32. Just use standard Python int. Constrain the result of operations as necessary by doing result &= 0xFFFFFFFF to remove unwanted high-order bits.
def key_hash(data):
# hash should be a 32-bit unsigned integer
hashed = 0
for char in data:
# hashed += ((hashed << 19) + ord(char)) & 0xFFFFFFFF
# the above is wrong; it's not masking the final addition.
hashed = (hashed + (hashed << 19) + ord(char)) & 0xFFFFFFFF
return hashed
You could do just one final masking but that would be rather slow on long input as the intermediate hashed would be a rather large number.
By the way, the above would not be a very good hash function. The rot in rotl means rotate, not shift.
You need
# hashed += ((hashed << 19) + (hashed >> 13) + ord(char)) & 0xFFFFFFFF
# the above is wrong; it's not masking the final addition.
hashed = (hashed + (hashed << 19) + (hashed >> 13) + ord(char)) & 0xFFFFFFFF
Edit ... a comparison; this code:
def rotate_left(x, n, maxbit=32):
# assumes 32 bit
x = x & (2 ** maxbit - 1)
return ((x << n) | (x >> (maxbit - n)))
def key_hash(data):
# hash should be a 32-bit unsigned integer
hashed = 0
for char in data:
hashed = (hashed + rotate_left(hashed, 19) + ord(char))
return hashed
def khash(data):
h = 0
for c in data:
assert 0 <= h <= 0xFFFFFFFF
h = (h + (h << 19) + (h >> 13) + ord(c)) & 0xFFFFFFFF
assert 0 <= h <= 0xFFFFFFFF
return h
guff = "twas brillig and the slithy toves did whatever"
print "yours: %08X" % key_hash(guff)
print "mine : %08X" % khash(guff)
produces:
yours: A20352DB4214FD
mine : DB4214FD
The following works for me, though maybe a little unpythonic:
from numpy import uint32
def key_hash(data):
# hash should be a 32-bit unsigned integer
hashed = uint32()
for char in data:
hashed += hashed << uint32(19) + uint32(ord(char))
return hashed
x = key_hash("testkey")
print type(x)
The problem is that numbers are coerced towards more bits rather than less.
Related
I'm wondering if there's a way to do a two's complement sign extension as you would in C/C++ in Python, using standard libraries (preferably on a bitarray).
C/C++:
// Example program
#include <iostream>
#include <string>
int main()
{
int x = 0xFF;
x <<= (32 - 8);
x >>= (32 - 8);
std::cout << x;
return 0;
}
And here's a Python function I've written which (in my testing) accomplishes the same thing. I'm simply wondering if there's a built-in (or just faster) way of doing it:
def sign_extend(value, bits):
highest_bit_mask = 1 << (bits - 1)
remainder = 0
for i in xrange(bits - 1):
remainder = (remainder << 1) + 1
if value & highest_bit_mask == highest_bit_mask:
value = (value & remainder) - highest_bit_mask
else:
value = value & remainder
return value
The following code delivers the same results as your function, but is a bit shorter. Also, obviously, if you are going to apply this to a lot of data, you can pre-calculate both the masks.
def sign_extend(value, bits):
sign_bit = 1 << (bits - 1)
return (value & (sign_bit - 1)) - (value & sign_bit)
I am trying to convert this C function into Python;
typedef unsigned long var;
/* Bit rotate rightwards */
var ror(var v,unsigned int bits) {
return (v>>bits)|(v<<(8*sizeof(var)-bits));
}
I have tried Googling for some solutions, but I can't seem to get any of them to give the same results as the one here.
This is one solution I have found from another program;
def mask1(n):
"""Return a bitmask of length n (suitable for masking against an
int to coerce the size to a given length)
"""
if n >= 0:
return 2**n - 1
else:
return 0
def ror(n, rotations=1, width=8):
"""Return a given number of bitwise right rotations of an integer n,
for a given bit field width.
"""
rotations %= width
if rotations < 1:
return n
n &= mask1(width)
return (n >> rotations) | ((n << (8 * width - rotations)))
I am trying to btishift key = 0xf0f0f0f0f123456. The C code gives 000000000f0f0f12 when it is called with; ror(key, 8 << 1) and Python gives; 0x0f0f0f0f0f123456 (the original input!)
Your C output doesn't match the function that you provided. That is presumably because you are not printing it correctly. This program:
#include <stdio.h>
#include <stdint.h>
uint64_t ror(uint64_t v, unsigned int bits)
{
return (v>>bits) | (v<<(8*sizeof(uint64_t)-bits));
}
int main(void)
{
printf("%llx\n", ror(0x0123456789abcdef, 4));
printf("%llx\n", ror(0x0123456789abcdef, 8));
printf("%llx\n", ror(0x0123456789abcdef, 12));
printf("%llx\n", ror(0x0123456789abcdef, 16));
return 0;
}
produces the following output:
f0123456789abcde
ef0123456789abcd
def0123456789abc
cdef0123456789ab
To produce an ror function in Python I refer you to this excellent article: http://www.falatic.com/index.php/108/python-and-bitwise-rotation
This Python 2 code produces the same output as the C program above:
ror = lambda val, r_bits, max_bits: \
((val & (2**max_bits-1)) >> r_bits%max_bits) | \
(val << (max_bits-(r_bits%max_bits)) & (2**max_bits-1))
print "%x" % ror(0x0123456789abcdef, 4, 64)
print "%x" % ror(0x0123456789abcdef, 8, 64)
print "%x" % ror(0x0123456789abcdef, 12, 64)
print "%x" % ror(0x0123456789abcdef, 16, 64)
The shortest way I've found in Python:
(note this works only with integers as inputs)
def ror(n,rotations,width):
return (2**width-1)&(n>>rotations|n<<(width-rotations))
There are different problems in your question.
C part :
You use a value of key that is a 64 bits value (0x0f0f0f0f0f123456), but the output shows that for you compiler unsigned long is only 32 bits wide. So what C code does is rotating the 32 bits value 0x0f123456 16 times giving 0x34560f12
If you had used unsigned long long (assuming it is 64 bits on your architecture as it is on mine), you would have got 0x34560f0f0f0f0f12 (rotation 16 times of a 64 bits)
Python part :
The definition of width between mask1 and ror is not consistent. mask1 takes a width in bits, where ror takes a width in bytes and one byte = 8 bits.
The ror function should be :
def ror(n, rotations=1, width=8):
"""Return a given number of bitwise right rotations of an integer n,
for a given bit field width.
"""
rotations %= width * 8 # width bytes give 8*bytes bits
if rotations < 1:
return n
mask = mask1(8 * width) # store the mask
n &= mask
return (n >> rotations) | ((n << (8 * width - rotations)) & mask) # apply the mask to result
That way with key = 0x0f0f0f0f0f123456, you get :
>>> hex(ror(key, 16))
'0x34560f0f0f0f0f12L'
>>> hex(ror(key, 16, 4))
'0x34560f12L'
exactly the same as C output
i know its nearly 6 years old
I always find it easier to use string slices than bitwise operations.
def rotate_left(x, n):
return int(f"{x:032b}"[n:] + f"{x:032b}"[:n], 2)
def rotate_right(x, n):
return int(f"{x:032b}"[-n:] + f"{x:032b}"[:-n], 2)
def rotation_value(value, rotations, widht=32):
""" Return a given number of bitwise left or right rotations of an interger
value,
for a given bit field widht.
if rotations == -rotations:
left
else:
right
"""
if int(rotations) != abs(int(rotations)):
rotations = widht + int(rotations)
return (int(value)<<(widht-(rotations%widht)) | (int(value)>>(rotations%widht))) & ((1<<widht)-1)
I need to calculate a checksum for a hex serial word string using XOR. To my (limited) knowledge this has to be performed using the bitwise operator ^. Also, the data has to be converted to binary integer form. Below is my rudimentary code - but the checksum it calculates is 1000831. It should be 01001110 or 47hex. I think the error may be due to missing the leading zeros. All the formatting I've tried to add the leading zeros turns the binary integers back into strings. I appreciate any suggestions.
word = ('010900004f')
#divide word into 5 separate bytes
wd1 = word[0:2]
wd2 = word[2:4]
wd3 = word[4:6]
wd4 = word[6:8]
wd5 = word[8:10]
#this converts a hex string to a binary string
wd1bs = bin(int(wd1, 16))[2:]
wd2bs = bin(int(wd2, 16))[2:]
wd3bs = bin(int(wd3, 16))[2:]
wd4bs = bin(int(wd4, 16))[2:]
#this converts binary string to binary integer
wd1i = int(wd1bs)
wd2i = int(wd2bs)
wd3i = int(wd3bs)
wd4i = int(wd4bs)
wd5i = int(wd5bs)
#now that I have binary integers, I can use the XOR bitwise operator to cal cksum
checksum = (wd1i ^ wd2i ^ wd3i ^ wd4i ^ wd5i)
#I should get 47 hex as the checksum
print (checksum, type(checksum))
Why use all this conversions and the costly string functions?
(I will answer the X part of your XY-Problem, not the Y part.)
def checksum (s):
v = int (s, 16)
checksum = 0
while v:
checksum ^= v & 0xff
v >>= 8
return checksum
cs = checksum ('010900004f')
print (cs, bin (cs), hex (cs) )
Result is 0x47 as expected. Btw 0x47 is 0b1000111 and not as stated 0b1001110.
s = '010900004f'
b = int(s, 16)
print reduce(lambda x, y: x ^ y, ((b>> 8*i)&0xff for i in range(0, len(s)/2)), 0)
Just modify like this.
before:
wd1i = int(wd1bs)
wd2i = int(wd2bs)
wd3i = int(wd3bs)
wd4i = int(wd4bs)
wd5i = int(wd5bs)
after:
wd1i = int(wd1bs, 2)
wd2i = int(wd2bs, 2)
wd3i = int(wd3bs, 2)
wd4i = int(wd4bs, 2)
wd5i = int(wd5bs, 2)
Why your code doesn't work?
Because you are misunderstanding int(wd1bs) behavior.
See doc here. So Python int function expect wd1bs is 10 base by default.
But you expect int function to treat its argument as 2 base.
So you need to write as int(wd1bs, 2)
Or you can also rewrite your entire code like this. So you don't need to use bin function in this case. And this code is basically same as #Hyperboreus answer. :)
w = int('010900004f', 16)
w1 = (0xff00000000 & w) >> 4*8
w2 = (0x00ff000000 & w) >> 3*8
w3 = (0x0000ff0000 & w) >> 2*8
w4 = (0x000000ff00 & w) >> 1*8
w5 = (0x00000000ff & w)
checksum = w1 ^ w2 ^ w3 ^ w4 ^ w5
print hex(checksum)
#'0x47'
And this is more shorter one.
import binascii
word = '010900004f'
print hex(reduce(lambda a, b: a ^ b, (ord(i) for i in binascii.unhexlify(word))))
#0x47
I want to generate 64 bits long int to serve as unique ID's for documents.
One idea is to combine the user's ID, which is a 32 bit int, with the Unix timestamp, which is another 32 bits int, to form an unique 64 bits long integer.
A scaled-down example would be:
Combine two 4-bit numbers 0010 and 0101 to form the 8-bit number 00100101.
Does this scheme make sense?
If it does, how do I do the "concatenation" of numbers in Python?
Left shift the first number by the number of bits in the second number, then add (or bitwise OR - replace + with | in the following examples) the second number.
result = (user_id << 32) + timestamp
With respect to your scaled-down example,
>>> x = 0b0010
>>> y = 0b0101
>>> (x << 4) + y
37
>>> 0b00100101
37
>>>
foo = <some int>
bar = <some int>
foobar = (foo << 32) + bar
This should do it:
(x << 32) + y
For the next guy (which was me in this case was me). Here is one way to do it in general (for the scaled down example):
def combineBytes(*args):
"""
given the bytes of a multi byte number combine into one
pass them in least to most significant
"""
ans = 0
for i, val in enumerate(args):
ans += (val << i*4)
return ans
for other sizes change the 4 to a 32 or whatever.
>>> bin(combineBytes(0b0101, 0b0010))
'0b100101'
None of the answers before this cover both merging and splitting the numbers. Splitting can be as much a necessity as merging.
NUM_BITS_PER_INT = 4 # Replace with 32, 48, 64, etc. as needed.
MAXINT = (1 << NUM_BITS_PER_INT) - 1
def merge(a, b):
c = (a << NUM_BITS_PER_INT) | b
return c
def split(c):
a = (c >> NUM_BITS_PER_INT) & MAXINT
b = c & MAXINT
return a, b
# Test
EXPECTED_MAX_NUM_BITS = NUM_BITS_PER_INT * 2
for a in range(MAXINT + 1):
for b in range(MAXINT + 1):
c = merge(a, b)
assert c.bit_length() <= EXPECTED_MAX_NUM_BITS
assert (a, b) == split(c)
How would you convert an arbitrary string into a unique integer, which would be the same across Python sessions and platforms? For example hash('my string') wouldn't work because a different value is returned for each Python session and platform.
Use a hash algorithm such as MD5 or SHA1, then convert the hexdigest via int():
>>> import hashlib
>>> int(hashlib.md5('Hello, world!').hexdigest(), 16)
144653930895353261282233826065192032313L
If a hash function really won't work for you, you can turn the string into a number.
my_string = 'my string'
def string_to_int(s):
ord3 = lambda x : '%.3d' % ord(x)
return int(''.join(map(ord3, s)))
In[10]: string_to_int(my_string)
Out[11]: 109121032115116114105110103L
This is invertible, by mapping each triplet through chr.
def int_to_string(n)
s = str(n)
return ''.join([chr(int(s[i:i+3])) for i in range(0, len(s), 3)])
In[12]: int_to_string(109121032115116114105110103L)
Out[13]: 'my string'
Here are my python27 implementation for algorithms listed here: http://www.cse.yorku.ca/~oz/hash.html.
No idea if they are efficient or not.
from ctypes import c_ulong
def ulong(i): return c_ulong(i).value # numpy would be better if available
def djb2(L):
"""
h = 5381
for c in L:
h = ((h << 5) + h) + ord(c) # h * 33 + c
return h
"""
return reduce(lambda h,c: ord(c) + ((h << 5) + h), L, 5381)
def djb2_l(L):
return reduce(lambda h,c: ulong(ord(c) + ((h << 5) + h)), L, 5381)
def sdbm(L):
"""
h = 0
for c in L:
h = ord(c) + (h << 6) + (h << 16) - h
return h
"""
return reduce(lambda h,c: ord(c) + (h << 6) + (h << 16) - h, L, 0)
def sdbm_l(L):
return reduce(lambda h,c: ulong(ord(c) + (h << 6) + (h << 16) - h), L, 0)
def loselose(L):
"""
h = 0
for c in L:
h += ord(c);
return h
"""
return sum(ord(c) for c in L)
def loselose_l(L):
return reduce(lambda h,c: ulong(ord(c) + h), L, 0)
First off, you probably don't really want the integers to be actually unique. If you do then your numbers might be unlimited in size. If that really is what you want then you could use a bignum library and interpret the bits of the string as the representation of a (potentially very large) integer. If your strings can include the \0 character then you should prepend a 1, so you can distinguish e.g. "\0\0" from "\0".
Now, if you prefer bounded-size numbers you'll be using some form of hashing. MD5 will work but it's overkill for the stated purpose. I recommend using sdbm instead, it works very well. In C it looks like this:
static unsigned long sdbm(unsigned char *str)
{
unsigned long hash = 0;
int c;
while (c = *str++)
hash = c + (hash << 6) + (hash << 16) - hash;
return hash;
}
The source, http://www.cse.yorku.ca/~oz/hash.html, also presents a few other hash functions.
Here's another option, quite crude (probably has many collisions) and not very legible.
It worked for the purpose of generating an int (and later on, a random color) for different strings:
aString = "don't panic"
reduce( lambda x,y:x+y, map( lambda x:ord(x[0])*x[1],zip( aString, range( 1, len( aString ) ) ) ) )