Persistent Hashing of Strings in Python - python

How would you convert an arbitrary string into a unique integer, which would be the same across Python sessions and platforms? For example hash('my string') wouldn't work because a different value is returned for each Python session and platform.

Use a hash algorithm such as MD5 or SHA1, then convert the hexdigest via int():
>>> import hashlib
>>> int(hashlib.md5('Hello, world!').hexdigest(), 16)
144653930895353261282233826065192032313L

If a hash function really won't work for you, you can turn the string into a number.
my_string = 'my string'
def string_to_int(s):
ord3 = lambda x : '%.3d' % ord(x)
return int(''.join(map(ord3, s)))
In[10]: string_to_int(my_string)
Out[11]: 109121032115116114105110103L
This is invertible, by mapping each triplet through chr.
def int_to_string(n)
s = str(n)
return ''.join([chr(int(s[i:i+3])) for i in range(0, len(s), 3)])
In[12]: int_to_string(109121032115116114105110103L)
Out[13]: 'my string'

Here are my python27 implementation for algorithms listed here: http://www.cse.yorku.ca/~oz/hash.html.
No idea if they are efficient or not.
from ctypes import c_ulong
def ulong(i): return c_ulong(i).value # numpy would be better if available
def djb2(L):
"""
h = 5381
for c in L:
h = ((h << 5) + h) + ord(c) # h * 33 + c
return h
"""
return reduce(lambda h,c: ord(c) + ((h << 5) + h), L, 5381)
def djb2_l(L):
return reduce(lambda h,c: ulong(ord(c) + ((h << 5) + h)), L, 5381)
def sdbm(L):
"""
h = 0
for c in L:
h = ord(c) + (h << 6) + (h << 16) - h
return h
"""
return reduce(lambda h,c: ord(c) + (h << 6) + (h << 16) - h, L, 0)
def sdbm_l(L):
return reduce(lambda h,c: ulong(ord(c) + (h << 6) + (h << 16) - h), L, 0)
def loselose(L):
"""
h = 0
for c in L:
h += ord(c);
return h
"""
return sum(ord(c) for c in L)
def loselose_l(L):
return reduce(lambda h,c: ulong(ord(c) + h), L, 0)

First off, you probably don't really want the integers to be actually unique. If you do then your numbers might be unlimited in size. If that really is what you want then you could use a bignum library and interpret the bits of the string as the representation of a (potentially very large) integer. If your strings can include the \0 character then you should prepend a 1, so you can distinguish e.g. "\0\0" from "\0".
Now, if you prefer bounded-size numbers you'll be using some form of hashing. MD5 will work but it's overkill for the stated purpose. I recommend using sdbm instead, it works very well. In C it looks like this:
static unsigned long sdbm(unsigned char *str)
{
unsigned long hash = 0;
int c;
while (c = *str++)
hash = c + (hash << 6) + (hash << 16) - hash;
return hash;
}
The source, http://www.cse.yorku.ca/~oz/hash.html, also presents a few other hash functions.

Here's another option, quite crude (probably has many collisions) and not very legible.
It worked for the purpose of generating an int (and later on, a random color) for different strings:
aString = "don't panic"
reduce( lambda x,y:x+y, map( lambda x:ord(x[0])*x[1],zip( aString, range( 1, len( aString ) ) ) ) )

Related

SymPy division doesn't cancel what it can when using symbolic denominator

I have some code using sympy.solvers.solve() that basically leads to the following:
>>> k, u, p, q = sympy.symbols('k u p q')
>>> solution = (k*u + p*u + q)/(k+p)
>>> solution.simplify()
(k*u + p*u + q)/(k + p)
Now, my problem is that it is not simplified enough/correctly. It should be giving the following:
q/(k + p) + u
From the original equation q = (k + p)*(m - u) this is more obvious (when you solve it manually, which my students will be doing).
I have tried many combinations of sol.simplify(), sol.cancel(), sol.collect(u) but I haven't found what can make it work (btw, the collect I can't really use, as I won't know beforehand which symbol will have to be collected, unless you can make something that collects all the symbols in the solution).
I am working with BookWidgets, which automatically corrects the answers that students give, which is why it's important that I have an output which will match what the students will enter.
First things first:
there is no "standard" output to a simplification step.
if the output of a simplification step doesn't suit your need, you might want to manipulate the expression with simplify, expand, collect, ...
two or more sequences of operations (simplify, expand, collect, ...) might lead to different results, or might lead to the same results. It depends on the expression being manipulated.
Let me show you with your example:
k, u, p, q = symbols('k u p q')
solution = (k*u + p*u + q)/(k+p)
# out1: (k*u + p*u + q)/(k + p)
solution = solution.collect(u)
# out2: (q + u*(k + p))/(k + p)
num, den = fraction(solution)
# use the linearity of addition
solution = Add(*[t / den for t in num.args])
# out3: q/(k + p) + u
In the above code, out1, out2, out3 are mathematically equivalent.
Instead of spending time to simplify outputs, I would test for mathematical equivalence with the equals method. For example:
verified_solution = (k*u + p*u + q)/(k+p)
num, den = fraction(verified_solution)
first_studend_sol = Add(*[t / den for t in num.args])
print(verified_solution.equals(first_studend_sol))
# True
second_student_solution = q/(k + p) + u
print(verified_solution.equals(second_student_solution))
# True
third_student_solution = q/(k + p) + u + 2
print(verified_solution.equals(third_student_solution))
# False
It looks like you want the expression in quotient/remainder form:
>>> n, d = solution.as_numer_denom()
>>> div(n, d)
(u, q)
>>> _[0] + _[1]/d
q/(k + p) + u
But that SymPy function may give unexpected results when the symbol names are changed as described here. Here is an alternative (for which I did not find and existing function in SymPy) that attempts more a synthetic division result:
def sdiv(p, q):
"""return w, r if p = w*q + r else 0, p
Examples
========
>>> from sympy.abc import x, y
>>> sdiv(x, x)
(1, 0)
>>> sdiv(x, y)
(0, x)
>>> sdiv(2*x + 3, x)
(2, 3)
>>> a, b=x + 2*y + z, x + y
>>> sdiv(a, b)
(1, y + z)
>>> sdiv(a, -b)
(-1, y + z)
>>> sdiv(-a, -b)
(1, -y - z)
>>> sdiv(-a, b)
(-1, -y - z)
"""
from sympy.core.function import _mexpand
P, Q = map(lambda i: _mexpand(i, recursive=True), (p, q))
r, wq = P.as_independent(*Q.free_symbols, as_Add=True)
# quick exit if no full division possible
if Q.is_Add and not wq.is_Add:
return S.Zero, P
# check multiplicative cancellation
w, bot = fraction((wq/Q).cancel())
if bot != 1 and wq.is_Add and Q.is_Add:
# try maximal additive extraction
s1 = s2 = 1
if signsimp(Q, evaluate=False).is_Mul:
wq = -wq
r = -r
Q = -Q
s1 = -1
if signsimp(wq, evaluate=False).is_Mul:
wq = -wq
s2 = -1
xa = wq.extract_additively(Q)
if xa:
was = wq.as_coefficients_dict()
now = xa.as_coefficients_dict()
dif = {k: was[k] - now.get(k, 0) for k in was}
n = min(was[k]//dif[k] for k in dif)
dr = wq - n*Q
w = s2*n
r = s1*(r + s2*dr)
assert _mexpand(p - (w*q + r)) == 0
bot = 1
return (w, r) if bot == 1 else (S.Zero, p)
The more general suggestion from Davide_sd about using equals is good if you are only testing the equality of two expressions in different forms.

RSA Python Issue

I am having an issue with getting my python program to decrypt a message with an RSA problem. For some reason my Python program is stalling, really just not outputting anything. Anyone got an idea as to why?
n = 23952937352643527451379227516428377705004894508566304313177880191662177061878993798938496818120987817049538365206671401938265663712351239785237507341311858383628932183083145614696585411921662992078376103990806989257289472590902167457302888198293135333083734504191910953238278860923153746261500759411620299864395158783509535039259714359526738924736952759753503357614939203434092075676169179112452620687731670534906069845965633455748606649062394293289967059348143206600765820021392608270528856238306849191113241355842396325210132358046616312901337987464473799040762271876389031455051640937681745409057246190498795697239
p = 153143042272527868798412612417204434156935146874282990942386694020462861918068684561281763577034706600608387699148071015194725533394126069826857182428660427818277378724977554365910231524827258160904493774748749088477328204812171935987088715261127321911849092207070653272176072509933245978935455542420691737433
c = 18031488536864379496089550017272599246134435121343229164236671388038630752847645738968455413067773166115234039247540029174331743781203512108626594601293283737392240326020888417252388602914051828980913478927759934805755030493894728974208520271926698905550119698686762813722190657005740866343113838228101687566611695952746931293926696289378849403873881699852860519784750763227733530168282209363348322874740823803639617797763626570478847423136936562441423318948695084910283653593619962163665200322516949205854709192890808315604698217238383629613355109164122397545332736734824591444665706810731112586202816816647839648399
e = 65537
q = 156408916769576372285319235535320446340733908943564048157238512311891352879208957302116527435165097143521156600690562005797819820759620198602417583539668686152735534648541252847927334505648478214810780526425005943955838623325525300844493280040860604499838598837599791480284496210333200247148213274376422459183
phi = (q-1)*(p-1)
d = pow(e,-1,phi)
m = pow(c,d)%n
print(m)
I apologize for the weird code formatting. Thanks in advance.
Assuming the math is correct (I didn't check), you definitely want to change this:
m = pow(c,d)%n
to this:
m = pow(c, d, n)
The first spelling computes c**d to full precision before dividing by n to find the remainder. That can be enormously expensive. The second way keeps reducing intermediate results, under the covers, mod n all along, and never needs to do arithmetic in integers larger than about n**2.
So, replacing the last line of your code and continuing:
>>> m = pow(c, d, n) # less than an eyeblink
>>> m
14311663942709674867122208214901970650496788151239520971623411712977120586163535880168563325
>>> pow(m, e, n) == c
True
So the original "message" (c) is recovered by doing modular exponentiation to powers d and e in turn.
As already answered by #TimPeters main issue you have is pow(c,d)%n which should be replaced with pow(c, d, n) for huge performance improvement.
So as your question was already answered, I decided to dig a bit further. Inspired by your question I decided to implement most of RSA mathematics from scratch according to WikiPedia article. Maybe it is a bit offtopic (not what you asked) but I'm sure next code will be useful demo for somebody who wants to try RSA in plain Python, and may be helpful to you too.
Next code has all variables named same as in wikipedia, formulas are also taken from there. Important!, one thing is missing in my code, I didn't implement padding for simplicity (just to show classical RSA math), it is very important to have correct (e.g. OAEP) padding in your system, without it there exist attacks on RSA. Also I used just 512 bits for prime parts of modulus, real systems shoud have thousands of bits to be secure. Also I don't do any splitting of message, long messages should be split into sub-messages and padded to fit modulus bitsize.
Try it online!
import random
def fermat_prp(n):
# https://en.wikipedia.org/wiki/Fermat_primality_test
assert n >= 4, n
for i in range(24):
a = (3, 5, 7)[i] if n >= 9 and i < 3 else random.randint(2, n - 2)
if pow(a, n - 1, n) != 1:
return False
return True
def gen_prime(bits):
assert bits >= 3, bits
while True:
n = random.randrange(1 << (bits - 1), 1 << bits)
if fermat_prp(n):
return n
def gcd(a, b):
while b != 0:
a, b = b, a % b
return a
def lcm(a, b):
return a * b // gcd(a, b)
def egcd(a, b):
# https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm
ro, r, so, s, to, t = a, b, 1, 0, 0, 1
while r != 0:
q = ro // r
ro, r = r, ro - q * r
so, s = s, so - q * s
to, t = t, to - q * t
return ro, so, to
def demo():
# https://en.wikipedia.org/wiki/RSA_(cryptosystem)
bits = 512
p, q = gen_prime(bits), gen_prime(bits)
n = p * q
ln = lcm(p - 1, q - 1)
e = 65537
print('PublicKey: e =', e, 'n =', n)
d = egcd(e, ln)[1] % ln
mtext = 'Hello, World!'
print('Plain:', mtext)
m = int.from_bytes(mtext.encode('utf-8'), 'little')
c = pow(m, e, n)
print('Encrypted:', c)
md = pow(c, d, n)
mdtext = md.to_bytes((md.bit_length() + 7) // 8, 'little').decode('utf-8')
print('Decrypted:', mdtext)
if __name__ == '__main__':
demo()
Output:
PublicKey: e = 65537 n = 110799663895649286762656294752173883884148615506062673584673343016070245791505883867301519267702723384430131035038547340921850290913097297607190494504060280758901448419479350528305305851775098631904614278162314251019568026506239421634950337278112960925116975344093575400871044570868887447462560168862887909233
Plain: Hello, World!
Encrypted: 51626387443589883457155394323971044262931599278626885275220384098221412582734630381413609428210758734789774315702921245355044370166117558802434906927834933002999816979504781510321118769252529439999715937013823223670924340787833496790181098038607416880371509879507193070745708801500713956266209367343820073123
Decrypted: Hello, World!

How to keep leading zeros in binary integer (python)?

I need to calculate a checksum for a hex serial word string using XOR. To my (limited) knowledge this has to be performed using the bitwise operator ^. Also, the data has to be converted to binary integer form. Below is my rudimentary code - but the checksum it calculates is 1000831. It should be 01001110 or 47hex. I think the error may be due to missing the leading zeros. All the formatting I've tried to add the leading zeros turns the binary integers back into strings. I appreciate any suggestions.
word = ('010900004f')
#divide word into 5 separate bytes
wd1 = word[0:2]
wd2 = word[2:4]
wd3 = word[4:6]
wd4 = word[6:8]
wd5 = word[8:10]
#this converts a hex string to a binary string
wd1bs = bin(int(wd1, 16))[2:]
wd2bs = bin(int(wd2, 16))[2:]
wd3bs = bin(int(wd3, 16))[2:]
wd4bs = bin(int(wd4, 16))[2:]
#this converts binary string to binary integer
wd1i = int(wd1bs)
wd2i = int(wd2bs)
wd3i = int(wd3bs)
wd4i = int(wd4bs)
wd5i = int(wd5bs)
#now that I have binary integers, I can use the XOR bitwise operator to cal cksum
checksum = (wd1i ^ wd2i ^ wd3i ^ wd4i ^ wd5i)
#I should get 47 hex as the checksum
print (checksum, type(checksum))
Why use all this conversions and the costly string functions?
(I will answer the X part of your XY-Problem, not the Y part.)
def checksum (s):
v = int (s, 16)
checksum = 0
while v:
checksum ^= v & 0xff
v >>= 8
return checksum
cs = checksum ('010900004f')
print (cs, bin (cs), hex (cs) )
Result is 0x47 as expected. Btw 0x47 is 0b1000111 and not as stated 0b1001110.
s = '010900004f'
b = int(s, 16)
print reduce(lambda x, y: x ^ y, ((b>> 8*i)&0xff for i in range(0, len(s)/2)), 0)
Just modify like this.
before:
wd1i = int(wd1bs)
wd2i = int(wd2bs)
wd3i = int(wd3bs)
wd4i = int(wd4bs)
wd5i = int(wd5bs)
after:
wd1i = int(wd1bs, 2)
wd2i = int(wd2bs, 2)
wd3i = int(wd3bs, 2)
wd4i = int(wd4bs, 2)
wd5i = int(wd5bs, 2)
Why your code doesn't work?
Because you are misunderstanding int(wd1bs) behavior.
See doc here. So Python int function expect wd1bs is 10 base by default.
But you expect int function to treat its argument as 2 base.
So you need to write as int(wd1bs, 2)
Or you can also rewrite your entire code like this. So you don't need to use bin function in this case. And this code is basically same as #Hyperboreus answer. :)
w = int('010900004f', 16)
w1 = (0xff00000000 & w) >> 4*8
w2 = (0x00ff000000 & w) >> 3*8
w3 = (0x0000ff0000 & w) >> 2*8
w4 = (0x000000ff00 & w) >> 1*8
w5 = (0x00000000ff & w)
checksum = w1 ^ w2 ^ w3 ^ w4 ^ w5
print hex(checksum)
#'0x47'
And this is more shorter one.
import binascii
word = '010900004f'
print hex(reduce(lambda a, b: a ^ b, (ord(i) for i in binascii.unhexlify(word))))
#0x47

CARP hash in Python

I am attempting to implement a CARP hash in Python as described in the following IETF draft:
https://datatracker.ietf.org/doc/html/draft-vinod-carp-v1-03#section-3.1
Specifically:
3.1. Hash Function
The hash function outputs a 32 bit unsigned integers based on a
zero-terminated ASCII input string. The machine name and domain
names of the URL, the protocol, and the machine names of each member
proxy should be evaluated in lower case since that portion of the
URL is case insensitive.
Because irreversibility and strong cryptographic features are
unnecessary for this application, a very simple and fast hash
function based on the bitwise left rotate operator is used.
For (each char in URL):
URL_Hash += _rotl(URL_Hash, 19) + char ;
Member proxy hashes are computed in a similar manner:
For (each char in MemberProxyName):
MemberProxy_Hash += _rotl(MemberProxy_Hash, 19) + char ;
Becaues member names are often similar to each other, their hash
values are further spread across hash space via the following
additional operations:
MemberProxy_Hash += MemberProxy_Hash * 0x62531965 ;
MemberProxy_Hash = _rotl (MemberProxy_Hash, 21) ;
3.2. Hash Combination
Hashes are combined by first exclusive or-ing (XOR) the URL hash by
the machine name and then multiplying by a constant and performing
a bitwise rotation.
All final and intermediate values are 32 bit unsigned integers.
Combined_Hash = (URL_hash ^ MemberProxy_Hash) ;
Combined_Hash += Combined_Hash * 0x62531965 ;
Combined_Hash = _rotl(Combined_Hash, 21) ;
I've tried to use numpy to create 32 bit unsigned integers. The first problem arrises when the left bit shift is implemented. Numpy automatically recasts the result as a 64 bit unsigned integer. Same for any arithmetic that would overflow 32 bits.
For example:
from numpy import uint32
def key_hash(data):
# hash should be a 32-bit unsigned integer
hashed = uint32()
for char in data:
hashed += hashed << 19 + ord(char)
return hashed
x = key_hash("testkey")
print type(x)
Returns:
type 'numpy.int64'
Any tips of how I constrain this all to 32 bit space? Also, I am a bit confused by the spec in how performing some of these operations like "MemberProxy_Hash += MemberProxy_Hash * 0x62531965" will ever fit in 32 bits as it is calculating the hash.
EDIT:
Based upon feedback, it sounds like the right solution would be:
def key_hash(data):
# hash should be a 32-bit unsigned integer
hashed = 0
for char in data:
hashed += ((hashed << 19) + (hashed >> 13) + ord(char)) & 0xFFFFFFFF
return hashed
def server_hash(data):
# hash should be a 32-bit unsigned integer
hashed = 0
for char in data:
hashed += ((hashed << 19) + (hashed >> 13) + ord(char)) & 0xFFFFFFFF
hashed += (hashed * 0x62531965) & 0xFFFFFFFF
hashed = ((hashed << 21) + (hashed >> 11)) & 0xFFFFFFFF
return hashed
def hash_combination(key_hash, server_hash):
# hash should be a 32-bit unsigned integer
combined_hash = (key_hash ^ server_hash) & 0xFFFFFFFF
combined_hash += (combined_hash * 0x62531965) & 0xFFFFFFFF
return combined_hash
EDIT #2:
Another fixed version.
def rotate_left(x, n, maxbit=32):
# assumes 32 bit
x = x & (2 ** maxbit - 1)
return ((x << n) | (x >> (maxbit - n)))
def key_hash(data):
# hash should be a 32-bit unsigned integer
hashed = 0
for char in data:
hashed = (hashed + rotate_left(hashed, 19) + ord(char))
return hashed
def server_hash(data):
# hash should be a 32-bit unsigned integer
hashed = 0
for char in data:
hashed = (hashed + rotate_left(hashed, 19) + ord(char))
hashed = hashed + hashed * 0x62531965
hashed = rotate_left(hashed, 21)
return hashed
def hash_combination(key_hash, server_hash):
# hash should be a 32-bit unsigned integer
combined_hash = key_hash ^ server_hash
combined_hash = combined_hash + combined_hash * 0x62531965
return combined_hash & 0xFFFFFFFF
Don't bother with numpy uint32. Just use standard Python int. Constrain the result of operations as necessary by doing result &= 0xFFFFFFFF to remove unwanted high-order bits.
def key_hash(data):
# hash should be a 32-bit unsigned integer
hashed = 0
for char in data:
# hashed += ((hashed << 19) + ord(char)) & 0xFFFFFFFF
# the above is wrong; it's not masking the final addition.
hashed = (hashed + (hashed << 19) + ord(char)) & 0xFFFFFFFF
return hashed
You could do just one final masking but that would be rather slow on long input as the intermediate hashed would be a rather large number.
By the way, the above would not be a very good hash function. The rot in rotl means rotate, not shift.
You need
# hashed += ((hashed << 19) + (hashed >> 13) + ord(char)) & 0xFFFFFFFF
# the above is wrong; it's not masking the final addition.
hashed = (hashed + (hashed << 19) + (hashed >> 13) + ord(char)) & 0xFFFFFFFF
Edit ... a comparison; this code:
def rotate_left(x, n, maxbit=32):
# assumes 32 bit
x = x & (2 ** maxbit - 1)
return ((x << n) | (x >> (maxbit - n)))
def key_hash(data):
# hash should be a 32-bit unsigned integer
hashed = 0
for char in data:
hashed = (hashed + rotate_left(hashed, 19) + ord(char))
return hashed
def khash(data):
h = 0
for c in data:
assert 0 <= h <= 0xFFFFFFFF
h = (h + (h << 19) + (h >> 13) + ord(c)) & 0xFFFFFFFF
assert 0 <= h <= 0xFFFFFFFF
return h
guff = "twas brillig and the slithy toves did whatever"
print "yours: %08X" % key_hash(guff)
print "mine : %08X" % khash(guff)
produces:
yours: A20352DB4214FD
mine : DB4214FD
The following works for me, though maybe a little unpythonic:
from numpy import uint32
def key_hash(data):
# hash should be a 32-bit unsigned integer
hashed = uint32()
for char in data:
hashed += hashed << uint32(19) + uint32(ord(char))
return hashed
x = key_hash("testkey")
print type(x)
The problem is that numbers are coerced towards more bits rather than less.

Concatenate two 32 bit int to get a 64 bit long in Python

I want to generate 64 bits long int to serve as unique ID's for documents.
One idea is to combine the user's ID, which is a 32 bit int, with the Unix timestamp, which is another 32 bits int, to form an unique 64 bits long integer.
A scaled-down example would be:
Combine two 4-bit numbers 0010 and 0101 to form the 8-bit number 00100101.
Does this scheme make sense?
If it does, how do I do the "concatenation" of numbers in Python?
Left shift the first number by the number of bits in the second number, then add (or bitwise OR - replace + with | in the following examples) the second number.
result = (user_id << 32) + timestamp
With respect to your scaled-down example,
>>> x = 0b0010
>>> y = 0b0101
>>> (x << 4) + y
37
>>> 0b00100101
37
>>>
foo = <some int>
bar = <some int>
foobar = (foo << 32) + bar
This should do it:
(x << 32) + y
For the next guy (which was me in this case was me). Here is one way to do it in general (for the scaled down example):
def combineBytes(*args):
"""
given the bytes of a multi byte number combine into one
pass them in least to most significant
"""
ans = 0
for i, val in enumerate(args):
ans += (val << i*4)
return ans
for other sizes change the 4 to a 32 or whatever.
>>> bin(combineBytes(0b0101, 0b0010))
'0b100101'
None of the answers before this cover both merging and splitting the numbers. Splitting can be as much a necessity as merging.
NUM_BITS_PER_INT = 4 # Replace with 32, 48, 64, etc. as needed.
MAXINT = (1 << NUM_BITS_PER_INT) - 1
def merge(a, b):
c = (a << NUM_BITS_PER_INT) | b
return c
def split(c):
a = (c >> NUM_BITS_PER_INT) & MAXINT
b = c & MAXINT
return a, b
# Test
EXPECTED_MAX_NUM_BITS = NUM_BITS_PER_INT * 2
for a in range(MAXINT + 1):
for b in range(MAXINT + 1):
c = merge(a, b)
assert c.bit_length() <= EXPECTED_MAX_NUM_BITS
assert (a, b) == split(c)

Categories

Resources