Does some standard Python module contain a function to compute modular multiplicative inverse of a number, i.e. a number y = invmod(x, p) such that x*y == 1 (mod p)? Google doesn't seem to give any good hints on this.
Of course, one can come up with home-brewed 10-liner of extended Euclidean algorithm, but why reinvent the wheel.
For example, Java's BigInteger has modInverse method. Doesn't Python have something similar?
Python 3.8+
y = pow(x, -1, p)
Python 3.7 and earlier
Maybe someone will find this useful (from wikibooks):
def egcd(a, b):
if a == 0:
return (b, 0, 1)
else:
g, y, x = egcd(b % a, a)
return (g, x - (b // a) * y, y)
def modinv(a, m):
g, x, y = egcd(a, m)
if g != 1:
raise Exception('modular inverse does not exist')
else:
return x % m
If your modulus is prime (you call it p) then you may simply compute:
y = x**(p-2) mod p # Pseudocode
Or in Python proper:
y = pow(x, p-2, p)
Here is someone who has implemented some number theory capabilities in Python: http://www.math.umbc.edu/~campbell/Computers/Python/numbthy.html
Here is an example done at the prompt:
m = 1000000007
x = 1234567
y = pow(x,m-2,m)
y
989145189L
x*y
1221166008548163L
x*y % m
1L
You might also want to look at the gmpy module. It is an interface between Python and the GMP multiple-precision library. gmpy provides an invert function that does exactly what you need:
>>> import gmpy
>>> gmpy.invert(1234567, 1000000007)
mpz(989145189)
Updated answer
As noted by #hyh , the gmpy.invert() returns 0 if the inverse does not exist. That matches the behavior of GMP's mpz_invert() function. gmpy.divm(a, b, m) provides a general solution to a=bx (mod m).
>>> gmpy.divm(1, 1234567, 1000000007)
mpz(989145189)
>>> gmpy.divm(1, 0, 5)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ZeroDivisionError: not invertible
>>> gmpy.divm(1, 4, 8)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ZeroDivisionError: not invertible
>>> gmpy.divm(1, 4, 9)
mpz(7)
divm() will return a solution when gcd(b,m) == 1 and raises an exception when the multiplicative inverse does not exist.
Disclaimer: I'm the current maintainer of the gmpy library.
Updated answer 2
gmpy2 now properly raises an exception when the inverse does not exists:
>>> import gmpy2
>>> gmpy2.invert(0,5)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ZeroDivisionError: invert() no inverse exists
As of 3.8 pythons pow() function can take a modulus and a negative integer. See here. Their case for how to use it is
>>> pow(38, -1, 97)
23
>>> 23 * 38 % 97 == 1
True
Here is a one-liner for CodeFights; it is one of the shortest solutions:
MMI = lambda A, n,s=1,t=0,N=0: (n < 2 and t%N or MMI(n, A%n, t, s-A//n*t, N or n),-1)[n<1]
It will return -1 if A has no multiplicative inverse in n.
Usage:
MMI(23, 99) # returns 56
MMI(18, 24) # return -1
The solution uses the Extended Euclidean Algorithm.
Sympy, a python module for symbolic mathematics, has a built-in modular inverse function if you don't want to implement your own (or if you're using Sympy already):
from sympy import mod_inverse
mod_inverse(11, 35) # returns 16
mod_inverse(15, 35) # raises ValueError: 'inverse of 15 (mod 35) does not exist'
This doesn't seem to be documented on the Sympy website, but here's the docstring: Sympy mod_inverse docstring on Github
Here is a concise 1-liner that does it, without using any external libraries.
# Given 0<a<b, returns the unique c such that 0<c<b and a*c == gcd(a,b) (mod b).
# In particular, if a,b are relatively prime, returns the inverse of a modulo b.
def invmod(a,b): return 0 if a==0 else 1 if b%a==0 else b - invmod(b%a,a)*b//a
Note that this is really just egcd, streamlined to return only the single coefficient of interest.
I try different solutions from this thread and in the end I use this one:
def egcd(a, b):
lastremainder, remainder = abs(a), abs(b)
x, lastx, y, lasty = 0, 1, 1, 0
while remainder:
lastremainder, (quotient, remainder) = remainder, divmod(lastremainder, remainder)
x, lastx = lastx - quotient*x, x
y, lasty = lasty - quotient*y, y
return lastremainder, lastx * (-1 if a < 0 else 1), lasty * (-1 if b < 0 else 1)
def modinv(a, m):
g, x, y = self.egcd(a, m)
if g != 1:
raise ValueError('modinv for {} does not exist'.format(a))
return x % m
Modular_inverse in Python
Here is my code, it might be sloppy but it seems to work for me anyway.
# a is the number you want the inverse for
# b is the modulus
def mod_inverse(a, b):
r = -1
B = b
A = a
eq_set = []
full_set = []
mod_set = []
#euclid's algorithm
while r!=1 and r!=0:
r = b%a
q = b//a
eq_set = [r, b, a, q*-1]
b = a
a = r
full_set.append(eq_set)
for i in range(0, 4):
mod_set.append(full_set[-1][i])
mod_set.insert(2, 1)
counter = 0
#extended euclid's algorithm
for i in range(1, len(full_set)):
if counter%2 == 0:
mod_set[2] = full_set[-1*(i+1)][3]*mod_set[4]+mod_set[2]
mod_set[3] = full_set[-1*(i+1)][1]
elif counter%2 != 0:
mod_set[4] = full_set[-1*(i+1)][3]*mod_set[2]+mod_set[4]
mod_set[1] = full_set[-1*(i+1)][1]
counter += 1
if mod_set[3] == B:
return mod_set[2]%B
return mod_set[4]%B
The code above will not run in python3 and is less efficient compared to the GCD variants. However, this code is very transparent. It triggered me to create a more compact version:
def imod(a, n):
c = 1
while (c % a > 0):
c += n
return c // a
from the cpython implementation source code:
def invmod(a, n):
b, c = 1, 0
while n:
q, r = divmod(a, n)
a, b, c, n = n, c, b - q*c, r
# at this point a is the gcd of the original inputs
if a == 1:
return b
raise ValueError("Not invertible")
according to the comment above this code, it can return small negative values, so you could potentially check if negative and add n when negative before returning b.
To figure out the modular multiplicative inverse I recommend using the Extended Euclidean Algorithm like this:
def multiplicative_inverse(a, b):
origA = a
X = 0
prevX = 1
Y = 1
prevY = 0
while b != 0:
temp = b
quotient = a/b
b = a%b
a = temp
temp = X
a = prevX - quotient * X
prevX = temp
temp = Y
Y = prevY - quotient * Y
prevY = temp
return origA + prevY
Well, here's a function in C which you can easily convert to python. In the below c function extended euclidian algorithm is used to calculate inverse mod.
int imod(int a,int n){
int c,i=1;
while(1){
c = n * i + 1;
if(c%a==0){
c = c/a;
break;
}
i++;
}
return c;}
Translates to Python Function
def imod(a,n):
i=1
while True:
c = n * i + 1;
if(c%a==0):
c = c/a
break;
i = i+1
return c
Reference to the above C function is taken from the following link C program to find Modular Multiplicative Inverse of two Relatively Prime Numbers
Related
I am having an issue with getting my python program to decrypt a message with an RSA problem. For some reason my Python program is stalling, really just not outputting anything. Anyone got an idea as to why?
n = 23952937352643527451379227516428377705004894508566304313177880191662177061878993798938496818120987817049538365206671401938265663712351239785237507341311858383628932183083145614696585411921662992078376103990806989257289472590902167457302888198293135333083734504191910953238278860923153746261500759411620299864395158783509535039259714359526738924736952759753503357614939203434092075676169179112452620687731670534906069845965633455748606649062394293289967059348143206600765820021392608270528856238306849191113241355842396325210132358046616312901337987464473799040762271876389031455051640937681745409057246190498795697239
p = 153143042272527868798412612417204434156935146874282990942386694020462861918068684561281763577034706600608387699148071015194725533394126069826857182428660427818277378724977554365910231524827258160904493774748749088477328204812171935987088715261127321911849092207070653272176072509933245978935455542420691737433
c = 18031488536864379496089550017272599246134435121343229164236671388038630752847645738968455413067773166115234039247540029174331743781203512108626594601293283737392240326020888417252388602914051828980913478927759934805755030493894728974208520271926698905550119698686762813722190657005740866343113838228101687566611695952746931293926696289378849403873881699852860519784750763227733530168282209363348322874740823803639617797763626570478847423136936562441423318948695084910283653593619962163665200322516949205854709192890808315604698217238383629613355109164122397545332736734824591444665706810731112586202816816647839648399
e = 65537
q = 156408916769576372285319235535320446340733908943564048157238512311891352879208957302116527435165097143521156600690562005797819820759620198602417583539668686152735534648541252847927334505648478214810780526425005943955838623325525300844493280040860604499838598837599791480284496210333200247148213274376422459183
phi = (q-1)*(p-1)
d = pow(e,-1,phi)
m = pow(c,d)%n
print(m)
I apologize for the weird code formatting. Thanks in advance.
Assuming the math is correct (I didn't check), you definitely want to change this:
m = pow(c,d)%n
to this:
m = pow(c, d, n)
The first spelling computes c**d to full precision before dividing by n to find the remainder. That can be enormously expensive. The second way keeps reducing intermediate results, under the covers, mod n all along, and never needs to do arithmetic in integers larger than about n**2.
So, replacing the last line of your code and continuing:
>>> m = pow(c, d, n) # less than an eyeblink
>>> m
14311663942709674867122208214901970650496788151239520971623411712977120586163535880168563325
>>> pow(m, e, n) == c
True
So the original "message" (c) is recovered by doing modular exponentiation to powers d and e in turn.
As already answered by #TimPeters main issue you have is pow(c,d)%n which should be replaced with pow(c, d, n) for huge performance improvement.
So as your question was already answered, I decided to dig a bit further. Inspired by your question I decided to implement most of RSA mathematics from scratch according to WikiPedia article. Maybe it is a bit offtopic (not what you asked) but I'm sure next code will be useful demo for somebody who wants to try RSA in plain Python, and may be helpful to you too.
Next code has all variables named same as in wikipedia, formulas are also taken from there. Important!, one thing is missing in my code, I didn't implement padding for simplicity (just to show classical RSA math), it is very important to have correct (e.g. OAEP) padding in your system, without it there exist attacks on RSA. Also I used just 512 bits for prime parts of modulus, real systems shoud have thousands of bits to be secure. Also I don't do any splitting of message, long messages should be split into sub-messages and padded to fit modulus bitsize.
Try it online!
import random
def fermat_prp(n):
# https://en.wikipedia.org/wiki/Fermat_primality_test
assert n >= 4, n
for i in range(24):
a = (3, 5, 7)[i] if n >= 9 and i < 3 else random.randint(2, n - 2)
if pow(a, n - 1, n) != 1:
return False
return True
def gen_prime(bits):
assert bits >= 3, bits
while True:
n = random.randrange(1 << (bits - 1), 1 << bits)
if fermat_prp(n):
return n
def gcd(a, b):
while b != 0:
a, b = b, a % b
return a
def lcm(a, b):
return a * b // gcd(a, b)
def egcd(a, b):
# https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm
ro, r, so, s, to, t = a, b, 1, 0, 0, 1
while r != 0:
q = ro // r
ro, r = r, ro - q * r
so, s = s, so - q * s
to, t = t, to - q * t
return ro, so, to
def demo():
# https://en.wikipedia.org/wiki/RSA_(cryptosystem)
bits = 512
p, q = gen_prime(bits), gen_prime(bits)
n = p * q
ln = lcm(p - 1, q - 1)
e = 65537
print('PublicKey: e =', e, 'n =', n)
d = egcd(e, ln)[1] % ln
mtext = 'Hello, World!'
print('Plain:', mtext)
m = int.from_bytes(mtext.encode('utf-8'), 'little')
c = pow(m, e, n)
print('Encrypted:', c)
md = pow(c, d, n)
mdtext = md.to_bytes((md.bit_length() + 7) // 8, 'little').decode('utf-8')
print('Decrypted:', mdtext)
if __name__ == '__main__':
demo()
Output:
PublicKey: e = 65537 n = 110799663895649286762656294752173883884148615506062673584673343016070245791505883867301519267702723384430131035038547340921850290913097297607190494504060280758901448419479350528305305851775098631904614278162314251019568026506239421634950337278112960925116975344093575400871044570868887447462560168862887909233
Plain: Hello, World!
Encrypted: 51626387443589883457155394323971044262931599278626885275220384098221412582734630381413609428210758734789774315702921245355044370166117558802434906927834933002999816979504781510321118769252529439999715937013823223670924340787833496790181098038607416880371509879507193070745708801500713956266209367343820073123
Decrypted: Hello, World!
I am currently working on a project replicating RSA key generation and testing using euclidean algorithm, extended euclidean algorithm to find the modular inverse of the value.
I used the Miller-Rabin test to choose two prime numbers, p and q.
After running the code, I am able to obtain Kpub and e, however Kpr returns as nan.
Please help!
#Euclidean Algorithm func
def EucAlgo(a, b):
if a==0:
return b
return EucAlgo(b % a,a)
def ExEucAlgo(a,b):
if a==0:
return b,0,1
gcd, s1, t1 = ExEucAlgo(b%a,a)
#gcd of a,b
s = t1 - (b/a) * s1
t = s1
return gcd, s, t
def ExEucAlgo_modInverse(a,b):
gcd, s, t = ExEucAlgo(b,a)
if (gcd == 1):
i = t % a
elif (gcd !=1):
print("There is no inverse modulo for the input.")
return i
def SqMul_ModularExpo(b, exp, n):
bin_exp = bin(exp)
base = b
for i in range (3, len(bin_exp)):
base = (base ** 2) % n
if(bin_exp[i]=='1'):
i+=1
base = (base * b) %n
return base
#RSA Key generation
p=9054583561027584891319616491815785011595937977633787663340258672121877196627062461308487615739189212918799813327175451021729047602129396754172486202100997
q=10115395220079214686776355235686624745626962891667413288473649946208213820942557513105240135405981494333016032659525466362014175268953946332375459648688023
n= p * q
phi_n= (p-1) * (q-1)
e= randint(1, phi_n - 1)
while((EucAlgo(e,phi_n)) !=1):
e = randint(1, (phi_n-1))
d = ExEucAlgo_modInverse(e,phi_n)
print(f"\nKpr={d}")
print(f"\nKpub=(n={n})\n \ne={e}")
The problem is that you are using float point division which will result in returning float a point which when dealing with large int can result in very large floats which python can't handle so the solution is to use integer division which means 5//2=2 not 2.5. The problem is that Now encrypting and decrypting data would result in wrong decryption. (You wont get 2 again) because of some bugs in your functions.
FIRST: use public exponent pf 65537(prime number) which is the default for all RSA implementations(see your browser certificates) rather than finding a random one. Then after calculating the extended Euclidean algorithm which is used to find modulo inverse you dont have to make any more calculations(just return this value if GCD is 1 otherwise raise an error or whatever).
Here is the complete code that works after removing some unneeded (functions, imports, and random public exponent) READ comments.
def EucAlgo(a, b):
if a == 0:
return b
return EucAlgo(b % a, a)
def ExEucAlgo(a,b):
if a==0:
return b, 0, 1
gcd, s1, t1 = ExEucAlgo(b%a, a)
# You dont use / use // to make integer division
s = t1 - (b//a) * s1
t = s1
return gcd, s, t
def ExEucAlgo_modInverse(a,b):
gcd, s, t = ExEucAlgo(a, b)
if (gcd == 1):
# Just return s which is the inverse of public exponent
return s
elif (gcd != 1):
# I think it's better to raise an error but it's up to you
print("There is no inverse modulo for the input.")
#RSA Key generation
p = 9054583561027584891319616491815785011595937977633787663340258672121877196627062461308487615739189212918799813327175451021729047602129396754172486202100997
q = 10115395220079214686776355235686624745626962891667413288473649946208213820942557513105240135405981494333016032659525466362014175268953946332375459648688023
n = p * q
phi_n = (p-1) * (q-1)
# Just use fixed prime public exponent rather than trying fixed ones
e = 65537
d = ExEucAlgo_modInverse(e, phi_n)
print(f"\nKpr={d}")
print(f"\nKpub=(n={n})\n \ne={e}")
# Try to encrypt and decrypt 36
ciphertext = pow(36, e, n)
print("Encrypted data {}".format(ciphertext))
print("Decrypted data is {}".format(pow(ciphertext, d, n)))
I stumbled upon a problem at Project Euler, https://projecteuler.net/problem=15
. I solved this by combinatorics but was left wondering if there is a dynamic programming solution to this problem or these kinds of problems overall. And say some squares of the grid are taken off - is that possible to navigate? I am using Python. How should I do that? Any tips are appreciated. Thanks in advance.
You can do a simple backtrack and explore an implicit graph like this: (comments explain most of it)
def explore(r, c, n, memo):
"""
explore right and down from position (r,c)
report a rout once position (n,n) is reached
memo is a matrix which saves how many routes exists from each position to (n,n)
"""
if r == n and c == n:
# one path has been found
return 1
elif r > n or c > n:
# crossing the border, go back
return 0
if memo[r][c] is not None:
return memo[r][c]
a= explore(r+1, c, n, memo) #move down
b= explore(r, c+1, n, memo) #move right
# return total paths found from this (r,c) position
memo[r][c]= a + b
return a+b
if __name__ == '__main__':
n= 20
memo = [[None] * (n+1) for _ in range(n+1)]
paths = explore(0, 0, n, memo)
print(paths)
Most straight-forwardly with python's built-in memoization util functools.lru_cache. You can encode missing squares as a frozenset (hashable) of missing grid points (pairs):
from functools import lru_cache
#lru_cache(None)
def paths(m, n, missing=None):
missing = missing or frozenset()
if (m, n) in missing:
return 0
if (m, n) == (0, 0):
return 1
over = paths(m, n-1, missing=missing) if n else 0
down = paths(m-1, n, missing=missing) if m else 0
return over + down
>>> paths(2, 2)
6
# middle grid point missing: only two paths
>>> paths(2, 2, frozenset([(1, 1)]))
2
>>> paths(20, 20)
137846528820
There is also a mathematical solution (which is probably what you used):
def factorial(n):
result = 1
for i in range(1, n + 1):
result *= i
return result
def paths(w, h):
return factorial(w + h) / (factorial(w) * factorial(h))
This works because the number of paths is the same as the number of ways to choose to go right or down over w + h steps, where you go right w times, which is equal to w + h choose w, or (w + h)! / (w! * h!).
With missing grid squares, I think there is a combinatoric solution, but it's very slow if there are many missing squares, so dynamic programming would probably be better there.
For example, the following should work:
missing = [
[0, 1],
[0, 0],
[0, 0],
]
def paths_helper(x, y, path_grid, missing):
if path_grid[x][y] is not None:
return path_grid[x][y]
if missing[x][y]:
path_grid[x][y] = 0
return 0
elif x < 0 or y < 0:
return 0
else:
path_count = (paths_helper(x - 1, y, path_grid, missing) +
paths_helper(x, y - 1, path_grid, missing))
path_grid[x][y] = path_count
return path_count
def paths(missing):
arr = [[None] * w for _ in range(h)]
w = len(missing[0])
h = len(missing)
return paths_helper(w, h, arr, missing)
print paths()
I was wondering if there was a function built into Python that can determine the distance between two rational numbers but without me telling it which number is larger.
e.g.
>>>distance(6,3)
3
>>>distance(3,6)
3
Obviously I could write a simple definition to calculate which is larger and then just do a simple subtraction:
def distance(x, y):
if x >= y:
result = x - y
else:
result = y - x
return result
but I'd rather not have to call a custom function like this.
From my limited experience I've often found Python has a built in function or a module that does exactly what you want and quicker than your code does it. Hopefully someone can tell me there is a built in function that can do this.
abs(x-y) will do exactly what you're looking for:
In [1]: abs(1-2)
Out[1]: 1
In [2]: abs(2-1)
Out[2]: 1
Although abs(x - y) and equivalently abs(y - x) work, the following one-liners also work:
math.dist((x,), (y,)) (available in Python ≥3.8)
math.fabs(x - y)
max(x - y, y - x)
-min(x - y, y - x)
max(x, y) - min(x, y)
(x - y) * math.copysign(1, x - y), or equivalently (d := x - y) * math.copysign(1, d) in Python ≥3.8
functools.reduce(operator.sub, sorted([x, y], reverse=True))
All of these return the euclidean distance(x, y).
Just use abs(x - y). This'll return the net difference between the two as a positive value, regardless of which value is larger.
If you have an array, you can also use numpy.diff:
import numpy as np
a = [1,5,6,8]
np.diff(a)
Out: array([4, 1, 2])
So simple just use abs((a) - (b)).
will work seamless without any additional care in signs(positive , negative)
def get_distance(p1,p2):
return abs((p1) - (p2))
get_distance(0,2)
2
get_distance(0,2)
2
get_distance(-2,0)
2
get_distance(2,-1)
3
get_distance(-2,-1)
1
use this function.
its the same convention you wanted.
using the simple abs feature of python.
also - sometimes the answers are so simple we miss them, its okay :)
>>> def distance(x,y):
return abs(x-y)
This does not address the original question, but I thought I would expand on the answer zinturs gave. If you would like to determine the appropriately-signed distance between any two numbers, you could use a custom function like this:
import math
def distance(a, b):
if (a == b):
return 0
elif (a < 0) and (b < 0) or (a > 0) and (b > 0):
if (a < b):
return (abs(abs(a) - abs(b)))
else:
return -(abs(abs(a) - abs(b)))
else:
return math.copysign((abs(a) + abs(b)),b)
print(distance(3,-5)) # -8
print(distance(-3,5)) # 8
print(distance(-3,-5)) # 2
print(distance(5,3)) # -2
print(distance(5,5)) # 0
print(distance(-5,3)) # 8
print(distance(5,-3)) # -8
Please share simpler or more pythonic approaches, if you have one.
If you plan to use the signed distance calculation snippet posted by phi (like I did) and your b might have value 0, you probably want to fix the code as described below:
import math
def distance(a, b):
if (a == b):
return 0
elif (a < 0) and (b < 0) or (a > 0) and (b >= 0): # fix: b >= 0 to cover case b == 0
if (a < b):
return (abs(abs(a) - abs(b)))
else:
return -(abs(abs(a) - abs(b)))
else:
return math.copysign((abs(a) + abs(b)),b)
The original snippet does not work correctly regarding sign when a > 0 and b == 0.
abs function is definitely not what you need as it is not calculating the distance. Try abs (-25+15) to see that it's not working. A distance between the numbers is 40 but the output will be 10. Because it's doing the math and then removing "minus" in front. I am using this custom function:
def distance(a, b):
if (a < 0) and (b < 0) or (a > 0) and (b > 0):
return abs( abs(a) - abs(b) )
if (a < 0) and (b > 0) or (a > 0) and (b < 0):
return abs( abs(a) + abs(b) )
print distance(-25, -15)
print distance(25, -15)
print distance(-25, 15)
print distance(25, 15)
You can try:
a=[0,1,2,3,4,5,6,7,8,9];
[abs(x[1]-x[0]) for x in zip(a[1:],a[:-1])]
I am looking for a hash functions family generator that could generate a family of hash functions given a set of parameters. I haven't found any such generator so far.
Is there a way to do that with the hashlib package ?
For example I'd like to do something like :
h1 = hash_function(1)
h2 = hash_function(2)
...
and h1 and h2 would be different hash functions.
For those of you who might know about it, I am trying to implement a min-hashing algorithm on a very large dataset.
Basically, I have a very large set of features (100 millions to 1 billion) for a given document, and I need to create 1000 to 10000 different random permutations for this set of features.
I do NOT want to build the random permutations explicitly so the technique I would like to use in the following :
generate a hash function h and consider that for two indices r and s
r appears before s in the permutation if h(r) < h(s) and do that for 100 to 1000 different hash functions.
Are there any known libraries that I might have missed ? Or any standard way of generating families of hash functions with python that you might be aware of ?
I'd just do something like (if you don't need thread-safety -- not hard to alter if you DO need thread safety -- and assuming a 32-bit Python version):
import random
_memomask = {}
def hash_function(n):
mask = _memomask.get(n)
if mask is None:
random.seed(n)
mask = _memomask[n] = random.getrandbits(32)
def myhash(x):
return hash(x) ^ mask
return myhash
As mentioned above, you can use universal hashing for minhash.
For example:
import random
def minhash():
d1 = set(random.randint(0, 2000) for _ in range(1000))
d2 = set(random.randint(0, 2000) for _ in range(1000))
jacc_sim = len(d1.intersection(d2)) / len(d1.union(d2))
print("jaccard similarity: {}".format(jacc_sim))
N_HASHES = 200
hash_funcs = []
for i in range(N_HASHES):
hash_funcs.append(universal_hashing())
m1 = [min([h(e) for e in d1]) for h in hash_funcs]
m2 = [min([h(e) for e in d2]) for h in hash_funcs]
minhash_sim = sum(int(m1[i] == m2[i]) for i in range(N_HASHES)) / N_HASHES
print("min-hash similarity: {}".format(minhash_sim))
def universal_hashing():
def rand_prime():
while True:
p = random.randrange(2 ** 32, 2 ** 34, 2)
if all(p % n != 0 for n in range(3, int((p ** 0.5) + 1), 2)):
return p
m = 2 ** 32 - 1
p = rand_prime()
a = random.randint(0, p)
if a % 2 == 0:
a += 1
b = random.randint(0, p)
def h(x):
return ((a * x + b) % p) % m
return h
Reference
#alex's answer is great and concise, but the hash functions it generates are not "very different from each other".
Let's look at the Pearson correlation between 10000 samples of 10000 hashes that put the results in 100 bins
%%time # 1min 14s
n=10000
hashes = [hash_function(i) for i in range(n)]
median_pvalue(hashes, n=n)
# 1.1614081043690444e-06
I.e. the median p_value is 1e-06 which is far from random. Here's an example if it were truly random :
%%time # 4min 15s
hashes = [lambda _ : random.randint(0,100) for _ in range(n)]
median_pvalue(hashes, n=n)
# 0.4979718236429698
Using Carter and Wegman method you could get:
%%time # 1min 43s
hashes = HashFamily(100).draw_hashes(n)
median_pvalue(hashes, n=n)
# 0.841929288037321
Code to reproduce :
from scipy.stats.stats import pearsonr
import numpy as np
import random
_memomask = {}
def hash_function(n):
mask = _memomask.get(n)
if mask is None:
random.seed(n)
mask = _memomask[n] = random.getrandbits(32)
def myhash(x):
return hash(x) ^ mask
return myhash
class HashFamily():
r"""Universal hash family as proposed by Carter and Wegman.
.. math::
\begin{array}{ll}
h_{{a,b}}(x)=((ax+b)~{\bmod ~}p)~{\bmod ~}m \ \mid p > m\\
\end{array}
Args:
bins (int): Number of bins to hash to. Better if a prime number.
moduler (int,optional): Temporary hashing. Has to be a prime number.
"""
def __init__(self, bins, moduler=None):
if moduler and moduler <= bins:
raise ValueError("p (moduler) should be >> m (buckets)")
self.bins = bins
self.moduler = moduler if moduler else self._next_prime(np.random.randint(self.bins + 1, 2**32))
# do not allow same a and b, as it could mean shifted hashes
self.sampled_a = set()
self.sampled_b = set()
def _is_prime(self, x):
"""Naive is prime test."""
for i in range(2, int(np.sqrt(x))):
if x % i == 0:
return False
return True
def _next_prime(self, n):
"""Naively gets the next prime larger than n."""
while not self._is_prime(n):
n += 1
return n
def draw_hash(self, a=None, b=None):
"""Draws a single hash function from the family."""
if a is None:
while a is None or a in self.sampled_a:
a = np.random.randint(1, self.moduler - 1)
assert len(self.sampled_a) < self.moduler - 2, "please give a bigger moduler"
self.sampled_a.add(a)
if b is None:
while b is None or b in self.sampled_b:
b = np.random.randint(0, self.moduler - 1)
assert len(self.sampled_b) < self.moduler - 1, "please give a bigger moduler"
self.sampled_b.add(b)
return lambda x: ((a * x + b) % self.moduler) % self.bins
def draw_hashes(self, n, **kwargs):
"""Draws n hash function from the family."""
return [self.draw_hash() for i in range(n)]
def median_pvalue(hashes, buckets=100, n=1000):
p_values = []
for j in range(n-1):
a = [hashes[j](i) % buckets for i in range(n)]
b = [hashes[j+1](i) % buckets for i in range(n)]
p_values.append(pearsonr(a,b)[1])
return np.median(p_values)
Note that my implementation is of Carter and Wegman is very naive (e.g. generation of prime numbers). It could be made shorter and quicker.
You should consider using universal hashing. My answer and code can be found here: https://stackoverflow.com/a/25104050/207661
The universal hash family is a set of hash functions H of size m, such that any two (district) inputs collide with probability at most 1/m when the hash function h is drawn randomly from set H.
Based on the formulation in Wikipedia, use can use the following code:
import random
def is_prime(n):
if n==2 or n==3: return True
if n%2==0 or n<2: return False
for i in range(3, int(n**0.5)+1, 2):
if n%i==0:
return False
return True
# universal hash functions
class UniversalHashFamily:
def __init__(self, number_of_hash_functions, number_of_buckets, min_value_for_prime_number=2, bucket_value_offset=0):
self.number_of_buckets = number_of_buckets
self.bucket_value_offset = bucket_value_offset
primes = []
number_to_check = min_value_for_prime_number
while len(primes) < number_of_hash_functions:
if is_prime(number_to_check):
primes.append(number_to_check)
number_to_check += random.randint(1, 1000)
self.hash_function_attrs = []
for i in range(number_of_hash_functions):
p = primes[i]
a = random.randint(1, p)
b = random.randint(0, p)
self.hash_function_attrs.append((a, b, p))
def __call__(self, function_index, input_integer):
a, b, p = self.hash_function_attrs[function_index]
return (((a*input_integer + b)%p)%self.number_of_buckets) + self.bucket_value_offset
Example usage:
We can create a hash family consists of 20 hash functions, each one map the input to 100 buckets.
hash_family = UniversalHashFamily(20, 100)
And get the hashed values like:
input_integer = 1234567890 # sample input
hash_family(0, input_integer) # the output of the first hash function, i.e. h0(input_integer)
hash_family(1, input_integer) # the output of the second hash function, i.e. h1(input_integer)
# ...
hash_family(19, input_integer) # the output of the last hash function, i.e. h19(input_integer)
If you are interested in the universal hash family for string inputs, you can use the following code. But please note that this code may not be the optimized solution for string hashing.
class UniversalStringHashFamily:
def __init__(self, number_of_hash_functions, number_of_buckets, min_value_for_prime_number=2, bucket_value_offset=0):
self.number_of_buckets = number_of_buckets
self.bucket_value_offset = bucket_value_offset
primes = []
number_to_check = max(min_value_for_prime_number, number_of_buckets)
while len(primes) < number_of_hash_functions:
if is_prime(number_to_check):
primes.append(number_to_check)
number_to_check += random.randint(1, 1000)
self.hash_function_attrs = []
for i in range(number_of_hash_functions):
p = primes[i]
a = random.randint(1, p)
a2 = random.randint(1, p)
b = random.randint(0, p)
self.hash_function_attrs.append((a, b, p, a2))
def hash_int(self, int_to_hash, a, b, p):
return (((a*int_to_hash + b)%p)%self.number_of_buckets) + self.bucket_value_offset
def hash_str(self, str_to_hash, a, b, p, a2):
str_to_hash = "1" + str_to_hash # this will ensure that universality is not affected, see wikipedia for more detail
l = len(str_to_hash)-1
int_to_hash = 0
for i in range(l+1):
int_to_hash += ord(str_to_hash[i]) * (a2 ** (l-i))
int_to_hash = int_to_hash % p
return self.hash_int(int_to_hash, a, b, p)
def __call__(self, function_index, str_to_hash):
a, b, p, a2 = self.hash_function_attrs[function_index]
return self.hash_str(str_to_hash, a, b, p, a2)