I am looking for a hash functions family generator that could generate a family of hash functions given a set of parameters. I haven't found any such generator so far.
Is there a way to do that with the hashlib package ?
For example I'd like to do something like :
h1 = hash_function(1)
h2 = hash_function(2)
...
and h1 and h2 would be different hash functions.
For those of you who might know about it, I am trying to implement a min-hashing algorithm on a very large dataset.
Basically, I have a very large set of features (100 millions to 1 billion) for a given document, and I need to create 1000 to 10000 different random permutations for this set of features.
I do NOT want to build the random permutations explicitly so the technique I would like to use in the following :
generate a hash function h and consider that for two indices r and s
r appears before s in the permutation if h(r) < h(s) and do that for 100 to 1000 different hash functions.
Are there any known libraries that I might have missed ? Or any standard way of generating families of hash functions with python that you might be aware of ?
I'd just do something like (if you don't need thread-safety -- not hard to alter if you DO need thread safety -- and assuming a 32-bit Python version):
import random
_memomask = {}
def hash_function(n):
mask = _memomask.get(n)
if mask is None:
random.seed(n)
mask = _memomask[n] = random.getrandbits(32)
def myhash(x):
return hash(x) ^ mask
return myhash
As mentioned above, you can use universal hashing for minhash.
For example:
import random
def minhash():
d1 = set(random.randint(0, 2000) for _ in range(1000))
d2 = set(random.randint(0, 2000) for _ in range(1000))
jacc_sim = len(d1.intersection(d2)) / len(d1.union(d2))
print("jaccard similarity: {}".format(jacc_sim))
N_HASHES = 200
hash_funcs = []
for i in range(N_HASHES):
hash_funcs.append(universal_hashing())
m1 = [min([h(e) for e in d1]) for h in hash_funcs]
m2 = [min([h(e) for e in d2]) for h in hash_funcs]
minhash_sim = sum(int(m1[i] == m2[i]) for i in range(N_HASHES)) / N_HASHES
print("min-hash similarity: {}".format(minhash_sim))
def universal_hashing():
def rand_prime():
while True:
p = random.randrange(2 ** 32, 2 ** 34, 2)
if all(p % n != 0 for n in range(3, int((p ** 0.5) + 1), 2)):
return p
m = 2 ** 32 - 1
p = rand_prime()
a = random.randint(0, p)
if a % 2 == 0:
a += 1
b = random.randint(0, p)
def h(x):
return ((a * x + b) % p) % m
return h
Reference
#alex's answer is great and concise, but the hash functions it generates are not "very different from each other".
Let's look at the Pearson correlation between 10000 samples of 10000 hashes that put the results in 100 bins
%%time # 1min 14s
n=10000
hashes = [hash_function(i) for i in range(n)]
median_pvalue(hashes, n=n)
# 1.1614081043690444e-06
I.e. the median p_value is 1e-06 which is far from random. Here's an example if it were truly random :
%%time # 4min 15s
hashes = [lambda _ : random.randint(0,100) for _ in range(n)]
median_pvalue(hashes, n=n)
# 0.4979718236429698
Using Carter and Wegman method you could get:
%%time # 1min 43s
hashes = HashFamily(100).draw_hashes(n)
median_pvalue(hashes, n=n)
# 0.841929288037321
Code to reproduce :
from scipy.stats.stats import pearsonr
import numpy as np
import random
_memomask = {}
def hash_function(n):
mask = _memomask.get(n)
if mask is None:
random.seed(n)
mask = _memomask[n] = random.getrandbits(32)
def myhash(x):
return hash(x) ^ mask
return myhash
class HashFamily():
r"""Universal hash family as proposed by Carter and Wegman.
.. math::
\begin{array}{ll}
h_{{a,b}}(x)=((ax+b)~{\bmod ~}p)~{\bmod ~}m \ \mid p > m\\
\end{array}
Args:
bins (int): Number of bins to hash to. Better if a prime number.
moduler (int,optional): Temporary hashing. Has to be a prime number.
"""
def __init__(self, bins, moduler=None):
if moduler and moduler <= bins:
raise ValueError("p (moduler) should be >> m (buckets)")
self.bins = bins
self.moduler = moduler if moduler else self._next_prime(np.random.randint(self.bins + 1, 2**32))
# do not allow same a and b, as it could mean shifted hashes
self.sampled_a = set()
self.sampled_b = set()
def _is_prime(self, x):
"""Naive is prime test."""
for i in range(2, int(np.sqrt(x))):
if x % i == 0:
return False
return True
def _next_prime(self, n):
"""Naively gets the next prime larger than n."""
while not self._is_prime(n):
n += 1
return n
def draw_hash(self, a=None, b=None):
"""Draws a single hash function from the family."""
if a is None:
while a is None or a in self.sampled_a:
a = np.random.randint(1, self.moduler - 1)
assert len(self.sampled_a) < self.moduler - 2, "please give a bigger moduler"
self.sampled_a.add(a)
if b is None:
while b is None or b in self.sampled_b:
b = np.random.randint(0, self.moduler - 1)
assert len(self.sampled_b) < self.moduler - 1, "please give a bigger moduler"
self.sampled_b.add(b)
return lambda x: ((a * x + b) % self.moduler) % self.bins
def draw_hashes(self, n, **kwargs):
"""Draws n hash function from the family."""
return [self.draw_hash() for i in range(n)]
def median_pvalue(hashes, buckets=100, n=1000):
p_values = []
for j in range(n-1):
a = [hashes[j](i) % buckets for i in range(n)]
b = [hashes[j+1](i) % buckets for i in range(n)]
p_values.append(pearsonr(a,b)[1])
return np.median(p_values)
Note that my implementation is of Carter and Wegman is very naive (e.g. generation of prime numbers). It could be made shorter and quicker.
You should consider using universal hashing. My answer and code can be found here: https://stackoverflow.com/a/25104050/207661
The universal hash family is a set of hash functions H of size m, such that any two (district) inputs collide with probability at most 1/m when the hash function h is drawn randomly from set H.
Based on the formulation in Wikipedia, use can use the following code:
import random
def is_prime(n):
if n==2 or n==3: return True
if n%2==0 or n<2: return False
for i in range(3, int(n**0.5)+1, 2):
if n%i==0:
return False
return True
# universal hash functions
class UniversalHashFamily:
def __init__(self, number_of_hash_functions, number_of_buckets, min_value_for_prime_number=2, bucket_value_offset=0):
self.number_of_buckets = number_of_buckets
self.bucket_value_offset = bucket_value_offset
primes = []
number_to_check = min_value_for_prime_number
while len(primes) < number_of_hash_functions:
if is_prime(number_to_check):
primes.append(number_to_check)
number_to_check += random.randint(1, 1000)
self.hash_function_attrs = []
for i in range(number_of_hash_functions):
p = primes[i]
a = random.randint(1, p)
b = random.randint(0, p)
self.hash_function_attrs.append((a, b, p))
def __call__(self, function_index, input_integer):
a, b, p = self.hash_function_attrs[function_index]
return (((a*input_integer + b)%p)%self.number_of_buckets) + self.bucket_value_offset
Example usage:
We can create a hash family consists of 20 hash functions, each one map the input to 100 buckets.
hash_family = UniversalHashFamily(20, 100)
And get the hashed values like:
input_integer = 1234567890 # sample input
hash_family(0, input_integer) # the output of the first hash function, i.e. h0(input_integer)
hash_family(1, input_integer) # the output of the second hash function, i.e. h1(input_integer)
# ...
hash_family(19, input_integer) # the output of the last hash function, i.e. h19(input_integer)
If you are interested in the universal hash family for string inputs, you can use the following code. But please note that this code may not be the optimized solution for string hashing.
class UniversalStringHashFamily:
def __init__(self, number_of_hash_functions, number_of_buckets, min_value_for_prime_number=2, bucket_value_offset=0):
self.number_of_buckets = number_of_buckets
self.bucket_value_offset = bucket_value_offset
primes = []
number_to_check = max(min_value_for_prime_number, number_of_buckets)
while len(primes) < number_of_hash_functions:
if is_prime(number_to_check):
primes.append(number_to_check)
number_to_check += random.randint(1, 1000)
self.hash_function_attrs = []
for i in range(number_of_hash_functions):
p = primes[i]
a = random.randint(1, p)
a2 = random.randint(1, p)
b = random.randint(0, p)
self.hash_function_attrs.append((a, b, p, a2))
def hash_int(self, int_to_hash, a, b, p):
return (((a*int_to_hash + b)%p)%self.number_of_buckets) + self.bucket_value_offset
def hash_str(self, str_to_hash, a, b, p, a2):
str_to_hash = "1" + str_to_hash # this will ensure that universality is not affected, see wikipedia for more detail
l = len(str_to_hash)-1
int_to_hash = 0
for i in range(l+1):
int_to_hash += ord(str_to_hash[i]) * (a2 ** (l-i))
int_to_hash = int_to_hash % p
return self.hash_int(int_to_hash, a, b, p)
def __call__(self, function_index, str_to_hash):
a, b, p, a2 = self.hash_function_attrs[function_index]
return self.hash_str(str_to_hash, a, b, p, a2)
Related
from math import sqrt
S1 = [1,0,0,0,1,0,0,2]
S3 = [0,1,1,2,0,1,2,0]
sum = 0
sums1 = 0
sums3 = 0
for i, j in zip(S1,S3):
sums1 += i*i
sums3 += j*j
sum += i*j
cosine_similarity = sum / ((sqrt(sums1)) * (sqrt(sums3)))
print (cosine_similarity)
plz how can I remove this error from code. I want to find cosine similarity of vectors.
The error is due to the indentation level of the last two lines (as mentioned by in the comments by j1-lee):
# ...
sum += i*j
# deindentation
cosine_similarity = sum / ((sqrt(sums1)) * (sqrt(sums3)))
print (cosine_similarity)
Here another implementation by decomposing the definition of cosine similarity into smaller operations:
def scalar_product(a, b):
return sum(a_i*b_i for a_i, b_i in zip(a, b))
def norm(a):
return sum(a_i**2 for a_i in a )**.5
def cosine_similarity(a, b):
return scalar_product(a, b) / (norm(a)*norm(b))
S1 = [1,0,0,0,1,0,0,2]
S3 = [0,1,1,2,0,1,2,0]
cs = cosine_similarity(S1, S3)
print(cs)
# 0.0 # orthogonality
cs = cosine_similarity(S1, S1)
print(cs)
# 1.0...# parallelity
I am currently working on a project replicating RSA key generation and testing using euclidean algorithm, extended euclidean algorithm to find the modular inverse of the value.
I used the Miller-Rabin test to choose two prime numbers, p and q.
After running the code, I am able to obtain Kpub and e, however Kpr returns as nan.
Please help!
#Euclidean Algorithm func
def EucAlgo(a, b):
if a==0:
return b
return EucAlgo(b % a,a)
def ExEucAlgo(a,b):
if a==0:
return b,0,1
gcd, s1, t1 = ExEucAlgo(b%a,a)
#gcd of a,b
s = t1 - (b/a) * s1
t = s1
return gcd, s, t
def ExEucAlgo_modInverse(a,b):
gcd, s, t = ExEucAlgo(b,a)
if (gcd == 1):
i = t % a
elif (gcd !=1):
print("There is no inverse modulo for the input.")
return i
def SqMul_ModularExpo(b, exp, n):
bin_exp = bin(exp)
base = b
for i in range (3, len(bin_exp)):
base = (base ** 2) % n
if(bin_exp[i]=='1'):
i+=1
base = (base * b) %n
return base
#RSA Key generation
p=9054583561027584891319616491815785011595937977633787663340258672121877196627062461308487615739189212918799813327175451021729047602129396754172486202100997
q=10115395220079214686776355235686624745626962891667413288473649946208213820942557513105240135405981494333016032659525466362014175268953946332375459648688023
n= p * q
phi_n= (p-1) * (q-1)
e= randint(1, phi_n - 1)
while((EucAlgo(e,phi_n)) !=1):
e = randint(1, (phi_n-1))
d = ExEucAlgo_modInverse(e,phi_n)
print(f"\nKpr={d}")
print(f"\nKpub=(n={n})\n \ne={e}")
The problem is that you are using float point division which will result in returning float a point which when dealing with large int can result in very large floats which python can't handle so the solution is to use integer division which means 5//2=2 not 2.5. The problem is that Now encrypting and decrypting data would result in wrong decryption. (You wont get 2 again) because of some bugs in your functions.
FIRST: use public exponent pf 65537(prime number) which is the default for all RSA implementations(see your browser certificates) rather than finding a random one. Then after calculating the extended Euclidean algorithm which is used to find modulo inverse you dont have to make any more calculations(just return this value if GCD is 1 otherwise raise an error or whatever).
Here is the complete code that works after removing some unneeded (functions, imports, and random public exponent) READ comments.
def EucAlgo(a, b):
if a == 0:
return b
return EucAlgo(b % a, a)
def ExEucAlgo(a,b):
if a==0:
return b, 0, 1
gcd, s1, t1 = ExEucAlgo(b%a, a)
# You dont use / use // to make integer division
s = t1 - (b//a) * s1
t = s1
return gcd, s, t
def ExEucAlgo_modInverse(a,b):
gcd, s, t = ExEucAlgo(a, b)
if (gcd == 1):
# Just return s which is the inverse of public exponent
return s
elif (gcd != 1):
# I think it's better to raise an error but it's up to you
print("There is no inverse modulo for the input.")
#RSA Key generation
p = 9054583561027584891319616491815785011595937977633787663340258672121877196627062461308487615739189212918799813327175451021729047602129396754172486202100997
q = 10115395220079214686776355235686624745626962891667413288473649946208213820942557513105240135405981494333016032659525466362014175268953946332375459648688023
n = p * q
phi_n = (p-1) * (q-1)
# Just use fixed prime public exponent rather than trying fixed ones
e = 65537
d = ExEucAlgo_modInverse(e, phi_n)
print(f"\nKpr={d}")
print(f"\nKpub=(n={n})\n \ne={e}")
# Try to encrypt and decrypt 36
ciphertext = pow(36, e, n)
print("Encrypted data {}".format(ciphertext))
print("Decrypted data is {}".format(pow(ciphertext, d, n)))
I have binary polynomials, which I represent like binary number. For example
a = 0b10011
b = 0b101
a is x^4+x+1 and b is x^2+1. So I want that
a%b = 2 # 10 as polynomial x
I would like to ask, how can I do it? I think that standard operation % of two polynomials will not work.
Here's a simple idea, given a normal polynomial division routine you could create a custom class to represent binary polynomials and then just override the (%) operator, maybe something like this:
from math import fabs
def poly_div(p1, p2):
def degree(poly):
while poly and poly[-1] == 0:
poly.pop()
return len(poly)-1
p2_degree = degree(p2)
p1_degree = degree(p1)
if p2_degree < 0:
raise ZeroDivisionError
if p1_degree >= p2_degree:
q = [0] * p1_degree
while p1_degree >= p2_degree:
d = [0]*(p1_degree - p2_degree) + p2
mult = q[p1_degree - p2_degree] = p1[-1] / float(d[-1])
d = [coeff*mult for coeff in d]
p1 = [fabs(p1_c - p2_c) for p1_c, p2_c in zip(p1, d)]
p1_degree = degree(p1)
r = p1
else:
q = [0]
r = p1
return q, r
class BinPoly:
def __init__(self, poly):
self.poly = [int(bit) for bit in list(poly)]
def __mod__(self, other):
return poly_div(self.poly, other.poly)
if __name__ == '__main__':
a = BinPoly('10011')
b = BinPoly('101')
print(a%b)
As you can see, you're constructing the polynomials out of string, tweaking the class to use binary numbers instead shouldn't be too hard, left as an exercise to the reader ;)
I stumbled upon a problem at Project Euler, https://projecteuler.net/problem=15
. I solved this by combinatorics but was left wondering if there is a dynamic programming solution to this problem or these kinds of problems overall. And say some squares of the grid are taken off - is that possible to navigate? I am using Python. How should I do that? Any tips are appreciated. Thanks in advance.
You can do a simple backtrack and explore an implicit graph like this: (comments explain most of it)
def explore(r, c, n, memo):
"""
explore right and down from position (r,c)
report a rout once position (n,n) is reached
memo is a matrix which saves how many routes exists from each position to (n,n)
"""
if r == n and c == n:
# one path has been found
return 1
elif r > n or c > n:
# crossing the border, go back
return 0
if memo[r][c] is not None:
return memo[r][c]
a= explore(r+1, c, n, memo) #move down
b= explore(r, c+1, n, memo) #move right
# return total paths found from this (r,c) position
memo[r][c]= a + b
return a+b
if __name__ == '__main__':
n= 20
memo = [[None] * (n+1) for _ in range(n+1)]
paths = explore(0, 0, n, memo)
print(paths)
Most straight-forwardly with python's built-in memoization util functools.lru_cache. You can encode missing squares as a frozenset (hashable) of missing grid points (pairs):
from functools import lru_cache
#lru_cache(None)
def paths(m, n, missing=None):
missing = missing or frozenset()
if (m, n) in missing:
return 0
if (m, n) == (0, 0):
return 1
over = paths(m, n-1, missing=missing) if n else 0
down = paths(m-1, n, missing=missing) if m else 0
return over + down
>>> paths(2, 2)
6
# middle grid point missing: only two paths
>>> paths(2, 2, frozenset([(1, 1)]))
2
>>> paths(20, 20)
137846528820
There is also a mathematical solution (which is probably what you used):
def factorial(n):
result = 1
for i in range(1, n + 1):
result *= i
return result
def paths(w, h):
return factorial(w + h) / (factorial(w) * factorial(h))
This works because the number of paths is the same as the number of ways to choose to go right or down over w + h steps, where you go right w times, which is equal to w + h choose w, or (w + h)! / (w! * h!).
With missing grid squares, I think there is a combinatoric solution, but it's very slow if there are many missing squares, so dynamic programming would probably be better there.
For example, the following should work:
missing = [
[0, 1],
[0, 0],
[0, 0],
]
def paths_helper(x, y, path_grid, missing):
if path_grid[x][y] is not None:
return path_grid[x][y]
if missing[x][y]:
path_grid[x][y] = 0
return 0
elif x < 0 or y < 0:
return 0
else:
path_count = (paths_helper(x - 1, y, path_grid, missing) +
paths_helper(x, y - 1, path_grid, missing))
path_grid[x][y] = path_count
return path_count
def paths(missing):
arr = [[None] * w for _ in range(h)]
w = len(missing[0])
h = len(missing)
return paths_helper(w, h, arr, missing)
print paths()
I am constructing a pseudo random number generator for hashing. The algorithm I need to use is as follows:
Initialize an integer R to be equal to 1 every time the tabling routine is called
On each successive call for a random number, set R = R*5
Mask all but the lower order n+2 bits of the product and place the result in R
Set P = R/4 and return
This is what I have so far which works for a table of size 2^n, but how can I change it so it can take in a table of any size?
def rand(size,i)
n = math.log(size,2)
r = 1
random_list = []
mask = (1 << 2+int(n)) - 1
for n in range(1,size+1):
r = r*5
r &= mask
p = r/4
random_list = random_list + [p]
if i == 0: return random_list
else: return random_list[i-1]
I didn't really understand how your code related to your algorithm (what is random_list?) or how the code should be structured, but I assume it is something similar to this:
class Rand:
def __init__(self, n):
# Initialize an integer R to be equal to 1 every time the tabling routine is called
self.r = 1
self.n = n
def rand(self):
# On each successive call for a random number, set R = R*5
self.r *= 5
# Mask all but the lower order n+2 bits of the product and place the result in R
self.r = self.r & (pow(2, self.n)-1)
# Set P = R/4 and return
self.p = self.r/4
return self.p
In which case, to make it work with a table of any size, the class becomes this:
class Rand2:
def __init__(self, tableSize):
# Initialize an integer R to be equal to 1 every time the tabling routine is called
self.r = 1
self.tableSize = tableSize
def rand(self):
# On each successive call for a random number, set R = R*5
self.r *= 5
# A bit mask is essentially a modulus operation, which is what we do instead
self.r = self.r % self.tableSize
# Set P = R/4 and return
self.p = self.r/4
return self.p
A simple test proves the outcome to be the same when the table sizes are identical:
rnd = Rand(10)
for i in range(0, 10):
print rnd.rand()
rnd2 = Rand2(pow(2, 10))
for i in range(0, 10):
print rnd2.rand()
But, like I said, I didn't really understand how your code related to your algorithm. I guess the tl;dr here is use the modulus operator instead of a bit mask.