Least Common Multiple of 2 numbers by prime factors of number - python

In this code, I am trying to get prime factors for the prime method to find LCM. then I am trying to save it by counter but I am not able to divide both key and values for the proper method.
I'm stuck at counter, please can anyone help me?
from collections import Counter
def q2_factor_lcm(a, b): #function for lcm
fa = factor_list(a) #factor list for a
fb = factor_list(b) #factorlist for b
c = Counter(fa) #variables to save counter for a
d = Counter(fb) #variables to save counter for b
r = c | d
r.keys()
for key, value in sorted(r.items()): # for loop for getting counter subtraction
l = pow(key, value)
result = [] # I am getting confused what to do now
for item in l:
result.append(l)
return result #will return result
def factor_list(n): # it is to generate prime numbers
factors = [] # to save list
iprimes = iter( primes_list(n) ) # loop
while n > 1:
p = next(iprimes)
while n % p == 0: # python calculation
n = n // p
factors.append(p)
return factors # it will return factors

First this method is not really efficient to find a lcm. As there are some nice and clean algo to find a gcd, it is easier to get the lcm of a and b by lcm = a * b / gcd(a,b) (*).
Second, never use pow with integer values. Floating point arithmetics is know to be broken not accurate.
Now for your question. The update operation on the 2 counters in not what you want: you lose one of the values when a key is present in both dicts. You should instead use the union of the key sets, and then use the max of both values (a non existent key is seen as a 0 value for the exponent):
...
# use a true dict to be able to later use the get method with a default
c = dict(Counter(fa)) #variables to save counter for a
d = dict(Counter(fb)) #variables to save counter for b
result = []
for key in sorted(set(c.keys()).union(set(d.keys()))):
exp = max(c.get(key, 0), d.get(key, 0))
for i in range(exp):
result.append(key)
return result
(*) The trick is that when a > b, GCD(a,b) is GCD(b, mod(a,b)). In Python it gives immediately:
def gcd(a, b):
if b > a:
return gcd(b, a)
if b == 1:
return b
m = a % b
return b if m == 0 else gcd(b, m)
def lcm(a,b):
return a * b / gcd(a,b)

Related

recursively calculate if x is power of b

The assignment is to write a recursive function that receives 2 whole non-negative numbers b, x, and returns True if there's a natural integer n so that b**n=x and False if not. I'm not allowed to use any math operators or loops, except % to determine if a number is even or odd.
but i do have external functions that i can use. Which can add 2 numbers, multiply 2 numbers, and divides a number by 2. also i can write helper function that i can use in the main function.
this is what i got so far, but it only works if b is in the form of 2^y (2,4,8,16 etc)
def is_power(b, x):
if b == x:
return True
if b > x:
return False
return is_power(add(b, b), x) # the func 'add' just adds 2 numbers
Furthermore, the complexity needs to be O(logb * logx)
Thank you.
You can essentially keep multiplying b by b until you reach, or pass, n.
A recursive implementation of this, using a helper function, could look something like this:
def is_power(b, x):
if b == 1: # Check special case
return x == 1
return helper(1, b, x)
def helper(counter, b, x):
if counter == x:
return True
elif counter > x:
return False
else:
return helper(mul(counter, b), b, x) # mul is our special multiplication function
Use the function you say you can use to multiply 2 numbers like:
power = False
result = b
while result < x:
result = yourMultiplyFunction(b,b)
if result == x:
power = True
break
print(power)
Question was EDITTED (can't use loops):
def powerOf(x, b, b1=-1):
if b1 == -1:
b1 = b
if (b == 1) and (x == 1):
return True
elif ( b==1 ) or (x == 1):
return False
if b*b1 < x:
return powerOf(x, b*b1, b1)
elif b*b1 > x:
return False
return True
print(powerOf(625, 25))
A solution that is O(logb * logx) would be slower than a naive sequential search
You can get O(logx / logb) by simply doing this:
def is_power(b,x,bn=1):
if bn == x: return True
if bn > x: return False
return is_power(b,x,bn*b)
I suspect that the objective is to go faster than O(logx/logb) and that the complexity requirement should be something like O(log(logx/logb)^2) which is equivalent to O(log(n)*log(n)).
To get a O(log(n)*log(n)) solution, you can convert the problem into a binary search by implementing a helper function to raise a number to a given power in O(log(n)) time and use it in the O(log(n)) search logic.
def raise_power(b,n): # recursive b^n O(logN)
if not n: return 1 # b^0 = 1
if n%2: return b*raise_power(b*b,n//2) # binary decomposition
return raise_power(b*b,n//2) # of power over base
def find_power(b,x,minp,maxp): # binary search
if minp>maxp: return False # no matching power
n = (minp+maxp)//2 # middle of exponent range
bp = raise_power(b,n) # compute power
if bp == x: return True # match found
if bp > x: return find_power(b,x,minp,n-1) # look in lower sub-range
return find_power(b,x,n+1,maxp) # look in upper sub-range
def max_power(b,x):
return 2*max_power(b*b,x) if b<x else 1 # double n until b^n > x
def is_power(b,x):
maxp = max_power(b,x) # determine upper bound
return find_power(b,x,0,maxp) # use binary search
Note that you will need to convert the *, + and //2 operations to their equivalent external functions in order to meet the requirements of your assignment

What Python function could i use to find the greatest common divisor of three numbers? [duplicate]

So I'm writing a program in Python to get the GCD of any amount of numbers.
def GCD(numbers):
if numbers[-1] == 0:
return numbers[0]
# i'm stuck here, this is wrong
for i in range(len(numbers)-1):
print GCD([numbers[i+1], numbers[i] % numbers[i+1]])
print GCD(30, 40, 36)
The function takes a list of numbers.
This should print 2. However, I don't understand how to use the the algorithm recursively so it can handle multiple numbers. Can someone explain?
updated, still not working:
def GCD(numbers):
if numbers[-1] == 0:
return numbers[0]
gcd = 0
for i in range(len(numbers)):
gcd = GCD([numbers[i+1], numbers[i] % numbers[i+1]])
gcdtemp = GCD([gcd, numbers[i+2]])
gcd = gcdtemp
return gcd
Ok, solved it
def GCD(a, b):
if b == 0:
return a
else:
return GCD(b, a % b)
and then use reduce, like
reduce(GCD, (30, 40, 36))
Since GCD is associative, GCD(a,b,c,d) is the same as GCD(GCD(GCD(a,b),c),d). In this case, Python's reduce function would be a good candidate for reducing the cases for which len(numbers) > 2 to a simple 2-number comparison. The code would look something like this:
if len(numbers) > 2:
return reduce(lambda x,y: GCD([x,y]), numbers)
Reduce applies the given function to each element in the list, so that something like
gcd = reduce(lambda x,y:GCD([x,y]),[a,b,c,d])
is the same as doing
gcd = GCD(a,b)
gcd = GCD(gcd,c)
gcd = GCD(gcd,d)
Now the only thing left is to code for when len(numbers) <= 2. Passing only two arguments to GCD in reduce ensures that your function recurses at most once (since len(numbers) > 2 only in the original call), which has the additional benefit of never overflowing the stack.
You can use reduce:
>>> from fractions import gcd
>>> reduce(gcd,(30,40,60))
10
which is equivalent to;
>>> lis = (30,40,60,70)
>>> res = gcd(*lis[:2]) #get the gcd of first two numbers
>>> for x in lis[2:]: #now iterate over the list starting from the 3rd element
... res = gcd(res,x)
>>> res
10
help on reduce:
>>> reduce?
Type: builtin_function_or_method
reduce(function, sequence[, initial]) -> value
Apply a function of two arguments cumulatively to the items of a sequence,
from left to right, so as to reduce the sequence to a single value.
For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates
((((1+2)+3)+4)+5). If initial is present, it is placed before the items
of the sequence in the calculation, and serves as a default when the
sequence is empty.
Python 3.9 introduced multiple arguments version of math.gcd, so you can use:
import math
math.gcd(30, 40, 36)
3.5 <= Python <= 3.8.x:
import functools
import math
functools.reduce(math.gcd, (30, 40, 36))
3 <= Python < 3.5:
import fractions
import functools
functools.reduce(fractions.gcd, (30, 40, 36))
A solution to finding out the LCM of more than two numbers in PYTHON is as follow:
#finding LCM (Least Common Multiple) of a series of numbers
def GCD(a, b):
#Gives greatest common divisor using Euclid's Algorithm.
while b:
a, b = b, a % b
return a
def LCM(a, b):
#gives lowest common multiple of two numbers
return a * b // GCD(a, b)
def LCMM(*args):
#gives LCM of a list of numbers passed as argument
return reduce(LCM, args)
Here I've added +1 in the last argument of range() function because the function itself starts from zero (0) to n-1. Click the hyperlink to know more about range() function :
print ("LCM of numbers (1 to 5) : " + str(LCMM(*range(1, 5+1))))
print ("LCM of numbers (1 to 10) : " + str(LCMM(*range(1, 10+1))))
print (reduce(LCMM,(1,2,3,4,5)))
those who are new to python can read more about reduce() function by the given link.
The GCD operator is commutative and associative. This means that
gcd(a,b,c) = gcd(gcd(a,b),c) = gcd(a,gcd(b,c))
So once you know how to do it for 2 numbers, you can do it for any number
To do it for two numbers, you simply need to implement Euclid's formula, which is simply:
// Ensure a >= b >= 1, flip a and b if necessary
while b > 0
t = a % b
a = b
b = t
end
return a
Define that function as, say euclid(a,b). Then, you can define gcd(nums) as:
if (len(nums) == 1)
return nums[1]
else
return euclid(nums[1], gcd(nums[:2]))
This uses the associative property of gcd() to compute the answer
Try calling the GCD() as follows,
i = 0
temp = numbers[i]
for i in range(len(numbers)-1):
temp = GCD(numbers[i+1], temp)
My way of solving it in Python. Hope it helps.
def find_gcd(arr):
if len(arr) <= 1:
return arr
else:
for i in range(len(arr)-1):
a = arr[i]
b = arr[i+1]
while b:
a, b = b, a%b
arr[i+1] = a
return a
def main(array):
print(find_gcd(array))
main(array=[8, 18, 22, 24]) # 2
main(array=[8, 24]) # 8
main(array=[5]) # [5]
main(array=[]) # []
Some dynamics how I understand it:
ex.[8, 18] -> [18, 8] -> [8, 2] -> [2, 0]
18 = 8x + 2 = (2y)x + 2 = 2z where z = xy + 1
ex.[18, 22] -> [22, 18] -> [18, 4] -> [4, 2] -> [2, 0]
22 = 18w + 4 = (4x+2)w + 4 = ((2y)x + 2)w + 2 = 2z
As of python 3.9 beta 4, it has got built-in support for finding gcd over a list of numbers.
Python 3.9.0b4 (v3.9.0b4:69dec9c8d2, Jul 2 2020, 18:41:53)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import math
>>> A = [30, 40, 36]
>>> print(math.gcd(*A))
2
One of the issues is that many of the calculations only work with numbers greater than 1. I modified the solution found here so that it accepts numbers smaller than 1. Basically, we can re scale the array using the minimum value and then use that to calculate the GCD of numbers smaller than 1.
# GCD of more than two (or array) numbers - alows folating point numbers
# Function implements the Euclidian algorithm to find H.C.F. of two number
def find_gcd(x, y):
while(y):
x, y = y, x % y
return x
# Driver Code
l_org = [60e-6, 20e-6, 30e-6]
min_val = min(l_org)
l = [item/min_val for item in l_org]
num1 = l[0]
num2 = l[1]
gcd = find_gcd(num1, num2)
for i in range(2, len(l)):
gcd = find_gcd(gcd, l[i])
gcd = gcd * min_val
print(gcd)
HERE IS A SIMPLE METHOD TO FIND GCD OF 2 NUMBERS
a = int(input("Enter the value of first number:"))
b = int(input("Enter the value of second number:"))
c,d = a,b
while a!=0:
b,a=a,b%a
print("GCD of ",c,"and",d,"is",b)
As You said you need a program who would take any amount of numbers
and print those numbers' HCF.
In this code you give numbers separated with space and click enter to get GCD
num =list(map(int,input().split())) #TAKES INPUT
def print_factors(x): #MAKES LIST OF LISTS OF COMMON FACTROS OF INPUT
list = [ i for i in range(1, x + 1) if x % i == 0 ]
return list
p = [print_factors(numbers) for numbers in num]
result = set(p[0])
for s in p[1:]: #MAKES THE SET OF COMMON VALUES IN LIST OF LISTS
result.intersection_update(s)
for values in result:
values = values*values #MULTIPLY ALL COMMON FACTORS TO FIND GCD
values = values//(list(result)[-1])
print('HCF',values)
Hope it helped

Python-remove highly similar string from dataset

I have a genomic dataset contained base messages, like this:
Position samp1 samp2 samp2 samp3 samp4 samp5 samp6 ...
posA T T T T T T T ...
posB G A A G G A A ...
posC G G G G G G G ...
...
This file has 100000+ lines, each line contains 200 bases of 200 samples.
Now i want to remove positons which haves high similar base in every samples, pic below is of 100 % the same, and i will remove one of them
we defined similar ratio as (similar base number) / (sequence length):
posH C C C C C C C C
posI A C C C A C C C
similarity of posH and posI is 6 / 8 = 75%
As required, similar ratio above 99% is regarded as highly similay, and remove one of the similar positions.
How can i do this work in python efficiently?
Thank you.
Similarity of 6/8 between posH and posI, looks like you want inverse of normalized hamming distance (i.e. 1-d).
You can compute inverse normalized hamming distance between two sequences using:
def inverse_hamming_distance(a,b):
z = list(zip(a, b))
return sum(e[0]==e[1] for e in z) / len(z)
and it gives:
>>> inverse_hamming_distance('CCCCCCCC', 'ACCCACCC')
0.75
However you can save some CPU cycle by early detecting that two lines are not similar. Given the minimum similarity threshold t, if you observe int(0.5+(1-t)*len(z)) dissimilar items, you don't need to go til the end, and you can already tell items are not similar.
def similar(a,b,t=0.99):
l = min(len(a), len(b))
t = int(0.5 + l*(1 - t))
n = 0
for a1, b1 in zip(a, b):
if a1 != b1:
n += 1
if n > t:
return False
return True
test:
>>> similar('CCCCCCCC', 'ACCCACCC', 0.75)
True
>>> similar('CCCCCCCC', 'ACCCACCC', 0.9)
False
First to speed this up a lot, start by storing all the data as lists of integers or binary before comparing. Either would dramatically reduce the memory required for the comparison operation. An enumerable would be a good fit. When you do this I would also split each dictionary value into a list with each item a specific sample: basedict = { 'posA' : [samp1, samp2,...] , ... }.
from enum import Enum
Base = Enum('Base', 'A C T G')
#mescalinum's answer has a good description on how to use a function to calculate whether two lines are similar:
def similar(a,b,t=0.99):
l = min(len(a), len(b))
t = int(0.5 + l*(1 - t))
n = 0
for a1, b1 in zip(a, b):
if a1 != b1:
n += 1
if n > t:
return False
return True
All that's left is to make a loop that works for your dataset. similarpositions gives a list of the keys to every position deemed 'similar.'
similarpositions = []
for key in basedict:
samplecomps = (len(basedict[key]) * (len(basedict[key]) - 1)) / 2 # number of comparisons between samples needed
dissimilar = 0
for item1 in basedict[key]:
for item2 in basedict[key]:
if similar(item1, item2, 0.99) == False:
dissimilar += 1
if samplecomps / dissimilar > 0.01: // break once we know too many dissimilar results, to save unneeded comparisions
break
if samplecomps / dissimilar > 0.01:
break
if samplecomps / dissimilar <= 0.01:
similarpositions.append(key)

Python programming beginner difficulties

I am trying to write a program in Python, but I am stuck in this piece of code:
def function():
a=[3,4,5,2,4]
b=1
c=0
for x in range(5):
if a[x-1]>b:
c=c+1
return c
print(function())
It gives me value 1 instead of 5. Actually the function I am trying to write is a little bit more complicated, but the problem is actually the same, it doesn't give me the right result.
def result():
r=[0]*len(y)
a=2
b=an_integer
while b>0:
for x in range(len(y)) :
if y[x-1] > 1/a and b>0:
r[x-1]=r[x-1]+1
b=b-1
a=a+1
return r
print(result())
v is a list of values smaller than 1 and b has an integer as value. If some values x in v are bigger than 1/a then the values x in r should get 1 bigger, then it should repeat a=a+1 until b becomes 0. I want this function to give a result of the type for ex. [7,6,5,4,3] where the sum of the elements in this list is equal to b.
Sometimes it gives me the right value, sometimes not and when the elements in v are equal for example v=[0.33333,0.33333,0.33333] it gets stuck and doesn't give me a result.
I don't know what I am doing wrong !
Your return statements are incorrectly indented. You want to return after the loop ends, not inside the loop.
def function():
a = [3, 4, 5, 2, 4]
b = 1
c = 0
for x in range(5):
if a[x-1] > b:
c = c + 1
return c
Also, a couple of optimizations to the code:
def function(a, b):
c = 0
for x in a:
if x > b:
c += 1
return c
or further:
def function(a, b):
return sum(x > b for x in a)
return; only inside the fun in the end it.
and name the Variable v

hash functions family generator in python

I am looking for a hash functions family generator that could generate a family of hash functions given a set of parameters. I haven't found any such generator so far.
Is there a way to do that with the hashlib package ?
For example I'd like to do something like :
h1 = hash_function(1)
h2 = hash_function(2)
...
and h1 and h2 would be different hash functions.
For those of you who might know about it, I am trying to implement a min-hashing algorithm on a very large dataset.
Basically, I have a very large set of features (100 millions to 1 billion) for a given document, and I need to create 1000 to 10000 different random permutations for this set of features.
I do NOT want to build the random permutations explicitly so the technique I would like to use in the following :
generate a hash function h and consider that for two indices r and s
r appears before s in the permutation if h(r) < h(s) and do that for 100 to 1000 different hash functions.
Are there any known libraries that I might have missed ? Or any standard way of generating families of hash functions with python that you might be aware of ?
I'd just do something like (if you don't need thread-safety -- not hard to alter if you DO need thread safety -- and assuming a 32-bit Python version):
import random
_memomask = {}
def hash_function(n):
mask = _memomask.get(n)
if mask is None:
random.seed(n)
mask = _memomask[n] = random.getrandbits(32)
def myhash(x):
return hash(x) ^ mask
return myhash
As mentioned above, you can use universal hashing for minhash.
For example:
import random
def minhash():
d1 = set(random.randint(0, 2000) for _ in range(1000))
d2 = set(random.randint(0, 2000) for _ in range(1000))
jacc_sim = len(d1.intersection(d2)) / len(d1.union(d2))
print("jaccard similarity: {}".format(jacc_sim))
N_HASHES = 200
hash_funcs = []
for i in range(N_HASHES):
hash_funcs.append(universal_hashing())
m1 = [min([h(e) for e in d1]) for h in hash_funcs]
m2 = [min([h(e) for e in d2]) for h in hash_funcs]
minhash_sim = sum(int(m1[i] == m2[i]) for i in range(N_HASHES)) / N_HASHES
print("min-hash similarity: {}".format(minhash_sim))
def universal_hashing():
def rand_prime():
while True:
p = random.randrange(2 ** 32, 2 ** 34, 2)
if all(p % n != 0 for n in range(3, int((p ** 0.5) + 1), 2)):
return p
m = 2 ** 32 - 1
p = rand_prime()
a = random.randint(0, p)
if a % 2 == 0:
a += 1
b = random.randint(0, p)
def h(x):
return ((a * x + b) % p) % m
return h
Reference
#alex's answer is great and concise, but the hash functions it generates are not "very different from each other".
Let's look at the Pearson correlation between 10000 samples of 10000 hashes that put the results in 100 bins
%%time # 1min 14s
n=10000
hashes = [hash_function(i) for i in range(n)]
median_pvalue(hashes, n=n)
# 1.1614081043690444e-06
I.e. the median p_value is 1e-06 which is far from random. Here's an example if it were truly random :
%%time # 4min 15s
hashes = [lambda _ : random.randint(0,100) for _ in range(n)]
median_pvalue(hashes, n=n)
# 0.4979718236429698
Using Carter and Wegman method you could get:
%%time # 1min 43s
hashes = HashFamily(100).draw_hashes(n)
median_pvalue(hashes, n=n)
# 0.841929288037321
Code to reproduce :
from scipy.stats.stats import pearsonr
import numpy as np
import random
_memomask = {}
def hash_function(n):
mask = _memomask.get(n)
if mask is None:
random.seed(n)
mask = _memomask[n] = random.getrandbits(32)
def myhash(x):
return hash(x) ^ mask
return myhash
class HashFamily():
r"""Universal hash family as proposed by Carter and Wegman.
.. math::
\begin{array}{ll}
h_{{a,b}}(x)=((ax+b)~{\bmod ~}p)~{\bmod ~}m \ \mid p > m\\
\end{array}
Args:
bins (int): Number of bins to hash to. Better if a prime number.
moduler (int,optional): Temporary hashing. Has to be a prime number.
"""
def __init__(self, bins, moduler=None):
if moduler and moduler <= bins:
raise ValueError("p (moduler) should be >> m (buckets)")
self.bins = bins
self.moduler = moduler if moduler else self._next_prime(np.random.randint(self.bins + 1, 2**32))
# do not allow same a and b, as it could mean shifted hashes
self.sampled_a = set()
self.sampled_b = set()
def _is_prime(self, x):
"""Naive is prime test."""
for i in range(2, int(np.sqrt(x))):
if x % i == 0:
return False
return True
def _next_prime(self, n):
"""Naively gets the next prime larger than n."""
while not self._is_prime(n):
n += 1
return n
def draw_hash(self, a=None, b=None):
"""Draws a single hash function from the family."""
if a is None:
while a is None or a in self.sampled_a:
a = np.random.randint(1, self.moduler - 1)
assert len(self.sampled_a) < self.moduler - 2, "please give a bigger moduler"
self.sampled_a.add(a)
if b is None:
while b is None or b in self.sampled_b:
b = np.random.randint(0, self.moduler - 1)
assert len(self.sampled_b) < self.moduler - 1, "please give a bigger moduler"
self.sampled_b.add(b)
return lambda x: ((a * x + b) % self.moduler) % self.bins
def draw_hashes(self, n, **kwargs):
"""Draws n hash function from the family."""
return [self.draw_hash() for i in range(n)]
def median_pvalue(hashes, buckets=100, n=1000):
p_values = []
for j in range(n-1):
a = [hashes[j](i) % buckets for i in range(n)]
b = [hashes[j+1](i) % buckets for i in range(n)]
p_values.append(pearsonr(a,b)[1])
return np.median(p_values)
Note that my implementation is of Carter and Wegman is very naive (e.g. generation of prime numbers). It could be made shorter and quicker.
You should consider using universal hashing. My answer and code can be found here: https://stackoverflow.com/a/25104050/207661
The universal hash family is a set of hash functions H of size m, such that any two (district) inputs collide with probability at most 1/m when the hash function h is drawn randomly from set H.
Based on the formulation in Wikipedia, use can use the following code:
import random
def is_prime(n):
if n==2 or n==3: return True
if n%2==0 or n<2: return False
for i in range(3, int(n**0.5)+1, 2):
if n%i==0:
return False
return True
# universal hash functions
class UniversalHashFamily:
def __init__(self, number_of_hash_functions, number_of_buckets, min_value_for_prime_number=2, bucket_value_offset=0):
self.number_of_buckets = number_of_buckets
self.bucket_value_offset = bucket_value_offset
primes = []
number_to_check = min_value_for_prime_number
while len(primes) < number_of_hash_functions:
if is_prime(number_to_check):
primes.append(number_to_check)
number_to_check += random.randint(1, 1000)
self.hash_function_attrs = []
for i in range(number_of_hash_functions):
p = primes[i]
a = random.randint(1, p)
b = random.randint(0, p)
self.hash_function_attrs.append((a, b, p))
def __call__(self, function_index, input_integer):
a, b, p = self.hash_function_attrs[function_index]
return (((a*input_integer + b)%p)%self.number_of_buckets) + self.bucket_value_offset
Example usage:
We can create a hash family consists of 20 hash functions, each one map the input to 100 buckets.
hash_family = UniversalHashFamily(20, 100)
And get the hashed values like:
input_integer = 1234567890 # sample input
hash_family(0, input_integer) # the output of the first hash function, i.e. h0(input_integer)
hash_family(1, input_integer) # the output of the second hash function, i.e. h1(input_integer)
# ...
hash_family(19, input_integer) # the output of the last hash function, i.e. h19(input_integer)
If you are interested in the universal hash family for string inputs, you can use the following code. But please note that this code may not be the optimized solution for string hashing.
class UniversalStringHashFamily:
def __init__(self, number_of_hash_functions, number_of_buckets, min_value_for_prime_number=2, bucket_value_offset=0):
self.number_of_buckets = number_of_buckets
self.bucket_value_offset = bucket_value_offset
primes = []
number_to_check = max(min_value_for_prime_number, number_of_buckets)
while len(primes) < number_of_hash_functions:
if is_prime(number_to_check):
primes.append(number_to_check)
number_to_check += random.randint(1, 1000)
self.hash_function_attrs = []
for i in range(number_of_hash_functions):
p = primes[i]
a = random.randint(1, p)
a2 = random.randint(1, p)
b = random.randint(0, p)
self.hash_function_attrs.append((a, b, p, a2))
def hash_int(self, int_to_hash, a, b, p):
return (((a*int_to_hash + b)%p)%self.number_of_buckets) + self.bucket_value_offset
def hash_str(self, str_to_hash, a, b, p, a2):
str_to_hash = "1" + str_to_hash # this will ensure that universality is not affected, see wikipedia for more detail
l = len(str_to_hash)-1
int_to_hash = 0
for i in range(l+1):
int_to_hash += ord(str_to_hash[i]) * (a2 ** (l-i))
int_to_hash = int_to_hash % p
return self.hash_int(int_to_hash, a, b, p)
def __call__(self, function_index, str_to_hash):
a, b, p, a2 = self.hash_function_attrs[function_index]
return self.hash_str(str_to_hash, a, b, p, a2)

Categories

Resources