How to optimize edit distance code?

How to optimize edit distance code? - python

How to optimize this edit distance code i.e. finding the number of bits changed between 2 values! e.g. word1 = '010000001000011111101000001001000110001'
word2 = '010000001000011111101000001011111111111'
When i tried to run on Hadoop it takes ages to complete?
How to reduce the for loop and comparsions ?
#!/usr/bin/python
import os, re, string, sys
from numpy import zeros
def calculateDistance(word1, word2):
x = zeros( (len(word1)+1, len(word2)+1) )
for i in range(0,len(word1)+1):
x[i,0] = i
for i in range(0,len(word2)+1):
x[0,i] = i
for j in range(1,len(word2)+1):
for i in range(1,len(word1)+1):
if word1[i-1] == word2[j-1]:
x[i,j] = x[i-1,j-1]
else:
minimum = x[i-1, j] + 1
if minimum > x[i, j-1] + 1:
minimum = x[i, j-1] + 1
if minimum > x[i-1, j-1] + 1:
minimum = x[i-1, j-1] + 1
x[i,j] = minimum
return x[len(word1), len(word2)]

I looked for a bit counting algorithm online, and I found this page, which has several good algorithms. My favorite there is a one-line function which claims to work for Python 2.6 / 3.0:
return sum( b == '1' for b in bin(word1 ^ word2)[2:] )
I don't have Python, so I can't test, but if this one doesn't work, try one of the others. The key is to count the number of 1's in the bitwise XOR of your two words, because there will be a 1 for each difference.
You are calculating the Hamming distance, right?
EDIT: I'm trying to understand your algorithm, and the way you're manipulating the inputs, it looks like they are actually arrays, and not just binary numbers. So I would expect that your code should look more like:
return sum( a != b for a, b in zip(word1, word2) )
EDIT2: I've figured out what your code does, and it's not the Hamming distance at all! It's actually the Levenshtein distance, which counts how many additions, deletions, or substitutions are needed to turn one string into another (the Hamming distance only counts substitutions, and so is only suitable for equal length strings of digits). Looking at the Wikipedia page, your algorithm is more or less a straight port of the pseudocode they have there. As they point out, the time and space complexity of a comparison of strings of length m and n is O(mn), which is pretty bad. They have a few suggestions of optimizations depending on your needs, but I don't know what you use this function for, so I can't say what would be best for you. If the Hamming distance is good enough for you, the code above should suffice (time complexity O(n)), but it gives different results on some sets of strings, even if they are of equal length, like '0101010101' and '1010101010', which have Hamming distance 10 (flip all bits) and Levenshtein distance 2 (remove the first 0 and add it at the end)

Since you haven't specified what edit distance you're using yet, I'm gonna go on a limb and assume it's Levenshtein distance. In which case, you can shave off some operations here and there:
def levenshtein(a,b):
"Calculates the Levenshtein distance between a and b."
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space.
# Not really important to the algorithm anyway.
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
edit: also, you make no mention of your dataset. According to its characteristics, the implementation might change to benefit from it.

Your algorithm seems to do a lot of work. It compares every bit to all bits in the opposite bit vector, meaning you get an algorithmic complexity of O(m*n). That is unnecessary if you are computing Hamming distance, so I assume you're not.
Your loop builds an x[i,j] matrix looking like this:
0 1 0 0 0 0 0 0 1 0 0 ... (word1)
0 0 1 0 0 0 0 0 0 1
1 1 0 1 1 1 1 1 1 0
0 0 1 0 1 1 1 1 1 1
0 0 1 1 0 1 1 1 1 2
0 0 1 1 1 0 1 1 1 2
0 0 1 1 1 1 0 1 1 2
1
1
...
(example word2)
This may be useful for detecting certain types of edits, but without knowing what edit distance algorithm you are trying to implement, I really can't tell you how to optimize it.

Related

Vectorize itertools combination_with_replacement

My goal is to speed up the creation of a list of combinations by using my GPU. How can I accomplish this?
By way of example, the following code creates a list of 260 text strings ranging from "aa" through "jz". We then use itertools combinations_with_replacement() to create all possible combinations of R elements of this list. The use of timeit shows that, beyond 3 elements, extracting a list of these combinations slows exponentially. I suspect this can be done with numba cuda, but I don't know how.
import timeit
timeit.timeit('''
from itertools import combinations_with_replacement
combo_count = 2
alphabet = 'a'
alpha_list = []
item_list = []
for i in range(0,26):
alpha_list.append(alphabet)
alphabet = chr(ord(alphabet) + 1)
for first_letter in alpha_list[0:10]:
for second_letter in alpha_list:
item_list.append(first_letter+second_letter)
print("Length of item list:",len(item_list))
combos = combinations_with_replacement(item_list,combo_count)
cmb_lst = [bla for bla in combos]
print("Length of list of all {} combinations: {}".format(combo_count,len(cmb_lst)))
''', number=1)

As mentioned in the comments, there is no way to "vectorize" the combinations_with_replacement() call from the itertools library directly (with Numba CUDA). Numba CUDA doesn't work that way.
However, I believe it should be possible to generate an equivalent result dataset, using Numba CUDA, in a way that seems to run faster than the itertools library function for certain cases. I imagine there are probably a number of ways to accomplish this, and I make no claims that the method I describe is in any way optimal. It certainly is not, and could certainly be made to run faster. However according to my testing, even this not-very-optimized approach can run a particular test case about 10x faster than python itertools or so on a V100 GPU.
As background, I consider this and this (or equivalent material) to be essential reading.
From the above, the formula for the number of combinations of n items with k choices, with replacement, is given by:
(n-1+k)!
-------- (Formula 1)
(n-1)!k!
In the code below, I have encapsulated the above calculation in count_comb_with_repl (device) and host_count_comb_with_repl (host) functions. It turns out we can use this one basic calculation, with a cascading-smaller sequence of choices for n and k, to drive the entire calculation process to compute a combination given only an index into the final result array. To visualize what we are doing, it helps to have a simple example. Let's take the case of 3 items, and 3 choices. Indexing items from zero, the array of possibilities looks like this:
n = 3, k = 3
index choices first digit calculation
0 0,0,0 -----------------
1 0,0,1
2 0,0,2
3 0,1,1 equivalent to n = 3, k = 2
4 0,1,2
5 0,2,2 -----------------
6 1,1,1 -----------------
7 1,1,2 equivalent to n = 2, k = 2
8 1,2,2 -----------------
9 2,2,2 equivalent to n = 1, k = 2
The length of the above list is given by plugging the values of n = 3 and k = 3 into formula 1. The key observation to understanding the method I present is that to compute the first digit of the choices result given only the index, we can compute the dividing points between 0, and 1 for example by observing that considering the results where the first choice index is 0, the length of this range is given by plugging the values of n = 3 and k = 2 into formula 1. Therefore if our given index is less than this value (6) then we know the first digit is 0. If it is greater than this value then we know the first digit is 1 or 2, and with suitable offsetting we can recompute the next range (corresponding to first digit of 1) and see if our index falls within this range.
Once we know the first digit, we can repeat the process (with suitable list reduction and offsetting) to find the next digit, and the next digit, etc.
Here is a python code that implements the above method. As I mentioned, for a test case of n=260 and k=4 this runs in less than 3 seconds on my V100.
$ cat t2.py
from numba import cuda,jit
import numpy as np
#cuda.jit(device=True)
def get_next_count_comb_with_repl(n,k,prev):
return int(round((prev*(n))/(n+k)))
#cuda.jit(device=True)
def count_comb_with_repl(n,k):
mymax = max(n-1,k)
ans = 1.0
cnt = 1
for i in range(mymax+1, n+k):
ans = ans*i/cnt
cnt += 1
return int(round(ans))
#intended to be identical to the previous function
#I just need a version I can call from host code
def host_count_comb_with_repl(n,k):
mymax = max(n-1,k)
ans = 1.0
cnt = 1
for i in range(mymax+1, n+k):
ans = ans*i/cnt
cnt += 1
return int(round(ans))
#cuda.jit(device=True)
def find_first_digit(n,k,i):
psum = 0
count = count_comb_with_repl(n, k-1)
if (i-psum) < count:
return 0,psum
psum += count
for j in range(1,n):
count = get_next_count_comb_with_repl(n-j,k-1,count)
if (i-psum) < count:
return j,psum
psum += count
return -1,0 # error
#cuda.jit
def kernel_count_comb_with_repl(n, k, l, r):
for i in range(cuda.grid(1), l, cuda.gridsize(1)):
new_ll = n
new_cc = k
new_i = i
new_digit = 0
for j in range(k):
digit,psum = find_first_digit(new_ll, new_cc, new_i)
new_digit += digit
new_ll -= digit
new_cc -= 1
new_i -= psum
r[i+j*l] = new_digit
combo_count = 4
ll = 260
cl = host_count_comb_with_repl(ll, combo_count)
print(cl)
# bug if cl > 2G
if cl < 2**31:
my_dtype = np.uint8
if ll > 255:
my_dtype = np.uint16
r = np.empty(cl*combo_count, dtype=my_dtype)
d_r = cuda.device_array_like(r)
block = 256
grid = (cl//block)+1
#grid = 640
kernel_count_comb_with_repl[grid,block](ll, combo_count, cl, d_r)
r = d_r.copy_to_host()
print(r.reshape(combo_count,cl))
$ time python t2.py
194831715
[[ 0 0 0 ... 258 258 259]
[ 0 0 0 ... 258 259 259]
[ 0 0 0 ... 259 259 259]
[ 0 1 2 ... 259 259 259]]
real 0m2.212s
user 0m1.110s
sys 0m1.077s
$
(The above test case: n=260, k = 4, takes ~30s on my system using OP's code.)
This should be considered to be a sketch of an idea. I make no claims that it is defect free. This type of problem can quickly exhaust the memory on a GPU (for large enough choices of n and/or k), and your only indication of that would probably be a crude out of memory error from numba.
Yes, the above code does not produce concatenations of aa through jz but this is just an indexing exercise using the result. You would use the result indices to index into your array of items, as needed to convert a result like 0,0,0,1 to a result like aa,aa,aa,ab
This isn't a performance win across the board. They python method is still faster for smaller test cases, and larger test cases (e.g. n = 260, k = 5) will exceed available memory on the GPU.

Order bias in wrong implementation of Fisher Yates Shuffle

I implemented the shuffling algorithm as:
import random
a = range(1, n+1) #a containing element from 1 to n
for i in range(n):
j = random.randint(0, n-1)
a[i], a[j] = a[j], a[i]
As this algorithm is biased. I just wanted to know for any n(n ≤ 17), is it possible to find that which permutation have the highest probablity of occuring and which permutation have least probablity out of all possible n! permutations. If yes then what is that permutation??
For example n=3:
a = [1,2,3]
There are 3^3 = 27 possible shuffle
No. occurence of different permutations:
1 2 3 = 4
3 1 2 = 4
3 2 1 = 4
1 3 2 = 5
2 1 3 = 5
2 3 1 = 5
P.S. I am not so good with maths.

This is not a proof by any means, but you can quickly come up with the distribution of placement probabilities by running the biased algorithm a million times. It will look like this picture from wikipedia:
An unbiased distribution would have 14.3% in every field.
To get the most likely distribution, I think it's safe to just pick the highest percentage for each index. This means it's most likely that the entire array is moved down by one and the first element will become the last.
Edit: I ran some simulations and this result is most likely wrong. I'll leave this answer up until I can come up with something better.

Simulated annealing doesn't return (an) optimal solution

I decided to learn simulated annealing as a new method to attack this problem with. It essentially asks how to fill a grid with -1, 0, or 1 so that each row and column sum is unique. As a test case, I used a 6x6 grid, for which there is definitely an optimal solution given by Neil:
1 1 1 1 1 1 6
1 1 1 1 1 -1 4
1 1 1 1 -1 -1 2
1 1 0 -1 -1 -1 -1
1 0 -1 -1 -1 -1 -3
0 -1 -1 -1 -1 -1 -5
5 3 1 0 -2 -4
My code usually doesn't reach the optimal case the majority of runs and even returns the wrong grid cost (old_cost should match count_conflict(grid)). Are my parameters incorrectly set, have I incorrectly implemented, or possibly is simulated annealing not a viable method here?
import random
from math import exp
G_SIZE = 6
grid = [[1]*G_SIZE for i in range(G_SIZE)]
def count_conflict(grid):
cnt = [0]*(2*G_SIZE+1)
conflicts = 0
for row in grid:
cnt[sum(row)] += 1
for col in zip(*grid):
cnt[sum(col)] += 1
#print(cnt)
for c in cnt:
if c == 0: conflicts += 1
if c > 1: conflicts += c-1
return conflicts
def neighbor(grid):
new_grid = grid[:]
i = random.choice(range(G_SIZE))
j = random.choice(range(G_SIZE))
new_cells = [-1, 0, 1]
new_cells.remove(new_grid[i][j])
new_grid[i][j] = random.choice(new_cells)
return new_grid
def acceptance_probability(old_cost, new_cost, T):
if new_cost < old_cost: return 1.0
return exp(-(new_cost - old_cost) / T)
# Initial guess
for i in range(1, G_SIZE):
for j in range(0, i):
grid[i][j] = -1
#print(grid)
old_cost = count_conflict(grid)
T = 10.0
T_min = 0.1
alpha = 0.99
while T > T_min:
for i in range(1000):
new_sol = neighbor(grid)
new_cost = count_conflict(new_sol)
ap = acceptance_probability(old_cost, new_cost, T)
print(old_cost, new_cost, ap, T)
if ap > random.random():
grid = new_sol
old_cost = new_cost
T *= alpha
for row in grid:
print(row)
print(count_conflict(grid))

a few things to do first, and which might quickly lead you to a working solution, without having to do anything else (eg, swap the heuristic):
add a line near the top outside of your main iterative loop, to
calculate the cost of your t0 state (ie, your starting
configuration);
inside the main loop, insert a single print statement just after the
line that calculates the cost for the current iteration--that writes
to a file, the value returned by the cost function for that
iteration; just below that add a line that prints that value every 20
iterations or something like that (eg, about once each second is
about as fast as we can comprehend scrolling data)
if n % 10 == 0: print(what_cost_fn_returned_this_iteration)
don't call acceptance_probability; there is no natural convergence
criterion in combinatorial optimization problems; the usual practice
is to break out of the main loop when any of these happen:
the max iteration count has been reached
the current minimum value of the cost function over the
past __ iterations has changed less than __%; for instance
if over the last 100 iterations, the cost (by comparing a
min and max using a moving window) varies less than 1%
after reaching a minimum during iteration, the cost is now
consistently increasing with iteration count
A few other observations:
with the diagnostics in place (see above) you will be able to
determine: (i) from some initial cost, what is my solver doing? ie,
is it moving in a more-or-less direct path to lower-and-lower values?
Is it oscillating? Is it increasing? (If the latter, the fix is
usually that you have a sign backwards)
a 6 x 6 matrix is very very small--that doesn't leave a lot for the
cost function to work with
re-write your cost function so that a "perfect" solution returns a
zero cost, and all others have a higher value
1000 iterations is not a lot; try increasing that to 50,000

new_grid = grid[:] makes a shallow copy. A deep copy or modifying the grid in place and reverting to the original solves the issue.

Improving runtime on Euler #10

So I was attacking a Euler Problem that seemed pretty simple on a small scale, but as soon as I bump it up to the number that I'm supposed to do, the code takes forever to run. This is the question:
The sum of the primes below 10 is 2 + 3 + 5 + 7 = 17.
Find the sum of all the primes below two million.
I did it in Python. I could wait a few hours for the code to run, but I'd rather find a more efficient way to go about this. Here's my code in Python:
x = 1;
total = 0;
while x <= 2000000:
y = 1;
z = 0;
while x >= y:
if x % y == 0:
z += 1;
y += 1;
if z == 2:
total += x
x += 1;
print total;

Like mentioned in the comments, implementing the Sieve of Eratosthenes would be a far better choice. It takes up O(n) extra space, which is an array of length ~2 million, in this case. It also runs in O(n), which is astronomically faster than your implementation, which runs in O(n²).
I originally wrote this in JavaScript, so bear with my python:
max = 2000000 # we only need to check the first 2 million numbers
numbers = []
sum = 0
for i in range(2, max): # 0 and 1 are not primes
numbers.append(i) # fill our blank list
for p in range(2, max):
if numbers[p - 2] != -1: # if p (our array stays at 2, not 0) is not -1
# it is prime, so add it to our sum
sum += numbers[p - 2]
# now, we need to mark every multiple of p as composite, starting at 2p
c = 2 * p
while c < max:
# we'll mark composite numbers as -1
numbers[c - 2] = -1
# increment the count to 3p, 4p, 5p, ... np
c += p
print(sum)
The only confusing part here might be why I used numbers[p - 2]. That's because I skipped 0 and 1, meaning 2 is at index 0. In other words, everything's shifted to the side by 2 indices.

Clearly the long pole in this tent is computing the list of primes in the first place. For an artificial situation like this you could get someone else's list (say, this one), prase it and add up the numbers in seconds.
But that's unsporting, in my view. In which case, try the sieve of atkin as noted in this SO answer.

Formulation of a recursive solution (variable for loops)

Please consider the below algorithm:
for(j1 = n upto 0)
for(j2 = n-j1 upto 0)
for(j3 = n-j1-j2 upto 0)
.
.
for (jmax = n -j1 - j2 - j_(max-1))
{
count++;
product.append(j1 * j2 ... jmax); // just an example
}
As you can see, some relevant points about the algo snippet above:
I have listed an algorithm with a variable number of for loops.
The result that i calculate at each innermost loop is appended to a list. This list will grow to dimension of 'count'.
Is this problem a suitable candidate for recursion? If yes, i am really not sure how to break the problem up. I am trying to code this up in python, and i do not expect any code from you guys. Just some pointers or examples in the right direction. Thank you.
Here is an initial try for a sample case http://pastebin.com/PiLNTWED

Your algorithm is finding all the m-tuples (m being the max subscript of j from your pseudocode) of non-negative integers that add up to n or less. In Python, the most natural way of expressing that would be with a recursive generator:
def gen_tuples(m, n):
if m == 0:
yield ()
else:
for x in range(n, -1, -1):
for sub_result in gen_tuples(m-1, n-x):
yield (x,)+sub_result
Example output:
>>> for x, y, z in gen_sums(3, 3):
print(x, y, z)
3 0 0
2 1 0
2 0 1
2 0 0
1 2 0
1 1 1
1 1 0
1 0 2
1 0 1
1 0 0
0 3 0
0 2 1
0 2 0
0 1 2
0 1 1
0 1 0
0 0 3
0 0 2
0 0 1
0 0 0

You could also consider using permutations, combinations or product from the itertools module.
If you want all the possible combinations of i, j, k, ... (i.e. nested for loops)
you can use:
for p in product(range(n), repeat=depth):
j1, j2, j3, ... = p # the same as nested for loops
# do stuff here
But beware, the number of iterations in the loop grows exponentially!

the toy example will translate into a kind of tail recursion so, personally, i wouldn't expect a recursive version to be more insightful for code review and maintenance.
however, to get acquainted to the principle, attempt to factor out the invariant parts / common terms from the individual loop and try to identify a pattern (and best prove it afterwards!). you should be able to fix a signature of the recursive procedure to be written. flesh it out with the parts inherent to the loop body/ies (and don't forget the termination condition).

Typically, if you want to transform for loops into recursive calls, you will need to replace the for statements with if statements. For nested loops, you will transform these into function calls.
For practice, start with a dumb translation of the code that works and then attempt to see where you can optimize later.
To give you an idea to try to apply to your situation, I would translate something like this:
results = []
for i in range(n):
results.append(do_stuff(i, n))
to something like this:
results = []
def loop(n, results, i=0):
if i >= n:
return results
results.append(do_stuff(i, n))
i += 1
loop(n, results, i)
there are different ways to handle returning the results list, but you can adapt to your needs.

-- As a response to the excellent listing by Blckgnht -- Consider here the case of n = 2 and max = 3
def simpletest():
'''
I am going to just test the algo listing with assumption
degree n = 2
max = dim(m_p(n-1)) = 3,
so j1 j2 and upto j3 are required for every entry into m_p(degree2)
Lets just print j1,j2,j3 to verify if the function
works in other general version where the number of for loops is not known
'''
n = 2
count = 0
for j1 in range(n, -1, -1):
for j2 in range(n -j1, -1, -1):
j3 = (n-(j1+j2))
count = count + 1
print 'To calculate m_p(%d)[%d], j1,j2,j3 = ' %(n,count), j1, j2, j3
assert(count==6) # just a checkpoint. See P.169 for a proof
print 'No. of entries =', count
The output of this code (and it is correct).
In [54]: %run _myCode/Python/invariant_hack.py
To calculate m_p(2)[1], j1,j2,j3 = 2 0 0
To calculate m_p(2)[2], j1,j2,j3 = 1 1 0
To calculate m_p(2)[3], j1,j2,j3 = 1 0 1
To calculate m_p(2)[4], j1,j2,j3 = 0 2 0
To calculate m_p(2)[5], j1,j2,j3 = 0 1 1
To calculate m_p(2)[6], j1,j2,j3 = 0 0 2
No. of entries = 6

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to optimize edit distance code? - python

Related

Vectorize itertools combination_with_replacement

Order bias in wrong implementation of Fisher Yates Shuffle

Simulated annealing doesn't return (an) optimal solution

Improving runtime on Euler #10

Formulation of a recursive solution (variable for loops)

Categories

Resources