Levenshtein distance in python giving only 1 as edit distance - python

I have a python program to read two lists (one with errors and other with correct data). Every element in my list with error needs to be compared with every element in my correct list. After comparing i get all the edit distance between every compared pair. now i can find the least edit distance for the given error data and thru get my correct data.
I am trying to use levenshtein distance to calculate the edit distance but its returning all edit distance as 1 even which is wrong.
This means the code for calculating levenshtein distance is not correct. I am struggling to find a fix for this. HELP!
My Code
import csv
def lev(a, b):
if not a: return len(b)
if not b: return len(a)
return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)
if __name__ == "__main__":
with open("all_correct_promo.csv","rb") as file1:
reader1 = csv.reader(file1)
correctPromoList = list(reader1)
#print correctPromoList
with open("all_extracted_promo.csv","rb") as file2:
reader2 = csv.reader(file2)
extractedPromoList = list(reader2)
#print extractedPromoList
incorrectPromo = []
count = 0
for extracted in extractedPromoList:
if(extracted not in correctPromoList):
incorrectPromo.append(extracted)
else:
count = count + 1
#print incorrectPromo
for promos in incorrectPromo:
for correctPromo in correctPromoList:
distance = lev(promos,correctPromo)
print promos, correctPromo , distance

There is a Python package available that implements the levenshtein distance : python-levenshtein
To install it:
pip install python-levenshtein
To use it:
>>> import Levenshtein
>>> string1 = 'dsfjksdjs'
>>> string2 = 'dsfiksjsd'
>>> print Levenshtein.distance(string1, string2)
3

The implementation is correct. I tested this:
def lev(a, b):
if not a: return len(b)
if not b: return len(a)
return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)
print lev('abcde','bc') # prints 3, which is correct
print lev('abc','bc') # prints 1, which is correct
Your problem, as I noticed by your comments, is probably when you call the method:
a = ['NSP-212690']
b = ['FE SV X']
print lev(a,b) # prints 1 which is incorrect because you are comparing arrays, not strings
print lev(a[0],b[0]) # prints 10, which is correct
So, what you can do is:
Before the call to "lev(a,b)", extract the first element of each array
def lev(a, b):
if not a: return len(b)
if not b: return len(a)
return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)
a = ['NSP-212690']
b = ['FE SV X']
a = a[0] # this is the key part
b = b[0] # and this
print lev(a,b) # prints 10, which is correct
Anyway, I would not recommend you that recursive implementation, because the performance is veeeery poor
I would recommend this implementation instead (source: wikipedia-levenshtein)
def lev(seq1, seq2):
oneago = None
thisrow = range(1, len(seq2) + 1) + [0]
for x in xrange(len(seq1)):
twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
for y in xrange(len(seq2)):
delcost = oneago[y] + 1
addcost = thisrow[y - 1] + 1
subcost = oneago[y - 1] + (seq1[x] != seq2[y])
thisrow[y] = min(delcost, addcost, subcost)
return thisrow[len(seq2) - 1]
or maybe this slightly modified version:
def lev(seq1, seq2):
if not a: return len(b)
if not b: return len(a)
oneago = None
thisrow = range(1, len(seq2) + 1) + [0]
for x in xrange(len(seq1)):
twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
for y in xrange(len(seq2)):
delcost = oneago[y] + 1
addcost = thisrow[y - 1] + 1
subcost = oneago[y - 1] + (seq1[x] != seq2[y])
thisrow[y] = min(delcost, addcost, subcost)
return thisrow[len(seq2) - 1]

Related

how do I identify sequence equation Python

Am I able to identify sequence, but not formula
I have the whole code
def analyse_sequence_type(y:list[int]):
if len(y) >= 5:
res = {"linear":[],"quadratic":[],"exponential":[],"cubic":[]}
for i in reversed(range(len(y))):
if i-2>=0 and (y[i] + y[i-2] == 2*y[i-1]): res["linear"].append(True)
elif i-3>=0 and (y[i] - 2*y[i-1] + y[i-2] == y[i-1] - 2*y[i-2] + y[i-3]): res["quadratic"].append(True)
for k, v in res.items():
if v:
if k == "linear" and len(v)+2 == len(y): return k
elif k == "quadratic" and len(v)+3 == len(y): return k
return
print(f"A relation cannot be made with just {len(y)} values.\nPlease enter a minimum of 5 values!")
return
I can identify linear and quadratic but how do I make a function
So, firstly we will need to create two functions for linear and quadratic (formulae attached below).
def linear(y):
"""
Returns equation in format (str)
y = mx + c
"""
d = y[1]-y[0] # get difference
c = f"{y[0]-d:+}" # get slope
if d == 0: c = y[0] - d # if no difference then intercept is 0
return f"f(x) = {d}x {c} ; f(1) = {y[0]}".replace("0x ","").replace("1x","x").replace(" + 0","");
We apply a similar logic for quadratic:
def quadratic(y):
"""
Returns equation in format (str)
y = ax² + bx + c
"""
a = logic_round((y[2] - 2*y[1] + y[0])/2) # get a
b = logic_round(y[1] - y[0] - 3*a) # get b
c = logic_round(y[0]-a-b) # get c
return f"f(x) = {a}x² {b:+}x {c:+} ; f(1) = {y[0]}".replace('1x²','x²').replace('1x','x').replace(' +0x','').replace(' +0','')
If you try the code with multiple inputs such as 5.0 you will get 5.0x + 4 (example). To omit that try:
def logic_round(num):
splitted = str(num).split('.') # split decimal
if len(splitted)>1 and len(set(splitted[-1])) == 1 and splitted[-1].startswith('0'): return int(splitted[0]) # check if it is int.0 or similar
elif len(splitted)>1: return float(num) # else returns float
return int(num)
The above functions will work in any way provided that the y is a list where the domain is [1, ∞).
Hope this helps :) Also give cubic a try.

Check the difference between string i and p

How would I go about checking to see what the difference between string p and i? So the 2nd line can equal the first line.
t=int(input())
print(t)
for i in range(t):
print(i)
i=input()
p=input()
print(i,p)
print('Case #'+(str(i+1))+': ')
if len(i)==0:
#print(len(p))
else:
#print((len(p)-len(i)))
Help Barbara find out how many extra letters she needs to remove in order to obtain I or if I cannot be obtained from P by removing letters then output IMPOSSIBLE.
input:
2
aaaa
aaaaa
bbbbb
bbbbc
output:
Case #1: 1
Case #2: IMPOSSIBLE
You can use Levenshtein distance to calculate the difference and decide what is possible and impossible yourself.
You can find more resources on YouTube to understand the concept better. E.g. https://www.youtube.com/watch?v=We3YDTzNXEk
I have provided a version of code for your convenient as well.
import numpy as np
def calculate_edit_distance(source, target):
'''Calculate the edit distance from source to target
[In] source="ab" target="bc"
[Out] return 2
'''
num_row = len(target) + 1
num_col = len(source) + 1
distance_table = np.array([[0] * num_col for _ in range(num_row)])
# getting from X[0...i] to empty target string requires i deletions
distance_table[:, 0] = [i for i in range(num_row)]
# getting from Y[0...i] to empty source string requires i deletions
distance_table[0] = [i for i in range(num_col)]
# loop through all the characters and calculate their respective distances
for i in range(num_row - 1):
for j in range(num_col - 1):
insert = distance_table[i + 1, j]
delete = distance_table[i, j + 1]
substitute = distance_table[i, j]
# if target char and source char are the same,
# just copy the diagonal value
if target[i] == source[j]:
distance_table[i + 1, j + 1] = substitute
else:
operations = [delete, insert, substitute]
best_operation = np.argmin(operations)
if best_operation == 2: # +2 if the operation is to substitute
distance_table[i + 1, j + 1] = substitute + 2
else: # same formula for both delete and insert operation
distance_table[i + 1, j + 1] = operations[best_operation] + 1
return distance_table[num_row - 1, num_col - 1]

Check the elements of the lists

a = ["000000001111111110101010","111111110000111111000011"]
what I need to do is check my list(a),
for item in a:
for elements in range(len(a[item])):
if "0" in a or "1" in a:
random change one elements in a[item](0 change to 1 or 1 change to 0) just one element,how can I do this
in my question if all elements changed should be:
a = ["111111110000000001010101","000000001111000000111100"]
if just one elements changed should be:
a =["000000001111101110101010","111111110000111111001011"]
just random pick 0 or 1 to change to 1 or 0
The source code below works for me:
a = ['000000001111111110101010',"111111110000111111000011"]
print(a)
ret = []
for item in a:
r = ''
for i in item:
b = int(i, base=2)
c = str(int(not b))
r = r + c
ret.append(r)
print(ret)
And the output is:
['000000001111111110101010', '111111110000111111000011']
['111111110000000001010101', '000000001111000000111100']
Here are both, flipping all digits and flipping one random digit:
import random
def flip_all(s):
s = list(s)
return ''.join([str(1 - int(c)) for c in s])
def flip_one(s):
s = list(s)
rand_i = random.randint(0, len(s)-1)
s[rand_i] = str(1 - int(s[rand_i]))
return ''.join(s)
a = ["000000001111111110101010","111111110000111111000011"]
print("a: ", a)
print("flip all: ", [flip_all(word) for word in a])
print("flip one: ", [flip_one(word) for word in a])
Output:
a: ['000000001111111110101010', '111111110000111111000011']
flip all: ['111111110000000001010101', '000000001111000000111100']
flip one: ['000000001101111110101010', '111111110000111110000011']

Levenshtein distance in a file

The statement says:
Modify the above program so that given the GGCCTTGCCATTGG pattern, each of the first 10 lines of the previous file indicates:
· The distance of edition that finds the substring more similar of that line.
· The substrings of that line that finds to minimum distance of edition
The above program is this:
import time
def levenshtein_distance (first, second):
if len(first) > len(second):
first, second = second, first
if len(second) == 0:
return len(fist)
first_length = len(first) + 1
second_length = len(second) + 1
distance_matrix = [[0]*second_length for x in range(first_length)]
for i in range(first_length): distance_matrix[i][0] = i
for j in range(second_length): distance_matrix[0][j] = j
for i in xrange(1, first_length):
for j in range(1, second_length):
deletion = distance_matrix[i-1][j] + 1
insertion = distance_matrix[i][j-1] + 2
substitution = distance_matrix[i-1][j-1] + 1
if first[i-1] != second[j-1]:
substitution += 1
distance_matrix[i][j] = min(insertion, deletion, substitution)
return distance_matrix[first_length-1][second_length-1]
def dna(patro):
t1 = time.clock()
f = open("HUMAN-DNA.txt")
text = f.readlines()
f.close()
distanciaMin = 100000000
distanciaPosicion = 0
distanciaLinea = 0
distanciaSubstring = ""
numeroLinea = 0
for line in text:
numeroLinea = numeroLinea + 1
for i in range(len(line)-len(patro)):
cadena = line[i:i+len(patro)]
distancia = levenshtein_distance(cadena, patro)
if distancia < distanciaMin:
distanciaMin = distancia
distanciaPosicion = 1
distanciaLinea = numeroLinea
distanciaSubstring = cadena
t2 = time.clock()
Now i put the new pattern
dna("GGCCTTGCCATTGG")
I have the distance of edition that is distanciaMin and I'm not sure about result of distanciaSubstring that is the substrings of that line(second point of statement), my question is How can i count the first ten lines in the text?
A part of the file is:
CCCATCTCTTTCTCATTCCTTGGTTGAGAACACGAACTTCAGGACTTGCCTCACACTAGGGCCCATTCTT
TGTTTCCCAGAAAGAAGAGGCTCTCCACACAGAGTCCCATGTACACCAGGCTGTCAACAAACATGAATTG
AATGAAGGAGTGGATGGTTGGGTGGAAGTGATTTAAGAAATCCTAACTGGGGAATTTCACTGGAAACTTA
GGAAATTCAATTTATATAAAGTCTATGAATCGTCCATTTTTGTGTCCGCACATTCAAATGCTGTAGCTAA
TTTCCTGCTAAACAGTAGAAATTCAGTAAGTGTTCATGTTGAAAGGATGAAATTTGAGTGCTCTTGCATC
CTCAAAGAACTCTAGTAAAATAGAAATAAAGCTTTATTTGGAAGATTAAGTCATGAGCATAATTATGAGA
AGGCGGTCATTCTAATAATAGTGTCTTCACAAGTAGATGCTACATGCTGTGTAATATTTTGACTAAAAAA
AGTTCCTCTCAACATTTCTGAAGTGAGATAATGTACAACGATCCATGTTTTTAGCTACCTTGATAAGTTT
AGTGCATCCAGGGCTCCTTTCTTACCTGCTAACCGCCGAGTTTCAAATGCTAAGAAATTCTTCATTTCCT
AACACAAATATTCAATATAATTGCTGGTTGTTTGGGAGAAGAAAAATTTAGAATTCAGAAAGAAATACAG
AATGAAATGTTCTAATCAATCGAAAAAGGATTCTATAGACTTCGACGTTGTCTGGTTTACAAAGCAGTCT
I couldn't understand your full question. But I am trying to solve How can i count the first ten lines in the text?. You can use filehandler.readlines(). It will load files in memory as a list where each row is separated by new line character.
Then you can read 10 lines from the list. You can try something like this,
>>> a = [0,1,2,3,4,5,6,7,8,9] # read file as a list of lines (a)
>>> def line(a, jump=2): # keep jump = 10 for your requirement.
lines = len(a)
i = 0
while i < lines+1:
yield a[i:i+jump]
i += jump
>>> foo = line(a)
>>> foo.next()
[0, 1]
>>> foo.next()
[2, 3]
>>> foo.next()
[4, 5]
For your code it will be,
foo = line(text, 10)
foo.next() # should return you 10 elements in each call

What's wrong with my Extended Euclidean Algorithm (python)?

My algorithm to find the HCF of two numbers, with displayed justification in the form r = a*aqr + b*bqr, is only partially working, even though I'm pretty sure that I have entered all the correct formulae - basically, it can and will find the HCF, but I am also trying to provide a demonstration of Bezout's Lemma, so I need to display the aforementioned displayed justification. The program:
# twonumbers.py
inp = 0
a = 0
b = 0
mul = 0
s = 1
r = 1
q = 0
res = 0
aqc = 1
bqc = 0
aqd = 0
bqd = 1
aqr = 0
bqr = 0
res = 0
temp = 0
fin_hcf = 0
fin_lcd = 0
seq = []
inp = input('Please enter the first number, "a":\n')
a = inp
inp = input('Please enter the second number, "b":\n')
b = inp
mul = a * b # Will come in handy later!
if a < b:
print 'As you have entered the first number as smaller than the second, the program will swap a and b before proceeding.'
temp = a
a = b
b = temp
else:
print 'As the inputted value a is larger than or equal to b, the program has not swapped the values a and b.'
print 'Thank you. The program will now compute the HCF and simultaneously demonstrate Bezout\'s Lemma.'
print `a`+' = ('+`aqc`+' x '+`a`+') + ('+`bqc`+' x '+`b`+').'
print `b`+' = ('+`aqd`+' x '+`a`+') + ('+`bqd`+' x '+`b`+').'
seq.append(a)
seq.append(b)
c = a
d = b
while r != 0:
if s != 1:
c = seq[s-1]
d = seq[s]
res = divmod(c,d)
q = res[0]
r = res[1]
aqr = aqc - (q * aqd)#These two lines are the main part of the justification
bqr = bqc - (q * aqd)#-/
print `r`+' = ('+`aqr`+' x '+`a`+') + ('+`bqr`+' x '+`b`+').'
aqd = aqr
bqd = bqr
aqc = aqd
bqc = bqd
s = s + 1
seq.append(r)
fin_hcf = seq[-2] # Finally, the HCF.
fin_lcd = mul / fin_hcf
print 'Using Euclid\'s Algorithm, we have now found the HCF of '+`a`+' and '+`b`+': it is '+`fin_hcf`+'.'
print 'We can now also find the LCD (LCM) of '+`a`+' and '+`b`+' using the following method:'
print `a`+' x '+`b`+' = '+`mul`+';'
print `mul`+' / '+`fin_hcf`+' (the HCF) = '+`fin_lcd`+'.'
print 'So, to conclude, the HCF of '+`a`+' and '+`b`+' is '+`fin_hcf`+' and the LCD (LCM) of '+`a`+' and '+`b`+' is '+`fin_lcd`+'.'
I would greatly appreciate it if you could help me to find out what is going wrong with this.
Hmm, your program is rather verbose and hence hard to read. For example, you don't need to initialise lots of those variables in the first few lines. And there is no need to assign to the inp variable and then copy that into a and then b. And you don't use the seq list or the s variable at all.
Anyway that's not the problem. There are two bugs. I think that if you had compared the printed intermediate answers to a hand-worked example you should have found the problems.
The first problem is that you have a typo in the second line here:
aqr = aqc - (q * aqd)#These two lines are the main part of the justification
bqr = bqc - (q * aqd)#-/
in the second line, aqd should be bqd
The second problem is that in this bit of code
aqd = aqr
bqd = bqr
aqc = aqd
bqc = bqd
you make aqd be aqr and then aqc be aqd. So aqc and aqd end up the same. Whereas you actually want the assignments in the other order:
aqc = aqd
bqc = bqd
aqd = aqr
bqd = bqr
Then the code works. But I would prefer to see it written more like this which is I think a lot clearer. I have left out the prints but I'm sure you can add them back:
a = input('Please enter the first number, "a":\n')
b = input('Please enter the second number, "b":\n')
if a < b:
a,b = b,a
r1,r2 = a,b
s1,s2 = 1,0
t1,t2 = 0,1
while r2 > 0:
q,r = divmod(r1,r2)
r1,r2 = r2,r
s1,s2 = s2,s1 - q * s2
t1,t2 = t2,t1 - q * t2
print r1,s1,t1
Finally, it might be worth looking at a recursive version which expresses the structure of the solution even more clearly, I think.
Hope this helps.
Here is a simple version of Bezout's identity; given a and b, it returns x, y, and g = gcd(a, b):
function bezout(a, b)
if b == 0
return 1, 0, a
else
q, r := divide(a, b)
x, y, g := bezout(b, r)
return y, x - q * y, g
The divide function returns both the quotient and remainder.
The python program that does what you want (please note that extended Euclid algorithm gives only one pair of Bezout coefficients) might be:
import sys
def egcd(a, b):
if a == 0:
return (b, 0, 1)
g, y, x = egcd(b % a, a)
return (g, x - (b // a) * y, y)
def main():
if len(sys.argv) != 3:
's program caluclates LCF, LCM and Bezout identity of two integers
usage %s a b''' % (sys.argv[0], sys.argv[0])
sys.exit(1)
a = int(sys.argv[1])
b = int(sys.argv[2])
g, x, y = egcd(a, b)
print 'HCF =', g
print 'LCM =', a*b/g
print 'Bezout identity: %i * (%i) + %i * (%i) = %i' % (a, x, b, y, g)
main()

Categories

Resources