Levenshtein distance in a file

Levenshtein distance in a file - python

The statement says:
Modify the above program so that given the GGCCTTGCCATTGG pattern, each of the first 10 lines of the previous file indicates:
· The distance of edition that finds the substring more similar of that line.
· The substrings of that line that finds to minimum distance of edition
The above program is this:
import time
def levenshtein_distance (first, second):
if len(first) > len(second):
first, second = second, first
if len(second) == 0:
return len(fist)
first_length = len(first) + 1
second_length = len(second) + 1
distance_matrix = [[0]*second_length for x in range(first_length)]
for i in range(first_length): distance_matrix[i][0] = i
for j in range(second_length): distance_matrix[0][j] = j
for i in xrange(1, first_length):
for j in range(1, second_length):
deletion = distance_matrix[i-1][j] + 1
insertion = distance_matrix[i][j-1] + 2
substitution = distance_matrix[i-1][j-1] + 1
if first[i-1] != second[j-1]:
substitution += 1
distance_matrix[i][j] = min(insertion, deletion, substitution)
return distance_matrix[first_length-1][second_length-1]
def dna(patro):
t1 = time.clock()
f = open("HUMAN-DNA.txt")
text = f.readlines()
f.close()
distanciaMin = 100000000
distanciaPosicion = 0
distanciaLinea = 0
distanciaSubstring = ""
numeroLinea = 0
for line in text:
numeroLinea = numeroLinea + 1
for i in range(len(line)-len(patro)):
cadena = line[i:i+len(patro)]
distancia = levenshtein_distance(cadena, patro)
if distancia < distanciaMin:
distanciaMin = distancia
distanciaPosicion = 1
distanciaLinea = numeroLinea
distanciaSubstring = cadena
t2 = time.clock()
Now i put the new pattern
dna("GGCCTTGCCATTGG")
I have the distance of edition that is distanciaMin and I'm not sure about result of distanciaSubstring that is the substrings of that line(second point of statement), my question is How can i count the first ten lines in the text?
A part of the file is:
CCCATCTCTTTCTCATTCCTTGGTTGAGAACACGAACTTCAGGACTTGCCTCACACTAGGGCCCATTCTT
TGTTTCCCAGAAAGAAGAGGCTCTCCACACAGAGTCCCATGTACACCAGGCTGTCAACAAACATGAATTG
AATGAAGGAGTGGATGGTTGGGTGGAAGTGATTTAAGAAATCCTAACTGGGGAATTTCACTGGAAACTTA
GGAAATTCAATTTATATAAAGTCTATGAATCGTCCATTTTTGTGTCCGCACATTCAAATGCTGTAGCTAA
TTTCCTGCTAAACAGTAGAAATTCAGTAAGTGTTCATGTTGAAAGGATGAAATTTGAGTGCTCTTGCATC
CTCAAAGAACTCTAGTAAAATAGAAATAAAGCTTTATTTGGAAGATTAAGTCATGAGCATAATTATGAGA
AGGCGGTCATTCTAATAATAGTGTCTTCACAAGTAGATGCTACATGCTGTGTAATATTTTGACTAAAAAA
AGTTCCTCTCAACATTTCTGAAGTGAGATAATGTACAACGATCCATGTTTTTAGCTACCTTGATAAGTTT
AGTGCATCCAGGGCTCCTTTCTTACCTGCTAACCGCCGAGTTTCAAATGCTAAGAAATTCTTCATTTCCT
AACACAAATATTCAATATAATTGCTGGTTGTTTGGGAGAAGAAAAATTTAGAATTCAGAAAGAAATACAG
AATGAAATGTTCTAATCAATCGAAAAAGGATTCTATAGACTTCGACGTTGTCTGGTTTACAAAGCAGTCT

I couldn't understand your full question. But I am trying to solve How can i count the first ten lines in the text?. You can use filehandler.readlines(). It will load files in memory as a list where each row is separated by new line character.
Then you can read 10 lines from the list. You can try something like this,
>>> a = [0,1,2,3,4,5,6,7,8,9] # read file as a list of lines (a)
>>> def line(a, jump=2): # keep jump = 10 for your requirement.
lines = len(a)
i = 0
while i < lines+1:
yield a[i:i+jump]
i += jump
>>> foo = line(a)
>>> foo.next()
[0, 1]
>>> foo.next()
[2, 3]
>>> foo.next()
[4, 5]
For your code it will be,
foo = line(text, 10)
foo.next() # should return you 10 elements in each call

Related

Problem with reading data from file in Python

EDIT:
Thanks for fixing it! Unfortunatelly, it messed up the logic. I'll explain what this program does. It's a solution to a task about playing cards trick. There are N cards on the table. First and Second are numbers on the front and back of the cards. The trick can only be done, if the visible numbers are in non-decreasing order. Someone from audience can come and swap places of cards. M represents how many cards will be swapped places. A and B represent which cards will be swapped. Magician can flip any number of cards to see the other side. The program must tell, if the magician can do the trick.
from collections import namedtuple
Pair = namedtuple("Pair", ["first", "second"])
pairs = []
with open('data.txt', 'r') as data, open('results.txt', 'w') as results:
n = data.readline()
n = int(n)
for _ in range(n):
first, second = (int(x) for x in data.readline().split(':'))
first, second = sorted((first, second))
pairs.append(Pair(first, second)) # add to the list by appending
m = data.readline()
m = int(m)
for _ in range(m):
a, b = (int(x) for x in data.readline().split('-'))
a -= 1
b -= 1
temp = pairs[a]
pairs[a] = pairs[b]
pairs[b] = temp
p = -1e-9
ok = True
for k in range(0, n):
if pairs[k].first >= p:
p = pairs[k].first
elif pairs[k].second >= p:
p = pairs[k].second
else:
ok = False
break
if ok:
results.write("YES\n")
else:
results.write("NO\n")
data:
4
2:5
3:4
6:3
2:7
2
3-4
1-3
results:
YES
YES
YES
YES
YES
YES
YES
What should be in results:
NO
YES

The code is full of bugs: you should write and test it incrementally instead of all at once. It seems that you started using readlines (which is a good way of managing this kind of work) but you kept the rest of the code in a reading one by one style. If you used readlines, the line for i, line in enumerate(data): should be changed to for i, line in enumerate(lines):.
Anyway, here is a corrected version with some explanation. I hope I did not mess with the logic.
from collections import namedtuple
Pair = namedtuple("Pair", ["first", "second"])
# The following line created a huge list of "Pairs" types, not instances
# pairs = [Pair] * (2*200*1000+1)
pairs = []
with open('data.txt', 'r') as data, open('results.txt', 'w') as results:
n = data.readline()
n = int(n)
# removing the reading of all data...
# lines = data.readlines()
# m = lines[n]
# removed bad for: for i, line in enumerate(data):
for _ in range(n): # you don't need the index
first, second = (int(x) for x in data.readline().split(':'))
# removed unnecessary recasting to int
# first = int(first)
# second = int(second)
# changed the swapping to a more elegant way
first, second = sorted((first, second))
pairs.append(Pair(first, second)) # we add to the list by appending
# removed unnecessary for: once you read all the first and seconds,
# you reached M
m = data.readline()
m = int(m)
# you don't need the index... indeed you don't need to count (you can read
# to the end of file, unless it is malformed)
for _ in range(m):
a, b = (int(x) for x in data.readline().split('-'))
# removed unnecessary recasting to int
# a = int(a)
# b = int(b)
a -= 1
b -= 1
temp = pairs[a]
pairs[a] = pairs[b]
pairs[b] = temp
p = -1e-9
ok = True
for k in range(0, n):
if pairs[k].first >= p:
p = pairs[k].first
elif pairs[k].second >= p:
p = pairs[k].second
else:
ok = False
break
if ok:
results.write("YES\n")
else:
results.write("NO\n")
Response previous to edition
range(1, 1) is empty, so this part of the code:
for i in range (1, 1):
n = data.readline()
n = int(n)
does not define n, at when execution gets to line 12 you get an error.
You can remove the for statement, changing those three lines to:
n = data.readline()
n = int(n)

How can I fix this error for popping a word in a list/string? (Python 3.x)

I'm not exactly the kind of guy you call "good" at coding. In this particular scenario, on line 13, I'm trying to pop the first word in the list until I'm done, but it keeps giving me the 'str' object can not be interpreted as an integer issue.
What am I doing wrong here?
n = n.split(" ")
N = n[0]
K = n[1]
f1 = input()
f1 = f1.split(" ")
f1 = list(f1)
current = 0
for x in f1:
while current <= 7:
print(x)
f1 = list(f1.pop()[0])
current = current + len(x)
if current > 7:
print("\n")
current = 0

According your comments, this program will split lines to contain max K characters:
K = 7
s = "hello my name is Bessie and this is my essay"
out, cnt = [], 0
for word in s.split():
l = len(word)
if cnt + l <= K:
cnt += l
if not out:
out.append([word])
else:
out[-1].append(word)
else:
cnt = l
out.append([word])
print("\n".join(" ".join(line) for line in out))
Prints:
hello my
name is
Bessie
and this
is my
essay

You could try splitting the string on the index, and inserting a newline there. Each time you do this your string gets one character longer, so we can use enumerate (which starts counting at zero) to add a number to our slice indexes.
s = 'Thanks for helping me'
new_line_index = [7,11, 19]
for i, x in enumerate(new_line_index):
s = s[:x+i] + '\n' + s[x+i:]
print(s)
Output
Thanks
for
helping
me

Print data between positions within a loop

I have one files.
File1 which has 3 columns. Data are tab separated
File1:
2 4 Apple
6 7 Samsung
Let's say if I run a loop of 10 iteration. If the iteration has value between column 1 and column 2 of File1, then print the corresponding 3rd column from File1, else print "0".
The columns may or may not be sorted, but 2nd column is always greater than 1st. Range of values in the two columns do not overlap between lines.
The output Result should look like this.
Result:
0
Apple
Apple
Apple
0
Samsung
Samsung
0
0
0
My program in python is here:
chr5_1 = [[]]
for line in file:
line = line.rstrip()
line = line.split("\t")
chr5_1.append([line[0],line[1],line[2]])
# Here I store all position information in chr5_1 list in list
chr5_1.pop(0)
for i in range (1,10):
for listo in chr5_1:
L1 = " ".join(str(x) for x in listo[:1])
L2 = " ".join(str(x) for x in listo[1:2])
L3 = " ".join(str(x) for x in listo[2:3])
if int(L1) <= i and int(L2) >= i:
print(L3)
break
else:
print ("0")
break
I am confused with loop iteration and it break point.

Try this:
chr5_1 = dict()
for line in file:
line = line.rstrip()
_from, _to, value = line.split("\t")
for i in range(int(_from), int(_to) + 1):
chr5_1[i] = value
for i in range (1, 10):
print chr5_1.get(i, "0")

I think this is a job for else:
position_information = []
with open('file1', 'rb') as f:
for line in f:
position_information.append(line.strip().split('\t'))
for i in range(1, 11):
for start, through, value in position_information:
if i >= int(start) and i <= int(through):
print value
# No need to continue searching for something to print on this line
break
else:
# We never found anything to print on this line, so print 0 instead
print 0
This gives the result you're looking for:
0
Apple
Apple
Apple
0
Samsung
Samsung
0
0
0

Setup:
import io
s = '''2 4 Apple
6 7 Samsung'''
# Python 2.x
f = io.BytesIO(s)
# Python 3.x
#f = io.StringIO(s)
If the lines of the file are not sorted by the first column:
import csv, operator
reader = csv.reader(f, delimiter = ' ', skipinitialspace = True)
f = list(reader)
f.sort(key = operator.itemgetter(0))
Read each line; do some math to figure out what to print and how many of them to print; print stuff; iterate
def print_stuff(thing, n):
while n > 0:
print(thing)
n -= 1
limit = 10
prev_end = 1
for line in f:
# if iterating over a file, separate the columns
begin, end, text = line.strip().split()
# if iterating over the sorted list of lines
#begin, end, text = line
begin, end = map(int, (begin, end))
# don't exceed the limit
begin = begin if begin < limit else limit
# how many zeros?
gap = begin - prev_end
print_stuff('0', gap)
if begin == limit:
break
# don't exceed the limit
end = end if end < limit else limit
# how many words?
span = (end - begin) + 1
print_stuff(text, span)
if end == limit:
break
prev_end = end
# any more zeros?
gap = limit - prev_end
print_stuff('0', gap)

How python [internally] retrieves elements from array and finds minimum

For this question http://www.spoj.com/problems/ACPC10D/ on SPOJ, I wrote a python solution as below:
count = 1
while True:
no_rows = int(raw_input())
if no_rows == 0:
break
grid = [[None for x in range(3)] for y in range(2)]
input_arr = map(int, raw_input().split())
grid[0][0] = 10000000
grid[0][1] = input_arr[1]
grid[0][2] = input_arr[1] + input_arr[2]
r = 1
for i in range(0, no_rows-1):
input_arr = map(int, raw_input().split())
_r = r ^ 1
grid[r][0] = input_arr[0] + min(grid[_r][0], grid[_r][1])
grid[r][1] = input_arr[1] + min(min(grid[_r][0], grid[r][0]), min(grid[_r][1], grid[_r][2]))
grid[r][2] = input_arr[2] + min(min(grid[_r][1], grid[r][1]), grid[_r][2])
r = _r
print str(count) + ". " + str(grid[(no_rows -1) & 1][1])
count += 1
The above code exceeds time limit. However, when I change the line
grid[r][2] = input_arr[2] + min(min(grid[_r][1], grid[r][1]), grid[_r][2])
to
grid[r][2] = input_arr[2] + min(min(grid[_r][1], grid[_r][2]), grid[r][1])
the solution is accepted. If you notice the difference, the first line compares, grid[_r][1], grid[r][1] for minimum (i.e. the row number are different) and second line compares grid[_r][1], grid[_r][2] for minimum(i.e. the row number are same)
This is a consistent behaviour. I want to understand, how python is processing those two lines - so that one results in exceeding time limit, while other is fine.

Efficient algorithm for counting unique elements in "suffixes" of an array

I was doing 368B on CodeForces with Python 3, which basically asks you to print the numbers of unique elements in a series of "suffixes" of a given array. Here's my solution (with some additional redirection code for testing):
import sys
if __name__ == "__main__":
f_in = open('b.in', 'r')
original_stdin = sys.stdin
sys.stdin = f_in
n, m = [int(i) for i in sys.stdin.readline().rstrip().split(' ')]
a = [int(i) for i in sys.stdin.readline().rstrip().split(' ')]
l = [None] * m
for i in range(m):
l[i] = int(sys.stdin.readline().rstrip())
l_sorted = sorted(l)
l_order = sorted(range(m), key=lambda k: l[k])
# the ranks of elements in l
l_rank = sorted(range(m), key=lambda k: l_order[k])
# unique_elem[i] = non-duplicated elements between l_sorted[i] and l_sorted[i+1]
unique_elem = [None] * m
for i in range(m):
unique_elem[i] = set(a[(l_sorted[i] - 1): (l_sorted[i + 1] - 1)]) if i < m - 1 else set(a[(l_sorted[i] - 1): n])
# unique_elem_cumulative[i] = non-duplicated elements between l_sorted[i] and a's end
unique_elem_cumulative = unique_elem[-1]
# unique_elem_cumulative_count[i] = #unique_elem_cumulative[i]
unique_elem_cumulative_count = [None] * m
unique_elem_cumulative_count[-1] = len(unique_elem[-1])
for i in range(m - 1):
i_rev = m - i - 2
unique_elem_cumulative = unique_elem[i_rev] | unique_elem_cumulative
unique_elem_cumulative_count[i_rev] = len(unique_elem_cumulative)
with open('b.out', 'w') as f_out:
for i in range(m):
idx = l_rank[i]
f_out.write('%d\n' % unique_elem_cumulative_count[idx])
sys.stdin = original_stdin
f_in.close()
The code shows correct results except for the possibly last big test, with n = 81220 and m = 48576 (a simulated input file is here, and an expected output created by a naive solution is here). The time limit is 1 sec, within which I can't solve the problem. So is it possible to solve it within 1 sec with Python 3? Thank you.
UPDATE: an "expected" output file is added, which is created by the following code:
import sys
if __name__ == "__main__":
f_in = open('b.in', 'r')
original_stdin = sys.stdin
sys.stdin = f_in
n, m = [int(i) for i in sys.stdin.readline().rstrip().split(' ')]
a = [int(i) for i in sys.stdin.readline().rstrip().split(' ')]
with open('b_naive.out', 'w') as f_out:
for i in range(m):
l_i = int(sys.stdin.readline().rstrip())
f_out.write('%d\n' % len(set(a[l_i - 1:])))
sys.stdin = original_stdin
f_in.close()

You'll be cutting it close, I think. On my admittedly rather old machine, the I/O alone takes 0.9 seconds per run.
An efficient algorithm, I think, will be to iterate backwards through the array, keeping track of which distinct elements you've found. When you find a new element, add its index to a list. This will therefore be a descending sorted list.
Then for each li, the index of li in this list will be the answer.
For the small sample dataset
10 10
1 2 3 4 1 2 3 4 100000 99999
1
2
3
4
5
6
7
8
9
10
The list would contain [10, 9, 8, 7, 6, 5] since when reading from the right, the first distinct value occurs at index 10, the second at index 9, and so on.
So then if li = 5, it has index 6 in the generated list, so 6 distinct values are found at indices >= li. Answer is 6
If li = 8, it has index 3 in the generated list, so 3 distinct values are found at indices >= li. Answer is 3
It's a little fiddly that the excercise numbers 1-indexed and python counts 0-indexed.
And to find this index quickly using existing library functions, I've reversed the list and then use bisect.
import timeit
from bisect import bisect_left
def doit():
f_in = open('b.in', 'r')
n, m = [int(i) for i in f_in.readline().rstrip().split(' ')]
a = [int(i) for i in f_in.readline().rstrip().split(' ')]
found = {}
indices = []
for i in range(n - 1, 0, -1):
if not a[i] in found:
indices.append(i+1)
found[a[i]] = True
indices.reverse()
length = len(indices)
for i in range(m):
l = int(f_in.readline().rstrip())
index = bisect_left(indices, l)
print length - index
if __name__ == "__main__":
print (timeit.timeit('doit()', setup="from bisect import bisect_left;from __main__ import doit", number=10))
On my machine outputs 12 seconds for 10 runs. Still too slow.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Levenshtein distance in a file - python

Related

Problem with reading data from file in Python

How can I fix this error for popping a word in a list/string? (Python 3.x)

Print data between positions within a loop

How python [internally] retrieves elements from array and finds minimum

Efficient algorithm for counting unique elements in "suffixes" of an array

Categories

Resources