Counting differences between two strings

Counting differences between two strings - python

I'm trying to count the number of differences between two imported strings (seq1 and seq2, import code not listed), but am getting no result when running the program. I want the output to read something like "2 differences." Not sure where I'm going wrong...
def difference (seq1, seq2):
count = 0
for i in seq1:
if seq1[i] != seq2[i]:
count += 1
return (count)
print (count, "differences")

You could do this pretty flatly with a generator expression
count = sum(1 for a, b in zip(seq1, seq2) if a != b)
If the sequences are of a different length, then you may consider the difference in length to be difference in content (I would). In that case, tag on an extra piece to account for it
count = sum(1 for a, b in zip(seq1, seq2) if a != b) + abs(len(seq1) - len(seq2))
Another weirdish way to write that which takes advantage of True being 1 and False being 0 is:
sum(a != b for a, b in zip(seq1, seq2))+ abs(len(seq1) - len(seq2))
zip is a python builtin that allows you to iterate over two sequences at once. It will also terminate on the shortest sequence, observe:
>>> seq1 = 'hi'
>>> seq2 = 'world'
>>> for a, b in zip(seq1, seq2):
... print('a =', a, '| b =', b)
...
a = h | b = w
a = i | b = o
This will evaluate similar to sum([1, 1, 1]) where each 1 represents a difference between the two sequences. The if a != b filter causes the generator to only produce a value when a and b differ.

When you say for i in seq1 you are iterating over the characters, not the indexes. You can use enumerate by saying for i, ch in enumerate(seq1) instead.
Or even better, use the standard function zip to go through both sequences at once.
You also have a problem because you return before you print. Probably your return needs to be moved down and unindented.

in your script there are to mistakes
"i" should be integer, not char
"return" should be in function the same level as print, not in cycle "for"
try not to use "print" in such way in functions
here is working version:
def difference (seq1, seq2):
count = 0
for i in range(len(seq1)):
if seq1[i] != seq2[i]:
count += 1
return (count)

So I had to do what you are asking to do and I came up with a very simple solution. Mine is a little different because I check the string to see which is bigger and put them in the correct variable for comparison later. All done with Vanilla python:
#Declare Variables
a='Here is my first string'
b='Here is my second string'
notTheSame=0
count=0
#Check which string is bigger and put the bigger string in C and smaller string in D
if len(a) >= len(b):
c=a
d=b
if len(b) > len(a):
d=a
c=b
#While the counter is less than the length of the longest string, compare each letter.
while count < len(c):
if count == len(d):
break
if c[count] != d[count]:
print(c[count] + " not equal to " + d[count])
notTheSame = notTheSame + 1
else:
print(c[count] + " is equal to " + d[count])
count=count+1
#the below output is a count of all the differences + the difference between the 2 strings
print("Number of Differences: " + str(len(c)-len(d)+notTheSame))

Correct code would be:
def difference(seq1, seq2):
count = 0
for i in range(len(seq1)):
if seq1[i] != seq2[i]:
count += 1
return count
First the return statement is done at the end of the function, therefore it should not be part of the for loop or the for loop would just run once.
Second the for loop wasn't correct because you weren't really telling giving the for loop an integer, therefore the correct code would be to give it a range the length of seq1, so:
for i in range(len(seq1)):

Related

Find Longest Alphabetically Ordered Substring - Efficiently

The goal of some a piece of code I wrote is to find the longest alphabetically ordered substring within a string.
"""
Find longest alphabetically ordered substring in string s.
"""
s = 'zabcabcd' # Test string.
alphabetical_str, temp_str = s[0], s[0]
for i in range(len(s) - 1): # Loop through string.
if s[i] <= s[i + 1]: # Check if next character is alphabetically next.
temp_str += s[i + 1] # Add character to temporary string.
if len(temp_str) > len(alphabetical_str): # Check is temporary string is the longest string.
alphabetical_str = temp_str # Assign longest string.
else:
temp_str = s[i + 1] # Assign last checked character to temporary string.
print(alphabetical_str)
I get an output of abcd.
But the instructor says there is PEP 8 compliant way of writing this code that is 7-8 lines of code and there is a more computational efficient way of writing this code that is ~16 lines. Also that there is a way of writing this code in only 1 line 75 character!
Can anyone provide some insight on what the code would look like if it was 7-8 lines or what the most work appropriate way of writing this code would be? Also any PEP 8 compliance critique would be appreciated.

Linear time:
s = 'zabcabcd'
longest = current = []
for c in s:
if [c] < current[-1:]:
current = []
current += c
longest = max(longest, current, key=len)
print(''.join(longest))
Your PEP 8 issues I see:
"Limit all lines to a maximum of 79 characters." (link) - You have two lines longer than that.
"do not rely on CPython’s efficient implementation of in-place string concatenation for statements in the form a += b" [...] the ''.join() form should be used instead" (link). You do that repeated string concatenation.
Also, yours crashes if the input string is empty.
1 line 72 characters:
s='zabcabcd';print(max([t:='']+[t:=t*(c>=t[-1:])+c for c in s],key=len))
Optimized linear time (I might add benchmarks tomorrow):
def Kelly_fast(s):
maxstart = maxlength = start = length = 0
prev = ''
for c in s:
if c >= prev:
length += 1
else:
if length > maxlength:
maxstart = start
maxlength = length
start += length
length = 1
prev = c
if length > maxlength:
maxstart = start
maxlength = length
return s[maxstart : maxstart+maxlength]

Depending on how you choose to count, this is only 6-7 lines and PEP 8 compliant:
def longest_alphabetical_substring(s):
sub = '', 0
for i in range(len(s)):
j = i + len(sub) + 1
while list(s[i:j]) == sorted(s[i:j]) and j <= len(s):
sub, j = s[i:j], j+1
return sub
print(longest_alphabetical_substring('zabcabcd'))
Your own code was PEP 8 compliant as far as I can tell, although it would make sense to capture code like this in a function, for easy reuse and logical grouping for improved readability.
The solution I provided here is not very efficient, as it keeps extracting copies of the best result so far. A slightly longer solution that avoids this:
def longest_alphabetical_substring(s):
n = m = 0
for i in range(len(s)):
for j in range(i+1, len(s)+1):
if j == len(s) or s[j] < s[j-1]:
if j-i > m-n:
n, m = i, j
break
return s[n:m]
print(longest_alphabetical_substring('zabcabcd'))
There may be more efficient ways of doing this; for example you could detect that there's no need to keep looking because there is not enough room left in the string to find longer strings, and exit the outer loop sooner.
User #kellybundy is correct, a truly efficient solution would be linear in time. Something like:
def las_efficient(s):
t = s[0]
return max([(t := c) if c < t[-1] else (t := t + c) for c in s[1:]], key=len)
print(las_efficient('zabcabcd'))
No points for readability here, but PEP 8 otherwise, and very brief.
And for an even more efficient solution:
def las_very_efficient(s):
m, lm, t, ls = '', 0, s[0], len(s)
for n, c in enumerate(s[1:]):
if c < t[-1]:
t = c
else:
t += c
if len(t) > lm:
m, lm = t, len(t)
if n + lm > ls:
break
return m

You can keep appending characters from the input string to a candidate list, but clear the list when the current character is lexicographically smaller than the last character in the list, and set the candidate list as the output list if it's longer than the current output list. Join the list into a string for the final output:
s = 'zabcabcdabc'
candidate = longest = []
for c in s:
if candidate and c < candidate[-1]:
candidate = []
candidate.append(c)
if len(candidate) > len(longest):
longest = candidate
print(''.join(longest))
This outputs:
abcd

If, else return else value even when the condition is true, inside a for loop

Here is the function i defined:
def count_longest(field, data):
l = len(field)
count = 0
final = 0
n = len(data)
for i in range(n):
count = 0
if data[i:i + l] is field:
while data[i - l: i] == data[i:i + l]:
count = count + 1
i = i + 1
else:
print("OK")
if final == 0 or count >= final:
final = count
return final
a = input("Enter the field - ")
b = input("Enter the data - ")
print(count_longest(a, b))
It works in some cases and gives incorrect output in most cases. I checked by printing the strings being compared, and even after matching the requirement, the loop results in "OK" which is to be printed when the condition is not true! I don't get it! Taking the simplest example, if i enter 'as', when prompted for field, and 'asdf', when prompted for data, i should get count = 1, as the longest iteration of the substring 'as' is once in the string 'asdf'. But i still get final as 0 at the end of the program. I added the else statement just to check the if the condition was being satisfied, but the program printed 'OK', therefore informing that the if condition has not been satisfied. While in the beginning itself, data[0 : 0 + 2] is equal to 'as', 2 being length of the "field".

There are a few things I notice when looking at your code.
First, use == rather than is to test for equality. The is operator checks if the left and right are referring to the very same object, whereas you want to properly compare them.
The following code shows that even numerical results that are equal might not be one and the same Python object:
print(2 ** 31 is 2 ** 30 + 2 ** 30) # <- False
print(2 ** 31 == 2 ** 30 + 2 ** 30) # <- True
(note: the first expression could either be False or True—depending on your Python interpreter).
Second, the while-loop looks rather suspicious. If you know you have found your sequence "as" at position i, you are repeating the while-loop as long as it is the same as in position i-1—which is probably something else, though. So, a better way to do the while-loop might be like so:
while data[i: i + l] == field:
count = count + 1
i = i + l # <- increase by l (length of field) !
Finally, something that might be surprising: changing the variable i inside the while-loop has no effect on the for-loop. That is, in the following example, the output will still be 0, 1, 2, 3, ..., 9, although it looks like it should skip every other element.
for i in range(10):
print(i)
i += 1
It does not effect the outcome of the function, but when debugging you might observe that the function seems to go backward after having found a run and go through parts of it again, resulting in additional "OK"s printed out.
UPDATE: Here is the complete function according to my remarks above:
def count_longest(field, data):
l = len(field)
count = 0
final = 0
n = len(data)
for i in range(n):
count = 0
while data[i: i + l] == field:
count = count + 1
i = i + l
if count >= final:
final = count
return final
Note that I made two additional simplifications. With my changes, you end up with an if and while that share the same condition, i.e:
if data[i:i+1] == field:
while data[i:i+1] == field:
...
In that case, the if is superfluous since it is already included in the condition of while.
Secondly, the condition if final == 0 or count >= final: can be simplified to just if count >= final:.

Why is the code not working for strings with same length?

Following is the code that I have written that counts the number of substrings of length 2 that are common to both the input strings.Also the substrings should be at the same location in both the strings.
def string_match(a, b):
count=0
shorter=min(len(a),len(b))
for i in range(shorter):
if(a[i:i+2]==b[i:i+2]):
count=count+1
else:
continue
return count
The code runs fine for strings with different length but gives wrong answer for strings with same length. for eg: 'abc' and 'abc' should return 2 but it is returning 3 and also 'abc' and 'axc' should return 0 but it is returning 1.
The above problem can be solved by changing range(shorter) to range(shorter-1), but I am not understanding why?
Also if possible suggest me changes in the above code that can count same substrings regardless of the positions in the two strings.
Thank you in advance!

Some good old print debugging should make things clearer:
#!/usr/bin/env python2
#coding=utf8
def string_match(a, b):
count=0
shorter=min(len(a),len(b))
print 'comparing', a, b
for i in range(shorter):
x = a[i:i+2]
y = b[i:i+2]
print 'checking substrings at %d: ' % i, x, y
if x == y:
count=count+1
else:
continue
return count
for a, b in (('abc', 'abc'), ('abc', 'axc')):
count = string_match(a,b)
print a, b, count
And the output:
so$ ./test.py
comparing abc abc
checking substrings at 0: ab ab
checking substrings at 1: bc bc
checking substrings at 2: c c
abc abc 3
comparing abc axc
checking substrings at 0: ab ax
checking substrings at 1: bc xc
checking substrings at 2: c c
abc axc 1
See the problem? You're always comparing a substring of length 1 at the end. This is because 'abc'[2:4] will give you just 'c'.
So, you'd need to end one step earlier (or, more generally, n-1 steps earlier when you're comparing substrings of length n). This is exactly what your -1 change would do, which is why it helps.
With the -1 change:
#!/usr/bin/env python2
#coding=utf8
def string_match(a, b):
count=0
shorter=min(len(a),len(b))
print 'comparing', a, b
for i in range(shorter-1):
x = a[i:i+2]
y = b[i:i+2]
print 'checking substrings at %d: ' % i, x, y
if x == y:
count=count+1
else:
continue
return count
for a, b in (('abc', 'abc'), ('abc', 'axc')):
count = string_match(a,b)
print a, b, count
And the new output:
so$ ./test.py
comparing abc abc
checking substrings at 0: ab ab
checking substrings at 1: bc bc
abc abc 2
comparing abc axc
checking substrings at 0: ab ax
checking substrings at 1: bc xc
abc axc 0

Examine your for loop
for i in range(shorter):
if a[i:i+2]==b[i:i+2]:
count=count+1
else:
continue
range(n) by default goes from 0 to n-1. So what happens in the case of n-1? Your loop is attempting to access the n-1th to n+1th characters. But the smaller string only has n characters. So Python simply returns that letter instead of two letters, and so two strings of equal length with the same last character would give a false positive. This is why range(shorter - 1) is necessary.
Also the use of continue is redundant as by default the loop will continue anyways
To find substrings of length 2 anywhere in the strings this should suffice
def string_match(string1, string2):
string1subs = [string1[i:i+2] for i in range(len(string1) - 1)]
count = 0
for i in range(len(string2) - 1):
if string2[i:i+2] in string1subs: count += 1
return count
Creates a list string1subs that contains all substrings of length 2 in string1. Then loops through all substrings of length 2 in string2 and checks if it is a substring of string1. If you prefer a more concise version:
def string_match(string1, string2):
string1subs = [string1[i:i+2] for i in range(len(string1) - 1)]
return sum(string2[i:i+2] in string1subs for i in range(len(string2) - 1))
Exact same version using sum and the fact that in Python, True is equal to 1.

Best way is not to use any index access at all:
def string_match(a, b):
count = 0
equal = False
for c, d in zip(a,b):
count += equal and c == d
equal = c == d
return count
or with generator expression:
from itertools import islice
def string_match(a, b):
return sum(a1 == b1 and a2 == b2
for a1, a2, b1, b2 in zip(a, islice(a,1,None), b, islice(b,1,None)))

Python if statement is only executed once

I try to make a simple spell checking programm by given two strings and adapt the first to the second one. If the strings have the same length my code works fine but if they're different, then the problems start. It only executes the if-statements once and stops after that. If I remove the break points, I get an IndexError: list index out of range.
Here is my code:
#!python
# -*- coding: utf-8 -*-
def edit_operations(first,second):
a = list(first)
b = list(second)
counter = 0
l_a = len(a)
l_b = len(b)
while True:
if a == b:
break
if l_a > l_b:
if a[counter] != b[counter]:
a[counter] = ""
c = "".join(a)
print "delete", counter+1, b[counter], c
counter += 1
l_a -= 1
break
if l_a < l_b:
if a[counter] != b[counter]:
c = "".join(a)
c = c[:counter] + b[counter] + c[counter:]
print "insert", counter+1, b[counter], c
counter += 1
l_a += 1
break
if a[counter] != b[counter]:
a[counter] = b[counter]
c = "".join(a)
print "replace", counter+1, b[counter], c
counter += 1
else:
counter += 1
if __name__ == "__main__":
edit_operations("Reperatur","Reparatur")
edit_operations("Singel","Single")
edit_operations("Krach","Stall")
edit_operations("wiederspiegeln","widerspiegeln")
edit_operations("wiederspiglen","widerspiegeln")
edit_operations("Babies","Babys")
edit_operations("Babs","Babys")
edit_operations("Babeeees","Babys")
This is the output I get:
replace 4 a Reparatur
replace 5 l Singll
replace 6 e Single
replace 1 S Srach
replace 2 t Stach
replace 4 l Stalh
replace 5 l Stall
delete 3 d widerspiegeln
replace 3 d widderspiglen
replace 4 e wideerspiglen
replace 5 r widerrspiglen
replace 6 s widersspiglen
replace 7 p widersppiglen
replace 8 i widerspiiglen
replace 9 e widerspieglen
replace 11 e widerspiegeen
replace 12 l widerspiegeln
delete 4 y Babes
insert 4 y Babys
delete 4 y Babeees
By the last 3 lines you can see my problem and I'm kinda desperate right now.
Hopefully someone could give me a hint what is wrong with it

The answer to the question in the title -- i.e., the if statement executed only once -- is already in a comment to your question, that is, there are two breaks in the two if blocks if l_a < l_b: and if l_a < l_b:.
In general, break statement interrupts the closest loop that it finds, no matter how nested the block where break finds itself is.
However, other problems do appear in your code:
the size of the list a is kept the same, however the same counter is used for iterating over the letters of the two strings. In case the length of the two strings are different, this problem leads eventually to the error IndexError: list index out of range, because the only condition that allows to exit the loop is when the two strings are the same. Also, when l_a > l_b, the same character of b that mismatched with a should be checked with the character next to the deleted one, however this does not happen because of the same counter.
When l_a < l_b the list a is not modified; just a new list c is created with the additional letter. Please look at list documentation.
counter is not updated correctly, as, when the two strings differ in length, it is incremented only if the letters different. This leads to an infinite loop.
In general, consider using a debugger in order to figure out the issues (look at the debuggers available in python https://wiki.python.org/moin/PythonDebuggingTools). It is possible to find online or in a bookstore many resources to learn how to debug code.

You should make use of the list.insert() function to insert a character into the list, the del operator to remove a single character from the list, and move the a==b comparison into the while loop conditional. The variable counter should indicate the index of the next character to be compared, and should not be incremented if the characters are not equal. Like this:
#! python3
def edit_operations(first,second):
a = list(first)
b = list(second)
counter = 0
while a != b:
if a[counter] != b[counter]:
if len(a) > len(b):
print("delete", counter + 1, a[counter])
del a[counter]
elif len(b) > len(a):
print("insert", counter + 1, b[counter])
a.insert(counter, b[counter])
else:
print("replace", counter + 1, b[counter])
a[counter] = b[counter]
else:
counter += 1
print("".join(a))
if __name__ == "__main__":
edit_operations("Reperatur","Reparatur")
edit_operations("Singel","Single")
edit_operations("Krach","Stall")
edit_operations("wiederspiegeln","widerspiegeln")
edit_operations("wiederspiglen","widerspiegeln")
edit_operations("Babies","Babys")
edit_operations("Babs","Babys")
edit_operations("Babeeees","Babys")
I've changed the print statements a little.

I really don't understand what your question is but if you need a spelling checker just use this library

Finding the length of longest repeating?

I have tried plenty of different methods to achieve this, and I don't know what I'm doing wrong.
reps=[]
len_charac=0
def longest_charac(strng)
for i in range(len(strng)):
if strng[i] == strng[i+1]:
if strng[i] in reps:
reps.append(strng[i])
len_charac=len(reps)
return len_charac

Remember in Python counting loops and indexing strings aren't usually needed. There is also a builtin max function:
def longest(s):
maximum = count = 0
current = ''
for c in s:
if c == current:
count += 1
else:
count = 1
current = c
maximum = max(count,maximum)
return maximum
Output:
>>> longest('')
0
>>> longest('aab')
2
>>> longest('a')
1
>>> longest('abb')
2
>>> longest('aabccdddeffh')
3
>>> longest('aaabcaaddddefgh')
4

Simple solution:
def longest_substring(strng):
len_substring=0
longest=0
for i in range(len(strng)):
if i > 0:
if strng[i] != strng[i-1]:
len_substring = 0
len_substring += 1
if len_substring > longest:
longest = len_substring
return longest
Iterates through the characters in the string and checks against the previous one. If they are different then the count of repeating characters is reset to zero, then the count is incremented. If the current count beats the current record (stored in longest) then it becomes the new longest.

Compare two things and there is one relation between them:
'a' == 'a'
True
Compare three things, and there are two relations:
'a' == 'a' == 'b'
True False
Combine these ideas - repeatedly compare things with the things next to them, and the chain gets shorter each time:
'a' == 'a' == 'b'
True == False
False
It takes one reduction for the 'b' comparison to be False, because there was one 'b'; two reductions for the 'a' comparison to be False because there were two 'a'. Keep repeating until the relations are all all False, and that is how many consecutive equal characters there were.
def f(s):
repetitions = 0
while any(s):
repetitions += 1
s = [ s[i] and s[i] == s[i+1] for i in range(len(s)-1) ]
return repetitions
>>> f('aaabcaaddddefgh')
4
NB. matching characters at the start become True, only care about comparing the Trues with anything, and stop when all the Trues are gone and the list is all Falses.
It can also be squished into a recursive version, passing the depth in as an optional parameter:
def f(s, depth=1):
s = [ s[i] and s[i]==s[i+1] for i in range(len(s)-1) ]
return f(s, depth+1) if any(s) else depth
>>> f('aaabcaaddddefgh')
4
I stumbled on this while trying for something else, but it's quite pleasing.

You can use itertools.groupby to solve this pretty quickly, it will group characters together, and then you can sort the resulting list by length and get the last entry in the list as follows:
from itertools import groupby
print(sorted([list(g) for k, g in groupby('aaabcaaddddefgh')],key=len)[-1])
This should give you:
['d', 'd', 'd', 'd']

This works:
def longestRun(s):
if len(s) == 0: return 0
runs = ''.join('*' if x == y else ' ' for x,y in zip(s,s[1:]))
starStrings = runs.split()
if len(starStrings) == 0: return 1
return 1 + max(len(stars) for stars in starStrings)
Output:
>>> longestRun("aaabcaaddddefgh")
4

First off, Python is not my primary language, but I can still try to help.
1) you look like you are exceeding the bounds of the array. On the last iteration, you check the last character against the character beyond the last character. This normally leads to undefined behavior.
2) you start off with an empty reps[] array and compare every character to see if it's in it. Clearly, that check will fail every time and your append is within that if statement.

def longest_charac(string):
longest = 0
if string:
flag = string[0]
tmp_len = 0
for item in string:
if item == flag:
tmp_len += 1
else:
flag = item
tmp_len = 1
if tmp_len > longest:
longest = tmp_len
return longest
This is my solution. Maybe it will help you.

Just for context, here is a recursive approach that avoids dealing with loops:
def max_rep(prev, text, reps, rep=1):
"""Recursively consume all characters in text and find longest repetition.
Args
prev: string of previous character
text: string of remaining text
reps: list of ints of all reptitions observed
rep: int of current repetition observed
"""
if text == '': return max(reps)
if prev == text[0]:
rep += 1
else:
rep = 1
return max_rep(text[0], text[1:], reps + [rep], rep)
Tests:
>>> max_rep('', 'aaabcaaddddefgh', [])
4
>>> max_rep('', 'aaaaaabcaadddddefggghhhhhhh', [])
7

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting differences between two strings - python

Related

Find Longest Alphabetically Ordered Substring - Efficiently

If, else return else value even when the condition is true, inside a for loop

Why is the code not working for strings with same length?

Python if statement is only executed once

Finding the length of longest repeating?

Categories

Resources