Python find similar sequences in string - python

I want a code to return sum of all similar sequences in two string. I wrote the following code but it only returns one of them
from difflib import SequenceMatcher
a='Apple Banana'
b='Banana Apple'
def similar(a,b):
c = SequenceMatcher(None,a.lower(),b.lower()).get_matching_blocks()
return sum( [c[i].size if c[i].size>1 else 0 for i in range(0,len(c)) ] )
print similar(a,b)
and the output will be
6
I expect it to be: 11

get_matching_blocks() returns the longest contiguous matching subsequence. Here the longest matching subsequence is 'banana' in both the strings, with length 6. Hence it is returning 6.
Try this instead:
def similar(a,b):
c = 'something' # Initialize this to anything to make the while loop condition pass for the first time
sum = 0
while(len(c) != 1):
c = SequenceMatcher(lambda x: x == ' ',a.lower(),b.lower()).get_matching_blocks()
sizes = [i.size for i in c]
i = sizes.index(max(sizes))
sum += max(sizes)
a = a[0:c[i].a] + a[c[i].a + c[i].size:]
b = b[0:c[i].b] + b[c[i].b + c[i].size:]
return sum
This "subtracts" the matching part of the strings, and matches them again, until len(c) is 1, which would happen when there are no more matches left.
However, this script doesn't ignore spaces. In order to do that, I used the suggestion from this other SO answer: just preprocess the strings before you pass them to the function like so:
a = 'Apple Banana'.replace(' ', '')
b = 'Banana Apple'.replace(' ', '')
You can include this part inside the function too.

When we edit your code to this it will tell us where 6 is coming from:
from difflib import SequenceMatcher
a='Apple Banana'
b='Banana Apple'
def similar(a,b):
c = SequenceMatcher(None,a.lower(),b.lower()).get_matching_blocks()
for block in c:
print "a[%d] and b[%d] match for %d elements" % block
print similar(a,b)
a[6] and b[0] match for 6 elements
a[12] and b[12] match for 0 elements

I made a small change to your code and it is working like a charm, thanks #Antimony
def similar(a,b):
a=a.replace(' ', '')
b=b.replace(' ', '')
c = 'something' # Initialize this to anything to make the while loop condition pass for the first time
sum = 0
i = 2
while(len(c) != 1):
c = SequenceMatcher(lambda x: x == ' ',a.lower(),b.lower()).get_matching_blocks()
sizes = [i.size for i in c]
i = sizes.index(max(sizes))
sum += max(sizes)
a = a[0:c[i].a] + a[c[i].a + c[i].size:]
b = b[0:c[i].b] + b[c[i].b + c[i].size:]
return sum

Related

Longest Common Prefix from list elements in Python

I have a list as below:
strs = ["flowers", "flow", "flight"]
Now, I want to find the longest prefix of the elements from the list. If there is no match then it should return "". I am trying to use the 'Divide and Conquer' rule for solving the problem. Below is my code:
strs = ["flowers", "flow", "flight"]
firstHalf = ""
secondHalf = ""
def longestCommonPrefix(strs) -> str:
minValue = min(len(i) for i in strs)
length = len(strs)
middle_index = length // 2
firstHalf = strs[:middle_index]
secondHalf = strs[middle_index:]
minSecondHalfValue = min(len(i) for i in secondHalf)
matchingString=[] #Creating a stack to append the matching characters
for i in range(minSecondHalfValue):
secondHalf[0][i] == secondHalf[1][i]
return secondHalf
print(longestCommonPrefix(strs))
I was able to find the mid and divide the list into two parts. Now I am trying to use the second half and get the longest prefix but am unable to do so. I have had created a stack where I would be adding the continuous matching characters and then I would use it to compare with the firstHalf but how can I compare the get the continuous matching characters from start?
Expected output:
"fl"
Just a suggestion would also help. I can give it a try.
No matter what, you need to look at each character from each string in turn (until you find a set of corresponding characters that doesn't match), so there's no benefit to splitting the list up. Just iterate through and break when the common prefix stops being common:
def common_prefix(strs) -> str:
prefix = ""
for chars in zip(*strs):
if len(set(chars)) > 1:
break
prefix += chars[0]
return prefix
print(common_prefix(["flowers", "flow", "flight"])) # fl
Even if this problem has already found its solution, I would like to post my approach (I considered the problem interesting, so started playing around with it).
So, your divide-and-conquer solution would involve a very big task split in many smaller subtasks, whose solutions get processed by other small tasks and so, until you get to the final solution. The typical example is a sum of numbers (let's take 1 to 8), which can be done sequentially (1 + 2 = 3, then 3 + 3 = 6, then 6 + 4 = 10... until the end) or splitting the problem (1 + 2 = 3, 3 + 4 = 7, 5 + 6 = 11, 7 + 8 = 15, then 3 + 7 = 10 and 11 + 15 = 26...). The second approach has the clear advantage that it can be parallelized - increasing the time performance dramatically in the right set up - reason why this goes generally hand in hand with topics like multithreading.
So my approach:
import math
def run(lst):
if len(lst) > 1:
lst_split = [lst[2 * (i-1) : min(len(lst) + 1, 2 * i)] for i in range(1, math.ceil(len(lst)/2.0) + 1)]
lst = [Processor().process(*x) for x in lst_split]
if any([len(x) == 0 for x in lst]):
return ''
return run(lst)
else:
return lst[0]
class Processor:
def process(self, w1, w2 = None):
if w2 != None:
zipped = list(zip(w1, w2))
for i, (x, y) in enumerate(zipped):
if x != y:
return w1[:i]
if i + 1 == len(zipped):
return w1[:i+1]
else:
return w1
return ''
lst = ["flowers", "flow", "flight", "flask", "flock"]
print(run(lst))
OUTPUT
fl
If you look at the run method, the passed lst gets split in couples, which then get processed (this is where you could start multiple threads, but let's not focus on that). The resulting list gets reprocessed until the end.
An interesting aspect of this problem is: if, after a pass, you get one empty match (two words with no common start), you can stop the reduction, given that you know the solution already! Hence the introduction of
if any([len(x) == 0 for x in lst]):
return ''
I don't think the functools.reduce offers the possibility of stopping the iteration in case a specific condition is met.
Out of curiosity: another solution could take advantage of regex:
import re
pattern = re.compile("(\w+)\w* \\1\w*")
def find(x, y):
v = pattern.findall(f'{x} {y}')
return v[0] if len(v) else ''
reduce(find, lst)
OUTPUT
'fl'
Sort of "divide and conquer" :
solve for 2 strings
solve for the other strings
def common_prefix2_(s1: str, s2: str)-> str:
if not s1 or not s2: return ""
for i, z in enumerate(zip(s1,s2)):
if z[0] != z[1]:
break
else:
i += 1
return s1[:i]
from functools import reduce
def common_prefix(l:list):
return reduce(common_prefix2_, l[1:], l[0]) if len(l) else ''
Tests
for l in [["flowers", "flow", "flight"],
["flowers", "flow", ""],
["flowers", "flow"],
["flowers", "xxx"],
["flowers" ],
[]]:
print(f"{l if l else '[]'}: '{common_prefix(l)}'")
# output
['flowers', 'flow', 'flight']: 'fl'
['flowers', 'flow', '']: ''
['flowers', 'flow']: 'flow'
['flowers', 'xxx']: ''
['flowers']: 'flowers'
[]: ''

How can you delete similar characters at the same positions in 2 strings

I need to figure out a way to delete common characters from two strings if the common characters are in the same position, but it is not working and I am trying to figure this out. This is what I tried so far, it works for some strings, but as soon as the second string is larger than the first, it stops working. EDIT: I also need a way to store the result in a variable before printing it as I need to use it in another function.
Example :
ABCDEF and ABLDKG would result in the "ABD" parts of both strings to be deleted, but the rest of the string would remain the same
CEF and LKG would be the output
def compare(input1,input2):
if len(input1) < len(input2):
for i in input1:
posi = int(input1.find(i))
if input1[num] == input2[num]:
x = input1.replace(i,"" )
y = input2.replace(i,"" )
num = num+1
print(x)
print(y)
else:
for i in input2:
num = 0
posi = int(input2.find(i))
if input2[num] == input1[num]:
input1 = input1[0:num] + input1[num+1:(len(input1)+ 1 )] # input1.replace(i,"" )
input2 = input2[0:num] + input2[num+1:(len(input1) + 1)]
x = input1
y = input2
num = num + 1
print(str(x))
print(str(y))
you could use
from itertools import zip_longest
a,b = "ABCDEF","ABLDKG"
[''.join(k) for k in zip(*[i for i in zip_longest(a, b, fillvalue = "") if i[0]!=i[1]])]
['CEF', 'LKG']
You can wrap this in a function:
def compare(a, b):
s = zip(*[i for i in zip_longest(a, b, fillvalue = "") if i[0]!=i[1]])
return [''.join(k) for k in s]
compare("ABCDEF","ABLDKG")
['CEF', 'LKG']
compare('asdfq', 'aqdexyz')
['sfq', 'qexyz']
strlist = ["ABCDEF","ABLDKG"]
char_dict = dict()
for item in strlist:
for char in item:
char_dict[char] = char_dict.get(char,0) + 1
new_strlist = []
for item in strlist:
new_strlist.append(''.join([char for char in item if char_dict[char] < 2]))
Note that this will convert strings that have only duplicates into empty strings rather than removing them altogether.

Return a new string where everth xth character (starting at 0) is now followed by *?

I wanna write a function that takes a string, s, and an int, x. It should return a new string where every xth character (starting from zero) is now followed by an '*'.
So far I've tried this code:
def string_chunks(string, x):
"""
>>> string_chunks("Once upon a time, in a land far, far away", 5)
'O*nce u*pon a* time*, in *a lan*d far*, far* away*'
"""
for ch in string:
return ch + "*"
but I am very stuck and am unable to make it work.
I would appreciate any help. If you provide an answer, it would be nice if you could comment the code also.
Turn it into a list and every nth index append a '*', then join it back to a string.
def string_chunks(string, x):
string = list(string)
for i in range(0, len(string)-1, 5):
string[i] += '*'
return ''.join(string)
Using a new string instead of a list
I thought it could be easier using a new string (ns) instead of a list to be joined, just adding each character of the original string (text = s) with a '' after each multiple of the interval x (checked with the if multiple_of_x. To check if the n (index of the character of s) is a multiple I used n % x == 0 that is equal to zero only for multiple of x (ex.: 5 10 15, because 5 % 5 = 0, 15 % 5 = 0.... and so on). If the result of n % x in not 0, it will add only the character without the ''.
def string_chunks(s,x=5):
ns = ""
for n,ch in enumerate(s):
multiple_of_x = (n % x == 0)
ns += ch + "*" if multiple_of_x else ch
return ns
text = "Once upon a time, in a land far, far away"
print(string_chunks(text))
Using a list
It can be done this way too.
def string_chunks(s,x=5):
ns = []
for n,ch in enumerate(s):
multiple_of_interval = (n % x == 0)
ns.append(ch + "*") if multiple_of_interval else ns.append(ch)
ns = "".join(ns)
return ns
text = "Once upon a time, in a land far, far away"
print(string_chunks(text))
Output
O*nce u*pon a* time*, in *a lan*d far*, far* away*
Currently you do this:
for ch in string:
return ch + "*"
This immediately exists the function. Instead, you want to create the whole string by doing something like this:
chunked_text = chunked_text + ch + "*"
and only after iterating over the whole string you want to return it.
try this
def string_chunks(string, x):
"""
>>>
'O*nce u*pon a* time*, in *a lan*d far*, far* away*'
"""
count = 0
newstring = []
for ch in string:
count = count + 1
if count == x:
newstring.append("*")
newstring.append(ch)
count = 0
else:
newstring.append(ch)
return("".join(str(x) for x in newstring))
output_s = string_chunks("Once upon a time, in a land far, far away", 5)
print output_s
Output:
Once* upon* a ti*me, i*n a l*and f*ar, f*ar aw*ay

Python: how to replace characters from i-th to j-th matches?

For example, if I have:
"+----+----+---+---+--+"
is it possible to replace from second to fourth + to -?
If I have
"+----+----+---+---+--+"
and I want to have
"+-----------------+--+"
I have to replace from 2-nd to 4-th + to -. Is it possible to achieve this by regex? and how?
If you can assume the first character is always a +:
string = '+' + re.sub(r'\+', r'-', string[1:], count=3)
Lop off the first character of your string and sub() the first three + characters, then add the initial + back on.
If you can't assume the first + is the first character of the string, find it first:
prefix = string.index('+') + 1
string = string[:prefix] + re.sub(r'\+', r'-', string[prefix:], count=3)
I would rather iterate over the string, and then replace the pluses according to what I found.
secondIndex = 0
fourthIndex = 0
count = 0
for i, c in enumerate(string):
if c == '+':
count += 1
if count == 2 and secondIndex == 0:
secondIndex = i
elif count == 4 and fourthIndex == 0:
fourthIndex = i
string = string[:secondIndex] + '-'*(fourthIndex-secondIndex+1) + string[fourthIndex+1:]
Test:
+----+----+---+---+--+
+-----------------+--+
I split the string into an array of strings using the character to replace as the separator.
Then rejoin the array, in sections, using the required separators.
example_str="+----+----+---+---+--+"
swap_char="+"
repl_char='-'
ith_match=2
jth_match=4
list_of_strings = example_str.split(swap_char)
new_string = ( swap_char.join(list_of_strings[0:ith_match]) + repl_char +
repl_char.join(list_of_strings[ith_match:jth_match]) +
swap_char + swap_char.join(list_of_strings[jth_match:]) )
print (example_str)
print (new_string)
running it gives :
$ python ./python_example.py
+----+----+---+---+--+
+-------------+---+--+
with regex? Yes, that's possible.
^(\+-+){1}((?:\+[^+]+){3})
explanation:
^
(\+-+){1} # read + and some -'s until 2nd +
( # group 2 start
(?:\+[^+]+){3} # read +, followed by non-plus'es, in total 3 times
) # group 2 end
testing:
$ cat test.py
import re
pattern = r"^(\+-+){1}((?:\+[^+]+){3})"
tests = ["+----+----+---+---+--+"]
for test in tests:
m = re.search(pattern, test)
if m:
print (test[0:m.start(2)] +
"-" * (m.end(2) - m.start(2)) +
test[m.end(2):])
Adjusting is simple:
^(\+-+){1}((?:\+[^+]+){3})
^ ^
the '1' indicates that you're reading up to the 2nd '+'
the '3' indicates that you're reading up to the 4th '+'
these are the only 2 changes you need to make, the group number stays the same.
Run it:
$ python test.py
+-----------------+--+
This is pythonic.
import re
s = "+----+----+---+---+--+"
idx = [ i.start() for i in re.finditer('\+', s) ][1:-2]
''.join([ j if i not in idx else '-' for i,j in enumerate(s) ])
However, if your string is constant and want it simple
print (s)
print ('+' + re.sub('\+---', '----', s)[1:])
Output:
+----+----+---+---+--+
+-----------------+--+
Using only comprehension lists:
s1="+----+----+---+---+--+"
indexes = [i for i,x in enumerate(s1) if x=='+'][1:4]
s2 = ''.join([e if i not in indexes else '-' for i,e in enumerate(s1)])
print(s2)
+-----------------+--+
I saw you already found a solution but I do not like regex so much, so maybe this will help another! :-)

Counting longest occurrence of repeated sequence in Python

What's the easiest way to count the longest consecutive repeat of a certain character in a string? For example, the longest consecutive repeat of "b" in the following string:
my_str = "abcdefgfaabbbffbbbbbbfgbb"
would be 6, since other consecutive repeats are shorter (3 and 2, respectively.) How can I do this in Python?
How about a regex example:
import re
my_str = "abcdefgfaabbbffbbbbbbfgbb"
len(max(re.compile("(b+b)*").findall(my_str))) #changed the regex from (b+b) to (b+b)*
# max([len(i) for i in re.compile("(b+b)").findall(my_str)]) also works
Edit, Mine vs. interjays
x=timeit.Timer(stmt='import itertools;my_str = "abcdefgfaabbbffbbbbbbfgbb";max(len(list(y)) for (c,y) in itertools.groupby(my_str) if c=="b")')
x.timeit()
22.759046077728271
x=timeit.Timer(stmt='import re;my_str = "abcdefgfaabbbffbbbbbbfgbb";len(max(re.compile("(b+b)").findall(my_str)))')
x.timeit()
8.4770550727844238
Here is a one-liner:
max(len(list(y)) for (c,y) in itertools.groupby(my_str) if c=='b')
Explanation:
itertools.groupby will return groups of consecutive identical characters, along with an iterator for all items in that group. For each such iterator, len(list(y)) will give the number of items in the group. Taking the maximum of that (for the given character) will give the required result.
Here's my really boring, inefficient, straightforward counting method (interjay's is much better). Note, I wrote this in this little text field, which doesn't have an interpreter, so I haven't tested it, and I may have made a really dumb mistake that a proof-read didn't catch.
my_str = "abcdefgfaabbbffbbbbbbfgbb"
last_char = ""
current_seq_len = 0
max_seq_len = 0
for c in mystr:
if c == last_char:
current_seq_len += 1
if current_seq_len > max_seq_len:
max_seq_len = current_seq_len
else:
current_seq_len = 1
last_char = c
print(max_seq_len)
Using run-length encoding:
import numpy as NP
signal = NP.array([4,5,6,7,3,4,3,5,5,5,5,3,4,2,8,9,0,1,2,8,8,8,0,9,1,3])
px, = NP.where(NP.ediff1d(signal) != 0)
px = NP.r_[(0, px+1, [len(signal)])]
# collect the run-lengths for each unique item in the signal
rx = [ (m, n, signal[m]) for (m, n) in zip(px[:-1], px[1:]) if (n - m) > 1 ]
# get longest:
rx2 = [ (b-a, c) for (a, b, c) in rx ]
rx2.sort(reverse=True)
# returns: [(4, 5), (3, 8)], ie, '5' occurs 4 times consecutively, '8' occurs 3 times consecutively
Here is my code, Not that efficient but seems to work:
def LongCons(mystring):
dictionary = {}
CurrentCount = 0
latestchar = ''
for i in mystring:
if i == latestchar:
CurrentCount += 1
if dictionary.has_key(i):
if CurrentCount > dictionary[i]:
dictionary[i]=CurrentCount
else:
CurrentCount = 1
dictionary.update({i: CurrentCount})
latestchar = i
k = max(dictionary, key=dictionary.get)
print(k, dictionary[k])
return

Categories

Resources