How to find longest intersection between two strings in python?

How to find longest intersection between two strings in python? - python

I'm trying to write a program that would find the longest intersection between two strings. The conditions are:
If there is no common character the program returns an empty chain.
If there are multiple substrings of common characters with the same length it should return whichever is the largest, for example, for "bbaacc" and "aabb" the repeating substrings are "aa" and "bb" but as "bb" > "aa", so the programs must return only "bb".
Finally the program should return the longest common substring, for instance, for "programme" and "grammaire" the return should be "gramm" not "gramme".
My code has a problem with this last condition, how could I change it so it works as expected?
def intersection(v, w):
if not v or not w:
return ""
x, xs, y, ys = v[0], v[1:], w[0], w[1:]
if x == y:
return x + intersection(xs, ys)
else:
return max(intersection(v, ys), intersection(xs, w), key=len)
Driver:
print(intersection('programme', 'grammaire'))

cant find the issue with your code, but i solved it like this
def longest_str_intersection(a: str, b: str):
# identify all possible character sequences from str a
seqs = []
for pos1 in range(len(a)):
for pos2 in range(len(a)):
seqs.append(a[pos1:pos2+1])
# remove empty sequences
seqs = [seq for seq in seqs if seq != '']
# find segments in str b
max_len_match = 0
max_match_sequence = ''
for seq in seqs:
if seq in b:
if len(seq) > max_len_match:
max_len_match = len(seq)
max_match_sequence = seq
return max_match_sequence
longest_str_intersection('programme', 'grammaire')
-> 'gramm'
also interested to see if you found a more elegant solution!

Related

Longest common substring with rolling hash

I am implementing in Python3 an algorithm to find the longest substring of two strings s and t. Given s and t, I need to return (a,b,l) where l is the length of the longest common substring, a is the position in s where the longest substring starts, and b is the position in t where the longest substring starts. I have a working version of the algorithm but it is quite slow and I am not sure why; it is frustrating because I have found other implementations in python using pretty much the same logic that are many times faster. I am self-learning so any help would be greatly appreciated.
The approach is based on comparing hash values rather than directly comparing substrings and using binary search to find maximal length of common substrings. Here is the code for my hash function (m is a big prime and x is just some constant):
def polynomial_hash(my_string, m, x):
str_len = len(my_string)
result = 0
for i in range(str_len):
result = (result + ord(my_string[i]) * power_mod_p(x, i, m)) % m
return result
Given two strings s and t, I first find which string is shorter, without loss of generality, let s be the shorter string. First I need to find the hash values of substrings of a string. I use the following function, implemented as a generator:
def all_length_k_hashes(my_string, k, m, x):
current_position = len(my_string) - k
x_to_the_k = power_mod_p(x, k, m)
hash_value = polynomial_hash(my_string[current_position:], m, x)
yield (hash_value, current_position)
while current_position > 0:
current_position = current_position - 1
hash_value = ((hash_value * x) + ord(my_string[current_position]) - x_to_the_k*ord(my_string[current_position + k])) % m
yield (hash_value, current_position)
This function is simple, its first yield is the hash value of the final length k substring of the string, after that each of its iteration is the hash value of the next length k substring to its left (we move left by one position, for example for k=3 from abcdefghi to abcdefghi then from abcdefghi to abcdefghi). This should be able to calculate all the hash values of all length k substrings of my_string in O(|my_string|).
Now I find out if s and t has a length k substring in common, I use the following function:
def common_sub_string_length_k(shorter_str, longer_str, k, m, x):
short_str_dict = dict()
for hash_and_index in all_length_k_hashes(shorter_str, k, m, x):
short_str_dict.update({hash_and_index[0]: hash_and_index[1]})
hash_generator_longer_str = all_length_k_hashes(longer_str, k, m, x)
for hash_and_index in hash_generator_longer_str:
if hash_and_index[0] in short_str_dict:
return (short_str_dict[hash_and_index[0]], hash_and_index[1])
return False
What is happening in this function is: I create a Python empty dictionary and fill it with (key:values) such that each key is the hash value of a length k substring of the shorter string and its value is that substring's starting index, I call this 'short_str_dict'
Then, using all_length_k_hashes, I create a generator of hash values of substrings of length k of the longer string, then I iterate through this generator to check if there is a hash value that's in the 'short_str_dict', if there is, then the two strings have a substring of length k in common (assuming no hash collisions). This whole process should take time O(|shorter_string| + |longer_string|)
Finally, the following function repeatedly uses the previous process to find the maximal k, using a binary search technique:
def longest_common_substring(str_1, str_2):
m_1 = 309000599
m_2 = 988017827
x = randint(1, 10 ** 6)
len_str_1 = len(str_1)
len_str_2 = len(str_2)
if len_str_1 <= len_str_2:
short_str = str_1
long_str = str_2
switched = False
else:
short_str = str_2
long_str = str_1
switched = True
len_short_str = len(short_str)
len_long_str = len(long_str)
low = 0
high = len_short_str
mid = 0
longest_so_far = 0
longest_indices = (0,0)
while low <= high:
mid = (high + low) // 2
m1_result = common_sub_string_length_k(short_str, long_str, mid, m_1, x)
m2_result = common_sub_string_length_k(short_str, long_str, mid, m_2, x)
if m1_result is False or m2_result is False:
high = mid - 1
else:
longest_so_far = mid
longest_indices = m1_result
low = mid + 1
if switched:
return (longest_indices[1], longest_indices[0], longest_so_far)
else:
return (longest_indices[0], longest_indices[1], longest_so_far)
Two different hashes are used to reduce the probability of a collision. So in total, assuming no collisions, this whole process should take
O(log|shorter_string|) * O(|shorter_string| + |longer_string|).
Have I made any error? Is it slow because of the use of Python dictionaries? I really want to understand my mistake. Any help is greatly appreciated.

Merging overlapping string sequences in a list

I am trying to figure out how to merge overlapping strings in a list together, for example for
['aacc','accb','ccbe']
I would get
['aaccbe']
This following code works for the example above, however it does not provide me with the desired result in the following case:
s = ['TGT','GTT','TTC','TCC','CCC','CCT','CCT','CTG','TGA','GAA','AAG','AGC','GCG','CGT','TGC','GCT','CTC','TCT','CTT','TTT','TTT','TTC','TCA','CAT','ATG','TGG','GGA','GAT','ATC','TCT','CTA','TAT','ATG','TGA','GAT','ATT','TTC']
a = s[0]
b = s[-1]
final_s = a[:a.index(b[0])]+b
print(final_s)
>>>TTC
My output is clearly not right, and I don't know why it doesn't work in this case. Note that I have already organized the list with the overlapping strings next to each other.

You can use a trie to storing the running substrings and more efficiently determine overlap. When the possibility of an overlap occurs (i.e for an input string, there exists a string in the trie with a letter that starts or ends the input string), a breadth-first search to find the largest possible overlap takes place, and then the remaining bits of string are added to the trie:
from collections import deque
#trie node (which stores a single letter) class definition
class Node:
def __init__(self, e, p = None):
self.e, self.p, self.c = e, p, []
def add_s(self, s):
if s:
self.c.append(self.__class__(s[0], self).add_s(s[1:]))
return self
class Trie:
def __init__(self):
self.c = []
def last_node(self, n):
return n if not n.c else self.last_node(n.c[0])
def get_s(self, c, ls):
#for an input string, find a letter in the trie that the string starts or ends with.
for i in c:
if i.e in ls:
yield i
yield from self.get_s(i.c, ls)
def add_string(self, s):
q, d = deque([j for i in self.get_s(self.c, (s[0], s[-1])) for j in [(s, i, 0), (s, i, -1)]]), []
while q:
if (w:=q.popleft())[1] is None:
d.append((w[0] if not w[0] else w[0][1:], w[2], w[-1]))
elif w[0] and w[1].e == w[0][w[-1]]:
if not w[-1]:
if not w[1].c:
d.append((w[0][1:], w[1], w[-1]))
else:
q.extend([(w[0][1:], i, 0) for i in w[1].c])
else:
q.append((w[0][:-1], w[1].p, w[1], -1))
if not (d:={a:b for a, *b in d}):
self.c.append(Node(s[0]).add_s(s[1:]))
elif (m:=min(d, key=len)):
if not d[m][-1]:
d[m][0].add_s(m)
else:
t = Node(m[0]).add_s(m)
d[m][0].p = self.last_node(t)
Putting it all together
t = Trie()
for i in ['aacc','accb','ccbe']:
t.add_string(i)
def overlaps(trie, c = ''):
if not trie.c:
yield c+trie.e
else:
yield from [j for k in trie.c for j in overlaps(k, c+trie.e)]
r = [j for k in t.c for j in overlaps(k)]
Output:
['aaccbe']

Use difflib.find_longest_match to find the overlap and concatenate appropriately, then use reduce to apply the entire list.
import difflib
from functools import reduce
def overlap(s1, s2):
# https://stackoverflow.com/a/14128905/4001592
s = difflib.SequenceMatcher(None, s1, s2)
pos_a, pos_b, size = s.find_longest_match(0, len(s1), 0, len(s2))
return s1[:pos_a] + s2[pos_b:]
s = ['aacc','accb','ccbe']
result = reduce(overlap, s, "")
print(result)
Output
aaccbe

find the minimum number of words(distance) between repeated occurrences of a search string in the input string

Here are test cases for the code:
string - 'Tim had been saying that he had been there'
search - 'had'
expected output - 4
string - 'he got what he got and what he wanted'
search - 'he'
expected out - 2
def return_distance(input, search):
words = input.split()
distance = None
indx = []
if not input or not search:
return None
else:
if words.count(search) >1:
indx = [ index for index, word in enumerate(words) if word == search]
distance = indx[1] - indx[0]
for i in range(len(indx)-1):
distance = min(distance, indx[i+1] - indx[i])-1
return distance
I am thinking how to optimize the code. I admit it is poorly written.

How about
def min_distance_between_words(sentence, word):
idxes = [i for i, e in enumerate(sentence.split()) if e == word]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])])
This splits the input sentence, makes a list of every index that matches the target word, then iterates over this list to compute the differences between each index and returns the minimum difference.
Since behavior is unspecified when the sentence doesn't have a word, it raises an error but you can add a check for this and return the value of your choice if desired using min's default parameter:
def min_distance_between_words(sentence, word):
idxes = [i for i, e in enumerate(sentence.split()) if e == word]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])], default=None)
As an aside, naming a variable input overwrites a builtin and return_distance is a rather ambiguous name for a function.
Adding a precondition for parameters for None as done with if not input or not search: is not typically done in Python (we assume caller will always pass in a string and adhere to the function's contract).
If you want to generalize this further, move the split() duty to the domain of the caller which enables the function to operate on arbitrary iterables:
def min_distance_between_occurrences(it, target):
idxes = [i for i, e in enumerate(it) if e == target]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])], default=None)
Call with:
min_distance_between_occurrences("a b c a".split(), "a")
min_distance_between_occurrences([(1, 2), (1, 3), (1, 2)], (1, 2))
Refactoring aside, as pointed out in the comments, the original code isn't correct. Issues include:
search_str does not exist. You probably meant search.
distance and min_dist don't really work together. Pick one or the other and use it for all minimum calculations.
min(min_dist, indx[i+1] - indx[i])-1 subtracts 1 in the wrong place, throwing off the count.
Here's a potential fix for these issues:
def return_distance(input, search):
words = input.split()
distance = None
if words.count(search) > 1:
indx = [index for index, word in enumerate(words) if word == search]
distance = indx[1] - indx[0] - 1
# ^^^^
for i in range(len(indx) - 1):
distance = min(distance, indx[i+1] - indx[i] - 1)
# ^^^^
return distance

One way is to use min with list comprehension on indx
min_dist = min([(indx[i+1] - indx[i]-1) for i in range(len(indx)-1) ])

Longest common sequence between many sub-sequences

Fancy title :)
I have a file that contains the following:
>sequence_40
ABCDABDCABCDBACDBACDBACDBACDABDCDC
ACDCCDCABDCADCADBCACBDCABD
>sequence_41
DCBACDBACDADCDCDCABCDCACBDCBDACBDC
BCDBABABBABACDCDBCACDBACDBACDBACDC
BCDB
...
Then, I have a function that returns a dictionary (called dict) that returns the sequences as keys and the strings (combined on one line) as values for the keys. The sequences range from 40 to 59.
I want to take a dictionary of sequences and return the longest common sub-sequence found in ALL the sequences. Managed to find some help here on stackoverflow and made a code that only compares the LAST TWO strings in that dictionary, not all of them :).
This is the code
def longest_common_sequence(s1, s2):
m = [[0] * (1 + len(s2)) for i in range(1 + len(s1))]
longest, x_longest = 0, 0
for x in range(1, 1 + len(s1)):
for y in range(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
for i in range(40,59):
s1=str(dictionar['sequence_'+str(i)])
s2=str(dictionar['sequence_'+str(i+1)])
longest_common_sequence(s1,s2)
How can I modify it to get the common subsequence among ALL sequences in dictionary? Thanks!

EDIT: As #lmcarreiro pointed out, there is a relevant difference between substrings (or subarrays or sublists) and subsequences. To my understanding we are all talking about substrings here, so I will use this term in my answer.
Guillaumes answer can be improved:
def eachPossibleSubstring(string):
for size in range(len(string) + 1, 0, -1):
for start in range(len(string) - size + 1):
yield string[start:start+size]
def findLongestCommonSubstring(strings):
shortestString = min(strings, key=len)
for substring in eachPossibleSubstring(shortestString):
if all(substring in string
for string in strings if string != shortestString):
return substring
print findLongestCommonSubstring([
'ABCDABDCABCDBACDBACDBACDBACDABDCDCACDCCDCABDCADCADBCACBDCABD',
'DCBACDBACDADCDCDCABCDCACBDCBDACBDCBCDBABABBABACDCDBCACDBACDBACDBACDCBCDB',
])
This prints:
ACDBACDBACDBACD
This is faster because I return the first found and search from longest to shortest.
The basic idea is this: Take each possible substring of the shortest of your strings (in the order from the longest to the shortest) and see if this substring can be found in all other strings. If so, return it, otherwise try the next substring.
You need to understand generators. Try e. g. this:
for substring in eachPossibleSubstring('abcd'):
print substring
or
print list(eachPossibleSubstring('abcd'))

I'd start by defining a function to return all possible subsequences of a given sequence:
from itertools import combinations_with_replacement
def subsequences(sequence):
"returns all possible subquences of a given sequence"
for start, stop in combinations_with_replacement(range(len(sequence)), 2):
if start < stop:
yield sequence[start:stop]
then I'd make another method to check if a given subsequence in present in all given sequences:
def is_common_subsequence(sub, sequences):
"returns True if <sub> is a common subsequence in all <sequences>"
return all(sub in sequence for sequence in sequences)
then using the 2 methods above it is pretty easy to get all common subsequences in a given set of sequences:
def common_sequences(sequences):
"return all subsequences common in sequences"
shortest_seq = min(sequences, key=len)
return set(subsequence for subsequence in subsequences(shortest_seq) \
if is_common_subsequence(subsequence, sequences))
... and extracting the longuest sequence:
def longuest_common_subsequence(sequences):
"returns the longuest subsequence in sequences"
return max(common_sequences(sequences), key=len)
Result:
sequences = {
41: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
42: '123ABCDEFGHIJKLMNOPQRSTUVW',
43: '123456ABCDEFGHIJKLMNOPQRST'
}
sequences2 = {
0: 'ABCDEFGHIJ',
1: 'DHSABCDFKDDSA',
2: 'SGABCEIDEFJRNF'
}
print(longuest_common_subsequence(sequences.values()))
>>> ABCDEFGHIJKLMNOPQRST
print(longuest_common_subsequence(sequences2.values()))
>>> ABC

Here you have a possible approach. First let's define a function that returns the longest substring between two strings:
def longest_substring(s1, s2):
t = [[0]*(1+len(s2)) for i in range(1+len(s1))]
l, xl = 0, 0
for x in range(1,1+len(s1)):
for y in range(1,1+len(s2)):
if s1[x-1] == s2[y-1]:
t[x][y] = t[x-1][y-1] + 1
if t[x][y]>l:
l = t[x][y]
xl = x
else:
t[x][y] = 0
return s1[xl-l: xl]
Now I'll create a random dict of sequences for the example:
import random
import string
d = {i : ''.join(random.choice(string.ascii_uppercase) for _ in range(50)) for i in range(10)}
print d
{0: 'ASCUCEVJNIGWVMWMBBQQBZYBBNGQAJRYXACGFEIFWHMBCNYRGL', 1: 'HKUKZOJJUCRTSBLNZXCIBARLPNAPAABRBZEVGVILJAFCGWGQVV', 2: 'MMHCYPKECRJFEWTGYITMHZSNHAFEZVFYDAVILRYRKIDDBEFRVX', 3: 'DGBULRFJINFZEELDASRFBIRSADWMRAYMGCDAOJDKQIMXIRLTEI', 4: 'VDUFWZSXLRGOIMAHOAMZAIWDPTHDVDXUACRBASJMCUHREDORRH', 5: 'RFGAVHOWNKRZMYMSFSSNUGCKEWUNVETCDWJXSPBJHKSTPFNSJO', 6: 'HFMLMHCFSOEXBXWFAROIRGJNPRTKRWCEPLFOKGMXNUPCPWREWX', 7: 'CNPGSHGVIRLDXAADXUVWCTJCXUHQLALBUOJMXQBKXWHKGSJHEH', 8: 'UWDXXTRCFNCBUBEYGYTDWTPLNTRHYQWKTHPRVCBAWIMNGHULDC', 9: 'OOCJRXBZKJIGHZEJOOIKWKMQKIEQVPEDTFPJQAUQKJQVLOMGJB'}
Finally, we need to find the longest subsequence between all sequences:
import itertools
max([longest_substring(i,j) for i,j in itertools.combinations(d.values(), 2)], key=len)
Output:
'VIL'

Concatenate strings if they have an overlapping region

I am trying to write a script that will find strings that share an overlapping region of 5 letters at the beginning or end of each string (shown in example below).
facgakfjeakfjekfzpgghi
pgghiaewkfjaekfjkjakjfkj
kjfkjaejfaefkajewf
I am trying to create a new string which concatenates all three, so the output would be:
facgakfjeakfjekfzpgghiaewkfjaekfjkjakjfkjaejfaefkajewf
Edit:
This is the input:
x = ('facgakfjeakfjekfzpgghi', 'kjfkjaejfaefkajewf', 'pgghiaewkfjaekfjkjakjfkj')
**the list is not ordered
What I've written so far *but is not correct:
def findOverlap(seq)
i = 0
while i < len(seq):
for x[i]:
#check if x[0:5] == [:5] elsewhere
x = ('facgakfjeakfjekfzpgghi', 'kjfkjaejfaefkajewf', 'pgghiaewkfjaekfjkjakjfkj')
findOverlap(x)

Create a dictionary mapping the first 5 characters of each string to its tail
strings = {s[:5]: s[5:] for s in x}
and a set of all the suffixes:
suffixes = set(s[-5:] for s in x)
Now find the string whose prefix does not match any suffix:
prefix = next(p for p in strings if p not in suffixes)
Now we can follow the chain of strings:
result = [prefix]
while prefix in strings:
result.append(strings[prefix])
prefix = strings[prefix][-5:]
print "".join(result)

A brute-force approach - do all combinations and return the first that matches linking terms:
def solution(x):
from itertools import permutations
for perm in permutations(x):
linked = [perm[i][:-5] for i in range(len(perm)-1)
if perm[i][-5:]==perm[i+1][:5]]
if len(perm)-1==len(linked):
return "".join(linked)+perm[-1]
return None
x = ('facgakfjeakfjekfzpgghi', 'kjfkjaejfaefkajewf', 'pgghiaewkfjaekfjkjakjfkj')
print solution(x)

Loop over each pair of candidates, reverse the second string and use the answer from here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find longest intersection between two strings in python? - python

Related

Longest common substring with rolling hash

Merging overlapping string sequences in a list

find the minimum number of words(distance) between repeated occurrences of a search string in the input string

Longest common sequence between many sub-sequences

Concatenate strings if they have an overlapping region

Categories

Resources