Longest common sequence between many sub-sequences - python

Fancy title :)
I have a file that contains the following:
>sequence_40
ABCDABDCABCDBACDBACDBACDBACDABDCDC
ACDCCDCABDCADCADBCACBDCABD
>sequence_41
DCBACDBACDADCDCDCABCDCACBDCBDACBDC
BCDBABABBABACDCDBCACDBACDBACDBACDC
BCDB
...
Then, I have a function that returns a dictionary (called dict) that returns the sequences as keys and the strings (combined on one line) as values for the keys. The sequences range from 40 to 59.
I want to take a dictionary of sequences and return the longest common sub-sequence found in ALL the sequences. Managed to find some help here on stackoverflow and made a code that only compares the LAST TWO strings in that dictionary, not all of them :).
This is the code
def longest_common_sequence(s1, s2):
m = [[0] * (1 + len(s2)) for i in range(1 + len(s1))]
longest, x_longest = 0, 0
for x in range(1, 1 + len(s1)):
for y in range(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
for i in range(40,59):
s1=str(dictionar['sequence_'+str(i)])
s2=str(dictionar['sequence_'+str(i+1)])
longest_common_sequence(s1,s2)
How can I modify it to get the common subsequence among ALL sequences in dictionary? Thanks!

EDIT: As #lmcarreiro pointed out, there is a relevant difference between substrings (or subarrays or sublists) and subsequences. To my understanding we are all talking about substrings here, so I will use this term in my answer.
Guillaumes answer can be improved:
def eachPossibleSubstring(string):
for size in range(len(string) + 1, 0, -1):
for start in range(len(string) - size + 1):
yield string[start:start+size]
def findLongestCommonSubstring(strings):
shortestString = min(strings, key=len)
for substring in eachPossibleSubstring(shortestString):
if all(substring in string
for string in strings if string != shortestString):
return substring
print findLongestCommonSubstring([
'ABCDABDCABCDBACDBACDBACDBACDABDCDCACDCCDCABDCADCADBCACBDCABD',
'DCBACDBACDADCDCDCABCDCACBDCBDACBDCBCDBABABBABACDCDBCACDBACDBACDBACDCBCDB',
])
This prints:
ACDBACDBACDBACD
This is faster because I return the first found and search from longest to shortest.
The basic idea is this: Take each possible substring of the shortest of your strings (in the order from the longest to the shortest) and see if this substring can be found in all other strings. If so, return it, otherwise try the next substring.
You need to understand generators. Try e. g. this:
for substring in eachPossibleSubstring('abcd'):
print substring
or
print list(eachPossibleSubstring('abcd'))

I'd start by defining a function to return all possible subsequences of a given sequence:
from itertools import combinations_with_replacement
def subsequences(sequence):
"returns all possible subquences of a given sequence"
for start, stop in combinations_with_replacement(range(len(sequence)), 2):
if start < stop:
yield sequence[start:stop]
then I'd make another method to check if a given subsequence in present in all given sequences:
def is_common_subsequence(sub, sequences):
"returns True if <sub> is a common subsequence in all <sequences>"
return all(sub in sequence for sequence in sequences)
then using the 2 methods above it is pretty easy to get all common subsequences in a given set of sequences:
def common_sequences(sequences):
"return all subsequences common in sequences"
shortest_seq = min(sequences, key=len)
return set(subsequence for subsequence in subsequences(shortest_seq) \
if is_common_subsequence(subsequence, sequences))
... and extracting the longuest sequence:
def longuest_common_subsequence(sequences):
"returns the longuest subsequence in sequences"
return max(common_sequences(sequences), key=len)
Result:
sequences = {
41: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
42: '123ABCDEFGHIJKLMNOPQRSTUVW',
43: '123456ABCDEFGHIJKLMNOPQRST'
}
sequences2 = {
0: 'ABCDEFGHIJ',
1: 'DHSABCDFKDDSA',
2: 'SGABCEIDEFJRNF'
}
print(longuest_common_subsequence(sequences.values()))
>>> ABCDEFGHIJKLMNOPQRST
print(longuest_common_subsequence(sequences2.values()))
>>> ABC

Here you have a possible approach. First let's define a function that returns the longest substring between two strings:
def longest_substring(s1, s2):
t = [[0]*(1+len(s2)) for i in range(1+len(s1))]
l, xl = 0, 0
for x in range(1,1+len(s1)):
for y in range(1,1+len(s2)):
if s1[x-1] == s2[y-1]:
t[x][y] = t[x-1][y-1] + 1
if t[x][y]>l:
l = t[x][y]
xl = x
else:
t[x][y] = 0
return s1[xl-l: xl]
Now I'll create a random dict of sequences for the example:
import random
import string
d = {i : ''.join(random.choice(string.ascii_uppercase) for _ in range(50)) for i in range(10)}
print d
{0: 'ASCUCEVJNIGWVMWMBBQQBZYBBNGQAJRYXACGFEIFWHMBCNYRGL', 1: 'HKUKZOJJUCRTSBLNZXCIBARLPNAPAABRBZEVGVILJAFCGWGQVV', 2: 'MMHCYPKECRJFEWTGYITMHZSNHAFEZVFYDAVILRYRKIDDBEFRVX', 3: 'DGBULRFJINFZEELDASRFBIRSADWMRAYMGCDAOJDKQIMXIRLTEI', 4: 'VDUFWZSXLRGOIMAHOAMZAIWDPTHDVDXUACRBASJMCUHREDORRH', 5: 'RFGAVHOWNKRZMYMSFSSNUGCKEWUNVETCDWJXSPBJHKSTPFNSJO', 6: 'HFMLMHCFSOEXBXWFAROIRGJNPRTKRWCEPLFOKGMXNUPCPWREWX', 7: 'CNPGSHGVIRLDXAADXUVWCTJCXUHQLALBUOJMXQBKXWHKGSJHEH', 8: 'UWDXXTRCFNCBUBEYGYTDWTPLNTRHYQWKTHPRVCBAWIMNGHULDC', 9: 'OOCJRXBZKJIGHZEJOOIKWKMQKIEQVPEDTFPJQAUQKJQVLOMGJB'}
Finally, we need to find the longest subsequence between all sequences:
import itertools
max([longest_substring(i,j) for i,j in itertools.combinations(d.values(), 2)], key=len)
Output:
'VIL'

Related

Longest common substring with rolling hash

I am implementing in Python3 an algorithm to find the longest substring of two strings s and t. Given s and t, I need to return (a,b,l) where l is the length of the longest common substring, a is the position in s where the longest substring starts, and b is the position in t where the longest substring starts. I have a working version of the algorithm but it is quite slow and I am not sure why; it is frustrating because I have found other implementations in python using pretty much the same logic that are many times faster. I am self-learning so any help would be greatly appreciated.
The approach is based on comparing hash values rather than directly comparing substrings and using binary search to find maximal length of common substrings. Here is the code for my hash function (m is a big prime and x is just some constant):
def polynomial_hash(my_string, m, x):
str_len = len(my_string)
result = 0
for i in range(str_len):
result = (result + ord(my_string[i]) * power_mod_p(x, i, m)) % m
return result
Given two strings s and t, I first find which string is shorter, without loss of generality, let s be the shorter string. First I need to find the hash values of substrings of a string. I use the following function, implemented as a generator:
def all_length_k_hashes(my_string, k, m, x):
current_position = len(my_string) - k
x_to_the_k = power_mod_p(x, k, m)
hash_value = polynomial_hash(my_string[current_position:], m, x)
yield (hash_value, current_position)
while current_position > 0:
current_position = current_position - 1
hash_value = ((hash_value * x) + ord(my_string[current_position]) - x_to_the_k*ord(my_string[current_position + k])) % m
yield (hash_value, current_position)
This function is simple, its first yield is the hash value of the final length k substring of the string, after that each of its iteration is the hash value of the next length k substring to its left (we move left by one position, for example for k=3 from abcdefghi to abcdefghi then from abcdefghi to abcdefghi). This should be able to calculate all the hash values of all length k substrings of my_string in O(|my_string|).
Now I find out if s and t has a length k substring in common, I use the following function:
def common_sub_string_length_k(shorter_str, longer_str, k, m, x):
short_str_dict = dict()
for hash_and_index in all_length_k_hashes(shorter_str, k, m, x):
short_str_dict.update({hash_and_index[0]: hash_and_index[1]})
hash_generator_longer_str = all_length_k_hashes(longer_str, k, m, x)
for hash_and_index in hash_generator_longer_str:
if hash_and_index[0] in short_str_dict:
return (short_str_dict[hash_and_index[0]], hash_and_index[1])
return False
What is happening in this function is: I create a Python empty dictionary and fill it with (key:values) such that each key is the hash value of a length k substring of the shorter string and its value is that substring's starting index, I call this 'short_str_dict'
Then, using all_length_k_hashes, I create a generator of hash values of substrings of length k of the longer string, then I iterate through this generator to check if there is a hash value that's in the 'short_str_dict', if there is, then the two strings have a substring of length k in common (assuming no hash collisions). This whole process should take time O(|shorter_string| + |longer_string|)
Finally, the following function repeatedly uses the previous process to find the maximal k, using a binary search technique:
def longest_common_substring(str_1, str_2):
m_1 = 309000599
m_2 = 988017827
x = randint(1, 10 ** 6)
len_str_1 = len(str_1)
len_str_2 = len(str_2)
if len_str_1 <= len_str_2:
short_str = str_1
long_str = str_2
switched = False
else:
short_str = str_2
long_str = str_1
switched = True
len_short_str = len(short_str)
len_long_str = len(long_str)
low = 0
high = len_short_str
mid = 0
longest_so_far = 0
longest_indices = (0,0)
while low <= high:
mid = (high + low) // 2
m1_result = common_sub_string_length_k(short_str, long_str, mid, m_1, x)
m2_result = common_sub_string_length_k(short_str, long_str, mid, m_2, x)
if m1_result is False or m2_result is False:
high = mid - 1
else:
longest_so_far = mid
longest_indices = m1_result
low = mid + 1
if switched:
return (longest_indices[1], longest_indices[0], longest_so_far)
else:
return (longest_indices[0], longest_indices[1], longest_so_far)
Two different hashes are used to reduce the probability of a collision. So in total, assuming no collisions, this whole process should take
O(log|shorter_string|) * O(|shorter_string| + |longer_string|).
Have I made any error? Is it slow because of the use of Python dictionaries? I really want to understand my mistake. Any help is greatly appreciated.

Permutations without itertools for two values (using recursion!)

Stackoverflow, I am once again asking for your help.
I'm aware there are other threads about this but I'll explain what makes my assignment different.
Basically my function would get a list of 0s and 1s, and return all the possible orders for the string. For example for "0111" we will get "0111", "1011", "1101", "1110".
Here's my code:
def permutations(string):
if len(string) == 1:
return [string]
lst = []
for j in range(len(string)):
remaining_elements = ''.join([string[i] for i in range(len(string)) if i != j])
mini_perm = permutations(remaining_elements)
for perm in mini_perm:
new_str = string[j] + perm
if new_str not in lst:
lst.append(new_str)
return lst
The problem is when I run a string like "000000000011" it takes a very long time to process. There is supposed to be a more efficient way to do it because it's just two numbers. So I shouldn't be using the indexes?
Please help me if you can figure out a more efficient say to do this.
(I am allowed to use loops just have to use recursion as well!)
Here is an example for creating permutations with recursion that is more efficient:
def permute(string):
string = list(string)
n = len(string)
# Base conditions
# If length is 0 or 1, there is only 1 permutation
if n in [0, 1]:
return [string]
# If length is 2, then there are only two permutations
# Example: [1,2] and [2,1]
if n == 2:
return [string, string[::-1]]
res = []
# For every number in array, choose 1 number and permute the remaining
# by calling permute recursively
for i in range(n):
permutations = permute(string[:i] + string[i+1:])
for p in permutations:
res.append([''.join(str(n) for n in [string[i]] + p)])
return res
This should also work for permute('000000000011') - hope it helps!
You can also use collections.Counter with a recursive generator function:
from collections import Counter
def permute(d):
counts = Counter(d)
def get_permuations(c, s = []):
if len(s) == sum(counts.values()):
yield ''.join(s)
else:
for a, b in c.items():
for i in range(1, b+1):
yield from get_permuations({**c, a:b - i}, s+([a]*i))
return list(set(get_permuations(counts)))
print(permute("0111"))
print(permute("000000000011"))
Output:
['0111', '1110', '1101', '1011']
['010000100000', '100000000001', '010000001000', '000000100001', '011000000000', '100000000010', '001001000000', '000000011000', '100000001000', '100000100000', '100001000000', '001000100000', '100010000000', '000000001100', '000100000100', '010010000000', '000000000011', '000000100010', '101000000000', '110000000000', '100000010000', '000100001000', '000001001000', '000000000101', '000000100100', '010000000001', '001000000100', '001000000010', '000110000000', '000011000000', '000001100000', '000000110000', '001000000001', '000010001000', '000100100000', '000001000001', '000010000001', '001100000000', '000100000001', '001000001000', '010000000100', '010000010000', '000000010001', '001000010000', '010001000000', '100000000100', '100100000000', '000000001001', '010100000000', '000010100000', '010000000010', '000000001010', '000010000100', '001010000000', '000000010010', '000001000010', '000100000010', '000101000000', '000000010100', '000100010000', '000000000110', '000001000100', '000010010000', '000000101000', '000001010000', '000010000010']
posting an answer someone gave me. Thanks for your responses!:
def permutations(zeroes, ones, lst, perm):
if zeroes == 0 and ones == 0:
lst.append(perm)
return
elif zeroes < 0 or ones < 0:
return
permutations(zeroes - 1, ones, lst, perm + '0')
permutations(zeroes, ones - 1, lst, perm + '1')

How to find longest intersection between two strings in python?

I'm trying to write a program that would find the longest intersection between two strings. The conditions are:
If there is no common character the program returns an empty chain.
If there are multiple substrings of common characters with the same length it should return whichever is the largest, for example, for "bbaacc" and "aabb" the repeating substrings are "aa" and "bb" but as "bb" > "aa", so the programs must return only "bb".
Finally the program should return the longest common substring, for instance, for "programme" and "grammaire" the return should be "gramm" not "gramme".
My code has a problem with this last condition, how could I change it so it works as expected?
def intersection(v, w):
if not v or not w:
return ""
x, xs, y, ys = v[0], v[1:], w[0], w[1:]
if x == y:
return x + intersection(xs, ys)
else:
return max(intersection(v, ys), intersection(xs, w), key=len)
Driver:
print(intersection('programme', 'grammaire'))
cant find the issue with your code, but i solved it like this
def longest_str_intersection(a: str, b: str):
# identify all possible character sequences from str a
seqs = []
for pos1 in range(len(a)):
for pos2 in range(len(a)):
seqs.append(a[pos1:pos2+1])
# remove empty sequences
seqs = [seq for seq in seqs if seq != '']
# find segments in str b
max_len_match = 0
max_match_sequence = ''
for seq in seqs:
if seq in b:
if len(seq) > max_len_match:
max_len_match = len(seq)
max_match_sequence = seq
return max_match_sequence
longest_str_intersection('programme', 'grammaire')
-> 'gramm'
also interested to see if you found a more elegant solution!

Finding the closest sub-string by Hamming distance

I need to find the substring of s that is closest to a string by Hamming distance and have it return a tuple of the index of the closest substring, the Hamming distance of the closest substring to p, and the closest substring itself.
I have this code so far:
def ham_dist(s1, s2):
if len(s1) != len(s2):
raise ValueError("Undefined")
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
But I am confused on how I would figure this out:
Your function should return (1,2,'bcef') because the closest substring is 'bcef', it begins at index 1 in s, and its Hamming distance to p is 2.
In your function, you should use your ham_dist function from part (a). If there is more than one substring with the same minimum distance to p, return any of them.
You can run through the source string and compute the Hamming distance between your search string and the substring of the same length starting at the current index. You save the index, Hamming distance and substring if it is smaller than what you had before. This way you will get the minimal value.
source_string = "pGpEusuCSWEaPOJmamlFAnIBgAJGtcJaMPFTLfUfkQKXeymydQsdWCTyEFjFgbSmknAmKYFHopWceEyCSumTyAFwhrLqQXbWnXSn"
search_string = "tyraM"
def ham_dist(s1, s2):
if len(s1) != len(s2):
raise ValueError("Undefined")
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
def search_min_dist(source,search):
l = len(search)
index = 0
min_dist = l
min_substring = source[:l]
for i in range(len(source)-l+1):
d = ham_dist(search, source[i:i+l])
if d<min_dist:
min_dist = d
index = i
min_substring = source[i:i+l]
return (index,min_dist,min_substring)
print search_min_dist(source_string,search_string)
Output
(28, 2, 'tcJaM')
The answer from Hugo Delahaye is a good one and does a better job of answering your question directly, but a different way to think about problems like this is to let Python's min() function figure out the answer. Under this type of data-centric programming (see Rule 5), your goal is to organize the data to make that possible.
s = 'abcefgh'
p = 'cdef'
N = len(p)
substrings = [
s[i : i + N]
for i in range(0, len(s) - N + 1)
]
result = min(
(ham_dist(p, sub), sub, i)
for i, sub in enumerate(substrings)
)
print(substrings) # ['abce', 'bcef', 'cefg', 'efgh']
print(result) # (2, 'bcef', 1)

Find all Occurences of Every Substring in String

I am trying to find all occurrences of sub-strings in a main string (of all lengths). My function takes one string and then returns a dictionary of every sub-string (which occurs more than once, of course) and how many times it occurs (format of the dictionary: {substring: # of occurrences, ...}). I am using collections.Counter(s) to help me with it.
Here is my function:
from collections import Counter
def patternFind(s):
patterns = {}
for index in range(1, len(s)+1)[::-1]:
d = nChunks(s, step=index)
parts = dict(Counter(d))
patterns.update({elem: parts[elem] for elem in parts.keys() if parts[elem] > 1})
return patterns
def nChunks(iterable, start=0, step=1):
return [iterable[i:i+step] for i in range(start, len(iterable), step)]
I have a string, data with about 2500 random letters (in a random order). However, there are 2 strings inserted into it (random points). Say this string is 'TEST'. data.count('TEST') returns 2. However, patternFind(data)['TEST'] gives me a KeyError. Therefore, my program does not detect the two strings in it.
What have I done wrong? Thanks!
Edit: My method of creating testing-instances:
def createNewTest():
n = randint(500, 2500)
x, y = randint(500, n), randint(500, n)
s = ''
for i in range(n):
s += choice(uppercase)
if i == x or i == y: s += "TEST"
return s
Using Regular Expressions
Apart from the count() method you described, regex is an obvious alternative
import re
needle = r'TEST'
haystack = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklagh'
pattern = re.compile(needle)
print len(re.findall(pattern, haystack))
Short Cut
If you need to build a dictionary of substrings, possibly you can do this with only subset of those strings. Assuming you know the needle you are looking for in the data then you only need the dictionary of substrings of data that are the same length of needle. This is very fast.
from collections import Counter
needle = "TEST"
def gen_sub(s, len_chunk):
for start in range(0, len(s)-len_chunk+1):
yield s[start:start+len_chunk]
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
parts = Counter([sub for sub in gen_sub(data, len(needle))])
print parts[needle]
Brute Force: building dictionary of all substrings
If you need to have a count of all possible substrings, this works but it is very slow:
from collections import Counter
def gen_sub(s):
for start in range(0, len(s)):
for end in range(start+1, len(s)+1):
yield s[start:end]
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhz'
parts = Counter([sub for sub in gen_sub(data)])
print parts['TEST']
Substring generator adapted from this: https://stackoverflow.com/a/8305463/1290420
While jurgenreza has explained why your program didn't work, the solution is still quite slow. If you only examine substrings s for which you know that s[:-1] repeats, you get a much faster solution (typically a hundred times faster and more):
from collections import defaultdict
def pfind(prefix, sequences):
collector = defaultdict(list)
for sequence in sequences:
collector[sequence[0]].append(sequence)
for item, matching_sequences in collector.items():
if len(matching_sequences) >= 2:
new_prefix = prefix + item
yield (new_prefix, len(matching_sequences))
for r in pfind(new_prefix, [sequence[1:] for sequence in matching_sequences]):
yield r
def find_repeated_substrings(s):
s0 = s + " "
return pfind("", [s0[i:] for i in range(len(s))])
If you want a dict, you call it like this:
result = dict(find_repeated_substrings(s))
On my machine, for a run with 2247 elements, it took 0.02 sec, while the original (corrected) solution took 12.72 sec.
(Note that this is a rather naive implementation; using indexes of instead of substrings should be even faster.)
Edit: The following variant works with other sequence types (not only strings). Also, it doesn't need a sentinel element.
from collections import defaultdict
def pfind(s, length, ends):
collector = defaultdict(list)
if ends[-1] >= len(s):
del ends[-1]
for end in ends:
if end < len(s):
collector[s[end]].append(end)
for key, matching_ends in collector.items():
if len(matching_ends) >= 2:
end = matching_ends[0]
yield (s[end - length: end + 1], len(matching_ends))
for r in pfind(s, length + 1, [end + 1 for end in matching_ends if end < len(s)]):
yield r
def find_repeated_substrings(s):
return pfind(s, 0, list(range(len(s))))
This still has the problem that very long substrings will exceed recursion depth. You might want to catch the exception.
The problem is in your nChunks function. It does not give you all the chunks that are necessary.
Let's consider a test string:
s='1test2345test'
For the chunks of size 4 your nChunks function gives this output:
>>>nChunks(s, step=4)
['1tes', 't234', '5tes', 't']
But what you really want is:
>>>def nChunks(iterable, start=0, step=1):
return [iterable[i:i+step] for i in range(len(iterable)-step+1)]
>>>nChunks(s, step=4)
['1tes', 'test', 'est2', 'st23', 't234', '2345', '345t', '45te', '5tes', 'test']
You can see that this way there are two 'test' chunks and your patternFind(s) will work like a charm:
>>> patternFind(s)
{'tes': 2, 'st': 2, 'te': 2, 'e': 2, 't': 4, 'es': 2, 'est': 2, 'test': 2, 's': 2}
here you can find a solution that uses a recursive wrapper around string.find() that searches all the occurences of a substring in a main string.
The collectallchuncks() function returns a defaultdict whith all the substrings as keys and for each substring a list of all the indexes where the substring is found in the main string.
import collections
# Minimum substring size, may be 1
MINSIZE = 3
# Recursive wrapper
def recfind(p, data, pos, acc):
res = data.find(p, pos)
if res == -1:
return acc
else:
acc.append(res)
return recfind(p, data, res+1, acc)
def collectallchuncks(data):
res = collections.defaultdict(str)
size = len(data)
for base in xrange(size):
for seg in xrange(MINSIZE, size-base+1):
chunk = data[base:base+seg]
if data.count(chunk) > 1:
res[chunk] = recfind(chunk, data, 0, [])
return res
if __name__ == "__main__":
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
allchuncks = collectallchuncks(data)
print 'TEST', allchuncks['TEST']
print 'hklag', allchuncks['hklag']
EDIT: If you just need the number of occurrences of each substring in the main string you can easily obtain it getting rid of the recursive function:
import collections
MINSIZE = 3
def collectallchuncks2(data):
res = collections.defaultdict(str)
size = len(data)
for base in xrange(size):
for seg in xrange(MINSIZE, size-base+1):
chunk = data[base:base+seg]
cnt = data.count(chunk)
if cnt > 1:
res[chunk] = cnt
return res
if __name__ == "__main__":
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
allchuncks = collectallchuncks2(data)
print 'TEST', allchuncks['TEST']
print 'hklag', allchuncks['hklag']

Categories

Resources