Longest of k consecutive strings

Longest of k consecutive strings - python

I'm defining a Python function to determine the longest string if the original strings are combined for every k consecutive strings. The function takes two parameters, strarr and k.
Here is an example:
max_consec(["zone", "abigail", "theta", "form", "libe", "zas", "theta", "abigail"], 2) --> "abigailtheta"
Here's my code so far (instinct is that I'm not passing k correctly within the function)
def max_consec(strarr, k):
lenList = []
for value in zip(strarr, strarr[k:]):
consecStrings = ''.join(value)
lenList.append(consecStrings)
for word in lenList:
if max(word):
return word
Here is a test case not passing:
testing(longest_consec(["zone", "abigail", "theta", "form", "libe", "zas"], 2), "abigailtheta")
My output:
'zonetheta' should equal 'abigailtheta'

It's not quite clear to me what you mean by "every k consecutive strings", but if you mean taking k-length slices of the list and concatenating all the strings in each slice, for example
['a', 'bb', 'ccc', 'dddd'] # k = 3
becomes
['a', 'bb', 'ccc']
['bb', 'ccc', 'dddd']
then
'abbccc'
'bbcccddd'
then this works ...
# for every item in strarr, take the k-length slice from that point and join the resulting strings
strs = [''.join(strarr[i:i + k]) for i in range(len(strarr) - k + 1)]
# find the largest by `len`gth
max(strs, key=len)
this post gives alternatives, though some of them are hard to read/verbose

Store the string lengths in an array. Now assume a window of size k passing through this list. Keep track of the sum in this window and starting point of the window.
When window reaches the end of the array you should have maximum sum and index where the maximum occurs. Construct the result with the elements from this window.
Time complexity: O(size of array + sum of all strings sizes) ~ O(n)
Also add some corner case handling when k > array_size or k <= 0
def max_consec(strarr, k):
size = len(strarr)
# corner cases
if k > size or k <= 0:
return "None" # or None
# store lengths
lenList = [len(word) for word in strarr]
print(lenList)
max_sum = sum(lenList[:k]) # window at index 0
prev_sum = max_sum
max_index = 0
for i in range(1, size - k + 1):
length = prev_sum - lenList[i-1] + lenList[i + k - 1] # window sum at i'th index. Subract previous sum and add next sum
prev_sum = length
if length > max_sum:
max_sum = length
max_index = i
return "".join(strarr[max_index:max_index+k]) # join strings in the max sum window
word = max_consec(["zone", "abigail", "theta", "form", "libe", "zas", "theta", "abigail"], 2)
print(word)

def max_consec(strarr, k):
n = -1
result = ""
for i in range(len(strarr)):
s = ''.join(strarr[i:i+k])
if len(s) > n:
n = len(s)
result = s
return result
Iterate over the list of strings and create a new string concatenating it with next k strings
Check if the newly created string is the longest. If so memorize it
Repeat the above steps until the iterations complete
return the memorised string

If I understand your question correctly. You have to eliminate duplicate values (in this case with set), sort them by length and concatenate the k longest words.
>>> def max_consec(words, k):
... words = sorted(set(words), key=len, reverse=True)
... return ''.join(words[:k])
...
>>> max_consec(["zone", "abigail", "theta", "form", "libe", "zas", "theta", "abigail"], 2)
'abigailtheta'
Update:
If the k elements should be consecutive. You can create pairs of consecutive words (in this case with zip). And return the longest if they get joined.
>>> def max_consec(words, k):
... return max((''.join(pair) for pair in zip(*[words[i:] for i in range(k)])), key=len)
...
>>> max_consec(["zone", "abigail", "theta", "form", "libe", "zas", "theta", "abigail"], 2)
'abigailtheta'

Related

Longest common substring with rolling hash

I am implementing in Python3 an algorithm to find the longest substring of two strings s and t. Given s and t, I need to return (a,b,l) where l is the length of the longest common substring, a is the position in s where the longest substring starts, and b is the position in t where the longest substring starts. I have a working version of the algorithm but it is quite slow and I am not sure why; it is frustrating because I have found other implementations in python using pretty much the same logic that are many times faster. I am self-learning so any help would be greatly appreciated.
The approach is based on comparing hash values rather than directly comparing substrings and using binary search to find maximal length of common substrings. Here is the code for my hash function (m is a big prime and x is just some constant):
def polynomial_hash(my_string, m, x):
str_len = len(my_string)
result = 0
for i in range(str_len):
result = (result + ord(my_string[i]) * power_mod_p(x, i, m)) % m
return result
Given two strings s and t, I first find which string is shorter, without loss of generality, let s be the shorter string. First I need to find the hash values of substrings of a string. I use the following function, implemented as a generator:
def all_length_k_hashes(my_string, k, m, x):
current_position = len(my_string) - k
x_to_the_k = power_mod_p(x, k, m)
hash_value = polynomial_hash(my_string[current_position:], m, x)
yield (hash_value, current_position)
while current_position > 0:
current_position = current_position - 1
hash_value = ((hash_value * x) + ord(my_string[current_position]) - x_to_the_k*ord(my_string[current_position + k])) % m
yield (hash_value, current_position)
This function is simple, its first yield is the hash value of the final length k substring of the string, after that each of its iteration is the hash value of the next length k substring to its left (we move left by one position, for example for k=3 from abcdefghi to abcdefghi then from abcdefghi to abcdefghi). This should be able to calculate all the hash values of all length k substrings of my_string in O(|my_string|).
Now I find out if s and t has a length k substring in common, I use the following function:
def common_sub_string_length_k(shorter_str, longer_str, k, m, x):
short_str_dict = dict()
for hash_and_index in all_length_k_hashes(shorter_str, k, m, x):
short_str_dict.update({hash_and_index[0]: hash_and_index[1]})
hash_generator_longer_str = all_length_k_hashes(longer_str, k, m, x)
for hash_and_index in hash_generator_longer_str:
if hash_and_index[0] in short_str_dict:
return (short_str_dict[hash_and_index[0]], hash_and_index[1])
return False
What is happening in this function is: I create a Python empty dictionary and fill it with (key:values) such that each key is the hash value of a length k substring of the shorter string and its value is that substring's starting index, I call this 'short_str_dict'
Then, using all_length_k_hashes, I create a generator of hash values of substrings of length k of the longer string, then I iterate through this generator to check if there is a hash value that's in the 'short_str_dict', if there is, then the two strings have a substring of length k in common (assuming no hash collisions). This whole process should take time O(|shorter_string| + |longer_string|)
Finally, the following function repeatedly uses the previous process to find the maximal k, using a binary search technique:
def longest_common_substring(str_1, str_2):
m_1 = 309000599
m_2 = 988017827
x = randint(1, 10 ** 6)
len_str_1 = len(str_1)
len_str_2 = len(str_2)
if len_str_1 <= len_str_2:
short_str = str_1
long_str = str_2
switched = False
else:
short_str = str_2
long_str = str_1
switched = True
len_short_str = len(short_str)
len_long_str = len(long_str)
low = 0
high = len_short_str
mid = 0
longest_so_far = 0
longest_indices = (0,0)
while low <= high:
mid = (high + low) // 2
m1_result = common_sub_string_length_k(short_str, long_str, mid, m_1, x)
m2_result = common_sub_string_length_k(short_str, long_str, mid, m_2, x)
if m1_result is False or m2_result is False:
high = mid - 1
else:
longest_so_far = mid
longest_indices = m1_result
low = mid + 1
if switched:
return (longest_indices[1], longest_indices[0], longest_so_far)
else:
return (longest_indices[0], longest_indices[1], longest_so_far)
Two different hashes are used to reduce the probability of a collision. So in total, assuming no collisions, this whole process should take
O(log|shorter_string|) * O(|shorter_string| + |longer_string|).
Have I made any error? Is it slow because of the use of Python dictionaries? I really want to understand my mistake. Any help is greatly appreciated.

find the minimum number of words(distance) between repeated occurrences of a search string in the input string

Here are test cases for the code:
string - 'Tim had been saying that he had been there'
search - 'had'
expected output - 4
string - 'he got what he got and what he wanted'
search - 'he'
expected out - 2
def return_distance(input, search):
words = input.split()
distance = None
indx = []
if not input or not search:
return None
else:
if words.count(search) >1:
indx = [ index for index, word in enumerate(words) if word == search]
distance = indx[1] - indx[0]
for i in range(len(indx)-1):
distance = min(distance, indx[i+1] - indx[i])-1
return distance
I am thinking how to optimize the code. I admit it is poorly written.

How about
def min_distance_between_words(sentence, word):
idxes = [i for i, e in enumerate(sentence.split()) if e == word]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])])
This splits the input sentence, makes a list of every index that matches the target word, then iterates over this list to compute the differences between each index and returns the minimum difference.
Since behavior is unspecified when the sentence doesn't have a word, it raises an error but you can add a check for this and return the value of your choice if desired using min's default parameter:
def min_distance_between_words(sentence, word):
idxes = [i for i, e in enumerate(sentence.split()) if e == word]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])], default=None)
As an aside, naming a variable input overwrites a builtin and return_distance is a rather ambiguous name for a function.
Adding a precondition for parameters for None as done with if not input or not search: is not typically done in Python (we assume caller will always pass in a string and adhere to the function's contract).
If you want to generalize this further, move the split() duty to the domain of the caller which enables the function to operate on arbitrary iterables:
def min_distance_between_occurrences(it, target):
idxes = [i for i, e in enumerate(it) if e == target]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])], default=None)
Call with:
min_distance_between_occurrences("a b c a".split(), "a")
min_distance_between_occurrences([(1, 2), (1, 3), (1, 2)], (1, 2))
Refactoring aside, as pointed out in the comments, the original code isn't correct. Issues include:
search_str does not exist. You probably meant search.
distance and min_dist don't really work together. Pick one or the other and use it for all minimum calculations.
min(min_dist, indx[i+1] - indx[i])-1 subtracts 1 in the wrong place, throwing off the count.
Here's a potential fix for these issues:
def return_distance(input, search):
words = input.split()
distance = None
if words.count(search) > 1:
indx = [index for index, word in enumerate(words) if word == search]
distance = indx[1] - indx[0] - 1
# ^^^^
for i in range(len(indx) - 1):
distance = min(distance, indx[i+1] - indx[i] - 1)
# ^^^^
return distance

One way is to use min with list comprehension on indx
min_dist = min([(indx[i+1] - indx[i]-1) for i in range(len(indx)-1) ])

Efficiency of finding mismatched patterns

I'm working on a simple bioinformatics problem. I have a working solution, but it is absurdly inefficient. How can I increase my efficiency?
Problem:
Find patterns of length k in the string g, given that the k-mer can have up to d mismatches.
And these strings and patterns are all genomic--so our set of possible characters is {A, T, C, G}.
I'll call the function FrequentWordsMismatch(g, k, d).
So, here are a few helpful examples:
FrequentWordsMismatch('AAAAAAAAAA', 2, 1) → ['AA', 'CA', 'GA', 'TA', 'AC', 'AG', 'AT']
Here's a much longer example, if you implement this and want to test:
FrequentWordsMisMatch('CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGCCGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGGCCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCGGTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACACACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC', 10, 2) → ['GCACACAGAC', 'GCGCACACAC']
With my naive solution, that second example could easily take ~60 seconds, though the first one is pretty quick.
Naive solution:
My idea was to, for every k-length segment in g, find every possible "neighbor" (e.g. other k-length segments with up to d mismatches) and add those neighbors as keys to a dictionary. I then count how many times each one of those neighbor kmers show up in the string g, and record those in the dictionary.
Obviously that's a kinda shitty way to do that, since the amount of neighbors scales like crazy as k and d increase, and having to scan through the strings with each of those neighbors makes this implementation terribly slow. But alas, that's why I'm asking for help.
I'll put my code below. There're definitely a lot of novice mistakes to unpack, so thanks for your time and attention.
def FrequentWordsMismatch(g, k, d):
'''
Finds the most frequent k-mer patterns in the string g, given that those
patterns can mismatch amongst themselves up to d times
g (String): Collection of {A, T, C, G} characters
k (int): Length of desired pattern
d (int): Number of allowed mismatches
'''
counts = {}
answer = []
for i in range(len(g) - k + 1):
kmer = g[i:i+k]
for neighborkmer in Neighbors(kmer, d):
counts[neighborkmer] = Count(neighborkmer, g, d)
maxVal = max(counts.values())
for key in counts.keys():
if counts[key] == maxVal:
answer.append(key)
return(answer)
def Neighbors(pattern, d):
'''
Find all strings with at most d mismatches to the given pattern
pattern (String): Original pattern of characters
d (int): Number of allowed mismatches
'''
if d == 0:
return [pattern]
if len(pattern) == 1:
return ['A', 'C', 'G', 'T']
answer = []
suffixNeighbors = Neighbors(pattern[1:], d)
for text in suffixNeighbors:
if HammingDistance(pattern[1:], text) < d:
for n in ['A', 'C', 'G', 'T']:
answer.append(n + text)
else:
answer.append(pattern[0] + text)
return(answer)
def HammingDistance(p, q):
'''
Find the hamming distance between two strings
p (String): String to be compared to q
q (String): String to be compared to p
'''
ham = 0 + abs(len(p)-len(q))
for i in range(min(len(p), len(q))):
if p[i] != q[i]:
ham += 1
return(ham)
def Count(pattern, g, d):
'''
Count the number of times that the pattern occurs in the string g,
allowing for up to d mismatches
pattern (String): Pattern of characters
g (String): String in which we're looking for pattern
d (int): Number of allowed mismatches
'''
return len(MatchWithMismatch(pattern, g, d))
def MatchWithMismatch(pattern, g, d):
'''
Find the indicies at which the pattern occurs in the string g,
allowing for up to d mismatches
pattern (String): Pattern of characters
g (String): String in which we're looking for pattern
d (int): Number of allowed mismatches
'''
answer = []
for i in range(len(g) - len(pattern) + 1):
if(HammingDistance(g[i:i+len(pattern)], pattern) <= d):
answer.append(i)
return(answer)
More tests
FrequentWordsMismatch('ACGTTGCATGTCGCATGATGCATGAGAGCT', 4, 1) → ['ATGC', 'ATGT', 'GATG']
FrequentWordsMismatch('AGTCAGTC', 4, 2) → ['TCTC', 'CGGC', 'AAGC', 'TGTG', 'GGCC', 'AGGT', 'ATCC', 'ACTG', 'ACAC', 'AGAG', 'ATTA', 'TGAC', 'AATT', 'CGTT', 'GTTC', 'GGTA', 'AGCA', 'CATC']
FrequentWordsMismatch('AATTAATTGGTAGGTAGGTA', 4, 0) → ["GGTA"]
FrequentWordsMismatch('ATA', 3, 1) → ['GTA', 'ACA', 'AAA', 'ATC', 'ATA', 'AGA', 'ATT', 'CTA', 'TTA', 'ATG']
FrequentWordsMismatch('AAT', 3, 0) → ['AAT']
FrequentWordsMismatch('TAGCG', 2, 1) → ['GG', 'TG']

The problem description is ambiguous in several ways, so I'm going by the examples. You seem to want all k-length strings from the alphabet (A, C, G, T} such that the number of matches to contiguous substrings of g is maximal - where "a match" means character-by-character equality with at most d character inequalities.
I'm ignoring that your HammingDistance() function makes something up even when inputs have different lengths, mostly because it doesn't make much sense to me ;-) , but partly because that isn't needed to get the results you want in any of the examples you gave.
The code below produces the results you want in all the examples, in the sense of producing permutations of the output lists you gave. If you want canonical outputs, I'd suggest sorting an output list before returning it.
The algorithm is pretty simple, but relies on itertools to do the heavy combinatorial lifting "at C speed". All the examples run in well under a second total.
For each length-k contiguous substring of g, consider all combinations(k, d) sets of d distinct index positions. There are 4**d ways to fill those index positions with letters from {A, C, G, T}, and each such way is "a pattern" that matches the substring with at most d discrepancies. Duplicates are weeded out by remembering the patterns already generated; this is faster than making heroic efforts to generate only unique patterns to begin with.
So, in all, the time requirement is O(len(g) * k**d * 4**d) = O(len(g) * (4*k)**d, where k**d is, for reasonably small values of k and d, an overstated standin for the binomial coefficent combinations(k, d). The important thing to note is that - unsurprisingly - it's exponential in d.
def fwm(g, k, d):
from itertools import product, combinations
from collections import defaultdict
all_subs = list(product("ACGT", repeat=d))
all_ixs = list(combinations(range(k), d))
patcount = defaultdict(int)
for starti in range(len(g)):
base = g[starti : starti + k]
if len(base) < k:
break
patcount[base] += 1
seen = set([base])
basea = list(base)
for ixs in all_ixs:
saved = [basea[i] for i in ixs]
for newchars in all_subs:
for i, newchar in zip(ixs, newchars):
basea[i] = newchar
candidate = "".join(basea)
if candidate not in seen:
seen.add(candidate)
patcount[candidate] += 1
for i, ch in zip(ixs, saved):
basea[i] = ch
maxcount = max(patcount.values())
return [p for p, c in patcount.items() if c == maxcount]
EDIT: Generating Patterns Uniquely
Rather than weed out duplicates by keeping a set of those seen so far, it's straightforward enough to prevent generating duplicates to begin with. In fact, the following code is shorter and simpler, although somewhat subtler. In return for less redundant work, there are layers of recursive calls to the inner() function. Which way is faster appears to depend on the specific inputs.
def fwm(g, k, d):
from collections import defaultdict
patcount = defaultdict(int)
alphabet = "ACGT"
allbut = {ch: tuple(c for c in alphabet if c != ch)
for ch in alphabet}
def inner(i, rd):
if not rd or i == k:
patcount["".join(base)] += 1
return
inner(i+1, rd)
orig = base[i]
for base[i] in allbut[orig]:
inner(i+1, rd-1)
base[i] = orig
for i in range(len(g) - k + 1):
base = list(g[i : i + k])
inner(0, d)
maxcount = max(patcount.values())
return [p for p, c in patcount.items() if c == maxcount]

Going on your problem description alone and not your examples (for the reasons I explained in the comment), one approach would be:
s = "CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGC"\
"CGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGG"\
"CCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCG"\
"GTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACAC"\
"ACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC"
def frequent_words_mismatch(g,k,d):
def num_misspellings(x,y):
return sum(xx != yy for (xx,yy) in zip(x,y))
seen = set()
for i in range(len(g)-k+1):
seen.add(g[i:i+k])
# For each unique sequence, add a (key,bin) pair to the bins dictionary
# (The bin is initialized to a list containing only the sequence, for now)
bins = {seq:[seq,] for seq in seen}
# Loop again through the unique sequences...
for seq in seen:
# Try to fit it in *all* already-existing bins (based on bin key)
for bk in bins:
# Don't re-add seq to it's own bin
if bk == seq: continue
# Test bin keys, try to find all appropriate bins
if num_misspellings(seq, bk) <= d:
bins[bk].append(seq)
# Get a list of the bin keys (one for each unique sequence) sorted in order of the
# number of elements in the corresponding bins
sorted_keys = sorted(bins, key= lambda k:len(bins[k]), reverse=True)
# largest_bin_key will be the key of the largest bin (there may be ties, so in fact
# this is *a* key of *one of the bins with the largest length*). That is, it'll
# be the sequence (found in the string) that the most other sequences (also found
# in the string) are at most d-distance from.
largest_bin_key = sorted_keys[0]
# You can return this bin, as your question description (but not examples) indicate:
return bins[largest_bin_key]
largest_bin = frequent_words_mismatch(s,10,2)
print(len(largest_bin)) # 13
print(largest_bin)
The (this) largest bin contains:
['CGGCCGCCGG', 'GGGCCGGCGG', 'CGGCCGGCGC', 'AGGCGGCCGG', 'CAGGCGCCGG',
'CGGCCGGCCG', 'CGGTAGCCGG', 'CGGCGGCCGC', 'CGGGCGCCGG', 'CCGGCGCCGG',
'CGGGCCCCGG', 'CCGCCGGCGG', 'GGGCCGCCGG']
It's O(n**2) where n is the number of unique sequences and completes on my computer in around 0.1 seconds.

Finding the subsequence that starts and ends with the same number with the maximum sum

I have to make a program that takes as input a list of numbers and returns the sum of the subsequence that starts and ends with the same number which has the maximum sum (including the equal numbers in the beginning and end of the subsequence in the sum). It also has to return the placement of the start and end of the subsequence, that is, their index+1. The problem is that my current code runs smoothly only while the length of list is not that long. When the list length extends to 5000 the program does not give an answer.
The input is the following:
6
3 2 4 3 5 6
The first line is for the length of the list. The second line is the list itself, with list items separated by space. The output will be 12, 1, 4, because as you can see there is 1 equal pair of numbers (3): the first and fourth element, so the sum of elements between them is 3 + 2 + 4 + 3 = 12, and their placement is first and fourth.
Here is my code.
length = int(input())
mass = raw_input().split()
for i in range(length):
mass[i]=int(mass[i])
value=-10000000000
b = 1
e = 1
for i in range(len(mass)):
if mass[i:].count(mass[i])!=1:
for j in range(i,len(mass)):
if mass[j]==mass[i]:
f = mass[i:j+1]
if sum(f)>value:
value = sum(f)
b = i+1
e = j+1
else:
if mass[i]>value:
value = mass[i]
b = i+1
e = i+1
print value
print b,e

This should be faster than your current approach.
Rather than searching through mass looking for pairs of matching numbers we pair each number in mass with its index and sort those pairs. We can then use groupby to find groups of equal numbers. If there are more than 2 of the same number we use the first and last, since they will have the greatest sum between them.
from operator import itemgetter
from itertools import groupby
raw = '3 5 6 3 5 4'
mass = [int(u) for u in raw.split()]
result = []
a = sorted((u, i) for i, u in enumerate(mass))
for _, g in groupby(a, itemgetter(0)):
g = list(g)
if len(g) > 1:
u, v = g[0][1], g[-1][1]
result.append((sum(mass[u:v+1]), u+1, v+1))
print(max(result))
output
(19, 2, 5)
Note that this code will not necessarily give the maximum sum between equal elements in the list if the list contains negative numbers. It will still work correctly with negative numbers if no group of equal numbers has more than two members. If that's not the case, we need to use a slower algorithm that tests every pair within a group of equal numbers.
Here's a more efficient version. Instead of using the sum function we build a list of the cumulative sums of the whole list. This doesn't make much of a difference for small lists, but it's much faster when the list size is large. Eg, for a list of 10,000 elements this approach is about 10 times faster. To test it, I create an array of random positive integers.
from operator import itemgetter
from itertools import groupby
from random import seed, randrange
seed(42)
def maxsum(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
result = []
a = sorted((u, i) for i, u in enumerate(seq))
for _, g in groupby(a, itemgetter(0)):
g = list(g)
if len(g) > 1:
u, v = g[0][1], g[-1][1]
result.append((sums[v+1] - sums[u], u+1, v+1))
return max(result)
num = 25000
hi = num // 2
mass = [randrange(1, hi) for _ in range(num)]
print(maxsum(mass))
output
(155821402, 21, 24831)
If you're using a recent version of Python you can use itertools.accumulate to build the list of cumulative sums. This is around 10% faster.
from itertools import accumulate
def maxsum(seq):
sums = [0] + list(accumulate(seq))
result = []
a = sorted((u, i) for i, u in enumerate(seq))
for _, g in groupby(a, itemgetter(0)):
g = list(g)
if len(g) > 1:
u, v = g[0][1], g[-1][1]
result.append((sums[v+1] - sums[u], u+1, v+1))
return max(result)
Here's a faster version, derived from code by Stefan Pochmann, which uses a dict, instead of sorting & groupby. Thanks, Stefan!
def maxsum(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, [i, i])[1] = i
return max((sums[j] - sums[i - 1], i, j)
for i, j in where.values())
If the list contains no duplicate items (and hence no subsequences bound by duplicate items) it returns the maximum item in the list.
Here are two more variations. These can handle negative items correctly, and if there are no duplicate items they return None. In Python 3 that could be handled elegantly by passing default=None to max, but that option isn't available in Python 2, so instead I catch the ValueError exception that's raised when you attempt to find the max of an empty iterable.
The first version, maxsum_combo, uses itertools.combinations to generate all combinations of a group of equal numbers and thence finds the combination that gives the maximum sum. The second version, maxsum_kadane uses a variation of Kadane's algorithm to find the maximum subsequence within a group.
If there aren't many duplicates in the original sequence, so the average group size is small, maxsum_combo is generally faster. But if the groups are large, then maxsum_kadane is much faster than maxsum_combo. The code below tests these functions on random sequences of 15000 items, firstly on sequences with few duplicates (and hence small mean group size) and then on sequences with lots of duplicates. It verifies that both versions give the same results, and then it performs timeit tests.
from __future__ import print_function
from itertools import groupby, combinations
from random import seed, randrange
from timeit import Timer
seed(42)
def maxsum_combo(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, []).append(i)
try:
return max((sums[j] - sums[i - 1], i, j)
for v in where.values() for i, j in combinations(v, 2))
except ValueError:
return None
def maxsum_kadane(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, []).append(i)
try:
return max(max_sublist([(sums[j] - sums[i-1], i, j)
for i, j in zip(v, v[1:])], k)
for k, v in where.items() if len(v) > 1)
except ValueError:
return None
# Kadane's Algorithm to find maximum sublist
# From https://en.wikipedia.org/wiki/Maximum_subarray_problem
def max_sublist(seq, k):
max_ending_here = max_so_far = seq[0]
for x in seq[1:]:
y = max_ending_here[0] + x[0] - k, max_ending_here[1], x[2]
max_ending_here = max(x, y)
max_so_far = max(max_so_far, max_ending_here)
return max_so_far
def test(num, hi, loops):
print('\nnum = {0}, hi = {1}, loops = {2}'.format(num, hi, loops))
print('Verifying...')
for k in range(5):
mass = [randrange(-hi // 2, hi) for _ in range(num)]
a = maxsum_combo(mass)
b = maxsum_kadane(mass)
print(a, b, a==b)
print('\nTiming...')
for func in maxsum_combo, maxsum_kadane:
t = Timer(lambda: func(mass))
result = sorted(t.repeat(3, loops))
result = ', '.join([format(u, '.5f') for u in result])
print('{0:14} : {1}'.format(func.__name__, result))
loops = 20
num = 15000
hi = num // 4
test(num, hi, loops)
loops = 10
hi = num // 100
test(num, hi, loops)
output
num = 15000, hi = 3750, loops = 20
Verifying...
(13983131, 44, 14940) (13983131, 44, 14940) True
(13928837, 27, 14985) (13928837, 27, 14985) True
(14057416, 40, 14995) (14057416, 40, 14995) True
(13997395, 65, 14996) (13997395, 65, 14996) True
(14050007, 12, 14972) (14050007, 12, 14972) True
Timing...
maxsum_combo : 1.72903, 1.73780, 1.81138
maxsum_kadane : 2.17738, 2.22108, 2.22394
num = 15000, hi = 150, loops = 10
Verifying...
(553789, 21, 14996) (553789, 21, 14996) True
(550174, 1, 14992) (550174, 1, 14992) True
(551017, 13, 14991) (551017, 13, 14991) True
(554317, 2, 14986) (554317, 2, 14986) True
(558663, 15, 14988) (558663, 15, 14988) True
Timing...
maxsum_combo : 7.29226, 7.34213, 7.36688
maxsum_kadane : 1.07532, 1.07695, 1.10525
This code runs on both Python 2 and Python 3. The above results were generated on an old 32 bit 2GHz machine running Python 2.6.6 on a Debian derivative of Linux. The speeds for Python 3.6.0 are similar.
If you want to include groups that consist of a single non-repeated number, and also want to include the numbers that are in groups as a "subsequence" of length 1, you can use this version:
def maxsum_kadane(seq):
if not seq:
return None
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, []).append(i)
# Find the maximum of the single items
m_single = max((k, v[0], v[0]) for k, v in where.items())
# Find the maximum of the subsequences
try:
m_subseq = max(max_sublist([(sums[j] - sums[i-1], i, j)
for i, j in zip(v, v[1:])], k)
for k, v in where.items() if len(v) > 1)
return max(m_single, m_subseq)
except ValueError:
# No subsequences
return m_single
I haven't tested it extensively, but it should work. ;)

Longest common sequence between many sub-sequences

Fancy title :)
I have a file that contains the following:
>sequence_40
ABCDABDCABCDBACDBACDBACDBACDABDCDC
ACDCCDCABDCADCADBCACBDCABD
>sequence_41
DCBACDBACDADCDCDCABCDCACBDCBDACBDC
BCDBABABBABACDCDBCACDBACDBACDBACDC
BCDB
...
Then, I have a function that returns a dictionary (called dict) that returns the sequences as keys and the strings (combined on one line) as values for the keys. The sequences range from 40 to 59.
I want to take a dictionary of sequences and return the longest common sub-sequence found in ALL the sequences. Managed to find some help here on stackoverflow and made a code that only compares the LAST TWO strings in that dictionary, not all of them :).
This is the code
def longest_common_sequence(s1, s2):
m = [[0] * (1 + len(s2)) for i in range(1 + len(s1))]
longest, x_longest = 0, 0
for x in range(1, 1 + len(s1)):
for y in range(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return s1[x_longest - longest: x_longest]
for i in range(40,59):
s1=str(dictionar['sequence_'+str(i)])
s2=str(dictionar['sequence_'+str(i+1)])
longest_common_sequence(s1,s2)
How can I modify it to get the common subsequence among ALL sequences in dictionary? Thanks!

EDIT: As #lmcarreiro pointed out, there is a relevant difference between substrings (or subarrays or sublists) and subsequences. To my understanding we are all talking about substrings here, so I will use this term in my answer.
Guillaumes answer can be improved:
def eachPossibleSubstring(string):
for size in range(len(string) + 1, 0, -1):
for start in range(len(string) - size + 1):
yield string[start:start+size]
def findLongestCommonSubstring(strings):
shortestString = min(strings, key=len)
for substring in eachPossibleSubstring(shortestString):
if all(substring in string
for string in strings if string != shortestString):
return substring
print findLongestCommonSubstring([
'ABCDABDCABCDBACDBACDBACDBACDABDCDCACDCCDCABDCADCADBCACBDCABD',
'DCBACDBACDADCDCDCABCDCACBDCBDACBDCBCDBABABBABACDCDBCACDBACDBACDBACDCBCDB',
])
This prints:
ACDBACDBACDBACD
This is faster because I return the first found and search from longest to shortest.
The basic idea is this: Take each possible substring of the shortest of your strings (in the order from the longest to the shortest) and see if this substring can be found in all other strings. If so, return it, otherwise try the next substring.
You need to understand generators. Try e. g. this:
for substring in eachPossibleSubstring('abcd'):
print substring
or
print list(eachPossibleSubstring('abcd'))

I'd start by defining a function to return all possible subsequences of a given sequence:
from itertools import combinations_with_replacement
def subsequences(sequence):
"returns all possible subquences of a given sequence"
for start, stop in combinations_with_replacement(range(len(sequence)), 2):
if start < stop:
yield sequence[start:stop]
then I'd make another method to check if a given subsequence in present in all given sequences:
def is_common_subsequence(sub, sequences):
"returns True if <sub> is a common subsequence in all <sequences>"
return all(sub in sequence for sequence in sequences)
then using the 2 methods above it is pretty easy to get all common subsequences in a given set of sequences:
def common_sequences(sequences):
"return all subsequences common in sequences"
shortest_seq = min(sequences, key=len)
return set(subsequence for subsequence in subsequences(shortest_seq) \
if is_common_subsequence(subsequence, sequences))
... and extracting the longuest sequence:
def longuest_common_subsequence(sequences):
"returns the longuest subsequence in sequences"
return max(common_sequences(sequences), key=len)
Result:
sequences = {
41: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
42: '123ABCDEFGHIJKLMNOPQRSTUVW',
43: '123456ABCDEFGHIJKLMNOPQRST'
}
sequences2 = {
0: 'ABCDEFGHIJ',
1: 'DHSABCDFKDDSA',
2: 'SGABCEIDEFJRNF'
}
print(longuest_common_subsequence(sequences.values()))
>>> ABCDEFGHIJKLMNOPQRST
print(longuest_common_subsequence(sequences2.values()))
>>> ABC

Here you have a possible approach. First let's define a function that returns the longest substring between two strings:
def longest_substring(s1, s2):
t = [[0]*(1+len(s2)) for i in range(1+len(s1))]
l, xl = 0, 0
for x in range(1,1+len(s1)):
for y in range(1,1+len(s2)):
if s1[x-1] == s2[y-1]:
t[x][y] = t[x-1][y-1] + 1
if t[x][y]>l:
l = t[x][y]
xl = x
else:
t[x][y] = 0
return s1[xl-l: xl]
Now I'll create a random dict of sequences for the example:
import random
import string
d = {i : ''.join(random.choice(string.ascii_uppercase) for _ in range(50)) for i in range(10)}
print d
{0: 'ASCUCEVJNIGWVMWMBBQQBZYBBNGQAJRYXACGFEIFWHMBCNYRGL', 1: 'HKUKZOJJUCRTSBLNZXCIBARLPNAPAABRBZEVGVILJAFCGWGQVV', 2: 'MMHCYPKECRJFEWTGYITMHZSNHAFEZVFYDAVILRYRKIDDBEFRVX', 3: 'DGBULRFJINFZEELDASRFBIRSADWMRAYMGCDAOJDKQIMXIRLTEI', 4: 'VDUFWZSXLRGOIMAHOAMZAIWDPTHDVDXUACRBASJMCUHREDORRH', 5: 'RFGAVHOWNKRZMYMSFSSNUGCKEWUNVETCDWJXSPBJHKSTPFNSJO', 6: 'HFMLMHCFSOEXBXWFAROIRGJNPRTKRWCEPLFOKGMXNUPCPWREWX', 7: 'CNPGSHGVIRLDXAADXUVWCTJCXUHQLALBUOJMXQBKXWHKGSJHEH', 8: 'UWDXXTRCFNCBUBEYGYTDWTPLNTRHYQWKTHPRVCBAWIMNGHULDC', 9: 'OOCJRXBZKJIGHZEJOOIKWKMQKIEQVPEDTFPJQAUQKJQVLOMGJB'}
Finally, we need to find the longest subsequence between all sequences:
import itertools
max([longest_substring(i,j) for i,j in itertools.combinations(d.values(), 2)], key=len)
Output:
'VIL'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Longest of k consecutive strings - python

Related

Longest common substring with rolling hash

find the minimum number of words(distance) between repeated occurrences of a search string in the input string

Efficiency of finding mismatched patterns

Finding the subsequence that starts and ends with the same number with the maximum sum

Longest common sequence between many sub-sequences

Categories

Resources