Motif search with Gibbs sampler - python

I am a beginner in both programming and bioinformatics. So, I would appreciate your understanding. I tried to develop a python script for motif search using Gibbs sampling as explained in Coursera class, "Finding Hidden Messages in DNA". The pseudocode provided in the course is:
GIBBSSAMPLER(Dna, k, t, N)
randomly select k-mers Motifs = (Motif1, …, Motift) in each string
from Dna
BestMotifs ← Motifs
for j ← 1 to N
i ← Random(t)
Profile ← profile matrix constructed from all strings in Motifs
except for Motifi
Motifi ← Profile-randomly generated k-mer in the i-th sequence
if Score(Motifs) < Score(BestMotifs)
BestMotifs ← Motifs
return BestMotifs
Problem description:
CODE CHALLENGE: Implement GIBBSSAMPLER.
Input: Integers k, t, and N, followed by a collection of strings Dna.
Output: The strings BestMotifs resulting from running GIBBSSAMPLER(Dna, k, t, N) with
20 random starts. Remember to use pseudocounts!
Sample Input:
8 5 100
CGCCCCTCTCGGGGGTGTTCAGTAACCGGCCA
GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG
TAGTACCGAGACCGAAAGAAGTATACAGGCGT
TAGATCAAGTTTCAGGTGCACGTCGGTGAACC
AATCCACCAGCTCCACGTGCAATGTTGGCCTA
Sample Output:
TCTCGGGG
CCAAGGTG
TACAGGCG
TTCAGGTG
TCCACGTG
I followed the pseudocode to the best of my knowledge. Here is my code:
def BuildProfileMatrix(dnamatrix):
ProfileMatrix = [[1 for x in xrange(len(dnamatrix[0]))] for x in xrange(4)]
indices = {'A':0, 'C':1, 'G': 2, 'T':3}
for seq in dnamatrix:
for i in xrange(len(dnamatrix[0])):
ProfileMatrix[indices[seq[i]]][i] += 1
ProbMatrix = [[float(x)/sum(zip(*ProfileMatrix)[0]) for x in y] for y in ProfileMatrix]
return ProbMatrix
def ProfileRandomGenerator(profile, dna, k, i):
indices = {'A':0, 'C':1, 'G': 2, 'T':3}
score_list = []
for x in xrange(len(dna[i]) - k + 1):
probability = 1
window = dna[i][x : k + x]
for y in xrange(k):
probability *= profile[indices[window[y]]][y]
score_list.append(probability)
rnd = uniform(0, sum(score_list))
current = 0
for z, bias in enumerate(score_list):
current += bias
if rnd <= current:
return dna[i][z : k + z]
def score(motifs):
ProfileMatrix = [[0 for x in xrange(len(motifs[0]))] for x in xrange(4)]
indices = {'A':0, 'C':1, 'G': 2, 'T':3}
for seq in motifs:
for i in xrange(len(motifs[0])):
ProfileMatrix[indices[seq[i]]][i] += 1
score = len(motifs)*len(motifs[0]) - sum([max(x) for x in zip(*ProfileMatrix)])
return score
from random import randint, uniform
def GibbsSampler(k, t, N):
dna = ['CGCCCCTCTCGGGGGTGTTCAGTAACCGGCCA',
'GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG',
'TAGTACCGAGACCGAAAGAAGTATACAGGCGT',
'TAGATCAAGTTTCAGGTGCACGTCGGTGAACC',
'AATCCACCAGCTCCACGTGCAATGTTGGCCTA']
Motifs = []
for i in [randint(0, len(dna[0])-k) for x in range(len(dna))]:
j = 0
kmer = dna[j][i : k+i]
j += 1
Motifs.append(kmer)
BestMotifs = []
s_best = float('inf')
for i in xrange(N):
x = randint(0, t-1)
Motifs.pop(x)
profile = BuildProfileMatrix(Motifs)
Motif = ProfileRandomGenerator(profile, dna, k, x)
Motifs.append(Motif)
s_motifs = score(Motifs)
if s_motifs < s_best:
s_best = s_motifs
BestMotifs = Motifs
return [s_best, BestMotifs]
k, t, N =8, 5, 100
best_motifs = [float('inf'), None]
# Repeat the Gibbs sampler search 20 times.
for repeat in xrange(20):
current_motifs = GibbsSampler(k, t, N)
if current_motifs[0] < best_motifs[0]:
best_motifs = current_motifs
# Print and save the answer.
print '\n'.join(best_motifs[1])
Unfortunately, my code never gives the same output as the solved example. Besides, while trying to debug the code I found that I get weird scores that define the mismatches between motifs. However, when I tried to run the score function separately, it worked perfectly.
Each time I run the script, the output changes, but anyway here is an example of one of the outputs for the input present in the code:
Example output of my code
TATGTGTA
TATGTGTA
TATGTGTA
GGTGTTCA
TATACAGG
Could you please help me debug this code?!! I spent the whole day trying to find out what's wrong with it although I know it might be some silly mistake I made, but my eye failed to catch it.
Thank you all!!

Finally, I found out what was wrong in my code! It was in line 54:
Motifs.append(Motif)
After randomly removing one of the motifs, followed by building a profile out of these motifs then randomly selecting a new motif based on this profile, I should have added the selected motif in the same position before removal NOT appended to the end of the motif list.
Now, the correct code is:
Motifs.insert(x, Motif)
The new code worked as expected.

Related

Longest common substring with rolling hash

I am implementing in Python3 an algorithm to find the longest substring of two strings s and t. Given s and t, I need to return (a,b,l) where l is the length of the longest common substring, a is the position in s where the longest substring starts, and b is the position in t where the longest substring starts. I have a working version of the algorithm but it is quite slow and I am not sure why; it is frustrating because I have found other implementations in python using pretty much the same logic that are many times faster. I am self-learning so any help would be greatly appreciated.
The approach is based on comparing hash values rather than directly comparing substrings and using binary search to find maximal length of common substrings. Here is the code for my hash function (m is a big prime and x is just some constant):
def polynomial_hash(my_string, m, x):
str_len = len(my_string)
result = 0
for i in range(str_len):
result = (result + ord(my_string[i]) * power_mod_p(x, i, m)) % m
return result
Given two strings s and t, I first find which string is shorter, without loss of generality, let s be the shorter string. First I need to find the hash values of substrings of a string. I use the following function, implemented as a generator:
def all_length_k_hashes(my_string, k, m, x):
current_position = len(my_string) - k
x_to_the_k = power_mod_p(x, k, m)
hash_value = polynomial_hash(my_string[current_position:], m, x)
yield (hash_value, current_position)
while current_position > 0:
current_position = current_position - 1
hash_value = ((hash_value * x) + ord(my_string[current_position]) - x_to_the_k*ord(my_string[current_position + k])) % m
yield (hash_value, current_position)
This function is simple, its first yield is the hash value of the final length k substring of the string, after that each of its iteration is the hash value of the next length k substring to its left (we move left by one position, for example for k=3 from abcdefghi to abcdefghi then from abcdefghi to abcdefghi). This should be able to calculate all the hash values of all length k substrings of my_string in O(|my_string|).
Now I find out if s and t has a length k substring in common, I use the following function:
def common_sub_string_length_k(shorter_str, longer_str, k, m, x):
short_str_dict = dict()
for hash_and_index in all_length_k_hashes(shorter_str, k, m, x):
short_str_dict.update({hash_and_index[0]: hash_and_index[1]})
hash_generator_longer_str = all_length_k_hashes(longer_str, k, m, x)
for hash_and_index in hash_generator_longer_str:
if hash_and_index[0] in short_str_dict:
return (short_str_dict[hash_and_index[0]], hash_and_index[1])
return False
What is happening in this function is: I create a Python empty dictionary and fill it with (key:values) such that each key is the hash value of a length k substring of the shorter string and its value is that substring's starting index, I call this 'short_str_dict'
Then, using all_length_k_hashes, I create a generator of hash values of substrings of length k of the longer string, then I iterate through this generator to check if there is a hash value that's in the 'short_str_dict', if there is, then the two strings have a substring of length k in common (assuming no hash collisions). This whole process should take time O(|shorter_string| + |longer_string|)
Finally, the following function repeatedly uses the previous process to find the maximal k, using a binary search technique:
def longest_common_substring(str_1, str_2):
m_1 = 309000599
m_2 = 988017827
x = randint(1, 10 ** 6)
len_str_1 = len(str_1)
len_str_2 = len(str_2)
if len_str_1 <= len_str_2:
short_str = str_1
long_str = str_2
switched = False
else:
short_str = str_2
long_str = str_1
switched = True
len_short_str = len(short_str)
len_long_str = len(long_str)
low = 0
high = len_short_str
mid = 0
longest_so_far = 0
longest_indices = (0,0)
while low <= high:
mid = (high + low) // 2
m1_result = common_sub_string_length_k(short_str, long_str, mid, m_1, x)
m2_result = common_sub_string_length_k(short_str, long_str, mid, m_2, x)
if m1_result is False or m2_result is False:
high = mid - 1
else:
longest_so_far = mid
longest_indices = m1_result
low = mid + 1
if switched:
return (longest_indices[1], longest_indices[0], longest_so_far)
else:
return (longest_indices[0], longest_indices[1], longest_so_far)
Two different hashes are used to reduce the probability of a collision. So in total, assuming no collisions, this whole process should take
O(log|shorter_string|) * O(|shorter_string| + |longer_string|).
Have I made any error? Is it slow because of the use of Python dictionaries? I really want to understand my mistake. Any help is greatly appreciated.

Python Text processing project, Type Error: 'int' object is not iterable

Working on a project to decipher real text from gibberish, I found this code on Github and have made some slight edits to better fit my needs. When testing I keep getting a TypeError on line 17, return [c.lower() for c in line if c.lower() in accepted_chars].
import math
import pickle
accepted_chars = 'abcdefghijklmnopqrstuvwxyz '
pos = dict([(char, idx) for idx, char in enumerate(accepted_chars)])
def normalize(line):
"""Return only the subset of chars from accepted_chars.
This helps keep the model relatively small by ignoring punctuation,
infrequenty symbols, etc. """
return [c.lower() for c in line if c.lower() in accepted_chars]
def ngram(n, l):
"""Return all n grams from l after normalizing """
filtered = normalize(l)
for start in range(0, len(filtered) - n + 1):
yield ''.join(filtered[start:start + n])
def train():
""" Write a simple model as a pickle file"""
k = len(accepted_chars)
# Assume we have seen 10 of each character pair. This acts as a kind of
# prior or smoothing factor. This way, if we see a character transition
# live that we've never observed in the past, we won't assume the entire
# string has 0 probability.
counts = [[10 for i in range(k)] for i in range(k)]
# Count transitions from big text file, taken
# from http://norvig.com/spell-correct.html
for line in open('big.txt'):
for a,b in ngram(2, line):
counts[pos[a]][pos[b]] += 1
# Normalize the counts so that they become log probabilities.
# We use log probabilities rather than straight probabilities to avoid
# numeric underflow issues with long texts.
# This contains a justification:
# http://squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/
for i, row in enumerate(counts):
s = float(sum(row))
for j in range(len(row)):
row[j] = math.log(row[j] / s)
# Find the probability of generating a few arbitrarily choosen good and
# bad phrases.
good_probs = [avg_transition_prob(l, counts) for l in open('good.txt')]
bad_probs = [avg_transition_prob(l, counts) for l in open('bad.txt')]
# Assert that we actually are capable of detecting the junk.
assert min(good_probs) > max(bad_probs)
#And pick a threshhold halfway between the worst good and best bad inputs.
thresh = (min(good_probs) + max(bad_probs)) / 2
pickle.dump({'mat': counts, 'thresh': thresh}, open('gib.model.pki', 'wb'))
def avg_transition_prob(l, log_prob_mat):
""" Return the average transition prob from l through log_prob_mat """
log_prob = 0.0
transition_ct = 0
for a, b in ngram(2,1):
log_prob += log_prob_mat[pos[a]][pos[b]]
transition_ct += 1
return math.exp(log_prob / (transition_ct or 1))
if __name__ == '__main__':
train()
ngram(2,1) calls normalize with the second parameter
normalize then does this:
return [c.lower() for c in line if c.lower() in accepted_chars]
Thus, you can't do for c in 1
Maybe you meant to put l there instead?

Compute counts of set partitions with mulitplicity and without order

I do have a piece of code that compute partitions of a set of (potentialy duplicated) integers. But i am interested in the set of possible partition and there multiplicity.
You can for exemple launch the follwoing code :
import numpy as np
from collections import Counter
import pandas as pd
def _B(i):
# for a given multiindex i, we defined _B(i) as the set of integers containg i_j times the number j:
if len(i) != 1:
B = []
for j in range(len(i)):
B.extend(i[j]*[j])
else:
B = i*[0]
return B
def _partition(collection):
# from here: https://stackoverflow.com/a/62532969/8425270
if len(collection) == 1:
yield (collection,)
return
first = collection[0]
for smaller in _partition(collection[1:]):
# insert `first` in each of the subpartition's subsets
for n, subset in enumerate(smaller):
yield smaller[:n] + ((first,) + subset,) + smaller[n + 1 :]
# put `first` in its own subset
yield ((first,),) + smaller
def to_list(tpl):
# the final hierarchy is
return list(list(i) if isinstance(i, tuple) else i for i in tpl)
def _Pi(inst_B):
# inst_B must be a tuple
if type(inst_B) != tuple :
inst_B = tuple(inst_B)
pp = [tuple(sorted(p)) for p in _partition(inst_B)]
c = Counter(pp)
Pi = c.keys()
N = list()
for pi in Pi:
N.append(c[pi])
Pi = [to_list(pi) for pi in Pi]
return Pi, N
if __name__ == "__main__":
import cProfile
pr = cProfile.Profile()
pr.enable()
sh = (3, 3, 3)
rez = list()
rez_sorted= list()
rez_ref = list()
for idx in np.ndindex(sh):
if sum(idx) > 0:
print(idx)
Pi, N = _Pi(_B(idx))
print(pd.DataFrame({'Pi': Pi, 'N': N * np.array([np.math.factorial(len(pi) - 1) for pi in Pi])}))
pr.disable()
# after your program ends
pr.print_stats(sort="tottime")
This code computes, for several examples of tuples of integer numbers (generated by np.ndindex) the partitions and counts i need. Everything happens in the _partition and the _Pi functions, this is were you should look at.
If you look closely at how these two functions are working, you'll see that they comput eevery potential partition and THEN count up how many times they appeared. For small problems, this is fine, but if the size of the prolbme increase, this starts to take a looooot of time. Try setting sh = (5,5,5), you'll see what i mean;
So the problem is the following :
Is there a way to compute directly the partitions and there number of occurences instead ?
Edit: I cross-posted on mathoverflow there, and they propose a solution in this article, in corrolary 2.10 (page 10 of the pdf). The problem could be solved by implmenting the sets p(v,r) in this corrolary.
I was hoping, as in the univariate case, that those sets would have a nice recursive expression but i ould not find one yet.
More Edit : This problem is equivalent to finding all (multiset)-partitions of a multiset. If the solution for finding (set)-partitions of a set is given by Bell partial polynomials, here we need multivariate version of these polynomials.

Find longest adjacent repeating non-overlapping substring

(This question isn't about music but I'm using music as an example of
a use case.)
In music a common way to structure phrases is as a sequence of notes
where the middle part is repeated one or more times. Thus, the phrase
consists of an introduction, a looping part and an outro. Here is one
example:
[ E E E F G A F F G A F F G A F C D ]
We can "see" that the intro is [ E E E] the repeating part is [ F G A
F ] and the outro is [ C D ]. So the way to split the list would be
[ [ E E E ] 3 [ F G A F ] [ C D ] ]
where the first item is the intro, the second number of times the
repeating part is repeated and the third part the outro.
I need an algorithm to perform such a split.
But there is one caveat which is that there may be multiple way to
split the list. For example, the above list could be split into:
[ [ E E E F G A ] 2 [ F F G A ] [ F C D ] ]
But this is a worse split because the intro and outro is longer. So
the criteria for the algorithm is to find the split that maximizes the
length of the looping part and minimizes the combined length of the
intro and outro. That means that the correct split for
[ A C C C C C C C C C A ]
is
[ [ A ] 9 [ C ] [ A ] ]
because the combined length of the intro and outro is 2 and the length
of the looping part is 9.
Also, while the intro and outro can be empty, only "true" repeats are
allowed. So the following split would be disallowed:
[ [ ] 1 [ E E E F G A F F G A F F G A F C D ] [ ] ]
Think of it as finding the optimal "compression" for the
sequence. Note that there may not be any repeats in some sequences:
[ A B C D ]
For these degenerate cases, any sensible result is allowed.
Here is my implementation of the algorithm:
def find_longest_repeating_non_overlapping_subseq(seq):
candidates = []
for i in range(len(seq)):
candidate_max = len(seq[i + 1:]) // 2
for j in range(1, candidate_max + 1):
candidate, remaining = seq[i:i + j], seq[i + j:]
n_reps = 1
len_candidate = len(candidate)
while remaining[:len_candidate] == candidate:
n_reps += 1
remaining = remaining[len_candidate:]
if n_reps > 1:
candidates.append((seq[:i], n_reps,
candidate, remaining))
if not candidates:
return (type(seq)(), 1, seq, type(seq)())
def score_candidate(candidate):
intro, reps, loop, outro = candidate
return reps - len(intro) - len(outro)
return sorted(candidates, key = score_candidate)[-1]
I'm not sure it is correct, but it passes the simple tests I've
described. The problem with it is that it is way to slow. I've looked
at suffix trees but they don't seem to fit my use case because the
substrings I'm after should be non-overlapping and adjacent.
Here's a way that's clearly quadratic-time, but with a relatively low constant factor because it doesn't build any substring objects apart from those of length 1. The result is a 2-tuple,
bestlen, list_of_results
where bestlen is the length of the longest substring of repeated adjacent blocks, and each result is a 3-tuple,
start_index, width, numreps
meaning that the substring being repeated is
the_string[start_index : start_index + width]
and there are numreps of those adjacent. It will always be that
bestlen == width * numreps
The problem description leaves ambiguities. For example, consider this output:
>>> crunch2("aaaaaabababa")
(6, [(0, 1, 6), (0, 2, 3), (5, 2, 3), (6, 2, 3), (0, 3, 2)])
So it found 5 ways to view "the longest" stretch as being of length 6:
The initial "a" repeated 6 times.
The initial "aa" repeated 3 times.
The leftmost instance of "ab" repeated 3 times.
The leftmost instance of "ba" repeated 3 times.
The initial "aaa" repeated 2 times.
It doesn't return the intro or outro because those are trivial to deduce from what it does return:
The intro is the_string[: start_index].
The outro is the_string[start_index + bestlen :].
If there are no repeated adjacent blocks, it returns
(0, [])
Other examples (from your post):
>>> crunch2("EEEFGAFFGAFFGAFCD")
(12, [(3, 4, 3)])
>>> crunch2("ACCCCCCCCCA")
(9, [(1, 1, 9), (1, 3, 3)])
>>> crunch2("ABCD")
(0, [])
The key to how it works: suppose you have adjacent repeated blocks of width W each. Then consider what happens when you compare the original string to the string shifted left by W:
... block1 block2 ... blockN-1 blockN ...
... block2 block3 ... blockN ... ...
Then you get (N-1)*W consecutive equal characters at the same positions. But that also works in the other direction: if you shift left by W and find (N-1)*W consecutive equal characters, then you can deduce:
block1 == block2
block2 == block3
...
blockN-1 == blockN
so all N blocks must be repetitions of block1.
So the code repeatedly shifts (a copy of) the original string left by one character, then marches left to right over both identifying the longest stretches of equal characters. That only requires comparing a pair of characters at a time. To make "shift left" efficient (constant time), the copy of the string is stored in a collections.deque.
EDIT: update() did far too much futile work in many cases; replaced it.
def crunch2(s):
from collections import deque
# There are zcount equal characters starting
# at index starti.
def update(starti, zcount):
nonlocal bestlen
while zcount >= width:
numreps = 1 + zcount // width
count = width * numreps
if count >= bestlen:
if count > bestlen:
results.clear()
results.append((starti, width, numreps))
bestlen = count
else:
break
zcount -= 1
starti += 1
bestlen, results = 0, []
t = deque(s)
for width in range(1, len(s) // 2 + 1):
t.popleft()
zcount = 0
for i, (a, b) in enumerate(zip(s, t)):
if a == b:
if not zcount: # new run starts here
starti = i
zcount += 1
# else a != b, so equal run (if any) ended
elif zcount:
update(starti, zcount)
zcount = 0
if zcount:
update(starti, zcount)
return bestlen, results
Using regexps
[removed this due to size limit]
Using a suffix array
This is the fastest I've found so far, although can still be provoked into quadratic-time behavior.
Note that it doesn't much matter whether overlapping strings are found. As explained for the crunch2() program above (here elaborated on in minor ways):
Given string s with length n = len(s).
Given ints i and j with 0 <= i < j < n.
Then if w = j-i, and c is the number of leading characters in common between s[i:] and s[j:], the block s[i:j] (of length w) is repeated, starting at s[i], a total of 1 + c // w times.
The program below follows that directly to find all repeated adjacent blocks, and remembers those of maximal total length. Returns the same results as crunch2(), but sometimes in a different order.
A suffix array eases the search, but hardly eliminates it. A suffix array directly finds <i, j> pairs with maximal c, but only limits the search to maximize w * (1 + c // w). Worst cases are strings of the form letter * number, like "a" * 10000.
I'm not giving the code for the sa module below. It's long-winded and any implementation of suffix arrays will compute the same things. The outputs of suffix_array():
sa is the suffix array, the unique permutation of range(n) such that for all i in range(1, n), s[sa[i-1]:] < s[sa[i]:].
rank isn't used here.
For i in range(1, n), lcp[i] gives the length of the longest common prefix between the suffixes starting at sa[i-1] and sa[i].
Why does it win? In part because it never has to search for suffixes that start with the same letter (the suffix array, by construction, makes them adjacent), and checking for a repeated block, and for whether it's a new best, takes small constant time regardless of how large the block or how many times it's repeated. As above, that's just trivial arithmetic on c and w.
Disclaimer: suffix arrays/trees are like continued fractions for me: I can use them when I have to, and can marvel at the results, but they give me a headache. Touchy, touchy, touchy.
def crunch4(s):
from sa import suffix_array
sa, rank, lcp = suffix_array(s)
bestlen, results = 0, []
n = len(s)
for sai in range(n-1):
i = sa[sai]
c = n
for saj in range(sai + 1, n):
c = min(c, lcp[saj])
if not c:
break
j = sa[saj]
w = abs(i - j)
if c < w:
continue
numreps = 1 + c // w
assert numreps > 1
total = w * numreps
if total >= bestlen:
if total > bestlen:
results.clear()
bestlen = total
results.append((min(i, j), w, numreps))
return bestlen, results
Some timings
I read a modest file of English words into a string, xs. One word per line:
>>> len(xs)
209755
>>> xs.count('\n')
25481
So about 25K words in about 210K bytes. These are quadratic-time algorithms, so I didn't expect it to go fast, but crunch2() was still running after hours - and still running when I let it go overnight.
Which caused me to realize its update() function could do an enormous amount of futile work, making the algorithm more like cubic-time overall. So I repaired that. Then:
>>> crunch2(xs)
(44, [(63750, 22, 2)])
>>> xs[63750 : 63750+50]
'\nelectroencephalograph\nelectroencephalography\nelec'
That took about 38 minutes, which was in the ballpark of what I expected.
The regexp version crunch3() took less than a tenth of a second!
>>> crunch3(xs)
(8, [(19308, 4, 2), (47240, 4, 2)])
>>> xs[19308 : 19308+10]
'beriberi\nB'
>>> xs[47240 : 47240+10]
'couscous\nc'
As explained before, the regexp version may not find the best answer, but something else is at work here: by default, "." doesn't match a newline, so the code was actually doing many tiny searches. Each of the ~25K newlines in the file effectively ends the local search range. Compiling the regexp with the re.DOTALL flag instead (so newlines aren't treated specially):
>>> crunch3(xs) # with DOTALL
(44, [(63750, 22, 2)])
in a bit over 14 minutes.
Finally,
>>> crunch4(xs)
(44, [(63750, 22, 2)])
in a bit under 9 minutes. The time to build the suffix array was an insignificant part of that (less than a second). That's actually pretty impressive, since the not-always-right brute force regexp version is slower despite running almost entirely "at C speed".
But that's in a relative sense. In an absolute sense, all of these are still pig slow :-(
NOTE: the version in the next section cuts this to under 5 seconds(!).
Enormously faster
This one takes a completely different approach. For the largish dictionary example above, it gets the right answer in less than 5 seconds.
I'm rather proud of this ;-) It was unexpected, and I haven't seen this approach before. It doesn't do any string searching, just integer arithmetic on sets of indices.
It remains dreadfully slow for inputs of the form letter * largish_integer. As is, it keeps going up by 1 so long as at least two (not necessarily adjacent, or even non-overlapping!) copies of a substring (of the current length being considered) exist. So, for example, in
'x' * 1000000
it will try all substring sizes from 1 through 999999.
However, looks like that could be greatly improved by doubling the current size (instead of just adding 1) repeatedly, saving the classes as it goes along, doing a mixed form of binary search to locate the largest substring size for which a repetition exists.
Which I'll leave as a doubtless tedious exercise for the reader. My work here is done ;-)
def crunch5(text):
from collections import namedtuple, defaultdict
# For all integers i and j in IxSet x.s,
# text[i : i + x.w] == text[j : j + x.w].
# That is, it's the set of all indices at which a specific
# substring of length x.w is found.
# In general, we only care about repeated substrings here,
# so weed out those that would otherwise have len(x.s) == 1.
IxSet = namedtuple("IxSet", "s w")
bestlen, results = 0, []
# Compute sets of indices for repeated (not necessarily
# adjacent!) substrings of length xs[0].w + ys[0].w, by looking
# at the cross product of the index sets in xs and ys.
def combine(xs, ys):
xw, yw = xs[0].w, ys[0].w
neww = xw + yw
result = []
for y in ys:
shifted = set(i - xw for i in y.s if i >= xw)
for x in xs:
ok = shifted & x.s
if len(ok) > 1:
result.append(IxSet(ok, neww))
return result
# Check an index set for _adjacent_ repeated substrings.
def check(s):
nonlocal bestlen
x, w = s.s.copy(), s.w
while x:
current = start = x.pop()
count = 1
while current + w in x:
count += 1
current += w
x.remove(current)
while start - w in x:
count += 1
start -= w
x.remove(start)
if count > 1:
total = count * w
if total >= bestlen:
if total > bestlen:
results.clear()
bestlen = total
results.append((start, w, count))
ch2ixs = defaultdict(set)
for i, ch in enumerate(text):
ch2ixs[ch].add(i)
size1 = [IxSet(s, 1)
for s in ch2ixs.values()
if len(s) > 1]
del ch2ixs
for x in size1:
check(x)
current_class = size1
# Repeatedly increase size by 1 until current_class becomes
# empty. At that point, there are no repeated substrings at all
# (adjacent or not) of the then-current size (or larger).
while current_class:
current_class = combine(current_class, size1)
for x in current_class:
check(x)
return bestlen, results
And faster still
crunch6() drops the largish dictionary example to under 2 seconds on my box. It combines ideas from crunch4() (suffix and lcp arrays) and crunch5() (find all arithmetic progressions with a given stride in a set of indices).
Like crunch5(), this also loops around a number of times equal to one more than the length of the repeated longest substring (overlapping or not). For if there are no repeats of length n, there are none for any size greater than n either. That makes finding repeats without regard to overlap easier, because it's an exploitable limitation. When constraining "wins" to adjacent repeats, that breaks down. For example, there are no adjacent repeats of even length 1 in "abcabc", but there is one of length 3. That appears to make any form of direct binary search futile (the presence or absence of adjacent repeats of size n says nothing about the existence of adjacent repeats of any other size).
Inputs of the form 'x' * n remain miserable. There are repeats of all lengths from 1 through n-1.
Observation: all the programs I've given generate all possible ways of breaking up repeated adjacent chunks of maximal length. For example, for a string of 9 "x", it says it can be gotten by repeating "x" 9 times or by repeating "xxx" 3 times. So, surprisingly, they can all be used as factoring algorithms too ;-)
def crunch6(text):
from sa import suffix_array
sa, rank, lcp = suffix_array(text)
bestlen, results = 0, []
n = len(text)
# Generate maximal sets of indices s such that for all i and j
# in s the suffixes starting at s[i] and s[j] start with a
# common prefix of at least len minc.
def genixs(minc, sa=sa, lcp=lcp, n=n):
i = 1
while i < n:
c = lcp[i]
if c < minc:
i += 1
continue
ixs = {sa[i-1], sa[i]}
i += 1
while i < n:
c = min(c, lcp[i])
if c < minc:
yield ixs
i += 1
break
else:
ixs.add(sa[i])
i += 1
else: # ran off the end of lcp
yield ixs
# Check an index set for _adjacent_ repeated substrings
# w apart. CAUTION: this empties s.
def check(s, w):
nonlocal bestlen
while s:
current = start = s.pop()
count = 1
while current + w in s:
count += 1
current += w
s.remove(current)
while start - w in s:
count += 1
start -= w
s.remove(start)
if count > 1:
total = count * w
if total >= bestlen:
if total > bestlen:
results.clear()
bestlen = total
results.append((start, w, count))
c = 0
found = True
while found:
c += 1
found = False
for s in genixs(c):
found = True
check(s, c)
return bestlen, results
Always fast, and published, but sometimes wrong
In bioinformatics, turns out this is studied under the names "tandem repeats", "tandem arrays", and "simple sequence repeats" (SSR). You can search for those terms to find quite a few academic papers, some claiming worst-case linear-time algorithms.
But those seem to fall into two camps:
Linear-time algorithms of the kind to be described, which are actually wrong :-(
Algorithms so complicated it would take dedication to even try to turn them into functioning code :-(
In the first camp, there are several papers that boil down to crunch4() above, but without its inner loop. I'll follow this with code for that, crunch4a(). Here's an example:
"SA-SSR: a suffix array-based algorithm for exhaustive and efficient SSR discovery in large genetic sequences."
Pickett et alia
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5013907/
crunch4a() is always fast, but sometimes wrong. In fact it finds at least one maximal repeated stretch for every example that appeared here, solves the largish dictionary example in a fraction of a second, and has no problem with strings of the form 'x' * 1000000. The bulk of the time is spent building the suffix and lcp arrays. But it can fail:
>>> x = "bcdabcdbcd"
>>> crunch4(x) # finds repeated bcd at end
(6, [(4, 3, 2)])
>>> crunch4a(x) # finds nothing
(0, [])
The problem is that there's no guarantee that the relevant suffixes are adjacent in the suffix array. The suffixes that start with "b" are ordered like so:
bcd
bcdabcdbcd
bcdbcd
To find the trailing repeated block by this approach requires comparing the first with the third. That's why crunch4() has an inner loop, to try all pairs starting with a common letter. The relevant pair can be separated by an arbitrary number of other suffixes in a suffix array. But that also makes the algorithm quadratic time.
# only look at adjacent entries - fast, but sometimes wrong
def crunch4a(s):
from sa import suffix_array
sa, rank, lcp = suffix_array(s)
bestlen, results = 0, []
n = len(s)
for sai in range(1, n):
i, j = sa[sai - 1], sa[sai]
c = lcp[sai]
w = abs(i - j)
if c >= w:
numreps = 1 + c // w
total = w * numreps
if total >= bestlen:
if total > bestlen:
results.clear()
bestlen = total
results.append((min(i, j), w, numreps))
return bestlen, results
O(n log n)
This paper looks right to me, although I haven't coded it:
"Simple and Flexible Detection of Contiguous Repeats Using a Suffix Tree"
Jens Stoye, Dan Gusfield
https://csiflabs.cs.ucdavis.edu/~gusfield/tcs.pdf
Getting to a sub-quadratic algorithm requires making some compromises, though. For example, "x" * n has n-1 substrings of the form "x"*2, n-2 of the form "x"*3, ..., so there are O(n**2) of those alone. So any algorithm that finds all of them is necessarily also at best quadratic time.
Read the paper for details ;-) One concept you're looking for is "primitive": I believe you only want repeats of the form S*n where S cannot itself be expressed as a repetition of shorter strings. So, e.g., "x" * 10 is primitive, but "xx" * 5 is not.
One step on the way to O(n log n)
crunch9() is an implementation of the "brute force" algorithm I mentioned in the comments, from:
"The enhanced suffix array and its applications to genome analysis"
Ibrahim et alia
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.2217&rep=rep1&type=pdf
The implementation sketch there only finds "branching tandem" repeats, and I added code here to deduce repeats of any number of repetitions, and to include non-branching repeats too. While it's still O(n**2) worst case, it's much faster than anything else here for the seq string you pointed to in the comments. As is, it reproduces (except for order) the same exhaustive account as most of the other programs here.
The paper goes on to fight hard to cut the worst case to O(n log n), but that slows it down a lot. So then it fights harder. I confess I lost interest ;-)
# Generate lcp intervals from the lcp array.
def genlcpi(lcp):
lcp.append(0)
stack = [(0, 0)]
for i in range(1, len(lcp)):
c = lcp[i]
lb = i - 1
while c < stack[-1][0]:
i_c, lb = stack.pop()
interval = i_c, lb, i - 1
yield interval
if c > stack[-1][0]:
stack.append((c, lb))
lcp.pop()
def crunch9(text):
from sa import suffix_array
sa, rank, lcp = suffix_array(text)
bestlen, results = 0, []
n = len(text)
# generate branching tandem repeats
def gen_btr(text=text, n=n, sa=sa):
for c, lb, rb in genlcpi(lcp):
i = sa[lb]
basic = text[i : i + c]
# Binary searches to find subrange beginning with
# basic+basic. A more gonzo implementation would do this
# character by character, never materialzing the common
# prefix in `basic`.
rb += 1
hi = rb
while lb < hi: # like bisect.bisect_left
mid = (lb + hi) // 2
i = sa[mid] + c
if text[i : i + c] < basic:
lb = mid + 1
else:
hi = mid
lo = lb
while lo < rb: # like bisect.bisect_right
mid = (lo + rb) // 2
i = sa[mid] + c
if basic < text[i : i + c]:
rb = mid
else:
lo = mid + 1
lead = basic[0]
for sai in range(lb, rb):
i = sa[sai]
j = i + 2*c
assert j <= n
if j < n and text[j] == lead:
continue # it's non-branching
yield (i, c, 2)
for start, c, _ in gen_btr():
# extend left
numreps = 2
for i in range(start - c, -1, -c):
if all(text[i+k] == text[start+k] for k in range(c)):
start = i
numreps += 1
else:
break
totallen = c * numreps
if totallen < bestlen:
continue
if totallen > bestlen:
bestlen = totallen
results.clear()
results.append((start, c, numreps))
# add non-branches
while start:
if text[start - 1] == text[start + c - 1]:
start -= 1
results.append((start, c, numreps))
else:
break
return bestlen, results
Earning the bonus points ;-)
For some technical meaning ;-) crunch11() is worst-case O(n log n). Besides the suffix and lcp arrays, this also needs the rank array, sa's inverse:
assert all(rank[sa[i]] == sa[rank[i]] == i for i in range(len(sa)))
As code comments note, it also relies on Python 3 for speed (range() behavior). That's shallow but would be tedious to rewrite.
Papers describing this have several errors, so don't flip out if this code doesn't exactly match what you read about. Implement exactly what they say instead, and it will fail.
That said, the code is getting uncomfortably complex, and I can't guarantee there aren't bugs. It works on everything I've tried.
Inputs of the form 'x' * 1000000 still aren't speedy, but clearly no longer quadratic-time. For example, a string repeating the same letter a million times completes in close to 30 seconds. Most other programs here would never end ;-)
EDIT: changed genlcpi() to use semi-open Python ranges; made mostly cosmetic changes to crunch11(); added "early out" that saves about a third the time in worst (like 'x' * 1000000) cases.
# Generate lcp intervals from the lcp array.
def genlcpi(lcp):
lcp.append(0)
stack = [(0, 0)]
for i in range(1, len(lcp)):
c = lcp[i]
lb = i - 1
while c < stack[-1][0]:
i_c, lb = stack.pop()
yield (i_c, lb, i)
if c > stack[-1][0]:
stack.append((c, lb))
lcp.pop()
def crunch11(text):
from sa import suffix_array
sa, rank, lcp = suffix_array(text)
bestlen, results = 0, []
n = len(text)
# Generate branching tandem repeats.
# (i, c, 2) is branching tandem iff
# i+c in interval with prefix text[i : i+c], and
# i+c not in subinterval with prefix text[i : i+c + 1]
# Caution: this pragmatically relies on that, in Python 3,
# `range()` returns a tiny object with O(1) membership testing.
# In Python 2 it returns a list - ahould still work, but very
# much slower.
def gen_btr(text=text, n=n, sa=sa, rank=rank):
from itertools import chain
for c, lb, rb in genlcpi(lcp):
origlb, origrb = lb, rb
origrange = range(lb, rb)
i = sa[lb]
lead = text[i]
# Binary searches to find subrange beginning with
# text[i : i+c+1]. Note we take slices of length 1
# rather than just index to avoid special-casing for
# i >= n.
# A more elaborate traversal of the lcp array could also
# give us a list of child intervals, and then we'd just
# need to pick the right one. But that would be even
# more hairy code, and unclear to me it would actually
# help the worst cases (yes, the interval can be large,
# but so can a list of child intervals).
hi = rb
while lb < hi: # like bisect.bisect_left
mid = (lb + hi) // 2
i = sa[mid] + c
if text[i : i+1] < lead:
lb = mid + 1
else:
hi = mid
lo = lb
while lo < rb: # like bisect.bisect_right
mid = (lo + rb) // 2
i = sa[mid] + c
if lead < text[i : i+1]:
rb = mid
else:
lo = mid + 1
subrange = range(lb, rb)
if 2 * len(subrange) <= len(origrange):
# Subrange is at most half the size.
# Iterate over it to find candidates i, starting
# with wa. If i+c is also in origrange, but not
# in subrange, good: then i is of the form wwx.
for sai in subrange:
i = sa[sai]
ic = i + c
if ic < n:
r = rank[ic]
if r in origrange and r not in subrange:
yield (i, c, 2, subrange)
else:
# Iterate over the parts outside subrange instead.
# Candidates i are then the trailing wx in the
# hoped-for wwx. We win if i-c is in subrange too
# (or, for that matter, if it's in origrange).
for sai in chain(range(origlb, lb),
range(rb, origrb)):
ic = sa[sai] - c
if ic >= 0 and rank[ic] in subrange:
yield (ic, c, 2, subrange)
for start, c, numreps, irange in gen_btr():
# extend left
crange = range(start - c, -1, -c)
if (numreps + len(crange)) * c < bestlen:
continue
for i in crange:
if rank[i] in irange:
start = i
numreps += 1
else:
break
# check for best
totallen = c * numreps
if totallen < bestlen:
continue
if totallen > bestlen:
bestlen = totallen
results.clear()
results.append((start, c, numreps))
# add non-branches
while start and text[start - 1] == text[start + c - 1]:
start -= 1
results.append((start, c, numreps))
return bestlen, results
Here's my implementation of what you're talking about. It's pretty similar to yours, but it skips over substrings which have been checked as repetitions of previous substrings.
from collections import namedtuple
SubSequence = namedtuple('SubSequence', ['start', 'length', 'reps'])
def longest_repeating_subseq(original: str):
winner = SubSequence(start=0, length=0, reps=0)
checked = set()
subsequences = ( # Evaluates lazily during iteration
SubSequence(start=start, length=length, reps=1)
for start in range(len(original))
for length in range(1, len(original) - start)
if (start, length) not in checked)
for s in subsequences:
subseq = original[s.start : s.start + s.length]
for reps, next_start in enumerate(
range(s.start + s.length, len(original), s.length),
start=1):
if subseq != original[next_start : next_start + s.length]:
break
else:
checked.add((next_start, s.length))
s = s._replace(reps=reps)
if s.reps > 1 and (
(s.length * s.reps > winner.length * winner.reps)
or ( # When total lengths are equal, prefer the shorter substring
s.length * s.reps == winner.length * winner.reps
and s.reps > winner.reps)):
winner = s
# Check for default case with no repetitions
if winner.reps == 0:
winner = SubSequence(start=0, length=len(original), reps=1)
return (
original[ : winner.start],
winner.reps,
original[winner.start : winner.start + winner.length],
original[winner.start + winner.length * winner.reps : ])
def test(seq, *, expect):
print(f'Testing longest_repeating_subseq for {seq}')
result = longest_repeating_subseq(seq)
print(f'Expected {expect}, got {result}')
print(f'Test {"passed" if result == expect else "failed"}')
print()
if __name__ == '__main__':
test('EEEFGAFFGAFFGAFCD', expect=('EEE', 3, 'FGAF', 'CD'))
test('ACCCCCCCCCA', expect=('A', 9, 'C', 'A'))
test('ABCD', expect=('', 1, 'ABCD', ''))
Passes all three of your examples for me. This seems like the sort of thing that could have a lot of weird edge cases, but given that it's an optimized brute force, it would probably be more a matter of updating the spec rather than fixing a bug in the code itself.
It looks like what you are trying to do is pretty much the LZ77 compression algorithm. You can check your code against the reference implementation in the Wikipedia article I linked to.

Finding the subsequence that starts and ends with the same number with the maximum sum

I have to make a program that takes as input a list of numbers and returns the sum of the subsequence that starts and ends with the same number which has the maximum sum (including the equal numbers in the beginning and end of the subsequence in the sum). It also has to return the placement of the start and end of the subsequence, that is, their index+1. The problem is that my current code runs smoothly only while the length of list is not that long. When the list length extends to 5000 the program does not give an answer.
The input is the following:
6
3 2 4 3 5 6
The first line is for the length of the list. The second line is the list itself, with list items separated by space. The output will be 12, 1, 4, because as you can see there is 1 equal pair of numbers (3): the first and fourth element, so the sum of elements between them is 3 + 2 + 4 + 3 = 12, and their placement is first and fourth.
Here is my code.
length = int(input())
mass = raw_input().split()
for i in range(length):
mass[i]=int(mass[i])
value=-10000000000
b = 1
e = 1
for i in range(len(mass)):
if mass[i:].count(mass[i])!=1:
for j in range(i,len(mass)):
if mass[j]==mass[i]:
f = mass[i:j+1]
if sum(f)>value:
value = sum(f)
b = i+1
e = j+1
else:
if mass[i]>value:
value = mass[i]
b = i+1
e = i+1
print value
print b,e
This should be faster than your current approach.
Rather than searching through mass looking for pairs of matching numbers we pair each number in mass with its index and sort those pairs. We can then use groupby to find groups of equal numbers. If there are more than 2 of the same number we use the first and last, since they will have the greatest sum between them.
from operator import itemgetter
from itertools import groupby
raw = '3 5 6 3 5 4'
mass = [int(u) for u in raw.split()]
result = []
a = sorted((u, i) for i, u in enumerate(mass))
for _, g in groupby(a, itemgetter(0)):
g = list(g)
if len(g) > 1:
u, v = g[0][1], g[-1][1]
result.append((sum(mass[u:v+1]), u+1, v+1))
print(max(result))
output
(19, 2, 5)
Note that this code will not necessarily give the maximum sum between equal elements in the list if the list contains negative numbers. It will still work correctly with negative numbers if no group of equal numbers has more than two members. If that's not the case, we need to use a slower algorithm that tests every pair within a group of equal numbers.
Here's a more efficient version. Instead of using the sum function we build a list of the cumulative sums of the whole list. This doesn't make much of a difference for small lists, but it's much faster when the list size is large. Eg, for a list of 10,000 elements this approach is about 10 times faster. To test it, I create an array of random positive integers.
from operator import itemgetter
from itertools import groupby
from random import seed, randrange
seed(42)
def maxsum(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
result = []
a = sorted((u, i) for i, u in enumerate(seq))
for _, g in groupby(a, itemgetter(0)):
g = list(g)
if len(g) > 1:
u, v = g[0][1], g[-1][1]
result.append((sums[v+1] - sums[u], u+1, v+1))
return max(result)
num = 25000
hi = num // 2
mass = [randrange(1, hi) for _ in range(num)]
print(maxsum(mass))
output
(155821402, 21, 24831)
If you're using a recent version of Python you can use itertools.accumulate to build the list of cumulative sums. This is around 10% faster.
from itertools import accumulate
def maxsum(seq):
sums = [0] + list(accumulate(seq))
result = []
a = sorted((u, i) for i, u in enumerate(seq))
for _, g in groupby(a, itemgetter(0)):
g = list(g)
if len(g) > 1:
u, v = g[0][1], g[-1][1]
result.append((sums[v+1] - sums[u], u+1, v+1))
return max(result)
Here's a faster version, derived from code by Stefan Pochmann, which uses a dict, instead of sorting & groupby. Thanks, Stefan!
def maxsum(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, [i, i])[1] = i
return max((sums[j] - sums[i - 1], i, j)
for i, j in where.values())
If the list contains no duplicate items (and hence no subsequences bound by duplicate items) it returns the maximum item in the list.
Here are two more variations. These can handle negative items correctly, and if there are no duplicate items they return None. In Python 3 that could be handled elegantly by passing default=None to max, but that option isn't available in Python 2, so instead I catch the ValueError exception that's raised when you attempt to find the max of an empty iterable.
The first version, maxsum_combo, uses itertools.combinations to generate all combinations of a group of equal numbers and thence finds the combination that gives the maximum sum. The second version, maxsum_kadane uses a variation of Kadane's algorithm to find the maximum subsequence within a group.
If there aren't many duplicates in the original sequence, so the average group size is small, maxsum_combo is generally faster. But if the groups are large, then maxsum_kadane is much faster than maxsum_combo. The code below tests these functions on random sequences of 15000 items, firstly on sequences with few duplicates (and hence small mean group size) and then on sequences with lots of duplicates. It verifies that both versions give the same results, and then it performs timeit tests.
from __future__ import print_function
from itertools import groupby, combinations
from random import seed, randrange
from timeit import Timer
seed(42)
def maxsum_combo(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, []).append(i)
try:
return max((sums[j] - sums[i - 1], i, j)
for v in where.values() for i, j in combinations(v, 2))
except ValueError:
return None
def maxsum_kadane(seq):
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, []).append(i)
try:
return max(max_sublist([(sums[j] - sums[i-1], i, j)
for i, j in zip(v, v[1:])], k)
for k, v in where.items() if len(v) > 1)
except ValueError:
return None
# Kadane's Algorithm to find maximum sublist
# From https://en.wikipedia.org/wiki/Maximum_subarray_problem
def max_sublist(seq, k):
max_ending_here = max_so_far = seq[0]
for x in seq[1:]:
y = max_ending_here[0] + x[0] - k, max_ending_here[1], x[2]
max_ending_here = max(x, y)
max_so_far = max(max_so_far, max_ending_here)
return max_so_far
def test(num, hi, loops):
print('\nnum = {0}, hi = {1}, loops = {2}'.format(num, hi, loops))
print('Verifying...')
for k in range(5):
mass = [randrange(-hi // 2, hi) for _ in range(num)]
a = maxsum_combo(mass)
b = maxsum_kadane(mass)
print(a, b, a==b)
print('\nTiming...')
for func in maxsum_combo, maxsum_kadane:
t = Timer(lambda: func(mass))
result = sorted(t.repeat(3, loops))
result = ', '.join([format(u, '.5f') for u in result])
print('{0:14} : {1}'.format(func.__name__, result))
loops = 20
num = 15000
hi = num // 4
test(num, hi, loops)
loops = 10
hi = num // 100
test(num, hi, loops)
output
num = 15000, hi = 3750, loops = 20
Verifying...
(13983131, 44, 14940) (13983131, 44, 14940) True
(13928837, 27, 14985) (13928837, 27, 14985) True
(14057416, 40, 14995) (14057416, 40, 14995) True
(13997395, 65, 14996) (13997395, 65, 14996) True
(14050007, 12, 14972) (14050007, 12, 14972) True
Timing...
maxsum_combo : 1.72903, 1.73780, 1.81138
maxsum_kadane : 2.17738, 2.22108, 2.22394
num = 15000, hi = 150, loops = 10
Verifying...
(553789, 21, 14996) (553789, 21, 14996) True
(550174, 1, 14992) (550174, 1, 14992) True
(551017, 13, 14991) (551017, 13, 14991) True
(554317, 2, 14986) (554317, 2, 14986) True
(558663, 15, 14988) (558663, 15, 14988) True
Timing...
maxsum_combo : 7.29226, 7.34213, 7.36688
maxsum_kadane : 1.07532, 1.07695, 1.10525
This code runs on both Python 2 and Python 3. The above results were generated on an old 32 bit 2GHz machine running Python 2.6.6 on a Debian derivative of Linux. The speeds for Python 3.6.0 are similar.
If you want to include groups that consist of a single non-repeated number, and also want to include the numbers that are in groups as a "subsequence" of length 1, you can use this version:
def maxsum_kadane(seq):
if not seq:
return None
total = 0
sums = [0]
for u in seq:
total += u
sums.append(total)
where = {}
for i, x in enumerate(seq, 1):
where.setdefault(x, []).append(i)
# Find the maximum of the single items
m_single = max((k, v[0], v[0]) for k, v in where.items())
# Find the maximum of the subsequences
try:
m_subseq = max(max_sublist([(sums[j] - sums[i-1], i, j)
for i, j in zip(v, v[1:])], k)
for k, v in where.items() if len(v) > 1)
return max(m_single, m_subseq)
except ValueError:
# No subsequences
return m_single
I haven't tested it extensively, but it should work. ;)

Categories

Resources