(This question isn't about music but I'm using music as an example of
a use case.)
In music a common way to structure phrases is as a sequence of notes
where the middle part is repeated one or more times. Thus, the phrase
consists of an introduction, a looping part and an outro. Here is one
example:
[ E E E F G A F F G A F F G A F C D ]
We can "see" that the intro is [ E E E] the repeating part is [ F G A
F ] and the outro is [ C D ]. So the way to split the list would be
[ [ E E E ] 3 [ F G A F ] [ C D ] ]
where the first item is the intro, the second number of times the
repeating part is repeated and the third part the outro.
I need an algorithm to perform such a split.
But there is one caveat which is that there may be multiple way to
split the list. For example, the above list could be split into:
[ [ E E E F G A ] 2 [ F F G A ] [ F C D ] ]
But this is a worse split because the intro and outro is longer. So
the criteria for the algorithm is to find the split that maximizes the
length of the looping part and minimizes the combined length of the
intro and outro. That means that the correct split for
[ A C C C C C C C C C A ]
is
[ [ A ] 9 [ C ] [ A ] ]
because the combined length of the intro and outro is 2 and the length
of the looping part is 9.
Also, while the intro and outro can be empty, only "true" repeats are
allowed. So the following split would be disallowed:
[ [ ] 1 [ E E E F G A F F G A F F G A F C D ] [ ] ]
Think of it as finding the optimal "compression" for the
sequence. Note that there may not be any repeats in some sequences:
[ A B C D ]
For these degenerate cases, any sensible result is allowed.
Here is my implementation of the algorithm:
def find_longest_repeating_non_overlapping_subseq(seq):
candidates = []
for i in range(len(seq)):
candidate_max = len(seq[i + 1:]) // 2
for j in range(1, candidate_max + 1):
candidate, remaining = seq[i:i + j], seq[i + j:]
n_reps = 1
len_candidate = len(candidate)
while remaining[:len_candidate] == candidate:
n_reps += 1
remaining = remaining[len_candidate:]
if n_reps > 1:
candidates.append((seq[:i], n_reps,
candidate, remaining))
if not candidates:
return (type(seq)(), 1, seq, type(seq)())
def score_candidate(candidate):
intro, reps, loop, outro = candidate
return reps - len(intro) - len(outro)
return sorted(candidates, key = score_candidate)[-1]
I'm not sure it is correct, but it passes the simple tests I've
described. The problem with it is that it is way to slow. I've looked
at suffix trees but they don't seem to fit my use case because the
substrings I'm after should be non-overlapping and adjacent.
Here's a way that's clearly quadratic-time, but with a relatively low constant factor because it doesn't build any substring objects apart from those of length 1. The result is a 2-tuple,
bestlen, list_of_results
where bestlen is the length of the longest substring of repeated adjacent blocks, and each result is a 3-tuple,
start_index, width, numreps
meaning that the substring being repeated is
the_string[start_index : start_index + width]
and there are numreps of those adjacent. It will always be that
bestlen == width * numreps
The problem description leaves ambiguities. For example, consider this output:
>>> crunch2("aaaaaabababa")
(6, [(0, 1, 6), (0, 2, 3), (5, 2, 3), (6, 2, 3), (0, 3, 2)])
So it found 5 ways to view "the longest" stretch as being of length 6:
The initial "a" repeated 6 times.
The initial "aa" repeated 3 times.
The leftmost instance of "ab" repeated 3 times.
The leftmost instance of "ba" repeated 3 times.
The initial "aaa" repeated 2 times.
It doesn't return the intro or outro because those are trivial to deduce from what it does return:
The intro is the_string[: start_index].
The outro is the_string[start_index + bestlen :].
If there are no repeated adjacent blocks, it returns
(0, [])
Other examples (from your post):
>>> crunch2("EEEFGAFFGAFFGAFCD")
(12, [(3, 4, 3)])
>>> crunch2("ACCCCCCCCCA")
(9, [(1, 1, 9), (1, 3, 3)])
>>> crunch2("ABCD")
(0, [])
The key to how it works: suppose you have adjacent repeated blocks of width W each. Then consider what happens when you compare the original string to the string shifted left by W:
... block1 block2 ... blockN-1 blockN ...
... block2 block3 ... blockN ... ...
Then you get (N-1)*W consecutive equal characters at the same positions. But that also works in the other direction: if you shift left by W and find (N-1)*W consecutive equal characters, then you can deduce:
block1 == block2
block2 == block3
...
blockN-1 == blockN
so all N blocks must be repetitions of block1.
So the code repeatedly shifts (a copy of) the original string left by one character, then marches left to right over both identifying the longest stretches of equal characters. That only requires comparing a pair of characters at a time. To make "shift left" efficient (constant time), the copy of the string is stored in a collections.deque.
EDIT: update() did far too much futile work in many cases; replaced it.
def crunch2(s):
from collections import deque
# There are zcount equal characters starting
# at index starti.
def update(starti, zcount):
nonlocal bestlen
while zcount >= width:
numreps = 1 + zcount // width
count = width * numreps
if count >= bestlen:
if count > bestlen:
results.clear()
results.append((starti, width, numreps))
bestlen = count
else:
break
zcount -= 1
starti += 1
bestlen, results = 0, []
t = deque(s)
for width in range(1, len(s) // 2 + 1):
t.popleft()
zcount = 0
for i, (a, b) in enumerate(zip(s, t)):
if a == b:
if not zcount: # new run starts here
starti = i
zcount += 1
# else a != b, so equal run (if any) ended
elif zcount:
update(starti, zcount)
zcount = 0
if zcount:
update(starti, zcount)
return bestlen, results
Using regexps
[removed this due to size limit]
Using a suffix array
This is the fastest I've found so far, although can still be provoked into quadratic-time behavior.
Note that it doesn't much matter whether overlapping strings are found. As explained for the crunch2() program above (here elaborated on in minor ways):
Given string s with length n = len(s).
Given ints i and j with 0 <= i < j < n.
Then if w = j-i, and c is the number of leading characters in common between s[i:] and s[j:], the block s[i:j] (of length w) is repeated, starting at s[i], a total of 1 + c // w times.
The program below follows that directly to find all repeated adjacent blocks, and remembers those of maximal total length. Returns the same results as crunch2(), but sometimes in a different order.
A suffix array eases the search, but hardly eliminates it. A suffix array directly finds <i, j> pairs with maximal c, but only limits the search to maximize w * (1 + c // w). Worst cases are strings of the form letter * number, like "a" * 10000.
I'm not giving the code for the sa module below. It's long-winded and any implementation of suffix arrays will compute the same things. The outputs of suffix_array():
sa is the suffix array, the unique permutation of range(n) such that for all i in range(1, n), s[sa[i-1]:] < s[sa[i]:].
rank isn't used here.
For i in range(1, n), lcp[i] gives the length of the longest common prefix between the suffixes starting at sa[i-1] and sa[i].
Why does it win? In part because it never has to search for suffixes that start with the same letter (the suffix array, by construction, makes them adjacent), and checking for a repeated block, and for whether it's a new best, takes small constant time regardless of how large the block or how many times it's repeated. As above, that's just trivial arithmetic on c and w.
Disclaimer: suffix arrays/trees are like continued fractions for me: I can use them when I have to, and can marvel at the results, but they give me a headache. Touchy, touchy, touchy.
def crunch4(s):
from sa import suffix_array
sa, rank, lcp = suffix_array(s)
bestlen, results = 0, []
n = len(s)
for sai in range(n-1):
i = sa[sai]
c = n
for saj in range(sai + 1, n):
c = min(c, lcp[saj])
if not c:
break
j = sa[saj]
w = abs(i - j)
if c < w:
continue
numreps = 1 + c // w
assert numreps > 1
total = w * numreps
if total >= bestlen:
if total > bestlen:
results.clear()
bestlen = total
results.append((min(i, j), w, numreps))
return bestlen, results
Some timings
I read a modest file of English words into a string, xs. One word per line:
>>> len(xs)
209755
>>> xs.count('\n')
25481
So about 25K words in about 210K bytes. These are quadratic-time algorithms, so I didn't expect it to go fast, but crunch2() was still running after hours - and still running when I let it go overnight.
Which caused me to realize its update() function could do an enormous amount of futile work, making the algorithm more like cubic-time overall. So I repaired that. Then:
>>> crunch2(xs)
(44, [(63750, 22, 2)])
>>> xs[63750 : 63750+50]
'\nelectroencephalograph\nelectroencephalography\nelec'
That took about 38 minutes, which was in the ballpark of what I expected.
The regexp version crunch3() took less than a tenth of a second!
>>> crunch3(xs)
(8, [(19308, 4, 2), (47240, 4, 2)])
>>> xs[19308 : 19308+10]
'beriberi\nB'
>>> xs[47240 : 47240+10]
'couscous\nc'
As explained before, the regexp version may not find the best answer, but something else is at work here: by default, "." doesn't match a newline, so the code was actually doing many tiny searches. Each of the ~25K newlines in the file effectively ends the local search range. Compiling the regexp with the re.DOTALL flag instead (so newlines aren't treated specially):
>>> crunch3(xs) # with DOTALL
(44, [(63750, 22, 2)])
in a bit over 14 minutes.
Finally,
>>> crunch4(xs)
(44, [(63750, 22, 2)])
in a bit under 9 minutes. The time to build the suffix array was an insignificant part of that (less than a second). That's actually pretty impressive, since the not-always-right brute force regexp version is slower despite running almost entirely "at C speed".
But that's in a relative sense. In an absolute sense, all of these are still pig slow :-(
NOTE: the version in the next section cuts this to under 5 seconds(!).
Enormously faster
This one takes a completely different approach. For the largish dictionary example above, it gets the right answer in less than 5 seconds.
I'm rather proud of this ;-) It was unexpected, and I haven't seen this approach before. It doesn't do any string searching, just integer arithmetic on sets of indices.
It remains dreadfully slow for inputs of the form letter * largish_integer. As is, it keeps going up by 1 so long as at least two (not necessarily adjacent, or even non-overlapping!) copies of a substring (of the current length being considered) exist. So, for example, in
'x' * 1000000
it will try all substring sizes from 1 through 999999.
However, looks like that could be greatly improved by doubling the current size (instead of just adding 1) repeatedly, saving the classes as it goes along, doing a mixed form of binary search to locate the largest substring size for which a repetition exists.
Which I'll leave as a doubtless tedious exercise for the reader. My work here is done ;-)
def crunch5(text):
from collections import namedtuple, defaultdict
# For all integers i and j in IxSet x.s,
# text[i : i + x.w] == text[j : j + x.w].
# That is, it's the set of all indices at which a specific
# substring of length x.w is found.
# In general, we only care about repeated substrings here,
# so weed out those that would otherwise have len(x.s) == 1.
IxSet = namedtuple("IxSet", "s w")
bestlen, results = 0, []
# Compute sets of indices for repeated (not necessarily
# adjacent!) substrings of length xs[0].w + ys[0].w, by looking
# at the cross product of the index sets in xs and ys.
def combine(xs, ys):
xw, yw = xs[0].w, ys[0].w
neww = xw + yw
result = []
for y in ys:
shifted = set(i - xw for i in y.s if i >= xw)
for x in xs:
ok = shifted & x.s
if len(ok) > 1:
result.append(IxSet(ok, neww))
return result
# Check an index set for _adjacent_ repeated substrings.
def check(s):
nonlocal bestlen
x, w = s.s.copy(), s.w
while x:
current = start = x.pop()
count = 1
while current + w in x:
count += 1
current += w
x.remove(current)
while start - w in x:
count += 1
start -= w
x.remove(start)
if count > 1:
total = count * w
if total >= bestlen:
if total > bestlen:
results.clear()
bestlen = total
results.append((start, w, count))
ch2ixs = defaultdict(set)
for i, ch in enumerate(text):
ch2ixs[ch].add(i)
size1 = [IxSet(s, 1)
for s in ch2ixs.values()
if len(s) > 1]
del ch2ixs
for x in size1:
check(x)
current_class = size1
# Repeatedly increase size by 1 until current_class becomes
# empty. At that point, there are no repeated substrings at all
# (adjacent or not) of the then-current size (or larger).
while current_class:
current_class = combine(current_class, size1)
for x in current_class:
check(x)
return bestlen, results
And faster still
crunch6() drops the largish dictionary example to under 2 seconds on my box. It combines ideas from crunch4() (suffix and lcp arrays) and crunch5() (find all arithmetic progressions with a given stride in a set of indices).
Like crunch5(), this also loops around a number of times equal to one more than the length of the repeated longest substring (overlapping or not). For if there are no repeats of length n, there are none for any size greater than n either. That makes finding repeats without regard to overlap easier, because it's an exploitable limitation. When constraining "wins" to adjacent repeats, that breaks down. For example, there are no adjacent repeats of even length 1 in "abcabc", but there is one of length 3. That appears to make any form of direct binary search futile (the presence or absence of adjacent repeats of size n says nothing about the existence of adjacent repeats of any other size).
Inputs of the form 'x' * n remain miserable. There are repeats of all lengths from 1 through n-1.
Observation: all the programs I've given generate all possible ways of breaking up repeated adjacent chunks of maximal length. For example, for a string of 9 "x", it says it can be gotten by repeating "x" 9 times or by repeating "xxx" 3 times. So, surprisingly, they can all be used as factoring algorithms too ;-)
def crunch6(text):
from sa import suffix_array
sa, rank, lcp = suffix_array(text)
bestlen, results = 0, []
n = len(text)
# Generate maximal sets of indices s such that for all i and j
# in s the suffixes starting at s[i] and s[j] start with a
# common prefix of at least len minc.
def genixs(minc, sa=sa, lcp=lcp, n=n):
i = 1
while i < n:
c = lcp[i]
if c < minc:
i += 1
continue
ixs = {sa[i-1], sa[i]}
i += 1
while i < n:
c = min(c, lcp[i])
if c < minc:
yield ixs
i += 1
break
else:
ixs.add(sa[i])
i += 1
else: # ran off the end of lcp
yield ixs
# Check an index set for _adjacent_ repeated substrings
# w apart. CAUTION: this empties s.
def check(s, w):
nonlocal bestlen
while s:
current = start = s.pop()
count = 1
while current + w in s:
count += 1
current += w
s.remove(current)
while start - w in s:
count += 1
start -= w
s.remove(start)
if count > 1:
total = count * w
if total >= bestlen:
if total > bestlen:
results.clear()
bestlen = total
results.append((start, w, count))
c = 0
found = True
while found:
c += 1
found = False
for s in genixs(c):
found = True
check(s, c)
return bestlen, results
Always fast, and published, but sometimes wrong
In bioinformatics, turns out this is studied under the names "tandem repeats", "tandem arrays", and "simple sequence repeats" (SSR). You can search for those terms to find quite a few academic papers, some claiming worst-case linear-time algorithms.
But those seem to fall into two camps:
Linear-time algorithms of the kind to be described, which are actually wrong :-(
Algorithms so complicated it would take dedication to even try to turn them into functioning code :-(
In the first camp, there are several papers that boil down to crunch4() above, but without its inner loop. I'll follow this with code for that, crunch4a(). Here's an example:
"SA-SSR: a suffix array-based algorithm for exhaustive and efficient SSR discovery in large genetic sequences."
Pickett et alia
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5013907/
crunch4a() is always fast, but sometimes wrong. In fact it finds at least one maximal repeated stretch for every example that appeared here, solves the largish dictionary example in a fraction of a second, and has no problem with strings of the form 'x' * 1000000. The bulk of the time is spent building the suffix and lcp arrays. But it can fail:
>>> x = "bcdabcdbcd"
>>> crunch4(x) # finds repeated bcd at end
(6, [(4, 3, 2)])
>>> crunch4a(x) # finds nothing
(0, [])
The problem is that there's no guarantee that the relevant suffixes are adjacent in the suffix array. The suffixes that start with "b" are ordered like so:
bcd
bcdabcdbcd
bcdbcd
To find the trailing repeated block by this approach requires comparing the first with the third. That's why crunch4() has an inner loop, to try all pairs starting with a common letter. The relevant pair can be separated by an arbitrary number of other suffixes in a suffix array. But that also makes the algorithm quadratic time.
# only look at adjacent entries - fast, but sometimes wrong
def crunch4a(s):
from sa import suffix_array
sa, rank, lcp = suffix_array(s)
bestlen, results = 0, []
n = len(s)
for sai in range(1, n):
i, j = sa[sai - 1], sa[sai]
c = lcp[sai]
w = abs(i - j)
if c >= w:
numreps = 1 + c // w
total = w * numreps
if total >= bestlen:
if total > bestlen:
results.clear()
bestlen = total
results.append((min(i, j), w, numreps))
return bestlen, results
O(n log n)
This paper looks right to me, although I haven't coded it:
"Simple and Flexible Detection of Contiguous Repeats Using a Suffix Tree"
Jens Stoye, Dan Gusfield
https://csiflabs.cs.ucdavis.edu/~gusfield/tcs.pdf
Getting to a sub-quadratic algorithm requires making some compromises, though. For example, "x" * n has n-1 substrings of the form "x"*2, n-2 of the form "x"*3, ..., so there are O(n**2) of those alone. So any algorithm that finds all of them is necessarily also at best quadratic time.
Read the paper for details ;-) One concept you're looking for is "primitive": I believe you only want repeats of the form S*n where S cannot itself be expressed as a repetition of shorter strings. So, e.g., "x" * 10 is primitive, but "xx" * 5 is not.
One step on the way to O(n log n)
crunch9() is an implementation of the "brute force" algorithm I mentioned in the comments, from:
"The enhanced suffix array and its applications to genome analysis"
Ibrahim et alia
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.2217&rep=rep1&type=pdf
The implementation sketch there only finds "branching tandem" repeats, and I added code here to deduce repeats of any number of repetitions, and to include non-branching repeats too. While it's still O(n**2) worst case, it's much faster than anything else here for the seq string you pointed to in the comments. As is, it reproduces (except for order) the same exhaustive account as most of the other programs here.
The paper goes on to fight hard to cut the worst case to O(n log n), but that slows it down a lot. So then it fights harder. I confess I lost interest ;-)
# Generate lcp intervals from the lcp array.
def genlcpi(lcp):
lcp.append(0)
stack = [(0, 0)]
for i in range(1, len(lcp)):
c = lcp[i]
lb = i - 1
while c < stack[-1][0]:
i_c, lb = stack.pop()
interval = i_c, lb, i - 1
yield interval
if c > stack[-1][0]:
stack.append((c, lb))
lcp.pop()
def crunch9(text):
from sa import suffix_array
sa, rank, lcp = suffix_array(text)
bestlen, results = 0, []
n = len(text)
# generate branching tandem repeats
def gen_btr(text=text, n=n, sa=sa):
for c, lb, rb in genlcpi(lcp):
i = sa[lb]
basic = text[i : i + c]
# Binary searches to find subrange beginning with
# basic+basic. A more gonzo implementation would do this
# character by character, never materialzing the common
# prefix in `basic`.
rb += 1
hi = rb
while lb < hi: # like bisect.bisect_left
mid = (lb + hi) // 2
i = sa[mid] + c
if text[i : i + c] < basic:
lb = mid + 1
else:
hi = mid
lo = lb
while lo < rb: # like bisect.bisect_right
mid = (lo + rb) // 2
i = sa[mid] + c
if basic < text[i : i + c]:
rb = mid
else:
lo = mid + 1
lead = basic[0]
for sai in range(lb, rb):
i = sa[sai]
j = i + 2*c
assert j <= n
if j < n and text[j] == lead:
continue # it's non-branching
yield (i, c, 2)
for start, c, _ in gen_btr():
# extend left
numreps = 2
for i in range(start - c, -1, -c):
if all(text[i+k] == text[start+k] for k in range(c)):
start = i
numreps += 1
else:
break
totallen = c * numreps
if totallen < bestlen:
continue
if totallen > bestlen:
bestlen = totallen
results.clear()
results.append((start, c, numreps))
# add non-branches
while start:
if text[start - 1] == text[start + c - 1]:
start -= 1
results.append((start, c, numreps))
else:
break
return bestlen, results
Earning the bonus points ;-)
For some technical meaning ;-) crunch11() is worst-case O(n log n). Besides the suffix and lcp arrays, this also needs the rank array, sa's inverse:
assert all(rank[sa[i]] == sa[rank[i]] == i for i in range(len(sa)))
As code comments note, it also relies on Python 3 for speed (range() behavior). That's shallow but would be tedious to rewrite.
Papers describing this have several errors, so don't flip out if this code doesn't exactly match what you read about. Implement exactly what they say instead, and it will fail.
That said, the code is getting uncomfortably complex, and I can't guarantee there aren't bugs. It works on everything I've tried.
Inputs of the form 'x' * 1000000 still aren't speedy, but clearly no longer quadratic-time. For example, a string repeating the same letter a million times completes in close to 30 seconds. Most other programs here would never end ;-)
EDIT: changed genlcpi() to use semi-open Python ranges; made mostly cosmetic changes to crunch11(); added "early out" that saves about a third the time in worst (like 'x' * 1000000) cases.
# Generate lcp intervals from the lcp array.
def genlcpi(lcp):
lcp.append(0)
stack = [(0, 0)]
for i in range(1, len(lcp)):
c = lcp[i]
lb = i - 1
while c < stack[-1][0]:
i_c, lb = stack.pop()
yield (i_c, lb, i)
if c > stack[-1][0]:
stack.append((c, lb))
lcp.pop()
def crunch11(text):
from sa import suffix_array
sa, rank, lcp = suffix_array(text)
bestlen, results = 0, []
n = len(text)
# Generate branching tandem repeats.
# (i, c, 2) is branching tandem iff
# i+c in interval with prefix text[i : i+c], and
# i+c not in subinterval with prefix text[i : i+c + 1]
# Caution: this pragmatically relies on that, in Python 3,
# `range()` returns a tiny object with O(1) membership testing.
# In Python 2 it returns a list - ahould still work, but very
# much slower.
def gen_btr(text=text, n=n, sa=sa, rank=rank):
from itertools import chain
for c, lb, rb in genlcpi(lcp):
origlb, origrb = lb, rb
origrange = range(lb, rb)
i = sa[lb]
lead = text[i]
# Binary searches to find subrange beginning with
# text[i : i+c+1]. Note we take slices of length 1
# rather than just index to avoid special-casing for
# i >= n.
# A more elaborate traversal of the lcp array could also
# give us a list of child intervals, and then we'd just
# need to pick the right one. But that would be even
# more hairy code, and unclear to me it would actually
# help the worst cases (yes, the interval can be large,
# but so can a list of child intervals).
hi = rb
while lb < hi: # like bisect.bisect_left
mid = (lb + hi) // 2
i = sa[mid] + c
if text[i : i+1] < lead:
lb = mid + 1
else:
hi = mid
lo = lb
while lo < rb: # like bisect.bisect_right
mid = (lo + rb) // 2
i = sa[mid] + c
if lead < text[i : i+1]:
rb = mid
else:
lo = mid + 1
subrange = range(lb, rb)
if 2 * len(subrange) <= len(origrange):
# Subrange is at most half the size.
# Iterate over it to find candidates i, starting
# with wa. If i+c is also in origrange, but not
# in subrange, good: then i is of the form wwx.
for sai in subrange:
i = sa[sai]
ic = i + c
if ic < n:
r = rank[ic]
if r in origrange and r not in subrange:
yield (i, c, 2, subrange)
else:
# Iterate over the parts outside subrange instead.
# Candidates i are then the trailing wx in the
# hoped-for wwx. We win if i-c is in subrange too
# (or, for that matter, if it's in origrange).
for sai in chain(range(origlb, lb),
range(rb, origrb)):
ic = sa[sai] - c
if ic >= 0 and rank[ic] in subrange:
yield (ic, c, 2, subrange)
for start, c, numreps, irange in gen_btr():
# extend left
crange = range(start - c, -1, -c)
if (numreps + len(crange)) * c < bestlen:
continue
for i in crange:
if rank[i] in irange:
start = i
numreps += 1
else:
break
# check for best
totallen = c * numreps
if totallen < bestlen:
continue
if totallen > bestlen:
bestlen = totallen
results.clear()
results.append((start, c, numreps))
# add non-branches
while start and text[start - 1] == text[start + c - 1]:
start -= 1
results.append((start, c, numreps))
return bestlen, results
Here's my implementation of what you're talking about. It's pretty similar to yours, but it skips over substrings which have been checked as repetitions of previous substrings.
from collections import namedtuple
SubSequence = namedtuple('SubSequence', ['start', 'length', 'reps'])
def longest_repeating_subseq(original: str):
winner = SubSequence(start=0, length=0, reps=0)
checked = set()
subsequences = ( # Evaluates lazily during iteration
SubSequence(start=start, length=length, reps=1)
for start in range(len(original))
for length in range(1, len(original) - start)
if (start, length) not in checked)
for s in subsequences:
subseq = original[s.start : s.start + s.length]
for reps, next_start in enumerate(
range(s.start + s.length, len(original), s.length),
start=1):
if subseq != original[next_start : next_start + s.length]:
break
else:
checked.add((next_start, s.length))
s = s._replace(reps=reps)
if s.reps > 1 and (
(s.length * s.reps > winner.length * winner.reps)
or ( # When total lengths are equal, prefer the shorter substring
s.length * s.reps == winner.length * winner.reps
and s.reps > winner.reps)):
winner = s
# Check for default case with no repetitions
if winner.reps == 0:
winner = SubSequence(start=0, length=len(original), reps=1)
return (
original[ : winner.start],
winner.reps,
original[winner.start : winner.start + winner.length],
original[winner.start + winner.length * winner.reps : ])
def test(seq, *, expect):
print(f'Testing longest_repeating_subseq for {seq}')
result = longest_repeating_subseq(seq)
print(f'Expected {expect}, got {result}')
print(f'Test {"passed" if result == expect else "failed"}')
print()
if __name__ == '__main__':
test('EEEFGAFFGAFFGAFCD', expect=('EEE', 3, 'FGAF', 'CD'))
test('ACCCCCCCCCA', expect=('A', 9, 'C', 'A'))
test('ABCD', expect=('', 1, 'ABCD', ''))
Passes all three of your examples for me. This seems like the sort of thing that could have a lot of weird edge cases, but given that it's an optimized brute force, it would probably be more a matter of updating the spec rather than fixing a bug in the code itself.
It looks like what you are trying to do is pretty much the LZ77 compression algorithm. You can check your code against the reference implementation in the Wikipedia article I linked to.
I am trying to find a way to solve a maze. My teacher said I have to use BFS as a way to learn. So I made the algorithm itself, but I don't understand how to get the shortest path out of it. I have looked at others their codes and they said that backtracking is the way to do it. How does this backtracking work and what do you backtrack?
I will give my code just because I like some feedback to it and maybe I made some mistake:
def main(self, r, c):
running = True
self.queue.append((r, c))
while running:
if len(self.queue) > 0:
self.current = self.queue[0]
if self.maze[self.current[0] - 1][self.current[1]] == ' ' and not (self.current[0] - 1, self.current[1])\
in self.visited and not (self.current[0] - 1, self.current[1]) in self.queue:
self.queue.append((self.current[0] - 1, self.current[1]))
elif self.maze[self.current[0] - 1][self.current[1]] == 'G':
return self.path
if self.maze[self.current[0]][self.current[1] + 1] == ' ' and not (self.current[0], self.current[1] + 1) in self.visited\
and not (self.current[0], self.current[1] + 1) in self.queue:
self.queue.append((self.current[0], self.current[1] + 1))
elif self.maze[self.current[0]][self.current[1] + 1] == 'G':
return self.path
if self.maze[self.current[0] + 1][self.current[1]] == ' ' and not (self.current[0] + 1, self.current[1]) in self.visited\
and not (self.current[0] + 1, self.current[1]) in self.queue:
self.queue.append((self.current[0] + 1, self.current[1]))
elif self.maze[self.current[0] + 1][self.current[1]] == 'G':
return self.path
if self.maze[self.current[0]][self.current[1] - 1] == ' ' and not (self.current[0], self.current[1] - 1) in self.visited\
and not (self.current[0], self.current[1] - 1) in self.queue:
self.queue.append((self.current[0], self.current[1] - 1))
elif self.maze[self.current[0]][self.current[1] - 1] == 'G':
return self.path
self.visited.append((self.current[0], self.current[1]))
del self.queue[0]
self.path.append(self.queue[0])
As maze I use something like this:
############
# S #
##### ######
# #
######## ###
# #
## ##### ###
# G#
############
Which is stored in a matrix
What I eventually want is just the shortest path inside a list as output.
Since this is a coding assignment I'll leave the code to you and simply explain the general algorithm here.
You have a n by m grid. I am assuming this is is provided to you. You can store this in a two dimensional array.
Step 1) Create a new two dimensional array the same size as the grid and populate each entry with an invalid coordinate (up to you, maybe use None or another value you can use to indicate that a path to that coordinate has not yet been discovered). I will refer to this two dimensional array as your path matrix and the maze as your grid.
Step 2) Enqueue the starting coordinate and update the path matrix at that position (for example, update matrix[1,1] if coordinate (1,1) is your starting position).
Step 3) If not at the final coordinate, dequeue an element from the queue. For each possible direction from the dequeued coordinate, check if it is valid (no walls AND the coordinate does not exist in the matrix yet), and enqueue all valid coordinates.
Step 4) Repeat Step 3.
If there is a path to your final coordinate, you will not only find it with this algorithm but it will also be a shortest path. To backtrack, check your matrix at the location of your final coordinate. This should lead you to another coordinate. Continue this process and backtrack until you arrive at the starting coordinate. If you store this list of backtracked coordinates then you will have a path in reverse.
The main problem in your code is this line:
self.path.append(self.queue[0])
This will just keep adding to the path while you go in all possible directions in a BFS way. This path will end up getting all coordinates that you visit, which is not really a "path", because with BFS you continually switch to a different branch in the search, and so you end up collecting positions that are quite unrelated.
You need to build the path in a different way. A memory efficient way of doing this is to track where you come from when visiting a node. You can use the visited variable for that, but then make it a dictionary, which for each r,c pair stores the r,c pair from which the cell was visited. It is like building a linked list. From each newly visited cell you'll be able to find back where you came from, all the way back to the starting cell. So when you find the target, you can build the path from this linked list.
Some other less important problems in your code:
You don't check whether a coordinate is valid. If the grid is bounded completely by # characters, this is not really a problem, but if you would have a gap at the border, you'd get an exception
There is code repetition for each of the four directions. Try to avoid such repetition, and store recurrent expressions like self.current[1] - 1 in a variable, and create a loop over the four possible directions.
The variable running makes no sense: it never becomes False. Instead make your loop condition what currently is your next if condition. As long as the queue is not empty, continue. If the queue becomes empty then that means there is no path to the target.
You store every bit of information in self properties. You should only do that for information that is still relevant after the search. I would instead just create local variables for queue, visited, current, ...etc.
Here is how the code could look:
class Maze():
def __init__(self, str):
self.maze = str.splitlines()
def get_start(self):
row = next(i for i, line in enumerate(self.maze) if "S" in line)
col = self.maze[row].index("S")
return row, col
def main(self, r, c):
queue = [] # use a local variable, not a member
visited = {} # use a dict, key = coordinate-tuples, value = previous location
visited[(r, c)] = (-1, -1)
queue.append((r, c))
while len(queue) > 0: # don't use running as variable
# no need to use current; just reuse r and c:
r, c = queue.pop(0) # you can remove immediately from queue
if self.maze[r][c] == 'G':
# build path from walking backwards through the visited information
path = []
while r != -1:
path.append((r, c))
r, c = visited[(r, c)]
path.reverse()
return path
# avoid repetition of code: make a loop
for dx, dy in ((-1, 0), (0, -1), (1, 0), (0, 1)):
new_r = r + dy
new_c = c + dx
if (0 <= new_r < len(self.maze) and
0 <= new_c < len(self.maze[0]) and
not (new_r, new_c) in visited and
self.maze[new_r][new_c] != '#'):
visited[(new_r, new_c)] = (r, c)
queue.append((new_r, new_c))
maze = Maze("""############
# S #
##### ######
# #
######## ###
# #
## ##### ###
# G#
############""")
path = maze.main(*maze.get_start())
print(path)
See it run on repl.it
I'm trying to write a program that mimics a word game where, from a given set of words, it will find the longest possible sequence of words. No word can be used twice.
I can do the matching letters and words up, and storing them into lists, but I'm having trouble getting my head around how to handle the potentially exponential number of possibilities of words in lists. If word 1 matches word 2 and then I go down that route, how do I then back up to see if words 3 or 4 match up with word one and then start their own routes, all stemming from the first word?
I was thinking some way of calling the function inside itself maybe?
I know it's nowhere near doing what I need it to do, but it's a start. Thanks in advance for any help!
g = "audino bagon baltoy banette bidoof braviary bronzor carracosta charmeleon cresselia croagunk darmanitan deino emboar emolga exeggcute gabite girafarig gulpin haxorus"
def pokemon():
count = 1
names = g.split()
first = names[count]
master = []
for i in names:
print (i, first, i[0], first[-1])
if i[0] == first[-1] and i not in master:
master.append(i)
count += 1
first = i
print ("success", master)
if len(master) == 0:
return "Pokemon", first, "does not work"
count += 1
first = names[count]
pokemon()
Your idea of calling a function inside of itself is a good one. We can solve this with recursion:
def get_neighbors(word, choices):
return set(x for x in choices if x[0] == word[-1])
def longest_path_from(word, choices):
choices = choices - set([word])
neighbors = get_neighbors(word, choices)
if neighbors:
paths = (longest_path_from(w, choices) for w in neighbors)
max_path = max(paths, key=len)
else:
max_path = []
return [word] + max_path
def longest_path(choices):
return max((longest_path_from(w, choices) for w in choices), key=len)
Now we just define our word list:
words = ("audino bagon baltoy banette bidoof braviary bronzor carracosta "
"charmeleon cresselia croagunk darmanitan deino emboar emolga "
"exeggcute gabite girafarig gulpin haxorus")
words = frozenset(words.split())
Call longest_path with a set of words:
>>> longest_path(words)
['girafarig', 'gabite', 'exeggcute', 'emolga', 'audino']
A couple of things to know: as you point out, this has exponential complexity, so beware! Also, know that python has a recursion limit!
Using some black magic and graph theory I found a partial solution that might be good (not thoroughly tested).
The idea is to map your problem into a graph problem rather than a simple iterative problem (although it might work too!). So I defined the nodes of the graph to be the first letters and last letters of your words. I can only create edges between nodes of type first and last. I cannot map node first number X to node last number X (a word cannot be followed by it self). And from that your problem is just the same as the Longest path problem which tends to be NP-hard for general case :)
By taking some information here: stackoverflow-17985202 I managed to write this:
g = "audino bagon baltoy banette bidoof braviary bronzor carracosta charmeleon cresselia croagunk darmanitan deino emboar emolga exeggcute gabite girafarig gulpin haxorus"
words = g.split()
begin = [w[0] for w in words] # Nodes first
end = [w[-1] for w in words] # Nodes last
links = []
for i, l in enumerate(end): # Construct edges
ok = True
offset = 0
while ok:
try:
bl = begin.index(l, offset)
if i != bl: # Cannot map to self
links.append((i, bl))
offset = bl + 1 # next possible edge
except ValueError: # no more possible edge for this last node, Next!
ok = False
# Great function shamelessly taken from stackoverflow (link provided above)
import networkx as nx
def longest_path(G):
dist = {} # stores [node, distance] pair
for node in nx.topological_sort(G):
# pairs of dist,node for all incoming edges
pairs = [(dist[v][0]+1,v) for v in G.pred[node]]
if pairs:
dist[node] = max(pairs)
else:
dist[node] = (0, node)
node,(length,_) = max(dist.items(), key=lambda x:x[1])
path = []
while length > 0:
path.append(node)
length,node = dist[node]
return list(reversed(path))
# Construct graph
G = nx.DiGraph()
G.add_edges_from(links)
# TADAAAA!
print(longest_path(G))
Although it looks nice, there is a big drawback. You example works because there is no cycle in the resulting graph of input words, however, this solution fails on cyclic graphs.
A way around that is to detect cycles and break them. Detection can be done this way:
if nx.recursive_simple_cycles(G):
print("CYCLES!!! /o\")
Breaking the cycle can be done by just dropping a random edge in the cycle and then you will randomly find the optimal solution for your problem (imagine a cycle with a tail, you should cut the cycle on the node having 3 edges), thus I suggest brute-forcing this part by trying all possible cycle breaks, computing longest path and taking the longest of the longest path. If you have multiple cycles it becomes a bit more explosive in number of possibilities... but hey it's NP-hard, at least the way I see it and I didn't plan to solve that now :)
Hope it helps
Here's a solution that doesn't require recursion. It uses the itertools permutation function to look at all possible orderings of the words, and find the one with the longest length. To save time, as soon as an ordering hits a word that doesn't work, it stops checking that ordering and moves on.
>>> g = 'girafarig eudino exeggcute omolga gabite'
... p = itertools.permutations(g.split())
... longestword = ""
... for words in p:
... thistry = words[0]
... # Concatenates words until the next word doesn't link with this one.
... for i in range(len(words) - 1):
... if words[i][-1] != words[i+1][0]:
... break
... thistry += words[i+1]
... i += 1
... if len(thistry) > len(longestword):
... longestword = thistry
... print(longestword)
... print("Final answer is {}".format(longestword))
girafarig
girafariggabiteeudino
girafariggabiteeudinoomolga
girafariggabiteexeggcuteeudinoomolga
Final answer is girafariggabiteexeggcuteeudinoomolga
First, let's see what the problem looks like:
from collections import defaultdict
import pydot
words = (
"audino bagon baltoy banette bidoof braviary bronzor carracosta "
"charmeleon cresselia croagunk darmanitan deino emboar emolga "
"exeggcute gabite girafarig gulpin haxorus"
).split()
def main():
# get first -> last letter transitions
nodes = set()
arcs = defaultdict(lambda: defaultdict(list))
for word in words:
first = word[0]
last = word[-1]
nodes.add(first)
nodes.add(last)
arcs[first][last].append(word)
# create a graph
graph = pydot.Dot("Word_combinations", graph_type="digraph")
# use letters as nodes
for node in sorted(nodes):
n = pydot.Node(node, shape="circle")
graph.add_node(n)
# use first-last as directed edges
for first, sub in arcs.items():
for last, wordlist in sub.items():
count = len(wordlist)
label = str(count) if count > 1 else ""
e = pydot.Edge(first, last, label=label)
graph.add_edge(e)
# save result
graph.write_jpg("g:/temp/wordgraph.png", prog="dot")
if __name__=="__main__":
main()
results in
which makes the solution fairly obvious (path shown in red), but only because the graph is acyclic (with the exception of two trivial self-loops).
Now I have read the other stackoverflow Game of Life questions and also Googled voraciously.I know what to do for my Python implementation of the Game Of Life.I want to keep track of the active cells in the grid.The problem is I'm stuck at how should I code it.
Here's what I thought up but I was kinda at my wit's end beyond that:
Maintain a ActiveCell list consisting of cell co-ordinates tuples which are active
dead or alive.
When computing next generation , just iterate over the ActiveCell list,compute cell
state and check whether state changes or not.
If state changes , add all of the present cells neighbours to the list
If not , remove that cell from the list
Now the problem is : (" . "--> other cell)
B C D
. A .
. . .
If A satisfies 3) then it adds B,C,D
then if B also returns true for 3) ,which means it will add A,C again
(Duplication)
I considered using OrderedSet or something to take care of the order and avoid duplication.But still these I hit these issues.I just need a direction.
don't know if it will help you, but here's a quick sketch of Game of Life, with activecells dictionary:
from itertools import product
def show(board):
for row in board:
print " ".join(row)
def init(N):
board = []
for x in range(N):
board.append([])
for y in range(N):
board[x].append(".");
return board
def create_plane(board):
board[2][0] = "x"
board[2][1] = "x"
board[2][2] = "x"
board[1][2] = "x"
board[0][1] = "x"
def neighbors(i, j, N):
g1 = {x for x in product([1, 0, -1], repeat=2) if x != (0, 0)}
g2 = {(i + di, j + dj) for di, dj in g1}
return [(x, y) for x, y in g2 if x >= 0 and x < N and y >= 0 and y < N]
def live(board):
N = len(board)
acells = {}
for i in range(N):
for j in range(N):
if board[i][j] == "x":
for (x, y) in neighbors(i, j, N):
if (x, y) not in acells: acells[(x, y)] = board[x][y]
while True:
print "-" * 2 * N, len(acells), "cells to check"
show(board)
raw_input("Press any key...")
for c in acells.keys():
a = len([x for x in neighbors(c[0], c[1], N) if board[x[0]][x[1]] == "x"])
cur = board[c[0]][c[1]]
if a == 0:
del acells[c] # if no live cells around, remove from active
elif cur == "x" and a not in (2, 3):
acells[c] = "." # if alive and not 2 or 3 neighbors - dead
elif cur == "." and a == 3:
acells[c] = "x" # if dead and 3 neighbors - alive
for x in neighbors(c[0], c[1], N): # add all neighbors of new born
if x not in acells: acells[x] = board[x[0]][x[1]]
for c in acells:
board[c[0]][c[1]] = acells[c]
N = 7
board = init(N)
create_plane(board)
live(board)
You have two lists, I'll name them currentState, and newChanges. Here will be the workflow:
Iterate over currentState, figuring out which are newly born cells, and which ones are going to die. Do NOT add these changes to your currentState. If there is a cell to be born or a death, add it to the newChanges list. When you are finished with this step, currentState should look exactly the same as it did at the beginning.
Once you have finished all calculations in step 1 for every cell, then iterate over newChanges. For each pair in newChanges, change it in currentState from dead to alive or vice versa.
Example:
currentState has {0,0} {0,1} {0,2}. (Three dots in a line)
newChanges is calculated to be {0,0} {-1,1} {1,1} {0,2} (The two end dots die, and the spot above and below the middle are born)
currentState recieves the changes, and becomes {-1,1} {0,1} {1 ,1}, and newChanges is cleared.
Did you consider using an ordered dictionary and just set the values to None?
You didn't state that you have a restriction to implement the game in a specific way. So, the main question is: how big a grid do you want to be able to handle?
For example, if you are starting with a small fixed-size grid, the simplest representation is just a [[bool]] or [[int]] containing whether each cell is alive or dead. So, each round, you can make a new grid from the old one, e.g. assuming that all cells outside the grid are dead. Example:
[
[False, True, False],
[True, False, True],
[False, True, False],
]
If you want a very large dynamic-sized grid, there's the HashLife algorithm, which is much faster, but more complicated.
I implemented Game of Life in Python for fun and what I did was having board dict, with tuples of coordinates. Value is a state of cell. You can look at the code here https://raw.github.com/tdi/pycello/master/pycello.py. I know this is not very fast implementation and the project is abandoned due to lack of time.
board = {}
board[(x,y)] = value