Longest common substring with rolling hash

Longest common substring with rolling hash - python

I am implementing in Python3 an algorithm to find the longest substring of two strings s and t. Given s and t, I need to return (a,b,l) where l is the length of the longest common substring, a is the position in s where the longest substring starts, and b is the position in t where the longest substring starts. I have a working version of the algorithm but it is quite slow and I am not sure why; it is frustrating because I have found other implementations in python using pretty much the same logic that are many times faster. I am self-learning so any help would be greatly appreciated.
The approach is based on comparing hash values rather than directly comparing substrings and using binary search to find maximal length of common substrings. Here is the code for my hash function (m is a big prime and x is just some constant):
def polynomial_hash(my_string, m, x):
str_len = len(my_string)
result = 0
for i in range(str_len):
result = (result + ord(my_string[i]) * power_mod_p(x, i, m)) % m
return result
Given two strings s and t, I first find which string is shorter, without loss of generality, let s be the shorter string. First I need to find the hash values of substrings of a string. I use the following function, implemented as a generator:
def all_length_k_hashes(my_string, k, m, x):
current_position = len(my_string) - k
x_to_the_k = power_mod_p(x, k, m)
hash_value = polynomial_hash(my_string[current_position:], m, x)
yield (hash_value, current_position)
while current_position > 0:
current_position = current_position - 1
hash_value = ((hash_value * x) + ord(my_string[current_position]) - x_to_the_k*ord(my_string[current_position + k])) % m
yield (hash_value, current_position)
This function is simple, its first yield is the hash value of the final length k substring of the string, after that each of its iteration is the hash value of the next length k substring to its left (we move left by one position, for example for k=3 from abcdefghi to abcdefghi then from abcdefghi to abcdefghi). This should be able to calculate all the hash values of all length k substrings of my_string in O(|my_string|).
Now I find out if s and t has a length k substring in common, I use the following function:
def common_sub_string_length_k(shorter_str, longer_str, k, m, x):
short_str_dict = dict()
for hash_and_index in all_length_k_hashes(shorter_str, k, m, x):
short_str_dict.update({hash_and_index[0]: hash_and_index[1]})
hash_generator_longer_str = all_length_k_hashes(longer_str, k, m, x)
for hash_and_index in hash_generator_longer_str:
if hash_and_index[0] in short_str_dict:
return (short_str_dict[hash_and_index[0]], hash_and_index[1])
return False
What is happening in this function is: I create a Python empty dictionary and fill it with (key:values) such that each key is the hash value of a length k substring of the shorter string and its value is that substring's starting index, I call this 'short_str_dict'
Then, using all_length_k_hashes, I create a generator of hash values of substrings of length k of the longer string, then I iterate through this generator to check if there is a hash value that's in the 'short_str_dict', if there is, then the two strings have a substring of length k in common (assuming no hash collisions). This whole process should take time O(|shorter_string| + |longer_string|)
Finally, the following function repeatedly uses the previous process to find the maximal k, using a binary search technique:
def longest_common_substring(str_1, str_2):
m_1 = 309000599
m_2 = 988017827
x = randint(1, 10 ** 6)
len_str_1 = len(str_1)
len_str_2 = len(str_2)
if len_str_1 <= len_str_2:
short_str = str_1
long_str = str_2
switched = False
else:
short_str = str_2
long_str = str_1
switched = True
len_short_str = len(short_str)
len_long_str = len(long_str)
low = 0
high = len_short_str
mid = 0
longest_so_far = 0
longest_indices = (0,0)
while low <= high:
mid = (high + low) // 2
m1_result = common_sub_string_length_k(short_str, long_str, mid, m_1, x)
m2_result = common_sub_string_length_k(short_str, long_str, mid, m_2, x)
if m1_result is False or m2_result is False:
high = mid - 1
else:
longest_so_far = mid
longest_indices = m1_result
low = mid + 1
if switched:
return (longest_indices[1], longest_indices[0], longest_so_far)
else:
return (longest_indices[0], longest_indices[1], longest_so_far)
Two different hashes are used to reduce the probability of a collision. So in total, assuming no collisions, this whole process should take
O(log|shorter_string|) * O(|shorter_string| + |longer_string|).
Have I made any error? Is it slow because of the use of Python dictionaries? I really want to understand my mistake. Any help is greatly appreciated.

Related

Python - long execution time

I created my first Python program and I suspect something is wrong. The execution time of the testovanie() method was 2 hour. In Java same code was time 10 min.
The implementation must be in two classes. And the implementation of each algorithm must be as written (if there is no problem).
Can you help me fix the execution time?
First Class
class Algoritmy:
"""
The Algorithms class creates an array of Integer numbers. Contains methods for working with the field (sorting).
"""
def __init__(self, velkostPola):
"""
Constructor that initializes attributes.
:param velkostPola: array size.
"""
self.velkostPola = velkostPola
self.poleCisel = []
def nacitajZoSuboru(self, nazov):
"""
The method reads integer values from a file and writes them to the field.
:param nazov: a string that contains the name of the file from which the values are read into the field.
:type nazov: string
"""
f = open(nazov, 'r+')
self.poleCisel = f.readlines()
self.poleCisel = [int(i) for i in self.poleCisel]
f.close()
def toString(self):
"""
A method that serves as a text representation (listing) of the entire array of Numbers fields.
"""
for x in self.poleCisel:
print(x)
def bubbleSort(self):
"""
The method sorts arrays according to a bubble algorithm with complexity n ^ 2.
Compares adjacent values if the first value is greater than the second value
replace them. The algorithm is repeated until it sorts the entire field from the smallest to the largest.
"""
n = len(self.poleCisel)
for i in range(0, n):
for j in range(0, n - i - 1):
if self.poleCisel[j] > self.poleCisel[j + 1]:
self.vymena(j,j+1)
def insertionSort(self):
"""
The method classifies the array according to the insertion algorithm with complexity n ^ 2.
Compares the next value with the previous one and places it after
a value that is less than. Repeats until it sorts the entire field from
smallest to largest.
"""
n = len(self.poleCisel)
for i in range(n):
pom = self.poleCisel[i]
j = i - 1
while (j >= 0) and (pom < self.poleCisel[j]):
self.poleCisel[j + 1] = self.poleCisel[j]
j -= 1
self.poleCisel[j + 1] = pom
def quickSort(self, najm, najv):
"""
The method sorts arrays according to a quick algorithm with complexity n * (log2 n).
The algorithm chooses the pivot. The elements are so smaller that there are smaller and on the left side
larger elements on the right.
:param najm: najmI celociselna lowest index
:type najm: integer
:param najv: najvI highest index
:type najv: integer
"""
i = najm
j = najv
pivot = self.poleCisel[najm + (najv - najm) // 2]
while i <= j:
while self.poleCisel[i] < pivot:
i += 1
while self.poleCisel[j] > pivot:
j -= 1
if i <= j:
self.vymena(i, j)
i += 1
j -= 1
if najm < j:
self.quickSort(najm, j)
if i < najv:
self.quickSort(i, najv)
def vymena(self, i, j):
"""
An auxiliary procedure that ensures the exchange of element i for element j in the array.
:param i: jeden prvok pola
:type i: integer
:param j: druhy prvok pola
:type j: integer
"""
pom = self.poleCisel[i]
self.poleCisel[i] = self.poleCisel[j]
self.poleCisel[j] = pom
def selectionSort(self):
"""
The method classifies the array according to a selection algorithm with complexity n ^ 2.
The algorithm finds the largest value and exchanges it with the last element. He will find
always the highest value among unsorted elements and exchanges it with
the last unsorted element.
"""
for i in reversed(range(0, len(self.poleCisel))):
prvy = 0
for j in range(0, i):
if self.poleCisel[j] > self.poleCisel[prvy]:
prvy = j
self.vymena(prvy,i)
def shellSort(self, n):
"""
The method classifies the array according to a shell algorithm with complexity n ^ 2
Gradually, the elements distant from each other are compared by a space - at the beginning there is a space = n / 2,
where n is the size of the field we are sorting. If the left element being compared is larger than the right one being compared,
so for replacement. Then the gap is reduced and the procedure is repeated.
:param n: size of array
:type n: integer
"""
medzera = n // 2
while medzera > 0:
i = medzera
for i in range(0, n):
pom = self.poleCisel[i]
j = i
while (j >= medzera) and (self.poleCisel[j - medzera] > pom):
self.poleCisel[j] = self.poleCisel[j - medzera]
j = j - medzera
self.poleCisel[j] = pom
medzera = medzera // 2
def heapSort(self):
"""
The method sorts arrays according to the heap algorithm with complexity n * (log n).
The algorithm adds elements to the heap where it stores them at the end of the heap. Unless
the previous element is larger, then the elements are replaced until it is
predecessor smaller. This is repeated until a sorted field is created.
"""
n = len(self.poleCisel)
for k in reversed(range(1, n // 2)):
self.maxHeapify(k, n)
while True:
self.vymena(0,n-1)
n = n - 1
self.maxHeapify(1, n)
if (n < 1):
break
def maxHeapify(self, otecI, n):
"""
This method serves to preserve the properties of Heap.
:param otecI: index otca
:type otecI: integer
:param n: nastavenie vacsich prvkov
:type n: integer
"""
otec = self.poleCisel[otecI - 1]
while otecI <= n // 2:
lavySyn = otecI + otecI
if (lavySyn < n) and (self.poleCisel[lavySyn - 1] < self.poleCisel[lavySyn]):
lavySyn += 1
if otec >= self.poleCisel[lavySyn - 1]:
break
else:
self.poleCisel[otecI - 1] = self.poleCisel[lavySyn - 1]
otecI = lavySyn
self.poleCisel[otecI - 1] = otec
Second Class
from Algoritmy import Algoritmy
import time
class Praca:
def __init__(self):
self.casB = []
self.casQ = []
self.casS = []
self.casI = []
self.casSh = []
self.casH = []
def vypisPriemer(self):
"""
A method that calculates and prints the averages of the algorithm duration from the time field.
"""
sumB = 0;sumQ = 0;sumS = 0;sumI = 0;sumSh = 0;sumH = 0
for j in range(0, 200):
sumB += self.casB[j]
sumQ += self.casQ[j]
sumS += self.casS[j]
sumI += self.casI[j]
sumSh += self.casSh[j]
sumH += self.casH[j]
priemerB = sumB / 200
priemerQ = sumQ / 200
priemerS = sumS / 200
priemerI = sumI / 200
priemerSh = sumSh / 200
priemerH = sumH / 200
print("Bubble Sort alg. priemer: %10.9f" %priemerB)
print("Quick Sort alg. priemer: %10.9f"%priemerQ)
print("Selection Sort alg. priemer: %10.9f"%priemerS)
print("Insertion Sort alg. priemer: %10.9f"%priemerI)
print("Shell Sort alg. priemer: %10.9f"%priemerSh)
print("Heap Sort alg. priemer: %10.9f"%priemerH)
def replikacie(self,velkost, nazovS):
"""
The method is aimed at performing 200 replications for each single algorithm.
Collects and stores the execution time of the algorithm in the field.
:param velkost: array size
:type velkost: integer
:param nazovS: file name
:type nazovS: string
"""
self.casB.clear()
self.casQ.clear()
self.casS.clear()
self.casI.clear()
self.casSh.clear()
self.casH.clear()
praca = Algoritmy(velkost)
for i in range(0, 200):
praca.nacitajZoSuboru(nazovS)
zaciatok=time.time()
praca.bubbleSort()
self.casB.append(time.time() - zaciatok)
praca.nacitajZoSuboru(nazovS)
zaciatok=time.time()
praca.quickSort(0, praca.velkostPola-1)
self.casQ.append(time.time() - zaciatok)
praca.nacitajZoSuboru(nazovS)
zaciatok=time.time()
praca.selectionSort()
self.casS.append(time.time() - zaciatok)
praca.nacitajZoSuboru(nazovS)
zaciatok=time.time()
praca.insertionSort()
self.casI.append(time.time() - zaciatok)
praca.nacitajZoSuboru(nazovS)
zaciatok=time.time()
praca.shellSort(praca.velkostPola)
self.casSh.append(time.time() - zaciatok)
praca.nacitajZoSuboru(nazovS)
zaciatok=time.time()
praca.heapSort()
self.casH.append(time.time() - zaciatok)
def testovanie(self):
"""
Testing
"""
self.replikacie(10000,"neutr10000.txt")
print("Neutriedene 10000")
self.vypisPriemer()
def main(self):
zaciatok = time.time()
self.testovanie()
print(time.time() - zaciatok)
"""
Run
"""
if __name__ == '__main__':
praca = Praca()
praca.main()
If you have any improvements, don't be shy to tell me, if I said it's my first Python program. Be nice to me :)

A more condensed MRE would make it easier to comment on the specific statements, but my guess is that your example just illustrates that Python is slow for certain use cases.
This kind of number crunching in pure-Python loops is the nightmare scenario for Python, at least for the most popular CPython implementation.
There are, however, different ways you could speed this up if you diverge a bit from pure CPython:
Use PyPy JIT to run your program instead of CPython. PyPy usually speeds your code ~3-5x, but for numeric stuff like yours you can get an even more impressive speed bump.
Use numeric libraries to vectorize your code and/or offload common operations to optimized routines (written in C, Fortran or even assembly). Numpy is a popular choice.
Rewrite your program, or at least the "hottest" code paths, in Cython cdef functions and classes, see, e.g., https://cython.readthedocs.io/en/latest/src/tutorial/cython_tutorial.html.
You may want to check out Numba, but I have no experience with it.

find the minimum number of words(distance) between repeated occurrences of a search string in the input string

Here are test cases for the code:
string - 'Tim had been saying that he had been there'
search - 'had'
expected output - 4
string - 'he got what he got and what he wanted'
search - 'he'
expected out - 2
def return_distance(input, search):
words = input.split()
distance = None
indx = []
if not input or not search:
return None
else:
if words.count(search) >1:
indx = [ index for index, word in enumerate(words) if word == search]
distance = indx[1] - indx[0]
for i in range(len(indx)-1):
distance = min(distance, indx[i+1] - indx[i])-1
return distance
I am thinking how to optimize the code. I admit it is poorly written.

How about
def min_distance_between_words(sentence, word):
idxes = [i for i, e in enumerate(sentence.split()) if e == word]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])])
This splits the input sentence, makes a list of every index that matches the target word, then iterates over this list to compute the differences between each index and returns the minimum difference.
Since behavior is unspecified when the sentence doesn't have a word, it raises an error but you can add a check for this and return the value of your choice if desired using min's default parameter:
def min_distance_between_words(sentence, word):
idxes = [i for i, e in enumerate(sentence.split()) if e == word]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])], default=None)
As an aside, naming a variable input overwrites a builtin and return_distance is a rather ambiguous name for a function.
Adding a precondition for parameters for None as done with if not input or not search: is not typically done in Python (we assume caller will always pass in a string and adhere to the function's contract).
If you want to generalize this further, move the split() duty to the domain of the caller which enables the function to operate on arbitrary iterables:
def min_distance_between_occurrences(it, target):
idxes = [i for i, e in enumerate(it) if e == target]
return min([y - x - 1 for x, y in zip(idxes, idxes[1:])], default=None)
Call with:
min_distance_between_occurrences("a b c a".split(), "a")
min_distance_between_occurrences([(1, 2), (1, 3), (1, 2)], (1, 2))
Refactoring aside, as pointed out in the comments, the original code isn't correct. Issues include:
search_str does not exist. You probably meant search.
distance and min_dist don't really work together. Pick one or the other and use it for all minimum calculations.
min(min_dist, indx[i+1] - indx[i])-1 subtracts 1 in the wrong place, throwing off the count.
Here's a potential fix for these issues:
def return_distance(input, search):
words = input.split()
distance = None
if words.count(search) > 1:
indx = [index for index, word in enumerate(words) if word == search]
distance = indx[1] - indx[0] - 1
# ^^^^
for i in range(len(indx) - 1):
distance = min(distance, indx[i+1] - indx[i] - 1)
# ^^^^
return distance

One way is to use min with list comprehension on indx
min_dist = min([(indx[i+1] - indx[i]-1) for i in range(len(indx)-1) ])

Find longest adjacent repeating non-overlapping substring

(This question isn't about music but I'm using music as an example of
a use case.)
In music a common way to structure phrases is as a sequence of notes
where the middle part is repeated one or more times. Thus, the phrase
consists of an introduction, a looping part and an outro. Here is one
example:
[ E E E F G A F F G A F F G A F C D ]
We can "see" that the intro is [ E E E] the repeating part is [ F G A
F ] and the outro is [ C D ]. So the way to split the list would be
[ [ E E E ] 3 [ F G A F ] [ C D ] ]
where the first item is the intro, the second number of times the
repeating part is repeated and the third part the outro.
I need an algorithm to perform such a split.
But there is one caveat which is that there may be multiple way to
split the list. For example, the above list could be split into:
[ [ E E E F G A ] 2 [ F F G A ] [ F C D ] ]
But this is a worse split because the intro and outro is longer. So
the criteria for the algorithm is to find the split that maximizes the
length of the looping part and minimizes the combined length of the
intro and outro. That means that the correct split for
[ A C C C C C C C C C A ]
is
[ [ A ] 9 [ C ] [ A ] ]
because the combined length of the intro and outro is 2 and the length
of the looping part is 9.
Also, while the intro and outro can be empty, only "true" repeats are
allowed. So the following split would be disallowed:
[ [ ] 1 [ E E E F G A F F G A F F G A F C D ] [ ] ]
Think of it as finding the optimal "compression" for the
sequence. Note that there may not be any repeats in some sequences:
[ A B C D ]
For these degenerate cases, any sensible result is allowed.
Here is my implementation of the algorithm:
def find_longest_repeating_non_overlapping_subseq(seq):
candidates = []
for i in range(len(seq)):
candidate_max = len(seq[i + 1:]) // 2
for j in range(1, candidate_max + 1):
candidate, remaining = seq[i:i + j], seq[i + j:]
n_reps = 1
len_candidate = len(candidate)
while remaining[:len_candidate] == candidate:
n_reps += 1
remaining = remaining[len_candidate:]
if n_reps > 1:
candidates.append((seq[:i], n_reps,
candidate, remaining))
if not candidates:
return (type(seq)(), 1, seq, type(seq)())
def score_candidate(candidate):
intro, reps, loop, outro = candidate
return reps - len(intro) - len(outro)
return sorted(candidates, key = score_candidate)[-1]
I'm not sure it is correct, but it passes the simple tests I've
described. The problem with it is that it is way to slow. I've looked
at suffix trees but they don't seem to fit my use case because the
substrings I'm after should be non-overlapping and adjacent.

Here's a way that's clearly quadratic-time, but with a relatively low constant factor because it doesn't build any substring objects apart from those of length 1. The result is a 2-tuple,
bestlen, list_of_results
where bestlen is the length of the longest substring of repeated adjacent blocks, and each result is a 3-tuple,
start_index, width, numreps
meaning that the substring being repeated is
the_string[start_index : start_index + width]
and there are numreps of those adjacent. It will always be that
bestlen == width * numreps
The problem description leaves ambiguities. For example, consider this output:
>>> crunch2("aaaaaabababa")
(6, [(0, 1, 6), (0, 2, 3), (5, 2, 3), (6, 2, 3), (0, 3, 2)])
So it found 5 ways to view "the longest" stretch as being of length 6:
The initial "a" repeated 6 times.
The initial "aa" repeated 3 times.
The leftmost instance of "ab" repeated 3 times.
The leftmost instance of "ba" repeated 3 times.
The initial "aaa" repeated 2 times.
It doesn't return the intro or outro because those are trivial to deduce from what it does return:
The intro is the_string[: start_index].
The outro is the_string[start_index + bestlen :].
If there are no repeated adjacent blocks, it returns
(0, [])
Other examples (from your post):
>>> crunch2("EEEFGAFFGAFFGAFCD")
(12, [(3, 4, 3)])
>>> crunch2("ACCCCCCCCCA")
(9, [(1, 1, 9), (1, 3, 3)])
>>> crunch2("ABCD")
(0, [])
The key to how it works: suppose you have adjacent repeated blocks of width W each. Then consider what happens when you compare the original string to the string shifted left by W:
... block1 block2 ... blockN-1 blockN ...
... block2 block3 ... blockN ... ...
Then you get (N-1)*W consecutive equal characters at the same positions. But that also works in the other direction: if you shift left by W and find (N-1)*W consecutive equal characters, then you can deduce:
block1 == block2
block2 == block3
...
blockN-1 == blockN
so all N blocks must be repetitions of block1.
So the code repeatedly shifts (a copy of) the original string left by one character, then marches left to right over both identifying the longest stretches of equal characters. That only requires comparing a pair of characters at a time. To make "shift left" efficient (constant time), the copy of the string is stored in a collections.deque.
EDIT: update() did far too much futile work in many cases; replaced it.
def crunch2(s):
from collections import deque
# There are zcount equal characters starting
# at index starti.
def update(starti, zcount):
nonlocal bestlen
while zcount >= width:
numreps = 1 + zcount // width
count = width * numreps
if count >= bestlen:
if count > bestlen:
results.clear()
results.append((starti, width, numreps))
bestlen = count
else:
break
zcount -= 1
starti += 1
bestlen, results = 0, []
t = deque(s)
for width in range(1, len(s) // 2 + 1):
t.popleft()
zcount = 0
for i, (a, b) in enumerate(zip(s, t)):
if a == b:
if not zcount: # new run starts here
starti = i
zcount += 1
# else a != b, so equal run (if any) ended
elif zcount:
update(starti, zcount)
zcount = 0
if zcount:
update(starti, zcount)
return bestlen, results
Using regexps
[removed this due to size limit]
Using a suffix array
This is the fastest I've found so far, although can still be provoked into quadratic-time behavior.
Note that it doesn't much matter whether overlapping strings are found. As explained for the crunch2() program above (here elaborated on in minor ways):
Given string s with length n = len(s).
Given ints i and j with 0 <= i < j < n.
Then if w = j-i, and c is the number of leading characters in common between s[i:] and s[j:], the block s[i:j] (of length w) is repeated, starting at s[i], a total of 1 + c // w times.
The program below follows that directly to find all repeated adjacent blocks, and remembers those of maximal total length. Returns the same results as crunch2(), but sometimes in a different order.
A suffix array eases the search, but hardly eliminates it. A suffix array directly finds <i, j> pairs with maximal c, but only limits the search to maximize w * (1 + c // w). Worst cases are strings of the form letter * number, like "a" * 10000.
I'm not giving the code for the sa module below. It's long-winded and any implementation of suffix arrays will compute the same things. The outputs of suffix_array():
sa is the suffix array, the unique permutation of range(n) such that for all i in range(1, n), s[sa[i-1]:] < s[sa[i]:].
rank isn't used here.
For i in range(1, n), lcp[i] gives the length of the longest common prefix between the suffixes starting at sa[i-1] and sa[i].
Why does it win? In part because it never has to search for suffixes that start with the same letter (the suffix array, by construction, makes them adjacent), and checking for a repeated block, and for whether it's a new best, takes small constant time regardless of how large the block or how many times it's repeated. As above, that's just trivial arithmetic on c and w.
Disclaimer: suffix arrays/trees are like continued fractions for me: I can use them when I have to, and can marvel at the results, but they give me a headache. Touchy, touchy, touchy.
def crunch4(s):
from sa import suffix_array
sa, rank, lcp = suffix_array(s)
bestlen, results = 0, []
n = len(s)
for sai in range(n-1):
i = sa[sai]
c = n
for saj in range(sai + 1, n):
c = min(c, lcp[saj])
if not c:
break
j = sa[saj]
w = abs(i - j)
if c < w:
continue
numreps = 1 + c // w
assert numreps > 1
total = w * numreps
if total >= bestlen:
if total > bestlen:
results.clear()
bestlen = total
results.append((min(i, j), w, numreps))
return bestlen, results
Some timings
I read a modest file of English words into a string, xs. One word per line:
>>> len(xs)
209755
>>> xs.count('\n')
25481
So about 25K words in about 210K bytes. These are quadratic-time algorithms, so I didn't expect it to go fast, but crunch2() was still running after hours - and still running when I let it go overnight.
Which caused me to realize its update() function could do an enormous amount of futile work, making the algorithm more like cubic-time overall. So I repaired that. Then:
>>> crunch2(xs)
(44, [(63750, 22, 2)])
>>> xs[63750 : 63750+50]
'\nelectroencephalograph\nelectroencephalography\nelec'
That took about 38 minutes, which was in the ballpark of what I expected.
The regexp version crunch3() took less than a tenth of a second!
>>> crunch3(xs)
(8, [(19308, 4, 2), (47240, 4, 2)])
>>> xs[19308 : 19308+10]
'beriberi\nB'
>>> xs[47240 : 47240+10]
'couscous\nc'
As explained before, the regexp version may not find the best answer, but something else is at work here: by default, "." doesn't match a newline, so the code was actually doing many tiny searches. Each of the ~25K newlines in the file effectively ends the local search range. Compiling the regexp with the re.DOTALL flag instead (so newlines aren't treated specially):
>>> crunch3(xs) # with DOTALL
(44, [(63750, 22, 2)])
in a bit over 14 minutes.
Finally,
>>> crunch4(xs)
(44, [(63750, 22, 2)])
in a bit under 9 minutes. The time to build the suffix array was an insignificant part of that (less than a second). That's actually pretty impressive, since the not-always-right brute force regexp version is slower despite running almost entirely "at C speed".
But that's in a relative sense. In an absolute sense, all of these are still pig slow :-(
NOTE: the version in the next section cuts this to under 5 seconds(!).
Enormously faster
This one takes a completely different approach. For the largish dictionary example above, it gets the right answer in less than 5 seconds.
I'm rather proud of this ;-) It was unexpected, and I haven't seen this approach before. It doesn't do any string searching, just integer arithmetic on sets of indices.
It remains dreadfully slow for inputs of the form letter * largish_integer. As is, it keeps going up by 1 so long as at least two (not necessarily adjacent, or even non-overlapping!) copies of a substring (of the current length being considered) exist. So, for example, in
'x' * 1000000
it will try all substring sizes from 1 through 999999.
However, looks like that could be greatly improved by doubling the current size (instead of just adding 1) repeatedly, saving the classes as it goes along, doing a mixed form of binary search to locate the largest substring size for which a repetition exists.
Which I'll leave as a doubtless tedious exercise for the reader. My work here is done ;-)
def crunch5(text):
from collections import namedtuple, defaultdict
# For all integers i and j in IxSet x.s,
# text[i : i + x.w] == text[j : j + x.w].
# That is, it's the set of all indices at which a specific
# substring of length x.w is found.
# In general, we only care about repeated substrings here,
# so weed out those that would otherwise have len(x.s) == 1.
IxSet = namedtuple("IxSet", "s w")
bestlen, results = 0, []
# Compute sets of indices for repeated (not necessarily
# adjacent!) substrings of length xs[0].w + ys[0].w, by looking
# at the cross product of the index sets in xs and ys.
def combine(xs, ys):
xw, yw = xs[0].w, ys[0].w
neww = xw + yw
result = []
for y in ys:
shifted = set(i - xw for i in y.s if i >= xw)
for x in xs:
ok = shifted & x.s
if len(ok) > 1:
result.append(IxSet(ok, neww))
return result
# Check an index set for _adjacent_ repeated substrings.
def check(s):
nonlocal bestlen
x, w = s.s.copy(), s.w
while x:
current = start = x.pop()
count = 1
while current + w in x:
count += 1
current += w
x.remove(current)
while start - w in x:
count += 1
start -= w
x.remove(start)
if count > 1:
total = count * w
if total >= bestlen:
if total > bestlen:
results.clear()
bestlen = total
results.append((start, w, count))
ch2ixs = defaultdict(set)
for i, ch in enumerate(text):
ch2ixs[ch].add(i)
size1 = [IxSet(s, 1)
for s in ch2ixs.values()
if len(s) > 1]
del ch2ixs
for x in size1:
check(x)
current_class = size1
# Repeatedly increase size by 1 until current_class becomes
# empty. At that point, there are no repeated substrings at all
# (adjacent or not) of the then-current size (or larger).
while current_class:
current_class = combine(current_class, size1)
for x in current_class:
check(x)
return bestlen, results
And faster still
crunch6() drops the largish dictionary example to under 2 seconds on my box. It combines ideas from crunch4() (suffix and lcp arrays) and crunch5() (find all arithmetic progressions with a given stride in a set of indices).
Like crunch5(), this also loops around a number of times equal to one more than the length of the repeated longest substring (overlapping or not). For if there are no repeats of length n, there are none for any size greater than n either. That makes finding repeats without regard to overlap easier, because it's an exploitable limitation. When constraining "wins" to adjacent repeats, that breaks down. For example, there are no adjacent repeats of even length 1 in "abcabc", but there is one of length 3. That appears to make any form of direct binary search futile (the presence or absence of adjacent repeats of size n says nothing about the existence of adjacent repeats of any other size).
Inputs of the form 'x' * n remain miserable. There are repeats of all lengths from 1 through n-1.
Observation: all the programs I've given generate all possible ways of breaking up repeated adjacent chunks of maximal length. For example, for a string of 9 "x", it says it can be gotten by repeating "x" 9 times or by repeating "xxx" 3 times. So, surprisingly, they can all be used as factoring algorithms too ;-)
def crunch6(text):
from sa import suffix_array
sa, rank, lcp = suffix_array(text)
bestlen, results = 0, []
n = len(text)
# Generate maximal sets of indices s such that for all i and j
# in s the suffixes starting at s[i] and s[j] start with a
# common prefix of at least len minc.
def genixs(minc, sa=sa, lcp=lcp, n=n):
i = 1
while i < n:
c = lcp[i]
if c < minc:
i += 1
continue
ixs = {sa[i-1], sa[i]}
i += 1
while i < n:
c = min(c, lcp[i])
if c < minc:
yield ixs
i += 1
break
else:
ixs.add(sa[i])
i += 1
else: # ran off the end of lcp
yield ixs
# Check an index set for _adjacent_ repeated substrings
# w apart. CAUTION: this empties s.
def check(s, w):
nonlocal bestlen
while s:
current = start = s.pop()
count = 1
while current + w in s:
count += 1
current += w
s.remove(current)
while start - w in s:
count += 1
start -= w
s.remove(start)
if count > 1:
total = count * w
if total >= bestlen:
if total > bestlen:
results.clear()
bestlen = total
results.append((start, w, count))
c = 0
found = True
while found:
c += 1
found = False
for s in genixs(c):
found = True
check(s, c)
return bestlen, results
Always fast, and published, but sometimes wrong
In bioinformatics, turns out this is studied under the names "tandem repeats", "tandem arrays", and "simple sequence repeats" (SSR). You can search for those terms to find quite a few academic papers, some claiming worst-case linear-time algorithms.
But those seem to fall into two camps:
Linear-time algorithms of the kind to be described, which are actually wrong :-(
Algorithms so complicated it would take dedication to even try to turn them into functioning code :-(
In the first camp, there are several papers that boil down to crunch4() above, but without its inner loop. I'll follow this with code for that, crunch4a(). Here's an example:
"SA-SSR: a suffix array-based algorithm for exhaustive and efficient SSR discovery in large genetic sequences."
Pickett et alia
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5013907/
crunch4a() is always fast, but sometimes wrong. In fact it finds at least one maximal repeated stretch for every example that appeared here, solves the largish dictionary example in a fraction of a second, and has no problem with strings of the form 'x' * 1000000. The bulk of the time is spent building the suffix and lcp arrays. But it can fail:
>>> x = "bcdabcdbcd"
>>> crunch4(x) # finds repeated bcd at end
(6, [(4, 3, 2)])
>>> crunch4a(x) # finds nothing
(0, [])
The problem is that there's no guarantee that the relevant suffixes are adjacent in the suffix array. The suffixes that start with "b" are ordered like so:
bcd
bcdabcdbcd
bcdbcd
To find the trailing repeated block by this approach requires comparing the first with the third. That's why crunch4() has an inner loop, to try all pairs starting with a common letter. The relevant pair can be separated by an arbitrary number of other suffixes in a suffix array. But that also makes the algorithm quadratic time.
# only look at adjacent entries - fast, but sometimes wrong
def crunch4a(s):
from sa import suffix_array
sa, rank, lcp = suffix_array(s)
bestlen, results = 0, []
n = len(s)
for sai in range(1, n):
i, j = sa[sai - 1], sa[sai]
c = lcp[sai]
w = abs(i - j)
if c >= w:
numreps = 1 + c // w
total = w * numreps
if total >= bestlen:
if total > bestlen:
results.clear()
bestlen = total
results.append((min(i, j), w, numreps))
return bestlen, results
O(n log n)
This paper looks right to me, although I haven't coded it:
"Simple and Flexible Detection of Contiguous Repeats Using a Suffix Tree"
Jens Stoye, Dan Gusfield
https://csiflabs.cs.ucdavis.edu/~gusfield/tcs.pdf
Getting to a sub-quadratic algorithm requires making some compromises, though. For example, "x" * n has n-1 substrings of the form "x"*2, n-2 of the form "x"*3, ..., so there are O(n**2) of those alone. So any algorithm that finds all of them is necessarily also at best quadratic time.
Read the paper for details ;-) One concept you're looking for is "primitive": I believe you only want repeats of the form S*n where S cannot itself be expressed as a repetition of shorter strings. So, e.g., "x" * 10 is primitive, but "xx" * 5 is not.
One step on the way to O(n log n)
crunch9() is an implementation of the "brute force" algorithm I mentioned in the comments, from:
"The enhanced suffix array and its applications to genome analysis"
Ibrahim et alia
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.2217&rep=rep1&type=pdf
The implementation sketch there only finds "branching tandem" repeats, and I added code here to deduce repeats of any number of repetitions, and to include non-branching repeats too. While it's still O(n**2) worst case, it's much faster than anything else here for the seq string you pointed to in the comments. As is, it reproduces (except for order) the same exhaustive account as most of the other programs here.
The paper goes on to fight hard to cut the worst case to O(n log n), but that slows it down a lot. So then it fights harder. I confess I lost interest ;-)
# Generate lcp intervals from the lcp array.
def genlcpi(lcp):
lcp.append(0)
stack = [(0, 0)]
for i in range(1, len(lcp)):
c = lcp[i]
lb = i - 1
while c < stack[-1][0]:
i_c, lb = stack.pop()
interval = i_c, lb, i - 1
yield interval
if c > stack[-1][0]:
stack.append((c, lb))
lcp.pop()
def crunch9(text):
from sa import suffix_array
sa, rank, lcp = suffix_array(text)
bestlen, results = 0, []
n = len(text)
# generate branching tandem repeats
def gen_btr(text=text, n=n, sa=sa):
for c, lb, rb in genlcpi(lcp):
i = sa[lb]
basic = text[i : i + c]
# Binary searches to find subrange beginning with
# basic+basic. A more gonzo implementation would do this
# character by character, never materialzing the common
# prefix in `basic`.
rb += 1
hi = rb
while lb < hi: # like bisect.bisect_left
mid = (lb + hi) // 2
i = sa[mid] + c
if text[i : i + c] < basic:
lb = mid + 1
else:
hi = mid
lo = lb
while lo < rb: # like bisect.bisect_right
mid = (lo + rb) // 2
i = sa[mid] + c
if basic < text[i : i + c]:
rb = mid
else:
lo = mid + 1
lead = basic[0]
for sai in range(lb, rb):
i = sa[sai]
j = i + 2*c
assert j <= n
if j < n and text[j] == lead:
continue # it's non-branching
yield (i, c, 2)
for start, c, _ in gen_btr():
# extend left
numreps = 2
for i in range(start - c, -1, -c):
if all(text[i+k] == text[start+k] for k in range(c)):
start = i
numreps += 1
else:
break
totallen = c * numreps
if totallen < bestlen:
continue
if totallen > bestlen:
bestlen = totallen
results.clear()
results.append((start, c, numreps))
# add non-branches
while start:
if text[start - 1] == text[start + c - 1]:
start -= 1
results.append((start, c, numreps))
else:
break
return bestlen, results
Earning the bonus points ;-)
For some technical meaning ;-) crunch11() is worst-case O(n log n). Besides the suffix and lcp arrays, this also needs the rank array, sa's inverse:
assert all(rank[sa[i]] == sa[rank[i]] == i for i in range(len(sa)))
As code comments note, it also relies on Python 3 for speed (range() behavior). That's shallow but would be tedious to rewrite.
Papers describing this have several errors, so don't flip out if this code doesn't exactly match what you read about. Implement exactly what they say instead, and it will fail.
That said, the code is getting uncomfortably complex, and I can't guarantee there aren't bugs. It works on everything I've tried.
Inputs of the form 'x' * 1000000 still aren't speedy, but clearly no longer quadratic-time. For example, a string repeating the same letter a million times completes in close to 30 seconds. Most other programs here would never end ;-)
EDIT: changed genlcpi() to use semi-open Python ranges; made mostly cosmetic changes to crunch11(); added "early out" that saves about a third the time in worst (like 'x' * 1000000) cases.
# Generate lcp intervals from the lcp array.
def genlcpi(lcp):
lcp.append(0)
stack = [(0, 0)]
for i in range(1, len(lcp)):
c = lcp[i]
lb = i - 1
while c < stack[-1][0]:
i_c, lb = stack.pop()
yield (i_c, lb, i)
if c > stack[-1][0]:
stack.append((c, lb))
lcp.pop()
def crunch11(text):
from sa import suffix_array
sa, rank, lcp = suffix_array(text)
bestlen, results = 0, []
n = len(text)
# Generate branching tandem repeats.
# (i, c, 2) is branching tandem iff
# i+c in interval with prefix text[i : i+c], and
# i+c not in subinterval with prefix text[i : i+c + 1]
# Caution: this pragmatically relies on that, in Python 3,
# `range()` returns a tiny object with O(1) membership testing.
# In Python 2 it returns a list - ahould still work, but very
# much slower.
def gen_btr(text=text, n=n, sa=sa, rank=rank):
from itertools import chain
for c, lb, rb in genlcpi(lcp):
origlb, origrb = lb, rb
origrange = range(lb, rb)
i = sa[lb]
lead = text[i]
# Binary searches to find subrange beginning with
# text[i : i+c+1]. Note we take slices of length 1
# rather than just index to avoid special-casing for
# i >= n.
# A more elaborate traversal of the lcp array could also
# give us a list of child intervals, and then we'd just
# need to pick the right one. But that would be even
# more hairy code, and unclear to me it would actually
# help the worst cases (yes, the interval can be large,
# but so can a list of child intervals).
hi = rb
while lb < hi: # like bisect.bisect_left
mid = (lb + hi) // 2
i = sa[mid] + c
if text[i : i+1] < lead:
lb = mid + 1
else:
hi = mid
lo = lb
while lo < rb: # like bisect.bisect_right
mid = (lo + rb) // 2
i = sa[mid] + c
if lead < text[i : i+1]:
rb = mid
else:
lo = mid + 1
subrange = range(lb, rb)
if 2 * len(subrange) <= len(origrange):
# Subrange is at most half the size.
# Iterate over it to find candidates i, starting
# with wa. If i+c is also in origrange, but not
# in subrange, good: then i is of the form wwx.
for sai in subrange:
i = sa[sai]
ic = i + c
if ic < n:
r = rank[ic]
if r in origrange and r not in subrange:
yield (i, c, 2, subrange)
else:
# Iterate over the parts outside subrange instead.
# Candidates i are then the trailing wx in the
# hoped-for wwx. We win if i-c is in subrange too
# (or, for that matter, if it's in origrange).
for sai in chain(range(origlb, lb),
range(rb, origrb)):
ic = sa[sai] - c
if ic >= 0 and rank[ic] in subrange:
yield (ic, c, 2, subrange)
for start, c, numreps, irange in gen_btr():
# extend left
crange = range(start - c, -1, -c)
if (numreps + len(crange)) * c < bestlen:
continue
for i in crange:
if rank[i] in irange:
start = i
numreps += 1
else:
break
# check for best
totallen = c * numreps
if totallen < bestlen:
continue
if totallen > bestlen:
bestlen = totallen
results.clear()
results.append((start, c, numreps))
# add non-branches
while start and text[start - 1] == text[start + c - 1]:
start -= 1
results.append((start, c, numreps))
return bestlen, results

Here's my implementation of what you're talking about. It's pretty similar to yours, but it skips over substrings which have been checked as repetitions of previous substrings.
from collections import namedtuple
SubSequence = namedtuple('SubSequence', ['start', 'length', 'reps'])
def longest_repeating_subseq(original: str):
winner = SubSequence(start=0, length=0, reps=0)
checked = set()
subsequences = ( # Evaluates lazily during iteration
SubSequence(start=start, length=length, reps=1)
for start in range(len(original))
for length in range(1, len(original) - start)
if (start, length) not in checked)
for s in subsequences:
subseq = original[s.start : s.start + s.length]
for reps, next_start in enumerate(
range(s.start + s.length, len(original), s.length),
start=1):
if subseq != original[next_start : next_start + s.length]:
break
else:
checked.add((next_start, s.length))
s = s._replace(reps=reps)
if s.reps > 1 and (
(s.length * s.reps > winner.length * winner.reps)
or ( # When total lengths are equal, prefer the shorter substring
s.length * s.reps == winner.length * winner.reps
and s.reps > winner.reps)):
winner = s
# Check for default case with no repetitions
if winner.reps == 0:
winner = SubSequence(start=0, length=len(original), reps=1)
return (
original[ : winner.start],
winner.reps,
original[winner.start : winner.start + winner.length],
original[winner.start + winner.length * winner.reps : ])
def test(seq, *, expect):
print(f'Testing longest_repeating_subseq for {seq}')
result = longest_repeating_subseq(seq)
print(f'Expected {expect}, got {result}')
print(f'Test {"passed" if result == expect else "failed"}')
print()
if __name__ == '__main__':
test('EEEFGAFFGAFFGAFCD', expect=('EEE', 3, 'FGAF', 'CD'))
test('ACCCCCCCCCA', expect=('A', 9, 'C', 'A'))
test('ABCD', expect=('', 1, 'ABCD', ''))
Passes all three of your examples for me. This seems like the sort of thing that could have a lot of weird edge cases, but given that it's an optimized brute force, it would probably be more a matter of updating the spec rather than fixing a bug in the code itself.

It looks like what you are trying to do is pretty much the LZ77 compression algorithm. You can check your code against the reference implementation in the Wikipedia article I linked to.

Efficiency of finding mismatched patterns

I'm working on a simple bioinformatics problem. I have a working solution, but it is absurdly inefficient. How can I increase my efficiency?
Problem:
Find patterns of length k in the string g, given that the k-mer can have up to d mismatches.
And these strings and patterns are all genomic--so our set of possible characters is {A, T, C, G}.
I'll call the function FrequentWordsMismatch(g, k, d).
So, here are a few helpful examples:
FrequentWordsMismatch('AAAAAAAAAA', 2, 1) → ['AA', 'CA', 'GA', 'TA', 'AC', 'AG', 'AT']
Here's a much longer example, if you implement this and want to test:
FrequentWordsMisMatch('CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGCCGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGGCCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCGGTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACACACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC', 10, 2) → ['GCACACAGAC', 'GCGCACACAC']
With my naive solution, that second example could easily take ~60 seconds, though the first one is pretty quick.
Naive solution:
My idea was to, for every k-length segment in g, find every possible "neighbor" (e.g. other k-length segments with up to d mismatches) and add those neighbors as keys to a dictionary. I then count how many times each one of those neighbor kmers show up in the string g, and record those in the dictionary.
Obviously that's a kinda shitty way to do that, since the amount of neighbors scales like crazy as k and d increase, and having to scan through the strings with each of those neighbors makes this implementation terribly slow. But alas, that's why I'm asking for help.
I'll put my code below. There're definitely a lot of novice mistakes to unpack, so thanks for your time and attention.
def FrequentWordsMismatch(g, k, d):
'''
Finds the most frequent k-mer patterns in the string g, given that those
patterns can mismatch amongst themselves up to d times
g (String): Collection of {A, T, C, G} characters
k (int): Length of desired pattern
d (int): Number of allowed mismatches
'''
counts = {}
answer = []
for i in range(len(g) - k + 1):
kmer = g[i:i+k]
for neighborkmer in Neighbors(kmer, d):
counts[neighborkmer] = Count(neighborkmer, g, d)
maxVal = max(counts.values())
for key in counts.keys():
if counts[key] == maxVal:
answer.append(key)
return(answer)
def Neighbors(pattern, d):
'''
Find all strings with at most d mismatches to the given pattern
pattern (String): Original pattern of characters
d (int): Number of allowed mismatches
'''
if d == 0:
return [pattern]
if len(pattern) == 1:
return ['A', 'C', 'G', 'T']
answer = []
suffixNeighbors = Neighbors(pattern[1:], d)
for text in suffixNeighbors:
if HammingDistance(pattern[1:], text) < d:
for n in ['A', 'C', 'G', 'T']:
answer.append(n + text)
else:
answer.append(pattern[0] + text)
return(answer)
def HammingDistance(p, q):
'''
Find the hamming distance between two strings
p (String): String to be compared to q
q (String): String to be compared to p
'''
ham = 0 + abs(len(p)-len(q))
for i in range(min(len(p), len(q))):
if p[i] != q[i]:
ham += 1
return(ham)
def Count(pattern, g, d):
'''
Count the number of times that the pattern occurs in the string g,
allowing for up to d mismatches
pattern (String): Pattern of characters
g (String): String in which we're looking for pattern
d (int): Number of allowed mismatches
'''
return len(MatchWithMismatch(pattern, g, d))
def MatchWithMismatch(pattern, g, d):
'''
Find the indicies at which the pattern occurs in the string g,
allowing for up to d mismatches
pattern (String): Pattern of characters
g (String): String in which we're looking for pattern
d (int): Number of allowed mismatches
'''
answer = []
for i in range(len(g) - len(pattern) + 1):
if(HammingDistance(g[i:i+len(pattern)], pattern) <= d):
answer.append(i)
return(answer)
More tests
FrequentWordsMismatch('ACGTTGCATGTCGCATGATGCATGAGAGCT', 4, 1) → ['ATGC', 'ATGT', 'GATG']
FrequentWordsMismatch('AGTCAGTC', 4, 2) → ['TCTC', 'CGGC', 'AAGC', 'TGTG', 'GGCC', 'AGGT', 'ATCC', 'ACTG', 'ACAC', 'AGAG', 'ATTA', 'TGAC', 'AATT', 'CGTT', 'GTTC', 'GGTA', 'AGCA', 'CATC']
FrequentWordsMismatch('AATTAATTGGTAGGTAGGTA', 4, 0) → ["GGTA"]
FrequentWordsMismatch('ATA', 3, 1) → ['GTA', 'ACA', 'AAA', 'ATC', 'ATA', 'AGA', 'ATT', 'CTA', 'TTA', 'ATG']
FrequentWordsMismatch('AAT', 3, 0) → ['AAT']
FrequentWordsMismatch('TAGCG', 2, 1) → ['GG', 'TG']

The problem description is ambiguous in several ways, so I'm going by the examples. You seem to want all k-length strings from the alphabet (A, C, G, T} such that the number of matches to contiguous substrings of g is maximal - where "a match" means character-by-character equality with at most d character inequalities.
I'm ignoring that your HammingDistance() function makes something up even when inputs have different lengths, mostly because it doesn't make much sense to me ;-) , but partly because that isn't needed to get the results you want in any of the examples you gave.
The code below produces the results you want in all the examples, in the sense of producing permutations of the output lists you gave. If you want canonical outputs, I'd suggest sorting an output list before returning it.
The algorithm is pretty simple, but relies on itertools to do the heavy combinatorial lifting "at C speed". All the examples run in well under a second total.
For each length-k contiguous substring of g, consider all combinations(k, d) sets of d distinct index positions. There are 4**d ways to fill those index positions with letters from {A, C, G, T}, and each such way is "a pattern" that matches the substring with at most d discrepancies. Duplicates are weeded out by remembering the patterns already generated; this is faster than making heroic efforts to generate only unique patterns to begin with.
So, in all, the time requirement is O(len(g) * k**d * 4**d) = O(len(g) * (4*k)**d, where k**d is, for reasonably small values of k and d, an overstated standin for the binomial coefficent combinations(k, d). The important thing to note is that - unsurprisingly - it's exponential in d.
def fwm(g, k, d):
from itertools import product, combinations
from collections import defaultdict
all_subs = list(product("ACGT", repeat=d))
all_ixs = list(combinations(range(k), d))
patcount = defaultdict(int)
for starti in range(len(g)):
base = g[starti : starti + k]
if len(base) < k:
break
patcount[base] += 1
seen = set([base])
basea = list(base)
for ixs in all_ixs:
saved = [basea[i] for i in ixs]
for newchars in all_subs:
for i, newchar in zip(ixs, newchars):
basea[i] = newchar
candidate = "".join(basea)
if candidate not in seen:
seen.add(candidate)
patcount[candidate] += 1
for i, ch in zip(ixs, saved):
basea[i] = ch
maxcount = max(patcount.values())
return [p for p, c in patcount.items() if c == maxcount]
EDIT: Generating Patterns Uniquely
Rather than weed out duplicates by keeping a set of those seen so far, it's straightforward enough to prevent generating duplicates to begin with. In fact, the following code is shorter and simpler, although somewhat subtler. In return for less redundant work, there are layers of recursive calls to the inner() function. Which way is faster appears to depend on the specific inputs.
def fwm(g, k, d):
from collections import defaultdict
patcount = defaultdict(int)
alphabet = "ACGT"
allbut = {ch: tuple(c for c in alphabet if c != ch)
for ch in alphabet}
def inner(i, rd):
if not rd or i == k:
patcount["".join(base)] += 1
return
inner(i+1, rd)
orig = base[i]
for base[i] in allbut[orig]:
inner(i+1, rd-1)
base[i] = orig
for i in range(len(g) - k + 1):
base = list(g[i : i + k])
inner(0, d)
maxcount = max(patcount.values())
return [p for p, c in patcount.items() if c == maxcount]

Going on your problem description alone and not your examples (for the reasons I explained in the comment), one approach would be:
s = "CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGC"\
"CGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGG"\
"CCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCG"\
"GTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACAC"\
"ACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC"
def frequent_words_mismatch(g,k,d):
def num_misspellings(x,y):
return sum(xx != yy for (xx,yy) in zip(x,y))
seen = set()
for i in range(len(g)-k+1):
seen.add(g[i:i+k])
# For each unique sequence, add a (key,bin) pair to the bins dictionary
# (The bin is initialized to a list containing only the sequence, for now)
bins = {seq:[seq,] for seq in seen}
# Loop again through the unique sequences...
for seq in seen:
# Try to fit it in *all* already-existing bins (based on bin key)
for bk in bins:
# Don't re-add seq to it's own bin
if bk == seq: continue
# Test bin keys, try to find all appropriate bins
if num_misspellings(seq, bk) <= d:
bins[bk].append(seq)
# Get a list of the bin keys (one for each unique sequence) sorted in order of the
# number of elements in the corresponding bins
sorted_keys = sorted(bins, key= lambda k:len(bins[k]), reverse=True)
# largest_bin_key will be the key of the largest bin (there may be ties, so in fact
# this is *a* key of *one of the bins with the largest length*). That is, it'll
# be the sequence (found in the string) that the most other sequences (also found
# in the string) are at most d-distance from.
largest_bin_key = sorted_keys[0]
# You can return this bin, as your question description (but not examples) indicate:
return bins[largest_bin_key]
largest_bin = frequent_words_mismatch(s,10,2)
print(len(largest_bin)) # 13
print(largest_bin)
The (this) largest bin contains:
['CGGCCGCCGG', 'GGGCCGGCGG', 'CGGCCGGCGC', 'AGGCGGCCGG', 'CAGGCGCCGG',
'CGGCCGGCCG', 'CGGTAGCCGG', 'CGGCGGCCGC', 'CGGGCGCCGG', 'CCGGCGCCGG',
'CGGGCCCCGG', 'CCGCCGGCGG', 'GGGCCGCCGG']
It's O(n**2) where n is the number of unique sequences and completes on my computer in around 0.1 seconds.

Query long lists

I would like to query the value of an exponentially weighted moving average at particular points. An inefficient way to do this is as follows. l is the list of times of events and queries has the times at which I want the value of this average.
a=0.01
l = [3,7,10,20,200]
y = [0]*1000
for item in l:
y[int(item)]=1
s = [0]*1000
for i in xrange(1,1000):
s[i] = a*y[i-1]+(1-a)*s[i-1]
queries = [23,68,103]
for q in queries:
print s[q]
Outputs:
0.0355271185019
0.0226018371526
0.0158992102478
In practice l will be very large and the range of values in l will also be huge. How can you find the values at the times in queries more efficiently, and especially without computing the potentially huge lists y and s explicitly. I need it to be in pure python so I can use pypy.
Is it possible to solve the problem in time proportional to len(l)
and not max(l) (assuming len(queries) < len(l))?

Here is my code for doing this:
def ewma(l, queries, a=0.01):
def decay(t0, x, t1, a):
from math import pow
return pow((1-a), (t1-t0))*x
assert l == sorted(l)
assert queries == sorted(queries)
samples = []
try:
t0, x0 = (0.0, 0.0)
it = iter(queries)
q = it.next()-1.0
for t1 in l:
# new value is decayed previous value, plus a
x1 = decay(t0, x0, t1, a) + a
# take care of all queries between t0 and t1
while q < t1:
samples.append(decay(t0, x0, q, a))
q = it.next()-1.0
# take care of all queries equal to t1
while q == t1:
samples.append(x1)
q = it.next()-1.0
# update t0, x0
t0, x0 = t1, x1
# take care of any remaining queries
while True:
samples.append(decay(t0, x0, q, a))
q = it.next()-1.0
except StopIteration:
return samples
I've also uploaded a fuller version of this code with unit tests and some comments to pastebin: http://pastebin.com/shhaz710
EDIT: Note that this does the same thing as what Chris Pak suggests in his answer, which he must have posted as I was typing this. I haven't gone through the details of his code, but I think mine is a bit more general. This code supports non-integer values in l and queries. It also works for any kind of iterables, not just lists since I don't do any indexing.

I think you could do it in ln(l) time, if l is sorted. The basic idea is that the non recursive form of EMA is a*s_i + (1-a)^1 * s_(i-1) + (1-a)^2 * s_(i-2) ....
This means for query k, you find the greatest number in l less than k, and for a estimation limit, use the following, where v is the index in l, l[v] is the value
(1-a)^(k-v) *l[v] + ....
Then, you spend lg(len(l)) time in search + a constant multiple for the depth of your estimation. I'll provide a code sample in a little bit (after work) if you want it, just wanted to get my idea out there while I was thinking about it
here's the code -
v is the dictionary of values at a given time; replace with 1 if it's just a 1 every time...
import math
from bisect import bisect_right
a = .01
limit = 1000
l = [1,5,14,29...]
def find_nearest_lt(l, time):
i = bisect_right(a, x)
if i:
return i-1
raise ValueError
def find_ema(l, time):
i = find_nearest_lt(l, time)
if l[i] == time:
result = a * v[l[i]
i -= 1
else:
result = 0
while (time-l[i]) < limit:
result += math.pow(1-a, time-l[i]) * v[l[i]]
i -= 1
return result
if I'm thinking correctly, the find nearest is l(n), then the while loop is <= 1000 iterations, guaranteed, so it's technically a constant (though a kind of large one). find_nearest was stolen from the page on bisect - http://docs.python.org/2/library/bisect.html

It appears that y is a binary value -- either 0 or 1 -- depending on the values of l. Why not use y = set(int(item) for item in l)? That's the most efficient way to store and look up a list of numbers.
Your code will cause an error the first time through this loop:
s = [0]*1000
for i in xrange(1000):
s[i] = a*y[i-1]+(1-a)*s[i-1]
because i-1 is -1 when i=0 (first pass of loop) and both y[-1] and s[-1] are the last element of the list, not the previous. Maybe you want xrange(1,1000)?
How about this code:
a=0.01
l = [3.0,7.0,10.0,20.0,200.0]
y = set(int(item) for item in l)
queries = [23,68,103]
ewma = []
x = 1 if (0 in y) else 0
for i in xrange(1, queries[-1]):
x = (1-a)*x
if i in y:
x += a
if i == queries[0]:
ewma.append(x)
queries.pop(0)
When it's done, ewma should have the moving averages for each query point.
Edited to include SchighSchagh's improvements.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Longest common substring with rolling hash - python

Related

Python - long execution time

find the minimum number of words(distance) between repeated occurrences of a search string in the input string

Find longest adjacent repeating non-overlapping substring

Efficiency of finding mismatched patterns

Query long lists

Categories

Resources