Longest repeated substring in massive string

Longest repeated substring in massive string - python

Given a long string, find the longest repeated sub-string.
The brute-force approach of course is to find all substrings and check the substrings of the remaining string, but the string(s) in question have millions of characters (like a DNA sequence, AGGCTAGCT etc) and I'd like something that finishes before the universe collapses in on itself.
Tried a number of approaches, and I have one solution that works quite fast on strings of up to several million, but takes literally forever (6+ hours) for larger strings, particularly when the length of the repeated sequence gets really long.
def find_lrs(text, cntr=2):
sol = (0, 0, 0)
del_list = ['01','01','01']
while len(del_list) != 0:
d = defaultdict(list)
for i in range(len(text)):
d[text[i:i + cntr]].append(i)
del_list = [(item, d[item]) for item in d if len(d[item]) > 1]
# if list is empty, we're done
if len(del_list) == 0:
return sol
else:
sol = (del_list[0][1][0], (del_list[0][1][1]),len(del_list[0][0]))
cntr += 1
return sol
I know it's ugly, but hey, I'm a beginner, and I'm just happy I got something to work. Idea is to go through the string starting out with length-2 substrings as the keys, and the index the substring is at the value. If the text was, say, 'BANANA', after the first pass through, the dict would look like this:
{'BA': [0], 'AN': [1, 3], 'NA': [2, 4], 'A': [5]}
BA shows up only once - starting at index 0. AN and NA show up twice, showing up at index 1/3 and 2/4, respectively.
I then create a list that only includes keys that showed up at least twice. In the example above, we can remove BA, since it only showed up once - if there's no substring of length 2 starting out with 'BA', there won't be an substring of length 3 starting with BA.
So after the first past through the pruned list is:
[('AN', [1, 3]), ('NA', [2, 4])]
Since there is at least two possibilities, we save the longest substring and indices found so far and increment the substring length to 3. We continue until no substring was repeated.
As noted, this works on strings up to 10 million in about 2 minutes, which apparently is reasonable - BUT, that's with the longest repeated sequence being fairly short. On a shorter string but longer repeated sequence, it takes -hours- to run. I suspect that it has something to do with how big the dictionary gets, but not quite sure why.
What I'd like to do of course is keep the dictionary short by removing the substrings that clearly aren't repeated, but I can't delete items from the dict while iterating over it. I know there are suffix tree approaches and such that - for now - are outside my ken.
Could simply be that this is beyond my current knowledge, which of course is fine, but I can't help shaking the idea that there is a solution here.

I forgot to update this. After going over my code again, away from my PC - literally writing out little diagrams on my iPad - I realized that the code above wasn't doing what I thought it was doing.
As noted above, my plan of attack was to start out by going through the string starting out with length-2 substrings as the keys, and the index the substring is at the value, creating a list that captures only length-2 substrings that occured at least twice, and only look at those locations.
All well and good - but look closely and you'll realize that I'm never actually updating the default dictionary to only have locations with two or more repeats! //bangs head against table.
I ultimately came up with two solutions. The first solution used a slightly different approach, the 'sorted suffixes' approach. This gets all the suffixes of the word, then sorts them in alphabetical order. For example, the suffixes of "BANANA", sorted, would be:
A
ANA
ANANA
BANANA
NA
NANA
We then look at each adjacent suffix and find how many letters each pair start out having in common. A and ANA have only 'A' in common. ANA and ANANA have "ANA" in common, so we have length 3 as the longest repeated substring. ANANA and BANANA have nothing in common at the start, ditto BANANA and NA. NA and NANA have "NA" in common. So 'ANA', length 3, is the longest repeated substring.
I made a little helper function to do the actual comparing. The code looks like this:
def longest_prefix(suf1, suf2, mx=None):
min_len = min(len(suf1), len(suf2))
for i in range(min_len):
if suf1[i] != suf2[i]:
return (suf1[0:i], len(suf1[0:i]))
return (suf1[0:i], len(suf1[0:i]))
def longest_repeat(txt):
lst = sorted([text[i:] for i in range(len(text))])
print(lst)
mxLen = 0
mx_string = ""
for x in range(len(lst) - 1):
temp = longest_prefix(lst[x], lst[x + 1])
if temp[1] > mxLen:
mxLen = temp[1]
mx_string = temp[0]
first = txt.find(mx_string)
last = txt.rfind(mx_string)
return (first, last, mxLen)
This works. I then went back and relooked at my original code and saw that I wasn't resetting the dictionary. The key is that after each pass through I update the dictionary to -only- look at repeat candidates.
def longest_repeat(text):
# create the initial dictionary with all length-2 repeats
cntr = 2 # size of initial substring length we look for
d = defaultdict(list)
for i in range(len(text)):
d[text[i:i + cntr]].append(i)
# find any item in dict that wasn't repeated at least once
del_list = [(d[item]) for item in d if len(d[item]) > 1]
sol = (0,0,0)
# Keep looking as long as del_list isn't empty,
while len(del_list) > 0:
d = defaultdict(list) # reset dictionary
cntr += 1 # increment search length
for item in del_list:
for i in item:
d[text[i:i + cntr]].append(i)
# filter as above
del_list = [(d[item]) for item in d if len(d[item]) > 1]
# if not empty, update solution
if len(del_list) != 0:
sol = (del_list[0][0], del_list[0][1], cntr)
return sol
This was quite fast, and I think it's easier to follow.

Related

Finding locations of repeated 0 in a binary list

I've got a binary list returned from a k means classification with k = 2, and I am trying to 1) identify the number of 0,0,0,... substrings of a given length - say a minimum of length 3, and 2) identify the start and end locations of these sublists, so in a list: L = [1,1,0,0,0,0,0,1,1,1,0,0,1,0,0,0], the outputs would ideally be: number = 2 and start_end_locations = [[2,6],[13,15]].
The lists I'm working with are tens of thousands of elements long, so I need to find a computationally fast way of performing this operation. I've seen many posts using groupby from itertools, but I can't find a way to apply them to my task.
Thanks in advance for your suggestions!

Thanks in advance for your suggestions!
craft a regular expression that matches your pattern: three or more zeros
concatenate the list items to a string
using re.finditer and match object start() and end() methods construct a list of indices
Concatenating the lists to a string could be the most expensive part - you won't know till you try; finditer should be pretty quick. Requires more than one pass through the data but probably low effort to code.
This will probably be better - a single pass through the list but you need to pay attention to the logic - more effort to code.
iterate over the list using enumerate
when you find a zero
capture its index and
set a flag indicating you are tracking zeros
when you find a one
if you are tracking zeros
capture the index
if the length of consecutive zeros meets your criteria capture the start and end indices for that run of zeros
reset flags and intermediate variables as necessary
A bit different than the word version:
def g(a=a):
y = []
criteria = 3
start,end = 0,0
prev = 1
for i,n in enumerate(a):
if not n: # n is zero
end = i
if prev: # previous item one
start = i
else:
if not prev and end - start + 1 >= criteria:
y.append((start,end))
prev = n
return y

You can use zip() to detect indexes of the 1,0 and 0,1 breaks in sequence. Then use zip() on the break indexes to form ranges and extract the ones that start with a zero and span at least 3 positions.
def getZeroStreaks(L,minSize=3):
breaks = [i for i,(a,b) in enumerate(zip(L,L[1:]),1) if a!=b]
return [[s,e-1] for s,e in zip([0]+breaks,breaks+[len(L)])
if e-s>=minSize and not L[s]]
output:
L = [1,1,0,0,0,0,0,1,1,1,0,0,1,0,0,0]
print(getZeroStreaks(L))
[[2, 6], [13, 15]]
from timeit import timeit
t = timeit(lambda:getZeroStreaks(L*1000),number=100)/100
print(t) # 0.0018 sec for 16,000 elements
The function can be generalized to find streaks of any value in a list:
def getStreaks(L,N=0,minSize=3):
breaks = [i for i,(a,b) in enumerate(zip(L,L[1:]),1) if (a==N)!=(b==N)]
return [[s,e-1] for s,e in zip([0]+breaks,breaks+[len(L)])
if e-s>=minSize and L[s]==N]

How can I merge overlapping strings in python?

I have some strings,
['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
These strings partially overlap each other. If you manually overlapped them you would get:
SGALWDVPSPV
I want a way to go from the list of overlapping strings to the final compressed string in python. I feel like this must be a problem that someone has solved already and am trying to avoid reinventing the wheel. The methods I can imagine now are either brute force or involve getting more complicated by using biopython and sequence aligners than I would like. I have some simple short strings and just want to properly merge them in a simple way.
Does anyone have any advice on a nice way to do this in python? Thanks!

Here is a quick sorting solution:
s = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
new_s = sorted(s, key=lambda x:s[0].index(x[0]))
a = new_s[0]
b = new_s[-1]
final_s = a[:a.index(b[0])]+b
Output:
'SGALWDVPSPV'
This program sorts s by the value of the index of the first character of each element, in an attempt to find the string that will maximize the overlap distance between the first element and the desired output.

My proposed solution with a more challenging test list:
#strFrag = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
strFrag = ['ALWDVPS', 'SGALWDV', 'LWDVPSP', 'WDVPSPV', 'GALWDVP', 'LWDVPSP', 'ALWDVPS']
for repeat in range(0, len(strFrag)-1):
bestMatch = [2, '', ''] #overlap score (minimum value 3), otherStr index, assembled str portion
for otherStr in strFrag[1:]:
for x in range(0,len(otherStr)):
if otherStr[x:] == strFrag[0][:len(otherStr[x:])]:
if len(otherStr)-x > bestMatch[0]:
bestMatch = [len(otherStr)-x, strFrag.index(otherStr), otherStr[:x]+strFrag[0]]
if otherStr[:-x] == strFrag[0][-len(otherStr[x:]):]:
if x > bestMatch[0]:
bestMatch = [x, strFrag.index(otherStr), strFrag[0]+otherStr[-x:]]
if bestMatch[0] > 2:
strFrag[0] = bestMatch[2]
strFrag = strFrag[:bestMatch[1]]+strFrag[bestMatch[1]+1:]
print(strFrag)
print(strFrag[0])
Basically the code compares every string/fragment to the first in list and finds the best match (most overlap). It consolidates the list progressively, merging the best matches and removing the individual strings. Code assumes that there are no unfillable gaps between strings/fragments (Otherwise answer may not result in longest possible assembly. Can be solved by randomizing the starting string/fragment). Also assumes that the reverse complement is not present (poor assumption with contig assembly), which would result in nonsense/unmatchable strings/fragments. I've included a way to restrict the minimum match requirements (changing bestMatch[0] value) to prevent false matches. Last assumption is that all matches are exact. To enable flexibility in permitting mismatches when assembling the sequence makes the problem considerably more complex. I can provide a solution for assembling with mismatches upon request.

To determine the overlap of two strings a and b, you can check if any prefix of b is a suffix of a. You can then use that check in a simple loop, aggregating the result and slicing the next string in the list according to the overlap.
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
def overlap(a, b):
return max(i for i in range(len(b)+1) if a.endswith(b[:i]))
res = lst[0]
for s in lst[1:]:
o = overlap(res, s)
res += s[o:]
print(res) # SGALWDVPSPV
Or using reduce:
from functools import reduce # Python 3
print(reduce(lambda a, b: a + b[overlap(a,b):], lst))
This is probably not super-efficient, with complexity of about O(n k), with n being the number of strings in the list and k the average length per string. You can make it a bit more efficient by only testing whether the last char of the presumed overlap of b is the last character of a, thus reducing the amount of string slicing and function calls in the generator expression:
def overlap(a, b):
return max(i for i in range(len(b)) if b[i-1] == a[-1] and a.endswith(b[:i]))

Here's my solution which borders on brute force from the OP's perspective. It's not bothered by order (threw in a random shuffle to confirm that) and there can be non-matching elements in the list, as well as other independent matches. Assumes overlap means not a proper subset but independent strings with elements in common at the start and end:
from collections import defaultdict
from random import choice, shuffle
def overlap(a, b):
""" get the maximum overlap of a & b plus where the overlap starts """
overlaps = []
for i in range(len(b)):
for j in range(len(a)):
if a.endswith(b[:i + 1], j):
overlaps.append((i, j))
return max(overlaps) if overlaps else (0, -1)
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV', 'NONSEQUITUR']
shuffle(lst) # to verify order doesn't matter
overlaps = defaultdict(list)
while len(lst) > 1:
overlaps.clear()
for a in lst:
for b in lst:
if a == b:
continue
amount, start = overlap(a, b)
overlaps[amount].append((start, a, b))
maximum = max(overlaps)
if maximum == 0:
break
start, a, b = choice(overlaps[maximum]) # pick one among equals
lst.remove(a)
lst.remove(b)
lst.append(a[:start] + b)
print(*lst)
OUTPUT
% python3 test.py
NONSEQUITUR SGALWDVPSPV
%
Computes all the overlaps and combines the largest overlap into a single element, replacing the original two, and starts process over again until we're down to a single element or no overlaps.
The overlap() function is horribly inefficient and likely can be improved but that doesn't matter if this isn't the type of matching the OP desires.

Once the peptides start to grow to 20 aminoacids cdlane's code chokes and spams (multiple) incorrect answer(s) with various amino acid lengths.
Try to add and use AA sequence 'VPSGALWDVPS' with or without 'D' and the code starts to fail its task because the N-and C-terminus grow and do not reflect what Adam Price is asking for. The output is: 'SGALWDVPSGALWDVPSPV' and thus 100% incorrect despite the effort.
Tbh imo there is only one 100% answer and that is to use BLAST and its protein search page or BLAST in the BioPython package. Or adapt cdlane's code to reflect AA gaps, substitutions and AA additions.

Dredging up an old thread, but had to solve this myself today.
For this specific case, where the fragments are already in order, and each overlap by the same amount (in this case 1), the following fairly simply concatenation works, though might not be the worlds most robust solution:
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
reference = "SGALWDVPSPV"
string = "".join([i[0] for i in lst] + [lst[-1][1:]])
reference == string
True

Surprising order in python set combination methods

I am aware that set() in python doesn't have an order as it is implemented as a hash table. However, I was a little surprised to solve a question which involved order using set.intersection().
So I am given two lists with an order, say for example, denoting some ranking or sequence of occurrence. I have to find the element that is common to both lists and has highest order (occurs the first) in the two lists. For example,
List1 = ['j','s','l','p','t']
List2 = ['b','d','q','s','y','j']
should output 's' as it is the second best in List1 and occurs the first in List2.
If you convert each of the lists into sets and take an intersection Set1.intersection(Set2), you get a set set(['s', 'j']). In my case I could convert this into a list and spit out the first element and this was approximately O(n1 + n2).
I was happy to solve this interview question (all test passed), but I am amazed how could I pull off such a order based problem using python set.
Does anyone have a clue how this is working? What's a possible case, that this could breakdown?
EDIT: This seems to be like a stoke of luck case, so if you have a good solution for this problem, it will be also appreciated

I found a O(n1+n2) approach. Commented code follows. The trick is to create a lookup table (not a dictionary, a simple array) to index the minimum position of the letters in both lists, and then find the minimum of the sum of those positions and the associated letter.
List1 = ['j','s','l','p','t']
List2 = ['b','d','q','s','y','j']
# unreachable max value to initialize the slots
maxlen = max(len(List1),len(List2))+1000
# create empty slot_array (initialized to values higher than last letter pos
# for both lists (the value is greater than any "valid" position)
# first pos (index 0) is for letter "a", last pos is for letter "z"
slot_array = [[maxlen,maxlen] for x in range(ord('a'),ord('z')+1)]
# scan both lists, and update the position if lower than the one in slot_array
for list_index,the_list in enumerate((List1,List2)):
print(list_index)
for letter_index,letter in enumerate(the_list):
slot = slot_array[ord(letter)-ord('a')]
if slot[list_index]>letter_index:
slot[list_index] = letter_index
# now compute minimum of sum of both minimum positions for each letter
min_value = maxlen*2
for i,(a,b) in enumerate(slot_array):
sab = a+b
if sab < min_value:
min_value = sab
min_index = i
# result is the letter with the minimal sum
print(chr(min_index+ord('a')))

A list comprehension could do the job:
Set2 = set(List2)
[x for x in List1 if x in Set2]
This will maintain the order of List1, you could do the same for List2 too.
You can then call next on the list comprehension (or a generator to be more efficient) to get the first match.

Comparing adjacent elements together without using zip method

How would you go about comparing two adjacent elements in a list in python? How would save or store the value of that item while going through a for loop? I'm trying not to use the zip method and just using an ordinary for loop.
comparing_two_elements = ['Hi','Hello','Goodbye','Does it really work finding longest length of string','Jet','Yes it really does work']
longer_string = ''
for i in range(len(comparing_two_elements)-1):
if len(prior_string) < len(comparing_two_elements[i + 1]):
longer_string = comparing_two_elements[i+1]
print(longer_string)

The below works simply by 'saving' the first element of your list as the longest element, as it will be the first time you loop over your list, and then on subsequent iterations it will compare the length of that item to the length of the next item in the list.
longest_element = None
for element in comparing_two_elements:
if not longest_element:
longest_element = element
continue
if len(longest_element) < len(element):
longest_element = element
If you want to go the "interesting" route, you could do it with combination of other functions, for eg
length_map = map(len, comparing_two_elements)
longest_index = length_map.index(max(length_map))
longest_element = comparing_two_elements[longest_index]

Use the third, optional step argument to range - and don't subtract 1 from len(...) ! Your logic is incomplete: what if the first of a pair of strings is longer? you don't do anything in that case.
It's not clear what you're trying to do. This for loop runs through i = 0, 2, 4, ... up to but excluding len(comparing_two_elements) (assumed to be even!), and prints the longer of each adjacent pair:
for i in range(0, len(comparing_two_elements), 2):
if len(comparing_two_elements[i]) < len(comparing_two_elements[i + 1]):
idx = i
else:
idx = i + 1
print(comparing_two_elements[idx])
This may not do exactly what you want, but as several people have observed, it's unclear just what that is. At least it's something you can reason about and adapt.
If you just want the longest string in a sequence seq, the whole adjacent pairs rigamarole is pointless; simply use:
longest_string = max(seq, key=len)

Find maximum length of all n-word-length substrings shared by two strings

I am working to produce a Python script that can find the (longest possible) length of all n-word-length substrings shared by two strings, disregarding trailing punctuation. Given two strings:
"this is a sample string"
"this is also a sample string"
I want the script to identify that these strings have a sequence of 2 words in common ("this is") followed by a sequence of 3 words in common ("a sample string"). Here is my current approach:
a = "this is a sample string"
b = "this is also a sample string"
aWords = a.split()
bWords = b.split()
#create counters to keep track of position in string
currentA = 0
currentB = 0
#create counter to keep track of longest sequence of matching words
matchStreak = 0
#create a list that contains all of the matchstreaks found
matchStreakList = []
#create binary switch to control the use of while loop
continueWhileLoop = 1
for word in aWords:
currentA += 1
if word == bWords[currentB]:
matchStreak += 1
#to avoid index errors, check to make sure we can move forward one unit in the b string before doing so
if currentB + 1 < len(bWords):
currentB += 1
#in case we have two identical strings, check to see if we're at the end of string a. If we are, append value of match streak to list of match streaks
if currentA == len(aWords):
matchStreakList.append(matchStreak)
elif word != bWords[currentB]:
#because the streak is broken, check to see if the streak is >= 1. If it is, append the streak counter to out list of streaks and then reset the counter
if matchStreak >= 1:
matchStreakList.append(matchStreak)
matchStreak = 0
while word != bWords[currentB]:
#the two words don't match. If you can move b forward one word, do so, then check for another match
if currentB + 1 < len(bWords):
currentB += 1
#if you have advanced b all the way to the end of string b, then rewind to the beginning of string b and advance a, looking for more matches
elif currentB + 1 == len(bWords):
currentB = 0
break
if word == bWords[currentB]:
matchStreak += 1
#now that you have a match, check to see if you can advance b. If you can, do so. Else, rewind b to the beginning
if currentB + 1 < len(bWords):
currentB += 1
elif currentB + 1 == len(bWords):
#we're at the end of string b. If we are also at the end of string a, check to see if the value of matchStreak >= 1. If so, add matchStreak to matchStreakList
if currentA == len(aWords):
matchStreakList.append(matchStreak)
currentB = 0
break
print matchStreakList
This script correctly outputs the (maximum) lengths of the common word-length substrings (2, 3), and has done so for all tests so far. My question is: Is there a pair of two strings for which the approach above will not work? More to the point: Are there extant Python libraries or well-known approaches that can be used to find the maximum length of all n-word-length substrings that two strings share?
[This question is distinct from the longest common substring problem, which is only a special case of what I'm looking for (as I want to find all common substrings, not just the longest common substring). This SO post suggests that methods such as 1) cluster analysis, 2) edit distance routines, and 3) longest common sequence algorithms might be suitable approaches, but I didn't find any working solutions, and my problem is perhaps slightly easier that that mentioned in the link because I'm dealing with words bounded by whitespace.]
EDIT:
I'm starting a bounty on this question. In case it will help others, I wanted to clarify a few quick points. First, the helpful answer suggested below by #DhruvPathak does not find all maximally-long n-word-length substrings shared by two strings. For example, suppose the two strings we are analyzing are:
"They all are white a sheet of spotless paper when they first are born
but they are to be scrawled upon and blotted by every goose quill"
and
"You are all white, a sheet of lovely, spotless paper, when you first
are born; but you are to be scrawled and blotted by every goose's
quill"
In this case, the list of maximally long n-word-length substrings (disregarding trailing punctuation) is:
all
are
white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every
Using the following routine:
#import required packages
import difflib
#define function we'll use to identify matches
def matches(first_string,second_string):
s = difflib.SequenceMatcher(None, first_string,second_string)
match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
return match
a = "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill"
a = a.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
b = b.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
print matches(a,b)
One gets output:
['e', ' all', ' white a sheet of', ' spotless paper when ', 'y', ' first are born but ', 'y', ' are to be scrawled', ' and blotted by every goose', ' quill']
In the first place, I am not sure how one could select from this list the substrings that contain only whole words. In the second place, this list does not include "are", one of the desired maximally-long common n-word-length substrings. Is there a method that will find all of the maximally long n-word-long substrings shared by these two strings ("You are all..." and "They all are...")?

There are still ambiguities here, and I don't want to spend time arguing about them. But I think I can add something helpful anyway ;-)
I wrote Python's difflib.SequenceMatcher, and spent a lot of time finding expected-case fast ways to find longest common substrings. In theory, that should be done with "suffix trees", or the related "suffix arrays" augmented with "longest common prefix arrays" (the phrases in quotes are search terms if you want to Google for more). Those can solve the problem in worst-case linear time. But, as is sometimes the case, the worst-case linear-time algorithms are excruciatingly complex and delicate, and suffer large constant factors - they can still pay off hugely if a given corpus is going to be searched many times, but that's not the typical case for Python's difflib and it doesn't look like your case either.
Anyway, my contribution here is to rewrite SequenceMatcher's find_longest_match() method to return all the (locally) maximum matches it finds along the way. Notes:
I'm going to use the to_words() function Raymond Hettinger gave you, but without the conversion to lower case. Converting to lower case leads to output that isn't exactly like what you said you wanted.
Nevertheless, as I noted in a comment already, this does output "quill", which isn't in your list of desired outputs. I have no idea why it isn't, since "quill" does appear in both inputs.
Here's the code:
import re
def to_words(text):
'Break text into a list of words without punctuation'
return re.findall(r"[a-zA-Z']+", text)
def match(a, b):
# Make b the longer list.
if len(a) > len(b):
a, b = b, a
# Map each word of b to a list of indices it occupies.
b2j = {}
for j, word in enumerate(b):
b2j.setdefault(word, []).append(j)
j2len = {}
nothing = []
unique = set() # set of all results
def local_max_at_j(j):
# maximum match ends with b[j], with length j2len[j]
length = j2len[j]
unique.add(" ".join(b[j-length+1: j+1]))
# during an iteration of the loop, j2len[j] = length of longest
# match ending with b[j] and the previous word in a
for word in a:
# look at all instances of word in b
j2lenget = j2len.get
newj2len = {}
for j in b2j.get(word, nothing):
newj2len[j] = j2lenget(j-1, 0) + 1
# which indices have not been extended? those are
# (local) maximums
for j in j2len:
if j+1 not in newj2len:
local_max_at_j(j)
j2len = newj2len
# and we may also have local maximums ending at the last word
for j in j2len:
local_max_at_j(j)
return unique
Then:
a = "They all are white a sheet of spotless paper " \
"when they first are born but they are to be " \
"scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless " \
"paper, when you first are born; but you are to " \
"be scrawled and blotted by every goose's quill"
print match(to_words(a), to_words(b))
displays:
set(['all',
'and blotted by every',
'first are born but',
'are to be scrawled',
'are',
'spotless paper when',
'white a sheet of',
'quill'])
EDIT - how it works
A great many sequence matching and alignment algorithms are best understood as working on a 2-dimensional matrix, with rules for computing the matrix entries and later interpreting the entries' meaning.
For input sequences a and b, picture a matrix M with len(a) rows and len(b) columns. In this application, we want M[i, j] to contain the length of the longest common contiguous subsequence ending with a[i] and b[j], and the computational rules are very easy:
M[i, j] = 0 if a[i] != b[j].
M[i, j] = M[i-1, j-1] + 1 if a[i] == b[j] (where we assume an out-of-bounds matrix reference silently returns 0).
Interpretation is also very easy in this case: there is a locally maximum non-empty match ending at a[i] and b[j], of length M[i, j], if and only if M[i, j] is non-zero but M[i+1, j+1] is either 0 or out-of-bounds.
You can use those rules to write very simple & compact code, with two loops, that computes M correctly for this problem. The downside is that the code will take (best, average and worst cases) O(len(a) * len(b)) time and space.
While it may be baffling at first, the code I posted is doing exactly the above. The connection is obscured because the code is heavily optimized, in several ways, for expected cases:
Instead of doing one pass to compute M, then another pass to interpret the results, computation and interpretation are interleaved in a single pass over a.
Because of that, the whole matrix doesn't need to be stored. Instead only the current row (newj2len) and the previous row (j2len) are simultaneously present.
And because the matrix in this problem is usually mostly zeroes, a row here is represented sparsely, via a dict mapping column indices to non-zero values. Zero entries are "free", in that they're never stored explicitly.
When processing a row, there's no need to iterate over each column: the precomputed b2j dict tells us exactly the interesting column indices in the current row (those columns that match the current word from a).
Finally, and partly by accident, all the preceeding optimizations conspire in such a way that there's never a need to know the current row's index, so we don't have to bother computing that either.
EDIT - the dirt simple version
Here's code that implements the 2D matrix directly, with no attempts at optimization (other than that a Counter can often avoid explicitly storing 0 entries). It's extremely simple, short and easy:
def match(a, b):
from collections import Counter
M = Counter()
for i in range(len(a)):
for j in range(len(b)):
if a[i] == b[j]:
M[i, j] = M[i-1, j-1] + 1
unique = set()
for i in range(len(a)):
for j in range(len(b)):
if M[i, j] and not M[i+1, j+1]:
length = M[i, j]
unique.add(" ".join(a[i+1-length: i+1]))
return unique
Of course ;-) that returns the same results as the optimized match() I posted at first.
EDIT - and another without a dict
Just for fun :-) If you have the matrix model down pat, this code will be easy to follow. A remarkable thing about this particular problem is that a matrix cell's value only depends on the values along the diagonal to the cell's northwest. So it's "good enough" just to traverse all the main diagonals, proceeding southeast from all the cells on the west and north borders. This way only small constant space is needed, regardless of the inputs' lengths:
def match(a, b):
from itertools import chain
m, n = len(a), len(b)
unique = set()
for i, j in chain(((i, 0) for i in xrange(m)),
((0, j) for j in xrange(1, n))):
k = 0
while i < m and j < n:
if a[i] == b[j]:
k += 1
elif k:
unique.add(" ".join(a[i-k: i]))
k = 0
i += 1
j += 1
if k:
unique.add(" ".join(a[i-k: i]))
return unique

There are really four questions embedded in your post.
1) How do you split text into words?
There are many ways to do this depending on what you count as a word, whether you care about case, whether contractions are allowed, etc. A regular expression lets you implement your choice of word splitting rules. The one I typically use is r"[a-z'\-]+". The catches contractions like don't and allow hyphenated words like mother-in-law.
2) What data structure can speed-up the search for common subsequences?
Create a location map showing for each word. For example, in the sentence you should do what you like the mapping for for you is {"you": [0, 4]} because it appears twice, once at position zero and once at position four.
With a location map in hand, it is a simple matter to loop-over the starting points to compare n-length subsequences.
3) How do I find common n-length subsequences?
Loop over all words in the one of the sentences. For each such word, find the places where it occurs in the other sequence (using the location map) and test whether the two n-length slices are equal.
4) How do I find the longest common subsequence?
The max() function finds a maximum value. It takes a key-function such as len() to determine the basis for comparison.
Here is some working code that you can customize to your own interpretation of the problem:
import re
def to_words(text):
'Break text into a list of lowercase words without punctuation'
return re.findall(r"[a-z']+", text.lower())
def starting_points(wordlist):
'Map each word to a list of indicies where the word appears'
d = {}
for i, word in enumerate(wordlist):
d.setdefault(word, []).append(i)
return d
def sequences_in_common(wordlist1, wordlist2, n=1):
'Generate all n-length word groups shared by two word lists'
starts = starting_points(wordlist2)
for i, word in enumerate(wordlist1):
seq1 = wordlist1[i: i+n]
for j in starts.get(word, []):
seq2 = wordlist2[j: j+n]
if seq1 == seq2 and len(seq1) == n:
yield ' '.join(seq1)
if __name__ == '__main__':
t1 = "They all are white a sheet of spotless paper when they first are " \
"born but they are to be scrawled upon and blotted by every goose quill"
t2 = "You are all white, a sheet of lovely, spotless paper, when you first " \
"are born; but you are to be scrawled and blotted by every goose's quill"
w1 = to_words(t1)
w2 = to_words(t2)
for n in range(1,10):
matches = list(sequences_in_common(w1, w2, n))
if matches:
print(n, '-->', max(matches, key=len))

difflib module would be a good candidate for this case , see get_matching_blocks :
import difflib
def matches(first_string,second_string):
s = difflib.SequenceMatcher(None, first_string,second_string)
match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
return match
first_string = "this is a sample string"
second_string = "this is also a sample string"
print matches(second_string, first_string )
demo: http://ideone.com/Ca3h8Z

A slight modification, with matching not chars but words, I suppose will do:
def matche_words(first_string,second_string):
l1 = first_string.split()
l2 = second_string.split()
s = difflib.SequenceMatcher(None, l1, l2)
match = [l1[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
return match
Demo:
>>> print '\n'.join(map(' '.join, matches(a,b)))
all
white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every
quill

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Longest repeated substring in massive string - python

Related

Finding locations of repeated 0 in a binary list

How can I merge overlapping strings in python?

Surprising order in python set combination methods

Comparing adjacent elements together without using zip method

Find maximum length of all n-word-length substrings shared by two strings

Categories

Resources