I built a function which finds the longest common sub-string of two text files in ascending order based on Rabin–Karp algorithm.
the main function is "find_longest" and the inner functions are: "make_hashtable","extend_fingerprints" and "has_match".
I'm having trouble analyzing the average case complexity of has_match.
Denote n1,n2 as text1,text2 and l as the size of the currunt "window".
fingers are the hash table of the substring.
def has_match(text1,text2,fingers1,fingers2,l,r):
h = make_hashtable(fingers2,r)
for i in range(len(fingers1)):
for j in h[fingers1[i]]:
if text1[i:i+l] == text2[j:j+l]:
return text1[i:i+l]
return None
this is "make_hashtable", here I'm pretty sure that the complexcity is O(n2-l+1):
def make_hashtable(fingers, table_size):
hash_table=[[] for i in range(table_size)]
count=0
for f in fingers:
hash_table[f].append(count)
count+=1
return hash_table
this is "find_longest", im adding this function despite the fact that i dont need it for the complexity analyzing.
def find_longest(text1,text2,basis=2**8,r=2**17-1):
match = ''
l = 0 #initial "window" size
#fingerprints of "windows" of size 0 - all are 0
fingers1 = [0]*(len(text1)+1)
fingers2 = [0]*(len(text2)+1)
while match != None: #there was a common substring of len l
l += 1
extend_fingerprints(text1, fingers1, l, basis, r)
extend_fingerprints(text2, fingers2, l, basis, r)
match = has_match(text1,text2,fingers1,fingers2,l,r)
print(match)
return l-1
and this is "extend_fingerprints":
def extend_fingerprints(text, fingers, length, basis=2**8, r=2**17-1):
count=0
for f in fingers:
if count==len(fingers)-1:
fingers.pop(len(fingers)-1)
break
fingers[count]=(f*basis+ord(text[length-1+count]))%r
count+=1
I'm having doubts between this two options:
1.O(n_2-l+1)+O(n_1-l+1)*O(l)
Refer to r as a constant number while n1,n2 are very large therefore a lot of collisions would be made at the hash table (let's say O(1) items at every 'cell', yet, always some "false-positives")
2.O(n_2-l+1)+O(n_1-l+1)+O(l)
Refer to r as optimal for a decent hash function, therefore almost no collisions which means that if two texts are the same cell at the hash table we may assume they are actually the same text?
Personally I lean towards the Bold statement.
tnx.
I think the answer is
O((n_2-l) + l*(n_1-l))
.
(n_2-l) represents the complexity of make_hashtable for the second text.
l*(n_1-l) represents the two nested loops who go through every item in the finger prints of the first text and perform 1 comparison operation (for l length slice), for some constant 'm' if there are some items of the same index in the hash table.
Related
I'm trying to build a heuristic for the simplest feasible Golomb Ruler as possible. From 0 to n, find n numbers such that all the differences between them are different. This heuristic consists of incrementing the ruler by 1 every time. If a difference already exists on a list, jump to the next integer. So the ruler starts with [0,1] and the list of differences = [ 1 ]. Then we try to add 2 to the ruler [0,1,2], but it's not feasible, since the difference (2-1 = 1) already exists in the list of differences. Then we try to add 3 to the ruler [0,1,3] and it is feasible, and thus the list of differences becomes [1,2,3] and so on. Here's what I've come to so far:
n = 5
positions = list(range(1,n+1))
Pos = []
Dist = []
difs = []
i = 0
while (i < len(positions)):
if len(Pos)==0:
Pos.append(0)
Dist.append(0)
elif len(Pos)==1:
Pos.append(1)
Dist.append(1)
else:
postest = Pos + [i] #check feasibility to enter the ruler
difs = [a-b for a in postest for b in postest if a > b]
if any (d in difs for d in Dist)==True:
pass
else:
for d in difs:
Dist.append(d)
Pos.append(i)
i += 1
However I can't make the differences check to work. Any suggestions?
For efficiency I would tend to use a set to store the differences, because they are good for inclusion testing, and you don't care about the ordering (possibly until you actually print them out, at which point you can use sorted).
You can use a temporary set to store the differences between the number that you are testing and the numbers you currently have, and then either add it to the existing set, or else discard it if you find any matches. (Note else block on for loop, that will execute if break was not encountered.)
n = 5
i = 0
vals = []
diffs = set()
while len(vals) < n:
diffs1 = set()
for j in reversed(vals):
diff = i - j
if diff in diffs:
break
diffs1.add(diff)
else:
vals.append(i)
diffs.update(diffs1)
i += 1
print(vals, sorted(diffs))
The explicit loop over values (rather than the use of any) is to avoid unnecessarily calculating the differences between the candidate number and all the existing values, when most candidate numbers are not successful and the loop can be aborted early after finding the first match.
It would work for vals also to be a set and use add instead of append (although similarly, you would probably want to use sorted when printing it). In this case a list is used, and although it does not matter in principle in which order you iterate over it, this code is iterating in reverse order to test the smaller differences first, because the likelihood is that unusable candidates are rejected more quickly this way. Testing it with n=200, the code ran in about 0.2 seconds with reversed and about 2.1 without reversed; the effect is progressively more noticeable as n increases. With n=400, it took 1.7 versus 27 seconds with and without the reversed.
I have some strings,
['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
These strings partially overlap each other. If you manually overlapped them you would get:
SGALWDVPSPV
I want a way to go from the list of overlapping strings to the final compressed string in python. I feel like this must be a problem that someone has solved already and am trying to avoid reinventing the wheel. The methods I can imagine now are either brute force or involve getting more complicated by using biopython and sequence aligners than I would like. I have some simple short strings and just want to properly merge them in a simple way.
Does anyone have any advice on a nice way to do this in python? Thanks!
Here is a quick sorting solution:
s = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
new_s = sorted(s, key=lambda x:s[0].index(x[0]))
a = new_s[0]
b = new_s[-1]
final_s = a[:a.index(b[0])]+b
Output:
'SGALWDVPSPV'
This program sorts s by the value of the index of the first character of each element, in an attempt to find the string that will maximize the overlap distance between the first element and the desired output.
My proposed solution with a more challenging test list:
#strFrag = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
strFrag = ['ALWDVPS', 'SGALWDV', 'LWDVPSP', 'WDVPSPV', 'GALWDVP', 'LWDVPSP', 'ALWDVPS']
for repeat in range(0, len(strFrag)-1):
bestMatch = [2, '', ''] #overlap score (minimum value 3), otherStr index, assembled str portion
for otherStr in strFrag[1:]:
for x in range(0,len(otherStr)):
if otherStr[x:] == strFrag[0][:len(otherStr[x:])]:
if len(otherStr)-x > bestMatch[0]:
bestMatch = [len(otherStr)-x, strFrag.index(otherStr), otherStr[:x]+strFrag[0]]
if otherStr[:-x] == strFrag[0][-len(otherStr[x:]):]:
if x > bestMatch[0]:
bestMatch = [x, strFrag.index(otherStr), strFrag[0]+otherStr[-x:]]
if bestMatch[0] > 2:
strFrag[0] = bestMatch[2]
strFrag = strFrag[:bestMatch[1]]+strFrag[bestMatch[1]+1:]
print(strFrag)
print(strFrag[0])
Basically the code compares every string/fragment to the first in list and finds the best match (most overlap). It consolidates the list progressively, merging the best matches and removing the individual strings. Code assumes that there are no unfillable gaps between strings/fragments (Otherwise answer may not result in longest possible assembly. Can be solved by randomizing the starting string/fragment). Also assumes that the reverse complement is not present (poor assumption with contig assembly), which would result in nonsense/unmatchable strings/fragments. I've included a way to restrict the minimum match requirements (changing bestMatch[0] value) to prevent false matches. Last assumption is that all matches are exact. To enable flexibility in permitting mismatches when assembling the sequence makes the problem considerably more complex. I can provide a solution for assembling with mismatches upon request.
To determine the overlap of two strings a and b, you can check if any prefix of b is a suffix of a. You can then use that check in a simple loop, aggregating the result and slicing the next string in the list according to the overlap.
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
def overlap(a, b):
return max(i for i in range(len(b)+1) if a.endswith(b[:i]))
res = lst[0]
for s in lst[1:]:
o = overlap(res, s)
res += s[o:]
print(res) # SGALWDVPSPV
Or using reduce:
from functools import reduce # Python 3
print(reduce(lambda a, b: a + b[overlap(a,b):], lst))
This is probably not super-efficient, with complexity of about O(n k), with n being the number of strings in the list and k the average length per string. You can make it a bit more efficient by only testing whether the last char of the presumed overlap of b is the last character of a, thus reducing the amount of string slicing and function calls in the generator expression:
def overlap(a, b):
return max(i for i in range(len(b)) if b[i-1] == a[-1] and a.endswith(b[:i]))
Here's my solution which borders on brute force from the OP's perspective. It's not bothered by order (threw in a random shuffle to confirm that) and there can be non-matching elements in the list, as well as other independent matches. Assumes overlap means not a proper subset but independent strings with elements in common at the start and end:
from collections import defaultdict
from random import choice, shuffle
def overlap(a, b):
""" get the maximum overlap of a & b plus where the overlap starts """
overlaps = []
for i in range(len(b)):
for j in range(len(a)):
if a.endswith(b[:i + 1], j):
overlaps.append((i, j))
return max(overlaps) if overlaps else (0, -1)
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV', 'NONSEQUITUR']
shuffle(lst) # to verify order doesn't matter
overlaps = defaultdict(list)
while len(lst) > 1:
overlaps.clear()
for a in lst:
for b in lst:
if a == b:
continue
amount, start = overlap(a, b)
overlaps[amount].append((start, a, b))
maximum = max(overlaps)
if maximum == 0:
break
start, a, b = choice(overlaps[maximum]) # pick one among equals
lst.remove(a)
lst.remove(b)
lst.append(a[:start] + b)
print(*lst)
OUTPUT
% python3 test.py
NONSEQUITUR SGALWDVPSPV
%
Computes all the overlaps and combines the largest overlap into a single element, replacing the original two, and starts process over again until we're down to a single element or no overlaps.
The overlap() function is horribly inefficient and likely can be improved but that doesn't matter if this isn't the type of matching the OP desires.
Once the peptides start to grow to 20 aminoacids cdlane's code chokes and spams (multiple) incorrect answer(s) with various amino acid lengths.
Try to add and use AA sequence 'VPSGALWDVPS' with or without 'D' and the code starts to fail its task because the N-and C-terminus grow and do not reflect what Adam Price is asking for. The output is: 'SGALWDVPSGALWDVPSPV' and thus 100% incorrect despite the effort.
Tbh imo there is only one 100% answer and that is to use BLAST and its protein search page or BLAST in the BioPython package. Or adapt cdlane's code to reflect AA gaps, substitutions and AA additions.
Dredging up an old thread, but had to solve this myself today.
For this specific case, where the fragments are already in order, and each overlap by the same amount (in this case 1), the following fairly simply concatenation works, though might not be the worlds most robust solution:
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
reference = "SGALWDVPSPV"
string = "".join([i[0] for i in lst] + [lst[-1][1:]])
reference == string
True
Here I had to remove the most frequent alphabet of a string(if frequency of two alphabets is same, then in alphabetical order) and put it into new string.
Input:
abbcccdddd
Output:
dcdbcdabcd
The code I wrote is:
s = list(sorted(<the input string>))
a = []
for c in range(len(s)):
freq =[0 for _ in range(26)]
for x in s:
freq[ord(x)-ord('a')] += 1
m = max(freq)
allindices = [p for p,q in enumerate(freq) if q == m]
r = chr(97+allindices[0])
a.append(r)
s.remove(r)
print''.join(a)
But it passed the allowed runtime limit maybe due to too many loops.(There's another for loop which seperates the strings from user input)
I was hoping if someone could suggest a more optimised version of it using less memory space.
Your solution involves 26 linear scans of the string and a bunch of unnecessary
conversions to count the frequencies. You can save some work by replacing all those linear scans with a linear count step, another linear repetition generation, then a sort to order your letters and a final linear pass to strip counts:
from collections import Counter # For unsorted input
from itertools import groupby # For already sorted input
from operator import itemgetter
def makenewstring(inp):
# When inp not guaranteed to be sorted:
counts = Counter(inp).iteritems()
# Alternative if inp is guaranteed to be sorted:
counts = ((let, len(list(g))) for let, g in groupby(inp))
# Create appropriate number of repetitions of each letter tagged with a count
# and sort to put each repetition of a letter in correct order
# Use negative n's so much more common letters appear repeatedly at start, not end
repeats = sorted((n, let) for let, cnt in counts for n in range(0, -cnt, -1))
# Remove counts and join letters
return ''.join(map(itemgetter(1), repeats))
Updated: It occurred to me that my original solution could be made much more concise, a one-liner actually (excluding required imports), that minimizes temporaries, in favor of a single sort-by-key operation that uses a trick to sort each letter by the count of that letter seen so far:
from collections import defaultdict
from itertools import count
def makenewstring(inp):
return ''.join(sorted(inp, key=lambda c, d=defaultdict(count): (-next(d[c]), c)))
This is actually the same basic logic as the original answer, it just accomplishes it by having sorted perform the decoration and undecoration of the values implicitly instead of doing it ourselves explicitly (implicit decorate/undecorate is the whole point of sorted's key argument; it's doing the Schwartzian transform for you).
Performance-wise, both approaches are similar; they both (in practice) scale linearly for smaller inputs (the one-liner up to inputs around 150 characters long, the longer code, using Counter, up to inputs in the len 2000 range), and while the growth is super-linear above that point, it's always below the theoretical O(n log_2 n) (likely due to the data being not entirely random thanks to the counts and limited alphabet, ensuring Python's TimSort has some existing ordering to take advantage of). The one-liner is somewhat faster for smaller strings (len 100 or less), the longer code is somewhat faster for larger strings (I'm guessing it has something to do with the longer code creating some ordering by grouping runs of counts for each letter). Really though, it hardly matters unless the input strings are expected to be huge.
Since the alphabet will always be a constant 26 characters,
this will work in O(N) and only takes a constant amount of space of 26
from collections import Counter
from string import ascii_lowercase
def sorted_alphabet(text):
freq = Counter(text)
alphabet = filter(freq.get, ascii_lowercase) # alphabet filtered with freq >= 1
top_freq = max(freq.values()) if text else 0 # handle empty text eg. ''
for top_freq in range(top_freq, 0, -1): # from top_freq to 1
for letter in alphabet:
if freq[letter] >= top_freq:
yield letter
print ''.join(sorted_alphabet('abbcccdddd'))
print ''.join(sorted_alphabet('dbdd'))
print ''.join(sorted_alphabet(''))
print ''.join(sorted_alphabet('xxxxaaax'))
dcdbcdabcd
ddbd
xxaxaxax
What about this?
I am making use of in-built python functions to eliminate loops and improve efficiency.
test_str = 'abbcccdddd'
remaining_letters = [1] # dummy initialisation
# sort alphabetically
unique_letters = sorted(set(test_str))
frequencies = [test_str.count(letter) for letter in unique_letters]
out = []
while(remaining_letters):
# in case of ties, index takes the first occurence, so the alphabetical order is preserved
max_idx = frequencies.index(max(frequencies))
out.append(unique_letters[max_idx])
#directly update frequencies instead of calculating them again
frequencies[max_idx] -= 1
remaining_letters = [idx for idx, freq in enumerate(frequencies) if freq>0]
print''.join(out) #dcdbcdabcd
I am working to produce a Python script that can find the (longest possible) length of all n-word-length substrings shared by two strings, disregarding trailing punctuation. Given two strings:
"this is a sample string"
"this is also a sample string"
I want the script to identify that these strings have a sequence of 2 words in common ("this is") followed by a sequence of 3 words in common ("a sample string"). Here is my current approach:
a = "this is a sample string"
b = "this is also a sample string"
aWords = a.split()
bWords = b.split()
#create counters to keep track of position in string
currentA = 0
currentB = 0
#create counter to keep track of longest sequence of matching words
matchStreak = 0
#create a list that contains all of the matchstreaks found
matchStreakList = []
#create binary switch to control the use of while loop
continueWhileLoop = 1
for word in aWords:
currentA += 1
if word == bWords[currentB]:
matchStreak += 1
#to avoid index errors, check to make sure we can move forward one unit in the b string before doing so
if currentB + 1 < len(bWords):
currentB += 1
#in case we have two identical strings, check to see if we're at the end of string a. If we are, append value of match streak to list of match streaks
if currentA == len(aWords):
matchStreakList.append(matchStreak)
elif word != bWords[currentB]:
#because the streak is broken, check to see if the streak is >= 1. If it is, append the streak counter to out list of streaks and then reset the counter
if matchStreak >= 1:
matchStreakList.append(matchStreak)
matchStreak = 0
while word != bWords[currentB]:
#the two words don't match. If you can move b forward one word, do so, then check for another match
if currentB + 1 < len(bWords):
currentB += 1
#if you have advanced b all the way to the end of string b, then rewind to the beginning of string b and advance a, looking for more matches
elif currentB + 1 == len(bWords):
currentB = 0
break
if word == bWords[currentB]:
matchStreak += 1
#now that you have a match, check to see if you can advance b. If you can, do so. Else, rewind b to the beginning
if currentB + 1 < len(bWords):
currentB += 1
elif currentB + 1 == len(bWords):
#we're at the end of string b. If we are also at the end of string a, check to see if the value of matchStreak >= 1. If so, add matchStreak to matchStreakList
if currentA == len(aWords):
matchStreakList.append(matchStreak)
currentB = 0
break
print matchStreakList
This script correctly outputs the (maximum) lengths of the common word-length substrings (2, 3), and has done so for all tests so far. My question is: Is there a pair of two strings for which the approach above will not work? More to the point: Are there extant Python libraries or well-known approaches that can be used to find the maximum length of all n-word-length substrings that two strings share?
[This question is distinct from the longest common substring problem, which is only a special case of what I'm looking for (as I want to find all common substrings, not just the longest common substring). This SO post suggests that methods such as 1) cluster analysis, 2) edit distance routines, and 3) longest common sequence algorithms might be suitable approaches, but I didn't find any working solutions, and my problem is perhaps slightly easier that that mentioned in the link because I'm dealing with words bounded by whitespace.]
EDIT:
I'm starting a bounty on this question. In case it will help others, I wanted to clarify a few quick points. First, the helpful answer suggested below by #DhruvPathak does not find all maximally-long n-word-length substrings shared by two strings. For example, suppose the two strings we are analyzing are:
"They all are white a sheet of spotless paper when they first are born
but they are to be scrawled upon and blotted by every goose quill"
and
"You are all white, a sheet of lovely, spotless paper, when you first
are born; but you are to be scrawled and blotted by every goose's
quill"
In this case, the list of maximally long n-word-length substrings (disregarding trailing punctuation) is:
all
are
white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every
Using the following routine:
#import required packages
import difflib
#define function we'll use to identify matches
def matches(first_string,second_string):
s = difflib.SequenceMatcher(None, first_string,second_string)
match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
return match
a = "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill"
a = a.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
b = b.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
print matches(a,b)
One gets output:
['e', ' all', ' white a sheet of', ' spotless paper when ', 'y', ' first are born but ', 'y', ' are to be scrawled', ' and blotted by every goose', ' quill']
In the first place, I am not sure how one could select from this list the substrings that contain only whole words. In the second place, this list does not include "are", one of the desired maximally-long common n-word-length substrings. Is there a method that will find all of the maximally long n-word-long substrings shared by these two strings ("You are all..." and "They all are...")?
There are still ambiguities here, and I don't want to spend time arguing about them. But I think I can add something helpful anyway ;-)
I wrote Python's difflib.SequenceMatcher, and spent a lot of time finding expected-case fast ways to find longest common substrings. In theory, that should be done with "suffix trees", or the related "suffix arrays" augmented with "longest common prefix arrays" (the phrases in quotes are search terms if you want to Google for more). Those can solve the problem in worst-case linear time. But, as is sometimes the case, the worst-case linear-time algorithms are excruciatingly complex and delicate, and suffer large constant factors - they can still pay off hugely if a given corpus is going to be searched many times, but that's not the typical case for Python's difflib and it doesn't look like your case either.
Anyway, my contribution here is to rewrite SequenceMatcher's find_longest_match() method to return all the (locally) maximum matches it finds along the way. Notes:
I'm going to use the to_words() function Raymond Hettinger gave you, but without the conversion to lower case. Converting to lower case leads to output that isn't exactly like what you said you wanted.
Nevertheless, as I noted in a comment already, this does output "quill", which isn't in your list of desired outputs. I have no idea why it isn't, since "quill" does appear in both inputs.
Here's the code:
import re
def to_words(text):
'Break text into a list of words without punctuation'
return re.findall(r"[a-zA-Z']+", text)
def match(a, b):
# Make b the longer list.
if len(a) > len(b):
a, b = b, a
# Map each word of b to a list of indices it occupies.
b2j = {}
for j, word in enumerate(b):
b2j.setdefault(word, []).append(j)
j2len = {}
nothing = []
unique = set() # set of all results
def local_max_at_j(j):
# maximum match ends with b[j], with length j2len[j]
length = j2len[j]
unique.add(" ".join(b[j-length+1: j+1]))
# during an iteration of the loop, j2len[j] = length of longest
# match ending with b[j] and the previous word in a
for word in a:
# look at all instances of word in b
j2lenget = j2len.get
newj2len = {}
for j in b2j.get(word, nothing):
newj2len[j] = j2lenget(j-1, 0) + 1
# which indices have not been extended? those are
# (local) maximums
for j in j2len:
if j+1 not in newj2len:
local_max_at_j(j)
j2len = newj2len
# and we may also have local maximums ending at the last word
for j in j2len:
local_max_at_j(j)
return unique
Then:
a = "They all are white a sheet of spotless paper " \
"when they first are born but they are to be " \
"scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless " \
"paper, when you first are born; but you are to " \
"be scrawled and blotted by every goose's quill"
print match(to_words(a), to_words(b))
displays:
set(['all',
'and blotted by every',
'first are born but',
'are to be scrawled',
'are',
'spotless paper when',
'white a sheet of',
'quill'])
EDIT - how it works
A great many sequence matching and alignment algorithms are best understood as working on a 2-dimensional matrix, with rules for computing the matrix entries and later interpreting the entries' meaning.
For input sequences a and b, picture a matrix M with len(a) rows and len(b) columns. In this application, we want M[i, j] to contain the length of the longest common contiguous subsequence ending with a[i] and b[j], and the computational rules are very easy:
M[i, j] = 0 if a[i] != b[j].
M[i, j] = M[i-1, j-1] + 1 if a[i] == b[j] (where we assume an out-of-bounds matrix reference silently returns 0).
Interpretation is also very easy in this case: there is a locally maximum non-empty match ending at a[i] and b[j], of length M[i, j], if and only if M[i, j] is non-zero but M[i+1, j+1] is either 0 or out-of-bounds.
You can use those rules to write very simple & compact code, with two loops, that computes M correctly for this problem. The downside is that the code will take (best, average and worst cases) O(len(a) * len(b)) time and space.
While it may be baffling at first, the code I posted is doing exactly the above. The connection is obscured because the code is heavily optimized, in several ways, for expected cases:
Instead of doing one pass to compute M, then another pass to interpret the results, computation and interpretation are interleaved in a single pass over a.
Because of that, the whole matrix doesn't need to be stored. Instead only the current row (newj2len) and the previous row (j2len) are simultaneously present.
And because the matrix in this problem is usually mostly zeroes, a row here is represented sparsely, via a dict mapping column indices to non-zero values. Zero entries are "free", in that they're never stored explicitly.
When processing a row, there's no need to iterate over each column: the precomputed b2j dict tells us exactly the interesting column indices in the current row (those columns that match the current word from a).
Finally, and partly by accident, all the preceeding optimizations conspire in such a way that there's never a need to know the current row's index, so we don't have to bother computing that either.
EDIT - the dirt simple version
Here's code that implements the 2D matrix directly, with no attempts at optimization (other than that a Counter can often avoid explicitly storing 0 entries). It's extremely simple, short and easy:
def match(a, b):
from collections import Counter
M = Counter()
for i in range(len(a)):
for j in range(len(b)):
if a[i] == b[j]:
M[i, j] = M[i-1, j-1] + 1
unique = set()
for i in range(len(a)):
for j in range(len(b)):
if M[i, j] and not M[i+1, j+1]:
length = M[i, j]
unique.add(" ".join(a[i+1-length: i+1]))
return unique
Of course ;-) that returns the same results as the optimized match() I posted at first.
EDIT - and another without a dict
Just for fun :-) If you have the matrix model down pat, this code will be easy to follow. A remarkable thing about this particular problem is that a matrix cell's value only depends on the values along the diagonal to the cell's northwest. So it's "good enough" just to traverse all the main diagonals, proceeding southeast from all the cells on the west and north borders. This way only small constant space is needed, regardless of the inputs' lengths:
def match(a, b):
from itertools import chain
m, n = len(a), len(b)
unique = set()
for i, j in chain(((i, 0) for i in xrange(m)),
((0, j) for j in xrange(1, n))):
k = 0
while i < m and j < n:
if a[i] == b[j]:
k += 1
elif k:
unique.add(" ".join(a[i-k: i]))
k = 0
i += 1
j += 1
if k:
unique.add(" ".join(a[i-k: i]))
return unique
There are really four questions embedded in your post.
1) How do you split text into words?
There are many ways to do this depending on what you count as a word, whether you care about case, whether contractions are allowed, etc. A regular expression lets you implement your choice of word splitting rules. The one I typically use is r"[a-z'\-]+". The catches contractions like don't and allow hyphenated words like mother-in-law.
2) What data structure can speed-up the search for common subsequences?
Create a location map showing for each word. For example, in the sentence you should do what you like the mapping for for you is {"you": [0, 4]} because it appears twice, once at position zero and once at position four.
With a location map in hand, it is a simple matter to loop-over the starting points to compare n-length subsequences.
3) How do I find common n-length subsequences?
Loop over all words in the one of the sentences. For each such word, find the places where it occurs in the other sequence (using the location map) and test whether the two n-length slices are equal.
4) How do I find the longest common subsequence?
The max() function finds a maximum value. It takes a key-function such as len() to determine the basis for comparison.
Here is some working code that you can customize to your own interpretation of the problem:
import re
def to_words(text):
'Break text into a list of lowercase words without punctuation'
return re.findall(r"[a-z']+", text.lower())
def starting_points(wordlist):
'Map each word to a list of indicies where the word appears'
d = {}
for i, word in enumerate(wordlist):
d.setdefault(word, []).append(i)
return d
def sequences_in_common(wordlist1, wordlist2, n=1):
'Generate all n-length word groups shared by two word lists'
starts = starting_points(wordlist2)
for i, word in enumerate(wordlist1):
seq1 = wordlist1[i: i+n]
for j in starts.get(word, []):
seq2 = wordlist2[j: j+n]
if seq1 == seq2 and len(seq1) == n:
yield ' '.join(seq1)
if __name__ == '__main__':
t1 = "They all are white a sheet of spotless paper when they first are " \
"born but they are to be scrawled upon and blotted by every goose quill"
t2 = "You are all white, a sheet of lovely, spotless paper, when you first " \
"are born; but you are to be scrawled and blotted by every goose's quill"
w1 = to_words(t1)
w2 = to_words(t2)
for n in range(1,10):
matches = list(sequences_in_common(w1, w2, n))
if matches:
print(n, '-->', max(matches, key=len))
difflib module would be a good candidate for this case , see get_matching_blocks :
import difflib
def matches(first_string,second_string):
s = difflib.SequenceMatcher(None, first_string,second_string)
match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
return match
first_string = "this is a sample string"
second_string = "this is also a sample string"
print matches(second_string, first_string )
demo: http://ideone.com/Ca3h8Z
A slight modification, with matching not chars but words, I suppose will do:
def matche_words(first_string,second_string):
l1 = first_string.split()
l2 = second_string.split()
s = difflib.SequenceMatcher(None, l1, l2)
match = [l1[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
return match
Demo:
>>> print '\n'.join(map(' '.join, matches(a,b)))
all
white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every
quill
Given a number, I have to find out all possible index-pairs in a given array whose sum equals that number. I am currently using the following algo:
def myfunc(array,num):
dic = {}
for x in xrange(len(array)): # if 6 is the current key,
if dic.has_key(num-array[x]): #look at whether num-x is there in dic
for y in dic[num-array[x]]: #if yes, print all key-pair values
print (x,y),
if dic.has_key(array[x]): #check whether the current keyed value exists
dic[array[x]].append(x) #if so, append the index to the list of indexes for that keyed value
else:
dic[array[x]] = [x] #else create a new array
Will this run in O(N) time? If not, then what should be done to make it so? And in any case, will it be possible to make it run in O(N) time without using any auxiliary data structure?
Will this run in O(N) time?
Yes and no. The complexity is actually O(N + M) where M is the output size.
Unfortunately, the output size is in O(N^2) worst case, for example the array [3,3,3,3,3,...,3] and number == 6 - it will result in quadric number of elements needed to be produced.
However - asymptotically speaking - it cannot be done better then this, because it is linear in the input size and output size.
Very, very simple solution that actually does run in O(N) time by using array references. If you want to enumerate all the output pairs, then of course (as amit notes) it must take O(N^2) in the worst case.
from collections import defaultdict
def findpairs(arr, target):
flip = defaultdict(list)
for i, j in enumerate(arr):
flip[j].append(i)
for i, j in enumerate(arr):
if target-j in flip:
yield i, flip[target-j]
Postprocessing to get all of the output values (and filter out (i,i) answers):
def allpairs(arr, target):
for i, js in findpairs(arr, target):
for j in js:
if i < j: yield (i, j)
This might help - Optimal Algorithm needed for finding pairs divisible by a given integer k
(With a slight modification, there we are seeing for all pairs divisible by given number and not necessarily just equal to given number)