Finding all common, non-overlapping substrings

Finding all common, non-overlapping substrings - python

Given two strings, I would like to identify all common sub-strings from longest to shortest.
I want to remove any "sub-"sub-strings. As an example, any substrings of '1234' would not be included in the match between '12345' and '51234'.
string1 = '51234'
string2 = '12345'
result = ['1234', '5']
I was thinking of finding the longest common substring, then recursively finding the longest substring(s) to the left/right. However, I do not want to remove a common substring after found. For example, the result below shares a 6 in the middle:
string1 = '12345623456'
string2 = '623456'
result = ['623456', '23456']
Lastly, I need to check one string against a fixed list of thousands of strings. I am unsure if there is a smart step I could take in hashing out all the substrings in these strings.
Previous Answers:
In this thread, a dynamic programming solution is found that takes O(nm) time, where n and m are the lengths of the strings. I am interested in a more efficient approach, which would use suffix trees.
Background:
I am composing song melodies from snippets of melodies. Sometimes, a combination manages to generate a melody matching too many notes in a row of an existing one.
I can use a string similarity measure, such as Edit Distance, but believe that tunes with very small differences to melodies are unique and interesting. Unfortunately, these tunes would have similar levels of similarity to songs that copy many notes of a melody in a row.

Let's start with the Tree
from collections import defaultdict
def identity(x):
return x
class TreeReprMixin(object):
def __repr__(self):
base = dict(self)
return repr(base)
class PrefixTree(TreeReprMixin, defaultdict):
'''
A hash-based Prefix or Suffix Tree for testing for
sequence inclusion. This implementation works for any
slice-able sequence of hashable objects, not just strings.
'''
def __init__(self):
defaultdict.__init__(self, PrefixTree)
self.labels = set()
def add(self, sequence, label=None):
layer = self
if label is None:
label = sequence
if label:
layer.labels.add(label)
for i in range(len(sequence)):
layer = layer[sequence[i]]
if label:
layer.labels.add(label)
return self
def add_ngram(self, sequence, label=None):
if label is None:
label = sequence
for i in range(1, len(sequence) + 1):
self.add(sequence[:i], label)
def __contains__(self, sequence):
layer = self
j = 0
for i in sequence:
j += 1
if not dict.__contains__(layer, i):
break
layer = layer[i]
return len(sequence) == j
def depth_in(self, sequence):
layer = self
count = 0
for i in sequence:
if not dict.__contains__(layer, i):
print "Breaking"
break
else:
layer = layer[i]
count += 1
return count
def subsequences_of(self, sequence):
layer = self
for i in sequence:
layer = layer[i]
return layer.labels
def __iter__(self):
return iter(self.labels)
class SuffixTree(PrefixTree):
'''
A hash-based Prefix or Suffix Tree for testing for
sequence inclusion. This implementation works for any
slice-able sequence of hashable objects, not just strings.
'''
def __init__(self):
defaultdict.__init__(self, SuffixTree)
self.labels = set()
def add_ngram(self, sequence, label=None):
if label is None:
label = sequence
for i in range(len(sequence)):
self.add(sequence[i:], label=label)
To populate the tree, you'd use the .add_ngram method.
The next part is a little trickier since you're looking for a concurrent traversal of strings whilst keeping track of tree coordinates. To pull all this off, we need some functions which operate on the tree and a query string
def overlapping_substrings(string, tree, solved=None):
if solved is None:
solved = PrefixTree()
i = 1
last = 0
matching = True
solutions = []
while i < len(string) + 1:
if string[last:i] in tree:
if not matching:
matching = True
else:
i += 1
continue
else:
if matching:
matching = False
solutions.append(string[last:i - 1])
last = i - 1
i -= 1
i += 1
if matching:
solutions.append(string[last:i])
for solution in solutions:
if solution in solved:
continue
else:
solved.add_ngram(solution)
yield solution
def slide_start(string):
for i in range(len(string)):
yield string[i:]
def seek_subtree(tree, sequence):
# Find the node of the search tree which
# is found by this sequence of items
node = tree
for i in sequence:
if i in node:
node = node[i]
else:
raise KeyError(i)
return node
def find_all_common_spans(string, tree):
# We can keep track of solutions to avoid duplicates
# and incomplete prefixes using a Prefix Tree
seen = PrefixTree()
for substring in slide_start(string):
# Drive generator forward
list(overlapping_substrings(substring, tree, seen))
# Some substrings are suffixes of other substrings which you do not
# want
compress = SuffixTree()
for solution in sorted(seen.labels, key=len, reverse=True):
# A substrings may be a suffix of another substrings, but that substrings
# is actually a repeating pattern. If a solution is
# a repeating pattern, `not solution in seek_subtree(tree, solution)` will tell us.
# Otherwise, discard the solution
if solution in compress and not solution in seek_subtree(tree, solution):
continue
else:
compress.add_ngram(solution)
return compress.labels
def search(query, corpus):
tree = SuffixTree()
if isinstance(corpus, SuffixTree):
tree = corpus
else:
for elem in corpus:
tree.add_ngram(elem)
return list(find_all_common_spans(query, tree))
So now to do the thing you wanted, do this:
search("12345", ["51234"])
search("623456", ["12345623456"])
If something is unclear, please let me know, and I'll try to clarify.

Related

Concatenate strings with a common substring in python?

Input
ONESTRING
STRINGTHREE
THREEFOUR
FOURFIVE
Output
ONESTRINGTHREEFOURFIVE
in python??
I think first i concatenate with 2 string then run a loop but this gives an error I don't know why can anyone help in in python?

WARNING
This solution is for a list of strings in arbitrary order. This means that EVERY possible pair of words must be checked for a common substring, which may require an enormous amount of memory if your list of strings is large.
Solution 1, allows for words with no common substrings to be concatenated if desired
import itertools
from typing import Set, Tuple, Dict, List
def get_match(pair: Tuple[str, str], min_overlap: int = 3) -> str:
a, b = pair
for i in range(min_overlap, min(map(len, pair)) + 1):
if a[-i:] == b[:i]:
return b[:i]
return ""
def links_joiners(strings: List[str]) -> Tuple[Dict[str, str], Set[str]]:
links, joiners = dict(), set()
for pair in itertools.permutations(strings, 2):
if (match := get_match(pair)):
joiners.add(match)
links.update((pair,))
return links, joiners
def get_ordered_strings(strings: List[str], links: Dict[str, str]) -> List[str]:
def find_order(node: str) -> int:
return 0 if node not in links else 1 + find_order(links[node])
return sorted(strings, key=find_order, reverse=True)
def join_strings(strings: List[str], joiners: Set[str]) -> str:
s = "".join(strings)
for j in joiners:
s = s.replace(j, "", 1)
return s
Usage:
strings = ["THREEFOUR",
"ONESTRING",
"STRINGTHREE",
"FOURFIVE"]
links, joiners = get_links_and_joiners(strings)
ordered_strings = get_ordered_strings(strings, links)
join_strings(ordered_strings, joiners)
Output:
'ONESTRINGTHREEFOURFIVE'
Explanation
First, itertools is part of the standard library; no need to install any third party packages for this solution.
Now, the links_joiners() function will take a list of strings and find all the pairs of strings with matching suffix-prefix pairs, putting those pairs into a links dictionary which looks like this:
{'ONESTRING': 'STRINGTHREE',
'THREEFOUR': 'FOURFIVE',
'STRINGTHREE': 'THREEFOUR'}
Notice these are not in order. This is because for an arbitrary list of strings we can't be sure the strings were in order in the first place, so we have to iterate over every permutation of strings exhaustively in order to ensure that we've covered all pairings.
Now, notice there's also a function called get_ordered_strings() with an inner function find_order(). The function get_ordered_strings() forms what is known as a closure, but that's not particularly important to understand right now. The find_order() function is recursive, here's how it works:
Given a node, if the node is not a key in the links dictionary we've reached the base case and return zero. Otherwise, move to step 2.
If node is present, add one to a recursive call to find_order on that new node.
So given a key, say "ONESTRING", the find_order() function will look at the value associated with that key, and if that value is also a key in the dictionary, look at its value, and so on until it reaches a value that isn't a key in the dictionary.
Here's the code for find_order() again:
def find_order(node: str) -> int:
if node not in links:
return 0
return 1 + find_order(links[node])
And here's what links looks like after calling links_joiners():
{'ONESTRING': 'STRINGTHREE',
'THREEFOUR': 'FOURFIVE',
'STRINGTHREE': 'THREEFOUR'}
Now trace an example call to find_order("ONESTRING"):
find_order("ONESTRING") = 1 + find_order("STRINGTHREE")
= 1 + (1 + find_order("THREEFOUR"))
= 1 + (1 + (1 + find_order("FOURFIVE"))) # Base case
= 1 + (1 + (1 + 0))
= 3
What this function is doing is finding how many pairwise connections can be made from a given starting string. Another way to think of it is that links is actually representing adjacencies in a (special case of a) DAG.
Essentially what we want to do is take the nodes THREEFOUR, ONESTRING, STRINGTHREE, FOURFIVE and construct the longest possible singly-linked list (a type of a DAG) from them:
ONESTRING -> STRINGTHREE -> THREEFOUR -> FOURFIVE
By passing a given "node" of this graph to find_order(), it will follow the graph all the way to the end. So ONESTRING travels a distance of 3 to get to the end, whereas THREEFOUR travels only a distance of 1.
Node: ONESTRING -> STRINGTHREE -> THREEFOUR -> FOURFIVE
Dist: 3 2 1 0
Now, by passing find_order to the built-in sorted() function, we can tell Python how we want our strings to be sorted, which, in this case is in reverse order, by distance. The result is this:
>>> strings = ['THREEFOUR', 'ONESTRING', 'STRINGTHREE', 'FOURFIVE']
>>> ordered_strings = get_ordered_strings(strings, links)
>>> ordered_strings
['ONESTRING', 'STRINGTHREE', 'THREEFOUR', 'FOURFIVE']
Now, by joining each string by their common substrings, we are constructing the longest possible string where the constraint is that each pair of strings must have a common substring in the correct position. In other words, ordered_strings represents the longest path in the DAG. Or more accurately, we've designed a DAG which will have the longest path, by using all the provided nodes, and putting them in the correct order.
From here, we join each string:
>>> s = "".join(ordered_strings)
>>> s
'ONESTRINGSTRINGTHREETHREEFOURFOURFIVE'
Then we remove one instance of each of the joiners:
for j in joiners:
s = s.replace(j, "", 1)
Solution 2, only concatenates overlapping strings
This solution reuses join_strings() and get_match() from above. It also uses the walrus operator := (Python 3.8+) but can easily be written without it.
def join_overlapping_pairs(strings: List[str]) -> str:
if len(strings) == 1:
return strings.pop()
matches = set()
for pair in itertools.permutations(strings, 2):
if (match := get_match(pair)):
matches.add(join_strings(pair, (match,)))
return join_overlapping_pairs(matches)

Here is generic solution according your provided example. Sequence must be ordered, otherwise it will not work.
from functools import reduce
s = [
"ONESTRING",
"STRINGTHREE",
"THREEFOUR",
"FOURFIVE",
]
def join_f(first, add):
i = 1
while add[:i] in first:
i += 1
return first + add[i-1:]
print(reduce(join_f, s))

May use difflib library, sample code for your reference
from difflib import SequenceMatcher
str1 = "ONESTRING"
str2 = "STRINGTHREE"
match = SequenceMatcher(None, str1, str2).find_longest_match(0, len(str1), 0, len(str2))
#match[a]=3, match[b]=0, match[size]=6

Assuming the words are in the connecting order:
words = ['ONESTRING',
'STRINGTHREE',
'THREEFOUR',
'FOURFIVE']
S = words[0]
for w in words[1:]:
S += w[next(i for i in range(1,len(w)) if S.endswith(w[:-i])):]
print(S)
'ONESTRINGTHREEFOURFIVE'
If the words are not in connecting order, a recursive approach can do it:
def combine(words,S=""):
if not words: return S
result = "" # result is shortest combo
for i,w in enumerate(words): # p is max overlap (end/start)
p = next((i for i in range(1,len(w)) if S.endswith(w[:-i])),0)
if result and not p: continue # check if can combine
combo = combine(words[:i]+words[i+1:],S+w[p:]) # candidate combo
if not result or len(combo)<len(result): # keep if shortest
result = combo or result
return result
Output:
words = ['ONESTRING',
'FOURFIVE',
'THREEFOUR',
'STRINGTHREE'
]
result = combine(words)
print(result)
'ONESTRINGTHREEFOURFIVE

How do I check if the next item in a string is the alphabetical successor of the one before? + Inverse

I'm trying to compress a string in a way that any sequence of letters in strict alphabetical order is swapped with the first letter plus the length of the sequence.
For example, the string "abcdefxylmno", would become: "a6xyl4"
Single letters that aren't in order with the one before or after just stay the way they are.
How do I check that two letters are successors (a,b) and not simply in alphabetical order (a,c)? And how do I keep iterating on the string until I find a letter that doesn't meet this requirement?
I'm also trying to do this in a way that makes it easier to write an inverse function (that given the result string gives me back the original one).
EDIT :
I've managed to get the function working, thanks to your suggestion of using the alphabet string as comparison; now I'm very much stuck on the inverse function: given "a6xyl4" expand it back into "abcdefxylmno".
After quite some time I managed to split the string every time there's a number and I made a function that expands a 2 char string, but it fails to work when I use it on a longer string:
from string import ascii_lowercase as abc
def subString(start,n):
L=[]
ind = abc.index(start)
newAbc = abc[ind:]
for i in range(len(newAbc)):
while i < n:
L.append(newAbc[i])
i+=1
res = ''.join(L)
return res
def unpack(S):
for i in range(len(S)-1):
if S[i] in abc and S[i+1] not in abc:
lett = str(S[i])
num = int(S[i+1])
return subString(lett,num)
def separate(S):
lst = []
for i in S:
lst.append(i)
for el in lst:
if el.isnumeric():
ind = lst.index(el)
lst.insert(ind+1,"-")
a = ''.join(lst)
L = a.split("-")
if S[-1].isnumeric():
L.remove(L[-1])
return L
else:
return L
def inverse(S):
L = separate(S)
for i in L:
return unpack(i)
Each of these functions work singularly, but inverse(S) doesn't output anything. What's the mistake?

You can use the ord() function which returns an integer representing the Unicode character. Sequential letters in alphabetical order differ by 1. Thus said you can implement a simple funtion:
def is_successor(a,b):
# check for marginal cases if we dont ensure
# input restriction somewhere else
if ord(a) not in range(ord('a'), ord('z')) and ord(a) not in range(ord('A'),ord('Z')):
return False
if ord(b) not in range(ord('a'), ord('z')) and ord(b) not in range(ord('A'),ord('Z')):
return False
# returns true if they are sequential
return ((ord(b) - ord(a)) == 1)
You can use chr(int) method for your reversing stage as it returns a string representing a character whose Unicode code point is an integer given as argument.

This builds on the idea that acceptable subsequences will be substrings of the ABC:
from string import ascii_lowercase as abc # 'abcdefg...'
text = 'abcdefxylmno'
stack = []
cache = ''
# collect subsequences
for char in text:
if cache + char in abc:
cache += char
else:
stack.append(cache)
cache = char
# if present, append the last sequence
if cache:
stack.append(cache)
# stack is now ['abcdef', 'xy', 'lmno']
# Build the final string 'a6x2l4'
result = ''.join(f'{s[0]}{len(s)}' if len(s) > 1 else s for s in stack)

Python algorithm in list

In a list of N strings, implement an algorithm that outputs the largest n if the entire string is the same as the preceding n strings. (i.e., print out how many characters in front of all given strings match).
My code:
def solution(a):
import numpy as np
for index in range(0,a):
if np.equal(a[index], a[index-1]) == True:
i += 1
return solution
else:
break
return 0
# Test code
print(solution(['abcd', 'abce', 'abchg', 'abcfwqw', 'abcdfg'])) # 3
print(solution(['abcd', 'gbce', 'abchg', 'abcfwqw', 'abcdfg'])) # 0

Some comments on your code:
There is no need to use numpy if it is only used for string comparison
i is undefined when i += 1 is about to be executed, so that will not run. There is no actual use of i in your code.
index-1 is an invalid value for a list index in the first iteration of the loop
solution is your function, so return solution will return a function object. You need to return a number.
The if condition is only comparing complete words, so there is no attempt to only compare a prefix.
A possible way to do this, is to be optimistic and assume that the first word is a prefix of all other words. Then as you detect a word where this is not the case, reduce the size of the prefix until it is again a valid prefix of that word. Continue like that until all words have been processed. If at any moment you find the prefix is reduced to an empty string, you can actually exit and return 0, as it cannot get any less than that.
Here is how you could code it:
def solution(words):
prefix = words[0] # if there was only one word, this would be the prefix
for word in words:
while not word.startswith(prefix):
prefix = prefix[:-1] # reduce the size of the prefix
if not prefix: # is there any sense in continuing?
return 0 # ...: no.
return len(prefix)

The description is somewhat convoluted but it does seem that you're looking for the length of the longest common prefix.
You can get the length of the common prefix between two strings using the next() function. It can find the first index where characters differ which will correspond to the length of the common prefix:
def maxCommon(S):
cp = S[0] if S else "" # first string is common prefix (cp)
for s in S[1:]: # go through other strings (s)
cs = next((i for i,(a,b) in enumerate(zip(s,cp)) if a!=b),len(cp))
cp = cp[:cs] # truncate to new common size (cs)
return len(cp) # return length of common prefix
output:
print(maxCommon(['abcd', 'abce', 'abchg', 'abcfwqw', 'abcdfg'])) # 3
print(maxCommon(['abcd', 'gbce', 'abchg', 'abcfwqw', 'abcdfg'])) # 0

How do I reverse the strings contained in each pair of matching parentheses, starting from the innermost pair? CodeFights

So far I have done this. I am stuck on recursion. I have no idea how to move forward, joining and reversing etc.
def callrecursion(s):
a=s.index('(')
z=len(s) - string[::-1].index(')') -1
newStr=s[a+1:z]
# Something is missing here i cant figure it out
print(newStr)
return newStr
def reverseParentheses(s):
if '(' in s:
return reverseParentheses(callrecursion(s))
print('wabba labba dub dub')
else:
return s
string='a(bcdefghijkl(mno)p)q'
reverseParentheses(string)
EXPECTED OUTPUT : "apmnolkjihgfedcbq"

def reverseParentheses(s):
if '(' in s:
posopen=s.find('(')
s=s[:posopen]+reverseParentheses(s[posopen+1:])
posclose=s.find(')',posopen+1)
s=s[:posopen]+s[posopen:posclose][::-1]+s[posclose+1:]
return s
string='a(bcdefghijkl(mno)p)q'
print(string)
print(reverseParentheses(string))
print('apmnolkjihgfedcbq') # your test
string='a(bc)(ef)g'
print(string)
print(reverseParentheses(string))
The idea is to go 'inward' as long as possible (where 'inward' does not even mean 'nesting', it goes as long as there are any opening parentheses), so the innermost pairs are flipped first, and then the rest as the recursion returns. This way 'parallel' parentheses seem to work too, while simple pairing of "first opening parentheses" with "last closing ones" do not handle them well. Or at least that is what I think.
Btw: recursion is just a convoluted replacement for rfind here:
def reverseParentheses(s):
while '(' in s:
posopen=s.rfind('(')
posclose=s.find(')',posopen+1)
s=s[:posopen]+s[posopen+1:posclose][::-1]+s[posclose+1:]
return s;
(... TBH: now I tried, and the recursive magic dies on empty parentheses () placed in the string, while this one works)

I've come up with tho following logic (assuming the parentheses are properly nested).
The base case is the absence of parentheses in s, so it is returned unchanged.
Otherwise we locate indices of leftmost and rightmost opening and closing parentheses
(taking care of possible string reversal, so ')' might appear opening and '(' -- as closing).
Having obtained beg and end the remaining job is quite simple: one has to pass the reversed substring contained between beg and end to the subsequent recursive call.
def reverseParentheses(s):
if s.find('(') == -1:
return s
if s.find('(') < s.find(')'):
beg, end = s.find('('), s.rfind(')')
else:
beg, end = s.find(')'), s.rfind('(')
return s[:beg] + reverseParentheses(s[beg + 1:end][::-1]) + s[end + 1:]

Assuming that number of opening and closing brackets always match, this might be the one of the simplest method to reverse words in parenthesis:
def reverse_parentheses(st: str) -> str:
while True:
split1 = st.split('(')
split2 = split1[-1].split(')')[0]
st = st.replace(f'({split2})', f'{split2[::-1]}')
if '(' not in st and ')' not in st:
return st
# s = "(abcd)"
# s = "(ed(et)el)"
# s = "(ed(et(oc))el)"
# s = "(u(love)i)"
s= "((ng)ipm(ca))"
reversed = reverse_parentheses(s)
print(reversed)

You have a few issues in your code, and much of the logic missing. This adapts your code and produces the desired output:
def callrecursion(s):
a=s.index('(')
# 's' not 'string'
z=len(s) - s[::-1].index(')') -1
newStr=s[a+1:z][::-1]
# Need to consider swapped parentheses
newStr=newStr.replace('(', "$") # Placeholder for other swap
newStr=newStr.replace(')', "(")
newStr=newStr.replace('$', ")")
#Need to recombine initial and trailing portions of original string
newStr = s[:a] + newStr + s[z+1:]
return newStr
def reverseParentheses(s):
if '(' in s:
return reverseParentheses(callrecursion(s))
print('wabba labba dub dub')
else:
return s
string='a(bcdefghijkl(mno)p)q'
print(reverseParentheses(string))
>>>apmnolkjihgfedcbq

While the existing O(n^2) solutions were sufficient here, this problem is solvable in O(n) time, and the solution is pretty fun.
The idea is to build a k-ary tree to represent our string, and traverse it with DFS. Each 'level' of the tree represents one layer of nested parentheses. There is one node for each set of parentheses, and one node for each letter, so there are only O(n) nodes in the tree.
For example, the tree-nodes at the top level are either:
A letter that is not contained in parentheses
A tree-node representing a pair of parentheses at the outermost layer of our string, which may have child tree-nodes
To get the effect of reversals, we can traverse the tree in a depth-first way recursively. Besides knowing our current node, we just need to know if we're in 'reverse mode': a boolean to tell us whether to visit our node's children from left to right, or right to left.
Every time we go down a level in our tree, whether we're in 'reverse mode' or not is flipped.
Python code:
class TreeNode:
def __init__(self, parent=None):
self.parent = parent
self.children = []
def reverseParentheses(s: str) -> str:
root_node = TreeNode()
curr_node = root_node
# Build the tree
for let in s:
# Go down a level-- new child
if let == '(':
new_child = TreeNode(parent=curr_node)
curr_node.children.append(new_child)
curr_node = new_child
# Go back to our parent
elif let == ')':
curr_node = curr_node.parent
else:
curr_node.children.append(let)
answer = []
def dfs(node, is_reversed: bool):
nonlocal answer
num_children = len(node.children)
if is_reversed:
range_start, range_end, range_step = num_children-1, -1, -1
else:
range_start, range_end, range_step = 0, num_children, 1
for i in range(range_start, range_end, range_step):
if isinstance(node.children[i], str):
answer.append(node.children[i])
else:
dfs(node.children[i], not is_reversed)
dfs(root_node, False)
return ''.join(answer)

Here is the correct version for your callrecursion function:
def callrecursion(text):
print(text)
a = text.find('(') + 1
z = text.rfind(')') + 1
newStr = text[:a - 1] + text[a:z-1][::-1].replace('(', ']').replace(')', '[').replace(']', ')').replace('[', '(') + text[z:]
return newStr
You probably have to take into account if the parethesis is the first/last character.

Suffix search - Python

Here's a the problem, provided a list of strings and a document find the shortest substring that contains all the strings in the list.
Thus for:
document = "many google employees can program because google is a technology company that can program"
searchTerms = ['google', 'program', 'can']
the output should be:
"can program because google" # 27 chars
and not:
"google employees can program" # 29 chars
"google is a technology company that can program" # 48 chars
Here's my approach,
Split the document into suffix tree,
check for all strings in each suffix
return the one of the shortest length,
Here's my code
def snippetSearch(document, searchTerms):
doc = document.split()
suffix_array = create_suffix_array(doc)
current = None
current_len = sys.maxsize
for suffix in suffix_array:
if check_for_terms_in_array(suffix, searchTerms):
if len(suffix) < current_len:
current_len = len(suffix)
current = suffix
return ' '.join(map(str, current))
def create_suffix_array(document):
suffix_array = []
for i in range(len(document)):
sub = document[i:]
suffix_array.append(sub)
return suffix_array
def check_for_terms_in_array(arr, terms):
for term in terms:
if term not in arr:
return False
return True
This is an online submission and it's not passing one test case. I have no idea what the test case is though. My question is, is there anything logically incorrect with the code. Also is there a more efficient way of doing this.

You can break this into two parts. First, finding the shortest substring that matches some property. We'll pretend we already have a function that tests for the property:
def find_shortest_ss(document, some_property):
# First level of looping gradually increases substring length
for x in range(len(document)):
# Second level of looping tests current length at valid positions
for y in range(max(len(document), len(document)-x)):
if some_property(document[y:x+y]):
return document[y:x+y]
# How to handle the case of no match is undefined
raise ValueError('No matching value found')
Now the property we want to test for itself:
def contains_all_terms(terms):
return (lambda s: all(term in s for term in terms))
This lambda expression takes some terms and will return a function which, when evaluated on a string, returns true if and only if all the terms are in the string. This is basically a more terse version of a nested function definition which you could write like this:
def contains_all_terms(terms):
def string_contains_them(s):
return all(term in s for term in terms)
return string_contains_them
So we're actually just returning the handle of the function we create dynamically inside of our contains_all_terms function
To piece this together we do like so:
>>> find_shortest_ss(document, contains_all_terms(searchTerms))
'program can google'
Some efficiency advantages which this code has:
The any builtin function has short-circuit evaluation, meaning that it will return False as soon as it finds a non-contained substring
It starts by checking all the shortest substrings, then proceeds to increase substring length one extra character length at a time. If it ever finds a satisfying substring it will exit and return that value. So you can guarantee the returned value will never be longer than necessary. It won't even be doing any operations on substrings longer than necessary.
8 lines of code, not bad I think

Well, brute force is O(n³), so why not:
from itertools import product
def find_shortest(doc, terms):
doc = document.split()
substrings = (
doc[i:j]
for i, j in product(range(0, len(doc)), range(0, len(doc)))
if all(search_term in doc[i:j] for search_term in search_terms)
)
shortest = doc
for candidate in substrings:
if len(candidate) < len(shortest):
shortest = candidate
return shortest.
document = 'many google employees can program can google employees because google is a technology company that writes program'
search_terms = ['google', 'program', 'can']
print find_shortest(document, search_terms)
>>>> ['program', 'can', 'google']
You can probably do this a lot faster, though. For example, any relevant substring can only end with one of the keywords

Instead of brute forcing all possible sub-strings, I brute forced all possible matching word positions... It should be a bit faster..
import numpy as np
from itertools import product
document = 'many google employees can program can google employees because google is a technology company that writes program'
searchTerms = ['google', 'program']
word_lists = []
for word in searchTerms:
word_positions = []
start = 0 #starting index of str.find()
while 1:
start = document.find(word, start, -1)
if start == -1: #no more instances
break
word_positions.append([start, start+len(word)]) #beginning and ending index of search term
start += 1 #increment starting search postion
word_lists.append(word_positions) #add all search term positions to list of all search terms
minLen = len(document)
lower = 0
upper = len(document)
for p in product(*word_lists): #unpack word_lists into word_positions
indexes = np.array(p).flatten() #take all indices into flat list
lowerI = np.min(indexes)
upperI = np.max(indexes)
indexRange = upperI - lowerI #determine length of substring
if indexRange < minLen:
minLen = indexRange
lower = lowerI
upper = upperI
print document[lower:upper]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding all common, non-overlapping substrings - python

Related

Concatenate strings with a common substring in python?

How do I check if the next item in a string is the alphabetical successor of the one before? + Inverse

Python algorithm in list

How do I reverse the strings contained in each pair of matching parentheses, starting from the innermost pair? CodeFights

Suffix search - Python

Categories

Resources