Suffix search - Python - python

Here's a the problem, provided a list of strings and a document find the shortest substring that contains all the strings in the list.
Thus for:
document = "many google employees can program because google is a technology company that can program"
searchTerms = ['google', 'program', 'can']
the output should be:
"can program because google" # 27 chars
and not:
"google employees can program" # 29 chars
"google is a technology company that can program" # 48 chars
Here's my approach,
Split the document into suffix tree,
check for all strings in each suffix
return the one of the shortest length,
Here's my code
def snippetSearch(document, searchTerms):
doc = document.split()
suffix_array = create_suffix_array(doc)
current = None
current_len = sys.maxsize
for suffix in suffix_array:
if check_for_terms_in_array(suffix, searchTerms):
if len(suffix) < current_len:
current_len = len(suffix)
current = suffix
return ' '.join(map(str, current))
def create_suffix_array(document):
suffix_array = []
for i in range(len(document)):
sub = document[i:]
suffix_array.append(sub)
return suffix_array
def check_for_terms_in_array(arr, terms):
for term in terms:
if term not in arr:
return False
return True
This is an online submission and it's not passing one test case. I have no idea what the test case is though. My question is, is there anything logically incorrect with the code. Also is there a more efficient way of doing this.

You can break this into two parts. First, finding the shortest substring that matches some property. We'll pretend we already have a function that tests for the property:
def find_shortest_ss(document, some_property):
# First level of looping gradually increases substring length
for x in range(len(document)):
# Second level of looping tests current length at valid positions
for y in range(max(len(document), len(document)-x)):
if some_property(document[y:x+y]):
return document[y:x+y]
# How to handle the case of no match is undefined
raise ValueError('No matching value found')
Now the property we want to test for itself:
def contains_all_terms(terms):
return (lambda s: all(term in s for term in terms))
This lambda expression takes some terms and will return a function which, when evaluated on a string, returns true if and only if all the terms are in the string. This is basically a more terse version of a nested function definition which you could write like this:
def contains_all_terms(terms):
def string_contains_them(s):
return all(term in s for term in terms)
return string_contains_them
So we're actually just returning the handle of the function we create dynamically inside of our contains_all_terms function
To piece this together we do like so:
>>> find_shortest_ss(document, contains_all_terms(searchTerms))
'program can google'
Some efficiency advantages which this code has:
The any builtin function has short-circuit evaluation, meaning that it will return False as soon as it finds a non-contained substring
It starts by checking all the shortest substrings, then proceeds to increase substring length one extra character length at a time. If it ever finds a satisfying substring it will exit and return that value. So you can guarantee the returned value will never be longer than necessary. It won't even be doing any operations on substrings longer than necessary.
8 lines of code, not bad I think

Well, brute force is O(n³), so why not:
from itertools import product
def find_shortest(doc, terms):
doc = document.split()
substrings = (
doc[i:j]
for i, j in product(range(0, len(doc)), range(0, len(doc)))
if all(search_term in doc[i:j] for search_term in search_terms)
)
shortest = doc
for candidate in substrings:
if len(candidate) < len(shortest):
shortest = candidate
return shortest.
document = 'many google employees can program can google employees because google is a technology company that writes program'
search_terms = ['google', 'program', 'can']
print find_shortest(document, search_terms)
>>>> ['program', 'can', 'google']
You can probably do this a lot faster, though. For example, any relevant substring can only end with one of the keywords

Instead of brute forcing all possible sub-strings, I brute forced all possible matching word positions... It should be a bit faster..
import numpy as np
from itertools import product
document = 'many google employees can program can google employees because google is a technology company that writes program'
searchTerms = ['google', 'program']
word_lists = []
for word in searchTerms:
word_positions = []
start = 0 #starting index of str.find()
while 1:
start = document.find(word, start, -1)
if start == -1: #no more instances
break
word_positions.append([start, start+len(word)]) #beginning and ending index of search term
start += 1 #increment starting search postion
word_lists.append(word_positions) #add all search term positions to list of all search terms
minLen = len(document)
lower = 0
upper = len(document)
for p in product(*word_lists): #unpack word_lists into word_positions
indexes = np.array(p).flatten() #take all indices into flat list
lowerI = np.min(indexes)
upperI = np.max(indexes)
indexRange = upperI - lowerI #determine length of substring
if indexRange < minLen:
minLen = indexRange
lower = lowerI
upper = upperI
print document[lower:upper]

Related

Longest Common Suffix from the listed words

Trying to reiterate backward the codes so as to find common suffix of the entered array of words say:
LongestCommonSuffix(['celebration', 'opinion', 'decision', 'revision'])
To get "ion" as output
This gives me the Longest Common Prefix BUT I need to change the loop to do the same but from the end of each word in the entered list without using Binary manipulation just LOOPING
def fun(strs):
res = ''
for i in range(len(strs[0])):
for s in strs:
if i == len(s) or s[i] != strs[0][i]:
return res
res += strs[0][i]
return res
You could create a variable common_suffix, make that equal to the first word, and then for each next word check if that word ends with that common suffix. If it doesn't, the common suffix is invalid, so try to shorten it until it does.
In code:
def LongestCommonSuffix(strs):
common_suffix = strs[0]
for next_word in strs[1:]:
while not next_word.endswith(common_suffix):
common_suffix = common_suffix[1:]
return common_suffix
print(LongestCommonSuffix(['celebration', 'opinion', 'decision', 'revision']))
This prints ion.
Note that this code works because common_suffix would be an empty string if all words are completely different. And each word ends with an empty string (for example 'test'.endswith('') is True), so the while loop will always quit.
You could add additional logic to break from the loop earlier, but if performance is not critical, I would stick with the simple code :-)

How do i make the program print specific letters in this specific format i give to it?

so i need to code a program which, for example if given the input 3[a]2[b], prints "aaabb" or when given 3[ab]2[c],prints "abababcc"(basicly prints that amount of that letter in the given order). i tried to use a for loop to iterate the first given input and then detect "[" letters in it so it'll know that to repeatedly print but i don't know how i can make it also understand where that string ends
also this is where i could get it to,which probably isnt too useful:
string=input()
string=string[::-1]
bulundu=6
for i in string:
if i!="]":
if i!="[":
lst.append(i)
if i=="[":
break
The approach I took is to remove the brackets, split the items into a list, then walk the list, and if the item is a number, add that many repeats of the next item to the result for output:
import re
data = "3[a]2[b]"
# Remove brackets and convert to a list
data = re.sub(r'[\[\]]', ' ', data).split()
result = []
for i, item in enumerate(data):
# If item is a number, print that many of the next item
if item.isdigit():
result.append(data[i+1] * int(item))
print(''.join(result))
# aaabb
A different approach, inspired by Subbu's use of re.findall. This approach finds all 'pairs' of numbers and letters using match groups, then multiplies them to produce the required text:
import re
data = "3[a]2[b]"
matches = re.findall('(\d+)\[([a-zA-Z]+)\]',data)
# [(3, 'a'), (2, 'b')]
for x in matches:
print(x[1] * int(x[0]), end='')
#aaabb
Lenghty and documented version using NO regex but simple string and list manipulation:
first split the input into parts that are numbers and texts
then recombinate them again
I opted to document with inline comments
This could be done like so:
# testcases are tuples of input and correct result
testcases = [ ("3[a]2[b]","aaabb"),
("3[ab]2[c]","abababcc"),
("5[12]6[c]","1212121212cccccc"),
("22[a]","a"*22)]
# now we use our algo for all those testcases
for inp,res in testcases:
split_inp = [] # list that takes the splitted values of the input
num = 0 # accumulator variable for more-then-1-digit numbers
in_text = False # bool that tells us if we are currently collecting letters
# go over all letters : O(n)
for c in inp:
# when a [ is reached our num is complete and we need to store it
# we collect all further letters until next ] in a list that we
# add at the end of your split_inp
if c == "[":
split_inp.append(num) # add the completed number
num = 0 # and reset it to 0
in_text = True # now in text
split_inp.append([]) # add a list to collect letters
# done collecting letters
elif c == "]":
in_text = False # no longer collecting, convert letters
split_inp[-1] = ''.join(split_inp[-1]) # to text
# between [ and ] ... simply add letter to list at end
elif in_text:
split_inp[-1].append(c) # add letter
# currently collecting numbers
else:
num *= 10 # increase current number by factor 10
num += int(c) # add newest number
print(repr(inp), split_inp, sep="\n") # debugging output for parsing part
# now we need to build the string from our parsed data
amount = 0
result = [] # intermediate list to join ['aaa','bb']
# iterate the list, if int remember it, it text, build composite
for part in split_inp:
if isinstance(part, int):
amount = part
else:
result.append(part*amount)
# join the parts
result = ''.join(result)
# check if all worked out
if result == res:
print("CORRECT: ", result + "\n")
else:
print (f"INCORRECT: should be '{res}' but is '{result}'\n")
Result:
'3[a]2[b]'
[3, 'a', 2, 'b']
CORRECT: aaabb
'3[ab]2[c]'
[3, 'ab', 2, 'c']
CORRECT: abababcc
'5[12]6[c]'
[5, '12', 6, 'c']
CORRECT: 1212121212cccccc
'22[a]'
[22, 'a']
CORRECT: aaaaaaaaaaaaaaaaaaaaaa
This will also handle cases of '5[12]' wich some of the other solutions wont.
You can capture both the number of repetitions n and the pattern to repeat v in one go using the described pattern. This essentially matches any sequence of digits - which is the first group we need to capture, reason why \d+ is between brackets (..) - followed by a [, followed by anything - this anything is the second pattern of interest, hence it is between backets (...) - which is then followed by a ].
findall will find all these matches in the passed line, then the first match - the number - will be cast to an int and used as a multiplier for the string pattern. The list of int(n) * v is then joined with an empty space. Malformed patterns may throw exceptions or return nothing.
Anyway, in code:
import re
pattern = re.compile("(\d+)\[(.*?)\]")
def func(x): return "".join([v*int(n) for n,v in pattern.findall(x)])
print(func("3[a]2[b]"))
print(func("3[ab]2[c]"))
OUTPUT
aaabb
abababcc
FOLLOW UP
Another solution which achieves the same result, without using regular expression (ok, not nice at all, I get it...):
def func(s): return "".join([int(x[0])*x[1] for x in map(lambda x:x.split("["), s.split("]")) if len(x) == 2])
I am not much more than a beginner and looking at the other answers, I thought understanding regex might be a challenge for a new contributor such as yourself since I myself haven't really dealt with regex.
The beginner friendly way to do this might be to loop through the input string and use string functions like isnumeric() and isalpha()
data = "3[a]2[b]"
chars = []
nums = []
substrings = []
for i, char in enumerate(data):
if char.isnumeric():
nums.append(char)
if char.isalpha():
chars.append(char)
for i, char in enumerate(chars):
substrings.append(char * int(nums[i]))
string = "".join(substrings)
print(string)
OUTPUT:
aaabb
And on trying different values for data:
data = "0[a]2[b]3[p]"
OUTPUT bbppp
data = "1[a]1[a]2[a]"
OUTPUT aaaa
NOTE: In case you're not familiar with the above functions, they are string functions, which are fairly self-explanatory. They are used as <your_string_here>.isalpha() which returns true if and only if the string is an alphabet (whitespace, numerics, and symbols return false
And, similarly for isnumeric()
For example,
"]".isnumeric() and "]".isalpha() return False
"a".isalpha() returns True
IF YOU NEED ANY CLARIFICATION ON A FUNCTION USED, PLEASE DO NOT HESITATE TO LEAVE A COMMENT

How do I check if the next item in a string is the alphabetical successor of the one before? + Inverse

I'm trying to compress a string in a way that any sequence of letters in strict alphabetical order is swapped with the first letter plus the length of the sequence.
For example, the string "abcdefxylmno", would become: "a6xyl4"
Single letters that aren't in order with the one before or after just stay the way they are.
How do I check that two letters are successors (a,b) and not simply in alphabetical order (a,c)? And how do I keep iterating on the string until I find a letter that doesn't meet this requirement?
I'm also trying to do this in a way that makes it easier to write an inverse function (that given the result string gives me back the original one).
EDIT :
I've managed to get the function working, thanks to your suggestion of using the alphabet string as comparison; now I'm very much stuck on the inverse function: given "a6xyl4" expand it back into "abcdefxylmno".
After quite some time I managed to split the string every time there's a number and I made a function that expands a 2 char string, but it fails to work when I use it on a longer string:
from string import ascii_lowercase as abc
def subString(start,n):
L=[]
ind = abc.index(start)
newAbc = abc[ind:]
for i in range(len(newAbc)):
while i < n:
L.append(newAbc[i])
i+=1
res = ''.join(L)
return res
def unpack(S):
for i in range(len(S)-1):
if S[i] in abc and S[i+1] not in abc:
lett = str(S[i])
num = int(S[i+1])
return subString(lett,num)
def separate(S):
lst = []
for i in S:
lst.append(i)
for el in lst:
if el.isnumeric():
ind = lst.index(el)
lst.insert(ind+1,"-")
a = ''.join(lst)
L = a.split("-")
if S[-1].isnumeric():
L.remove(L[-1])
return L
else:
return L
def inverse(S):
L = separate(S)
for i in L:
return unpack(i)
Each of these functions work singularly, but inverse(S) doesn't output anything. What's the mistake?
You can use the ord() function which returns an integer representing the Unicode character. Sequential letters in alphabetical order differ by 1. Thus said you can implement a simple funtion:
def is_successor(a,b):
# check for marginal cases if we dont ensure
# input restriction somewhere else
if ord(a) not in range(ord('a'), ord('z')) and ord(a) not in range(ord('A'),ord('Z')):
return False
if ord(b) not in range(ord('a'), ord('z')) and ord(b) not in range(ord('A'),ord('Z')):
return False
# returns true if they are sequential
return ((ord(b) - ord(a)) == 1)
You can use chr(int) method for your reversing stage as it returns a string representing a character whose Unicode code point is an integer given as argument.
This builds on the idea that acceptable subsequences will be substrings of the ABC:
from string import ascii_lowercase as abc # 'abcdefg...'
text = 'abcdefxylmno'
stack = []
cache = ''
# collect subsequences
for char in text:
if cache + char in abc:
cache += char
else:
stack.append(cache)
cache = char
# if present, append the last sequence
if cache:
stack.append(cache)
# stack is now ['abcdef', 'xy', 'lmno']
# Build the final string 'a6x2l4'
result = ''.join(f'{s[0]}{len(s)}' if len(s) > 1 else s for s in stack)

Python algorithm in list

In a list of N strings, implement an algorithm that outputs the largest n if the entire string is the same as the preceding n strings. (i.e., print out how many characters in front of all given strings match).
My code:
def solution(a):
import numpy as np
for index in range(0,a):
if np.equal(a[index], a[index-1]) == True:
i += 1
return solution
else:
break
return 0
# Test code
print(solution(['abcd', 'abce', 'abchg', 'abcfwqw', 'abcdfg'])) # 3
print(solution(['abcd', 'gbce', 'abchg', 'abcfwqw', 'abcdfg'])) # 0
Some comments on your code:
There is no need to use numpy if it is only used for string comparison
i is undefined when i += 1 is about to be executed, so that will not run. There is no actual use of i in your code.
index-1 is an invalid value for a list index in the first iteration of the loop
solution is your function, so return solution will return a function object. You need to return a number.
The if condition is only comparing complete words, so there is no attempt to only compare a prefix.
A possible way to do this, is to be optimistic and assume that the first word is a prefix of all other words. Then as you detect a word where this is not the case, reduce the size of the prefix until it is again a valid prefix of that word. Continue like that until all words have been processed. If at any moment you find the prefix is reduced to an empty string, you can actually exit and return 0, as it cannot get any less than that.
Here is how you could code it:
def solution(words):
prefix = words[0] # if there was only one word, this would be the prefix
for word in words:
while not word.startswith(prefix):
prefix = prefix[:-1] # reduce the size of the prefix
if not prefix: # is there any sense in continuing?
return 0 # ...: no.
return len(prefix)
The description is somewhat convoluted but it does seem that you're looking for the length of the longest common prefix.
You can get the length of the common prefix between two strings using the next() function. It can find the first index where characters differ which will correspond to the length of the common prefix:
def maxCommon(S):
cp = S[0] if S else "" # first string is common prefix (cp)
for s in S[1:]: # go through other strings (s)
cs = next((i for i,(a,b) in enumerate(zip(s,cp)) if a!=b),len(cp))
cp = cp[:cs] # truncate to new common size (cs)
return len(cp) # return length of common prefix
output:
print(maxCommon(['abcd', 'abce', 'abchg', 'abcfwqw', 'abcdfg'])) # 3
print(maxCommon(['abcd', 'gbce', 'abchg', 'abcfwqw', 'abcdfg'])) # 0

Python 2.7 finding if some anagram of one string is a substring of another [duplicate]

This question already has answers here:
Anagram of String 2 is Substring of String 1
(5 answers)
Closed 5 years ago.
EDIT: Posting my final solution because this was a very helpful thread and I want to add some finality to it. Using the advice from both answers below I was able to craft a solution. I added a helper function in which I defined an anagram. Here is my final solution:
def anagram(s1, s2):
s1 = list(s1)
s2 = list(s2)
s1.sort()
s2.sort()
return s1 == s2
def Question1(t, s):
t_len = len(t)
s_len = len(s)
t_sort = sorted(t)
for start in range(s_len - t_len + 1):
if anagram(s[start: start+t_len], t):
return True
return False
print Question1("app", "paple")
I am working on some practice technical interview questions and I'm stuck on the following question:
Find whether an anagram of string t is a substring of s
I have worked out the following two variants of my code, and a solution to this I believe lies in a cross between the two. The problem I am having is that the first code always prints False., regardless of input. The second variation works to some degree. However, it cannot sort individual letters. For example t=jks s=jksd will print True! however t=kjs s=jksd will print False.
def Question1():
# Define strings as raw user input.
t = raw_input("Enter phrase t:")
s = raw_input("Enter phrase s:")
# Use the sorted function to find if t in s
if sorted(t.lower()) in sorted(s.lower()):
print("True!")
else:
print("False.")
Question1()
Working variant:
def Question1():
# Define strings as raw user input.
t = raw_input("Enter phrase t:")
s = raw_input("Enter phrase s:")
# use a loop to find if t is in s.
if t.lower() in s.lower():
print("True!")
else:
print("False.")
Question1()
I believe there is a solution that lies between these two, but I'm having trouble figuring out how to use sorted in this situation.
You're very much on the right track. First, please note that there is no loop in your second attempt.
The problem is that you can't simply sort all of s and then look for sorted(t) in that. Rather, you have to consider each len(t) sized substring of s, and check that against the sorted t. Consider the trivial example:
t = "abd"
s = "abdc"
s trivially contains t. However, when you sort them, you get the strings abd and abcd, and the in comparison fails. The sorting gets other letters in the way.
Instead, you need to step through s in chunks the size of t.
t_len = len(t)
s_len = len(s)
t_sort = sorted(t)
for start in range(s_len - t_len + 1):
chunk = s[start:start+t_len]
if t_sort == sorted(chunk):
# SUCCESS!!
I think your problem lies in the "substring" requirement. If you sort, you destroy order. Which means that while you can determine that an anagram of string1 is an anagram of a substring of string2, until you actually deal with string2 in order, you won't have a correct answer.
I'd suggest iterating over all the substrings of length len(s1) in s2. This is a straightforward for loop. Once you have the substrings, you can compare them (sorted vs sorted) with s1 to decide if there is any rearrangement of s1 that yields a contiguous substring of s2.
Viz:
s1 = "jks"
s2 = "aksjd"
print('s1=',s1, ' s2=', s2)
for offset in range(len(s2) - len(s1) + 1):
ss2 = s2[offset:offset+len(s1)]
if sorted(ss2) == sorted(s1):
print('{} is an anagram of {} at offset {} in {}'.format(ss2, s1, offset, s2))

Categories

Resources