Concatenate strings with a common substring in python? - python

Input
ONESTRING
STRINGTHREE
THREEFOUR
FOURFIVE
Output
ONESTRINGTHREEFOURFIVE
in python??
I think first i concatenate with 2 string then run a loop but this gives an error I don't know why can anyone help in in python?

WARNING
This solution is for a list of strings in arbitrary order. This means that EVERY possible pair of words must be checked for a common substring, which may require an enormous amount of memory if your list of strings is large.
Solution 1, allows for words with no common substrings to be concatenated if desired
import itertools
from typing import Set, Tuple, Dict, List
def get_match(pair: Tuple[str, str], min_overlap: int = 3) -> str:
a, b = pair
for i in range(min_overlap, min(map(len, pair)) + 1):
if a[-i:] == b[:i]:
return b[:i]
return ""
def links_joiners(strings: List[str]) -> Tuple[Dict[str, str], Set[str]]:
links, joiners = dict(), set()
for pair in itertools.permutations(strings, 2):
if (match := get_match(pair)):
joiners.add(match)
links.update((pair,))
return links, joiners
def get_ordered_strings(strings: List[str], links: Dict[str, str]) -> List[str]:
def find_order(node: str) -> int:
return 0 if node not in links else 1 + find_order(links[node])
return sorted(strings, key=find_order, reverse=True)
def join_strings(strings: List[str], joiners: Set[str]) -> str:
s = "".join(strings)
for j in joiners:
s = s.replace(j, "", 1)
return s
Usage:
strings = ["THREEFOUR",
"ONESTRING",
"STRINGTHREE",
"FOURFIVE"]
links, joiners = get_links_and_joiners(strings)
ordered_strings = get_ordered_strings(strings, links)
join_strings(ordered_strings, joiners)
Output:
'ONESTRINGTHREEFOURFIVE'
Explanation
First, itertools is part of the standard library; no need to install any third party packages for this solution.
Now, the links_joiners() function will take a list of strings and find all the pairs of strings with matching suffix-prefix pairs, putting those pairs into a links dictionary which looks like this:
{'ONESTRING': 'STRINGTHREE',
'THREEFOUR': 'FOURFIVE',
'STRINGTHREE': 'THREEFOUR'}
Notice these are not in order. This is because for an arbitrary list of strings we can't be sure the strings were in order in the first place, so we have to iterate over every permutation of strings exhaustively in order to ensure that we've covered all pairings.
Now, notice there's also a function called get_ordered_strings() with an inner function find_order(). The function get_ordered_strings() forms what is known as a closure, but that's not particularly important to understand right now. The find_order() function is recursive, here's how it works:
Given a node, if the node is not a key in the links dictionary we've reached the base case and return zero. Otherwise, move to step 2.
If node is present, add one to a recursive call to find_order on that new node.
So given a key, say "ONESTRING", the find_order() function will look at the value associated with that key, and if that value is also a key in the dictionary, look at its value, and so on until it reaches a value that isn't a key in the dictionary.
Here's the code for find_order() again:
def find_order(node: str) -> int:
if node not in links:
return 0
return 1 + find_order(links[node])
And here's what links looks like after calling links_joiners():
{'ONESTRING': 'STRINGTHREE',
'THREEFOUR': 'FOURFIVE',
'STRINGTHREE': 'THREEFOUR'}
Now trace an example call to find_order("ONESTRING"):
find_order("ONESTRING") = 1 + find_order("STRINGTHREE")
= 1 + (1 + find_order("THREEFOUR"))
= 1 + (1 + (1 + find_order("FOURFIVE"))) # Base case
= 1 + (1 + (1 + 0))
= 3
What this function is doing is finding how many pairwise connections can be made from a given starting string. Another way to think of it is that links is actually representing adjacencies in a (special case of a) DAG.
Essentially what we want to do is take the nodes THREEFOUR, ONESTRING, STRINGTHREE, FOURFIVE and construct the longest possible singly-linked list (a type of a DAG) from them:
ONESTRING -> STRINGTHREE -> THREEFOUR -> FOURFIVE
By passing a given "node" of this graph to find_order(), it will follow the graph all the way to the end. So ONESTRING travels a distance of 3 to get to the end, whereas THREEFOUR travels only a distance of 1.
Node: ONESTRING -> STRINGTHREE -> THREEFOUR -> FOURFIVE
Dist: 3 2 1 0
Now, by passing find_order to the built-in sorted() function, we can tell Python how we want our strings to be sorted, which, in this case is in reverse order, by distance. The result is this:
>>> strings = ['THREEFOUR', 'ONESTRING', 'STRINGTHREE', 'FOURFIVE']
>>> ordered_strings = get_ordered_strings(strings, links)
>>> ordered_strings
['ONESTRING', 'STRINGTHREE', 'THREEFOUR', 'FOURFIVE']
Now, by joining each string by their common substrings, we are constructing the longest possible string where the constraint is that each pair of strings must have a common substring in the correct position. In other words, ordered_strings represents the longest path in the DAG. Or more accurately, we've designed a DAG which will have the longest path, by using all the provided nodes, and putting them in the correct order.
From here, we join each string:
>>> s = "".join(ordered_strings)
>>> s
'ONESTRINGSTRINGTHREETHREEFOURFOURFIVE'
Then we remove one instance of each of the joiners:
for j in joiners:
s = s.replace(j, "", 1)
Solution 2, only concatenates overlapping strings
This solution reuses join_strings() and get_match() from above. It also uses the walrus operator := (Python 3.8+) but can easily be written without it.
def join_overlapping_pairs(strings: List[str]) -> str:
if len(strings) == 1:
return strings.pop()
matches = set()
for pair in itertools.permutations(strings, 2):
if (match := get_match(pair)):
matches.add(join_strings(pair, (match,)))
return join_overlapping_pairs(matches)

Here is generic solution according your provided example. Sequence must be ordered, otherwise it will not work.
from functools import reduce
s = [
"ONESTRING",
"STRINGTHREE",
"THREEFOUR",
"FOURFIVE",
]
def join_f(first, add):
i = 1
while add[:i] in first:
i += 1
return first + add[i-1:]
print(reduce(join_f, s))

May use difflib library, sample code for your reference
from difflib import SequenceMatcher
str1 = "ONESTRING"
str2 = "STRINGTHREE"
match = SequenceMatcher(None, str1, str2).find_longest_match(0, len(str1), 0, len(str2))
#match[a]=3, match[b]=0, match[size]=6

Assuming the words are in the connecting order:
words = ['ONESTRING',
'STRINGTHREE',
'THREEFOUR',
'FOURFIVE']
S = words[0]
for w in words[1:]:
S += w[next(i for i in range(1,len(w)) if S.endswith(w[:-i])):]
print(S)
'ONESTRINGTHREEFOURFIVE'
If the words are not in connecting order, a recursive approach can do it:
def combine(words,S=""):
if not words: return S
result = "" # result is shortest combo
for i,w in enumerate(words): # p is max overlap (end/start)
p = next((i for i in range(1,len(w)) if S.endswith(w[:-i])),0)
if result and not p: continue # check if can combine
combo = combine(words[:i]+words[i+1:],S+w[p:]) # candidate combo
if not result or len(combo)<len(result): # keep if shortest
result = combo or result
return result
Output:
words = ['ONESTRING',
'FOURFIVE',
'THREEFOUR',
'STRINGTHREE'
]
result = combine(words)
print(result)
'ONESTRINGTHREEFOURFIVE

Related

How do I check if the next item in a string is the alphabetical successor of the one before? + Inverse

I'm trying to compress a string in a way that any sequence of letters in strict alphabetical order is swapped with the first letter plus the length of the sequence.
For example, the string "abcdefxylmno", would become: "a6xyl4"
Single letters that aren't in order with the one before or after just stay the way they are.
How do I check that two letters are successors (a,b) and not simply in alphabetical order (a,c)? And how do I keep iterating on the string until I find a letter that doesn't meet this requirement?
I'm also trying to do this in a way that makes it easier to write an inverse function (that given the result string gives me back the original one).
EDIT :
I've managed to get the function working, thanks to your suggestion of using the alphabet string as comparison; now I'm very much stuck on the inverse function: given "a6xyl4" expand it back into "abcdefxylmno".
After quite some time I managed to split the string every time there's a number and I made a function that expands a 2 char string, but it fails to work when I use it on a longer string:
from string import ascii_lowercase as abc
def subString(start,n):
L=[]
ind = abc.index(start)
newAbc = abc[ind:]
for i in range(len(newAbc)):
while i < n:
L.append(newAbc[i])
i+=1
res = ''.join(L)
return res
def unpack(S):
for i in range(len(S)-1):
if S[i] in abc and S[i+1] not in abc:
lett = str(S[i])
num = int(S[i+1])
return subString(lett,num)
def separate(S):
lst = []
for i in S:
lst.append(i)
for el in lst:
if el.isnumeric():
ind = lst.index(el)
lst.insert(ind+1,"-")
a = ''.join(lst)
L = a.split("-")
if S[-1].isnumeric():
L.remove(L[-1])
return L
else:
return L
def inverse(S):
L = separate(S)
for i in L:
return unpack(i)
Each of these functions work singularly, but inverse(S) doesn't output anything. What's the mistake?
You can use the ord() function which returns an integer representing the Unicode character. Sequential letters in alphabetical order differ by 1. Thus said you can implement a simple funtion:
def is_successor(a,b):
# check for marginal cases if we dont ensure
# input restriction somewhere else
if ord(a) not in range(ord('a'), ord('z')) and ord(a) not in range(ord('A'),ord('Z')):
return False
if ord(b) not in range(ord('a'), ord('z')) and ord(b) not in range(ord('A'),ord('Z')):
return False
# returns true if they are sequential
return ((ord(b) - ord(a)) == 1)
You can use chr(int) method for your reversing stage as it returns a string representing a character whose Unicode code point is an integer given as argument.
This builds on the idea that acceptable subsequences will be substrings of the ABC:
from string import ascii_lowercase as abc # 'abcdefg...'
text = 'abcdefxylmno'
stack = []
cache = ''
# collect subsequences
for char in text:
if cache + char in abc:
cache += char
else:
stack.append(cache)
cache = char
# if present, append the last sequence
if cache:
stack.append(cache)
# stack is now ['abcdef', 'xy', 'lmno']
# Build the final string 'a6x2l4'
result = ''.join(f'{s[0]}{len(s)}' if len(s) > 1 else s for s in stack)

Python algorithm in list

In a list of N strings, implement an algorithm that outputs the largest n if the entire string is the same as the preceding n strings. (i.e., print out how many characters in front of all given strings match).
My code:
def solution(a):
import numpy as np
for index in range(0,a):
if np.equal(a[index], a[index-1]) == True:
i += 1
return solution
else:
break
return 0
# Test code
print(solution(['abcd', 'abce', 'abchg', 'abcfwqw', 'abcdfg'])) # 3
print(solution(['abcd', 'gbce', 'abchg', 'abcfwqw', 'abcdfg'])) # 0
Some comments on your code:
There is no need to use numpy if it is only used for string comparison
i is undefined when i += 1 is about to be executed, so that will not run. There is no actual use of i in your code.
index-1 is an invalid value for a list index in the first iteration of the loop
solution is your function, so return solution will return a function object. You need to return a number.
The if condition is only comparing complete words, so there is no attempt to only compare a prefix.
A possible way to do this, is to be optimistic and assume that the first word is a prefix of all other words. Then as you detect a word where this is not the case, reduce the size of the prefix until it is again a valid prefix of that word. Continue like that until all words have been processed. If at any moment you find the prefix is reduced to an empty string, you can actually exit and return 0, as it cannot get any less than that.
Here is how you could code it:
def solution(words):
prefix = words[0] # if there was only one word, this would be the prefix
for word in words:
while not word.startswith(prefix):
prefix = prefix[:-1] # reduce the size of the prefix
if not prefix: # is there any sense in continuing?
return 0 # ...: no.
return len(prefix)
The description is somewhat convoluted but it does seem that you're looking for the length of the longest common prefix.
You can get the length of the common prefix between two strings using the next() function. It can find the first index where characters differ which will correspond to the length of the common prefix:
def maxCommon(S):
cp = S[0] if S else "" # first string is common prefix (cp)
for s in S[1:]: # go through other strings (s)
cs = next((i for i,(a,b) in enumerate(zip(s,cp)) if a!=b),len(cp))
cp = cp[:cs] # truncate to new common size (cs)
return len(cp) # return length of common prefix
output:
print(maxCommon(['abcd', 'abce', 'abchg', 'abcfwqw', 'abcdfg'])) # 3
print(maxCommon(['abcd', 'gbce', 'abchg', 'abcfwqw', 'abcdfg'])) # 0

How to search an infinite list of words in sorted order for an index corresponding to a word as input

I am confounded by a hard problem I am having with a given stream I am dealing with to do better the linear time O(n)..
Search an infinite list of words in sorted order for an index corresponding to a word as input
given an infinite list ["apple", "banana", "cat", "dog", ...]
we have a class A where A.get(2) # => "cat"
write a function to return the index for the word given as input to the function like so:
A.get_index("cat") # => 2
you can use A.get() but not python the .index() for sequences
IIUC, you can do it in O(log n) with a modification of binary search.
import bisect
def find(endless_haystack, needle):
if endless_haystack[0] == needle:
return 0
i = 1
hay = endless_haystack[i]
while hay < needle: # this is O(log n) where n is the index of the element
i = 2 * i
hay = endless_haystack[i]
# from the loop before the element is between i and i // 2
return bisect.bisect_left(endless_haystack, needle, i // 2, i)
Be aware that the above code is a sketch of an actual solution, you need to check for some edge cases.
you can use the built-in function enumerate who is giving you the index and the element:
def get_index(word, my_infinite_list):
return next(i for i, e in enumerate(my_infinite_list) if e == word)
the built-in function next will make sure to iterate over your list until you find the wanted word
Simply iterate over the list by incrementing a counter until you hit a corresponding value
class A:
[...]
def get_index(self, item):
i = 0
while self.get(i) != item:
i += 1
return i
Note: That this is not very secure code. But because we assume the list is infinite, you don't risk to hit out of index. There is an overflow risk though...

Python: Search a sorted list of tuples

Useful information:
For information on how to sort a list of various data types see:
How to sort (list/tuple) of lists/tuples?
.. and for information on how to perform a binary search on a sorted list see: Binary search (bisection) in Python
My question:
How can you neatly apply binary search (or another log(n) search algorithm) to a list of some data type, where the key is a inner-component of the data type itself? To keep the question simple we can use a list of tuples as an example:
x = [("a", 1), ("b",2), ("c",3)]
binary_search(x, "b") # search for "b", should return 1
# note how we are NOT searching for ("b",2) yet we want ("b",2) returned anyways
To simplify even further: we only need to return a single search result, not multiple if for example ("b",2) and ("b",3) both existed.
Better yet:
How can we modify the following simple code to perform the above operation?
from bisect import bisect_left
def binary_search(a, x, lo=0, hi=None): # can't use a to specify default for hi
hi = hi if hi is not None else len(a) # hi defaults to len(a)
pos = bisect_left(a, x, lo, hi) # find insertion position
return (pos if pos != hi and a[pos] == x else -1) # don't walk off the end
PLEASE NOTE: I am not looking for the complete algorithm itself. Rather, I am looking for the application of some of Python's standard(ish) libraries, and/or Python's other functionalities so that I can easily search a sorted list of some arbitrary data type at any time.
Thanks
Take advantage of how lexicographic ordering deals with tuples of unequal length:
# bisect_right would also work
index = bisect.bisect_left(x, ('b',))
It may sometimes be convenient to feed a custom sequence type to bisect:
class KeyList(object):
# bisect doesn't accept a key function, so we build the key into our sequence.
def __init__(self, l, key):
self.l = l
self.key = key
def __len__(self):
return len(self.l)
def __getitem__(self, index):
return self.key(self.l[index])
import operator
# bisect_right would *not* work for this one.
index = bisect.bisect_left(KeyList(x, operator.itemgetter(0)), 'b')
What about converting the list of tuples to a dict?
>>> d = dict([("a", 1), ("b",2), ("c",3)])
>>> d['b'] # 2

Contradictory outputs in simple recursive function

Note: Goal of the function is to remove duplicate(repeated) characters.
Now for the same given recursive function, different output pops out for different argument:
def rd(x):
if x[0]==x[-1]:
return x
elif x[0]==x[1]:
return rd(x[1: ])
else:
return x[0]+rd(x[1: ])
print("Enter a sentence")
r=raw_input()
print("simplified: "+rd(r))
This functions works well for the argument only if the duplicate character is within the starting first six characters of the string, for example:
if r=abcdeeeeeeefghijk or if r=abcdeffffffghijk
but if the duplicate character is after the first six character then the output is same as the input,i.e, output=input. That means with the given below value of "r", the function doesn't work:
if r=abcdefggggggggghijkde (repeating characters are after the first six characters)
The reason you function don't work properly is you first if x[0]==x[-1], there you check the first and last character of the substring of the moment, but that leave pass many possibility like affffffa or asdkkkkkk for instance, let see why:
example 1: 'affffffa'
here is obvious right?
example 2: 'asdkkkkkk'
here we go for case 3 of your function, and then again
'a' +rd('sdkkkkkk')
'a'+'s' +rd('dkkkkkk')
'a'+'s'+'d' +rd('kkkkkk')
and when we are in 'kkkkkk' it stop because the first and last are the same
example 3: 'asdfhhhhf'
here is the same as example 2, in the recursion chain we arrive to fhhhhf and here the first and last are the same so it leave untouched
How to fix it?, simple, as other have show already, check for the length of the string first
def rd(x):
if len(x)<2: #if my string is 1 or less character long leave it untouched
return x
elif x[0]==x[1]:
return rd(x[1: ])
else:
return x[0]+rd(x[1: ])
here is alternative and iterative way of doing the same: you can use the unique_justseen recipe from itertools recipes
from itertools import groupby
from operator import itemgetter
def unique_justseen(iterable, key=None):
"List unique elements, preserving order. Remember only the element just seen."
# unique_justseen('AAAABBBCCDAABBB') --> A B C D A B
# unique_justseen('ABBCcAD', str.lower) --> A B C A D
return map(next, map(itemgetter(1), groupby(iterable, key)))
def clean(text):
return "".join(unique_justseen(text)
test
>>> clean("abcdefggggggggghijk")
'abcdefghijk'
>>> clean("abcdefghijkkkkkkkk")
'abcdefghijk'
>>> clean("abcdeffffffghijk")
'abcdefghijk'
>>>
and if you don't want to import anything, here is another way
def clean(text):
result=""
last=""
for c in text:
if c!=last:
last = c
result += c
return result
The only issue I found with you code was the first if statement. I assumed you used it to make sure that the string was at least 2 long. It can be done using string modifier len() in fact the whole function can but we will leave it recursive for OP sake.
def rd(x):
if len(x) < 2: #Modified to return if len < 2. accomplishes same as original code and more
return x
elif x[0]==x[1]:
return rd(x[1: ])
else:
return x[0]+rd(x[1: ])
r=raw_input("Enter a sentence: ")
print("simplified: "+rd(r))
I would however recommend not making the function recursive and instead mutating the original string as follows
from collections import OrderedDict
def rd(string):
#assuming order does matter we will use OrderedDict, no longer recursive
return "".join(OrderedDict.fromkeys(string)) #creates an empty ordered dict eg. ({a:None}), duplicate keys are removed because it is a dict
#grabs a list of all the keys in dict, keeps order because list is orderable
#joins all items in list with '', becomes string
#returns string
r=raw_input("Enter a sentence: ")
print("simplified: "+rd(r))
Your function is correct but, if you want to check the last letter, the function must be:
def rd(x):
if len(x)==1:
return x
elif x[0]==x[1]:
return rd(x[1: ])
else:
return x[0]+rd(x[1: ])
print("Enter a sentence")
r=raw_input()
print("simplified: "+rd(r))

Categories

Resources