Concatenate strings if they have an overlapping region - python

I am trying to write a script that will find strings that share an overlapping region of 5 letters at the beginning or end of each string (shown in example below).
facgakfjeakfjekfzpgghi
pgghiaewkfjaekfjkjakjfkj
kjfkjaejfaefkajewf
I am trying to create a new string which concatenates all three, so the output would be:
facgakfjeakfjekfzpgghiaewkfjaekfjkjakjfkjaejfaefkajewf
Edit:
This is the input:
x = ('facgakfjeakfjekfzpgghi', 'kjfkjaejfaefkajewf', 'pgghiaewkfjaekfjkjakjfkj')
**the list is not ordered
What I've written so far *but is not correct:
def findOverlap(seq)
i = 0
while i < len(seq):
for x[i]:
#check if x[0:5] == [:5] elsewhere
x = ('facgakfjeakfjekfzpgghi', 'kjfkjaejfaefkajewf', 'pgghiaewkfjaekfjkjakjfkj')
findOverlap(x)

Create a dictionary mapping the first 5 characters of each string to its tail
strings = {s[:5]: s[5:] for s in x}
and a set of all the suffixes:
suffixes = set(s[-5:] for s in x)
Now find the string whose prefix does not match any suffix:
prefix = next(p for p in strings if p not in suffixes)
Now we can follow the chain of strings:
result = [prefix]
while prefix in strings:
result.append(strings[prefix])
prefix = strings[prefix][-5:]
print "".join(result)

A brute-force approach - do all combinations and return the first that matches linking terms:
def solution(x):
from itertools import permutations
for perm in permutations(x):
linked = [perm[i][:-5] for i in range(len(perm)-1)
if perm[i][-5:]==perm[i+1][:5]]
if len(perm)-1==len(linked):
return "".join(linked)+perm[-1]
return None
x = ('facgakfjeakfjekfzpgghi', 'kjfkjaejfaefkajewf', 'pgghiaewkfjaekfjkjakjfkj')
print solution(x)

Loop over each pair of candidates, reverse the second string and use the answer from here

Related

Python detect if string contains specific length substring

I am given a string and need to find the first substring in it, according to the substring's length
for example: given the string 'abaadddefggg'
for length = 3 I should get the output of 'ddd'
for length = 2 I should get 'aa' and so on
any ideas?
You could iterate over the strings indexes, and produce all the substrings. If any of these substrings is made up of a single character, that's the substring you're looking for:
def sequence(s, length):
for i in range(len(s) - length):
candidate = s[i:i+length]
if len(set(candidate)) == 1:
return candidate
One approach in Python 3.8+ using itertools.groupby combined with the walrus operator:
from itertools import groupby
string = 'abaadddefggg'
k = 3
res = next(s for _, group in groupby(string) if len(s := "".join(group)) == k)
print(res)
Output
ddd
An alternative general approach:
from itertools import groupby
def find_substring(string, k):
for _, group in groupby(string):
s = "".join(group)
if len(s) == k:
return s
res = find_substring('abaadddefggg', 3)
print(res)

create string pattern automatically

Need to create a string based on a given pattern.
If the pattern is 222243243 string need to be created is "2{4,6}[43]+2{1,3}[43]+".
Logic to create the above string is, check how many 2's sets in pattern and count them and add more two 2's .here contains two sets of 2's. The first one contains 4 2's and the seconds part contains 1 2's. So the first 2's can be 4 to 6(4+2) 2's and seconds 2's can be 1 to 3(1+2). when there are 3's or 4's, [43]+ need to add.
workings:
import re
data='222243243'
TwosStart=[]#contains twos start positions
TwosEnd=[]#contains twos end positions
TwoLength=[]#number of 2's sets
for match in re.finditer('2+', data):
s = match.start()#2's start position
e = match.end()#2's end position
d=e-s
print(s,e,d)
TwosStart.append(s)
TwosEnd.append(e)
TwoLength.append(d)
So using the above code I know how many 2's sets are in a given pattern and their starting and ending positions. but I have no idea to automatically create a string using the above information.
Ex:
if pattern '222243243' string should be "2{4,6}[43]+2{1,3}[43]+"
if pattern '222432243' string should be "2{3,5}[43]+2{2,4}[43]+"
if pattern '22432432243' string should be "2{2,4}[43]+2{1,3}[43]+2{2,4}[43]+"
One approach is to use itertools.groupby:
from itertools import groupby
s = "222243243"
result = []
for key, group in groupby(s, key=lambda c: c == "2"):
if key:
size = (sum(1 for _ in group))
result.append(f"2{{{size},{size+2}}}")
else:
result.append("[43]+")
pattern = "".join(result)
print(pattern)
Output
2{4,6}[43]+2{1,3}[43]+
Using your base code:
import re
data='222243243'
cpy=data
offset=0 # each 'cpy' modification offsets the addition
for match in re.finditer('2+', data):
s = match.start() # 2's start position
e = match.end() # 2's end position
d = e-s
regex = "]+2{" + str(d) + "," + str(d+2) + "}["
cpy = cpy[:s+offset] + regex + cpy[e+offset:]
offset+=len(regex)-d
# sometimes the borders can have wrong characters
if cpy[0]==']':
cpy=cpy[2:] # remove "]+"
else:
cpy='['+cpy
if cpy[len(cpy)-1]=='[':
cpy=cpy[:-1]
else:
cpy+="]+"
print(cpy)
Output
2{4,6}[43]+2{1,3}[43]+

Struggling with Regex for adjacent letters differing by case

I am looking to be able to recursively remove adjacent letters in a string that differ only in their case e.g. if s = AaBbccDd i would want to be able to remove Aa Bb Dd but leave cc.
I can do this recursively using lists:
I think it aught to be able to be done using regex but i am struggling:
with test string 'fffAaaABbe' the answer should be 'fffe' but the regex I am using gives 'fe'
def test(line):
res = re.compile(r'(.)\1{1}', re.IGNORECASE)
#print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
The way that works is:
def test(line):
result =''
chr = list(line)
cnt = 0
i = len(chr) - 1
while i > 0:
if ord(chr[i]) == ord(chr[i - 1]) + 32 or ord(chr[i]) == ord(chr[i - 1]) - 32:
cnt += 1
chr.pop(i)
chr.pop(i - 1)
i -= 2
else:
i -= 1
if cnt > 0: # until we can't find any duplicates.
return test(''.join(chr))
result = ''.join(chr)
print(result)
Is it possible to do this using a regex?
re.IGNORECASE is not way to solve this problem, as it will treat aa, Aa, aA, AA same way. Technically it is possible using re.sub, following way.
import re
txt = 'fffAaaABbe'
after_sub = re.sub(r'Aa|aA|Bb|bB|Cc|cC|Dd|dD|Ee|eE|Ff|fF|Gg|gG|Hh|hH|Ii|iI|Jj|jJ|Kk|kK|Ll|lL|Mm|mM|Nn|nN|Oo|oO|Pp|pP|Qq|qQ|Rr|rR|Ss|sS|Tt|tT|Uu|uU|Vv|vV|Ww|wW|Xx|xX|Yy|yY|Zz|zZ', '', txt)
print(after_sub) # fffe
Note that I explicitly defined all possible letters pairs, because so far I know there is no way to say "inverted case letter" using just re pattern. Maybe other user will be able to provide more concise re-based solution.
I suggest a different approach which uses groupby to group adjacent similar letters:
from itertools import groupby
def test(line):
res = []
for k, g in groupby(line, key=lambda x: x.lower()):
g = list(g)
if all(x == x.lower() for x in g):
res.append(''.join(g))
print(''.join(res))
Sample run:
>>> test('AaBbccDd')
cc
>>> test('fffAaaABbe')
fffe
r'(.)\1{1}' is wrong because it will match any character that is repeated twice, including non-letter characters. If you want to stick to letters, you can't use this.
However, even if we just do r'[A-z]\1{1}', this would still be bad because you would match any sequence of the same letter twice, but it would catch xx and XX -- you don't want to match consecutive same characters with matching case, as you said in the original question.
It just so happens that there is no short-hand to do this conveniently, but it is still possible. You could also just write a small function to turn it into a short-hand.
Building on #Daweo's answer, you can generate the regex pattern needed to match pairs of same letters with non-matching case to get the final pattern of aA|Aa|bB|Bb|cC|Cc|dD|Dd|eE|Ee|fF|Ff|gG|Gg|hH|Hh|iI|Ii|jJ|Jj|kK|Kk|lL|Ll|mM|Mm|nN|Nn|oO|Oo|pP|Pp|qQ|Qq|rR|Rr|sS|Ss|tT|Tt|uU|Uu|vV|Vv|wW|Ww|xX|Xx|yY|Yy|zZ|Zz:
import re
import string
def consecutiveLettersNonMatchingCase():
# Get all 'xX|Xx' with a list comprehension
# and join them with '|'
return '|'.join(['{0}{1}|{1}{0}'.format(s, t)\
# Iterate through the upper/lowercase characters
# in lock-step
for s, t in zip(
string.ascii_lowercase,
string.ascii_uppercase)])
def test(line):
res = re.compile(consecutiveLettersNonMatchingCase())
print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
print(consecutiveLettersNonMatchingCase())

working with Permutationed list (making pass_word list)

i want to create a pass_word list let's assume i have created a Permutationed list for example :
##
##
##
##
and then i want to add another chars to it (for ex : a,b) a,b is named special chars in this code and ## are added chars
so i want finally get this list :
ab## , ab##,ab##,ab## , ba##, .... a##b,...,b##a , ... , ba##
Note : I don't want any special characters get duplicated for ex i
don't want aa## or bb## (a,b can't be duplicated because they are
special chars #or # can be duplicated because they are added chars )
codes :
master_list=[]
l=[]
l= list(itertools.combinations_with_replacement('##',2)) # get me this list :[(#,#),(#,#),(#,#),(#,#)]
for i in l:
i = i+tuple(s) # adding special char(1 in this example) to created list
master_list.append(i)
print (master_list) # now i have this list : [(#,#,1),(#,#,1),....(#,#,1)
now if i can get all permutation of master_list my problem can be solved but i can't do that
i solved my problem , my idea : first of all i generate all posiable permutation of added chars**(#,#)** and save them to a list and then create another list and save specific chars (a,b) to it now we have to list just we need to merge them and in finally use permute_unique function
def permute_unique(nums):
perms = [[]]
for n in nums:
new_perm = []
for perm in perms:
for i in range(len(perm) + 1):
new_perm.append(perm[:i] + [n] + perm[i:])
# handle duplication
if i < len(perm) and perm[i] == n: break
perms = new_perm
return perms
l= list(itertools.combinations_with_replacement(algorithm,3))
for i in l:
i = i+tuple(s) # merge
master_list.append(i)
print(list(permute_unique))
You can just combine the combinations_with_replacement of the "added" chars with all the permutations of those combinations and the "special" characters:
>>> special = "ab"
>>> added = "##"
>>> [''.join(p)
for a in itertools.combinations_with_replacement(added, 2)
for p in itertools.permutations(a + tuple(special))]
['##ab',
'##ba',
'#a#b',
...
'a#b#',
'ab##',
...
'##ab',
'##ba',
...
'ba##',
'ba##']
If you want to prevent duplicates, pass the inner permuations through a set:
>>> [''.join(p)
for a in itertools.combinations_with_replacement(added, 2)
for p in set(itertools.permutations(a + tuple(special)))]

Python: Replace elements of one list with those of another if condition is met

I have two lists one called src with each element in this format:
['SOURCE: filename.dc : 1 : a/path/: description','...]
And one called base with each element in this format:
['BASE: 1: another/path','...]
I am trying to compare the base element's number (in this case it's 4) with the source element's number (in this case it's 1).
If they match then i want to replace the source element's number with the base element's path.
Right now i can split the source element's number with a for loop like this:
for w in source_list:
src_no=(map(lambda s: s.strip(), w.split(':'))[2])
And i can split the base element's path and number with a for loop like this:
for r in basepaths:
base_no=(map(lambda s: s.strip(), r.split(':'))[1])
base_path=(map(lambda s: s.strip(), r.split(':'))[2])
I expect the new list to look like ( base on the example of the two elements above):
['SOURCE: filename.dc : another/path : a/path/: description','...]
the src list is a large list with many elements, the base list is usually three or four elements long and is only used to translate into the new list.
I hacked something together for you, which I think should do what you want:
base_list = ['BASE: 1: another/path']
base_dict = dict()
# First map the base numbers to the paths
for entry in base_list:
n, p = map(lambda s: s.strip(), entry.split(':')[1:])
base_dict[n] = p
source_list = ['SOURCE: filename.dc : 1 : a/path/: description']
# Loop over all source entries and replace the number with the base path of the numbers match
for i, entry in enumerate(source_list):
n = entry.split(':')[2].strip()
if n in base_dict:
new_entry = entry.split(':')
new_entry[2] = base_dict[n]
source_list[i] = ':'.join(new_entry)
Be aware that this is a hacky solution, I think you should use regexp (look into the re module) to extract number and paths and when replacing the number.
This code also alters a list while iterating over it, which may not be the most pythonic thing to do.
Something like this:
for i in range(len(source_list)):
for b in basepaths:
if source_list[i].split(":")[2].strip() == b.split(":")[1].strip():
source_list[i] = ":".join(source_list[i].split(":")[:3] + [b.split(":")[2]] + source_list[i].split(":")[4:])
just get rid of the [] part of the splits:
src=(map(lambda s: s.strip(), w.split(':')))
base=(map(lambda s: s.strip(), r.split(':')))
>> src
>> ['SOURCE', 'filename.dc', '1', 'a/path/', 'description']
the base will similarly be a simple list
now just replace the proper element:
src[2] = base[2]
then put the elements back together if necessary:
src = ' : '.join(src)
def separate(x, separator = ':'):
return tuple(x.split(separator))
sourceList = map(separate, source)
baseDict = {}
for b in map(separate, base):
baseDict[int(b[1])] = b[2]
def tryFind(number):
try:
return baseDict[number]
except:
return number
result = [(s[0], s[1], tryFind(int(s[2])), s[3]) for s in sourceList]
This worked for me, it's not the best, but a start
So there is one large list that will be sequentially browsed and a shorter one. I would turn the short one into a mapping to find immediately for each item of the first list whether there is a match:
base = {}
for r in basepaths:
base_no=(map(lambda s: s.strip(), r.split(':'))[1])
base_path=(map(lambda s: s.strip(), r.split(':'))[2])
base[base_no] = base_path
for w in enumerate source_list:
src_no=(map(lambda s: s.strip(), w.split(':'))[2])
if src_no in base:
path = base[src_no]
# stuff...

Categories

Resources