How to merge strings with overlapping characters in python?

How to merge strings with overlapping characters in python? - python

I'm working on a python project which reads in an URL encoded overlapping list of strings. Each string is 15 characters long and overlaps with its sequential string by at least 3 characters and at most 15 characters (identical).
The goal of the program is to go from a list of overlapping strings - either ordered or unordered - to a compressed URL encoded string.
My current method fails at duplicate segments in the overlapping strings. For example, my program is incorrectly combining:
StrList1 = [ 'd+%7B%0A++++public+', 'public+static+v','program%0Apublic+', 'ublic+class+Hel', 'lass+HelloWorld', 'elloWorld+%7B%0A+++', '%2F%2F+Sample+progr', 'program%0Apublic+']
to output:
output = ['ublic+class+HelloWorld+%7B%0A++++public+', '%2F%2F+Sample+program%0Apublic+static+v`]
when correct output is:
output = ['%2F%2F+Sample+program%0Apublic+class+HelloWorld+%7B%0A++++public+static+v']
I am using simple python, not biopython or sequence aligners, though perhaps I should be?
Would greatly appreciate any advice on the matter or suggestions of a nice way to do this in python!
Thanks!

You can start with one of the strings in the list (stored as string), and for each of the remaining strings in the list (stored as candidate) where:
candidate is part of string,
candidate contains string,
candidate's tail matches the head of string,
or, candidate's head matches the tail of string,
assemble the two strings according to how they overlap, and then recursively repeat the procedure with the overlapping string removed from the remaining strings and the assembled string appended, until there is only one string left in the list, at which point it is a valid fully assembled string that can be added to the final output.
Since there can potentially be multiple ways several strings can overlap with each other, some of which can result in the same assembled strings, you should make output a set of strings instead:
def assemble(str_list, min=3, max=15):
if len(str_list) < 2:
return set(str_list)
output = set()
string = str_list.pop()
for i, candidate in enumerate(str_list):
matches = set()
if candidate in string:
matches.add(string)
elif string in candidate:
matches.add(candidate)
for n in range(min, max + 1):
if candidate[:n] == string[-n:]:
matches.add(string + candidate[n:])
if candidate[-n:] == string[:n]:
matches.add(candidate[:-n] + string)
for match in matches:
output.update(assemble(str_list[:i] + str_list[i + 1:] + [match]))
return output
so that with your sample input:
StrList1 = ['d+%7B%0A++++public+', 'public+static+v','program%0Apublic+', 'ublic+class+Hel', 'lass+HelloWorld', 'elloWorld+%7B%0A+++', '%2F%2F+Sample+progr', 'program%0Apublic+']
assemble(StrList1) would return:
{'%2F%2F+Sample+program%0Apublic+class+HelloWorld+%7B%0A++++public+static+v'}
or as an example of an input with various overlapping possibilities (that the second string can match the first by being inside, having tail matching the head, and having head matching the tail):
assemble(['abcggggabcgggg', 'ggggabc'])
would return:
{'abcggggabcgggg', 'abcggggabcggggabc', 'abcggggabcgggggabc', 'ggggabcggggabcgggg'}

Related

write RE to get string in between first alphabet to last alphabet?

I need to print the 2 missing strings KMJD23KN0008393 and KMJD23KN0008394 but what I am receiving is KMJD23KN8393 and KMJD23KN8394 .I need those missing zeros also in our list.
ll = ['KMJD23KN0008391','KMJD23KN0008392','KMJD23KN0008395','KMJD23KN0008396']
missList=[]
for i in ll:
reList=re.findall(r"[^\W\d_]+|\d+", i)
print(reList)

The issue can be decomposed into three parts:
Extracting the trailing number string
Interpreting that substring as the actual number, which should form a consecutive sequence
Re-formatting the missing items based on the surrounding items.
There are multiple assumptions implicit in these items. You need to be aware of these, and ideally make them explicit. In the following, I’ve worked with the following assumptions:
Anything that forms a number at the end of the string is considered. Everything before that is a prefix and is assumed to be identical throughout the series.
The trailing numbers in the series always have the same width.
The items in the list are sorted in ascending order by their trailing number.
The following code implements these assumptions:
all_missing = []
last_num = int(re.search(r'\d+$', ll[-1])[0])
prefix = re.match('.*\D', ll[0])[0]
for item in ll:
num_str = re.search(r'\d+$', item)[0]
num = int(num_str)
num_width = len(num_str)
for missing in range(last_num + 1, num):
all_missing.append(f'{prefix}{missing:0{num_width}}')
last_num = num
print(all_missing)
Some notes here:
To extract the trailing number, a very simple regex is sufficient: \d+$. That is: one or more digits, until the end of the string.
Conversely, to extract the prefix, we search for any sequence of arbitrary characters where the last character is a non-digit. That is: .*\D.
To re-format the missing items, we concatenate the prefix with the missing number, and we pad the missing number with zeros (from left) until it is of the expected width. This is achieved by using Python’s f-strings with the format specifier '0{num_width}'.

How can i find the Correct String or 3 Most Similar Strings from a List using a String?

Trying to find 3 closest matches with a given string like "_e_ul" and the correct match from my list would be Mehul, but difflib.get_close_matches seems to be getting very weird matches that don't match my word at all and look very random.
Also if this helps at all, I have a list with all the Possible Strings so the String I am given will surely be in the List. I need some library that can match the positions of the alphabets shown in the string with that of the list.
res = get_close_matches(hint,pokemons,n=3,cutoff= 0.2)
if res == []:
print("No Similar Names Found Unfortunately")
else:
var = "\n".join(res)
print(f"**Did you Mean...**\n`{var}`")```
hint was _ar__n_e - answer supposed to be mareanie
pokemons is a list containing all pokemon names
The output came out to be
**Did you Mean...**
`Cacturne
Carnivine
Charmander`

Get all substrings between two different start and ending delimiters

I am trying in Python 3 to get a list of all substrings of a given String a, which start after a delimiter x and end right before a delimiter y.
I have found solutions which only get me the first occurence, but the result needs to be a list of all occurences.
start = '>'
end = '</'
s = '<script>a=eval;b=alert;a(b(/XSS/.source));</script><script>a=eval;b=alert;a(b(/XSS/.source));</script>'"><marquee><h1>XSS by Xylitol</h1></marquee>'
print((s.split(start))[1].split(end)[0])
the above example is what I've got so far. But I am searching for a more elegant and stable way to get all the occurences.
So the expected return as list would contain the javascript code as following entries:
a=eval;b=alert;a(b(/XSS/.source));
a=eval;b=alert;a(b(/XSS/.source));

Looking for patterns in strings seems like a decent job for regular expressions.
This should return a list of anything between a pair of <script> and </script>:
import re
pattern = re.compile(r'<script>(.*?)</script>')
s = '<script>a=eval;b=alert;a(b(/XSS/.source));</script><script>a=eval;b=alert;a(b(/XSS/.source));</script>\'"><marquee><h1>XSS by Xylitol</h1></marquee>'
print(pattern.findall(s))
Result:
['a=eval;b=alert;a(b(/XSS/.source));', 'a=eval;b=alert;a(b(/XSS/.source));']

How to get all possible incremental values between two mixed strings in Python?

I am new to Python (and programming, in general) and hoping to see if someone can help me. I am trying to automate a task that I am currently doing manually but is no longer feasible. I want to find and write all strings between two given strings. For example, if starting and ending strings are XYZ-DF 000010 and XYZ-DF 000014, the desired output should be XYZ-DF 000010; XYZ-DF 000011; XYZ-DF 000012; XYZ-DF 000013; XYZ-DF 000014. The prefix and numbers (and their padding) are not always the same. For example, next starting and ending strings in the list could be ABC_XY00000001 and ABC_XY00000123. The prefix and padding for any pair of starting and ending strings, though, will always be the same.
I think I need to separate the prefix (includes any alphabets, spaces, underscore, hyphen etc.) and numbers, remove padding from the numbers, increment the numbers by 1 from starting number to ending number for every starting and ending strings in a second loop, and then finally get the output by concatenation.
So far this is what I have:
First, I read the 2 columns that contain a list of starting and ending strings in a csv into lists using pandas:
columns = ['Beg', 'End']
data = pd.read_csv('C:/Downloads/test.csv', names=columns, header = None)
begs = data.Beg.tolist()
ends= data.End.tolist()
Next, I loop over "begs" and "ends" using the zip function.
for beg, end in zip(begs,ends):
Inside the loop, I want to iterate over each string in begs and ends (one pair at a time) and perform the following operations on them:
1) Use regex to separate the characters (including alphabets, spaces, underscore, hyphen etc.) from the numbers (including padding) for each of the strings one at a time.
start = re.match(r"([a-z-_ ]+)([0-9]+)", beg, re.I) #Let's assume first starting string in the begs list is "XYZ-DF 000010" from my example above
prefix = start.group(1) #Should yield "XYZ-DF "
start_num = start.group(2) #Should yield "000010"
padding = (len(start_num)) #Yields 6
start_num_stripped = start_num.lstrip("0") #Yields 10
end = re.match(r"([a-z-_ ]+)([0-9]+)", end, re.I) #Let's assume first ending string in the ends list is "XYZ-DF 000014" from my example above
end_num = end.group(2) #Yields 000014
end_num_stripped = end_num.lstrip("0") #Yields 14
2) After these operations, run a nested while loop from start_num_stripped until end_num_stripped
output_string = ""
while start_num_stripped <= end_num_stripped:
output_string = output_string+prefix+start_num_stripped.zfill(padding)+"; "
start_num_stripped += 1
Finally, how do I write the output_string for each pair of starting and ending strings to a csv file that contains 3 columns containing the starting string, ending string, and their output string? An example of an output in csv format is given below (newline after each row is for clarity and not needed in the output).
"Starting String", "Ending String", "Output String"
"ABCD-00001","ABCD-00003","ABCD-00001; ABCD-00002; ABCD-00003"
"XYZ-DF 000010","XYZ-DF 000012","XYZ-DF 000010; XYZ-DF 000011; XYZ-DF 000012"
"BBB_CC0000008","BBB_CC0000014","BBB_CC0000008; BBB_CC0000009; BBB_CC0000010; BBB_CC0000011; BBB_CC0000012; BBB_CC0000013; BBB_CC0000014"

You could find the longest trailing numeric suffix using a regular expression. Then simply iterate numbers from start to end appending them (with leading zeros) to the common prefix:
import re
startString = "XYZ-DF 000010"
endString = "XYZ-DF 000012"
suffixLen = len(re.findall("[0-9]*$",startString)[0])
start = int("1"+startString[-suffixLen:])
end = int("1"+endString[-suffixLen:])
result = [ startString[:-suffixLen]+str(n)[1:] for n in range(start,end+1) ]
csvLine = '"' + '","'.join([ startString,endString,";".join(result) ]) + '"'
print(csvLine) # "XYZ-DF 000010","XYZ-DF 000012","XYZ-DF 000010;XYZ-DF 000011;XYZ-DF 000012"
Note: using int("1" + suffix) causes numbers in the range to always have 1 more digit than the length of the suffix (1xxxxx). This makes it easy to get the leading zeroes by simply dropping the first character after turning them back into strings str(n)[1:]
Note2: I'm not familiar with pandas but I'm pretty sure it has a way to write a csv directly from the result list rather than formatting it manually as I did here in csvLine.

Python - finding words based on repeated characters specified in a string

Let's say I have a list of words:
resign
resins
redyed
resist
reeded
I also have a string ".10.10"
I need to iterate through the list and find the words where there are repeated characters in the same locations where there are numbers in the string.
For instance, the string ".10.10" would find the word 'redyed' since there are e's where there are 1's and there are d's where there are 0's.
Another string ".00.0." would find the word 'reeded' as there are e's in that position.
My attempts in python so far are not really worth printing. At the moment I look through the string, add all 0s to an array and the 1s to an array then try to find repeated characters in the array positions. But it's terribly clumsy and doesn't work properly.

def matches(s, pattern):
d = {}
return all(cp == "." or d.setdefault(cp, cs) == cs
for cs, cp in zip(s, pattern))
a = ["resign", "resins", "redyed", "resist", "reeded"]
print [s for s in a if matches(s, ".01.01")]
print [s for s in a if matches(s, ".00.0.")]
prints
['redyed']
['reeded']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to merge strings with overlapping characters in python? - python

Related

write RE to get string in between first alphabet to last alphabet?

How can i find the Correct String or 3 Most Similar Strings from a List using a String?

Get all substrings between two different start and ending delimiters

How to get all possible incremental values between two mixed strings in Python?

Python - finding words based on repeated characters specified in a string

Categories

Resources