Permutate removal of defined substrings with varying length from strings

Permutate removal of defined substrings with varying length from strings - python

I am trying to generate all permutations from a list of strings where certain substrings of characters are removed. I have a list of certain chemical compositions and I want all compositions resulting from that list where one of those elements is removed. A short excerpt of this list looks like this:
AlCrHfMoNbN
AlCrHfMoTaN
AlCrHfMoTiN
AlCrHfMoVN
AlCrHfMoWN
...
What I am trying to get is
AlCrHfMoNbN --> CrHfMoNbN
AlHfMoNbN
AlCrMoNbN
AlCrHfNbN
AlCrHfMoN
AlCrHfMoTaN --> CrHfMoTaN
AlHfMoTaN
AlCrMoTaN
AlCrHfTaN
AlCrHfMoN
for each composition. I just need the right column. As you can see some of the resulting compositions are duplicates and this is intended. The list of elements that need to be removed is
Al, Cr, Hf, Mo, Nb, Ta, Ti, V, W, Zr
As you see some have a length of two characters and some of only one.
There is a question that asks about something very similar, however my problem is more complex:
Getting a list of strings with character removed in permutation
I tried adjusting the code to my needs:
def f(s, c, start):
i = s.find(c, start)
return [s] if i < 0 else f(s, c, i+1) + f(s[:i]+s[i+1:], c, i)
s = 'AlCrHfMoNbN'
print(f(s, 'Al', 0))
But this simple approach only leads to ['AlCrHfMoNbN', 'lCrHfMoNbN']. So only one character is removed whereas I need to remove a defined string of characters with a varying length. Also I am limited to a single input object s - instead of hundreds that I need to process - so cycling through by hand is not an option.
To sum it up what I need is a change in the code that allows to:
input a list of strings either separated by linebreaks or whitespace
remove substrings of characters from that list which are defined by a second list (just like above)
writes the resulting "reduced" items in a continuing list preferably as a single column without any commas and such
Since I only have some experience with Python and Bash I strongly prefer a solution with these languages.

IIUC, all you need is str.replace:
input_list = ['AlCrHfMoNbN', 'AlCrHfMoTaN']
removals = ['Al', 'Cr', 'Hf', 'Mo', 'Nb', 'Ta', 'Ti', 'V', 'W', 'Zr']
result = {}
for i in input_list:
result[i] = [i.replace(r,'') for r in removals if r in i]
Output:
{'AlCrHfMoNbN': ['CrHfMoNbN',
'AlHfMoNbN',
'AlCrMoNbN',
'AlCrHfNbN',
'AlCrHfMoN'],
'AlCrHfMoTaN': ['CrHfMoTaN',
'AlHfMoTaN',
'AlCrMoTaN',
'AlCrHfTaN',
'AlCrHfMoN']}

if you have gawk, set FPAT to [A-Z][a-z]* so each element will be regarded as a field, and use a simple loop to generate permutations. also set OFS to empty string so there won't be spaces in output records.
$ gawk 'BEGIN{FPAT="[A-Z][a-z]*";OFS=""} {for(i=1;i<NF;++i){p=$i;$i="";print;$i=p}}' file
CrHfMoNbN
AlHfMoNbN
AlCrMoNbN
AlCrHfNbN
AlCrHfMoN
CrHfMoTaN
AlHfMoTaN
AlCrMoTaN
AlCrHfTaN
AlCrHfMoN
CrHfMoTiN
AlHfMoTiN
AlCrMoTiN
AlCrHfTiN
AlCrHfMoN
CrHfMoVN
AlHfMoVN
AlCrMoVN
AlCrHfVN
AlCrHfMoN
CrHfMoWN
AlHfMoWN
AlCrMoWN
AlCrHfWN
AlCrHfMoN
I've also written a portable one with extra spaces and explanatory comments:
awk '{
# separate last element from others
sub(/[A-Z][a-z]*$/, " &")
# from the beginning of line
# we will match each element and print a line where it is omitted
for (i=0; match(substr($1,i), /[A-Z][a-z]*/); i+=RLENGTH)
print substr($1,1,i) substr($1,i+RLENGTH+1) $2
# ^ before match ^ after match ^ last element
}' file

This doesn't use your attempt, but it works when we assume that your elements always begin with an uppercase letter (and consist otherwise only of lowercase letters):
def f(s):
# split string by elements
import re
elements = re.findall('[A-Z][^A-Z]*', s)
# make a list of strings, where the first string has the first element removed, the second string the second, ...
r = []
for i in range(len(elements)):
r.append(''.join(elements[:i]+elements[i+1:]))
# return this list
return r
Of course this still only works for one string. So if you have a list of strings l and you want to apply it for every string in it, just use a for loop like that:
# your list of strings
l = ["AlCrHfMoNbN", "AlCrHfMoTaN", "AlCrHfMoTiN", "AlCrHfMoVN", "AlCrHfMoWN"]
# iterate through your input list
for s in l:
# call above function
r = f(s)
# print out the result if you want to
[print(i) for i in r]

Related

Using python to counts occurrence of each adjacent character

Use Python to solve this question：
Define a function that takes a string, counts each adjacent occurrence of a character. This function should return a string with each character and its count. For example:
'jjjjeerrr' would return to 'j4e2r3'

The zip() function is your friend when you need to compare elements of a string or list with their successor or predecessor. With that you can get the indexes of the first letter in each repeated sequence. These "break" positions can then be combined (using zip again) to form start/end ranges that will give you the size of the repetition:
def rle(S):
breaks = [i for i,(a,b) in enumerate(zip(S,S[1:]),1) if a!=b]
return "".join(f"{S[s]}{e-s}" for s,e in zip([0]+breaks,breaks+[len(S)]))
Output:
print(rle("jjjjeerrr")) # j4e2r3
print(rle("jjjjeerrrsssjj")) # j4e2r3s3j2

How to merge strings with overlapping characters in python?

I'm working on a python project which reads in an URL encoded overlapping list of strings. Each string is 15 characters long and overlaps with its sequential string by at least 3 characters and at most 15 characters (identical).
The goal of the program is to go from a list of overlapping strings - either ordered or unordered - to a compressed URL encoded string.
My current method fails at duplicate segments in the overlapping strings. For example, my program is incorrectly combining:
StrList1 = [ 'd+%7B%0A++++public+', 'public+static+v','program%0Apublic+', 'ublic+class+Hel', 'lass+HelloWorld', 'elloWorld+%7B%0A+++', '%2F%2F+Sample+progr', 'program%0Apublic+']
to output:
output = ['ublic+class+HelloWorld+%7B%0A++++public+', '%2F%2F+Sample+program%0Apublic+static+v`]
when correct output is:
output = ['%2F%2F+Sample+program%0Apublic+class+HelloWorld+%7B%0A++++public+static+v']
I am using simple python, not biopython or sequence aligners, though perhaps I should be?
Would greatly appreciate any advice on the matter or suggestions of a nice way to do this in python!
Thanks!

You can start with one of the strings in the list (stored as string), and for each of the remaining strings in the list (stored as candidate) where:
candidate is part of string,
candidate contains string,
candidate's tail matches the head of string,
or, candidate's head matches the tail of string,
assemble the two strings according to how they overlap, and then recursively repeat the procedure with the overlapping string removed from the remaining strings and the assembled string appended, until there is only one string left in the list, at which point it is a valid fully assembled string that can be added to the final output.
Since there can potentially be multiple ways several strings can overlap with each other, some of which can result in the same assembled strings, you should make output a set of strings instead:
def assemble(str_list, min=3, max=15):
if len(str_list) < 2:
return set(str_list)
output = set()
string = str_list.pop()
for i, candidate in enumerate(str_list):
matches = set()
if candidate in string:
matches.add(string)
elif string in candidate:
matches.add(candidate)
for n in range(min, max + 1):
if candidate[:n] == string[-n:]:
matches.add(string + candidate[n:])
if candidate[-n:] == string[:n]:
matches.add(candidate[:-n] + string)
for match in matches:
output.update(assemble(str_list[:i] + str_list[i + 1:] + [match]))
return output
so that with your sample input:
StrList1 = ['d+%7B%0A++++public+', 'public+static+v','program%0Apublic+', 'ublic+class+Hel', 'lass+HelloWorld', 'elloWorld+%7B%0A+++', '%2F%2F+Sample+progr', 'program%0Apublic+']
assemble(StrList1) would return:
{'%2F%2F+Sample+program%0Apublic+class+HelloWorld+%7B%0A++++public+static+v'}
or as an example of an input with various overlapping possibilities (that the second string can match the first by being inside, having tail matching the head, and having head matching the tail):
assemble(['abcggggabcgggg', 'ggggabc'])
would return:
{'abcggggabcgggg', 'abcggggabcggggabc', 'abcggggabcgggggabc', 'ggggabcggggabcgggg'}

Python - Parse strings with variable repeating substring

I am trying to do something which I thought would be simple (and probably is), however I am hitting a wall. I have a string that contains document numbers. In most cases the format is ######-#-### however in some cases, where the single digit should be, there are multiple single digits separated by a comma (i.e. ######-#,#,#-###). The number of single digits separated by a comma is variable. Below is an example:
For the string below:
('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
I need to return:
['030421-1-001', '030421-2-001' '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002' '030421-1-003']
I have only gotten as far as returning the strings that match the ######-#-### pattern:
import re
p = re.compile('\d{6}-\d{1}-\d{3}')
m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
print m
Thanks in advance for any help!
Matt

Perhaps something like this:
>>> import re
>>> s = '030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003'
>>> it = re.finditer(r'(\b\d{6}-)(\d(?:,\d)*)(-\d{3})\b', s)
>>> for m in it:
a, b, c = m.groups()
for x in b.split(','):
print a + x + c
...
030421-1-001
030421-2-001
030421-1-002
030421-1-002
030421-2-002
030421-3-002
030421-1-003
Or using a list comprehension
>>> [a+x+c for a, b, c in (m.groups() for m in it) for x in b.split(',')]
['030421-1-001', '030421-2-001', '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002', '030421-1-003']

Use '\d{6}-\d(,\d)*-\d{3}'.
* means "as many as you want (0 included)".
It is applied to the previous element, here '(,\d)'.

I wouldn't use a single regular expression to try and parse this. Since it is essentially a list of strings, you might find it easier to replace the "&" with a comma globally in the string and then use split() to put the elements into a list.
Doing a loop of the list will allow you to write a single function to parse and fix the string and then you can push it onto a new list and the display your string.
replace(string, '&', ',')
initialList = string.split(',')
for item in initialList:
newItem = myfunction(item)
newList.append(newItem)
newstring = newlist(join(','))

(\d{6}-)((?:\d,?)+)(-\d{3})
We take 3 capturing groups. We match the first part and last part the easy way. The center part is optionally repeated and optionally contains a ','. Regex will however only match the last one, so ?: won't store it at all. What where left with is the following result:
>>> p = re.compile('(\d{6}-)((?:\d,?)+)(-\d{3})')
>>> m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
>>> m
[('030421-', '1,2', '-001'), ('030421-', '1', '-002'), ('030421-', '1,2,3', '-002'), ('030421-', '1', '-003')]
You'll have to manually process the 2nd term to split them up and join them, but a list comprehension should be able to do that.

Python: comparing list to a string

I want to know how to compare a string to a list.
For example
I have string 'abcdab' and a list ['ab','bcd','da']. Is there any way to compare all possible list combinations to the string, and avoid overlaping elements. so that output will be a list of tuples like
[('ab','da'),('bcd'),('bcd','ab'),('ab','ab'),('ab'),('da')].
The output should avoid combinations such as ('bcd', 'da') as the character 'd' is repeated in tuple while it appears only once in the string.
As pointed out in the answer. The characters in string and list elements, must not be rearranged.
One way I tried was to split string elements in to all possible combinations and compare. Which was 2^(n-1) n being number of characters. It was very time consuming.
I am new to python programing.
Thanks in advance.

all possible list combinations to string, and avoiding overlaping
elements
Is a combination one or more complete items in its exact, current order in the list that match a pattern or subpattern of the string? I believe one of the requirements is to not rearrange the items in the list (ab doesn't get substituted for ba). I believe one of the requirements is to not rearrange the characters in the string. If the subpattern appears twice, then you want the combinations to reflect two individual copies of the subpattern by themselves as well as a list of with both items of the subpattern with other subpatterns that match too. You want multiple permutations of the matches.

This little recursive function should do the job:
def matches(string, words, start=-1):
result= []
for word in words: # for each word
pos= start
while True:
pos= string.find(word, pos+1) # find the next occurence of the word
if pos==-1: # if there are no more occurences, continue with the next word
break
if [word] not in result: # add the word to the result
result.append([word])
# recursively scan the rest of the string
for match in matches(string, words, pos+len(word)-1):
match= [word]+match
if match not in result:
result.append(match)
return result
output:
>>> print matches('abcdab', ['ab','bcd','da'])
[['ab'], ['ab', 'ab'], ['ab', 'da'], ['bcd'], ['bcd', 'ab'], ['da']]

Oops! I somehow missed Rawing's answer. Oh well. :)
Here's another recursive solution.
#! /usr/bin/env python
def find_matches(template, target, output, matches=None):
if matches is None:
matches = []
for s in template:
newmatches = matches[:]
if s in target:
newmatches.append(s)
#Replace matched string with a null byte so it can't get re-matched
# and recurse to find more matches.
find_matches(template, target.replace(s, '\0', 1), output, newmatches)
else:
#No (more) matches found; save current matches
if newmatches:
output.append(tuple(newmatches))
return
def main():
target = 'abcdab'
template = ['ab','bcd','da']
print template
print target
output = []
find_matches(template, target, output)
print output
if __name__ == '__main__':
main()
output
['ab', 'bcd', 'da']
abcdab
[('ab', 'ab'), ('ab',), ('bcd', 'ab'), ('bcd',), ('da', 'ab'), ('da',)]

Split a string using a list of strings as a pattern

Consider an input string :
mystr = "just some stupid string to illustrate my question"
and a list of strings indicating where to split the input string:
splitters = ["some", "illustrate"]
The output should look like
result = ["just ", "some stupid string to ", "illustrate my question"]
I wrote some code which implements the following approach. For each of the strings in splitters, I find its occurrences in the input string, and insert something which I know for sure would not be a part of my input string (for example, this '!!'). Then I split the string using the substring that I just inserted.
for s in splitters:
mystr = re.sub(r'(%s)'%s,r'!!\1', mystr)
result = re.split('!!', mystr)
This solution seems ugly, is there a nicer way of doing it?

Splitting with re.split will always remove the matched string from the output (NB, this is not quite true, see the edit below). Therefore, you must use positive lookahead expressions ((?=...)) to match without removing the match. However, re.split ignores empty matches, so simply using a lookahead expression doesn't work. Instead, you will lose one character at each split at minimum (even trying to trick re with "boundary" matches (\b) does not work). If you don't care about losing one whitespace / non-word character at the end of each item (assuming you only split at non-word characters), you can use something like
re.split(r"\W(?=some|illustrate)")
which would give
["just", "some stupid string to", "illustrate my question"]
(note that the spaces after just and to are missing). You could then programmatically generate these regexes using str.join. Note that each of the split markers is escaped with re.escape so that special characters in the items of splitters do not affect the meaning of the regular expression in any undesired ways (imagine, e.g., a ) in one of the strings, which would otherwise lead to a regex syntax error).
the_regex = r"\W(?={})".format("|".join(re.escape(s) for s in splitters))
Edit (HT to #Arkadiy): Grouping the actual match, i.e. using (\W) instead of \W, returns the non-word characters inserted into the list as seperate items. Joining every two subsequent items would then produce the list as desired as well. Then, you can also drop the requirement of having a non-word character by using (.) instead of \W:
the_new_regex = r"(.)(?={})".format("|".join(re.escape(s) for s in splitters))
the_split = re.split(the_new_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest(the_split[::2], the_split[1::2], fillvalue='')]
Because normal text and auxiliary character alternate, the_split[::2] contains the normal split text and the_split[1::2] the auxiliary characters. Then, itertools.izip_longest is used to combine each text item with the corresponding removed character and the last item (which is unmatched in the removed characters)) with fillvalue, i.e. ''. Then, each of these tuples is joined using "".join(x). Note that this requires itertools to be imported (you could of course do this in a simple loop, but itertools provides very clean solutions to these things). Also note that itertools.izip_longest is called itertools.zip_longest in Python 3.
This leads to further simplification of the regular expression, because instead of using auxiliary characters, the lookahead can be replaced with a simple matching group ((some|interesting) instead of (.)(?=some|interesting)):
the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]
Here, the slice indices on the_raw_split have swapped, because now the even-numbered items must be added to item afterwards instead of in front. Also note the [""] + part, which is necessary to pair the first item with "" to fix the order.
(end of edit)
Alternatively, you can (if you want) use string.replace instead of re.sub for each splitter (I think that is a matter of preference in your case, but in general it is probably more efficient)
for s in splitters:
mystr = mystr.replace(s, "!!" + s)
Also, if you use a fixed token to indicate where to split, you do not need re.split, but can use string.split instead:
result = mystr.split("!!")
What you could also do (instead of relying on the replacement token not to be in the string anywhere else or relying on every split position being preceded by a non-word character) is finding the split strings in the input using string.find and using string slicing to extract the pieces:
def split(string, splitters):
while True:
# Get the positions to split at for all splitters still in the string
# that are not at the very front of the string
split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
if len(split_positions) > 0:
# There is still somewhere to split
next_split = min(split_positions)
yield string[:next_split] # Yield everything before that position
string = string[next_split:] # Retain the rest of the string
else:
yield string # Yield the rest of the string
break # Done.
Here, [i for i in (string.find(s) for s in splitters) if i > 0] generates a list of positions where the splitters can be found, for all splitters that are in the string (for this, i < 0 is excluded) and not right at the beginning (where we (possibly) just split, so i == 0 is excluded as well). If there are any left in the string, we yield (this is a generator function) everything up to (excluding) the first splitter (at min(split_positions)) and replace the string with the remaining part. If there are none left, we yield the last part of the string and exit the function. Because this uses yield, it is a generator function, so you need to use list to turn it into an actual list.
Note that you could also replace yield whatever with a call to some_list.append (provided you defined some_list earlier) and return some_list at the very end, I do not consider that to be very good code style, though.
TL;DR
If you are OK with using regular expressions, use
the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]
else, the same can also be achieved using string.find with the following split function:
def split(string, splitters):
while True:
# Get the positions to split at for all splitters still in the string
# that are not at the very front of the string
split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
if len(split_positions) > 0:
# There is still somewhere to split
next_split = min(split_positions)
yield string[:next_split] # Yield everything before that position
string = string[next_split:] # Retain the rest of the string
else:
yield string # Yield the rest of the string
break # Done.

Not especially elegant but avoiding regex:
mystr = "just some stupid string to illustrate my question"
splitters = ["some", "illustrate"]
indexes = [0] + [mystr.index(s) for s in splitters] + [len(mystr)]
indexes = sorted(list(set(indexes)))
print [mystr[i:j] for i, j in zip(indexes[:-1], indexes[1:])]
# ['just ', 'some stupid string to ', 'illustrate my question']
I should acknowledge here that a little more work is needed if a word in splitters occurs more than once because str.index finds only the location of the first occurrence of the word...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Permutate removal of defined substrings with varying length from strings - python

Related

Using python to counts occurrence of each adjacent character

How to merge strings with overlapping characters in python?

Python - Parse strings with variable repeating substring

Python: comparing list to a string

Split a string using a list of strings as a pattern

Categories

Resources