Python: comparing list to a string - python

I want to know how to compare a string to a list.
For example
I have string 'abcdab' and a list ['ab','bcd','da']. Is there any way to compare all possible list combinations to the string, and avoid overlaping elements. so that output will be a list of tuples like
[('ab','da'),('bcd'),('bcd','ab'),('ab','ab'),('ab'),('da')].
The output should avoid combinations such as ('bcd', 'da') as the character 'd' is repeated in tuple while it appears only once in the string.
As pointed out in the answer. The characters in string and list elements, must not be rearranged.
One way I tried was to split string elements in to all possible combinations and compare. Which was 2^(n-1) n being number of characters. It was very time consuming.
I am new to python programing.
Thanks in advance.

all possible list combinations to string, and avoiding overlaping
elements
Is a combination one or more complete items in its exact, current order in the list that match a pattern or subpattern of the string? I believe one of the requirements is to not rearrange the items in the list (ab doesn't get substituted for ba). I believe one of the requirements is to not rearrange the characters in the string. If the subpattern appears twice, then you want the combinations to reflect two individual copies of the subpattern by themselves as well as a list of with both items of the subpattern with other subpatterns that match too. You want multiple permutations of the matches.

This little recursive function should do the job:
def matches(string, words, start=-1):
result= []
for word in words: # for each word
pos= start
while True:
pos= string.find(word, pos+1) # find the next occurence of the word
if pos==-1: # if there are no more occurences, continue with the next word
break
if [word] not in result: # add the word to the result
result.append([word])
# recursively scan the rest of the string
for match in matches(string, words, pos+len(word)-1):
match= [word]+match
if match not in result:
result.append(match)
return result
output:
>>> print matches('abcdab', ['ab','bcd','da'])
[['ab'], ['ab', 'ab'], ['ab', 'da'], ['bcd'], ['bcd', 'ab'], ['da']]

Oops! I somehow missed Rawing's answer. Oh well. :)
Here's another recursive solution.
#! /usr/bin/env python
def find_matches(template, target, output, matches=None):
if matches is None:
matches = []
for s in template:
newmatches = matches[:]
if s in target:
newmatches.append(s)
#Replace matched string with a null byte so it can't get re-matched
# and recurse to find more matches.
find_matches(template, target.replace(s, '\0', 1), output, newmatches)
else:
#No (more) matches found; save current matches
if newmatches:
output.append(tuple(newmatches))
return
def main():
target = 'abcdab'
template = ['ab','bcd','da']
print template
print target
output = []
find_matches(template, target, output)
print output
if __name__ == '__main__':
main()
output
['ab', 'bcd', 'da']
abcdab
[('ab', 'ab'), ('ab',), ('bcd', 'ab'), ('bcd',), ('da', 'ab'), ('da',)]

Related

exhaustive search over a list of complex strings without modifying original input

I am attempting to create a minimal algorithm to exhaustively search for duplicates over a list of strings and remove duplicates using an index to avoid changing cases of words and their meanings.
The caveat is the list has such words Blood, blood, DNA, ACTN4, 34-methyl-O-carboxy, Brain, brain-facing-mouse, BLOOD and so on.
I only want to remove the duplicate 'blood' word, keep the first occurrence with the first letter capitalized, and not modify cases of any other words. Any suggestions on how should I proceed?
Here is my code
def remove_duplicates(list_of_strings):
""" function that takes input of a list of strings,
uses index to iterate over each string lowers each string
and returns a list of strings with no duplicates, does not modify the original strings
an exhaustive search to remove duplicates using index of list and list of string"""
list_of_strings_copy = list_of_strings
try:
for i in range(len(list_of_strings)):
list_of_strings_copy[i] = list_of_strings_copy[i].lower()
word = list_of_strings_copy[i]
for j in range(len(list_of_strings_copy)):
if word == list_of_strings_copy[j]:
list_of_strings.pop(i)
j+=1
except Exception as e:
print(e)
return list_of_strings
Make a dictionary, {text.lower():text,...}, use the keys for comparison and save the first instance of the text in the values.
d={}
for item in list_of_strings:
if item.lower() not in d:
d[item.lower()] = item
d.values() should be what you want.
I think something like the following would do what you need:
def remove_duplicates(list_of_strings):
new_list = [] # create empty return list
for string in list_of_strings: # iterate through list of strings
string = string[0].capitalize() + string[1:].lower() # ensure first letter is capitalized and rest are low case
if string not in new_list: # check string is not duplicate in retuned list
new_list.append(string) # if string not in list append to returned list
return new_list # return end list
strings = ["Blood", "blood", "DNA", "ACTN4", "34-methyl-O-carboxy", "Brain", "brain-facing-mouse", "BLOOD"]
returned_strings = remove_duplicates(strings)
print(returned_strings)
(For reference this was written in Python 3.10)

Using python to counts occurrence of each adjacent character

Use Python to solve this question:
Define a function that takes a string, counts each adjacent occurrence of a character. This function should return a string with each character and its count. For example:
'jjjjeerrr' would return to 'j4e2r3'
The zip() function is your friend when you need to compare elements of a string or list with their successor or predecessor. With that you can get the indexes of the first letter in each repeated sequence. These "break" positions can then be combined (using zip again) to form start/end ranges that will give you the size of the repetition:
def rle(S):
breaks = [i for i,(a,b) in enumerate(zip(S,S[1:]),1) if a!=b]
return "".join(f"{S[s]}{e-s}" for s,e in zip([0]+breaks,breaks+[len(S)]))
Output:
print(rle("jjjjeerrr")) # j4e2r3
print(rle("jjjjeerrrsssjj")) # j4e2r3s3j2

Permutate removal of defined substrings with varying length from strings

I am trying to generate all permutations from a list of strings where certain substrings of characters are removed. I have a list of certain chemical compositions and I want all compositions resulting from that list where one of those elements is removed. A short excerpt of this list looks like this:
AlCrHfMoNbN
AlCrHfMoTaN
AlCrHfMoTiN
AlCrHfMoVN
AlCrHfMoWN
...
What I am trying to get is
AlCrHfMoNbN --> CrHfMoNbN
AlHfMoNbN
AlCrMoNbN
AlCrHfNbN
AlCrHfMoN
AlCrHfMoTaN --> CrHfMoTaN
AlHfMoTaN
AlCrMoTaN
AlCrHfTaN
AlCrHfMoN
for each composition. I just need the right column. As you can see some of the resulting compositions are duplicates and this is intended. The list of elements that need to be removed is
Al, Cr, Hf, Mo, Nb, Ta, Ti, V, W, Zr
As you see some have a length of two characters and some of only one.
There is a question that asks about something very similar, however my problem is more complex:
Getting a list of strings with character removed in permutation
I tried adjusting the code to my needs:
def f(s, c, start):
i = s.find(c, start)
return [s] if i < 0 else f(s, c, i+1) + f(s[:i]+s[i+1:], c, i)
s = 'AlCrHfMoNbN'
print(f(s, 'Al', 0))
But this simple approach only leads to ['AlCrHfMoNbN', 'lCrHfMoNbN']. So only one character is removed whereas I need to remove a defined string of characters with a varying length. Also I am limited to a single input object s - instead of hundreds that I need to process - so cycling through by hand is not an option.
To sum it up what I need is a change in the code that allows to:
input a list of strings either separated by linebreaks or whitespace
remove substrings of characters from that list which are defined by a second list (just like above)
writes the resulting "reduced" items in a continuing list preferably as a single column without any commas and such
Since I only have some experience with Python and Bash I strongly prefer a solution with these languages.
IIUC, all you need is str.replace:
input_list = ['AlCrHfMoNbN', 'AlCrHfMoTaN']
removals = ['Al', 'Cr', 'Hf', 'Mo', 'Nb', 'Ta', 'Ti', 'V', 'W', 'Zr']
result = {}
for i in input_list:
result[i] = [i.replace(r,'') for r in removals if r in i]
Output:
{'AlCrHfMoNbN': ['CrHfMoNbN',
'AlHfMoNbN',
'AlCrMoNbN',
'AlCrHfNbN',
'AlCrHfMoN'],
'AlCrHfMoTaN': ['CrHfMoTaN',
'AlHfMoTaN',
'AlCrMoTaN',
'AlCrHfTaN',
'AlCrHfMoN']}
if you have gawk, set FPAT to [A-Z][a-z]* so each element will be regarded as a field, and use a simple loop to generate permutations. also set OFS to empty string so there won't be spaces in output records.
$ gawk 'BEGIN{FPAT="[A-Z][a-z]*";OFS=""} {for(i=1;i<NF;++i){p=$i;$i="";print;$i=p}}' file
CrHfMoNbN
AlHfMoNbN
AlCrMoNbN
AlCrHfNbN
AlCrHfMoN
CrHfMoTaN
AlHfMoTaN
AlCrMoTaN
AlCrHfTaN
AlCrHfMoN
CrHfMoTiN
AlHfMoTiN
AlCrMoTiN
AlCrHfTiN
AlCrHfMoN
CrHfMoVN
AlHfMoVN
AlCrMoVN
AlCrHfVN
AlCrHfMoN
CrHfMoWN
AlHfMoWN
AlCrMoWN
AlCrHfWN
AlCrHfMoN
I've also written a portable one with extra spaces and explanatory comments:
awk '{
# separate last element from others
sub(/[A-Z][a-z]*$/, " &")
# from the beginning of line
# we will match each element and print a line where it is omitted
for (i=0; match(substr($1,i), /[A-Z][a-z]*/); i+=RLENGTH)
print substr($1,1,i) substr($1,i+RLENGTH+1) $2
# ^ before match ^ after match ^ last element
}' file
This doesn't use your attempt, but it works when we assume that your elements always begin with an uppercase letter (and consist otherwise only of lowercase letters):
def f(s):
# split string by elements
import re
elements = re.findall('[A-Z][^A-Z]*', s)
# make a list of strings, where the first string has the first element removed, the second string the second, ...
r = []
for i in range(len(elements)):
r.append(''.join(elements[:i]+elements[i+1:]))
# return this list
return r
Of course this still only works for one string. So if you have a list of strings l and you want to apply it for every string in it, just use a for loop like that:
# your list of strings
l = ["AlCrHfMoNbN", "AlCrHfMoTaN", "AlCrHfMoTiN", "AlCrHfMoVN", "AlCrHfMoWN"]
# iterate through your input list
for s in l:
# call above function
r = f(s)
# print out the result if you want to
[print(i) for i in r]

Getting substring between nth and mth occurence of some sequence

I want to do search a string that I know contains several occurences of a particular char sequence and retrieve what's between two certain of these occurences, numbered. Or preferably, numbered from the end. I also want to do this as compact as possible, as it goes inside a list comprehension.
Let's say I have the following string:
s = "foofoo\tBergen, Norway\tAffluent\tDonkey"
I want to retrieve the substring of s that is situated between the last occurence of "\t" and the penultimate occurence.
So in this very example: "Affluent"
Here is the comprehension I am currently using (without having pruned the string):
data = [(entries[i], entries[i+1]) for i in range(0, len(entries), 3)]
It's the string entries[i] for every entry into data that I want to prune.
Rsplit is used to split the word from right side
a="foofoo\tBergen, Norway\tAffluent\tDonkey"
word= a.rsplit('\t',2)
if len(word)>2:
print word[-2]
#output =Affluent
Assuming that the beginning of your string is treated as the 0th occurrence of the delimeter symbol:
def concat_strings(strs):
result = ""
for substr in strs:
result = result + substr
return result
def find_section(s, delim, n, m):
tokens = s.split(delim)
return concat_strings(tokens[n:m])
You could split the string by your character sequence, and join back-together (using your character sequence as the joining string) the desired occurrences.
Update:For the example cited:
"\t".join(s.split("\t")[-2:-1])

Python Regex Findall Lookahead

I've created a function which searches a protein string for an open reading frame. Here it is:
def orf_finder(seq,format):
record = SeqIO.read(seq,format) #Reads in the sequence and tells biopython what format it is.
string = [] #creates an empty list
for i in range(3):
string.append(record.seq[i:]) #creates a list of three lists, each holding a different reading frame.
protein_string = [] #creates an empty list
protein_string.append([str(i.translate()) for i in string]) #translates each list in 'string' and combines them into one long list
regex = re.compile('M''[A-Z]'+r'*') #compiles a regular expression pattern: methionine, followed by any amino acid and ending with a stop codon.
res = max(regex.findall(str(protein_string)), key=len) #res is a string of the longest translated orf in the sequence.
print "The longest ORF (translated) is:\n\n",res,"\n"
print "The first blast result for this protein is:\n"
blast_records = NCBIXML.parse(NCBIWWW.qblast("blastp", "nr", res)) #blasts the sequence and puts the results into a 'record object'.
blast_record = blast_records.next()
counter = 0 #the counter is a method for outputting the first blast record. After it is printed, the counter equals '1' and therefore the loop stops.
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if counter < 1: #mechanism for stopping loop
print 'Sequence:', alignment.title
print 'Sength:', alignment.length
print 'E value:', hsp.expect
print 'Query:',hsp.query[0:]
print 'Match:',hsp.match[0:]
counter = 1
The only issue is, I don't think that my regex, re.compile('M''[A-Z]'+r'*'), does not find overlapping matches. I know that a lookahead clause, ?=, might solve my problem, but I can't seem to implement it without returning an error.
Does anyone know how I can get it to work?
The code above uses biopython to read-in the DNA sequence, translate it and then searches for a protein readin frame; a sequence starting with 'M' and ending with '*'.
re.compile(r"M[A-Z]+\*")
Assuming that your searched string starts with 'M', followed by one or more upper case 'A-Z' and ends with an '*'.

Categories

Resources