Combinations of a string with specific variable characters - python

How can I collect the combinations of a string, in which certain characters (but not all) are variable?
In other words, I have an input string and a character map. The character map specifies which characters are variable, and what they could be replaced with. The function then yields all possible combinations.
To put this in context, I'm trying to collect possible variations for an OCR output string that could have been misinterpreted by the OCR engine.
Example input:
"ABCD"
Example character map:
dict(
B=("X", "Z"),
D=("E")
)
Intended output:
[
"ABCD",
"ABCE",
"AXCD",
"AXCE",
"AZCD",
"AZCE"
]

You can use itertools.product:
>>> from itertools import product
>>> s = "ABCD"
>>> d = {"B": ["X", "Z"], "D": ["E"]}
>>> poss = [[c]+d.get(c,[]) for c in s]
>>> poss
[['A'], ['B', 'X', 'Z'], ['C'], ['D', 'E']]
>>> [''.join(p) for p in product(*poss)]
['ABCD', 'ABCE', 'AXCD', 'AXCE', 'AZCD', 'AZCE']
Note that I made d["D"] a list rather than simply a string for consistency.

My own solution was very ugly and non-Pythonic, but here goes:
def fuzzy_search(string, character_map):
all_variations = []
for i, character in enumerate(string):
if character in character_map:
character_variations = list(character_map[character])
character_variations.insert(0, character)
if i == len(string) - 1:
return [string[:-1] + variation for variation in character_variations]
for variation in character_variations:
sub_variations = fuzzy_search(string[i + 1:], character_map)
for sub_variation in sub_variations:
all_variations.append(string[:i] + variation + sub_variation)
return all_variations
return all_variations
map = dict(
B=("X", "Z"),
D=("E")
)
print fuzzy_search("ABCD", map)
Outputs:
['ABCD', 'ABCE', 'AXCD', 'AXCE', 'AZCD', 'AZCE']
I figured there should be way more elegant solutions than a recursive function with multiple loops.

Related

How to get items from a list that contain all specific characters from another one

I have 2 lists and i would like to create a third one that only contains items of the first that have all characters from the second one.
I tried some range(), for, len(), etc ideas that i got but got no success at all :/
e.g.
all_types = ['T','L','R','B','TL','TR','TB','LR','LB','BR','TLR','TLB','TRB','LRB','TBLR']
chars = ['R', 'B']
To
selected_types = ['BR', 'TBR', 'LRB', 'TBLR']
selected_types = [t for t in all_types if all(char in t for char in chars)]
You could use a set for chars and use its issubset() method to filter elements of your list:
all_types = ['T','L','R','B','TL','TR','TB','LR','LB','BR','TLR','TLB','TRB','LRB','TBLR']
chars = {'R', 'B'}
selected_types = [ t for t in all_types if chars.issubset(t) ]
# ['BR', 'TRB', 'LRB', 'TBLR']
If you can't change the type of the chars variable to a set for some reasons, you could use a filter with a temporary set built on the fly:
from functools import partial
selected_types = [*filter(partial(set(chars).issubset),all_types)]
all_types = ['T','L','R','B','TL','TR','TB','LR','LB','BR','TLR','TLB','TRB','LRB','TBLR']
chars = ['R', 'B']
selected_types = []
for t in all_types:
if all([c in t for c in chars]):
selected_types.append(t)

Find the Letters Occurring Odd Number of Times

I came across a funny question, I am wondering whether we can solve it.
The Background
In time complexity O(n), can we find the letters occurring odd number of times, Output a list contain letters and keep the order of letters consistent with original string.
In case of multiple options to choose from, take the last occurence as the unpaired character.
Here is an example:
# note we should keep the order of letters
findodd('Hello World') == ["H", "e", " ", "W", "r", "l", "d"] # it is good
findodd('Hello World') == ["H", "l", " ", "W", "r", "e", "d"] # it is wrong
My attempt
def findodd(s):
hash_map = {}
# This step is a bit strange. I will show an example:
# If I have a string 'abc', I will convert string to list = ['a','b','c'].
# Just because we can not use dict.get(a) to lookup dict. However, dict.get('a') works well.
s = list(s)
res = []
for i in range(len(s)):
if hash_map.get(s[i]) == 1:
hash_map[s[i]] = 0
res.remove(s[i])
else:
hash_map[s[i]] = 1
res.append(s[i])
return res
findodd('Hello World')
Out:
["H", "e", " ", "W", "r", "l", "d"]
However, since I use list.remove, the time complexity is above O(n) in my solution.
My Question:
Can anyone give some advice about O(n) solution?
If I don't wanna use s = list(s), how to iterate over a string 'abc' to lookup the value of key = 'a' in a dict? dict.get('a') works but dict.get(a) won't work.
Source
Here are 2 webpage I watched, however they did not take the order of letter into account and did not provide O(n) solution.
find even time number, stack overflow
find odd time number, geeks for geeks
Python 3.7 up has dictionary keys input ordered. Use collection.OrderedDict for lower python versions.
Go through your word, add letter do dict if not in, else delete key from dict.
Solution is the dict.keys() collection:
t = "Hello World"
d = {}
for c in t:
if c in d: # even time occurences: delete key
del d[c]
else:
d[c] = None # odd time occurence: add key
print(d.keys())
Output:
dict_keys(['H', 'e', ' ', 'W', 'r', 'l', 'd'])
Its O(n) because you touch each letter in your input exactly once - lookup into dict is O(1).
There is some overhead by key adding/deleting. If that bothers you, use a counter instead and filter the key() collection for those that are odd - this will make it O(2*n) - 2 is constant so still O(n).
Here is an attempt (keys are ordered in python 3.6 dict):
from collections import defaultdict
def find_odd(s):
counter = defaultdict(int)
for x in s:
counter[x] += 1
return [l for l, c in counter.items() if c%2 != 0]
the complexity of this algo is less than 2n, which is O(n)!
Example
>>> s = "hello world"
>>> find_odd(s)
['h', 'e', 'l', ' ', 'w', 'r', 'd']
You could use the hash map to store the index at which a character occurs, and toggle it when it already has a value.
And then you just iterate the string again and only keep those letters that occur at the index that you have in the hash map:
from collections import defaultdict
def findodd(s):
hash_map = defaultdict(int)
for i, c in enumerate(s):
hash_map[c] = 0 if hash_map[c] else i+1
return [c for i, c in enumerate(s) if hash_map[c] == i+1]
My solution from scratch
It actually uses the feature that a dict in Python 3.6 is key-ordered.
def odd_one_out(s):
hash_map = {}
# reverse the original string to capture the last occurance
s = list(reversed(s))
res = []
for i in range(len(s)):
if hash_map.get(s[i]):
hash_map[s[i]] += 1
else:
hash_map[s[i]] = 1
for k,v in hash_map.items():
if v % 2 != 0:
res.append(k)
return res[::-1]
Crazy super short solution
#from user FArekkusu on Codewars
from collections import Counter
def find_odd(s):
d = Counter(reversed(s))
return [x for x in d if d[x] % 2][::-1]
Using Counter from collections will give you an O(n) solution. And since the Counter object is a dictionary (which keeps the occurrence order), your result can simply be a filter on the counts:
from collections import Counter
text = 'Hello World'
oddLetters = [ char for char,count in Counter(text).items() if count&1 ]
print(oddLetters) # ['H', 'e', 'l', ' ', 'W', 'r', 'd']

Get sequences from a file and store them into a list in python

Here is the code (i took it from this discussion Translation DNA to Protein, but here i'm using RNA instead of DNA file):
from itertools import takewhile
def translate_rna(sequence, d, stop_codons=('UAA', 'UGA', 'UAG')):
start = sequence.find('AUG')
# Take sequence from the first start codon
trimmed_sequence = sequence[start:]
# Split it into triplets
codons = [trimmed_sequence[i:i + 3] for i in range(0, len(trimmed_sequence), 3)]
# Take all codons until first stop codon
coding_sequence = takewhile(lambda x: x not in stop_codons and len(x) == 3, codons)
# Translate and join into string
protein_sequence = ''.join([codontable[codon] for codon in coding_sequence])
# This line assumes there is always stop codon in the sequence
return "{0}".format(protein_sequence)
Calling the translate_rna function:
sequence = ''
for line in open("to_rna", "r"):
sequence += line.strip()
translate_rna(sequence, d)
My to_rna file looks like:
CCGCCCCUCUGCCCCAGUCACUGAGCCGCCGCCGAGGAUUCAGCAGCCUCCCCCUUGAGCCCCCUCGCUU
CCCGACGUUCCGUUCCCCCCUGCCCGCCUUCUCCCGCCACCGCCGCCGCCGCCUUCCGCAGGCCGUUUCC
ACCGAGGAAAAGGAAUCGUAUCGUAUGUCCGCUAUCCAG.........
The function translate only the first proteine (from the first AUG to the first stop_codon)
I think the problem is in this line:
# Take all codons until first stop codon
coding_sequence = takewhile(lambda x: x not in stop_codons and len(x) == 3 , codons)
My question is : How can i tell python (after finding the first AUG and store it into coding_sequence as a list) to search again the next AUG in the RNA file and sotre it in the next position.
As a result, i wanna have a list like that:
['here_is_the_1st_coding_sequence', 'here_is_the_2nd_coding_sequence', ...]
PS : This is a homework, so i can't use Biopython.
EDIT:
A simple way to describe the problem:
From this code:
from itertools import takewhile
lst = ['N', 'A', 'B', 'Z', 'C', 'A', 'V', 'V' 'Z', 'X']
ch = ''.join(lst)
stop = 'Z'
start = ch.find('A')
seq = takewhile(lambda x: x not in stop, ch)
I want to get this:
['AB', 'AVV']
EDIT 2:
For instance, from this string:
UUUAUGCGCCGCUAACCCAUGGUUCCCUAGUGGUCCUGACGCAUGUGA
I should get as result:
['AUGCGCCGC', 'AUGGUUCCC', 'AUG']
looking at your basic code, because I couldn't quite follow your main stuff, it looks like you just want to split your string on all occurences of another string, and substring the string starting from the index of another string. If that is wrong, please tell me and I can update accordingly.
To achieve this, python has a builtin str.split(sub) which splits a string at every occurence of sub. Also, it has a str.index(sub) which returns the first index of sub. Example:
>>> ch = 'NABZCAVZX'
>>> ch[ch.index('A'):].split('Z')
['AB', 'CAV', 'X']
you can also specify sub strings that aren't just one char:
>>> ch = 'NACBABQZCVEZTZCGE'
>>> ch[ch.index('AB'):].split('ZC')
['ABQ', 'VEZT', 'GE']
Using multiple delimiters:
>>> import re
>>> stop_codons = ['UAA','UGA','UAG']
>>> re.compile('|'.join(stop_codons))\
>>> delim = re.compile('|'.join(stop_codons))
>>> ch = 'CCHAUAABEGTAUAAVEGTUGAVKEGUAABEGEUGABRLVBUAGCGGA'
>>> delim.split(ch)
['CCHA', 'BEGTA', 'VEGT', 'VKEG', 'BEGE', 'BRLVB', 'CGGA']
note that there is no order preferance to the split, ie if there is a UGA string ahead of a UAA, it will still split on the UGA. I am not sure if thats what you want but thats it.

How to split a string into parts that each part contains only same characters in python

I want to get a sequence of DNA as a string and i need to split the string into parts of a list.And each part must contain same characters only.And final output must be a list according to the order of the original sequence using python 3.4
Ex:- infected ="AATTTGCCAAA"
I need to get the output as followed
Modified. = ['AA','TTT','G','CC','AAA' ]
It's what that itertools.groupby is for :
>>> from itertools import groupby
>>> infected ="AATTTGCCAAA"
>>>
>>> [''.join(g) for _,g in groupby(infected)]
['AA', 'TTT', 'G', 'CC', 'AAA']
def fchar(ch,mi):
global numLi
fc=ch
li=""
for c in infected[mi:]:
if fc==c :
li+=fc
mi = mi+1
else:
break
if mi<len(infected) :
return li+" "+fchar(infected[mi],mi)
else:
return li
infected =input("Enter DNA sequence\n") ;#"AAATTTTTTTTGCCCCCCA"
x=fchar(infected[0],0)
newSet = x.split(' ')
print(newSet)

all combination of a complicated list

I want to find all possible combination of the following list:
data = ['a','b','c','d']
I know it looks a straightforward task and it can be achieved by something like the following code:
comb = [c for i in range(1, len(data)+1) for c in combinations(data, i)]
but what I want is actually a way to give each element of the list data two possibilities ('a' or '-a').
An example of the combinations can be ['a','b'] , ['-a','b'], ['a','b','-c'], etc.
without something like the following case of course ['-a','a'].
You could write a generator function that takes a sequence and yields each possible combination of negations. Like this:
import itertools
def negations(seq):
for prefixes in itertools.product(["", "-"], repeat=len(seq)):
yield [prefix + value for prefix, value in zip(prefixes, seq)]
print list(negations(["a", "b", "c"]))
Result (whitespace modified for clarity):
[
[ 'a', 'b', 'c'],
[ 'a', 'b', '-c'],
[ 'a', '-b', 'c'],
[ 'a', '-b', '-c'],
['-a', 'b', 'c'],
['-a', 'b', '-c'],
['-a', '-b', 'c'],
['-a', '-b', '-c']
]
You can integrate this into your existing code with something like
comb = [x for i in range(1, len(data)+1) for c in combinations(data, i) for x in negations(c)]
Once you have the regular combinations generated, you can do a second pass to generate the ones with "negation." I'd think of it like a binary number, with the number of elements in your list being the number of bits. Count from 0b0000 to 0b1111 via 0b0001, 0b0010, etc., and wherever a bit is set, negate that element in the result. This will produce 2^n combinations for each input combination of length n.
Here is one-liner, but it can be hard to follow:
from itertools import product
comb = [sum(t, []) for t in product(*[([x], ['-' + x], []) for x in data])]
First map data to lists of what they can become in results. Then take product* to get all possibilities. Finally, flatten each combination with sum.
My solution basically has the same idea as John Zwinck's answer. After you have produced the list of all combinations
comb = [c for i in range(1, len(data)+1) for c in combinations(data, i)]
you generate all possible positive/negative combinations for each element of comb. I do this by iterating though the total number of combinations, 2**(N-1), and treating it as a binary number, where each binary digit stands for the sign of one element. (E.g. a two-element list would have 4 possible combinations, 0 to 3, represented by 0b00 => (+,+), 0b01 => (-,+), 0b10 => (+,-) and 0b11 => (-,-).)
def twocombinations(it):
sign = lambda c, i: "-" if c & 2**i else ""
l = list(it)
if len(l) < 1:
return
# for each possible combination, make a tuple with the appropriate
# sign before each element
for c in range(2**(len(l) - 1)):
yield tuple(sign(c, i) + el for i, el in enumerate(l))
Now we apply this function to every element of comb and flatten the resulting nested iterator:
l = itertools.chain.from_iterable(map(twocombinations, comb))

Categories

Resources