So I have a DNA sequence
DNA = "TANNNT"
where N = ["A", "G", "C", "T"]
I want to have all possible output of TAAAAT, TAAAGT, TAAACT, TAAATT..... and so on.
Right now from online I found solution of permutations where I can do
perms = [''.join(p) for p in permutations(N, 3)]
then just iterate my DNA sequence as
TA + perms + T
but I wonder if there is easier way to do this, because I have a lot more DNA sequences and make take a lot more time to hard code it.
Edit:
The hard code part will be as in I would have to state
N1 = [''.join(p) for p in permutations(N, 1)]
N2 = [''.join(p) for p in permutations(N, 2)]
N3 = [''.join(p) for p in permutations(N, 3)]
then do for i in N3:
key = "TA" + N3[i] + "T"
Since my sequence is quite long, I don't want count how many consecutive N I have in the sequence and want to see if there is better way to do this.
You can use your permutation results to format a string like:
Code:
import itertools as it
import re
def convert_sequence(base_string, target_letter, perms):
REGEX = re.compile('(%s+)' % target_letter)
match = REGEX.search(base_string).group(0)
pattern = REGEX.sub('%s', base_string)
return [pattern % ''.join(p) for p in it.permutations(perms, len(match))]
Test Code:
print(convert_sequence('TANNNT', 'N', ['A', 'G', 'C', 'T']))
Results:
['TAAGCT', 'TAAGTT', 'TAACGT', 'TAACTT', 'TAATGT',
'TAATCT', 'TAGACT', 'TAGATT', 'TAGCAT', 'TAGCTT',
'TAGTAT', 'TAGTCT', 'TACAGT', 'TACATT', 'TACGAT',
'TACGTT', 'TACTAT', 'TACTGT', 'TATAGT', 'TATACT',
'TATGAT', 'TATGCT', 'TATCAT', 'TATCGT']
Related
So I have a string like this:
"abc"
I would need:
"abc"
"acb"
"bca"
"bac"
"cab"
"cba"
I tried:
string = "abc"
combinations = []
for i in range(len(string)):
acc = string[i]
for ii in range(i+1,i+len(string)):
acc += string[ii%len(string)]
combinations.append(acc)
combinations.append(acc[::-1])
print(combinations)
If works for string of size 3, but I believe it is very inefficient and also doesn't work for "abcd". Is there a better approach?
Update: I would like to solve by providing an algorithm. Actually currently working in a recursive way to solve it. Would prefer a solution that is not a python function solving the problem for me
For permutations:
Using Itertools Library:
from itertools import permutations
ini_str = "abc"
print("Initial string", ini_str)
permutation = [''.join(p) for p in permutations(ini_str)]
print("Resultant List", str(permutation))
Initial string abc
Resultant List ['abc', 'acb', 'bac', 'bca', 'cab', 'cba']
Recursive Method:
def permutations(remaining, candidate=""):
if len(remaining) == 0:
print(candidate)
for i in range(len(remaining)):
newCandidate = candidate + remaining[i]
newRemaining = remaining[0:i] + remaining[i+1:]
permutations(newRemaining, newCandidate)
if __name__ == '__main__':
s = "ABC"
permutations(s)
Iterative Method:
def permutations(s):
partial = []
partial.append(s[0])
for i in range(1, len(s)):
for j in reversed(range(len(partial))):
# remove current partial permutation from the list
curr = partial.pop(j)
for k in range(len(curr) + 1):
partial.append(curr[:k] + s[i] + curr[k:])
print(partial, end='')
if __name__ == '__main__':
s = "ABC"
permutations(s)
Try itertools.permutations:
from itertools import permutations
print('\n'.join(map(''.join, list(permutations("abc", 3)))))
Output:
abc
acb
bac
bca
cab
cba
edit:
For an algorithm:
string = "abc"
def combinations(head, tail=''):
if len(head) == 0:
print(tail)
else:
for i in range(len(head)):
combinations(head[:i] + head[i+1:], tail + head[i])
combinations(string)
Output:
abc
acb
bac
bca
cab
cba
Your logic is correct, but you want permutation, not combination. Try:
from itertools import permutations
perms = permutations("abc", r=3)
# perms is a python generator and not a list.
You can easily create a list from that generator if you want.
1. Use permutations from itertools
from itertools import permutations
s = 'abc'
permutation = [''.join(p) for p in permutations(s)]
print(permutation)
# ['abc', 'acb', 'bac', 'bca', 'cab', 'cba']
2. Implement the algorithm
s = 'abc'
result = []
def permutations(string, step = 0):
if step == len(string):
result.append("".join(string))
for i in range(step, len(string)):
string_copy = [character for character in string]
string_copy[step], string_copy[i] = string_copy[i], string_copy[step]
permutations(string_copy, step + 1)
permutations(s)
print(result)
# ['abc', 'acb', 'bac', 'bca', 'cba', 'cab']
I am trying to write a python code that finds restriction enzyme sites within a sequence of DNA. Restriction enzymes cut at specific DNA sequences, however some are not so strict, for example XmnI cuts this sequence:
GAANNNNTTC
Where N can be any nucleotide (A, C, G, or T). If my math is right thats 4^4 = 256 unique sequences that it can cut. I want to make a list of these 256 short sequences, then check each one against a (longer) input DNA sequence. However, I'm having a hard time generating the 256 sequences. Here's what I have so far:
cutsequencequery = "GAANNNNTTC"
Nseq = ["A", "C", "G", "T"]
querylist = []
if "N" in cutsequencequery:
Nlist = [cutsequencequery.replace("N", t) for t in Nseq]
for j in list(Nlist):
querylist.append(j)
for i in querylist:
print(i)
print(len(querylist))
and here is the output:
GAAAAAATTC
GAACCCCTTC
GAAGGGGTTC
GAATTTTTTC
4
So it's switching each N to either A, C, G, and T, but I think I need another loop (or 3?) to generate all 256 combinations. Is there an efficient way to do this that I'm not seeing?
Maybe you should take a look into python's itertools library, which include product which creates an iterable with every combination of iterables, therefore:
from itertools import product
cutsequencequery = "GAANNNNTTC"
nseq = ["A", "C", "G", "T"]
size = cutsequencequery.count('N')
possibilities = product(*[nseq for i in range(size)])
# = ('A', 'A', 'A', 'A'), ... , ('T', 'T', 'T', 'T')
# len(list(possibilities)) = 256 = 4^4, as expected
s = set()
for n in possibilities:
print(''.join(n)) # = 'AAAA', ..., 'TTTT'
new_sequence = cutsequencequery.replace('N' * size, ''.join(n))
s.add(new_sequence)
print(new_sequence) # = 'GAAAAAATTC', ..., 'GAATTTTTTC'
print(len(s)) # 256 unique sequences
Here is the code (i took it from this discussion Translation DNA to Protein, but here i'm using RNA instead of DNA file):
from itertools import takewhile
def translate_rna(sequence, d, stop_codons=('UAA', 'UGA', 'UAG')):
start = sequence.find('AUG')
# Take sequence from the first start codon
trimmed_sequence = sequence[start:]
# Split it into triplets
codons = [trimmed_sequence[i:i + 3] for i in range(0, len(trimmed_sequence), 3)]
# Take all codons until first stop codon
coding_sequence = takewhile(lambda x: x not in stop_codons and len(x) == 3, codons)
# Translate and join into string
protein_sequence = ''.join([codontable[codon] for codon in coding_sequence])
# This line assumes there is always stop codon in the sequence
return "{0}".format(protein_sequence)
Calling the translate_rna function:
sequence = ''
for line in open("to_rna", "r"):
sequence += line.strip()
translate_rna(sequence, d)
My to_rna file looks like:
CCGCCCCUCUGCCCCAGUCACUGAGCCGCCGCCGAGGAUUCAGCAGCCUCCCCCUUGAGCCCCCUCGCUU
CCCGACGUUCCGUUCCCCCCUGCCCGCCUUCUCCCGCCACCGCCGCCGCCGCCUUCCGCAGGCCGUUUCC
ACCGAGGAAAAGGAAUCGUAUCGUAUGUCCGCUAUCCAG.........
The function translate only the first proteine (from the first AUG to the first stop_codon)
I think the problem is in this line:
# Take all codons until first stop codon
coding_sequence = takewhile(lambda x: x not in stop_codons and len(x) == 3 , codons)
My question is : How can i tell python (after finding the first AUG and store it into coding_sequence as a list) to search again the next AUG in the RNA file and sotre it in the next position.
As a result, i wanna have a list like that:
['here_is_the_1st_coding_sequence', 'here_is_the_2nd_coding_sequence', ...]
PS : This is a homework, so i can't use Biopython.
EDIT:
A simple way to describe the problem:
From this code:
from itertools import takewhile
lst = ['N', 'A', 'B', 'Z', 'C', 'A', 'V', 'V' 'Z', 'X']
ch = ''.join(lst)
stop = 'Z'
start = ch.find('A')
seq = takewhile(lambda x: x not in stop, ch)
I want to get this:
['AB', 'AVV']
EDIT 2:
For instance, from this string:
UUUAUGCGCCGCUAACCCAUGGUUCCCUAGUGGUCCUGACGCAUGUGA
I should get as result:
['AUGCGCCGC', 'AUGGUUCCC', 'AUG']
looking at your basic code, because I couldn't quite follow your main stuff, it looks like you just want to split your string on all occurences of another string, and substring the string starting from the index of another string. If that is wrong, please tell me and I can update accordingly.
To achieve this, python has a builtin str.split(sub) which splits a string at every occurence of sub. Also, it has a str.index(sub) which returns the first index of sub. Example:
>>> ch = 'NABZCAVZX'
>>> ch[ch.index('A'):].split('Z')
['AB', 'CAV', 'X']
you can also specify sub strings that aren't just one char:
>>> ch = 'NACBABQZCVEZTZCGE'
>>> ch[ch.index('AB'):].split('ZC')
['ABQ', 'VEZT', 'GE']
Using multiple delimiters:
>>> import re
>>> stop_codons = ['UAA','UGA','UAG']
>>> re.compile('|'.join(stop_codons))\
>>> delim = re.compile('|'.join(stop_codons))
>>> ch = 'CCHAUAABEGTAUAAVEGTUGAVKEGUAABEGEUGABRLVBUAGCGGA'
>>> delim.split(ch)
['CCHA', 'BEGTA', 'VEGT', 'VKEG', 'BEGE', 'BRLVB', 'CGGA']
note that there is no order preferance to the split, ie if there is a UGA string ahead of a UAA, it will still split on the UGA. I am not sure if thats what you want but thats it.
How can I collect the combinations of a string, in which certain characters (but not all) are variable?
In other words, I have an input string and a character map. The character map specifies which characters are variable, and what they could be replaced with. The function then yields all possible combinations.
To put this in context, I'm trying to collect possible variations for an OCR output string that could have been misinterpreted by the OCR engine.
Example input:
"ABCD"
Example character map:
dict(
B=("X", "Z"),
D=("E")
)
Intended output:
[
"ABCD",
"ABCE",
"AXCD",
"AXCE",
"AZCD",
"AZCE"
]
You can use itertools.product:
>>> from itertools import product
>>> s = "ABCD"
>>> d = {"B": ["X", "Z"], "D": ["E"]}
>>> poss = [[c]+d.get(c,[]) for c in s]
>>> poss
[['A'], ['B', 'X', 'Z'], ['C'], ['D', 'E']]
>>> [''.join(p) for p in product(*poss)]
['ABCD', 'ABCE', 'AXCD', 'AXCE', 'AZCD', 'AZCE']
Note that I made d["D"] a list rather than simply a string for consistency.
My own solution was very ugly and non-Pythonic, but here goes:
def fuzzy_search(string, character_map):
all_variations = []
for i, character in enumerate(string):
if character in character_map:
character_variations = list(character_map[character])
character_variations.insert(0, character)
if i == len(string) - 1:
return [string[:-1] + variation for variation in character_variations]
for variation in character_variations:
sub_variations = fuzzy_search(string[i + 1:], character_map)
for sub_variation in sub_variations:
all_variations.append(string[:i] + variation + sub_variation)
return all_variations
return all_variations
map = dict(
B=("X", "Z"),
D=("E")
)
print fuzzy_search("ABCD", map)
Outputs:
['ABCD', 'ABCE', 'AXCD', 'AXCE', 'AZCD', 'AZCE']
I figured there should be way more elegant solutions than a recursive function with multiple loops.
I want to find all possible combination of the following list:
data = ['a','b','c','d']
I know it looks a straightforward task and it can be achieved by something like the following code:
comb = [c for i in range(1, len(data)+1) for c in combinations(data, i)]
but what I want is actually a way to give each element of the list data two possibilities ('a' or '-a').
An example of the combinations can be ['a','b'] , ['-a','b'], ['a','b','-c'], etc.
without something like the following case of course ['-a','a'].
You could write a generator function that takes a sequence and yields each possible combination of negations. Like this:
import itertools
def negations(seq):
for prefixes in itertools.product(["", "-"], repeat=len(seq)):
yield [prefix + value for prefix, value in zip(prefixes, seq)]
print list(negations(["a", "b", "c"]))
Result (whitespace modified for clarity):
[
[ 'a', 'b', 'c'],
[ 'a', 'b', '-c'],
[ 'a', '-b', 'c'],
[ 'a', '-b', '-c'],
['-a', 'b', 'c'],
['-a', 'b', '-c'],
['-a', '-b', 'c'],
['-a', '-b', '-c']
]
You can integrate this into your existing code with something like
comb = [x for i in range(1, len(data)+1) for c in combinations(data, i) for x in negations(c)]
Once you have the regular combinations generated, you can do a second pass to generate the ones with "negation." I'd think of it like a binary number, with the number of elements in your list being the number of bits. Count from 0b0000 to 0b1111 via 0b0001, 0b0010, etc., and wherever a bit is set, negate that element in the result. This will produce 2^n combinations for each input combination of length n.
Here is one-liner, but it can be hard to follow:
from itertools import product
comb = [sum(t, []) for t in product(*[([x], ['-' + x], []) for x in data])]
First map data to lists of what they can become in results. Then take product* to get all possibilities. Finally, flatten each combination with sum.
My solution basically has the same idea as John Zwinck's answer. After you have produced the list of all combinations
comb = [c for i in range(1, len(data)+1) for c in combinations(data, i)]
you generate all possible positive/negative combinations for each element of comb. I do this by iterating though the total number of combinations, 2**(N-1), and treating it as a binary number, where each binary digit stands for the sign of one element. (E.g. a two-element list would have 4 possible combinations, 0 to 3, represented by 0b00 => (+,+), 0b01 => (-,+), 0b10 => (+,-) and 0b11 => (-,-).)
def twocombinations(it):
sign = lambda c, i: "-" if c & 2**i else ""
l = list(it)
if len(l) < 1:
return
# for each possible combination, make a tuple with the appropriate
# sign before each element
for c in range(2**(len(l) - 1)):
yield tuple(sign(c, i) + el for i, el in enumerate(l))
Now we apply this function to every element of comb and flatten the resulting nested iterator:
l = itertools.chain.from_iterable(map(twocombinations, comb))