I am trying to write a python code that finds restriction enzyme sites within a sequence of DNA. Restriction enzymes cut at specific DNA sequences, however some are not so strict, for example XmnI cuts this sequence:
GAANNNNTTC
Where N can be any nucleotide (A, C, G, or T). If my math is right thats 4^4 = 256 unique sequences that it can cut. I want to make a list of these 256 short sequences, then check each one against a (longer) input DNA sequence. However, I'm having a hard time generating the 256 sequences. Here's what I have so far:
cutsequencequery = "GAANNNNTTC"
Nseq = ["A", "C", "G", "T"]
querylist = []
if "N" in cutsequencequery:
Nlist = [cutsequencequery.replace("N", t) for t in Nseq]
for j in list(Nlist):
querylist.append(j)
for i in querylist:
print(i)
print(len(querylist))
and here is the output:
GAAAAAATTC
GAACCCCTTC
GAAGGGGTTC
GAATTTTTTC
4
So it's switching each N to either A, C, G, and T, but I think I need another loop (or 3?) to generate all 256 combinations. Is there an efficient way to do this that I'm not seeing?
Maybe you should take a look into python's itertools library, which include product which creates an iterable with every combination of iterables, therefore:
from itertools import product
cutsequencequery = "GAANNNNTTC"
nseq = ["A", "C", "G", "T"]
size = cutsequencequery.count('N')
possibilities = product(*[nseq for i in range(size)])
# = ('A', 'A', 'A', 'A'), ... , ('T', 'T', 'T', 'T')
# len(list(possibilities)) = 256 = 4^4, as expected
s = set()
for n in possibilities:
print(''.join(n)) # = 'AAAA', ..., 'TTTT'
new_sequence = cutsequencequery.replace('N' * size, ''.join(n))
s.add(new_sequence)
print(new_sequence) # = 'GAAAAAATTC', ..., 'GAATTTTTTC'
print(len(s)) # 256 unique sequences
Related
I just started to use list comprehension and I'm struggling with it. In this case, I need to get the n number of each list (sequence_0 and sequence_1) that the iteration is at each time. How can I do that?
The idea is to get the longest sequence of equal nucleotides (a motif) between the two sequences. Once a pair is finded, the program should continue in the nexts nucleotides of the sequences, checking if they are also equal and then elonganting the motif with it. The final output should be an list of all the motifs finded.
The problem is, to continue in the next nucleotides once a pair is finded, i need the position of the pair in both sequences to the program continue. The index function does not work in this case, and that's why i need the enumerate.
Also, I don't understand exactly the reason for the x and y between (), it would be good to understand that too :)
just to explain, the content of the lists is DNA sequences, so its basically something like:
sequence_1 = ['A', 'T', 'C', 'A', 'C']
def find_shared_motif(arq):
data = fastaread(arq)
seqs = [list(sequence) for sequence in data.values()]
motifs = [[]]
i = 0
sequence_0, sequence_1 = seqs[0], seqs[1] # just to simplify
for x, y in [(x, y) for x in zip(sequence_0[::], sequence_0[1::]) for y in zip(sequence_1[::], sequence_1[1::])]:
print(f'Pairs {"".join(x)} and {"".join(y)} being analyzed...')
if x == y:
print(f'Pairs {"".join(x)} and {"".join(y)} match!')
motifs[i].append(x[0]), motifs[i].append(x[1])
k = sequence_0.index(x[0]) + 2 # NAO ESTA DEVOLVENDO O NUMERO CERTO
u = sequence_1.index(y[0]) + 2
print(k, u)
# Determines if the rest of the sequence is compatible
print(f'Starting to elongate the motif {x}...')
for j, m in enumerate(sequence_1[u::]):
try:
# Checks if the nucleotide is equal for both of the sequences
print(f'Analyzing the pair {sequence_0[k + j]}, {m}')
if m == sequence_0[k + j]:
motifs[i].append(m)
print(f'The pair {sequence_0[k + j]}, {m} is equal!')
# Stop in the first nonequal residue
else:
print(f'The pair {sequence_0[k + j]}, {m} is not equal.')
break
except IndexError:
print('IndexError, end of the string')
else:
i += 1
motifs.append([])
return motifs
...
One way to go with it is to start zipping both lists:
a = ['A', 'T', 'C', 'A', 'C']
b = ['A', 'T', 'C', 'C', 'T']
c = list(zip(a,b))
In that case, c will have the list of tuples below
c = [('A','A'), ('T','T'), ('C','C'), ('A','C'), ('C','T')]
Then, you can go with list comprehension and enumerate:
d = [(i, t) for i, t in enumerate(c)]
This will bring something like this to you:
d = [(0, ('A','A')), (1, ('T','T')), (2, ('C','C')), ...]
Of course you can go for a one-liner, if you want:
d = [(i, t) for i, t in enumerate(zip(a,b))]
>>> [(0, ('A','A')), (1, ('T','T')), (2, ('C','C')), ...]
Now, you have to deal with the nested tuples. Focus on the internal ones. It is obvious that what you want is to compare the first element of the tuples with the second ones. But, also, you will need the position where the difference resides (that lies outside). So, let's build a function for it. Inside the function, i will capture the positions, and t will capture the inner tuples:
def compare(a, b):
d = [(i, t) for i, t in enumerate(zip(a,b))]
for i, t in d:
if t[0] != t[1]:
return i
return -1
In that way, if you get -1 at the end, it means that all elements in both lists are equal, side by side. Otherwise, you will get the position of the first difference between them.
It is important to notice that, in the case of two lists with different sizes, the zip function will bring a list of tuples with the size matching the smaller of the lists. The extra elements of the other list will be ignored.
Ex.
list(zip([1,2], [3,4,5]))
>>> [(1,3), (2,4)]
You can use the function compare with your code to get the positions where the lists differ, and use that to build your motifs.
I have a string S that is composed of 20 characters:
S='ARNDCEQGHILKMFPSTWYV'
I need to generate all possible k-mer combinations from a given input k.
When k == 3, then there are 8000 combinations (20*20*20) and the output list looks like this:
output = ['AAA', 'AAR', ..., 'AVV', ..., 'VVV'] #len(output)=8000
When k == 2, then there are 400 combinations (20*20) and the output list looks like this:
output = ['AA', 'AR', 'AN', ..., 'VV'] #len(output)=400
When k == 1, then there are only 20 combinations:
output =['A', 'R', 'N', ..., 'Y', 'V'] #len(output)=20
I know how to do this if the number k is fixed, like if k == 3, then I can do this:
for a in S:
for b in S:
for c in S:
output.append(a+b+c)
#then len(output)=8000
But the number k is chosen randomly.
I tried to use permutations, but it does not given me strings with repeated letters like 'AAA', but maybe it can and I'm just doing it wrong.
What you are looking for is itertools.product(). You can use repeat argument for the number of k's in your algorithm.
from itertools import product
...
list(product('ARNDCEQGHILKMFPSTWYV', repeat=2)) # len = 400
list(product('ARNDCEQGHILKMFPSTWYV', repeat=3)) # len = 8000
Bear in mind it returns tuples of characters as default, if you want strings instead, you can join using list comprehensions as below:
[''.join(c) for c in product('ARNDCEQGHILKMFPSTWYV', repeat=3)]
# ['AAA', 'AAR', ..., 'AVV', ..., 'VVV']
You can use itertools.product and generate the random value for k:
import itertools
import random
S = 'ARNDCEQGHILKMFPSTWYV'
final_results = map(''.join, itertools.product(*[S]*random.randint(1, 10)))
Just generate random integer V in range 0..L^k-1 where L is string length and k is length of k-mer.
Then build corresponding combination
V = Random(L**k)
for i in range(k):
C[i] = A[V % L] ///i-th letter using integer modulo
V = V // L ///integer division
So I have a DNA sequence
DNA = "TANNNT"
where N = ["A", "G", "C", "T"]
I want to have all possible output of TAAAAT, TAAAGT, TAAACT, TAAATT..... and so on.
Right now from online I found solution of permutations where I can do
perms = [''.join(p) for p in permutations(N, 3)]
then just iterate my DNA sequence as
TA + perms + T
but I wonder if there is easier way to do this, because I have a lot more DNA sequences and make take a lot more time to hard code it.
Edit:
The hard code part will be as in I would have to state
N1 = [''.join(p) for p in permutations(N, 1)]
N2 = [''.join(p) for p in permutations(N, 2)]
N3 = [''.join(p) for p in permutations(N, 3)]
then do for i in N3:
key = "TA" + N3[i] + "T"
Since my sequence is quite long, I don't want count how many consecutive N I have in the sequence and want to see if there is better way to do this.
You can use your permutation results to format a string like:
Code:
import itertools as it
import re
def convert_sequence(base_string, target_letter, perms):
REGEX = re.compile('(%s+)' % target_letter)
match = REGEX.search(base_string).group(0)
pattern = REGEX.sub('%s', base_string)
return [pattern % ''.join(p) for p in it.permutations(perms, len(match))]
Test Code:
print(convert_sequence('TANNNT', 'N', ['A', 'G', 'C', 'T']))
Results:
['TAAGCT', 'TAAGTT', 'TAACGT', 'TAACTT', 'TAATGT',
'TAATCT', 'TAGACT', 'TAGATT', 'TAGCAT', 'TAGCTT',
'TAGTAT', 'TAGTCT', 'TACAGT', 'TACATT', 'TACGAT',
'TACGTT', 'TACTAT', 'TACTGT', 'TATAGT', 'TATACT',
'TATGAT', 'TATGCT', 'TATCAT', 'TATCGT']
How can I collect the combinations of a string, in which certain characters (but not all) are variable?
In other words, I have an input string and a character map. The character map specifies which characters are variable, and what they could be replaced with. The function then yields all possible combinations.
To put this in context, I'm trying to collect possible variations for an OCR output string that could have been misinterpreted by the OCR engine.
Example input:
"ABCD"
Example character map:
dict(
B=("X", "Z"),
D=("E")
)
Intended output:
[
"ABCD",
"ABCE",
"AXCD",
"AXCE",
"AZCD",
"AZCE"
]
You can use itertools.product:
>>> from itertools import product
>>> s = "ABCD"
>>> d = {"B": ["X", "Z"], "D": ["E"]}
>>> poss = [[c]+d.get(c,[]) for c in s]
>>> poss
[['A'], ['B', 'X', 'Z'], ['C'], ['D', 'E']]
>>> [''.join(p) for p in product(*poss)]
['ABCD', 'ABCE', 'AXCD', 'AXCE', 'AZCD', 'AZCE']
Note that I made d["D"] a list rather than simply a string for consistency.
My own solution was very ugly and non-Pythonic, but here goes:
def fuzzy_search(string, character_map):
all_variations = []
for i, character in enumerate(string):
if character in character_map:
character_variations = list(character_map[character])
character_variations.insert(0, character)
if i == len(string) - 1:
return [string[:-1] + variation for variation in character_variations]
for variation in character_variations:
sub_variations = fuzzy_search(string[i + 1:], character_map)
for sub_variation in sub_variations:
all_variations.append(string[:i] + variation + sub_variation)
return all_variations
return all_variations
map = dict(
B=("X", "Z"),
D=("E")
)
print fuzzy_search("ABCD", map)
Outputs:
['ABCD', 'ABCE', 'AXCD', 'AXCE', 'AZCD', 'AZCE']
I figured there should be way more elegant solutions than a recursive function with multiple loops.
I want to find all possible combination of the following list:
data = ['a','b','c','d']
I know it looks a straightforward task and it can be achieved by something like the following code:
comb = [c for i in range(1, len(data)+1) for c in combinations(data, i)]
but what I want is actually a way to give each element of the list data two possibilities ('a' or '-a').
An example of the combinations can be ['a','b'] , ['-a','b'], ['a','b','-c'], etc.
without something like the following case of course ['-a','a'].
You could write a generator function that takes a sequence and yields each possible combination of negations. Like this:
import itertools
def negations(seq):
for prefixes in itertools.product(["", "-"], repeat=len(seq)):
yield [prefix + value for prefix, value in zip(prefixes, seq)]
print list(negations(["a", "b", "c"]))
Result (whitespace modified for clarity):
[
[ 'a', 'b', 'c'],
[ 'a', 'b', '-c'],
[ 'a', '-b', 'c'],
[ 'a', '-b', '-c'],
['-a', 'b', 'c'],
['-a', 'b', '-c'],
['-a', '-b', 'c'],
['-a', '-b', '-c']
]
You can integrate this into your existing code with something like
comb = [x for i in range(1, len(data)+1) for c in combinations(data, i) for x in negations(c)]
Once you have the regular combinations generated, you can do a second pass to generate the ones with "negation." I'd think of it like a binary number, with the number of elements in your list being the number of bits. Count from 0b0000 to 0b1111 via 0b0001, 0b0010, etc., and wherever a bit is set, negate that element in the result. This will produce 2^n combinations for each input combination of length n.
Here is one-liner, but it can be hard to follow:
from itertools import product
comb = [sum(t, []) for t in product(*[([x], ['-' + x], []) for x in data])]
First map data to lists of what they can become in results. Then take product* to get all possibilities. Finally, flatten each combination with sum.
My solution basically has the same idea as John Zwinck's answer. After you have produced the list of all combinations
comb = [c for i in range(1, len(data)+1) for c in combinations(data, i)]
you generate all possible positive/negative combinations for each element of comb. I do this by iterating though the total number of combinations, 2**(N-1), and treating it as a binary number, where each binary digit stands for the sign of one element. (E.g. a two-element list would have 4 possible combinations, 0 to 3, represented by 0b00 => (+,+), 0b01 => (-,+), 0b10 => (+,-) and 0b11 => (-,-).)
def twocombinations(it):
sign = lambda c, i: "-" if c & 2**i else ""
l = list(it)
if len(l) < 1:
return
# for each possible combination, make a tuple with the appropriate
# sign before each element
for c in range(2**(len(l) - 1)):
yield tuple(sign(c, i) + el for i, el in enumerate(l))
Now we apply this function to every element of comb and flatten the resulting nested iterator:
l = itertools.chain.from_iterable(map(twocombinations, comb))