introducing mutations in a DNA string in python - python

Given a DNA string for example AGC. I am trying to generate all possible uniq strings allowing upto #n (user defined number) mismatches in the given string.
I am able to do this for one mismatch in the following way but not able to implement the recursive solution to generate all the possible combinations based on #n mismatch, DNA string and mutation set(AGCTN)
temp_dict = {}
sequence = 'AGC'
for x in xrange(len(sequence)):
prefix = sequence[:x]
suffix = sequence[x+1:]
temp_dict.update([ (prefix+base+suffix,1) for base in 'ACGTN'])
print temp_dict
An example:
for a given sample string : ACG, the following are the 13 uniq sequences allowing upto one mismatch
{'ACC': 1, 'ATG': 1, 'AAG': 1, 'ANG': 1, 'ACG': 1, 'GCG': 1, 'AGG': 1,
'ACA': 1, 'ACN': 1, 'ACT': 1, 'TCG': 1, 'CCG': 1, 'NCG': 1}
I want to generalize this so that the program can take a 100 characters long DNA string and return a list/dict of uniq strings allowing user defined #mismatches
Thanks!
-Abhi

Assuming I understand you, I think you can use the itertools module. The basic idea is to choose locations where there's going to be a mismatch using combinations and then construct all satisfying lists using product:
import itertools
def mismatch(word, letters, num_mismatches):
for locs in itertools.combinations(range(len(word)), num_mismatches):
this_word = [[char] for char in word]
for loc in locs:
orig_char = word[loc]
this_word[loc] = [l for l in letters if l != orig_char]
for poss in itertools.product(*this_word):
yield ''.join(poss)
For your example case:
>>> mismatch("ACG", "ACGTN", 0)
<generator object mismatch at 0x1004bfaa0>
>>> list(mismatch("ACG", "ACGTN", 0))
['ACG']
>>> list(mismatch("ACG", "ACGTN", 1))
['CCG', 'GCG', 'TCG', 'NCG', 'AAG', 'AGG', 'ATG', 'ANG', 'ACA', 'ACC', 'ACT', 'ACN']

I believe the accepted answer only gives N mismatches, not up to N. A slight modification to the accepted answer should correct this I think:
from itertools import combinations,product
def mismatch(word, i = 2):
for d in range(i+1):
for locs in combinations(range(len(word)), d):
thisWord = [[char] for char in word]
for loc in locs:
origChar = word[loc]
thisWord[loc] = [l for l in "ACGT" if l != origChar]
for poss in product(*thisWord):
yield "".join(poss)
kMerList = list(mismatch("AAAA",3))
print kMerList
I am completely new to programming, so please correct me if I'm wrong.

Related

Create string combination based on replacement

Given a word and a dictionary of replacement characters, I need to form a Combination of characters based on the replacement
Input
word = 'accompanying'
substitutions={'c':['$'], 'a': ['4'], 'g': ['9']}
Output
{'a$$ompanyin9', 'ac$ompanyin9','a$companyin9','4ccomp4nying', '4$$omp4nying',
'4$comp4nying','4c$omp4nying', '4ccomp4nyin9', 'a$$ompanying', 'a$companying', 'ac$ompanying',
'accompanyin9', 'accompanying', '4$$omp4nyin9', '4$comp4nyin9', '4c$omp4nyin9','etc.,'}
I wrote a code, But it does not provide me all the combinations which I am expecting
Sample Code
from itertools import product
substitutions={'c':['$'], 'a': ['4'], 'g': ['9']}
for key in substitutions.keys():
if key not in substitutions[key]:
substitutions[key].append(key)
wordPossibilities = []
word = 'accompanying'
for substitute in [zip(substitutions.keys(),ch) for ch in product(*substitutions.values())]:
temp=word
for replacement in substitute:
temp=temp.replace(*replacement)
wordPossibilities.append(temp)
print(set(wordPossibilities))
My Output
{'4$$omp4nyin9', 'a$$ompanyin9', 'a$$ompanying', 'accompanyin9',
'accompanying', '4ccomp4nyin9', '4$$omp4nying', '4ccomp4nying'}
My code replaces all characters in the provided string if found a replacement. How do I make replacements based on Indexes to find all possible combinations?
It is clean and straightforward to use a generator with recursion:
word = 'accompanying'
subs={'c':['$'], 'a': ['4'], 'g': ['9']}
def get_subs(d, c = []):
if not d:
yield ''.join(c)
else:
for i in [d[0], *subs.get(d[0], [])]:
yield from get_subs(d[1:], c+[i])
print(list(get_subs(word)))
Output:
['accompanying', 'accompanyin9', 'accomp4nying', 'accomp4nyin9', 'ac$ompanying', 'ac$ompanyin9', 'ac$omp4nying', 'ac$omp4nyin9', 'a$companying', 'a$companyin9', 'a$comp4nying', 'a$comp4nyin9', 'a$$ompanying', 'a$$ompanyin9', 'a$$omp4nying', 'a$$omp4nyin9', '4ccompanying', '4ccompanyin9', '4ccomp4nying', '4ccomp4nyin9', '4c$ompanying', '4c$ompanyin9', '4c$omp4nying', '4c$omp4nyin9', '4$companying', '4$companyin9', '4$comp4nying', '4$comp4nyin9', '4$$ompanying', '4$$ompanyin9', '4$$omp4nying', '4$$omp4nyin9']
However, itertools.product can be used for a shorter solution:
from itertools import product as prod
s = ''.join('{}' if i in subs else i for i in word)
result = [s.format(*i) for i in prod(*[[i, *subs[i]] for i in word if i in subs])]
Output:
['accompanying', 'accompanyin9', 'accomp4nying', 'accomp4nyin9', 'ac$ompanying', 'ac$ompanyin9', 'ac$omp4nying', 'ac$omp4nyin9', 'a$companying', 'a$companyin9', 'a$comp4nying', 'a$comp4nyin9', 'a$$ompanying', 'a$$ompanyin9', 'a$$omp4nying', 'a$$omp4nyin9', '4ccompanying', '4ccompanyin9', '4ccomp4nying', '4ccomp4nyin9', '4c$ompanying', '4c$ompanyin9', '4c$omp4nying', '4c$omp4nyin9', '4$companying', '4$companyin9', '4$comp4nying', '4$comp4nyin9', '4$$ompanying', '4$$ompanyin9', '4$$omp4nying', '4$$omp4nyin9']
Obviously, you need to rewrite your logic to consider individual instances of the desired letters, rather than each unique letter. Find all occurrences of desired letters; use itertools to get the power set; make the indicated substitutions for each element of the power set. power_set comes from this SO answer. I've left the code "exploded" in some places to show the logic more readily. You will likely want to wrap the final loop into a one-line return expression.
from itertools import chain, combinations
def power_set(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
substitutions={'c':['$'], 'a': ['4', 'a'], 'g': ['9']}
word = 'accordingly'
# Get index of each desired letter and its poosible substitutions
sub_idx = [(pos, letter, sub_letter) for pos, letter in enumerate(word)
if letter in list(substitutions.keys()) for sub_letter in substitutions[letter]]
print("Replacement set", sub_idx)
for possibility in power_set(sub_idx):
# Make each of the substitutions indicated in the power set
new_word = list(word)
for pos, _, sub_letter in possibility:
new_word[pos] = sub_letter
print(''.join(new_word))
Output:
Replacement set [(0, 'a', '4'), (0, 'a', 'a'), (1, 'c', '$'), (2, 'c', '$'), (8, 'g', '9')]
accordingly
4ccordingly
accordingly
a$cordingly
ac$ordingly
accordin9ly
accordingly
4$cordingly
4c$ordingly
4ccordin9ly
a$cordingly
ac$ordingly
accordin9ly
a$$ordingly
a$cordin9ly
ac$ordin9ly
a$cordingly
ac$ordingly
accordin9ly
4$$ordingly
4$cordin9ly
4c$ordin9ly
a$$ordingly
a$cordin9ly
ac$ordin9ly
a$$ordin9ly
a$$ordingly
a$cordin9ly
ac$ordin9ly
4$$ordin9ly
a$$ordin9ly
a$$ordin9ly

Generating all possible k-mers (string combinations) from a given list

I have a string S that is composed of 20 characters:
S='ARNDCEQGHILKMFPSTWYV'
I need to generate all possible k-mer combinations from a given input k.
When k == 3, then there are 8000 combinations (20*20*20) and the output list looks like this:
output = ['AAA', 'AAR', ..., 'AVV', ..., 'VVV'] #len(output)=8000
When k == 2, then there are 400 combinations (20*20) and the output list looks like this:
output = ['AA', 'AR', 'AN', ..., 'VV'] #len(output)=400
When k == 1, then there are only 20 combinations:
output =['A', 'R', 'N', ..., 'Y', 'V'] #len(output)=20
I know how to do this if the number k is fixed, like if k == 3, then I can do this:
for a in S:
for b in S:
for c in S:
output.append(a+b+c)
#then len(output)=8000
But the number k is chosen randomly.
I tried to use permutations, but it does not given me strings with repeated letters like 'AAA', but maybe it can and I'm just doing it wrong.
What you are looking for is itertools.product(). You can use repeat argument for the number of k's in your algorithm.
from itertools import product
...
list(product('ARNDCEQGHILKMFPSTWYV', repeat=2)) # len = 400
list(product('ARNDCEQGHILKMFPSTWYV', repeat=3)) # len = 8000
Bear in mind it returns tuples of characters as default, if you want strings instead, you can join using list comprehensions as below:
[''.join(c) for c in product('ARNDCEQGHILKMFPSTWYV', repeat=3)]
# ['AAA', 'AAR', ..., 'AVV', ..., 'VVV']
You can use itertools.product and generate the random value for k:
import itertools
import random
S = 'ARNDCEQGHILKMFPSTWYV'
final_results = map(''.join, itertools.product(*[S]*random.randint(1, 10)))
Just generate random integer V in range 0..L^k-1 where L is string length and k is length of k-mer.
Then build corresponding combination
V = Random(L**k)
for i in range(k):
C[i] = A[V % L] ///i-th letter using integer modulo
V = V // L ///integer division

Get sequences from a file and store them into a list in python

Here is the code (i took it from this discussion Translation DNA to Protein, but here i'm using RNA instead of DNA file):
from itertools import takewhile
def translate_rna(sequence, d, stop_codons=('UAA', 'UGA', 'UAG')):
start = sequence.find('AUG')
# Take sequence from the first start codon
trimmed_sequence = sequence[start:]
# Split it into triplets
codons = [trimmed_sequence[i:i + 3] for i in range(0, len(trimmed_sequence), 3)]
# Take all codons until first stop codon
coding_sequence = takewhile(lambda x: x not in stop_codons and len(x) == 3, codons)
# Translate and join into string
protein_sequence = ''.join([codontable[codon] for codon in coding_sequence])
# This line assumes there is always stop codon in the sequence
return "{0}".format(protein_sequence)
Calling the translate_rna function:
sequence = ''
for line in open("to_rna", "r"):
sequence += line.strip()
translate_rna(sequence, d)
My to_rna file looks like:
CCGCCCCUCUGCCCCAGUCACUGAGCCGCCGCCGAGGAUUCAGCAGCCUCCCCCUUGAGCCCCCUCGCUU
CCCGACGUUCCGUUCCCCCCUGCCCGCCUUCUCCCGCCACCGCCGCCGCCGCCUUCCGCAGGCCGUUUCC
ACCGAGGAAAAGGAAUCGUAUCGUAUGUCCGCUAUCCAG.........
The function translate only the first proteine (from the first AUG to the first stop_codon)
I think the problem is in this line:
# Take all codons until first stop codon
coding_sequence = takewhile(lambda x: x not in stop_codons and len(x) == 3 , codons)
My question is : How can i tell python (after finding the first AUG and store it into coding_sequence as a list) to search again the next AUG in the RNA file and sotre it in the next position.
As a result, i wanna have a list like that:
['here_is_the_1st_coding_sequence', 'here_is_the_2nd_coding_sequence', ...]
PS : This is a homework, so i can't use Biopython.
EDIT:
A simple way to describe the problem:
From this code:
from itertools import takewhile
lst = ['N', 'A', 'B', 'Z', 'C', 'A', 'V', 'V' 'Z', 'X']
ch = ''.join(lst)
stop = 'Z'
start = ch.find('A')
seq = takewhile(lambda x: x not in stop, ch)
I want to get this:
['AB', 'AVV']
EDIT 2:
For instance, from this string:
UUUAUGCGCCGCUAACCCAUGGUUCCCUAGUGGUCCUGACGCAUGUGA
I should get as result:
['AUGCGCCGC', 'AUGGUUCCC', 'AUG']
looking at your basic code, because I couldn't quite follow your main stuff, it looks like you just want to split your string on all occurences of another string, and substring the string starting from the index of another string. If that is wrong, please tell me and I can update accordingly.
To achieve this, python has a builtin str.split(sub) which splits a string at every occurence of sub. Also, it has a str.index(sub) which returns the first index of sub. Example:
>>> ch = 'NABZCAVZX'
>>> ch[ch.index('A'):].split('Z')
['AB', 'CAV', 'X']
you can also specify sub strings that aren't just one char:
>>> ch = 'NACBABQZCVEZTZCGE'
>>> ch[ch.index('AB'):].split('ZC')
['ABQ', 'VEZT', 'GE']
Using multiple delimiters:
>>> import re
>>> stop_codons = ['UAA','UGA','UAG']
>>> re.compile('|'.join(stop_codons))\
>>> delim = re.compile('|'.join(stop_codons))
>>> ch = 'CCHAUAABEGTAUAAVEGTUGAVKEGUAABEGEUGABRLVBUAGCGGA'
>>> delim.split(ch)
['CCHA', 'BEGTA', 'VEGT', 'VKEG', 'BEGE', 'BRLVB', 'CGGA']
note that there is no order preferance to the split, ie if there is a UGA string ahead of a UAA, it will still split on the UGA. I am not sure if thats what you want but thats it.

Combinations of a string with specific variable characters

How can I collect the combinations of a string, in which certain characters (but not all) are variable?
In other words, I have an input string and a character map. The character map specifies which characters are variable, and what they could be replaced with. The function then yields all possible combinations.
To put this in context, I'm trying to collect possible variations for an OCR output string that could have been misinterpreted by the OCR engine.
Example input:
"ABCD"
Example character map:
dict(
B=("X", "Z"),
D=("E")
)
Intended output:
[
"ABCD",
"ABCE",
"AXCD",
"AXCE",
"AZCD",
"AZCE"
]
You can use itertools.product:
>>> from itertools import product
>>> s = "ABCD"
>>> d = {"B": ["X", "Z"], "D": ["E"]}
>>> poss = [[c]+d.get(c,[]) for c in s]
>>> poss
[['A'], ['B', 'X', 'Z'], ['C'], ['D', 'E']]
>>> [''.join(p) for p in product(*poss)]
['ABCD', 'ABCE', 'AXCD', 'AXCE', 'AZCD', 'AZCE']
Note that I made d["D"] a list rather than simply a string for consistency.
My own solution was very ugly and non-Pythonic, but here goes:
def fuzzy_search(string, character_map):
all_variations = []
for i, character in enumerate(string):
if character in character_map:
character_variations = list(character_map[character])
character_variations.insert(0, character)
if i == len(string) - 1:
return [string[:-1] + variation for variation in character_variations]
for variation in character_variations:
sub_variations = fuzzy_search(string[i + 1:], character_map)
for sub_variation in sub_variations:
all_variations.append(string[:i] + variation + sub_variation)
return all_variations
return all_variations
map = dict(
B=("X", "Z"),
D=("E")
)
print fuzzy_search("ABCD", map)
Outputs:
['ABCD', 'ABCE', 'AXCD', 'AXCE', 'AZCD', 'AZCE']
I figured there should be way more elegant solutions than a recursive function with multiple loops.

How to generate a list of all possible alphabetical combinations based on an input of numbers

I have just come across an interesting interview style type of question which I couldn't get my head around.
Basically, given a number to alphabet mapping such that [1:A, 2:B, 3:C ...], print out all possible combinations.
For instance "123" will generate [ABC, LC, AW] since it can be separated into 12,3 and 1,23.
I'm thinking it has to be some type of recursive function where it checks with windows of size 1 and 2 and appending to a previous result if it's a valid letter mapping.
If anyone can formulate some pseudo/python code that'd be much appreciated.
So I managed to hack together an answer, it's not as pythonic as I'd like and there may be some redundancies, but it works with the 123 example to output ABC,AW, and LC.
I'll probably clean it up tomorrow (or if someone wants to clean it up), just posting it in case someone is also working on it and is wondering.
def num_to_alphabet(numbers, ans = ""):
if not numbers:
print ans
numbers = str(numbers)
window = numbers[:2]
alph = string.uppercase
ans = ans[:]
ans2 = ans[:]
window_val = ""
try:
if window[0]:
val = int(numbers[0])-1
if alph[val]:
ans += alph[val]
num_to_alphabet(numbers[1:], ans)
if window[1]:
val = int(window) -1
if alph[val]:
ans2 += alph[val]
if len(window) > 1:
num_to_alphabet(numbers[2:],ans2)
else:
num_to_alphabet(numbers[1:],ans2)
except IndexError:
pass
As simple as a tree
Let suppose you have give "1261"
Construct a tree with it a Root .
By defining the node(left , right ) , where left is always direct map and right is combo
version suppose for the if you take given Number as 1261
1261 ->
(1(261) ,12(61)) -> 1 is left-node(direct map -> a) 12 is right node(combo-map1,2->L)
(A(261) , L(61)) ->
(A(2(61),26(1))) ,L(6(1)) ->
(A(B(6(1)),Z(1)) ,L(F(1))) ->
(A(B(F(1)),Z(A)) ,L(F(A))) ->
(A(B(F(A)),Z(A)) ,L(F(A)))
so now you have got all the leaf node..
just print all paths from root to leaf node , this gives you all possible combinations .
like in this case
ABFA , AZA , LFA
So once you are done with the construction of tree just print all paths from root to node
which is your requirement .
charMap = {'1':'A', '2':'B' ... }
def getNodes(str):
results = []
if len(str) == 0: return results
c = str[0]
results.append(c)
results = results.join(c.join(getNodes(str[1:])))
if str[:2] in charMap.keys(): results = results.join(c.join(getNodes(str[2:])))
return results
def mapout(nodes):
cArray = []
for x in nodes:
cx = ''
for y in x:
cx = cx + charMap.get(y)
cArray.append(cx)
return cArray
res = getNodes('12345')
print(mapout(res))
Untested, but I believe this is along the lines of what you're looking for.
The following answer recursively tries all possibilities at the current position (there are more than two!) and goes on with the remainder of the string. That's it.
from string import ascii_uppercase
def alpha_combinations(s):
if len(s) == 0:
yield ""
return
for size in range(1, len(s) + 1):
v = int(s[:size])
if v > 26:
break
if v > 0:
c = ascii_uppercase[v - 1]
for ac in alpha_combinations(s[size:]):
yield c + ac
print(list(alpha_combinations(input())))
It expects a number as a string. It gives correct output for 101010 (['AAJ', 'AJJ', 'JAJ', 'JJJ']). (I think some of the other solutions don't handle zeroes correctly.)
So, I wanted to tackle this as well, since it’s actually a cool problem. So here goes my solution:
If we ignore the translations to strings for now, we are essentially looking for partitions of a set. So for the input 123 we have a set {1, 2, 3} and are looking for partitions. But of those partitions, only those are interesting which maintain the original order of the input. So we are actually not talking about a set in the end (where order doesn’t matter).
Anyway, I called this “ordered partition”—I don’t know if there actually exists a term for it. And we can generate those ordered partitions easily using recursion:
def orderedPartitions(s):
if len(s) == 0:
yield []
return
for i in range(1, len(s)+1):
for p in orderedPartitions(s[i:]):
yield [s[:i]] + p
For a string input '123', this gives us the following partions, which is exactly what we are looking for:
['1', '2', '3']
['1', '23']
['12', '3']
['123']
Now, to get back to the original problem which is asking for translations to strings, all we need to do is check each of those partitions, if they contain only valid numbers, i.e. 1 to 26. And if that is the case, translate those numbers and return the resulting string.
import string
def alphaCombinations(s):
for partition in orderedPartitions(str(s)):
# get the numbers
p = list(map(int, partition))
# skip invalid numbers
if list(filter(lambda x: x < 1 or x > 26, p)):
continue
# yield translated string
yield ''.join(map(lambda i: string.ascii_uppercase[i - 1], p))
And it works:
>>> list(alphaCombinations(123))
['ABC', 'AW', 'LC']
>>> list(alphaCombinations(1234))
['ABCD', 'AWD', 'LCD']
>>> list(alphaCombinations(4567))
['DEFG']
I still am not sure of the description, but this Python script first partitions the num into its 'breaks' then tries each break member as a whole as an index into its corresponding character; then converts each digit of the member into letters of a word. Both contributions are shown before showing the sum total of all conversions to letters/words for the num "123"
>>> import string
>>> mapping ={str(n):ch for n,ch in zip(range(1,27), string.ascii_uppercase)}
>>> num = '123'
>>> [[num[:i], num[i:]] for i in range(len(num)+1)]
[['', '123'], ['1', '23'], ['12', '3'], ['123', '']]
>>> breaks = set(part for part in sum(([num[:i], num[i:]] for i in range(len(num)+1)), []) if part)
>>> breaks
{'123', '12', '3', '1', '23'}
>>> as_a_whole = [mapping[p] for p in breaks if p in mapping]
>>> as_a_whole
['L', 'C', 'A', 'W']
>>> by_char = [''.join(mapping[n] for n in p) for p in breaks]
>>> by_char
['ABC', 'AB', 'C', 'A', 'BC']
>>> everything = sorted(set(as_a_whole + by_char))
>>> everything
['A', 'AB', 'ABC', 'BC', 'C', 'L', 'W']
>>>

Categories

Resources