Reverse complement of DNA strand using Python

Reverse complement of DNA strand using Python - python

I have a DNA sequence and would like to get reverse complement of it using Python. It is in one of the columns of a CSV file and I'd like to write the reverse complement to another column in the same file. The tricky part is, there are a few cells with something other than A, T, G and C. I was able to get reverse complement with this piece of code:
def complement(seq):
complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
bases = list(seq)
bases = [complement[base] for base in bases]
return ''.join(bases)
def reverse_complement(s):
return complement(s[::-1])
print "Reverse Complement:"
print(reverse_complement("TCGGGCCC"))
However, when I try to find the item which is not present in the complement dictionary, using the code below, I just get the complement of the last base. It doesn't iterate. I'd like to know how I can fix it.
def complement(seq):
complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
bases = list(seq)
for element in bases:
if element not in complement:
print element
letters = [complement[base] for base in element]
return ''.join(letters)
def reverse_complement(seq):
return complement(seq[::-1])
print "Reverse Complement:"
print(reverse_complement("TCGGGCCCCX"))

The other answers are perfectly fine, but if you plan to deal with real DNA sequences I suggest using Biopython. What if you encounter a character like "-", "*" or indefinitions? What if you want to do further manipulations of your sequences? Do you want to create a parser for each file format out there?
The code you ask for is as easy as:
from Bio.Seq import Seq
seq = Seq("TCGGGCCC")
print seq.reverse_complement()
# GGGCCCGA
Now if you want to do another transformations:
print seq.complement()
print seq.transcribe()
print seq.translate()
Outputs
AGCCCGGG
UCGGGCCC
SG
And if you run into strange chars, no need to keep adding code to your program. Biopython deals with it:
seq = Seq("TCGGGCCCX")
print seq.reverse_complement()
# XGGGCCCGA

In general, a generator expression is simpler than the original code and avoids creating extra list objects. If there can be multiple-character insertions go with the other answers.
complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
seq = "TCGGGCCC"
reverse_complement = "".join(complement.get(base, base) for base in reversed(seq))

import string
old_chars = "ACGT"
replace_chars = "TGCA"
tab = string.maketrans(old_chars,replace_chars)
print "AAAACCCGGT".translate(tab)[::-1]
that will give you the reverse compliment = ACCGGGTTTT

The get method of a dictionary allows you to specify a default value if the key is not in the dictionary. As a preconditioning step I would map all your non 'ATGC' bases to single letters (or punctuation or numbers or anything that wont show up in your sequence), then reverse the sequence, then replace the single letter alternates with their originals. Alternatively, you could reverse it first and then search and replace things like sni with ins.
alt_map = {'ins':'0'}
complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
def reverse_complement(seq):
for k,v in alt_map.iteritems():
seq = seq.replace(k,v)
bases = list(seq)
bases = reversed([complement.get(base,base) for base in bases])
bases = ''.join(bases)
for k,v in alt_map.iteritems():
bases = bases.replace(v,k)
return bases
>>> seq = "TCGGinsGCCC"
>>> print "Reverse Complement:"
>>> print(reverse_complement(seq))
GGGCinsCCGA

The fastest one liner for reverse complement is the following:
def rev_compl(st):
nn = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
return "".join(nn[n] for n in reversed(st))

def ReverseComplement(Pattern):
revcomp = []
x = len(Pattern)
for i in Pattern:
x = x - 1
revcomp.append(Pattern[x])
return ''.join(revcomp)
# this if for the compliment
def compliment(Nucleotide):
comp = []
for i in Nucleotide:
if i == "T":
comp.append("A")
if i == "A":
comp.append("T")
if i == "G":
comp.append("C")
if i == "C":
comp.append("G")
return ''.join(comp)

Give a try to below code,
complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
seq = "TCGGGCCC"
reverse_complement = "".join(complement.get(base, base) for base in reversed(seq))

Considering also degenerate bases:
def rev_compl(seq):
BASES ='NRWSMBDACGTHVKSWY'
return ''.join([BASES[-j] for j in [BASES.find(i) for i in seq][::-1]])

This may be the quickest way to complete a reverse compliment:
def complement(seq):
complementary = { 'A':'T', 'T':'A', 'G':'C','C':'G' }
return ''.join(reversed([complementary[i] for i in seq]))

Using the timeit module for speed profiling, this is the fastest algorithm I came up with with my coworkers for sequences < 200 nucs:
sequence \
.replace('A', '*') \ # Temporary symbol
.replace('T', 'A') \
.replace('*', 'T') \
.replace('C', '&') \ # Temporary symbol
.replace('G', 'C') \
.replace('&', 'G')[::-1]

Related

How do I convert this iterative function to a recursive one?

This function map input strings to that in a dictionary, outputting the result. Any idea how this can be approached recursively?
def dna(seq):
hashtable = {'A': 'U', 'G': 'C', 'T': 'A', 'C': 'G'}
ans = ''
for i in range(len(seq)):
ans += hashtable[seq[i]]
return ans
print(dna('AGCTGACGTA'))
Thanks.

You could do:
def dna(seq):
if not seq:
return ''
return {'A': 'U', 'G': 'C', 'T': 'A', 'C': 'G'}[seq[0]] + dna(seq[1:])
Although this is almost certainly slower, uses more memory, and will hit Python's recursion limit. The recommended approach for almost all usecases would be iterative; modify your code to use Python's builtin string join:
def dna(seq):
hashtable = {'A': 'U', 'G': 'C', 'T': 'A', 'C': 'G'}
ans = []
for elem in seq:
ans.append(hashtable[elem])
return ''.join(ans)

You should understand recursion is not always the answer.
There is a maximum recursion depth in python which you can change. But still you will have a limit. See: https://stackoverflow.com/a/3323013/2681662
The maximum recursion depth allowed:
import sys
print(sys.getrecursionlimit())
So iterative approach is better in your case.
Still let's see how the recursive version would look like.
For a recursive function you have to follow simple rules:
Create an exit condition
Call yourself (the function) again.
def dna_r(seq):
hashy = {'A': 'U', 'G': 'C', 'T': 'A', 'C': 'G'}
if len(seq) == 1:
return hashy[seq]
return dna_r(seq[0]) + dna_r(seq[1:])

Loop over letters in a string that contains the alphabet to determine which are missing from a dictionary

I am very new to python and trying to find the solution to this for a class.
I need the function missing_letters to take a list, check the letters using histogram and then loop over the letters in alphabet to determine which are missing from the input parameter. Finally I need to print the letters that are missing, in a string.
alphabet = "abcdefghijklmnopqrstuvwxyz"
test = ["one","two","three"]
def histogram(s):
d = dict()
for c in s:
if c not in d:
d[c] = 1
else:
d[c] += 1
return d
def missing_letter(s):
for i in s:
checked = (histogram(i))
As you can see I haven't gotten very far, at the moment missing_letters returns
{'o': 1, 'n': 1, 'e': 1}
{'t': 1, 'w': 1, 'o': 1}
{'t': 1, 'h': 1, 'r': 1, 'e': 2}
I now need to loop over alphabet to check which characters are missing and print. Any help and direction will be much appreciated. Many thanks!

You can use set functions in python, which is very fast and efficient:
alphabet = set('abcdefghijklmnopqrstuvwxyz')
s1 = 'one'
s2 = 'two'
s3 = 'three'
list_of_missing_letters = set(alphabet) - set(s1) - set(s2) - set(s3)
print(list_of_missing_letters)
Or like this:
from functools import reduce
alphabet = set('abcdefghijklmnopqrstuvwxyz')
list_of_strings = ['one', 'two', 'three']
list_of_missing_letters = set(alphabet) - \
reduce(lambda x, y: set(x).union(set(y)), list_of_strings)
print(list_of_missing_letters)
Or using your own histogram function:
alphabet = "abcdefghijklmnopqrstuvwxyz"
test = ["one", "two", "three"]
def histogram(s):
d = dict()
for c in s:
if c not in d:
d[c] = 1
else:
d[c] += 1
return d
def missing_letter(t):
test_string = ''.join(t)
result = []
for l in alphabet:
if l not in histogram(test_string).keys():
result.append(l)
return result
print(missing_letter(test))
Output:
['a', 'b', 'c', 'd', 'f', 'g', 'i', 'j', 'k', 'l', 'm', 'p', 'q', 's', 'u', 'v', 'x', 'y', 'z']

from string import ascii_lowercase
words = ["one","two","three"]
letters = [l.lower() for w in words for l in w]
# all letters not in alphabet
letter_str = "".join(x for x in ascii_lowercase if x not in letters)
Output:
'abcdfgijklmpqsuvxyz'

It is not the easiest question to understand, but from what I can gather you require all the letters of the alphabet not in the input to be returned in console.
So a loop as opposed to functions which have been already shown would be:
def output():
output = ""
for i in list(alphabet):
for key in checked.keys():
if i != key:
if i not in list(output):
output += i
print(output)
Sidenote: Please either make checked a global variable or put it outside of function so this function can use it

Python Function that receives a letter and rotates that letter 13 places to the right

I'm trying to create a Python function that uses the Caesar cipher to encrypt a message.
So far, the code I have is
letter = input("Enter a letter: ")
def alphabet_position(letter):
alphabet_pos = {'A':0, 'a':0, 'B':1, 'b':1, 'C':2, 'c':2, 'D':3,
'd':3, 'E':4, 'e':4, 'F':5, 'f':5, 'G':6, 'g':6,
'H':7, 'h':7, 'I':8, 'i':8, 'J':9, 'j':9, 'K':10,
'k':10, 'L':11, 'l':11, 'M':12, 'm':12, 'N': 13,
'n':13, 'O':14, 'o':14, 'P':15, 'p':15, 'Q':16,
'q':16, 'R':17, 'r':17, 'S':18, 's':18, 'T':19,
't':19, 'U':20, 'u':20, 'V':21, 'v':21, 'W':22,
'w':22, 'X':23, 'x':23, 'Y':24, 'y':24, 'Z':25, 'z':25 }
pos = alphabet_pos[letter]
return pos
When I try to run my code, it will ask for the letter but it doesn't return anything after that
Please help if you have any suggestions.

you would need to access your dictionary in a different way:
pos = alphabet_pos.get(letter)
return pos
and then you can finally call the function.
alphabet_position(letter)

You can define two dictionaries, one the reverse of the other. You need to be careful on a few aspects:
Whether case is important. If it's not, use str.casefold as below.
What happens when you roll off the end of the alphabet, e.g. 13th letter after "z". Below we assume you start from the beginning again.
Don't type out the alphabet manually. You can use the string module.
Here's a demo:
letter = input("Enter a letter: ")
from string import ascii_lowercase
def get_next(letter, n):
pos_alpha = dict(enumerate(ascii_lowercase))
alpha_pos = {v: k for k, v in pos_alpha.items()}
return pos_alpha[alpha_pos[letter.casefold()] + n % 26]
get_next(letter, 13)
Enter a letter: a
'n'

If you need a entirely new encoded dict
import string
import numpy as np, random
letters = string.ascii_uppercase
d=dict(zip(list(letters),range(0,len(letters))))
encoded_dic={}
def get_caesar_value(v, by=13):
return(v+by)%26
for k,v in d.items():
encoded_dic[k]=chr(65+get_caesar_value(v))
print(encoded_dic)
Output:
{'A': 'N', 'C': 'P', 'B': 'O', 'E': 'R', 'D': 'Q', 'G': 'T', 'F': 'S', 'I': 'V', 'H': 'U', 'K': 'X', 'J': 'W', 'M': 'Z', 'L': 'Y', 'O': 'B', 'N': 'A', 'Q': 'D', 'P': 'C', 'S': 'F', 'R': 'E', 'U': 'H', 'T': 'G', 'W': 'J', 'V': 'I', 'Y': 'L', 'X': 'K', 'Z': 'M'}

The code you have only maps letters to a position. We'll rewrite it and make a rotate function.
Code
import string
import itertools as it
LOOKUP = {
**{x:i for i, x in enumerate(string.ascii_lowercase)},
**{x:i for i, x in enumerate(string.ascii_uppercase)}
}
def abc_position(letter):
"""Return the alpha position of a letter."""
return LOOKUP[letter]
def rotate(letter, shift=13):
"""Return a letter shifted some positions to the right; recycle at the end."""
iterable = it.cycle(string.ascii_lowercase)
start = it.dropwhile(lambda x: x != letter.casefold(), iterable)
# Advance the iterator
for i, x in zip(range(shift+1), start):
res = x
if letter.isupper():
return res.upper()
return res
Tests
func = abc_position
assert func("a") == 0
assert func("A") == 0
assert func("c") == 2
assert func("z") == 25
func = rotate
assert func("h") == "u"
assert func("a", 0) == "a"
assert func("A", 0) == "A"
assert func("a", 2) == "c"
assert func("c", 3) == "f"
assert func("A", 2) == "C"
assert func("a", 26) == "a"
# Restart after "z"
assert func("z", 1) == "a"
assert func("Z", 1) == "A"
Demo
>>> letter = input("Enter a letter: ")
Enter a letter: h
>>> rot = rotate(letter, 13)
>>> rot
'u'
>>> abc_position(rot)
20
Here we rotated the letter "h" 13 positions, got a letter and then determined the position of this resultant letter in the normal string of abc's.
Details
abc_position()
This function was rewritten to lookup the position of a letter. It merges two dictionaries:
one that enumerates a lowercase ascii letters
one that enumerates a uppercase ascii letters
The string module has this letters already.
rotate()
This function only rotates lowercase letters; uppercase letters are translated from the lowercase position. The string of letters is rotated by making an infinite cycle (an iterator) of lowercase letters.
The cycle is first advanced to start at the desired letter. This is done by dropping all letters that don't look like the one passed in.
Then it is advanced in a loop some number of times equal to shift. The loop is just one way to consume or move the iterator ahead. We only care about the last letter, not the ones in between. This letter is returned, either lower or uppercase.
Since a letter is returned (not a position), you can now use your abc_position() function to find it's normal position.
Alternatives
Other rotation functions can substitute rotate():
import codecs
def rot13(letter):
return codecs.encode(letter, "rot13")
def rot13(letter):
table = str.maketrans(
"ABCDEFGHIJKLMabcdefghijklmNOPQRSTUVWXYZnopqrstuvwxyz",
"NOPQRSTUVWXYZnopqrstuvwxyzABCDEFGHIJKLMabcdefghijklm")
return str.translate(letter, table)
However, these options are constrained to rot13, while rotate() can be shifted by any number. Note: rot26 will cycle back to the beginning, e.g. rotate("a", 26) -> a.
See also this post on how to make true rot13 cipher.
See also docs on itertools.cycle and itertools.dropwhile.

You can do it with quick calculations from ord and chr functions instead:
def encrypt(letter):
return chr((ord(letter.lower()) - ord('a') + 13) % 26 + ord('a'))
so that:
print(encrypt('a'))
print(encrypt('o'))
outputs:
n
b

Replacing Multiple Letters in a String with Each Other in Python [duplicate]

This question already has answers here:
Replace multiple elements in string with str methods
(2 answers)
Closed 8 years ago.
So I understand how to use str.replace() to replace single letters in a string, and I also know how to use the following replace_all function:
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i,j)
return text
But I am trying to replace letters with each other. For example replace each A with T and each T with A, each C with G and each G with C, but I end up getting a string composed of only two letters, either A and G or C and T, for example, and I know the output should be composed of four letters. Here is the code I have tried (I'd rather avoid built in functions):
d={'A': 'T', 'C': 'G', 'A': 'T', 'G': 'C'}
DNA_String = open('rosalind_rna.txt', 'r')
DNA_String = DNA_String.read()
reverse = str(DNA_String[::-1])
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i,j)
return text
complement = replace_all(reverse, d)
print complement
I also tried using:
complement = str.replace(reverse, 'A', 'T')
complement = str.replace(reverse, 'T', 'A')
complement = str.replace(reverse, 'G', 'C')
complement = str.replace(reverse, 'C', 'G')
But I end up getting a string that is four times as long as it should be.
I've also tried:
complement = str.replace(reverse, 'A', 'T').replace(reverse, 'T', 'A').replace(reverse, 'G', 'C')str.replace(reverse, 'C', 'G')
But I get an error message that an integer input is needed.

You can map each letter to another letter.
>>> M = {'A':'T', 'T':'A', 'C':'G', 'G':'C'}
>>> STR = 'CGAATT'
>>> S = "".join([M.get(c,c) for c in STR])
>>> S
'GCTTAA'

You should probably use str.translate for this. Use string.maketrans to create an according transition table.
>>> import string
>>> d ={'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
>>> s = "ACTG"
>>> _from, to = map(lambda t: ''.join(t), zip(*d.items()))
>>> t = string.maketrans(_from, to)
>>> s.translate(t)
'TGAC'
By the way, the error you get with this line
complement = str.replace(reverse, 'A', 'T').replace(reverse, 'T', 'A')...
is that you are explicitly passing the self keyword when it is passed implicitly. Doing str.replace(reverse, 'A', 'T') is equivalent to reverse.replace('A', 'T'). Accordingly, when you do str.replace(...).replace(reverse, 'T', 'A'), this is equivalent to str.replace(str.replace(...), reverse, 'T', 'A'), i.e. the result of the first replace is inserted as self in the other replace, and the other parameters are shifted and the 'A' is interpreted as the count parameter, which has to be an int.

I think this is happening because you're replacing all the As with Ts and then replacing all those Ts (as well as those in the original string) with As. Try replacing with lower-case letters and then converting the whole string with upper():
dic = {'A': 't', 'T': 'a', 'C': 'g', 'G': 'c'}
text = 'GATTCCACCGT'
for i, j in dic.iteritems():
text = text.replace(i,j)
text = text.upper()
gives:
'CTAAGGTGGCA'

Weird behavior in Python

I wrote a simple program to translate DNA to RNA. Basically, you input a string, it separates the string into characters and sends them to a list, shifts the letter and returns a string from the resulting list. This program correctly translates a to u, and to to a, but does not change g to c and c to g.
This is the program:
def trad(x):
h=[]
for letter in x:
h.append(letter)
for letter in h:
if letter=="a":
h[h.index(letter)]="u"
continue
if letter=="t":
h[h.index(letter)]="a"
continue
if letter=="g":
h[h.index(letter)]="c"
continue
if letter=="c":
h[h.index(letter)]="g"
continue
ret=""
for letter in h:
ret+=letter
return ret
while True:
stry=raw_input("String?")
print trad(stry)
Now, just altering the program by not iterating over elements, but on positions, it works as expected. This is the resulting code:
def trad(x):
h=[]
for letter in x:
h.append(letter)
for letter in xrange (0, len(h)):
if h[letter]=="a":
h[letter]="u"
continue
if h[letter]=="t":
h[letter]="a"
continue
if h[letter]=="g":
h[letter]="c"
continue
if h[letter]=="c":
h[letter]="g"
continue
ret=""
for letter in h:
ret+=letter
return ret
while True:
stry=raw_input("String?")
print trad(stry)
Why does this strange behaviour occur, and how can I resolve it?

You are going about this a much harder way than is necessary, this could easily be done using str.translate() - a method on str instances that translates instances of one character to another, which is exactly what you want:
import string
replacements = string.maketrans("atgc", "uacg")
while True:
stry=raw_input("String?")
print stry.translate(replacements)
This is an answer for 2.x, in 3.x, use str.maketrans() instead.

I'm not sure what type of issue you are having, but here's a simple way to do it, using a dictionary.
def trad(coding_strand):
mRNA_parts = {'a': 'u', 't': 'a', 'g': 'c', 'c': 'g'}
mRNA = ''
for nucleotide in coding_strand: # this makes it lowercase
mRNA += mRNA_parts[nucleotide.lower()]
return mRNA.upper() # returns it as uppercase
I have it returned as uppercase because, generally, nucleotides in DNA/RNA are written in uppercase.
I also revised your method... It's better to iterate through the indices themselves; then you don't have to do l.index(elem).
def trad(coding_strand):
mRNA = []
for index in range(len(coding_strand)):
nucleotide = coding_strand[index].upper()
if nucleotide == 'A':
mRNA.append('U')
elif nucleotide == 'T':
mRNA.append('A')
elif nucleotide == 'C':
mRNA.append('G')
elif nucleotide == 'G':
mRNA.append('C')
ret = ''
for letter in mRNA:
ret += mRNA
print ret
I don't suggest using a string and adding on to it nor using a list; a list comprehension is much more effective.
Here's a semi-one-liner, courtesy of BurhanKhalid:
def trad(coding_strand):
mRNA_parts = {'A': 'U', 'T': 'A', 'G': 'C', 'C': 'G'}
return ''.join([mRNA_parts[nucleotide] for nucleotide in coding_strand.upper()])
A complete one-liner:
def trade(coding_strand, key={'A': 'U', 'T': 'A', 'G': 'C', 'C': 'G'}): ''.join(return [key[i] for i in coding_strand.upper()])
Some references:
Dictionaries
List Comprehensions

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reverse complement of DNA strand using Python - python

import string old_chars = "ACGT" replace_chars = "TGCA" tab = string.maketrans(old_chars,replace_chars) print "AAAACCCGGT".translate(tab)[::-1] that will give you the reverse compliment = ACCGGGTTTT

The fastest one liner for reverse complement is the following: def rev_compl(st): nn = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'} return "".join(nn[n] for n in reversed(st))

Give a try to below code, complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'} seq = "TCGGGCCC" reverse_complement = "".join(complement.get(base, base) for base in reversed(seq))

Considering also degenerate bases: def rev_compl(seq): BASES ='NRWSMBDACGTHVKSWY' return ''.join([BASES[-j] for j in [BASES.find(i) for i in seq][::-1]])

This may be the quickest way to complete a reverse compliment: def complement(seq): complementary = { 'A':'T', 'T':'A', 'G':'C','C':'G' } return ''.join(reversed([complementary[i] for i in seq]))

Related

How do I convert this iterative function to a recursive one?

Loop over letters in a string that contains the alphabet to determine which are missing from a dictionary

Python Function that receives a letter and rotates that letter 13 places to the right

Replacing Multiple Letters in a String with Each Other in Python [duplicate]

Weird behavior in Python

Categories

Resources