Translate a DNA sequence into its aminoacids in Python - python

I'm stuck in a exercice in python where I need to convert a DNA sequence into its corresponding amino acids. So far, I have:
seq1 = "AATAGGCATAACTTCCTGTTCTGAACAGTTTGA"
for i in range(0, len(seq), 3):
print seq[i:i+3]
I need to do this without using dictionaries, and I was going for replace, but it seems it's not advisable either. How can I achieve this?
And it's supposed to give something like this, for exemple:
>seq1_1_+
TQSLIVHLIY
>seq1_2_+
LNRSFTDSST
>seq1_3_+
SIADRSLTHLL
Update 2: OK, so i had to resort to functions, and as suggested, i have gotten the output i wanted. Now, i have a series of functions, which return a series of aminoacid sequences, and i want to get an output file that looks like this, for exemple:
>seq1_1_+
iyyslrs-las-smrlssiv-m
>seq1_2_+
fiirydrs-ladrcgshrssk
>seq1_3_+
llfativas-lidaalidrl
>seq1_1_-
frrsmraasis-lativannkm
>seq1_2_-
lddr-ephrsas-lrs-riin
>seq1_3_-
-tidesridqlasydrse--m
For that, i'm using this:
for x in f1:
x = x.strip()
if x.count("seq"):
f2.write((x)+("_1_+\n"))
f2.write((x)+("_2_+\n"))
f2.write((x)+("_3_+\n"))
f2.write((x)+("_1_-\n"))
f2.write((x)+("_2_-\n"))
f2.write((x)+("_3_-\n"))
else:
f2.write((translate1(x))+("\n"))
f2.write((translate2(x))+("\n"))
f2.write((translate3(x))+("\n"))
f2.write((translate1neg(x))+("\n"))
f2.write((translate2neg(x))+("\n"))
f2.write((translate3neg(x))+("\n"))
But unlike the expected output file suggested, i get this:
>seq1_1_+
>seq1_2_+
>seq1_3_+
>seq1_1_-
>seq1_2_-
>seq1_3_-
iyyslrs-las-smrlssiv-m
fiirydrs-ladrcgshrssk
llfativas-lidaalidrl
frrsmraasis-lativannkm
lddr-ephrsas-lrs-riin
-tidesridqlasydrse--m
So he's pretty much doing all the seq's first, and all the functions afterwards, so i need to intercalate them, problem is how.

To translate you need a table of codons, so without dictionary or other data structure seems strange.
Maybe you can look into biopython? And see how they manage it.
You can also translate directly from the coding strand DNA sequence:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) "
You may take a look into

You cannot practically do this without either a function or a dictionary. Part 1, converting the sequence into three-character codons, is easy enough as you have already done it.
But Part 2, to convert these into amino acids, you will need to define a mapping, either:
mapping = {"NNN": "X", ...}
or
def mapping(codon):
if codon in ("AGA", "AGG", "CGA", "CGC", "CGG", "CGT"):
return "R"
...
or
for codon, acid in [("CAA", "Q"), ("CAG", "Q"), ...]:
I would favour the second of these as it has the least duplication (and therefore potential for error).

You got the amino acid output for the first codon only because you used 'return' inside the 'for loop'. Once the first amino acid is returned, the loop terminates, hence the second codon won't be tested at all.
You can create an empty list to keep the results for the translation of each codon, e.g.
aa = []
then, instead of using return, append the output to the list:
for x in range(0,len(seq1),3):
nuc2= seq1[x:x+3]
if nuc2 in ('GCT', 'GCC', 'GCA', 'GCG'):
aa.append("a")
elif nuc2 in ('TGT', 'TGC'):
aa.append("c")
....
and finally, join the alphabets in the list and return the string from the function:
return "".join(aa)
or simply print it:
print("".join(aa))

you can convert the nucleotide bases in numbers (base 4) and then translate using ordered aa in a string:
def translate(seq,frame):
BASES = 'ACGT'
# standard code
AA = 'KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSS*CWCLFLF'
# convert DNA sequence in numbers: A=0; C=1; G=2; T=3
seqn = [str(BASES.find(i)) for i in seq.upper()]
# list of all codons in all forward frames (i.e. 3 digit numbers in base 4)
allframes = [''.join(seqn[x:x+3]) for x in range(len(seqn)) if x <= (len(seqn)-3)]
# translate codons in frame taking aa in AA string using indexes (turned in base 10 from base 4) in allframes
return ''.join([AA[int(i,4)] for i in allframes[(frame-1)::3]])

Related

How to replace all T with U in an input string of DNA?

So, the task is quite simple. I just need to replace all "T"s with "U"s in an input string of DNA. I have written the following code:
def transcribe_dna_to_rna(s):
base_change = {"t":"U", "T":"U"}
replace = "".join([base_change(n,n) for n in s])
return replace.upper()
and for some reason, I get the following error code:
'dict' object is not callable
Why is it that my dictionary is not callable? What should I change in my code?
Thanks for any tips in advance!
To correctly convert DNA to RNA nucleotides in string s, use a combination of str.maketrans and str.translate, which replaces thymine to uracil while preserving the case. For example:
s = 'ACTGactgACTG'
s = s.translate(str.maketrans("tT", "uU"))
print(s)
# ACUGacugACUG
Note that in bioinformatics, case (lower or upper) is often important and should be preserved, so keeping both t -> u and T -> U is important. See, for example:
Uppercase vs lowercase letters in reference genome
SEE ALSO:
Character Translation using Python (like the tr command)
Note that there are specialized bioinformatics tools specifically for handling biological sequences.
For example, BioPython offers transcribe:
from Bio.Seq import Seq
my_seq = Seq('ACTGactgACTG')
my_seq = my_seq.transcribe()
print(my_seq)
# ACUGacugACUG
To install BioPython, use conda install biopython or conda create --name biopython biopython.
The syntax error tells you that base_change(n,n) looks like you are trying to use base_change as the name of a function, when in fact it is a dictionary.
I guess what you wanted to say was
def transcribe_dna_to_rna(s):
base_change = {"t":"U", "T":"U"}
replace = "".join([base_change.get(n, n) for n in s])
return replace.upper()
where the function is the .get(x, y) method of the dictionary, which returns the value for the key in x if it is present, and otherwise y (so in this case, return the original n if it's not in the dictionary).
But this is overcomplicating things; Python very easily lets you replace characters in strings.
def transcribe_dna_to_rna(s):
return s.upper().replace("T", "U")
(Stole the reordering to put the .upper() first from #norie's answer; thanks!)
If your real dictionary was much larger, your original attempt might make more sense, as long chains of .replace().replace().replace()... are unattractive and eventually inefficient when you have a lot of them.
In python 3, use str.translate:
dna = "ACTG"
rna = dna.translate(str.maketrans("T", "U")) # "ACUG"
Change s to upper and then do the replacement.
def transcribe_dna_to_rna(s):
return s.upper().replace("T", "U")

How to implement the concept of cellular automata in python

I am fairly new at python (and programming in general, just started 2 months ago). I have been tasked with creating a program that takes a users starting string (i.e. "11001100") and prints each generation based off a set of rules. It then stops when it repeats the users starting string. However, I am clueless as to where to even begin. I vaguely understand the concept of cellular automata and therefore am at a loss as to how to implement it into a script.
Ideally, it would take the users input string "11001100" (gen0) and looks at the rule set I created and converts it so "11001100" would be "00110011" (gen1) and then converts it again to (gen3) and again to (gen4) until it is back to the original input the user provided (gen0). My rule set is below:
print("What is your starting string?")
SS = input()
gen = [SS]
while 1:
for i in range(len(SS)):
if gen[-1] in gen[:-2]:
break
for g in gen:
print(g)
newstate = {
#this is used to convert the string. we break up the users string into threes. i.e if user enters 11001100, we start with the left most digit "1" and look at its neighbors (x-1 and x+1) or in this case "0" and "1". Using these three numbers we compare it to the chart below:
'000': 1 ,
'001': 1 ,
'010': 0 ,
'011': 0 ,
'100': 1 ,
'101': 1 ,
'110': 0 ,
'111': 0 ,
}
I would greatly appreciate any help or further explanation/dummy proof explanation of how to get this working.
Assuming that newstate is a valid dict where the key/value pairs correspond with your state replacement (if you want 100 to convert to 011, newstate would have newstate['100'] == '011'), you can do list comprehensions on split strings:
changed = ''.join(newstate[c] for c in prev)
where prev is your previous state string. IE:
>>> newstate = {'1':'0','0':'1'}
>>> ''.join(newstate[c] for c in '0100101')
'1011010'
you can then use this list comp to change a string itself by calling itself in the list comprehension:
>>> changed = '1010101'
>>> changed = ''.join(newstate[c] for c in changed)
>>> changed
'0101010'
you have the basic flow down in your original code, you jsut need to refine it. The psuedo code would look something like:
newstate = dict with key\value mapping pairs
original = input
changed = original->after changing
while changed != original:
changed = changed->after changing
print changed
The easiest way to do this would be with the re.sub() method in the python regex module, re.
import re
def replace_rule(string, new, pattern):
return re.sub(pattern, new, string)
def replace_example(string):
pattern = r"100"
replace_with = "1"
return re.sub(pattern, replace_with, string)
replace_example("1009")
=> '19'
replace_example("1009100")
=> '191'
Regex is a way to match strings to certain regular patterns, and do certain operations on them, like sub, which finds and replaces patterns in strings. Here is a link: https://docs.python.org/3/library/re.html

eliminate '\n' in dictionary

I have a dictionary looks like this, the DNA is the keys and quality value is value:
{'TTTGTTCTTTTTGTAATGGGGCCAGATGTCACTCATTCCACATGTAGTATCCAGATTGAAATGAAATGAGGTAGAACTGACCCAGGCTGGACAAGGAAGG\n':
'eeeecdddddaaa`]eceeeddY\\cQ]V[F\\\\TZT_b^[^]Z_Z]ac_ccd^\\dcbc\\TaYcbTTZSb]Y]X_bZ\\a^^\\S[T\\aaacccBBBBBBBBBB\n',
'ACTTATATTATGTTGACACTCAAAAATTTCAGAATTTGGAGTATTTTGAATTTCAGATTTTCTGATTAGGGATGTACCTGTACTTTTTTTTTTTTTTTTT\n':
'dddddd\\cdddcdddcYdddd`d`dcd^dccdT`cddddddd^dddddddddd^ddadddadcd\\cda`Y`Y`b`````adcddd`ddd_dddadW`db_\n',
'CTGCCAGCACGCTGTCACCTCTCAATAACAGTGAGTGTAATGGCCATACTCTTGATTTGGTTTTTGCCTTATGAATCAGTGGCTAAAAATATTATTTAAT\n':
'deeee`bbcddddad\\bbbbeee\\ecYZcc^dd^ddd\\\\`]``L`ccabaVJ`MZ^aaYMbbb__PYWY]RWNUUab`Y`BBBBBBBBBBBBBBBBBBBB\n'}
I want to write a function so that if I query a DNA sequence, it returns a tuple of this DNA sequence and its corresponding quality value
I wrote the following function, but it gives me an error message that says list indices must be integers, not str
def query_sequence_id(self, dna_seq=''):
"""Overrides the query_sequence_id so that it optionally returns both the sequence and the quality values.
If DNA sequence does not exist in the class, return a string error message"""
list_dna = []
for t in self.__fastqdict.keys():
list_dna.append(t.rstrip('\n'))
self.dna_seq = dna_seq
if self.dna_seq in list_dna:
return (self.dna_seq,self.__fastqdict.values()[self.dna_seq + "\n"])
else:
return "This DNA sequence does not exist"
so I want something like if I print
query_sequence_id("TTTGTTCTTTTTGTAATGGGGCCAGATGTCACTCATTCCACATGTAGTATCCAGATTGAAATGAAATGAGGTAGAACTGACCCAGGCTGGACAAGGAAGG"),
I would get
('TTTGTTCTTTTTGTAATGGGGCCAGATGTCACTCATTCCACATGTAGTATCCAGATTGAAATGAAATGAGGTAGAACTGACCCAGGCTGGACAAGGAAGG',
'eeeecdddddaaa`]eceeeddY\\cQ]V[F\\\\TZT_b^[^]Z_Z]ac_ccd^\\dcbc\\TaYcbTTZSb]Y]X_bZ\\a^^\\S[T\\aaacccBBBBBBBBBB')
I want to get rid of "\n" for both keys and values, but my code failed. Can anyone help me fix my code?
The newline characters aren't your problem, though they are messy. You're trying to index the view returned by dict.values() based on the string. That's not only not what you want, but it also defeats the whole purpose of using the dictionary in the first place. Views are iterables, not mappings like dicts are. Just look up the value in the dictionary, the normal way:
return (self.dna_seq, self.__fastqdict[self.dna_seq + "\n"])
As for the newlines, why not just take them out when you build the dictionary in the first place?
To modify the dictionary you can just do the following:
myNewDict = {}
for var in myDict:
myNewDict[var.strip()] = myDict[var].strip()
You can remove those pesky newlines from your dictionary's keys and values like this (assuming your dictionary was stored in a variable nameddna):
dna = {k.rstrip(): v.rstrip() for k, v in dna.iteritems()}

Python: argument conversion during string format error /w dictionary/list reads

new to these boards and understand there is protocol and any critique is appreciated. I have begun python programming a few days ago and am trying to play catch-up. The basis of the program is to read a file, convert a specific occurrence of a string into a dictionary of positions within the document. Issues abound, I'll take all responses.
Here is my code:
f = open('C:\CodeDoc\Mm9\sampleCpG.txt', 'r')
cpglist = f.read()
def buildcpg(cpg):
return "\t".join(["%d" % (k) for k in cpg.items()])
lookingFor = 'CG'
i = 0
index = 0
cpgdic = {}
try:
while i < len(cpglist):
index = cpglist.index(lookingFor, i)
i = index + 1
for index in range(len(cpglist)):
if index not in cpgdic:
cpgdic[index] = index
print (buildcpg(cpgdic))
except ValueError:
pass
f.close()
The cpgdic is supposed to act as a dictionary of the position reference obtained in the index. Each read of index should be entering cpgdic as a new value, and the print (buildcpg(cpgdic)) is my hunch of where the logic fails. I believe(??) it is passing cpgdic into the buildcpg function, where it should be returned as an output of all the positions of 'CG', however the error "TypeError:not all arguments converted during string formatting" shows up. Your turn!
ps. this destroys my 2GB memory; I need to improve with much more reading
cpg.items is yielding tuples. As such, k is a tuple (length 2) and then you're trying to format that as a single integer.
As a side note, you'll probably be a bit more memory efficient if you leave off the [ and ] in the join line. This will turn your list comprehension to a generator expression which is a bit nicer. If you're on python2.x, you could use cpg.iteritems() instead of cpg.items() as well to save a little memory.
It also makes little sense to store a dictionary where the keys and the values are the same. In this case, a simple list is probably more elegant. I would probably write the code this way:
with open('C:\CodeDoc\Mm9\sampleCpG.txt') as fin:
cpgtxt = fin.read()
indices = [i for i,_ in enumerate(cpgtxt) if cpgtxt[i:i+2] == 'CG']
print '\t'.join(indices)
Here it is in action:
>>> s = "CGFOOCGBARCGBAZ"
>>> indices = [i for i,_ in enumerate(s) if s[i:i+2] == 'CG']
>>> print indices
[0, 5, 10]
Note that
i for i,_ in enumerate(s)
is roughly the same thing as
i for i in range(len(s))
except that I don't like range(len(s)) and the former version will work with any iterable -- Not just sequences.

PYTHON problem with negative decimals

I have a list of negative floats. I want to make a histogram with them. As far as I know, Python can't do operations with negative numbers. Is this correct? The list is like [-0.2923998, -1.2394875, -0.23086493, etc.]. I'm trying to find the maximum and minimum number so I can find out what the range is. My code is giving an error:
setrange = float(maxv) - float(minv)
TypeError: float() argument must be a string or a number
And this is the code:
f = open('clusters_scores.out','r')
#first, extract all of the sim values
val = []
for line in f:
lineval = line.split()
print lineval
val.append(lineval)
print val
#val = map(float,val)
maxv = max(val)
minv = min(val)
setrange = float(maxv) - float(minv)
All the values that are being put into the 'val' list are negative decimals. What is the error referring to, and how do I fix it?
The input file looks like:
-0.0783532095182 -0.99415440702 -0.692972552716 -0.639273674023 -0.733029194040.765257900121 -0.755438339963
-0.144140594077 -1.06533353638 -0.366278118372 -0.746931508538 -1.02549039392 -0.296715961215
-0.0915937502791 -1.68680560936 -0.955147543358
-0.0488457137771 -0.0943080192383 -0.747534412969 -1.00491121699
-1.43973471463
-0.0642611118901 -0.0910684525497
-1.19327387414 -0.0794696449245
-1.00791366035 -0.0509749096549
-1.08046507281 -0.957339914505 -0.861495748259
The results of split() are a list of split values, which is probably why you are getting that error.
For example, if you do '-0.2'.split(), you get back a list with a single value ['-0.2'].
EDIT: Aha! With your input file provided, it looks like this is the problem: -0.733029194040.765257900121. I think you mean to make that two separate floats?
Assuming a corrected file like this:
-0.0783532095182 -0.99415440702 -0.692972552716 -0.639273674023 -0.733029194040 -0.765257900121 -0.755438339963
-0.144140594077 -1.06533353638 -0.366278118372 -0.746931508538 -1.02549039392 -0.296715961215
-0.0915937502791 -1.68680560936 -0.955147543358
-0.0488457137771 -0.0943080192383 -0.747534412969 -1.00491121699
-1.43973471463
-0.0642611118901 -0.0910684525497
-1.19327387414 -0.0794696449245
-1.00791366035 -0.0509749096549
-1.08046507281 -0.957339914505 -0.861495748259
The following code will no longer throw that exception:
f = open('clusters_scores.out','r')
#first, extract all of the sim values
val = []
for line in f:
linevals = line.split()
print linevals
val += linevals
print val
val = map(float, val)
maxv = max(val)
minv = min(val)
setrange = float(maxv) - float(minv)
I have changed it to take the list result from split() and concatenate it to the list, rather than append it, which will work provided there are valid inputs in your file.
All the values that are being put into the 'val' list are negative decimals.
No, they aren't; they're lists of strings that represent negative decimals, since the .split() call produces a list. maxv and minv are lists of strings, which can't be fed to float().
What is the error referring to, and how do I fix it?
It's referring to the fact that the contents of val aren't what you think they are. The first step in debugging is to verify your assumptions. If you try this code out at the REPL, then you could inspect the contents of maxv and minv and notice that you have lists of strings rather than the expected strings.
I assume you want to put all the lists of strings (from each line of the file) together into a single list of strings. Use val.extend(lineval) rather than val.append(lineval).
That said, you'll still want to map the strings into floats before calling max or min because otherwise you will be comparing the strings as strings rather than floats. (It might well work, but explicit is better than implicit.)
Simpler yet, just read the entire file at once and split it; .split() without arguments splits on whitespace, and a newline is whitespace. You can also do the mapping at the same point as the reading, with careful application of a list comprehension. I would write:
with open('clusters_scores.out') as f:
val = [float(x) for x in f.read().split()]
result = max(val) - min(val)

Categories

Resources