dna translation with python script - python

I'm now trying to create a tool which can translate DNA sequences and then compare them to each other for deleting the repetitions!
I used this script to read my fastq file:
def sequence_cleaner(fastq_file, min_length=0, por_n=100):
# Create our hash table to add the sequences
# Using the Biopython fastq parse we can read our fastq input
for seq_record in SeqIO.parse(fastq_file, "fastq"):
# Take the current sequence
sequence = str(seq_record.seq).upper()
# Check if the current sequence is according to the user parameters
if (len(sequence) >= min_length and
(float(sequence.count("N"))/float(len(sequence)))*100 <= por_n):
# If the sequence passed in the test "is it clean?" and it isn't in the
# hash table, the sequence and its id are going to be in the hash
if sequence not in sequences:
sequences[sequence] = seq_record.id
# If it is already in the hash table, we're just gonna concatenate the ID
# of the current sequence to another one that is already in the hash table
sequences[sequence] += "_" + seq_record.id
print sequence
trans=translate( sequence )
# Write the clean sequences
# Create a file in the same directory where you ran this script
output_file = open("clear_" + fastq_file, "w+")
# Just read the hash table and write on the file as a fasta format
for sequence in sequences:
output_file.write("#" + sequences[sequence] +"\n" + sequence + "\n" + trans +"\n")
print("\n YOUR SEQUENCES ARE CLEAN!!!\nPlease check clear_" + fastq_file + " on the same repository than " + rep + "\n")
and i used this one to translate it to amino acide sequences:
def translate( sequ ):
"""Return the translated protein from 'sequence' assuming +1 reading frame"""
gencode = {
'ATA':'Ile', 'ATC':'Ile', 'ATT':'Ile', 'ATG':'Met',
'ACA':'Thr', 'ACC':'Thr', 'ACG':'Thr', 'ACT':'Thr',
'AAC':'Asn', 'AAT':'Asn', 'AAA':'Lys', 'AAG':'Lys',
'AGC':'Ser', 'AGT':'Ser', 'AGA':'Arg', 'AGG':'Arg',
'CTA':'Leu', 'CTC':'Leu', 'CTG':'Leu', 'CTT':'Leu',
'CCA':'Pro', 'CCC':'Pro', 'CCG':'Pro', 'CCT':'Pro',
'CAC':'His', 'CAT':'His', 'CAA':'Gln', 'CAG':'Gln',
'CGA':'Arg', 'CGC':'Arg', 'CGG':'Arg', 'CGT':'Arg',
'GTA':'Val', 'GTC':'Val', 'GTG':'Val', 'GTT':'Val',
'GCA':'Ala', 'GCC':'Ala', 'GCG':'Ala', 'GCT':'Ala',
'GAC':'Asp', 'GAT':'Asp', 'GAA':'Glu', 'GAG':'Glu',
'GGA':'Gly', 'GGC':'Gly', 'GGG':'Gly', 'GGT':'Gly',
'TCA':'Ser', 'TCC':'Ser', 'TCG':'Ser', 'TCT':'Ser',
'TTC':'Phe', 'TTT':'Phe', 'TTA':'Leu', 'TTG':'Leu',
'TAC':'Tyr', 'TAT':'Tyr', 'TAA':'STOP', 'TAG':'STOP',
'TGC':'Cys', 'TGT':'Cys', 'TGA':'STOP', 'TGG':'Trp'}
return ''.join(gencode.get(sequ[3*i:3*i+3],'X') for i in range(len(sequ)//3))
The result is not what i expected:
Firstly you can see that the sequences id are not sorted from 1 to 4 like on the original file, and also it repeats the same 4th id translation for the three other sequences!

To answer your two questions
the sequences id are not sorted from 1 to 4 like on the original file
You are using a dictionary which is unsorted.
Regular Python dictionaries iterate over key/value pairs in arbitrary
You could sort your dictionary by values, see here for a suggestion: Sort a Python dictionary by value or use a sorted dictionary, see the link above
it repeats the same 4th id translation for the three other sequences
You are assigning the translated sequence trans=translate( sequence ) for each sequence but you are not storing trans in a dictionary or list which is specific for your ID, you are assigning trans to every entry. Try using a separate dictionary which stores the translated sequence together with the sequence ID.


How to speed up this word-tuple finding algorithm?

I am trying to create a simple model to predict the next word in a sentence. I have a big .txt file that contains sentences seperated by '\n'. I also have a vocabulary file which lists every unique word in my .txt file and a unique ID. I used the vocabulary file to convert the words in my corpus to their corresponding IDs. Now I want to make a simple model which reads the IDs from txt file and find the word pairs and how many times this said word pairs were seen in the corpus. I have managed to write to code below:
tuples = [[]] #array for word tuples to be stored in
data = [] #array for tuple frequencies to be stored in
data.append(0) #tuples array starts with an empty element at the beginning for some reason.
# Adding zero to the beginning of the frequency array levels the indexes of the two arrays
with open("markovData.txt") as f:
contentData = f.readlines()
contentData = [x.strip() for x in contentData]
lineIndex = 0
for line in contentData:
tmpArray = line.split() #split line to array of words
tupleIndex = 0
tmpArrayIndex = 0
for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
if [tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] in tuples: #if the word pair is was seen before
data[tuples.index([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]])] += 1 #increment the frequency of said pair
tuples.append([tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]]) #if the word pair is never seen before
data.append(1) #add the pair to list and set frequency to 1.
#print every 1000th line to check the progress
lineIndex += 1
if ((lineIndex % 1000) == 0):
with open("markovWindowSize1.txt", 'a', encoding="utf8") as markovWindowSize1File:
#write tuples to txt file
for tuple in tuples:
if (len(tuple) > 0): # if tuple is not epmty
markovWindowSize1File.write(str(element[0]) + "," + str(element[1]) + " ")
#blank spaces between two data
#write frequencies of the tuples to txt file
for element in data:
markovWindowSize1File.write(str(element) + " ")
This code seems to be working well for the first couple thousands of lines. Then things start to get slower because the tuple list keeps getting bigger and I have to search the whole tuple list to check if the next word pair was seen before or not. I managed to get the data of 50k lines in 30 minutes but I have much bigger corpuses with millions of lines. Is there a way to store and search for the word pairs in a more efficient way? Matrices would probably work a lot faster but my unique word count is about 300.000 words. Which means I have to create a 300k*300k matrix with integers as data type. Even after taking advantage of symmetric matrices, it would require a lot more memory than what I have.
I tried using memmap from numpy to store the matrix in disk rather than memory but it required about 500 GB free disk space.
Then I studied the sparse matrices and found out that I can just store the non-zero values and their corresponding row and column numbers. Which is what I did in my code.
Right now, this model works but it is very bad at guessing the next word correctly ( about 8% success rate). I need to train with bigger corpuses to get better results. What can I do to make this word pair finding code more efficient?
Edit: Thanks to everyone answered, I am now able to process my corpus of ~500k lines in about 15 seconds. I am adding the final version of the code below for people with similiar problems:
import numpy as np
import time
start = time.time()
myDict = {} # empty dict
with open("markovData.txt") as f:
contentData = f.readlines()
contentData = [x.strip() for x in contentData]
lineIndex = 0
for line in contentData:
tmpArray = line.split() #split line to array of words
tmpArrayIndex = 0
for tmpArrayIndex in range(len(tmpArray) - 1): #do this for every word except the last one since the last word has no word after it.
if (tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]) in myDict: #if the word pair is was seen before
myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] += 1 #increment the frequency of said pair
myDict[tmpArray[tmpArrayIndex], tmpArray[tmpArrayIndex + 1]] = 1 #if the word pair is never seen before
#add the pair to list and set frequency to 1.
#print every 1000th line to check the progress
lineIndex += 1
if ((lineIndex % 1000) == 0):
end = time.time()
print(end - start)
keyText= ""
valueText = ""
for key1,key2 in myDict:
keyText += (str(key1) + "," + str(key2) + " ")
valueText += (str(myDict[key1,key2]) + " ")
with open("markovPairs.txt", 'a', encoding="utf8") as markovPairsFile:
with open("markovFrequency.txt", 'a', encoding="utf8") as markovFrequencyFile:
As I understand you, you are trying to build a Hidden Markov Model, using frequencies of n-grams (word tupels of length n). Maybe just try out a more efficiently searchable data structure, for example a nested dictionary. It could be of the form
{ID_word1:{ID_word1:x1,... ID_wordk:y1}, ...ID_wordk:{ID_word1:xn, ...ID_wordk:yn}}.
This would mean that you only have k**2 dictionary entries for tuples of 2 words (google uses up to 5 for automatic translation) where k is the cardinality of V, your (finite) vocabulary. This should boost your performance, since you do not have to search a growing list of tuples. x and y are representative for the occurrence counts, which you should increment when encountering a tuple. (Never use in-built function count()!)
I would also look into collections.Counter, a data structure made for your task. A Counter object is like a dictionary but counts the occurrences of a key entry. You could use this by simply incrementing a word pair as you encounter it:
from collections import Counter
word_counts = Counter()
with open("markovData.txt", "r") as f:
# iterate over word pairs
word_counts[(word1, word2)] += 1
Alternatively, you can construct the tuple list as you have and simply pass this into a Counter as an object to compute the frequencies at the end:
word_counts = Counter(word_tuple_list)

rename file by joining two search results

I am learning python and also English, I am doing this code to read a TXT and find in it a sequence of numbers, then rename the file with the sequence found. But besides looking for this sequence of numbers, I needed to find some set of words, for example, if I find the words Apple, Watermelon and Pineapple, and not find Pumpkin, classifies TXT as "fruits", and when renaming the file renames with the sequence of digits plus an "f" of fruit for example:
name_files2 = os.listdir(path_txt)
for TXT in name_files2:
with open(path_txt + '\\' + TXT, "r") as content:
search = re.search(r'(\d{5})\-(\d{2})\-(\d{4})\.(\d)\.(\d{2})\.(\d{4})|'
r'(\d{7})\-(\d{2})\-(\d{4})\.(\d)\.(\d{4})', content.read())
if search is not None:
name2 = search.group(0)
name2 = re.sub(r"\D", "", name2)
fp = os.path.join("18_digitos", name2 + "_%d.txt")
postfix = 0
while os.path.exists(fp % postfix):
postfix += 1
os.path.join(path_txt, TXT),
fp % postfix
I can find the words in this way in the text, but I can not do both at the same time
if text_complete.find("apple") >= 0 and text_complete.find("watermelon") >= 0 and \
text_complete.find("pineapple") >= 0 and text_complete.find("pumpkin") < 0:
print("Find Fruit")
I need basically to make two codes work together, I need them to find the 18-digit sequence, identify the keywords and classify as fruits for example, and rename the file with the sequence found + key word ranking + increment. Example: 12345678901234567_f_0 , 12345678901234567_f_1.
Currently it only concatenates the sequence and increment, example: 12345678901234567_0 , 12345678901234567_1. The increment I made to differentiate the files when they have the same sequence of numbers
EDIT: what I am not getting is to join the sequence and classification fruit that were extracted from the same text. The same number, may have the classification fruits or vegetables, for example. So I need to find out which sequence came out each fruit or vegetable classification to rename the file
If I understand correctly, you want to check the contents of the file twice: once to extract a sequence of numbers and once to check if it contains "fruit" words.
In order to look at the text more than once you should store the contents of the file in its own variable.
You can change the code in the with block to :
with open(path_txt + '\\' + TXT, "r") as content:
text_complete = content.read()
And the later you can check for your number sequence
search = re.search(r'...', text_complete.read()) # ... is your long regular expression
And you can also run your if statement to check for "fruit" words:
if text_complete.find("apple") >= 0 and ... : # ... is the rest of your condition
found_fruit = True
By storing the contents of the file as a string as the text_complete variable, you can refer to it multiple times, checking for something different each time.

Python dictionary attack only solves the last hash

I'm trying to build a dictionary attack script (for education purposes) using Python and it only ever solves the last SHA-256 hash in my file.
The logic is as follows:
Read a file containing words
Store the hashed value of the word along with the word as a key-value pair in a dictionary
Scan the lines of a text file containing SHA-256 hashes (1 hashed value per line)
Iterate over the items in the dictionary and print the key if a value matches the hash
It works perfectly for the very last item in my file, but says a match was not found for all my others.
My hash file looks like:
Containing the hashed values for "test" and "password".
My word file contains over 70,000 words and I've made sure both words are in the file, and when I debug, they both have values in the dictionary if I call the expression.
Here's where I iterate over the hashes in my file:
with open(hashFile) as f:
for c in f:
And the function I wrote to compare a hashed value to every value in the dictionary:
def findMatch(hv,m):
#k is the key, m is the dictionary
for k in m:
if(m[k].lower() == hv):
print("Match was found: " + k )
print("Match was not found, searched through " + str(len(wordMap)) + " words")
Any help is appreciated, thanks!
In findMatch(str(c).lower(),wordMap), there is no need to call str() (because c is already a string), but there is a need to strip off the trailing newline character: findMatch(c.strip().lower(),wordMap). Otherwise, it is included in the hash value calculation. Apparently the last line of your file does not have the trailing newline, that's why it is correctly recognized.

Rosalind Profile and Consensus: Writing long strings to one line in Python (Formatting)

I'm trying to tackle a problem on Rosalind where, given a FASTA file of at most 10 sequences at 1kb, I need to give the consensus sequence and profile (how many of each base do all the sequences have in common at each nucleotide). In the context of formatting my response, what I have as my code works for small sequences (verified).
However, I have issues in formatting my response when it comes to large sequences.
What I expect to return, regardless of length, is:
"consensus sequence"
"A: one line string of numbers without commas"
"C: one line string """" "
"G: one line string """" "
"T: one line string """" "
All aligned with each other and on their own respective lines, or at least some formatting that allows me to carry this formatting as a unit onward to maintain the integrity of aligning.
but when I run my code for a large sequence, I get each separate string below the consensus sequence broken up by a newline, presumably because the string itself is too long. I've been struggling to think of ways to circumvent the issue, but my searches have been fruitless. I'm thinking about some iterative writing algorithm that can just write the entirety of the above expectation but in chunks Any help would be greatly appreciated. I have attached the entirety of my code below for the sake of completeness, with block comments as needed, though the main section.
def cons(file):
#returns consensus sequence and profile of a FASTA file
import os
path = os.path.abspath(os.path.expanduser(file))
with open(path,"r") as D:
#initialize list of sequences, list of all strings, and a temporary storage
#list, respectively
#get a list of strings from the file, stripping the newline character
for x in F:
#if the string in question is a nucleotide sequence (without ">")
#i'll store that string into a temporary variable until I run into a string
#with a ">", in which case I'll join all the strings in my temporary
#sequence list and append to my list of sequences SEQS
for i in range(1,len(mystrings)):
if ">" not in mystrings[i]:
#set up list of nucleotide counts for A,C,G and T, in that order
ACGT= [[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))]]
#assumed to be equal length sequences. Counting amount of shared nucleotides
#in each column
for i in range(0,len(SEQS[0])-1):
for j in range(0, len(SEQS)):
if SEQS[j][i]=="A":
elif SEQS[j][i]=="C":
elif SEQS[j][i]=="G":
elif SEQS[j][i]=="T":
acgt=["A: ","C: ","G: ","T: "]
for i in range(0,len(TR_ACGT)-1):
if comp.index(max(comp))==0:
elif comp.index(max(comp))==1:
elif comp.index(max(comp))==2:
elif comp.index(max(comp))==3:
writing to file... trying to get it to write as
consensus sequence
A: blah(1line)
C: blah(1line)
G: blah(1line)
T: blah(line)
which works for small sequences. but for larger sequences
python keeps adding newlines if the string in question is very long...
writing_strings=[acgt[i]+' '.join(str(n) for n in ACGT[i] for i in range(0,len(ACGT))) for i in range(0,len(acgt))]
with open(myfile,'w') as D:
for i in range(0,len(writing_strings)):
Your code is totally fine except for this line:
writing_strings=[acgt[i]+' '.join(str(n) for n in ACGT[i] for i in range(0,len(ACGT))) for i in range(0,len(acgt))]
You accidentally replicate your data. Try replacing it with:
writing_strings=[ACGT[i] + str(ACGT[i]) for i in range(0,len(ACGT))]
and then write it to your output file as follows:
That's a lazy way to get rid of the brackets from your list.

best way to compare sequence of letters inside file?

I have a file, that have lots of sequences of letters.
Some of these sequences might be equal, so I would like to compare them, all to all.
I'm doing something like this but this isn't exactly want I wanted:
for line in fl:
line = line.split()
for elem in line:
if '>' in elem:
for el in line:
if elem == el:
print elem, el
example of the file:
So what I want if to known if any sequence is totally equal to 1, or to 2, and so on.
If the goal is to simply group like sequences together, then simply sorting the data will do the trick. Here is a solution that uses BioPython to parse the input FASTA file, sorts the collection of sequences, uses the standard Python itertools.groupby function to merge ids for equal sequences, and outputs a new FASTA file:
from itertools import groupby
from Bio import SeqIO
records = list(SeqIO.parse(file('spoo.fa'),'fasta'))
def seq_getter(s): return str(s.seq)
for seq,equal in groupby(records, seq_getter):
ids = ','.join(s.id for s in equal)
print '>%s' % ids
print seq
In general for this type of work you may want to investigate Biopython which has lots of functionality for parsing and otherwise dealing with sequences.
However, your particular problem can be solved using a dict, an example of which Manoj has given you.
Comparing long sequences of letters is going to be pretty inefficient. It will be quicker to compare the hash of the sequences. Python offers two built in data types that use hash: set and dict. It's best to use dict here as we can store the line numbers of all the matches.
I've assumed the file has identifiers and labels on alternate lines, so if we split the file text on new lines we can take one line as the id and the next as the sequence to match.
We then use a dict with the sequence as a key. The corresponding value is a list of ids which have this sequence. By using defaultdict from collections we can easily handle the case of a sequence not being in the dict; if the key hasn't be used before defaultdict will automatically create a value for us, in this case an empty list.
So when we've finished working through the file the values of the dict will effectively be a list of lists, each entry containing the ids which share a sequence. We can then use a list comprehension to pull out the interesting values, i.e. entries where more than one id is used by a sequence.
from collections import defaultdict
lines = filetext.split("\n")
sequences = defaultdict(list)
while (lines):
id = lines.pop(0)
data = lines.pop(0)
results = [match for match in sequences.values() if len(match) > 1]
print results
The following script will return a count of sequences. It returns a dictionary with the individual, distinct sequences as keys and the numbers (the first part of each line) where these sequences occur.
import sys
from collections import defaultdict
def count_sequences(filename):
result = defaultdict(list)
with open(filename) as f:
for index, line in enumerate(f):
sequence = line.replace('\n', '')
line_number = index + 1
return result
if __name__ == '__main__':
filename = sys.argv[1]
for sequence, occurrences in count_sequences(filename).iteritems():
print "%s: %s, found in %s" % (sequence, len(occurrences), occurrences)
Sample output:
etc#etc:~$ python ./fasta.py /path/to/my/file
Changed code to use dafaultdict and for loop. Thanks #KennyTM.
Update 2
Changed code to use append rather than +. Thanks #Dave Webb.

