This question already has answers here:
How to read a file in reverse order?
(22 answers)
Closed 2 years ago.
I am making a .fasta parser script. (I know some parsers already exist for .fasta files but I need practice dealing with large files and thought this was a good start).
The goal of the program: take a very large .fasta file with multiple sequences, and return the Reverse compliment of each sequence in a new file.
I have a script that reads it one line at a time but each line is only about 50 bytes in a regular .fasta file. So buffering only one line at a time is not necessary. Is there a way I can set the amount of lines to buffer at a time?
For people who don't know what fasta is: A .fasta file is a text file with DNA sequences each with a header line before the DNA/RNA/Protein sequence marked by a '>'. Example:
>Sequence_1
ATG
TATA
>Sequence_2
TATA
GACT
ATG
Overview of my code:
Read the first byte of each line to Map the location of the sequences in the file by finding the '>'s.
Use this map to read each sequence backwards line by line (this is what I want to change to be able to)
Reverse
compliment the base pairs in the string (I.E. G->C)
Write this line to a new file
Here is what should be all the relevant code so you can see the functions the lines I am trying to change are calling: (can probably skip)
def get_fasta_seqs(BIGFILE): #Returns returns an object containing all of seq locations
f = open(BIGFILE)
cnt = 0
seq_name = []
seq_start = []
seq_end = []
seqcount = 0
#print(line)
#for loop skips first line check for seq name
if f.readline(1) == '>':
seq_name.append(cnt)
seq_start.append(cnt+1)
seqcount =+ 1
for line in f:
cnt += 1
if f.readline(1) == '>':
seq_name.append(cnt)
seq_start.append(cnt+1)
seqcount += 1
if seqcount > 1:
seq_end.append(cnt-1)
seq_end.append(cnt-1) #add location of final line
seqs = fileseq(seq_name,seq_start,seq_end,seqcount) #This class only has a __init__ function for these lists
return seqs
def fasta_rev_compliment(fasta_read,fasta_write = "default",NTtype = "DNA"):
if fasta_write == 'default':
fasta_write = fasta_read[:-6] + "_RC.fasta"
seq_map = get_fasta_seqs(fasta_read)
print(seq_map.seq_name)
f = open(fasta_write,'a')
for i in range(seq_map.seqcount): #THIS IS WHAT I WANT TO CHANGE
line = getline(fasta_read,seq_map.seq_name[i]+1) #getline is reading it as 1 indexed?
f.write(line)
my_fasta_seqs = get_fasta_seqs(fasta_read)
for seqline in reversed(range(seq_map.seq_start[i],seq_map.seq_end[i]+1)):
seq = getline(fasta_read,seqline+1)
seq = seq.replace('\n','')
seq = reverse_compliment(seq,NTtype = NTtype) #this function just returns the reverse compliment for that line.
seq = seq + '\n'
f.write(seq)
f.close()
fasta_rev_compliment('BIGFILE.fasta')
The main bit of code I want to change is here:
for i in range(seq_map.seqcount): #THIS IS WHAT I WANT TO CHANGE
line = getline(fasta_read,seq_map.seq_name[i]+1) #getline is reading it as 1 indexed?
f.write(line)
my_fasta_seqs = get_fasta_seqs(fasta_read)
for seqline in reversed(range(seq_map.seq_start[i],seq_map.seq_end[i]+1)):
seq = getline(fasta_read,seqline+1)
I want something like this:
def fasta_rev_compliment(fasta_read,fasta_write = "default",NTtype = "DNA",lines_to_record_before_flushing = 5):
###MORE CODE###
#i want something like this
for i in range(seq_map.seqcount): #THIS IS WHAT I WANT TO CHANGE
#is their a way to load
line = getline(fasta_read,seq_map.seq_name[i]+1) #getline is reading it as 1 indexed?
f.write(line)
my_fasta_seqs = get_fasta_seqs(fasta_read)
for seqline in reversed(range(seq_map.seq_start[i],seq_map.seq_end[i]+1)):
seq = getline(fasta_read,seqline+1)
#Repeat n = 5 (or other specified number) times until flushing ram.
The problem I am running into is the fact that I need to read the file backwards. All of the methods I can find don't work well when you try to apply it to reading the file backwards. Is there something that can read a file in chunks but backwards?
Or: Anything else that can make this more effective for a low memory setup. Right now it uses hardly any memory, but takes 21secs for a 100kB file with about 12,000 lines, but processes the file instantly using the file.readlines() method.
Here is an example of obtaining the reverse complement of a fasta file. Perhaps you can use some ideas from this.
import re
file = """\
>Sequence_1
ATG
TATA
>Sequence_2
TATA
GACT
ATG""".splitlines()
s = ''
for line in file:
line = line.rstrip()
if line.startswith('>'):
if len(s):
# complement the sequence of fasta 'TAGC' to 'ATCG'
# T to A, A to T, G to C, C to G
s = s.translate(str.maketrans('TAGC', 'ATCG'))
# reverse the string, 's[::-1]'
# Also, print up to 50 fasta per line to the end of the sequence
s = re.sub(r'(.{1,50})', r'\1\n', s[::-1])
print(s, end='')
s = ''
print(line)
else:
s += line
# print last sequence
s = s.translate(str.maketrans('TAGC', 'ATCG'))
s = re.sub(r'(.{1,50})', r'\1\n', s[::-1])
print(s, end='')
Prints:
>Sequence_1
TATACAT
>Sequence_2
CATAGTCTATA
Related
this is my first meeting with Python :)
I have question -> my code and question bellow:
I am trying to divide input file to 3 files (Program info/Program core/Toollist)
I can match and write first part like i wanted (break on loop when find string in line),
how can i tell I want to "continue with looping from this found string" and write to second list/file
or how can I mark all lines between two strings to append it in list and write do file after.
Thanks a lot guy. I wish you merry christmas and will be happy from your help
import os
filename = "D327971_fc1.i" # current program name
file = open(filename, 'r') # read current program
if os.stat(filename).st_size == 0: # check if size of file is null
print('File is empty')
file.close()
else:
read = file.readlines()
programdef = []
toollist = []
core = []
line_num = -1
for line in read:
start_line_point = "Zacatek" in line
end_line_point = "Konec" in line
toollist_point = "nastroj" in line
programdef.append(line.strip())
if start_line_point: break
core.append(line.strip())
if end_line_point:
toollist.append(line.strip())
with open('0progdef.txt', 'w') as f:
f.write(str(programdef))
with open('1core.txt', 'w') as f:
f.write(str(core))
with open('2toollist.txt', 'w') as f:
f.write(str(toollist))
Divide input file to 3 lists with marking lines by find string and esport this lists to 3 files after it.
If I understood correctly, what you want is to split the file into 3 different files: the first one includes all lines before "Zacatek", the second one includes all lines between "Zacatek" and "Konec" and the third one includes all line between "Konec" and "nastroj".
You could change your for loop to something like:
keywords = {0:'Zacatek', 1:'Konec', 2:'nastroj'}
index = 0
for line in read:
if index == 3:
break
if keywords[index] in line:
index += 1
continue
if index == 0:
programdef.append(line.strip())
elif index == 1:
core.append(line.strip())
elif index == 2 :
toollist.append(line.strip())
This will create the three expected files containing lists of the lines in the original file.
This is my first coding class and I'm having trouble getting the counter to increase every time one of the given appears in the DNA sequence.
My code so far:
agat_Counter = 0
aatg_Counter= 0
tatc_Counter= 0
DNAsample = open('DNA SEQUENCE FILE.txt', 'r');
for lines in DNAsample:
if lines in DNAsample=='AGAT':
agat_Counter+=1
else:
agat_Counter+=0
print(agat_Counter)
for lines in DNAsample:
if lines in DNAsample=='AATG':
aatg_Counter+=1
else:
aatg_Counter+=0
print(aatg_Counter)
for lines in DNAsample:
if lines in DNAsample=='TATC':
tatc_Counter+=0
else:
tatc_Counter+=0
print(tatc_Counter)
You can do this with many ways. One of the more simple is the following:
DNAsample = open('DNA SEQUENCE FILE.txt', 'r').read()
agat_Counter = DNAsample.count('AGAT')
aatg_Counter= DNAsample.count('AATG')
tatc_Counter= DNAsample.count('TATC')
This should work. The issue is with your if statements. as well as once you iterate through the file once, the file pointer is at the end (I think) so it won't go back through. The code below iterates through each line one at a time and compares the string to the 4 character sequence, note that the .strip() removes the trailing \n and or \r characters that are in the line variable as the file is iterated through.
In general, when opening files it is best to use with open(filename, mode) as var: as shown below this handles closing the file once it is done and elminates the risk of un-closed file handles.
Assumption based on original code is that the DNA SEQUENCE FILE.txt file is organized as such:
AGAT
AATG
...
agat_Counter = 0
aatg_Counter= 0
tatc_Counter= 0
with open('DNA SEQUENCE FILE.txt', 'r') as DNAample:
for line in DNAsample:
strippedLine = line.strip()
if strippedLine == 'AGAT':
agat_Counter += 1
elif strippedLine == 'AATG':
aatg_Counter += 1
elif stripepdLine == 'TATC':
tatc_Counter += 1
print(agat_Counter)
print(aatg_Counter)
print(tatc_Counter)
I have a text file I wish to analyze. I'm trying to find every line that contains certain characters (ex: "#") and then print the line located 3 lines before it (ex: if line 5 contains "#", I would like to print line 2)
This is what I got so far:
file = open('new_file.txt', 'r')
a = list()
x = 0
for line in file:
x = x + 1
if '#' in line:
a.append(x)
continue
x = 0
for index, item in enumerate(a):
for line in file:
x = x + 1
d = a[index]
if x == d - 3:
print line
continue
It won't work (it prints nothing when I feed it a file that has lines containing "#"), any ideas?
First, you are going through the file multiple times without re-opening it for subsequent times. That means all subsequent attempts to iterate the file will terminate immediately without reading anything.
Second, your indexing logic a little convoluted. Assuming your files are not huge relative to your memory size, it is much easier to simply read the whole into memory (as a list) and manipulate it there.
myfile = open('new_file.txt', 'r')
a = myfile.readlines();
for index, item in enumerate(a):
if '#' in item and index - 3 >= 0:
print a[index - 3].strip()
This has been tested on the following input:
PrintMe
PrintMe As Well
Foo
#Foo
Bar#
hello world will print
null
null
##
Ok, the issue is that you have already iterated completely through the file descriptor file in line 4 when you try again in line 11. So line 11 will make an empty loop. Maybe it would be a better idea to iterate the file only once and remember the last few lines...
file = open('new_file.txt', 'r')
a = ["","",""]
for line in file:
if "#" in line:
print(a[0], end="")
a.append(line)
a = a[1:]
For file IO it is usually most efficient for programmer time and runtime to use reg-ex to match patterns. In combination with iteration through the lines in the file. your problem really isn't a problem.
import re
file = open('new_file.txt', 'r')
document = file.read()
lines = document.split("\n")
LinesOfInterest = []
for lineNumber,line in enumerate(lines):
WhereItsAt = re.search( r'#', line)
if(lineNumber>2 and WhereItsAt):
LinesOfInterest.append(lineNumber-3)
print LinesOfInterest
for lineNumber in LinesOfInterest:
print(lines[lineNumber])
Lines of Interest is now a list of line numbers matching your criteria
I used
line1,0
line2,0
line3,0
#
line1,1
line2,1
line3,1
#
line1,2
line2,2
line3,2
#
line1,3
line2,3
line3,3
#
as input yielding
[0, 4, 8, 12]
line1,0
line1,1
line1,2
line1,3
I am new to python and I am trying to figure out how to read a fasta file with multiple sequences and then create a new fasta file containing the reverse compliment of the sequences. The file will look something like:
>homo_sapiens
ACGTCAGTACGTACGTCATGACGTACGTACTGACTGACTGACTGACGTACTGACTGACTGACGTACGTACGTACGTACGTACGTACTG
>Canis_lupus
CAGTCATGCATGCATGCAGTCATGACGTCAGTCAGTACTGCATGCATGCATGCATGCATGACTGCAGTACTGACGTACTGACGTCATGCATGCAGTCATG
>Pan_troglodytus
CATGCATACTGCATGCATGCATCATGCATGCATGCATGCATGCATGCATCATGACTGCAGTCATGCAGTCAGTCATGCATGCATCAT
I am trying to learn how to use for and while loops so if the solution can incorporate one of them it would be preferred.
So far I managed to do it in a very unelegant manner as follows:
file1 = open('/path/to/file', 'r')
for line in file1:
if line[0] == '>':
print line.strip() #to capture the title line
else:
import re
seq = line.strip()
line = re.sub(r'T', r'P', seq)
seq = line
line = re.sub(r'A',r'T', seq)
seq = line
line = re.sub(r'G', r'R', seq)
seq = line
line = re.sub(r'C', r'G', seq)
seq = line
line = re.sub(r'P', r'A', seq)
seq = line
line = re.sub(r'R', r'C', seq)
print line[::-1]
file1.close()
This worked but I know there is a better way to iterate through that end part. Any better solutions?
I know you consider this an exercise for yourself, but in case you are interested in using existing facilities, have a look at the Biopython package. Especially if you are going to do more sequence work.
That would allow you to instantiate a sequence with e.g. seq = Seq('GATTACA'). Then, seq.reverse_complement() will give you the reverse complement.
Note that the reverse complement is more than just string reversal, the nucleotide bases need to be replaced with their complementary letter as well.
Assuming I got you right, would the code below work for you? You could just add the exchanges you want to the dictionary.
d = {'A':'T','C':'G','T':'A','G':'C'}
with open("seqs.fasta", 'r') as in_file:
for line in in_file:
if line != '\n': # skip empty lines
line = line.strip() # Remove new line character (I'm working on windows)
if line.startswith('>'):
head = line
else:
print head
print ''.join(d[nuc] for nuc in line[::-1])
Output:
>homo_sapiens
CAGTACGTACGTACGTACGTACGTACGTCAGTCAGTCAGTACGTCAGTCAGTCAGTCAGTACGTACGTCATGACGTACGT
ACTGACGT
>Canis_lupus
CATGACTGCATGCATGACGTCAGTACGTCAGTACTGCAGTCATGCATGCATGCATGCATGCAGTACTGACTGACGTCATG
ACTGCATGCATGCATGACTG
>Pan_troglodytus
ATGATGCATGCATGACTGACTGCATGACTGCAGTCATGATGCATGCATGCATGCATGCATGCATGATGCATGCATGCAGT
ATGCATG
Here is a simple example of a string reversal.
Python Code
string = raw_input("Enter a string:")
reverse_string = ""
print "our string is %s" % string
print "our range will be %s\n" % range(0,len(string))
for num in range(0,len(string)):
offset = len(string) - 1
reverse_string += string[offset - num]
print "the num is currently: %d" % num
print "the offset is currently: %d" % offset
print "the index is currently: %d" % int(offset - num)
print "the new string is currently: %s" % reverse_string
print "-------------------------------"
offset =- 1
print "\nOur reverse string is: %s" % reverse_string
Added print commands to show you what is happening in the script.
Run it in python and see what happens.
Usually, to iterate over lines in a text file you use a for loop, because "open" returns a file object which is iterable
>>> f = open('workfile', 'w')
>>> print f
<open file 'workfile', mode 'w' at 80a0960>
There is more about this here
You can also use context manager "with" to open a file. This key statement will close the file object for you, so you will never forget it.
I decided not to include a "for line in f:" statement because you have to read several lines to process one sequence (title, sequence and blank line). If you try to use a for loop with "readline()" you will end up with a ValueError (try :)
So I would use string.translate. This script opens a file named "test" with your example in it:
import string
if __name__ == "__main__":
file_name = "test"
translator = string.maketrans("TAGCPR", "PTRGAC")
with open(file_name, "r") as f:
while True:
title = f.readline().strip()
if not title: # end of file
break
rev_seq = f.readline().strip().translate(translator)[::-1]
f.readline() # blank line
print(title)
print(rev_seq)
Output (with your example):
>homo_sapiens
RPGTPRGTPRGTPRGTPRGTPRGTPRGTRPGTRPGTRPGTPRGTRPGTRPGTRPGTRPGTPRGTPRGTRPTGPRGTPRGTPRTGPRGT
>Canis_lupus
RPTGPRTGRPTGRPTGPRGTRPGTPRGTRPGTPRTGRPGTRPTGRPTGRPTGRPTGRPTGRPGTPRTGPRTGPRGTRPTGPRTGRPTGRPTGRPTGPRTG
>Pan_troglodytus
PTGPTGRPTGRPTGPRTGPRTGRPTGPRTGRPGTRPTGPTGRPTGRPTGRPTGRPTGRPTGRPTGPTGRPTGRPTGRPGTPTGRPTG
I'm trying to learn python and I'm doing a problem out of a book but I'm stuck on one question. It asks me to read a file and each line contains an 'a' or a 's' and basically I have a total which is 500. If the line contains an 'a' it would add the amount next to it for example it would say "a 20" and it would add 20 to my total and for s it would subtract that amount. In the end I'm supposed to return the total after it made all the changes. So far I got
def NumFile(file:
infile = open(file,'r')
content = infile.readlines()
infile.close()
add = ('a','A')
subtract = ('s','S')
after that I'm completely lost at how to start this
You need to iterate over the lines of the file. Here is a skeleton implementation:
# ...
with open(filename) as f:
for line in f:
tok = line.split()
op = tok[0]
qty = int(tok[1])
# ...
# ...
This places every operation and quantity into op and qty respectively.
I leave it to you to fill in the blanks (# ...).
A variation might be
f = open('myfile.txt','r')
lines = f.readlines()
for i in lines:
i = i.strip() # removes new line characters
i = i.split() # splits a string by spaces and stores as a list
key = i[0] # an 'a' or an 's'
val = int( i[1] ) # an integer, which you can now add to some other variable
Try adding print statements to see whats going on. The cool thing about python is you can stack multiple commands in a single line. Here is an equivalent code
for i in open('myfile.txt','r').readlines():
i = i.strip().split()
key = i[0]
val = int (i[1])