Extract specific fasta sequences from a big fasta file

Extract specific fasta sequences from a big fasta file - python

I want to extract specific fasta sequences from a big fasta file using the following script, but the output is empty.
The transcripts.txt file contains the list transcripts IDs that I want to export (both the IDs and the sequences) from assembly.fasta to selected_transcripts.fasta.
For example:
transcripts.txt:
Transcript_00004|5601
Transcript_00005|5352
assembly.fasta:
>Transcript_00004|5601
GATCTGGCGCTGAGCTGGGTGCTGATCGACCCGGCGTCCGGCCGCTCCGTGAACGCCTCGAGTCGGCGCCCGGTGTGCGTTGACCGGAGATCGCGATCTGGGGAGACCGTCGTGCGGTT
>Transcript_00004|5360
CGATCTGGCGCTGAGCTGGGTGCTGATCGACCCGGCGTCCGGCCGCTCCGTGAACGCCTCGAGTCGGCGCCCGGTGTGCGTTGACCGGAGATCGCGATCTGGGGAGACCGTCGTGCGGTT
The IDs are preceded by the > symbol: >Transcripts_00004|5601.
I have to read the assembly.fasta file, if the transcript ID in assembly.fasta is the same of that write in transcripts.txt, I have to write this transcript ID and its sequence in selected_transcripts.fasta. So, in the example above, I have to write only the first transcript.
Any suggestions?
Thanks.
from Bio import SeqIO
my_list = [line.split(',') for line in open("/home/universita/transcripts.txt")]
fin = open('/home/universita/assembly.fasta', 'r')
fout = open('/home/universita/selected_transcripts.fasta', 'w')
for record in SeqIO.parse(fin,'fasta'):
for item in my_list:
if item == record.id:
fout.write(">" + record.id + "\n")
fout.write(record.seq + "\n")
fin.close()
fout.close()

Based on your examples there a couple of small problems which may explain why you don't get anything. Your transcripts.txt has multiple entries in one line, therefore my_list will have all the items of the first_line in my_line[0], in your loop you iterate through my_list by lines, so your first item will be
['Transcript_00004|5601', 'Transcript_00005|5352']
Also if assembly.fasta has no > in the header line you won't get back any records with IDs and sequences. The following code should take care of those problems, assuming you added > to the headers and the split function is now using space and not colon.
from Bio import SeqIO
my_list = []
with open("transcripts.txt") as transcripts:
for line in transcripts:
my_list.extend(line.split(' '))
fin = open('assembly.fasta', 'r')
fout = open('selected_transcripts.fasta', 'w')
for record in SeqIO.parse(fin,'fasta'):
for item in my_list:
if item.strip() == record.id:
fout.write(">" + record.id + "\n")
fout.write(record.seq + "\n")
fin.close()
fout.close()
Reading of transcript was changed so that all IDs are appended to my_list separately. Also each item is stripped of white space in order to avoid having line breaks in the string when it is compared to the record.id.

Related

How to append current lines to previous line when its start with 'id'

My Problem is the following
I have one text file, it contains more than 1000 rows, want to read files line by line
I am trying this code, but not getting expected output
my source file:
uuid;UserGroup;Name;Description;Owner;Visibility;Members ----> header of the file
id:;group1;raji;xyzabc;ramya;public;
abc
def
geh
id:group2;raji;rtyui;ramya;private
cvb
nmh
poi
import csv
output=[]
temp=[]
fo = open ('usergroups.csv', 'r')
for line in fo:
#next(uuid)
line = line.strip()
if not line:
continue #ignore empty lines
#temp.append(line)
if not line.startswith('id:') and not None:
temp.append(line)
print(line)
else:
if temp:
line += ";" + ",".join(temp)
temp.clear()
output.append(line)
print("\n".join(output))
with open('new.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(output)
i am getting this output:
id;group1;raji;xyzabc;ramya;public;uuid;UserGroup;Name;Description;Owner;Visibility;Members
id:group2;raji;rtyui;ramya;private;abc,def,geh
So whenever a line does not start with 'id' it should be appended to the previous line.
my desired output:
uuid;UserGroup;Name;Description;Owner;Visibility;Members ----> header of the file
id;group1;raji;xyzabc;ramya;public;abc,def,geh
id:group2;raji;rtyui;ramya;private;cvb,nmh,poi

There are a few mistakes. I'll only show the relevant corrections:
Use
if not line.startswith('id'):
No 'id:', since you also have a line starting with 'id;', plus you state yourself that a line has to start with "id" (no ":" there). The and if None part is unneccessary, because it's always true.
The other part:
output.append(line.split(';'))
because writerows need an iterable (list) of "row" objects, and a row object is a list of string. So you need a list of lists, which the above is, thanks to the extra split.
(Of course, now the line print("\n".join(output)) fails, but writer.writerows(output) works.)

I don't know if it will help you but with regex, this problem is solved in a very simple way. I leave here the code in case you are interested.
import regex as re
input_text = """uuid;UserGroup;Name;Description;Owner;Visibility;Members ----> header of the file
id;group1;raji;xyzabc;ramya;public;
abc
def
geh
id:group2;raji;rtyui;ramya;private
cvb
nmh
poi"""
formatted = re.sub(r"\n(?!(id|\n))", "", input_text)
print(formatted)
uuid;UserGroup;Name;Description;Owner;Visibility;Members ----> header of the file
id;group1;raji;xyzabc;ramya;public;abcdefgeh
id:group2;raji;rtyui;ramya;privatecvbnmhpoi
This code just replace the regular expression \n(?!(id|id|n)) with the empty string. This regular expression will replace all line breaks that are not followed by the word "id" or another line break (so we keep the space between the two lines of ids).

Writing to a file has not been included here, but there is a list of strings available to work with, as in your original code.
Note: this is not really an answer to your question, as it is a solution to your problem
The structure is by and large the same, with a few changes for readability.
Readable code is easier to get right
import csv
output = []
temp = []
currIdLine = ""
with( open ('usergroups.csv', 'r')) as f:
for dirtyline in f.readlines():
line = dirtyline.strip()
if not line:
print("Skipping empty line")
continue
if line.startswith('uuid'): # append the header to the output
output.append(line)
continue
if line.startswith('id'):
if temp:
print(temp)
output.append(currIdLine + ";" + ','.join(temp)) #based on current input, there is a bug here where the output will contain two sequential ';' characters
temp.clear()
currIdLine = line
else:
temp.append(line)
output.append(currIdLine + ";" + ','.join(temp))
print(output)

nested file read doesn't loop through all of the primary loop

I have two files.
One file has two columns-let's call it db, and the other one has one column-let's call it in.
Second column in db is the same type as the column in in and both files are sorted by this column.
db for example:
RPL24P3 NG_002525
RPLP1P1 NG_002526
RPL26P4 NG_002527
VN2R11P NG_006060
VN2R12P NG_006061
VN2R13P NG_006062
VN2R14P NG_006063
in for example:
NG_002527
NG_006062
I want to read through these files and get the output as follows:
NG_002527: RPL26P4
NG_006062: VN2R13P
Meaning that I'm iterating on in lines and trying to find the matching line in db.
The code I have written for that is:
with open(db_file, 'r') as db, open(sortIn, 'r') as inF, open(out_file, 'w') as outF:
for line in inF:
for dbline in db:
if len(dbline) > 1:
dbline = dbline.split('\t')
if line.rstrip('\n') == dbline[db_specifications[0]]:
outF.write(dbline[db_specifications[0]] + ': ' + dbline[db_specifications[1]] + '\n')
break
*db_specification isn't relevant for this problem, hence I didn't copy the relevant code for it - the problem doesn't lie there.
The current code will find a match and write it as I planned just for the first line in in but won't find any matches for the other lines. I have a suspicion it has to do with break but I can't figure out what to change.

Since the data in the db_file is sorted by second column, you can use this code to read the file.
with open("xyz.txt", "r") as db_file, open("abc.txt", "r") as sortIn, open("out.txt", 'w') as outF:
#first read the sortIn file as a list
i_list = [line.strip() for line in sortIn.readlines()]
#for each record read from the file, split the values into key and value
for line in db_file:
t_key,t_val = line.strip().split(' ')
#if value is in i_list file, then write to output file
if t_val in i_list: outF.write(t_val + ': ' + t_key + '\n')
#if value has reached the max value in sort list
#then you don't need to read the db_file anymore
if t_val == i_list[-1]: break
The output file will have the following items:
NG_002527: RPL26P4
NG_006062: VN2R13P
In the above code, we have to read the sortIn list first. Then read each line in the db_file. i_list[-1] will have the max value of sortIn file as the sortIn file is also sorted in ascending order.
The above code will have fewer i/o compared to the below one.
===========
previous answer submission:
Based on how the data has been stored in the db_file, it looks like we have to read the entire file to check against the sortIn file. If the values in the db_file was sorted by the second column, we could have stopped reading the file once the last item in sortIn was found.
With the assumption that we need to read all records from the files, see if the below code works for you.
with open("xyz.txt", "r") as db_file, open("abc.txt", "r") as sortIn, open("out.txt", 'w') as outF:
#read the db_file and convert it into a dictionary
d_list = dict([line.strip().split(' ') for line in db_file.readlines()])
#read the sortIn file as a list
i_list = [line.strip() for line in sortIn.readlines()]
#check if the value of each value in d_list is one of the items in i_list
out_list = [v + ': '+ k for k,v in d_list.items() if v in i_list]
#out_list is your final list that needs to be written into a file
#now read out_list and write each item into the file
for i in out_list:
outF.write(i + '\n')
The output file will have the following items:
NG_002527: RPL26P4
NG_006062: VN2R13P
To help you, i have also printed the contents in d_list, i_list, and out_list.
The contents in d_list will look like this:
{'RPL24P3': 'NG_002525', 'RPLP1P1': 'NG_002526', 'RPL26P4': 'NG_002527', 'VN2R11P': 'NG_006060', 'VN2R12P': 'NG_006061', 'VN2R13P': 'NG_006062', 'VN2R14P': 'NG_006063'}
The contents in i_list will look like this:
['NG_002527', 'NG_006062']
The contents that get written into the outF file from out_list will look like this:
['NG_002527: RPL26P4', 'NG_006062: VN2R13P']

I was able to solve the problem by inserting the following line:
line = next(inF)
before the break statement.

Why is for loop iterated only once?

I'm rather new to coding, yet as I have to write a few letters I wanted to write a script to change the name within this letter automatically.
I've got a textfile with placeholders for the name and a csv-file where the names are stored in the following format:
Surname;Firstname
Doe;John
Norris;Chuck
...
Now I've conjured up this script:
import csv
import re
letterPATH = "Brief.txt"
tablePATH = "Liste.csv"
with open(letterPATH, "r") as letter, open(tablePATH, "r") as table:
table = csv.reader(table, delimiter=";")
rows = list(table)
rows = rows[1::]
print(rows)
for (surname, firstname) in rows:
#Check if first- and surname have correct output
#print(firstname)
#print(surname)
for lines in letter:
new_content = ""
print(lines)
lines = re.sub(r"\<Nachname\>", surname, lines)
print(lines)
lines = re.sub(r"\<Vorname\>", firstname, lines)
print(lines)
new_content += lines
with open(surname + firstname +".txt", "w") as new_letter:
new_letter.writelines(new_content)
I've got the following problem now:
There's a file created a textfile for each entry as it should (JohnDoe.txt, ChuckNorris.txt and so on) however only the first file has the correct content, while the others are empty.
While debugging I've seen that the for-loop in line 18 is only iterated once and the with statement is iterated multiple times as it should.
I simply do not understand why the for-loop isn't iterating.
Cheers and thanks for your help! :)

letter is a file. A file keeps track of how much you've read and where the next read should be. So if you've read two lines, then the next read will be on the third line, and so on.
Since you read through the whole file the first time, the next iterations it'll not read any more lines from the file, since you've already read them.
The solution could be to reset the file pointer (the thing pointing to where in the file you've currently read to) to the beginning with the letter.seek(0) method. Or, you could simply store the file content in a list directly and iterate over the list.
import csv
import re
letterPATH = "Brief.txt"
tablePATH = "Liste.csv"
with open(letterPATH, "r") as letter_file, open(tablePATH, "r") as table:
table = csv.reader(table, delimiter=";")
letter = list(letter_file) # Add all content to a list instead.
rows = list(table)
rows = rows[1::]
print(rows)
for (surname, firstname) in rows:
#Check if first- and surname have correct output
#print(firstname)
#print(surname)
for lines in letter:
new_content = ""
print(lines)
lines = re.sub(r"\<Nachname\>", surname, lines)
print(lines)
lines = re.sub(r"\<Vorname\>", firstname, lines)
print(lines)
new_content += lines
with open(surname + firstname +".txt", "w") as new_letter:
new_letter.writelines(new_content)

Changing the contents of a text file and making a new file with same format

I have a big text file with a lot of parts. Every part has 4 lines and next part starts immediately after the last part.
The first line of each part starts with #, the 2nd line is a sequence of characters, the 3rd line is a + and the 4th line is again a sequence of characters.
Small example:
#M00872:462:000000000-D47VR:1:1101:15294:1338 1:N:0:ACATCG
TGCTCGGTGTATGTAAACTTCCGACTTCAACTGTATAGGGATCCAATTTTGACAAAATATTAACGCTTATCGATAAAATTTTGAATTTTGTAACTTGTTTTTGTAATTCTTTAGTTTGTATGTCTGTTGCTATTATGTCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAAGTTAACCGATACCGTCGAGATCCGTTCACTAATCGAACGGATCTGTCTCTGTCTCTCTC
+
BAABBADBBBFFGGGGGGGGGGGGGGGHHGHHGH55FB3A3GGH3ADG5FAAFEGHHFFEFHD5AEG1EF511F1?GFH3#BFADGD55F?#GFHFGGFCGG/GHGHHHHHHHDBG4E?FB?BGHHHHHHHHHHHHHHHHHFHHHHHHHHHGHGHGHHHHHFHHHHHGGGGHHHHGGGGHHHHHHHGHGHHHHHHFGHCFGGGHGGGGGGGGFGGEGBFGGGGGGGGGFGGGGFFB9/BFFFFFFFFFF/
I want to change the 2nd and the 4th line of each part and make a new file with similar structure (4 lines for each part). In fact I want to keep the 1st 65 characters (in lines 2 and 4) and remove the rest of characters. The expected output for the small example would look like this:
#M00872:462:000000000-D47VR:1:1101:15294:1338 1:N:0:ACATCG
TGCTCGGTGTATGTAAACTTCCGACTTCAACTGTATAGGGATCCAATTTTGACAAAATATTAACG
+
BAABBADBBBFFGGGGGGGGGGGGGGGHHGHHGH55FB3A3GGH3ADG5FAAFEGHHFFEFHD5A
I wrote the following code:
infile = open("file.fastq", "r")
new_line=[]
for line_number in len(infile.readlines()):
if line_number ==2 or line_number ==4:
new_line.append(infile[line_number])
with open('out_file.fastq', 'w') as f:
for item in new_line:
f.write("%s\n" % item)
but it does not return what I want. How to fix it to get the expected output?

This code will achieve what you want -
from itertools import islice
with open('bio.txt', 'r') as infile:
while True:
lines_gen = list(islice(infile, 4))
if not lines_gen:
break
a,b,c,d = lines_gen
b = b[0:65]+'\n'
d = d[0:65]+'\n'
with open('mod_bio.txt', 'a+') as f:
f.write(a+b+c+d)
How it works?
We first make a generator that gives 4 lines at a time as you mention.
Then we open the lines into individual lines a,b,c,d and perform string slicing. Eventually we join that string and write it to a new file.

I think some itertools.cycle could be nice here:
import itertools
with open("transformed.file.fastq", "w+") as output_file:
with open("file.fastq", "r") as input_file:
for i in itertools.cycle((1,2,3,4)):
line = input_file.readline().strip()
if not line:
break
if i in (2,4):
line = line[:65]
output_file.write("{}\n".format(line))

readlines() will return list of each line in your file. You don't need to prepare a list new_line. Directly iterate over index-value pair of list, then you can modify all the values in your desired position.
By modifying your code, try this
infile = open("file.fastq", "r")
new_lines = infile.readlines()
for i, t in enumerate(new_lines):
if i == 1 or i == 3:
new_lines[i] = new_lines[i][:65]
with open('out_file.fastq', 'w') as f:
for item in new_lines:
f.write("%s" % item)

Adding each item in list to end of specific lines in FASTA file

I solved this in the comments below.
So essentially what I am trying to do is add each element of a list of strings to the end of specific lines in a different file.
Hard to explain but essentially I want to parse a FASTA file, and every time it reaches a header (line.startswith('>')) I want it to replace parts of that header with an element in a list I've already made.
For example:
File1:
">seq1 unwanted here
AATATTATA
ATATATATA
>seq2 unwanted stuff here
GTGTGTGTG
GTGTGTGTG
>seq3 more stuff I don't want
ACACACACAC
ACACACACAC"
I want it to keep ">seq#" but replace everything after with the next item in the list below:
List:
mylist = "['things1', '', 'things3', 'things4', '' 'things6', 'things7']"
Result (modified file1):
">seq1 things1
AATATTATA
ATATATATA
>seq2 # adds nothing here due to mylist[1] = ''
GTGTGTGTG
GTGTGTGTG
>seq3 things3
ACACACACAC
ACACACACAC
As you can see I want it to add even the blank items in the list.
So once again, I want it to parse this FASTA file, and every time it gets to a header (there are thousands), I want it to replace everything after the first word with the next item in the separate list I have made.

What you have will work, but there are a few unnecessary lines so I've edited down to use a few less lines. Also, an important note is that you don't close your file handles. This could result in errors, specifically when writing to file, either way it's bad practice. code:
#!/usr/bin/python
import sys
# gets list of annotations
def get_annos(infile):
with open(infile, 'r') as fh: # makes sure the file is closed properly
annos = []
for line in fh:
annos.append( line.split('\t')[5] ) # added tab as separator
return annos
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
annos = get_annos(infile1) # contains list of annos
with open(infile2, 'r') as f2, open(outfile, 'w') as output:
for line in f2:
if line.startswith('>'):
line_split = list(line.split()[0]) # split line on whitespace and store first element in list
line_split.append(annos.pop(0)) # append data of interest to current id line
output.write( ' '.join(line_split) + '\n' ) # join and write to file with a newline character
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)
This is not perfect but it cleans things up a bit. I'd might veer away from using pop() to associate the annotation data with the sequence IDs unless you are certain the files are in the same order every time.

There is a great library in python for Fasta and other DNA file parsing. It is totally helpful in Bioinformatics. You can also manipulate any data according to your need.
Here is a simple example extracted from the library website:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
You should get something like this on your screen:
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592

***********EDIT*********
I solved this before anyone could help. This is my code, can anyone tell me if I have any bad practices? Is there a way to do this without writing everything to a new file? Seems like it would take a long time/lots of memory.
#!/usr/bin/python
# Script takes unedited FASTA file, removed seq length and
# other header info, adds annotation after sequence name
# run as: $ python addanno.py testanno.out testseq.fasta out.txt
import sys
# gets list of annotations
def get_annos(infile):
f = open(infile)
list2 = []
for line in f:
columns = line.strip().split('\t')
list2.append(columns[5])
return list2
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
mylist = get_annos(infile1) # contains list of annos
f2 = open(infile2, 'r')
output = open(out, 'w')
for line in f2:
if line.startswith('>'):
l = line.partition(" ")
list3 = list(l)
del list3[1:]
list3.append(' ')
list3.append(mylist.pop(0))
final = ''.join(list3)
line = line.replace(line, final)
output.write(line)
output.write('\n')
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract specific fasta sequences from a big fasta file - python

Related

How to append current lines to previous line when its start with 'id'

nested file read doesn't loop through all of the primary loop

Why is for loop iterated only once?

Changing the contents of a text file and making a new file with same format

Adding each item in list to end of specific lines in FASTA file

Categories

Resources