I'm trying to tackle a problem on Rosalind where, given a FASTA file of at most 10 sequences at 1kb, I need to give the consensus sequence and profile (how many of each base do all the sequences have in common at each nucleotide). In the context of formatting my response, what I have as my code works for small sequences (verified).
However, I have issues in formatting my response when it comes to large sequences.
What I expect to return, regardless of length, is:
"consensus sequence"
"A: one line string of numbers without commas"
"C: one line string """" "
"G: one line string """" "
"T: one line string """" "
All aligned with each other and on their own respective lines, or at least some formatting that allows me to carry this formatting as a unit onward to maintain the integrity of aligning.
but when I run my code for a large sequence, I get each separate string below the consensus sequence broken up by a newline, presumably because the string itself is too long. I've been struggling to think of ways to circumvent the issue, but my searches have been fruitless. I'm thinking about some iterative writing algorithm that can just write the entirety of the above expectation but in chunks Any help would be greatly appreciated. I have attached the entirety of my code below for the sake of completeness, with block comments as needed, though the main section.
def cons(file):
#returns consensus sequence and profile of a FASTA file
import os
path = os.path.abspath(os.path.expanduser(file))
with open(path,"r") as D:
F=D.readlines()
#initialize list of sequences, list of all strings, and a temporary storage
#list, respectively
SEQS=[]
mystrings=[]
temp_seq=[]
#get a list of strings from the file, stripping the newline character
for x in F:
mystrings.append(x.strip("\n"))
#if the string in question is a nucleotide sequence (without ">")
#i'll store that string into a temporary variable until I run into a string
#with a ">", in which case I'll join all the strings in my temporary
#sequence list and append to my list of sequences SEQS
for i in range(1,len(mystrings)):
if ">" not in mystrings[i]:
temp_seq.append(mystrings[i])
else:
SEQS.append(("").join(temp_seq))
temp_seq=[]
SEQS.append(("").join(temp_seq))
#set up list of nucleotide counts for A,C,G and T, in that order
ACGT= [[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))]]
#assumed to be equal length sequences. Counting amount of shared nucleotides
#in each column
for i in range(0,len(SEQS[0])-1):
for j in range(0, len(SEQS)):
if SEQS[j][i]=="A":
ACGT[0][i]+=1
elif SEQS[j][i]=="C":
ACGT[1][i]+=1
elif SEQS[j][i]=="G":
ACGT[2][i]+=1
elif SEQS[j][i]=="T":
ACGT[3][i]+=1
ancstr=""
TR_ACGT=list(zip(*ACGT))
acgt=["A: ","C: ","G: ","T: "]
for i in range(0,len(TR_ACGT)-1):
comp=TR_ACGT[i]
if comp.index(max(comp))==0:
ancstr+=("A")
elif comp.index(max(comp))==1:
ancstr+=("C")
elif comp.index(max(comp))==2:
ancstr+=("G")
elif comp.index(max(comp))==3:
ancstr+=("T")
'''
writing to file... trying to get it to write as
consensus sequence
A: blah(1line)
C: blah(1line)
G: blah(1line)
T: blah(line)
which works for small sequences. but for larger sequences
python keeps adding newlines if the string in question is very long...
'''
myfile="myconsensus.txt"
writing_strings=[acgt[i]+' '.join(str(n) for n in ACGT[i] for i in range(0,len(ACGT))) for i in range(0,len(acgt))]
with open(myfile,'w') as D:
D.writelines(ancstr)
D.writelines("\n")
for i in range(0,len(writing_strings)):
D.writelines(writing_strings[i])
D.writelines("\n")
cons("rosalind_cons.txt")
Your code is totally fine except for this line:
writing_strings=[acgt[i]+' '.join(str(n) for n in ACGT[i] for i in range(0,len(ACGT))) for i in range(0,len(acgt))]
You accidentally replicate your data. Try replacing it with:
writing_strings=[ACGT[i] + str(ACGT[i]) for i in range(0,len(ACGT))]
and then write it to your output file as follows:
D.write(writing_strings[i][1:-1])
That's a lazy way to get rid of the brackets from your list.
If we have the following input and we would like to keep the rows if their "APPID colum " (4rd column) are the same and their column "Category" (the 18th column) are one "Cell" and one "Biochemical" or one "Cell" and one "Enzyme".
A , APPID , C , APP_ID , D , E , F , G , H , I , J , K , L , M , O , P , Q , Category , S , T
,,, APP-1 ,,,,,,,,,,,,,, Cell ,,
,,, APP-1 ,,,,,,,,,,,,,, Enzyme ,,
,,, APP-2 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Biochemical ,,
The ideal output will be
A , APPID , C , APP_ID , D , E , F , G , H , I , J , K , L , M , O , P , Q , Category , S , T
,,, APP-1 ,,,,,,,,,,,,,, Enzyme ,,
,,, APP-3 ,,,,,,,,,,,,,, Biochemical ,,
,,, APP-1 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Cell ,,
"APP-1" is kept because their column 3 are the same and their Category are one "Cell" and the other one is "Enzyme". The same thing for "APP-3", which has one "Cell" and the other one is "Biochemical" in its "Category" column.
The following attempt could do the trick:
import os
App=["1"]
for a in App:
outname="App_"+a+"_target_overlap.csv"
out=open(outname,'w')
ticker=0
cell_comp_id=[]
final_comp_id=[]
# make compound with cell activity (to a target) list first
filename="App_"+a+"_target_Detail_average.csv"
if os.path.exists(filename):
file = open (filename)
line=file.readlines()
if(ticker==0): # Deal with the title
out.write(line[0])
ticker=ticker+1
for c in line[1:]:
c=c.split(',')
if(c[17]==" Cell "):
cell_comp_id.append(c[3])
else:
cell_comp_id=list(set(cell_comp_id))
# while we have list of compounds with cell activity, now we search the Bio and Enz and make one final compound list
if os.path.exists(filename):
for c in line[1:]:
temporary_line=c #for output_temp
c=c.split(',')
for comp in cell_comp_id:
if (c[3]==comp and c[17]==" Biochemical "):
final_comp_id.append(comp)
out.write(str(temporary_line))
elif (c[3]==comp and c[17]==" Enzyme "):
final_comp_id.append(comp)
out.write(str(temporary_line))
else:
final_comp_id=list(set(final_comp_id))
# After we obatin a final compound list in target a , we go through all the csv again for output the cell data
filename="App_"+a+"_target_Detail_average.csv"
if os.path.exists(filename):
for c in line[1:]:
temporary_line=c #for output_temp
c=c.split(',')
for final in final_comp_id:
if (c[3]==final and c[17]==" Cell "):
out.write(str(temporary_line))
out.close()
When the input file is small (tens of thousands lines), this script can finish its job in reasonable time. However, the input files become millions to billions of lines, this script will take forever to finish (days...). I think the issue is we create a list of APPID with "Cell" in the 18th column. Then we go back to compare this "Cell" list (maybe half million lines) to the whole file (1 million lines for example): If any APPID in the Cell list and the whole file is the same and the 18th column of the row in the whole file is "Enzyme" or "Biochemical", we keep the information. This step seems to be very time consuming.
I am thinking maybe preparing "Cell" , "Enzyme" and "Biochemical" dictionaries and compare them will be faster? May I know if any guru has better way to process it? Any example/comment will be helpful. Thanks.
We use python 2.7.6.
reading the file(s) efficiently
One big problem is that you're reading the file all in one go using readlines. This will require loading it ALL into memory at one go. I doubt if you have that much memory available.
Try:
with open(filename) as fh:
out.write(fh.readline()) # ticker
for line in fh: #iterate through lines 'lazily', reading as you go.
c = line.split(',')
style code to start with. This should help a lot. Here, in context:
# make compound with cell activity (to a target) list first
if os.path.exists(filename):
with open(filename) as fh:
out.write(fh.readline()) # ticker
for line in fh:
cols = line.split(',')
if cols[17] == " Cell ":
cell_comp_id.append(cols[3])
the with open(...) as syntax is a very common python idiom which automatically handles closing the file when you finish the with block, or if there is an error. Very useful.
sets
Next thing is, as you suggest, using sets a little better.
You don't need to recreate the set each time, you can just update it to add items. Here's some example set code (written in the python interperter style, >>> at the beginning means it's a line of stuff to type - don't actually type the >>> bit!):
>>> my_set = set()
>>> my_set
set()
>>> my_set.update([1,2,3])
>>> my_set
set([1,2,3])
>>> my_set.update(["this","is","stuff"])
>>> my_set
set([1,2,3,"this","is","stuff"])
>>> my_set.add('apricot')
>>> my_set
set([1,2,3,"this","is","stuff","apricot"])
>>> my_set.remove("is")
>>> my_set
set([1,2,3,"this","stuff","apricot"])
so you can add items, and remove them from a set without creating a new set from scratch (which you are doing each time with the cell_comp_id=list(set(cell_comp_id)) bit.
You can also get differences, intersections, etc:
>>> set(['a','b','c','d']) & set(['c','d','e','f'])
set(['c','d'])
>>> set([1,2,3]) | set([3,4,5])
set([1,2,3,4,5])
See the docs for more info.
So lets try something like:
cells = set()
enzymes = set()
biochemicals = set()
with open(filename) as fh:
out.write(fh.readline()) #ticker
for line in fh:
cols = line.split(',')
row_id = cols[3]
row_category = cols[17]
if row_category == ' Cell ':
cells.add(row_id)
elif row_category == ' Biochemical ':
biochemicals.add(row_id)
elif row_category == ' Enzyme ':
enzymes.add(row_id)
Now you have sets of cells, biochemicals and enzymes. You only want the cross section of these, so:
cells_and_enzymes = cells & enzymes
cells_and_biochemicals = cells & biochemicals
You can then go through all the files again and simply check if row_id (or c[3]) is in either of those lists, and if so, print it.
You can actually combine those two lists even further:
cells_with_enz_or_bio = cells_and_enzymes | cells_and_biochemicals
which would be the cells which have enzymes or biochemicals.
So then when you run through the files the second time, you can do:
if row_id in cells_with_enz_or_bio:
out.write(line)
after all that?
Just using those suggestions might be enough to get you by. You still are storing in memory the entire sets of cells, biochemicals and enzymes, though. And you're still running through the files twice.
So one there are two ways we could potentially speed it up, while still staying with a single python process. I don't know how much memory you have available. If you run out of memory, then it might possibly slow things down slightly.
reducing sets as we go.
If you do have a million records, and 800000 of them are pairs (have a cell record and a biochemical record) then by the time you get to the end of the list, you're storing 800000 IDs in sets. To reduce memory usage, once we've established that we do want to output a record, we could save that information (that we want to print the record) to a file on disk, and stop storing it in memory. Then we could read that list back later to figure out which records to print.
Since this does increase disk IO, it could be slower. But if you are running out of memory, it could reduce swapping, and thus end up faster. It's hard to tell.
with open('to_output.tmp','a') as to_output:
for a in App:
# ... do your reading thing into the sets ...
if row_id in cells and (row_id in biochemicals or row_id in enzymes):
to_output.write('%s,' % row_id)
cells.remove(row_id)
biochemicals.remove(row_id)
enzymes.remove(row_id)
once you've read through all the files, you now have a file (to_output.tmp) which contains all the ids that you want to keep. So you can read that back into python:
with open('to_output.tmp') as ids_file:
ids_to_keep = set(ids_file.read().split(','))
which means you can then on your second run through the files simply say:
if row_id in ids_to_keep:
out.write(line)
using dict instead of sets:
If you have plenty of memory, you could bypass all of that and use dicts for storing the data, which would let you run through the files only once, rather than using sets at all.
cells = {}
enzymes = {}
biochemicals = {}
with open(filename) as fh:
out.write(fh.readline()) #ticker
for line in fh:
cols = line.split(',')
row_id = cols[3]
row_category = cols[17]
if row_category == ' Cell ':
cells[row_id] = line
elif row_category == ' Biochemical ':
biochemicals[row_id] = line
elif row_category == ' Enzyme ':
enzymes[row_id] = line
if row_id in cells and row_id in biochemicals:
out.write(cells[row_id])
out.write(biochemicals[row_id])
if row_id in enzymes:
out.write(enzymes[row_id])
elif row_id in cells and row_id in enzymes:
out.write(cells[row_id])
out.write(enzymes[row_id])
The problem with this method is that if any rows are duplicated, it will get confused.
If you are sure that the input records are unique, and that they either have enzyme or biochemical records, but not both, then you could easily add del cells[row_id] and the appropriate others to remove rows from the dicts once you've printed them, which would reduce memory usage.
I hope this helps :-)
A technique I have used to deal with massive files quickly in Python is to use the multiprocessing library to split the file into large chunks, and process those chunks in parallel in worker subprocesses.
Here's the general algorithm:
Based on the amount of memory you have available on the system that will run this script, decide how much of the file you can afford to read into memory at once. The goal is to make the chunks as large as you can without causing thrashing.
Pass the file name and chunk beginning/end positions to subprocesses, which will each open the file, read in and process their sections of the file, and return their results.
Specifically, I like to use a multiprocessing pool, then create a list of chunk start/stop positions, then use the pool.map() function. This will block until everyone has completed, and the results from each subprocess will be available if you catch the return value from the map call.
For example, you could do something like this in your subprocesses:
# assume we have passed in a byte position to start and end at, and a file name:
with open("fi_name", 'r') as fi:
fi.seek(chunk_start)
chunk = fi.readlines(chunk_end - chunk_start)
retriever = operator.itemgetter(3, 17) # extracts only the elements we want
APPIDs = {}
for line in chunk:
ID, category = retriever(line.split())
try:
APPIDs[ID].append(category) # we've seen this ID before, add category to its list
except KeyError:
APPIDs[ID] = [category] # we haven't seen this ID before - make an entry
# APPIDs entries will look like this:
#
# <APPID> : [list of categories]
return APPIDs
In your main process, you would retrieve all the returned dictionaries and resolve duplicates or overlaps, then output something like this:
for ID, categories in APPIDs.iteritems():
if ('Cell' in categories) and ('Biochemical' in categories or 'Enzyme' in categories):
# print or whatever
A couple of notes/caveats:
Pay attention to the load on your hard disk/SSD/wherever your data is located. If your current method is already maxing out its throughput, you probably won't see any performance improvements from this. You can try implementing the same algorithm with threading instead.
If you do get a heavy hard disk load that's NOT due to memory thrashing, you can also reduce the number of simultaneous subprocesses you're allowing in the pool. This will result in fewer read requests to the drive, while still taking advantage of truly parallel processing.
Look for patterns in your input data you can exploit. For example, if you can rely on matching APPIDs to be next to each other, you can actually do all of your comparisons in the subprocesses and let your main process hang out until its time to combine the subprocess data structures.
TL;DR
Break your file up into chunks and process them in parallel with the multiprocessing library.