I am getting corrupt files after parsing through them - python

I have many output files from a molecular dynamics code (Lammps) and I have a script that parses through the files and calculates a bunch of parameters from the output files. Each file is between 4 -17MB and the script parses through 768 of these files in around 3 minutes. The format of the files are text and not binaries.
Here is the problem:
The script stops after processing 4 to 5 of these folders (4~5*768) complaining about a reshape error problem. But the real reason is that there are a bunch of "special characters" are inserted inside the text without any good reason. I am certain that this file has not been corrupt before the script parsed through it and the only reason these characters are inserted inside the text is actually running the script.
The script is written in python and I use the 'r' key in command below to make sure that the script does not have write access to the file.
fid = open(file_path, 'r')
I re-wrote the same script in Matlab and the same problem exists there too making me believe that it is a I/O issue related to my hardware rather than a coding issue.
I run this script on an ubuntu workstation with ext4 format hard drives and 64 GB of ram.
Interestingly when I run the same script on a cluster I am not able to replicate this problem. Even the parallel version of this script runs perfectly fine on the cluster without a problem but not on my local machine.
Here is the part of the script which does all the reading with the opened file the rest of the file does the calculations. In the end I write the results in a separate file which is a very small file (50 KB).
for file_name in list_files:
if 'dump' in file_name:
print file_name
gb_name = dir_list[j]
gb_num = int(file_name[file_name.find('.')+1:len(file_name)])
i += 1
file_path = read_path + '/' + file_name
fid = open(file_path, 'r', 1)
junk = fid.readline(); junk = fid.readline(); junk = fid.readline();
num_lines = int(fid.readline())
junk = fid.readline()
if len(junk) > 26:
x_lims = map(float, fid.readline().split())
y_lims = map(float, fid.readline().split())
z_lims = map(float, fid.readline().split())
a_tri = abs(x_lims[1] - (x_lims[0]-y_lims[2]))
c_tri = abs(z_lims[1] - z_lims[0])
gb_area = a_tri * c_tri
junk = fid.readline()
else:
x_lims = map(float, fid.readline().split())
y_lims = map(float, fid.readline().split())
z_lims = map(float, fid.readline().split())
gb_area = abs(x_lims[1] - x_lims[0]) * abs(z_lims[1] - z_lims[0])
junk = fid.readline()
tmp = fid.readlines()
fid.close()
if len(tmp) == num_lines:
coords = np.array(map(float, ''.join(tmp).split())).reshape(num_lines, 7)
coords[::, 0:2].astype(int)
else:
raise Exception('Number of lines is not consistent with number of data points.')
Here is an example of the corrupt file. Note the special characters:
99463 2 51.5597 211.814 41.7614k4.26088e-13 -3.35999
99881 2 52.1696 212.526 39.0575 4.91923e-/ef3 -3.35998
Please leave a comment here if you have had similar experiences.

Related

Python: Having trouble replacing lines from file

I'm trying to build a translator using deepl for subtitles but it isn't running perfectly. I managed to translate the subtitles and most of the part I'm having problems replacing the lines. I can see that the lines are translated because it prints them but it doesn't replace them. Whenever I run the program it is the same as the original file.
This is the code responsible for:
def translate(input, output, languagef, languaget):
file = open(input, 'r').read()
fileresp = open(output,'r+')
subs = list(srt.parse(file))
for sub in subs:
try:
linefromsub = sub.content
translationSentence = pydeepl.translate(linefromsub, languaget.upper(), languagef.upper())
print(str(sub.index) + ' ' + translationSentence)
for line in fileresp.readlines():
newline = fileresp.write(line.replace(linefromsub,translationSentence))
except IndexError:
print("Error parsing data from deepl")
This is the how the file looks:
1
00:00:02,470 --> 00:00:04,570
- Yes, I do.
- (laughs)
2
00:00:04,605 --> 00:00:07,906
My mom doesn't want
to babysit everyday
3
00:00:07,942 --> 00:00:09,274
or any day.
4
00:00:09,310 --> 00:00:11,977
But I need
my mom's help sometimes.
5
00:00:12,013 --> 00:00:14,046
She's just gonna
have to be grandma today.
Help will be appreaciated :)
Thanks.
You are opening fileresp with r+ mode. When you call readlines(), the file's position will be set to the end of the file. Subsequent calls to write() will then append to the file. If you want to overwrite the original contents as opposed to append, you should try this instead:
allLines = fileresp.readlines()
fileresp.seek(0) # Set position to the beginning
fileresp.truncate() # Delete the contents
for line in allLines:
fileresp.write(...)
Update
It's difficult to see what you're trying to accomplish with r+ mode here but it seems you have two separate input and output files. If that's the case consider:
def translate(input, output, languagef, languaget):
file = open(input, 'r').read()
fileresp = open(output, 'w') # Use w mode instead
subs = list(srt.parse(file))
for sub in subs:
try:
linefromsub = sub.content
translationSentence = pydeepl.translate(linefromsub, languaget.upper(), languagef.upper())
print(str(sub.index) + ' ' + translationSentence)
fileresp.write(translationSentence) # Write the translated sentence
except IndexError:
print("Error parsing data from deepl")

Concatenate multiple text files of DNA sequences in Python or R?

I was wondering how to concatenate exon/DNA fasta files using Python or R.
Example files:
So far I really liked using R ape package for the cbind method, solely because of the fill.with.gaps=TRUE attribute. I really need gaps inserted when a species is missing an exon.
My code:
ex1 <- read.dna("exon1.txt", format="fasta")
ex2 <- read.dna("exon2.txt", format="fasta")
output <- cbind(ex1, ex2, fill.with.gaps=TRUE)
write.dna(output, "Output.txt", format="fasta")
Example:
exon1.txt
>sp1
AAAA
>sp2
CCCC
exon2.txt
>sp1
AGG-G
>sp2
CTGAT
>sp3
CTTTT
Output file:
>sp1
AAAAAGG-G
>sp2
CCCCCTGAT
>sp3
----CTTTT
So far I am having trouble trying to apply this technique when I have multiple exon files (trying to figure out a loop to open and execute the cbind method for all files ending with .fa in the directory), and sometimes not all files have exons that are all identical in length - hence DNAbin stops working.
So far I have:
file_list <- list.files(pattern=".fa")
myFunc <- function(x) {
for (file in file_list) {
x <- read.dna(file, format="fasta")
out <- cbind(x, fill.with.gaps=TRUE)
write.dna(out, "Output.txt", format="fasta")
}
}
However when I run this and I check my output text file, it misses many exons and I think that is because not all files have the same exon length... or my script is failing somewhere and I can't figure it out: (
Any ideas? I can also try Python.
If you prefer using Linux one liners you have
cat exon1.txt exon2.txt > outfile
if you want only the unique records from the outfile use
awk '/^>/{f=!d[$1];d[$1]=1}f' outfile > sorted_outfile
I just came out with this answer in Python 3:
def read_fasta(fasta): #Function that reads the files
output = {}
for line in fasta.split("\n"):
line = line.strip()
if not line:
continue
if line.startswith(">"):
active_sequence_name = line[1:]
if active_sequence_name not in output:
output[active_sequence_name] = []
continue
sequence = line
output[active_sequence_name].append(sequence)
return output
with open("exon1.txt", 'r') as file: # read exon1.txt
file1 = read_fasta(file.read())
with open("exon2.txt", 'r') as file: # read exon2.txt
file2 = read_fasta(file.read())
finaldict = {} #Concatenate the
for i in list(file1.keys()) + list(file2.keys()): #both files content
if i not in file1.keys():
file1[i] = ["-" * len(file2[i][0])]
if i not in file2.keys():
file2[i] = ["-" * len(file1[i][0])]
finaldict[i] = file1[i] + file2[i]
with open("output.txt", 'w') as file: # output that in file
for k, i in finaldict.items(): # named output.txt
file.write(">{}\n{}\n".format(k, "".join(i))) #proper formatting
It's pretty hard to comment and explain it completely, and it might not help you, but this is better than nothing :P
I used Ɓukasz Rogalski's code from answer to Reading a fasta file format into Python dict.

Writing list to text file not making new lines python

I'm having trouble writing to text file. Here's my code snippet.
ram_array= map(str, ram_value)
cpu_array= map(str, cpu_value)
iperf_ba_array= map(str, iperf_ba)
iperf_tr_array= map(str, iperf_tr)
#with open(ram, 'w') as f:
#for s in ram_array:
#f.write(s + '\n')
#with open(cpu,'w') as f:
#for s in cpu_array:
#f.write(s + '\n')
with open(iperf_b,'w') as f:
for s in iperf_ba_array:
f.write(s+'\n')
f.close()
with open(iperf_t,'w') as f:
for s in iperf_tr_array:
f.write(s+'\n')
f.close()
The ram and cpu both work flawlessly, however when writing to a file for iperf_ba and iperf_tr they always come out look like this:
[45947383.0, 47097609.0, 46576113.0, 47041787.0, 47297394.0]
Instead of
1
2
3
They're both reading from global lists. The cpu and ram have values appended 1 by 1, but otherwise they look exactly the same pre processing.
Here's how they're made
filename= "iperfLog_2015_03_12_20:45:18_123_____tag_33120L06.csv"
write_location= self.tempLocation()
location=(str(write_location) + str(filename));
df = pd.read_csv(location, names=list('abcdefghi'))
transfer = df.h
transfer=transfer[~transfer.isnull()]#uses pandas to remove nan
transfer=transfer.tolist()
length= int(len(transfer))
extra= length-1
del transfer[extra]
bandwidth= df.i
bandwidth=bandwidth[~bandwidth.isnull()]
bandwidth=bandwidth.tolist()
del bandwidth[extra]
iperf_tran.append(transfer)
iperf_band.append(bandwidth)
[from comment]
you need to use .extend(list) if you want to add a list to a list - and don't worry: we're all spending hours debugging/chasing classy-stupid-me mistakes sometimes ;)

Change two lines in text

I have a python script mostly coded so far for a project I'm currently working on and have hit a road block. I essentially run a program that spits out the following output file (called big.dmp):
)O+_05 Big-body initial data (WARNING: Do not delete this line!!)
) Lines beginning with `)' are ignored.
)---------------------------------------------------------------------
style (Cartesian, Asteroidal, Cometary) = Cartesian
epoch (in days) = 1365250.
)---------------------------------------------------------------------
COMPSTAR r=5.00000E-01 d=3.00000E+00 m= 0.160000000000000E+01
4.570923967127310E-01 1.841433531828977E+01 0.000000000000000E+00
-6.207379670518027E-03 1.540861575481520E-04 0.000000000000000E+00
0.000000000000000E+00 0.000000000000000E+00 0.000000000000000E+00
Now with this file I need to edit both the epoch line and the line beginning with COMPSTAR while keeping the rest of the information constant from integration to integration as the last 3 lines contain the cartesian coordinates of my object and is essentially what the program is outputting.
I know how to use f = open('big.dmp', 'w') and f.write('text here') to create the initial file but how would one go about reading these final three lines into a new big.dmp file for the next integration?
Something like this perhaps?
infile = open('big1.dmp')
outfile = open('big2.dmp', 'w')
for line in infile:
if line.startswith(')'):
# ignore comments
pass
elif 'epoch' in line:
# do something with line
line = line.replace('epoch', 'EPOCH')
elif line.startswith('COMPSTAR'):
# do something with line
line = line.replace('COMPSTAR', 'comparison star')
outfile.write(line)
Here is a somewhat more change-tolerant version:
import re
reg_num = r'\d+'
reg_sci = r'[-+]?\d*\.?\d+([eE][+-]?\d+)?'
def update_config(s, finds=None, replaces=None, **kwargs):
if finds is None: finds = update_config.finds
if replaces is None: replaces = update_config.replaces
for name,value in kwargs.iteritems():
s = re.sub(finds[name], replaces[name].format(value), s)
return s
update_config.finds = {
'epoch': r'epoch \(in days\) =\s*'+reg_num+'\.',
'r': r' r\s*=\s*' + reg_sci,
'd': r' d\s*=\s*' + reg_sci,
'm': r' m\s*=\s*' + reg_sci
}
update_config.replaces = {
'epoch': 'epoch (in days) ={:>11d}.',
'r': ' r={:1.5E}',
'd': ' d={:1.5E}',
'm': ' m= {:1.15E}'
}
def main():
with open('big.dmp') as inf:
s = inf.read()
s = update_config(s, epoch=1365252, r=0.51, d=2.99, m=1.1)
with open('big.dmp', 'w') as outf:
outf.write(s)
if __name__=="__main__":
main()
On the off-chance that the format of your file is fixed with regard to line numbers, this solution will change only the two lines:
with open('big.dmp') as inf, open('out.txt', 'w') as outf:
data = inf.readlines()
data[4] = ' epoch (in days) = 9999.\n' # line with epoch
data[6] = 'COMPSTAR r=2201 d=3330 m= 12\n' # line with COMPSTAR
outf.writelines(data)
resulting in this output file:
)O+_05 Big-body initial data (WARNING: Do not delete this line!!)
) Lines beginning with `)' are ignored.
)---------------------------------------------------------------------
style (Cartesian, Asteroidal, Cometary) = Cartesian
epoch (in days) = 9999.
)---------------------------------------------------------------------
COMPSTAR r=2201 d=3330 m= 12
4.570923967127310E-01 1.841433531828977E+01 0.000000000000000E+00
-6.207379670518027E-03 1.540861575481520E-04 0.000000000000000E+00
0.000000000000000E+00 0.000000000000000E+00 0.000000000000000E+00
Clearly this will not work if the line numbers aren't consistent, but I thought I'd offer it up just in case your data format is consistent in terms of line numbers.
Also, since it reads the whole file into memory at once, it won't be an ideal solution for truly huge files.
The advantage of opening files using with is that they are automatically closed for you when you are done with them, or if you encounter an exception.
There are more flexible solution (searching for the strings, processing the file line-by-line) but if your data is fixed and small, there's no downside of taking advantage of those factors. Somebody smart once said "Simple is better than complex." (The Zen of Python)
It's a little hard to understand what you want, but assuming that you only want to remove the lines not starting with ):
text = open(filename).read()
lines = text.split("\n")
result = [line for line in lines if not line.startswith(")")
or, the one liner:
[line for line in open(file_name).read().split("\n") if not line.startswith(")")]

(BioPython) How do I stop MemoryError: Out of Memory exception?

I have a program where I take a pair of very large multiple sequence files (>77,000 sequences each averaging about 1000 bp long) and calculate the alignment score between each paired individual element and write that number into an output file (which I will load into an excel file later).
My code works for small multiple sequence files but my large master file will throw the following traceback after analyzing the 16th pair.
Traceback (most recent call last):
File "C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Create Program\scoreCreate", line 109, in <module>
cycle(f,k,binLen)
File "C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Create Program\scoreCreate", line 85, in cycle
a = pairwise2.align.localxx(currentSubject.seq, currentQuery.seq, score_only=True)
File "C:\Python26\lib\site-packages\Bio\pairwise2.py", line 301, in __call__
return _align(**keywds)
File "C:\Python26\lib\site-packages\Bio\pairwise2.py", line 322, in _align
score_only)
MemoryError: Out of memory
I have tried many things to work around this (as many of you may see from the code), all to no avail. I have tried splitting the large master file into smaller batches to be fed into score calculating method. I have tried del files after I am done using them, I have tried using my Ubuntu 11.11 on an Oracle Virtual machine (I typically work in 64bit Windows 7). Am I being to ambitious is this computationally feasable in BioPython? Below is my code, I have no experience in memory debugging which is the clear culprit of this problem. Any assistance is greatly appreciated I am becoming very frustrated with this problem.
Best,
Harry
##Open reference file
##a.)Upload subjectList
##b.)Upload query list (a and b are pairwise data)
## Cycle through each paired FASTA and get alignment score of each(Large file)
from Bio import SeqIO
from Bio import pairwise2
import gc
##BATCH ITERATOR METHOD (not my code)
def batch_iterator(iterator, batch_size) :
entry = True #Make sure we loop once
while entry :
batch = []
while len(batch) < batch_size :
try :
entry = iterator.next()
except StopIteration :
entry = None
if entry is None :
#End of file
break
batch.append(entry)
if batch :
yield batch
def split(subject,query):
##Query Iterator and Batch Subject Iterator
query_iterator = SeqIO.parse(query,"fasta")
record_iter = SeqIO.parse(subject,"fasta")
##Writes both large file into many small files
print "Splitting Subject File..."
binLen=2
for j, batch1 in enumerate(batch_iterator(record_iter, binLen)) :
filename1="groupA_%i.fasta" % (j+1)
handle1=open(filename1, "w")
count1 = SeqIO.write(batch1, handle1, "fasta")
handle1.close()
print "Done splitting Subject file"
print "Splitting Query File..."
for k, batch2 in enumerate(batch_iterator(query_iterator,binLen)):
filename2="groupB_%i.fasta" % (k+1)
handle2=open(filename2, "w")
count2 = SeqIO.write(batch2, handle2, "fasta")
handle2.close()
print "Done splitting both FASTA files"
print " "
return [k ,binLen]
##This file will hold the alignment scores in a tab deliminated text
f = open("C:\\Users\\Harry\\Documents\\cgigas\\alignScore.txt", 'w')
def cycle(f,k,binLen):
i=1
m=1
while i<=k+1:
##Open the first small file
subjectFile = open("C:\\Users\\Harry\\Documents\\cgigas\\BioPython Programs\\groupA_" + str(i)+".fasta", "rU")
queryFile =open("C:\\Users\\Harry\\Documents\\cgigas\\BioPython Programs\\groupB_" + str(i)+".fasta", "rU")
i=i+1
j=0
##Make small file iterators
smallQuery=SeqIO.parse(queryFile,"fasta")
smallSubject=SeqIO.parse(subjectFile,"fasta")
##Cycles through both sets of FASTA files
while j<binLen:
j=j+1
currentQuery=smallQuery.next()
currentSubject=smallSubject.next()
##Verify every pair is correct
print " "
print "Pair: " + str(m)
print "Subject: "+ currentSubject.id
print "Query: " + currentQuery.id
gc.collect()
a = pairwise2.align.localxx(currentSubject.seq, currentQuery.seq, score_only=True)
gc.collect()
currentQuery=None
currentSubject=None
score=str(a)
a=None
print "Score: " + score
f.write("1"+ "\n")
m=m+1
smallQuery.close()
smallSubject.close()
subjectFile.close()
queryFile.close()
gc.collect()
print "New file"
##MAIN PROGRAM
##Here is our paired list of FASTA files
##subject = open("C:\\Users\\Harry\\Documents\\cgigas\\subjectFASTA.fasta", "rU")
##query =open("C:\\Users\\Harry\\Documents\\cgigas\\queryFASTA.fasta", "rU")
##[k,binLen]=split(subject,query)
k=272
binLen=2
cycle(f,k,binLen)
P.S. Be kind I am aware there is probably some goofy things in the code that I put in there trying to get around this problem.
See also this very similar question on BioStars, http://www.biostars.org/post/show/45893/trying-to-get-around-memoryerror-out-of-memory-exception-in-biopython-program/
There I suggested trying existing tools for this kind of thing, e.g. EMBOSS needleall http://emboss.open-bio.org/wiki/Appdoc:Needleall (you can parse the EMBOSS alignment output with Biopython)
The pairwise2 module was updated in the recent version of Biopython (1.68) to become faster and less memory consuming.

Categories

Resources