Reading specific chunks pandas / not reading all chunks in pandas - python

I am trying to use accordingly to this question and answer reading a large csv file by chunks and processing it. Since I'm not native with python I got an optimization problem and looking for a better solution here.
What my code does:
I read in the line count of my csv with
with open(file) as f:
row_count = sum(1 for line in f)
afterwards I "slice" my data in 30 equal sized chunks and process it accordingly to the linked answer with a for loop and pd.read_csv(file, chunksize). Since plotting 30 graphs in one is pretty unclear, I plot it every 5 steps with modulo (which may be variated). For this I use an external counter.
chunksize = row_count // 30
counter = 0
for chunk in pd.read_csv(file, chunksize=chunksize):
df = chunk
print(counter)
if ((counter % 5) == 0 | (counter == 0):
plt.plot(df["Variable"])
counter = counter +1
plt.show()
Now to my question:
It seems like, this loop reads the chunk size in before processing the loop, which is reasonable. I can see this, since the print(counter) steps are also fairly slow. Since I read a few million rows of a csv, it takes some time every step. Is there a way to skip the not wanted chunks in the for loop, before reading it in? I was trying out something like:
wanted_plts <- [1,5,10,15,20,25,30]
for i in wanted_plts:
for chunk[i] in pd.read_csv(file, chunksize=chunksize):
.
.
I think I have understanding issues how I can manipulate this syntax of the for loop range. There should be an elegant way to fix this.
Also: i found the .get_chunk(x) by pandas but this seems to create just one chunk of size x.
Another attempt by me is trying to subset the reader object of pd.read_csv like pd.read_csv()[0,1,2] but it seems that's not possible too.
Amendment: I'm aware plotting a lot of data in matplotlib is slow. I preprocess it earlier, but for making this code readable I removed all unnecessary parts.

You are wasting a lot of resources when parsing CSV into DataFrame without using it. To avoid this you can create line index during the first pass:
fp = open(file_name)
row_count = 0
pos = {0: 0}
line = fp.readline()
while line:
row_count += 1
pos[row_count] = fp.tell()
line = fp.readline()
Do not dispose the file handle yet! Because read_csv() accepts streams, you can move your file pointer as you want:
chunksize = row_count // 30
wanted_plts = [1,5,10,15,20,25,30]
for i in wanted_plts:
fp.seek(pos[i*chunksize]) # this will bring you to the first line of the desired chunk
obj = pd.read_csv(fp, chunksize=chunksize) # read your chunk lazily
df = obj.get_chunk() # convert to DataFrame object
plt.plot(df["Variable"]) # do something
fp.close() # Don't forget to close the file when finished.
And finally a warning: when reading CSV this way you will lose column names. So make an adjustment:
obj = pd.read_csv(fp, chunksize=chunksize, names=[!!<column names you have>!!])
P.S. file is a reserved word, avoid using it to prevent undesired side effects. You can use file_ or file_name instead.

I've toyed with your setup, trying to find a way to skip chunks, using another rendering library like pyqtgraph or using matplotlib.pyplot subroutines instead of plot(), all to no avail.
So the only fair advice I can give you is to limit the scope of read_csv to only the data you're interested in by passing the usecols parameter.
Instead of:
for chunk in pd.read_csv(file, chunksize=chunksize):
plt.plot(chunk['Variable'])
Use:
for chunk in pd.read_csv(file, usecols=['Variable'], chunksize=chunksize):
plt.plot(chunk)
And, if you haven't already, definitely limit the number of iterations by going for the biggest chunksize you possibly can (so in your case the lowest row_count divider).
I haven't quantified their respective weight but you will gain on both the csv_read() and the plot() method overheads, even ever so slightly due to the fact that your current chunks are already quite big.
With my test data, quadrupling the chunksize cuts down processing time in half:
chunksize=1000 => executed in 12.7s
chunksize=2000 => executed in 9.06s
chunksize=3000 => executed in 7.68s
chunksize=4000 => executed in 6.94s
And specifying usecols at read time also cuts down processing time in half again:
chunksize=1000 + usecols=['Variable'] => executed in 8.33s
chunksize=2000 + usecols=['Variable'] => executed in 5.27s
chunksize=3000 + usecols=['Variable'] => executed in 4.39s
chunksize=4000 + usecols=['Variable'] => executed in 3.54s

As far as I know, pandas does not provide any support for skipping chunks of file. At least I never found anything about it in the documentation.
In general, skipping lines from file (not reading them at all) is difficult unless you know in advance how many lines you want to skip and how many characters you have in each of those lines. In this case you can try to play with IO and seek to move the stream position to the exact place you need the next iteration.
But it does not seem your case.
I think the best thing you can do to improve efficiency is to read the lines using standard IO, and convert to a dataframe only the lines you need / want to plot.
Consider for example the following custom iterator.
When instantiated, it saves the header (first line). Each iteration it reads a chunk of lines from the file and then skip the following n*chunksize lines. It returns the header line followed by the read lines, wrapped in a io.StringIO object (so it's a stream and can be fed directly to pandas.read_csv).
import io
from itertools import islice
class DfReaderChunks:
def __init__(self, filename, chunksize, n):
self.fo = open(filename)
self.chs = chunksize
self.skiplines = self.chs * n
self.header = next(self.fo)
def getchunk(self):
ll = list(islice(self.fo, self.chs))
if len(ll) == 0:
raise StopIteration
dd = list(islice(self.fo, self.skiplines))
return self.header + ''.join(ll)
def __iter__(self):
return self
def __next__(self):
return io.StringIO(self.getchunk())
def close(self):
self.fo.close()
def __del__(self):
self.fo.close()
Using this class, your can read from your file:
reader = DfReaderChunks(file, chunksize, 4)
for dfst in reader:
df = pd.read_csv(dfst)
print(df) #here I print to stdout, you can plot
reader.close()
which is "equivalent" to your setup:
for chunk in pd.read_csv(file, chunksize=chunksize):
df = chunk
if (counter % 5 == 0):
print(df) #again I print, you can plot
counter += 1
I tested the time used by both the above snippets using a dataframe of 39 Mb (100000 rows or random numbers).
On my machine, the former takes 0.458 seconds, the latter 0.821 seconds.
The only drawback is that the former snippet loses track of the row index (it's a new dataframe each time, so index always start from 0) but the printed chunks are the same.

Related

How to append chunks of 2D numpy array to binary file as the chunks are created?

I have a large input file which consists of data frames (a data series (complex64), with an identifying header in each frame). It is larger than my available memory. The headers repeat, but are randomly ordered, so for example the input file could look like:
<FRAME header={0}, data={**first** 500 numbers...}>,
<FRAME header={18}, data={first 500 numbers...}>,
<FRAME header={4}, data={first 500 numbers...}>,
<FRAME header={0}, data={**next** 500 numbers...}>
...
I want to order the data into a new file that is a numpy array of shape (len(headers), len(data_series)). It has to build the output file as it reads the frames, because I can't fit it all in memory.
I've looked at numpy.savetxt and the python csv package but for disk size, precision, and speed reasons I would prefer for the output file to be binary. numpy.save is good except that I can't figure out how to make it append to an unknown array size.
I have to work in Python2.7 because of some dependencies needed to read these frames. What I have done so far is made a function able to write all of the frames with a matching header to a single binary file:
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
with open("singleFrameHeader", 'ab') as f:
current_data = input_data.readFrame() # This loads the next frame in the file
if current_data.header == 0:
float_arr = np.array(current_data.data).view(float)
float_arr.tofile(f)
This works great, but what I need to extend it to be two dimensional. I'm starting to look at h5py as an option, but was hoping there is a simpler solution.
What would be great is something like
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
with open("bigMatrix", 'ab') as f:
current_data = input_data.readFrame() # This loads the next frame in the file
index = current_data.header
float_arr = np.array(current_data.data).view(float)
float_arr.tofile(f, index)
Any help is appreciated. I thought this would be a more common use-case to read and write to a 2D binary file in append mode.
You have two problems: one is that a file contains sequential data, and the other is that numpy binary files don't store shape information.
A simple way to start solving this would be to carry through with your initial idea of converting the data into files by header, then combining all the binary files into one large product (if you still feel the need to do so).
You could maintain a map of the headers you've found so far to their output files, data size, etc. This will allow you to combine the data more intelligently, if for example, there are missing chunks or headers or something.
from contextlib import ExitStack
from os import remove
from tempfile import NamedTemporaryFile
from shutil import copyfileobj
import sys
class Header:
__slots__ = ('id', 'count', 'file', 'name')
def __init__(self, id):
self.id = id
self.count = 0
self.file = NamedTemporaryFile(delete=False)
self.name = self.file.name
def write_frame(self, frame):
data = np.array(frame.data).view(float)
self.count += data.size
data.tofile(self.file)
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
file_map = {}
with ExitStack() as stack:
while True:
frame = input_data.next_frame()
if frame is None:
break # recast this loop as necessary
if frame.header not in file_map:
header = Header(frame.header)
stack.enter_context(header.file)
file_map[frame.header] = header
else:
header = file_map[frame.header]
header.write_frame(frame)
max_header = max(file_map)
max_count = max(h.count for h in file_map)
with open('singleFrameHeader', 'wb') as output:
output.write(max_header.to_bytes(8, sys.byteorder))
output.write(max_count.to_bytes(8, sys.byteorder))
for i in range max_header:
if i in file_map:
h = file_map[i]
with open(h.name, 'rb') as input:
copyfileobj(input, output)
remove(h.name)
if h.count < max_count:
np.full(max_count - h.count, np.nan, dtype=np.float).tofile(output)
else:
np.full(max_count, np.nan, dtype=np.float).tofile(output)
The first 16 bytes will be the int64 number of headers and number of elements per header, respectively. Keep in mind that the file is in native byte order, whatever that may be, and is therefore not portable.
Alternative
If (and only if) you know the exact size of a header dataset ahead of time, you can do this in one pass, with no temporary files. It also helps if the headers are contiguous. Otherwise, missing swaths will be zero-filled. You will still need to maintain a dictionary of your current position within a header, but you will no longer have to keep a separate file pointer around for each one. All-in-all, this is a much better alternative than the original solution, if your use-case allows it:
header_size = 500 * N # You must know this up front
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
header_map = {}
with open('singleFrameHeader', 'wb') as output:
output.write(max_header.to_bytes(8, sys.byteorder))
output.write(max_count.to_bytes(8, sys.byteorder))
while True:
frame = input_data.next__frame()
if frame is None:
break
if frame.header not in header_map:
header_map[frame.header] = 0
data = np.array(frame.data).view(float)
output.seek(16 + frame.header * header_size + header_map[frame.header])
data.tofile(output)
header_map[frame.header] += data.size * data.dtype.itemsize
I asked a question regarding this sort of out-of-order write pattern as a consequence of this answer: What happens when you seek past the end of a file opened for writing?

Only read certain rows in a csv file with python

I want to read only a certain amount of rows starting from a certain row in a csv file without iterating over the whole csv file to reach this certain point.
Lets say i have a csv file with 100 rows and i want to read only row 50 to 60. I dont want to iterate from row 1 to 49 to reach row 50 to start reading. Can i somehow achieve this with seek()?
For example:
Seek to row 50
read from 50 to 60
next time:
seek to row 27
read 27 to 34
and so on
So not only seeking continuesly forward through the file but also backwards.
Thank you a lot
An option would be to use Pandas. For example:
import pandas as pd
# Select file
infile = r'path/file'
# Use skiprows to choose starting point and nrows to choose number of rows
data = pd.read_csv(infile, skiprows = 50, nrows=10)
You can use chunksize
import pandas as pd
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
If the # of columns/line lengths are variable, it isn't possible to find the line you want without "reading" (ie, processing) every character of the file that comes before that, and counting the line terminators. And the fastest way to process them in python, is to use iteration.
As to the fastest way to do that with a large file, I do not know whether it is faster to iterate by line this way:
with open(file_name) as f:
for line,_ in zip(f, range(50)):
pass
lines = [line for line,_ in zip(f, range(10))]
...or to read a character at a time using seek, and count new line characters. But it is certainly MUCH more convenient to do the first.
However if the file gets read a lot, iterating over the lines will be slow over time. If the file contents do not change, you could instead accomplish this by reading the whole thing once and building a dict of the line lengths ahead of time:
from itertools import accumulate
with open(file_name) as f:
cum_lens = dict(enumerate(accumulate(len(line) for line in f), 1))
This would allow you to seek to any line number in the file without processing the whole thing ever again:
def seek_line(path, line_num, cum_lens):
with open(path) as f:
f.seek(cum_lens[line_num], 0)
return f.readline()
class LineX:
"""A file reading object that can quickly obtain any line number."""
def __init__(self, path, cum_lens):
self.cum_lens = cum_lens
self.path = path
def __getitem__(self, i):
return seek_line(self.path, i, self.cum_lens)
linex = LineX(file_name, cum_lens)
line50 = linex[50]
But at this point, you might be better off loading the file contents into some kind of database. It depends on what you're trying to do, and what kind of data the file contains.
As others are saying the most obvious solution is to use pandas read csv !
The method has a parameter called skiprows:
from the doc there is what is said :
skiprows : list-like, int or callable, optional
Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].
You can have something like this :
import pandas as pd
data = pd.read_csv('path/to/your/file', skiprows =lambda x: x not in range(50, 60))
Since you specify that the memory is your problem you can use the chunksize parameter as said in this tutorial
he said :
The parameter essentially means the number of rows to be read into a
dataframe at any single time in order to fit into the local memory.
Since the data consists of more than 70 millions of rows, I specified
the chunksize as 1 million rows each time that broke the large data
set into many smaller pieces.
df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)
You can try this and iterate over the chunk to retrieve only the rows you are looking for.
The function should return true if the row number is in the specified list
its that easy:
with open("file.csv", "r") as file:
print(file.readlines()[50:60])

"pygame.error: Out of memory" when loading level with a large area [duplicate]

Is there a limit to memory for python? I've been using a python script to calculate the average values from a file which is a minimum of 150mb big.
Depending on the size of the file I sometimes encounter a MemoryError.
Can more memory be assigned to the python so I don't encounter the error?
EDIT: Code now below
NOTE: The file sizes can vary greatly (up to 20GB) the minimum size of the a file is 150mb
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w")
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for u in files:
line = u.readlines()
list_of_lines = []
for i in line:
values = i.split('\t')
list_of_lines.append(values)
count = 0
for j in list_of_lines:
count +=1
for k in range(0,count):
list_of_lines[k].remove('\n')
length = len(list_of_lines[0])
print_counter = 4
for o in range(0,length):
total = 0
for p in range(0,count):
number = float(list_of_lines[p][o])
total = total + number
average = total/count
print average
if print_counter == 4:
file_write.write(str(average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
(This is my third answer because I misunderstood what your code was doing in my original, and then made a small but crucial mistake in my second—hopefully three's a charm.
Edits: Since this seems to be a popular answer, I've made a few modifications to improve its implementation over the years—most not too major. This is so if folks use it as template, it will provide an even better basis.
As others have pointed out, your MemoryError problem is most likely because you're attempting to read the entire contents of huge files into memory and then, on top of that, effectively doubling the amount of memory needed by creating a list of lists of the string values from each line.
Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available. Even if you don't use it all up and your program "works", using it may be impractical because it takes too long.
Anyway, the most obvious way to avoid that is to process each file a single line at a time, which means you have to do the processing incrementally.
To accomplish this, a list of running totals for each of the fields is kept. When that is finished, the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read. Once that is done, these averages can be printed out and some written to one of the output files. I've also made a conscious effort to use very descriptive variable names to try to make it understandable.
try:
from itertools import izip_longest
except ImportError: # Python 3
from itertools import zip_longest as izip_longest
GROUP_SIZE = 4
input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt",
"A2_B1_100000.txt"]
file_write = open("average_generations.txt", 'w')
mutation_average = open("mutation_average", 'w') # left in, but nothing written
for file_name in input_file_names:
with open(file_name, 'r') as input_file:
print('processing file: {}'.format(file_name))
totals = []
for count, fields in enumerate((line.split('\t') for line in input_file), 1):
totals = [sum(values) for values in
izip_longest(totals, map(float, fields), fillvalue=0)]
averages = [total/count for total in totals]
for print_counter, average in enumerate(averages):
print(' {:9.4f}'.format(average))
if print_counter % GROUP_SIZE == 0:
file_write.write(str(average)+'\n')
file_write.write('\n')
file_write.close()
mutation_average.close()
You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.
Better iterate over each line:
for current_line in u:
do_something_with(current_line)
is the recommended approach.
Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a for loop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.
This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.
Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csv module which will handle all the splitting, removing of \ns etc. for you.
Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about
1959167 [MiB]
On jython 2.5 it crashes earlier:
239000 [MiB]
probably I can configure Jython to use more memory (it uses limits from JVM)
Test app:
import sys
sl = []
i = 0
# some magic 1024 - overhead of string object
fill_size = 1024
if sys.version.startswith('2.7'):
fill_size = 1003
if sys.version.startswith('3'):
fill_size = 497
print(fill_size)
MiB = 0
while True:
s = str(i).zfill(fill_size)
sl.append(s)
if i == 0:
try:
sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
except AttributeError:
pass
i += 1
if i % 1024 == 0:
MiB += 1
if MiB % 25 == 0:
sys.stderr.write('%d [MiB]\n' % (MiB))
In your app you read whole file at once. For such big files you should read the line by line.
No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.
In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.
Edit:
Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.
A better approach would be to read the files one line at a time:
for u in files:
for line in u: # This will iterate over each line in the file
# Read values from the line, do necessary calculations
Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.
You have a secondary problem: your choices of variable names severely obfuscate what you are doing.
Here is your script rewritten with the readlines() caper removed and with meaningful names:
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
table = []
for aline in afile:
values = aline.split('\t')
values.remove('\n') # why?
table.append(values)
row_count = len(table)
row0length = len(table[0])
print_counter = 4
for column_index in range(row0length):
column_total = 0
for row_index in range(row_count):
number = float(table[row_index][column_index])
column_total = column_total + number
column_average = column_total/row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.
As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.
Here is a revised version of the outer loop code:
for afile in files:
for row_count, aline in enumerate(afile, start=1):
values = aline.split('\t')
values.remove('\n') # why?
fvalues = map(float, values)
if row_count == 1:
row0length = len(fvalues)
column_index_range = range(row0length)
column_totals = fvalues
else:
assert len(fvalues) == row0length
for column_index in column_index_range:
column_totals[column_index] += fvalues[column_index]
print_counter = 4
for column_index in column_index_range:
column_average = column_totals[column_index] / row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1

What is the optimal way to process a very large (over 30GB) text file and also show progress

[newbie question]
Hi,
I'm working on a huge text file which is well over 30GB.
I have to do some processing on each line and then write it to a db in JSON format. When I read the file and loop using "for" my computer crashes and displays blue screen after about 10% of processing data.
Im currently using this:
f = open(file_path,'r')
for one_line in f.readlines():
do_some_processing(one_line)
f.close()
Also how can I show overall progress of how much data has been crunched so far ?
Thank you all very much.
File handles are iterable, and you should probably use a context manager. Try this:
with open(file_path, 'r') as fh:
for line in fh:
process(line)
That might be enough.
I use a function like this for a similiar problem. You can wrap up any iterable with it.
Change this
for one_line in f.readlines():
You just need to change your code to
# don't use readlines, it creates a big list of all data in memory rather than
# iterating one line at a time.
for one_line in in progress_meter(f, 10000):
You might want to pick a smaller or larger value depending on how much time you want to waste printing status messages.
def progress_meter(iterable, chunksize):
""" Prints progress through iterable at chunksize intervals."""
scan_start = time.time()
since_last = time.time()
for idx, val in enumerate(iterable):
if idx % chunksize == 0 and idx > 0:
print idx
print 'avg rate', idx / (time.time() - scan_start)
print 'inst rate', chunksize / (time.time() - since_last)
since_last = time.time()
print
yield val
Using readline imposes to find the end of each line in your file. If some lines are very long, it might lead your interpreter to crash (not enough memory to buffer the full line).
In order to show progress you can check the file size for example using:
import os
f = open(file_path, 'r')
fsize = os.fstat(f).st_size
The progress of your task can then be the number of bytes processed divided by the file size times 100 to have a percentage.

Most efficient way to concatenate and rearrange files

I am reading from several files, each file is divided into 2 pieces, first a header section of a few thousand lines followed by a body of a few thousand. My problem is I need to concatenate these files into one file where all the headers are on the top followed by the body.
Currently I am using two loops: one to pull out all the headers and write them, and the second to write the body of each file (I also include a tmp_count variable to limit the number of lines to be loading into memory before dumping to file).
This is pretty slow - about 6min for 13gb file. Can anyone tell me how to optimize this or if there is a faster way to do this in python ?
Thanks!
Here is my code:
def cat_files_sam(final_file_name,work_directory_master,file_count):
final_file = open(final_file_name,"w")
if len(file_count) > 1:
file_count=sort_output_files(file_count)
# only for # headers
for bowtie_file in file_count:
#print bowtie_file
tmp_list = []
tmp_count = 0
for line in open(os.path.join(work_directory_master,bowtie_file)):
if line.startswith("#"):
if tmp_count == 1000000:
final_file.writelines(tmp_list)
tmp_list = []
tmp_count = 0
tmp_list.append(line)
tmp_count += 1
else:
final_file.writelines(tmp_list)
break
for bowtie_file in file_count:
#print bowtie_file
tmp_list = []
tmp_count = 0
for line in open(os.path.join(work_directory_master,bowtie_file)):
if line.startswith("#"):
continue
if tmp_count == 1000000:
final_file.writelines(tmp_list)
tmp_list = []
tmp_count = 0
tmp_list.append(line)
tmp_count += 1
final_file.writelines(tmp_list)
final_file.close()
How fast would you expect it to be to move 13Gb of data around? This problem is I/O bound and not a problem with Python. To make it faster, do less I/O. Which means that you are either (a) stuck with the speed you've got or (b) should retool later elements of your toolchain to handle the files in-place rather than requiring one giant 13 Gb file.
You can save the time it takes the 2nd time to skip the headers, as long as you have a reasonable amount of spare disk space: as well as the final file, also open (for 'w+') a temporary file temp_file, and do:
import shutil
hdr_list = []
bod_list = []
dispatch = {True: (hdr_list, final_file),
False: (bod_list, temp_file)}
for bowtie_file in file_count:
with open(os.path.join(work_directory_master,bowtie_file)) as f:
for line in f:
L, fou = dispatch[line[0]=='#']
L.append(f)
if len(L) == 1000000:
fou.writelines(L)
del L[:]
# write final parts, if any
for L, fou in dispatch.items():
if L: fou.writelines(L)
temp_file.seek(0)
shutil.copyfileobj(temp_file, final_file)
This should enhance your program's performance. Fine-tuning that now-hard-coded 1000000, or even completely doing away with the lists and writing each line directly to the appropriate file (final or temporary), are other options you should benchmark (but if you have unbounded amounts of memory, then I expect that they won't matter much -- however, intuitions about performance are often misleading, so it's best to try and measure!-).
There are two gross inefficiencies in the code you meant to write (which is not the code presented):
You are building up huge lists of header lines in the first major for block instead of just writing them out.
You are skipping the headers of the files again in the second major for block line by line when you've already determined where the headers end in (1). See file.seek and file.tell

Categories

Resources