Why is loading this file taking so much memory?

Why is loading this file taking so much memory? - python

Trying to load a file into python. It's a very big file (1.5Gb), but I have the available memory and I just want to do this once (hence the use of python, I just need to sort the file one time so python was an easy choice).
My issue is that loading this file is resulting in way to much memory usage. When I've loaded about 10% of the lines into memory, Python is already using 700Mb, which is clearly too much. At around 50% the script hangs, using 3.03 Gb of real memory (and slowly rising).
I know this isn't the most efficient method of sorting a file (memory-wise) but I just want it to work so I can move on to more important problems :D So, what is wrong with the following python code that's causing the massive memory usage:
print 'Loading file into memory'
input_file = open(input_file_name, 'r')
input_file.readline() # Toss out the header
lines = []
totalLines = 31164015.0
currentLine = 0.0
printEvery100000 = 0
for line in input_file:
currentLine += 1.0
lined = line.split('\t')
printEvery100000 += 1
if printEvery100000 == 100000:
print str(currentLine / totalLines)
printEvery100000 = 0;
lines.append( (lined[timestamp_pos].strip(), lined[personID_pos].strip(), lined[x_pos].strip(), lined[y_pos].strip()) )
input_file.close()
print 'Done loading file into memory'
EDIT: In case anyone is unsure, the general consensus seems to be that each variable allocated eats up more and more memory. I "fixed" it in this case by 1) calling readLines(), which still loads all the data, but only has one 'string' variable overhead for each line. This loads the entire file using about 1.7Gb. Then, when I call lines.sort(), I pass a function to key that splits on tabs and returns the right column value, converted to an int. This is slow computationally, and memory-intensive overall, but it works. Learned a ton about variable allocation overhad today :D

Here is a rough estimate of the memory needed, based on the constants derived from your example. At a minimum you have to figure the Python internal object overhead for each split line, plus the overhead for each string.
It estimates 9.1 GB to store the file in memory, assuming the following constants, which are off by a bit, since you're only using part of each line:
1.5 GB file size
31,164,015 total lines
each line split into a list with 4 pieces
Code:
import sys
def sizeof(lst):
return sys.getsizeof(lst) + sum(sys.getsizeof(v) for v in lst)
GIG = 1024**3
file_size = 1.5 * GIG
lines = 31164015
num_cols = 4
avg_line_len = int(file_size / float(lines))
val = 'a' * (avg_line_len / num_cols)
lst = [val] * num_cols
line_size = sizeof(lst)
print 'avg line size: %d bytes' % line_size
print 'approx. memory needed: %.1f GB' % ((line_size * lines) / float(GIG))
Returns:
avg line size: 312 bytes
approx. memory needed: 9.1 GB

I don't know about the analysis of the memory usage, but you might try this to get it to work without running out of memory. You'll sort into a new file which is accessed using a memory mapping (I've been led to believe this will work efficiently [in terms of memory]). Mmap has some OS specific workings, I tested this on Linux (very small scale).
This is the basic code, to make it run with a decent time efficiency you'd probably want to do a binary search on the sorted file to find where to insert the line otherwise it will probably take a long time.
You can find a file-seeking binary search algorithm in this question.
Hopefully a memory efficient way of sorting a massive file by line:
import os
from mmap import mmap
input_file = open('unsorted.txt', 'r')
output_file = open('sorted.txt', 'w+')
# need to provide something in order to be able to mmap the file
# so we'll just copy the first line over
output_file.write(input_file.readline())
output_file.flush()
mm = mmap(output_file.fileno(), os.stat(output_file.name).st_size)
cur_size = mm.size()
for line in input_file:
mm.seek(0)
tup = line.split("\t")
while True:
cur_loc = mm.tell()
o_line = mm.readline()
o_tup = o_line.split("\t")
if o_line == '' or tup[0] < o_tup[0]: # EOF or we found our spot
mm.resize(cur_size + len(line))
mm[cur_loc+len(line):] = mm[cur_loc:cur_size]
mm[cur_loc:cur_loc+len(line)] = line
cur_size += len(line)
break

Related

How can i handle file with 161 million line?

I tried to handle this code as I have a big file with the size 3 GB "mydata.dat" with 161991000 lines. Code is for calculating the distance between two points using DensityPeakCluster. number of points 18000
sample of the file like as
1 2 26.23
1 3 44.49
1 4 47.17
and so on until
1 18000 23.5
then
2 3 25.2
2 4 15.2
until 2 18000 0.25 and so on until 17999 18000 0.25
block one for the code is
class Graph(defaultdict):
def __init__(self, input_file, sep=" ", header=False, undirect=True):
super(Graph, self).__init__(dict)
self.edges_num = 0
with open(input_file) as f:
if header:
f.readline()
for line in f:
line = line.strip().split(sep)
self[line[0]][line[1]] = float(line[2])
self.edges_num += 1
if undirect:
self[line[1]][line[0]] = float(line[2])
self.edges_num += 1
def edges(self):
edges_list = []
for node1 in self:
for node2 in self[node1]:
edges_list.append((node1, node2))
return edges_list
block 2 of the code as the code is long to write it here
def edges_weight(self):
weight_list = []
for edge in self.edges():
node1, node2 = edge
weight_list.append([node1, node2, self[node1][node2]])
weight_list = sorted(weight_list, key=lambda x:x[2])
return weight_list
def get_weight(self, node1, node2):
return self[node1][node2]
def get_weights(self):
weights = []
for edge in self.edges():
weights.append(self.get_weight(edge[0], edge[1]))
return weights
if __name__=="__main__":
input_file = "./data/mydata.dat"
percent = 2.0
output_file = "./data/results"
G = Graph(input_file)
position = round(G.number_of_edges()*percent/100)
dc = G.edges_weight()[position][2]
print("average percentage of neighbours (hard coded): {}".format(percent))
print("Computing Rho with gaussian kernel of radius: {}".format(dc))
nodes = G.nodes()
for i in range(G.number_of_nodes()-1):
for j in range(i+1, G.number_of_nodes()):
node_i = nodes[i]
node_j = nodes[j]
dist_ij = G.get_weight(node_i, node_j)
what happened to me
1- I got killed so I tried to make reading from the file as
bigfile = open(input_file,'r')
tmp_lines = bigfile.readlines(1024*1024)
for line in tmp_lines:
line = line.strip().split(sep)
self[line[0]][line[1]] = float(line[2])
self.edges_num += 1
if undirect:
self[line[1]][line[0]] = float(line[2])
self.edges_num += 1
2- but got
dist_ij = G.get_weight(node_i, node_j) in get_weight
return self[node1][node2]
KeyError: '6336'
3- I tried to use google colab but didn't work as RAM is 12 GB and didn't enough for me .. i asked for buying a neW RAM but the problem still was I couldn't manage the code well so the RAM will be less for processing .. i'm stuck in this problem and couldn't know what should I do ?
**1- My problem is how to deal with a big file as I have ? what is the way that I should use to handle this size?
2- if I use NumPy to load the file can this decrease usage of memory?**

The most straight forward answer is to not load the whole file at once. This can even be done one line at a time. For example, suppose you wanted the sum:
filename = 'file.dat'
lines = (int(line.split(' ')[2]) for line in open(filename))
print(sum(lines))
Here we did not load all the lines into memory. We instead opened a file pointer and started a python generator. The generator holds the function "int(line.split(' ')[2])" and only executes that function when each line is called. The initiation of needing to call each line is started by the sum(), and sum only calls each line one at a time as needed, never loading more than one line into memory at a time. Hence, when we execute that line we start to add up all the values on the lines from the generator and keep a running total. The point is that the code uses no memory RAM (aside from the kernel overhead).
This could be done a piece at a time as well. Load all the zeros.
filename = 'file.dat'
lines = (line.split(' ') for line in open(filename))
zeros = (line for line in lines if line[0]=='0' or line[1]=='0')
print(sum(c for a,b,c in zeros))
This can of course be slower than loading some or all of the file into memory. Moreover you have to consider how many times you want to iterate over the file like this. It is preferred to only iterate over the lines a few times, gathering all the calculations you want. You then probably want to save those answers because re-iterating over the file again takes more time.
In considering loading the file into memory, you need to double check what exactly you want to load and how. For example, do you want to load the values 1 2 in the line 1 2 26.23? If not, then strip those out to take up less memory. For example
import numpy as np
filename = 'file.dat'
values = (float(line.split(' ')[2]) for line in open(filename))
X = np.fromiter(values,dtype='float32',count=161991000)
By specifying the count we told python EXACTLY how much memory to allocate in advance (instead of having python re-adjust the array every time it needs more memory). With a count of that size and dtype of float32, we know that this data will take up exactly 647.97mb in RAM. So, be careful not to write any operations that duplicate this data. If you write something that makes 5 copies of this that will eat up RAM quickly.
I think this gives you an idea of how to manage memory. :-)

"pygame.error: Out of memory" when loading level with a large area [duplicate]

Is there a limit to memory for python? I've been using a python script to calculate the average values from a file which is a minimum of 150mb big.
Depending on the size of the file I sometimes encounter a MemoryError.
Can more memory be assigned to the python so I don't encounter the error?
EDIT: Code now below
NOTE: The file sizes can vary greatly (up to 20GB) the minimum size of the a file is 150mb
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w")
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for u in files:
line = u.readlines()
list_of_lines = []
for i in line:
values = i.split('\t')
list_of_lines.append(values)
count = 0
for j in list_of_lines:
count +=1
for k in range(0,count):
list_of_lines[k].remove('\n')
length = len(list_of_lines[0])
print_counter = 4
for o in range(0,length):
total = 0
for p in range(0,count):
number = float(list_of_lines[p][o])
total = total + number
average = total/count
print average
if print_counter == 4:
file_write.write(str(average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')

(This is my third answer because I misunderstood what your code was doing in my original, and then made a small but crucial mistake in my second—hopefully three's a charm.
Edits: Since this seems to be a popular answer, I've made a few modifications to improve its implementation over the years—most not too major. This is so if folks use it as template, it will provide an even better basis.
As others have pointed out, your MemoryError problem is most likely because you're attempting to read the entire contents of huge files into memory and then, on top of that, effectively doubling the amount of memory needed by creating a list of lists of the string values from each line.
Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available. Even if you don't use it all up and your program "works", using it may be impractical because it takes too long.
Anyway, the most obvious way to avoid that is to process each file a single line at a time, which means you have to do the processing incrementally.
To accomplish this, a list of running totals for each of the fields is kept. When that is finished, the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read. Once that is done, these averages can be printed out and some written to one of the output files. I've also made a conscious effort to use very descriptive variable names to try to make it understandable.
try:
from itertools import izip_longest
except ImportError: # Python 3
from itertools import zip_longest as izip_longest
GROUP_SIZE = 4
input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt",
"A2_B1_100000.txt"]
file_write = open("average_generations.txt", 'w')
mutation_average = open("mutation_average", 'w') # left in, but nothing written
for file_name in input_file_names:
with open(file_name, 'r') as input_file:
print('processing file: {}'.format(file_name))
totals = []
for count, fields in enumerate((line.split('\t') for line in input_file), 1):
totals = [sum(values) for values in
izip_longest(totals, map(float, fields), fillvalue=0)]
averages = [total/count for total in totals]
for print_counter, average in enumerate(averages):
print(' {:9.4f}'.format(average))
if print_counter % GROUP_SIZE == 0:
file_write.write(str(average)+'\n')
file_write.write('\n')
file_write.close()
mutation_average.close()

You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.
Better iterate over each line:
for current_line in u:
do_something_with(current_line)
is the recommended approach.
Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a for loop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.
This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.
Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csv module which will handle all the splitting, removing of \ns etc. for you.

Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about
1959167 [MiB]
On jython 2.5 it crashes earlier:
239000 [MiB]
probably I can configure Jython to use more memory (it uses limits from JVM)
Test app:
import sys
sl = []
i = 0
# some magic 1024 - overhead of string object
fill_size = 1024
if sys.version.startswith('2.7'):
fill_size = 1003
if sys.version.startswith('3'):
fill_size = 497
print(fill_size)
MiB = 0
while True:
s = str(i).zfill(fill_size)
sl.append(s)
if i == 0:
try:
sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
except AttributeError:
pass
i += 1
if i % 1024 == 0:
MiB += 1
if MiB % 25 == 0:
sys.stderr.write('%d [MiB]\n' % (MiB))
In your app you read whole file at once. For such big files you should read the line by line.

No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.
In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.
Edit:
Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.
A better approach would be to read the files one line at a time:
for u in files:
for line in u: # This will iterate over each line in the file
# Read values from the line, do necessary calculations

Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.
You have a secondary problem: your choices of variable names severely obfuscate what you are doing.
Here is your script rewritten with the readlines() caper removed and with meaningful names:
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
table = []
for aline in afile:
values = aline.split('\t')
values.remove('\n') # why?
table.append(values)
row_count = len(table)
row0length = len(table[0])
print_counter = 4
for column_index in range(row0length):
column_total = 0
for row_index in range(row_count):
number = float(table[row_index][column_index])
column_total = column_total + number
column_average = column_total/row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.
As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.
Here is a revised version of the outer loop code:
for afile in files:
for row_count, aline in enumerate(afile, start=1):
values = aline.split('\t')
values.remove('\n') # why?
fvalues = map(float, values)
if row_count == 1:
row0length = len(fvalues)
column_index_range = range(row0length)
column_totals = fvalues
else:
assert len(fvalues) == row0length
for column_index in column_index_range:
column_totals[column_index] += fvalues[column_index]
print_counter = 4
for column_index in column_index_range:
column_average = column_totals[column_index] / row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1

Python: numpy.corrcoef Memory Error

I was trying to calculate the correlation between a large set of data read from a text. For extremely large data set the program give a memory error. Can anyone please tell me how to correct this problem. Thanks
The following is my code:
enter code here
import numpy
from numpy import *
from array import *
from decimal import *
import sys
Threshold = 0.8;
TopMostData = 10;
FileName = sys.argv[1]
File = open(FileName,'r')
SignalData = numpy.empty((1, 128));
SignalData[:][:] = 0;
for line in File:
TempLine = line.split();
TempInt = [float(i) for i in TempLine]
SignalData = vstack((SignalData,TempInt))
del TempLine;
del TempInt;
File.close();
TempData = SignalData;
SignalData = SignalData[1:,:]
SignalData = SignalData[:,65:128]
print "File Read | Data Stored" + " | Total Lines: " + str(len(SignalData))
CorrelationData = numpy.corrcoef(SignalData)
The following is the error:
Traceback (most recent call last):
File "Corelation.py", line 36, in <module>
CorrelationData = numpy.corrcoef(SignalData)
File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 1824, in corrcoef
return c/sqrt(multiply.outer(d, d))
MemoryError

You run out of memory as the comments show. If that happens because you are using 32-bit Python, even the method below will fail. But for the 64-bit Python and not-so-much-RAM situation we can do a lot as calculating the correlations is easily done piecewise, as you only need two lines in the memory simultaneously.
So, you may split your input into, say, 1000 row chunks, and then the resulting 1000 x 1000 matrices are easy to keep in memory. Then you can assemble your result into the big output matrix which is not necessarily in the RAM. I recommend this approach even if you have a lot of RAM, because this is much more memory-friendly. Correlation coefficient calculation is not an operation where fast random accesses would help a lot if the input can be kept in RAM.
Unfortunately, the numpy.corrcoef does not do this automatically, and we'll have to roll our own correlation coefficient calculation. Fortunately, that is not as hard as it sounds.
Something along these lines:
import numpy as np
# number of rows in one chunk
SPLITROWS = 1000
# the big table, which is usually bigger
bigdata = numpy.random.random((27000, 128))
numrows = bigdata.shape[0]
# subtract means form the input data
bigdata -= np.mean(bigdata, axis=1)[:,None]
# normalize the data
bigdata /= np.sqrt(np.sum(bigdata*bigdata, axis=1))[:,None]
# reserve the resulting table onto HDD
res = np.memmap("/tmp/mydata.dat", 'float64', mode='w+', shape=(numrows, numrows))
for r in range(0, numrows, SPLITROWS):
for c in range(0, numrows, SPLITROWS):
r1 = r + SPLITROWS
c1 = c + SPLITROWS
chunk1 = bigdata[r:r1]
chunk2 = bigdata[c:c1]
res[r:r1, c:c1] = np.dot(chunk1, chunk2.T)
Some notes:
the code above is tested above np.corrcoef(bigdata)
if you have complex values, you'll need to create a complex output array res and take the complex conjugate of chunk2.T
the code garbles bigdata to maintain performance and minimize memory use; if you need to preserve it, make a copy
The above code takes about 85 s to run on my machine, but the data will mostly fit in RAM, and I have a SSD disk. The algorithm is coded in such order to avoid too random access into the HDD, i.e. the access is reasonably sequential. In comparison, the non-memmapped standard version is not significantly faster even if you have a lot of memory. (Actually, it took a lot more time in my case, but I suspect I ran out of my 16 GiB and then there was a lot of swapping going on.)
You can make the actual calculations faster by omitting half of the matrix, because res.T == res. In practice, you can omit all blocks where c > r and then mirror them later on. On the other hand, the performance is most likely limited by the HDD preformance, so other optimizations do not necessarily bring much more speed.
Of course, this approach is easy to make parallel, as the chunk calculations are completely independent. Also the memmapped array can be shared between threads rather easily.

What is the optimal way to process a very large (over 30GB) text file and also show progress

[newbie question]
Hi,
I'm working on a huge text file which is well over 30GB.
I have to do some processing on each line and then write it to a db in JSON format. When I read the file and loop using "for" my computer crashes and displays blue screen after about 10% of processing data.
Im currently using this:
f = open(file_path,'r')
for one_line in f.readlines():
do_some_processing(one_line)
f.close()
Also how can I show overall progress of how much data has been crunched so far ?
Thank you all very much.

File handles are iterable, and you should probably use a context manager. Try this:
with open(file_path, 'r') as fh:
for line in fh:
process(line)
That might be enough.

I use a function like this for a similiar problem. You can wrap up any iterable with it.
Change this
for one_line in f.readlines():
You just need to change your code to
# don't use readlines, it creates a big list of all data in memory rather than
# iterating one line at a time.
for one_line in in progress_meter(f, 10000):
You might want to pick a smaller or larger value depending on how much time you want to waste printing status messages.
def progress_meter(iterable, chunksize):
""" Prints progress through iterable at chunksize intervals."""
scan_start = time.time()
since_last = time.time()
for idx, val in enumerate(iterable):
if idx % chunksize == 0 and idx > 0:
print idx
print 'avg rate', idx / (time.time() - scan_start)
print 'inst rate', chunksize / (time.time() - since_last)
since_last = time.time()
print
yield val

Using readline imposes to find the end of each line in your file. If some lines are very long, it might lead your interpreter to crash (not enough memory to buffer the full line).
In order to show progress you can check the file size for example using:
import os
f = open(file_path, 'r')
fsize = os.fstat(f).st_size
The progress of your task can then be the number of bytes processed divided by the file size times 100 to have a percentage.

Most efficient way to concatenate and rearrange files

I am reading from several files, each file is divided into 2 pieces, first a header section of a few thousand lines followed by a body of a few thousand. My problem is I need to concatenate these files into one file where all the headers are on the top followed by the body.
Currently I am using two loops: one to pull out all the headers and write them, and the second to write the body of each file (I also include a tmp_count variable to limit the number of lines to be loading into memory before dumping to file).
This is pretty slow - about 6min for 13gb file. Can anyone tell me how to optimize this or if there is a faster way to do this in python ?
Thanks!
Here is my code:
def cat_files_sam(final_file_name,work_directory_master,file_count):
final_file = open(final_file_name,"w")
if len(file_count) > 1:
file_count=sort_output_files(file_count)
# only for # headers
for bowtie_file in file_count:
#print bowtie_file
tmp_list = []
tmp_count = 0
for line in open(os.path.join(work_directory_master,bowtie_file)):
if line.startswith("#"):
if tmp_count == 1000000:
final_file.writelines(tmp_list)
tmp_list = []
tmp_count = 0
tmp_list.append(line)
tmp_count += 1
else:
final_file.writelines(tmp_list)
break
for bowtie_file in file_count:
#print bowtie_file
tmp_list = []
tmp_count = 0
for line in open(os.path.join(work_directory_master,bowtie_file)):
if line.startswith("#"):
continue
if tmp_count == 1000000:
final_file.writelines(tmp_list)
tmp_list = []
tmp_count = 0
tmp_list.append(line)
tmp_count += 1
final_file.writelines(tmp_list)
final_file.close()

How fast would you expect it to be to move 13Gb of data around? This problem is I/O bound and not a problem with Python. To make it faster, do less I/O. Which means that you are either (a) stuck with the speed you've got or (b) should retool later elements of your toolchain to handle the files in-place rather than requiring one giant 13 Gb file.

You can save the time it takes the 2nd time to skip the headers, as long as you have a reasonable amount of spare disk space: as well as the final file, also open (for 'w+') a temporary file temp_file, and do:
import shutil
hdr_list = []
bod_list = []
dispatch = {True: (hdr_list, final_file),
False: (bod_list, temp_file)}
for bowtie_file in file_count:
with open(os.path.join(work_directory_master,bowtie_file)) as f:
for line in f:
L, fou = dispatch[line[0]=='#']
L.append(f)
if len(L) == 1000000:
fou.writelines(L)
del L[:]
# write final parts, if any
for L, fou in dispatch.items():
if L: fou.writelines(L)
temp_file.seek(0)
shutil.copyfileobj(temp_file, final_file)
This should enhance your program's performance. Fine-tuning that now-hard-coded 1000000, or even completely doing away with the lists and writing each line directly to the appropriate file (final or temporary), are other options you should benchmark (but if you have unbounded amounts of memory, then I expect that they won't matter much -- however, intuitions about performance are often misleading, so it's best to try and measure!-).

There are two gross inefficiencies in the code you meant to write (which is not the code presented):
You are building up huge lists of header lines in the first major for block instead of just writing them out.
You are skipping the headers of the files again in the second major for block line by line when you've already determined where the headers end in (1). See file.seek and file.tell

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is loading this file taking so much memory? - python

Related

How can i handle file with 161 million line?

"pygame.error: Out of memory" when loading level with a large area [duplicate]

Python: numpy.corrcoef Memory Error

What is the optimal way to process a very large (over 30GB) text file and also show progress

Most efficient way to concatenate and rearrange files

Categories

Resources