I'm running 64-bit Python 3 on Linux, and I have a code that generates lists with about 20,000 elements. A memory error occurred when my code tried to write a list of ~20,000 2D arrays to a binary file via the pickle module, but it generated all of these arrays and appended them to this list without a problem. I know this must take up a lot of memory, but the machine I'm using has about 100GB available (from the command free -m). The line with the error:
with open('all_data.data', 'wb') as f:
pickle.dump(data, f)
>>> MemoryError
where data is my list of ~20,000 numpy arrays. Also, previously I was trying to run this code with about 55,000 elements, but while it was 40% of the way through with appending all the arrays to the data list, it just output Killed by itself. So now I'm trying to break it into segments, but this time I get a MemoryError. How can I bypass this? I was also informed that I have access to multiple CPUs, but I have no idea how to take advantage of these (I don't yet understand multiprocessing).
Pickle will try to parse all your data, and likely convert it to intermediate states before writing everything to disk - so if you are using about half your available memory, it will blow up.
Since your data is already on a list, an easy workaround there is to pickle each array, and store it, instead of trying to serialize the 20000 arrays in a single go:
with open('all_data.data', 'wb') as f:
for item in data:
pickle.dump(item, f)
Then, to read it back, just keep unpickling objects from your file and appending then to a list, until the file is exhausted:
data = []
with open('all_data.data', 'rb') as f:
while True:
try:
data.append(pickle.load(f))
except EOFError:
break
This works because unpicking from a file is quite well behaved: the file pointer stays exactly at the point a pickled object stored in the file ends - further reads therefore start at the beginning of the next object.
Related
I have used Python for years. I have used pickle extensively. I cannot figure out what this is doing:
with codecs.open("huge_picklefile.pc", "rb") as f:
data = pickle.load(f)
print(len(data))
data = pickle.load(f)
print(len(data))
data = pickle.load(f)
print(len(data))
This returns to me:
335
59
12
I am beyond confused. I am use to pickle loading the massive file into memory. The object itself is a massive array of arrays (I assume). Could it be comprised of multiple pickle objects? Unfortunately, I didn't create the pickle object and I don't have access to who did.
I cannot figure out why pickle is splitting up my file into chunks, which isn't the default, and I am not telling it to. What does reloading the same file do? I honestly never tried or even came across a use case until now.
I spent a good 5 hours trying to figure out how to even ask this question on Google. Unsurprisingly, trying "multiple pickle loads on the same document" doesn't yield anything too useful. The Python 3.7 pickle docs does not describe this behavior. I can't figure out how repeatedly loading a pickle document doesn't (a) crash or (b) load the entire thing into memory and then just reference itself. In my 15 years of using python I have never run into this problem... so I am taking a leap of faith that this is just weird and we should probably just use a database instead.
This file is not quite a pickle file. Someone has dumped multiple pickles into the same file, resulting in the file contents being a concatenation of multiple pickles. When you call pickle.load(f), pickle will read the file from the current file position until it finds a pickle end, so each pickle.load call will load the next pickle.
You can create such a file yourself by calling pickle.dump repeatedly:
with open('demofile', 'wb') as f:
pickle.dump([1, 2, 3], f)
pickle.dump([10, 20], f)
pickle.dump([0, 0, 0], f)
I know the title probably isn't the best way of wording my question, so please feel free to change it if you can come up with something better!
I have a large list of strings (somewhere between 50k-100k) that I would like to iterate through, and within each iteration, get some information about the file, then write the item with its information to a file.
My initial implementation had a second list, and each iteration would append a dict with the item and its information. Then, once the list had been iterated through, the second list (of dicts) would be written to a json file. However, I did not consider the fact that all of this would be stored in memory and due to the size of the list, there was the possibility that it would run out of memory before finishing, and I'd have to restart.
Original implementation:
results = []
for f in long_list:
results.append({"item": f, "otherdata": some_function(f)})
print(results[len(results)-1])
with open("results.json", "w") as fp:
json.dump(results, fp)
So my second implementation instead writes to a file (I'm fine if it's not JSON, I can convert it later) every iteration, so running out of memory won't be a problem (unless it will, please correct me if I'm wrong). But with this implementation, I'm not sure if the with open("file_name.txt", "a") as f should go in the for loop or outside the for loop.
So here are my questions:
Will my script run out of memory or is the list too small for that to happen (32GB of RAM on the computer I'm using)?
If running out of memory is not going to be a problem, should I stick with the first implementation or should I still opt for the second implementation (write to file each iteration)?
If I should opt for the second implementation, should I put the with open inside the loop or outside? What would be more efficient and less likely to cause issues?
Is there a better way to do this? Using CSV files instead?
So I have some fairly gigantic .gz files - we're talking 10 to 20 gb each when decompressed.
I need to loop through each line of them, so I'm using the standard:
import gzip
f = gzip.open(path+myFile, 'r')
for line in f.readlines():
#(yadda yadda)
f.close()
However, both the open() and close() commands take AGES, using up 98% of the memory+CPU. So much so that the program exits and prints Killed to the terminal. Maybe it is loading the entire extracted file into memory?
I'm now using something like:
from subprocess import call
f = open(path+'myfile.txt', 'w')
call(['gunzip', '-c', path+myfile], stdout=f)
#do some looping through the file
f.close()
#then delete extracted file
This works. But is there a cleaner way?
I'm 99% sure that your problem is not in the gzip.open(), but in the readlines().
As the documentation explains:
f.readlines() returns a list containing all the lines of data in the file.
Obviously, that requires reading reading and decompressing the entire file, and building up an absolutely gigantic list.
Most likely, it's actually the malloc calls to allocate all that memory that are taking forever. And then, at the end of this scope (assuming you're using CPython), it has to GC that whole gigantic list, which will also take forever.
You almost never want to use readlines. Unless you're using a very old Python, just do this:
for line in f:
A file is an iterable full of lines, just like the list returned by readlines—except that it's not actually a list, it generates more lines on the fly by reading out of a buffer. So, at any given time, you'll only have one line and a couple of buffers on the order of 10MB each, instead of a 25GB list. And the reading and decompressing will be spread out over the lifetime of the loop, instead of done all at once.
From a quick test, with a 3.5GB gzip file, gzip.open() is effectively instant, for line in f: pass takes a few seconds, gzip.close() is effectively instant. But if I do for line in f.readlines(): pass, it takes… well, I'm not sure how long, because after about a minute my system went into swap thrashing hell and I had to force-kill the interpreter to get it to respond to anything…
Since this has come up a dozen more times since this answer, I wrote this blog post which explains a bit more.
Have a look at pandas, in particular IO tools. They support gzip compression when reading files and you can read files in chunks. Besides, pandas is very fast and memory efficient.
As I never tried, I don't know how well the compression and reading in chunks live together, but it might be worth giving a try
I'm trying to find out the best way to read/process lines for super large file.
Here I just try
for line in f:
Part of my script is as below:
o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
if i%4!=3:
LIST.append(line)
else:
LIST.append(line)
b1=[ord(x) for x in line]
ave1=(sum(b1)-10)/float(len(line)-1)
if (ave1 < 84):
del LIST[-4:]
output1=o.writelines(LIST)
My file1 is around 10GB; and when I run the script, the memory usage just keeps increasing to be like 15GB without any output. That means the computer is still trying to read the whole file into memory first, right? This really makes no different than using readlines()
However in the post:
Different ways to read large data in python
Srika told me:
The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
But obviously I still need to worry large files..I'm really confused.
thx
edit:
Every 4 lines is kind of group in my data.
THe purpose is to do some calculations on every 4th line; and based on that calculation, decide if we need to append those 4 lines.So writing lines is my purpose.
The reason the memory keeps inc. even after you use enumerator is because you are using LIST.append(line). That basically accumulates all the lines of the file in a list. Obviously its all sitting in-memory. You need to find a way to not accumulate lines like this. Read, process & move on to next.
One more way you could do is read your file in chunks (in fact reading 1 line at a time can qualify in this criteria, 1chunk == 1line), i.e. read a small part of the file process it then read next chunk etc. I still maintain that this is best way to read files in python large or small.
with open(...) as f:
for line in f:
<do something with line>
The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.
It looks like at the end of this function, you're taking all of the lines you've read into memory, and then immediately writing them to a file. Maybe you can try this process:
Read the lines you need into memory (the first 3 lines).
On the 4th line, append the line & perform your calculation.
If your calculation is what you're looking for, flush the values in your collection to the file.
Regardless of what follows, create a new collection instance.
I haven't tried this out, but it could maybe look something like this:
o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]
for i,line in enumerate(f):
if i % 4 != 3:
LIST.append(line)
else:
LIST.append(line)
b1 = [ord(x) for x in line]
ave1 = (sum(b1) - 10) / float(len(line) - 1
# If we've found what we want, save them to the file
if (ave1 >= 84):
o.writelines(LIST)
# Release the values in the list by starting a clean list to work with
LIST = []
EDIT: As a thought though, since your file is so large, this may not be the best technique because of all the lines you would have to write to file, but it may be worth investigating regardless.
Since you add all the lines to the list LIST and only sometimes remove some lines from it, LIST we become longer and longer. All those lines that you store in LIST will take up memory. Don't keep all the lines around in a list if you don't want them to take up memory.
Also your script doesn't seem to produce any output anywhere, so the point of it all isn't very clear.
Ok, you know what your problem is already from the other comments/answers, but let me simply state it.
You are only reading a single line at a time into memory, but you are storing a significant portion of these in memory by appending to a list.
In order to avoid this you need to store something in the filesystem or a database (on the disk) for later look up if your algorithm is complicated enough.
From what I see it seems you can easily write the output incrementally. ie. You are currently using a list to store valid lines to write to output as well as temporary lines you may delete at some point. To be efficient with memory you want to write the lines from your temporary list as soon as you know these are valid output.
In summary, use your list to store only temporary data you need to do your calculations based off of, and once you have some valid data ready for output you can simply write it to disk and delete it from your main memory (in python this would mean you should no longer have any references to it.)
If you do not use the with statement , you must close the file's handlers:
o.close()
f.close()
I have a number of large (~100 Mb) files which I'm regularly processing. While I'm trying to delete unneeded data structures during processing, memory consumption is a bit too high. I was wondering if there is a way to efficiently manipulate large data, e.g.:
def read(self, filename):
fc = read_100_mb_file(filename)
self.process(fc)
def process(self, content):
# do some processing of file content
Is there a duplication of data structures? Isn't it more memory efficient to use a class-wide attribute like self.fc?
When should I use garbage collection? I know about the gc module, but do I call it after I del fc for example?
update
p.s. 100 Mb is not a problem in itself. but float conversion, further processing add significantly more to both working set and virtual size (I'm on Windows).
I'd suggest looking at the presentation by David Beazley on using generators in Python. This technique allows you to handle a lot of data, and do complex processing, quickly and without blowing up your memory use. IMO, the trick isn't holding a huge amount of data in memory as efficiently as possible; the trick is avoiding loading a huge amount of data into memory at the same time.
Before you start tearing your hair out over the garbage collector, you might be able to avoid that 100mb hit of loading the entire file into memory by using a memory-mapped file object. See the mmap module.
Don't read the entire 100 meg file in at a time. Use streams to process a little bit at a time. Check out this blog post that talks about handling large csv and xml files. http://lethain.com/entry/2009/jan/22/handling-very-large-csv-and-xml-files-in-python/
Here is a sample of the code from the article.
from __future__ import with_statement # for python 2.5
with open('data.in','r') as fin:
with open('data.out','w') as fout:
for line in fin:
fout.write(','.join(line.split(' ')))
So, from your comments I assume that your file looks something like this:
item1,item2,item3,item4,item5,item6,item7,...,itemn
which you all reduce to a single value by repeated application of some combination function. As a solution, only read a single value at a time:
def read_values(f):
buf = []
while True:
c = f.read(1)
if c == ",":
yield parse("".join(buf))
buf = []
elif c == "":
yield parse("".join(buf))
return
else:
buf.append(c)
with open("some_file", "r") as f:
agg = initial
for v in read_values(f):
agg = combine(agg, v)
This way, memory consumption stays constant, unless agg grows in time.
Provide appropriate implementations of initial, parse and combine
Don't read the file byte-by-byte, but read in a fixed buffer, parse from the buffer and read more as you need it
This is basically what the builtin reduce function does, but I've used an explicit for loop here for clarity. Here's the same thing using reduce:
with open("some_file", "r") as f:
agg = reduce(combine, read_values(f), initial)
I hope I interpreted your problem correctly.
First of all, don't touch the garbage collector. That's not the problem, nor the solution.
It sounds like the real problem you're having is not with the file reading at all, but with the data structures that you're allocating as you process the files.
Condering using del to remove structures that you no longer need during processing. Also, you might consider using marshal to dump some of the processed data to disk while you work through the next 100mb of input files.
For file reading, you have basically two options: unix-style files as streams, or memory mapped files. For streams-based files, the default python file object is already buffered, so the simplest code is also probably the most efficient:
with open("filename", "r") as f:
for line in f:
# do something with a line of the files
Alternately, you can use f.read([size]) to read blocks of the file. However, usually you do this to gain CPU performance, by multithreading the processing part of your script, so that you can read and process at the same time. But it doesn't help with memory usage; in fact, it uses more memory.
The other option is mmap, which looks like this:
with open("filename", "r+") as f:
map = mmap.mmap(f.fileno(), 0)
line = map.readline()
while line != '':
# process a line
line = map.readline()
This sometimes outperforms streams, but it also won't improve memory usage.
In your example code, data is being stored in the fc variable. If you don't keep a reference to fc around, your entire file contents will be removed from memory when the read method ends.
If they are not, then you are keeping a reference somewhere. Maybe the reference is being created in read_100_mb_file, maybe in process. If there is no reference, CPython implementation will deallocate it almost immediatelly.
There are some tools to help you find where this reference is, guppy, dowser, pysizer...