Let's say I need to crunch some data and then write the results to some file.
Would it be better to open the file first, then process the data, then write to the file?
with open('file', 'w') as f:
summary = process_data()
f.write(summary)
Or would it be better to open the file just before writing to it?
summary = process_data()
with open('file', 'w') as f:
f.write(summary)
My intuition tells me that if process_data() requires a lot of memory and if file is large, there may be some issues with the first approach.
Edit:
To clarify from some of the responses, what are the pros and cons of each approach?
Python doesn’t have c-like scopes, only scoping constructs are def and class blocks, so summary is not cleaned up after with block has ended in the second example.
I can think of only one difference: opening file in a write mode clears it, so if the process_data takes a long time inside the with block - it leaves file in an empty state for longer.
If that’s not a concern this is 2+3 vs 3+2.
define better...
I can think of a few aspects:
Amount of time the file is takes - if this file is needed by other users, it would be best to keep it open for your usage as little as possible --> process before
Clean coding - the with statement is neater than the open + close --> process inside
My intuition tells me that if process_data() requires a lot of memory and if file is large, there may be some issues with the first approach.
The size of file shouldn't matter, as you don't read it, you only open it for writing...
Related
I am new to Python, but I didn't know this til yet.
I have a basic program inside a for loop, that requests data from a site and saves it to a text file
But when I checked inside my task manager I saw that the memory usage only increase? This might be a problem for me when running this for a long time.
Is it standard for Python to do this or can you change it?
Here is a what the program basically is
savefile = open("file.txt", "r+")
for i in savefile:
#My code goes here
savefile.write(i)
#end of loop
savefile.close()
Python does not write to file until you call .close() or .flush() or until it hits a specified buffer size. This question might help you: How often does python flush to a file?
As #Almog said, Python does not write to the file immediately. Because of this, every line you write to the file gets stored into RAM until you use savefile.close(), which flushes the internal buffer and writes everything to the file. This would explain the extra memory usage.
Try changing the loop to this:
savefile = open('file.txt', 'r+')
for i in savefile:
savefile.write(i)
savefile.flush() #flushes buffer, saving RAM
savefile.close()
There is a better Solution, in pythonic way, to this:
with open("your_file.txt", "write_mode") as file_variable_name:
for line in file_name:
file_name.write(line)
file_name.flush()
This code flushes the File for each line and after it's execution it closes the File thanks to the with-Statement
As a disclaimer, I'm hardly a computer scientist, but I've been reading everything I can on the subject of efficient file i/o to try and tackle this facet of a project I'm working on.
I have a very large (10 - 100 GB) log file of comma-separated values that I need to parse through. The first value labels it as "A" or "B"; for every "A" line, I need to examine the line before it and the line after it, and if either line before or after it meets a criterion, I want to store it in memory or write it to a file. The lines are not uniform in size.
That is my specific problem: I can't seem to locate an efficient way to do this in a non-binary file. With a binary file, I'd simply iterate over the file once and rewind to and fro with a logical check. I've investigated memory mapping, but it seems structured for binary files; my current code is Pythonic and takes weeks to run [see disclaimer].
My other question would be-- how easily could parallelism be invoked to help here? I have a notion of how -- map the file out three lines at a time and send each chunk to each node [lines 1,2,3 go to one node; lines 3,4,5 go to another ...], but I have no idea how to go about implementing this.
Any help would be very much appreciated.
Just read the lines in a loop. Keep track of the previous line in memory and examine it when needed.
Pseudocode:
for each line:
previousLine := currentLine
read currentLine from file
do processing...
This is efficient assuming you're already reading every line into memory anyway, and if you use a proper buffering scheme for reading the file (read large chunks at a time into memory).
I don't think parallelism will help in this situation. If properly written, the bottleneck of the program should be disk I/O, and multiple threads/processes can't read from disk any faster than a single thread. Parallelism only improves CPU-bound problems.
For what it's worth you can "seek" in ASCII files the same way you can with binary files. You would just keep track of the file offset each time you begin to read a line, and store that offset so you know where to seek back to later. Depending on how this is implemented this will never perform better than the above, though, and sometimes worse (you would want the file data to be buffered in memory so that the "seeking" is a memory operation and not a disk operation; you definitely want to read the file contents sequentially to maximize cache-ahead benefits).
Here's a first pass. Assumes properly formatted lines of text.
from itertools import chain
with open('your-file') as f:
prev_line = None
cur_line = f.readline()
for next_line in chain(f, [None]):
pieces = cur_line.split(',')
if pieces[0] == 'A':
check_against_criterion_if_not_none(prev_line)
check_against_criterion_if_not_none(next_line)
prev_line, cur_line = cur_line, next_line
A nifty trick is tacking on that extra 'None' at the end of the file, using itertools.chain, so that code properly checks the last line of the file against the 2nd to last line.
I used to read files like this:
f = [i.strip("\n") for i in open("filename.txt")]
which works just fine. I prefer this way because it is cleaner and shorter than traditional file reading code samples available on the web (e.g. f = open(...) , for line in f.readlines() , f.close()).
However, I wonder if there can be any drawback for reading files like this, e.g. since I don't close the file, does Python interpreter handles this itself? Is there anything I should be careful of using this approach?
This is the recommended way:
with open("filename.txt") as f:
lines = [line.strip("\n") for line in f]
The other way may not close the input file for a long time. This may not matter for your application.
The with statement takes care of closing the file for you. In CPython, just letting the file handle object be garbage-collected should close the file for you, but in other flavors of Python (Jython, IronPython, PyPy) you definitely can't count on this. Also, the with statement makes your intentions very clear, and conforms with common practice.
From the docs:
When you’re done with a file, call f.close() to close it and free up any system resources taken up by the open file.
You should always close a file after working with it. Python will not automatically do it for you. If you want a cleaner and shorter way, use a with statement:
with open("filename.txt") as myfile:
lines = [i.strip("\n") for i in myfile]
This has two advantages:
It automatically closes the file after the with block
If an exception is raised, the file is closed regardless.
It might be fine in a limited number of cases, e.g. a temporary test.
Python will only close the file handle after it finishes the execution.
Therefore this approach is a no-go for a proper application.
When we write onto a file using any of the write functions. Python holds everything to write in the file in a buffer and pushes it onto the actual file on the storage device either at the end of the python file or if it encounters a close() function.
So if the file terminates in between then the data is not stored in the file. So I would suggest two options:
use with because as soon as you get out of the block or encounter any exception it closes the file,
with open(filename , file_mode) as file_object:
do the file manipulations........
or you can use the flush() function if you want to force python to write contents of buffer onto storage without closing the file.
file_object.flush()
For Reference: https://lerner.co.il/2015/01/18/dont-use-python-close-files-answer-depends/
So I have some fairly gigantic .gz files - we're talking 10 to 20 gb each when decompressed.
I need to loop through each line of them, so I'm using the standard:
import gzip
f = gzip.open(path+myFile, 'r')
for line in f.readlines():
#(yadda yadda)
f.close()
However, both the open() and close() commands take AGES, using up 98% of the memory+CPU. So much so that the program exits and prints Killed to the terminal. Maybe it is loading the entire extracted file into memory?
I'm now using something like:
from subprocess import call
f = open(path+'myfile.txt', 'w')
call(['gunzip', '-c', path+myfile], stdout=f)
#do some looping through the file
f.close()
#then delete extracted file
This works. But is there a cleaner way?
I'm 99% sure that your problem is not in the gzip.open(), but in the readlines().
As the documentation explains:
f.readlines() returns a list containing all the lines of data in the file.
Obviously, that requires reading reading and decompressing the entire file, and building up an absolutely gigantic list.
Most likely, it's actually the malloc calls to allocate all that memory that are taking forever. And then, at the end of this scope (assuming you're using CPython), it has to GC that whole gigantic list, which will also take forever.
You almost never want to use readlines. Unless you're using a very old Python, just do this:
for line in f:
A file is an iterable full of lines, just like the list returned by readlines—except that it's not actually a list, it generates more lines on the fly by reading out of a buffer. So, at any given time, you'll only have one line and a couple of buffers on the order of 10MB each, instead of a 25GB list. And the reading and decompressing will be spread out over the lifetime of the loop, instead of done all at once.
From a quick test, with a 3.5GB gzip file, gzip.open() is effectively instant, for line in f: pass takes a few seconds, gzip.close() is effectively instant. But if I do for line in f.readlines(): pass, it takes… well, I'm not sure how long, because after about a minute my system went into swap thrashing hell and I had to force-kill the interpreter to get it to respond to anything…
Since this has come up a dozen more times since this answer, I wrote this blog post which explains a bit more.
Have a look at pandas, in particular IO tools. They support gzip compression when reading files and you can read files in chunks. Besides, pandas is very fast and memory efficient.
As I never tried, I don't know how well the compression and reading in chunks live together, but it might be worth giving a try
I have a number of large (~100 Mb) files which I'm regularly processing. While I'm trying to delete unneeded data structures during processing, memory consumption is a bit too high. I was wondering if there is a way to efficiently manipulate large data, e.g.:
def read(self, filename):
fc = read_100_mb_file(filename)
self.process(fc)
def process(self, content):
# do some processing of file content
Is there a duplication of data structures? Isn't it more memory efficient to use a class-wide attribute like self.fc?
When should I use garbage collection? I know about the gc module, but do I call it after I del fc for example?
update
p.s. 100 Mb is not a problem in itself. but float conversion, further processing add significantly more to both working set and virtual size (I'm on Windows).
I'd suggest looking at the presentation by David Beazley on using generators in Python. This technique allows you to handle a lot of data, and do complex processing, quickly and without blowing up your memory use. IMO, the trick isn't holding a huge amount of data in memory as efficiently as possible; the trick is avoiding loading a huge amount of data into memory at the same time.
Before you start tearing your hair out over the garbage collector, you might be able to avoid that 100mb hit of loading the entire file into memory by using a memory-mapped file object. See the mmap module.
Don't read the entire 100 meg file in at a time. Use streams to process a little bit at a time. Check out this blog post that talks about handling large csv and xml files. http://lethain.com/entry/2009/jan/22/handling-very-large-csv-and-xml-files-in-python/
Here is a sample of the code from the article.
from __future__ import with_statement # for python 2.5
with open('data.in','r') as fin:
with open('data.out','w') as fout:
for line in fin:
fout.write(','.join(line.split(' ')))
So, from your comments I assume that your file looks something like this:
item1,item2,item3,item4,item5,item6,item7,...,itemn
which you all reduce to a single value by repeated application of some combination function. As a solution, only read a single value at a time:
def read_values(f):
buf = []
while True:
c = f.read(1)
if c == ",":
yield parse("".join(buf))
buf = []
elif c == "":
yield parse("".join(buf))
return
else:
buf.append(c)
with open("some_file", "r") as f:
agg = initial
for v in read_values(f):
agg = combine(agg, v)
This way, memory consumption stays constant, unless agg grows in time.
Provide appropriate implementations of initial, parse and combine
Don't read the file byte-by-byte, but read in a fixed buffer, parse from the buffer and read more as you need it
This is basically what the builtin reduce function does, but I've used an explicit for loop here for clarity. Here's the same thing using reduce:
with open("some_file", "r") as f:
agg = reduce(combine, read_values(f), initial)
I hope I interpreted your problem correctly.
First of all, don't touch the garbage collector. That's not the problem, nor the solution.
It sounds like the real problem you're having is not with the file reading at all, but with the data structures that you're allocating as you process the files.
Condering using del to remove structures that you no longer need during processing. Also, you might consider using marshal to dump some of the processed data to disk while you work through the next 100mb of input files.
For file reading, you have basically two options: unix-style files as streams, or memory mapped files. For streams-based files, the default python file object is already buffered, so the simplest code is also probably the most efficient:
with open("filename", "r") as f:
for line in f:
# do something with a line of the files
Alternately, you can use f.read([size]) to read blocks of the file. However, usually you do this to gain CPU performance, by multithreading the processing part of your script, so that you can read and process at the same time. But it doesn't help with memory usage; in fact, it uses more memory.
The other option is mmap, which looks like this:
with open("filename", "r+") as f:
map = mmap.mmap(f.fileno(), 0)
line = map.readline()
while line != '':
# process a line
line = map.readline()
This sometimes outperforms streams, but it also won't improve memory usage.
In your example code, data is being stored in the fc variable. If you don't keep a reference to fc around, your entire file contents will be removed from memory when the read method ends.
If they are not, then you are keeping a reference somewhere. Maybe the reference is being created in read_100_mb_file, maybe in process. If there is no reference, CPython implementation will deallocate it almost immediatelly.
There are some tools to help you find where this reference is, guppy, dowser, pysizer...