I'm currently writing a python script that processes very large (> 10GB) files. As loading the whole file into memory is not an option, I'm right now reading and processing it line by line:
for line in f:
....
Once the script is finished it will run fairly often, so I'm starting to think about what impact that sort of reading will have on my disks lifespan.
Will the script actually read line by line or is there some kind of OS-powered buffering happening? If not, should I implement some kind of intermediary buffer myself? Is hitting the disk that often actually harmful? I remember reading something about BitTorrent wearing out disks quickly exactly because of that kind of bitwise reading/writing rather than operating with larger chunks of data.
I'm using both a HDD and an SSD in my test environment, so answers would be interesting for both systems.
Both your OS and Python use buffers to read data in larger chunks, for performance reasons. Your disk will not be materially impacted by reading a file line by line from Python.
Specifically, Python cannot give you individual lines without scanning ahead to find the line separators, so it'll read chunks, parse out individual lines, and each iteration will take lines from the buffer until another chunk must be read to find the next set of lines. The OS uses a buffer cache to help speed up I/O in general.
Related
I have a json file with several millions of rows, compressed to .gz. When I read the full file I run out of memory.
So I split the file in multiple files, each 100k rows, compress all of them to .gz and read them all in a loop. No memory problems.
Now I think, that in both cases I read exact the same amount of rows in memory, but the one file approach runs out of memory.
Could somebody elaborate why?
Because when you have a single file, you open and store in your RAM the full content of the file, whereas while using a loop, you process files one by one. If done properly (as it looks like you do), your program will close each file and free the allocated memory before opening the next one, hence reducing the necessary RAM to run the program and avoiding to run out of memory.
My intuition was that writing a file line by line as possible in python, in the same way, that reading it is. I was wrong. Consider the following code.
with open("source_file.csv","r") as source, open("output_file.csv","w") as dest:
while 1:
buf = source.read(16*1024)
if not buf:
break
dest.write(buf)
I observed how the output file is written via a $ tail -f output_file.csv. Data is displayed as it is being written. I interrupted the python script and noticed that none of that data was written on disk.
Here's my understanding of this so far :
The read operation is indeed done line-by-line here without loading it into memory. The write operation, on the other hand, is loaded line-by-line into memory, before it is serialized and written on disk.
I thought data could be written on disk chunk-by-chunk, to be memory efficient. Otherwise, any big file could not be copied (memory will eventually be insufficient), which is (Probably) what happens in my previous post.
My questions :
Is my analysis correct? If not, I would like additional information.
How do we write chunk-by-chunk on disk?
I am going to read about 7 GB text file.
Whenever I try to read this file, it takes long than what I expected.
For example, suppose I have 350 MB text file and my laptop takes about a minute or less. If I suppose to read 7GB, ideally it should take 20 minutes or less. Isn't it? Mine takes much longer than what I expected and I want to shorten the time of reading and processing my data.
I am using the following code for reading:
for line in open(filename, 'r'):
try:
list.append(json.loads(line))
except:
pass
After reading a file, I used to process to filter out unnecessary data from the list by making another list and killing previous list.
If you have any suggestion, just let me know.
The 7GB file is likely taking significantly longer than 20 x 350mb file because you don't have enough memory to hold all the data in memory. This means that, at some point, your operating system will start swapping out some of the data — writing it from memory onto disk — so that the memory can be re-used.
This is slow because your hard disk is significantly slower than RAM, and at 7GB there will be a lot of data being read from your hard disk, put into RAM, then moved back to your page file (the file on disk your operating system uses to store data that has been copied out of RAM).
My suggestion would be to re-work your program so that it only needs to store a small portion of the file in memory at a time. Depending on your problem, you can likely do this by moving some of the logic into the loop that reads the file. For example, if your program is trying to find and print all the lines which contain "ERROR", you could re-write it from:
lines = []
for line in open("myfile"):
lines.append(json.loads(line))
for line in lines:
if "ERROR" in line:
print line
To:
for line_str in open("myfile"):
line_obj = json.loads(line_str)
if "ERROR" in line_obj:
print line_obj
I am trying to teach myself Python by reading documentation. I am trying to understand what it means to flush a file buffer. According to documentation, "file.flush" does the following.
Flush the internal buffer, like stdio‘s fflush().
This may be a no-op on some file-like objects.
I don't know what "internal buffer" and "no-op" mean, but I think it says that flush writes data from some buffer to a file.
Hence, I ran this file toggling the pound sign in the line in the middle.
with open("myFile.txt", "w+") as file:
file.write("foo")
file.write("bar")
# file.flush()
file.write("baz")
file.write("quux")
However, I seem to get the same myFile.txt with and without the call to file.flush(). What effect does file.flush() have?
Python buffers writes to files. That is, file.write returns before the data is actually written to your hard drive. The main motivation of this is that a few large writes are much faster than many tiny writes, so by saving up the output of file.write until a bit has accumulated, Python can maintain good writing speeds.
file.flush forces the data to be written out at that moment. This is hand when you know that it might be a while before you have more data to write out, but you want other processes to be able to view the data you've already written. Imagine a log file that grows slowly. You don't want to have to wait ages before enough entries have built up to cause the data to be written out in one big chunk.
In either case, file.close causes the remaining data to be flushed, so "quux" in your code will be written out as soon as file (which is a really bad name as it shadows the builtin file constructor) falls out of scope of the with context manager and gets closed.
Note: your OS does some buffering of its own, but I believe every OS where Python is implemented will honor file.flush's request to write data out to the drive. Someone please correct me if I'm wrong.
By the way, "no-op" means "no operation", as in it won't actually do anything. For example, StringIO objects manipulate strings in memory, not files on your hard drive. StringIO.flush probably just immediately returns because there's not really anything for it to do.
Buffer content might be cached to improve performance. Flush makes sure that the content is written to disk completely, avoiding data loss. It is also useful when, for example, you want the line asking for user input printed completely on-screen before the next file operation takes place.
I'm reading from certain offsets in several hundred and possibly thousands files. Because I need only certain data from certain offsets at that particular time, I must either keep the file handle open for later use OR I can write the parts I need into seperate files.
I figured keeping all these file handles open rather than doing a significant amount of writing to the disk of new temporary files is the lesser of two evils. I was just worried about the efficiency of having so many file handles open.
Typically, I'll open a file, seek to an offset, read some data, then 5 seconds later do the same thing but at another offset, and do all this on thousand of files within a 2 minute timeframe.
Is that going to be a problem?
A followup: Really, I"m asking which is better to leave these thousands file handles open, or to constantly close them and re-open them just when I instantaneously need them.
Some systems may limit the number of file descriptors that a single process can have open simultaneously. 1024 is a common default, so if you need "thousands" open at once, you might want to err on the side of portability and design your application to use a smaller
pool of open file descriptors.
I recommend that you take a look at Storage.py in BitTorrent. It includes an implementation of a pool of file handles.