I am going to read about 7 GB text file.
Whenever I try to read this file, it takes long than what I expected.
For example, suppose I have 350 MB text file and my laptop takes about a minute or less. If I suppose to read 7GB, ideally it should take 20 minutes or less. Isn't it? Mine takes much longer than what I expected and I want to shorten the time of reading and processing my data.
I am using the following code for reading:
for line in open(filename, 'r'):
try:
list.append(json.loads(line))
except:
pass
After reading a file, I used to process to filter out unnecessary data from the list by making another list and killing previous list.
If you have any suggestion, just let me know.
The 7GB file is likely taking significantly longer than 20 x 350mb file because you don't have enough memory to hold all the data in memory. This means that, at some point, your operating system will start swapping out some of the data — writing it from memory onto disk — so that the memory can be re-used.
This is slow because your hard disk is significantly slower than RAM, and at 7GB there will be a lot of data being read from your hard disk, put into RAM, then moved back to your page file (the file on disk your operating system uses to store data that has been copied out of RAM).
My suggestion would be to re-work your program so that it only needs to store a small portion of the file in memory at a time. Depending on your problem, you can likely do this by moving some of the logic into the loop that reads the file. For example, if your program is trying to find and print all the lines which contain "ERROR", you could re-write it from:
lines = []
for line in open("myfile"):
lines.append(json.loads(line))
for line in lines:
if "ERROR" in line:
print line
To:
for line_str in open("myfile"):
line_obj = json.loads(line_str)
if "ERROR" in line_obj:
print line_obj
Related
I have a json file with several millions of rows, compressed to .gz. When I read the full file I run out of memory.
So I split the file in multiple files, each 100k rows, compress all of them to .gz and read them all in a loop. No memory problems.
Now I think, that in both cases I read exact the same amount of rows in memory, but the one file approach runs out of memory.
Could somebody elaborate why?
Because when you have a single file, you open and store in your RAM the full content of the file, whereas while using a loop, you process files one by one. If done properly (as it looks like you do), your program will close each file and free the allocated memory before opening the next one, hence reducing the necessary RAM to run the program and avoiding to run out of memory.
My intuition was that writing a file line by line as possible in python, in the same way, that reading it is. I was wrong. Consider the following code.
with open("source_file.csv","r") as source, open("output_file.csv","w") as dest:
while 1:
buf = source.read(16*1024)
if not buf:
break
dest.write(buf)
I observed how the output file is written via a $ tail -f output_file.csv. Data is displayed as it is being written. I interrupted the python script and noticed that none of that data was written on disk.
Here's my understanding of this so far :
The read operation is indeed done line-by-line here without loading it into memory. The write operation, on the other hand, is loaded line-by-line into memory, before it is serialized and written on disk.
I thought data could be written on disk chunk-by-chunk, to be memory efficient. Otherwise, any big file could not be copied (memory will eventually be insufficient), which is (Probably) what happens in my previous post.
My questions :
Is my analysis correct? If not, I would like additional information.
How do we write chunk-by-chunk on disk?
I'm currently writing a python script that processes very large (> 10GB) files. As loading the whole file into memory is not an option, I'm right now reading and processing it line by line:
for line in f:
....
Once the script is finished it will run fairly often, so I'm starting to think about what impact that sort of reading will have on my disks lifespan.
Will the script actually read line by line or is there some kind of OS-powered buffering happening? If not, should I implement some kind of intermediary buffer myself? Is hitting the disk that often actually harmful? I remember reading something about BitTorrent wearing out disks quickly exactly because of that kind of bitwise reading/writing rather than operating with larger chunks of data.
I'm using both a HDD and an SSD in my test environment, so answers would be interesting for both systems.
Both your OS and Python use buffers to read data in larger chunks, for performance reasons. Your disk will not be materially impacted by reading a file line by line from Python.
Specifically, Python cannot give you individual lines without scanning ahead to find the line separators, so it'll read chunks, parse out individual lines, and each iteration will take lines from the buffer until another chunk must be read to find the next set of lines. The OS uses a buffer cache to help speed up I/O in general.
baseline - I have CSV data with 10,000 entries. I save this as 1 csv file and load it all at once.
alternative - I have CSV data with 10,000 entries. I save this as 10,000 CSV files and load it individually.
Approximately how much more inefficient is this computationally. I'm not hugely interested in memory concerns. The purpose of the alternative method is because I frequently need to access subsets of the data and don't want to have to read the entire array.
I'm using python.
Edit: I can other file formats if needed.
Edit1: SQLite wins. Amazingly easy and efficient compared to what I was doing before.
SQLite is ideal solution for your application.
Simply import your CSV file into SQLite database table (it is going to be single file), then add indexes as necessary.
To access your data, use python sqlite3 library. You can use this tutorial on how to use it.
Compared to many other solutions, SQLite will be the fastest way to select partial data sets locally - certainly much, much faster than access 10000 files. Also read this answer which explains why SQLite is so good.
I would write all the lines to one file. For 10,000 lines it's probably not worthwhile, but you can pad all the lines to the same length - say 1000 bytes.
Then it's easy to seek to the nth line, just multiply n by the line length
10,000 files is going to be slower to load and access than one file, if only because the files' data will likely be fragmented around your disk drive, so accessing it will require a much larger number of seeks than would accessing the contents of a single file, which will generally be stored as sequentially as possible. Seek times are a big slowdown on spinning media, since your program has to wait while the drive heads are physically repositioned, which can take milliseconds. (slow seeks times aren't an issue for SSDs, but even then there will still be the overhead of 10,000 file's worth of metadata for the operating system to deal with). Also with a single file, the OS can speed things up for you by doing read-ahead buffering (as it can reasonably assume that if you read one part of the file, you will likely want to read the next part soon). With multiple files, the OS can't do that.
My suggestion (if you don't want to go the SQLite route) would be to use a single CSV file, and (if possible) pad all of the lines of your CSV file out with spaces so that they all have the same length. For example, say you make sure when writing out the CSV file to make all lines in the file exactly 80 bytes long. Then reading the (n)th line of the file becomes relatively fast and easy:
myFileObject.seek(n*80)
theLine = myFileObject.read(80)
I have a big file transfer (say 4gb or so) and rather than using shutil, I'm just opening and writing it the normal file way so I can include a progress percentage as it moves along.
It then occurred to me to try to attempt to resume the file write, if for some reason it borked out during the process. I haven't had any luck though. I presumed it would be some clever combination of offsetting the read of the source file and using seek, but I haven't had any luck so far. Any ideas?
Additionally, is there some sort of dynamic way to figure what block size to use when reading and writing files? I'm fairly novice to that area, and just read to use a larger size for larger file (I'm using 65536 at the moment). Is there a smart way to do it, or does one simply guess..? Thanks guys.
Here is the code snippet of the appending file transfer:
newsrc = open(src, 'rb')
dest_size = os.stat(destFile).st_size
print 'Dest file exists, resuming at block %s' % dest_size
newsrc.seek(dest_size)
newdest = open(destFile, 'a')
cur_block_pos = dest_size
# Start copying file
while True:
cur_block = newsrc.read(131072)
cur_block_pos += 131072
if not cur_block:
break
else:
newdest.write(cur_block)
It does append and start writing, but it then writes dest_size more data at the end than it should for probably obvious reasons to the rest of you. Any ideas?
For the second part of your question, data is typically read from and written to a hard drive in blocks of 512 bytes. So using a block size that is a multiple of that should give the most efficient transfer. Other than that, it doesn't matter much. Just keep in mind that whatever block size you specify is the amount of data that the I/O operation stores in memory at any given time, so don't choose something so large that it uses up a lot of your RAM. I think 8K (8192) is a common choice, but 64K should be fine. (I don't think the size of the file being transferred matters much when you're choosing the best block size)