Resuming a large file write in Python - python

I have a big file transfer (say 4gb or so) and rather than using shutil, I'm just opening and writing it the normal file way so I can include a progress percentage as it moves along.
It then occurred to me to try to attempt to resume the file write, if for some reason it borked out during the process. I haven't had any luck though. I presumed it would be some clever combination of offsetting the read of the source file and using seek, but I haven't had any luck so far. Any ideas?
Additionally, is there some sort of dynamic way to figure what block size to use when reading and writing files? I'm fairly novice to that area, and just read to use a larger size for larger file (I'm using 65536 at the moment). Is there a smart way to do it, or does one simply guess..? Thanks guys.
Here is the code snippet of the appending file transfer:
newsrc = open(src, 'rb')
dest_size = os.stat(destFile).st_size
print 'Dest file exists, resuming at block %s' % dest_size
newsrc.seek(dest_size)
newdest = open(destFile, 'a')
cur_block_pos = dest_size
# Start copying file
while True:
cur_block = newsrc.read(131072)
cur_block_pos += 131072
if not cur_block:
break
else:
newdest.write(cur_block)
It does append and start writing, but it then writes dest_size more data at the end than it should for probably obvious reasons to the rest of you. Any ideas?

For the second part of your question, data is typically read from and written to a hard drive in blocks of 512 bytes. So using a block size that is a multiple of that should give the most efficient transfer. Other than that, it doesn't matter much. Just keep in mind that whatever block size you specify is the amount of data that the I/O operation stores in memory at any given time, so don't choose something so large that it uses up a lot of your RAM. I think 8K (8192) is a common choice, but 64K should be fine. (I don't think the size of the file being transferred matters much when you're choosing the best block size)

Related

Rewrite a file in-place with Python

It might depend on each OS and also be hardware-dependent, but is there a way in Python to ask a write operation on a file to happen "in-place", i.e. at the same place of the original file, i.e. if possible on the same sectors on disk?
Example: let's say sensitivedata.raw, a 4KB file has to be crypted:
with open('sensitivedata.raw', 'r+') as f: # read write mode
s = f.read()
cipher = encryption_function(s) # same exact length as input s
f.seek(0)
f.write(cipher) # how to ask this write operation to overwrite the original bytes?
Example 2: replace a file by null-byte content of the same size, to avoid an undelete tool to recover it (of course, to do it properly we need several passes, with random data and not only null-bytes, but here it's just to give an idea)
with open('sensitivedata.raw', 'r+') as f:
s = f.read()
f.seek(0)
f.write(len(s) * '\x00') # totally inefficient but just to get the idea
os.remove('sensitivedata.raw')
PS: if it really depends a lot on OS, I'm primarily interested in the Windows case
Side-quesion: if it's not possible in the case of a SSD, does this mean that if you once in your life wrote sensitive data as plaintext on a SSD (example: a password in plaintext, a crypto private key, or anything else, etc.), then there is no way to be sure that this data is really erased? i.e. the only solution is to 100% wipe the disk and fill it many passes with random bytes? Is that correct?
That's an impossible requirement to impose. While on most spinning disk drives, this will happen automatically (there's no reason to write the new data elsewhere when it could just overwrite the existing data directly), SSDs can't do this (when they claim to do so, they're lying to the OS).
SSDs can't rewrite blocks; they can only erase a block, or write to an empty block. The implementation of a "rewrite" is to write to a new block (reading from the original block to fill out the block if there isn't enough new data), then (eventually, cause it's relatively expensive) erase the old block to make it available for a future write.
Update addressing side-question: The only truly secure solution is to run your drive through a woodchipper, then crush the remains with a millstone. :-) Really, in most cases, the window of vulnerability on an SSD should be relatively short; erasing sectors is expensive, so even SSDs that don't honor TRIM typically do it in the background to ensure future (cheap) write operations aren't held up by (expensive) erase operations. This isn't really so bad when you think about it; sure, the data is visible for a period of time after you logically erased it. But it was visible for a period of time before you erased it too, so all this is doing is extending the window of vulnerability by (seconds, minutes, hours, days, depending on the drive); the mistake was in storing sensitive data to permanent storage in the first place; even with extreme (woodchipper+millstone) solutions, someone else could have snuck in and copied the data before you thought to encrypt/destroy it.

Reading a file line by line -- impact on disk?

I'm currently writing a python script that processes very large (> 10GB) files. As loading the whole file into memory is not an option, I'm right now reading and processing it line by line:
for line in f:
....
Once the script is finished it will run fairly often, so I'm starting to think about what impact that sort of reading will have on my disks lifespan.
Will the script actually read line by line or is there some kind of OS-powered buffering happening? If not, should I implement some kind of intermediary buffer myself? Is hitting the disk that often actually harmful? I remember reading something about BitTorrent wearing out disks quickly exactly because of that kind of bitwise reading/writing rather than operating with larger chunks of data.
I'm using both a HDD and an SSD in my test environment, so answers would be interesting for both systems.
Both your OS and Python use buffers to read data in larger chunks, for performance reasons. Your disk will not be materially impacted by reading a file line by line from Python.
Specifically, Python cannot give you individual lines without scanning ahead to find the line separators, so it'll read chunks, parse out individual lines, and each iteration will take lines from the buffer until another chunk must be read to find the next set of lines. The OS uses a buffer cache to help speed up I/O in general.

Reading a big text file and memory

I am going to read about 7 GB text file.
Whenever I try to read this file, it takes long than what I expected.
For example, suppose I have 350 MB text file and my laptop takes about a minute or less. If I suppose to read 7GB, ideally it should take 20 minutes or less. Isn't it? Mine takes much longer than what I expected and I want to shorten the time of reading and processing my data.
I am using the following code for reading:
for line in open(filename, 'r'):
try:
list.append(json.loads(line))
except:
pass
After reading a file, I used to process to filter out unnecessary data from the list by making another list and killing previous list.
If you have any suggestion, just let me know.
The 7GB file is likely taking significantly longer than 20 x 350mb file because you don't have enough memory to hold all the data in memory. This means that, at some point, your operating system will start swapping out some of the data — writing it from memory onto disk — so that the memory can be re-used.
This is slow because your hard disk is significantly slower than RAM, and at 7GB there will be a lot of data being read from your hard disk, put into RAM, then moved back to your page file (the file on disk your operating system uses to store data that has been copied out of RAM).
My suggestion would be to re-work your program so that it only needs to store a small portion of the file in memory at a time. Depending on your problem, you can likely do this by moving some of the logic into the loop that reads the file. For example, if your program is trying to find and print all the lines which contain "ERROR", you could re-write it from:
lines = []
for line in open("myfile"):
lines.append(json.loads(line))
for line in lines:
if "ERROR" in line:
print line
To:
for line_str in open("myfile"):
line_obj = json.loads(line_str)
if "ERROR" in line_obj:
print line_obj

Efficient writing to a Compact Flash in python

I'm writing a gui to do perform a glorified 'dd'.
I could just subprocess to 'dd' but I thought I might as well use python's open()/read()/write() if I can as it'll let me display progress much more easily.
Prompted by this link here I have:
input = open('filename.img', 'rb')
output = open("/dev/sdc", 'wb')
while True:
buffer = input.read(1024)
if buffer:
output.write(buffer)
else:
break
input.close()
output.close()
...however it is horribly slow. Or at least far slower than dd. (around 4-5x slower)
I had a play and noticed altering the number of bytes 'buffered' had a huge affect on the speed of completion. Raising it to 2048 for example seems to half the time taken. Perhaps going OT for SO here but I guess the flash has an optimum number of bytes to be written at once? Could anyone suggest how I discover this?
The image & card are 1Gb so I would very much like to return to the ~5 minutes dd took if possible. I appreciate that in all likelihood I won't match it.
Rather than trial and error, would anyone be able to suggest a way to optimise the above code and reasoning as to why it works? Especially what value for input.read() for example?
One restriction: python 2.4.3 on linux (centos5) (please don't hurt me)
Speed depending on buffer size is unrelated to the specific characteristics of compact flash, but inherent to all I/O with (relatively) slow devices, even to all kinds of system calls. You should make the buffer size as large as possible without exhausting memory - 2MiB should be enough for a Flash drive.
You should use the time and strace utilities to determine why your program is slower. If time shows a large user/real (large meaning greater than 0.1), you can optimize your Python interpreter - cpython 2.4 is pretty slow, and you're creating new objects all the time instead of writing into a preallocated buffer. If there is a significant difference in sys timings, analyze the syscalls made by both programs (with strace) and try to emit the ones dd does.
Also note that you must call fsync (or execute the sync program) afterwards to measure the real time it took writing the file to disk (or open the output file with O_DIRECT). Otherwise, the operating system will let your program exit and just keep all the written data in buffers that are then continually written out to the actual disk. To test that you're doing it right, remove the disk immediately after your program is finished. Note that the speed difference can be staggering. This effect is less noticeable if your disk(CF card) is way larger than the available physical memory.
So with a little help, I've removed the 'buffer' bit completely and added an os.fsync().
import os
input = open('filename.img', 'rb')
output = open("/dev/sdc", 'wb')
output.write(input.read())
input.close()
output.close()
outputfile.flush()
os.fsync(outputfile.fileno())

Having several file pointers open simultaneaously alright?

I'm reading from certain offsets in several hundred and possibly thousands files. Because I need only certain data from certain offsets at that particular time, I must either keep the file handle open for later use OR I can write the parts I need into seperate files.
I figured keeping all these file handles open rather than doing a significant amount of writing to the disk of new temporary files is the lesser of two evils. I was just worried about the efficiency of having so many file handles open.
Typically, I'll open a file, seek to an offset, read some data, then 5 seconds later do the same thing but at another offset, and do all this on thousand of files within a 2 minute timeframe.
Is that going to be a problem?
A followup: Really, I"m asking which is better to leave these thousands file handles open, or to constantly close them and re-open them just when I instantaneously need them.
Some systems may limit the number of file descriptors that a single process can have open simultaneously. 1024 is a common default, so if you need "thousands" open at once, you might want to err on the side of portability and design your application to use a smaller
pool of open file descriptors.
I recommend that you take a look at Storage.py in BitTorrent. It includes an implementation of a pool of file handles.

Categories

Resources