Rewrite a file in-place with Python

Rewrite a file in-place with Python - python

It might depend on each OS and also be hardware-dependent, but is there a way in Python to ask a write operation on a file to happen "in-place", i.e. at the same place of the original file, i.e. if possible on the same sectors on disk?
Example: let's say sensitivedata.raw, a 4KB file has to be crypted:
with open('sensitivedata.raw', 'r+') as f: # read write mode
s = f.read()
cipher = encryption_function(s) # same exact length as input s
f.seek(0)
f.write(cipher) # how to ask this write operation to overwrite the original bytes?
Example 2: replace a file by null-byte content of the same size, to avoid an undelete tool to recover it (of course, to do it properly we need several passes, with random data and not only null-bytes, but here it's just to give an idea)
with open('sensitivedata.raw', 'r+') as f:
s = f.read()
f.seek(0)
f.write(len(s) * '\x00') # totally inefficient but just to get the idea
os.remove('sensitivedata.raw')
PS: if it really depends a lot on OS, I'm primarily interested in the Windows case
Side-quesion: if it's not possible in the case of a SSD, does this mean that if you once in your life wrote sensitive data as plaintext on a SSD (example: a password in plaintext, a crypto private key, or anything else, etc.), then there is no way to be sure that this data is really erased? i.e. the only solution is to 100% wipe the disk and fill it many passes with random bytes? Is that correct?

That's an impossible requirement to impose. While on most spinning disk drives, this will happen automatically (there's no reason to write the new data elsewhere when it could just overwrite the existing data directly), SSDs can't do this (when they claim to do so, they're lying to the OS).
SSDs can't rewrite blocks; they can only erase a block, or write to an empty block. The implementation of a "rewrite" is to write to a new block (reading from the original block to fill out the block if there isn't enough new data), then (eventually, cause it's relatively expensive) erase the old block to make it available for a future write.
Update addressing side-question: The only truly secure solution is to run your drive through a woodchipper, then crush the remains with a millstone. :-) Really, in most cases, the window of vulnerability on an SSD should be relatively short; erasing sectors is expensive, so even SSDs that don't honor TRIM typically do it in the background to ensure future (cheap) write operations aren't held up by (expensive) erase operations. This isn't really so bad when you think about it; sure, the data is visible for a period of time after you logically erased it. But it was visible for a period of time before you erased it too, so all this is doing is extending the window of vulnerability by (seconds, minutes, hours, days, depending on the drive); the mistake was in storing sensitive data to permanent storage in the first place; even with extreme (woodchipper+millstone) solutions, someone else could have snuck in and copied the data before you thought to encrypt/destroy it.

Related

Efficient writing to a Compact Flash in python

I'm writing a gui to do perform a glorified 'dd'.
I could just subprocess to 'dd' but I thought I might as well use python's open()/read()/write() if I can as it'll let me display progress much more easily.
Prompted by this link here I have:
input = open('filename.img', 'rb')
output = open("/dev/sdc", 'wb')
while True:
buffer = input.read(1024)
if buffer:
output.write(buffer)
else:
break
input.close()
output.close()
...however it is horribly slow. Or at least far slower than dd. (around 4-5x slower)
I had a play and noticed altering the number of bytes 'buffered' had a huge affect on the speed of completion. Raising it to 2048 for example seems to half the time taken. Perhaps going OT for SO here but I guess the flash has an optimum number of bytes to be written at once? Could anyone suggest how I discover this?
The image & card are 1Gb so I would very much like to return to the ~5 minutes dd took if possible. I appreciate that in all likelihood I won't match it.
Rather than trial and error, would anyone be able to suggest a way to optimise the above code and reasoning as to why it works? Especially what value for input.read() for example?
One restriction: python 2.4.3 on linux (centos5) (please don't hurt me)

Speed depending on buffer size is unrelated to the specific characteristics of compact flash, but inherent to all I/O with (relatively) slow devices, even to all kinds of system calls. You should make the buffer size as large as possible without exhausting memory - 2MiB should be enough for a Flash drive.
You should use the time and strace utilities to determine why your program is slower. If time shows a large user/real (large meaning greater than 0.1), you can optimize your Python interpreter - cpython 2.4 is pretty slow, and you're creating new objects all the time instead of writing into a preallocated buffer. If there is a significant difference in sys timings, analyze the syscalls made by both programs (with strace) and try to emit the ones dd does.
Also note that you must call fsync (or execute the sync program) afterwards to measure the real time it took writing the file to disk (or open the output file with O_DIRECT). Otherwise, the operating system will let your program exit and just keep all the written data in buffers that are then continually written out to the actual disk. To test that you're doing it right, remove the disk immediately after your program is finished. Note that the speed difference can be staggering. This effect is less noticeable if your disk(CF card) is way larger than the available physical memory.

So with a little help, I've removed the 'buffer' bit completely and added an os.fsync().
import os
input = open('filename.img', 'rb')
output = open("/dev/sdc", 'wb')
output.write(input.read())
input.close()
output.close()
outputfile.flush()
os.fsync(outputfile.fileno())

Python - HardDrive access when opening files

If you open a file for reading, read from it, close it, and then repeat the process (in a loop) does python continually access the hard-drive? Because it doesn't seem to from my experimentation, and I'd like to understand why.
An (extremely) simple example:
while True:
file = open('/var/log/messages', 'r')
stuff = file.read()
file.close()
time.sleep(2)
If I run this, my hard-drive access LED lights up once and then the hard-drive remains dormant. How is this possible? Is python somehow keeping the contents of the file stored in RAM? Because logic would tell me that it should be accessing the hard-drive every two seconds.

The answer depends on your operating system and the type of hard drive you have. Most of the time, when you access something off the drive, the information is cached in main memory in case you need it again soon. Depending on the replacement strategy used by your OS, the data may stay in main memory for a while or be replaced relatively soon.
In some cases, your hard drive will cache frequently accessed information in its own memory, and then while the drive might be accessed, it will retrieve the information faster and send it to the processor faster than if it had to search the drive platters.

Likely your operating system or file-system is smart enough to serve the file from the OS cache if the file has not changed in between.

Python does not cache, the operating system does. You can find out the size of these caches with top. In the third line, it will say something like:
Mem: 5923332k total, 5672832k used, 250500k free, 43392k buffers
That means about 43MB are being used by the OS to cache data recently written to or read from the hard disk. You can turn this caching off by writing 2 or 3 to /proc/sys/vm/drop_caches.

Any write functions In python that have the same safety as ACID does in databases

The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.

Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.

The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.

It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".

Deleting an arbitrary chunk of a file

What is the most efficient way to delete an arbitrary chunk of a file, given the start and end offsets? I'd prefer to use Python, but I can fall back to C if I have to.
Say the file is this
..............xxxxxxxx----------------
I want to remove a chunk of it:
..............[xxxxxxxx]----------------
After the operation it should become:
..............----------------
Reading the whole thing into memory and manipulating it in memory is not a feasible option.

The best performance will almost invariably be obtained by writing a new version of the file and then having it atomically write the old version, because filesystems are strongly optimized for such sequential access, and so is the underlying hardware (with the possible exception of some of the newest SSDs, but, even then, it's an iffy proposition). In addition, this avoids destroying data in the case of a system crash at any time -- you're left with either the old version of the file intact, or the new one in its place. Since every system could always crash at any time (and by Murphy's Law, it will choose the most unfortunate moment;-), integrity of data is generally considered very important (often data is more valuable than the system on which it's kept -- hence, "mirroring" RAID solutions to ensure against disk crashes losing precious data;-).
If you accept this sane approach, the general idea is: open old file for reading, new one for writing (creation); copy N1 bytes over from the old file to the new one; then skip N2 bytes of the old file; then copy the rest over; close both files; atomically rename new to old. (Windows apparently has no "atomic rename" system call usable from Python -- to keep integrity in that case, instead of the atomic rename, you'd do three step: rename old file to a backup name, rename new file to old, delete backup-named file -- in case of system crash during the second one of these three very fast operations, one rename is all it will take to restore data integrity).
N1 and N2, of course, are the two parameters saying where the deleted piece starts, and how long it is. For the part about opening the files, with open('old.dat', 'rb') as oldf: and with open('NEWold.dat', 'wb') as newf: statements, nested into each other, are clearly best (the rest of the code until the rename step must be nested in both of them of course).
For the "copy the rest over" step, shutil.copyfileobj is best (be sure to specify a buffer length that's comfortably going to fit in your available RAM, but a large one will tend to give better performance). The "skip" step is clearly just a seek on the oldf open-for-reading file object. For copying exactly N1 bytes from oldf to newf, there is no direct support in Python's standard library, so you have to write your own, e.g:
def copyN1(oldf, newf, N1, buflen=1024*1024):
while N1:
newf.write(oldf.read(min(N1, buflen)))
N1 -= buflen

I'd suggest memory mapping. Though it is actually manipulating file in memory it is more efficient then plain reading of the whole file into memory.
Well, you have to manipulate the file contents in-memory one way or another as there's no system call for such operation neither in *nix nor in Win (at least not that I'm aware of).

Try mmaping the file. This won't necessarily read it all into memory at once.
If you really want to do it by hand, choose some chunk size and do back-and-forth reads and writes. But the seeks are going to kill you...

Data Carving loop improvements

I currently have a Python app I am developing which will data carve a block device for jpeg files. Let's just say that it sometimes works and sometimes doesn't. I have created it so that I read the block device till I find a ffd8, then I keep the stream open and search via looping for the ffd9 closure. Though I always need to take into account all ffd9 closures even after the first. So it tends to be a really intensive operation. Given a device with let's say 25 jpegs as well as lots of other data, the looping is pretty dramatic and it runs though a lot.
The program is not the slowest thing in the world, but I think it could be much faster and much more efficient. I am looking for a better way to search the block device and extract the data in a more efficient manner. I also don't want to kill the HDD or the drive holding the image of the block device.
So does anybody knew of a better way to systematically handle the searching and extraction of the data?

The trouble with reading the block device directly is that there is no guarantee that the blocks of any given file are contiguous. That means that even if you find your magic marker bytes 0xFFD8 in block 13, say, there is no guarantee that block 14 belongs to the same file, whether or not it contains the 0xFFD9 end marker or not. (Most files will start on a block boundary; the end of the file may be anywhere, possibly even across block boundaries.)
What's the better way to deal with it? Well, it depends what you're after - but if you're looking only at currently allocated blocks, then scan the file system using the Python analog of the POSIX C function ftw (nftw), and read each file in turn. This won't find evidence of deleted JPEG files in the free list - if that's what you are after, then you'll need to do as you are doing, more or less, but correlate that information with what you find in the file system proper. Mapping those blocks will (at best) be hard.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.