Efficient writing to a Compact Flash in python

Efficient writing to a Compact Flash in python - python

I'm writing a gui to do perform a glorified 'dd'.
I could just subprocess to 'dd' but I thought I might as well use python's open()/read()/write() if I can as it'll let me display progress much more easily.
Prompted by this link here I have:
input = open('filename.img', 'rb')
output = open("/dev/sdc", 'wb')
while True:
buffer = input.read(1024)
if buffer:
output.write(buffer)
else:
break
input.close()
output.close()
...however it is horribly slow. Or at least far slower than dd. (around 4-5x slower)
I had a play and noticed altering the number of bytes 'buffered' had a huge affect on the speed of completion. Raising it to 2048 for example seems to half the time taken. Perhaps going OT for SO here but I guess the flash has an optimum number of bytes to be written at once? Could anyone suggest how I discover this?
The image & card are 1Gb so I would very much like to return to the ~5 minutes dd took if possible. I appreciate that in all likelihood I won't match it.
Rather than trial and error, would anyone be able to suggest a way to optimise the above code and reasoning as to why it works? Especially what value for input.read() for example?
One restriction: python 2.4.3 on linux (centos5) (please don't hurt me)

Speed depending on buffer size is unrelated to the specific characteristics of compact flash, but inherent to all I/O with (relatively) slow devices, even to all kinds of system calls. You should make the buffer size as large as possible without exhausting memory - 2MiB should be enough for a Flash drive.
You should use the time and strace utilities to determine why your program is slower. If time shows a large user/real (large meaning greater than 0.1), you can optimize your Python interpreter - cpython 2.4 is pretty slow, and you're creating new objects all the time instead of writing into a preallocated buffer. If there is a significant difference in sys timings, analyze the syscalls made by both programs (with strace) and try to emit the ones dd does.
Also note that you must call fsync (or execute the sync program) afterwards to measure the real time it took writing the file to disk (or open the output file with O_DIRECT). Otherwise, the operating system will let your program exit and just keep all the written data in buffers that are then continually written out to the actual disk. To test that you're doing it right, remove the disk immediately after your program is finished. Note that the speed difference can be staggering. This effect is less noticeable if your disk(CF card) is way larger than the available physical memory.

So with a little help, I've removed the 'buffer' bit completely and added an os.fsync().
import os
input = open('filename.img', 'rb')
output = open("/dev/sdc", 'wb')
output.write(input.read())
input.close()
output.close()
outputfile.flush()
os.fsync(outputfile.fileno())

Related

Is there a good way to test partial write failures to files?

Is there a good way to test partial write failures to files? I'm particularly interested in simulating a full disk.
I have some code which modifies a file. For some failures there's nothing the code can do, eg: if the disk is unplugged while writing. But for other predictable failures, such as disk full, my code should (and can) catch the exception and undo all changes since the most recent modification began.
I think my code does this well, but am struggling to find a way to exhaustively unit test it. It's difficult to write a unit test to limit a real file system1. I don't see any way to limit a BytesIO. I'm not aware of any mock packages for this.
Are there any standard tools/techniques for this before I write my own?
1 Limiting a real file system is hard for a few reasons. The biggest difficulty is that file systems are usually limited by blocks of a few KiB not bytes. It's hard to make this test all unhappy paths. That is, a good test would be repeated with limits of different lengths to ensuring every individual file.write(...) errors in test, but achieving this with block sizes of say 4KiB is going to be difficult.

Disclaimer: I'm a contributor to pyfakefs.
This may be overkill for you, but you could simulate the whole file system using pyfakefs. This allows you to set the file system size beforehand. Here is a trivial example using pytest:
def test_disk_full(fs): # fs is the file system fixture
fs.set_disk_usage(100) # sets the file system size in bytes
os.makedirs('/foo')
with open('/foo/bar.txt', 'w') as f:
with pytest.raises(OSError):
f.write('a' * 200)
f.flush()

Rewrite a file in-place with Python

It might depend on each OS and also be hardware-dependent, but is there a way in Python to ask a write operation on a file to happen "in-place", i.e. at the same place of the original file, i.e. if possible on the same sectors on disk?
Example: let's say sensitivedata.raw, a 4KB file has to be crypted:
with open('sensitivedata.raw', 'r+') as f: # read write mode
s = f.read()
cipher = encryption_function(s) # same exact length as input s
f.seek(0)
f.write(cipher) # how to ask this write operation to overwrite the original bytes?
Example 2: replace a file by null-byte content of the same size, to avoid an undelete tool to recover it (of course, to do it properly we need several passes, with random data and not only null-bytes, but here it's just to give an idea)
with open('sensitivedata.raw', 'r+') as f:
s = f.read()
f.seek(0)
f.write(len(s) * '\x00') # totally inefficient but just to get the idea
os.remove('sensitivedata.raw')
PS: if it really depends a lot on OS, I'm primarily interested in the Windows case
Side-quesion: if it's not possible in the case of a SSD, does this mean that if you once in your life wrote sensitive data as plaintext on a SSD (example: a password in plaintext, a crypto private key, or anything else, etc.), then there is no way to be sure that this data is really erased? i.e. the only solution is to 100% wipe the disk and fill it many passes with random bytes? Is that correct?

That's an impossible requirement to impose. While on most spinning disk drives, this will happen automatically (there's no reason to write the new data elsewhere when it could just overwrite the existing data directly), SSDs can't do this (when they claim to do so, they're lying to the OS).
SSDs can't rewrite blocks; they can only erase a block, or write to an empty block. The implementation of a "rewrite" is to write to a new block (reading from the original block to fill out the block if there isn't enough new data), then (eventually, cause it's relatively expensive) erase the old block to make it available for a future write.
Update addressing side-question: The only truly secure solution is to run your drive through a woodchipper, then crush the remains with a millstone. :-) Really, in most cases, the window of vulnerability on an SSD should be relatively short; erasing sectors is expensive, so even SSDs that don't honor TRIM typically do it in the background to ensure future (cheap) write operations aren't held up by (expensive) erase operations. This isn't really so bad when you think about it; sure, the data is visible for a period of time after you logically erased it. But it was visible for a period of time before you erased it too, so all this is doing is extending the window of vulnerability by (seconds, minutes, hours, days, depending on the drive); the mistake was in storing sensitive data to permanent storage in the first place; even with extreme (woodchipper+millstone) solutions, someone else could have snuck in and copied the data before you thought to encrypt/destroy it.

Best Approach for I/O Bound Problems?

I am currently running a code on a HPC cluster that writes several 16 MB files on disk (same directory) for a short period of time and then deletes it. They are written to disks and then deleted sequentially. However, the total number of I/O operations exceeds 20,000 * 12,000 times.
I am using the joblib module in python2.7 to take advantage of running my code on several cores. Its basically a nested loop problem with the outer loop being parallelised by joblib and the inner loop is run sequentially in the function. In total its a (20,000 * 12,000 loop.)
The basic skeleton of my code is the following.
from joblib import Parallel, delayed
import subprocess
def f(a,b,c,d):
cmds = 'path/to/a/bash_script_on_disk with arguments from a,b > \
save_file_to_disk'
subprocess.check_output(cmds,shell=True)
cmds1 = 'path/to/a/second_bash_script_on_disk > \
save_file_to_disk'
subprocess.check_output(cmds1,shell=True)
#The structure above is repeated several times.
#However I do delete the files as soon as I can using:
cmds2 = 'rm -rf files'
subprocess.check_output(cmds2,shell=True)
#This is followed by the second/inner loop.
for i in range(12000):
#Do some computation, create and delete files in each
#iteration.
if __name__ == '__main__':
num_cores = 48
Parallel(n_jobs=num_cores)(delayed(f)(a,b,c,d) for i in range(20,000))
#range(20,000) is batched by a wrapper script that sends no more \
#than 48 jobs per node.(Max.cores available)
This code is extremely slow and the bottleneck is the I/O time. Is this a good use case to temporarily write files to /dev/shm/? I have 34GB of space available as tmpfs on /dev/shm/.
Things I already tested:
I tried to set up the same code on a smaller scale on my laptop which has 8 cores. However, writing to /dev/shm/ ran slower than writing to disk.
Side Note: (The inner loop can be parallelised too, however, the number of cores I have available is far lesser than 20,000 which is why I am sticking to this configuration. Please let me know if there are better ways to do this.)

First, do not talk about total I/O operations, that is meaningless. Instead, talk about IOPS and throughout.
Second, that is almost impossible that writing to /dev/shm/ will be slower than writing to disk. Please provide more information. You can test write performance using fio, example command: sudo fio --name fio_test_file --rw=read --direct=1 --bs=4k --size=50M --numjobs=16 --group_reporting, and my test result is: bw=428901KB/s, iops=107225.
Third, you are really writing too many files, you should think about your structure.

It depends on your temporary data size.
If you have much more memory than you're using for the data, then yes - shm will be a good place for it. If you're going to write almost as much as you've got available, then you're likely going to start swapping - which would kill the performance of everything.
If you can fit your data in memory, then tmpfs by definition will always be faster than writing to a physical disk. If it isn't, then there are more factors impacting your environment. Running your code under a profiler would be a good idea in this case.

Using more memory than available

I have written a program that expands a database of prime numbers. This program is written in python and runs on windows 10 (x64) with 8GB RAM.
The program stores all primes it has found in a list of integers for further calculations and uses approximately 6-7GB of RAM while running. During some runs however, this figure has dropped to below 100MB. The memory usage then stays low for the duration of the run, though increasing as expected as more numbers are added to the prime array. Note that not all runs result in a memory drop.
Memory usage measured with task manager
These, seemingly random, drops has led me the following theories:
There's a bug in my code, making it drop critical data and messing up the results (most likely but not supported by the results)
Python just happens to optimize my code extremely well after a while.
Python or Windows is compensating for my over-usage of the RAM by cleaning out portions of my prime-number array that aren't used that much. (eventually resulting in incorrect calculations)
Python or Windows is compensating for my over-usage of the RAM by allocating disk space instead of ram.
Questions
What could be the reason(s) for this memory drop?
How does python handle programs that use more than available RAM?
How does Windows handle programs that use more than available RAM?

1, 2, and 3 are incorrect theories.
4 is correct. Windows (not Python) is moving some of your process memory to swap space. This is almost totally transparent to your application - you don't need to do anything special to respond to or handle this situation. The only thing you will notice is your application may get slower as information is written to and read from disk. But it all happens transparently. See https://en.wikipedia.org/wiki/Virtual_memory for more information.

Have you heard of paging? Windows dumps some ram (that hasn't been used in a while) to your hard drive to keep your computer from running out or ram and ultimately crashing.
Only Windows deals with memory management. Although, if you use Windows 10, it will also compress your memory, somewhat like a zip file.

Python IMAP search, search results exhaust all memory

I'm trying to fetch all auto-reponse emails from a specific address in Python using imaplib. Everything worked fine for weeks but now each time I run my program all my RAM is consumed (several GB!) and the script end up being killed by the OOM killer.
Here is the code I'm currently using:
M = imaplib.IMAP4_SSL('server')
M.LOGIN('user', 'pass')
M.SELECT()
date = (datetime.date.today() - datetime.timedelta(1)).strftime("%d-%b-%Y")
result, data = M.uid('search', None, '(SENTON %s HEADER FROM "auto#site.com" NOT SUBJECT "RE:")' % date)
...
I'm sure that less than 100 emails of a few kilobytes should be returned. What could be the matter here ? Or is there a way to limit the number of emails returned ?
Thx!

There's no way to know for sure what the cause is, without being able to reproduce the problem (and certainly not without seeing the complete program which triggers the problem, and knowing the version of all dependencies you're using).
However, here's my best guess. Several versions of Python include a very memory-wasteful implementation of imaplib. The problem is particularly evident on Windows, but not limited to that platform.
The core of the problem is the way strings are allocated when read from a socket, and the way imaplib reads strings from sockets.
When reading from a socket, Python first allocates a buffer large enough to handle as many bytes as the application asks for. This may be something reasonable sounding, perhaps 16 kB. Then data is read into that buffer and the buffer is resized down to fit the number of bytes actually read.
The efficiency of this operation depends on the quality of the platform re-allocation implementation. Resizing a buffer may end up moving it to a more suitable location, where the smaller size avoids wasting much memory. Or it may just mark the tail part of the memory, no longer allocated as part of that region, as re-usable (and it may even be able to re-use it in practice). Or it might end up wasting that technically unallocated memory.
Imagine the cumulative effects of that memory being wasted if you have to read a few dozen kB of data, and the data arrives from the network a few dozen bytes at a time. Worse, imagine if the data is really trickling, and you only get a few bytes at a time. Or if you're reading a very "large" response of several hundred kB.
The amount of memory wasted - effectively allocated by the process, but not usable in any meaningful way - can be huge. 100 kB of data, read 5 bytes at a time requires 20480 buffers. If each buffer starts off as 16 kB and is unsuccessfully shrunk, causing them to remain at 16Kb, then you've allocated at least 320MB of memory to hold that 100 kB of data.
Some versions of imaplib exacerbated this problem by introducing multiple layers of buffering and copying. A very old version (hopefully not one you're actually using) even read 1 byte at a time (which would result in 1.6GB of memory usage in the above scenario).
Of course, this problem usually doesn't show up on Linux, where the re-allocator is not so bad. And at various points in previous Python releases (previous to the most recent 2.x release), the bug was "fixed", so I wouldn't expect to see it show up these days. And this doesn't explain why your program ran fine for a while before failing this way.
But it is my best guess.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.