Is there a good way to test partial write failures to files?

Is there a good way to test partial write failures to files? - python

Is there a good way to test partial write failures to files? I'm particularly interested in simulating a full disk.
I have some code which modifies a file. For some failures there's nothing the code can do, eg: if the disk is unplugged while writing. But for other predictable failures, such as disk full, my code should (and can) catch the exception and undo all changes since the most recent modification began.
I think my code does this well, but am struggling to find a way to exhaustively unit test it. It's difficult to write a unit test to limit a real file system1. I don't see any way to limit a BytesIO. I'm not aware of any mock packages for this.
Are there any standard tools/techniques for this before I write my own?
1 Limiting a real file system is hard for a few reasons. The biggest difficulty is that file systems are usually limited by blocks of a few KiB not bytes. It's hard to make this test all unhappy paths. That is, a good test would be repeated with limits of different lengths to ensuring every individual file.write(...) errors in test, but achieving this with block sizes of say 4KiB is going to be difficult.

Disclaimer: I'm a contributor to pyfakefs.
This may be overkill for you, but you could simulate the whole file system using pyfakefs. This allows you to set the file system size beforehand. Here is a trivial example using pytest:
def test_disk_full(fs): # fs is the file system fixture
fs.set_disk_usage(100) # sets the file system size in bytes
os.makedirs('/foo')
with open('/foo/bar.txt', 'w') as f:
with pytest.raises(OSError):
f.write('a' * 200)
f.flush()

Related

ffmpeg in python - extracting meta data

I use ffmpeg for Python to extract meta data from video files. I think the official documentation is available here: https://kkroening.github.io/ffmpeg-python/
To extract meta data (duration, resolution, frames per second, etc.) I use the function "ffmpeg.probe" as provided. Sadly, when running it on a large amount of video files, it is rather inefficient as it seems to (obviously?) load the whole file into memory each time to read just a small amount of data.
If this is not what it does, maybe someone could explain what the cause might be for the rather extensive runtime.
Otherwise, is there any way to retrieve meta data in a more efficient way using ffmpeg or some other library?
Any feedback or help is very much appreciated.
Edit: For clarity I added my code here:
pool = mp.Pool()
videos = []
for file in os.listdir(directory):
pool.apply_async(ffmpeg.probe, args=[os.path.join(directory, file)], callback=videos.append)
pool.close()
pool.join()
The imports and the definition of the paths are missing, but it should suffice to understand what is going on.

running it on a large amount of video files
This is where multithreading/multiprocessing could potentially be helpful IF the slowdown comes from subprocess spawning (and not from actual I/O). This may not help as file I/O in general takes time compared to virtually everything else.
load the whole file into memory each time to read just a small amount of data
This is incorrect assertion IMO. It should only read relevant headers/packets to retrieve the metadata. You are likely paying subprocess tax more than anything else.
a way to retrieve meta data
(1) Adding to what #Peter Hassaballeh said above, ffprobe has options to limit what to look up. If you only need to get the container(format)-level info or only of a particular stream, you can specify exactly what you need (to an extent). This could save some time.
(2) You can try MediaInfo (another free tool like ffprobe) which you should be able to call from Python as well.
(3) If you are dealing with a particular file format, the fastest way is to decode it yourself in Pyton, read only the bytes that matters to you. Depending on what is the current bottleneck, it may not be that drastic of an improvement, tho.

I suggest using the ffprobe directly. Unfortunately ffmpeg can be CPU expensive sometimes but it all depends on your hardware specs.

How to perform a pickling so that it is robust against crashing?

I routinely use pickle.dump() to save large files in Python 2.7. In my code, I have one .pickle file that I continually update with each iteration of my code, overwriting the same file each time.
However, I occasionally encounter crashes (e.g. from server issues). This may happen in the middle of the pickle dump, rendering the pickle incomplete and the pickle file unreadable, and I lose all my data from the past iterations.
I guess one way I could do it is to save one .pickle file for each iteration, and combine all of them later. Are there any other recommended methods, or best practices in writing to disk that is robust to crashing?

You're effectively doing backups, as your goal is the same: disaster recovery, lose as little work as possible.
In backups, there are these standard practices, so choose whatever fits you best:
backing up
full backup (save everything each time)
incremental backup (save only what changed since the last backup)
differential backup (save only what changed since the last full backup)
dealing with old backups
circular buffer/rotating copies (delete or overwrite backups older than X days/iterations, optionally change indices in others' names)
consolidatating old incremental/differential copies into the preceding full backup (as a failsafe, consolidate into a new file and only then delete the old ones)

Are there any other recommended methods, or best practices ?
Let me mention a cool Mike McKearn's dill package
import dill as pickle
pickle.dump_session( aRotatingIndexFileNAME ) # saves python session-state
# # session is worth
# # many computing [CPU-core * hrs]
if needed, just
import dill as pickle; pickle.load_session( aLastSessionStateFileNAME )
Using rotating filenames are a common best-practice for archiving up to some depth of roll-back capabilities, so worth not repeating these here.
dill has literally saved me on this, including having knowingly the same call-signatures for easy substitutions into python projects.

About the speed of random file read (Python)

Please take a look at the following code (kind of pseudo code):
index = db.open()
fh = open('somefile.txt','rb')
for i in range(1000):
x = random_integer(1,5000)
pos,length = index[x]
fh.seek(pos)
buffer = fh.read(length)
doSomeThingWith(buffer)
fh.close()
db.close()
I used a database to index the positions and lengths of text segments in a .txt file for random retrieval.
No wonder, if the above code is run repeatedly, the execution takes less and less time.
1) What is responsible for this speed-up? Is it because of things staying in the memory or the "caching" or something?
2) Is there anyway to control it?
3) I've compared with other methods where the text segments are stored in Berkeley DB and so on. When at its fastest, the above code is faster than retrieval from Berkeley DB. How do I judge the performance of my database+file solution? I mean, is it safe to judge it as at least "fast enough"?

what is behind and responsible for this speed-up?
It could be the operating system's disk cache. http://en.wikipedia.org/wiki/Page_cache
Once you've read a chunk of a file from disk once, it will hang around in RAM for a while. RAM is orders of magnitude faster than disk, so you'll see a lot of variability in the time it takes to read random pieces of a large file.
Or, depending on what "db" is, the database implementation could be doing its own caching.
Is there anyway to control it?
If it's the disk cache:
It depends on the operating system, but it's typically a pretty coarse-grained control; for example, you may be forced to disable caching for an entire volume, which would affect other processes on the system reading from that volume, and would affect every other file that lived on that volume. It would also probably require root/admin access.
See this similar question about disabling caching on Linux: Linux : Disabling File cache for a process?
Depending on what you're trying to do, you can force-flush the disk cache. This can be useful in situations where you want to run a test with a cold cache, letting you get an idea of the worst-case performance. (This also depends on your OS and may require root/admin access.)
If it's the database:
Depends on the database. If it's a local database, you may just be seeing disk cache effects, or the database library could be doing its own caching. If you're talking to a remote database, the caching could be happening locally or remotely (or both).
There may be configuration options to disable or control caching at either of these layers.

Python - Tailing a logfile - sleep() versus inotify?

I'm writing a Python script that needs to tail -f a logfile.
The operating system is RHEL, running Linux 2.6.18.
The normal approach I believe is to use an infinite loop with sleep, to continually poll the file.
However, since we're on Linux, I'm thinking I can also use something like pyinotify (https://github.com/seb-m/pyinotify) or Watchdog (https://github.com/gorakhargosh/watchdog) instead?
What are the pros/cons of the this?
I've heard that using sleep(), you can miss events, if the file is growing quickly - is that possible? I thought GNU tail uses sleep as well anyhow?
Cheers,
Victor

The cleanest solution would be inotify in many ways - this is more or less exactly what it's intended for, after all. If the log file was changing extremely rapidly then you could potentially risk being woken up almost constantly, which wouldn't necessarily be particularly efficient - however, you could always mitigate this by adding a short delay of your own after the inotify filehandle returns an event. In practice I doubt this would be an issue on most systems, but I thought it worth mentioning in case your system is very tight on CPU resources.
I can't see how the sleep() approach would miss file updates except in cases where the file is truncated or rotated (i.e. renamed and another file of the same name created). These are tricky cases to handle however you do things, and you can use tricks like periodically re-opening the file by name to check for rotation. Read the tail man page because it handles many such cases, and they're going to be quite common for log files in particular (log rotation being widely considered to be good practice).
The downside of sleep() is of course that you'd end up batching up your reads with delays in between, and also that you have the overhead of constantly waking up and polling the file even when it's not changing. If you did this, say, once per second, however, the overhead probably isn't noticeable on most systems.
I'd say inotify is the best choice unless you want to remain compatible, in which case the simple fallback using sleep() is still quite reasonable.
EDIT:
I just realised I forgot to mention - an easy way to check for a file being renamed is to perform an os.fstat(fd.fileno()) on your open filehandle and a os.stat() on the filename you opened and compare the results. If the os.stat() fails then the error will tell you if the file's been deleted, and if not then comparing the st_ino (the inode number) fields will tell you if the file's been deleted and then replaced with a new one of the same name.
Detecting truncation is harder - effectively your read pointer remains at the same offset in the file and reading will return nothing until the file content size gets back to where you were - then the file will read from that point as normal. If you call os.stat() frequently you could check for the file size going backwards - alternatively you could use fd.tell() to record your current position in the file and then perform an explicit seek to the end of the file and call fd.tell() again. If the value is lower, then the file's been truncated under you. This is a safe operation as long as you keep the original file position around because you can always seek back to it after the check.
Alternatively if you're using inotify anyway, you could just watch the parent directory for changes.
Note that files can be truncated to non-zero sizes, but I doubt that's likely to happen to a log file - the common cases will be being deleted and replaced, or truncated to zero. Also, I don't know how you'd detect the case that the file was truncated and then immediately filled back up to beyond your current position, except by remembering the most recent N characters and comparing them, but that's a pretty grotty thing to do. I think inotify will just tell you the file has been modified in that case.

Any write functions In python that have the same safety as ACID does in databases

The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.

Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.

The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.

It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.