Fine control over h5py buffering

Fine control over h5py buffering - python

I have some data in memory that I want to store in a HDF file.
My data are not huge (<100 MB, so they fit in memory very comfortably), so for performance it seems to make sense to keep them there. At the same time, I also want to store it on disk. It is not critical that the two are always exactly in sync, as long as they are both valid (i.e. not corrupted), and that I can trigger a synchronization manually.
I could just keep my data in a separate container in memory, and shovel it into an HDF object on demand. If possible I would like to avoid writing this layer. It would require me to keep track of what parts have been changed, and selectively update those. I was hoping HDF would take care of that for me.
I know about the driver='core' with backing store functionality, but it AFAICT, it only syncs the backing store when closing the file. I can flush the file, but does that guarantee to write the object to storage?
From looking at the HDF5 source code, it seems that the answer is yes. But I'd like to hear a confirmation.
Bonus question: Is driver='core' actually faster than normal filesystem back-ends? What do I need to look out for?

What the H5Fflush command does is request to the file system to transfer all buffers to the file.
The documentation has a specific note about it:
HDF5 does not possess full control over buffering. H5Fflush flushes
the internal HDF5 buffers then asks the operating system (the OS) to
flush the system buffers for the open files. After that, the OS is
responsible for ensuring that the data is actually flushed to disk.
In practice, I have noticed that I can use most of the time read the data from a HDF5 file that has been flushed (even if the process was subsequently killed) but this is not guaranteed by HDF5: there is no safety in relying on the flush operation to have a valid HDF5 file as further operations (on the metadata, for instance) can corrupt the file is the process is then interrupted. You have to close the file completely to have this consistency.

If you need consistency and avoid corrupted hdf5 files, you may like to:
1) use write-ahead-log, use append logs to write what's being added/updated each time, no need to write to hdf5 this moment.
2) periodically, or at the time you need to shutdown, you replay the logs to apply them one by one, write to the hdf5 file
3) if your process down during 1), you won't lose data, after you startup next time, just replay the logs and writes them to hdf5 file
4) if your process is down during 2), you will not lose data, just remove the corrupted hdf5 file, replay the logs and write it again.

Related

Python: How to link a filename to a memory file

What I want to realize has the following feature:
Python program (or say process, thread,...) create a memory file that can be read or written.
As long as the program is alive, the file data only exists in memory (NOT in disk). As long as the program is NOT alive, there is no date left.
However there is an interface on the disk, it has a filename. This interface is linked to the memory file. Read and write operations on the interface are possible.
Why not use IO
The memory file will be an input of another program (not Python). So a file name is needed.
Why not use tempfile?
The major reason is security. For different OS the finalization of tempfile will be different (right?) And for some occasional cases, such as interruptions on OS, data may remain on disk. So a program-holding data seems more secure (at least to an extent).
Anyway I just want a try to see if tempfile can be avoided.

You could consider using a named pipe (using mkfifo). Another option is to create an actual file which the two programs open. Once both open it, you can unlink it so that it's no longer accessible on disk.

Does Python's "append" file write mode only write new bytes, or does it re-write the entire file as well?

Though I would imagine that append mode is "smart" enough to only insert the new bytes being appended, I want to make absolutely sure that Python doesn't handle it by re-writing the entire file along with the new bytes.
I am attempting to keep a running backup of a program log, and it could reach several thousand records in a CSV format.

Python file operations are convenience wrappers over operating system file operations. The operating system either implements this file system operations internally, forwards them to a loadable module (plugin) or an external server (NFS,SMB). Most of the operating systems since very 1971 are capable to perform appending data to the existing file. At least all the ones that claim to be even remotely POSIX compliant.
The POSIX append mode simply opens the file for writing and moves the file pointer to the end of the file. This means that all the write operations will just write past the end of the file.
There might be a few exceptions to that, for example some routine might use low level system calls to move the file pointer backwards. Or the underlying file system might be not POSIX compliant and use some form of object transactional storage like AWS S3. But for any standard scenario I wouldn't worry about such cases.
However since you mentioned backup as your use case you need to be extra careful. Backups are not as easy as they seem on the surface. Things to worry about, various caches that might hold data in memory before if it is written to disk. What will happen if the power goes out just right after you appended new records. Also, what will happen if somebody starts several copies of your program?
And the last thing. Unless you are running on a 1980s 8bit computer a few thousand CSV lines is nothing to the modern hardware. Even if the files are loaded and written back you wouldn't notice any difference

What is faster/more efficient, read/write to file or to io file like object?

I am working with some large audio files (~500MB), with a lot of processing and conversion involved. One of the steps involves writing a file, sending it though a network, then reading the file at arrival, then saving the file based on some logic.
As the network part is irrelevant for me, I wonder what is faster or more efficient, reading and writing actual files, or io file like object.
Also, how significant is the performance difference, if at all.
My intuition would say io object would be more efficient, but I do not know how either process works.

io file-like object has been created to avoid creating temporary files that you don't want to store, just to be able to pass to other modules and "fool" them into believing that they're actual file handles (there are limitations but for most usages it's okay)
So yes, using a io.BytesIO object will be faster, even with a SSD drive, reading/writing to RAM wins.
class io.BytesIO([initial_bytes])
A stream implementation using an in-memory bytes buffer.
Now if the data is very big, you're going to be out of memory or swap mechanism will occur. So there's a limit to the amount of data you can store in memory (I remember that old audio editing software were able to do "direct-to-disk" for that very reason: memory was limited at the time, and it was not possible to store several minutes of audio data in memory)

ACID Transactions at the File System

Background:
I am getting a temperature float from an arduino via a serial connection. I need to be able to cache this temperature data every 30 seconds for other applications (e.g. web, thermostat controller) to access and not overload the serial connection.
Currently I cache this data to RAM as a file in /run (I'm trying to follow Linux convention). Then, other applications can poll the file for the temperature as they want it all day long with i/o now the only bottle neck (using an rpi, so not a lot of enterprise-level need here).
Problem:
I think when an app reads these files, it risks reading corrupt data. Should a writer update the file, and a reader try to read the file at the same time, can corrupt data be read, causing the thermostat to behave erratically?
Should I just use sqlite3 as an overkill solution, or use file locks (and does that risk something else not working perfectly)?
This is all taking place in multiple python processes. Is Linux able to handle this situation natively or do I need to apply somehow the principles mentioned here?

Calls to write(2) ought to be atomic under Linux.
Which means as long you are writing a single buffer, you can be certain that readers won't read an incomplete record. You might want to use os.write to make sure that no buffering/chunking happens you are not aware of.
if a read is happening and a file is updated, will the read use the new data while in the middle of a file, or does it somehow know how to get data from the old file (how)?
If there is exactly one read(2) and one write(2), you are guaranteed to see a consistent result. If you split your write into two, it might happen that you write the first part, read and then write the second part which would be an atomicity violation. In case you need to write multiple buffers, either combine them yourself or use writev(2).

Any write functions In python that have the same safety as ACID does in databases

The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.

Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.

The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.

It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.