How to safely write to a file? - python

Imagine you have a library for working with some sort of XML file or configuration file. The library reads the whole file into memory and provides methods for editing the content. When you are done manipulating the content you can call a write to save the content back to file. The question is how to do this in a safe way.
Overwriting the existing file (starting to write to the original file) is obviously not safe. If the write method fails before it is done you end up with a half written file and you have lost data.
A better option would be to write to a temporary file somewhere, and when the write method has finished, you copy the temporary file to the original file.
Now, if the copy somehow fails, you still have correctly saved data in the temporary file. And if the copy succeeds, you can remove the temporary file.
On POSIX systems I guess you can use the rename system call which is an atomic operation. But how would you do this best on a Windows system? In particular, how do you handle this best using Python?
Also, is there another scheme for safely writing to files?

If you see Python's documentation, it clearly mentions that os.rename() is an atomic operation. So in your case, writing data to a temporary file and then renaming it to the original file would be quite safe.
Another way could work like this:
let original file be abc.xml
create abc.xml.tmp and write new data to it
rename abc.xml to abc.xml.bak
rename abc.xml.tmp to abc.xml
after new abc.xml is properly put in place, remove abc.xml.bak
As you can see that you have the abc.xml.bak with you which you can use to restore if there are any issues related with the tmp file and of copying it back.

If you want to be POSIXly correct and save you have to:
Write to temporary file
Flush and fsync the file (or fdatasync)
Rename over the original file
Note that calling fsync has unpredictable effects on performance -- Linux on ext3 may stall for disk I/O whole numbers of seconds as a result, depending on other outstanding I/O.
Notice that rename is not an atomic operation in POSIX -- at least not in relation to file data as you expect. However, most operating systems and filesystems will work this way. But it seems you missed the very large linux discussion about Ext4 and filesystem guarantees about atomicity. I don't know exactly where to link but here is a start: ext4 and data loss.
Notice however that on many systems, rename will be as safe in practice as you expect. However it is in a way not possible to get both -- performance and reliability across all possible linux confiugrations!
With a write to a temporary file, then a rename of the temporary file, one would expect the operations are dependent and would be executed in order.
The issue however is that most, if not all filesystems separate metadata and data. A rename is only metadata. It may sound horrible to you, but filesystems value metadata over data (take Journaling in HFS+ or Ext3,4 for example)! The reason is that metadata is lighter, and if the metadata is corrupt, the whole filesystem is corrupt -- the filesystem must of course preserve it self, then preserve the user's data, in that order.
Ext4 did break the rename expectation when it first came out, however heuristics were added to resolve it. The issue is not a failed rename, but a successful rename. Ext4 might sucessfully register the rename, but fail to write out the file data if a crash comes shortly thereafter. The result is then a 0-length file and neither orignal nor new data.
So in short, POSIX makes no such guarantee. Read the linked Ext4 article for more information!

In Win API I found quite nice function ReplaceFile that does what name suggests even with optional back-up. There is always way with DeleteFile, MoveFile combo.
In general what you want to do is really good. And I cannot think of any better write scheme.

A simplistic solution. Use tempfile to create a temporary file and if writing succeeds the just rename the file to your original configuration file.
Note that rename is not atomic across filesystems. You'll have to resort to a slight workaround (e.g. tempfile on target filesystem, followed by a rename) in order to be really safe.
For locking a file, see portalocker.

The standard solution is this.
Write a new file with a similar name. X.ext# for example.
When that file has been closed (and perhaps even read and checksummed), then you two two renames.
X.ext (the original) to X.ext~
X.ext# (the new one) to X.ext
(Only for the crazy paranoids) call the OS sync function to force dirty buffer writes.
At no time is anything lost or corruptable. The only glitch can happen during the renames. But you haven't lost anything or corrupted anything. The original is recoverable right up until the final rename.

Per RedGlyph's suggestion, I'm added an implementation of ReplaceFile that uses ctypes to access the Windows APIs. I first added this to jaraco.windows.api.filesystem.
ReplaceFile = windll.kernel32.ReplaceFileW
ReplaceFile.restype = BOOL
ReplaceFile.argtypes = [
LPWSTR,
LPWSTR,
LPWSTR,
DWORD,
LPVOID,
LPVOID,
]
REPLACEFILE_WRITE_THROUGH = 0x1
REPLACEFILE_IGNORE_MERGE_ERRORS = 0x2
REPLACEFILE_IGNORE_ACL_ERRORS = 0x4
I then tested the behavior using this script.
from jaraco.windows.api.filesystem import ReplaceFile
import os
open('orig-file', 'w').write('some content')
open('replacing-file', 'w').write('new content')
ReplaceFile('orig-file', 'replacing-file', 'orig-backup', 0, 0, 0)
assert open('orig-file').read() == 'new content'
assert open('orig-backup').read() == 'some content'
assert not os.path.exists('replacing-file')
While this only works in Windows, it appears to have a lot of nice features that other replace routines would lack. See the API docs for details.

There's now a codified, pure-Python, and I dare say Pythonic solution to this in the boltons utility library: boltons.fileutils.atomic_save.
Just pip install boltons, then:
from boltons.fileutils import atomic_save
with atomic_save('/path/to/file.txt') as f:
f.write('this will only overwrite if it succeeds!\n')
There are a lot of practical options, all well-documented. Full disclosure, I am the author of boltons, but this particular part was built with a lot of community help. Don't hesitate to drop a note if something is unclear!

You could use the fileinput module to handle the backing-up and in-place writing for you:
import fileinput
for line in fileinput.input(filename,inplace=True, backup='.bak'):
# inplace=True causes the original file to be moved to a backup
# standard output is redirected to the original file.
# backup='.bak' specifies the extension for the backup file.
# manipulate line
newline=process(line)
print(newline)
If you need to read in the entire contents before you can write the newline's,
then you can do that first, then print entire new contents with
newcontents=process(contents)
for line in fileinput.input(filename,inplace=True, backup='.bak'):
print(newcontents)
break
If the script ends abruptly, you will still have the backup.

Related

Archive files directly from memory in Python

I'm writing this program where I get a number of files, then zip them with encryption using pyzipper, and also I'm using io.BitesIO() to write these files to it so I keep them in-memory. So now, after some other additions, I want to get all of these in-memory files and zip them together in a single encrypted zip file using the same pyzipper.
The code looks something like this:
# Create the in-memory file object
in_memory = BytesIO()
# Create the zip file and open in write mode
with pyzipper.AESZipFile(in_memory, "w", compression=pyzipper.ZIP_LZMA, encryption=pyzipper.WZ_AES) as zip_file:
# Set password
zip_file.setpassword(b"password")
# Save "data" with file_name
zip_file.writestr(file_name, data)
# Go to the beginning
in_memory.seek(0)
# Read the zip file data
data = in_memory.read()
# Add the data to a list
files.append(data)
So, as you may guess the "files" list is an attribute from a class and the whole thing above is a function that does this a number of times and then you get the full files list. For simplicity's sake, I removed most of the irrelevant parts.
I get no errors for now, but when I try to write all files to a new zip file I get an error. Here's the code:
with pyzipper.AESZipFile(test_name, "w", compression=pyzipper.ZIP_LZMA, encryption=pyzipper.WZ_AES) as zfile:
zfile.setpassword(b"pass")
for file in files:
zfile.write(file)
I get a ValueError because of os.stat:
File "C:\Users\vulka\AppData\Local\Programs\Python\Python310\lib\site-packages\pyzipper\zipfile.py", line 820, in from_file
st = os.stat(filename)
ValueError: stat: embedded null character in path
[WHAT I TRIED]
So, I tried using mmap for this purpose but I don't think this can help me and if it can - then I have no idea how to make it work.
I also tried using fs.memoryfs.MemoryFS to temporarily create a virtual filessystem in memory to store all the files and then get them back to zip everything together and then save it to disk. Again - failed. I got tons of different errors in my tests and TBH, there's very little information out there on this fs method and even if what I'm trying to do is possible - I couldn't figure it out.
P.S: I don't know if pyzipper (almost 1:1 zipfile with the addition of encryption) supports nested zip files at all. This could be the problem I'm facing but if it doesn't I'm open to any suggestions for a new approach to doing this. Also, I don't want to rely on a 3rd party software, even if it is open source! (I'm talking about the method of using 7zip to do all the archiving and ecryption, even though it shouldn't even be possible to use it without saving the files to disk in the first place, which is the main thing I'm trying to avoid)

Python 3.4: Temporary files not being deleted automatically

If you're using the tempfile library, then you might have experienced the scenario where the temporary file is not being automatically deleted even your program has completed. This is quite crucial for programs which would be used multiple times as its directory will get messy once the temporary files pile up.
This is Q&A. Hence, here's my solution.
Apparently, the default settings for the tempfile.TemporaryFile does not automatically delete your temporary file, but adding a prefix in your tempfile.NamedTemporaryFile works:
with tempfile.NamedTemporaryFile(prefix="anything_",
dir=os.getcwd()) as tempf:
'''put something'''
tempf.seek(0)
Notes:
The os.getcwd() is to get the current directory of your file.
Your temporary file would be anything_ + (random values) (i.e anything_23mem)
Hope it helps.

A safe, atomic file-copy operation

I need to copy a file from one location to another, and I need to throw an exception (or at least somehow recognise) if the file already exists at the destination (no overwriting).
I can check first with os.path.exists() but it's extremely important that the file cannot be created in the small amount of time between checking and copying.
Is there a built-in way of doing this, or is there a way to define an action as atomic?
There is in fact a way to do this, atomically and safely, provided all actors do it the same way. It's an adaptation of the lock-free whack-a-mole algorithm, and not entirely trivial, so feel free to go with "no" as the general answer ;)
What to do
Check whether the file already exists. Stop if it does.
Generate a unique ID
Copy the source file to the target folder with a temporary name, say, <target>.<UUID>.tmp.
Rename† the copy <target>-<UUID>.mole.tmp.
Look for any other files matching the pattern <target>-*.mole.tmp.
If their UUID compares greater than yours, attempt to delete it. (Don't worry if it's gone.)
If their UUID compares less than yours, attempt to delete your own. (Again, don't worry if it's gone.) From now on, treat their UUID as if it were your own.
Check again to see if the destination file already exists. If so, attempt to delete your temporary file. (Don't worry if it's gone. Remember your UUID may have changed in step 5.)
If you didn't already attempt to delete it in step 6, attempt to rename your temporary file to its final name, <target>. (Don't worry if it's gone, just jump back to step 5.)
You're done!
How it works
Imagine each candidate source file is a mole coming out of its hole. Half-way out, it pauses and whacks any competing moles back into the ground, before checking no other mole has fully emerged. If you run this through in your head, you should see that only one mole will ever make it all the way out. To prevent this system from livelocking, we add a total ordering on which mole can whack which. Bam! A PhD thesis lock-free algorithm.
† Step 4 may look unnecessary—why not just use that name in the first place? However, another process may "adopt" your mole file in step 5, and make it the winner in step 7, so it's very important that you're not still writing out the contents! Renames on the same file system are atomic, so step 4 is safe.
There is no way to do this; file copy operations are never atomic and there is no way to make them.
But you can write the file under a random, temporary name and then rename it. Rename operations have to be atomic. If the file already exists, the rename will fail and you'll get an error.
[EDIT2] rename() is only atomic if you do it in the same file system. The safe way is to create the new file in the same folder as the destination.
[EDIT] There is a lot of discussion whether rename is always atomic or not and about the overwrite behavior. So I dug up some resources.
On Linux, if the destination exists and both source and destination are files, then the destination is silently overwritten (man page). So I was wrong there.
But rename(2) still guarantees that either the original file or the new file remain valid if something goes wrong, so the operation is atomic in the sense that it can't corrupt data. It's not atomic in the sense that it prevents two processes from doing the same rename at the same time and you can predict the result. One will win but you can't tell which.
On Windows, if another process is currently writing the file, you get an error if you try to open it for writing, so one advantage for Windows, here.
If your computer crashes while the operation is written to disk, the implementation of the file system will decide how much data gets corrupted. There is nothing an application could do about this. So stop whining already :-)
There is also no other approach that works better or even just as well as this one.
You could use file locking instead. But that would just make everything more complex and yield no additional advantages (besides being more complicated which some people do see as a huge advantage for some reason). And you'd add a lot of nice corner cases when your file is on a network drive.
You could use open(2) with the mode O_CREAT which would make the function fail if the file already exists. But that wouldn't prevent a second process to delete the file and writing their own copy.
Or you could create a lock directory since creating directories has to be atomic as well. But that would not buy you much, either. You'd have to write the locking code yourself and make absolutely, 100% sure that you really, really always delete the lock directory in case of disaster - which you can't.
A while back my team needed a mechanism for atomic writes in Python and we came up the following code (also available in a gist):
def copy_with_metadata(source, target):
"""Copy file with all its permissions and metadata.
Lifted from https://stackoverflow.com/a/43761127/2860309
:param source: source file name
:param target: target file name
"""
# copy content, stat-info (mode too), timestamps...
shutil.copy2(source, target)
# copy owner and group
st = os.stat(source)
os.chown(target, st[stat.ST_UID], st[stat.ST_GID])
def atomic_write(file_contents, target_file_path, mode="w"):
"""Write to a temporary file and rename it to avoid file corruption.
Attribution: #therightstuff, #deichrenner, #hrudham
:param file_contents: contents to be written to file
:param target_file_path: the file to be created or replaced
:param mode: the file mode defaults to "w", only "w" and "a" are supported
"""
# Use the same directory as the destination file so that moving it across
# file systems does not pose a problem.
temp_file = tempfile.NamedTemporaryFile(
delete=False,
dir=os.path.dirname(target_file_path))
try:
# preserve file metadata if it already exists
if os.path.exists(target_file_path):
copy_with_metadata(target_file_path, temp_file.name)
with open(temp_file.name, mode) as f:
f.write(file_contents)
f.flush()
os.fsync(f.fileno())
os.replace(temp_file.name, target_file_path)
finally:
if os.path.exists(temp_file.name):
try:
os.unlink(temp_file.name)
except:
pass
With this code, copying a file atomically is as simple as reading it into a variable and then sending it to atomic_write.
The comments should provide a good idea of what's going on but I also wrote up this more complete explanation on Medium for anyone interested.

I want a clever algorithm for indexing a file directory...pointers?

I have a directory of music on Ubuntu (.mp3, .wav, etc) files. This directory can have as many sub directories as it needs, no limits. I want to be able to make a music library out of it - that is, return list of songs based on filters of:
1) membership to playlist
2) artist name
3) string search
4) name of song
etc, etc
However, if file names are changed, moved, or even added to my Music directory, I need to be able to reflect this is in my music organization engine - quickly!
I originally thought to just monitor my directory with pyinotify, incron, or inotify. Unfortunately my directory is a Samba share and so monitoring file events failed. So my next guess was to simply recursively search the directory in python, and populate a SQL database. Then when updating, I would just look to see if anything has changed (scanning each subfolder to see if each song's name is in the database already, and if not adding it), and make UPDATEs accordingly. Unfortunately, this seems to be a terrible O(n^2) implementation - awful for a multi-terabyte music collection.
A slightly better one might involve creating a tree structure in SQL, thus narrowing the possible candidates to search for a match at any given subfolder step to the size of that subfolder. Still seems inelegant.
What design paradigms/packages can I use to help myself out? Obviously will involve lots of clever hash tables. I'm just looking for some pointers in the right direction for how to approach the problem. (Also I'm a complete junkie for optimization.)
The hard part of this is the scanning of the directory, just because it can be expensive.
But that's a cruel reality since you can't use inotify et al.
In your database, simply create a node type record:
create table node (
nodeKey integer not null primary key,
parentNode integer references node(nodeKey), // allow null for the root, or have root point to itself, whatever
fullPathName varchar(2048),
nodeName varchar(2048),
nodeType varchar(1) // d = directory, f = file, or whatever else you want
)
That's your node structure.
You can use the full path column to quickly find anything by the absolute path.
When a file moves, simply recalculate the path.
Finally, scan you music files. In unix, you can do something like:
find . -type f | sort > sortedListOfFiles
Next, simply suck all of the path names out of the database.
select fullPathName from node where nodeType != 'd' order by fullPathName
Now you have two sorted list of files.
Run them through DIFF (or comm), and you'll have a list of deleted and new files. You won't have a list of "moved" files. If you want to do some heuristic where you compare new and old files and they have the same endings (i.e. ..../album/song) to try and detect "moves" vs new and old, then fine, no big deal. Worth a shot.
But diff will give you your differential in a heartbeat.
If you have zillions of files, then, sorry, this it going to take some time -- but you already know that when you lose the inotify capability. If you had that it would just be incremental maintenance.
When a file moves, it's trivial to find its new absolute path, because you can ask its parent for its path and simply append your name to it. After that, you're not crawling a tree or anything, unless you want to. Works both ways.
Addenda:
If you want to track actual name changes, you can get a little more information.
You can do this:
find . -type f -print0 | xargs -0 ls -i | sort -n > sortedListOfFileWithInode
The -print0 and -0 are used to work with files with spaces in them. Quotes in the file names will wreck this however. You might be better off running the raw list through python and fstat to get the inode. Different things you can do here.
What this does is rather than just having names, you also get the inode of the file. The inode is the "real" file, a directory links names to inodes. This is how you can have multiple names (hard links) in a unix file system to a single file, all of the names point to the same inode.
When a file is renamed, the inode will remain the same. In unix, there's a single command used for renaming, and moving files, mv. When mv renames or moves the file, the inode stays the same AS LONG AS THE FILE IS ON THE SAME FILE SYSTEM.
So, using the inode as well as the file name will let you capture some more interesting information, like file moves.
It won't help if they delete the file and add a new file. But you WILL (likely) be able to tell that it happened, since it is unlikely that an old inode will be reused for the new inode.
So if you have a list of files (sorted by file name):
1234 song1.mp3
1235 song2.mp3
1236 song3.mp3
and someone removes and adds back song 2, you'll have something like
1234 song1.mp3
1237 song2.mp3
1236 song3.mp3
But if you do this:
mv song1.mp3 song4.mp3
You'll get:
1237 song2.mp3
1236 song3.mp3
1234 song4.mp3
The other caveat is that if you lose the drive and restore it from backup, likely all of the inodes will change, forcing effectively a rebuild of your index.
If you're real adventurous you can try playing with extended file system attributes and assign other interesting meta data to files. Haven't done much with that, but it's got possibilities as well, and there are likely unseen dangers, but...
my aggregate_digup program reads an extended sha1sum.txt format file produced by the digup program. this lets me locate a file based on its sha1sum. the digup program stores the mtime size hash and pathname in its output. by default it skips hashing a file if the mtime and size match. the index produced by my aggregate_digup is used by my modifed version of the open uri context menu gedit plugin allowing one to option click on sha1:b7d67986e54f852de25e2d803472f31fb53184d5 and it'll list the copies of the file it knows about so you can pick one and open it.
how this relates to the problem is that there are two parts: one the playlists and two the files.
if we can assume that nothing the player does changes the files, then the hash and sizes of the files are constant. so we should be able to use the size and hash of a file as a unique identifier.
for example the key for the file mentioned: 222415:b7d67986e54f852de25e2d803472f31fb53184d5
i've found that in practice this has no collisions in any natural collection.
(this does mean that the ID3 metadata which is appended or prepended to the mp3 data can't change unless you choose to skip that metadata while hashing)
so the playlist database would be something this:
files(file_key, hash, size, mtime, path, flag)
tracks(file_key, title, artist)
playlists(playlistid, index, file_key)
to update the files table:
import os
import stat
# add new files:
update files set flag=0
for path in filesystem:
s=os.stat(path)
if stat.S_ISREG(s.st_mode):
fetch first row of select mtime, hash, size from files where path=path
if row is not None:
if s.st_mtime == mtime and s.st_size == size:
update files set flag=1 where path=path
continue
hash=hash_file(path)
file_key="%s:%s" % (int(s.st_mtime), hash)
insert or update files set file_key=file_key, size=s.st_size, mtime=s.st_mtime, hash=hash, flag=1 where path=path
# remove non-existent files:
delete from files where flag=0
The reality is, this is a hard problem. You're starting from a disadvantage as well: Python and mySQL aren't the fastest tools to use for this purpose.
Even iTunes is complained about because of the time it takes to import libraries and index new files. Can you imagine the man hours that went into making iTunes as good as it is?
Your best bet is to look at the code of major open source music players such as
Miro, http://www.getmiro.com/,
Banshee, http://banshee.fm/, and
Songbird, http://getsongbird.com/
And try an adapt their algorithms to your purpose and to Python idioms.
import os
import re
your other code here that initially sets up a dictonary containing which files you already have in your library (I called the dictionary archived_music)
music_directory = '/home/username/music'
music_type = '\.mp3$|\.wav$|\.etc$'
found_files = os.popen('find %s -type f -mtime 1 2>/dev/null' % music_directory)
for file in found_files:
directory, filename = os.path.split()
if re.compile(music_type).search(filename):
#found a music file, check if you already have it in the library
if filename in archived_music:
continue
#if you have gotten to this point, the music was not found in the arcchived music directory, so now perform whatever processing you would like to do on the full path found in file.
You can use this code as a little function or whatever and call it on whatever time resolution you would like. It will use the find command and find every newly created file within the last day. It will then check whether it is of type music_type, if it is it will check the filename against whatever current database you have set up and you can continue processing from there. This should be able to get your started for updating newly added music or whatnot.
I've done something similar in the past, but ended up utilizing Amarok w/ MySQL. Amarok will create a mysql database for you and index all your files quite nicely - after that interfacing with the database should be relatively straightforward from python.
It was quite a time saver for me :)
HTH

Python os.rename if file system is full

I'm asking this beacause there's no way to try it myself (if there's one share it please (:).
I'm doing some file handling with Python os library, specifically file moving/renaming with os.rename().
Python docs explains some of the exceptions this function might raise here, but do not say anything about a full file system case. My guess is it raises an IOError, is this right?
Cheers.
In practice this should rarely come up, but if you want to test I'd recommend creating a small file system (I don't know what OS you are on, but this could be on a virtual partition, a RAM disk, a flash drive, etc.) and loading it up with garbage files to see what happens. Something like this maybe:
aBigNumber = 100000000000000000000000000000000
counter = 0
while (True):
counter += 1
anotherFile = open(`counter` + ".txt", "wb")
anotherFile.write("0" * aBigNumber)
anotherFile.close()
When you get an exception, you should be able to verify that the disk is full and then you'll know what kind of error to expect.
You can test it by filling up a small partition and then try the file operations on the filled filesystem. On *nix systems you can mount a tmpfs; for windows maybe use a usb stick.

Categories

Resources