With:
import os
for file in files:
os.remove(file)
Will it wait for every file to be removed like a synchronous function in each call to os.remove(), or will it iterate through while calling os.remove()?
Generally, the os module provides wrappers around system calls of the operating system. For example, on Linux the os.remove/os.unlink functions correspond to the unlink system call. These functions wait until the system call has finished.
Whether this means that the high level operation intended by the program has finished depends on the use-case.
For example, unlink merely removes the path pointing to the file content; if there are other paths for the same file (i.e. hardlinks) or processes with a file handle on it, the file content remains. Only when all references are gone is the file content eligible for removal from the filesystem (similar to reference counting). The filesystem itself may arbitrarily delay removal of the content, and distributed filesystems may have additional consistency and synchronisation constraints.
As a rule of thumb, if there are no special requirements then it is fine to consider the os call to be prompt and synchronous. If there are specific requirements, such as file content being completely destroyed, read up on the specific behaviour of the involved components.
Related
What I want to realize has the following feature:
Python program (or say process, thread,...) create a memory file that can be read or written.
As long as the program is alive, the file data only exists in memory (NOT in disk). As long as the program is NOT alive, there is no date left.
However there is an interface on the disk, it has a filename. This interface is linked to the memory file. Read and write operations on the interface are possible.
Why not use IO
The memory file will be an input of another program (not Python). So a file name is needed.
Why not use tempfile?
The major reason is security. For different OS the finalization of tempfile will be different (right?) And for some occasional cases, such as interruptions on OS, data may remain on disk. So a program-holding data seems more secure (at least to an extent).
Anyway I just want a try to see if tempfile can be avoided.
You could consider using a named pipe (using mkfifo). Another option is to create an actual file which the two programs open. Once both open it, you can unlink it so that it's no longer accessible on disk.
A python process is writing to a file and the file has been deleted/moved by an external process (cron job in my case).
The python process will continue to execute without any errors (expected as it is being written to the buffer rather than the file and will be flushed after f.close()). Yet there won't be any new file created in this case and buffer will be silently discarded(correct me if I'm wrong).
Is there any pythonic way to handle this instead of checking if file exists and create one if not, before every write operation.
There is no "pythonic" way to do this because the question isn't about a specific language. It's an operating system question. So the answer is going to be different for MS Windows than it is for a UNIX like OS such as Linux or macOS. To do this efficiently requires using a facility such as the Linux inotify API. A simpler approach that will work on any UNIX like OS is to open the file then call os.fstat() and remember the st_ino member of the returned object. Then periodically call os.stat() on the path name and compare its st_ino value to the one you saved earlier. If it changes, or the os.stat() call fails, then you know the file name you are writing to is no longer the same file.
Suppose I have a large number of Python processes launching at the same time from a common directory.
If a Python source file has been recently modified the interpreter will compile a .pyc file.
If there are multiple processes simultaneously trying to build a .pyc for the same Python source file, can this create a race condition or other issues?
Will Python (or cpython specifically) guarantee concurrency protection when compiling?
I'm aware of methods that could be used to avoid this, I'm only interested in understanding if this use case can be problematic.
Generally no, when CPython write bytecode cache file, it first write to a temporary file, then move to the desired location with os.replace. os.replace uses rename(2) system call underlying, rename() is atomic if the OS/filesystem does not crash in the middle. As a result bytecode file write is atomic.
IMHO what you should worry about is bytecode cache file stale checking. Python check cache file freshness with source file stat.mtime (and file size). The caveat is, resolution of mtime being used by python is one second, thus if one process modified source file while another process is writing cache file in the same second, left the bytecode cache file inconsistent with source and will not get refreshed later. Good news is pep-0552 get accepted, changed to hashed based cache file, which will take care of this case.
The description of tempfile.NamedTemporaryFile() says:
If delete is true (the default), the file is deleted as soon as it
is closed.
In some circumstances, this means that the file is not deleted after the
Python interpreter ends. For example, when running the following test under
py.test, the temporary file remains:
from __future__ import division, print_function, absolute_import
import tempfile
import unittest2 as unittest
class cache_tests(unittest.TestCase):
def setUp(self):
self.dbfile = tempfile.NamedTemporaryFile()
def test_get(self):
self.assertEqual('foo', 'foo')
In some way this makes sense, because this program never explicitly
closes the file object. The only other way for the object to get closed
would presumably be in the __del__ destructor, but here the language
references states that "It is not guaranteed that __del__() methods are
called for objects that still exist when the interpreter exits." So
everything is consistent with the documentation so far.
However, I'm confused about the implications of this. If it is not
guaranteed that file objects are closed on interpreter exit, can it
possibly happen that some data that was successfully written to a
(buffered) file object is lost even though the program exits gracefully,
because it was still in the file object's buffer, and the file object
never got closed?
Somehow that seems very unlikely and un-pythonic to me, and the open()
documentation doesn't contain any such warnings either. So I
(tentatively) conclude that file objects are, after all, guaranteed to
be closed.
But how does this magic happen, and why can't NamedTemporaryFile() use
the same magic to ensure that the file is deleted?
Edit: Note that I am not talking about file descriptors here (that are buffered by the OS and closed by the OS on program exit), but about Python file objects that may implement their own buffering.
On Windows, NamedTemporaryFile uses a Windows-specific extension (os.O_TEMPORARY) to ensure that the file is deleted when it is closed. This probably also works if the process is killed in any way. However there is no obvious equivalent on POSIX, most likely because on POSIX you can simply delete files that are still in use; it only deletes the name, and the file's content is only removed after it is closed (in any way). But indeed assuming that we want the file name to persist until the file is closed, like with NamedTemporaryFile, then we need "magic".
We cannot use the same magic as for flushing buffered files. What occurs there is that the C library handles it (in Python 2): the files are FILE objects in C, and the C guarantees that they are flushed on normal program exit (but not if the process is killed). In the case of Python 3, there is custom C code to achieve the same effect. But it's specific to this use case, not anything directly reusable.
That's why NamedTemporaryFile uses a custom __del__. And indeed, __del__ are not guaranteed to be called when the interpreter exits. (We can prove it with a global cycle of references that also references a NamedTemporaryFile instance; or running PyPy instead of CPython.)
As a side note, NamedTemporaryFile could be implemented a bit more robustly, e.g. by registering itself with atexit to ensure that the file name is removed then. But you can call it yourself too: if your process doesn't use an unbounded number of NamedTemporaryFiles, it's simply atexit.register(my_named_temporary_file.close).
On any version of *nix, all file descriptors are closed when a process finishes, and this is taken care of by the operating system. Windows is likely exactly the same in this respect. Without digging in the source code, I can't say with 100% authority what actually happens, but likely what happens is:
If delete is False, unlink() (or a function similar to it on other operating systems) is called. This means that the file will automatically be deleted when the process exits and there are no more open file descriptors. While the process is running, the file will still remain around.
If delete is True, likely the C function remove() is used. This will forcibly delete the file before the process exits.
The file buffering is handled by the Operating System. If you do not close a file after you open it, it is because you are assuming that the operating system will flush the buffer and close the file after the owner exists. This is not Python magic, this is your OS doing it's thing. The __del__() method is related to Python and requires explicit calls.
I'm working on a python server which concurrently handles transactions on a number of databases, each storing performance data about a different application. Concurrency is accomplished via the Multiprocessing module, so each transaction thread starts in a new process, and shared-memory data protection schemes are not viable.
I am using sqlite as my DBMS, and have opted to set up each application's DB in its own file. Unfortunately, this introduces a race condition on DB creation; If two process attempt to create a DB for the same new application at the same time, both will create the file where the DB is to be stored. My research leads me to believe that one cannot lock a file before it is created; Is there some other mechanism I can use to ensure that the file is not created and then written to concurrently?
Thanks in advance,
David
The usual Unix-style way of handling this for regular files is to just try to create the file and see if it fails. In Python's case, that would be:
try:
os.open(filename, os.O_WRONLY | os.O_CREAT | os.O_EXCL)
except IOError: # or OSError?
# Someone else created it already.
At the very least, you can use this method to try to create a "lock file" with a similar name to the database. If the lock file is created, you go ahead and make the database. If not, you do whatever you need to for the "database exists" case.
Name your database files in such a way that they are guaranteed not to collide.
http://docs.python.org/library/tempfile.html
You could capture the error when trying to create the file in your code and in your exception handler, check if the file exists and use the existing file instead of creating it.
You didn't mention the platform, but on linux open(), or os.open() in python, takes a flags parameter which you can use. The O_CREAT flag creates a file if it does not exist, and the O_EXCL flag gives you an error if the file already exists. You'll also be needing O_RDONLY, O_WRONLY or O_RDWR for specifying the access mode. You can find these constants in the os module.
For example: fd = os.open(filename, os.O_RDWR | os.O_CREAT | os.O_EXCL)
You can use the POSIX O_EXCL and O_CREAT flags to open(2) to guarantee that only a single process gets the file and thus the database; O_EXCL won't work over NFSv2 or earlier, and it'd be pretty shaky to rely on it for other network filesystems.
The liblockfile library implements a network-filesystem safe locking mechanism described in the open(2) manpage, which would be convenient; but I only see pre-made Ruby and Perl bindings. Depending upon your needs, maybe providing Python bindings would be useful, or perhaps just re-implementing the algorithm:
O_EXCL Ensure that this call creates the file: if this flag is
specified in conjunction with O_CREAT, and pathname
already exists, then open() will fail. The behavior of
O_EXCL is undefined if O_CREAT is not specified.
When these two flags are specified, symbolic links are not
followed: if pathname is a symbolic link, then open()
fails regardless of where the symbolic link points to.
O_EXCL is only supported on NFS when using NFSv3 or later
on kernel 2.6 or later. In environments where NFS O_EXCL
support is not provided, programs that rely on it for
performing locking tasks will contain a race condition.
Portable programs that want to perform atomic file locking
using a lockfile, and need to avoid reliance on NFS
support for O_EXCL, can create a unique file on the same
file system (e.g., incorporating hostname and PID), and
use link(2) to make a link to the lockfile. If link(2)
returns 0, the lock is successful. Otherwise, use stat(2)
on the unique file to check if its link count has
increased to 2, in which case the lock is also successful.